%Copyright 2009 Brian Jaress % %This program is free software: you can redistribute it and/or modify %it under the terms of the GNU General Public License as published by %the Free Software Foundation, either version 3 of the License, or %(at your option) any later version. % %This program is distributed in the hope that it will be useful, %but WITHOUT ANY WARRANTY; without even the implied warranty of %MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the %GNU General Public License for more details. % %You should have received a copy of the GNU General Public License %along with this program. If not, see . \documentclass{article} \usepackage{listings} \usepackage{hyperref} \lstnewenvironment{code}{\lstset{ language=Haskell, basicstyle=\small\ttfamily, frame=single}}{} \newcommand{\kbd}[1]{\texttt{#1}} \begin{document} \title{Taking a Second Look at Haskell:\\ A Thread Pool for the Command Line} \author{Brian Jaress\\ \href{mailto:jaress@hawaii.edu}{jaress@hawaii.edu}\\ \url{http://brian-jaress.livejournal.com}} \date{2009-01-18} \maketitle \begin{abstract} A small program I wrote to see what I think of the controversial Haskell language. The program runs other programs in the manner of a thread pool. The final section contains personal, subjective thoughts on the language. \end{abstract} \section{Introduction} I kept hearing people who love \href{http://haskell.org/}{Haskell} say that it's a fantastic language but an acquired taste, so I thought, ``Maybe I was too hasty. I'll give it another try.'' Not long after that, someone who loved Haskell at first sight \href{http://www.reddit.com/r/programming/comments/7pohw/why_haskell_is_beyond_ready_for_prime_time/c0725wy}{suggested} writing something practical quickly (for a reasonable definition of ``practical'') as a way to gauge whether I knew enough to draw any conclusions. This is the result of following that advice. It includes a very small program I wrote, plus some thoughts on the language. \subsection{What the Program Does} This program is a command line \href{http://en.wikipedia.org/wiki/Thread_pool_pattern} {thread pool}. You give it shell commands and it runs them, with a fixed number\footnote{There's an exception at the end, where there will likely be fewer than that many commands left to run.} running at any one time, while the rest are either finished or waiting to start. You might, for example, have fifty files to convert using a program that converts a single file each time you run it and want to process three at a time.\footnote{Thread pools usually show up as part of a larger program, where they are used to avoid the overhead of creating and destroying threads and to peg the number of simultaneous threads at something not too large or too small. The first purpose doesn't really apply here because the tasks themselves have the same type of overhead associated with a thread, and more of it. The second purpose, however, does apply because doing all the tasks at once can take longer and use more resources than doing a few at a time.} It's practical in the sense that I have a real use for it. \subsection{How to Use the Program} The source code is available at \url{http://www2.hawaii.edu/~jaress/threadPool/threadPool.lhs}. You should compile the program using \href{http://www.haskell.org/ghc/}{GHC}\footnote{I developed and tested the program with GHC version 6.8.2.} and the \kbd{-threaded} option. It takes a single, optional argument which is the number of threads (the default is three). Give it the commands to run, one per line, through standard input. \section{The Code} The thread pool program is mostly a typical, internal thread pool with some input processing. \subsection{Libraries} I didn't use the \href{http://blogs.msdn.com/devdev/archive/2005/10/20/483247.aspx} {software transactional memory} library that everyone raves about. One thing I'm ambivalent about is enough for now. Other than that, I used pretty much what you'd expect: \begin{code} import System.Environment (getArgs) import System.Process (runCommand, waitForProcess) import Control.Concurrent (forkIO, killThread, ThreadId) import Control.Concurrent.MVar import Control.Exception (finally) import Control.Monad (mapM_, replicateM_) import Control.Concurrent.QSem \end{code} \subsection{The Thread Pool} The internal thread pool itself doesn't see the processes it runs---it carries out arbitrary tasks. \begin{code} threadPool :: Int -> [IO ()] -> IO () threadPool threadCount tasks = do allTasksAssigned <- newMVar False tasksPending <- newMVar 0 finalTaskDone <- newEmptyMVar nextTask <- newEmptyMVar \end{code} The pool uses four \kbd{MVar}s\footnote{An \kbd{MVar} is like a synchronized drop-box. A thread trying to put a value into a full \kbd{MVar} will block, as will a thread trying to take a value out of an empy one.} to coordinate the threads. \begin{description} \item[nextTask] where the main thread leaves tasks for worker threads \item[allTasksAssigned] changed by the main thread to indicate it has put the last task in \kbd{nextTask} \item[tasksPending] incremented by the main thread when it assigns a task, decremented by the workers when they complete a task \item[finalTaskDone] used in the (likely) case that workers need to notify the main thread when all tasks are complete \end{description} The gist of what the main thread does is pretty simple. Start the worker threads and put tasks into \kbd{nextTask} until you run out: \begin{code} threadIDs <- forkIOn (max threadCount 1) $ worker tasksPending nextTask allTasksAssigned finalTaskDone doAll [putMVar nextTask task >> modifyMVar_ tasksPending (return.(+1)) | task <- tasks] \end{code} The tricky part is in how the threads interact and coordinate through the \kbd{MVar}s. If the main thread has assigned every task but none are pending, then it knows that all the tasks have been completed and it can kill the workers. If the main thread has assigned every task and some are still pending, it has to wait for the workers to notify it through \kbd{finalTaskDone}. There's no problem with workers sending a signal when none is needed---they'll block, but they've finished the tasks and the main thread doesn't expect a signal because it already knows to kill them. What about workers failing to send a needed signal? The trick here is that the main thread sets \kbd{allTasksAssigned} before it looks for a zero, while the workers decrement before they check \kbd{allTasksAssigned}. If a worker has finished the last task and does not see that all tasks have been assigned, then it has already decremented, and the main thread will see a zero after announcing that all tasks have been assigned\footnote{Even if the worker decrements before the main thread increments, it will work since the main thread increments before it checks \kbd{tasksPending}.}: \begin{code} swapMVar allTasksAssigned True pending <- readMVar tasksPending if pending > 0 then takeMVar finalTaskDone else return () killAll threadIDs where worker tasksPending nextTask allTasksAssigned finalTaskDone = finally (takeMVar nextTask >>= id) $ do pending <- modifyMVar tasksPending (\x -> return (x-1, x-1)) noMore <- readMVar allTasksAssigned if pending == 0 && noMore then putMVar finalTaskDone () else return () worker tasksPending nextTask allTasksAssigned finalTaskDone \end{code} Any serious bugs are most likely in this part of the program. (Corrections to any part of the program are welcome.) That's almost it for the pool itself. The rest is simple helper functions---wrappers for working with multiple values at once. \begin{code} doAll :: [IO ()] -> IO () doAll = foldl1 (>>) killAll = doAll . map killThread forkIOn :: Int -> IO () -> IO [ThreadId] forkIOn count io = if count > 0 then do first <- forkIO io rest <- forkIOn (count - 1) io return (first:rest) else return [] \end{code} \subsection{Handling Input} The number of threads to put in the pool is taken from the first command line argument. It's very quick and dirty option parsing, and the default of three is arbitrary. \begin{code} getThreadCount :: IO (Int) getThreadCount = do args <- getArgs if args == [] then return 3 else return $ read $ head args \end{code} There's one more thing to do, and that's turn each line of input into a task that actually runs an external process and waits for it to finish. Those tasks go right into our internal thread pool, and the program is complete. \begin{code} main = do threadCount <- getThreadCount input <- getContents threadPool threadCount [execute cmd | cmd <- lines input] where execute cmd = do handle <- runCommand cmd waitForProcess handle return () \end{code} \section{Subjective Thoughts on Haskell} Programming in Haskell, at least this time around, was a lot like programming in other languages. I spent most of my time planning the secret handshake between threads and looking up library functions. When I tried code I had just written, I got an unhelpful error message with a helpful line number. It's just like home. My main problem was with syntax, but for an interesting reason. Haskell loves to disguise one thing as another without quite making it work the same way. For example, my most puzzling bug came from writing \kbd{return first:rest} when I should have written \kbd{return (first:rest)} because \kbd{return} is a function, and function application has higher precedence than the \kbd{:} operator. Another result of these partial disguises is that I felt the language encouraging me to do hacky things, to play along with the disguise for a while and then suddenly violate it. Using \kbd{return ()} as a no-op in an \kbd{if} statement or \kbd{x >>= id} for anything at all felt clever but somehow wrong. Such things are sometimes necessary, but Haskell seems to have a little too much fun with them. One surprise was that I ended up using the type system almost exclusively to resolve overloading\footnote{For example, it is often used that way in every line of a \kbd{do} block.} and not for type safety. I didn't feel restricted by it, but it was essentially no help with correctness. My experience with Haskell so far matches the reputation of \href{http://www.perl.org/}{Perl} more so than the reputation of Haskell itself. It's full of little corners and sharp edges, and the more of them you know the more clever little shortcuts you can take. An expert could probably squeeze the whole \kbd{threadPool} function into a few lines. Of course, the whole program would have been a few lines if I got \kbd{threadPool} from a library. I don't say this to knock Haskell for not having a thread pool library; it has many fine libraries, some of them probably not found in other languages. The point is, libraries are the dominant factor. The language itself matters because there's rarely a library for everything you want to do, but the effect is often exaggerated. \kbd{Control.Concurrent} was a much bigger help than, say, referential transparency or monadic IO.\footnote{\kbd{Control.Concurrent} uses both of those features, but equivalent libraries are often available in languages with neither feature.} In my view, the arguments both for and against Haskell are overblown. If you're wondering whether to learn it or use it for a particular project, give it the same consideration as any other language. If you know what you need, \href{http://www.haskell.org/hoogle/} {look for it} in the libraries. \end{document}