%Copyright 2009 Brian Jaress % %This program is free software: you can redistribute it and/or modify %it under the terms of the GNU General Public License as published by %the Free Software Foundation, either version 3 of the License, or %(at your option) any later version. % %This program is distributed in the hope that it will be useful, %but WITHOUT ANY WARRANTY; without even the implied warranty of %MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the %GNU General Public License for more details. % %You should have received a copy of the GNU General Public License %along with this program. If not, see . \documentclass{article} \usepackage{listings} \usepackage{hyperref} \lstnewenvironment{code}{\lstset{ language=Haskell, basicstyle=\small\ttfamily, frame=single}}{} \newcommand{\kbd}[1]{\texttt{#1}} \begin{document} \title{Making a Choice on Haskell: \\ A Revised Thread Pool for the Command Line} \author{Brian Jaress\\ \href{mailto:jaress@hawaii.edu}{jaress@hawaii.edu}\\ \url{http://brian-jaress.livejournal.com}} \date{ 2009-03-14 } \maketitle \begin{abstract} The second version of a small Haskell program that runs other programs in the manner of a thread pool. This version incorporates ideas suggested by experienced Haskell users and some other new features. It also contains updated subjective thoughts on Haskell. \end{abstract} \section{Introduction} A while ago I wrote a small program in Haskell to see what I thought of the language, added my thoughts to the code, and posted it. Experienced haskellers soon \href{http://www.reddit.com/r/haskell/comments/7qvq2/taking_a_second_look_at_haskell_a_thread_pool_for/} {posted} critiques and improved versions. I decided to incorporate those ideas into this second version of the program. Only one change has a significant effect on the program's behavior (see section \ref{sections}) but the code is much cleaner. I also updated the comments on my subjective thoughts. The conclusion (section \ref{conclusion}) has an answer to the big question: Do I want to keep using Haskell? \subsection{What the Program Does} This program is a command line \href{http://en.wikipedia.org/wiki/Thread_pool_pattern} {thread pool}. You give it shell commands and it runs them, with a fixed number\footnote{There's an exception at the end, where there will likely be fewer than that many commands left to run.} running at any one time, while the rest are either finished or waiting to start. You might, for example, have fifty files to convert using a program that converts a single file each time you run it and want to process three at a time.\footnote{Thread pools usually show up as part of a larger program, where they are used to avoid the overhead of creating and destroying threads and to peg the number of simultaneous threads at something not too large or too small. The first purpose doesn't really apply here because the tasks themselves have the same type of overhead associated with a thread, and more of it. The second purpose, however, does apply because doing all the tasks at once can take longer and use more resources than doing a few at a time.} \subsection{How to Use the Program} The name of the executable is \kbd{threadpool}. It takes a single, optional argument which is the number of simultaneous threads (the default is three). Give it the commands to run, one per line, through standard input. You may use blank lines to divide the commands into sections. The commands in a section will not be started until all the commands in previous sections are complete. The source code is available \href{http://hackage.haskell.org/cgi-bin/hackage-scripts/package/threadPool} {on HackageDB}. You should compile the program using \href{http://www.haskell.org/ghc/}{GHC} and the \kbd{-threaded} option. \footnote{I developed and tested the program with GHC version 6.8.2.} \section{The Code} The thread pool program is mostly an internal thread pool with some input processing. \subsection{Libraries} This version incorporates two key library \href{http://www.reddit.com/r/haskell/comments/7qvq2/taking_a_second_look_at_haskell_a_thread_pool_for/} {suggestions} from Haskell users. The first is to use a \href{http://en.wikipedia.org/wiki/Semaphore_(programming)} {quantity semaphore} from the standard libraries. The second was to use functions for working with multiple monad values from \kbd{Control.Monad}. (See Section \ref{threadPool} for how they are used.) \begin{code} import Control.Concurrent (forkIO) import Control.Concurrent.QSem import Control.Exception (finally) import Control.Monad (mapM_, replicateM_) import Data.List (unfoldr) import System.Environment (getArgs) import System.Process (runCommand, waitForProcess) \end{code} \subsection{The Thread Pool} \label{threadPool} The internal thread pool itself doesn't see the processes it runs---it carries out arbitrary tasks. This version uses a quantity semaphore, per a \href{http://www.reddit.com/r/haskell/comments/7qvq2/taking_a_second_look_at_haskell_a_thread_pool_for/c0755up} {suggestion on reddit}.\footnote{Thomas DuBuisson has since posted an internal thread pool library \href{http://hackage.haskell.org/cgi-bin/hackage-scripts/package/Control-Engine} {on HackageDB}. I considered using it, but it's not quite what I needed, despite being very full-featured.} \begin{code} threadPool :: Int -> [IO ()] -> IO () threadPool threadCount tasks = do semaphore <- newQSem (max threadCount 1) mapM_ (delegate semaphore) tasks collect (max threadCount 1) semaphore where collect count = replicateM_ count . waitQSem delegate semaphore task = do waitQSem semaphore forkIO (task `finally` signalQSem semaphore) \end{code} A quantity semaphore, if you're like me and haven't used one before, represents a fixed amount of a resource. You can think of it like a library with a certain number of copies of the same book. If the library owns five copies and three are checked out, you can borrow one of the other two. If all five are out, you have to wait. That captures an essential property of the thread pool, and when the haskellers posted their own solutions there was a definite difference in size and complexity between solutions that used a \kbd{QSem} and those that didn't. The \kbd{threadPool} function creates a quantity semaphore and uses it to control the number of threads working on tasks at the same time. The main thread borrows a unit of the semaphore before giving a task to a worker thread, and the worker returns the unit when the task is done. That way, the main thread waits to give out a new task if the number of pending (assigned but incomplete) tasks equals the size of the quantity semaphore. Once all tasks are assigned, the main thread waits on the entire quantity semaphore to make sure all the workers are done before it moves on. Threads are not reused in this version, so \kbd{threadPool} is not technically a thread pool. Since it still looks like one from the outside, the name remains. \subsection{Handling Input} \subsubsection{Command Line Arguments} The number of simultaneous tasks is taken from the first command line argument, with a default of three. \begin{code} getThreadCount :: IO (Int) getThreadCount = do args <- getArgs if args == [] then return 3 else return $ extract $ reads $ head args where extract [(val, "")] = val extract _ = error "Non-integer threadcount on command line." \end{code} New in this version is a nicer error message. The old version would say ``no parse'' if the argument wasn't an integer. \subsubsection{Sectioned Input} \label{sections} Also new is the ability to divide the list of commands into sections. Before any commands in a new section start, all commands from previous sections will finish. You can achieve the same thing by running the \kbd{threadpool} program multiple times in sequence, but sections are likely to be easier if you are generating the list of commands with another program. The sectioning code ignores leading and trailing separators and collapses multiple contiguous separators into one, similar to blank lines in \LaTeX. I couldn't find a library function that did exactly what I needed, but several were close enough to be stitched together into my own one-off function. \begin{code} sections :: (a -> Bool) -> [a] -> [[a]] sections = unfoldr . splitFirst where splitFirst isSep raw = case break isSep (dropWhile isSep raw) of ([], _) -> Nothing parts -> Just parts \end{code} \subsubsection{Commands into Tasks} \label{main} There's one thing left, and that's to turn the commands into tasks and give each section to a \kbd{threadPool}. It's almost as straightforward as it sounds. The only wrinkle is that, since the separators are not valid commands, the conversion of commands into tasks happens after splitting into sections. That means dipping into the sublists. \begin{code} main = do count <- getThreadCount input <- getContents mapM_ (threadPool count . map execute) (parsed input) where parsed = sections (== "") . lines execute cmd = do handle <- runCommand cmd waitForProcess handle return () \end{code} \section{Further Subjective Thoughts on Haskell} \subsection{Explainability and Flexibility} In the previous version, I mentioned ``little corners and sharp edges'' that an expert can use to be very concise. One commenter \href{http://www.reddit.com/r/programming/comments/7qvlf/taking_a_second_look_at_haskell_pdf/c0752di} {pointed out} that Haskell's mathematical approach helps you avoid cutting yourself. There's some truth to that. I'm not worried about waking up to discover that \kbd{mapM\_} has a new, subtly different meaning that breaks my program, and a principled basis probably makes it easier for experts to remember things and understand each other's code. But there's still the issue of explaining code to a non-expert like me. Haskell has a reputation for being hard to learn, and I think part of that comes from being hard to explain. The \kbd{sections} function (in Section \ref{sections}) is the composition of two curried functions, one of which applies its first argument to its second, repeatedly. Explaining why that works would be a challenge.\footnote{And if you can explain that, try \kbd{mapM\_ (threadPool count . map execute) (parsed input)} from section \ref{main}.} The flip side is that you get a lot of flexibility to make it clear what the code is supposed to do. It's pretty easy to see, without actually tracing through the code of \kbd{sections}, that there's a helper function to split off only the first section and that helper is combined with \kbd{unfoldr} to make a function that splits all sections. The easiest place to see this flexibility might be in the large number of tiny library functions. One \href{http://www.reddit.com/r/haskell/comments/7qvq2/taking_a_second_look_at_haskell_a_thread_pool_for/c0754sx} {comment} on the first version was that I hadn't taken advantage of the library functions, particularly in \kbd{Control.Monad}. I had assumed such things weren't in the libraries and that \kbd{Control.Monad} contained little else besides the monad type class. In fact, it's full of shorthand functions defined using one or two operations. I ended up actually sitting down and thinking for a while about whether to use \kbd{mapM\_} or \kbd{forM\_} in the \kbd{threadPool} function. The two are exactly the same, except that the order of the two arguments is reversed. The suggestions I saw generally used \kbd{forM\_}, probably because it puts the function argument second and they supplied it with a multi-line lambda expression. I decided to give the function a name by defining it in a \kbd{where} clause\footnote{I really like \kbd{where} clauses. I highly recommend them to designers of other languages.}, so the multi-line argument wasn't an issue. I went with \kbd{mapM\_} because it matched the order of the definition in the \kbd{where} clause: \kbd{delegate semaphore task} delegates a single task, and \kbd{mapM\_ (delegate semaphore) tasks} delegates a list of tasks. That sort of little choice actually made up most of the work I put into the code for this version. Switching to a quantity semaphore and adding new features took relatively little time, then I went back and revised. Oddly enough, it reminded me of writing something in English. \subsection{Input/Output and the Design of Haskell} After the first version, a few people complained that I had written a program almost entirely concerned with IO. They're quite right that that's what I've done, though slightly less so in this version. To some extent, that gives a different view of Haskell than if I had done something else. However, one of the reasons the thread pool is focused on IO is that Haskell, unlike most other languages, considers thread synchronization to be IO. Writing this program showed me something of why IO in Haskell is different. By extension, it showed me something of the thinking behind the design of Haskell. The difficulty of explaining Haskell code trips me up again. I will only generalize and say that Haskell seems to expect, as a matter of course, ways of thinking that old languages use to impress and new languages shun. \section{Conclusion} \label{conclusion} I keep a shortlist of languages I prefer to use, and I've decided to add Haskell. It's solid and different in a way that complements the other languages on my list. It will take time to become proficient---there's a lot to learn, and I have other things going on---so I don't make the decision lightly. A good part of it is simply the style of the language. There are several languages that I see a lot of value in but haven't added to the list, either because of some specific stylistic decision they made or something I couldn't quite put my finger on. There is, of course, a technical veto for languages that just don't cut it, but so many are left that the final choice is personal. \end{document}