%Copyright 2009 Brian Jaress
%
%This program is free software: you can redistribute it and/or modify
%it under the terms of the GNU General Public License as published by
%the Free Software Foundation, either version 3 of the License, or
%(at your option) any later version.
%
%This program is distributed in the hope that it will be useful,
%but WITHOUT ANY WARRANTY; without even the implied warranty of
%MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
%GNU General Public License for more details.
%
%You should have received a copy of the GNU General Public License
%along with this program.  If not, see <http://www.gnu.org/licenses/>.
\documentclass{article}
\usepackage{listings}
\usepackage{hyperref}
\lstnewenvironment{code}{\lstset{
    language=Haskell,
    basicstyle=\small\ttfamily,
    frame=single}}{}
\newcommand{\kbd}[1]{\texttt{#1}}

\begin{document}

\title{Taking a Second Look at Haskell:\\
A Thread Pool for the Command Line}
\author{Brian Jaress\\
\href{mailto:jaress@hawaii.edu}{jaress@hawaii.edu}\\
\url{http://brian-jaress.livejournal.com}}
\date{2009-01-18}
\maketitle

\begin{abstract}
A small program I wrote to see what I think of the controversial Haskell
language.  The program runs other programs in the manner of a thread
pool.  The final section contains personal, subjective thoughts on the
language.
\end{abstract}


\section{Introduction}

I kept hearing people who love \href{http://haskell.org/}{Haskell} say
that it's a fantastic language but an acquired taste, so I thought,
``Maybe I was too hasty.  I'll give it another try.''

Not long after that, someone who loved Haskell at first sight
\href{http://www.reddit.com/r/programming/comments/7pohw/why_haskell_is_beyond_ready_for_prime_time/c0725wy}{suggested}
writing something practical quickly (for a reasonable definition of
``practical'') as a way to gauge whether I knew enough to draw any
conclusions.

This is the result of following that advice.  It includes a very small
program I wrote, plus some thoughts on the language.

\subsection{What the Program Does}

This program is a command line
\href{http://en.wikipedia.org/wiki/Thread_pool_pattern} {thread pool}.
You give it shell commands and it runs them, with a fixed
number\footnote{There's an exception at the end, where there will likely
be fewer than that many commands left to run.} running at any one time,
while the rest are either finished or waiting to start.

You might, for example, have fifty files to convert using a program that
converts a single file each time you run it and want to process three at
a time.\footnote{Thread pools usually show up as part of a larger
program, where they are used to avoid the overhead of creating and
destroying threads and to peg the number of simultaneous threads at
something not too large or too small.  The first purpose doesn't really
apply here because the tasks themselves have the same type of overhead
associated with a thread, and more of it.  The second purpose, however,
does apply because doing all the tasks at once can take longer and use
more resources than doing a few at a time.}

It's practical in the sense that I have a real use for it.


\subsection{How to Use the Program}

The source code is available at
\url{http://www2.hawaii.edu/~jaress/threadPool/threadPool.lhs}.  You
should compile the program using
\href{http://www.haskell.org/ghc/}{GHC}\footnote{I developed and tested
the program with GHC version 6.8.2.} and the \kbd{-threaded} option.

It takes a single, optional argument which is the number of threads (the
default is three).  Give it the commands to run, one per line, through
standard input.


\section{The Code}

The thread pool program is mostly a typical, internal thread pool with
some input processing.


\subsection{Libraries}

I didn't use the
\href{http://blogs.msdn.com/devdev/archive/2005/10/20/483247.aspx}
{software transactional memory} library that everyone raves about.  One
thing I'm ambivalent about is enough for now.

Other than that, I used pretty much what you'd expect:

\begin{code}
import System.Environment (getArgs)
import System.Process (runCommand, waitForProcess)
import Control.Concurrent (forkIO, killThread, ThreadId)
import Control.Concurrent.MVar
import Control.Exception (finally)
import Control.Monad (mapM_, replicateM_)
import Control.Concurrent.QSem
\end{code}


\subsection{The Thread Pool}

The internal thread pool itself doesn't see the processes it runs---it
carries out arbitrary tasks.

\begin{code}
threadPool :: Int -> [IO ()] -> IO ()
threadPool threadCount tasks = do
    allTasksAssigned <- newMVar False
    tasksPending <- newMVar 0
    finalTaskDone <- newEmptyMVar
    nextTask <- newEmptyMVar
\end{code}

The pool uses four \kbd{MVar}s\footnote{An \kbd{MVar} is like a
synchronized drop-box.  A thread trying to put a value into a full
\kbd{MVar} will block, as will a thread trying to take a value out of an
empy one.} to coordinate the threads.

\begin{description}
\item[nextTask] where the main thread leaves tasks for worker threads
\item[allTasksAssigned] changed by the main thread to indicate it has
put the last task in \kbd{nextTask}
\item[tasksPending] incremented by the main thread when it assigns a
task, decremented by the workers when they complete a task
\item[finalTaskDone] used in the (likely) case that workers need to
notify the main thread when all tasks are complete
\end{description}

The gist of what the main thread does is pretty simple.  Start the worker
threads and put tasks into \kbd{nextTask} until you run out:

\begin{code}
    threadIDs <- forkIOn (max threadCount 1) $
        worker tasksPending nextTask
            allTasksAssigned finalTaskDone
    doAll [putMVar nextTask task >>
            modifyMVar_ tasksPending (return.(+1)) |
            task <- tasks]
\end{code}

The tricky part is in how the threads interact and coordinate through
the \kbd{MVar}s.

If the main thread has assigned every task but none are pending, then it
knows that all the tasks have been completed and it can kill the
workers.  If the main thread has assigned every task and some are still
pending, it has to wait for the workers to notify it through
\kbd{finalTaskDone}.

There's no problem with workers sending a signal when none is
needed---they'll block, but they've finished the tasks and the main
thread doesn't expect a signal because it already knows to kill them.

What about workers failing to send a needed signal?  The trick here is
that the main thread sets \kbd{allTasksAssigned} before it looks for a
zero, while the workers decrement before they check
\kbd{allTasksAssigned}.  If a worker has finished the last task and does
not see that all tasks have been assigned, then it has already
decremented, and the main thread will see a zero after announcing that
all tasks have been assigned\footnote{Even if the worker decrements
before the main thread increments, it will work since the main thread
increments before it checks \kbd{tasksPending}.}:

\begin{code}
    swapMVar allTasksAssigned True
    pending <- readMVar tasksPending
    if pending > 0 then takeMVar finalTaskDone else return ()
    killAll threadIDs

    where
        worker tasksPending nextTask
            allTasksAssigned finalTaskDone =
            finally (takeMVar nextTask >>= id) $ do
                pending <- modifyMVar tasksPending
                    (\x -> return (x-1, x-1))

                noMore <- readMVar allTasksAssigned
                if pending == 0 && noMore
                    then putMVar finalTaskDone ()
                    else return ()
                worker tasksPending nextTask
                    allTasksAssigned finalTaskDone
\end{code}

Any serious bugs are most likely in this part of the program.
(Corrections to any part of the program are welcome.)

That's almost it for the pool itself.  The rest is simple helper
functions---wrappers for working with multiple values at once.

\begin{code}
        doAll :: [IO ()] -> IO ()
        doAll = foldl1 (>>)
        killAll = doAll . map killThread

        forkIOn :: Int -> IO () -> IO [ThreadId]
        forkIOn count io =
            if count > 0
                then do
                    first <- forkIO io
                    rest <- forkIOn (count - 1) io
                    return (first:rest)
                else
                    return []
\end{code}


\subsection{Handling Input}

The number of threads to put in the pool is taken from the first command
line argument.  It's very quick and dirty option parsing, and the
default of three is arbitrary.

\begin{code}
getThreadCount :: IO (Int)
getThreadCount = do
    args <- getArgs
    if args == []
        then return 3
        else return $ read $ head args
\end{code}

There's one more thing to do, and that's turn each line of input into a
task that actually runs an external process and waits for it to finish.
Those tasks go right into our internal thread pool, and the program is
complete.

\begin{code}
main = do
    threadCount <- getThreadCount
    input <- getContents
    threadPool threadCount [execute cmd | cmd <- lines input]
        where
            execute cmd = do
                handle <- runCommand cmd
                waitForProcess handle
                return ()
\end{code}


\section{Subjective Thoughts on Haskell}

Programming in Haskell, at least this time around, was a lot like
programming in other languages.  I spent most of my time planning the
secret handshake between threads and looking up library functions.  When
I tried code I had just written, I got an unhelpful error message with a
helpful line number.  It's just like home.

My main problem was with syntax, but for an interesting reason.  Haskell
loves to disguise one thing as another without quite making it work the
same way.  For example, my most puzzling bug came from writing
\kbd{return first:rest} when I should have written \kbd{return
(first:rest)} because \kbd{return} is a function, and function
application has higher precedence than the \kbd{:} operator.

Another result of these partial disguises is that I felt the language
encouraging me to do hacky things, to play along with the disguise for a
while and then suddenly violate it.  Using \kbd{return ()} as a no-op in
an \kbd{if} statement or \kbd{x >>= id} for anything at all felt clever
but somehow wrong.

Such things are sometimes necessary, but Haskell seems to have a little
too much fun with them.

One surprise was that I ended up using the type system almost
exclusively to resolve overloading\footnote{For example, it is often
used that way in every line of a \kbd{do} block.} and not for type
safety.  I didn't feel restricted by it, but it was essentially no help
with correctness.

My experience with Haskell so far matches the reputation of
\href{http://www.perl.org/}{Perl} more so than the reputation of Haskell
itself.  It's full of little corners and sharp edges, and the more of
them you know the more clever little shortcuts you can take.  An expert
could probably squeeze the whole \kbd{threadPool} function into a few
lines.

Of course, the whole program would have been a few lines if I got
\kbd{threadPool} from a library.  I don't say this to knock Haskell for
not having a thread pool library; it has many fine libraries, some of
them probably not found in other languages.  The point is, libraries are
the dominant factor.

The language itself matters because there's rarely a library for
everything you want to do, but the effect is often exaggerated.
\kbd{Control.Concurrent} was a much bigger help than, say, referential
transparency or monadic IO.\footnote{\kbd{Control.Concurrent} uses both
of those features, but equivalent libraries are often available in
languages with neither feature.}

In my view, the arguments both for and against Haskell are overblown.
If you're wondering whether to learn it or use it for a particular
project, give it the same consideration as any other language.  If you
know what you need, \href{http://www.haskell.org/hoogle/} {look for it}
in the libraries.

\end{document}