%Copyright 2009 Brian Jaress
%
%This program is free software: you can redistribute it and/or modify
%it under the terms of the GNU General Public License as published by
%the Free Software Foundation, either version 3 of the License, or
%(at your option) any later version.
%
%This program is distributed in the hope that it will be useful,
%but WITHOUT ANY WARRANTY; without even the implied warranty of
%MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
%GNU General Public License for more details.
%
%You should have received a copy of the GNU General Public License
%along with this program.  If not, see <http://www.gnu.org/licenses/>.
\documentclass{article}
\usepackage{listings}
\usepackage{hyperref}
\lstnewenvironment{code}{\lstset{
    language=Haskell,
    basicstyle=\small\ttfamily,
    frame=single}}{}
\newcommand{\kbd}[1]{\texttt{#1}}

\begin{document}

\title{Making a Choice on Haskell:
\\
A Revised Thread Pool for the Command Line}
\author{Brian Jaress\\
\href{mailto:jaress@hawaii.edu}{jaress@hawaii.edu}\\
\url{http://brian-jaress.livejournal.com}}
\date{
2009-03-14
}
\maketitle

\begin{abstract}
The second version of a  small Haskell program that runs other programs
in the manner of a thread pool.  This version incorporates ideas
suggested by experienced Haskell users and some other new features.  It
also contains updated subjective thoughts on Haskell.
\end{abstract}


\section{Introduction}

A while ago I wrote a small program in Haskell to see what I thought of
the language, added my thoughts to the code, and posted it.  Experienced
haskellers soon
\href{http://www.reddit.com/r/haskell/comments/7qvq2/taking_a_second_look_at_haskell_a_thread_pool_for/}
{posted} critiques and improved versions.

I decided to incorporate those ideas into this second version of the
program.  Only one change has a significant effect on the program's
behavior (see section \ref{sections}) but the code is much cleaner.

I also updated the comments on my subjective thoughts.  The conclusion
(section \ref{conclusion}) has an answer to the big question: Do I want
to keep using Haskell?

\subsection{What the Program Does}

This program is a command line
\href{http://en.wikipedia.org/wiki/Thread_pool_pattern} {thread pool}.
You give it shell commands and it runs them, with a fixed
number\footnote{There's an exception at the end, where there will likely
be fewer than that many commands left to run.} running at any one time,
while the rest are either finished or waiting to start.

You might, for example, have fifty files to convert using a program that
converts a single file each time you run it and want to process three at
a time.\footnote{Thread pools usually show up as part of a larger
program, where they are used to avoid the overhead of creating and
destroying threads and to peg the number of simultaneous threads at
something not too large or too small.  The first purpose doesn't really
apply here because the tasks themselves have the same type of overhead
associated with a thread, and more of it.  The second purpose, however,
does apply because doing all the tasks at once can take longer and use
more resources than doing a few at a time.}


\subsection{How to Use the Program}

The name of the executable is \kbd{threadpool}.  It takes a single,
optional argument which is the number of simultaneous threads (the
default is three).  Give it the commands to run, one per line, through
standard input.  You may use blank lines to divide the commands into
sections.  The commands in a section will not be started until all the
commands in previous sections are complete.

The source code is available
\href{http://hackage.haskell.org/cgi-bin/hackage-scripts/package/threadPool}
{on HackageDB}.  You should compile the program using
\href{http://www.haskell.org/ghc/}{GHC} and the \kbd{-threaded} option.
\footnote{I developed and tested the program with GHC version 6.8.2.}


\section{The Code}

The thread pool program is mostly an internal thread pool with some
input processing.


\subsection{Libraries}

This version incorporates two key library
\href{http://www.reddit.com/r/haskell/comments/7qvq2/taking_a_second_look_at_haskell_a_thread_pool_for/}
{suggestions} from Haskell users.

The first is to use a
\href{http://en.wikipedia.org/wiki/Semaphore_(programming)} {quantity
semaphore} from the standard libraries.  The second was to use functions
for working with multiple monad values from \kbd{Control.Monad}.  (See
Section \ref{threadPool} for how they are used.)

\begin{code}
import Control.Concurrent (forkIO)
import Control.Concurrent.QSem
import Control.Exception (finally)
import Control.Monad (mapM_, replicateM_)
import Data.List (unfoldr)
import System.Environment (getArgs)
import System.Process (runCommand, waitForProcess)
\end{code}


\subsection{The Thread Pool}
\label{threadPool}

The internal thread pool itself doesn't see the processes it runs---it
carries out arbitrary tasks.  This version uses a quantity semaphore,
per a
\href{http://www.reddit.com/r/haskell/comments/7qvq2/taking_a_second_look_at_haskell_a_thread_pool_for/c0755up}
{suggestion on reddit}.\footnote{Thomas DuBuisson has since posted an
internal thread pool library
\href{http://hackage.haskell.org/cgi-bin/hackage-scripts/package/Control-Engine}
{on HackageDB}.  I considered using it, but it's not quite what I needed,
despite being very full-featured.}

\begin{code}
threadPool :: Int -> [IO ()] -> IO ()
threadPool threadCount tasks = do
    semaphore <- newQSem (max threadCount 1)
    mapM_ (delegate semaphore) tasks
    collect (max threadCount 1) semaphore
        where
            collect count = replicateM_ count . waitQSem
            delegate semaphore task = do
                waitQSem semaphore
                forkIO (task `finally` signalQSem semaphore)
\end{code}

A quantity semaphore, if you're like me and haven't used one before,
represents a fixed amount of a resource.  You can think of it like a
library with a certain number of copies of the same book.  If the
library owns five copies and three are checked out, you can borrow one
of the other two.  If all five are out, you have to wait.  That captures
an essential property of the thread pool, and when the haskellers posted
their own solutions there was a definite difference in size and
complexity between solutions that used a \kbd{QSem} and those that
didn't.

The \kbd{threadPool} function creates a quantity semaphore and uses it
to control the number of threads working on tasks at the same time.  The
main thread borrows a unit of the semaphore before giving a task to a
worker thread, and the worker returns the unit when the task is done.
That way, the main thread waits to give out a new task if the number of
pending (assigned but incomplete) tasks equals the size of the quantity
semaphore.

Once all tasks are assigned, the main thread waits on the entire
quantity semaphore to make sure all the workers are done before it moves
on.

Threads are not reused in this version, so \kbd{threadPool} is not
technically a thread pool.  Since it still looks like one from the
outside, the name remains.


\subsection{Handling Input}

\subsubsection{Command Line Arguments}

The number of simultaneous tasks is taken from the first command line
argument, with a default of three.

\begin{code}
getThreadCount :: IO (Int)
getThreadCount = do
    args <- getArgs
    if args == []
        then return 3
        else return $ extract $ reads $ head args
    where
        extract [(val, "")] = val
        extract _ = error
            "Non-integer threadcount on command line."
\end{code}

New in this version is a nicer error message.  The old version would say
``no parse'' if the argument wasn't an integer.

\subsubsection{Sectioned Input}
\label{sections}

Also new is the ability to divide the list of commands into sections.
Before any commands in a new section start, all commands from previous
sections will finish.  You can achieve the same thing by running the
\kbd{threadpool} program multiple times in sequence, but sections are
likely to be easier if you are generating the list of commands with
another program.

The sectioning code ignores leading and trailing separators and
collapses multiple contiguous separators into one, similar to blank
lines in \LaTeX.  I couldn't find a library function that did exactly
what I needed, but several were close enough to be stitched together
into my own one-off function.

\begin{code}
sections :: (a -> Bool) -> [a] -> [[a]]
sections = unfoldr . splitFirst
    where
        splitFirst isSep raw =
            case break isSep (dropWhile isSep raw) of
                ([], _) ->  Nothing
                parts   ->  Just parts
\end{code}

\subsubsection{Commands into Tasks}
\label{main}

There's one thing left, and that's to turn the commands into tasks and
give each section to a \kbd{threadPool}.  It's almost as straightforward
as it sounds.  The only wrinkle is that, since the separators are not
valid commands, the conversion of commands into tasks happens after
splitting into sections.  That means dipping into the sublists.

\begin{code}
main = do
    count <- getThreadCount
    input <- getContents
    mapM_ (threadPool count . map execute) (parsed input)
        where
            parsed = sections (== "") . lines
            execute cmd = do
                handle <- runCommand cmd
                waitForProcess handle
                return ()
\end{code}


\section{Further Subjective Thoughts on Haskell}


\subsection{Explainability and Flexibility}

In the previous version, I mentioned ``little corners and sharp edges''
that an expert can use to be very concise.  One commenter
\href{http://www.reddit.com/r/programming/comments/7qvlf/taking_a_second_look_at_haskell_pdf/c0752di}
{pointed out} that Haskell's mathematical approach helps you avoid
cutting yourself.  There's some truth to that.  I'm not worried about
waking up to discover that \kbd{mapM\_} has a new, subtly different
meaning that breaks my program, and a principled basis probably makes it
easier for experts to remember things and understand each other's code.

But there's still the issue of explaining code to a non-expert like me.
Haskell has a reputation for being hard to learn, and I think part of
that comes from being hard to explain.  The \kbd{sections} function (in
Section \ref{sections}) is the composition of two curried functions, one
of which applies its first argument to its second, repeatedly.
Explaining why that works would be a challenge.\footnote{And if you can
explain that, try \kbd{mapM\_ (threadPool count . map execute) (parsed
input)} from section \ref{main}.}

The flip side is that you get a lot of flexibility to make it clear what
the code is supposed to do.  It's pretty easy to see, without actually
tracing through the code of \kbd{sections}, that there's a helper
function to split off only the first section and that helper is combined
with \kbd{unfoldr} to make a function that splits all sections.

The easiest place to see this flexibility might be in the large number
of tiny library functions.  One
\href{http://www.reddit.com/r/haskell/comments/7qvq2/taking_a_second_look_at_haskell_a_thread_pool_for/c0754sx}
{comment} on the first version was that I hadn't taken advantage of the
library functions, particularly in \kbd{Control.Monad}.  I had assumed
such things weren't in the libraries and that \kbd{Control.Monad}
contained little else besides the monad type class.  In fact, it's full
of shorthand functions defined using one or two operations.

I ended up actually sitting down and thinking for a while about whether
to use \kbd{mapM\_} or \kbd{forM\_} in the \kbd{threadPool} function.
The two are exactly the same, except that the order of the two arguments
is reversed.  The suggestions I saw generally used \kbd{forM\_},
probably because it puts the function argument second and they supplied
it with a multi-line lambda expression.  I decided to give the function
a name by defining it in a \kbd{where} clause\footnote{I really like
\kbd{where} clauses.  I highly recommend them to designers of other
languages.}, so the multi-line argument wasn't an issue.  I went with
\kbd{mapM\_} because it matched the order of the definition in the
\kbd{where} clause: \kbd{delegate semaphore task} delegates a single
task, and \kbd{mapM\_ (delegate semaphore) tasks} delegates a list of
tasks.

That sort of little choice actually made up most of the work I put into
the code for this version.  Switching to a quantity semaphore and adding
new features took relatively little time, then I went back and revised.
Oddly enough, it reminded me of writing something in English.


\subsection{Input/Output and the Design of Haskell}

After the first version, a few people complained that I had written a
program almost entirely concerned with IO.  They're quite right that
that's what I've done, though slightly less so in this version.  To some
extent, that gives a different view of Haskell than if I had done
something else.

However, one of the reasons the thread pool is focused on IO is that
Haskell, unlike most other languages, considers thread synchronization
to be IO.  Writing this program showed me something of why IO in Haskell
is different.  By extension, it showed me something of the thinking
behind the design of Haskell.

The difficulty of explaining Haskell code trips me up again.  I
will only generalize and say that Haskell seems to expect, as a matter
of course, ways of thinking that old languages use to impress and new
languages shun.


\section{Conclusion}
\label{conclusion}

I keep a shortlist of languages I prefer to use, and I've decided to add
Haskell.  It's solid and different in a way that complements the other
languages on my list.  It will take time to become proficient---there's
a lot to learn, and I have other things going on---so I don't make the
decision lightly.

A good part of it is simply the style of the language.  There are
several languages that I see a lot of value in but haven't added to the
list, either because of some specific stylistic decision they made or
something I couldn't quite put my finger on.

There is, of course, a technical veto for languages that just don't cut
it, but so many are left that the final choice is personal.

\end{document}