A module with a function to support caching the output of your parallel tasks.
|:: forall input output key m . (MonadIO m, Ord key, Show key, Unbox key, NFData output, Serialize key, Serialize output)|
|=> ParTaskOpts m output|
The parallel task options for running these tasks in parallel
The directory in which to store the cache files ("cache-index" and "cache-payload")
and the log file ("parmap-log"). If you have multiple distinct parMapCache tasks
and you don't want them overlapping, pass a different directory for each.
(This is definitely a good idea, because if your two functions have an identical
|-> (input -> key)|
The function to map inputs to keys
|-> (input -> m output)|
The actual function to calculate an output from an input. Note that despite the NFData instance on output, we do not force the evaluation of output; that is left to you to do inside this function.
The list of inputs to process
|-> m (IOVector output)|
The vector of outputs.
A function that performs caching (between runs of the same tasks) to help when running the same analysis task many times.
Imagine that you have a program where you want to some map-reduce work. The mapping takes a long time, but you are working on the reduce part. You don't want to have to redo the mapping every time you run your program; you can use this cache functionality to save the results of the mapping between program runs. Alternatively, you may want to analyse only part of your data at first (for speed) then slowly expand to the rest of the data set. Caching allows you to re-use the results you have already calculated.
There are three main concepts in the type signature.
input is a type containing all the information needed
to perform the task and produce the output. This may involve file handles or functions or whatever. The
type is generally smaller, and is the smallest possible unique identifier for a corresponding output. This might
be the primary key of a database record, or an input filename. (Obviously, in some cases,
input = key; that
makes life easy). The
output type is the output of the task.
In order to serialise the cache to a file, both
output have to be instances of
Serialize. To allow
efficient unboxing of a vector, we require an
Unbox instance for
key (contact me if you think this is too onerous),
and to ensure strict reading from the cache we require
NFData for output.
parMapCache doesn't know when your cache is invalid (e.g. because you've altered the processing algorithm
that you are passing to this function), and will blindly use it if it finds it. It's your responsibility to remove
the cache when it becomes invalid.