parallel-tasks-4.0.1.0

Safe HaskellNone

Control.Concurrent.ParallelTasks.Cache

Description

A module with a function to support caching the output of your parallel tasks.

Synopsis

Documentation

parMapCacheSource

Arguments

:: forall input output key m . (MonadIO m, Ord key, Show key, Unbox key, NFData output, Serialize key, Serialize output) 
=> ParTaskOpts m output

The parallel task options for running these tasks in parallel

-> FilePath

The directory in which to store the cache files ("cache-index" and "cache-payload") and the log file ("parmap-log"). If you have multiple distinct parMapCache tasks and you don't want them overlapping, pass a different directory for each. (This is definitely a good idea, because if your two functions have an identical serialised key value, you'll be in all sorts of trouble!)

-> (input -> key)

The function to map inputs to keys

-> (input -> m output)

The actual function to calculate an output from an input. Note that despite the NFData instance on output, we do not force the evaluation of output; that is left to you to do inside this function.

-> [input]

The list of inputs to process

-> m (IOVector output)

The vector of outputs.

A function that performs caching (between runs of the same tasks) to help when running the same analysis task many times.

Imagine that you have a program where you want to some map-reduce work. The mapping takes a long time, but you are working on the reduce part. You don't want to have to redo the mapping every time you run your program; you can use this cache functionality to save the results of the mapping between program runs. Alternatively, you may want to analyse only part of your data at first (for speed) then slowly expand to the rest of the data set. Caching allows you to re-use the results you have already calculated.

There are three main concepts in the type signature. input is a type containing all the information needed to perform the task and produce the output. This may involve file handles or functions or whatever. The key type is generally smaller, and is the smallest possible unique identifier for a corresponding output. This might be the primary key of a database record, or an input filename. (Obviously, in some cases, input = key; that makes life easy). The output type is the output of the task.

In order to serialise the cache to a file, both key and output have to be instances of Serialize. To allow efficient unboxing of a vector, we require an Unbox instance for key (contact me if you think this is too onerous), and to ensure strict reading from the cache we require NFData for output.

Remember that parMapCache doesn't know when your cache is invalid (e.g. because you've altered the processing algorithm that you are passing to this function), and will blindly use it if it finds it. It's your responsibility to remove the cache when it becomes invalid.