Safe Haskell	None
Language	Haskell2010

Algorithm.OptimalBlocks

Synopsis

Documentation

data Blocks Source

The result of the chop function, contains the list of optimal blocks that were found, and any remaining bytes that did not end optimally.

Constructors

Blocks
Fields blocksOptimal :: [OptimalBlock] blocksRemain :: ByteString

Instances

Eq Blocks
Show Blocks
NFData Blocks

data ChunkConfig Source

Parameters to the chop function. windowSize is how many bytes wide the hashing window is. blockSize is the target size of each generated block. Actual blocks will be larger or smaller, but on average, blocks will be about blockSize on reasonably high-entropy data.

Constructors

ChunkConfig
Fields windowSize :: Int blockSize :: Int chunkAlg :: Algorithm

Instances

Show ChunkConfig

newtype OptimalBlock Source

Alias for ByteString, used to indicate that this sequence of bytes ends in an optimal fashion.

Constructors

OptimalBlock
Fields fromOptimal :: ByteString

Instances

Eq OptimalBlock
Ord OptimalBlock
Show OptimalBlock

data Algorithm Source

Constructors

Reference
Old
New

Instances

Eq Algorithm
Ord Algorithm
Show Algorithm

chop Source

Arguments

:: ChunkConfig	chopping parameters
-> ByteString	ByteString to chop
-> Blocks

Chop up a ByteString into blocks of data that are likely to occur in other ByteStrings. This uses roughly the same algorithm that rsync does: calculate a hash of every windowSize-sized sequence of bytes within the given ByteString, and then break it up where the hashes match a certain pattern. Specifically, this function uses BuzzHash (a rolling hash) to make the hash calculations fast, and the pattern it looks for is that the hash's binary form ends with the right number of "ones", where "right" is determined by the given blockSize. The breaks are inserted after the matching windows are found.

split Source

Arguments

:: WindowSize	How many bytes to use when generating the pattern hash
-> BlockShift	log2 of the desired block size
-> ByteString	The ByteString to split
-> (ByteString, ByteString)

Split up a ByteString into a complete block and a remainder. If no break point is found in the given string, the first element of the resulting pair will be empty.

defaultConfig :: ChunkConfig Source

This is an alias of chop' that uses a window size of 128 bytes and a desired block size of 256KiB.

sizedBitmask :: Int -> Word64 Source

Determine the bitmask that will probably give us blocks of size desiredSz. The idea behind this is that if, for example, we want 1MB blocks, then we need a bitmask that will match one window in (1024*1024). This is equivalent to saying that we want the hash's bottom 20 bits to be set (a 1 in 2**20 occurrance). This function's ugly, and uses logarithms and lots of type conversions, but it's only called once per chop' call, so it doesn't have much impact on performance.