Safe Haskell | None |
---|---|
Language | Haskell2010 |
- data Blocks = Blocks {}
- data ChunkConfig = ChunkConfig {}
- newtype OptimalBlock = OptimalBlock {}
- data Algorithm
- chop :: ChunkConfig -> ByteString -> Blocks
- split :: WindowSize -> BlockShift -> ByteString -> (ByteString, ByteString)
- defaultConfig :: ChunkConfig
- sizedBitmask :: Int -> Word64
Documentation
The result of the chop
function, contains the list of optimal blocks
that were found, and any remaining bytes that did not end optimally.
data ChunkConfig Source
Parameters to the chop function. windowSize
is how many bytes wide the
hashing window is. blockSize
is the target size of each generated block.
Actual blocks will be larger or smaller, but on average, blocks will be
about blockSize
on reasonably high-entropy data.
newtype OptimalBlock Source
Alias for ByteString
, used to indicate that this sequence of bytes ends
in an optimal fashion.
:: ChunkConfig | chopping parameters |
-> ByteString | ByteString to chop |
-> Blocks |
Chop up a ByteString
into blocks of data that are likely to occur in
other ByteString
s. This uses roughly the same algorithm that rsync does:
calculate a hash of every windowSize
-sized sequence of bytes within the
given ByteString
, and then break it up where the hashes match a certain
pattern. Specifically, this function uses BuzzHash (a rolling hash) to make
the hash calculations fast, and the pattern it looks for is that the hash's
binary form ends with the right number of "ones", where "right" is determined
by the given blockSize
. The breaks are inserted after the matching windows
are found.
:: WindowSize | How many bytes to use when generating the pattern hash |
-> BlockShift | log2 of the desired block size |
-> ByteString | The ByteString to split |
-> (ByteString, ByteString) |
Split up a ByteString into a complete block and a remainder. If no break point is found in the given string, the first element of the resulting pair will be empty.
defaultConfig :: ChunkConfig Source
This is an alias of chop'
that uses a window size of 128 bytes and a
desired block size of 256KiB.
sizedBitmask :: Int -> Word64 Source
Determine the bitmask that will probably give us blocks of size
desiredSz
. The idea behind this is that if, for example, we want 1MB
blocks, then we need a bitmask that will match one window in (1024*1024).
This is equivalent to saying that we want the hash's bottom 20 bits to be
set (a 1 in 2**20 occurrance). This function's ugly, and uses logarithms and
lots of type conversions, but it's only called once per chop'
call, so it
doesn't have much impact on performance.