Storage.Hashed.Index

Description

This module contains plain tree indexing code. The index itself is a CACHE: you should only ever use it as an optimisation and never as a primary storage. In practice, this means that when we change index format, the application is expected to throw the old index away and build a fresh index. Please note that tracking index validity is out of scope for this library: this is responsibility of your application. It is advisable that in your validity tracking code, you also check for format validity (see indexFormatValid) and scrap and re-create index when needed.

The index is a binary file that overlays a hashed tree over the working copy. This means that every working file and directory has an entry in the index, that contains its path and hash and validity data. The validity data is a timestamp plus the file size. The file hashes are sha256's of the file's content.

There are two entry types, a file entry and a directory entry. Both have a common binary format (see Item). The on-disk format is best described by the section Index format below.

For each file, the index has a copy of the file's last modification timestamp taken at the instant when the hash has been computed. This means that when file size and timestamp of a file in working copy matches those in the index, we assume that the hash stored in the index for given file is valid. These hashes are then exposed in the resulting Tree object, and can be leveraged by eg. diffTrees to compare many files quickly.

You may have noticed that we also keep hashes of directories. These are assumed to be valid whenever the complete subtree has been valid. At any point, as soon as a size or timestamp mismatch is found, the working file in question is opened, its hash (and timestamp and size) is recomputed and updated in-place in the index file (everything lives at a fixed offset and is fixed size, so this isn't an issue). This is also true of directories: when a file in a directory changes hash, this triggers recomputation of all of its parent directory hashes; moreover this is done efficiently -- each directory is updated at most once during an update run.

Index format

The Index is organised into "lines" where each line describes a single indexed item. Cf. Item.

The first word on the index "line" is the length of the file path (which is the only variable-length part of the line). Then comes the path itself, then fixed-length hash (sha256) of the file in question, then two words, one for size and one aux, which is used differently for directories and for files.

With directories, this aux holds the offset of the next sibling line in the index, so we can efficiently skip reading the whole subtree starting at a given directory (by just seeking aux bytes forward). The lines are pre-ordered with respect to directory structure -- the directory comes first and after it come all its items. Cf. readIndex'.

For files, the aux field holds a timestamp.

Synopsis

Documentation

readIndex :: FilePath -> (Tree IO -> Hash) -> IO Index Source

Read an index and build up a Tree object from it, referring to current working directory. The initial Index object returned by readIndex is not directly useful. However, you can use Tree.filter on it. Either way, to obtain the actual Tree object, call update.

The usual use pattern is this:

 do (idx, update) <- readIndex
    tree <- update =<< filter predicate idx

The resulting tree will be fully expanded.

updateIndexFrom :: FilePath -> (Tree IO -> Hash) -> Tree IO -> IO Index Source

Will add and remove files in index to make it match the Tree object given (it is an error for the Tree to contain a file or directory that does not exist in a plain form in current working directory).

indexFormatValid :: FilePath -> IO Bool Source

Check that a given file is an index file with a format we can handle. You should remove and re-create the index whenever this is not true.

updateIndex :: Index -> IO (Tree IO)Source

type Index = IndexM IO Source

filter :: FilterTree a m => (AnchoredPath -> TreeItem m -> Bool) -> a m -> a mSource

Given pred tree, produce a Tree that only has items for which pred returns True. The tree might contain stubs. When expanded, these will be subject to filtering as well.