System.Posix.ARX.HEREDat

Description

Utilities for encoding arbitrary data as Bourne shell fragments that stream the data to standard output, using HERE documents and simple shell decoders.

Synopsis

Documentation

data Chunk Source

A chunk describes a block of binary data ready for inclusion in a shell script. For many data blocks, no encoding or decoding is necessary; these are stored in a SafeChunk. Those blocks needing byte-translation are stored in an EncodedChunk.

Constructors

SafeChunk !ByteString
EncodedChunk !ByteString !Int !EscapeChar !EscapeChar

Instances

Show Chunk
IsString Chunk

chunk :: ByteString -> Chunk Source

Converts a ByteString into a string safe for inclusion in a shell HERE document and annotates with information to construct a shell decoder for that document, if necessary.

A ByteString with nulls is rewritten in a complicated way. Two escape characters are chosen from a class of ASCII printable characters that look like reasonable escape characters; the two that show up least frequently in the document (including 0 times) become the null replacer and the escaper. All instances of these two characters are rewritten to escape sequences formed with the escaper, while nulls are rewritten to the null replacer. Given the two characters thus chosen, a command line with tr and sed in sequence can be constructed to decode the document.

This encoding doubles the amount of space consumed by the escape characters. In the worst case, where the data is made of all 20 potential escapes, evenly distributed, and one null (so we can't punt on escaping), the data will grow in size by 10 percent. For data that is more evenly distributed over the bytes -- as we might expect of compressed tarballs -- we expect a size growth of two 256ths, or less than 0.8 percent.

encode :: Word8 -> Word8 -> ByteString -> ByteString Source

Given a byte to replace nulls and an escape byte, rewrites the data such that nulls are mapped to the replace byte, replace bytes are mapped to a pair of escape bytes and the escape byte is is mapped to an escape byte followed by an underscore. For example, if the null replace byte is ! and the escape byte is # then all nulls become !, any ! become ## and all # become #_.

This escaping scheme is dictated by the needs of our Sed decoder, which is just two global substitions, one after another. If the escaping were such that, with our characters above, # escaped to ## and ! to #_, then #_ in the input becomes ##_. We want to run the subsitution for # first, to catch this; it produces #_; then Sed feeds the input to the second substitution which unfortunately renders !. In the alternate scheme, the input is encoded #__, the ! decoder runs first and ignores it, then the # decoder runs and catches it. When using a pipeline of stream processors to interpret escape sequences, it seems best to ensure that only the very last processor inserts escape characters, to prevent their further interpretation.

decode :: Word8 -> Word8 -> ByteString -> ByteString Source

Given the byte used to replace nulls and the escape byte, undoes the result of the encode operation -- rewriting null replacers to literal nulls and escape patterns to the original bytes. This function is not intended to be used in practice -- it will be shell commands that unpack the data -- but serves to document the ideas behind decoding as well as offering a way to check the encoder.

data EscapeChar Source

Constructors

EscapeChar !Word8 !ByteString !ByteString !ByteString

Instances

Show EscapeChar

escapes :: [EscapeChar]Source

The candidate escape characters, with the forms to be used in constructed tr and sed commands.

safeForHereDoc :: ByteString -> Bool Source

Many binary strings can be embedded as-is in a HEREDOC, without escaping.

encoded :: Chunk -> Bool Source

Predicate to determine whether data is represented as an encoded chunk or is unencoded.

leastStringNotIn :: ByteString -> ByteString Source

Finds a short hexadecimal string that is not in the input.

A string of length n has at most n - (k - 1) substrings of some fixed, positive length k -- the substring starting at the first byte and extending for k, the substring starting at the second byte and extending for k and so on, on until the end where we have to stop k - 1 short of the last byte. We choose k such that it contains enough hexadecimal digits to enumerate all the substrings; for a 4M input, we want k = 6.

We can take all the hex substrings of length k in the input, sort them, and then find the gaps. We take the least substring in the first gap for our chosen substring. This gives us an O(n log n) algorithm.

The measurable length of a ByteString is at most the maximum Word (since the length function results in an Int); this is one less than 2 to the bit width of a Word (because there is a 0 Word). Thus a Word suffices to enumerate all the possible substrings in a ByteString; and one more. (Substrings are zero-indexed and the length is 1-indexed.) We can leverage this fact to translate all substrings to Word and store them in an unboxed vector, using integer operations to find the least subtring in the first gap. Space usage is linear in the length of the input string; for a 4M string, the sorted vector could consume 32M on 64 bit machines.