Utilities for encoding arbitrary data as Bourne shell fragments that stream the data to standard output, using HERE documents and simple shell decoders.
- data Chunk
- chunk :: ByteString -> Chunk
- encode :: Word8 -> Word8 -> ByteString -> ByteString
- decode :: Word8 -> Word8 -> ByteString -> ByteString
- data EscapeChar = EscapeChar !Word8 !ByteString !ByteString !ByteString
- escapes :: [EscapeChar]
- safeForHereDoc :: ByteString -> Bool
- encoded :: Chunk -> Bool
- leastStringNotIn :: ByteString -> ByteString
A chunk describes a block of binary data ready for inclusion in a shell
script. For many data blocks, no encoding or decoding is necessary; these
are stored in a
SafeChunk. Those blocks needing byte-translation are
stored in an
ByteString into a string safe for inclusion in a shell HERE
document and annotates with information to construct a shell decoder for
that document, if necessary.
ByteString with nulls is rewritten in a complicated way. Two escape
characters are chosen from a class of ASCII printable characters that look
like reasonable escape characters; the two that show up least frequently
in the document (including 0 times) become the null replacer and the
escaper. All instances of these two characters are rewritten to escape
sequences formed with the escaper, while nulls are rewritten to the null
replacer. Given the two characters thus chosen, a command line with
sed in sequence can be constructed to decode the document.
This encoding doubles the amount of space consumed by the escape characters. In the worst case, where the data is made of all 20 potential escapes, evenly distributed, and one null (so we can't punt on escaping), the data will grow in size by 10 percent. For data that is more evenly distributed over the bytes -- as we might expect of compressed tarballs -- we expect a size growth of two 256ths, or less than 0.8 percent.
Given a byte to replace nulls and an escape byte, rewrites the data such
that nulls are mapped to the replace byte, replace bytes are mapped to a
pair of escape bytes and the escape byte is is mapped to an escape byte
followed by an underscore. For example, if the null replace byte is
and the escape byte is
# then all nulls become
## and all
This escaping scheme is dictated by the needs of our Sed decoder, which is
just two global substitions, one after another. If the escaping were such
that, with our characters above,
# escaped to
#_ in the input becomes
##_. We want to run the subsitution
# first, to catch this; it produces
#_; then Sed feeds the input
to the second substitution which unfortunately renders
!. In the
alternate scheme, the input is encoded
! decoder runs first
and ignores it, then the
# decoder runs and catches it. When using a
pipeline of stream processors to interpret escape sequences, it seems best
to ensure that only the very last processor inserts escape characters, to
prevent their further interpretation.
Given the byte used to replace nulls and the escape byte, undoes the result of the encode operation -- rewriting null replacers to literal nulls and escape patterns to the original bytes. This function is not intended to be used in practice -- it will be shell commands that unpack the data -- but serves to document the ideas behind decoding as well as offering a way to check the encoder.
The candidate escape characters, with the forms to be used in constructed
Many binary strings can be embedded as-is in a HEREDOC, without escaping.
Predicate to determine whether data is represented as an encoded chunk or is unencoded.
Finds a short hexadecimal string that is not in the input.
A string of length
n has at most
n - (k - 1) substrings of some fixed,
k -- the substring starting at the first byte and
k, the substring starting at the second byte and extending
k and so on, on until the end where we have to stop
k - 1 short of
the last byte. We choose
k such that it contains enough hexadecimal
digits to enumerate all the substrings; for a 4M input, we want
k = 6.
We can take all the hex substrings of length
k in the input, sort them,
and then find the gaps. We take the least substring in the first gap for
our chosen substring. This gives us an O(n log n) algorithm.
The measurable length of a
ByteString is at most the maximum
(since the length function results in an
Int); this is one less than 2
to the bit width of a
Word (because there is a 0
Word). Thus a
suffices to enumerate all the possible substrings in a
one more. (Substrings are zero-indexed and the length is 1-indexed.) We
can leverage this fact to translate all substrings to
Word and store
them in an unboxed vector, using integer operations to find the least
subtring in the first gap. Space usage is linear in the length of the
input string; for a 4M string, the sorted vector could consume 32M on 64