| Copyright | (c) 2010-2011 Simon Meier (c) 2010 Jasper van der Jeugt |
|---|---|
| License | BSD3-style (see LICENSE) |
| Maintainer | Simon Meier <iridcode@gmail.com> |
| Portability | GHC |
| Safe Haskell | Trustworthy |
| Language | Haskell98 |
Data.ByteString.Builder.Prim
Description
This module provides Builder primitives, which are lower level building
blocks for constructing Builders. You don't need to go down to this level but
it can be slightly faster.
Morally, builder primitives are like functions a -> Builder, that is they
take a value and encode it as a sequence of bytes, represented as a Builder.
Of course their implementation is a bit more specialised.
Builder primitives come in two forms: fixed-size and bounded-size.
- Fixed(-size) primitives are builder primitives that always result in a
sequence of bytes of a fixed length. That is, the length is independent of
the value that is encoded. An example of a fixed size primitive is the
big-endian encoding of a
Word64, which always results in exactly 8 bytes. - Bounded(-size) primitives are builder primitives that always result in a
sequence of bytes that is no larger than a predetermined bound. That is, the
bound is independent of the value that is encoded but the actual length will
depend on the value. An example for a bounded primitive is the UTF-8 encoding
of a
Char, which can be 1,2,3 or 4 bytes long, so the bound is 4 bytes.
Note that fixed primitives can be considered as a special case of bounded primitives, and we can lift from fixed to bounded.
Because bounded primitives are the more general case, in this documentation we only refer to fixed size primitives where it matters that the resulting sequence of bytes is of a fixed length. Otherwise, we just refer to bounded size primitives.
The purpose of using builder primitives is to improve the performance of
Builders. These improvements stem from making the two most common steps
performed by a Builder more efficient. We explain these two steps in turn.
The first most common step is the concatenation of two Builders. Internally,
concatenation corresponds to function composition. (Note that Builders can
be seen as difference-lists of buffer-filling functions; cf.
http://hackage.haskell.org/cgi-bin/hackage-scripts/package/dlist. )
Function composition is a fast O(1) operation. However, we can use bounded
primitives to remove some of these function compositions altogether, which is
more efficient.
The second most common step performed by a Builder is to fill a buffer using
a bounded primitives, which works as follows. The Builder checks whether
there is enough space left to execute the bounded primitive. If there is, then
the Builder executes the bounded primitive and calls the next Builder with
the updated buffer. Otherwise, the Builder signals its driver that it
requires a new buffer. This buffer must be at least as large as the bound of
the primitive. We can use bounded primitives to reduce the number of
buffer-free checks by fusing the buffer-free checks of consecutive Builders.
We can also use bounded primitives to simplify the control flow for signalling
that a buffer is full by ensuring that we check first that there is enough
space left and only then decide on how to encode a given value.
Let us illustrate these improvements on the CSV-table rendering example from
Data.ByteString.Builder. Its "hot code" is the rendering of a table's
cells, which we implement as follows using only the functions from the
Builder API.
import Data.ByteString.Builder as B renderCell :: Cell -> Builder renderCell (StringC cs) = renderString cs renderCell (IntC i) = B.intDec i renderString :: String -> Builder renderString cs = B.charUtf8 '"' <> foldMap escape cs <> B.charUtf8 '"' where escape '\\' = B.charUtf8 '\\' <> B.charUtf8 '\\' escape '\"' = B.charUtf8 '\\' <> B.charUtf8 '\"' escape c = B.charUtf8 c
Efficient encoding of Ints as decimal numbers is performed by intDec.
Optimization potential exists for the escaping of Strings. The above
implementation has two optimization opportunities. First, the buffer-free
checks of the Builders for escaping double quotes and backslashes can be
fused. Second, the concatenations performed by foldMap can be eliminated.
The following implementation exploits these optimizations.
import qualified Data.ByteString.Builder.Prim as P
import Data.ByteString.Builder.Prim
( condB, liftFixedToBounded, (>*<), (>$<) )
renderString :: String -> Builder
renderString cs =
B.charUtf8 '"' <> primMapListBounded escape cs <> B.charUtf8 '"'
where
escape :: BoundedPrim Char
escape =
condB (== '\\') (fixed2 ('\\', '\\')) $
condB (== '\"') (fixed2 ('\\', '\"')) $
charUtf8
{-# INLINE fixed2 #-}
fixed2 x = liftFixedToBounded $ const x >$< char7 >*< char7
The code should be mostly self-explanatory. The slightly awkward syntax is
because the combinators are written such that the size-bound of the resulting
BoundedPrim can be computed at compile time. We also explicitly inline the
fixed2 primitive, which encodes a fixed tuple of characters, to ensure that
the bound computation happens at compile time. When encoding the following list
of Strings, the optimized implementation of renderString is two times
faster.
maxiStrings :: [String] maxiStrings = take 1000 $ cycle ["hello", "\"1\"", "λ-wörld"]
Most of the performance gain stems from using primMapListBounded, which
encodes a list of values from left-to-right with a BoundedPrim. It exploits
the Builder internals to avoid unnecessary function compositions (i.e.,
concatenations). In the future, we might expect the compiler to perform the
optimizations implemented in primMapListBounded. However, it seems that the
code is currently to complicated for the compiler to see through. Therefore, we
provide the BoundedPrim escape hatch, which allows data structures to provide
very efficient encoding traversals, like primMapListBounded for lists.
Note that BoundedPrims are a bit verbose, but quite versatile. Here is an
example of a BoundedPrim for combined HTML escaping and UTF-8 encoding. It
exploits that the escaped character with the maximal Unicode codepoint is '>'.
{-# INLINE charUtf8HtmlEscaped #-}
charUtf8HtmlEscaped :: BoundedPrim Char
charUtf8HtmlEscaped =
condB (> '>' ) charUtf8 $
condB (== '<' ) (fixed4 ('&',('l',('t',';')))) $ -- <
condB (== '>' ) (fixed4 ('&',('g',('t',';')))) $ -- >
condB (== '&' ) (fixed5 ('&',('a',('m',('p',';'))))) $ -- &
condB (== '"' ) (fixed5 ('&',('#',('3',('4',';'))))) $ -- "
condB (== '\'') (fixed5 ('&',('#',('3',('9',';'))))) $ -- '
(liftFixedToBounded char7) -- fallback for Chars smaller than '>'
where
{-# INLINE fixed4 #-}
fixed4 x = liftFixedToBounded $ const x >$<
char7 >*< char7 >*< char7 >*< char7
{-# INLINE fixed5 #-}
fixed5 x = liftFixedToBounded $ const x >$<
char7 >*< char7 >*< char7 >*< char7 >*< char7
This module currently does not expose functions that require the special
properties of fixed-size primitives. They are useful for prefixing Builders
with their size or for implementing chunked encodings. We will expose the
corresponding functions in future releases of this library.
Synopsis
- data BoundedPrim a
- emptyB :: BoundedPrim a
- (>*<) :: Monoidal f => f a -> f b -> f (a, b)
- (>$<) :: Contravariant f => (b -> a) -> f a -> f b
- eitherB :: BoundedPrim a -> BoundedPrim b -> BoundedPrim (Either a b)
- condB :: (a -> Bool) -> BoundedPrim a -> BoundedPrim a -> BoundedPrim a
- primBounded :: BoundedPrim a -> a -> Builder
- primMapListBounded :: BoundedPrim a -> [a] -> Builder
- primUnfoldrBounded :: BoundedPrim b -> (a -> Maybe (b, a)) -> a -> Builder
- primMapByteStringBounded :: BoundedPrim Word8 -> ByteString -> Builder
- primMapLazyByteStringBounded :: BoundedPrim Word8 -> ByteString -> Builder
- data FixedPrim a
- emptyF :: FixedPrim a
- liftFixedToBounded :: FixedPrim a -> BoundedPrim a
- primFixed :: FixedPrim a -> a -> Builder
- primMapListFixed :: FixedPrim a -> [a] -> Builder
- primUnfoldrFixed :: FixedPrim b -> (a -> Maybe (b, a)) -> a -> Builder
- primMapByteStringFixed :: FixedPrim Word8 -> ByteString -> Builder
- primMapLazyByteStringFixed :: FixedPrim Word8 -> ByteString -> Builder
- int8 :: FixedPrim Int8
- word8 :: FixedPrim Word8
- int16BE :: FixedPrim Int16
- int32BE :: FixedPrim Int32
- int64BE :: FixedPrim Int64
- word16BE :: FixedPrim Word16
- word32BE :: FixedPrim Word32
- word64BE :: FixedPrim Word64
- floatBE :: FixedPrim Float
- doubleBE :: FixedPrim Double
- int16LE :: FixedPrim Int16
- int32LE :: FixedPrim Int32
- int64LE :: FixedPrim Int64
- word16LE :: FixedPrim Word16
- word32LE :: FixedPrim Word32
- word64LE :: FixedPrim Word64
- floatLE :: FixedPrim Float
- doubleLE :: FixedPrim Double
- intHost :: FixedPrim Int
- int16Host :: FixedPrim Int16
- int32Host :: FixedPrim Int32
- int64Host :: FixedPrim Int64
- wordHost :: FixedPrim Word
- word16Host :: FixedPrim Word16
- word32Host :: FixedPrim Word32
- word64Host :: FixedPrim Word64
- floatHost :: FixedPrim Float
- doubleHost :: FixedPrim Double
- char7 :: FixedPrim Char
- int8Dec :: BoundedPrim Int8
- int16Dec :: BoundedPrim Int16
- int32Dec :: BoundedPrim Int32
- int64Dec :: BoundedPrim Int64
- intDec :: BoundedPrim Int
- word8Dec :: BoundedPrim Word8
- word16Dec :: BoundedPrim Word16
- word32Dec :: BoundedPrim Word32
- word64Dec :: BoundedPrim Word64
- wordDec :: BoundedPrim Word
- word8Hex :: BoundedPrim Word8
- word16Hex :: BoundedPrim Word16
- word32Hex :: BoundedPrim Word32
- word64Hex :: BoundedPrim Word64
- wordHex :: BoundedPrim Word
- int8HexFixed :: FixedPrim Int8
- int16HexFixed :: FixedPrim Int16
- int32HexFixed :: FixedPrim Int32
- int64HexFixed :: FixedPrim Int64
- word8HexFixed :: FixedPrim Word8
- word16HexFixed :: FixedPrim Word16
- word32HexFixed :: FixedPrim Word32
- word64HexFixed :: FixedPrim Word64
- floatHexFixed :: FixedPrim Float
- doubleHexFixed :: FixedPrim Double
- char8 :: FixedPrim Char
- charUtf8 :: BoundedPrim Char
Bounded-size primitives
data BoundedPrim a Source #
A builder primitive that always results in sequence of bytes that is no longer than a pre-determined bound.
Combinators
The combinators for BoundedPrims are implemented such that the
size of the resulting BoundedPrim can be computed at compile time.
emptyB :: BoundedPrim a Source #
The BoundedPrim that always results in the zero-length sequence.
(>*<) :: Monoidal f => f a -> f b -> f (a, b) infixr 5 Source #
A pairing/concatenation operator for builder primitives, both bounded and fixed size.
For example,
toLazyByteString (primFixed (char7 >*< char7) ('x','y')) = "xy"We can combine multiple primitives using >*< multiple times.
toLazyByteString (primFixed (char7 >*< char7 >*< char7) ('x',('y','z'))) = "xyz"(>$<) :: Contravariant f => (b -> a) -> f a -> f b infixl 4 Source #
A fmap-like operator for builder primitives, both bounded and fixed size.
Builder primitives are contravariant so it's like the normal fmap, but backwards (look at the type). (If it helps to remember, the operator symbol is like ($) but backwards.)
We can use it for example to prepend and/or append fixed values to an primitive.
import Data.ByteString.Builder.Prim as P
showEncoding ((\x -> ('\'', (x, '\''))) >$< fixed3) 'x' = "'x'"
where
fixed3 = P.char7 >*< P.char7 >*< P.char7Note that the rather verbose syntax for composition stems from the requirement to be able to compute the size / size bound at compile time.
eitherB :: BoundedPrim a -> BoundedPrim b -> BoundedPrim (Either a b) Source #
Encode an Either value using the first BoundedPrim for Left
values and the second BoundedPrim for Right values.
Note that the functions eitherB, pairB, and contramapB (written below
using >$<) suffice to construct BoundedPrims for all non-recursive
algebraic datatypes. For example,
maybeB :: BoundedPrim () -> BoundedPrim a -> BoundedPrim (Maybe a) maybeB nothing just =maybe(Left ()) Right>$<eitherB nothing just
condB :: (a -> Bool) -> BoundedPrim a -> BoundedPrim a -> BoundedPrim a Source #
Conditionally select a BoundedPrim.
For example, we can implement the ASCII primitive that drops characters with
Unicode codepoints above 127 as follows.
charASCIIDrop =condB(< '\128') (liftFixedToBoundedchar7)emptyB
Builder construction
primBounded :: BoundedPrim a -> a -> Builder Source #
Create a Builder that encodes values with the given BoundedPrim.
We rewrite consecutive uses of primBounded such that the bound-checks are
fused. For example,
primBounded (word32 c1) `mappend` primBounded (word32 c2)
is rewritten such that the resulting Builder checks only once, if ther are
at 8 free bytes, instead of checking twice, if there are 4 free bytes. This
optimization is not observationally equivalent in a strict sense, as it
influences the boundaries of the generated chunks. However, for a user of
this library it is observationally equivalent, as chunk boundaries of a lazy
ByteString can only be observed through the internal interface.
Morevoer, we expect that all primitives write much fewer than 4kb (the
default short buffer size). Hence, it is safe to ignore the additional
memory spilled due to the more agressive buffer wrapping introduced by this
optimization.
primMapListBounded :: BoundedPrim a -> [a] -> Builder Source #
Create a Builder that encodes a list of values consecutively using a
BoundedPrim for each element. This function is more efficient than
mconcat . map (primBounded w)
or
foldMap (primBounded w)
because it moves several variables out of the inner loop.
primUnfoldrBounded :: BoundedPrim b -> (a -> Maybe (b, a)) -> a -> Builder Source #
Create a Builder that encodes a sequence generated from a seed value
using a BoundedPrim for each sequence element.
primMapByteStringBounded :: BoundedPrim Word8 -> ByteString -> Builder Source #
Create a Builder that encodes each Word8 of a strict ByteString
using a BoundedPrim. For example, we can write a Builder that filters
a strict ByteString as follows.
import qualified Data.ByteString.Builder.Prim as P
filterBS p = P.condB p (P.liftFixedToBounded P.word8) P.emptyB
primMapLazyByteStringBounded :: BoundedPrim Word8 -> ByteString -> Builder Source #
Chunk-wise application of primMapByteStringBounded.
Fixed-size primitives
A builder primitive that always results in a sequence of bytes of a pre-determined, fixed size.
Combinators
The combinators for FixedPrims are implemented such that the
size
of the resulting FixedPrim is computed at compile time.
The (>*<) and (>$<) pairing and mapping operators can be used
with FixedPrim.
liftFixedToBounded :: FixedPrim a -> BoundedPrim a Source #
Lift a FixedPrim to a BoundedPrim.
Builder construction
In terms of expressivity, the function fixedPrim would be sufficient
for constructing Builders from FixedPrims. The fused variants of
this function are provided because they allow for more efficient
implementations. Our compilers are just not smart enough yet; and for some
of the employed optimizations (see the code of primMapByteStringFixed)
they will very likely never be.
Note that functions marked with "Heavy inlining." are forced to be
inlined because they must be specialized for concrete encodings,
but are rather heavy in terms of code size. We recommend to define a
top-level function for every concrete instantiation of such a function in
order to share its code. A typical example is the function
byteStringHex from Data.ByteString.Builder.ASCII,
which is implemented as follows.
byteStringHex :: S.ByteString -> Builder byteStringHex =primMapByteStringFixedword8HexFixed
primMapListFixed :: FixedPrim a -> [a] -> Builder Source #
Encode a list of values from left-to-right with a FixedPrim.
primMapByteStringFixed :: FixedPrim Word8 -> ByteString -> Builder Source #
Heavy inlining. Encode all bytes of a strict ByteString from
left-to-right with a FixedPrim. This function is quite versatile. For
example, we can use it to construct a Builder that maps every byte before
copying it to the buffer to be filled.
mapToBuilder :: (Word8 -> Word8) -> S.ByteString -> Builder mapToBuilder f = primMapByteStringFixed (contramapF f word8)
We can also use it to hex-encode a strict ByteString as shown by the
byteStringHex example above.
primMapLazyByteStringFixed :: FixedPrim Word8 -> ByteString -> Builder Source #
Heavy inlining. Encode all bytes of a lazy ByteString from
left-to-right with a FixedPrim.
Standard encodings of Haskell values
Binary encodings
Big-endian
Little-endian
Non-portable, host-dependent
intHost :: FixedPrim Int Source #
Encode a single native machine Int. The Ints is encoded in host order,
host endian form, for the machine you are on. On a 64 bit machine the Int
is an 8 byte value, on a 32 bit machine, 4 bytes. Values encoded this way
are not portable to different endian or integer sized machines, without
conversion.
wordHost :: FixedPrim Word Source #
Encode a single native machine Word. The Words is encoded in host order,
host endian form, for the machine you are on. On a 64 bit machine the Word
is an 8 byte value, on a 32 bit machine, 4 bytes. Values encoded this way
are not portable to different endian or word sized machines, without
conversion.
floatHost :: FixedPrim Float Source #
Encode a Float in native host order and host endianness. Values written
this way are not portable to different endian machines, without conversion.
Character encodings
ASCII
Decimal numbers
Decimal encoding of numbers using ASCII encoded characters.
Hexadecimal numbers
Encoding positive integers as hexadecimal numbers using lower-case ASCII characters. The shortest possible representation is used. For example,
toLazyByteString (primBounded word16Hex 0x0a10) = "a10"
Note that there is no support for using upper-case characters. Please contact the maintainer if your application cannot work without hexadecimal encodings that use upper-case characters.
Fixed-width hexadecimal numbers
Encoding the bytes of fixed-width types as hexadecimal numbers using lower-case ASCII characters. For example,
toLazyByteString (primFixed word16HexFixed 0x0a10) = "0a10"
ISO/IEC 8859-1 (Char8)
The ISO/IEC 8859-1 encoding is an 8-bit encoding often known as Latin-1.
The Char8 encoding implemented here works by truncating the Unicode
codepoint to 8-bits and encoding them as a single byte. For the codepoints
0-255 this corresponds to the ISO/IEC 8859-1 encoding. Note that the
Char8 encoding is equivalent to the ASCII encoding on the Unicode
codepoints 0-127. Hence, functions such as intDec can also be used for
encoding Ints as a decimal number with Char8 encoded characters.