bytestring-0.10.9.0: Fast, compact, strict and lazy byte strings with a list interface

Copyright (c) 2010-2011 Simon Meier(c) 2010 Jasper van der Jeugt BSD3-style (see LICENSE) Simon Meier GHC Trustworthy Haskell98

Data.ByteString.Builder.Prim

Description

This module provides Builder primitives, which are lower level building blocks for constructing Builders. You don't need to go down to this level but it can be slightly faster.

Morally, builder primitives are like functions a -> Builder, that is they take a value and encode it as a sequence of bytes, represented as a Builder. Of course their implementation is a bit more specialised.

Builder primitives come in two forms: fixed-size and bounded-size.

• Fixed(-size) primitives are builder primitives that always result in a sequence of bytes of a fixed length. That is, the length is independent of the value that is encoded. An example of a fixed size primitive is the big-endian encoding of a Word64, which always results in exactly 8 bytes.
• Bounded(-size) primitives are builder primitives that always result in a sequence of bytes that is no larger than a predetermined bound. That is, the bound is independent of the value that is encoded but the actual length will depend on the value. An example for a bounded primitive is the UTF-8 encoding of a Char, which can be 1,2,3 or 4 bytes long, so the bound is 4 bytes.

Note that fixed primitives can be considered as a special case of bounded primitives, and we can lift from fixed to bounded.

Because bounded primitives are the more general case, in this documentation we only refer to fixed size primitives where it matters that the resulting sequence of bytes is of a fixed length. Otherwise, we just refer to bounded size primitives.

The purpose of using builder primitives is to improve the performance of Builders. These improvements stem from making the two most common steps performed by a Builder more efficient. We explain these two steps in turn.

The first most common step is the concatenation of two Builders. Internally, concatenation corresponds to function composition. (Note that Builders can be seen as difference-lists of buffer-filling functions; cf. http://hackage.haskell.org/cgi-bin/hackage-scripts/package/dlist. ) Function composition is a fast O(1) operation. However, we can use bounded primitives to remove some of these function compositions altogether, which is more efficient.

The second most common step performed by a Builder is to fill a buffer using a bounded primitives, which works as follows. The Builder checks whether there is enough space left to execute the bounded primitive. If there is, then the Builder executes the bounded primitive and calls the next Builder with the updated buffer. Otherwise, the Builder signals its driver that it requires a new buffer. This buffer must be at least as large as the bound of the primitive. We can use bounded primitives to reduce the number of buffer-free checks by fusing the buffer-free checks of consecutive Builders. We can also use bounded primitives to simplify the control flow for signalling that a buffer is full by ensuring that we check first that there is enough space left and only then decide on how to encode a given value.

Let us illustrate these improvements on the CSV-table rendering example from Data.ByteString.Builder. Its "hot code" is the rendering of a table's cells, which we implement as follows using only the functions from the Builder API.

import Data.ByteString.Builder as B

renderCell :: Cell -> Builder
renderCell (StringC cs) = renderString cs
renderCell (IntC i)     = B.intDec i

renderString :: String -> Builder
renderString cs = B.charUtf8 '"' <> foldMap escape cs <> B.charUtf8 '"'
where
escape '\\' = B.charUtf8 '\\' <> B.charUtf8 '\\'
escape '\"' = B.charUtf8 '\\' <> B.charUtf8 '\"'
escape c    = B.charUtf8 c


Efficient encoding of Ints as decimal numbers is performed by intDec. Optimization potential exists for the escaping of Strings. The above implementation has two optimization opportunities. First, the buffer-free checks of the Builders for escaping double quotes and backslashes can be fused. Second, the concatenations performed by foldMap can be eliminated. The following implementation exploits these optimizations.

import qualified Data.ByteString.Builder.Prim  as P
import           Data.ByteString.Builder.Prim
( condB, liftFixedToBounded, (>*<), (>$<) ) renderString :: String -> Builder renderString cs = B.charUtf8 '"' <> E.encodeListWithB escape cs <> B.charUtf8 '"' where escape :: E.BoundedPrim Char escape = condB (== '\\') (fixed2 ('\\', '\\'))$
condB (== '\"') (fixed2 ('\\', '\"')) $E.charUtf8 {-# INLINE fixed2 #-} fixed2 x = liftFixedToBounded$ const x >$< E.char7 >*< E.char7  The code should be mostly self-explanatory. The slightly awkward syntax is because the combinators are written such that the size-bound of the resulting BoundedPrim can be computed at compile time. We also explicitly inline the fixed2 primitive, which encodes a fixed tuple of characters, to ensure that the bound computation happens at compile time. When encoding the following list of Strings, the optimized implementation of renderString is two times faster. maxiStrings :: [String] maxiStrings = take 1000$ cycle ["hello", "\"1\"", "λ-wörld"]


Most of the performance gain stems from using primMapListBounded, which encodes a list of values from left-to-right with a BoundedPrim. It exploits the Builder internals to avoid unnecessary function compositions (i.e., concatenations). In the future, we might expect the compiler to perform the optimizations implemented in primMapListBounded. However, it seems that the code is currently to complicated for the compiler to see through. Therefore, we provide the BoundedPrim escape hatch, which allows data structures to provide very efficient encoding traversals, like primMapListBounded for lists.

Note that BoundedPrims are a bit verbose, but quite versatile. Here is an example of a BoundedPrim for combined HTML escaping and UTF-8 encoding. It exploits that the escaped character with the maximal Unicode codepoint is '>'.

{-# INLINE charUtf8HtmlEscaped #-}
charUtf8HtmlEscaped :: E.BoundedPrim Char
charUtf8HtmlEscaped =
condB (>  '>' ) E.charUtf8 $condB (== '<' ) (fixed4 ('&',('l',('t',';'))))$        -- &lt;
condB (== '>' ) (fixed4 ('&',('g',('t',';')))) $-- &gt; condB (== '&' ) (fixed5 ('&',('a',('m',('p',';')))))$  -- &amp;
condB (== '"' ) (fixed5 ('&',('#',('3',('4',';'))))) $-- &#34; condB (== '\'') (fixed5 ('&',('#',('3',('9',';')))))$  -- &#39;
(liftFixedToBounded E.char7)         -- fallback for Chars smaller than '>'
where
{-# INLINE fixed4 #-}
fixed4 x = liftFixedToBounded $const x >$<
E.char7 >*< E.char7 >*< E.char7 >*< E.char7

{-# INLINE fixed5 #-}
fixed5 x = liftFixedToBounded $const x >$<
E.char7 >*< E.char7 >*< E.char7 >*< E.char7 >*< E.char7


This module currently does not expose functions that require the special properties of fixed-size primitives. They are useful for prefixing Builders with their size or for implementing chunked encodings. We will expose the corresponding functions in future releases of this library.

Synopsis

# Bounded-size primitives

data BoundedPrim a Source #

A builder primitive that always results in sequence of bytes that is no longer than a pre-determined bound.

## Combinators

The combinators for BoundedPrims are implemented such that the size of the resulting BoundedPrim can be computed at compile time.

The BoundedPrim that always results in the zero-length sequence.

(>*<) :: Monoidal f => f a -> f b -> f (a, b) infixr 5 Source #

A pairing/concatenation operator for builder primitives, both bounded and fixed size.

For example,

toLazyByteString (primFixed (char7 >*< char7) ('x','y')) = "xy"

We can combine multiple primitives using >*< multiple times.

toLazyByteString (primFixed (char7 >*< char7 >*< char7) ('x',('y','z'))) = "xyz"

(>$<) :: Contravariant f => (b -> a) -> f a -> f b infixl 4 Source # A fmap-like operator for builder primitives, both bounded and fixed size. Builder primitives are contravariant so it's like the normal fmap, but backwards (look at the type). (If it helps to remember, the operator symbol is like ($) but backwards.)

We can use it for example to prepend and/or append fixed values to an primitive.

showEncoding ((\x -> ('\'', (x, '\''))) >$< fixed3) 'x' = "'x'" where fixed3 = char7 >*< char7 >*< char7 Note that the rather verbose syntax for composition stems from the requirement to be able to compute the size / size bound at compile time. eitherB :: BoundedPrim a -> BoundedPrim b -> BoundedPrim (Either a b) Source # Encode an Either value using the first BoundedPrim for Left values and the second BoundedPrim for Right values. Note that the functions eitherB, pairB, and contramapB (written below using >$<) suffice to construct BoundedPrims for all non-recursive algebraic datatypes. For example,

maybeB :: BoundedPrim () -> BoundedPrim a -> BoundedPrim (Maybe a)
maybeB nothing just = maybe (Left ()) Right >$< eitherB nothing just  condB :: (a -> Bool) -> BoundedPrim a -> BoundedPrim a -> BoundedPrim a Source # Conditionally select a BoundedPrim. For example, we can implement the ASCII primitive that drops characters with Unicode codepoints above 127 as follows. charASCIIDrop = condB (< '\128') (fromF char7) emptyB  ## Builder construction primBounded :: BoundedPrim a -> a -> Builder Source # Create a Builder that encodes values with the given BoundedPrim. We rewrite consecutive uses of primBounded such that the bound-checks are fused. For example, primBounded (word32 c1) mappend primBounded (word32 c2) is rewritten such that the resulting Builder checks only once, if ther are at 8 free bytes, instead of checking twice, if there are 4 free bytes. This optimization is not observationally equivalent in a strict sense, as it influences the boundaries of the generated chunks. However, for a user of this library it is observationally equivalent, as chunk boundaries of a lazy ByteString can only be observed through the internal interface. Morevoer, we expect that all primitives write much fewer than 4kb (the default short buffer size). Hence, it is safe to ignore the additional memory spilled due to the more agressive buffer wrapping introduced by this optimization. primMapListBounded :: BoundedPrim a -> [a] -> Builder Source # Create a Builder that encodes a list of values consecutively using a BoundedPrim for each element. This function is more efficient than the canonical filter p = B.toLazyByteString . E.encodeLazyByteStringWithF (E.ifF p E.word8) E.emptyF)  mconcat . map (primBounded w) or foldMap (primBounded w) because it moves several variables out of the inner loop. primUnfoldrBounded :: BoundedPrim b -> (a -> Maybe (b, a)) -> a -> Builder Source # Create a Builder that encodes a sequence generated from a seed value using a BoundedPrim for each sequence element. Create a Builder that encodes each Word8 of a strict ByteString using a BoundedPrim. For example, we can write a Builder that filters a strict ByteString as follows. import Data.ByteString.Builder.Primas P (word8, condB, emptyB) filterBS p = P.condB p P.word8 P.emptyB Chunk-wise application of primMapByteStringBounded. # Fixed-size primitives data FixedPrim a Source # A builder primitive that always results in a sequence of bytes of a pre-determined, fixed size. ## Combinators The combinators for FixedPrims are implemented such that the size of the resulting FixedPrim is computed at compile time. The '(>*and '($<)' pairing and mapping operators can be used with FixedPrim.

The FixedPrim that always results in the zero-length sequence.

Lift a FixedPrim to a BoundedPrim.

## Builder construction

In terms of expressivity, the function fixedPrim would be sufficient for constructing Builders from FixedPrims. The fused variants of this function are provided because they allow for more efficient implementations. Our compilers are just not smart enough yet; and for some of the employed optimizations (see the code of encodeByteStringWithF) they will very likely never be.

Note that functions marked with "Heavy inlining." are forced to be inlined because they must be specialized for concrete encodings, but are rather heavy in terms of code size. We recommend to define a top-level function for every concrete instantiation of such a function in order to share its code. A typical example is the function byteStringHex from Data.ByteString.Builder.ASCII, which is implemented as follows.

byteStringHex :: S.ByteString -> Builder
byteStringHex = encodeByteStringWithF word8HexFixed


primFixed :: FixedPrim a -> a -> Builder Source #

Encode a value with a FixedPrim.

primMapListFixed :: FixedPrim a -> [a] -> Builder Source #

Encode a list of values from left-to-right with a FixedPrim.

primUnfoldrFixed :: FixedPrim b -> (a -> Maybe (b, a)) -> a -> Builder Source #

Encode a list of values represented as an unfoldr with a FixedPrim.

Heavy inlining. Encode all bytes of a strict ByteString from left-to-right with a FixedPrim. This function is quite versatile. For example, we can use it to construct a Builder that maps every byte before copying it to the buffer to be filled.

mapToBuilder :: (Word8 -> Word8) -> S.ByteString -> Builder
mapToBuilder f = encodeByteStringWithF (contramapF f word8)

We can also use it to hex-encode a strict ByteString as shown by the byteStringHex example above.

Heavy inlining. Encode all bytes of a lazy ByteString from left-to-right with a FixedPrim.

# Standard encodings of Haskell values

## Binary encodings

Encoding single signed bytes as-is.

Encoding single unsigned bytes as-is.

### Big-endian

Encoding Int16s in big endian format.

Encoding Int32s in big endian format.

Encoding Int64s in big endian format.

Encoding Word16s in big endian format.

Encoding Word32s in big endian format.

Encoding Word64s in big endian format.

Encode a Float in big endian format.

Encode a Double in big endian format.

### Little-endian

Encoding Int16s in little endian format.

Encoding Int32s in little endian format.

Encoding Int64s in little endian format.

Encoding Word16s in little endian format.

Encoding Word32s in little endian format.

Encoding Word64s in little endian format.

Encode a Float in little endian format.

Encode a Double in little endian format.

### Non-portable, host-dependent

Encode a single native machine Int. The Ints is encoded in host order, host endian form, for the machine you are on. On a 64 bit machine the Int is an 8 byte value, on a 32 bit machine, 4 bytes. Values encoded this way are not portable to different endian or integer sized machines, without conversion.

Encoding Int16s in native host order and host endianness.

Encoding Int32s in native host order and host endianness.

Encoding Int64s in native host order and host endianness.

Encode a single native machine Word. The Words is encoded in host order, host endian form, for the machine you are on. On a 64 bit machine the Word is an 8 byte value, on a 32 bit machine, 4 bytes. Values encoded this way are not portable to different endian or word sized machines, without conversion.

Encoding Word16s in native host order and host endianness.

Encoding Word32s in native host order and host endianness.

Encoding Word64s in native host order and host endianness.

Encode a Float in native host order and host endianness. Values written this way are not portable to different endian machines, without conversion.

Encode a Double in native host order and host endianness.

## Character encodings

### ASCII

Encode the least 7-bits of a Char using the ASCII encoding.

#### Decimal numbers

Decimal encoding of numbers using ASCII encoded characters.

Decimal encoding of an Int8.

Decimal encoding of an Int16.

Decimal encoding of an Int32.

Decimal encoding of an Int64.

Decimal encoding of an Int.

Decimal encoding of a Word8.

Decimal encoding of a Word16.

Decimal encoding of a Word32.

Decimal encoding of a Word64.

Decimal encoding of a Word.

Encoding positive integers as hexadecimal numbers using lower-case ASCII characters. The shortest possible representation is used. For example,

toLazyByteString (primBounded word16Hex 0x0a10) = "a10"

Hexadecimal encoding of a Word8.

Hexadecimal encoding of a Word16.

Hexadecimal encoding of a Word32.

Hexadecimal encoding of a Word64.

Hexadecimal encoding of a Word.

Encoding the bytes of fixed-width types as hexadecimal numbers using lower-case ASCII characters. For example,

toLazyByteString (primFixed word16HexFixed 0x0a10) = "0a10"

Encode a Int8 using 2 nibbles (hexadecimal digits).

Encode a Int16 using 4 nibbles.

Encode a Int32 using 8 nibbles.

Encode a Int64 using 16 nibbles.

Encode a Word8 using 2 nibbles (hexadecimal digits).

Encode a Word16 using 4 nibbles.

Encode a Word32 using 8 nibbles.

Encode a Word64 using 16 nibbles.

Encode an IEEE Float using 8 nibbles.

Encode an IEEE Double using 16 nibbles.

### ISO/IEC 8859-1 (Char8)

The ISO/IEC 8859-1 encoding is an 8-bit encoding often known as Latin-1. The Char8 encoding implemented here works by truncating the Unicode codepoint to 8-bits and encoding them as a single byte. For the codepoints 0-255 this corresponds to the ISO/IEC 8859-1 encoding. Note that the Char8 encoding is equivalent to the ASCII encoding on the Unicode codepoints 0-127. Hence, functions such as intDec can also be used for encoding Ints as a decimal number with Char8 encoded characters.

Char8 encode a Char.

### UTF-8

The UTF-8 encoding can encode all Unicode codepoints. It is equivalent to the ASCII encoding on the Unicode codepoints 0-127. Hence, functions such as intDec can also be used for encoding Ints as a decimal number with UTF-8 encoded characters.

UTF-8 encode a Char.