|Copyright||(c) 2010 Jasper Van der Jeugt (c) 2010 - 2011 Simon Meier|
|License||BSD3-style (see LICENSE)|
|Maintainer||Simon Meier <email@example.com>|
Builders are used to efficiently construct sequences of bytes from
such a construction is part of the implementation of an encoding, i.e.,
a function for converting Haskell values to sequences of bytes.
Examples of encodings are the generation of the sequence of bytes
representing a HTML document to be sent in a HTTP response by a
web application or the serialization of a Haskell value using
a fixed binary format.
For an efficient implementation of an encoding,
it is important that (a) little time is spent on converting
the Haskell values to the resulting sequence of bytes and
(b) that the representation of the resulting sequence
is such that it can be consumed efficiently.
Builders support (a) by providing an O(1) concatentation operation
and efficient implementations of basic encodings for
and other standard Haskell values.
They support (b) by providing their result as a lazy
which is internally just a linked list of pointers to chunks
of consecutive raw memory.
ByteStrings can be efficiently consumed by functions that
write them to a file or send them over a network socket.
Note that each chunk boundary incurs expensive extra work (e.g., a system call)
that must be amortized over the work spent on consuming the chunk body.
Builders therefore take special care to ensure that the
average chunk size is large enough.
The precise meaning of large enough is application dependent.
The current implementation is tuned
for an average chunk size between 4kb and 32kb,
which should suit most applications.
As a simple example of an encoding implementation, we show how to efficiently convert the following representation of mixed-data tables to an UTF-8 encoded Comma-Separated-Values (CSV) table.
data Cell = StringC String | IntC Int deriving( Eq, Ord, Show ) type Row = [Cell] type Table = [Row]
We use the following imports and abbreviate
mappend to simplify reading.
import qualified Data.ByteString.Lazy as L import Data.ByteString.Lazy.Builder import Data.ByteString.Lazy.Builder.ASCII (
intDec) import Data.Monoid import Data.Foldable (
foldMap) import Data.List (
intersperse) infixr 4 <> (<>) ::
Monoidm => m -> m -> m (<>) =
CSV is a character-based representation of tables. For maximal modularity,
we could first render
Strings and then encode this
using some Unicode character encoding. However, this sacrifices performance
due to the intermediate
String representation being built and thrown away
right afterwards. We get rid of this intermediate
String representation by
fixing the character encoding to UTF-8 and using
Builders to convert
Tables directly to UTF-8 encoded CSV tables represented as lazy
encodeUtf8CSV :: Table -> L.ByteString encodeUtf8CSV =
toLazyByteString. renderTable renderTable :: Table -> Builder renderTable rs =
mconcat[renderRow r <>
charUtf8'\n' | r <- rs] renderRow :: Row -> Builder renderRow  =
memptyrenderRow (c:cs) = renderCell c <> mconcat [ charUtf8 ',' <> renderCell c' | c' <- cs ] renderCell :: Cell -> Builder renderCell (StringC cs) = renderString cs renderCell (IntC i) =
intDeci renderString :: String -> Builder renderString cs = charUtf8 '"' <> foldMap escape cs <> charUtf8 '"' where escape '\\' = charUtf8 '\\' <> charUtf8 '\\' escape '\"' = charUtf8 '\\' <> charUtf8 '\"' escape c = charUtf8 c
Note that the ASCII encoding is a subset of the UTF-8 encoding,
which is why we can use the optimized function
Int as a decimal number with UTF-8 encoded digits.
intDec is more efficient than
as it avoids constructing an intermediate
Avoiding this intermediate data structure significantly improves
performance because encoding
Cells is the core operation
for rendering CSV-tables.
See Data.ByteString.Lazy.Builder.BasicEncoding for further
information on how to improve the performance of
We demonstrate our UTF-8 CSV encoding function on the following table.
strings :: [String] strings = ["hello", "\"1\"", "λ-wörld"] table :: Table table = [map StringC strings, map IntC [-3..3]]
encodeUtf8CSV table results in the following lazy
Chunk "\"hello\",\"\\\"1\\\"\",\"\206\187-w\195\182rld\"\n-3,-2,-1,0,1,2,3\n" Empty
We can clearly see that we are converting to a binary format. The 'λ' and 'ö' characters, which have a Unicode codepoint above 127, are expanded to their corresponding UTF-8 multi-byte representation.
We use the
criterion library (http://hackage.haskell.org/package/criterion)
to benchmark the efficiency of our encoding function on the following table.
import Criterion.Main -- add this import to the ones above maxiTable :: Table maxiTable = take 1000 $ cycle table main :: IO () main = defaultMain [ bench "encodeUtf8CSV maxiTable (original)" $ whnf (L.length . encodeUtf8CSV) maxiTable ]
On a Core2 Duo 2.20GHz on a 32-bit Linux,
the above code takes 1ms to generate the 22'500 bytes long lazy
Looking again at the definitions above,
we see that we took care to avoid intermediate data structures,
as otherwise we would sacrifice performance.
the following (arguably simpler) definition of
renderRow is about 20% slower.
renderRow :: Row -> Builder renderRow = mconcat . intersperse (charUtf8 ',') . map renderCell
renderString :: String -> Builder renderString cs = charUtf8 $ "\"" ++ concatMap escape cs ++ "\"" where escape '\\' = "\\" escape '\"' = "\\\"" escape c = return c
Apart from removing intermediate data-structures, encodings can be optimized further by fine-tuning their execution parameters using the functions in Data.ByteString.Lazy.Builder.Extras and their "inner loops" using the functions in Data.ByteString.Lazy.Builder.BasicEncoding.
- data Builder
- toLazyByteString :: Builder -> ByteString
- hPutBuilder :: Handle -> Builder -> IO ()
- byteString :: ByteString -> Builder
- lazyByteString :: ByteString -> Builder
- int8 :: Int8 -> Builder
- word8 :: Word8 -> Builder
- int16BE :: Int16 -> Builder
- int32BE :: Int32 -> Builder
- int64BE :: Int64 -> Builder
- word16BE :: Word16 -> Builder
- word32BE :: Word32 -> Builder
- word64BE :: Word64 -> Builder
- floatBE :: Float -> Builder
- doubleBE :: Double -> Builder
- int16LE :: Int16 -> Builder
- int32LE :: Int32 -> Builder
- int64LE :: Int64 -> Builder
- word16LE :: Word16 -> Builder
- word32LE :: Word32 -> Builder
- word64LE :: Word64 -> Builder
- floatLE :: Float -> Builder
- doubleLE :: Double -> Builder
- char7 :: Char -> Builder
- string7 :: String -> Builder
- char8 :: Char -> Builder
- string8 :: String -> Builder
- charUtf8 :: Char -> Builder
- stringUtf8 :: String -> Builder
The Builder type
Builders are buffer-filling functions. They are
executed by a driver that provides them with an actual buffer to
fill. Once called with a buffer, a
Builder fills it and returns a
signal to the driver telling it that it is either done, has filled the
current buffer, or wants to directly insert a reference to a chunk of
memory. In the last two cases, the
Builder also returns a
Builder that the driver can call to fill the next
buffer. Here, we provide the two drivers that satisfy almost all use
cases. See Data.ByteString.Lazy.Builder.Extras, for information
about fine-tuning them.
It is recommended that the
Handle is set to binary and
BlockBuffering mode. See
This function is more efficient than
hPut . because in
many cases no buffer allocation has to be done. Moreover, the results of
several executions of short
Builders are concatenated in the
buffer, therefore avoiding unnecessary buffer flushes.
The ASCII encoding is a 7-bit encoding. The Char7 encoding implemented here works by truncating the Unicode codepoint to 7-bits, prefixing it with a leading 0, and encoding the resulting 8-bits as a single byte. For the codepoints 0-127 this corresponds the ASCII encoding. In Data.ByteString.Lazy.Builder.ASCII, we also provide efficient implementations of ASCII-based encodings of numbers (e.g., decimal and hexadecimal encodings).
ISO/IEC 8859-1 (Char8)
The ISO/IEC 8859-1 encoding is an 8-bit encoding often known as Latin-1. The Char8 encoding implemented here works by truncating the Unicode codepoint to 8-bits and encoding them as a single byte. For the codepoints 0-255 this corresponds to the ISO/IEC 8859-1 encoding. Note that you can also use the functions from Data.ByteString.Lazy.Builder.ASCII, as the ASCII encoding and ISO/IEC 8859-1 are equivalent on the codepoints 0-127.
The UTF-8 encoding can encode all Unicode codepoints. We recommend
using it always for encoding
Strings unless an application
really requires another encoding. Note that you can also use the
functions from Data.ByteString.Lazy.Builder.ASCII for UTF-8 encoding,
as the ASCII encoding is equivalent to the UTF-8 encoding on the Unicode