Streaming Benchmarks ==================== .. image:: https://badges.gitter.im/composewell/gitter.svg? :target: https://gitter.im/composewell/streamly :alt: Gitter chat .. image:: https://img.shields.io/hackage/v/streaming-benchmarks.svg?style=flat :target: https://hackage.haskell.org/package/streaming-benchmarks :alt: Hackage .. image:: https://travis-ci.org/composewell/streaming-benchmarks.svg?branch=master :target: https://travis-ci.org/composewell/streaming-benchmarks :alt: Unix Build Status .. image:: https://ci.appveyor.com/api/projects/status/8d1kgrrw9mmxv5xt?svg=true :target: https://ci.appveyor.com/project/harendra-kumar/streaming-benchmarks :alt: Windows Build status .. contents:: Table of Contents :depth: 1 This package compares `streamly `_, a blazing fast streaming library providing native high level, declarative and composable concurrency support, with popular streaming libraries e.g. vector, streaming, pipes and conduit. This package has been motivated by `streamly `_, however, it is general purpose and compares more libraries and benchmarks than shown here. Please send an email or a pull request if the benchmarking code has a problem or is unfair to some library in any way. Benchmarks & Results -------------------- A stream of one million consecutive numbers is generated using monadic unfold API ``unfoldrM``, these elements are then processed using a streaming combinator under test (e.g. ``map``). The total time to process all one million operations, and the maximum resident set size (rss) is measured and plotted for each library. The underlying monad for each stream is the IO Monad. All the libraries are compiled with GHC-8.4.3. All the benchmarks were run on an Apple MacBook Pro computer with a single 2.2 GHz Intel Core i7 processor with 4 cores and 16GB RAM. Highlights ~~~~~~~~~~ * ``streamly`` shows the best overall performance in terms of time as well as space. ``streamly`` and ``vector`` show similar performance except for the ``append`` operation where ``streamly`` is much better, and the ``filter`` operation where vector is faster. * The ``append`` operation scales well only for ``streamly`` and ``conduit``. All other libraries show quadratic complexity on this operation. * ``streaming`` performs slightly better than ``conduit`` when multiple operations are composed together even though in terms of individual operations it is slightly worse than ``conduit``. * ``conduit`` and ``pipes`` show unusually large space utilization for ``take`` and ``drop`` operations (more than 100-150 MiB vs 3 MiB). * ``drinkery`` shows very good performance too though not plotted here because of a small issue in measurement and lack of space. * ``machines`` is roughly 2x slower than the slowest library here, and its maximum resident set size is close to 100 MiB for all operations (touching 300 MiB for ``take``) compared to the 3MiB for all other libraries. I am not sure if there is something wrong with the measurements or the benchmarking code, majority of the code is common to all libraries, any improvements in the machines benchmarking code are welcome. Key Operations ~~~~~~~~~~~~~~ The following diagram plots the time taken by key streaming operations to process a million stream elements. *Note: the time for streamly and vector is very low (600-700 microseconds) and therefore can barely be seen in this graph.* .. |keyoperations-time| image:: charts-0/KeyOperations-time.svg :width: 75% :target: charts-0/KeyOperations-time.svg :alt: Time Cost of Key Streaming Operations |keyoperations-time| For those interested in the heap allocations, the following diagram plots the overall heap allocations during each measurement period i.e. the total allocations for processing one million stream elements. .. |keyoperations-allocated| image:: charts-0/KeyOperations-allocated.svg :width: 75% :target: charts-0/KeyOperations-allocated.svg :alt: Heap allocations for Key Streaming Operations |keyoperations-allocated| The following diagram plots the maximum resident set size (rss) during the measurement of each operation. In plain terms, it is the maximum amount of physical memory that is utilized at any point during the measurement. .. |keyoperations-maxrss| image:: charts-0/KeyOperations-maxrss.svg :width: 75 % :target: charts-0/KeyOperations-maxrss.svg :alt: Maximum rss for Key Streaming Operations |keyoperations-maxrss| +------------------------+----------------------------------------------------+ | Benchmark | Description | +========================+====================================================+ | drain | Just discards all the elements in the stream | +------------------------+----------------------------------------------------+ | drop-all | drops all element using the ``drop`` operation | +------------------------+----------------------------------------------------+ | last | extract the last element of the stream | +------------------------+----------------------------------------------------+ | fold | sum all the numbers in the stream | +------------------------+----------------------------------------------------+ | map | increments each number in the stream by 1 | +------------------------+----------------------------------------------------+ | take-all | Use ``take`` to retain all the elements in the | | | stream | +------------------------+----------------------------------------------------+ | filter-even | Use ``filter`` to keep even numbers and discard | | | odd numbers in the stream. | +------------------------+----------------------------------------------------+ | scan | scans the stream using ``+`` operation | +------------------------+----------------------------------------------------+ | mapM | transform the stream using a monadic action | +------------------------+----------------------------------------------------+ | zip | combines corresponding elements of the two streams | | | together | +------------------------+----------------------------------------------------+ Append Operation ~~~~~~~~~~~~~~~~ A million streams of single elements are created and appended together to create a stream of million elements. The total time taken in this operation is measured. *Note that vector, streaming and pipes show a quadratic complexity (O(n^2)) on this benchmark and do not finish in a reasonable time*. The time shown in the graph for these libraries is just indicative, the actual time taken is much higher. .. |append| image:: charts-0/AppendOperation-time.svg :width: 60 % :target: charts-0/AppendOperation-time.svg :alt: Cost of appending a million streams of single elements |append| toList Operation ~~~~~~~~~~~~~~~~ A stream of a million elements is generated using ``unfoldrM`` and then converted to a list. .. |toList| image:: charts-0/toListOperation-time.svg :width: 60 % :target: charts-0/toListOperation-time.svg :alt: Cost of converting a stream of million elements to a list |toList| Composing Multiple Operations ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A stream operation or a combination of stream operations are performed four times in a row to measure how the composition scales for each library. A million elements are passed through this composition. *Note: the time for streamly and vector is very low (600-700 microseconds) and therefore can barely be seen in this graph.* .. |composed| image:: charts-0/ComposedOperations%3A4times-time.svg :width: 60 % :target: charts-0/ComposedOperations%3A4times-time.svg :alt: Cost when operations are composed |composed| +------------------------+----------------------------------------------------+ | Benchmark | Description | +========================+====================================================+ | mapM | ``mapM`` four times in a row | +------------------------+----------------------------------------------------+ | all-in-filters | four filters in a row, | | | each allowing all elements in | +------------------------+----------------------------------------------------+ | map-with-all-in-filter | ``map`` followed by ``filter`` composed four times | | | serially | +------------------------+----------------------------------------------------+ How to Run ---------- To quickly compare packages: :: # Chart all the default packages ./run.sh --quick # Compare a given list of packages # Available package names are: streamly, vector, streaming, pipes, # conduit, machines, drinkery, list, pure-vector ./run.sh --quick --select "streamly,vector" # Show full results for the first packages and delta from that for # the rest of the packages. ./run.sh --quick --select "streamly,vector" --delta After running you can find the charts generated in the ``charts`` directory. If you have the patience to wait longer for the results remove the ``--quick`` option, the results are likely to be a tiny bit more accurate. The ``list`` package above is the standard haskell lists in the base package, and ``pure-vector`` is the vector package using pure API instead of the monadic API. Pedantic Mode ~~~~~~~~~~~~~ Note that if different optimization flags are used on different packages, performance can sometimes badly suffer because of GHC inlining and specialization not working optimally. If you want to be absolutely sure that all packages and dependencies are compiled with the same optimization flags (``-O2``) use ``run.sh --pedantic``, it will install the stack snapshot in a private directory under the current directory and build them fresh with the ghc flags specified in ``stack-pedantic.yaml``. Be aware that this will require 1-2 GB extra disk space. Adding New Libraries ~~~~~~~~~~~~~~~~~~~~ It is trivial to add a new package. This is how `a benchmark file `_ for a streaming package looks like. Pull requests are welcome, I will be happy to help, `just join the gitter chat `_ and ask! Benchmarking Notes ------------------ Benchmarking is a tricky business. Though the benchmarks have been carefully designed there may still be issues with the way benchmarking is being done or the way they have been coded. If you find that something is being measured unfairly or incorrectly please bring it to our notice by raising an issue or sending an email or via gitter chat. Measurement ~~~~~~~~~~~ ``Benchmarking Tool:`` We use the `gauge `_ package for measurements instead of criterion. There were several issues with criterion that we fixed in gauge to get correct results. Each benchmark is run in a separate process to avoid any interaction between benchmarks. Benchmarking Code ~~~~~~~~~~~~~~~~~ * ``IO Monad:`` We run the benchmarks in the IO monad so that they are close to real life usage. Note that most existing streaming benchmarks use pure code or Identity monad which may produce entirely different results. * ``unfoldrM`` is used to generate the stream for two reasons, (1) it is monadic, (2) it reduces the generation overhead so that the actual streaming operation cost is amplified. If we use generation from a list there is a significant overhead in the generation itself because of the intermediate list structure. * Unless we perform some real IO operation, the operation being benchmarked can get completely optimized out in some cases. We use a random number generation in the IO monad and feed it to the operation being benchmarked to avoid that issue. GHC Inlining ------------ * ``Inlining:`` GHC simplifier is very fragile and inlining may affect the results in unpredictable ways unless you have spent enough time scrutinizing and optimizing everything carefully. Inlining is the biggest source of fragility in performance benchmarking. It can easily result in an order of magnitude drop in performance just because some operation is not correctly inlined. Note that this applies very well to the benchmarking code as well. * ``GHC Optimization Flags:`` To make sure we are comparing fairly we make sure that we compile the benchmarking code, the library code as well as all dependencies using exactly the same GHC flags. GHC inlining and specialization optimizations can make the code unpredictable if mixed flags are used. See the ``--pedantic`` option of the ``run.sh`` script. * ``Single file vs multiple files`` The best way to avoid issues is to have all the benchmarking code in a single file. However, in real life that is not the case and we also needed some modularity to scale the benchmarks to arbitrary number of libraries so we split it into per package file. As soon as the code was split into multiple files, performance of some libraries dropped, in some cases by 3-4x. Careful sprinkling of INLINE pragmas was required to bring it back to original. Even functions that seemed just 2 lines of code were not automatically inlined. * When all the code was in a single file, not a single INLINE pragma was needed. But when split in multiple files even functions that were not exported from that file needed an INLINE pragma for equivalent performance. This is something that GHC may have to look at. * The effect of inlining varied depending on the library. To make sure that we are using the fully optimized combination of inline or non-inline for each library we carefully studied the impact of inlining individual operations for each package. The current code is the best we could get for each package. * There is something magical about streamly, not sure what it is. Even though all other libraries were impacted significantly for many ops, streamly seemed almost unaffected by splitting the benchmarking ops into a separate file! If we can find out why is it so, we could perhaps understand and use GHC inlining in a more predictable manner. Edit - CPS seems to be more immune to inlining, as soon as streamly started using direct style, it too became sensitive to inlining. * This kind of unpredictable non-uniform impact of moving functions in different files shows that we are at the mercy of the GHC simplifier and always need to tune performance carefully after refactoring, to be sure that everything is fine. In other words, benchmarking and optimizing is crucial not just for the libraries `but for the users of the libraries as well`. Streaming Libraries ------------------- There are two dual paradigms for stream processing in Haskell. In the first paradigm we represent a stream as a data type and use functions to work on it. In the second paradigm we represent *stream processors* as data types and provide them individual data elements to process, there is no explicit representation of the stream as a data type. In the first paradigm we work with data representation and in the second paradigm we work with function representations. Both of these paradigms have equal expressive power. The latter uses the monadic composition for data flow whereas the former does not need monadic composition for straight line stream processing and therefore can use it for higher level composition e.g. to compose streams in a product style. To see an example of the first paradigm, let us use the ``vector`` package to represent a monadic stream of integers as ``Stream IO Int``. This data representation of stream is passed explicitly to the stream processing functions like ``filter`` and ``drop`` to manipulate it:: import qualified Data.Vector.Fusion.Stream.Monadic as S stream :: S.Stream IO Int stream = S.fromList [1..100] main = do let str = (S.filter even . S.drop 10) stream toList str >>= putStrLn . show Pure lists and vectors are the most basic examples of streams in this paradigm. The streaming IO libraries just extend the same paradigm to monadic streaming. The API of these libraries is very much similar to lists with a monad parameter added. The second paradigm is direct opposite of the first one, there is no stream representation in this paradigm, instead we represent *stream processors* as data types. A stream processor represents a particular process rather than data, and we compose them together to create composite processors. We can call them stream transducers or simply pipes. Using the ``machines`` package:: import qualified Data.Machine as S producer :: S.SourceT IO Int producer = S.enumerateFromTo 1 100 main = do let processor = producer S.~> S.dropping 10 S.~> S.filtered even S.runT processor >>= putStrLn . show Both of these paradigms look almost the same, right? To see the difference let's take a look at some types. In the first paradigm we have an explicit stream type and the processing functions take the stream as input and produce the transformed stream:: stream :: S.Stream IO Int filter :: Monad m => (a -> Bool) -> Stream m a -> Stream m a In the second paradigm, there is no stream data type, there are stream processors, let's call them boxes that represent a process. We have a *SourceT* box that represents a singled ended producer and a *Process* box or a pipe that has two ends, an input end and an output end, a ``MachineT`` represents any kind of box. We put these boxes together using the ``~>`` operator and then run the resulting machine using ``runT``:: producer :: S.SourceT IO Int filtered :: (a -> Bool) -> Process a a dropping :: Int -> Process a a (~>) :: Monad m => MachineT m k b -> ProcessT m b c -> MachineT m k c Custom pipes can be created using a Monadic composition and primitives to receive and send data usually called ``await`` and ``yield``. .. |str| replace:: `streamly `__ +-----------------------------------------------------------------------------+ | Streaming libraries using the direct paradigm. | +------------------------+----------------------------------------------------+ | Library | Remarks | +========================+====================================================+ | vector | The simplest in this category, provides | | | transformation and combining of monadic | | | streams but no monadic composition of streams. | | | Provides a very simple list like API. | +------------------------+----------------------------------------------------+ | streaming | * Encodes a return value to be supplied when the | | | stream ends. The monad instance passes on the | | | streams and combines the return values. | | | * Functor general | | | * The API is more complicated than vector because | | | of the return value and the functor layer. | +------------------------+----------------------------------------------------+ | list-t | Provides straight line composition of streams | | | as well as a list like monadic composition. | | | The API is simple, just like ``vector``. | +------------------------+----------------------------------------------------+ | | Like list-t, in addition to straight line | | | composition it provides a list like monadic | | | composition of streams, supports combining streams | | | concurrently supports concurrent applicative and | | | monadic composition. | | |str| | The basic API is very much like lists and | | | almost identical to ``vector`` streams. | +------------------------+----------------------------------------------------+ +-----------------------------------------------------------------------------+ | Streaming libraries using the pipes paradigm. | +------------------------+----------------------------------------------------+ | Library | Remarks | +========================+====================================================+ | conduit | ``await`` and ``yield`` data to upstream or | | | downstream pipes; supports pushing leftovers back. | +------------------------+----------------------------------------------------+ | pipes | ``await`` and ``yield`` data to upstream or | | | downstream pipes | +------------------------+----------------------------------------------------+ | machines | Can await from two sources, left and right. | +------------------------+----------------------------------------------------+