# Guidelines for writing well optimized code

Stream operations are always part of a loop. Usually the loop consists of a
stream generation or an unfold operation followed by stream transformation
functions (e.g. map) and then a stream elimination operation or a fold
operation.

The default or most common stream representation used in streamly is
`StreamD` which is a direct style (compare with CPS representation
`StreamK`) stream representation.  All the direct style operations in
a loop "fuse" together to form a tight machine loop eliminating any
intermediate constructors, therefore, reducing allocations and cpu cost.
This elimination of intermediate constructors in a stream loop is known
as stream fusion.

## Writing high performance code using streamly

This section outlines guidelines for writing code using the available
combinators. The guidelines for writing new combinators are provided in
the following section.

Note that there is no absolute benchmark for performance, most of
the time even without following the guidelines you may get excellent
performance for the task at hand. Two important points to keep in mind
before you optimize:

* See if you actually need more performance
* See if the code you are optimizing is in fast path

### Stream Fusion

When your performance requirements are stringent you may want to care
about not breaking stream fusion unnecessarily.  Certain operations do
not fuse and act as a barrier to stream fusion. If these operations are
part of a loop, the loop won't fuse completely. In some cases these may
be necessary but in others it may be possible to replace these with
better fusing operations. These operations include:

* Avoid unnecessary use of stream append operations in code that should
  fuse.  Append operations in general force CPS style breaking stream
  fusion.  Some direct style append operations can fuse but they won't
  scale for more than a few appends.
* Avoid unnecessary use of the stream monad if fusion is important. You
  may use functor or applicative though.
* Concurrent stream operations cannot fuse
* Operations involving exception primitives like catch/throw/mask
  on elements of the stream cannot fuse.
* Stream loops involving FFI calls on the elements of the stream

The haddock documentation includes a note when an operation cannot fuse.

### Unfolds

* Use unfolds especially when higher order operations are involved. For
  example, `unfoldMany` can fuse completely whereas `concatMap` would
  not fuse.
* Use `outerProduct` in `Unfold` module instead of using the monad instance of
  streams to fuse nested loops where performance matters. See the unfold
  benchmarks for an example. `outerProduct` can give you C like nested loop
  performance.

### Inlining

To make sure fusion can occur INLINE any operations that are part of a
stream loop but factored out as separate functions.

### Strictness

Keep the fold and scan accumulators strict. You could also consider
using mutable state in a fold accumulator if the state is large. A
large immutable structure as an accumulator may cause optimization
issues. See the `WordCount` example for an example of this case.

## Writing streamly combinators

To enable stream fusion in a direct style stream all the operations that
are part of a loop must be inlined:

* Use an explicit INLINE pragma on any combinator that could be part of a loop
  to ensure they will be inlined.
* When a higher order function consumes another function then ensure that the
  higher order function is inlined in the same phase or before the function it
  is consuming. Use appropriate inline phases to ensure proper inline ordering
  when required.

### Streams and Unfolds

Stream and Unfold State.  Direct style stream and unfold combinators use a
state data type to represent the internal state of the stream/unfold generator.

* In some cases we may have to add a `FUSE` annotation on the state type to
  ensure that all the internal join points created by GHC are inlined with the
  help of the `fusion-plugin`.
* In rare cases we may need to use a strictness annotation on the state
  to allow fusion. `parseMany` and `splitOnSeq` are such examples.

Step function of a stream or unfold:

* Do not introduce unnecessary states. If there is only one entry point of a
  state then perhaps you want to collapse the state into the state from which
  it is called. More states means more jumps and may affect code locality, and
  efficiency of low level code (e.g. register allocation) because of
  independent placement of the code blocks in different states.

  In some cases we may have to go against the above guideline and introduce a
  separate state even though it is not necessary. See the GroupConsume state in
  foldMany for an example.
* Keep minimum possible data in state. More variables in state may lead
  to poor performance because the state may not fit into registers and the
  spill may cause allocations on each iteration of the loop. Mutable state
  may help in such cases.
* The step function must be annotated such that it gets inlined after the main
  combinator (`INLINE_LATE`).

Multiple yield points or single?:

* A single yield point is usually desirable, however, not always necessary.
  In some cases multiple yield points may in fact be needed for fusion,
  see `splitOnSeq` for an example. Or maybe its fusing because of a
  direct yield instead of going through an indirect common yielding
  state.

Recursion in step function:

* In general, we avoid making the step function recursive. Use the
  `Skip` constructor to remove recursion. Recursive step function can
  introduce optimization barriers that are harder to remove by the GHC
  simplifier.

  However, in some cases it may be better to have a local recursive
  loop. A recursive loop can help us avoid threading around some large
  state values every time. Values that do not change across a loop can
  be factored in the scope outside the loop (static argument transform),
  this way we can create a local loop which is more efficient than
  threading around the state in a larger loop.  See the `splitOnSeq`
  combinator as an example where we use a local recursive loop, it fuses
  and is significantly efficient compared to using `Skip`.

  In a local recursive loop use SPEC and annotate even the rest of the
  loop arguments as strict where needed. We have observed that when the
  arguments were not strict the loop does not fuse (splitOnSeq).

### Fold and Parser drivers

Recursive loop closing operations:

* When writing recursive looping combinators using StreamD (e.g. foldlM' in
  StreamD) use a strict SPEC argument in the recursive loops to ensure that
  "Spec Constructor" optimization removes boxed arguments and reduces
  allocations.

### Fold and Parser combinators

State of a Fold or Parser:

* The accumulator of a fold must be a strict data structure. Use the strict
  data structures provided in `Streamly.Internal.Data` for this purpose or use
  explicit bang annotation to make the data strict.
* In some cases you may need a `FUSE` annotation on the state type to ensure
  that any internal join points created by GHC are always inlined with the help
  of the fusion-plugin.

The step function of a Fold or Parser:

* Never make the fold step recursive, recursion creates an optimization barrier
  for the GHC simplifier. Use `Partial` or `Continue` constructors to avoid
  recursion.

* Sometimes you may need an explicit INLINE on the step function.

### NOINLINE, isolating the closed loop

We want the loop iterations to be optimized and the loop stages to be fused to
generate a tighter loop. However, it is not necessarily optimal to inline the
whole loop itself into a parent function. For example, consider the following
function in the `FileSystem.Handle` benchmarks:

```
{-# NOINLINE readWriteAfter_Stream #-}
readWriteAfter_Stream :: Handle -> Handle -> IO ()
readWriteAfter_Stream inh devNull =
    let readEx = IP.after_ (hClose inh) (S.unfold FH.read inh)
     in S.fold (FH.write devNull) readEx
```

If this is inlined into a parent benchmark group list, this leads to
many times performance degradation. That's because inlining the loop
into a bigger structure interferes with the optimization of the loop
body itself and it may not fuse. Whereas it is desirable to INLINE all
the stages of a loop, it is often not useful to inline the whole loop
itself, in fact we may have to occasionally use a `NOINLINE` so that the
compiler does not inline it.

### NoSpecConstr

It is not always useful to specialize function calls for all constructors, some
constructors may just add to code bloat or add overhead of passing unboxed
values in a loop. In such cases the `NoSpecConstr` annotation can be useful.
See the `parselMx'` function in `StreamD` module.

### StreamK operations

StreamK uses foldr/build fusion to a very limited degree. StreamK is not the
primary representation in streamly but is used for several operations that
cannot scale in StreamD representation. StreamK is relatively immune to compiler
optimizations. In some cases you may need an INLINE pragma to improve the
performance.

### How to debug non-fusing code?

Strip down the code to a minimal version until it starts fusing and then
start building it up from there adding more things incrementally. At
each stage keep checking if the code is fusing. At some point it
won't fuse. See what we added that made the code not fuse. Go through
the guidelines above to check if we did something that that is not
recommended. Or raise an issue for GHC to be fixed if possible.

# GHC Optimizations In Streamly

There are three important levels of optimizations used in streamly:

* CPS style stream representation (StreamK) to direct style (StreamD)
  conversion and vice-versa using rewrite rules
* Stream fusion using case-of-case optimization in StreamD
* foldr/build fusion using rewrite rules in StreamK

## INLINE Phases

Inlining of functions at the right time is crucial for all these optimization
to work.  A missing inline or inline in an incorrect GHC simplifier phase can
adversely impact performance.  We use three builtin phases of GHC simplifier
for inlining i.e. phase 0, 1 and 2. We have defined them as follows in
  `src/inline.hs`:

```
#define INLINE_EARLY  INLINE [2]
#define INLINE_NORMAL INLINE [1]
#define INLINE_LATE   INLINE [0]
```

We also use INLINE [3] in some cases.

## `fromStreamD/toStreamD` Fusion

The combinators in `Streamly.Prelude` are defined in terms of combinators in
`Streamly.Internal.Data.Stream.StreamD` (Direct style streams) or
`Streamly.Internal.Data.Stream.StreamK` (CPS style streams). We convert the
stream from `StreamD` to `StreamK` representation or vice versa in some cases.

Most operations use the StreamD representation, however, stream append
operations and monad instances use StreamK representation because StreamD would
not perform well in these cases. We use rewrite rules to convert from one
representation to another when required. For this reason the combinators in
Streamly.Prelude are written using fromStreamD/fromStreamK etc.

In the first inlining phase (INLINE_EARLY or INLINE) we expand the combinators
in `Streamly.Prelude` into fromStreamD/fromStreamK/toStreamD/toStreamK and
combinators defined in StreamD or StreamK modules. Once we do that
fromStreamD/toStreamD get exposed and we can apply rewrite rules to rewrite
transformations like `fromStreamK .  toStreamK` to `id`. A plain `INLINE`
pragma is usually enough on combinators in `Streamly.Prelude`.

```
{-# RULES "fromStreamK/toStreamK fusion"
  forall s. toStreamK (fromStreamK s) = s #-}
```

Also, we have to prevent fromStreamK and toStreamK themselves from inlining in
this phase so that rewrite rules can be applied on them, therefore, we annotate
these functions with `INLINE_LATE`.

## Fallback Rules

In some cases, if the operation could not fuse we want to use a fallback
rewrite rule in the next phase. For such cases we use the INLINE_EARLY phase
for the first rewrite and the INLINE_NORMAL phase for the fallback rules.

The fallback rules make sure that if we could not fuse the direct style
operations then better use the CPS style operation, because unfused direct
style would have worse performance than the CPS style ops.

```
{-# INLINE_EARLY unfoldr #-}
unfoldr :: (Monad m, IsStream t) => (b -> Maybe (a, b)) -> b -> t m a
unfoldr step seed = fromStreamS (S.unfoldr step seed)
{-# RULES "unfoldr fallback to StreamK" [1]
     forall a b. S.toStreamK (S.unfoldr a b) = K.unfoldr a b #-}
```

## High Level Operation Fusion

Since each high level combinator in `Streamly.Prelude` is wrapped in
`fromStreamD/toStreamD` etc. the combinator fusion cannot work unless we have
removed those and exposed consecutive operations e.g. a `map` followed by
another `map`.  Assuming that redundant `fromStreamK/toStreamK` have been
removed in the `INLINE_EARLY` phase, we can then apply the combinator fusion
rules in the `INLINE_NORMAL` phase.  For example, we can fuse two `map`
operations into a single `map` operation.  Note that now we have exposed the
`StreamD/StreamK` implementations of combinators and the rules would apply on
those.

## Inlining Higher Order Functions

Note that partially applied functions cannot be inlined. So if we have a code
like this:

```
concatMap1 src = runStream $ S.concatMap (S.replicate 3) src
```

We want to ensure that `concatMap` gets inlined before `replicate` so that
`replicate` becomes fully applied before it gets inlined. Currently ensuring
that both of them are inlined in the same phase (`INLINE_NORMAL`) seems to be
enough to achieve that. In general, we should try to ensure that higher order
functions are inlined before or in the same phase as the functions they can
consume as arguments. This means `StreamD` combinators should not be marked
as `INLINE` or `INLINE_EARLY`, instead they should all be marked as
`INLINE_NORMAL` because higher order functions like `concatMap`/`map`/`mapM`
etc are marked as `INLINE_NORMAL`. `StreamD` functions in other modules like
`Streamly.Data.Array.Foreign` should also follow the same rules.

## Stream Fusion

In StreamD combinators, inlining the inner step or loop functions too early
i.e. in the same phase or before the outer function is inlined may block stream
fusion opportunities. Therefore, the inner step functions and folding loops are
marked as INLINE_LATE.

## Specialization

In some cases, the `step` function in `StreamD` does not get specialized when
inlined unless it is provided with an explicit signature or made a lambda, for
example, in the `replicate/replicateM` combinator we need the type annotation
on `i` to get it specialized:

```
    {-# INLINE_LATE step #-}
    step _ (i :: Int) =
        if i <= 0
        then return Stop
        else do
                x <- action
                return $ Yield x (i - 1)
```

`-flate-specialise` also helps in this case.

## Stream and Fold State Data Structures

Since state is an internal data structure threaded around in the loop, it is a
good practice to use strict unboxed fields for state data structures where
possible. In most cases it is not necessary, but in some cases it may affect
fusion and make a difference of 10x performance or more.  For example, using
non-strict fields can increase the code size for internal join points and
functions created during transformations, which can affect the inlining of
these code blocks which in turn can affect stream fusion.

See https://gitlab.haskell.org/ghc/ghc/issues/17075 .