Ticket #5916 (new bug)

Opened 16 months ago

Last modified 3 months ago

runST isn't free

Reported by: tibbe Owned by:
Priority: normal Milestone: 7.6.2
Component: Compiler Version: 7.4.1
Keywords: Cc: carter.schonwald@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Difficulty: Unknown
Test Case: Blocked By:
Blocking: Related Tickets:

Description

While optimizing some code I discovered that runST isn't free. I had a function on the form:

f ... = ...let x = runST in (f x)...

Manually transforming it into

f ... = runST (g ...)
 where g ... = do
    ...
    x <- ...
    g x
    ...

(The real example is at  https://github.com/tibbe/unordered-containers/commit/58b7815a057b3575c58b746d5971d59d806b1333)

improved performance quite a bit on this, already highly tuned, function. Unfortunately, combining all the calls to runST into one and pulling them "up" is not all good. The code is now less modular, has a somewhat over specified evaluation order, and generally looks more imperative.

The cause of the decrease in performance is that runSTRep cannot be inlined, which causes allocation of closures inline in the code (where none should be necessary.)

The comment next to runSTRep explains why it's implemented the way it is, but I wonder if matters could be improved by making it a primop. If we create a fresh state token every time, instead of reusing realWorld#, it should be impossible for mutable structures to let-float and become CAFs (which is what runSTRep tries to avoid.)

Change History

Changed 16 months ago by simonmar

  • difficulty set to Unknown

I chatted with tibbe a little on IRC about this, and one conclusion is that we should probably try harder to inline runSTRep. It has to be inlined very late in order to avoid bad things happening; indeed we used to try to do that, but it didn't work (see the comment on runSTRep). If we did this, then the allocation of the function closure for passing to runST could disappear, which might be critical in an inner loop.

Changed 16 months ago by simonpj

I can't make sense of the exmaple in the intro. Could you make a tiny test case that demonstrates the problem?

Changed 16 months ago by tibbe

I'm afraid I'm too stuck in my current use case to think of examples from some other domain. This is not a standalone test case, but hopefully it at least makes the problem clear. Imagine we want to write a function:

-- | /O(n)/ Update the element at the given position in this array.
update :: Array e -> Int -> e -> Array e

where Array is some simple wrapper around Array#.

We can write this function using another function, update', in ST.

-- | /O(n)/ Update the element at the given position in this array.
update' :: Array e -> Int -> e -> ST s (Array e)
update' ary idx b = do
    mary <- thaw ary 0 count  -- Copies all elements into a mutable array
    write mary idx b          -- Update the one element
    unsafeFreeze mary         -- Return as an immutable array
  where !count = length ary

update ary idx b = runST (update' ary idx b)

Now, say we have a function that calls update many times in a loop:

data RoseTree a = Rose a (Array (Rose a)) | Nil

insert :: a -> RoseTree a -> RoseTree a
insert x Nil = Rose x ...
insert x (Rose y subtrees) 
    | x == y    = Rose x subtrees
    | otherwise = Rose y $ update subtrees idx (insert x subtreeToUpdate)
  where
    idx = subtreeIndex x y  -- Pick the right subtree to update
    subtreeToUpdate = index subtrees idx
    
index :: Array a -> Int -> a
index = ...

(If you find insert hard to understand, it might be helpful to consider what insert would look like if we were using a tuple of fixed size (e.g. 2 for a binary tree) instead of an Array.)

insert will end up calling runST many times. If we want to avoid the cost of runST at each level of the tree, we could structure the code like so:

insert :: a -> RoseTree a -> RoseTree a
insert x0 t0 = runST (go x0 t0)
    go x Nil = return $ Rose x ...
    go x (Rose y subtrees) 
        | x == y    = return $ Rose x subtrees
        | otherwise = do
            st <- go x subtreeToUpdate
            ary <- update' subtrees idx st
            return $ Rose y ary
      where
        idx = subtreeIndex x y  -- Pick the right subtree to update
        subtreeToUpdate = index subtrees idx

N.B. We have replace the call to update, which embeds a call to runST, with a call to update', which doesn't.

We have now reduced the number of calls to runST from O(log n) to 1, which turns out to be a performance improvement since runST allocates.

Changed 16 months ago by simonpj

Thanks for the example. What I don't understand is why does runST allocate. That's the example I'd like to see. In your code

update ary idx b
  = runST (update' ary idx b)

  = {- inline runST; g is a coercion between ST and STRep -}
     runSTRep (update' ary idx b) |> g)

  = {- inline the wrapper on update'
      runSTRep ($wupdate' ary idx b)

  = {- inline runSTRep, admittedly late -}
     case $wupdate' ary idx b realWorld# of (# s', _ #) -> s'

No allocation. So where am I wrong? A tiny repro case of this phenomenon would be great.

I understand about not wanting to take the runST outwards... apart from anything else it makes it sequential and it should not be!

Changed 16 months ago by simonpj

This is wrong I think, runSTRep is NOINLINE:

    {-# NOINLINE runSTRep #-}
    runSTRep :: (forall s. STRep s a) -> a
    runSTRep st_rep = case st_rep realWorld# of
        (# _, r #) -> r

We thus need to create a closure containing the arguments to $wupdate' and give it as an argument to runSTRep.

Changed 16 months ago by simonpj

Conclusion: inline runSTRep, but very late. The comments suggest this is difficult. Perhaps we could arrange it so full laziness (which is what causes the wrong let floating according to the comment) happens before the last inliner phase. But this feels a bit brittle still, we have to be really careful so any optimizations we run won't result in things becoming shared CAFs.

Changed 16 months ago by tibbe

I locally defined my own versions of runST and runSTRep, which I inlined in the last phase. I looked at the core and validated that in my particular case nothing bad had happened. This let me measure the actual performance difference: 7% on my particular benchmark. In addition it let me move back to nice, clean code.

Changed 12 months ago by pcapriotti

  • milestone set to 7.6.1

Changed 9 months ago by igloo

  • milestone changed from 7.6.1 to 7.6.2

Changed 3 months ago by carter

  • cc carter.schonwald@… added
Note: See TracTickets for help on using tickets.