Ticket #2289 (new bug)

Opened 4 years ago

Last modified 4 months ago

Needless reboxing of values when returning from a tight loop

Reported by: dons Owned by:
Priority: low Milestone: 7.4.1
Component: Compiler Version: 6.8.2
Keywords: boxing, loops, performance Cc: dons@…, johan.tibell@…, michal.terepeta@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Difficulty: Unknown
Test Case: Blocked By:
Blocking: Related Tickets:

Description

GHC wants to box up strict values when returning from tight inner loops, even when they're immediately taken apart. This leads to redundant instructions in the bodies of tight loops, and more code.

It affects, in particular, loops that result from fusion, which need to be tight, but often return multiple values via unlifted pairs.

Consider this program:

{-# OPTIONS -fexcess-precision #-}
{-# LANGUAGE TypeOperators #-}

import System.Environment
import Text.Printf
import Data.Array.Vector

mean :: UArr Double -> Double
mean arr = s / fromIntegral l
  where
    s :*: l = foldlU k (0 :*: 0) arr :: (Double :*: Int)
    k (s :*: n) x = s+x :*: n+1

main = do
    [d] <- map read `fmap` getArgs
    printf "%f\n" (mean (enumFromToFracU 1 d))

It generates this rather good Core (ghc 6.8.2):

$s$wfold_s1rB :: Double#
               -> Int#
               -> Double#
               -> (# Double, Int #)

$s$wfold_s1rB =
\ (sc_s1rr :: Double#)
  (sc1_s1rs :: Int#)
  (sc2_s1rt :: Double#) ->
  case >## sc_s1rr y_a1pr of wild4_X1no {
    False ->
      $s$wfold_s1rB
        (+## sc_s1rr 1.0)
        (+# sc1_s1rs 1)
        (+## sc2_s1rt sc_s1rr);
    True -> (# D# sc2_s1rt, I# sc1_s1rs #)
  };
} in 
case $s$wfold_s1rB 2.0 1 1.0 of ww_s1qg { (# ww1_s1qi, ww2_s1qj #) ->
case ww1_s1qi of wild4_a1mC { D# x_a1mE ->
case ww2_s1qj of wild5_aP6 { I# x1_aP8 ->
case /## x_a1mE (int2Double# x1_aP8)
of wild21_a1mK { __DEFAULT ->
D# wild21_a1mK

But note, what's this?

    True -> (# D# sc2_s1rt, I# sc1_s1rs #)
  };
} in 
case $s$wfold_s1rB 2.0 1 1.0 of ww_s1qg { (# ww1_s1qi, ww2_s1qj #) ->
case ww1_s1qi of wild4_a1mC { D# x_a1mE ->
case ww2_s1qj of wild5_aP6 { I# x1_aP8 ->
case /## x_a1mE (int2Double# x1_aP8)

The return values of what was a strict pair are boxed, placed in an unboxed tuple, and then immediately unboxed and the division takes place.

Ok, let's isolate this. Here, the boxed return, from the inner loop:

mean_s19V :: Double#
           -> Int#
           -> Double#
           -> (# Double, Int #)

mean_s19V =
\ (ds1_dD3 :: Double#)
  (ds2_dD4 :: Int#)
  (ds3_dD5 :: Double#) ->
  case >## ds1_dD3 d#_aoG of wild4_Xw {
    False ->
      mean_s19V
        (+## ds1_dD3 1.0)
        (+# ds2_dD4 1)
        (+## ds3_dD5 ds1_dD3);
    True -> (# D# ds3_dD5, I# ds2_dD4 #)
  };
} in 
case mean_s19V 2.0 1 1.0 of wild4_Xr { (# ds1_dCV, ds2_dCW #) ->
case ds1_dCV of wild5_Xv { D# x_aoR ->
case ds2_dCW of wild6_Xy { I# y_aoS ->
case /## x_aoR (int2Double# y_aoS) of wild7_XB { __DEFAULT ->
D# wild7_XB

And the inner loop and exit:

s1bd_info:

  -- what's this stuff?
  leaq        32(%r12), %rax
  cmpq        %r15, %rax
  movq        %rax, %r12
  ja  .L17

  -- ok, to business:
  ucomisd     5(%rbx), %xmm5
  ja  .L19
  movapd      %xmm6, %xmm0
  leaq        -32(%rax), %r12
  incq        %rsi
  addsd       %xmm5, %xmm0
  addsd       .LC1(%rip), %xmm5
  movapd      %xmm0, %xmm6
  jmp s1bd_info


.L19:
  movq        %rsi, -16(%rax)
  movq        $base_GHCziBase_Izh_con_info, -24(%rax)
  movq        $base_GHCziFloat_Dzh_con_info, -8(%rax)
  movsd       %xmm6, (%rax)
  leaq        -7(%rax), %rbx
  leaq        -23(%rax), %rsi
  jmp *(%rbp)

Now, I can avoid the reboxing manually:

mean_s19R :: Double#
           -> Int#
           -> Double#
           -> (# Double#, Int# #)

mean_s19R =
\ (ds1_dCZ :: Double#)
  (ds2_dD0 :: Int#)
  (ds3_dD1 :: Double#) ->
  case >## ds1_dCZ d#_aoG of wild4_Xw {
    False ->
      mean_s19R
        (+## ds1_dCZ 1.0)
        (+# ds2_dD0 1)
        (+## ds3_dD1 ds1_dCZ);
    True -> (# ds3_dD1, ds2_dD0 #)
  };
} in 
case mean_s19R 2.0 1 1.0 of wild4_Xr { (# x_aoR, y_aoS #) ->
case /## x_aoR (int2Double# y_aoS) of wild5_Xv { __DEFAULT ->
D# wild5_Xv

And we get:

s1b9_info:
  -- hey , our junk is gone!

  ucomisd     5(%rbx), %xmm5
  ja  .L17
  movapd      %xmm6, %xmm0
  incq        %rsi
  addsd       %xmm5, %xmm0
  addsd       .LC1(%rip), %xmm5
  movapd      %xmm0, %xmm6
  jmp s1b9_info

-- cool, that was it, let's go home:
.L17:
  movapd      %xmm6, %xmm5
  movq        %rsi, %rbx
  jmp *(%rbp)

Which is a much better result. The loop is tighter.

What can be done here?

Change History

Changed 4 years ago by dons

Here's a simple test case:

{-# LANGUAGE TypeOperators #-}
{-# OPTIONS -funbox-strict-fields #-}

import System.Environment
import Text.Printf

data P a b = P !a !b

mean :: Double -> Double -> P Double Int
mean n m = go n 0 0
    where
        go :: Double -> Int -> Double -> P Double Int
        go x l s | x > m      = P s l
                 | otherwise  = go (x+1) (l+1) (s+x)

main = do
    [d] <- map read `fmap` getArgs
    printf "%f\n" (case mean 1 d of
                        (P x y) -> x / fromIntegral y    )

Yields:

$wgo_s1az :: Double# -> Int# -> Double# -> (# Double, Int #)
$wgo_s1az =
    \ (ww_X1au :: Double#)
    (ww1_X1az :: Int#)
    (ww2_X1aE :: Double#) ->
        case >## ww_X1au y_a178 of wild4_X1g {
            False ->
                $wgo_s1az
                (+## ww_X1au 1.0)
                (+# ww1_X1az 1)
                (+## ww2_X1aE ww_X1au);
            True -> (# D# ww2_X1aE, I# ww1_X1az #)
        };

    } in 
        case $wgo_s1az 2.0 1 1.0 of ww_s1a2 { (# ww1_s1a4, ww2_s1a5 #) ->
        case ww1_s1a4 of wild4_a17r { D# x_a17t ->
        case ww2_s1a5 of wild5_aEV { I# x1_aEX ->
        case /## x_a17t (int2Double# x1_aEX)
        of wild21_a17z { __DEFAULT ->
        D# wild21_a17z

Exhibiting the reboxing.

Running this:

$ time ./G 1e9                  
500000000.067109
./G 1e9  2.21s user 0.00s system 99% cpu 2.225 total

While if we prevent the reboxing, by moving the division inside the loop:

$wgo_s1a0 :: Double# -> Int# -> Double# -> Double#
$wgo_s1a0 =
 \ (ww_X19X :: Double#)
   (ww1_X1a2 :: Int#)
   (ww2_X1a7 :: Double#) ->
   case >## ww_X19X y_a16C of wild4_X1g {
     False ->
       $wgo_s1a0
         (+## ww_X19X 1.0)
         (+# ww1_X1a2 1)
         (+## ww2_X1a7 ww_X19X);
     True -> /## ww2_X1a7 (int2Double# ww1_X1a2

We get faster code:

$ time ./G 1e9                  
500000000.067109
./G 1e9  1.84s user 0.01s system 99% cpu 1.861 total

So I suspect the boxing causes a heap check to end up inside the loop (run on every iteration?), and thus a performance loss.

Changed 4 years ago by dons

Note that if we specialise the constructor manually, we can get the correct unboxing: That is, just using strict pairs isn't enough:

data P a b = P {-# UNPACK #-} !a {-# UNPACK #-} !b

won't work.

where we use P only for P Double Int, I'm unable to get the return value unboxed.

If we specialise P, data P = P !Double !Int, then we do get the unpacking, as expected.

This makes strict pairs a little less useful for accumulating state -- we have to specialise them directly ourselves to avoid the reboxing penalty.

Changed 4 years ago by simonpj

  • difficulty set to Unknown

Nice example. I took a little look at it. Two things

First, yes GHC never does a heap-check at the start of an alternative of a primop case; that is, one whose scrutinee is just a primop application. In this example that's bad, because there is no allocation before the conditional, and the hot path does no allocation at all.

The fix is to put the heap check at the start of the alternatives, if no allocation precedes the case itself. This would require a significant (but not drastic) change to the code generator. Happily, it'll be a much easier change when John Dias's new back end comes on stream, we'll hold it till then.

Second, and orthogonally, this loop has a nested CPR property. There is no reason that the CPR analyser can't deal with nested stuff, but it doesn't. There's a nice little project there too.

Either of these would fix this particular example, but both are valuable in their own right.

Simon

Changed 4 years ago by dons

Looking in the CPR analyser, I see the following comment:

The analysis here detects nested CPR information. For example, if a function returns a constructed pair, the first element of which is a constructed int, then the analysis will detect nested CPR information for the int as well. Unfortunately, the current transformations can't take advantage of the nested CPR information. They have (broken now, I think) code which will flatten out nested CPR components and rebuild them in the wrapper, but enabling this would lose laziness. It is possible to make use of the nested info: if we knew that a caller was strict in that position then we could create a specialized version of the function which flattened/reconstructed that position.

It is not known whether this optimisation would be worthwhile.

So, there's some skeleton code in their for the nested CPR stuff. And the CPR analyser seems pretty small too. Would working on this be worthwhile now?

Changed 4 years ago by simonpj

Rats. I'd forgotten about the strictness question:

f :: Int -> (Int,Int)
f x = (g x, h x)

Suppose g and h have the CPR property -- that is, they explicitly return a boxed value. Then it's a mistake to transform to

f x = case (g x, h x) of { (I# r1, I# r2) ->
      (I# r1 ,I# r2) }

because that'd make f too strict. But in your example, g and h are themselves constructors.

My conclusion: for the nested part of CPR analysis we do not want to "look through" function calls, but rather look only for literal constructor applications. I have not thought about how much this'd affect the analysis.

Provided the analysis was modified in this way, it shouldn't be too hard to modify the worker/wrapper part to take account of it.

But NB that CprAnalyse? is dead code; the current analysis is done as part of strictness analysis in DmdAnal?. And the strictness analyser itself needs love and attention. So much to do, so little time.

Simon

Changed 4 years ago by simonpj

(Retrying, having messed up typesetting.)

Rats. I'd forgotten about the strictness question:

f :: Int -> (Int,Int)
f x = (g x, h x)

Suppose g and h have the CPR property -- that is, they explicitly return a boxed value. Then it's a mistake to transform to

f x = case (g x, h x) of { (I# r1, I# r2) ->
      (I# r1 ,I# r2) }

because that'd make f too strict. But in your example, g and h are themselves constructors.

My conclusion: for the nested part of CPR analysis we do not want to "look through" function calls, but rather look only for literal constructor applications. I have not thought about how much this'd affect the analysis.

Provided the analysis was modified in this way, it shouldn't be too hard to modify the worker/wrapper part to take account of it.

But NB that CprAnalyse is dead code; the current analysis is done as part of strictness analysis in DmdAnal. And the strictness analyser itself needs love and attention. So much to do, so little time.

Changed 4 years ago by simonpj

See also #2387, which shows another example of the same phenomenon. (I'm closing #2387 to leave just this one open.)

Simon

Changed 4 years ago by igloo

  • milestone set to 6.10 branch

Changed 4 years ago by dons

Another case we'd like genralised CPR to work is for sum types.

Consider the task of summing integers from a file, one per line. We need to parse each line, returning possibly success, in a tight loop:

import qualified Data.ByteString.Char8 as S

main = print . go 0 =<< S.getContents
  where
    go !n s = case S.readInt s of
                  S.Nothing    -> n
                  S.Just (k,t) -> go (n+k) (S.tail t)

Where

readInt :: ByteString -> Maybe (Int, ByteString)

We'd like the Just/Nothing tag returned in a register, in an ideal world. And the components of the pair as well. Currently we have to monomorphise the type, and flatten it , to get register returns here.

Note that Clean uses a triple of (Bool, Int, String) for this kind of thing.

Changed 4 years ago by simonpj

Don't hold your breath for unboxed sum types. For example, if the Bool is True, the other two fields may be uninitialised, and should not be followed by the GC. I suppose if you were prepared to stubs initialised to (error "Bad"), then you could do this worker/wrapper split for go:

readInt :: String -> Maybe (Int, String)
readInt s = case readInt_w s of
               (# True,  n, s #) -> Nothing
               (# False, n, s #) -> Just (n,s)

readInt_w :: String -> (# Bool, Int, String #)
readInt_w s = case <readInt_body> of
                Just (n,s) -> (# False, n, s #)
                Nothing    -> (# True, error "BAD", error "BAD" #)

Things get harder if there are more constructors in the sum type, or the constructors have more arguments.

Simon

Changed 3 years ago by guest

  • cc changed from dons@galois.com to dons@galois.com

Changed 3 years ago by simonmar

  • architecture changed from Unknown to Unknown/Multiple

Changed 3 years ago by simonmar

  • os changed from Unknown to Unknown/Multiple

Changed 3 years ago by igloo

  • milestone changed from 6.10 branch to 6.12 branch

Changed 3 years ago by simonpj

See also #2387, and #1600

Changed 2 years ago by simonmar

I believe this example fits into the same category. We have a recursive tree traversal in the ST monad that returns an Int, and we want the Int unboxed. Here's the complete code, both the version that doesn't optimise as well as we'd like, and the hand-optimised version:

{-# LANGUAGE BangPatterns, UnboxedTuples, MagicHash #-}
module Test where

import Data.Array.ST
import Control.Monad.ST
import Data.Array.Base
import GHC.ST
import GHC.Exts

data Tree
      = Nil
      | Node {-#UNPACK#-} !Int
                          !Tree
                          !Tree
             {-#UNPACK#-} !Int

#if 0
-- The code we want to write
traverse :: Tree -> STUArray s Int Int -> ST s Int
traverse Nil                     !arr = return 0
traverse (Node item child alt w) !arr = do
  childw <- traverse child arr
  altw   <- traverse alt arr
  itemw <- unsafeRead arr item
  unsafeWrite arr item (itemw + childw + w)
  return $! childw + w + altw
#else
-- The code we have to write
traverse :: Tree -> STUArray s Int Int -> ST s Int
traverse tree arr = ST $ \s -> 
  case traverse' tree arr s of { (# s', i #) -> (# s', I# i #) }
  where
  traverse' Nil                             !arr s  = (# s, 0# #)
  traverse' (Node item child alt w@(I# w#)) !arr s0 =
     case traverse' child arr s0 of { (# s1, childw #) ->
     case traverse' alt arr   s1 of { (# s2, altw   #) ->
     case unsafeRead arr item of { ST f -> case f s2 of { (# s3, I# itemw #) ->
     case unsafeWrite arr item (I# itemw + I# childw + w) of { ST f -> case f s2 of { (# s4, _ #) ->
     (# s4, childw +# w# +# altw #)
     }}}}}}
#endif

Changed 2 years ago by simonmar

  • failure set to Runtime performance bug

Changed 22 months ago by igloo

  • milestone changed from 6.12 branch to 6.12.3

Changed 20 months ago by igloo

  • priority changed from normal to low
  • milestone changed from 6.12.3 to 6.14.1

Changed 14 months ago by igloo

  • milestone changed from 7.0.1 to 7.0.2

Changed 12 months ago by tibbe

  • cc johan.tibell@… added

Changed 11 months ago by igloo

  • milestone changed from 7.0.2 to 7.2.1

Changed 11 months ago by michalt

  • cc michal.terepeta@… added

Changed 4 months ago by igloo

  • milestone changed from 7.2.1 to 7.4.1
Note: See TracTickets for help on using tickets.