Ticket #5505 (closed bug: fixed)

Opened 20 months ago

Last modified 6 months ago

Program runs faster with profiling than without

Reported by: simonpj Owned by:
Priority: high Milestone: 7.4.1
Component: Compiler Version: 7.6.1
Keywords: Cc: doaitse@…, scpmw@…, anton.nik@…, bgamari@…, tomberek@…
Operating System: Windows Architecture: Unknown/Multiple
Type of failure: Runtime performance bug Difficulty: Unknown
Test Case: Blocked By:
Blocking: Related Tickets:

Description

This program, reported by Doaitse, runs nearly twice as fast when it is profiled than when it isn't. You need to install the Utrecht library uulib:

cabal install -p uulib

Now compile and run UULib.hs with and without -prof -auto-all. (It makes little difference whether or not you use -O.) When you run, here are the stats for GHC 7.0.3:

----------------  NOT PROFILED ----------------
bash-3.1$ ./uulib +RTS -s
c:\tmp\uulib.exe +RTS -s 
138000
      51,077,556 bytes allocated in the heap
      19,153,500 bytes copied during GC
       4,914,980 bytes maximum residency (5 sample(s))
         519,828 bytes maximum slop
              12 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:    93 collections,     0 parallel,  4.71s,  4.70s elapsed
  Generation 1:     5 collections,     0 parallel,  2.73s,  2.76s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    0.05s  (  0.11s elapsed)
  GC    time    7.44s  (  7.45s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    7.49s  (  7.56s elapsed)

  %GC time      99.4%  (98.6% elapsed)

  Alloc rate    1,091,377,449 bytes per MUT second

  Productivity   0.6% of total user, 0.6% of total elapsed


------------- PROFILED ------------------------
bash-3.1$ ./uulib +RTS -s
c:\tmp\uulib.exe +RTS -s 
138000
      79,692,548 bytes allocated in the heap
      31,874,676 bytes copied during GC
       7,774,720 bytes maximum residency (5 sample(s))
         257,912 bytes maximum slop
              19 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:   147 collections,     0 parallel,  2.65s,  2.66s elapsed
  Generation 1:     5 collections,     0 parallel,  1.33s,  1.34s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    0.09s  (  0.16s elapsed)
  GC    time    3.98s  (  3.99s elapsed)
  RP    time    0.00s  (  0.00s elapsed)
  PROF  time    0.00s  (  0.00s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    4.07s  (  4.15s elapsed)

  %GC time      97.7%  (96.1% elapsed)

  Alloc rate    851,407,014 bytes per MUT second

  Productivity   2.3% of total user, 2.3% of total elapsed

The profiled one:

  • Allocates more
  • But runs nearly twice as fast

Looking at the stats, though, you can see that in both cases 99% of execution time spent in the garbage collector. I don't think I've ever seen a program spend that much time in GC.

So I don't really know what's going on, but it's clearly a rather amazing program.

Attachments

UULib.hs Download (0.6 KB) - added by simonpj 20 months ago.
Main.hs Download (1.7 KB) - added by tomberek 6 months ago.

Change History

Changed 20 months ago by simonpj

  Changed 20 months ago by scpmw

It also seems to make no difference whether you use -auto-all or not, ruling out the annotations as the villain. This is most definitely some sort of garbage collection issue. ThreadScope? shows extremely long garbage collection phases - over 0.5 seconds each towards the end.

When I turn my profiling code to look at the garbage collection phases, it gives me >98% time spent directly in evacuate, for whatever that's worth. If only I had a per-line profile now... :-)

  Changed 20 months ago by simonmar

  • owner set to simonmar
  • priority changed from normal to high
  • milestone set to 7.4.1

  Changed 20 months ago by simonmar

Fascinating bug. I'm going to have to sleep on this and see if a fix occurs to me.

For now, you can workaround it by adding +RTS -Z:

$ ./UULib +RTS -s -Z                
./UULib +RTS -s -Z 
138000
     110,831,992 bytes allocated in the heap
      51,956,344 bytes copied during GC
      17,880,048 bytes maximum residency (6 sample(s))
       3,212,328 bytes maximum slop
              34 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:   190 collections,     0 parallel,  0.23s,  0.24s elapsed
  Generation 1:     6 collections,     0 parallel,  0.07s,  0.07s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    0.11s  (  0.11s elapsed)
  GC    time    0.30s  (  0.30s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    0.41s  (  0.41s elapsed)

Or alternatively avoid using unsafePerformIO (I presume you're using that somewhere in the uulib package?), or change it to unsafeDupablePerformIO.

  Changed 19 months ago by lelf

  • cc anton.nik@… added

  Changed 18 months ago by bgamari

  • cc bgamari@… added

My GGen ( https://github.com/bgamari/GGen) program also exhibits this behavior.

               Memory usage       Runtime
Profiling         20MB              14.15s
No Profiling      13MB              27.86s

I'm currently looking at GGen commit 49dbdd994291e8ef75d0119fa28f84164c5fff77 with the input file found at  http://www.thingiverse.com/download:31703. Pass the input file name on the command line.

  Changed 18 months ago by bgamari

Here are -sstderr logs for both the profiling and non-profiling cases. Note that in my case relatively little time is spent in GC,

With profiling

../../GGen/dist/build/GGen/GGen z_endstop_clamp_and_mount.stl +RTS -sstderr 
  10,355,015,584 bytes allocated in the heap
     196,136,736 bytes copied during GC
       7,760,328 bytes maximum residency (27 sample(s))
         479,656 bytes maximum slop
              20 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0: 19768 collections,     0 parallel,  0.44s,  0.47s elapsed
  Generation 1:    27 collections,     0 parallel,  0.15s,  0.16s elapsed

  INIT  time    0.01s  (  0.00s elapsed)
  MUT   time   13.56s  ( 15.97s elapsed)
  GC    time    0.59s  (  0.64s elapsed)
  RP    time    0.00s  (  0.00s elapsed)
  PROF  time    0.00s  (  0.00s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time   14.15s  ( 16.60s elapsed)

  %GC time       4.2%  (3.8% elapsed)

  Alloc rate    763,501,989 bytes per MUT second

  Productivity  95.8% of total user, 81.6% of total elapsed

Without profiling

../../GGen/dist/build/GGen/GGen z_endstop_clamp_and_mount.stl +RTS -sstderr 
  18,234,515,824 bytes allocated in the heap
     128,446,200 bytes copied during GC
       5,211,504 bytes maximum residency (23 sample(s))
         325,792 bytes maximum slop
              13 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0: 34960 collections,     0 parallel,  0.62s,  0.71s elapsed
  Generation 1:    23 collections,     0 parallel,  0.10s,  0.10s elapsed

  INIT  time    0.01s  (  0.00s elapsed)
  MUT   time   27.14s  ( 31.47s elapsed)
  GC    time    0.72s  (  0.81s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time   27.86s  ( 32.27s elapsed)

  %GC time       2.6%  (2.5% elapsed)

  Alloc rate    671,650,697 bytes per MUT second

  Productivity  97.4% of total user, 84.1% of total elapsed

follow-up: ↓ 8   Changed 18 months ago by bgamari

Interestingly enough, adding some SPECIALIZE pragmas to a few hotspots (commit b08b2b3680548f47c71a90db20d8b1310392fec9) seems to fix this behavior (runtime of 10s with profiling, 5s without).

in reply to: ↑ 7   Changed 18 months ago by simonmar

Replying to bgamari:

Interestingly enough, adding some SPECIALIZE pragmas to a few hotspots (commit b08b2b3680548f47c71a90db20d8b1310392fec9) seems to fix this behavior (runtime of 10s with profiling, 5s without).

I strongly suspect your bug is different to the original one reported in this ticket. Could you open a new ticket please? Also, I looked at your example, but it has rather a lot of dependencies (cairo, attoparsec, vector-spaces etc.) and not all of them work with GHC HEAD. Is it possible to make a more self-contained test case?

  Changed 18 months ago by bgamari

Given I have pretty much no idea what causes the behavior to begin with, I wouldn't know where to start in constructing a simple test case. Moreover, with the addition of the mentioned pragmas, the bug has effectively vanished.

  Changed 18 months ago by marlowsd@…

commit b04cc4c7b192a50885871d8ec71c25b6b6475d67

Author: Simon Marlow <marlowsd@gmail.com>
Date:   Tue Nov 15 11:38:23 2011 +0000

    Avoid generating chains of indirections in stack squeezing (#5505)

 rts/ThreadPaused.c |  133 ++++++++++++++++++++++++++++-----------------------
 1 files changed, 73 insertions(+), 60 deletions(-)

  Changed 18 months ago by simonmar

  • status changed from new to closed
  • resolution set to fixed

Results for the original UULib test program after the above patch:

> ./UULib +RTS -s
138000
     101,854,752 bytes allocated in the heap
      38,063,512 bytes copied during GC
       9,955,944 bytes maximum residency (5 sample(s))
       2,221,040 bytes maximum slop
              24 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0       191 colls,     0 par    0.04s    0.04s     0.0002s    0.0011s
  Gen  1         5 colls,     0 par    0.05s    0.05s     0.0098s    0.0254s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    0.10s  (  0.10s elapsed)
  GC      time    0.09s  (  0.09s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    0.20s  (  0.19s elapsed)

  %GC     time      45.5%  (46.1% elapsed)

  Alloc rate    952,278,461 bytes per MUT second

  Productivity  54.3% of total user, 54.9% of total elapsed

  Changed 18 months ago by marlowsd@…

commit 9562f18769b18cd44290d14628dd8d9a45e7d898

Author: Simon Marlow <marlowsd@gmail.com>
Date:   Tue Nov 15 15:44:53 2011 +0000

    further fixes to the #5505 fix.

 rts/ThreadPaused.c |    7 +++----
 1 files changed, 3 insertions(+), 4 deletions(-)

  Changed 6 months ago by tomberek

  • status changed from closed to new
  • cc tomberek@… added
  • resolution fixed deleted
  • failure changed from None/Unknown to Runtime performance bug
  • version changed from 7.0.3 to 7.6.1
  • owner simonmar deleted
  • os changed from Unknown/Multiple to Windows

The Main.hs program (compiled with or without -O2) processes a StateT monad over ST running around a circular list. When run with profiling it is around 35% faster than without.

It has slightly different GC behavior, but it doesn't look significant enough to cause such a large difference.

Changed 6 months ago by tomberek

  Changed 6 months ago by tomberek

./Main.exe +RTS -s
222
CPU time:   1.87s
         172,636 bytes allocated in the heap
             980 bytes copied during GC
          41,860 bytes maximum residency (1 sample(s))
          19,580 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0         0 colls,     0 par    0.00s    0.00s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.00s    0.00s     0.0001s    0.0001s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    1.87s  (  1.86s elapsed)
  GC      time    0.00s  (  0.00s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    1.87s  (  1.86s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    92,219 bytes per MUT second

  Productivity 100.0% of total user, 100.6% of total elapsed


./Main.exe +RTS -s -h
222
CPU time:   1.39s
         172,892 bytes allocated in the heap
         271,180 bytes copied during GC
          78,104 bytes maximum residency (14 sample(s))
          20,200 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0         0 colls,     0 par    0.00s    0.00s     0.0000s    0.0000s
  Gen  1        14 colls,     0 par    0.00s    0.00s     0.0002s    0.0003s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    1.39s  (  1.37s elapsed)
  GC      time    0.00s  (  0.00s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    1.39s  (  1.38s elapsed)

  %GC     time       0.0%  (0.2% elapsed)

  Alloc rate    124,525 bytes per MUT second

  Productivity 100.0% of total user, 100.7% of total elapsed

  Changed 6 months ago by simonmar

  • status changed from new to closed
  • difficulty set to Unknown
  • resolution set to fixed

I've moved this to a new ticket, see #7429.

Note: See TracTickets for help on using tickets.