GHC GC rises greatly on -N8 compared to -N7
I've recently been benchmarking some parallel code on an 8-core x86 machine. I have graphed performance for -N1 through to -N8, but I often get a huge spike on -N8. I can produce this fairly reliably with the attached program, that simply spawns 8 threads that each spin writing to a TVar. Here is an example of the timings I get:
time ./Main +RTS -N1
real 0m43.697s
user 0m43.683s
sys 0m0.016s
time ./Main +RTS -N4
real 0m24.424s
user 1m13.621s
sys 0m0.208s
time ./Main +RTS -N7
real 0m18.631s
user 1m42.778s
sys 0m0.260s
time ./Main +RTS -N8
real 0m37.859s
user 3m38.506s
sys 0m0.632s
Spot the odd one out! It's not to do with having exactly 8 threads for 8 cores -- I get this problem with ~150 threads on 8 cores. When I run with -s, it seems like the GC is the reason:
time ./Main +RTS -N7 -s
7,313,502,412 bytes allocated in the heap
108,812,664 bytes copied during GC
11,924 bytes maximum residency (100 sample(s))
61,828 bytes maximum slop
5 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 4207 collections, 0 parallel, 1.18s, 1.17s elapsed
Generation 1: 100 collections, 99 parallel, 0.18s, 0.04s elapsed
Parallel GC work balance: 3.25 (285256 / 87661, ideal 7)
Task 0 (worker) : MUT time: 5.56s ( 11.70s elapsed)
GC time: 0.01s ( 0.01s elapsed)
Task 1 (worker) : MUT time: 6.65s ( 11.70s elapsed)
GC time: 1.16s ( 1.13s elapsed)
Task 2 (worker) : MUT time: 0.03s ( 11.70s elapsed)
GC time: 0.00s ( 0.00s elapsed)
Task 3 (worker) : MUT time: 0.00s ( 11.70s elapsed)
GC time: 0.00s ( 0.00s elapsed)
Task 4 (worker) : MUT time: 9.93s ( 11.70s elapsed)
GC time: 0.01s ( 0.01s elapsed)
Task 5 (worker) : MUT time: 6.97s ( 11.70s elapsed)
GC time: 0.00s ( 0.00s elapsed)
Task 6 (worker) : MUT time: 8.25s ( 11.70s elapsed)
GC time: 0.04s ( 0.02s elapsed)
Task 7 (worker) : MUT time: 6.79s ( 11.72s elapsed)
GC time: 0.00s ( 0.00s elapsed)
Task 8 (worker) : MUT time: 3.70s ( 11.70s elapsed)
GC time: 0.14s ( 0.03s elapsed)
INIT time 0.00s ( 0.00s elapsed)
MUT time 61.91s ( 11.70s elapsed)
GC time 1.35s ( 1.21s elapsed)
EXIT time 0.01s ( 0.02s elapsed)
Total time 63.27s ( 12.93s elapsed)
%GC time 2.1% (9.4% elapsed)
Alloc rate 118,112,344 bytes per MUT second
Productivity 97.9% of total user, 478.9% of total elapsed
recordMutableGen_sync: 383
gc_alloc_block_sync: 2
whitehole_spin: 0
gen[0].steps[0].sync_todo: 0
gen[0].steps[0].sync_large_objects: 0
gen[0].steps[1].sync_todo: 16
gen[0].steps[1].sync_large_objects: 0
gen[1].steps[0].sync_todo: 175
gen[1].steps[0].sync_large_objects: 0
real 0m12.933s
user 1m3.276s
sys 0m0.248s
time ./Main +RTS -N8 -s
7,316,581,044 bytes allocated in the heap
153,424,584 bytes copied during GC
11,964 bytes maximum residency (132 sample(s))
65,916 bytes maximum slop
6 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 4711 collections, 4711 parallel, 136.08s, 21.38s elapsed
Generation 1: 132 collections, 132 parallel, 6.12s, 0.93s elapsed
Parallel GC work balance: 1.01 (38276826 / 38012855, ideal 8)
Task 0 (worker) : MUT time: 0.00s ( 14.56s elapsed)
GC time: 30.31s ( 4.74s elapsed)
Task 1 (worker) : MUT time: 0.00s ( 14.56s elapsed)
GC time: 25.09s ( 3.68s elapsed)
Task 2 (worker) : MUT time: 0.00s ( 14.56s elapsed)
GC time: 0.09s ( 0.01s elapsed)
Task 3 (worker) : MUT time: 0.00s ( 14.56s elapsed)
GC time: 0.00s ( 0.00s elapsed)
Task 4 (worker) : MUT time: 0.98s ( 14.56s elapsed)
GC time: 7.62s ( 1.17s elapsed)
Task 5 (worker) : MUT time: 0.00s ( 14.56s elapsed)
GC time: 13.09s ( 2.01s elapsed)
Task 6 (worker) : MUT time: 5.81s ( 14.56s elapsed)
GC time: 1.92s ( 0.28s elapsed)
Task 7 (worker) : MUT time: 0.00s ( 14.56s elapsed)
GC time: 25.12s ( 4.24s elapsed)
Task 8 (worker) : MUT time: 0.95s ( 14.56s elapsed)
GC time: 8.01s ( 1.20s elapsed)
Task 9 (worker) : MUT time: 0.00s ( 14.56s elapsed)
GC time: 30.94s ( 4.98s elapsed)
INIT time 0.00s ( 0.00s elapsed)
MUT time 65.37s ( 14.56s elapsed)
GC time 142.20s ( 22.31s elapsed)
EXIT time 0.00s ( 0.01s elapsed)
Total time 207.57s ( 36.87s elapsed)
%GC time 68.5% (60.5% elapsed)
Alloc rate 111,922,106 bytes per MUT second
Productivity 31.5% of total user, 177.3% of total elapsed
recordMutableGen_sync: 597
gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].steps[0].sync_todo: 0
gen[0].steps[0].sync_large_objects: 0
gen[0].steps[1].sync_todo: 35296
gen[0].steps[1].sync_large_objects: 0
gen[1].steps[0].sync_todo: 557515
gen[1].steps[0].sync_large_objects: 0
real 0m36.873s
user 3m27.569s
sys 0m0.524s
See how the GC goes from 1s to 142s, just by virtue of -N8 instead of -N7. Increasing the heap size seems to alleviate the problem, but the difference is so severe (and unexpected, at least to me) that I thought I'd file a ticket. I am happy to try other flags if you want. This is all on GHC 6.10.4.
Trac metadata
Trac field | Value |
---|---|
Version | 6.10.4 |
Type | Bug |
TypeOfFailure | OtherFailure |
Priority | normal |
Resolution | Unresolved |
Component | Runtime System |
Test case | |
Differential revisions | |
BlockedBy | |
Related | |
Blocking | |
CC | |
Operating system | |
Architecture |