Ticket #1589 (closed bug: fixed)

Opened 6 years ago

Last modified 4 months ago

Process creation and communication doesn't scale linearly

Reported by: guest Owned by: simonmar
Priority: normal Milestone: 6.10 branch
Component: Runtime System Version: 6.7
Keywords: Cc: lennart.augustsson@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Difficulty: Moderate (less than a day)
Test Case: Blocked By:
Blocking: Related Tickets:

Description

Creating processes (with forkIO) and communicating between them (with putMVar and takeMVar) does not scale linearly. For 10000 processes creation takes 8us, but for 100000 it takes 60us. Even taking the increased GC into account it's highly non-linear.

Attachments

Thread.hs Download (2.1 KB) - added by guest 6 years ago.

Change History

Changed 6 years ago by guest

Changed 6 years ago by igloo

  • milestone set to 6.8 branch

Changed 6 years ago by simonmar

  • owner set to simonmar
  • difficulty changed from Unknown to Moderate (1 day)
  • os changed from Windows to Multiple
  • architecture changed from x86 to Multiple

The problem is GC: try it with -H500m.

In fact there are two problems: both MVars and TSOs are kept in the remembered set of the old generation instead of having a proper write barrier, which means that all MVars and TSOs will be visited in every minor GC.

I've fixed the problem for MVars:

Thu Oct 11 14:55:05 BST 2007  Simon Marlow <simonmar@microsoft.com>
  * Add a proper write barrier for MVars

Now the HEAD is about 30% faster on that example with 100000 threads and the default heap settings. We might consider merging that patch into 6.8.2.

Threads are somewhat more tricky... I've made some progress towards a fix, but it isn't complete yet.

Changed 6 years ago by simonmar

  • milestone changed from 6.8 branch to 6.10 branch

Changed 5 years ago by simonmar

See also #1984

Changed 5 years ago by simonmar

Just for your enjoyment, here is how my current development tree does on this benchmark:

./threads1589 1000000 0 +RTS -sstderr -k200 
     274,891,208 bytes allocated in the heap
     788,755,388 bytes copied during GC
     176,005,076 bytes maximum residency (8 sample(s))
      31,957,908 bytes maximum slop
             387 MB total memory in use (3 MB lost due to fragmentation)

  Generation 0:   518 collections,     0 parallel,  1.83s,  1.84s elapsed
  Generation 1:     8 collections,     0 parallel,  1.36s,  1.76s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    0.74s  (  0.76s elapsed)
  GC    time    3.19s  (  3.60s elapsed)
  EXIT  time    0.25s  (  0.42s elapsed)
  Total time    4.18s  (  4.61s elapsed)

  %GC time      76.4%  (78.1% elapsed)

  Alloc rate    278,209,693 bytes per MUT second

  Productivity  23.6% of total user, 21.4% of total elapsed

4.6us per thread, and it scales just slightly worse than linearly until my laptop runs out of memory. The test is doing a bit more than just creating threads, it passes a single message along the chain of threads after creating them all, which is why it needs linear heap space and hence won't scale completely linearly.

It'll be a while until I commit this as it's on top of a ton of other stuff I have to get committed, but I'm getting there.

Changed 5 years ago by simonmar

  • status changed from new to closed
  • resolution set to fixed

Here's the results with the current HEAD. GHC doesn't even break a sweat with a million threads:

$ ./1589 1000000  
Creating pipeline with 1000000 processes in it.
Pumping a single message through the pipeline.
Pumping a 100 messages through the pipeline.
       n   create    pump1    pump2 create/n  pump1/n  pump2/n
                s        s        s       us       us       us
 1000000    5.612    1.108   17.969     5.61     1.11     0.18

Creation scales linearly until I run out of memory. You can improve the results even more by making stacks start a bit smaller:

$ ./1589 1000000 +RTS -k500
Creating pipeline with 1000000 processes in it.
Pumping a single message through the pipeline.
Pumping a 100 messages through the pipeline.
       n   create    pump1    pump2 create/n  pump1/n  pump2/n
                s        s        s       us       us       us
 1000000    3.780    1.140   15.009     3.78     1.14     0.15

Changed 5 years ago by simonmar

  • architecture changed from Multiple to Unknown/Multiple

Changed 5 years ago by simonmar

  • os changed from Multiple to Unknown/Multiple

Changed 4 years ago by simonmar

  • difficulty changed from Moderate (1 day) to Moderate (less than a day)

Changed 4 months ago by ezyang

  • failure set to None/Unknown

For reference, the fix was:

commit 04cddd339c000df6d02c90ce59dbffa58d2fe166
Author: Simon Marlow <simonmarhaskell@gmail.com>
Date:   Wed Apr 16 23:39:51 2008 +0000

    Add a write barrier to the TSO link field (#1589)

Changed 4 months ago by ezyang@…

commit 6ff3c31888e614d1f6f78449e41d293612a855be

Author: Edward Z. Yang <ezyang@mit.edu>
Date:   Sun Jan 27 18:25:34 2013 -0800

    Fix documentation bug: TSOs are *not* unconditionally kept on the mutable list.
    
    The bug where TSOs were unconditionally kept on the mutable list was #1589
    which was fixed in 04cddd339c000df6d02c90ce59dbffa58d2fe166.
    Curiously enough, the commit that changed this comment
    0417404f5d1230c9d291ea9f73e2831121c8ec99 occurred *after* this
    change was made; I can only assume Simon Marlow accidentally forgot
    that he had fixed this bug. :-)
    
    Signed-off-by: Edward Z. Yang <ezyang@mit.edu>

 rts/sm/Scav.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)
Note: See TracTickets for help on using tickets.