Ticket #1589 (closed bug: fixed)

Opened 5 years ago

Last modified 2 years ago

Process creation and communication doesn't scale linearly

Reported by: guest Owned by: simonmar
Priority: normal Milestone: 6.10 branch
Component: Runtime System Version: 6.7
Keywords: Cc: lennart.augustsson@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Difficulty: Moderate (less than a day)
Test Case: Blocked By:
Blocking: Related Tickets:

Description

Creating processes (with forkIO) and communicating between them (with putMVar and takeMVar) does not scale linearly. For 10000 processes creation takes 8us, but for 100000 it takes 60us. Even taking the increased GC into account it's highly non-linear.

Attachments

Thread.hs Download (2.1 KB) - added by guest 5 years ago.

Change History

Changed 5 years ago by guest

Changed 4 years ago by igloo

  • milestone set to 6.8 branch

Changed 4 years ago by simonmar

  • owner set to simonmar
  • difficulty changed from Unknown to Moderate (1 day)
  • os changed from Windows to Multiple
  • architecture changed from x86 to Multiple

The problem is GC: try it with -H500m.

In fact there are two problems: both MVars and TSOs are kept in the remembered set of the old generation instead of having a proper write barrier, which means that all MVars and TSOs will be visited in every minor GC.

I've fixed the problem for MVars:

Thu Oct 11 14:55:05 BST 2007  Simon Marlow <simonmar@microsoft.com>
  * Add a proper write barrier for MVars

Now the HEAD is about 30% faster on that example with 100000 threads and the default heap settings. We might consider merging that patch into 6.8.2.

Threads are somewhat more tricky... I've made some progress towards a fix, but it isn't complete yet.

Changed 4 years ago by simonmar

  • milestone changed from 6.8 branch to 6.10 branch

Changed 4 years ago by simonmar

See also #1984

Changed 4 years ago by simonmar

Just for your enjoyment, here is how my current development tree does on this benchmark:

./threads1589 1000000 0 +RTS -sstderr -k200 
     274,891,208 bytes allocated in the heap
     788,755,388 bytes copied during GC
     176,005,076 bytes maximum residency (8 sample(s))
      31,957,908 bytes maximum slop
             387 MB total memory in use (3 MB lost due to fragmentation)

  Generation 0:   518 collections,     0 parallel,  1.83s,  1.84s elapsed
  Generation 1:     8 collections,     0 parallel,  1.36s,  1.76s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    0.74s  (  0.76s elapsed)
  GC    time    3.19s  (  3.60s elapsed)
  EXIT  time    0.25s  (  0.42s elapsed)
  Total time    4.18s  (  4.61s elapsed)

  %GC time      76.4%  (78.1% elapsed)

  Alloc rate    278,209,693 bytes per MUT second

  Productivity  23.6% of total user, 21.4% of total elapsed

4.6us per thread, and it scales just slightly worse than linearly until my laptop runs out of memory. The test is doing a bit more than just creating threads, it passes a single message along the chain of threads after creating them all, which is why it needs linear heap space and hence won't scale completely linearly.

It'll be a while until I commit this as it's on top of a ton of other stuff I have to get committed, but I'm getting there.

Changed 4 years ago by simonmar

  • status changed from new to closed
  • resolution set to fixed

Here's the results with the current HEAD. GHC doesn't even break a sweat with a million threads:

$ ./1589 1000000  
Creating pipeline with 1000000 processes in it.
Pumping a single message through the pipeline.
Pumping a 100 messages through the pipeline.
       n   create    pump1    pump2 create/n  pump1/n  pump2/n
                s        s        s       us       us       us
 1000000    5.612    1.108   17.969     5.61     1.11     0.18

Creation scales linearly until I run out of memory. You can improve the results even more by making stacks start a bit smaller:

$ ./1589 1000000 +RTS -k500
Creating pipeline with 1000000 processes in it.
Pumping a single message through the pipeline.
Pumping a 100 messages through the pipeline.
       n   create    pump1    pump2 create/n  pump1/n  pump2/n
                s        s        s       us       us       us
 1000000    3.780    1.140   15.009     3.78     1.14     0.15

Changed 3 years ago by simonmar

  • architecture changed from Multiple to Unknown/Multiple

Changed 3 years ago by simonmar

  • os changed from Multiple to Unknown/Multiple

Changed 2 years ago by simonmar

  • difficulty changed from Moderate (1 day) to Moderate (less than a day)
Note: See TracTickets for help on using tickets.