Notes on development of monad-par streaming framework and benchmarks. --------------------------------------------------------------------- [2011.03.19] Right now I'm seeing a weird behavior where when run -threaded it prints many of the messages before "new"' in the filter stage, but it doesn't print the messages after "put" in that same stage. This system allows self-stealing, correct? [2011.03.19] Quick measurements on wasp: --------------------------- Throughput of just countupWindowed + measureRate ... 2 threads is 8600 * 1024 = 8,806,400 = 8.8mHz. * With -N4 it uses ~234% CPU and gets sligthly confused, lowering throughput to 5800-6100 windows/sec from 8600 * It actually does work (e.g. doesn't deadlock) to run it with -N1 because there's only one kernel. The throughput is much more variable. From 4800-8500 win/sec. GC time is tiny in this case, 0.5%. --------------------------- Throughput of a single amap (+100) kernel = * Gets stuck waiting for first kernel -N1 as expected (and fills memory). * on -N2 it gets a rate of between 7937 and 8200 wins/sec but then it gets stuck after about 10 seconds. I don't fully understand this. The two threads should be enough for the two kernels right? Currently Par.hs forks a "replacement thread" to continue working on the queue when runParAsync returns. Therefore there should be enough haskell threads, even if there are only two Par scheduler threads. Those two par workers should be preemptable by the measureRate thread.... * on -N3 it works, and gives 6200-7100 wins/sec throughput. Uses ~245% CPU. Presumably the two par worker threads are hammering away and the measureRate one is less busy. But if I leave it running for a while it grew up to 79% (of 16gb) mem usage. Strange, -s claims the following. How can max resdency be so low!? 2,993,287,338,808 bytes allocated in the heap 17,134,304,736 bytes copied during GC 3,744,408 bytes maximum residency (152145 sample(s)) 1,870,240 bytes maximum slop 14 MB total memory in use (3 MB lost due to fragmentation) Generation 0: 4961153 collections, 4961152 parallel, 696.02s, 63.29s elapsed Generation 1: 152145 collections, 152145 parallel, 115.00s, 22.66s elapsed Parallel GC work balance: 1.08 (2102121164 / 1946757116, ideal 3) MUT time (elapsed) GC time (elapsed) Task 0 (worker) : 0.02s ( 0.00s) 0.00s ( 0.00s) Task 1 (worker) : 1927.40s (742.55s) 0.00s ( 0.00s) Task 2 (worker) : 1927.40s (742.55s) 0.00s ( 0.00s) Task 3 (worker) : 1928.34s (743.40s) 0.00s ( 0.00s) Task 4 (worker) : 1928.34s (743.40s) 0.00s ( 0.00s) Task 5 (bound) : 0.00s ( 0.00s) 0.11s ( 0.02s) Task 6 (worker) : 1928.34s (743.40s) 0.00s ( 0.00s) Task 7 (worker) : 1117.43s (743.40s) 810.91s ( 85.93s) SPARKS: 0 (0 converted, 0 pruned) INIT time 0.00s ( 0.00s elapsed) MUT time 1117.32s (743.40s elapsed) GC time 811.02s ( 85.95s elapsed) EXIT time 0.00s ( 0.01s elapsed) Total time 1928.34s (829.36s elapsed) %GC time 42.1% (10.4% elapsed) Alloc rate 2,678,988,417 bytes per MUT second Productivity 57.9% of total user, 134.7% of total elapsed gc_alloc_block_sync: 6092257 whitehole_spin: 0 gen[0].sync_large_objects: 190255 gen[1].sync_large_objects: 267 Oh, maybe because of the CArray's all the real storage is outside haskell's heap. There must be a memory leak in streamMap. Trying to fix it: (1) Factored out 'loop'. I need to try to ensure that no closure holds onto the original head of the stream. Wow! That lowered throughput a lot (-N3) and drove cpu usage up! 3500 wins/sec declining to 300. And it still leaks. The key difference seems to be passing the extra "fn" argument to loop. (2) Hmm... I went back to what I *thought* was the previous form above (that leaked). But now it's getting the good >6000 throughput and doesn't seem to be leaking. It gives memory back to the system and goes up and down in mem footprint. But now it uses 300% cpu. The only difference I can see is that I changed the module export decl. How could this matter if compiling into an executable? Nevertheless maybe this helps it inline.... Now I can run it for 10 min with minimal memory usage. -qa seems to help the variance on -N4, i.e. with more workers than kernels. --------------------------- Throughput of a single FFT kernel. * Oops, maybe this accounts for the difference above between leaking/non-leaking. The FFT version maintains a high >7000 wins/sec throughput. But it leaks memory. Maybe it's not really doing the FFT and is leaking suspensions? * Nope... I tried forcing the FFT by inspecting one element of the output. Still leaks. Well, the answer is simple. It just can't keep up with a source that has no backpressure. To confirm this hypothesis, I ran it with -N1 (with the new yielding source operator). NOPE, it still leaks. [2011.03.20] {Added disjoint_working_sets_pipeline} Ok, first try. Not only has this not demonstrated the benefit of pipeline parallelism, right now it isn't showing much of a speedup at all. Right now I'm running with default parameters (4 256 10 20000) Running with "-N4 -qa" I see large variance. Between 1.0 and 3.3 seconds. Seeing the same without -qa: Running without no extra runtime flags (no -qa): nothreads: 2.18s 1thread: 2.3s 4threads: 1-3 seconds. The big difference is running with N>1 it can spend a lot of time in GC (e.g. 84%!!). But I need to stop running on wasp for actual parallel speedup measurements -- I seem to have problems using wasp for this purpose.