Notes on development of monad-par streaming framework and benchmarks.
---------------------------------------------------------------------

[2011.03.19] Right now I'm seeing a weird behavior where when run
-threaded it prints many of the messages before "new"' in the filter
stage, but it doesn't print the messages after "put" in that same
stage.

This system allows self-stealing, correct?  

[2011.03.19] 

 Quick measurements on wasp:
 ---------------------------
  Throughput of just countupWindowed + measureRate ...
    2 threads is 8600 * 1024 = 8,806,400 = 8.8mHz.

  * With -N4 it uses ~234% CPU and gets sligthly confused, lowering
    throughput to 5800-6100 windows/sec from 8600

  * It actually does work (e.g. doesn't deadlock) to run it with -N1
    because there's only one kernel.  The throughput is much more
    variable.  From 4800-8500 win/sec.  GC time is tiny in this
    case, 0.5%.
 ---------------------------
  Throughput of a single amap (+100) kernel = 

 * Gets stuck waiting for first kernel -N1 as expected (and fills memory).

 * on -N2 it gets a rate of between 7937 and 8200 wins/sec but then
   it gets stuck after about 10 seconds.  I don't fully understand
   this.  The two threads should be enough for the two kernels
   right?  Currently Par.hs forks a "replacement thread" to continue
   working on the queue when runParAsync returns.  Therefore there
   should be enough haskell threads, even if there are only two Par
   scheduler threads.  Those two par workers should be preemptable by
   the measureRate thread....

 * on -N3 it works, and gives 6200-7100 wins/sec throughput.  Uses
   ~245% CPU.  Presumably the two par worker threads are hammering
   away and the measureRate one is less busy.  But if I leave it
   running for a while it grew up to 79% (of 16gb) mem usage.
   Strange, -s claims the following.  How can max resdency be so low!?

       2,993,287,338,808 bytes allocated in the heap
	 17,134,304,736 bytes copied during GC
	      3,744,408 bytes maximum residency (152145 sample(s))
	      1,870,240 bytes maximum slop
		     14 MB total memory in use (3 MB lost due to fragmentation)

	 Generation 0: 4961153 collections, 4961152 parallel, 696.02s, 63.29s elapsed
	 Generation 1: 152145 collections, 152145 parallel, 115.00s, 22.66s elapsed

	 Parallel GC work balance: 1.08 (2102121164 / 1946757116, ideal 3)

			       MUT time (elapsed)       GC time  (elapsed)
	 Task  0 (worker) :    0.02s    (  0.00s)       0.00s    (  0.00s)
	 Task  1 (worker) :  1927.40s    (742.55s)       0.00s    (  0.00s)
	 Task  2 (worker) :  1927.40s    (742.55s)       0.00s    (  0.00s)
	 Task  3 (worker) :  1928.34s    (743.40s)       0.00s    (  0.00s)
	 Task  4 (worker) :  1928.34s    (743.40s)       0.00s    (  0.00s)
	 Task  5 (bound)  :    0.00s    (  0.00s)       0.11s    (  0.02s)
	 Task  6 (worker) :  1928.34s    (743.40s)       0.00s    (  0.00s)
	 Task  7 (worker) :  1117.43s    (743.40s)     810.91s    ( 85.93s)

	 SPARKS: 0 (0 converted, 0 pruned)

	 INIT  time    0.00s  (  0.00s elapsed)
	 MUT   time  1117.32s  (743.40s elapsed)
	 GC    time  811.02s  ( 85.95s elapsed)
	 EXIT  time    0.00s  (  0.01s elapsed)
	 Total time  1928.34s  (829.36s elapsed)

	 %GC time      42.1%  (10.4% elapsed)

	 Alloc rate    2,678,988,417 bytes per MUT second

	 Productivity  57.9% of total user, 134.7% of total elapsed

       gc_alloc_block_sync: 6092257
       whitehole_spin: 0
       gen[0].sync_large_objects: 190255
       gen[1].sync_large_objects: 267

 Oh, maybe because of the CArray's all the real storage is outside haskell's heap.

 There must be a memory leak in streamMap.  Trying to fix it:

   (1) Factored out 'loop'.  I need to try to ensure that no closure
       holds onto the original head of the stream.  Wow!  That
       lowered throughput a lot (-N3) and drove cpu usage up! 3500
       wins/sec declining to 300.  And it still leaks.
	 The key difference seems to be passing the extra "fn" argument to loop.

   (2) Hmm... I went back to what I *thought* was the previous form
       above (that leaked).  But now it's getting the good >6000
       throughput and doesn't seem to be leaking.  It gives memory
       back to the system and goes up and down in mem footprint.
       But now it uses 300% cpu.  The only difference I can see is
       that I changed the module export decl.  How could this matter
       if compiling into an executable?  Nevertheless maybe this
       helps it inline....  Now I can run it for 10 min with minimal
       memory usage.

       -qa seems to help the variance on -N4, i.e. with more workers
	than kernels.

 ---------------------------
  Throughput of a single FFT kernel.

  * Oops, maybe this accounts for the difference above between
    leaking/non-leaking.  The FFT version maintains a high >7000
    wins/sec throughput.  But it leaks memory.  Maybe it's not
    really doing the FFT and is leaking suspensions?

  * Nope... I tried forcing the FFT by inspecting one element of the
    output.  Still leaks.  Well, the answer is simple.  It just
    can't keep up with a source that has no backpressure.  To
    confirm this hypothesis, I ran it with -N1 (with the new
    yielding source operator).  NOPE, it still leaks.


[2011.03.20] {Added disjoint_working_sets_pipeline}

Ok, first try.  Not only has this not demonstrated the benefit of
pipeline parallelism, right now it isn't showing much of a speedup at
all.  Right now I'm running with default parameters (4 256 10 20000)

Running with "-N4 -qa" I see large variance.  Between 1.0 and 3.3
seconds.  Seeing the same without -qa:

Running without no extra runtime flags (no -qa):
  nothreads: 2.18s
  1thread: 2.3s
  4threads: 1-3 seconds.

The big difference is running with N>1 it can spend a lot of time in
GC (e.g. 84%!!).  But I need to stop running on wasp for actual
parallel speedup measurements -- I seem to have problems using wasp
for this purpose.