See haddock in Data.CAS


A few notes on performance results
==================================


[2011.11.12] Initial Measurements
---------------------------------

  An initial round of tests gives the following results when executing
  100K CAS's from each of four threads on a 3.33GHz Nehalem desktop:

    RAW Haskell CAS:  0.143s        25% productivity 58MB alloc,  204,693 successes
    'Fake' CAS:       0.9s - 1.42s  89% productivity, 132M alloc, 104,044 successes
    Foreign CAS:      1.6s          23% productivity, 82M alloc,  264,821 successes

  "Successes" counts the total number of CAS operations that actually
  succeeded.

  This microbenchmark is spending a lot of its time in Gen 0 garbage collection.

  Next, a million CAS's per thread:

    RAW Haskell CAS:  0.65 - 1.0    20% productivity 406MB alloc (can stack overflow)
    'Fake' CAS:       14.3 - 17s    270% CPU  92% productivity 1.3GB alloc  1,008,468 successes
    Foreign CAS:      Stack overflow after 28.5s ... 300-390% CPU, 15% productivity, 

                      After bumping the stack size up it takes a long
                      time to finish, even after it has printed the
                      sample success bit vectors.
		      
		      With -K100M:
		      78s           6% productivity, 898M alloc,  3,324,943 successes
                                    67 seconds elapsed in Gen 0 GC!!

  Something odd is going on here.  How could it spend so long in GC
  for so little allocation??  For only two threads the Foreign
  implementation drops to 22s, but still 17.97s elapsed in Gen 0 GC.
  
  What about the Raw haskell CAS?  It also wil stack overflow with the
  current version of the test.  With a 1G stack it can do 5Mx4 CAS's
  (8,9M successful) in 6.7 seconds.  10M in 17s.  And STILL not seeing
  the previous segfault with the Raw CAS version...
  
[2011.11.12] Simple Stack Overflow Fix
--------------------------------------

  After making sure that all the (+1)'s in the test are strict, the
  stack overflow goes away and the numbers change (Raw does 5M in 3.3s
  instead of 6.7s).  BUT there's still quite a lot of time spent in
  GC.  Here's 1Mx4 CAS's again:
      
    RAW Haskell CAS:  0.7s  (23% prod, 0.8s total Gen 0 GC)
    'Fake' CAS:       11.8s (91% prod, 0.8s total Gen 0 GC)
    Foreign CAS:      52s  (6% prod)
  
  And then adding -A1M makes a neglible change in runtime for Raw, but
  reduces the # of gen0 collections from 484 to 235.

  Ok, how about testing on a 3.1GHz Westmere.
  Wow, just ran into this:

    cc1: internal compiler error: Segmentation fault
    Please submit a full bug report,
    with preprocessed source if appropriate.
    See <http://bugzilla.redhat.com/bugzilla> for instructions.
    make: *** [all] Error 1

  On a different machine it worked (default runtime flags 1Mx4 CAS):

    RAW Haskell CAS:  0.7s  
    'Fake' CAS:       8.1s
    Foreign CAS:      46s

  The lack of hyperthreading may also be helping.

 <TODO>: 
  The primary source of allocation in this example is the accumulation
  of the [Bool] lists of success and failure.  I should disable those
  and test again.


[2011.11.13] Testing specialized CAS.Foreign instance
-----------------------------------------------------

All of the above results were for a cell containing an Int.  That
would not have triggered the specialized (Storable-based) instance in
Foreign.hs.  There SHOULD be special cases for all word-sized scalars,
but currently there's just one for Word32.  Let's test that one.

    Word32 1Mx4 CAS's:
    RAW Haskell CAS:  0.57s  (13.7% prod)
    Foreign CAS:      0.64s  (15% prod)

Wow, the foreign one is doing as well as the Haskell one even though
there's some extra silliness in the Foreign.CASRef type (causing a
runtime case dispatch to unpack).


[2011.11.13] Testing atomicModify based on CAS
----------------------------------------------

An atomicModify based on CAS offers a drop-in replacement that could
improve performance.  I implemented one which will try CAS until it
fails a certain number of times ("30" for now, but needs to be tuned).

These seem to work well.  They are cheaper than you would think given
how long it takes to get successful CAS attempts under contention:

 1Mx4 CAS attempts or atomicModifies:
    0.19s  -- CAS attempts 1.07M successful.
    0.02s  -- 1M   atomicModifyIORefCAS on 1 thread
    0.13s  -- 1Mx2 atomicModifyIORefCAS on 2 threads
    0.37s  -- 1Mx4 atomicModifyIORefCAS on 4 threads

And they are cheaper than the real atomicModifyIORef (which also seems
to have a stack space problem right now because of its laziness).  But
with a huge stack (1G) it will succeed:

    1.27s  -- 1Mx4 atomicModifyIORef

But inserting extra strictness (an evaluate call) to avoid the
stack-leak actually makes it slower:

    2.08s  -- 1Mx4 atomicModifyIORef, stricter