[2013.04.07] {Figuring out install methods, linking, and segfaults}

Ok, here's the weird state of affairs at the moment.

 * Mac os, build test in-place compiling directly with Data.Atomics -- WORKS
 * mac os, cabal install then build test                            -- SEGFAULTS
 * linux, build test in-place compiling directly with Data.Atomics -- Fails link, no "cas".
 * linux, cabal install then build test            -- WORKS (but I thought segfaulted earlier?

Weird, right?  Well, I'm not surprised by the failed link, because
right now the foreign primops are trying to use a function defined in
SMP.h inside the GHC RTS.  That may be illegitemate.  (But at the same
time it is EXTERN_INLINE so shouldn't it just inline it before link
time?)

Note that SMP.h is included by Stg.h which is included by Rts.h.
Rts.h is included all over the place, but not directly by Cmm.h.

Anyway, putting in some extra printfs verifies that in the segfaulting
case it is returning a ticket of "zero" and in fact it is also
returning zero as a Haskell pointer because valgrind shows the
segfault as attempting to reference 0x0.  Just to confirm, I see
similar behavior if I just hack the CMM to return an invalid pointer:

    stg_readMutVar2zh {
      RET_NP(19, 19);
    }

In fact, with the segfaulting Mac OS config... if I use the above,
hacked stg_readMutVar2zh, I DO see it receives a ticket 19.  So it IS
calling the correct code.

WAIT!  There's not much we can infer from the previous.  Because, the
failure is odd and transient!  Once I restored stg_readMutVar2zh to
the original, it is now passing in BOTH compile methods on Mac OS.
Further, it will happily run hundreds of time without segfaulting.  So
it appears to be compile time nondeterminism rather than runtime ;-).

-----------------

AHA!  It works when I build like this:

   cabal install --disable-library-profiling --disable-documentation
   
But NOT with a vanially cabal install (profiling and documentation).
As expected, it is the profiling that makes the difference.
Note that I am NOT building the test with profiling.  The profiling
version should be IRRELEVANT.  Why does this cause the segfault?

Well, if it is cabal installed WITH profiling, we can change the
error, at least, by changing how Test.exe is built.  Building with no
options (without -threaded -O2) and we see a false positive with the
casArrayElem, then a segfault on the ticket-test:

    Perform a CAS within a MutableArray#
    (Poking at array was ok: 33)
      1st try should succeed: (False,4377552852)
    2nd should fail: (False,4377552852)
    Printing array:
      33  33  33  33  33
    Done.

      casmutarray1: [OK]

    Using new 'ticket' based compare and swap:
    YAY, read the IORef, ticket 4382124888
    Test.exe: internal error: MUT_VAR_DIRTY object entered!
	(GHC version 7.6.2 for x86_64_apple_darwin)
	Please report this as a GHC bug:  http://www.haskell.org/ghc/reportabug

What if we actually DO build with profiling?

    ghc  -prof -fforce-recomp Test.hs -o Test.exe
    
    Test.hs:39:10:
	Dynamic linking required, but this is a non-standard build (eg. prof).
	You need to build the program twice: once the normal way, and then
	in the desired way using -osuf to set the object file suffix.    

Oops, I have to disable the use of test-framework-th and template
haskell it seems.  I did that, and it builds and DOESN'T SEGFAULT. 
So it's building WITHOUT profiling after installing with profiling
that breaks!!

[2013.04.08] {Further investigating}
------------------------------------

Ok, let's see what the chain of causation here is.  HOW is it actually
picking up anything from those profiling installs if profiling is
disabled in the final build-and-link?

Here are the sizes of the relevant binaries on a profile compile:

    Summing ./lib/atomic-primops-0.1.0.0/ghc-7.6.2/HSatomic-primops-0.1.0.0.o
	      44,392 bytes in 1 plain files.
    Summing ./lib/atomic-primops-0.1.0.0/ghc-7.6.2/libHSatomic-primops-0.1.0.0.a
	      58,808 bytes in 1 plain files.


The final link command issued to gcc says this:

    '-L/Users/rrnewton/.cabal/lib/atomic-primops-0.1.0.0/ghc-7.6.2' ...
    '-lHSatomic-primops-0.1.0.0'

There doesn't seem to be anything that would cause it to pull in the
"_p.a" version...  How about I manually delete the "*p_*" files and
see what happens?  Same behavior!  It is NOT using those files
directly.

And, indeed, if I try to actually build the profiling version of
Test.exe then it fails to find Data.Atomics (presumably based on
failing to find the .p_hi file.)

How about a binary difference in the NON-profiling binaries?  Here are
their sizes in the non-profiling (WORKING) build.  YES, there are
binary differences:

            44,336 bytes in 1 plain files.
            58,752 bytes in 1 plain files.
	    
How about with ghc-7.4.2?
-------------------------

 * Exact same pattern of behavior, segfaults with atomic-primops/prof Test.exe/noprof
 

[2013.04.08] {Also trying HPC mode}
-----------------------------------

Under ghc-7.6.2, and installing with --enable-library-coverage, then
building Test.exe with -fhpc, it works on my Mac.

Also, note that the cabal driven test-suite is not working right now
on Mac or Linux....  I think it worked some before I moved it to the
testing/ subdir.

Having a completely separate .cabal for testing/...  That works on Mac
and Linux subject to the above bug concerning profiling installs.

[2013.04.20] {Cabal-dev poor isolation}
---------------------------------------

I thought I could build different GHC versions in parallel (ghc-7.6.2,
ghc-7.4.2 etc).  But it looks like they interfere.  Further, even when
running serially I'm getting excessive rebuilding as I switch modes
(e.g. "make prof74" then "make prof76").  I need to add some isolation
of my own.


[2013.04.20] {Debugging}
------------------------

I fixed a major bug in the primop.  I also verified that GC is what is
causing all_hammer_one to observe fewer successes than ideal.

Now it passes all tests, *some* of the time.  Other times I see
segfaults on the all_hammer_one test with 10K iterations.  The 10K
iteration one regularly sees GC's happen, whereas the 1K doesn't.  Yet
it APPEARS that the segfault happens *before* the first GC (i.e. as
reported by "+RTS -T").  Is it GC itself that is crashing?

 [Nope ,this was the null pointer error - FIXED]

-----------------------

Ok, things are going pretty well.  I've been running many tests with
7.4 and 7.6 that are passing completely.  However, I just switched
focus back to the profiling version and got this:

    Installed testing-0.1.0.0
    ./cabal-dev/ghc-7.4.2_prof/bin/test-atomic-primops_ghc-7.4.2 +RTS -N -T -RTS
    test-atomic-primops_ghc-7.4.2: internal error: evacuate(static): strange closure type -385875961
	(GHC version 7.4.2 for x86_64_apple_darwin)
	Please report this as a GHC bug:  http://www.haskell.org/ghc/reportabug
    make[1]: *** [runtest] Abort trap: 6

On another run I got this:

    test-atomic-primops_ghc-7.4.2: internal error: MUT_VAR_DIRTY object entered!
	(GHC version 7.4.2 for x86_64_apple_darwin)
	Please report this as a GHC bug:  http://www.haskell.org/ghc/reportabug

And on 7.6 I got this:

    /cabal-dev/ghc-7.6.2_prof/bin/test-atomic-primops_ghc-7.6.2 +RTS -N -T -RTS
    test-atomic-primops_ghc-7.6.2: internal error: MVAR object entered!
	(GHC version 7.6.2 for x86_64_apple_darwin)
	Please report this as a GHC bug:  http://www.haskell.org/ghc/reportabug
    make[1]: *** [runtest] Abort trap: 6
    
Uh oh... it seems like there's been regression on the profiling version....

STRANGE -- it looks like it was based on extra double quotes in the args to GHC:

    cabal-dev install "-v --enable-library-profiling --ghc-options=-prof" -s cabal-dev/ghc-7.6.2_prof --disable-documentation --with-ghc=ghc-7.6.2 --program-suffix=_ghc-7.6.2 .
    
How did that cause it to compile and yet run with an internal failure?


[2013.07.18] {Timing atomic Counter ops}
================================================================================

The foreign/unboxed counter is the fasted for single thread repeated incr:

  Laptop, Macbook Retina  (4 physical core w/ hyperthreading):
  
    $ ./dist/build/test-atomic-primops/test-atomic-primops -j 1 +RTS -N4 -RTS -t single
  
    Timing readIORef/writeIORef on one thread
    SELFTIMED: 0.137 sec
    Final value: 10000000
    RAW_single_thread_repeat_incr: [OK]    

    Timing CAS increments on one thread without retries
    SELFTIMED: 0.220 sec
    Final value: 10000000
    CAS_single_thread_repeat_incr: [OK]
    
    SELFTIMED: 0.510 sec
    CounterReference_single_thread_repeat_incr: [OK]
    
    SELFTIMED: 0.187 sec
    CounterIORef_single_thread_repeat_incr: [OK]
    
    SELFTIMED: 0.119 sec
    CounterForeign_single_thread_repeat_incr: [OK]

  Desktop, veronica westmere 3.1 ghz, 4 core no-HT:

    SELFTIMED: 0.137 sec
    Final value: 10000000
    Timing CAS increments on one thread without retries
    RAW_single_thread_repeat_incr: [OK]
    
    SELFTIMED: 0.228 sec
    Final value: 10000000
    CAS_single_thread_repeat_incr: [OK]
    
    SELFTIMED: 0.638 sec
    CounterReference_single_thread_repeat_incr: [OK]
    
    SELFTIMED: 0.242 sec
    CounterIORef_single_thread_repeat_incr: [OK]
    
    SELFTIMED: 0.135 sec
    CounterForeign_single_thread_repeat_incr: [OK]

The numbers for boolean-flipping (nots) are different than adds:

    Timing readIORef/writeIORef on one thread
    SELFTIMED: 0.097 sec
    Timing CAS boolean flips on one thread without retries
    RAW_single_thread_repeat_flip: [OK]
    
    SELFTIMED: 0.136 sec
    Final value: True
    Timing readIORef/writeIORef on one thread
    CAS_single_thread_repeat_flip: [OK]


--------------------------------------------------

With max contention on four threads (all threads hammering counter),
the atomicModifyIORef version falls apart.  E.g. for 1M total
increments here are the numbers:

  Laptop, Macbook Retina  (4 physical core w/ hyperthreading):
    SELFTIMED: 45.768 sec
    CounterReference_concurrent_repeat_incr: [OK]
    
    SELFTIMED: 0.348 sec
    CounterIORef_concurrent_repeat_incr: [OK]

    SELFTIMED: 0.060 sec
    CounterForeign_concurrent_repeat_incr: [OK]

  Desktop, veronica westmere 3.1 ghz, 4 core no-HT:

    SELFTIMED: 14.260 sec
    CounterReference_concurrent_repeat_incr: [OK]
    
    SELFTIMED: 0.426 sec
    CounterIORef_concurrent_repeat_incr: [OK]
    
    SELFTIMED: 0.066 sec
    CounterForeign_concurrent_repeat_incr: [OK]
    
MASSIVE difference between the foreign fetchAndAdd version and the
atomicModifyIORef.  But the difference is MUCH worse on my laptop.
The laptop has hyperthreading, but in the benchmark we only use four
unpinned threads, not eight....


[2013.08.02] {Just observed a failure}

On machine basalt, ghc 7.6.3.  But is it reproducible?

    create_and_mutate_twice: [OK]
    n_threads_mutate: [Failed]
    Did the sum end up equal to 120?
    run_barriers: [OK]