id	summary	reporter	owner	description	type	status	priority	milestone	component	version	resolution	keywords	cc	os	architecture	failure	difficulty	testcase	blockedby	blocking	related
3295	Null deref by threaded runtime scheduler	A1kmm	simonmar	"Using ghc and runtime built from HEAD on Tuesday (although this has been an issue on older builds I tried as well), the ghc runtime crashes on a null deref.

The system is a 24-processor Intel Xeon 2.66 GHz shared memory system running Linux 2.6.16.60-0.34-smp in 64 bit mode (from SUSE 10 SP2).

This only happens when the program is compiled with -threaded (threaded runtime), but doesn't happen with the threaded debug runtime.

It also only happens when +RTS -N2 or a greater number of threads is passed to the runtime. It doesn't happen when -g1 is passed to turn off distributed garbage collection.
{{{
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 1241594176 (LWP 20751)]
0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x994370) at rts/Schedule.c:672
672         return (waiting_for_gc || 
(gdb) list
667         //   - another thread is initiating a GC
668         //   - another Task is returning from a foreign call
669         //   - the thread at the head of the run queue cannot be run
670         //     by this Task (it is bound to another Task, or it is unbound
671         //     and this task it bound).
672         return (waiting_for_gc || 
673                 cap->returning_tasks_hd != NULL ||
674                 (!emptyRunQueue(cap) && (task->tso == NULL
675                                          ? cap->run_queue_hd->bound != NULL
676                                          : cap->run_queue_hd->bound != task)));
(gdb) print cap
$1 = (Capability *) 0x0
(gdb) print *task
$2 = {id = 1241594176, cap = 0x0, stopped = 7160552, suspended_tso = 0x0, tso = 0x7fccc4, stat = NoStatus, ret = 0x0, cond = {__data = {__lock = 1, __futex = 0, 
      __total_seq = 1, __wakeup_seq = 0, __woken_seq = 0, __mutex = 0x84e340, __nwaiters = 10033872, __broadcast_seq = 0}, 
    __size = ""\001\000\000\000\000\000\000\000\001"", '\0' <repeats 23 times>, ""@�\204\000\000\000\000\000�\032\231\000\000\000\000"", __align = 1}, lock = {__data = {
      __lock = 0, __count = 1, __owner = 0, __nusers = 0, __kind = 22668, __spins = 0, __list = {__prev = 0x5793, __next = 0x85afd}}, 
    __size = ""\000\000\000\000\001"", '\0' <repeats 11 times>, ""\214X\000\000\000\000\000\000\223W\000\000\000\000\000\000�Z\b\000\000\000\000"", __align = 4294967296}, 
  wakeup = 547420, elapsedtimestart = 202, muttimestart = 0, mut_time = 0, mut_etime = 0, gc_time = 0, gc_etime = 10033296, prev = 0x994370, next = 0x0, return_link = 0x0, 
  all_link = 0x0, prev_stack = 0x9945c0}
(gdb) info threads
* 21 Thread 1241594176 (LWP 20751)  0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x994370) at rts/Schedule.c:672
  20 Thread 1233201472 (LWP 20750)  0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9b4930) at rts/Schedule.c:672
  19 Thread 1224808768 (LWP 20749)  0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9b2df0) at rts/Schedule.c:672
  18 Thread 1216416064 (LWP 20748)  0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9b12b0) at rts/Schedule.c:672
  17 Thread 1208023360 (LWP 20747)  0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9af770) at rts/Schedule.c:672
  16 Thread 1199630656 (LWP 20746)  0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9adc30) at rts/Schedule.c:672
  15 Thread 1191237952 (LWP 20745)  0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9ac0f0) at rts/Schedule.c:672
  14 Thread 1182845248 (LWP 20744)  0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9aa5b0) at rts/Schedule.c:672
  13 Thread 1174452544 (LWP 20743)  0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9a8a70) at rts/Schedule.c:672
  12 Thread 1166059840 (LWP 20742)  0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9a6f30) at rts/Schedule.c:672
  11 Thread 1157667136 (LWP 20741)  0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9a53f0) at rts/Schedule.c:672
  10 Thread 1149274432 (LWP 20740)  0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9a38b0) at rts/Schedule.c:672
  9 Thread 1140881728 (LWP 20739)  0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9a1d70) at rts/Schedule.c:672
  8 Thread 1132489024 (LWP 20738)  0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9a0230) at rts/Schedule.c:672
  7 Thread 1124096320 (LWP 20737)  0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x99e6f0) at rts/Schedule.c:672
  6 Thread 1115703616 (LWP 20736)  0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x99cbb0) at rts/Schedule.c:672
  5 Thread 1107310912 (LWP 20735)  0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x99b070) at rts/Schedule.c:672
  4 Thread 1098918208 (LWP 20734)  0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x999530) at rts/Schedule.c:672
  3 Thread 1090525504 (LWP 20733)  0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9979f0) at rts/Schedule.c:672
  2 Thread 1082132800 (LWP 20732)  0x00002aae2fa22548 in __lll_mutex_lock_wait () from /lib64/libpthread.so.0
  1 Thread 46927615298704 (LWP 20729)  0x00002aae2fa20553 in pthread_cond_signal@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
}}}
The values of *task and the thread in which the crash occurs is not always consistent, but the stacktrace is. For example, another crash...
{{{
(gdb) bt
#0  0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9b4930) at rts/Schedule.c:672
#1  0x0000000000673205 in workerStart (task=0x9915e0) at rts/Schedule.c:2033
#2  0x00002aae0cd1a143 in start_thread () from /lib64/libpthread.so.0
#3  0x00002aae0ceee8cd in clone () from /lib64/libc.so.6
#4  0x0000000000000000 in ?? ()
(gdb) print *task
$6 = {id = 1233201472, cap = 0x0, stopped = 11791697, suspended_tso = 0x0, tso = 0x25ed0f1, stat = NoStatus, ret = 0x0, cond = {__data = {__lock = 1, __futex = 0, 
      __total_seq = 1, __wakeup_seq = 0, __woken_seq = 0, __mutex = 0x84e800, __nwaiters = 10033872, __broadcast_seq = 0}, 
    __size = ""\001\000\000\000\000\000\000\000\001"", '\0' <repeats 24 times>, ""�\204\000\000\000\000\000�\032\231\000\000\000\000"", __align = 1}, lock = {__data = {__lock = 0, 
      __count = 1, __owner = 0, __nusers = 0, __kind = 875, __spins = 0, __list = {__prev = 0x33a, __next = 0x2da2b}}, 
    __size = ""\000\000\000\000\001"", '\0' <repeats 11 times>, ""k\003\000\000\000\000\000\000:\003\000\000\000\000\000\000+�\002\000\000\000\000"", __align = 4294967296}, 
  wakeup = 186902, elapsedtimestart = 24, muttimestart = 0, mut_time = 0, mut_etime = 0, gc_time = 0, gc_etime = 10033296, prev = 0x9b4930, next = 0x0, return_link = 0x0, 
  all_link = 0x0, prev_stack = 0x9b4b80}
(gdb) info threads
  22 Thread 1249986880 (LWP 20846)  0x00002aae0cd20548 in __lll_mutex_lock_wait () from /lib64/libpthread.so.0
  21 Thread 1241594176 (LWP 20845)  0x00002aae0cd1e1c6 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
* 20 Thread 1233201472 (LWP 20844)  0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9b4930) at rts/Schedule.c:672
  19 Thread 1224808768 (LWP 20843)  0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9b2df0) at rts/Schedule.c:672
  18 Thread 1216416064 (LWP 20842)  0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9b12b0) at rts/Schedule.c:672
  17 Thread 1208023360 (LWP 20841)  0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9af770) at rts/Schedule.c:672
  16 Thread 1199630656 (LWP 20840)  0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9adc30) at rts/Schedule.c:672
  15 Thread 1191237952 (LWP 20839)  0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9ac0f0) at rts/Schedule.c:672
  14 Thread 1182845248 (LWP 20838)  0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9aa5b0) at rts/Schedule.c:672
  13 Thread 1174452544 (LWP 20837)  gcWorkerThread (cap=<value optimized out>) at includes/SpinLock.h:45
  12 Thread 1166059840 (LWP 20836)  0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9a6f30) at rts/Schedule.c:672
  11 Thread 1157667136 (LWP 20835)  0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9a53f0) at rts/Schedule.c:672
  10 Thread 1149274432 (LWP 20834)  0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9a38b0) at rts/Schedule.c:672
  9 Thread 1140881728 (LWP 20833)  0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9a1d70) at rts/Schedule.c:672
  8 Thread 1132489024 (LWP 20832)  0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9a0230) at rts/Schedule.c:672
  7 Thread 1124096320 (LWP 20831)  0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x99e6f0) at rts/Schedule.c:672
  6 Thread 1115703616 (LWP 20830)  0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x99cbb0) at rts/Schedule.c:672
  5 Thread 1107310912 (LWP 20829)  0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x99b070) at rts/Schedule.c:672
  4 Thread 1098918208 (LWP 20827)  0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x999530) at rts/Schedule.c:672
  3 Thread 1090525504 (LWP 20826)  0x000000000067262c in schedule (initialCapability=<value optimized out>, task=0x9979f0) at rts/Schedule.c:672
  2 Thread 1082132800 (LWP 20825)  0x00002aae0cee86e2 in select () from /lib64/libc.so.6
  1 Thread 46927031233680 (LWP 20824)  0x00002aae0cd1c2ef in pthread_mutex_lock () from /lib64/libpthread.so.0
}}}
valgrind memcheck does not report any errors prior to the NULL deref:
{{{
==20983== Thread 2:
==20983== Invalid read of size 8
==20983==    at 0x67262C: schedule (Schedule.c:672)
==20983==    by 0x673204: workerStart (Schedule.c:2033)
==20983==    by 0x58E4142: start_thread (in /lib64/libpthread-2.4.so)
==20983==    by 0x5AB78CC: clone (in /lib64/libc-2.4.so)
==20983==  Address 0x1d0 is not stack'd, malloc'd or (recently) free'd
==20983== 
==20983== Process terminating with default action of signal 11 (SIGSEGV)
}}}
valgrind helgrind reports numerous possible data race conditions, including one between schedule and GarbageCollect just prior to the crash:
{{{
==21106== Possible data race during read of size 8 at 0x5ccc300 by thread #5
==21106==    at 0x672612: schedule (Schedule.c:367)
==21106==    by 0x673204: workerStart (Schedule.c:2033)
==21106==    by 0x4A24A1E: mythread_wrapper (hg_intercepts.c:194)
==21106==    by 0x58E7142: start_thread (in /lib64/libpthread-2.4.so)
==21106==    by 0x5ABA8CC: clone (in /lib64/libc-2.4.so)
==21106==  This conflicts with a previous write of size 8 by thread #1
==21106==    at 0x679B8D: GarbageCollect (SpinLock.h:55)
==21106==    by 0x6712D0: scheduleDoGC (Schedule.c:1522)
==21106==    by 0x672C77: schedule (Schedule.c:621)
==21106==    by 0x66FF84: real_main (RtsMain.c:68)
==21106==    by 0x67009D: hs_main (RtsMain.c:117)
==21106==    by 0x5A17183: (below main) (in /lib64/libc-2.4.so)
}}}

I am still working on whether I can make a small program to reproduce this."	bug	closed	high	6.12.1	Runtime System	6.11	fixed	crash, nullderef, threaded, parallel, GC		Linux	ia64		Unknown				
