Ticket #2411 (closed merge: fixed)

Opened 5 years ago

Last modified 5 years ago

RaiseAsync and STM segfault with stop_at_atomically in some circumstances.

Reported by: sclv Owned by: igloo
Priority: high Milestone: 6.10.1
Component: Runtime System Version: 6.8.3
Keywords: Cc:
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Difficulty: Unknown
Test Case: Blocked By:
Blocking: Related Tickets:

Description

x86_64 6.9 build from June 15. I can't distill it down to a single simple testcase, but the backtrace is very informative:

ASSERTION FAILED: file RaiseAsync?.c, line 1015

The assertion in question is "ASSERT(stmGetEnclosingTRec(tso->trec) == NO_TREC)". This assertion is only called when we hit an ATOMICALLY_FRAME with stop_at_atomically activated.

We only do that when Schedule.c is running the garbage collector and raises an exception to retry an invalid stm transaction. I don't follow enough here to offer a real patch, but obviously there is a false assumption that the transaction we're trying to retry does not have an enclosing trec. Why this may be the case is somewhat beyond me. I'm doing various things with unsafeIOToSTM, so I think the answer may be that there is some sort of forking going on with threaded IO which means that threads may have enclosing trecs.

#0  0x00002b134fdb9b45 in raise () from /lib64/libc.so.6
#1  0x00002b134fdbb0e0 in abort () from /lib64/libc.so.6
#2  0x00000000009b40e0 in rtsFatalInternalErrorFn (s=0xa2c670 "ASSERTION FAILED: file %s, line %u\n", 
    ap=0x41000e50) at RtsMessages.c:164
#3  0x00000000009b3ca4 in barf (s=0xa2c670 "ASSERTION FAILED: file %s, line %u\n") at RtsMessages.c:40
#4  0x00000000009b3cfe in _assertFail (filename=0xa3200c "RaiseAsync.c", linenum=1015)
    at RtsMessages.c:55
#5  0x00000000009e4675 in raiseAsync (cap=0xd734a0, tso=0x2b1351bdb000, exception=0x0, 
    stop_at_atomically=rtsTrue, stop_here=0x0) at RaiseAsync.c:1015
#6  0x00000000009e365b in throwToSingleThreaded_ (cap=0xd734a0, tso=0x2b1351bdb000, exception=0x0, 
---Type <return> to continue, or q <return> to quit---
    stop_at_atomically=rtsTrue, stop_here=0x0) at RaiseAsync.c:73
#7  0x00000000009b6b51 in scheduleDoGC (cap=0xd734a0, task=0xd8e750, force_major=rtsFalse)
    at Schedule.c:2046
#8  0x00000000009b5a9d in schedule (initialCapability=0xd734a0, task=0xd8e750) at Schedule.c:718
#9  0x00000000009b77db in workerStart (task=0xd8e750) at Schedule.c:2537
#10 0x00002b134fb73020 in start_thread () from /lib64/libpthread.so.0
#11 0x00002b134fe4df8d in clone () from /lib64/libc.so.6
#12 0x0000000000000000 in ?? ()

Attachments

stmcrash.hs Download (2.6 KB) - added by sclv 5 years ago.
Test case
stmc2.hs Download (0.7 KB) - added by sclv 5 years ago.
simplified testcase

Change History

Changed 5 years ago by sclv

  • component changed from Compiler to Runtime System

Changed 5 years ago by igloo

  • difficulty set to Unknown
  • milestone set to 6.10.1

If you have any sort of testcase then that might help track down the problem.

Changed 5 years ago by sclv

Test case

Changed 5 years ago by sclv

Ok. So I've narrowed down a relatively simple test case. It crashes even at -N1 but crashes much more immediately at -N4. The backtrace is the same as above.

From this test case its pretty clear that there's an issue either with catchSTM or directly with RaiseAsync?.c or with their interaction, and in fact unsafePerformIO and unsafeIOToSTM are nowhere to be found in this code.

This test case can probably be distilled down to something even simpler that doesn't use any of the typeclass toys I was playing with that ended up here in the first place. But I've taken it about as far as I can today.

As far as I know though, this is only triggered during validation during GC, which is the only time stop_at_atomically is set. So one question I have is why a thread which is already hit by an asynchronous exception should *then* be treated as invalid as well and subject to an attempt to roll back during scheduleDoGC. So it may be that there are two different mechanisms (one for exception handling via catchSTM and one for validation) that end up stepping on one another's toes. Just spinning a theory though.

I should note this bug occurs in 6.8.3 up through HEAD as far as I know.

Changed 5 years ago by sclv

simplified testcase

Changed 5 years ago by sclv

Attaching a very simplified testcase that strips out the typeclass nonsense. It looks to me like catchSTM is almost just plain hopeless. I can't piece through everything, but I note that it starts a nested transaction, while raiseAsync asserts that there is no enclosing transaction on a rollback. So, I could well be misunderstanding this, but it looks like that may be the issue?

Changed 5 years ago by simonmar

  • owner set to simonmar
  • priority changed from normal to high

prioritising crashes

Changed 5 years ago by simonmar

  • owner changed from simonmar to igloo
  • type changed from bug to merge

Fixed:

Fri Sep 26 16:28:06 PDT 2008  Simon Marlow <simonmar@microsoft.com>
  * Fix #2411: missing case for CATCH_STM_FRAME in raiseAsync()

Changed 5 years ago by simonmar

  • architecture changed from Unknown to Unknown/Multiple

Changed 5 years ago by simonmar

  • os changed from Unknown to Unknown/Multiple

Changed 5 years ago by igloo

  • status changed from new to closed
  • resolution set to fixed

Merged

Note: See TracTickets for help on using tickets.