Ticket #2910 (closed merge: fixed)

Opened 4 years ago

Last modified 4 years ago

throwTo can block indefinitely when target thread finishes with exceptions blocked

Reported by: int-e Owned by: igloo
Priority: normal Milestone: 6.10.2
Component: Runtime System Version: 6.10.1
Keywords: Cc:
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Difficulty: Easy (less than 1 hour)
Test Case: Blocked By:
Blocking: Related Tickets:

Description

throwTo can block indefinitely when the target thread has exceptions blocked at thread creation time. The following test program demonstrates this problem.

import Control.Exception
import GHC.Conc
  
main = do
    t1 <- block $ forkIO yield
    t2 <- forkIO $ killThread t1
    threadDelay 1000000
    threadStatus t1 >>= print
    threadStatus t2 >>= print

can print (and does fairly reliably for me)

ThreadFinished
ThreadBlocked BlockedOnException

See also  http://www.haskell.org/pipermail/reactive/2009-January/000197.html

Attachments

2910-more.patch Download (132.5 KB) - added by int-e 4 years ago.
Some other changes that I feel should be made in throwTo()
2910.patch Download (133.0 KB) - added by int-e 4 years ago.
Fix another race between throwTo and finishing threads - version 2, which makes sure that holding the TSO lock actually prevents the race.

Change History

  Changed 4 years ago by simonmar

  • owner set to simonmar
  • difficulty set to Easy (1 hr)
  • os changed from Linux to Unknown/Multiple
  • architecture changed from x86 to Unknown/Multiple
  • milestone set to 6.10.2

Good bug.

  Changed 4 years ago by simonmar

  • owner changed from simonmar to igloo
  • type changed from bug to merge

Fixed:

Tue Jan  6 15:32:54 GMT 2009  Simon Marlow <marlowsd@gmail.com>
  * wake up the blocked exception queue on ThreadFinished; fixes #2910

Changed 4 years ago by int-e

Some other changes that I feel should be made in throwTo()

follow-up: ↓ 4   Changed 4 years ago by int-e

As the patches I just attached suggest, this race is not completely fixed. (I'm pretty certain - Conal's TestRace? program locks up without the first patch, but works fine so far with it. I also have a modified version that logs thread creation and throwTo and shows the program lock up with all threads finished except the main thread, which is blocked on an exception.)

The second patch contains changes unrelated to this bug which I'm not 100% certain about - but they felt necessary.

Changed 4 years ago by int-e

Fix another race between throwTo and finishing threads - version 2, which makes sure that holding the TSO lock actually prevents the race.

in reply to: ↑ 3 ; follow-up: ↓ 5   Changed 4 years ago by simonmar

Replying to int-e:

As the patches I just attached suggest, this race is not completely fixed. (I'm pretty certain - Conal's TestRace? program locks up without the first patch, but works fine so far with it. I also have a modified version that logs thread creation and throwTo and shows the program lock up with all threads finished except the main thread, which is blocked on an exception.) The second patch contains changes unrelated to this bug which I'm not 100% certain about - but they felt necessary.

Thanks - it's nice to have someone else looking at this code!

In response to your patches:

hunk ./rts/RaiseAsync.c 418
 	// Unblocking BlockedOnSTM threads requires the TSO to be
 	// locked; see STM.c:unpark_tso().
 	if (target->why_blocked != BlockedOnSTM) {
+	    unlockTSO(target);
 	    goto retry;
 	}
 	if ((target->flags & TSO_BLOCKEX) &&

well spotted.

hunk ./rts/RaiseAsync.c 440
 	// thread is blocking exceptions, and block on its
 	// blocked_exception queue.
 	lockTSO(target);
+	if (target->why_blocked != BlockedOnCCall &&
+ 	    target->why_blocked != BlockedOnCCall_NoUnblockExc) {
+	    unlockTSO(target);
+	    return;
+	}
 	blockedThrowTo(cap,source,target);
 	*out = target;
 	return THROWTO_BLOCKED;

again, well spotted - except that we want goto retry rather than return.

hunk ./rts/RaiseAsync.c 267
 		target = target->_link;
 		goto retry;
 	    }
+	    // The thread may also have finished in the meantime.
+	    if (target->what_next == ThreadKilled
+		|| target->what_next == ThreadComplete) {
+		unlockTSO(target);
+		return THROWTO_SUCCESS;
+	    }
 	    blockedThrowTo(cap,source,target);
 	    *out = target;
 	    return THROWTO_BLOCKED;

lockTSO() doesn't lock the what_next field, only the blocked_exceptions field, so I think this change is not necessary.

hunk ./rts/RaiseAsync.c 555
 void
 awakenBlockedExceptionQueue (Capability *cap, StgTSO *tso)
 {
+    lockTSO(tso);
+    // Taking the tso lock before the following check assures that we
+    // wait for any throwTo that may just be adding a new thread to the
+    // queue. This is essential, because we may not get another chance
+    // to wake up that thread.
     if (tso->blocked_exceptions != END_TSO_QUEUE) {
hunk ./rts/RaiseAsync.c 561
-	lockTSO(tso);
 	awakenBlockedQueue(cap, tso->blocked_exceptions);
 	tso->blocked_exceptions = END_TSO_QUEUE;
hunk ./rts/RaiseAsync.c 563
-	unlockTSO(tso);
     }
hunk ./rts/RaiseAsync.c 564
+    unlockTSO(tso);
 }    

This is not necessary. However, while figuring out why, I did find the real bug. Threads that fall through the cracks and end up on the blocked_exceptions list of a finished or blocked target thread are supposed to be caught by the GC (see comments at line 216 in MarkWeak.c). However, this wasn't working in the case when the target thread had finished, because maybePerformBlockedException() wasn't handling the ThreadComplete or ThreadKilled case, so I've fixed that.

in reply to: ↑ 4   Changed 4 years ago by int-e

Replying to simonmar:

Thanks - it's nice to have someone else looking at this code!

You're welcome.

lockTSO() doesn't lock the what_next field, only the blocked_exceptions field, so I think this change is not necessary.

Ah, I see. But I believe that the change still makes sense - see below.

Threads that fall through the cracks and end up on the blocked_exceptions list of a finished or blocked target thread are supposed to be caught by the GC ...

And that's the part I missed, although admittedly I'm a bit unhappy about waiting for the next GC. Does the RTS perform a GC when it finds no other work? In any case, there'll be some wait.

Now I believe we can prevent this from happening, with those two hunks above. The key idea is that once the what_next field is set to ThreadComplete or ThreadKilled, it will not be modified again.

As you wrote in the comment in scheduleHandleThreadFinished, what_next has already been set when awakenBlockedExceptionQueue is called. So the only scenario we have to prevent is that a thread throwing an exception finds its target running, and then adds itself to the target's blocked_exception queue, with the target thread completing and running awakenExceptionQueue inbetween those two steps.

This can be accomplished by making awakenExceptionQueue take the TSO lock every time, *and* checking whether the target has finished between taking the TSO lock and calling blockedThrowTo for all calls to blockedThrowTo, unless we can prove that the thread cannot finish in the meantime.

My changes only covered the NotBlocked case. I believe that in the Blocked* cases, the thread cannot finish in the meantime (they lock the TSO, directly or indirectly, and then check that the thread is still blocked - which implies that it has not finished), but I'm 100% not certain.

To summarize: We would not use the TSO lock to protect the what_next field - we'd use (or abuse?) it to prevent a specific race between blockedThrowTo and awakenExceptionQueue.

I think the benefits are clear: We avoid one case of the RTS having to wait for a GC.

The cost seems bearable: awakenExceptionQueue is only called when a thread finishes or when it returns from a C call (and in the latter case, we could continue to use the old variant). Both cases aren't exactly fast paths. Then there's a cost in code complexity (the reasoning is fairly tricky) - but that's your judgement call.

  Changed 4 years ago by simonmar

Yes, I realise your patch was aimed at closing the race. I was worried about the cost of doing an unconditional lockTSO on thread exit, and the complexity of the invariant. However, I haven't been able to measure a difference in performance (yet!) so I'll probably go with your version (but also with my fixes, a little extra robustness won't hurt).

The GC runs after 0.3 seconds of non-activity, BTW. This is tunable with the +RTS -I flag.

Also, while playing with Conal's TestRace? program I found two more races, patches to follow. yay!

  Changed 4 years ago by simonmar

More patches to merge:

Wed Jan  7 11:20:26 GMT 2009  Simon Marlow <marlowsd@gmail.com>
  * putMVar and takeMVar: add write_barrier() to fix race with throwTo

Wed Jan  7 12:06:52 GMT 2009  Simon Marlow <marlowsd@gmail.com>
  * fix a race where the timer signal could remain turned off, leading to deadlock

Wed Jan  7 12:07:34 GMT 2009  Simon Marlow <marlowsd@gmail.com>
  * maybePerformBlockedException() should handle ThreadComplete/ThreadKilled
  Part of the fix for #2910

Wed Jan  7 12:08:08 GMT 2009  Bertram Felgenhauer <int-e@gmx.de>
  * Fix two more locking issues in throwTo()

Wed Jan  7 12:11:42 GMT 2009  Simon Marlow <marlowsd@gmail.com>
  * add comment

Wed Jan  7 14:05:07 GMT 2009  Simon Marlow <marlowsd@gmail.com>
  * Close the races between throwTo and thread completion
  Any threads we missed were being caught by the GC (possibly the idle
  GC if the system was otherwise inactive), but that's not ideal.  The
  fix (from Bertram Felgenhauer) is to use lockTSO to synchronise,
  imposing an unconditional lockTSO on thread exit.  I couldn't measure
  any performance overhead from doing this, so it seems reasonable.

  Changed 4 years ago by igloo

  • status changed from new to closed
  • resolution set to fixed

All 7 merged

  Changed 4 years ago by simonmar

  • difficulty changed from Easy (1 hr) to Easy (less than 1 hour)
Note: See TracTickets for help on using tickets.