Ticket #367 (new bug: None)

Opened 7 years ago

Last modified 4 months ago

Infinite loops can hang Concurrent Haskell

Reported by: simonpj Owned by:
Priority: lowest Milestone: _|_
Component: Compiler Version: 6.4.1
Keywords: scheduler allocation Cc: ganesh.sittampalam@…, SamB, leon.p.smith@…, pho@…
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Incorrect result at runtime Difficulty: Unknown
Test Case: Blocked By:
Blocking: Related Tickets:

Description (last modified by simonmar) (diff)

An infinite loop that does not allocate can hang 
Concurrent Haskell, becuase no thread switching 
occurs.  Demo code below (from Koen Claessen).

Bites occasionally, but not often.

Simon



module Main where

import Control.Concurrent
  ( forkIO
  , threadDelay
  , killThread
  , newEmptyMVar
  , takeMVar
  , putMVar
  )

import Data.IORef

import IO( hFlush, stdout )

timeout :: Int -> a -> IO (Maybe a)
timeout n x =
  do put "Race starts ..."
     resV <- newEmptyMVar
     pidV <- newEmptyMVar

     let waitAndFail =
           do put "Waiting ..."
              threadDelay n
              put "Done waiting!"
              putMVar resV Nothing

         eval =
           do put "Evaluating ..."
              x `seq` put "Done!"
              putMVar resV (Just x)

     -- used "mfix" here before but got non-termination 
problems
     -- (not sure they had anything to do with mfix)
     pid1  <- forkIO $ do pid2 <- takeMVar pidV
                          eval
                          killThread pid2
     pid2  <- forkIO $ do waitAndFail
                          killThread pid1
     putMVar pidV pid2

     put "Blocking ..."
     takeMVar resV

put s =
  do putStrLn s
     hFlush stdout

main =
  do timeout 1 (sum (repeat 1))
<<<

The above program produces the following (expected 
result):

>>>
Race starts ...
Blocking ...
Evaluating ...
Waiting ...
Done waiting!
<<<

If you replace 'sum (repeat 1)' by 'last (repeat 1)' the
program hangs.


Change History

Changed 6 years ago by simonmar

  • difficulty set to Unknown
  • version changed from None to 6.4.1
  • os set to Unknown
  • architecture set to Unknown
  • description modified (diff)

Changed 6 years ago by igloo

  • milestone set to 6.8

One reason this is annoying is that it means you can't have a manager thread running code (which you have no reason to believe won't go into an infinite loop or generate an infinite amount of output) with a timeout.

For example, lambdabot can't just use something like the timeout function above to ensure its modules don't go into an infinite loop. Another example is not being able to give up and return a bad result if you run out of time in something like the ICFP contest.

Changed 5 years ago by guest

  • cc ganesh@… added

Changed 5 years ago by guest

  • cc ganesh.sittampalam@… added; ganesh@… removed

Changed 5 years ago by igloo

  • milestone changed from 6.8 branch to _|_

Changed 4 years ago by simonmar

  • architecture changed from Unknown to Unknown/Multiple

Changed 4 years ago by simonmar

  • os changed from Unknown to Unknown/Multiple

Changed 3 years ago by SamB

This is just a gratuitous comment to add the word "scheduler" to the page.

Changed 3 years ago by SamB

  • keywords scheduler allocation added

Changed 3 years ago by SamB

According to CapabilitiesAndScheduling:

We do have a time-slice mechanism: the timer interrupt (see Timer.c) sets the context_switch flag, which causes the running thread to return to the scheduler the next time a heap check fails (at the end of the current nursery block). When a heap check fails, the thread doesn't necessarily always return to the scheduler: as long as the context_switch flag isn't set, and there is another block in the nursery, it resets Hp and HpLim?? to point to the new block, and continues.

To fix this bug, we need a way for the timer signal handler to *force* the Haskell code to stop in bounded time.

Two ways that come to mind for the handler to force a stop are:

  1. insert a breakpoint (or a special jump?) at some pre-arranged point in any arbitrarily-long allocation-free loop, such that the breakpoint signal handler can safely enter the schedular
  2. use some sort of instruction-by-instruction Call Frame Information, something like that of DWARF 2 and up, to figure out the innermost frame's stack layout.

Approach 1 is kind of icky insofar as it might cause spurious stops in other threads, or worse!

For those unfamiliar with it: DWARF's CFI represents a function of type (IP value × register name) → Maybe (how to find the value of that register in the caller), where Nothing means "that register got clobbered". Obviously, we would also need information about which of the interrupted code's registers and stack slots represented pointers that should be followed by the garbage collector for approach 2 to work.

Any other ideas about how the timer handler can guarantee re-entering the scheduler in bounded time?

There *is* the obvious "check every time around a non-allocating loop" approach, but it seems obvious that that would cost far too much where we can least afford it. (Is this actually true? So many things that seem obvious aren't...)

Changed 3 years ago by SamB

  • cc SamB added
  • owner nobody deleted
  • status changed from assigned to new

Changed 3 years ago by igloo

  • failure set to Incorrect result at runtime

Changed 4 months ago by lpsmith

  • cc leon.p.smith@… added

Changed 4 months ago by PHO

  • cc pho@… added
Note: See TracTickets for help on using tickets.