Ticket #5897 (closed bug: fixed)

Opened 16 months ago

Last modified 12 months ago

GHC runtime task workers are not released with C FFI

Reported by: sanketr Owned by:
Priority: highest Milestone: 7.4.3
Component: Runtime System Version: 7.4.1
Keywords: worker, ffi Cc: sanket.agrawal@…
Operating System: Unknown/Multiple Architecture: x86_64 (amd64)
Type of failure: None/Unknown Difficulty: Unknown
Test Case: Blocked By:
Blocking: Related Tickets: #4262

Description

I have a test code which calls C FFI to collect data every n microseconds. The timer event in Haskell code spawns one thread for each C FFI thread. Those C FFI threads call back, and coordinates with calling GHC thread through mvar. What I am consistently seeing is the increase in number of runtime task workers with each iteration of timer event. The attached test code reproduces the issue (please see attached README on how to run it).

I tested these on both Mac and Redhat Linux x86_64 with GHC 7.4.1, and was able to reliably reproduce the issue.

The end result is that if number of C FFI threads is beyond a certain threshold (6 on my quad-core iMac), the number of runtime tasks seem to increase without bounds. For example, here is a sample RTS output from attached test code, with 7 C FFI threads, and 2 GHC threads, after two iterations of the call to C FFI - instead of 4 task workers, there are 10 - tested with "-N3 +RTS -s" on GHC 7.4.1 and Mac 10.7.2 (quad-core iMac):

Parallel GC work balance: nan (0 / 0, ideal 1)

                        MUT time (elapsed)       GC time  (elapsed)
  Task  0 (worker) :    0.00s    (  0.42s)       0.00s    (  0.00s)
  Task  1 (worker) :    0.00s    (  0.00s)       0.00s    (  0.00s)
  Task  2 (worker) :    0.00s    (  0.93s)       0.00s    (  0.00s)
  Task  3 (worker) :    0.00s    (  0.93s)       0.00s    (  0.00s)
  Task  4 (worker) :    0.00s    (  0.93s)       0.00s    (  0.00s)
  Task  5 (worker) :    0.00s    (  0.93s)       0.00s    (  0.00s)
  Task  6 (worker) :    0.00s    (  0.93s)       0.00s    (  0.00s)
  Task  7 (worker) :    0.06s    (  1.00s)       0.00s    (  0.00s)
  Task  8 (worker) :    0.00s    (  0.00s)       0.00s    (  0.00s)
  Task  9 (worker) :    0.06s    (  1.43s)       0.00s    (  0.00s)
  Task 10 (bound)  :    0.00s    (  0.00s)       0.00s    (  0.00s)

The culprit seems to be mvar callback by C FFI. If I remove mvar callback, the number of task workers stay constant at 4.

If this is a bug in GHC runtime and not my code, it seems to be a big bug because mvar callback is important for coordination with C FFI threads. This bug might have been in previous versions of GHC as well, but probably not discovered because it seems to require a certain C FFI thread count threshold to kick in.

Attachments

5897.tgz Download (2.4 KB) - added by sanketr 16 months ago.
Test code to reproduce the issue
5897_simpler.tgz Download (2.2 KB) - added by sanketr 16 months ago.
Simpler test code to reproduce the bug

Change History

Changed 16 months ago by sanketr

Test code to reproduce the issue

  Changed 16 months ago by sanketr

Just figured out how to reproduce the segmentation fault/bus error I have been seeing with more complicated code this test code is derived from. The test code will consistently produce bus error or segmentation fault on Mac OS 10.7.2, after twenty or so iterations, if I set nThreads in main of T.hs to say 30. It seems to crash in pthread, during call to lock mutex in sendSignal C code - I wonder if it is related to some kind of mutex leak in interaction between GHC RTS and C FFI. gdb output from core below:

$ gdb ./T /cores/core.2372 
GNU gdb 6.3.50-20050815 (Apple version gdb-1705) (Fri Jul  1 10:50:06 UTC 2011)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-apple-darwin"...Reading symbols for shared libraries ... done

Reading symbols for shared libraries . done
Reading symbols for shared libraries ........................... done
#0  0x00007fff9321fbca in __psynch_cvwait ()
(gdb)


I can reproduce the crash consistently on quad-core iMac with "-N3" runtime option.

  Changed 16 months ago by sanketr

  • related changed from 4262 to #4262

  Changed 16 months ago by sanketr

In case this helps locate the cause of the issue further, this is what I got once on Mac when causing seg fault (I commented out mutex locking/unlocking in C sendSignal function - so, it was just mvar callback at that point when this error happened):

internal error: ARR_WORDS object entered!

(GHC version 7.4.1 for x86_64_apple_darwin) Please report this as a GHC bug: http://www.haskell.org/ghc/reportabug

  Changed 16 months ago by sanketr

Edit: it wasn't seg fault as I noted above. I was trying to cause seg fault, but got the above error instead, once. Can't reproduce it again.

  Changed 16 months ago by simonmar

  • owner set to simonmar
  • difficulty set to Unknown
  • priority changed from normal to highest
  • milestone set to 7.4.2

  Changed 16 months ago by sanketr

I am attaching simplified test code, the bare minimum that reproduces the bug of unreleased worker threads, and segmentation fault (on mac 10.7.2 x86_64). The c FFI code is now much simpler.

Changed 16 months ago by sanketr

Simpler test code to reproduce the bug

  Changed 16 months ago by simonmar

  • status changed from new to closed
  • resolution set to invalid

I think you're using Data.Vector.Storable unsafely:

getPtr :: (SV.Storable a) => SV.Vector a -> Ptr a
getPtr = unsafeForeignPtrToPtr . (\(x,_,_) -> x) . SV.unsafeToForeignPtr

This is the cause of the segfault. I've fixed the program:

import Control.Concurrent (forkIO, threadDelay, MVar, newEmptyMVar, putMVar, takeMVar, readMVar)
import Control.Monad (mapM, mapM_, forM, forM_)
import Control.Exception
import System.Exit
import Foreign
import Foreign.C

-- a "wrapper" import is a converter for converting a Haskell 
-- function to a foreign function pointer
foreign import ccall "wrapper"
  syncWithCWrap :: IO () -> IO (FunPtr (IO ()))

foreign import ccall safe "mt.h sendSignal"
  sendSignal :: CShort -> IO()

foreign import ccall safe "test.c initThreads"
  initThreads :: CInt -> Ptr (FunPtr (IO())) -> IO()

syncWithC :: MVar CInt -> MVar CInt -> CInt -> IO ()
syncWithC m1 m2 x = do
              putMVar m2 x
              takeMVar m1 -- wait for done signal from timerevent function
              return ()

timerevent :: [MVar CInt] -> [MVar CInt] -> Int -> IO()
timerevent m1 m2 t =  run where
    run = do
    -- pause for t microseconds
    threadDelay t
    print "Processing data"
    forM_ listOfThreads $ \x -> forkIO $ sendSignal x
    -- collect mvar from each C FFI thread
    -- all C threads have been paused by sendSignal above
    mvars <- forM m2 takeMVar
    -- signal each thread to continue
    forM_ m1 (\x -> putMVar x 0)
    print $ "Processed data"
    run
      where 
      listOfThreads = [0..fromIntegral $ (length m1) - 1]

-- getPtr :: (SV.Storable a) => SV.Vector a -> Ptr a
-- getPtr = unsafeForeignPtrToPtr . (\(x,_,_) -> x) . SV.unsafeToForeignPtr

main :: IO ()
main = do
  let nThreads = 30 
  -- create two mvar lists for C FFI threads
  m1 <- mapM (const newEmptyMVar) [1..nThreads] :: IO [MVar CInt]
  m2 <- mapM (const newEmptyMVar) [1..nThreads] :: IO [MVar CInt]
  -- create callback functions for each of C thread - it will call
  -- back syncWithC with no arguments
  fnptrs <- mapM (\(x,y) -> syncWithCWrap $ syncWithC x y 0) (zip m1 m2)
  -- create a storable vector of function ptrs - we will pass ptr to
  -- function ptrs to C FFI
  vfnptrs <- newArray fnptrs
--  let vfnptrs = SV.fromList fnptrs
  -- kick off C FFI - fork in background 
  forkIO $ initThreads nThreads vfnptrs
  -- kick off timer thread to coordinate with C FFI threads - every
  -- ~0.5 seconds, it will sendSignal function in C FFI for each
  -- thread. sendSignal calls back syncWithC
  timerevent m1 m2 500
  return ()

The extra OS threads are created by "safe" foreign calls, namely sendSignal. I don't see any evidence that the RTS is creating more than it needs to.

  Changed 16 months ago by sanketr

  • owner simonmar deleted
  • status changed from closed to new
  • resolution invalid deleted

Simon, thanks for the catch. I forgot that I was getting pointer from a vector without retaining a reference to the vector (by passing it to say, long-lived timer function).

How do I determine that the increasing number of threads is not an issue, and that they will be gc-ed eventually (if RTS output is just showing worker threads that are not alive anymore)? When I run the above fixed code on Linux for a few seconds, I get RTS output like below:

....... head snipped for bug report here because too long .....
 Task 93100 (worker) :    0.00s    (  0.00s)       0.00s    (  0.00s)
  Task 93101 (worker) :    0.00s    (  0.00s)       0.00s    (  0.00s)
  Task 93102 (worker) :    0.00s    (  0.08s)       0.00s    (  0.00s)
  Task 93103 (worker) :    0.00s    (  0.00s)       0.00s    (  0.00s)
  Task 93104 (worker) :    0.00s    (  0.14s)       0.00s    (  0.00s)
  Task 93105 (worker) :    0.00s    (  0.12s)       0.00s    (  0.00s)
  Task 93106 (worker) :    0.00s    (  0.03s)       0.00s    (  0.00s)
  Task 93107 (worker) :    0.00s    (  0.20s)       0.00s    (  0.00s)
  Task 93108 (worker) :    0.03s    (  5.10s)       0.00s    (  0.00s)
  Task 93109 (bound)  :    0.00s    (  0.00s)       0.03s    (  0.03s)

It keeps increasing without any bounds for the short time I run it. If you look at the code, sendSignal is kicked off in FFI, but then it returns, and timerevent resumes execution. So, those worker threads should be freed (eventually). What I want to make sure is that the above observation can be safely ignored, and will like to know how to determine it.

I am re-opening the bug for task worker resolution. Please close it after commenting on my observation - I hope there is no bug here.

  Changed 16 months ago by sanketr

In case it is not clear right away from the code, here is the sequence of execution from timerevent when forking FFI threads:

- timerevent forks sendSignal threads - it forks n threads corresponding to n C FFI threads - timerevent now waits on a list of mvar m2 - list has n elements, one corresponding to each sendSignal thread - each sendSignal thread calls back syncWithC function - puts an element in m2, and waits on m1 - timerevent gets all m2, and puts 0 in list of m1 - Each syncWithC thread (which was called back by sendSignal) gets its corresponding m1, and finishes. sendSignal is done. Those FFI threads are now finished.

So, if sendSignal is finished (otherwise the code will deadlock if mvars are not put/taken in right order), the number of worker threads shouldn't keep increasing with each iteration of timerevent.

  Changed 16 months ago by sanketr

Sorry, comment system mangled the steps. Here it is again:

In case it is not clear right away from the code, here is the sequence of execution from timerevent when forking FFI threads:

- timerevent forks sendSignal threads - it forks n threads corresponding to n C FFI threads
- timerevent now waits on a list of mvar m2 - list has n elements, one corresponding to each sendSignal thread
- each sendSignal thread calls back syncWithC function - puts an element in m2, and waits on m1 
- timerevent gets all m2, and puts 0 in list of m1 
- Each syncWithC thread (which was called back by sendSignal) gets its corresponding m1, and finishes. 
sendSignal is done. Those FFI threads are now finished.

So, if sendSignal is finished (otherwise the code will deadlock if mvars are not put/taken in right order), the number of worker threads shouldn't keep increasing with each iteration of timerevent.

  Changed 16 months ago by marlowsd@…

commit 085c7fe5d4ea6e7b59f944d46ecfeba3755a315b

Author: Simon Marlow <marlowsd@gmail.com>
Date:   Fri Mar 2 10:53:34 2012 +0000

    Drop the per-task timing stats, give a summary only (#5897)
    
    We were keeping around the Task struct (216 bytes) for every worker we
    ever created, even though we only keep a maximum of 6 workers per
    Capability.  These Task structs accumulate and cause a space leak in
    programs that do lots of safe FFI calls; this patch frees the Task
    struct as soon as a worker exits.
    
    One reason we were keeping the Task structs around is because we print
    out per-Task timing stats in +RTS -s, but that isn't terribly useful.
    What is sometimes useful is knowing how *many* Tasks there were.  So
    now I'm printing a single-line summary, this is for the program in
    
      TASKS: 2001 (1 bound, 31 peak workers (2000 total), using -N1)
    
    So although we created 2k tasks overall, there were only 31 workers
    active at any one time (which is exactly what we expect: the program
    makes 30 safe FFI calls concurrently).
    
    This also gives an indication of how many capabilities were being
    used, which is handy if you use +RTS -N without an explicit number.

 rts/RtsMain.c         |    3 +-
 rts/Stats.c           |   42 +++++++++-----------
 rts/Task.c            |  100 ++++++++++++++++++++++---------------------------
 rts/Task.h            |   30 +++-----------
 rts/posix/OSThreads.c |    1 -
 rts/sm/Compact.c      |    2 +-
 6 files changed, 73 insertions(+), 105 deletions(-)

follow-up: ↓ 13   Changed 16 months ago by simonmar

  • status changed from new to closed
  • resolution set to fixed

I looked into your example again. Although the RTS was not keeping all the OS threads around, it was retaining information about each OS thread that had been used as a worker in the past, and this was being printed in the output of +RTS -s. Since keeping around all this information could constitute a space leak, I've dropped it (see the commit above).

The example program above used to leak space over time, now it runs in constant space.

The per-Task stats weren't particularly useful anyway, even for me.

in reply to: ↑ 12   Changed 16 months ago by sanketr

Replying to simonmar:

I looked into your example again. Although the RTS was not keeping all the OS threads around, it was retaining information about each OS thread that had been used as a worker in the past, and this was being printed in the output of +RTS -s. Since keeping around all this information could constitute a space leak, I've dropped it (see the commit above). The example program above used to leak space over time, now it runs in constant space. The per-Task stats weren't particularly useful anyway, even for me.

Terrific! The per-task stats now look much more useful after this fix.

  Changed 12 months ago by sanketr

  • status changed from closed to new
  • resolution fixed deleted

Simon, looks like this fix didn't make it in 7.4.2. The number of task workers keep increasing, and the memory leak is still there.

  Changed 12 months ago by simonmar

  • status changed from new to merge

Sorry, it looks like that was an oversight. We'll merge it into the branch for 7.4.3 (if there is one).

  Changed 12 months ago by simonmar

  • milestone changed from 7.4.2 to 7.4.3

  Changed 12 months ago by pcapriotti

  • status changed from merge to closed
  • resolution set to fixed

Merged as d677a952d666e5e7144e60524efb6989dddeb383 (plus testsuite fix e8fae135f7b7820f7dc213182903e8e4fb5170b6).

Note: See TracTickets for help on using tickets.