Ticket #2848 (closed merge: fixed)

Opened 5 years ago

Last modified 5 years ago

threadDelay can wait forever, next time on January 22, 2009, around 20:00 UTC.

Reported by: tomekz Owned by: igloo
Priority: high Milestone: 6.10.2
Component: Runtime System Version: 6.8.1
Keywords: ticks getourtimeofday threadDelay non-threaded Cc: tomasz.zielonka@…, mariusz.gadarowski@…, przemyslaw.kosiak@…
Operating System: Linux Architecture: x86
Type of failure: Difficulty: Unknown
Test Case: Blocked By:
Blocking: Related Tickets:

Description (last modified by igloo) (diff)

We(*) have found a serious bug in the non-threaded RTS on 32-bit *nix systems. Calling threadDelay at certain clock-time intervals can make the thread wait forever. This happens regularly, approximately every 49 days 17 hours, or, more accurately, every (2^32 / 1000) seconds. You can reproduce the bug with a simple program, if you set your machine's clock to a specially crafted time.

In this description I assume a 32-bit *nix system, where the type unsigned long has 32 bits.

If you want to see this happen, here is an instruction:

Test.hs:

import Control.Concurrent

main = threadDelay 5000000 -- 5 secs

Running on Linux, as root:

# date --set 'Thu Jan 22 20:20:10 UTC 2009' +%s; strace ./Delay
...
gettimeofday({1232655610, 12122}, NULL) = 0
gettimeofday({1232655610, 12171}, NULL) = 0
gettimeofday({1232655610, 12195}, NULL) = 0
select(0, [], [], NULL, {5, 20000})     = 0 (Timeout)
gettimeofday({1232655615, 32125}, NULL) = 0
select(0, [], [], NULL, {0, 4000})      = 0 (Timeout)
gettimeofday({1232655615, 35777}, NULL) = 0
select(0, [], [], NULL, {0, 4000})      = 0 (Timeout)
gettimeofday({1232655615, 40039}, NULL) = 0
select(0, [], [], NULL, {4294, 951296}

The runtime slept for more than 5 seconds, but it didn't wake up the thread, but instead started waiting for an evil looking amount of time.

I've managed to find the exact location of the bug, in function getourtimeofday() in rts/posix/Itimer.c. Other parts of non-threaded RTS, and especially the function awaitEvent(rtsBool) in rts/posix/Select.c, assume that the results of getourtimeofday() use the full range of values in lnat. But this isn't so. Let's take a look at the current version of this function:

lnat
getourtimeofday(void)
{
  struct timeval tv;
  nat interval;
  interval = RtsFlags.MiscFlags.tickInterval;
  if (interval == 0) { interval = 50; }
  gettimeofday(&tv, (struct timezone *) NULL);
  	// cast to lnat because nat may be 64 bit when int is only 32 bit
  return ((lnat)tv.tv_sec * 1000 / interval +
	  (lnat)tv.tv_usec / (interval * 1000));
}

Look at the first half of the expression in the return statement:

(lnat)tv.tv_sec * 1000 / interval

Here, tv.tv_sec is first converted to lnat, than multiplied by 1000, and finally divided by interval. lnat is defined as "unsigned long" - on a 32-bit system this is usually a 32-bit type. The result of multiplying by 1000 will have its higher bits lost and it will be in the range [0 .. 2^32 - 1]. Dividing it by interval will shrink the range to [0 .. (2^32 - 1) / interval]. With default value of tickInterval - which is 50 - this will be [0 .. 85899345]. Adding the second part of the expression can only increase it by (999999 / (interval * 1000)), which is equal to 19 with default settings.

So, for default parameters, the result of getourtimeofday() will be in [0 .. 85899364]. Why is it a problem? It's quite simple: When a thread executes threadDelay the runtime calculates the tick at which it should be woken up. In our example, threadDelay is executed at tick 85899266, and the time of wake-up is probably calculated as 85899266 + 5 * (1000 / 50) = 85899366. This is greater then the biggest value getourtimeofday() can return.

At this point I've probably already given too much detail, so I'll stop. I think the fix is quite easy: just perform the problematic part of computation on a wider integral type and cast it to lnat at the end.

There is also an easy workaround: compile your program with -threaded. The threaded runtime does not use this code, and AFAICT, the time is tracked using a 64-bit number of microseconds.

When I was trying to understand the problem, for some time the culprit for me was the "clever trick" mentionad at the top of rts/posix/Select.c. Now I understand it better and think it's innocent.

We have found this bug in GHC 6.6, and it's still there in the HEAD.

(*) "We" means Mariusz Gądarowski, Przemysław Kosiak, Jakub Bogusz and Tomasz Zielonka - employees of Gemius SA in Poland. Mariusz and Przemek encountered this bug in their haskell system for distributing computations on a cluster. Mariusz and Jakub provided data about the symptoms of the problem - that was invaluable because, as you can now see, the bug was very difficult to reproduce. Tomasz finally spent four hours in solitude to nail the bug, and another hour to write this report. The problem repeated itself for serveral months, and in desperation Mariusz started to think about rewriting the system in C++. Unfortunately, there is still such a posibility, because we have another similar problem to solve. We are more optimistic now, but this event raised some doubts about the development model of GHC. We are afraid that with the current emphasis on adding new features you may be adding bugs faster that you are removing them :-/

Change History

  Changed 5 years ago by tomekz

  • keywords ticks getourtimeofday threadDelay non-threaded added

  Changed 5 years ago by tomekz

  • version changed from 6.10.1 to 6.8.1
  • os changed from Unknown/Multiple to Linux
  • architecture changed from Unknown/Multiple to x86

  Changed 5 years ago by tomekz

  • summary changed from ThreadDelay can wait forever, next time on January 22, 2009, around 20:00 UTC. to threadDelay can wait forever, next time on January 22, 2009, around 20:00 UTC.

  Changed 5 years ago by tomekz

  • cc tomasz.zielonka@…, mariusz.gadarowski@…, przemyslaw.kosiak@… added

  Changed 5 years ago by tomekz

Sorry for messing the report up with superscripts. I wanted to use the character literally, as an exponentiation operator. Is there a way to edit the description now?

  Changed 5 years ago by tomekz

Damn!

  Changed 5 years ago by igloo

  • difficulty set to Unknown
  • description modified (diff)

in reply to: ↑ description   Changed 5 years ago by igloo

Replying to tomekz:

We are more optimistic now, but this event raised some doubts about the development model of GHC. We are afraid that with the current emphasis on adding new features you may be adding bugs faster that you are removing them :-/

If you want to discuss this, I think it would be best to bring it up on the glasgow-haskell-users mailing list, rather than burying it in a bug report where most people won't see it and where it could drown out the actual bug report.

follow-up: ↓ 1   Changed 5 years ago by tomekz

Good point.

Perhaps we are exaggerating. What this bug can tell us is probably not about the development model, but about how much GHC is used for critical, continuously running systems. One would think that someone should already encounter this. On the other hand, the bug won't manifest itself in a threaded runtime, and such systems may tend to use it... In fact, the main components of our system use the threaded RTS, and bug manifested itself in a small program which is the command-line interface to the system.

in reply to: ↑ 9   Changed 5 years ago by dons

tomekz, Galois has a lot of continuously running systems (as there are also in the wider Haskell industrial and open source world (e.g. xmonad!)), and I'm not aware of anyone encountering this. Typically we use the threaded runtime though, if using threadDelay, for any serious system, which might explain the result.

  Changed 5 years ago by tomekz

Ha! I've just found these words from Simon Marlow, in glasgow-haskell-users thread "ghci and ghc -threaded broken with pipes & forking" from March 2007:

In fact you should think of the non-threaded RTS as deprecated. It isn't Haskell'-compliant, for one thing (assuming that Haskell' will probably require non-blocking foreign calls).

I'm hesitant to actually deprecate it, for a few reasons: the threaded RTS is so much more complicated, it might have some adverse performance impliciations, and there are still people who want to run everything in a single OS thread, for whatever reason. But having multiple variants of the RTS is a maintenance and testing headache.

One more reason to deprecate the non-threaded RTS, or at least to pick the threaded by default. I'll add something to that discussion.

  Changed 5 years ago by igloo

  • priority changed from normal to high
  • milestone set to 6.10.2

  Changed 5 years ago by igloo

Thanks for the detailed diagnosis!

  Changed 5 years ago by simonmar

  • owner set to simonmar

  Changed 5 years ago by simonmar

  • status changed from new to closed
  • resolution set to fixed

Thanks for a great bug report. The code is now:

  // Avoid overflow when we multiply seconds by 1000.  See #2848
  return (lnat)((StgWord64)tv.tv_sec * 1000 / interval +
                (StgWord64)tv.tv_usec / (interval * 1000));

fixed by this patch:

Tue Dec  9 10:56:00 GMT 2008  Simon Marlow <marlowsd@gmail.com>
  * Fix #2848: avoid overflow during time calculation

  Changed 5 years ago by simonmar

  • status changed from closed to reopened
  • type changed from bug to merge
  • resolution fixed deleted

oops, still needs to be merged.

  Changed 5 years ago by simonmar

  • owner changed from simonmar to igloo
  • status changed from reopened to new

  Changed 5 years ago by igloo

  • status changed from new to closed
  • resolution set to fixed

Merged

Note: See TracTickets for help on using tickets.