Ticket #5865 (closed bug: fixed)

Opened 15 months ago

Last modified 13 months ago

threadDelay is disrupted by system clock changes

Reported by: joeyadams Owned by: pcapriotti
Priority: high Milestone: 7.6.1
Component: Runtime System Version: 7.4.1
Keywords: Cc: anton.nik@…, tibbe, bos
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Incorrect result at runtime Difficulty: Unknown
Test Case: Blocked By:
Blocking: Related Tickets:

Description

threadDelay is sensitive to changes to the system clock. Two behaviors are observed:

  • When the system clock is moved forward, threadDelay will timeout prematurely, because it thinks that amount of time elapsed.
  • When the system clock is moved backward, threadDelay will take longer to complete, because it's waiting for the system clock to catch up.

Whether these behaviors are present depends on both the operating system and the use of -threaded. Here are the configurations I tested:

  • ghc-7.4.1 Linux 64-bit: Disrupted by both forward and backward clock changes
  • ghc-7.4.1 Linux 64-bit, -threaded: Disrupted only by backward clock changes
  • ghc-7.2.2 Windows 32-bit: Not disrupted by clock changes (behaves correctly)
  • ghc-7.2.2 Windows 32-bit, -threaded: Disrupted only by backward clock changes

To reproduce the problem, first compile the attached program. It uses System.Posix.Clock from the "clock" package to get the time, unaffected by system clock changes. Then run it. It will produce output like this:

0: 0s
1: 10s

Set the system clock forward by a minute. On Linux without -threaded, it will print something like this:

5: 50s
6: 52s

6 is printed 2 seconds after 5, because threadDelay didn't wait a full 10 seconds like it should have.

Now set the system clock backward by a minute. On Windows with -threaded, or on Linux with or without -threaded, it will print something like this:

3: 30s
4: 101s

4 is printed over a minute after 3, because threadDelay waited for the system clock to recover the time subtracted by changing the time.

For convenience, here's a way to fix the system clock on Linux after fiddling around with it:

sudo ntpdate pool.ntp.org

Digging through the source, each configuration seems to boil down to the following system calls:

  • Linux: gettimeofday (via getourtimeofday in posix/Itimer.c)
  • Linux, -threaded: gettimeofday, epoll_wait

Perhaps the calls to gettimeofday and GetSystemTimeAsFileTime? should be replaced with monotonic time functions (i.e. ones not affected by system clock changes):

Attachments

threadDelay-clock-change.hs Download (346 bytes) - added by joeyadams 15 months ago.
Uses the "clock" package to print monotonic times
5865-posix-nonthreaded.patch Download (4.2 KB) - added by pcapriotti 14 months ago.
5865-posix-threaded.patch Download (5.6 KB) - added by pcapriotti 14 months ago.
5865-darwin-nonthreaded.patch Download (1.3 KB) - added by pcapriotti 14 months ago.
5865-darwin-threaded.patch Download (2.3 KB) - added by pcapriotti 14 months ago.
5865-win32-threaded.2.patch Download (4.2 KB) - added by pcapriotti 13 months ago.

Change History

Changed 15 months ago by joeyadams

Uses the "clock" package to print monotonic times

  Changed 15 months ago by simonmar

  • priority changed from normal to high
  • difficulty set to Unknown
  • milestone set to 7.4.2

Nice bug report, thanks.

Interestingly I changed the ticker to use CLOCK_MONOTONIC recently, but this is separate from the code that handles threadDelay. See: 6b1098511aaabd2c9503ee7be6da1944466f9cb4

  Changed 15 months ago by simonpj

  • owner set to igloo

  Changed 15 months ago by lelf

  • cc anton.nik@… added

  Changed 14 months ago by igloo

  • owner changed from igloo to pcapriotti

  Changed 14 months ago by pcapriotti

  • status changed from new to patch

Attached patches for Linux.

  Changed 14 months ago by simonmar

  • cc tibbe, bos added

For the nonthreaded patch: couldn't we just modify the definition of getProcessElapsedTime instead of adding a new function getTime? That would seem to require fewer changes, and it looks strange for getTime to call stat_getElapsedTime which calls back to getProcessElapsedTime.

For the threaded patch: are we getting HAVE_CLOCK_GETTIME from ghcautoconf.h? Perhaps it ought to come from the base package's HsBaseConfig.h? Otherwise it looks fine, but we should run it past the IO manager people (tibbe & bos, CC'd).

Changed 14 months ago by pcapriotti

Changed 14 months ago by pcapriotti

  Changed 14 months ago by pcapriotti

Thanks Simon, I addressed your comments and uploaded updated patches.

  Changed 14 months ago by simonmar

Looks great, ok by me to push.

  Changed 14 months ago by tibbe

I don't see how CLOCK_ID is used, otherwise the threaded patch looks fine to me.

  Changed 14 months ago by pcapriotti

It's used in getMonotonicTime, as a parameter to clock_gettime. The idea is to fallback to CLOCK_REALTIME if clock_gettime is available, but there's no monotonic clock.

follow-up: ↓ 14   Changed 14 months ago by pcapriotti

Added patch for win32, using QueryPerformanceCounter instead of GetSystemTimeAsFileTime.

Changed 14 months ago by pcapriotti

Changed 14 months ago by pcapriotti

  Changed 14 months ago by pcapriotti

Added patches for Mac OS X.

  Changed 14 months ago by pcapriotti

tibbe, any comments on the new patches? ok to push?

in reply to: ↑ 11   Changed 14 months ago by joeyadams

Replying to pcapriotti:

Added patch for win32, using QueryPerformanceCounter instead of GetSystemTimeAsFileTime.

Thanks for addressing the problem on Windows, too. However, two things worry me about this patch:

  • The documentation for  QueryPerformanceFrequency implies that some hardware does not support a high-resolution performance counter. On the other hand, I don't know how likely we are to see such systems running Windows 2000 or later. A  post on virtualdub.org has some interesting things to say about QueryPerformanceCounter, too.
  • The return values of QueryPerformanceCounter and QueryPerformanceFrequency aren't even checked. Thus, if one of these functions fail, the program will silently start acting strangely.

We should at the very least check the return value of QueryPerformanceCounter. Ideally, we would fall back to  GetTickCount.

Note that GetTickCount has a wraparound problem due to its 32-bit return value. There is a  GetTickCount64 function, but it is only available for Vista and later.

The easy way to work around the problem: never compare two times more than 49 days apart, and do the subtraction with the original DWORDs to make wraparound work for you. See my  StackOverflow answer about this for more details.

One way to emulate a 64-bit counter is to associate the return value of GetTickCount with a 64-bit value that represents, in a sense, the current time. Something like this:

uint64_t initMonotonicClock(void)
{
    return GetTickCount();
}

uint64_t updateMonotonicClock(uint64_t before)
{
    DWORD now    = GetTickCount();
    DWORD offset = now - (DWORD)before;
    return before + offset;
}

So for every call to getMonotonicUSec, update a global variable using the result of updateMonotonicClock. As long as getMonotonicUSec is called at least once every 49 days, any pair of return values from getMonotonicUSec can be compared without overflow issues (that is, unless the program runs for about 500 million years).

  Changed 14 months ago by pcapriotti

A comment in rts/win32/GetTime.c says:

// getProcessElapsedTime relies on QueryPerformanceFrequency 
// which should be available on any Windows computer thay you
// would want to run Haskell on. Satnam Singh, 5 July 2010.

so I assumed the same would hold for base.

I agree we should check the return value of QPF, though, but I would assume that if QPF returns true, then QPC is not going to fail.

I'm not sure what to make of the virtualdub.org post. If it's really as bad as it says there, should we actually be using timeGetTime, which has >5ms resolution? And is timeGetTime actually monotonic (I understand GetTickCount is not)?

  Changed 14 months ago by pcapriotti

Updated win32 patch so that it falls back to GetTickCount (with a 64-bit accumulator). Using GetTickCount should be perfectly safe, since it's impossible to delay more than 231 μs (less than 6 hours), because threadDelay takes an Int.

Incidentally, GetTickCount also seems unaffected by changes in the system clock, so the only difference would be in accuracy.

  Changed 14 months ago by pcapriotti

Any further comments? If not, I'll merge the patches shortly.

follow-up: ↓ 19   Changed 14 months ago by joeyadams

I don't think the Win32 patch is thread-safe. If two OS threads call threadDelay simultaneously, it's possible for static variables like static HsWord64 previous_count to be invalidated by concurrent updates, since the assignment isn't guaranteed to be atomic (to my knowledge).

in reply to: ↑ 18   Changed 14 months ago by pcapriotti

Replying to joeyadams:

I don't think the Win32 patch is thread-safe. If two OS threads call threadDelay simultaneously, it's possible for static variables like static HsWord64 previous_count to be invalidated by concurrent updates, since the assignment isn't guaranteed to be atomic (to my knowledge).

Right, writes are not atomic. So the initialization of the frequency variable is broken, too. The latter can be fixed by moving it to some global initialization function (startIOManagerThread, perhaps?), but I'm not sure how the GetTickCount accumulator can be fixed. I looked into the Interlocked* family of functions, but they are only supported from XP onwards, which is less then QueryPerformanceCounter itself.

Any suggestions?

  Changed 13 months ago by simonmar

I would use cas() to update the value of previous_count, checking that it still has the value first read, and re-do the GetTickCount() call otherwise.

This does seem a bit like overkill though, and virtually impossible to test properly. I'm tempted to just drop the use of previous_count altogether, and accept that the counter wraps after 49 days on systems where QPC doesn't work (which are in the minority, and likely to get even more rare over time).

  Changed 13 months ago by pcapriotti

The problem with cas() is that is only defined for StgWord, whereas the counter is a 64-bit value.

I would also be in favor of dropping the 64 bit counter for GetTickCount altogether.

Changed 13 months ago by pcapriotti

  Changed 13 months ago by pcapriotti

Posted simplified patch without the 64-bit fallback counter, and with separate initialization.

  Changed 13 months ago by simonmar

Ok by me to commit.

  Changed 13 months ago by p.capriotti@…

commit e57d23d6f11c1262a096f2135e64a6a81d0ca938

Author: Paolo Capriotti <p.capriotti@gmail.com>
Date:   Sat Mar 24 02:19:43 2012 +0000

    Use monotonic time function for darwin (#5865)

 rts/posix/Clock.h   |    3 +++
 rts/posix/GetTime.c |   11 +++++++++++
 2 files changed, 14 insertions(+), 0 deletions(-)

  Changed 13 months ago by pcapriotti

  • status changed from patch to closed
  • resolution set to fixed
  • milestone changed from 7.4.2 to 7.6.1

Pushed in GHC:

commit 5ba7db93fb634ff8094f42780a14972322446a94
Author: Paolo Capriotti <p.capriotti@gmail.com>
Date:   Tue Mar 20 11:57:28 2012 +0000

    Use monotonic clock in Select.c (#5865)

and base:

commit a865c6cfc12afd59760a1993eff51f801e27eea1
Author: Paolo Capriotti <p.capriotti@gmail.com>
Date:   Mon Apr 16 14:47:36 2012 +0100

    Add timer initialization for darwin.

commit 4bea82c27d75845d0816617bcfa8514d02400b77
Author: Paolo Capriotti <p.capriotti@gmail.com>
Date:   Fri Mar 23 14:26:20 2012 +0000

    Replace getUSecOfDay with monotonic timer (#5865)

commit f87f2853fd3c335bfeccd637e09fa933d55d4f2a
Author: Paolo Capriotti <p.capriotti@gmail.com>
Date:   Fri Mar 23 17:22:20 2012 +0000

    Define monotonic time function for Darwin.

commit 2f308f5771fc94657a80252ac16b7fac8ee80e9c
Author: Paolo Capriotti <p.capriotti@gmail.com>
Date:   Mon Mar 19 18:00:48 2012 +0000

    Use monotonic time in Event/Manager.hs.
Note: See TracTickets for help on using tickets.