Ticket #4362 (closed bug: fixed)

Opened 3 years ago

Last modified 3 years ago

error in multithreaded program "epollControl: does not exist (No such file or directory)"

Reported by: guest Owned by:
Priority: normal Milestone: 7.0.2
Component: libraries/base Version: 7.1
Keywords: Cc: johan.tibell@…, bos@…, bit@…, mail@…
Operating System: Linux Architecture: x86
Type of failure: Incorrect result at runtime Difficulty:
Test Case: Blocked By:
Blocking: Related Tickets:

Description

I wrote a program that is a standard multi threaded tcp server. It listens on a socket, and then it enters a loop where it does an accept for a new connection and then handles the connection on a new thread using forkIO. For each incoming connection, the program reads some data from the socket and then downloads a file from the web using the function simpleHTTP from the HTTP library.

My program runs fine most of the time, but every now and then when I run my program, all of the simpleHTTP calls throw an IOException, with the error:

epollControl: does not exist (No such file or directory)

My program also has an additional background thread, started at program startup with forkIO, that periodically calls simpleHTTP and these calls always work, and they continue to work even as the calls from the connection threads throw the above error.

From looking at the library code, it seems simpleHTTP eventually calls one of the low level socket functions which in turn calls System.Event.Thread.ensureIOManagerIsRunning, which calls System.Event.EPoll.new, which calls epoll_ctl which fails.

epoll_ctl man page tells me that the reason for this error is: "ENOENT: op was EPOLL_CTL_MOD or EPOLL_CTL_DEL, and fd is not in epfd."

As I said above, this problem is hard to reproduce, it only happens once every 30 or so invocations of my program.

I am using ghc 7.1.20100925 and Linux 2.6.31 I am compiling with -threaded -O2

I'd be happy to give any any other needed information.

Thanks

Attachments

epoll-bug.tar.gz Download (3.1 KB) - added by StefanWehr 3 years ago.
Sample program showing the epollControl errors
epoll-bug.tar.2.gz Download (3.5 KB) - added by StefanWehr 3 years ago.
updated sample programs

Change History

  Changed 3 years ago by simonmar

  • cc johan.tibell@…, bos@… added

Adding Bryan and Johan to the CC list, this looks like a problem in the IO manager.

  Changed 3 years ago by tibbe

Could you please attach a test program together with instructions for running it? It would make our job much easier.

  Changed 3 years ago by guest

I am the original reporter of this bug.

I just want to update you guys that I am working on a short test program that demonstrates this bug, but it is not easy since this bug is tricky to reproduce. Hopefully it will be ready in a few days.

Thanks

  Changed 3 years ago by tibbe

I short test program would be great. Could you also add yourself to the CC list of this bug so we have a way to contact you?

  Changed 3 years ago by bit

  • cc bit@… added

I am the original reporter of this bug.

Adding myself to the CC list as requested. Still working on a test program...

  Changed 3 years ago by igloo

  • status changed from new to infoneeded
  • milestone set to 7.0.2

  Changed 3 years ago by dleuschner

We've probably run into the same problem with 7.0.1 und amd64. We're getting LOTS of those "epollControl: does not exist" messages on every testrun. Unfortunately our client/server system consists of more than 20.000 lines of Haskell code (plus lots of libraries) and we're not quite sure which lines actually contribute to the message. We'll also try to write a simple test program. Would it be helpful if we'd set up a test-system on that you could compile and run the "big" program?

(By the way: With the same code we're also having a problem with GHC 6.12.3 causing the complete program to hang. It seems if all threads stop running. The CPU is idle, there are no garbage collections and "threadDelay" calls never return. This problem was actually the reason why we tried GHC 7.)

  Changed 3 years ago by bos

Yes, being able to compile and run the application in a way that reproduces the problem would be a big help. Maybe not as much as having a small test case, but I'll take whatever I can get :-)

  Changed 3 years ago by dleuschner

We now have the system up and running. I could reproduce the problem on every testrun. Bryan I'll send you an email with the system name, login credentials and instructions on how to reproduce the problem. If anybody else of the GHC/IO-Manager team would like to have a look, I'm glad to provide you with the credentials, if you send me an email to leuschner@…. Thanks a lot! (We know that a small test case would be way better -- we're working on it!)

  Changed 3 years ago by StefanWehr

  • cc mail@… added

  Changed 3 years ago by StefanWehr

dleuschner and myself finally managed to reproduce the problem with two fairly small sample (but not necessarily minimal) programs. I've attached a tar-archive with the sample programs.

The sample programs are a client and a server program:

* The server has three parts:

  1. It forks 10 threads and each threads downloads every 100 milliseconds some data via HTTP.
  2. Another thread reports every 10 seconds the number of successful HTTP downloads
  3. It accepts TCP connections on port 2222. When a clients connects, the server forks a new thread to serve the client. Serving the clients simple means reading some data from the client and then closing the connection.

* The client forks 10 threads which periodically open a TCP connection to the server on port 2222 and send some data.

Here are the first few lines of the logging output from the server:

drghc@docidev ~/ghc-bug $ ./Server
Run 0: 0 successful downloads
Run 1: 450 successful downloads
Run 2: 901 successful downloads
ERROR: Downloading "http://gmail.com/" failed with an exception: epollControl: does not exist (No such file or directory)
ERROR: Downloading "http://www.factisresearch.com/" failed with an exception: epollControl: does not exist (No such file or directory)
ERROR: Downloading "http://www.haskell.org/" failed with an exception: epollControl: does not exist (No such file or directory)
ERROR: Downloading "http://www.stefanwehr.de/" failed with an exception: epollControl: does not exist (No such file or directory)
ERROR: Downloading "http://www.factisresearch.com/" failed with an exception: epollControl: does not exist (No such file or directory)
ERROR: Downloading "http://www.haskell.org/" failed with an exception: epollControl: does not exist (No such file or directory)
ERROR: Downloading "http://www.stefanwehr.de/" failed with an exception: epollControl: does not exist (No such file or directory)
ERROR: Downloading "http://www.factisresearch.com/" failed with an exception: epollControl: does not exist (No such file or directory)
ERROR: Downloading "http://www.haskell.org/" failed with an exception: epollControl: does not exist (No such file or directory)
ERROR: Downloading "http://www.google.de/" failed with an exception: epollControl: does not exist (No such file or directory)
ERROR: Downloading "http://www.amazon.de/" failed with an exception: epollControl: does not exist (No such file or directory)
ERROR: Downloading "http://www.haskell.org/" failed with an exception: epollControl: does not exist (No such file or directory)
ERROR: Downloading "http://www.heise.de/" failed with an exception: epollControl: does not exist (No such file or directory)
ERROR: Downloading "http://www.stefanwehr.de/" failed with an exception: epollControl: does not exist (No such file or directory)
ERROR: Downloading "http://www.yahoo.com/" failed with an exception: epollControl: does not exist (No such file or directory)
ERROR: Downloading "http://www.factisresearch.com/" failed with an exception: epollControl: does not exist (No such file or directory)
ERROR: Downloading "http://www.google.de/" failed with an exception: epollControl: does not exist (No such file or directory)
ERROR: Downloading "http://www.microsoft.com/" failed with an exception: epollControl: does not exist (No such file or directory)
ERROR: Downloading "http://www.haskell.org/" failed with an exception: epollControl: does not exist (No such file or directory)
ERROR: Downloading "http://www.heise.de/" failed with an exception: epollControl: does not exist (No such file or directory)
ERROR: Downloading "http://www.haskell.org/" failed with an exception: epollControl: does not exist (No such file or directory)
ERROR: Downloading "http://www.google.de/" failed with an exception: epollControl: does not exist (No such file or directory)
ERROR: Downloading "http://www.google.de/" failed with an exception: epollControl: does not exist (No such file or directory)
ERROR: Downloading "http://gmail.com/" failed with an exception: epollControl: does not exist (No such file or directory)
ERROR: Downloading "http://www.stefanwehr.de/" failed with an exception: epollControl: does not exist (No such file or directory)
ERROR: Downloading "http://www.google.de/" failed with an exception: epollControl: does not exist (No such file or directory)
ERROR: Downloading "http://www.yahoo.com/" failed with an exception: epollControl: does not exist (No such file or directory)
ERROR: Downloading "http://www.stefanwehr.de/" failed with an exception: epollControl: does not exist (No such file or directory)
ERROR: Thread receiveThread for <socket: 32> terminated with an exception: epollControl: invalid argument (Bad file descriptor)
ERROR: Thread receiveThread for <socket: 10> terminated with an exception: epollControl: invalid argument (Bad file descriptor)
ERROR: Downloading "http://www.haskell.org/" failed with an exception: epollControl: does not exist (No such file or directory)
ERROR: Downloading "http://gmail.com/" failed with an exception: epollControl: does not exist (No such file or directory)
ERROR: Downloading "http://gmail.com/" failed with an exception: epollControl: does not exist (No such file or directory)
ERROR: Downloading "http://gmail.com/" failed with an exception: epollControl: does not exist (No such file or directory)
ERROR: Downloading "http://www.factisresearch.com/" failed with an exception: epollControl: does not exist (No such file or directory)
ERROR: Thread receiveThread for <socket: 32> terminated with an exception: epollControl: invalid argument (Bad file descriptor)
ERROR: Downloading "http://www.stefanwehr.de/" failed with an exception: epollControl: does not exist (No such file or directory)
ERROR: Thread receiveThread for <socket: 51> terminated with an exception: epollControl: invalid argument (Bad file descriptor)
ERROR: Thread receiveThread for <socket: 54> terminated with an exception: epollControl: invalid argument (Bad file descriptor)
ERROR: Downloading "http://www.yahoo.com/" failed with an exception: epollControl: does not exist (No such file or directory)
ERROR: Downloading "http://gmail.com/" failed with an exception: epollControl: does not exist (No such file or directory)
ERROR: Downloading "http://www.haskell.org/" failed with an exception: epollControl: does not exist (No such file or directory)
ERROR: Downloading "http://www.google.de/" failed with an exception: epollControl: does not exist (No such file or directory)
ERROR: Downloading "http://www.yahoo.com/" failed with an exception: epollControl: does not exist (No such file or directory)
Run 3: 1319 successful downloads
ERROR: Thread receiveThread for <socket: 34> terminated with an exception: epollControl: invalid argument (Bad file descriptor)
ERROR: Thread receiveThread for <socket: 58> terminated with an exception: epollControl: invalid argument (Bad file descriptor)
ERROR: Downloading "http://www.google.de/" failed with an exception: epollControl: does not exist (No such file or directory)

Some comments on the log output:

* The clients were started some time after the server outputs the log message

Run 2: 901 successful downloads

Before the clients were connected, the server ran without any errors.

* The error message

ERROR: Thread receiveThread for <socket: 32> terminated with an exception: epollControl: invalid argument (Bad file descriptor)

is in some sense expected because the TPC part of the server relies on this exception to terminate the receive thread for a client.

* Stopping the clients also stops the epollControl errors from happening. Starting the clients again causes the epollControl error to appear again.

We ran the server and the clients on the same machine and we also ran them on different machines, with no noticable difference.

However, we discovered that the epollControll errors disappear if the clients close their sockets explicitly via sClose (see the line with the comment in Clients.hs).

How to run the sample programs: Unpack the archive, execute make, then ./Server and some time later ./Clients.

In case you cannot reproduce the problem on your machine: we were able to reproduce it on the machine mentioned earlier, the code is located in /home/drghc/ghc-bug.

Many thanks for helping!

Changed 3 years ago by StefanWehr

Sample program showing the epollControl errors

  Changed 3 years ago by bos

  • status changed from infoneeded to closed
  • resolution set to invalid

Guys, this isn't related to GHC at all. I ran it under strace, and it's obvious that either the application or the http-enumerator library is at fault. I see activities like this:

12306 recvfrom(17, "1000\r\n><div id=ghe"..., 32752, 0, NULL, NULL) = 1418
12306 close(17)                         = 0

12306 recvfrom(17,  <unfinished ...>

12306 <... recvfrom resumed> 0x7fb0c64a0010, 4096, 0, 0, 0) = -1 EBADF (Bad file descriptor)

See what's happening there? Something is closing the file descriptor, then attempting to receive more data on it, and you're getting exactly the error you'd expect.

  Changed 3 years ago by StefanWehr

  • status changed from closed to new
  • resolution invalid deleted

The system call activities you've observed were caused by the sample program, however, the epoll problem is still there after removing the source for these activities.

I've attached a new version of the sample programs, which fixes two things:

  • It ensures that no socket operation is performed on a socket after it has been closed. I also checked the output of stract: no more recvfrom after a close without an intervening open.
  • It replaces http-enumerator with a very simply HTTP download function.

The server still produces the epoll errors after the clients have been connected. As before, stopping the clients also stops the epoll errors from happening. Starting the clients again causes the epoll error to appear again. Also as before, the epoll errors disappear if the clients close their sockets explicitly via sClose (see the line with the comment in Clients.hs).

I've also run the sample program under MacOSX (which does not support epoll). No problems on this platform.

GHC Version: 7.0.1

OS: Linux 2.6.34-gentoo-r1

Changed 3 years ago by StefanWehr

updated sample programs

follow-up: ↓ 15   Changed 3 years ago by dleuschner

I can't really say anything about epoll because I just looked at it today but I noticed a thing while looking at the strace output of the program Stefan posted. I seems that epoll discards watched file descriptors when they are closed but the IO manager seems to think that the file descriptor is still registered with epoll. I extracted the following sequence concerning file descriptors with index 7:

[pid 27110] accept(12, {sa_family=AF_INET, sin_port=htons(39725), sin_addr=inet_addr("127.0.0.1")}, [16]) = 7
[pid 27112] epoll_ctl(3, EPOLL_CTL_ADD, 7, {EPOLLIN, {u32=7, u64=2392000568567529479}}) = 0
[pid 27112] close(7)                    = 0
[pid 27111] socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 7
[pid 27111] <... read resumed> "127.0.0.1\tlocalhost\n178.63.197.6"..., 4096) = 70
[pid 27111] epoll_ctl(3, EPOLL_CTL_MOD, 7, {EPOLLIN|EPOLLOUT, {u32=7, u64=21500846801944583}}) = -1 ENOENT (No such file or directory)

In the last line the IO manager seems to assume that file descriptor 7 is still registered with epoll because it was added in line 2 but probably the close in line 3 invalidate or removed the file descriptor. I have written a tiny C testprogram that demonstrates that the "socket -> N", "epoll_ctl add N", "close N", "socket -> N", "epoll_ctl modify N" sequence doesn't work. Maybe that bit of information is helpful. You can find the full strace and Haskell log for the run in ~/ghc-bug/server.trace on our system.

in reply to: ↑ 14   Changed 3 years ago by tibbe

Replying to dleuschner:

I can't really say anything about epoll because I just looked at it today but I noticed a thing while looking at the strace output of the program Stefan posted. I seems that epoll discards watched file descriptors when they are closed but the IO manager seems to think that the file descriptor is still registered with epoll.

This is a known bug (#4514). Bryan has a patch that's almost done.

  Changed 3 years ago by StefanWehr

Note that bug #4514 occurs on Linux and MacOS X, wheres our sample program runs fine under Mac OS X but doesn't work under Linux.

  Changed 3 years ago by dleuschner

  • status changed from new to closed
  • resolution set to fixed

I can confirm that a checkout of the GHC 7 branch with Bryans patches from #4514 applied fix this issue for our test program and our application. Thanks a lot! :-)

I hope a fixed version of GHC will be released soon as we currently cannot use plain 6.12.3 because of #9 and we cannot use plain 7.0.1 because of this issue.

Note: See TracTickets for help on using tickets.