Ticket #2148 (closed bug: fixed)

Opened 5 years ago

Last modified 5 years ago

x86_64 code use several GiB of memory generates: internal error: ASSERTION FAILED: file sm/Storage.c, line 1126

Reported by: twhitehead Owned by: simonmar
Priority: high Milestone: 6.10.1
Component: Runtime System Version: 6.8.2
Keywords: Cc:
Operating System: Linux Architecture: x86_64 (amd64)
Type of failure: Difficulty: Unknown
Test Case: Blocked By:
Blocking: Related Tickets:

Description

I have some fairly simple code using ByteStrings to parses fasta files ( http://en.wikipedia.org/wiki/Fasta_format) and generate all the M-length suffixes.

When I set it to hold enough M-length suffixes is memory that several GiB of RAM are consumed, it either segment faults or loops infinitely (the pared down code I've attached infinitely loops with the given parameters, while my original code segment faulted). I hooked into the attached version while it was infinitely looping one with gdb and did a backtrace.

#0  0x0000000000460ef4 in free_list_push_forwards ()
#1  0x0000000000461357 in freeGroup ()
#2  0x000000000046222f in GarbageCollect ()
#3  0x000000000045d33f in scheduleDoGC ()
#4  0x000000000045dd36 in scheduleWaitThread ()
#5  0x000000000045aa01 in real_main ()
#6  0x000000000045aad3 in main ()

Disassembling the free_list_push_forwards routine revealed that it was infinitely stuck in the while loop because the tail of the passed block descriptor (list) was circular and the passed block descriptor had a larger block count than any of the block descriptors in its circular tail. I also noted that the block descriptors composing the tail where not non-decreasing by block count up to the circular point (I don't know if this is a problem though).

When I compile with -debug, I get the following (first two lines is program output):

Starting new sorted block at 139609412 (/tmp/PChunk3790.tmp)
Starting new sorted block at 255642977 (/tmp/PChunk3791.tmp)
Bug-debug: internal error: ASSERTION FAILED: file sm/Storage.c, line 1126

    (GHC version 6.8.2 for x86_64_unknown_linux)
    Please report this as a GHC bug:  http://www.haskell.org/ghc/reportabug

I've attached the Bug.hs code (which is the pared down version of my code). A gzipped version of the input file (~3GiB) can be found at  ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz. To generate the above output, I compiled with "ghc -debug -O2 --make Bug" and with "./Bug-debug nr 8 67108864" (i.e., generate blocks of 67108864 8-character suffixes from the fasta file nr).

From what I've seen so far, the problem seem linked to the complexity of the code and the amount of memory used. I've never had any problems as long as I stay under 4GiB of RAM, and the more I pared down the code, the more memory I had to make it use before it would start to demonstrate the problem (with the parameters above the pared down code uses just over 16GiB). I could possibly pare the code down further.

I do not know at all what I'm talking about, but from the little I've seen so far, I wonder if it is related to the 64bit memory space (e.g., having enough blocks that it can actually loops a 32bit counter tracking them, etc.).

NOTE: I compiled ghc from source from http://www.haskell.org/ghc/dist/stable/dist/ghc-6.8.2-src.tar.bz2.

Attachments

Bug.hs Download (7.4 KB) - added by twhitehead 5 years ago.
Pared down haskell source demonstrating problem.
Bug.2.hs Download (1.6 KB) - added by twhitehead 5 years ago.
Further (greatly) simplified code still exhibiting the problem.
Bug.3.hs Download (1.0 KB) - added by twhitehead 5 years ago.
Further simplification (now just generates integer lists as a series of (:) thunks)

Change History

Changed 5 years ago by twhitehead

Pared down haskell source demonstrating problem.

  Changed 5 years ago by simonmar

  • owner set to simonmar
  • difficulty set to Unknown

Sounds nasty - I'll investigate.

  Changed 5 years ago by simonmar

  • component changed from Compiler to Runtime System
  • milestone set to 6.8.3

  Changed 5 years ago by twhitehead

I managed to cleanup the code even further and still have it exhibit the bug. The new code just generate a list of M N-length substrings (when I said prefixes above, I meant substrings) from the fixed ByteString ['0'..'Z'].

Compiling as above ("ghc --make -O2 -debug Bug") and running with "./Bug 8 134217728" (i.e., N=8 and M=134217728) produces the following after about an hour and three quarters:

Starting block 0
Starting block 1
Memory leak detected
  gen 0 blocks :   57
  gen 1 blocks : 2523057
  nursery      :  128
  allocate()   :    0
  retainer     :    0
  arena blocks :    0
  exec         :    0
  free         : 5830
  total        : 2529072

  in system    : 3555720
Bug_: internal error: ASSERTION FAILED: file sm/Storage.c, line 1206

    (GHC version 6.8.2 for x86_64_unknown_linux)
    Please report this as a GHC bug:  http://www.haskell.org/ghc/reportabug

I'll attach the simplified code.

Changed 5 years ago by twhitehead

Further (greatly) simplified code still exhibiting the problem.

  Changed 5 years ago by twhitehead

I just noticed that I didn't read that output close enough, and my further simplified code is actually dieing on a different assert (line 1206 in sm/Storage.c versus 1126).

PS: Thanks very much for looking into these bugs. : )

  Changed 5 years ago by twhitehead

One last simplification. The program now just endlessly generates the integer list [1..M], where M is a command-line argument, from back to front as a sequence of (:) thunks.

Compiling as above ("ghc --make -O2 -debug Bug") and running with "./Bug 134217728" (i.e., M=134217728) produces the following after about two days (the third time generating the sequence):

Starting block 0
Starting block 1
Starting block 2
Bug: internal error: ASSERTION FAILED: file sm/Storage.c, line 1126

    (GHC version 6.8.2 for x86_64_unknown_linux)
    Please report this as a GHC bug:  http://www.haskell.org/ghc/reportabug

I'll attach the simplified code.

Changed 5 years ago by twhitehead

Further simplification (now just generates integer lists as a series of (:) thunks)

  Changed 5 years ago by simonmar

Before investigating this, can you tell me which version of the program fails in the shortest amount of time? Or if you know of any RTS flag settings that make it fail sooner? It's more important to reproduce the error quickly rather than to have the simplest program. In fact for RTS bugs the program itself is sometimes not relevant at all, although having as few dependencies as possible is always good, and all other things being equal a smaller program is always better.

  Changed 5 years ago by twhitehead

The one that fails the quickest so far is Bug.2.hs (3hrs 10min).

As noted above, however, the exact error message is slightly different than the others. I also have also not discovered any RTS flags to speed up the time to crash (so far I have only tried -H10g).

  Changed 5 years ago by simonmar

  • priority changed from normal to high

  Changed 5 years ago by simonmar

Sadly I have no machines with enough memory to run this program.

Also, reading back through the comments apparently Bug.hs crashed in "an hour and three quarters", which would make it quicker than the 3hrs 10min for Bug2.hs, right? Unfortunately from the memory leak message, it seems like the RTS had been using around 13Gb at the time, which is way more than I have in any of my machines. I'll look into getting a new machine, or some more memory.

follow-up: ↓ 11   Changed 5 years ago by twhitehead

Bug2.hs is the 1hr-42min-to-crash code, I just made a mistake when quoting its runtime in the fastest-to-crash-code note above (I also made it somewhat confusing in the detailed messages above by always just refering to Bug.hs as I didn't actually add the extensions until I went to upload them).

With regard to not having a machine with enough memory, I can get you access to our 32GiB-per-node cluster if that would help.

in reply to: ↑ 10   Changed 5 years ago by simonmar

Replying to twhitehead:

With regard to not having a machine with enough memory, I can get you access to our 32GiB-per-node cluster if that would help.

Debugging remotely is painful, especially with a bug that takes this long to reproduce. I have a machine with enough memory on order which should arrive in a week or so, which will probably be beyond the 6.8.3 cutoff, unfortunately. If you can send me a binary and a core dump from the failure, I might be able to debug it that way - the assertion failure should produce a core dump, but make sure you have core dumps enabled with "ulimit -c unlimited". On some Linux machines you have to disable the apport service too.

  Changed 5 years ago by igloo

  • milestone changed from 6.8.3 to 6.10.1

  Changed 5 years ago by simonmar

  • status changed from new to closed
  • resolution set to fixed

Using today's HEAD, I just ran Bug2 until it used up all my memory (16Gb) and then some (apologies to the other people that were using this machine at the time :-). Last week I was able to reproduce the symptom, so I think it's likely that it was caused by one of the bugs I've fixed recently. I'm therefore going to close this ticket, but if it does still happen for you with 6.10 please re-open.

Note: See TracTickets for help on using tickets.