Ticket #5111 (closed bug: duplicate)

Opened 2 years ago

Last modified 21 months ago

linux-powerpc : segfault in stage2 compiler

Reported by: erikd Owned by:
Priority: normal Milestone: _|_
Component: Compiler Version: 7.1
Keywords: Cc: pho@…
Operating System: Linux Architecture: powerpc
Type of failure: Installing GHC failed Difficulty:
Test Case: Blocked By:
Blocking: Related Tickets:

Description

After the patch to fix #4999, the compile gets a bit further and now crashes at:

"inplace/bin/ghc-stage2"   -H32m -O    -package-name vector-0.7.0.1 -hide-all-packages -i -ilibraries/vector/. -ilibraries/vector/dist-install/build -ilibraries/vector/dist-install/build/autogen -Ilibraries/vector/dist-install/build -Ilibraries/vector/dist-install/build/autogen -Ilibraries/vector/include -Ilibraries/vector/internal   -optP-DVECTOR_BOUNDS_CHECKS -optP-include -optPlibraries/vector/dist-install/build/autogen/cabal_macros.h -package base-4.3.1.0 -package primitive-0.3.1  -O2 -XHaskell98 -XCPP -XDeriveDataTypeable -O2 -XGenerics -no-user-package-conf -rtsopts     -odir libraries/vector/dist-install/build -hidir libraries/vector/dist-install/build -stubdir libraries/vector/dist-install/build -hisuf hi -osuf  o -hcsuf hc -c libraries/vector/./Data/Vector/Fusion/Stream/Monadic.hs -o libraries/vector/dist-install/build/Data/Vector/Fusion/Stream/Monadic.o
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
Loading package primitive-0.3.1 ... linking ... done.
Loading package ffi-1.0 ... linking ... done.
Segmentation fault
make[1]: *** [libraries/vector/dist-install/build/Data/Vector/Fusion/Stream/Monadic.o] Error 139
make: *** [all] Error 2

Compiling with debug on and running under gdb results in:

Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
Loading package primitive-0.3.1 ... linking ... done.
Loading package ffi-1.0 ... linking ... done.

Program received signal SIGSEGV, Segmentation fault.
0xf19ea004 in ?? ()
(gdb) bt
#0  0xf19ea004 in ?? ()
#1  0x11a30f58 in schedule (initialCapability=0x93fb0070, task=0x3dc0f1de) at rts/Schedule.c:457
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

so this is crashing in haskell code, not the RTS.

Tracing this using putStr and friends I've tracked the bug to TcSplice?.runMeta. It seems to crash when a thunk is evaluated just as that function returns. I'm still working on tracing back from there.

Change History

  Changed 2 years ago by simonmar

Did this start failing recently? Can you trace it to a particular patch?

Another way to attack it is to find a smaller program that fails: run the testsuite with stage=1.

  Changed 2 years ago by erikd

I'm not sure when this started happening. I'm pretty sure it broke some time between 6.12.3 and 7.0.1. Initially it was only one problem (bug #4999), but after fixing that I ran into this.

I wouldn't be surprised it if the fix for #4999 was incomplete, that is, I made it compile but other problems exist. Is there a way I could test that idea?

  Changed 2 years ago by PHO

  • cc pho@… added

  Changed 2 years ago by erikd

By adding Debug.Trace statements I've tracked this down to function runAnnotation in the file compiler/typecheck/TcSplice.lhs. Specifically this code:

    -- Run the appropriately wrapped expression to get the value of
    -- the annotation and its dictionaries. The return value is of
    -- type AnnotationWrapper by construction, so this conversion is
    -- safe
    flip runMetaAW zonked_wrapped_expr' $ \annotation_wrapper ->
         case annotation_wrapper of
            AnnotationWrapper value | let serialized = toSerialized serializeWithData value ->
                -- Got the value and dictionaries: build the serialized value and 
		-- call it a day. We ensure that we seq the entire serialized value 
		-- in order that any errors in the user-written code for the
                -- annotation are exposed at this point.  This is also why we are 
		-- doing all this stuff inside the context of runMeta: it has the 
		-- facilities to deal with user error in a meta-level expression
             do De  
                seqSerialized serialized `seq` Annotation { 
                    ann_target = target,
                    ann_value = serialized
                }


Tracing a bit further it seems that the segfault occurs when value from the AnnotationWrapper is passed to the toSerialized function, and then typeOf is called on it.

The most obvious explanation is that something got corrupted long before we even hit this code.

  Changed 2 years ago by erikd

I've managed to confirm that this problem has existed at least since git HEAD at 2010/09/01. Since I can't build any version earlier than that, finding when this bug was introduced is unlikely.

  Changed 2 years ago by simonmar

Since it fails in runAnnotation, that probably indicates that the problem is in running interpreted code, and hence a prime suspect is the linker.

  Changed 2 years ago by erikd

Since Simon suggests this is a linker problem, I'm revisiting the file rts/Linker.c.

The Macho-O powerpc linker messes with the misalignment field of the ObjectCode struct in rts/LinkerInternals.h. I understand the reasoning for this with Mach-O object files and I'm trying to figure out if something similar is needed for ELF files on linux-powerpc.

  Changed 2 years ago by erikd

The  Mach-O documentation suggests that the alignment is calculated by:

misalignment = (header.sizeofcmds + sizeof(header)) & 0xF;
return misalignment ? (16 - misalignment) : 0;

is all about finding the start of the actual code within the object file.

My reading of the ELF documentation suggests that the ELF version should look like this:

misalignment = header.e_ehsize & 0xF;
return misalignment ? (16 - misalignment) : 0;}}}

but that still doesn't work. For all the ghc generated object files I've seen the .text section immediately follows the ELF header whose size if specified by e_ehsize.

  Changed 2 years ago by simonpj

The conclusion here seems to be that GHC's linker doesn't work on PowerPC. This isn't one of our tier-1 supported platforms, but we'd be glad to work with anyone willing to roll up their sleeves and help fix it. Thanks erikd for your work so far.

Simon

follow-up: ↓ 11   Changed 2 years ago by erikd

I'm willing to do the leg work on this but I currently don't have a good idea of where to go from here.

in reply to: ↑ 10   Changed 2 years ago by simonmar

Replying to erikd:

I'm willing to do the leg work on this but I currently don't have a good idea of where to go from here.

  • Find a small test case that crashes - one of the GHCi tests, for example.
  • Find out where it is crashing with gdb, and debug it
  • You might want to talk to Greg Wright (gwright at antiope.com) who has been working in this area recently (see #4867)

  Changed 2 years ago by erikd

Ok, bug number #4867 has some good info.

Finding a GHCi test that crashes is easy, they all crash before GHCi prints the command prompt. Running under gdb I get:

erikd > gdb inplace/lib/ghc-stage2 
GNU gdb (GDB) 7.2-debian
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "powerpc-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/erikd/Git/ghc-no-ghci/inplace/lib/ghc-stage2...done.
(gdb) run --interactive -Binplace/lib
Starting program: /home/erikd/Git/ghc-no-ghci/inplace/lib/ghc-stage2 --interactive -Binplace/lib
[Thread debugging using libthread_db enabled]
[New Thread 0xf7dff490 (LWP 9874)]
[New Thread 0xf71ff490 (LWP 9875)]
GHCi, version 7.1.20110527: http://www.haskell.org/ghc/  :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
Loading package ffi-1.0 ... linking ... done.

Program received signal SIGSEGV, Segmentation fault.
0xf529d28c in ?? ()
(gdb) bt
#0  0xf529d28c in ?? ()
#1  0x10078ddc in s3b1_info ()
#2  0x11a885a8 in schedule (initialCapability=0x48a19255, task=0x3bf9fffc) at rts/Schedule.c:457
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

which suggests, as before, that its crashing in haskell code.

I did however get some useful info from running:

inplace/bin/ghc-stage2 --interactive +RTS -Dl

which spits out a huge pile of stuff and ends with:

Loading package ffi-1.0 ... linking ... resolveObjs: start
initLinker: start
initLinker: idempotent return
resolveObjs: done
done.
lookupSymbol: looking up base_GHCziIOziHandleziFD_stdin_closure
initLinker: start
initLinker: idempotent return
lookupSymbol: value of base_GHCziIOziHandleziFD_stdin_closure is 0xf53d38d4
lookupSymbol: looking up base_GHCziIOziHandleziFD_stdout_closure
initLinker: start
initLinker: idempotent return
lookupSymbol: value of base_GHCziIOziHandleziFD_stdout_closure is 0xf53d3880
lookupSymbol: looking up base_GHCziIOziHandleziFD_stderr_closure
initLinker: start
initLinker: idempotent return
lookupSymbol: value of base_GHCziIOziHandleziFD_stderr_closure is 0xf53d3940
Segmentation fault

That gives me something to chase.

  Changed 2 years ago by erikd

Further debugging shows that the segfault occurs when GHCi calls GhciMonad.turnOffBuffering on stdin, stdout and stderr which are closures loaded from HSbase-4.3.1.0.o.

Still trying to get a handle on whats wrong with the object loading.

  Changed 2 years ago by erikd

Since the problem was occuring in GhciMonad.turnOffBuffering I ran "gdb -x commands" with the file commands file containing:

file inplace/lib/ghc-stage2
set args +RTS -V0 -i0 -RTS -Binplace/lib --interactive
break GhciMonad_turnOffBuffering_closure
break GhciMonad_turnOffBuffering_info
break GhciMonad_turnOffBuffering_srt

Running this I still get a new segfault with a backtrace as follows:

Breakpoint 1 at 0x11bd4274
Breakpoint 2 at 0x1007a760
Breakpoint 3 at 0x11bd4268
(gdb) r
Starting program: /home/erikd/Git/ghc-no-ghci/inplace/lib/ghc-stage2 +RTS -V0 -i0 -RTS -Binplace/lib --interactive
[Thread debugging using libthread_db enabled]
[New Thread 0xf7dff490 (LWP 1748)]

Program received signal SIGSEGV, Segmentation fault.
0x11acd94c in LOOKS_LIKE_INFO_PTR_NOT_NULL (p=2105675784) at includes/rts/storage/ClosureMacros.h:225
225         return info->type != INVALID_OBJECT && info->type < N_CLOSURE_TYPES;
(gdb) bt
#0  0x11acd94c in LOOKS_LIKE_INFO_PTR_NOT_NULL (p=2105675784) at includes/rts/storage/ClosureMacros.h:225
#1  0x11acd9d4 in LOOKS_LIKE_INFO_PTR (p=2105675784) at includes/rts/storage/ClosureMacros.h:230
#2  0x11acda3c in LOOKS_LIKE_CLOSURE_PTR (p=0x11bd4274) at includes/rts/storage/ClosureMacros.h:235
#3  0x11ace45c in evacuate1 (p=0x11bd2c10) at rts/sm/Evac.c:371
#4  0x11aa6d04 in scavenge_large_srt_bitmap (large_srt=0x11bd2d08) at rts/sm/Scav.c:283
#5  0x11aa6dd8 in scavenge_srt (srt=0x11bd2d08, srt_bitmap=65535) at rts/sm/Scav.c:310
#6  0x11aa6f3c in scavenge_fun_srt (info=0x10070f1c) at rts/sm/Scav.c:359
#7  0x11aa9098 in scavenge_static () at rts/sm/Scav.c:1506
#8  0x11aa9b2c in scavenge_loop1 () at rts/sm/Scav.c:1879
#9  0x11a9eaec in scavenge_until_all_done () at rts/sm/GC.c:967
#10 0x11a9d0c0 in GarbageCollect (force_major_gc=rtsFalse, gc_type=2, cap=0x11d63500) at rts/sm/GC.c:371
#11 0x11a8c0e0 in scheduleDoGC (cap=0x11d63500, task=0x11d772d8, force_major=rtsFalse) at rts/Schedule.c:1427
#12 0x11a8a878 in schedule (initialCapability=0x11d63500, task=0x11d772d8) at rts/Schedule.c:547
#13 0x11a8cf7c in scheduleWaitThread (tso=0xf7e03d04, ret=0x0, cap=0x11d63500) at rts/Schedule.c:1914
#14 0x11a7f000 in rts_evalLazyIO (cap=0x11d63500, p=0x11bcca28, ret=0x0) at rts/RtsAPI.c:494
#15 0x11a829b8 in real_main () at rts/RtsMain.c:63
#16 0x11a82ae8 in hs_main (argc=7, argv=0xffffe1f4, main_closure=0x11bcca28) at rts/RtsMain.c:111
#17 0x10003888 in main ()

Debugging continues.

  Changed 2 years ago by erikd

If I run ghc-stage2 under gdb as above and then when it crashes find the pid of the ghc-stage2 process and "cat /proc/$PID/maps" i get:

00100000-00103000 r-xp 00000000 00:00 0                                  [vdso]
0fbd9000-0fbdb000 r-xp 00000000 08:03 7021637                            /usr/lib/gconv/UTF-32.so
0fbdb000-0fbea000 ---p 00002000 08:03 7021637                            /usr/lib/gconv/UTF-32.so
0fbea000-0fbeb000 r--p 00001000 08:03 7021637                            /usr/lib/gconv/UTF-32.so
0fbeb000-0fbec000 rw-p 00002000 08:03 7021637                            /usr/lib/gconv/UTF-32.so
0fbfc000-0fd65000 r-xp 00000000 08:03 10379407                           /lib/libc-2.13.so
0fd65000-0fd75000 ---p 00169000 08:03 10379407                           /lib/libc-2.13.so
0fd75000-0fd79000 r--p 00169000 08:03 10379407                           /lib/libc-2.13.so
0fd79000-0fd7a000 rw-p 0016d000 08:03 10379407                           /lib/libc-2.13.so
0fd7a000-0fd7d000 rw-p 00000000 00:00 0 
0fd8d000-0fe37000 r-xp 00000000 08:03 10379465                           /lib/libm-2.13.so
0fe37000-0fe47000 ---p 000aa000 08:03 10379465                           /lib/libm-2.13.so
0fe47000-0fe4a000 r--p 000aa000 08:03 10379465                           /lib/libm-2.13.so
0fe4a000-0fe4b000 rw-p 000ad000 08:03 10379465                           /lib/libm-2.13.so
0fe5b000-0fec2000 r-xp 00000000 08:03 11019070                           /usr/lib/libgmp.so.10.0.1
0fec2000-0fed2000 ---p 00067000 08:03 11019070                           /usr/lib/libgmp.so.10.0.1
0fed2000-0fed9000 rw-p 00067000 08:03 11019070                           /usr/lib/libgmp.so.10.0.1
0fee9000-0ff01000 r-xp 00000000 08:03 10379476                           /lib/libpthread-2.13.so
0ff01000-0ff10000 ---p 00018000 08:03 10379476                           /lib/libpthread-2.13.so
0ff10000-0ff11000 r--p 00017000 08:03 10379476                           /lib/libpthread-2.13.so
0ff11000-0ff12000 rw-p 00018000 08:03 10379476                           /lib/libpthread-2.13.so
0ff12000-0ff14000 rw-p 00000000 00:00 0 
0ff24000-0ff27000 r-xp 00000000 08:03 10379504                           /lib/libdl-2.13.so
0ff27000-0ff36000 ---p 00003000 08:03 10379504                           /lib/libdl-2.13.so
0ff36000-0ff37000 r--p 00002000 08:03 10379504                           /lib/libdl-2.13.so
0ff37000-0ff38000 rw-p 00003000 08:03 10379504                           /lib/libdl-2.13.so
0ff48000-0ff4a000 r-xp 00000000 08:03 10379415                           /lib/libutil-2.13.so
0ff4a000-0ff59000 ---p 00002000 08:03 10379415                           /lib/libutil-2.13.so
0ff59000-0ff5a000 r--p 00001000 08:03 10379415                           /lib/libutil-2.13.so
0ff5a000-0ff5b000 rw-p 00002000 08:03 10379415                           /lib/libutil-2.13.so
0ff6b000-0ff73000 r-xp 00000000 08:03 10379459                           /lib/librt-2.13.so
0ff73000-0ff82000 ---p 00008000 08:03 10379459                           /lib/librt-2.13.so
0ff82000-0ff83000 r--p 00007000 08:03 10379459                           /lib/librt-2.13.so
0ff83000-0ff84000 rw-p 00008000 08:03 10379459                           /lib/librt-2.13.so
0ff94000-0ffdd000 r-xp 00000000 08:03 10379331                           /lib/libncursesw.so.5.9
0ffdd000-0ffed000 ---p 00049000 08:03 10379331                           /lib/libncursesw.so.5.9
0ffed000-0fff0000 rw-p 00049000 08:03 10379331                           /lib/libncursesw.so.5.9
10000000-11bba000 r-xp 00000000 08:03 8045838                            /home/erikd/Git/ghc-no-ghci/inplace/lib/ghc-stage2
11bca000-11d64000 rw-p 01bba000 08:03 8045838                            /home/erikd/Git/ghc-no-ghci/inplace/lib/ghc-stage2
11d64000-11da9000 rwxp 00000000 00:00 0                                  [heap]
f7600000-f7601000 ---p 00000000 00:00 0 
f7601000-f7f00000 rw-p 00000000 00:00 0 
f7fc9000-f7fcc000 rw-p 00000000 00:00 0 
f7fd4000-f7fdb000 r--s 00000000 08:03 7021752                            /usr/lib/gconv/gconv-modules.cache
f7fdb000-f7fdd000 rw-p 00000000 00:00 0 
f7fdd000-f7ffd000 r-xp 00000000 08:03 10379426                           /lib/ld-2.13.so
f7ffd000-f7ffe000 r--p 00020000 08:03 10379426                           /lib/ld-2.13.so
f7ffe000-f7fff000 rw-p 00021000 08:03 10379426                           /lib/ld-2.13.so
fffde000-fffff000 rw-p 00000000 00:00 0                                  [stack]

The info pointer in the creash 2105675784 (0x7d821008) is clearly outside if this memory map.

  Changed 22 months ago by igloo

  • milestone set to _|_

  Changed 21 months ago by erikd

This is actually a manifestation of bug #2972.

If I compile without vector and dph in the tree, I get a working stage2 compiler with a ghci that segfaults as per #2972.

  Changed 21 months ago by simonmar

Should this bug be closed as a dup then?

  Changed 21 months ago by erikd

  • status changed from new to closed
  • resolution set to duplicate
Note: See TracTickets for help on using tickets.