Ticket #5899 (closed bug: worksforme)

Opened 15 months ago

Last modified 11 months ago

RTS crash w/ strange closure type 603975781 on OS X 10.8

Reported by: dylukes Owned by:
Priority: high Milestone: 7.4.2
Component: Runtime System Version: 7.4.1
Keywords: rts, strange closure, internal error, os x Cc: chak@…, gdr@…, benl@…, pho@…, anton.nik@…, jhenahan@…, ireney.knapp@…
Operating System: MacOS X Architecture: x86_64 (amd64)
Type of failure: Runtime crash Difficulty: Unknown
Test Case: Blocked By:
Blocking: Related Tickets:

Description

On OS X 10.8 (Mountain Lion, the first developer seed), GHC's RTS crashes with strange closure type 603975781, for almost any program compiled with GHC.

As examples, cpphs and cabal's Setup crash, but a simple `main = putStrLn "hello world"' does not.

`runhaskell ...' works. It seems this only manifests in compiled programs.

An example of the output can be found here:  https://gist.github.com/1918845

Attachments

patch0.diff Download (0.6 KB) - added by dylukes 15 months ago.
Patch that *might* fix the issue. (ty slyfox)
Irene-symbol-order.txt.zip Download (75.1 KB) - added by lelf 14 months ago.
Irene-crash-report.txt Download (8.7 KB) - added by lelf 14 months ago.
Irene-debug-output.txt Download (7.4 KB) - added by lelf 14 months ago.

Change History

follow-up: ↓ 3   Changed 15 months ago by dylukes

Note: OS X Build Number is 12A128p.

  Changed 15 months ago by dylukes

The smallest program I've found that triggers this bug (?) is the following. Note, it does not cause a `strange closure' internal error, it just segfaults. However it might be related. This should be a much easier example to work with.

main = print $ reverse [1,2,3]

However, the following is also simple enough to trigger a segmentation fault...

main = print $ reverse' [1,2,3]
  where reverse' = undefined :: [a] -> [a]

in reply to: ↑ 1   Changed 15 months ago by dylukes

Replying to dylukes:

Note: OS X Build Number is 12A128p.

More information:

$ ld -v @(#)PROGRAM:ld PROJECT:ld64-131.3 configured to support archs: armv6 armv7 i386 x86_64

$ cc -v Apple clang version 4.0 (tags/Apple/clang-418.0.46) (based on LLVM 3.1svn) Target: x86_64-apple-darwin12.0.0 Thread model: posix

$ ghc -v Glasgow Haskell Compiler, Version 7.4.1, stage 2 booted by GHC version 7.0.4

This GHC was installed from http://www.haskell.org/ghc/dist/7.4.1/ghc-7.4.1-x86_64-apple-darwin.tar.bz2.

Changed 15 months ago by dylukes

Patch that *might* fix the issue. (ty slyfox)

  Changed 15 months ago by chak

  • cc chak@…, benl@… added

  Changed 15 months ago by PHO

  • cc pho@… added

  Changed 15 months ago by lelf

  • cc anton.nik@… added

follow-up: ↓ 9   Changed 15 months ago by dylukes

I initially thought this might only affect 64bit programs compiled by 64bit GHC... but it seems in (rarer) cases it may affect 32bit programs... Though, that may be another issue entirely.

  Changed 15 months ago by simonmar

  • priority changed from normal to highest
  • difficulty set to Unknown
  • milestone set to 7.4.2

in reply to: ↑ 7 ; follow-up: ↓ 10   Changed 15 months ago by dylukes

Replying to dylukes:

I initially thought this might only affect 64bit programs compiled by 64bit GHC... but it seems in (rarer) cases it may affect 32bit programs... Though, that may be another issue entirely.

Turns out it does only affect 64bit programs. This case was where I was building a 64bit executable with 32bit GHC during the bootstrapping process of building GHC. It's just 64bit.

in reply to: ↑ 9   Changed 15 months ago by dylukes

Replying to dylukes:

Replying to dylukes:

I initially thought this might only affect 64bit programs compiled by 64bit GHC... but it seems in (rarer) cases it may affect 32bit programs... Though, that may be another issue entirely.

Turns out it does only affect 64bit programs. This case was where I was building a 64bit executable with 32bit GHC during the bootstrapping process of building GHC. It's just 64bit.

...Actually it turns out it was building a 32bit ghc-cabal:

ghc-cabal: internal error: evacuate(static): strange closure type 16

So... something is broken in 32bit as well.

  Changed 15 months ago by jhenahan

  • cc jhenahan@… added

Just posting to say that I'm also poking around for more data. Will post if I find anything new. Same OS, etc.

As for type 16, I read a thread (I think on this trac, though it may have been the mailing list) that mentioned strange closure type 11 being related to signal 11 (i.e., segfault). Perhaps type 16 is related to signal 16, SIGURG. From man signal:

16 SIGURG discard signal urgent condition present on socket

  Changed 14 months ago by dylukes

  • summary changed from GHC RTS crash w/ strange closure type 603975781 on OS X 10.8 to RTS crash w/ strange closure type 603975781 on OS X 10.8

More information and some clarifications:

- I can confirm updating to 10.8 DP2, or Xcode 4.4 DP2 do not fix it. Individually or together. - This manifests in the x86_64 AND i386 architectures. - There is an existing issue with binaries compiled on 10.8 not running on 10.7 or previous. This has been reported to Apple, and is mirrored here:  http://openradar.appspot.com/11022559

  Changed 14 months ago by simonmar

  • priority changed from highest to high

Current status on this: we're not treating it as a blocking issue for 7.4.2, as 10.8 is not released yet and we expect to have time for another GHC release before it is.

We are currently blocked on a diagnosis. That means either someone diagnosing it for us, or either Ian or me installing 10.8 on our respective Macs. Can this be done non-destructively?

  Changed 14 months ago by dylukes

10.8 cannot be installed non-destructively over an existing 10.7, but... you could install 10.8 to a new partition (temporarily).

I would volunteer to do diagnosis, but I am utterly ignorant of how to do so.

  Changed 14 months ago by dr.gigabit

I have the same symptom while trying to "cabal install syb"

Resolving dependencies...
[1 of 1] Compiling Main             ( /var/folders/gh/3w2hyrhs649b13txtmn7j5wc0000gn/T/syb-0.3.6-18301/syb-0.3.6/Setup.hs, /var/folders/gh/3w2hyrhs649b13txtmn7j5wc0000gn/T/syb-0.3.6-18301/syb-0.3.6/dist/setup/Main.o )

/var/folders/gh/3w2hyrhs649b13txtmn7j5wc0000gn/T/syb-0.3.6-18301/syb-0.3.6/Setup.hs:4:30:
    Warning: In the use of `runTests'
             (imported from Distribution.Simple, but defined in Distribution.Simple.UserHooks):
             Deprecated: "Please use the new testing interface instead!"
Linking /var/folders/gh/3w2hyrhs649b13txtmn7j5wc0000gn/T/syb-0.3.6-18301/syb-0.3.6/dist/setup/setup ...
setup: internal error: evacuate(static): strange closure type 603975781
    (GHC version 7.4.1 for x86_64_apple_darwin)
    Please report this as a GHC bug:  http://www.haskell.org/ghc/reportabug
cabal: Error: some packages failed to install:
syb-0.3.6 failed during the configure step. The exception was:
ExitFailure 6

  Changed 14 months ago by Irene

  • cc ireney.knapp@… added

I tried the sample program above:

main = print $ reverse [1,2,3]

GHC 7.4.1 (from the .pkg version of the prebuilt binaries, but it's probably identical to the tarball version?) compiled successfully but the output crashed; here is the OS X crash report:

Process:         Main [37094]
Path:            /Users/USER/*/Main
Identifier:      Main
Version:         0
Code Type:       X86-64 (Native)
Parent Process:  bash [29186]
User ID:         501

Date/Time:       2012-03-22 20:24:06.768 -0400
OS Version:      Mac OS X 10.8 (12A154q)
Report Version:  10

Interval Since Last Report:          166904 sec
Crashes Since Last Report:           12
Per-App Crashes Since Last Report:   1
Anonymous UUID:                      15C338D1-9CE8-40B1-8287-60D878AF6A68

Crashed Thread:  0  Dispatch queue: com.apple.main-thread

Exception Type:  EXC_BAD_ACCESS (SIGSEGV)
Exception Codes: KERN_INVALID_ADDRESS at 0x000000022bbe8a30

VM Regions Near 0x22bbe8a30:
    VM_ALLOCATE            000000010bd00000-000000010be00000 [ 1024K] rw-/rwx SM=PRV  
--> 
    MALLOC_TINY            00007fc3d8400000-00007fc3d8411000 [   68K] rw-/rwx SM=COW  

Application Specific Information:
objc[37094]: garbage collection is OFF

Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0   Main                          	0x000000010bbf0617 stg_ap_pp_fast + 31

Thread 0 crashed with X86 Thread State (64-bit):
  rax: 0x0000000023fff065  rbx: 0x000000010bc0ad58  rcx: 0x000000010bbf0708  rdx: 0x000000010bd041d8
  rdi: 0x000000010bc21e98  rsi: 0x000000010bc087c0  rbp: 0x000000010bd05358  rsp: 0x00007fff540c8ab8
   r8: 0x0000000000000001   r9: 0x0000000000000017  r10: 0x0000000000000001  r11: 0x0000000000000246
  r12: 0x000000010bd041e0  r13: 0x000000010bc21e98  r14: 0x000000010bc087e0  r15: 0x000000010bd050c0
  rip: 0x000000010bbf0617  rfl: 0x0000000000010202  cr2: 0x000000022bbe8a30
Logical CPU: 6

Binary Images:
       0x10bb33000 -        0x10bc07fef +Main (0) <F8E9D66A-B502-3555-B942-53E07F336457> /Users/USER/*/Main
    0x7fff6b733000 -     0x7fff6b7678e7  dyld (209.1) <7F330FEF-C9C5-38D8-9C3D-FBDCC0C28BDA> /usr/lib/dyld
    0x7fff868d6000 -     0x7fff869ed827  libobjc.A.dylib (526) <C3BAF7E1-9924-3714-9001-C1A97AF7448E> /usr/lib/libobjc.A.dylib
    0x7fff869fa000 -     0x7fff86a46ff7  libauto.dylib (185) <EC749301-51DA-3413-97DF-5481A75F974C> /usr/lib/libauto.dylib
    0x7fff86b6b000 -     0x7fff86b70fff  libcompiler_rt.dylib (30) <C865130E-E5D7-33E3-8131-2591703C67EB> /usr/lib/system/libcompiler_rt.dylib
    0x7fff8717a000 -     0x7fff871e2ff7  libc++.1.dylib (61) <5C289258-570C-3D3E-ACAB-88CB1C01804B> /usr/lib/libc++.1.dylib
    0x7fff87b24000 -     0x7fff87b27ff7  libdyld.dylib (209.1) <94E58E38-AC20-36DB-A84E-DAFA8D4E41E2> /usr/lib/system/libdyld.dylib
    0x7fff890e7000 -     0x7fff890e8fff  libremovefile.dylib (23) <D5F8B6CB-1EE1-3A71-858A-F98362786CD9> /usr/lib/system/libremovefile.dylib
    0x7fff89148000 -     0x7fff8914afff  libquarantine.dylib (48) <CC311F4D-83E1-3A88-9328-9FB095DACF32> /usr/lib/system/libquarantine.dylib
    0x7fff898b8000 -     0x7fff898b9fff  libsystem_blocks.dylib (57.2) <7014BC27-D424-3E9B-9535-3CAA6C956337> /usr/lib/system/libsystem_blocks.dylib
    0x7fff89934000 -     0x7fff8994fff7  libsystem_kernel.dylib (2050.2.33) <D93B6B58-F16D-377C-BE81-C4A87BDDF359> /usr/lib/system/libsystem_kernel.dylib
    0x7fff89950000 -     0x7fff89951ff7  libsystem_sandbox.dylib (206) <A1AB71A9-6E45-3C2A-A890-046185233396> /usr/lib/system/libsystem_sandbox.dylib
    0x7fff8a3e6000 -     0x7fff8a3e7ff7  libSystem.B.dylib (169.1) <A1FA6BD6-4F77-38E5-891E-9EB347229419> /usr/lib/libSystem.B.dylib
    0x7fff8a52a000 -     0x7fff8a558ff7  libsystem_m.dylib (3022.4) <C2BB2EF1-B11D-37DE-AF67-50720171F3A0> /usr/lib/system/libsystem_m.dylib
    0x7fff8a559000 -     0x7fff8a5c0fff  libcommonCrypto.dylib (60007) <A95DE414-20D1-3B00-9993-E6B731028556> /usr/lib/system/libcommonCrypto.dylib
    0x7fff8c958000 -     0x7fff8c95dfff  libcache.dylib (53) <C94D138A-1C5A-3855-ADCC-CAE07A94266C> /usr/lib/system/libcache.dylib
    0x7fff8e589000 -     0x7fff8e656fef  libsystem_c.dylib (825.12.1) <626CC4B4-4865-3179-B743-93CEDF4A8802> /usr/lib/system/libsystem_c.dylib
    0x7fff8ec1c000 -     0x7fff8ec23fff  libcopyfile.dylib (89) <8E286594-B745-32B5-89FE-0529963AA219> /usr/lib/system/libcopyfile.dylib
    0x7fff8ec4c000 -     0x7fff8ec70ff7  libc++abi.dylib (23) <5E3B1C2D-9BD1-391A-884C-1F3A69D2351E> /usr/lib/libc++abi.dylib
    0x7fff8ef3d000 -     0x7fff8ef48fff  libsystem_notify.dylib (98.4) <375881A9-6561-31E8-8AAF-0F108C9E52BC> /usr/lib/system/libsystem_notify.dylib
    0x7fff8f20c000 -     0x7fff8f20cfff  libkeymgr.dylib (25) <ACF42B1C-042B-3F24-9754-545E33EB04D7> /usr/lib/system/libkeymgr.dylib
    0x7fff8f2a1000 -     0x7fff8f2a9ff7  libsystem_dnssd.dylib (379.4) <C08FFB68-677D-36DB-A40C-737900E7A76A> /usr/lib/system/libsystem_dnssd.dylib
    0x7fff9072b000 -     0x7fff9072cff7  libdnsinfo.dylib (453.12) <C61AA787-2517-395E-B7FC-657CEAF80455> /usr/lib/system/libdnsinfo.dylib
    0x7fff90c9b000 -     0x7fff90cbcff7  libxpc.dylib (140.21.1) <BDE6735A-54A8-382E-9E46-38132F7D24F4> /usr/lib/system/libxpc.dylib
    0x7fff911a9000 -     0x7fff911b1ff7  liblaunch.dylib (442.7) <445D837C-39DB-30B0-8A54-C7F71CC651A2> /usr/lib/system/liblaunch.dylib
    0x7fff91290000 -     0x7fff91292ff7  libunc.dylib (24) <645FE7EF-A412-30B3-A570-08DC4A7D34B3> /usr/lib/system/libunc.dylib
    0x7fff912c1000 -     0x7fff912f7ff7  libsystem_info.dylib (406.11) <13705DE7-0A3C-33E0-994C-361A36E8596B> /usr/lib/system/libsystem_info.dylib
    0x7fff9162c000 -     0x7fff91632fff  libmacho.dylib (823) <4C09D65D-BB52-32D4-912C-8B298BA3F65F> /usr/lib/system/libmacho.dylib
    0x7fff92181000 -     0x7fff92196ff7  libdispatch.dylib (228.14) <B8EB96A3-6F01-3052-8A88-2010BF33A0E2> /usr/lib/system/libdispatch.dylib
    0x7fff92356000 -     0x7fff92364ff7  libsystem_network.dylib (77.6) <DF53A34A-ED8B-30D8-9CDF-025359B047E0> /usr/lib/system/libsystem_network.dylib
    0x7fff92365000 -     0x7fff9236bff7  libunwind.dylib (35.1) <32CAA2F5-4A69-3DD6-A789-D92D526B5D48> /usr/lib/system/libunwind.dylib
    0x7fff9266f000 -     0x7fff92764fff  libiconv.2.dylib (34) <4E5A84D7-2EF1-351A-BC64-95B15597EA88> /usr/lib/libiconv.2.dylib

External Modification Summary:
  Calls made by other processes targeting this process:
    task_for_pid: 0
    thread_create: 0
    thread_set_state: 0
  Calls made by this process:
    task_for_pid: 0
    thread_create: 0
    thread_set_state: 0
  Calls made by all processes on this machine:
    task_for_pid: 9468
    thread_create: 0
    thread_set_state: 12

VM Region Summary:
ReadOnly portion of Libraries: Total=58.6M resident=127.7M(218%) swapped_out_or_unallocated=16777216.0T(30040018386944%)
Writable regions: Total=18.6M written=396K(2%) resident=480K(3%) swapped_out=0K(0%) unallocated=18.1M(97%)
 
REGION TYPE                      VIRTUAL
===========                      =======
MALLOC                             9396K
MALLOC guard page                    16K
STACK GUARD                        56.0M
Stack                              8192K
VM_ALLOCATE                        1024K
__DATA                              800K
__LINKEDIT                         52.3M
__TEXT                             6448K
shared memory                        12K
===========                      =======
TOTAL                             133.5M

Model: iMac12,2, BootROM IM121.0047.B1F, 4 processors, Intel Core i7, 3.4 GHz, 4 GB, SMC 1.72f5
Graphics: AMD Radeon HD 6970M, AMD Radeon HD 6970M, PCIe, 1024 MB
Memory Module: BANK 0/DIMM0, 2 GB, DDR3, 1333 MHz, 0x02FE, 0x45424A3230554638424353302D444A2D4620
Memory Module: BANK 1/DIMM0, 2 GB, DDR3, 1333 MHz, 0x02FE, 0x45424A3230554638424353302D444A2D4620
AirPort: spairport_wireless_card_type_airport_extreme (0x168C, 0x9A), Atheros 9380: 4.0.64.8-P2P
Bluetooth: Version 4.0.7b30 exported, 2 service, 18 devices, 0 incoming serial ports
Network Service: Wi-Fi, AirPort, en1
Serial ATA Device: ST31000528AS, 1 TB
Serial ATA Device: HL-DT-STDVDRW  GA32N
USB Device: hub_device, 0x0424  (SMSC), 0x2514, 0xfd100000 / 2
USB Device: Tripp Lite UPS, 0x09ae  (Tripp Lite), 0x2011, 0xfd130000 / 6
USB Device: Video Capture, 0x0fd9, 0x0037, 0xfd140000 / 5
USB Device: Internal Memory Card Reader, apple_vendor_id, 0x8403, 0xfd110000 / 4
USB Device: IR Receiver, apple_vendor_id, 0x8242, 0xfd120000 / 3
USB Device: FaceTime HD Camera (Built-in), apple_vendor_id, 0x850b, 0xfa200000 / 3
USB Device: hub_device, 0x0424  (SMSC), 0x2514, 0xfa100000 / 2
USB Device: BRCM2046 Hub, 0x0a5c  (Broadcom Corp.), 0x4500, 0xfa110000 / 4
USB Device: Bluetooth USB Host Controller, apple_vendor_id, 0x8215, 0xfa111000 / 7
FireWire Device: My Book 111D, WD, 800mbit_speed

  Changed 14 months ago by simonmar

I managed to reproduce the problem. I know what's going on, but I don't know how to fix it yet. So the problem is that in the binary, the linker has re-ordered some of the contents so that one particular info table is not next to its code any more:

000000010002cb88 t _base_GHCziIOziHandleziInternals_augmentIOError_info_dsp
000000010002cba0 T _base_GHCziIOziHandleziInternals_augmentIOError_info
000000010002cbe8 t _base_GHCziEventziInternal_evtNothing_info_dsp
000000010002cbf8 T _base_GHCziEventziInternal_evtNothing_info
000000010002cc88 t _base_GHCziBase_zd_info_dsp
000000010002cca0 t _base_GHCziIOziHandleziInternals_zdLr3Qxlvl8_info_dsp
000000010002ccb0 T _base_GHCziIOziHandleziInternals_zdLr3Qxlvl8_info
000000010002cd38 T _base_GHCziBase_zd_info

Note the symbol _base_GHCziBase_zd_info_dsp should be adjacent to _base_GHCziBase_zd_info, but the linker has placed some other stuff in between. These _dsp symbols are already special OS X magic that we insert to prevent the linker dropping things on the floor (IIRC, and this is horrible because it means the libraries on OS X have twice as many symbols as other platforms).

These symbols are adjacent in the original object file:

libHSbase-4.5.0.0.a(Base__45.o):
0000000000000028 D _base_GHCziBase_zd_closure
0000000000000018 T _base_GHCziBase_zd_info
0000000000000000 t _base_GHCziBase_zd_info_dsp
                 U _stg_ap_p_fast

it seems like there ought to be a way to disable this behaviour with a linker flag, but I can't find one that works. I've tried -no_order_inits and -no_order_data.

Help from OS X experts greatly appreciated...

  Changed 14 months ago by dylukes

When I went to the LLVM folks to talk about this, they referred me to a recent mailing list discussion last week. As long as there's support for moving forward with proper TNTC support in LLVM, this should get solved as a byproduct.

 http://lists.cs.uiuc.edu/pipermail/llvmdev/2012-March/048195.html

  Changed 14 months ago by Irene

I did more digging - I wanted to verify that this was the only issue breaking us on Mountain Lion, and possibly also find a short-term kludge so that those of us who are using it can continue to develop our Haskell projects while we wait for the release. :) It's important that we know whether there are any other issues hidden by this one, to avoid a situation where this one gets fixed just in time but then something else breaks us and we aren't able to release.

I used the test program

main = print $ f 1 where f = (+1)

and passed the linker an option that forces it to keep the symbols in the same order (which the LLVM people have no interest in supporting as a long-term solution, since it makes various stuff impossible):

$ nm -n Main.o | grep -v '^ \+U ' | sed -e 's/^[0-9a-f]* [a-zA-Z] //' > order.txt 
$ ld Main.o rtsopts.o /Library/Frameworks/GHC.framework/Versions/7.0.4-x86_64/usr/lib/ghc-7.0.4/base-4.3.1.0/libHSbase-4.3.1.0.a /usr/lib/libiconv.dylib /Library/Frameworks/GHC.framework/Versions/7.0.4-x86_64/usr/lib/ghc-7.0.4/integer-gmp-0.2.0.3/libHSinteger-gmp-0.2.0.3.a /Library/Frameworks/GHC.framework/Versions/7.0.4-x86_64/usr/lib/ghc-7.0.4/ghc-prim-0.2.0.0/libHSghc-prim-0.2.0.0.a /Library/Frameworks/GHC.framework/Versions/7.0.4-x86_64/usr/lib/ghc-7.0.4/directory-1.1.0.0/libHSdirectory-1.1.0.0.a /Library/Frameworks/GHC.framework/Versions/7.0.4-x86_64/usr/lib/ghc-7.0.4/filepath-1.2.0.0/libHSfilepath-1.2.0.0.a /Library/Frameworks/GHC.framework/Versions/7.0.4-x86_64/usr/lib/ghc-7.0.4/unix-2.4.2.0/libHSunix-2.4.2.0.a /Library/Frameworks/GHC.framework/Versions/7.0.4-x86_64/usr/lib/ghc-7.0.4/old-time-1.0.0.6/libHSold-time-1.0.0.6.a /Library/Frameworks/GHC.framework/Versions/7.0.4-x86_64/usr/lib/ghc-7.0.4/old-locale-1.0.0.2/libHSold-locale-1.0.0.2.a /usr/lib/libSystem.dylib /usr/lib/crt1.10.6.o /Library/Frameworks/GHC.framework/Versions/7.0.4-x86_64/usr/lib/ghc-7.0.4/libHSffi.a /Library/Frameworks/GHC.framework/Versions/7.0.4-x86_64/usr/lib/ghc-7.0.4/libHSrts_debug.a /Library/Frameworks/GHC.framework/Versions/7.0.4-x86_64/usr/lib/ghc-7.0.4/libHSrtsmain.a

You'll note that I'm using GHC 7.0.4 for this test. That's because I couldn't find where _main was defined under 7.4.1 and thus couldn't do the manual linking step. (I presume it's generated dynamically...) This has no ramifications for the eventual fix, since of course GHC devs know how to modify GHC to be certain it really is issuing the desired ld command - but I don't. So I had to do it this way.

Anyway, attached, please find three long dumps of information. They are the complete symbol table of the a.out produced by the above command; the complete output of the test program with all the RTS debugging flags turned on; and the crash report produced by the OS. It looks like we still have a problem, even with the symbol order fixed - or did I mess something up?

  Changed 14 months ago by Irene

Hm, not sure how to attach a file. See  http://ireneknapp.com/himitsu/still-a-problem.tar.bz2 for the aforementioned dumps.

Changed 14 months ago by lelf

Changed 14 months ago by lelf

Changed 14 months ago by lelf

  Changed 14 months ago by simonmar

I think you only forced ordering for the Main.o module, and not the libraries or the RTS, correct? I observed the ordering being mangled for one object file in the base package, so you would need to force the correct order for all the symbols in the libraries too.

  Changed 14 months ago by Irene

Yes, that is correct - I didn't think of doing it for libraries and the RTS. I'll put together another test that does, and report back. Good catch!

  Changed 14 months ago by Irene

I did the more thorough test, and the trivial program runs without a crash, producing correct output. Excellent! This means that the problem does indeed consist only of the TNTC thing, which is what I was trying to verify.

follow-up: ↓ 25   Changed 14 months ago by simonmar

One thought occurred to me: maybe if we set the size of the _dsp symbol to be the size of the info table plus the size of the code, that would prevent ld from separating them. But, as far as I can tell, symbols do not have sizes in Mach-O.

I'm a bit bemused at how the linker can get away with reordering code within an object file. The behaviour seems to be inconsistent with the man page for ld, which says

The object files are loaded in the order in which they are specified on the command line. The segments and the sections in those segments will appear in the output file in the order they are encountered in the object files being linked. [...] The use of the -order_file option will alter the layout rules above, and move the symbols specified to start of their section.

which doesn't explicitly say that code within a section will not be reordered, but it strongly implies that.

in reply to: ↑ 24   Changed 14 months ago by chak

Replying to simonmar:

I'm a bit bemused at how the linker can get away with reordering code within an object file. The behaviour seems to be inconsistent with the man page for ld, which says The object files are loaded in the order in which they are specified on the command line. The segments and the sections in those segments will appear in the output file in the order they are encountered in the object files being linked. [...] The use of the -order_file option will alter the layout rules above, and move the symbols specified to start of their section. which doesn't explicitly say that code within a section will not be reordered, but it strongly implies that.

But don't the -no_order_inits and -no_order_data options make it clear that reordering within sections is happening?

Can't we use -order_file to solve this problem once and for all?

-order_file file

Alters the order in which functions and data are laid out. For each section in the output file, any symbol in that sec- tion that are specified in the order file file is moved to the start of its section and laid out in the same order as in the order file file. Order files are text files with one symbol name per line. Lines starting with a # are comments. A symbol name may be optionally preceded with its object file leaf name and a colon (e.g. foo.o:_foo). This is useful for static functions/data that occur in multiple files. A symbol name may also be optionally preceded with the architecture (e.g. ppc:_foo or ppc:foo.o:_foo). This enables you to have one order file that works for multiple architectures. Lit- eral c-strings may be ordered by by quoting the string (e.g. "Hello, world\n") in the order file.

In fact, the man page makes it sound as if we can achieve TNTC without any hacks using an order file on OS X.

Or am I missing something?

  Changed 14 months ago by Irene

Well, we can indeed do it with an order file, but it's unwieldy to have to construct one that encompasses every single module being linked, and then it breaks again if linked again, as for example happens if we build a Haskell library that an innocent and well-meaning author tries to link into a C program without GHC's involvement. Actually, I'm not sure how we can make that scenario work even with some sort of way to tell LLVM explicitly about TNTC, since the Mach-O format doesn't have any way to express that constraint, so it will never survive to a second linking.

But the main reason to not use an order file is that the reordering probably actually does provide a good benefit, since (I assume but nobody's actually explained it that I've seen) it does things such as providing additional code locality so as to keep things within the processor's instruction cache more often. It can also do whole-program dead-code removal. This sort of feature, which the LLVM people call link-time optimization (LTO), is defeated if we tell it it can't actually do anything.

  Changed 14 months ago by simonmar

Regarding the documentation, the documentation for -no_order_inits says

When the -order_file option is not used, the linker lays out functions in object file order and it moves all initializer routines to the start of the __text section and terminator routines to the end. Use this option to disable the automatic rearrangement of initializers and terminators.

So this explicitly says that functions are laid out in "object file order" when -order_file is not used. Either the documentation is wrong, or the implementation has a bug.

In principle we could use -order_file, but as Irene says it is difficult to arrange. We don't know the list of symbols before linking because they come from libraries, so we would have to link the program twice: once to find the list of object files, then construct the symbol list by looking up the object files in the libraries, and then link again with -order_file and the constructed list of symbols. This could all be quite slow.

Irene: I don't think reordering is gaining us anything. The linker doesn't have any locality information that it could use to do a sensible reordering. I suspect that all it is doing is filling in gaps caused by alignment with small symbols to reduce the size of the linked binary a tiny bit.

  Changed 13 months ago by simonpj

  • cc gdr@… added

We seem stuck here. Mabye it's even a linker bug.

  • It is documented not to do this
  • Surely other systems also rely on the linker not randomly re-laying out code

We could do with help from a linker expert. Gaby dos Reis perhaps?

Simon

  Changed 12 months ago by lukexi

Hi guys,

I'm very happy to report that this seems to be fixed using the new Haskell Platform release 2012.2.0.0!

The test program above runs perfectly, and my multithreaded server I originally ran into the 'strange closure type' issue with runs wonderfully as well. Hurray!

I'm on ML DP3 (Build 12A206j), with Xcode 4.4 DP4.

lukexi@Luke-Ianninis-MacBook-Air:~/ghctest2$ echo 'main = print $ reverse [1,2,3]' > Main.hs
lukexi@Luke-Ianninis-MacBook-Air:~/ghctest2$ ghc Main
[1 of 1] Compiling Main             ( Main.hs, Main.o )
Linking Main ...
lukexi@Luke-Ianninis-MacBook-Air:~/ghctest2$ ./Main 
[3,2,1]
lukexi@Luke-Ianninis-MacBook-Air:~/ghctest2$
lukexi@Luke-Ianninis-MacBook-Air:~/ghctest2$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.4.1
lukexi@Luke-Ianninis-MacBook-Air:~/ghctest2$ ld -v
@(#)PROGRAM:ld  PROJECT:ld64-132.11
configured to support archs: armv6 armv7 i386 x86_64
LTO support using: LLVM version 3.1svn, from Apple Clang 4.0 (build 421.0.31)
lukexi@Luke-Ianninis-MacBook-Air:~/ghctest2$ gcc --version
i686-apple-darwin11-llvm-gcc-4.2 (GCC) 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.11.00)
Copyright (C) 2007 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

  Changed 12 months ago by simonmar

That's good news. However we don't know what made the problem go away, so it's possible it might re-emerge.

Let's keep the ticket open until we can verify whether our 7.4.2 distributions work on 10.8.

  Changed 12 months ago by igloo

But the HP contains exactly the same GHC that didn't work, doesn't it?

Isn't it more likely that the newer versions of OS X stuff fixed the linker?

  Changed 12 months ago by simonmar

I've been updating my Mac to the latest 10.8 and XCode, so hopefully I'll be able to answer that soon.

  Changed 12 months ago by simonmar

  • status changed from new to closed
  • resolution set to worksforme

I just did a validate using an existing installation of GHC 7.0.3 on OS X 10.8 DP3 with XCode 4.4 DP5. There are a few test failures that need to be cleaned up, but the crashes now seem to be gone.

So I presume this was a bug in the linker after all, and Apple fixed it.

  Changed 11 months ago by chak

For what it is worth, I just validated the HEAD with GHC 7.4.1 on ML DP4 with Xcode 4.5 DP. Seems to be fine.

Note: See TracTickets for help on using tickets.