Ticket #4867 (closed bug: fixed)

Opened 2 years ago

Last modified 2 years ago

ghci displays negative floats incorrectly (was: Incorrect result from trig functions)

Reported by: gwright Owned by: gwright
Priority: high Milestone: 7.0.2
Component: GHCi Version: 7.0.1
Keywords: Cc: william.knop.nospam@…, arsenm2@…, pho@…
Operating System: MacOS X Architecture: x86_64 (amd64)
Type of failure: Incorrect result at runtime Difficulty:
Test Case: Blocked By:
Blocking: Related Tickets:

Description

Trigonometric functions give the wrong answer in some cases. I have verified this bug on ghc-7.0.1 and ghc-7.0.1-rc1 on OS X 10.6 64 bit. (The bug may be limited to 64 bit platforms; I've not been able to reproduce it on ghc-6.10.4/OS X 10.5/32 bit.)

Here's an example (ghc-7.0.2-rc1, OS X 10.6, 64 bit):

plumbbob-franklin> inplace/bin/ghc-stage2 --interactive
GHCi, version 7.0.1.20101221: http://www.haskell.org/ghc/  :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
Loading package ffi-1.0 ... linking ... done.
Prelude> tan (0.5 * pi + 0.01)
-4.563974029425858e214
Prelude> tan (0.5 * pi - 0.01)
99.99666664444354
Prelude> 

The first result (tan (0.5 * pi + 0.01)) is wrong. The correct answer is close to -100.0. The sin and cos functions also give incorrect answers in some cases, e.g.,

Prelude> cos (3.0 * pi)
-3.666940035476786e76

At first glance, the bug seems related to the normalization of the argument (i.e., mapping the argument to the range (-pi/2, pi/2)).

This is a nasty one. It ought to be fixed before 7.0.2 goes out.

Attachments

Linker.c.dpatch Download (45.2 KB) - added by gwright 2 years ago.
cleanup_float_showing.dpatch Download (67.2 KB) - added by altaic 2 years ago.
new_improved_Linker.c.dpatch Download (84.3 KB) - added by gwright 2 years ago.

Change History

  Changed 2 years ago by gwright

Hmmm. Not just the regular trig functions are broken, but so is tanh:

Prelude> tanh (-1.0)
-2.5422996573661585e76

(should be about -0.76).

  Changed 2 years ago by gwright

This seems to be a ghci problem. If I compile

--
-- test the tanh function:
--

module Main where

main = do
	print (tanh (-1.0))

I get

plumbbob-franklin> ./foo
-0.7615941559557649

which is the right answer.

  Changed 2 years ago by gwright

  • component changed from Compiler to GHCi

  Changed 2 years ago by guest

I was unable to reproduce under Linux (Ubuntu 10.10 x86_64)

  Changed 2 years ago by gwright

  • os changed from Unknown/Multiple to MacOS X
  • architecture changed from Unknown/Multiple to x86_64 (amd64)

  Changed 2 years ago by gwright

ghc 6.10.4 on FreeBSD 8.1/amd64 doesn't show this bug, so it looks like it is specific to OS X, or only occurs in 7.0.1 or later. If anyone else tries to reproduce this, please note the ghc version.

  Changed 2 years ago by ajd

$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.0.1
$ ghci
GHCi, version 7.0.1: http://www.haskell.org/ghc/  :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
Loading package ffi-1.0 ... linking ... done.
Prelude> tan (0.5 * pi + 0.01)
-99.99666664444476
Prelude> tan (0.5 * pi - 0.01)
99.99666664444354
Prelude> cos (3*pi)
-1.0
Prelude> tanh (-1)
-0.7615941559557649

Can't reproduce under Linux i686 w/ GHC 7.0.1.

  Changed 2 years ago by igloo

  • priority changed from normal to high
  • milestone set to 7.0.2

  Changed 2 years ago by gwright

Since this bug only seems to happen on OS X/64 bit, I wonder if it isn't related to the recent work on the linker. Perhaps the following has a hint, though I haven't been able to find it yet.

Ran ghc-7.0.1-rc1 as inplace/bin/ghc-stage2 --interactive +RTS -Dl (built with GhcDebugged=YES).

Prelude> cos (-pi)
resolveObjs: start
initLinker: start
initLinker: idempotent return
resolveObjs: done
lookupSymbol: looking up _base_GHCziFloat_zdfFloatingDouble_closure
initLinker: start
initLinker: idempotent return
lookupSymbol: value of _base_GHCziFloat_zdfFloatingDouble_closure is 0x1073bb298
lookupSymbol: looking up _base_GHCziFloat_pi_closure
initLinker: start
initLinker: idempotent return
lookupSymbol: value of _base_GHCziFloat_pi_closure is 0x1073b9e38
lookupSymbol: looking up _base_GHCziNum_negate_closure
initLinker: start
initLinker: idempotent return
lookupSymbol: value of _base_GHCziNum_negate_closure is 0x1073c6680
lookupSymbol: looking up _base_GHCziReal_zdp1Fractional_closure
initLinker: start
initLinker: idempotent return
lookupSymbol: value of _base_GHCziReal_zdp1Fractional_closure is 0x1073c8b80
lookupSymbol: looking up _base_GHCziFloat_zdfFloatingDouble_closure
initLinker: start
initLinker: idempotent return
lookupSymbol: value of _base_GHCziFloat_zdfFloatingDouble_closure is 0x1073bb298
lookupSymbol: looking up _base_GHCziFloat_zdp1Floating_closure
initLinker: start
initLinker: idempotent return
lookupSymbol: value of _base_GHCziFloat_zdp1Floating_closure is 0x1073b9e30
lookupSymbol: looking up _base_GHCziFloat_zdfFloatingDouble_closure
initLinker: start
initLinker: idempotent return
lookupSymbol: value of _base_GHCziFloat_zdfFloatingDouble_closure is 0x1073bb298
lookupSymbol: looking up _base_GHCziFloat_cos_closure
initLinker: start
initLinker: idempotent return
lookupSymbol: value of _base_GHCziFloat_cos_closure is 0x1073b9e78
lookupSymbol: looking up _ghczmprim_GHCziTypes_ZC_con_info
initLinker: start
initLinker: idempotent return
lookupSymbol: value of _ghczmprim_GHCziTypes_ZC_con_info is 0x10534f9b0
lookupSymbol: looking up _ghczmprim_GHCziTypes_ZMZN_closure
initLinker: start
initLinker: idempotent return
lookupSymbol: value of _ghczmprim_GHCziTypes_ZMZN_closure is 0x105352df0
lookupSymbol: looking up _base_GHCziBase_returnIO_closure
initLinker: start
initLinker: idempotent return
lookupSymbol: value of _base_GHCziBase_returnIO_closure is 0x1073b4cb8
lookupSymbol: looking up _base_GHCziNum_zdp2Num_closure
initLinker: start
initLinker: idempotent return
lookupSymbol: value of _base_GHCziNum_zdp2Num_closure is 0x1073c6660
lookupSymbol: looking up _base_GHCziReal_zdp1Fractional_closure
initLinker: start
initLinker: idempotent return
lookupSymbol: value of _base_GHCziReal_zdp1Fractional_closure is 0x1073c8b80
lookupSymbol: looking up _base_GHCziFloat_zdfFloatingDouble_closure
initLinker: start
initLinker: idempotent return
lookupSymbol: value of _base_GHCziFloat_zdfFloatingDouble_closure is 0x1073bb298
lookupSymbol: looking up _base_GHCziFloat_zdp1Floating_closure
initLinker: start
initLinker: idempotent return
lookupSymbol: value of _base_GHCziFloat_zdp1Floating_closure is 0x1073b9e30
lookupSymbol: looking up _base_SystemziIO_print_closure
initLinker: start
initLinker: idempotent return
lookupSymbol: value of _base_SystemziIO_print_closure is 0x1073e6a20
lookupSymbol: looking up _base_GHCziBase_thenIO_closure
initLinker: start
initLinker: idempotent return
lookupSymbol: value of _base_GHCziBase_thenIO_closure is 0x1073b4ca0
-2.1804254445564623e-216
Prelude> 

If this doesn't tell anyone anything, I will start poking around with gdb.

  Changed 2 years ago by gwright

  • owner set to gwright

  Changed 2 years ago by gwright

In short, the correct answer is returned by the tanh function in libSystem.B and is placed on the heap. Below is what gdb says.

I start gdb and source the gdbinit macros and my load gdb_ghc script. The latter is

gwright-macbook> cat gdb_ghc
file /Users/gwright/tmp/ghc-7-branch/ghc/inplace/lib/ghc-stage2
set args +RTS -V0 -i0 -RTS -B/Users/gwright/tmp/ghc-7-branch/ghc/inplace/lib -pgmc /usr/bin/gcc-4.2 -pgma /usr/bin/gcc-4.2 -pgml /usr/bin/gcc-4.2 -pgmP "/usr/bin/gcc-4.2 -E -undef -traditional" --interactive
break tanh

Here's the transcript:

gwright-macbook> gdb
GNU gdb 6.3.50-20050815 (Apple version gdb-1510) (Wed Sep 22 02:45:02 UTC 2010)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-apple-darwin".
(gdb) source gdbinit
(gdb) source gdb_ghc 
Reading symbols for shared libraries ..... done
Breakpoint 1 at 0x3126e978bbfb80
(gdb) run
Starting program: /Users/gwright/tmp/ghc-7-branch/ghc/inplace/lib/ghc-stage2 +RTS -V0 -i0 -RTS -B/Users/gwright/tmp/ghc-7-branch/ghc/inplace/lib -pgmc /usr/bin/gcc-4.2 -pgma /usr/bin/gcc-4.2 -pgml /usr/bin/gcc-4.2 -pgmP "/usr/bin/gcc-4.2 -E -undef -traditional" --interactive
Reading symbols for shared libraries ++++. done
GHCi, version 7.0.1.20101221: http://www.haskell.org/ghc/  :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
Loading package ffi-1.0 ... linking ... done.
Prelude> tanh(-0.0001)

Breakpoint 1, 0x00007fff80036b80 in tanh$fenv_access_off ()
(gdb) 

continuing from that point to the end of the tanh function,

(gdb) display/i $rip
1: x/i $rip  0x7fff80036b80 <tanh$fenv_access_off>:	movapd %xmm0,%xmm1
(gdb) stepi
0x00007fff80036b84 in tanh$fenv_access_off ()
1: x/i $rip  0x7fff80036b84 <tanh$fenv_access_off+4>:	andpd  0x144054(%rip),%xmm0        # 0x7fff8017abe0 <abs_mask>
(gdb) 
0x00007fff80036b8c in tanh$fenv_access_off ()
1: x/i $rip  0x7fff80036b8c <tanh$fenv_access_off+12>:	movq   %xmm0,%rax
(gdb) stepi
0x00007fff80036b91 in tanh$fenv_access_off ()
1: x/i $rip  0x7fff80036b91 <tanh$fenv_access_off+17>:	movapd %xmm1,%xmm7
(gdb) stepi
0x00007fff80036b95 in tanh$fenv_access_off ()
1: x/i $rip  0x7fff80036b95 <tanh$fenv_access_off+21>:	sub    0x130654(%rip),%rax        # 0x7fff801671f0 <two>
(gdb) stepi
0x00007fff80036b9c in tanh$fenv_access_off ()
1: x/i $rip  0x7fff80036b9c <tanh$fenv_access_off+28>:	xorpd  %xmm0,%xmm7
(gdb) stepi
0x00007fff80036ba0 in tanh$fenv_access_off ()
1: x/i $rip  0x7fff80036ba0 <tanh$fenv_access_off+32>:	cmp    0x130661(%rip),%rax        # 0x7fff80167208 <eighth_m_two+8>
(gdb) stepi
0x00007fff80036ba7 in tanh$fenv_access_off ()
1: x/i $rip  0x7fff80036ba7 <tanh$fenv_access_off+39>:	ja     0x7fff80036cf4 <tanh$fenv_access_off+372>
(gdb) stepi
0x00007fff80036cf4 in tanh$fenv_access_off ()
1: x/i $rip  0x7fff80036cf4 <tanh$fenv_access_off+372>:	jg     0x7fff80036e97 <tanh$fenv_access_off+791>
(gdb) stepi
0x00007fff80036cfa in tanh$fenv_access_off ()
1: x/i $rip  0x7fff80036cfa <tanh$fenv_access_off+378>:	cmp    0x1304ff(%rip),%rax        # 0x7fff80167200 <eighth_m_two>
(gdb) stepi
0x00007fff80036d01 in tanh$fenv_access_off ()
1: x/i $rip  0x7fff80036d01 <tanh$fenv_access_off+385>:	jbe    0x7fff80036dd6 <tanh$fenv_access_off+598>
(gdb) stepi
0x00007fff80036dd6 in tanh$fenv_access_off ()
1: x/i $rip  0x7fff80036dd6 <tanh$fenv_access_off+598>:	cmp    0x13041b(%rip),%rax        # 0x7fff801671f8 <tiny_m_two>
(gdb) stepi
0x00007fff80036ddd in tanh$fenv_access_off ()
1: x/i $rip  0x7fff80036ddd <tanh$fenv_access_off+605>:	jbe    0x7fff80036e31 <tanh$fenv_access_off+689>
(gdb) stepi
0x00007fff80036ddf in tanh$fenv_access_off ()
1: x/i $rip  0x7fff80036ddf <tanh$fenv_access_off+607>:	lea    0x144a9a(%rip),%rdx        # 0x7fff8017b880 <small_poly>
(gdb) stepi
0x00007fff80036de6 in tanh$fenv_access_off ()
1: x/i $rip  0x7fff80036de6 <tanh$fenv_access_off+614>:	mulsd  %xmm0,%xmm0
(gdb) stepi
0x00007fff80036dea in tanh$fenv_access_off ()
1: x/i $rip  0x7fff80036dea <tanh$fenv_access_off+618>:	movapd %xmm1,%xmm2
(gdb) stepi
0x00007fff80036dee in tanh$fenv_access_off ()
1: x/i $rip  0x7fff80036dee <tanh$fenv_access_off+622>:	mulsd  0x40(%rdx),%xmm1
(gdb) stepi
0x00007fff80036df3 in tanh$fenv_access_off ()
1: x/i $rip  0x7fff80036df3 <tanh$fenv_access_off+627>:	movlhps %xmm0,%xmm0
(gdb) stepi
0x00007fff80036df6 in tanh$fenv_access_off ()
1: x/i $rip  0x7fff80036df6 <tanh$fenv_access_off+630>:	movapd %xmm0,%xmm3
(gdb) stepi
0x00007fff80036dfa in tanh$fenv_access_off ()
1: x/i $rip  0x7fff80036dfa <tanh$fenv_access_off+634>:	movapd %xmm3,%xmm4
(gdb) stepi
0x00007fff80036dfe in tanh$fenv_access_off ()
1: x/i $rip  0x7fff80036dfe <tanh$fenv_access_off+638>:	addpd  (%rdx),%xmm0
(gdb) stepi
0x00007fff80036e02 in tanh$fenv_access_off ()
1: x/i $rip  0x7fff80036e02 <tanh$fenv_access_off+642>:	addpd  0x10(%rdx),%xmm3
(gdb) stepi
0x00007fff80036e07 in tanh$fenv_access_off ()
1: x/i $rip  0x7fff80036e07 <tanh$fenv_access_off+647>:	mulpd  %xmm4,%xmm0
(gdb) stepi
0x00007fff80036e0b in tanh$fenv_access_off ()
1: x/i $rip  0x7fff80036e0b <tanh$fenv_access_off+651>:	mulpd  %xmm4,%xmm3
(gdb) stepi
0x00007fff80036e0f in tanh$fenv_access_off ()
1: x/i $rip  0x7fff80036e0f <tanh$fenv_access_off+655>:	mulsd  %xmm4,%xmm1
(gdb) stepi
0x00007fff80036e13 in tanh$fenv_access_off ()
1: x/i $rip  0x7fff80036e13 <tanh$fenv_access_off+659>:	addpd  0x20(%rdx),%xmm0
(gdb) stepi
0x00007fff80036e18 in tanh$fenv_access_off ()
1: x/i $rip  0x7fff80036e18 <tanh$fenv_access_off+664>:	addpd  0x30(%rdx),%xmm3
(gdb) stepi
0x00007fff80036e1d in tanh$fenv_access_off ()
1: x/i $rip  0x7fff80036e1d <tanh$fenv_access_off+669>:	mulpd  %xmm3,%xmm0
(gdb) stepi
0x00007fff80036e21 in tanh$fenv_access_off ()
1: x/i $rip  0x7fff80036e21 <tanh$fenv_access_off+673>:	mulsd  %xmm0,%xmm1
(gdb) stepi
0x00007fff80036e25 in tanh$fenv_access_off ()
1: x/i $rip  0x7fff80036e25 <tanh$fenv_access_off+677>:	movhlps %xmm0,%xmm0
(gdb) stepi
0x00007fff80036e28 in tanh$fenv_access_off ()
1: x/i $rip  0x7fff80036e28 <tanh$fenv_access_off+680>:	mulsd  %xmm1,%xmm0
(gdb) stepi
0x00007fff80036e2c in tanh$fenv_access_off ()
1: x/i $rip  0x7fff80036e2c <tanh$fenv_access_off+684>:	addsd  %xmm2,%xmm0
(gdb) stepi
0x00007fff80036e30 in tanh$fenv_access_off ()
1: x/i $rip  0x7fff80036e30 <tanh$fenv_access_off+688>:	retq   
(gdb) stepi
0x000000010524fd0c in ?? ()
1: x/i $rip  0x10524fd0c:	lea    -0x10047b(%rip),%rax        # 0x10514f898
(gdb) p $xmm0
$1 = {
  v4_float = {3.08506751, -1.63840696e-06, -0.602399945, -2.4928418e+25}, 
  v2_double = {42.888663036703718, -9.9999999666666668e-05}, 
  v16_int8 = {64, 69, 113, -65, -75, -37, -25, 60, -65, 26, 54, -30, -23, -92, -10, 98}, 
  v8_int16 = {16453, 29119, -18981, -6340, -16614, 14050, -5724, -2462}, 
  v4_int32 = {1078292927, -1243879620, -1088801054, -375064990}, 
  v2_int64 = {4631232860024203068, -4676364914860427678}, 
  uint128 = 0x404571bfb5dbe73cbf1a36e2e9a4f662
}
(gdb) 

The correct answer, about -0.0001, is in $xmm0. Continuing a few more instructions,

(gdb) stepi
0x000000010524fd13 in ?? ()
1: x/i $rip  0x10524fd13:	mov    %rax,-0x8(%r12)
(gdb) stepi
0x000000010524fd18 in ?? ()
1: x/i $rip  0x10524fd18:	movsd  %xmm0,(%r12)
(gdb) stepi
0x000000010524fd1e in ?? ()
1: x/i $rip  0x10524fd1e:	lea    -0x7(%r12),%rbx
(gdb) p8 $r12
0x1044a9f90:	0x1044a9f39
0x1044a9f88:	0x1044a9f49
0x1044a9f80:	0x1044a9f58
0x1044a9f78:	0x10094eb08 <ghczm7zi0zi1zi20101221_CoreSyn_UnfIfGoodArgs_con_info>
0x1044a9f70:	0x0
0x1044a9f68:	0x0
0x1044a9f60:	0x102266d01 <ghczmprim_GHCziTypes_ZMZN_closure+1>
0x1044a9f58:	0xbf1a36e2e9a4f662
(gdb) x/gf $r12
0x1044a9f58:	-9.9999999666666668e-05
(gdb) 

The correct result is now pointed to by $r12, the heap pointer on x86_64.

So as noted earlier, the hyperbolic tangent is computed correctly and the right value ends up on the heap. Whatever goes wrong must happen later.

There are about 4600 machine instructions between the end of the tanh and when the (wrong) answer is printed out. The bug's in there somewhere.

It would be helpful if someone could sketch what ghci does when evaluating a statement at the prompt. (At least which files to look in; I didn't find anything about this in the commentary.)

  Changed 2 years ago by gwright

A note on the above: I rebuilt ghc with GhcThreaded = NO before using gdb. This simplifies what goes on after tanh(-0.0001) is computed, but not enough that I've been able to find the bug yet.

follow-up: ↓ 14   Changed 2 years ago by simonmar

Is the bug consistently reproducible, i.e. the same expressions always give the same (right or wrong) answers?

Do you get any failures in the numeric tests?

I would proceed by reducing the example further, e.g. call your own tanh and add some traces.

in reply to: ↑ 13   Changed 2 years ago by gwright

Replying to simonmar:

Is the bug consistently reproducible, i.e. the same expressions always give the same (right or wrong) answers? Do you get any failures in the numeric tests? I would proceed by reducing the example further, e.g. call your own tanh and add some traces.

The same expressions give the same correct or incorrect answers. For example, the result of tanh(x) is wrong for x negative. For positive x, the result is correct.

However, whatever is going on is not whether the argument is positive or negative, but whether the result is positive or negative. For the cos function,

Prelude> cos(pi/2 - 0.001)
9.999998333332927e-4
Prelude> cos(pi/2)
6.123233995736766e-17
Prelude> cos(pi/2 + 0.001)
-5.696722673613253e74
Prelude> 

As the result of the cos becomes negative, the result is wrong.

Hmm. This gives me an idea --- perhaps the bug is in showing the result. Let's try

Prelude> let x = -0.0001
Prelude> x
-5.379511282975455e73
Prelude> 

Yes! But compiling

module Main where

main = do
	let x = -0.001
	print x

and running it gives

plumbbob-franklin> ./foo
-1.0e-3

the right answer.

Perhaps the problem is just showing a negative result:

Prelude> let x = -0.0001
Prelude> let y = 1.2
Prelude> x * y
-4.8047181618589074e73
Prelude> let z = -1.2
Prelude> x * z
1.2e-4
Prelude>
Prelude> (abs x) * y
1.2e-4
Prelude> 

So the value of x is saved correctly, only when the result of the REPL is negative doe we get garbage.

Well, now I have a better idea of what to look for.

  Changed 2 years ago by gwright

  • summary changed from Incorrect result from trig functions to ghci displays negative floats incorrectly (was: Incorrect result from trig functions)

The bug appears when I try to display negative floating point numbers:

Prelude> -1
-1
Prelude> -2.0
-8.751772232148878e-77
Prelude> 

I've updated the summary to indicate that this is the real problem.

  Changed 2 years ago by gwright

The source of this trouble seems to be in formatRealFloat in libraries/Base/GHC/Float.lhs.

Here's the top of that function:

formatRealFloat :: (RealFloat a) => FFFormat -> Maybe Int -> a -> String
formatRealFloat fmt decs x
   | isNaN x                   = "NaN"
   | isInfinite x              = if x < 0 then "-Infinity" else "Infinity"
   | x < 0 || isNegativeZero x = '-' : doFmt fmt (floatToDigits (toInteger base) (-x))
   | otherwise                 = doFmt fmt (floatToDigits (toInteger base) x)
 where
  base = 10
  ...

The clue is that the incorrectly displayed numbers always have the correct sign. The guard | x < 0 || isNegativeZero is satisfied, so the minus sign is prefixed to the formatted digits. In retrospect, if I had paid more attention to the gdb traces I did earlier, I would have seen this.

Still don't know exactly what goes wrong yet.

  Changed 2 years ago by gwright

Some progress -- I've seen the error occur in gdb. I start up gdb, source the gdbinit macros and my gdb_ghc script. I still set a breakpoint on the tanh function, as that makes it easier to get close to the bug. After entering

Prelude> tanh(-1.0)

and hitting the breakpoint, I set a watchpoint on $xmm0 and stepped 4000 machine instructions. Then I single stepped displaying each instruction. Eventually I found:

0x000000010524a8e2 in ?? ()
1: x/i $rip  0x10524a8e2:       movsd  0x281b7c(%rip),%xmm7        # 0x1054cc466
(gdb) 
0x000000010524a8ea in ?? ()
1: x/i $rip  0x10524a8ea:       xorpd  %xmm7,%xmm0
(gdb) 
Watchpoint 2: $xmm0

Old value = {
  v4_float = {0, 0, -1.81539845, -5.54887492e-07}, 
  v2_double = {0, -0.76159415595576485}, 
  v16_int8 = {0, 0, 0, 0, 0, 0, 0, 0, -65, -24, 94, -6, -75, 20, -13, -108}, 
  v8_int16 = {0, 0, 0, 0, -16408, 24314, -19180, -3180}, 
  v4_int32 = {0, 0, -1075290374, -1256918124}, 
  v2_int64 = {0, -4618336986995559532}, 
  uint128 = 10732945108776183999
}
New value = {
  v4_float = {0, 0, 2.22554744e-27, -3.5318561e+37}, 
  v2_double = {0, 2.9599490498307242e-216}, 
  v16_int8 = {0, 0, 0, 0, 0, 0, 0, 0, 19, 48, 83, 119, -3, -44, -112, -36}, 
  v8_int16 = {0, 0, 0, 0, 4912, 21367, -556, -28452}, 
  v4_int32 = {0, 0, 321934199, -36400932}, 
  v2_int64 = {0, 1382696860427522268}, 
  uint128 = 15893437270084235283
}

This ghci trying to negate the value in $xmm0. (IEEE 754 uses a sign & magnitude representation; the canonical way to negate a floating value is to XOR it with the sign bit mask.) After the xor, $xmm0 contains the incorrect value I see output from ghci.

What can be concluded is that the sign mask loaded into $xmm7 is wrong. It is:

(gdb) p $xmm7
$1 = {
  v4_float = {0, 0, -6.14059368e-12, 394010.25}, 
  v2_double = {0, -1.1531066037009445e-92}, 
  v16_int8 = {0, 0, 0, 0, 0, 0, 0, 0, -84, -40, 13, -115, 72, -64, 99, 72}, 
  v8_int16 = {0, 0, 0, 0, -21288, 3469, 18624, 25416}, 
  v4_int32 = {0, 0, -1395126899, 1220567880}, 
  v2_int64 = {0, -5992024403754327224}, 
  uint128 = 5216224211261839532
}
(gdb) p/x $xmm7
$2 = {
  v4_float = {0x0.000000p+0, 0x0.000000p+0, -0x1.b01b1ap-38, 0x1.80c690p+18}, 
  v2_double = {0x0.0000000000000p+0, -0x1.80d8d48c06348p-306}, 
  v16_int8 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xac, 0xd8, 0xd, 0x8d, 0x48, 0xc0, 0x63, 0x48}, 
  v8_int16 = {0x0, 0x0, 0x0, 0x0, 0xacd8, 0xd8d, 0x48c0, 0x6348}, 
  v4_int32 = {0x0, 0x0, 0xacd80d8d, 0x48c06348}, 
  v2_int64 = {0x0, 0xacd80d8d48c06348}, 
  uint128 = 0x4863c0488d0dd8ac0000000000000000
}
(gdb) p/u $xmm7
$3 = {
  v4_float = {0, 0, 0, 394010}, 
  v2_double = {0, 0}, 
  v16_int8 = {0, 0, 0, 0, 0, 0, 0, 0, 172, 216, 13, 141, 72, 192, 99, 72}, 
  v8_int16 = {0, 0, 0, 0, 44248, 3469, 18624, 25416}, 
  v4_int32 = {0, 0, 2899840397, 1220567880}, 
  v2_int64 = {0, 12454719669955224392}, 
  uint128 = 96222353056234618556517257475063283712
}

It's not obvious what's being loaded into $xmm7. There are 16 files containing the "load xmm7 relative to the instruction pointer, xor with xmm0", 113 places total. Could be a static data alignment error or the linker could be wee-hawed.

follow-up: ↓ 19   Changed 2 years ago by simonmar

Looks like you're getting closer to nailing this one. If you turn on linker debugging (+RTS -Dl) you should be able to see it resolving the relocation at that address, and that should give you enough information to map it to a source file. Once you have the source file, you can look at the assembly and see what the relocation is supposed to be doing.

in reply to: ↑ 18   Changed 2 years ago by gwright

Replying to simonmar:

Looks like you're getting closer to nailing this one. If you turn on linker debugging (+RTS -Dl) you should be able to see it resolving the relocation at that address, and that should give you enough information to map it to a source file. Once you have the source file, you can look at the assembly and see what the relocation is supposed to be doing.

Yes, turning on linker debugging is the plan. I've needed to add some additional debug statements to find out what happens to the __const section of the __TEXT segment. This is supposed to be non-relocatable static data, which by my reading of the documentation means loaded at a specified offset from the __text section. One of two things is happening: in the statement that loads $xmm7,

        movsd  0x281b7c(%rip),%xmm7

either the offset from the instruction pointer is wrong, or the expected data isn't at that location. In the saved assembly files, I've found a number of places where (-<double>) is computed, and the sign bit mask seems always stored in the __const section of __TEXT. If the data's in the wrong place, I'm guessing either an alignment bug or something really odd that puts it in the wrong section.

  Changed 2 years ago by gwright

Seems to be a bad relocation. In load statement,

        movsd  0x281b7c(%rip),%xmm7

the memory location at %rip + 0x281b7c is inside the __text section, not the __const section. The register $xmm7 is therefore loaded with 8 bytes of machine code, not the expected constant data. (This does explain why the results are repeatable; we're not loading from uninitialized memory.)

I'm adding more debugging instruction so I can get a complete understanding of what the linker is doing.

  Changed 2 years ago by gwright

Some more poking around reveals that there is a class of relocations that aren't being done. These are the relocations in the __text section to data in the __const section of the __TEXT segment. I'm going to add some more debugging statements to show the result of each relocation, or if the relocation was skipped.

This is finally getting close to being fixed.

  Changed 2 years ago by gwright

Now things are a bit clearer: the 64 bit relocations are computed, but not always applied to the loaded object code. It looks as if another part of the OS X linker (in relocateSection) was never written. This is not entirely surprising, given that there was a similar bug earlier. The fix may be as simple as calling relocateAddress in the right place, but I'm not sure yet.

  Changed 2 years ago by altaic

  • cc william.knop.nospam@… added

  Changed 2 years ago by arsenm

  • cc arsenm2@… added

  Changed 2 years ago by gwright

I was half right: rather than being missing, another piece of the OS X 64 bit linker was just sketched in. This code incorrectly calculated some of the less common relocations. The fix was just to replace the erroneous code with a call to relocateAddress.

Here's what happens now:

gwright-macbook> inplace/bin/ghc-stage2 --interactive
GHCi, version 7.0.1.20101221: http://www.haskell.org/ghc/  :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
Loading package ffi-1.0 ... linking ... done.
Prelude> -1.0
--1.0
Prelude> 

I don't yet understand the double minus sign in the result. It's probably caused by the messing about I did with Float.lhs when trying to track down the bug. I hope there isn't another real bug here, but you never know.

follow-up: ↓ 27   Changed 2 years ago by altaic

Not sure if you had already discovered this, but it appears the bug is limited to Double, while Float appears to be fine:

Prelude> -1.0 :: Float
-1.0
Prelude> -1.0 :: Double
-3.666940035476786e76

Oddly, the transform that's producing the garbage value is reversible (guessing there'd be loss of bits at extremes):

Prelude> -1.0 :: Double
-3.666940035476786e76
Prelude> -3.666940035476786e76 :: Double
-1.0

in reply to: ↑ 26   Changed 2 years ago by gwright

Replying to altaic:

Not sure if you had already discovered this, but it appears the bug is limited to Double, while Float appears to be fine: {{{ Prelude> -1.0 :: Float -1.0 Prelude> -1.0 :: Double -3.666940035476786e76 }}} Oddly, the transform that's producing the garbage value is reversible (guessing there'd be loss of bits at extremes): {{{ Prelude> -1.0 :: Double -3.666940035476786e76 Prelude> -3.666940035476786e76 :: Double -1.0 }}}

I don't think I noticed the Float/Double difference earlier but with the latest linker patch everything seems OK. I'll check again to verify this.

The failed relocations caused garbage values to be printed by ghci, but the intermediate values in the calcuations are OK. The error went like this: when a negative Double is printed, a '-' sign is printed, then the negative of the value is printed. IEEE 754 doubles are negated by xor-ing them with a fixed bitmask, which complements the sign bit. The problem was that the relocation to the bitmask was wrong, so the double was xor-ed with a constant, incorrect mask.

  Changed 2 years ago by gwright

I did a fresh pull from the repository and slapped my rts/Linker.c into that tree. The validate script passes (with one or two fewer failure than before) and firing up ghci, I get:

gwright-macbook> inplace/bin/ghc-stage2 --interactive
GHCi, version 7.0.1.20110203: http://www.haskell.org/ghc/  :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
Loading package ffi-1.0 ... linking ... done.
Prelude> -1.0
-1.0
Prelude> tanh(-1.0)
-0.7615941559557649
Prelude> 

Gladness and rejoicing all around. The double minus sign in my hacked-up tree seems the spurious result of my ham-fisted debugging.

I will send a note to ghc HQ about the my patch, which adds quite a bit of debugging information. Perhaps they'll want the bug fix patch separately from the additional debugging and orthography changes.

This should be wrapped up in the next day or two.

Changed 2 years ago by gwright

follow-up: ↓ 46   Changed 2 years ago by gwright

The attached patch fixes this bug. Tested with a fresh pull of ghc-7.0 branch on 8 Feb 2011.

Here's testsuite_summary.txt:

OVERALL SUMMARY for test run started at Wed Feb  9 14:58:03 EST 2011
    2683 total tests, which gave rise to
   10035 test cases, of which
       0 caused framework failures
    7661 were skipped

    2246 expected passes
      91 expected failures
       0 unexpected passes
      37 unexpected failures

Unexpected failures:
   1372(normal)
   1959(normal)
   2578(normal)
   3586(normal)
   4850(normal)
   MethSharing(normal)
   T1969(normal)
   T3007(normal)
   T3064(normal)
   T3294(normal)
   T3736(normal)
   T4059(normal)
   T4801(normal)
   bug1465(normal)
   cabal01(normal)
   cabal04(normal)
   conc070(ghci)
   derefnull(normal)
   driver062a(normal)
   driver062b(normal)
   driver062c(normal)
   driver062d(normal)
   driver062e(normal)
   driver081a(normal)
   driver081b(normal)
   gadt23(normal)
   ghciprog004(normal)
   hs-boot(normal)
   mod179(normal)
   outofmem(normal)
   prog003(ghci)
   recomp004(normal)
   rn.prog006(normal)
   rtsopts001(normal)
   rtsopts002(normal)
   space_leak_001(normal)
   withRtsOpts(normal)

It's mildly interesting how this bug made little difference in the number of failed tests. Almost no ghci tests seem to print negative Doubles as their final answer.

Another patch with the additional debug statements, comments and coding standard upgrades will be attached later. This patch only corrects the relocation bug itself.

  Changed 2 years ago by altaic

Strange, I'm getting the double negative in HEAD.

GHCi, version 7.1.20110210: http://www.haskell.org/ghc/  :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
Loading package ffi-1.0 ... linking ... done.
Prelude> -1.0
--1.0

  Changed 2 years ago by altaic

Hmm, that's odd... Perhaps someone with better GHC-fu can figure out why these results are different:

Prelude GHC.Float> showsPrec 0 (-1.0 :: Double) ""
"--1.0"
Prelude GHC.Float> showSignedFloat showFloat 0 (-1.0 :: Double) ""
"-1.0"

Given that in Float.lhs:

instance  Show Double  where
    showsPrec   x = showSignedFloat showFloat x
    showList = showList__ (showsPrec 0)

  Changed 2 years ago by gwright

Hmm. On the ghc-7.0 branch, I get

gwright-macbook> inplace/bin/ghc-stage2 --interactive
GHCi, version 7.0.1.20110203: http://www.haskell.org/ghc/  :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
Loading package ffi-1.0 ... linking ... done.
Prelude>  showsPrec 0 (-1.0 :: Double) ""
"-1.0"

I'll take a look at HEAD; looks like something strange is going on. When I saw this happen in the 7.0 branch, I thought it was caused by my attempts to debug Float.lhs. If it's showing up in HEAD that's not the cause.

This is a tad worrisome and should be understood before the 7.0.2 release.

  Changed 2 years ago by gwright

Something strange is still going on. With the test program

gwright-macbook> cat foo.hs
--
-- Test returning -1.0
--

module Main where

main = print (-1.0 :: Double)

on HEAD I get

gwright-macbook> inplace/bin/ghc-stage2 --interactive foo.hs
GHCi, version 7.1.20110210: http://www.haskell.org/ghc/  :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
Loading package ffi-1.0 ... linking ... done.
[1 of 1] Compiling Main             ( foo.hs, interpreted )
Ok, modules loaded: Main.
*Main> main
--1.0
*Main> 

Compiling the same program gives

gwright-macbook> inplace/bin/ghc-stage2 --make foo.hs
[1 of 1] Compiling Main             ( foo.hs, foo.o )
Linking foo ...
ld: warning: -read_only_relocs cannot be used with x86_64
gwright-macbook> ./foo
-1.0
gwright-macbook> 

So this appears to be a ghci or linker bug still.

Slightly more distressing is when I tried the above test program on the ghc-7.0 tree that I did most of my debugging work on:

gwright-macbook> inplace/bin/ghc-stage2 --interactive foo.hs
GHCi, version 7.0.1.20101221: http://www.haskell.org/ghc/  :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
Loading package ffi-1.0 ... linking ... done.
Ok, modules loaded: Main.
Prelude Main> main
--0.0
Prelude Main> 

in this case the answer is entirely wrong.

I went back to gdb and stepped through my original test program until I reached the location where %xmm7 is loaded. The memory contents pointed to by %rip plus the offset seem to contain the correct mask, but I will check again.

I suspect an off by one or size of the relocated object issue.

  Changed 2 years ago by gwright

OK, I've checked whether %xmm7 is loaded correctly and it is. Using gdb to set a watchpoint on %xmm7, I can see the correct mask to invert a double being loaded:

(gdb) c
Continuing.
-Watchpoint 2: $xmm7

Old value = {
  v4_float = {0, 0, -1.875, 0}, 
  v2_double = {0, -1}, 
  v16_int8 = {0, 0, 0, 0, 0, 0, 0, 0, -65, -16, 0, 0, 0, 0, 0, 0}, 
  v8_int16 = {0, 0, 0, 0, -16400, 0, 0, 0}, 
  v4_int32 = {0, 0, -1074790400, 0}, 
  v2_int64 = {0, -4616189618054758400}, 
  uint128 = 61631
}
New value = {
  v4_float = {0, 0, 0, 0}, 
  v2_double = {0, 0}, 
  v16_int8 = {0 <repeats 16 times>}, 
  v8_int16 = {0, 0, 0, 0, 0, 0, 0, 0}, 
  v4_int32 = {0, 0, 0, 0}, 
  v2_int64 = {0, 0}, 
  uint128 = 0
}
0x000000010524a8ea in ?? ()
(gdb) c
Continuing.
-Watchpoint 2: $xmm7

Old value = {
  v4_float = {0, 0, 0, 0}, 
  v2_double = {0, 0}, 
  v16_int8 = {0 <repeats 16 times>}, 
  v8_int16 = {0, 0, 0, 0, 0, 0, 0, 0}, 
  v4_int32 = {0, 0, 0, 0}, 
  v2_int64 = {0, 0}, 
  uint128 = 0
}
New value = {
  v4_float = {0, 0, -0, 0}, 
  v2_double = {0, -0}, 
  v16_int8 = {0, 0, 0, 0, 0, 0, 0, 0, -128, 0, 0, 0, 0, 0, 0, 0}, 
  v8_int16 = {0, 0, 0, 0, -32768, 0, 0, 0}, 
  v4_int32 = {0, 0, -2147483648, 0}, 
  v2_int64 = {0, -9223372036854775808}, 
  uint128 = 128
}

Look at v_int64, the nonzero part is the representation of the IEEE 754 sign bit mask. To see it in hex,

(gdb) info reg all

<snip>

xmm7           {
  v4_float = {0, 0, -0, 0}, 
  v2_double = {0, -0}, 
  v16_int8 = {0, 0, 0, 0, 0, 0, 0, 0, -128, 0, 0, 0, 0, 0, 0, 0}, 
  v8_int16 = {0, 0, 0, 0, -32768, 0, 0, 0}, 
  v4_int32 = {0, 0, -2147483648, 0}, 
  v2_int64 = {0, -9223372036854775808}, 
  uint128 = 128
}       (raw 0x00000000000000800000000000000000)

<snip>

The raw value has hex 8 followed by 15 zeros in the lower half, which the sign bit mask.

I think this is telling us that one relocation bug has been fixed, but there is probably still another. Not surprising, given the untested and rickety nature of the original code. The only question is whether to close out this ticket and open a new one, or continue working on this under this ticket.

  Changed 2 years ago by altaic

I would have sworn I tested this earlier since I noticed showSignedFloat appears to be unnecessary (showFloat calls formatRealFloat which handles signed RealFloat, although it doesn't call showParen which is necessary for showsPrec).

So, I changed the showChar '-' to showChar 'n' in showSignedFloat with the following results:

Prelude GHC.Float> showSignedFloat showFloat 0 (-1.0 :: Double) ""
"n1.0"
Prelude GHC.Float> showsPrec 0 (-1.0 :: Double) ""
"n-1.0"

Clearly it's not negating properly when showSignedFloat calls showPos (-x)), and the extra '-' is coming from formatRealFloat. Is the sign stored in a separate register that's getting clobbered? In StgPrimFloat.c it appears to be so; I scattered ASSERTs around to no avail.

Of course the easy fix would be to clean up the negative checks so only one '-' is possible, however there's definitely a deeper problem. And I still don't understand why calling showSignedFloat showFloat directly is at all different from calling showsPrec.

  Changed 2 years ago by altaic

I'm attaching a patch that cleans up float showing so there's no possibility of displaying duplicate negatives. It fixes the main issue of this ticket, but it does not address the underlying issue of the failed negation.

Changed 2 years ago by altaic

  Changed 2 years ago by altaic

Apologies, that patch changes the signature of showFloat which doesn't break anything in GHC; I don't know about other packages.

  Changed 2 years ago by gwright

The real mystery here is why compiled code shows only a single minus sign, but ghci shows two.

A look at the code in Float.lhs shows that two minus signs ought to be displayed. Start with the instance for Show Double:

instance  Show Double  where
    showsPrec   p = showSignedFloat showFloat p

(In the above, I've change the x in the original source to p, since the argument is the Int precedence, not the Double value to be displayed.) The definitions of showSignedFloat and showFloat are

showSignedFloat :: (RealFloat a)
  => (a -> ShowS)       -- ^ a function that can show unsigned values
  -> Int                -- ^ the precedence of the enclosing context
  -> a                  -- ^ the value to show
  -> ShowS
showSignedFloat showPos p x
   | x < 0 || isNegativeZero x
       = showParen (p > 6) (showChar '?' . showPos (-x))
   | otherwise = showPos x

-- | Show a signed 'RealFloat' value to full precision
-- using standard decimal notation for arguments whose absolute value lies
-- between @0.1@ and @9,999,999@, and scientific notation otherwise.
showFloat :: (RealFloat a) => a -> ShowS
showFloat x  =  showString (formatRealFloat FFGeneric Nothing x)

The last thing we need is the first bit of formatRealFloat:

formatRealFloat :: (RealFloat a) => FFFormat -> Maybe Int -> a -> String
formatRealFloat fmt decs x
   | isNaN x                   = "NaN"
   | isInfinite x              = if x < 0 then "-Infinity" else "Infinity"
   | x < 0 || isNegativeZero x = '!' : doFmt fmt (floatToDigits (toInteger base) (-x))
   | otherwise                 = doFmt fmt (floatToDigits (toInteger base) x)
 where
   <snip>

Also in the above I have subsituted '!' and '?' for the '-' signs. This will let us see which path through the code is taken.

Now say we want to show a Double. When ghci displays a Double, say -1.0, the Show instance should be invoked as

 showsPrec 0 (-1.0) ""

Expand this out:

showSignedFloat showFloat 0 (-1.0) ""

showSignedFloat sees that the argument is negative, so it prefixes a '?' to the output string. showFloat also sees a negative argument, so it prefixes a '!' to the output string. Normal order evaluation is from the outside in, so we should see '?!' before the numeric value. Let's check:

Prelude GHC.Float> showsPrec 0 (-1.0) ""
"?!1.0"

So it appears that ghci is doing exactly what is specified by the code in Float.lhs.

Here's the real mystery. Let's compile this program, which should produce the same output:

gwright-macbook> cat foo.hs
--
-- Test returning -1.0
--

module Main where

main = putStrLn (show (-1.0))

Compiling and running,

gwright-macbook> inplace/bin/ghc-stage2 --make foo.hs
[1 of 1] Compiling Main             ( foo.hs, foo.o )
Linking foo ...
ld: warning: -read_only_relocs cannot be used with x86_64
gwright-macbook> ./foo
?1.0

So we have used showSignedFloat but what happened to formatRealFloat? Did it get passed an incorrect (positive) argument? Or is the compiler doing some kind of optimization that changes the code?

It seems as if Float.lhs needs to be corrected so it only displays a single '-' sign, but whatever is happening needs to be understood before we make changes, so we don't simply cover up the real bug.

follow-up: ↓ 43   Changed 2 years ago by batterseapower

I think a single minus sign is the expected result, and it is what I get with GHC and GHCi (near HEAD) on OS X 10.6. Note that:

showSignedFloat showFloat 0 (-1.0) ""

Calls showFloat with the last argument negated after outputting a minus sign, so showFloat itself will not get a chance to produce an extra minus sign. (The relevant line of showSignedFloat is Float.lhs:1016) The question seems to be why that negation is not happening on your machine.

This seems unlikely, but is it possible that your GHCi is still somehow finding the version of the libraries that you edited before, and that version accidentally removed the negation?

  Changed 2 years ago by altaic

It's happening on both of our machines, and I did a fresh pull from HEAD. What optimizations did you use for the build without the bug?

  Changed 2 years ago by batterseapower

I should clarify that I'm doing a 32 bit build of GHC, so my failure to reproduce the problems is expected. However, I'm trying to point out the problem does not seem to be with the code in Float.hs.

follow-up: ↓ 44   Changed 2 years ago by altaic

Agreed. I'm going to have another look at Linker.c.

Optimizations appear to effect the bug quite a bit. The double '-' happened for me with the "devel2" BuildFlavour?. With "quickest" BuildFlavour? (no optimizations) I get the following output from GHCi:

Prelude> -1.0 :: Float
-1.0
Prelude> -1.0 :: Double
1.0000000000000253
Prelude> (-1.0 :: Double) + (1.0 :: Double)
2.0000000000000253

So it's not just the failed negation of x in showSignedFloat; it seems that other negations are failing as well, which is unsurprising.

in reply to: ↑ 39   Changed 2 years ago by gwright

Replying to batterseapower:

I think a single minus sign is the expected result, and it is what I get with GHC and GHCi (near HEAD) on OS X 10.6. Note that: showSignedFloat showFloat 0 (-1.0) "" Calls showFloat with the last argument negated after outputting a minus sign, so showFloat itself will not get a chance to produce an extra minus sign. (The relevant line of showSignedFloat is Float.lhs:1016) The question seems to be why that negation is not happening on your machine.

Checking again (after coffee) I agree with this. My above argument is wrong. The question is why the negation is not being done. showFloat ought to see a positive argument.

This seems unlikely, but is it possible that your GHCi is still somehow finding the version of the libraries that you edited before, and that version accidentally removed the negation?

No, I can edit the libraries, change the minus signs to other characters, and when I rebuild I see the expected result (i.e., the minus signs are changed to the specified characters).

Thinking about this some more, there are probably more incorrect relocations. Doubles are negated by xor-ing them with a fixed bitmask, 0x8000000000000000. The thing to do is to track down the assembly code corresponding to the negation in showSignedFloat. One thing that would cause this problem is if the loaded bitmask were all zeroes instead of having '1' in the msb. xor-ing with an all zero mask will return the Double value unchanged.

in reply to: ↑ 42   Changed 2 years ago by gwright

Replying to altaic:

Agreed. I'm going to have another look at Linker.c. Optimizations appear to effect the bug quite a bit. The double '-' happened for me with the "devel2" BuildFlavour?. With "quickest" BuildFlavour? (no optimizations) I get the following output from GHCi: {{{ Prelude> -1.0 :: Float -1.0 Prelude> -1.0 :: Double 1.0000000000000253 Prelude> (-1.0 :: Double) + (1.0 :: Double) 2.0000000000000253 }}} So it's not just the failed negation of x in showSignedFloat; it seems that other negations are failing as well, which is unsurprising.

Changes with optimization level point to a relocation issue. As the code changes, the references to constant data move around, and the relocations are different.

  Changed 2 years ago by gwright

Looks like an alignment error. Here's the gdb trace of the negation in showSignedFloat:

0x000000010524a8dd in ?? ()
1: x/i $rip  0x10524a8dd:       movsd  0x18(%rbx),%xmm0
(gdb) 
0x000000010524a8e2 in ?? ()
1: x/i $rip  0x10524a8e2:       movsd  0x3070c2(%rip),%xmm7        # 0x1055519ac
(gdb) 
Watchpoint 2: $xmm7

Old value = {
  v4_float = {0, 0, -1.875, 0}, 
  v2_double = {0, -1}, 
  v16_int8 = {0, 0, 0, 0, 0, 0, 0, 0, -65, -16, 0, 0, 0, 0, 0, 0}, 
  v8_int16 = {0, 0, 0, 0, -16400, 0, 0, 0}, 
  v4_int32 = {0, 0, -1074790400, 0}, 
  v2_int64 = {0, -4616189618054758400}, 
  uint128 = 61631
}
New value = {
  v4_float = {0, 0, 0, 0}, 
  v2_double = {0, 0}, 
  v16_int8 = {0 <repeats 16 times>}, 
  v8_int16 = {0, 0, 0, 0, 0, 0, 0, 0}, 
  v4_int32 = {0, 0, 0, 0}, 
  v2_int64 = {0, 0}, 
  uint128 = 0
}

Note that %xmm7 is loaded with all zeroes. If we look at the memory from which %xmm7 was loaded, we see:

(gdb) p8 0x1055519ac
0x1055519e4:    0x69666e4900000000
0x1055519dc:    0x7974696e69
0x1055519d4:    0x666e492d00000000
0x1055519cc:    0x4e614e00000000
0x1055519c4:    0x65756c615620
0x1055519bc:    0x646162203a6f5464
0x1055519b4:    0x6e756f7280000000
0x1055519ac:    0x0
(gdb) 

We're off by four bytes. The correct mask starts four bytes higher, at 0x1055519b0 (look for 80000000). This is good --- it explains why in one of my builds, the error didn't show up. In that case, by chance the memory was aligned on the correct boundary.

The question now is whether the relocations need to be corrected, or if the memory images are not being aligned properly when they are copied from the object file.

in reply to: ↑ 29   Changed 2 years ago by igloo

Replying to gwright:

Almost no ghci tests seem to print negative Doubles as their final answer.

If you're running the testsuite in fast mode then very few ghci tests will run. If you're mainly interested in ghci tests then try make fulltest WAY=ghci or make WAY=ghci under testsuite/tests/ghc-regress.

There are also some ghci tests in way 'normal', though.

  Changed 2 years ago by gwright

The memory images are aligned properly. This isn't surprising, since the linker is using mmap and mmap aligns on 16 byte boundaries, which is the most restrictive that the linker requires.

I think the misalignment comes from the way I use the relocateAddress function. A perusal of my favorite bedtime reading, Apple's "ABI Mach-O File Format Reference", tells me that for non-external relocations, the r_symbolnum field of the reloc structure is used as index to the target section. (A potential gotcha is that the r_symbolnum field counts sections from 1 while the linker counts sections from 0. The code must have been written by a frustrated FORTRAN programmer.) I'm testing now what I think should be the correct relocation calculation using the target section information.

I'll do a fresh pull, make a minimal patch and run the whole test suite once this works.

  Changed 2 years ago by gwright

The answer to the misalignment is even simpler than I thought. For relocations that aren't external or through the global offset table, the displacement is the code should be left unchanged. I've corrected my change to rts/Linker.c to do this.

These null fixups are a bit odd, but I guess they are they take care of some corner case.

I've run the entire testsuite with my new patch and get 70 unexpected failures, 12 of which are ghci failures. I this is better than before, but not dramatically so.

The reason that we're not seeing an impressive reduction in test failures is that the bad relocations were quite uncommon: only 50 relocations were non-external/non-GOT out of the approximately 210000 performed when ghci starts with with no files specified on the command line.

Once the updated patch is applied, it's probably time to close this bug.

Changed 2 years ago by gwright

  Changed 2 years ago by igloo

  • status changed from new to closed
  • resolution set to fixed

Applied, thanks!

  Changed 2 years ago by PHO

  • cc pho@… added
Note: See TracTickets for help on using tickets.