| 4 | | |
| 5 | | == Lots of temporary variables == |
| 6 | | |
| 7 | | WONTFIX. Lots of temporary variables (these can tickle other issues when the temporaries are long-lived, but otherwise would be optimized away). You can at least eliminate some of them by looking at the output of `-ddump-opt-cmm`, which utilizes some basic temporary inlining when used with the native backend `-fasm`, but this doesn't currently apply to the GCC or LLVM backends. |
| 8 | | |
| 9 | | ~~At least one major culprit for this is `allocDynClosure`, described in Note `Return a LocalReg`; this pins down the value of the `CmmExpr` to be something for one particular time, but for a vast majority of use-cases the expression is used immediately afterwards. Actually, this is mostly my patches fault, because the extra rewrite means that the inline pass is broken.~~ Fixed in latest version of the pass; we don't quite manage to inline enough but there's only one extra temporary. |
| 10 | | |
| 11 | | Another cause of all of these temporary variables is that the new code generator immediately assigns any variables that were on the stack to temporaries immediately upon entry to a function. This is on purpose. The idea is we optimize these temporary variables away. |
| 12 | | |
| 13 | | == Rewriting stacks == |
| 14 | | |
| 15 | | FIXED. `3586.hs` emits the following code: |
| 16 | | |
| 17 | | {{{ |
| 18 | | Main.$wa_entry() |
| 19 | | { [const Main.$wa_slow-Main.$wa_info;, const 3591;, const 0;, |
| 20 | | const 458752;, const 0;, const 15;] |
| 21 | | } |
| 22 | | c17W: |
| 23 | | _s16B::F64 = F64[Sp + 20]; |
| 24 | | F64[Sp - 8] = _s16B::F64; |
| 25 | | _s16h::I32 = I32[Sp + 16]; |
| 26 | | _s16j::I32 = I32[Sp + 12]; |
| 27 | | _s16y::I32 = I32[Sp + 8]; |
| 28 | | _s16x::I32 = I32[Sp + 4]; |
| 29 | | _s16w::I32 = I32[Sp + 0]; |
| 30 | | if (Sp - 12 < SpLim) goto u1bR; |
| 31 | | Sp = Sp + 32; |
| 32 | | // [SNIP] |
| 33 | | // directEntry else |
| 34 | | // emitCall: Sequel: Return |
| 35 | | F64[Sp - 12] = _s17a::F64; |
| 36 | | I32[Sp - 16] = _s17b::I32; |
| 37 | | I32[Sp - 20] = _s16j::I32; |
| 38 | | I32[Sp - 24] = _s16y::I32; |
| 39 | | I32[Sp - 28] = _s16x::I32; |
| 40 | | I32[Sp - 32] = _s16w::I32; |
| 41 | | Sp = Sp - 32; |
| 42 | | jump Main.$wa_info (); |
| 43 | | u1bR: |
| 44 | | Sp = Sp + 32; |
| 45 | | // outOfLine here |
| 46 | | R1 = Main.$wa_closure; |
| 47 | | F64[Sp - 12] = _s16B::F64; |
| 48 | | I32[Sp - 16] = _s16h::I32; |
| 49 | | I32[Sp - 20] = _s16j::I32; |
| 50 | | I32[Sp - 24] = _s16y::I32; |
| 51 | | I32[Sp - 28] = _s16x::I32; |
| 52 | | I32[Sp - 32] = _s16w::I32; |
| 53 | | Sp = Sp - 32; |
| 54 | | jump stg_gc_fun (); |
| 55 | | }}} |
| 56 | | |
| 57 | | We see that these temporary variables are being repeatedly rewritten to the stack, even when there are no changes. |
| 58 | | |
| 59 | | Since these areas on the stack are all old call areas, one way to fix this is to inline all of the memory references. However, this has certain undesirable properties for other code, so we need to be a little more clever. The key thing to notice is that these accesses are only used once per control flow path, in which case sinking the loads down and then inlining them should be OK (it will increase code size but not execution time.) However, the other difficulty is that the CmmOpt inliner, as it stands, won't inline things that look like this because although the variable is only used once in different branches, the same name is used, so it can't distinguish between the temporaries with mutually exclusive live ranges. Building a more clever inliner with Hoopl is also a bit tricky, because inlining is a forward analysis/transformation, but usage counting is a backwards analysis. |
| 60 | | |
| 61 | | This looks fixed with the patch from April 14. |
| 62 | | |
| 63 | | == Spilling Hp/Sp == |
| 64 | | |
| 65 | | FIXED. `3586.hs` emits the following code: |
| 66 | | |
| 67 | | {{{ |
| 68 | | _c1ao::I32 = Hp - 4; |
| 69 | | I32[Sp - 20] = _c1ao::I32; |
| 70 | | foreign "ccall" |
| 71 | | newCAF((BaseReg, PtrHint), (R1, PtrHint))[_unsafe_call_]; |
| 72 | | _c1ao::I32 = I32[Sp - 20]; |
| 73 | | }}} |
| 74 | | |
| 75 | | We see `Hp - 4` being allocated to a temp, and then consequently being spilled to the stack even though `newCAF` definitely will not change `Hp`, so we could have floated the expression down. |
| 76 | | |
| 77 | | This seems to happen whenever there's a `newCAF` ccall. |
| 78 | | |
| 79 | | We also seem to reload these values multiple times. |
| 80 | | |
| 81 | | {{{ |
| 82 | | _c7Yt::I32 = Hp - 4; |
| 83 | | I32[Sp - 28] = _c7Yt::I32; |
| 84 | | foreign "ccall" |
| 85 | | newCAF((BaseReg, PtrHint), (R1, PtrHint))[_unsafe_call_]; |
| 86 | | _c7Yt::I32 = I32[Sp - 28]; |
| 87 | | I32[R1 + 4] = _c7Yt::I32; |
| 88 | | I32[R1] = stg_IND_STATIC_info; |
| 89 | | _c7Yt::I32 = I32[Sp - 28]; <--- totally unnecessary |
| 90 | | I32[Sp - 8] = _c7Yt::I32; |
| 91 | | I32[Sp - 12] = stg_upd_frame_info; |
| 92 | | }}} |
| 93 | | |
| 94 | | ~~We need to not spill across certain foreign calls, but for which calls this is OK for is unclear.~~ Variables stay live across all unsafe foreign calls (foreign calls in the middle), except for the obvious cases (the return registers), so no spilling should happen at all. The liveness analysis is too conservative. |
| 95 | | |
| 96 | | This is not fixed in the April 14 version of the patch... we still need to fix the liveness analysis? I thought I fixed that... that's because the transform did extra spilling for CmmUnsafeForeignCalls. Removed that code, and now it's fixed. |
| 97 | | |
| 98 | | == Up and Down == |
| 99 | | |
| 100 | | FIXED. A frequent pattern is the stack pointer being bumped up and then back down again, for no particular reason. |
| 101 | | |
| 102 | | {{{ |
| 103 | | Sp = Sp + 4; |
| 104 | | Sp = Sp - 4; |
| 105 | | jump block_c7xh_entry (); |
| 106 | | }}} |
| 107 | | |
| 108 | | This is mentioned at the very top of `cmm-notes`. This was a bug in the stack layout code that I have fixed. |
| 109 | | |
| 110 | | == Sp is generally stupid == |
| 111 | | |
| 112 | | FIXED. Here is an optimized C-- sample from `arr016.hs`. |
| 113 | | |
| 114 | | {{{ |
| 115 | | Main.D:Arbitrary_entry() |
| 116 | | { [const 131084;, const 0;, const 15;] |
| 117 | | } |
| 118 | | c7J5: |
| 119 | | _B1::I32 = I32[Sp + 4]; |
| 120 | | _B2::I32 = I32[Sp + 0]; |
| 121 | | if ((Sp + 0) < I32[BaseReg + 84]) goto u7Jf; |
| 122 | | Sp = Sp + 12; |
| 123 | | Hp = Hp + 12; |
| 124 | | if (Hp > I32[BaseReg + 92]) goto c7Jc; |
| 125 | | I32[Hp - 8] = Main.D:Arbitrary_con_info; |
| 126 | | I32[Hp - 4] = _B2::I32; |
| 127 | | I32[Hp + 0] = _B1::I32; |
| 128 | | _c7J4::I32 = Hp - 7; |
| 129 | | R1 = _c7J4::I32; |
| 130 | | Sp = Sp - 4; |
| 131 | | jump (I32[Sp + 0]) (); |
| 132 | | u7Jf: |
| 133 | | Sp = Sp + 12; |
| 134 | | goto c7Jd; |
| 135 | | c7Jc: |
| 136 | | I32[BaseReg + 112] = 12; |
| 137 | | goto c7Jd; |
| 138 | | c7Jd: |
| 139 | | R1 = Main.D:Arbitrary_closure; |
| 140 | | Sp = Sp - 12; |
| 141 | | jump (I32[BaseReg - 4]) (); |
| 142 | | } |
| 143 | | }}} |
| 144 | | |
| 145 | | Compare with the old code: |
| 146 | | |
| 147 | | {{{ |
| 148 | | Main.D:Arbitrary_entry() |
| 149 | | { [const 131084;, const 0;, const 15;] |
| 150 | | } |
| 151 | | c4pX: |
| 152 | | Hp = Hp + 12; |
| 153 | | if (Hp > I32[BaseReg + 92]) goto c4pU; |
| 154 | | I32[Hp - 8] = Main.D:Arbitrary_con_info; |
| 155 | | I32[Hp - 4] = I32[Sp + 0]; |
| 156 | | I32[Hp + 0] = I32[Sp + 4]; |
| 157 | | R1 = Hp - 7; |
| 158 | | Sp = Sp + 8; |
| 159 | | jump (I32[Sp + 0]) (); |
| 160 | | c4pV: |
| 161 | | R1 = Main.D:Arbitrary_closure; |
| 162 | | jump (I32[BaseReg - 4]) (); |
| 163 | | c4pU: |
| 164 | | I32[BaseReg + 112] = 12; |
| 165 | | goto c4pV; |
| 166 | | } |
| 167 | | }}} |
| 168 | | |
| 169 | | You can see the up and down behavior here, but that's been fixed, so ignore it for now. (Update the C--!) The unfixed problem is this (some of the other problems were already addressed): we do an unnecessary stack check on entry to this function. We should eliminate the stack check (and by dead code analysis, the GC call) in such cases. |
| 170 | | |
| 171 | | This pattern essentially happens for every function, since we always assign incoming parameters to temporary variables before doing anything. |
| | 235 | |
| | 236 | == Lots of temporary variables == |
| | 237 | |
| | 238 | WONTFIX. Lots of temporary variables (these can tickle other issues when the temporaries are long-lived, but otherwise would be optimized away). You can at least eliminate some of them by looking at the output of `-ddump-opt-cmm`, which utilizes some basic temporary inlining when used with the native backend `-fasm`, but this doesn't currently apply to the GCC or LLVM backends. |
| | 239 | |
| | 240 | ~~At least one major culprit for this is `allocDynClosure`, described in Note `Return a LocalReg`; this pins down the value of the `CmmExpr` to be something for one particular time, but for a vast majority of use-cases the expression is used immediately afterwards. Actually, this is mostly my patches fault, because the extra rewrite means that the inline pass is broken.~~ Fixed in latest version of the pass; we don't quite manage to inline enough but there's only one extra temporary. |
| | 241 | |
| | 242 | Another cause of all of these temporary variables is that the new code generator immediately assigns any variables that were on the stack to temporaries immediately upon entry to a function. This is on purpose. The idea is we optimize these temporary variables away. |
| | 243 | |
| | 244 | == Rewriting stacks == |
| | 245 | |
| | 246 | FIXED. `3586.hs` emits the following code: |
| | 247 | |
| | 248 | {{{ |
| | 249 | Main.$wa_entry() |
| | 250 | { [const Main.$wa_slow-Main.$wa_info;, const 3591;, const 0;, |
| | 251 | const 458752;, const 0;, const 15;] |
| | 252 | } |
| | 253 | c17W: |
| | 254 | _s16B::F64 = F64[Sp + 20]; |
| | 255 | F64[Sp - 8] = _s16B::F64; |
| | 256 | _s16h::I32 = I32[Sp + 16]; |
| | 257 | _s16j::I32 = I32[Sp + 12]; |
| | 258 | _s16y::I32 = I32[Sp + 8]; |
| | 259 | _s16x::I32 = I32[Sp + 4]; |
| | 260 | _s16w::I32 = I32[Sp + 0]; |
| | 261 | if (Sp - 12 < SpLim) goto u1bR; |
| | 262 | Sp = Sp + 32; |
| | 263 | // [SNIP] |
| | 264 | // directEntry else |
| | 265 | // emitCall: Sequel: Return |
| | 266 | F64[Sp - 12] = _s17a::F64; |
| | 267 | I32[Sp - 16] = _s17b::I32; |
| | 268 | I32[Sp - 20] = _s16j::I32; |
| | 269 | I32[Sp - 24] = _s16y::I32; |
| | 270 | I32[Sp - 28] = _s16x::I32; |
| | 271 | I32[Sp - 32] = _s16w::I32; |
| | 272 | Sp = Sp - 32; |
| | 273 | jump Main.$wa_info (); |
| | 274 | u1bR: |
| | 275 | Sp = Sp + 32; |
| | 276 | // outOfLine here |
| | 277 | R1 = Main.$wa_closure; |
| | 278 | F64[Sp - 12] = _s16B::F64; |
| | 279 | I32[Sp - 16] = _s16h::I32; |
| | 280 | I32[Sp - 20] = _s16j::I32; |
| | 281 | I32[Sp - 24] = _s16y::I32; |
| | 282 | I32[Sp - 28] = _s16x::I32; |
| | 283 | I32[Sp - 32] = _s16w::I32; |
| | 284 | Sp = Sp - 32; |
| | 285 | jump stg_gc_fun (); |
| | 286 | }}} |
| | 287 | |
| | 288 | We see that these temporary variables are being repeatedly rewritten to the stack, even when there are no changes. |
| | 289 | |
| | 290 | Since these areas on the stack are all old call areas, one way to fix this is to inline all of the memory references. However, this has certain undesirable properties for other code, so we need to be a little more clever. The key thing to notice is that these accesses are only used once per control flow path, in which case sinking the loads down and then inlining them should be OK (it will increase code size but not execution time.) However, the other difficulty is that the CmmOpt inliner, as it stands, won't inline things that look like this because although the variable is only used once in different branches, the same name is used, so it can't distinguish between the temporaries with mutually exclusive live ranges. Building a more clever inliner with Hoopl is also a bit tricky, because inlining is a forward analysis/transformation, but usage counting is a backwards analysis. |
| | 291 | |
| | 292 | This looks fixed with the patch from April 14. |
| | 293 | |
| | 294 | == Spilling Hp/Sp == |
| | 295 | |
| | 296 | FIXED. `3586.hs` emits the following code: |
| | 297 | |
| | 298 | {{{ |
| | 299 | _c1ao::I32 = Hp - 4; |
| | 300 | I32[Sp - 20] = _c1ao::I32; |
| | 301 | foreign "ccall" |
| | 302 | newCAF((BaseReg, PtrHint), (R1, PtrHint))[_unsafe_call_]; |
| | 303 | _c1ao::I32 = I32[Sp - 20]; |
| | 304 | }}} |
| | 305 | |
| | 306 | We see `Hp - 4` being allocated to a temp, and then consequently being spilled to the stack even though `newCAF` definitely will not change `Hp`, so we could have floated the expression down. |
| | 307 | |
| | 308 | This seems to happen whenever there's a `newCAF` ccall. |
| | 309 | |
| | 310 | We also seem to reload these values multiple times. |
| | 311 | |
| | 312 | {{{ |
| | 313 | _c7Yt::I32 = Hp - 4; |
| | 314 | I32[Sp - 28] = _c7Yt::I32; |
| | 315 | foreign "ccall" |
| | 316 | newCAF((BaseReg, PtrHint), (R1, PtrHint))[_unsafe_call_]; |
| | 317 | _c7Yt::I32 = I32[Sp - 28]; |
| | 318 | I32[R1 + 4] = _c7Yt::I32; |
| | 319 | I32[R1] = stg_IND_STATIC_info; |
| | 320 | _c7Yt::I32 = I32[Sp - 28]; <--- totally unnecessary |
| | 321 | I32[Sp - 8] = _c7Yt::I32; |
| | 322 | I32[Sp - 12] = stg_upd_frame_info; |
| | 323 | }}} |
| | 324 | |
| | 325 | ~~We need to not spill across certain foreign calls, but for which calls this is OK for is unclear.~~ Variables stay live across all unsafe foreign calls (foreign calls in the middle), except for the obvious cases (the return registers), so no spilling should happen at all. The liveness analysis is too conservative. |
| | 326 | |
| | 327 | This is not fixed in the April 14 version of the patch... we still need to fix the liveness analysis? I thought I fixed that... that's because the transform did extra spilling for CmmUnsafeForeignCalls. Removed that code, and now it's fixed. |
| | 328 | |
| | 329 | == Up and Down == |
| | 330 | |
| | 331 | FIXED. A frequent pattern is the stack pointer being bumped up and then back down again, for no particular reason. |
| | 332 | |
| | 333 | {{{ |
| | 334 | Sp = Sp + 4; |
| | 335 | Sp = Sp - 4; |
| | 336 | jump block_c7xh_entry (); |
| | 337 | }}} |
| | 338 | |
| | 339 | This is mentioned at the very top of `cmm-notes`. This was a bug in the stack layout code that I have fixed. |
| | 340 | |
| | 341 | == Sp is generally stupid == |
| | 342 | |
| | 343 | FIXED. Here is an optimized C-- sample from `arr016.hs`. |
| | 344 | |
| | 345 | {{{ |
| | 346 | Main.D:Arbitrary_entry() |
| | 347 | { [const 131084;, const 0;, const 15;] |
| | 348 | } |
| | 349 | c7J5: |
| | 350 | _B1::I32 = I32[Sp + 4]; |
| | 351 | _B2::I32 = I32[Sp + 0]; |
| | 352 | if ((Sp + 0) < I32[BaseReg + 84]) goto u7Jf; |
| | 353 | Sp = Sp + 12; |
| | 354 | Hp = Hp + 12; |
| | 355 | if (Hp > I32[BaseReg + 92]) goto c7Jc; |
| | 356 | I32[Hp - 8] = Main.D:Arbitrary_con_info; |
| | 357 | I32[Hp - 4] = _B2::I32; |
| | 358 | I32[Hp + 0] = _B1::I32; |
| | 359 | _c7J4::I32 = Hp - 7; |
| | 360 | R1 = _c7J4::I32; |
| | 361 | Sp = Sp - 4; |
| | 362 | jump (I32[Sp + 0]) (); |
| | 363 | u7Jf: |
| | 364 | Sp = Sp + 12; |
| | 365 | goto c7Jd; |
| | 366 | c7Jc: |
| | 367 | I32[BaseReg + 112] = 12; |
| | 368 | goto c7Jd; |
| | 369 | c7Jd: |
| | 370 | R1 = Main.D:Arbitrary_closure; |
| | 371 | Sp = Sp - 12; |
| | 372 | jump (I32[BaseReg - 4]) (); |
| | 373 | } |
| | 374 | }}} |
| | 375 | |
| | 376 | Compare with the old code: |
| | 377 | |
| | 378 | {{{ |
| | 379 | Main.D:Arbitrary_entry() |
| | 380 | { [const 131084;, const 0;, const 15;] |
| | 381 | } |
| | 382 | c4pX: |
| | 383 | Hp = Hp + 12; |
| | 384 | if (Hp > I32[BaseReg + 92]) goto c4pU; |
| | 385 | I32[Hp - 8] = Main.D:Arbitrary_con_info; |
| | 386 | I32[Hp - 4] = I32[Sp + 0]; |
| | 387 | I32[Hp + 0] = I32[Sp + 4]; |
| | 388 | R1 = Hp - 7; |
| | 389 | Sp = Sp + 8; |
| | 390 | jump (I32[Sp + 0]) (); |
| | 391 | c4pV: |
| | 392 | R1 = Main.D:Arbitrary_closure; |
| | 393 | jump (I32[BaseReg - 4]) (); |
| | 394 | c4pU: |
| | 395 | I32[BaseReg + 112] = 12; |
| | 396 | goto c4pV; |
| | 397 | } |
| | 398 | }}} |
| | 399 | |
| | 400 | You can see the up and down behavior here, but that's been fixed, so ignore it for now. (Update the C--!) The unfixed problem is this (some of the other problems were already addressed): we do an unnecessary stack check on entry to this function. We should eliminate the stack check (and by dead code analysis, the GC call) in such cases. |
| | 401 | |
| | 402 | This pattern essentially happens for every function, since we always assign incoming parameters to temporary variables before doing anything. |
| | 403 | |
| | 404 | == Heap and R1 aliasing == |
| | 405 | |
| | 406 | FIXED. Values on the heap and values from R1 don't necessarily clobber |
| | 407 | each other. allocDynClosure seems like a pretty safe bet they |
| | 408 | don't. But is this true in general? ANSWER: Memory writes with |
| | 409 | Hp are always new allocations, so they don't clobber anything. |
| | 410 | |
| | 411 | {{{ |
| | 412 | _s14Y::F64 = F64[_s1uN::I32 + 8]; |
| | 413 | _s152::F64 = F64[_s1uN::I32 + 16]; |
| | 414 | _s156::F64 = F64[_s1uN::I32 + 24]; |
| | 415 | _s15a::F64 = F64[_s1uN::I32 + 32]; |
| | 416 | _s15e::F64 = F64[_s1uN::I32 + 40]; |
| | 417 | _s15i::F64 = F64[_s1uN::I32 + 48]; |
| | 418 | _s15m::F64 = F64[_s1uN::I32 + 56]; |
| | 419 | _c39O::I32 = Hp - 60; |
| | 420 | // calling allocDynClosure |
| | 421 | // allocDynClosure |
| | 422 | I32[Hp - 60] = sat_s1uQ_info; |
| | 423 | F64[Hp - 52] = _s14Y::F64; |
| | 424 | F64[Hp - 44] = _s152::F64; |
| | 425 | F64[Hp - 36] = _s156::F64; |
| | 426 | F64[Hp - 28] = _s15a::F64; |
| | 427 | F64[Hp - 20] = _s15e::F64; |
| | 428 | F64[Hp - 12] = _s15i::F64; |
| | 429 | F64[Hp - 4] = _s15m::F64; |
| | 430 | }}} |
| | 431 | |