| | 141 | === GHC Heap Check (case merging) === |
| | 142 | |
| | 143 | '''Following is from Roman Leshchinskiy''' |
| | 144 | |
| | 145 | I investigated heap check a bit more and it seems to me that it's largely |
| | 146 | GHC's fault. LLVM does do loop unswitching which correctly pulls out |
| | 147 | loop-invariant heap checks but that happens fairly late in its pipeline |
| | 148 | and heap checks interfere with optimisations before that. |
| | 149 | |
| | 150 | However, we really shouldn't be generating those heap checks in the first |
| | 151 | place. Here is a small example loop: |
| | 152 | |
| | 153 | {{{ |
| | 154 | foo as n = loop 0# 0.0## |
| | 155 | where |
| | 156 | loop i x |
| | 157 | | i >=# n = (# (), D# x #) |
| | 158 | | otherwise = loop (i +# 1#) (x *## indexDoubleArray# as i) |
| | 159 | }}} |
| | 160 | |
| | 161 | This is the C-- that GHC generates: |
| | 162 | |
| | 163 | {{{ |
| | 164 | sep_ret() |
| | 165 | ceO: |
| | 166 | Hp = Hp + 12; |
| | 167 | if (Hp > I32[BaseReg + 92]) goto ceT; |
| | 168 | |
| | 169 | if (%MO_S_Ge_W32(R1, I32[Sp + 16])) goto ceX; |
| | 170 | |
| | 171 | _seA::I32 = R1 + 1; |
| | 172 | F64[Sp + 0] = %MO_F_Mul_W64(F64[Sp + 0], F64[I32[Sp + 12] + ((R1 << 3) + 8)]); |
| | 173 | R1 = _seA::I32; |
| | 174 | Hp = Hp - 12; |
| | 175 | jump sep_info (); |
| | 176 | ceT: |
| | 177 | I32[BaseReg + 112] = 12; |
| | 178 | goto ceR; |
| | 179 | ceR: |
| | 180 | I32[Sp + 8] = sep_info; |
| | 181 | I32[BaseReg + 32] = 131327; |
| | 182 | jump stg_gc_ut (); |
| | 183 | ceX: |
| | 184 | I32[Hp - 8] = GHC.Types.D#_con_info; |
| | 185 | F64[Hp - 4] = F64[Sp + 0]; |
| | 186 | I32[Sp + 16] = Hp - 7; |
| | 187 | R1 = GHC.Unit.()_closure+1; |
| | 188 | Sp = Sp + 16; |
| | 189 | jump (I32[Sp + 4]) (); |
| | 190 | }}} |
| | 191 | |
| | 192 | Note how in each loop iteration, we add 12 to Hp, then do the heap check |
| | 193 | and then subtract 12 from Hp again. I really don't think we should be |
| | 194 | generating that and then relying on LLVM to optimise it away. |
| | 195 | |
| | 196 | This happens because GHC commons up heap checks for case alternatives and |
| | 197 | does just one check before evaluating the case. The relevant comment from |
| | 198 | CgCase.lhs is this: |
| | 199 | |
| | 200 | A more interesting situation is this: |
| | 201 | |
| | 202 | {{{ |
| | 203 | !A!; |
| | 204 | ...A... |
| | 205 | case x# of |
| | 206 | 0# -> !B!; ...B... |
| | 207 | default -> !C!; ...C... |
| | 208 | }}} |
| | 209 | |
| | 210 | where !x! indicates a possible heap-check point. The heap checks |
| | 211 | in the alternatives '''can''' be omitted, in which case the topmost |
| | 212 | heapcheck will take their worst case into account. |
| | 213 | |
| | 214 | This certainly makes sense if A allocates. But with vector-based code at |
| | 215 | least, a lot of the time neither A nor C will allocate '''and''' C will |
| | 216 | tail-call A again so by pushing the heap check into !A!, we are now doing |
| | 217 | it '''in''' the loop rather than at the end. |
| | 218 | |
| | 219 | It seems to me that we should only do this if A actually allocates and |
| | 220 | leave the heap checks in the alternatives if it doesn't (perhaps we could |
| | 221 | also use a common heap check if '''all''' alternatives allocate). I tried to |
| | 222 | hack this and see what happens but found the code in CgCase and friends |
| | 223 | largely incomprehensible. What would I have to change to implement this |
| | 224 | (perhaps controlled by a command line flag) and is it a good idea at all? |