Ticket #5208 (closed feature request: fixed)

Opened 2 years ago

Last modified 2 years ago

Unroll array copy/clone primops in the native and LLVM code generators

Reported by: tibbe Owned by:
Priority: normal Milestone:
Component: Compiler Version: 7.1
Keywords: Cc: johan.tibell@…, dterei
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Difficulty:
Test Case: Blocked By:
Blocking: Related Tickets:

Description

#4928 added inline primops for copying and cloning arrays. It would be beneficial to create an unrolled loop, instead of a call to memcpy, when the copy is of a statically know small size. I suggest we use an unrolling threshold of 128 bytes, as that's what both GCC and LLVM use for the same optimization.

Attachments

Main.hs Download (244 bytes) - added by dterei 2 years ago.
MemOpsTest.cmm Download (2.3 KB) - added by dterei 2 years ago.
Return.c Download (173 bytes) - added by dterei 2 years ago.
0002-Unroll-memcpy-in-the-X86-backend.patch Download (5.6 KB) - added by tibbe 2 years ago.
0001-Add-test-for-unrolling-memcpy-memset-in-the-x86-back.patch Download (11.7 KB) - added by tibbe 2 years ago.
0001-Use-the-new-memcpy-memmove-memset-MachOps.2.patch Download (5.2 KB) - added by tibbe 2 years ago.

Change History

  Changed 2 years ago by tibbe

  • cc johan.tibell@… added
  • version changed from 7.0.3 to 7.1

Changed 2 years ago by dterei

Changed 2 years ago by dterei

Changed 2 years ago by dterei

  Changed 2 years ago by dterei

OK so you can find a branch here adding MachOp?'s for memcpy, memset and memmove:

 https://github.com/dterei/ghc/tree/memcpy-prim

It uses special LLVM intrinsics in the LLVM backend and just does appropriate calls in the NCG. It should be in working condition and ready to add to GHC basically. Probably should ask Ben Lippmeier to check the SPARC code is fine.

I'm also attaching a test program to this ticket. You can use it like so:

gcc -c Return.c
ghc -c MemOpsTest.cmm
ghc --make Main.hs MemOpsTest.o Return.o
./Main

  Changed 2 years ago by dterei

I'll probably push this stuff to GHC soon but am a little busy right now and may be good to get Ben Lippmeier and Simon Marlow's opinion first

  Changed 2 years ago by simonmar

Looks good to me.

  Changed 2 years ago by tibbe

  • status changed from new to patch

The attached patch makes the array copy primops use the new memcpy/memmmove/memset MachOps. Please review and apply.

Remaining ToDo?: Unroll the MachOps in the x86(-64) backend.

  Changed 2 years ago by tibbe

I've attached a patch that unrolls the primops in the x86 backend, together with a small test. The 0002-Unroll-memcpy-in-the-X86-backend.patch patch depends on 0001-Use-the-new-memcpy-memmove-memset-MachOps.patch. Please review.

follow-up: ↓ 9   Changed 2 years ago by dterei

So it looks fine, here are some comments though:

0002-Unroll-memcpy-in-the-X86-backend.patch

  • '-- TODO: Add movabs instruction and support 64-bit sets.' - Can this be done now rather than later? It doesn't seem like it should take very long. Also I think once you have this done you should be able to unify the go functions of memcpy and memset
  • By saying add the movabs instruction I assume you want to change the memset function to initially store the value to set into a register and then mov that register into each of the memory slots. With this change you could also have memset handle the case where the value to be set is an expression other than a literal.
  • Should 'maxInlineSizeThreshold' be shared rather than duplicated?

0001-Use-the-new-memcpy-memmove-memset-MachOps.patch

Looks fine but would it be worthwhile handling the alignment number in the 'emitMemmoveCall', 'emitMemcpyCall' and 'emitMemsetCall' functions themselves rather than passing it in as argument?

0001-Add-test-for-unrolling-memcpy-memset-in-the-x86-back.patch

I think this testing method works well. I think you should test when the alignment is 8 though as well as 4. Also maybe try to cover a few more cases by using some different sizes and alignments. So create the arrays at size 71 but then test a few times using sizes say 1,64,65,66,67,71 and alignments 1,2,4,8. The unroll code is fairly tricky and so we want an extensive test case.

I'll be happy to help out with any of these points when I have time. Other than extending the test case though I would be happy pushing it to head as it stands.

  Changed 2 years ago by dterei

  • cc dterei added

in reply to: ↑ 7 ; follow-up: ↓ 10   Changed 2 years ago by tibbe

Thanks for reviewing!

Replying to dterei:

'-- TODO: Add movabs instruction and support 64-bit sets.' - Can this be done now rather than later? It doesn't seem like it should take very long. Also I think once you have this done you should be able to unify the go functions of memcpy and memset

The only reason I haven't done so already is that I don't know how to add a new instruction. If someone could give some pointers to all the places that need to be updated after adding a new instruction and whether I need to do something special for 64-bit only instructions I can take a stab at it.

By saying add the movabs instruction I assume you want to change the memset function to initially store the value to set into a register and then mov that register into each of the memory slots. With this change you could also have memset handle the case where the value to be set is an expression other than a literal.

Storing the value in a register is necessary as the mov instructions don't take 64-bit immediates.

Should 'maxInlineSizeThreshold' be shared rather than duplicated?

Probably. I will make it a top-level constant for now. Ideally each architecture should have its own limit and I'd like to see a common place to but architecture specific configuration. The memcpy unrolling optimization should ideally be written once and reused on all architectures. This requires that we abstract over move instructions and the like. This is what LLVM does. It significantly reduces implementation effort.

Looks fine but would it be worthwhile handling the alignment number in the 'emitMemmoveCall', 'emitMemcpyCall' and 'emitMemsetCall' functions themselves rather than passing it in as argument?

I did it this way on purpose as I'd like it to be possible to use the CallishMachOps even when the memory isn't aligned. I think this makes them more generally useful in GHC.

I think this testing method works well. I think you should test when the alignment is 8 though as well as 4. Also maybe try to cover a few more cases by using some different sizes and alignments. So create the arrays at size 71 but then test a few times using sizes say 1,64,65,66,67,71 and alignments 1,2,4,8. The unroll code is fairly tricky and so we want an extensive test case.

I'll try to improve the test case this weekend.

I'll be happy to help out with any of these points when I have time. Other than extending the test case though I would be happy pushing it to head as it stands.

Here's what I suggest: I'll work on improving the tests and fix the minor things. The bigger things (like movabs) are additional optimizations on top of the current ones so I suggest we save those for separate patches.

in reply to: ↑ 9   Changed 2 years ago by dterei

Replying to tibbe:

Probably. I will make it a top-level constant for now. Ideally each architecture should have its own limit and I'd like to see a common place to but architecture specific configuration. The memcpy unrolling optimization should ideally be written once and reused on all architectures. This requires that we abstract over move instructions and the like. This is what LLVM does. It significantly reduces implementation effort.

OK. I'm not sure if it would be worth the effort given we only have two backends (X86 and SPARC) and only really X86 is used.

Here's what I suggest: I'll work on improving the tests and fix the minor things. The bigger things (like movabs) are additional optimizations on top of the current ones so I suggest we save those for separate patches.

Sounds good to me!

Changed 2 years ago by tibbe

Changed 2 years ago by tibbe

  Changed 2 years ago by tibbe

I've addressed the comments that we agreed on and updated the test to be more comprehensive. Please review.

  Changed 2 years ago by dterei

  • status changed from patch to closed
  • resolution set to fixed

Committed:

6c7d2a946a96ed74799cf353f3f62c875f56639b 790063769da85adefa9ad9194e00f69e6ca6fd5c 01c9b2f8ece0a7f1226d0768e811666f792787bc

Note: See TracTickets for help on using tickets.