It should be possible to give --make a -j flag, similar to make's, to tell it to
use multiple proceses to build modules. This would allow executables, libraries
and cabal packages to be built faster for people with multiple CPUs.
This seems like a good place to hang my patch to implement ghc --make -jN, which was used for the experiments in the 2005 Haskell Workshop paper on SMP GHC, but almost certainly isn't ready for prime time.
We're not planning this for 6.10. It's more likely that Cabal will get parallel make support first, in which case there's less need for us to tackle this.
Is this a dead ticket? Because I'd love to see it implemented. I'm working on a 32-core machine and compiling large Haskell packages (dozens of modules). A -j flag would make a real difference for me. Granted this isn't the most common case, but I'd expect a significant improvement even for smaller packages on a dual-core machine (and these days, which machine isn't at least that?). Faster builds => happier developers, and all that :-)
Cabal install -j flag solves a different problem. It builds and installs different packages in parallel. This ticket is about GHC being able to build different modules in parallel in a single package, regardless of installing the results.
Can somebody give an idea about the difficulty of this?
From an outsider's view, GHC has a very clear idea about module dependencies and in which order it has to build them so building independent modules in parallel shouldn't be too hard, should it?
I've been working on this for a while, including this year's GSoC. Since the GHC 7.8 feature freeze is imminent, I would like to post a stable subset of my current progress for review and possible inclusion to GHC 7.8. This subset of changes includes the bare minimum required to build multiple modules in parallel, and the parallel upsweep itself. Each patch is for the most part self-explanatory, I hope.
The speedups provided by the parallel upsweep are decent: I can realize a 1.8x speedup when compiling the Cabal library with -j3 -O2, and a 2.4x speedup when compiling 7 independent, relatively large modules with -j3 -O0, for example. The performance/thread ratio seems to peak at -j3 (which instructs the runtime to use 3 capabilities).
The performance of the sequential upsweep is not significantly impacted by these patches: compiling the Cabal library with -O2 takes about 1% longer with the patches than without.
I have not tested these patches on any platform other than x86_64/Linux, but I have no reason to believe that behavior would differ on other platforms. Nonetheless, testing on other platforms is necessary and appreciated.
These changes are well-tested and stable, but there is a single bug that escapes me which can be triggered by the testsuite. When running the testsuite with e.g. EXTRA_HC_OPTS=-j2, there is about a 1/1000 chance that a compiler process will exit with
ghc-stage2: GHC.Event.Manager.loop: state is already Finished
On a separate machine, I don't ever trigger the aforementioned bug, but instead the compiler process on rare occasion never exits. The testsuite script eventually kills the process, causing the particular test to fail with 'exit code non-0'.
These are likely bugs in the IO manager triggered by the changing of the number of capabilities at runtime. If I instead explicitly set the number of capabilities at startup with EXTRA_HC_OPTS=-j2 +RTS -N2 -RTS then neither bug manifests. I don't yet understand the IO manager well enough to fix this issue though.
Other than that though, this feature Just Works. One simply has to pass the -jN flag to GHC and the build will just finish faster.
Questions or comments? I have likely not explained everything I ought to explain.
I have not reviewed in detail -- I hope Simon Marlow may find time to do so. But there are some tricky corners and I love a bit more by way of comments, especially Notes along the lines of Commentary/CodingStyle. Think: how easy will it be for someone else to understand and maintain this in 5 years time. (I used github to add comments in a couple of places.)
Thanks for doing the work on this! Very exciting. I for one will start testing it.
I started reading the code a bit and have one question. But first, thanks for producing readable, well-documented code. My question has to do with the banal issue of printing stuff out in parallel, which can often be quite ugly.
I like that the normal compilation output is directed to a per-module TQueue. But what about printed exceptions when they occur?
I see that you've got three "[g]bracket" calls in the parallel upsweep, and that you take care to kill worker threads (asychronously) when an exception occurs. However, I don't see a general catch-all for exceptions at the top of each worker thread (such as in Control.Async). In my experience exceptions on child threads can be a real pain. For example, if a worker thread dies... it looks like other threads may be blocked indefinitely waiting on the result MVar?
Thanks Simon. I will attempt to comment the changes more sufficiently, following the mentioned coding style. I will update the user manual as well.
Ryan: I don't see where a stray exception could occur. The call to parUpsweep_one is already guarded by a try, and asynchronous exceptions are masked throughout the rest of the worker body. And the rest of the worker body doesn't seem to do anything that could throw an exception from within.
So it seems to me that the exception handling is already fairly tight: a worker thread should always exit gracefully. I think.
I was expecting something right after the fork. But it looks like the only code that happens outside of that "try" is the newIORef, putMVar, and writeLogQueue... reasonably safe.
Regarding printing, I guess ErrUtils.errorMsg is thread-safe? What I'm thinking about is an error message from the compiler that gets barfed out simultaneously with other normal compiler output from other threads.
errorMsg just calls log_action which will append the stringified exception to the module's LogQueue as with any other compile message, so it should be OK.
Most of these patches LGTM, but I haven't reviewed the parallel upsweep patch itself very closely yet. I left a few comments on the others, mostly echoing Simon about some additional documentation, and dead code removal.
But this looks like it can easily make the 7.8.1 window, however.
This is great! It is probably out of the scope for the Gsoc, but I'd like to mention:
When you have modules in your project that take a very long time to compile with many modules depending on them, it is useful to re-use information about how long each module took to compile the last time. This way we can build dependencies leading towards these "blocker modules" as early as possible.
parcs, you probably have a good overview on how GHC builds things now. Do you think the current state would make it possible to save and re-use such timing information?
I'm impressed, it looks like you've done a great job. Well done.
The parallel upsweep itself would look much nicer written using ParIO from monad-par, but that's something for the future.
Take a careful look at reTypecheckLoop, I'm not sure it's correct (see my inline comment).
The FastString changes need some more commentary, as pointed out by others. There are good comments in the parallel upsweep patch though.
You've obviously been careful to minimize the impact on sequential compilation performance, which is great.
The parallel IO manager bug needs to be fixed before we can merge the patch though. We can't ship it with a bug that causes random compilation failure.
Aside from the issues above, I'm happy with the patch.
nh2: That's an interesting idea. We would just have to persist the timing information somehow (through the interface file maybe) and implement a smart semaphore (replacing QSem) that wakes up the module that would result in the shortest compile time. It certainly sounds possible, at least.
Simon: Import loops were indeed not handled correctly. I managed to work out a solution that I hope is understandable. It involves augmenting a module's explicit textual dependencies with the implicit dependencies that arise from module loops. Let me know what you think.
Other changes I made:
removed the BinIO constructor as suggested
more thoroughly commented the FastString implementation, as suggested.
revised one of the thread-safety changes: originally, I changed newUnique and newUniqueSupply to atomically update the env_us var. But the only reason this was necessary is because the env_us var was shared among interleaved threads created by forkM. So instead of making sure to atomically update this var, I think it is more sensible to not share the env_us var among interleaved threads. This solution should be in theory more efficient as well, as multiple threads are no longer potentially contending on the same env_us var.
and I enabled buffering of stdout/stderr when compiling modules via GHCi
Please perform a second pass on all the commits, as I did a lot of rebasings and fixups etc and I might have missed something stupid.
nh2: That's an interesting idea. We would just have to persist the timing information somehow (through the interface file maybe)
I don't think the interface file is a good place: I tried to build/improve Haskell build systems in the last months and interface files not being generated the same for identical inputs was always a problem (e.g. in #8144 (closed)). Probably making an own file for that will work as well.
The changes to handle loops look OK to me, but I would test it on a GHC build to be sure. You want to build the whole of ghc/compiler with --make -O2; the build system doesn't do this so you have to make a command line by hand.
By the way, I've had various hard-to-pin-down problems with changing capabilities at runtime myself.
Is it possible in this case to just disable the use of that feature for the initial release? That is, you would have to use +RTS -N to get real speedup. But, hey, it's kind of an advanced compilation feature anyway. In this scenario we could get a lot of testing experience with parallel builds without setNumCapabilities and then combine them when there is higher confidence.
By the way, I've had various hard-to-pin-down problems with changing capabilities at runtime myself.
Is it possible in this case to just disable the use of that feature for the initial release? That is, you would have to use +RTS -N to get real speedup. But, hey, it's kind of an advanced compilation feature anyway. In this scenario we could get a lot of testing experience with parallel builds without setNumCapabilities and then combine them when there is higher confidence.
I don't mind going that route if the issue doesn't get sorted out soon. A subsequent point release of GHC could reinstate the feature (automatically setting the number of capabilities for the user), then. But I think I could fix it on time.
The changes to handle loops look OK to me, but I would test it on a GHC build to be sure. You want to build the whole of ghc/compiler with --make -O2; the build system doesn't do this so you have to make a command line by hand.
At first I couldn't even get GHC to compile itself via --make -O2 without -j (see #8184 (closed)). Since that's been fixed I was able to build GHC via --make -O2 -j after a minor tweak in the code, so the loop handling should be solid now.
Ok, I'm trying to get a decent set of libraries installed to test this well. The very first thing I cabal installed ('text') did get a small speedup.
However, I'm also seeing some excessive system time. This may have nothing to do with the parallel make approach and just be a function of the new IO manager. In fact, if I understand the parallel make design, worker threads should either be running or blocked on MVars. (That's good for avoiding wasted user time as well, unlike work-stealing which burns cycles looking for work.)
I'm running on a 32-core Intel Westmere machine, using this command to install text version 0.11.3.1:
time cabal install text --ghc-options="-j24" --reinstall
Notice that in this simple test I am relying on the setNumCapabilities behavior. Though a quick check confirms that I get the same times with +RTS -Nadded. Here are the times:
* 1 thread: real 1m20.028s user 1m17.921s sys 0m1.768s * 2 threads: real 1m7.417s user 1m22.818s sys 0m14.891s * 4 threads: real 0m59.528s user 1m29.110s sys 0m37.981s * 8 threads: real 0m57.219s user 1m54.461s sys 1m31.703s * 16 threads: real 1m6.225s user 4m46.976s sys 3m32.661s * 24 threads: real 1m16.501s user 9m53.254s sys 6m3.375s * 31 threads: real 1m27.445s user 17m0.314s sys 8m0.175s
Well, it's nice that final sequential time is not much worse than the one-threaded time!
Have you tested cabal -j on this branch? They *should* compose fine, and I'd love to see what kind of speedup one can get installing the Haskell Platform packages.
Unfortunately, right now I get a whole bunch of undefined reference errors when I try something like this on 4880dfaeafec1fc65568a5445a70ec4286949123:
time cabal-1.18.0 install -j30 --disable-library-profiling --disable-documentation HUnit --reinstall
Btw, I see the same problem with cabal 1.16.0.2. But taking away the -j30 makes it work. I do NOT have the same problem on master (8c99e698) presently, nor on earlier versions of HEAD from 2013.08.04. (Yet, this might have nothing to do with parcs patches, of course. There are another ~46 patches that are on master but not on the ghc-parmakegsoc branch. Fast forwarding the branch will be my next step.)
Notice that, weirdly, this problem occurs even when ghc -j is not used (as above).
For reference here is a sample of the errors:
cabal-1.17.0-HEAD-20130802: Error: some packages failed to install:HUnit-1.2.5.2 failed during the configure step. The exception was:user error(/home/beehive/ryan_scratch/ghc-parGSOC2/libraries/Cabal/Cabal/dist-install/build/libHSCabal-1.18.0.a(Simple__64.o):In function `chbT_info':(.text+0x857): undefined reference to `rff5_info'/home/beehive/ryan_scratch/ghc-parGSOC2/libraries/Cabal/Cabal/dist-install/build/libHSCabal-1.18.0.a(Simple__64.o):In function `sg4L_info':...
Have you tested cabal -j on this branch? They *should* compose fine, and I'd love to see what kind of speedup one can get installing the Haskell Platform packages.
Right now this may result in more processes being created than desired. We're working on solving this issue (though the patches won't be ready for 1.18).
Thanks for testing. Most of the time each thread will be either blocked on an MVar or compiling (or figuring out what to block on next), so I'm not sure where the excessive system time is coming from but I assume it's either from the RTS or the IO manager.
I haven't tested cabal with my branch but I don't think my changes are causing the errors you're experiencing. I'm willing to bet that it's due to recent changes pushed to master. (Or maybe you forgot to do git submodule update after checking out my branch?)
akio:
#8209 (closed) is most likely what's being triggered, although I'm not positive.
Is there some way to detect whether GHC has support for parallel --make? E.g. using ghc --info? I'd prefer to use that instead of conditioning on the version.
I haven't tested cabal with my branch but I don't think my changes are causing the errors you're experiencing. I'm willing to bet that it's due to recent changes pushed to master. (Or maybe you forgot to do git submodule update after checking out my branch?)
I'm pretty sure I updated the submodules, but I'll do a fresh build to double check. Except for the excessive system-time the parallel make actually works for me, as long as I don't do cabal -j. I can even install packages with parallel GHC / non-parallel cabal.
Could I get a fix on what other people who are testing this branch are seeing? Are you able to install packages with cabal -j?
Is there some way to detect whether GHC has support for parallel --make? E.g. using ghc --info? I'd prefer to use that instead of conditioning on the version.
What are the advantages of adding an entry to ghc --info over conditioning on the version, in this case?
GHC no longer deadlocks from its use of setNumCapabilities (see #8209 (closed)) but there is still another related issue: GHC sometimes prints
ghc-stage2: GHC.Event.Manager.loop: state is already Finished
What are the advantages of adding an entry to ghc --info over conditioning on the version, in this case?
Right now this is mostly for convenience - to enable supporting both the master
version of GHC 7.7 and the parmake branch. In the future this will be less
important - unless GHC HQ opts to disable parallel --make on some platforms for
some reason.
Implementing this is trivial: just add a ("Supports parallel --make", "YES")
tuple to the list returned by compiler/main/DynFlags.compilerInfo. I can write a
patch myself if you want.
Right now this is mostly for convenience - to enable supporting both the master
version of GHC 7.7 and the parmake branch. In the future this will be less
important - unless GHC HQ opts to disable parallel --make on some platforms for
some reason.
Implementing this is trivial: just add a ("Supports parallel --make", "YES")
tuple to the list returned by compiler/main/DynFlags.compilerInfo. I can write a
patch myself if you want.
Yes, I'm working on this, though I had some delays. One problem is that we've already released Cabal 1.18, so this feature will have to go into Cabal 1.19.
Do I understand corectly that right now one must pass +RTS -Nn to GHC if one wants it to use n OS threads?
Yes, I'm working on this, though I had some delays. One problem is that we've already released Cabal 1.18, so this feature will have to go into Cabal 1.19.
Okay, great.
Do I understand corectly that right now one must pass +RTS -Nn to GHC if one wants it to use n OS threads?
Nope, passing +RTS -Nn is not necessary. The number of capabilities will be set at runtime according to the -jn flag.
I want to raise one more issue on this topic. Nowadays most people implicitly call ghc under layers of tooling (stack, cabal, ghc-mod, haskell-mode, etc).
For example, with stack, at this particular moment in time I can't find a place to stick -j3 such that all my builds on a machine will see it.
This question of internal parallelism is an internal implementation issue, and shouldn't change semantics. Thus I think it would be safe for it to be optionally controlled by an environment variable. If I could just set "GHC_MAKE_CORES=3" on a given system, then I could get the desired behavior irrespective of what tools I'm using on top of GHC.
@rrnewton: That may or may not be a good idea - since tools like stack already do a -j on the package level, it seems that the best results can be achieved if such tools are aware of GHC's -j and use it correctly (e.g. you'd probably not want full -j for both stack and GHC as that would give you N*N threads and a lot of overhead) - so it may be worth more pushing for these tools to use GHC's -j.
Also, if there was a GHC_MAKE_CORES env var, and -j was also given, which one would take precedence?
To answer one of your questions directly, most of these tools accept a --ghc-options flag, e.g. you can do stack build --ghc-options="-j4".
I think another reason why not all tools use ghc -j is because of #9221.
For me, typical stack builds spend a huge amount of time bottlenecked on just one or two ghc compiles. Yes, we could push towards stack and other tools dynamically provisioning cores to GHC instances. But, in the short term GHC doesn't actually get decent parallel scaling anyway (#9221) so I really just want -j2 or -j3. And that limited level of oversubscription I'm not very worried about. (The overall builds spend so little their time with decent parallel utilization anyway.)
Do we have a policy for priorities of env vars versus flags in general? I think command line flags should take precedence. Then any smart tools that know how to manage "ghc -j" can override any system wide default.
Wearing my GHC developer's hat: I don't like environment flags controlling the compiler's behavior, since they make it harder to reproduce or identify issues. It's very helpful if I can tell the user to run cabal build -v and extract the ghc command line and know that everything that affects ghc's operation is specified on that command line or in the input files. Implicit global state is bad, right?
Of course, in reality there is already implicit global state (versions of installed packages, versions of C compiler/CPP/LLVM/etc.), but the less the better. Compare for example Environment Variables Affecting GCC, which roughly amounts to LANG, TMPDIR and where to look for headers and libraries.
In principle -jN should not affect ghc's behavior, and then I would not really be against adding an environment variable for it, but in practice it does: see #9370, which I think is still unfixed, not to mention the possibility of bugs in -jN itself.
Wearing my GHC user's hat: I don't really see the need for an environment variable here. Doesn't the stack build --ghc-options="-j4" that nh2 mentioned solve your immediate problem? And using this environment variable doesn't strike me as the Right Solution to any problem: if you are invoking ghc directly then you don't need the environment variable, while if you are invoking it from a larger build system then ideally you would make the build system aware of the amount of parallelism ghc is using, and at that point you might as well make the build system aware of the -jN flag as well.
In summary, I don't think there is a good case for adding such an environment variable.