Changes between Version 18 and Version 19 of DataParallel/BenchmarkStatus
- Timestamp:
- 03/04/09 21:03:27 (4 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
DataParallel/BenchmarkStatus
v18 v19 23 23 || DotP, ref Haskell || 100M elements || – || 810 || 437 || 221 || 209 || 24 24 || DotP, ref C || 100M elements || – || 458 || 235 || 210 || 210 || 25 || SMVM, primitives || ?? elems, density ?? || || || || || || 26 || SMVM, vectorised || ?? elems, density ?? || || || || || || 25 || SMVM, primitives || 100kx100k @ density 0.001 || 119/119 || 254/254 || 154/154 || 90/90 || 67/67 || 26 || SMVM, vectorised || 100kx100k @ density 0.001 || _|_ || _|_ || _|_ || _|_ || _|_ || 27 || SMVM, ref C || 100kx100k @ density 0.001 || 46 || – || – || – || – || 27 28 28 29 All results are in milliseconds, and the triples report best/average/worst execution time (wall clock) of three runs. The column marked "sequential" reports times when linked against `dph-seq` and the columns marked "P=n" report times when linked against `dph-par` and run in parallel using the specified number of parallel OS threads. 29 30 30 ==== Observations regarding DotP ====31 ==== Comments regarding DotP ==== 31 32 32 33 Performance is memory bound, and hence, the benchmark stops scaling once the memory bus saturated. As a consequence, the wall-clock execution time of the Haskell programs and the C reference implementation are the same when all available parallelism is exploited. The parallel DPH library delivers the same single core performance as the sequential one in this benchmark. 34 35 ==== Comments regarding smvm ==== 36 37 There seems to be a fusion problem in DotP with `dph-par` (even if the version of `zipWithSUP` that uses `splitSD/joinSD` is used); hence the much lower runtime for "N=1" than for "sequential". The vectorised version runs out of memory; maybe because we didn't solve the `bpermute` problem, yet. 33 38 34 39 === Execution on greyarea (1x UltraSPARC T2) === … … 43 48 || DotP, ref Haskell || 100M elements || – || 934 || 467 || 238 || 117 || 61 || 65 || 36 || 44 49 || DotP, ref C || 100M elements || – || 554 || 277 || 142 || 72 || 37 || 22 || 20 || 45 || SMVM, primitives || ?? elems, density ?? || || || || || || || || || 46 || SMVM, vectorised || ?? elems, density ?? || || || || || || || || || 50 || SMVM, primitives || 100kx100k @ density 0.001 || 1112/1112 || 1926/1926 || 1009/1009 || 797/797 || 463/ 463 || 326/326 || 189/189 || 207/207 || 51 || SMVM, vectorised || 100kx100k @ density 0.001 || || || || || || || || || 52 || SMVM, ref C || 100kx100k @ density 0.001 || 600 || – || – || – || – || – || – || – || 47 53 48 54 All results are in milliseconds, and the triples report best/worst execution time (wall clock) of three runs. The column marked "sequential" reports times when linked against `dph-seq` and the columns marked "P=n" report times when linked against `dph-par` and run in parallel using the specified number of parallel OS threads. 49 55 50 ==== Observations regarding DotP ====56 ==== Comments regarding DotP ==== 51 57 52 58 The benchmark scales nicely up to the maximum number of hardware threads. Memory latency is largely covered by excess parallelism. It is unclear why the Haskell reference implementation "ref Haskell" falls of at 32 and 64 threads. 59 60 ==== Comments regarding smvm ==== 61 62 As on !LimitingFactor, but it scales much more nicely and improves until using four threads per core. This suggets that memory bandwidth is again a critical factor in this benchmark (this fits well with earlier observations on other architectures). Despite fusion problem with `dph-par`, the parallel Haskell program, using all 8 cores, still ends up three times faster than the sequential C program. 53 63 54 64 ----
