# perf [![img](https://img.shields.io/hackage/v/perf.svg)](https://hackage.haskell.org/package/perf) [![img](https://github.com/tonyday567/perf/workflows/haskell-ci.yml/badge.svg)](https://github.com/tonyday567/perf/actions) # Introduction `perf` provides some ideas, code and a library for low-level performance measurement for Haskell hacking. The library: - provides a monad transformer, `PerfT`, as a light-weight wrapper for use on existing code. `PerfT` modifications can be included in code bases, as opposed to performance being separated code and process, with any effects able to be erased at compile time with `evalPerfT`. - focuses on fast and accurate measurement. - is polymorphic to what, exactly, is being measured, so that concepts such as counters, debug checks, time and space performance can share treatment. - can measure big O for algorithms that can be defined in terms of input size growth. # Setup Note that running perf.org is very slow compared with an external process which accesses the compiled version of the library. :r :set -Wno-type-defaults :set -Wno-unused-do-bind :set -Wno-name-shadowing :set -XOverloadedStrings :set -XOverloadedLabels import Perf import Data.FormatN import qualified Data.Text as Text import qualified Data.Text.IO as Text import qualified Data.Map.Strict as Map import Control.Monad import Data.Bifunctor import System.Clock putStrLn "ok" [ 1 of 10] Compiling Perf.Stats ( src/Perf/Stats.hs, interpreted ) [Source file changed] [ 3 of 10] Compiling Perf.Time ( src/Perf/Time.hs, interpreted ) [Source file changed] [ 6 of 10] Compiling Perf.Measure ( src/Perf/Measure.hs, interpreted ) [Source file changed] [ 7 of 10] Compiling Perf.Report ( src/Perf/Report.hs, interpreted ) [Source file changed] [ 8 of 10] Compiling Perf.BigO ( src/Perf/BigO.hs, interpreted ) [Perf.Stats changed] [10 of 10] Compiling Perf ( src/Perf.hs, interpreted ) [Perf.BigO changed] Ok, 10 modules reloaded. ok # System.Clock The default clock is MonoticRaw for linux & macOS, and ThreadCPUTime for Windows. ## resolution getRes Monotonic getRes Realtime getRes ProcessCPUTime getRes ThreadCPUTime getRes MonotonicRaw TimeSpec {sec = 0, nsec = 1000} TimeSpec {sec = 0, nsec = 1000} TimeSpec {sec = 0, nsec = 1000} TimeSpec {sec = 0, nsec = 42} TimeSpec {sec = 0, nsec = 42} # Time ## What is a tick? A fundamental operation of Perf.Time is tick, which sandwiches a (strict) function application between two readings of a clock, and returns time in nanoseconds, and the computation result. In this way, the \`Perf\` monad can be inserted into the midst of a computation in an attempt to measure performance in-situ as opposed to sitting off in a separate and decontextualized process. :t tick tick :: (a -> b) -> a -> IO (Nanos, b) `tick` returns in the IO monad, because reading a cycle counter is an IO effect. A trivial but fundamental point is that performance measurement effects the computation being measured. ## tick\_ tick\_ measures the nanoseconds between two immediate clock reads. :t tick_ tick_ :: IO Nanos replicateM 10 tick_ [1833,500,416,416,416,375,375,416,416,416] ## multiple ticks fmap (fmap (fst)) . replicateM 10 $ tick (const ()) () [7000,2333,2000,2208,1958,1959,1959,2000,2000,1959] Here, `const () ()` was evaluated and took 7 micro-seconds for the first effect, reducing down to 2 msecs after 10 effects. ## tickIO `tickIO` measures the evaluation of an IO value. :t tickIO tickIO :: IO a -> IO (Cycles, a) fmap (fmap fst) . replicateM 10 $ tickIO (pure ()) [5541,1625,1458,1833,1375,1416,1375,1375,1375,1375] ## sum example fmap (expt (Just 2) . fromIntegral) . fst <$> ticks 10 sum ([1..10000] :: [Double]) ["5.0e5","2.4e5","2.4e5","2.4e5","2.4e5","2.4e5","2.4e5","2.4e5","2.5e5","2.4e5"] ts <- ticks 10000 sum ([1..1000] :: [Double]) print $ average (fmap fromIntegral $ fst ts) 10747.1975 # PerfT `PerfT` allows for multiple measurement points and is polymorphic in what is being measured. It returns a Map of results held in State. Compare a lower-level usage of ticks, measuring the average of summing to one thousand over one thousand trials: first (average . fmap fromIntegral) <$> ticks 1000 sum [1..1000] (25947.635,500500) … with PerfT usage second (fmap (average . fmap fromIntegral)) <$> runPerfT (times 1000) (sum |$| [1..1000]) (500500,fromList [("",26217.098)]) Comparing performance of sum versus a list fusion approach: fmap (average . fmap fromIntegral) <$> (execPerfT (times 1000) $ do; (fap "sum" sum [1..1000]); (fap "fusion" (\x -> sum [1..x]) 1000)) fromList [("fusion",32871.248),("sum",26924.128)] An IO example exampleIO' :: IO () exampleIO' = do txt <- Text.readFile "src/Perf.hs" let n = Text.length txt Text.putStrLn $ "length of file is: " <> Text.pack (show n) exampleIO = execPerfT time (do txt <- fam "file_read" (Text.readFile "src/Perf.hs") n <- fap "length" Text.length txt fam "print_result" (Text.putStrLn $ "length of file is: " <> Text.pack (show n))) perf-explore --exampleIO length of file is: 1794 length of file is: 1794 label1 label2 label3 old result new result change normal file-read time 2.31e5 1.28e5 improvement normal length time 2.71e3 2.00e3 improvement normal print-result time 3.75e4 1.32e4 improvement outer file-read time 6.05e4 3.64e4 improvement outer length time 9.59e2 6.25e2 improvement outer outer-total time 7.39e4 4.02e4 improvement outer print-result time 9.79e3 1.71e3 improvement # perf-explore `perf-explore` contains some exploratory routines used to develop `perf` perf-explore --help examples of perf usage Usage: perf-explore [-n|--runs ARG] [--Monotonic | --Realtime | --ProcessCPUTime | --ThreadCPUTime | --MonotonicRaw] [--best | --median | --average] [--time | --space | --spacetime | --allocation | --count] [-g|--golden ARG] [--nocheck] [-r|--record] [--header | --noheader] [--error ARG] [--warning ARG] [--improved ARG] [--sums | --lengths | --nub | --clocks | --examples | --example | --exampleIO | --noops | --ticks] [-l|--length ARG] [--sumFuse | --sum | --lengthF | --constFuse | --mapInc | --noOp] perf exploration Available options: -n,--runs ARG number of runs to perform --best report upper decile --median report median --average report average --time measure time performance --space measure space performance --spacetime measure both space and time performance --allocation measure bytes allocated --count measure count -g,--golden ARG golden file name --nocheck do not check versus the golden file -r,--record record the result to the golden file --header include headers --noheader dont include headers --error ARG error level --warning ARG warning level --improved ARG improved level --sums run on sum algorithms --lengths run on length algorithms --nub nub test --clocks clock test --examples run on example algorithms --example run on the example algorithm --exampleIO exampleIO test --noops noops test --ticks tick test -l,--length ARG length of list --sumFuse fused sum pipeline --sum sum --lengthF foldr id length --constFuse fused const pipeline --mapInc fmap (+1) --noOp const () -h,--help Show this help text fmap averageI <$> execPerfT (times 10000) (sum |$| [1..1000]) fromList [("",136055.5594)] The equivalent to the above code is: perf-explore -n 10000 -l 1000 --sum --nocheck label1 label2 results sum time 6.32e3 ## noops This no-op experiment is useful to understand the pure time performance of the machinery around measurement. It can be (re)run with: perf-explore --noops label1 label2 label3 old result new result change const 1st time 1.72e4 8.79e3 improvement const 2nd time 2.09e2 1.25e2 improvement const 3rd time 1.66e2 1.25e2 improvement const 4th time 2.08e2 8.30e1 improvement const average time 2.08e2 1.10e2 improvement const best time 1.31e2 6.31e1 improvement const median time 1.60e2 7.76e1 improvement pure 1st time 1.00e3 1.25e2 improvement pure 2nd time 1.67e2 8.30e1 improvement pure 3rd time 1.66e2 8.30e1 improvement pure 4th time 1.25e2 4.20e1 improvement pure average time 1.85e2 8.29e1 improvement pure best time 1.31e2 6.37e1 improvement pure median time 1.63e2 7.79e1 improvement ## measurement context Exploration of how the code surrounding measurement effects performance. perf-explore -n 1000 -l 1000 --ticks --nocheck
  stepTime tick tickForce tickForceArgs tickLazy tickWHNF times
sumAux 3.29e3 4.83e3 3.29e3 3.29e3 1.85e2 3.92e3 3.29e3
sumCata 5.86e3 5.61e3 6.00e3 6.12e3 1.85e2 5.78e3 5.86e3
sumCo 3.73e3 4.63e3 3.66e3 3.66e3 1.90e2 4.36e3 3.72e3
sumCoCase 5.08e3 5.10e3 4.96e3 4.95e3 1.85e2 5.12e3 5.11e3
sumCoGo 3.47e3 4.74e3 4.66e3 4.64e3 1.85e2 4.72e3 3.29e3
sumF 5.92e3 4.85e3 4.84e3 6.41e3 1.85e2 4.85e3 5.91e3
sumFlip 4.54e3 4.45e3 4.44e3 4.44e3 1.85e2 4.44e3 4.26e3
sumFlipLazy 4.52e3 4.51e3 4.47e3 4.47e3 1.85e2 4.49e3 4.50e3
sumFoldr 5.55e3 4.78e3 4.71e3 4.72e3 1.85e2 4.77e3 5.56e3
sumFuse 8.28e2 8.33e2 8.29e2 8.29e2 1.86e2 8.28e2 8.29e2
sumFuseFoldl’ 2.03e3 8.29e2 8.32e2 8.29e2 1.84e2 8.29e2 8.29e2
sumFuseFoldr 1.17e3 1.17e3 1.18e3 1.17e3 1.84e2 1.19e3 1.17e3
sumFusePoly 8.40e2 8.37e2 8.35e2 8.36e2 1.84e2 8.40e2 8.37e2
sumLambda 3.67e3 5.03e3 3.67e3 3.67e3 1.85e2 3.78e3 3.67e3
sumMono 3.66e3 5.13e3 5.12e3 7.20e3 1.84e2 5.13e3 3.66e3
sumPoly 4.83e3 4.85e3 4.83e3 4.84e3 1.86e2 4.84e3 4.84e3
sumSum 4.55e3 4.83e3 4.53e3 4.53e3 1.85e2 6.02e3 4.55e3
sumTail 4.54e3 7.07e3 5.81e3 4.96e3 3.27e2 6.49e3 4.43e3
sumTailLazy 6.24e3 4.41e3 6.47e3 6.23e3 2.03e2 5.49e3 6.24e3
### short list perf-explore -n 10000 -l 10 --median --ticks
  stepTime tick tickForce tickForceArgs tickLazy tickWHNF times
sumAux 2.17e2 2.21e2 2.18e2 2.19e2 1.85e2 2.17e2 2.18e2
sumCata 2.16e2 2.19e2 2.20e2 2.21e2 1.85e2 2.17e2 2.18e2
sumCo 2.22e2 2.34e2 2.17e2 2.18e2 1.85e2 2.17e2 2.21e2
sumCoCase 2.15e2 2.32e2 2.18e2 2.36e2 1.91e2 2.18e2 2.18e2
sumCoGo 2.16e2 2.23e2 2.18e2 2.31e2 1.87e2 2.16e2 2.18e2
sumF 2.19e2 2.30e2 2.19e2 2.20e2 1.86e2 2.19e2 2.20e2
sumFlip 2.16e2 2.34e2 2.17e2 2.16e2 1.85e2 2.17e2 2.17e2
sumFlipLazy 2.16e2 2.23e2 2.16e2 2.17e2 1.85e2 2.17e2 2.18e2
sumFoldr 2.14e2 2.31e2 2.17e2 2.17e2 1.85e2 2.17e2 2.18e2
sumFuse 2.02e2 2.03e2 2.03e2 2.03e2 1.85e2 2.03e2 2.03e2
sumFuseFoldl’ 2.02e2 2.03e2 2.03e2 2.03e2 1.85e2 2.03e2 2.03e2
sumFuseFoldr 2.04e2 2.04e2 2.07e2 2.04e2 1.94e2 2.05e2 2.04e2
sumFusePoly 2.05e2 2.05e2 2.05e2 2.05e2 1.85e2 2.05e2 2.05e2
sumLambda 2.18e2 2.39e2 2.18e2 2.19e2 1.84e2 2.20e2 2.19e2
sumMono 2.08e2 2.31e2 2.08e2 2.11e2 1.92e2 2.09e2 2.09e2
sumPoly 2.18e2 2.32e2 2.19e2 2.19e2 1.85e2 2.20e2 2.20e2
sumSum 2.18e2 2.33e2 2.19e2 2.19e2 1.85e2 2.20e2 2.19e2
sumTail 2.52e2 4.19e2 2.95e2 2.60e2 2.69e2 3.64e2 2.42e2
sumTailLazy 2.09e2 2.42e2 2.13e2 2.10e2 1.90e2 2.28e2 2.11e2
### long list perf-explore -n 100 -l 100000 --best --ticks
  stepTime tick tickForce tickForceArgs tickLazy tickWHNF times
sumAux 7.38e5 7.38e5 7.36e5 7.36e5 1.66e2 7.38e5 7.38e5
sumCata 7.40e5 7.40e5 7.38e5 7.39e5 1.66e2 7.40e5 7.40e5
sumCo 7.40e5 7.41e5 7.38e5 7.38e5 1.66e2 7.41e5 7.39e5
sumCoCase 7.39e5 7.39e5 7.36e5 7.36e5 1.66e2 7.40e5 7.38e5
sumCoGo 7.39e5 7.39e5 7.36e5 7.36e5 1.66e2 7.39e5 7.39e5
sumF 3.52e5 3.52e5 3.52e5 3.52e5 1.66e2 3.52e5 3.52e5
sumFlip 3.75e5 3.75e5 3.75e5 3.75e5 1.66e2 3.75e5 3.75e5
sumFlipLazy 3.65e5 3.65e5 3.65e5 3.65e5 1.66e2 3.65e5 3.65e5
sumFoldr 7.51e5 7.52e5 7.47e5 7.48e5 1.66e2 7.51e5 7.51e5
sumFuse 6.26e4 6.26e4 6.26e4 6.26e4 1.66e2 6.26e4 6.26e4
sumFuseFoldl’ 6.26e4 6.26e4 6.26e4 6.26e4 1.66e2 6.26e4 6.26e4
sumFuseFoldr 4.97e5 4.95e5 4.96e5 4.97e5 1.66e2 4.96e5 4.97e5
sumFusePoly 6.26e4 6.26e4 6.26e4 6.26e4 1.66e2 6.26e4 6.26e4
sumLambda 3.73e5 3.71e5 3.71e5 3.71e5 1.66e2 3.71e5 3.73e5
sumMono 3.95e5 3.95e5 3.95e5 3.95e5 1.66e2 3.95e5 3.95e5
sumPoly 3.85e5 3.85e5 3.84e5 3.84e5 1.66e2 3.85e5 3.85e5
sumSum 4.06e5 4.06e5 4.06e5 4.06e5 1.66e2 4.06e5 4.06e5
sumTail 3.06e5 3.53e5 3.06e5 3.06e5 1.66e2 3.08e5 3.06e5
sumTailLazy 3.01e5 3.01e5 3.01e5 3.01e5 1.66e2 3.01e5 3.01e5
## sums perf-explore -n 1000 -l 1000 --sums label1 label2 old result new result change sumAux time 5.53e3 5.21e3 improvement sumCata time 5.18e3 4.73e3 improvement sumCo time 6.50e3 6.40e3 sumCoCase time 5.16e3 6.03e3 slightly-degraded sumCoGo time 6.11e3 5.88e3 sumF time 5.44e3 4.16e3 improvement sumFlip time 7.23e3 7.07e3 sumFlipLazy time 6.68e3 6.44e3 sumFoldr time 5.23e3 5.00e3 sumFuse time 6.85e2 6.81e2 sumFuseFoldl' time 6.94e2 6.78e2 sumFuseFoldr time 1.04e3 1.02e3 sumFusePoly time 6.96e2 6.84e2 sumLambda time 4.79e3 4.77e3 sumMono time 4.82e3 4.84e3 sumPoly time 4.77e3 4.81e3 sumSum time 4.95e3 5.05e3 sumTail time 7.32e3 7.10e3 sumTailLazy time 6.75e3 6.52e3 ## lengths perf-explore -n 1000 -l 1000 --lengths label1 label2 old result new result change lengthAux time 4.44e3 3.68e3 improvement lengthCo time 4.91e3 4.45e3 improvement lengthCoCase time 4.90e3 4.44e3 improvement lengthF time 3.38e3 3.21e3 lengthFMono time 4.16e3 4.00e3 lengthFlip time 5.49e3 4.90e3 improvement lengthFlipLazy time 5.32e3 4.77e3 improvement lengthFoldr time 4.23e3 3.90e3 improvement lengthFoldrConsttime 3.98e3 3.74e3 improvement lengthTail time 6.47e3 5.30e3 improvement lengthTailLazy time 6.11e3 5.34e3 improvement ## Space perf-explore -n 10 -l 100000 --space +RTS -T -RTS label1 label2 old result new result change sum MaxMem 4.61e6 4.61e6 sum allocated 4.20e5 4.20e5 sum gcLiveBytes 2.15e5 2.17e5 sum gcollects 1.00e-1 1.00e-1 sum maxLiveBytes 0.00e0 0.00e0 Data is collected from GHCStats - allocated_bytes - gcs - gcdetails_live_bytes - max_live_bytes - max_mem_in_use_bytes # Perf.BigO Perf.BigO represents functionality to determine the complexity order for a computation. We could do a regression and minimise the error term, but we know that the largest run contains the most information; we would need to weight the simulations according to some heuristic. Instead, we: - estimate the order factor for each possible Order, from N3 to N0, setting the highest n run constant factor to zero, - pick the order based on lowest absolute error result summed across all the runs, import qualified Prelude as P import Data.List (nub) estOrder (\x -> sum $ nub [1..x]) 10 [1,10,100,1000] BigOrder {bigOrder = N2, bigFactor = 4.05725, bigConstant = 0.0} ## Cache speed The average cycles per + operation can get down to about 0.7 cycles, and there are about 4 cache registers per cycle, so a sum pipeline uses 2.8 register instructions per +.
Cache nsecs Cycles
register 0.1 4 per cycle
L1 Cache access 1 3-4 cycles
L2 Cache access 4 11-12 cycles
L3 unified access 14 30 - 40
DRAM hit 80 195 cycles
L1 miss 16 40 cycles
L2 miss >250 >600 cycles