hw-dsv: Unbelievably fast streaming DSV file parser

[ bsd3, csv, data-structures, library, program, simd, succinct-data-structures, text ] [ Propose Tags ]

Please see the README on Github at https://github.com/haskell-works/hw-dsv#readme


[Skip to Readme]

Flags

Automatic Flags
NameDescriptionDefault
avx2

Enable avx2 instruction set

Disabled
bmi2

Enable bmi2 instruction set

Disabled
sse42

Enable SSE 4.2 optimisations.

Enabled

Use -f <flag> to enable a flag, or -f -<flag> to disable that flag. More info

Downloads

Note: This package has metadata revisions in the cabal description newer than included in the tarball. To unpack the package including the revisions, use 'cabal get'.

Maintainer's Corner

Package maintainers

For package maintainers and hackage trustees

Candidates

Versions [RSS] 0.1.0.0, 0.2, 0.2.1, 0.3.0, 0.3.1, 0.3.2, 0.3.3, 0.3.4, 0.3.5, 0.3.6, 0.3.7, 0.3.8, 0.4.0, 0.4.1.0, 0.4.1.1, 0.4.1.2
Change log ChangeLog.md
Dependencies appar (>=0.1.8 && <0.2), base (>=4.11 && <5), bits-extra (>=0.0.1.2 && <0.1), bytestring (>=0.10 && <0.12), deepseq (>=1.4 && <1.5), generic-lens (>=2.2 && <2.3), ghc-prim (>=0.4 && <0.10), hedgehog (>=0.5 && <1.3), hw-bits (>=0.7.0.2 && <0.8), hw-dsv, hw-ip (>=2.3.4.2 && <2.5), hw-prim (>=0.6.2.14 && <0.7), hw-rankselect (>=0.12.0.2 && <0.14), hw-rankselect-base (>=0.3.2.0 && <0.4), hw-simd (>=0.1.1.3 && <0.2), lens (>=4.15 && <6), optparse-applicative (>=0.13 && <0.18), resourcet (>=1.1 && <1.3), text (>=1.2.2 && <3), transformers (>=0.4 && <0.7), vector (>=0.12.0.1 && <0.14) [details]
License BSD-3-Clause
Copyright 2018-2020 John Ky
Author John Ky
Maintainer newhoggy@gmail.com
Revised Revision 2 made by newhoggy at 2022-08-31T06:50:03Z
Category Text, CSV, SIMD, Succinct Data Structures, Data Structures
Home page https://github.com/haskell-works/hw-dsv#readme
Bug tracker https://github.com/haskell-works/hw-dsv/issues
Source repo head: git clone https://github.com/haskell-works/hw-dsv
Uploaded by haskellworks at 2022-03-25T14:50:08Z
Distributions
Reverse Dependencies 3 direct, 3 indirect [details]
Executables hw-dsv
Downloads 8083 total (42 in the last 30 days)
Rating (no votes yet) [estimated by Bayesian average]
Your Rating
  • λ
  • λ
  • λ
Status Docs available [build log]
Last success reported on 2022-03-25 [all 1 reports]

Readme for hw-dsv-0.4.1.1

[back to package description]

hw-dsv

CircleCI Travis

Unbelievably fast streaming DSV file parser that reads based on succinct data structures.

This library will use support for some BMI2 or AVX2 CPU instructions on some x86 based CPUs if compiled with the appropriate flags on ghc-8.4.1 or later.

Compilation & Installation

Pre-requisites:

  • cabal-install-3.0.0.0
  • ghc-8.4.4 or higher

It is sufficient to build, test and benchmark the library as follows for basic performance. The library will be compiled to use broadword implementation of rank & select, which has reasonable performance.

cabal v2-configure --enable-tests --enable-benchmarks --disable-documentation
cabal v2-build
cabal v2-test
cabal v2-bench
cabal v2-install --overwrite-policy=always --installdir="$HOME/.local/bin"

Ensure that $HOME/.local/bin is in your path if you are using intending to use the hw-dsv binary.

For best performance, add the bmi2 and avx2 flags to target the BMI2 and AVX2 instruction are specified in the cabal.project file.

For slightly older CPUs, remove avx2 flags from the cabal.project file to target only the BMI2 instruction set.

Stack support

It should be possible to install hw-dsv via stack:

stack install --flag bits-extra:bmi2 --flag hw-rankselect-base:bmi2 --flag hw-rankselect:bmi2 --flag hw-simd:bmi2 --flag hw-simd:avx2 --flag hw-dsv:bmi2 --flag hw-dsv:avx2

Although your mileage may vary depending on which snapshot you are using.

The flags should be adjusted for the CPU you are targetting.

Benchmark results

The following benchmark shows the kinds of performance gain that can be expected from enabling the BMI2 instruction set for CPU targets that support them. Benchmarks were run on 2.9 GHz Intel Core i7, macOS High Sierra.

With BMI2 disabled:

$ cat 7g.csv | pv -t -e -b -a | hw-dsv query-lazy -k 1 -k 2 -d , -e '|' > /dev/null
7.08GiB 0:07:25 [16.3MiB/s]

With BMI2 and AVX2 enabled:

$ cat 7gb.csv | pv -t -e -b -a | hw-dsv query-lazy -k 1 -k 2 -d , -e '|' > /dev/null
7.08GiB 0:00:39 [ 181MiB/s]

With only BMI2 enabled:

$ cat 7gb.csv | pv -t -e -b -a | hw-dsv query-lazy -k 1 -k 2 -d , -e '|' > /dev/null
7.08GiB 0:00:43 [ 165MiB/s]

hw-dsv command line options

The hw-dsv application accepts 1-based column indexes rather than 0-based. The library is 0-based.

Using hw-dsv as a library

{-# LANGUAGE ScopedTypeVariables #-}

module Example where

import qualified Data.ByteString.Lazy                   as LBS
import qualified Data.Vector                            as DV
import qualified HaskellWorks.Data.Dsv.Lazy.Cursor      as SVL
import qualified HaskellWorks.Data.Dsv.Lazy.Cursor.Lazy as SVL

example :: IO ()
example = do
  bs <- LBS.readFile "sample.csv"
  let c = SVL.makeCursor ',' bs
  let rows :: [DV.Vector LBS.ByteString] = SVL.toListVector c

  return ()