streamly-archive: Stream data from archives using the streamly library.

[ archive, bsd3, codec, library, streaming, streamly ] [ Propose Tags ]

Downloads

Maintainer's Corner

Package maintainers

For package maintainers and hackage trustees

Candidates

Versions [RSS] 0.0.1, 0.0.2, 0.1.0, 0.2.0
Change log ChangeLog.md
Dependencies base (>=4.7 && <5), bytestring (>=0.10 && <0.11), streamly (>=0.8 && <0.9) [details]
License BSD-3-Clause
Copyright 2021 Shlok Datye
Author Shlok Datye
Maintainer sd-haskell@quant.is
Category Archive, Codec, Streaming, Streamly
Home page https://github.com/shlok/streamly-archive#readme
Bug tracker https://github.com/shlok/streamly-archive/issues
Source repo head: git clone https://github.com/shlok/streamly-archive
Uploaded by shlok at 2021-07-24T17:25:21Z
Distributions
Downloads 513 total (16 in the last 30 days)
Rating (no votes yet) [estimated by Bayesian average]
Your Rating
  • λ
  • λ
  • λ
Status Docs uploaded by user
Build status unknown [no reports yet]

Readme for streamly-archive-0.1.0

[back to package description]

streamly-archive

Hackage CI

Stream data from archives (tar, tar.gz, zip, or any other format supported by libarchive) using the Haskell streamly library.

Requirements

Install libarchive on your system.

  • Debian Linux: sudo apt-get install libarchive-dev.
  • macOS: brew install libarchive.

Quick start

{-# LANGUAGE ScopedTypeVariables, TypeApplications #-}

module Main where

import Crypto.Hash (hashFinalize, hashInit, hashUpdate)
import Crypto.Hash.Algorithms (SHA256)
import Data.ByteString (ByteString)
import Data.Either (isRight)
import Data.Function ((&))
import Data.Maybe (fromJust, fromMaybe)
import Data.Void (Void)
import Streamly.External.Archive (Header, headerPathName, readArchive)
import Streamly.Internal.Data.Fold.Types (Fold (..))
import Streamly.Internal.Data.Unfold.Types (Unfold)
import qualified Streamly.Prelude as S

main :: IO ()
main = do
    -- Obtain an unfold for the archive.
    -- For each entry in the archive, we will get a Header followed
    -- by zero or more ByteStrings containing chunks of file data.
    let unf :: Unfold IO Void (Either Header ByteString)
            = readArchive "/path/to/archive.tar.gz"

    -- Create a fold for converting each entry (which, as we saw
    -- above, is a Left followed by zero or more Rights) into a
    -- path and corresponding SHA-256 hash (Nothing for no data).
    let entryFold :: Fold IO (Either Header ByteString) (String, Maybe String)
            = Fold
                (\(mpath, mctx) e ->
                    case e of
                        Left h -> do
                            mpath' <- headerPathName h
                            return (mpath', mctx)
                        Right bs ->
                            return (mpath,
                                Just . (`hashUpdate` bs) $
                                    fromMaybe (hashInit @SHA256) mctx))
                (return (Nothing, Nothing))
                (\(mpath, mctx) ->
                    return (show $ fromJust mpath,
                                show . hashFinalize <$> mctx))

    -- Execute the stream, grouping at the headers (the Lefts) using the
    -- above fold, and output the paths and SHA-256 hashes along the way.
    S.unfold unf undefined
        & S.groupsBy (\e _ -> isRight e) entryFold
        & S.mapM_ print

Benchmarks

See ./bench/README.md. We find on our machine that (1) reading an archive using this library is just as fast as using plain Haskell IO code; and that (2) both are somewhere between 1.7x (large files) and 2.5x (many 1-byte files) slower than C.

The former fulfills the promise of streamly and stream fusion. The differences to C are presumably explained by the marshalling of data into the Haskell world and are currently small enough for our purposes.

Linode; Debian 10, Dedicated 32GB: 16 CPU, 640GB Storage, 32GB RAM.