streamly-archive: Stream data from archives using the streamly library.

This is a package candidate release! Here you can preview how this package release will appear once published to the main package index (which can be accomplished via the 'maintain' link below). Please note that once a package has been published to the main package index it cannot be undone! Please consult the package uploading documentation for more information.

[maintain] [Publish]

Please see the README on GitHub at https://github.com/shlok/streamly-archive#readme

[Skip to Readme]

Properties

Versions	0.0.1, 0.0.2, 0.1.0, 0.1.0, 0.2.0
Change log	ChangeLog.md
Dependencies	base (>=4.7 && <5), bytestring (>=0.10 && <0.11), streamly (>=0.8 && <0.9) [details]
License	BSD-3-Clause
Copyright	2021 Shlok Datye
Author	Shlok Datye
Maintainer	sd-haskell@quant.is
Category	Archive, Codec, Streaming, Streamly
Home page	https://github.com/shlok/streamly-archive#readme
Bug tracker	https://github.com/shlok/streamly-archive/issues
Source repo	head: git clone https://github.com/shlok/streamly-archive
Uploaded	by shlok at 2021-07-24T17:20:13Z

Modules

[Index] [Quick Jump]

Streamly
- External
  - Streamly.External.Archive
    - Internal
      - Streamly.External.Archive.Internal.Foreign

Downloads

streamly-archive-0.1.0.tar.gz [browse] (Cabal source package)
Package description (as included in the package)

Maintainer's Corner

Package maintainers

shlok

For package maintainers and hackage trustees

edit package information

Readme for streamly-archive-0.1.0

[back to package description]

streamly-archive

Stream data from archives (tar, tar.gz, zip, or any other format supported by libarchive) using the Haskell streamly library.

Requirements

Install libarchive on your system.

Debian Linux: sudo apt-get install libarchive-dev.
macOS: brew install libarchive.

Quick start

{-# LANGUAGE ScopedTypeVariables, TypeApplications #-}

module Main where

import Crypto.Hash (hashFinalize, hashInit, hashUpdate)
import Crypto.Hash.Algorithms (SHA256)
import Data.ByteString (ByteString)
import Data.Either (isRight)
import Data.Function ((&))
import Data.Maybe (fromJust, fromMaybe)
import Data.Void (Void)
import Streamly.External.Archive (Header, headerPathName, readArchive)
import Streamly.Internal.Data.Fold.Types (Fold (..))
import Streamly.Internal.Data.Unfold.Types (Unfold)
import qualified Streamly.Prelude as S

main :: IO ()
main = do
    -- Obtain an unfold for the archive.
    -- For each entry in the archive, we will get a Header followed
    -- by zero or more ByteStrings containing chunks of file data.
    let unf :: Unfold IO Void (Either Header ByteString)
            = readArchive "/path/to/archive.tar.gz"

    -- Create a fold for converting each entry (which, as we saw
    -- above, is a Left followed by zero or more Rights) into a
    -- path and corresponding SHA-256 hash (Nothing for no data).
    let entryFold :: Fold IO (Either Header ByteString) (String, Maybe String)
            = Fold
                (\(mpath, mctx) e ->
                    case e of
                        Left h -> do
                            mpath' <- headerPathName h
                            return (mpath', mctx)
                        Right bs ->
                            return (mpath,
                                Just . (`hashUpdate` bs) $
                                    fromMaybe (hashInit @SHA256) mctx))
                (return (Nothing, Nothing))
                (\(mpath, mctx) ->
                    return (show $ fromJust mpath,
                                show . hashFinalize <$> mctx))

    -- Execute the stream, grouping at the headers (the Lefts) using the
    -- above fold, and output the paths and SHA-256 hashes along the way.
    S.unfold unf undefined
        & S.groupsBy (\e _ -> isRight e) entryFold
        & S.mapM_ print

Benchmarks

See ./bench/README.md. We find on our machine^† that (1) reading an archive using this library is just as fast as using plain Haskell IO code; and that (2) both are somewhere between 1.7x (large files) and 2.5x (many 1-byte files) slower than C.

The former fulfills the promise of streamly and stream fusion. The differences to C are presumably explained by the marshalling of data into the Haskell world and are currently small enough for our purposes.

^† Linode; Debian 10, Dedicated 32GB: 16 CPU, 640GB Storage, 32GB RAM.