streamly-archive: Stream data from archives using the streamly library.

This is a package candidate release! Here you can preview how this package release will appear once published to the main package index (which can be accomplished via the 'maintain' link below). Please note that once a package has been published to the main package index it cannot be undone! Please consult the package uploading documentation for more information.

[maintain] [Publish]

Please see the README on GitHub at https://github.com/shlok/streamly-archive#readme


[Skip to Readme]

Properties

Versions 0.0.1, 0.0.2, 0.1.0, 0.1.0
Change log ChangeLog.md
Dependencies base (>=4.7 && <5), bytestring (==0.10.*), streamly (==0.8.*) [details]
License BSD-3-Clause
Copyright 2021 Shlok Datye
Author Shlok Datye
Maintainer sd-haskell@quant.is
Category Archive, Codec, Streaming, Streamly
Home page https://github.com/shlok/streamly-archive#readme
Bug tracker https://github.com/shlok/streamly-archive/issues
Source repo head: git clone https://github.com/shlok/streamly-archive
Uploaded by shlok at 2021-07-24T17:20:13Z

Modules

[Index]

Downloads

Maintainer's Corner

For package maintainers and hackage trustees


Readme for streamly-archive-0.1.0

[back to package description]

streamly-archive

Hackage CI

Stream data from archives (tar, tar.gz, zip, or any other format supported by libarchive) using the Haskell streamly library.

Requirements

Install libarchive on your system.

Quick start

{-# LANGUAGE ScopedTypeVariables, TypeApplications #-}

module Main where

import Crypto.Hash (hashFinalize, hashInit, hashUpdate)
import Crypto.Hash.Algorithms (SHA256)
import Data.ByteString (ByteString)
import Data.Either (isRight)
import Data.Function ((&))
import Data.Maybe (fromJust, fromMaybe)
import Data.Void (Void)
import Streamly.External.Archive (Header, headerPathName, readArchive)
import Streamly.Internal.Data.Fold.Types (Fold (..))
import Streamly.Internal.Data.Unfold.Types (Unfold)
import qualified Streamly.Prelude as S

main :: IO ()
main = do
    -- Obtain an unfold for the archive.
    -- For each entry in the archive, we will get a Header followed
    -- by zero or more ByteStrings containing chunks of file data.
    let unf :: Unfold IO Void (Either Header ByteString)
            = readArchive "/path/to/archive.tar.gz"

    -- Create a fold for converting each entry (which, as we saw
    -- above, is a Left followed by zero or more Rights) into a
    -- path and corresponding SHA-256 hash (Nothing for no data).
    let entryFold :: Fold IO (Either Header ByteString) (String, Maybe String)
            = Fold
                (\(mpath, mctx) e ->
                    case e of
                        Left h -> do
                            mpath' <- headerPathName h
                            return (mpath', mctx)
                        Right bs ->
                            return (mpath,
                                Just . (`hashUpdate` bs) $
                                    fromMaybe (hashInit @SHA256) mctx))
                (return (Nothing, Nothing))
                (\(mpath, mctx) ->
                    return (show $ fromJust mpath,
                                show . hashFinalize <$> mctx))

    -- Execute the stream, grouping at the headers (the Lefts) using the
    -- above fold, and output the paths and SHA-256 hashes along the way.
    S.unfold unf undefined
        & S.groupsBy (\e _ -> isRight e) entryFold
        & S.mapM_ print

Benchmarks

See ./bench/README.md. We find on our machine that (1) reading an archive using this library is just as fast as using plain Haskell IO code; and that (2) both are somewhere between 1.7x (large files) and 2.5x (many 1-byte files) slower than C.

The former fulfills the promise of streamly and stream fusion. The differences to C are presumably explained by the marshalling of data into the Haskell world and are currently small enough for our purposes.

Linode; Debian 10, Dedicated 32GB: 16 CPU, 640GB Storage, 32GB RAM.