# streamly-archive [![Hackage](https://img.shields.io/hackage/v/streamly-archive.svg?style=flat)](https://hackage.haskell.org/package/streamly-archive) ![CI](https://github.com/shlok/streamly-archive/workflows/CI/badge.svg?branch=master) Stream data from archives (tar, tar.gz, zip, or any other format [supported by libarchive](https://github.com/libarchive/libarchive/wiki/LibarchiveFormats)) using the Haskell [streamly](https://hackage.haskell.org/package/streamly) library. ## Requirements Install libarchive on your system. * Debian Linux: `sudo apt-get install libarchive-dev`. * macOS: `brew install libarchive`. ## Quick start ```haskell {-# LANGUAGE ScopedTypeVariables #-} {-# LANGUAGE TypeApplications #-} module Main where import Crypto.Hash (hashFinalize, hashInit, hashUpdate) import Crypto.Hash.Algorithms (SHA256) import Data.ByteString (ByteString) import Data.Function ((&)) import Data.Maybe (fromJust, fromMaybe) import Data.Void (Void) import qualified Streamly.Data.Fold as F import qualified Streamly.Data.Stream.Prelude as S import Streamly.External.Archive ( Header, groupByHeader, headerPathName, readArchive, ) import Streamly.Internal.Data.Fold.Type (Fold (Fold), Step (Partial)) import Streamly.Internal.Data.Unfold.Type (Unfold) main :: IO () main = do -- Obtain an unfold for the archive. -- For each entry in the archive, we will get a Header followed -- by zero or more ByteStrings containing chunks of file data. let unf :: Unfold IO Void (Either Header ByteString) = readArchive "/path/to/archive.tar.gz" -- Create a fold for converting each entry (which, as we saw -- above, is a Left followed by zero or more Rights) into a -- path and corresponding SHA-256 hash (Nothing for no data). let entryFold :: Fold IO (Either Header ByteString) (String, Maybe String) = Fold ( \(mpath, mctx) e -> case e of Left h -> do mpath' <- headerPathName h return $ Partial (mpath', mctx) Right bs -> return $ Partial ( mpath, Just . (`hashUpdate` bs) $ fromMaybe (hashInit @SHA256) mctx ) ) (return $ Partial (Nothing, Nothing)) ( \(mpath, mctx) -> return ( show $ fromJust mpath, show . hashFinalize <$> mctx ) ) -- Execute the stream, grouping at the headers (the Lefts) using the -- above fold, and output the paths and SHA-256 hashes along the way. S.unfold unf undefined & groupByHeader entryFold & S.mapM print & S.fold F.drain ``` ## Benchmarks See `./bench/README.md`. We find on our machine that (1) reading an archive using this library is just as fast as using plain Haskell `IO` code; and that (2) both are somewhere between 1.7x (large files) and 2.5x (many 1-byte files) slower than C. The former fulfills the promise of [streamly](https://hackage.haskell.org/package/streamly) and stream fusion. The differences to C are presumably explained by the marshalling of data into the Haskell world and are currently small enough for our purposes. April 2023; [Linode](https://linode.com); Debian 11, Dedicated 32GB: 16 CPU, 640GB Storage, 32GB RAM.