| Safe Haskell | None |
|---|---|
| Language | Haskell2010 |
Data.Warc
Description
WARC (or Web ARCive) is a archival file format widely used to distribute corpora of crawled web content (see, for instance the Common Crawl corpus). A WARC file consists of a set of records, each of which describes a web request or response.
This module provides a streaming parser and encoder for WARC archives for use
with the pipes package.
- type Warc m a = FreeT (Record m) m (Producer ByteString m a)
- data Record m r = Record {
- recHeader :: RecordHeader
- recContent :: Producer ByteString m r
- parseWarc :: (Functor m, Monad m) => Producer ByteString m a -> Warc m a
- iterRecords :: forall m a. Monad m => (forall b. Record m b -> m b) -> Warc m a -> m (Producer ByteString m a)
- produceRecords :: forall m o a. Monad m => (forall b. RecordHeader -> Producer ByteString m b -> Producer o m b) -> Warc m a -> Producer o m (Producer ByteString m a)
- encodeRecord :: Monad m => Record m a -> Producer ByteString m a
- module Data.Warc.Header
Documentation
type Warc m a = FreeT (Record m) m (Producer ByteString m a) Source #
A WARC archive.
This represents a sequence of records followed by whatever data was leftover from the parse.
A WARC record
This represents a single record of a WARC file, consisting of a set of headers and a means of producing the record's body.
Constructors
| Record | |
Fields
| |
Parsing
Arguments
| :: (Functor m, Monad m) | |
| => Producer ByteString m a | a producer of a stream of WARC content |
| -> Warc m a | the parsed WARC archive |
Parse a WARC archive.
Note that this function does not actually do any parsing itself;
it merely returns a Warc value which can then be run to parse
individual records.
Arguments
| :: Monad m | |
| => (forall b. Record m b -> m b) | the action to run on each |
| -> Warc m a | the |
| -> m (Producer ByteString m a) | returns any leftover data |
Iterate over the Records in a WARC archive
Arguments
| :: Monad m | |
| => (forall b. RecordHeader -> Producer ByteString m b -> Producer o m b) | consume the record producing some output |
| -> Warc m a | a WARC archive (see |
| -> Producer o m (Producer ByteString m a) | returns any leftover data |
Encoding
encodeRecord :: Monad m => Record m a -> Producer ByteString m a Source #
Encode a Record in WARC format.
Headers
module Data.Warc.Header