Copyright	(c) Robbin C.
License	GPLv3
Maintainer	Robbin C.
Stability	unstable
Portability	portable
Safe Haskell	None
Language	Haskell2010

Codec.Archive.Zim.Parser

Contents

High-level Functions
Searching
Exceptions
ZIM Header
ZIM Directory Entry
ZIM Content
ZIM file format

Description

This is a library for parsing ZIM (http://openzim.org) files. ZIM files contain offline web content (eg, Wikipedia) which can be browsed locally without an Internet connection.

The high-level functions can be used if it is not a problem to re-open and close the ZIM file on each invocation. For simple browsing on a local device, this should suffice. This also works if the underlying ZIM file is changing.

The other functions can be used if the caller opts to have more control over resource management.

Behind the scenes, conduit is used to read from files so memory usage should be constant.

Below is a full example of a Scotty web server that serves a ZIM file (specified on command line) on localhost port 3000:

{-# LANGUAGE OverloadedStrings #-}

import Control.Monad.IO.Class (liftIO)
import Data.Text.Lazy (toStrict, fromStrict)
import Data.Text.Encoding (decodeUtf8, encodeUtf8)
import System.Environment (getArgs)
import Network.HTTP.Types.Status (status404)
import Web.Scotty
import Codec.Archive.Zim.Parser (getZimMainPageUrl, getZimUrlContent)

main :: IO ()
main = do
    [fp] <- getArgs
    scotty 3000 $ do
      get "/" (redirectToZimMainPage fp)
      get (regex "^/(./.*)$") (serveZimUrl fp)
      notFound $ text "Invalid URL!"

redirectToZimMainPage :: FilePath -> ActionM ()
redirectToZimMainPage fp = do
    res <- liftIO $ getZimMainPageUrl fp
    case res of
      Nothing -> do
        status status404
        text "This ZIM file has no main page specified!"
      Just url -> redirect . fromStrict $ decodeUtf8 url

serveZimUrl :: FilePath -> ActionM ()
serveZimUrl fp = do
    url <- (encodeUtf8 . toStrict) <$> param "1"
    res <- liftIO $ getZimUrlContent fp url
    case res of
      Nothing -> do
        liftIO . putStrLn $ "Invalid URL: " ++ show url
        status status404
        text $ "Invalid URL!"
      Just (mimeType, content) -> do
        liftIO . putStrLn $ "Serving: " ++ show url
        setHeader "Content-Type" (fromStrict $ decodeUtf8 mimeType)
        raw content

Feedback and contributions are welcome on http://github.com/robbinch/zim-parser.

Synopsis

High-level Functions

The following high-level functions are sufficient to program a simple webserver that serves ZIM files (see example above).

getZimMainPageUrl Source

Arguments

:: FilePath	Path to ZIM file
-> IO (Maybe ByteString)	Returns URL if found

Returns URL of main page in ZIM. This URL can be used for redirecting to the actual page.

getZimUrlContent Source

Arguments

:: FilePath	Path to ZIM file
-> ByteString	URL
-> IO (Maybe (ByteString, ByteString))	Returns (MIME type, content) if found

Returns (MIME type, content) of URL, ready to be served via HTTP. Note that MIME type is a strict bytestring while Content is lazy.

Searching

searchZimDirEntByUrl Source

Arguments

:: ZimHeader	ZIM header
-> Handle	Handle to ZIM file
-> ByteString	URL to search for
-> IO (Maybe (Int, ZimDirEnt))	Returns (URL Index, Directory Entry) if found.

Search for a Directory Entry given a URL. URL must be prefixed with Namespace (eg. "A/Blue.html" or "I/favicon.png").

searchZimDirEntByTitle Source

Arguments

:: ZimHeader	ZIM header
-> Handle	Handle to ZIM file
-> Char	Namespace to search for
-> ByteString	Title to search for
-> IO (Maybe (Int, ZimDirEnt))	Returns (Title Index, Directory Entry) if found

Search for a Directory Entry given a Title and namespace.

searchZimDirEntByTitlePrefix Source

Arguments

:: ZimHeader	ZIM header
-> Handle	Handle to ZIM file
-> Char	Namespace
-> ByteString	Title Prefix
-> IO (Maybe ((Int, ZimDirEnt), (Int, ZimDirEnt)))	Returns ((Lower Title Index, Lower Directory Entry), (Upper Title Index, Upper Directory Entry)) if found.

Search for lower and upper bounds of Title indices that contains prefix in their title. Eg, if title list comprises [ "A", "Ba", "Bb", "Bc", "C" ] prefix search for "B" will return bounds corresponding to ("Ba", "Bc").

Exceptions

data ZimException Source

Other than the below, ErrorCall can be thrown by LZMA library if there is a problem with decompression.

Constructors

ZimInvalidMagic	ZIM file has invalid magic number (anything other than 72173914).
ZimParseError String	There is an error in parsing.
ZimIncompleteInput	There is insufficient bytes required to parse.
ZimInvalidIndex Int	The given index (URL, title or cluster) is out of bounds for this ZIM file.

Instances

Show ZimException Source
Exception ZimException Source

ZIM Header

data ZimHeader Source

See http://www.openzim.org/wiki/ZIM_file_format#Header for more details.

Constructors

ZimHeader

Fields

zimMagicNumber :: Int: Magic Number of file (somewhat superfluous as getZimHeader will throw an exception if magic number is anything other than 72173914)
zimVersion :: Int: Version of ZIM header
zimUuid :: ByteString: UUID of file
zimArticleCount :: Int: Number of articles
zimClusterCount :: Int: Number of clusters
zimUrlPtrPos :: Integer: Position of sorted URL pointers
zimTitlePtrPos :: Integer: Position of sorted Title pointers
zimClusterPtrPos :: Integer: Position of Cluster pointers
zimMimeListPos :: Integer: Position of MIME list
zimMainPage :: Maybe Int: Index of main page
zimLayoutPage :: Maybe Int: Index of layout page
zimChecksumPos :: Integer: Position of MD5 checksum

Instances

Eq ZimHeader Source
Show ZimHeader Source

getZimHeader Source

Arguments

:: Handle	Handle to ZIM file (eg. previously returned from `withBinaryFile`)
-> IO ZimHeader	Returns ZIM Header

Parses ZIM Header from a file handle. A ZIM Header is used by most of the functions in this module. For better performance or resource management, multiple file handles can be opened with the same ZIM header in order to call functions in parallel. If the underlying ZIM file has changed, a new ZIM header should be parsed.

getZimMimeList Source

Arguments

:: ZimHeader	ZIM header
-> Handle	Handle to ZIM file
-> IO (Array Int ByteString)	Returns array of MIME types

Parses MIME List from a ZIM header and a file handle.

ZIM Directory Entry

data ZimDirEntType Source

There are 4 types of directory entries. Most content in a ZIM file are usually ZimArticleEntry or ZimRedirectEntry.

Constructors

ZimArticleEntry
ZimRedirectEntry
ZimLinkTarget
ZimDeletedEntry

Instances

Eq ZimDirEntType Source
Show ZimDirEntType Source

data ZimDirEnt Source

See http://www.openzim.org/wiki/ZIM_file_format#Directory_Entries for more details.

Constructors

ZimDirEnt

Fields

zimDeType :: ZimDirEntType: Type of this Directory Entry
zimDeMimeType :: Int: Index into MIME list given by getZimMimeList
zimDeParameterLen :: Int: Parameter Length
zimDeNamespace :: Char: Namespace
zimDeRevision :: Int: Revision
zimDeRedirectIndex :: Maybe Int: Redirect Index (only applicable for ZimRedirectEntry)
zimDeClusterNumber :: Maybe Int: Content is stored in this Cluster Number (only applicable for ZimArticleEntry)
zimDeBlobNumber :: Maybe Int: Content is stored in this Blob Number (only applicable for ZimArticleEntry)
zimDeUrl :: ByteString: URL
zimDeTitle :: ByteString: Title

Instances

Eq ZimDirEnt Source
Show ZimDirEnt Source

getZimDirEntByUrlIndex Source

Arguments

:: ZimHeader	ZIM header
-> Handle	Handle to ZIM file
-> Int	URL index
-> IO ZimDirEnt	Returns a Directory Entry

Returns Directory Entry corresponding to URL index.

getZimDirEntByTitleIndex Source

Arguments

:: ZimHeader	ZIM header
-> Handle	Handle to ZIM file
-> Int	Title index
-> IO ZimDirEnt	Returns a Directory Entry

Returns Directory Entry corresponding to Title index.

ZIM Content

getZimCluster Source

Arguments

:: ZimHeader	ZIM header
-> Handle	Handle to ZIM file
-> Int	Cluster number
-> IO ByteString	Returns a lazy bytestring containing cluster

Returns (decompressed) Cluster corresponding to Cluster number. This can throw ErrorCall if there is an error during decompression.

getZimBlob Source

Arguments

:: ZimHeader	ZIM header
-> Handle	Handle to ZIM file
-> Int	Cluster Number
-> Int	Blob Number
-> IO ByteString	Returns a lazy bytestring containing blob

Returns Blob given Cluster and Blob number.

getZimContentByUrlIndex Source

Arguments

:: ZimHeader	ZIM header
-> Handle	Handle to ZIM file
-> Int	URL index
-> IO ByteString	Returns a lazy bytestring containing content

Returns content given URL index. Redirects are handled automatically

ZIM file format

Following is a short summary of the ZIM file format. The authoritative reference is at http://www.openzim.org/wiki/ZIM_file_format.

1. ZIM header

This is an 80-byte header (see ZimHeader). Among other things, it contains file offsets to the below.

2. List of MIME types

This is a sequence of null-terminated strings (eg. text/html, text/javascript). The last string is zero length, so the end always consists of 2 consecutive null bytes.

3. List of URLs

This is a sequence of 8-byte file offsets, each pointing to a directory entry. This list is sorted by the directory entries' URL.

getZimDirEntByUrlIndex looks up this table to return a directory entry.

4. List of Titles

This is a sequence of 4-byte indices, each pointing to a URL above (which in turn point to a directory entry). This list is sorted by the directory entries' Title.

getZimDirEntByTitleIndex uses this table to return a directory entry.

5. Directory Entries

This is a sequence of Directory Entries (see ZimDirEnt). The first 2 bytes determine the type of this entry, which also determine the length. Contents include:

a. MIME type

This 2-byte field means:

0xffff: This directory entry is a ZimRedirectEntry.
0xfffe: This directory entry is a ZimLinkTarget.
0xfffd: This directory entry is a ZimDeletedEntry.
any other value: This directory entry is a ZimArticleEntry and this index into the MIME list from above determines its MIME type.

b. Namespace

This single character determines the directory entry's namespace. (eg. A for articles, I for images, etc.) The comprehensive list is at http://www.openzim.org/wiki/ZIM_file_format#Namespaces.

c. Cluster and Blob number

Only for ZimArticleEntry, this is the directory entry's Cluster and Blob number. The Cluster number is a 4-byte index into the list of Clusters below. The Blob number refers to a block inside the (decompressed) cluster. Together, they provide the content of this directory entry.

d. URL and Title

These 2 null-terminated strings represent the URL and Title of this directory entry respectively. If the Title is empty, it is taken to be the same as the URL.

6. List of Clusters

This is a list of 8-byte file offsets, each pointing to a cluster in the file. The end of a cluster is also the start of the next cluster. Therefore, the length of a cluster is the difference between the adjacent offsets. For the last cluster, the end is the Checksum file offset, as the Checksum is always the last 16 bytes of a ZIM file.

a. Compression Type

The first byte of the cluster determines if it is uncompressed (eg. PNG image) or compressed with LZMA (eg. HTML).

0 or 1: No compression
4: Compressed with LZMA

b. List of Blobs

This is a list of 4-byte offsets, each pointing inside this cluster. The end of a blob is also the start of the next blob. Therefore, the length of a blob is the difference between the adjacent offsets. The last offset points to the end of the data area so there is always one more offset than blobs.