sequitur-0.1.0.0: Grammar-based compression algorithms SEQUITUR
Copyright(c) Masahiro Sakai 2024
LicenseBSD-style
Maintainermasahiro.sakai@gmail.com
Stabilityprovisional
Portabilitynon-portable
Safe HaskellSafe-Inferred
LanguageHaskell2010

Language.Grammar.Sequitur

Description

SEQUITUR is a linear-time, online algorithm for producing a context-free grammar from an input sequence. The resulting grammar is a compact representation of original sequence and can be used for data compression.

Example:

  • Input string: abcabcabcabcabc
  • Resulting grammar

    • SAAB
    • ABB
    • Babc

SEQUITUR consumes input symbols one-by-one and append each symbol at the end of the grammar's start production (S in the above example), then substitutes repeating patterns in the given sequence with new rules. SEQUITUR maintains two invariants:

Digram Uniqueness
SEQUITUR ensures that no digram (a.k.a. bigram) occurs more than once in the grammar. If a digram (e.g. ab) occurs twice, SEQUITUR introduce a fresh non-terminal symbol (e.g. M) and a rule (e.g. Mab) and replace original occurences with the newly introduced non-terminals. One exception is the cases where two occurrence overlap.
Rule Utility
If a non-terminal symbol occurs only once, SEQUITUR removes the associated rule and substitute the occurence with the right-hand side of the rule.

References:

Synopsis

Basic type definition

type Grammar a = IntMap [Symbol a] Source #

A grammar is a mappping from non-terminal (left-hand side of the rule) to sequnce of symbols (right hand side of the rule).

Non-terminal is represented as a RuleId.

type RuleId = Int Source #

A non-terminal symbol is represented by an Int.

The number 0 is reserved for the start symbol of the grammar.

data Symbol a Source #

A symbol is either a terminal symbol (from user-specified type) or a non-terminal symbol which we represent using RuleId type.

Constructors

NonTerminal !RuleId 
Terminal !a 

Instances

Instances details
Generic (Symbol a) Source # 
Instance details

Defined in Language.Grammar.Sequitur

Associated Types

type Rep (Symbol a) :: Type -> Type #

Methods

from :: Symbol a -> Rep (Symbol a) x #

to :: Rep (Symbol a) x -> Symbol a #

Show a => Show (Symbol a) Source # 
Instance details

Defined in Language.Grammar.Sequitur

Methods

showsPrec :: Int -> Symbol a -> ShowS #

show :: Symbol a -> String #

showList :: [Symbol a] -> ShowS #

Eq a => Eq (Symbol a) Source # 
Instance details

Defined in Language.Grammar.Sequitur

Methods

(==) :: Symbol a -> Symbol a -> Bool #

(/=) :: Symbol a -> Symbol a -> Bool #

Ord a => Ord (Symbol a) Source # 
Instance details

Defined in Language.Grammar.Sequitur

Methods

compare :: Symbol a -> Symbol a -> Ordering #

(<) :: Symbol a -> Symbol a -> Bool #

(<=) :: Symbol a -> Symbol a -> Bool #

(>) :: Symbol a -> Symbol a -> Bool #

(>=) :: Symbol a -> Symbol a -> Bool #

max :: Symbol a -> Symbol a -> Symbol a #

min :: Symbol a -> Symbol a -> Symbol a #

Hashable a => Hashable (Symbol a) Source # 
Instance details

Defined in Language.Grammar.Sequitur

Methods

hashWithSalt :: Int -> Symbol a -> Int #

hash :: Symbol a -> Int #

type Rep (Symbol a) Source # 
Instance details

Defined in Language.Grammar.Sequitur

type Rep (Symbol a) = D1 ('MetaData "Symbol" "Language.Grammar.Sequitur" "sequitur-0.1.0.0-Lgle0ocATSo37AWieLgTJ0" 'False) (C1 ('MetaCons "NonTerminal" 'PrefixI 'False) (S1 ('MetaSel ('Nothing :: Maybe Symbol) 'NoSourceUnpackedness 'SourceStrict 'DecidedStrict) (Rec0 RuleId)) :+: C1 ('MetaCons "Terminal" 'PrefixI 'False) (S1 ('MetaSel ('Nothing :: Maybe Symbol) 'NoSourceUnpackedness 'SourceStrict 'DecidedStrict) (Rec0 a)))

High-level API

encode :: (Eq a, Hashable a) => [a] -> Grammar a Source #

Construct a grammer from a given sequence of symbols using SEQUITUR.

decode :: HasCallStack => Grammar a -> [a] Source #

Reconstruct a input sequence from a grammar.

This is a left-inverse of encode.

This function is implemented as

decode = toList . decodeToSeq

and provided just for convenience. For serious usage, use decodeToSeq or decodeLazy.

decodeLazy :: HasCallStack => Grammar a -> [a] Source #

A variant of decode but you can consume from the beginning before constructing entire sequence.

decodeToSeq :: HasCallStack => Grammar a -> Seq a Source #

A variant of decode with possibly better performance.

decodeToMonoid :: (Monoid m, HasCallStack) => (a -> m) -> Grammar a -> m Source #

Monoid-based folding over the decoded sequence.

This function is equivalent to the following definition, is more efficent due to the utilization of sharing.b

decodeToMonoid f = mconcat . map f . decode

Low-level monadic API

data Builder s a Source #

Builder denotes a internal state of the SEQUITUR algorithm.

newBuilder :: PrimMonad m => m (Builder (PrimState m) a) Source #

Create a new Builder.

add :: (PrimMonad m, Eq a, Hashable a) => Builder (PrimState m) a -> a -> m () Source #

Add a new symbol to the end of grammar's start production, and perform normalization to keep the invariants of SEQUITUR algorithm.

build :: PrimMonad m => Builder (PrimState m) a -> m (Grammar a) Source #

Retrieve a grammar (as a persistent data structure) from Builder's internal state.