sequitur-0.2.0.0: Grammar-based compression algorithms SEQUITUR
Copyright(c) Masahiro Sakai 2024
LicenseBSD-style
Maintainermasahiro.sakai@gmail.com
Stabilityprovisional
Portabilitynon-portable
Safe HaskellSafe-Inferred
LanguageHaskell2010

Language.Grammar.Sequitur

Description

SEQUITUR is a linear-time, online algorithm for producing a context-free grammar from an input sequence. The resulting grammar is a compact representation of the original sequence and can be used for data compression.

Example:

  • Input string: abcabcabcabcabc
  • Resulting grammar

    • SAAB
    • ABB
    • Babc

SEQUITUR consumes input symbols one-by-one and appends each symbol at the end of the grammar's start production (S in the above example), then substitutes repeating patterns in the given sequence with new rules. SEQUITUR maintains two invariants:

Digram Uniqueness
SEQUITUR ensures that no digram (a.k.a. bigram) occurs more than once in the grammar. If a digram (e.g. ab) occurs twice, SEQUITUR introduces a fresh non-terminal symbol (e.g. M) and a rule (e.g. Mab) and replaces the original occurrences with the newly introduced non-terminal symbol. One exception is the cases where two occurrences overlap.
Rule Utility
If a non-terminal symbol occurs only once, SEQUITUR removes the associated rule and substitutes the occurrence with the right-hand side of the rule.

References:

Synopsis

Basic type definition

newtype Grammar a Source #

Since a grammar generated by SEQUITUR has exactly one rule for each non-terminal symbol, a grammar is represented as a mapping from non-terminal symbols (left-hand sides of the rules) to sequences of symbols (right-hand side of the rules).

For example, a grammar

  • 01 1 2
  • 12 2
  • 2a b c

is represented as

Grammar (fromList
  [ (0, [NonTerminal 1, NonTerminal 1, NonTerminal 2])
  , (1, [NonTerminal 2, NonTerminal 2])
  , (2, [Terminal 'a', Terminal 'b', Terminal 'c'])
  ])

Since a grammar generated by SEQUITUR produces exactly one sequence, we can identify the grammar with the produced sequence. Therefore, Grammar type is an instance of Foldable, IsList, and IsString.

Constructors

Grammar 

Fields

Instances

Instances details
Foldable Grammar Source #

Since: 0.2.0.0

Instance details

Defined in Language.Grammar.Sequitur

Methods

fold :: Monoid m => Grammar m -> m #

foldMap :: Monoid m => (a -> m) -> Grammar a -> m #

foldMap' :: Monoid m => (a -> m) -> Grammar a -> m #

foldr :: (a -> b -> b) -> b -> Grammar a -> b #

foldr' :: (a -> b -> b) -> b -> Grammar a -> b #

foldl :: (b -> a -> b) -> b -> Grammar a -> b #

foldl' :: (b -> a -> b) -> b -> Grammar a -> b #

foldr1 :: (a -> a -> a) -> Grammar a -> a #

foldl1 :: (a -> a -> a) -> Grammar a -> a #

toList :: Grammar a -> [a] #

null :: Grammar a -> Bool #

length :: Grammar a -> Int #

elem :: Eq a => a -> Grammar a -> Bool #

maximum :: Ord a => Grammar a -> a #

minimum :: Ord a => Grammar a -> a #

sum :: Num a => Grammar a -> a #

product :: Num a => Grammar a -> a #

Functor Grammar Source #

Since: 0.2.0.0

Instance details

Defined in Language.Grammar.Sequitur

Methods

fmap :: (a -> b) -> Grammar a -> Grammar b #

(<$) :: a -> Grammar b -> Grammar a #

IsString (Grammar Char) Source #

Since: 0.2.0.0

Instance details

Defined in Language.Grammar.Sequitur

IsTerminalSymbol a => IsList (Grammar a) Source #

Since: 0.2.0.0

Instance details

Defined in Language.Grammar.Sequitur

Associated Types

type Item (Grammar a) #

Methods

fromList :: [Item (Grammar a)] -> Grammar a #

fromListN :: Int -> [Item (Grammar a)] -> Grammar a #

toList :: Grammar a -> [Item (Grammar a)] #

Show a => Show (Grammar a) Source # 
Instance details

Defined in Language.Grammar.Sequitur

Methods

showsPrec :: Int -> Grammar a -> ShowS #

show :: Grammar a -> String #

showList :: [Grammar a] -> ShowS #

Eq a => Eq (Grammar a) Source # 
Instance details

Defined in Language.Grammar.Sequitur

Methods

(==) :: Grammar a -> Grammar a -> Bool #

(/=) :: Grammar a -> Grammar a -> Bool #

type Item (Grammar a) Source # 
Instance details

Defined in Language.Grammar.Sequitur

type Item (Grammar a) = a

data Symbol a Source #

A symbol is either a terminal symbol (from a user-specified type) or a non-terminal symbol.

Instances

Instances details
Functor Symbol Source #

Since: 0.2.0.0

Instance details

Defined in Language.Grammar.Sequitur

Methods

fmap :: (a -> b) -> Symbol a -> Symbol b #

(<$) :: a -> Symbol b -> Symbol a #

Generic (Symbol a) Source # 
Instance details

Defined in Language.Grammar.Sequitur

Associated Types

type Rep (Symbol a) :: Type -> Type #

Methods

from :: Symbol a -> Rep (Symbol a) x #

to :: Rep (Symbol a) x -> Symbol a #

Show a => Show (Symbol a) Source # 
Instance details

Defined in Language.Grammar.Sequitur

Methods

showsPrec :: Int -> Symbol a -> ShowS #

show :: Symbol a -> String #

showList :: [Symbol a] -> ShowS #

Eq a => Eq (Symbol a) Source # 
Instance details

Defined in Language.Grammar.Sequitur

Methods

(==) :: Symbol a -> Symbol a -> Bool #

(/=) :: Symbol a -> Symbol a -> Bool #

Ord a => Ord (Symbol a) Source # 
Instance details

Defined in Language.Grammar.Sequitur

Methods

compare :: Symbol a -> Symbol a -> Ordering #

(<) :: Symbol a -> Symbol a -> Bool #

(<=) :: Symbol a -> Symbol a -> Bool #

(>) :: Symbol a -> Symbol a -> Bool #

(>=) :: Symbol a -> Symbol a -> Bool #

max :: Symbol a -> Symbol a -> Symbol a #

min :: Symbol a -> Symbol a -> Symbol a #

Hashable a => Hashable (Symbol a) Source # 
Instance details

Defined in Language.Grammar.Sequitur

Methods

hashWithSalt :: Int -> Symbol a -> Int #

hash :: Symbol a -> Int #

type Rep (Symbol a) Source # 
Instance details

Defined in Language.Grammar.Sequitur

type Rep (Symbol a) = D1 ('MetaData "Symbol" "Language.Grammar.Sequitur" "sequitur-0.2.0.0-DRyoTT0Z6K7K4tnKJGe036" 'False) (C1 ('MetaCons "NonTerminal" 'PrefixI 'False) (S1 ('MetaSel ('Nothing :: Maybe Symbol) 'NoSourceUnpackedness 'SourceStrict 'DecidedStrict) (Rec0 NonTerminalSymbol)) :+: C1 ('MetaCons "Terminal" 'PrefixI 'False) (S1 ('MetaSel ('Nothing :: Maybe Symbol) 'NoSourceUnpackedness 'NoSourceStrictness 'DecidedLazy) (Rec0 a)))

type NonTerminalSymbol = Int Source #

Non-terminal symbols are represented by Int.

The number 0 is reserved for the start symbol of the grammar.

type IsTerminalSymbol a = Hashable a Source #

IsTerminalSymbol is a class synonym for absorbing the difference between hashable <1.4.0.0 and >=1.4.0.0.

hashable-1.4.0.0 makes Eq be a superclass of Hashable. Therefore we define

type IsTerminalSymbol a = Hashable a

on hashable >=1.4.0.0, while we define

type IsTerminalSymbol a = (Eq a, Hashable a)

on hashable <1.4.0.0.

Also, developers can temporarily add other classes (e.g. Show) to ease debugging.

Construction

High-level API

Use encode if the entire sequence is given at once and you only need to create a single grammar from it.

encode :: IsTerminalSymbol a => [a] -> Grammar a Source #

Construct a grammar from a given sequence of symbols using SEQUITUR.

fromList and fromString can also be used.

Low-level monadic API

Use these low-level monadic API if the input sequence is given incrementally, or you want to repeatedly construct grammars with newly added inputs.

data Builder s a Source #

Builder denotes an internal state of the SEQUITUR algorithm.

newBuilder :: PrimMonad m => m (Builder (PrimState m) a) Source #

Create a new Builder.

add :: (PrimMonad m, IsTerminalSymbol a) => Builder (PrimState m) a -> a -> m () Source #

Add a new symbol to the end of grammar's start production, and perform normalization to keep the invariants of the SEQUITUR algorithm.

build :: PrimMonad m => Builder (PrimState m) a -> m (Grammar a) Source #

Retrieve a grammar (as a persistent data structure) from the Builder's internal state.

Conversion to other types

decode :: HasCallStack => Grammar a -> [a] Source #

Reconstruct an input sequence from a grammar.

It is lazy in the sense that you can consume from the beginning before constructing the entire sequence. This function is suitable if you just need to access the resulting sequence only once and from beginning to end. If you need to use the resulting sequence in a more complex way, decodeToSeq would be more suitable.

This is a left-inverse of encode, and is equivalent to toList of Foldable class and toList of IsList.

decodeToSeq :: HasCallStack => Grammar a -> Seq a Source #

A variant of decode in which the result type is Seq.

decodeToMonoid :: (Monoid m, HasCallStack) => (a -> m) -> Grammar a -> m Source #

Monoid-based folding over the decoded sequence.

This function is equivalent to the following definition but is more efficient due to the utilization of sharing.

decodeToMonoid f = mconcat . map f . decode

This is equivalent to foldMap of Foldable class.

decodeNonTerminalsToMonoid :: (Monoid m, HasCallStack) => (a -> m) -> Grammar a -> IntMap m Source #

Monoid-based folding over the decoded sequence of each non-terminal symbol.

For example, in the following grammar

g = Grammar (IntMap.fromList
  [ (0, [NonTerminal 1, Terminal 'c', NonTerminal 1])
  , (1, [Terminal 'a', Terminal 'b'])
  ])

non-terminal symbol 0 and 1 produces "abcab" and "ab" respectively. Therefore, decodeNonTerminalsToMonoid f yields

IntMap.fromList
  [ (0, mconcat (map f "abcab"))
  , (1, mconcat (map f "ab"))
  ]