| Copyright | (c) Masahiro Sakai 2024 |
|---|---|
| License | BSD-style |
| Maintainer | masahiro.sakai@gmail.com |
| Stability | provisional |
| Portability | non-portable |
| Safe Haskell | Safe-Inferred |
| Language | Haskell2010 |
Language.Grammar.Sequitur
Description
SEQUITUR is a linear-time, online algorithm for producing a context-free grammar from an input sequence. The resulting grammar is a compact representation of original sequence and can be used for data compression.
Example:
- Input string:
abcabcabcabcabc Resulting grammar
S→AABA→BBB→abc
SEQUITUR consumes input symbols one-by-one and append each symbol at the end of the
grammar's start production (S in the above example), then substitutes repeating
patterns in the given sequence with new rules. SEQUITUR maintains two invariants:
- Digram Uniqueness
- SEQUITUR ensures that no digram
(a.k.a. bigram) occurs more than once in the grammar. If a digram
(e.g.
ab) occurs twice, SEQUITUR introduce a fresh non-terminal symbol (e.g.M) and a rule (e.g.M→ab) and replace original occurences with the newly introduced non-terminals. One exception is the cases where two occurrence overlap. - Rule Utility
- If a non-terminal symbol occurs only once, SEQUITUR removes the associated rule and substitute the occurence with the right-hand side of the rule.
References:
- Sequitur algorithm - Wikipedia
- sequitur.info
- Nevill-Manning, C.G. and Witten, I.H. (1997) "Identifying Hierarchical Structure in Sequences: A linear-time algorithm," Journal of Artificial Intelligence Research, 7, 67-82.
Synopsis
- type Grammar a = IntMap [Symbol a]
- type RuleId = Int
- data Symbol a
- = NonTerminal !RuleId
- | Terminal !a
- encode :: (Eq a, Hashable a) => [a] -> Grammar a
- decode :: HasCallStack => Grammar a -> [a]
- decodeLazy :: HasCallStack => Grammar a -> [a]
- decodeToSeq :: HasCallStack => Grammar a -> Seq a
- decodeToMonoid :: (Monoid m, HasCallStack) => (a -> m) -> Grammar a -> m
- data Builder s a
- newBuilder :: PrimMonad m => m (Builder (PrimState m) a)
- add :: (PrimMonad m, Eq a, Hashable a) => Builder (PrimState m) a -> a -> m ()
- build :: PrimMonad m => Builder (PrimState m) a -> m (Grammar a)
Basic type definition
type Grammar a = IntMap [Symbol a] Source #
A grammar is a mappping from non-terminal (left-hand side of the rule) to sequnce of symbols (right hand side of the rule).
Non-terminal is represented as a RuleId.
A non-terminal symbol is represented by an Int.
The number 0 is reserved for the start symbol of the grammar.
A symbol is either a terminal symbol (from user-specified type)
or a non-terminal symbol which we represent using RuleId type.
Constructors
| NonTerminal !RuleId | |
| Terminal !a |
Instances
| Generic (Symbol a) Source # | |
| Show a => Show (Symbol a) Source # | |
| Eq a => Eq (Symbol a) Source # | |
| Ord a => Ord (Symbol a) Source # | |
Defined in Language.Grammar.Sequitur | |
| Hashable a => Hashable (Symbol a) Source # | |
Defined in Language.Grammar.Sequitur | |
| type Rep (Symbol a) Source # | |
Defined in Language.Grammar.Sequitur type Rep (Symbol a) = D1 ('MetaData "Symbol" "Language.Grammar.Sequitur" "sequitur-0.1.0.0-Lgle0ocATSo37AWieLgTJ0" 'False) (C1 ('MetaCons "NonTerminal" 'PrefixI 'False) (S1 ('MetaSel ('Nothing :: Maybe Symbol) 'NoSourceUnpackedness 'SourceStrict 'DecidedStrict) (Rec0 RuleId)) :+: C1 ('MetaCons "Terminal" 'PrefixI 'False) (S1 ('MetaSel ('Nothing :: Maybe Symbol) 'NoSourceUnpackedness 'SourceStrict 'DecidedStrict) (Rec0 a))) | |
High-level API
encode :: (Eq a, Hashable a) => [a] -> Grammar a Source #
Construct a grammer from a given sequence of symbols using SEQUITUR.
decode :: HasCallStack => Grammar a -> [a] Source #
Reconstruct a input sequence from a grammar.
This is a left-inverse of encode.
This function is implemented as
decode =toList.decodeToSeq
and provided just for convenience.
For serious usage, use decodeToSeq or decodeLazy.
decodeLazy :: HasCallStack => Grammar a -> [a] Source #
A variant of decode but you can consume from the beginning
before constructing entire sequence.
decodeToSeq :: HasCallStack => Grammar a -> Seq a Source #
A variant of decode with possibly better performance.
decodeToMonoid :: (Monoid m, HasCallStack) => (a -> m) -> Grammar a -> m Source #