Copyright | (c) Masahiro Sakai 2024 |
---|---|
License | BSD-style |
Maintainer | masahiro.sakai@gmail.com |
Stability | provisional |
Portability | non-portable |
Safe Haskell | Safe-Inferred |
Language | Haskell2010 |
Language.Grammar.Sequitur
Description
SEQUITUR is a linear-time, online algorithm for producing a context-free grammar from an input sequence. The resulting grammar is a compact representation of the original sequence and can be used for data compression.
Example:
- Input string:
abcabcabcabcabc
Resulting grammar
S
→AAB
A
→BB
B
→abc
SEQUITUR consumes input symbols one-by-one and appends each symbol at the end of the
grammar's start production (S
in the above example), then substitutes repeating
patterns in the given sequence with new rules. SEQUITUR maintains two invariants:
- Digram Uniqueness
- SEQUITUR ensures that no digram
(a.k.a. bigram) occurs more than once in the grammar. If a digram
(e.g.
ab
) occurs twice, SEQUITUR introduces a fresh non-terminal symbol (e.g.M
) and a rule (e.g.M
→ab
) and replaces the original occurrences with the newly introduced non-terminal symbol. One exception is the cases where two occurrences overlap. - Rule Utility
- If a non-terminal symbol occurs only once, SEQUITUR removes the associated rule and substitutes the occurrence with the right-hand side of the rule.
References:
- Sequitur algorithm - Wikipedia
- sequitur.info
- Nevill-Manning, C.G. and Witten, I.H. (1997) "Identifying Hierarchical Structure in Sequences: A linear-time algorithm," Journal of Artificial Intelligence Research, 7, 67-82.
Synopsis
- newtype Grammar a = Grammar {}
- data Symbol a
- type NonTerminalSymbol = Int
- type IsTerminalSymbol a = Hashable a
- encode :: IsTerminalSymbol a => [a] -> Grammar a
- data Builder s a
- newBuilder :: PrimMonad m => m (Builder (PrimState m) a)
- add :: (PrimMonad m, IsTerminalSymbol a) => Builder (PrimState m) a -> a -> m ()
- build :: PrimMonad m => Builder (PrimState m) a -> m (Grammar a)
- decode :: HasCallStack => Grammar a -> [a]
- decodeToSeq :: HasCallStack => Grammar a -> Seq a
- decodeToMonoid :: (Monoid m, HasCallStack) => (a -> m) -> Grammar a -> m
- decodeNonTerminalsToMonoid :: (Monoid m, HasCallStack) => (a -> m) -> Grammar a -> IntMap m
Basic type definition
Since a grammar generated by SEQUITUR has exactly one rule for each non-terminal symbol, a grammar is represented as a mapping from non-terminal symbols (left-hand sides of the rules) to sequences of symbols (right-hand side of the rules).
For example, a grammar
0
→1 1 2
1
→2 2
2
→a b c
is represented as
Grammar (fromList [ (0, [NonTerminal 1, NonTerminal 1, NonTerminal 2]) , (1, [NonTerminal 2, NonTerminal 2]) , (2, [Terminal 'a', Terminal 'b', Terminal 'c']) ])
Since a grammar generated by SEQUITUR produces exactly one
sequence, we can identify the grammar with the produced
sequence. Therefore, Grammar
type is an instance of Foldable
,
IsList
, and IsString
.
Instances
Foldable Grammar Source # | Since: 0.2.0.0 |
Defined in Language.Grammar.Sequitur Methods fold :: Monoid m => Grammar m -> m # foldMap :: Monoid m => (a -> m) -> Grammar a -> m # foldMap' :: Monoid m => (a -> m) -> Grammar a -> m # foldr :: (a -> b -> b) -> b -> Grammar a -> b # foldr' :: (a -> b -> b) -> b -> Grammar a -> b # foldl :: (b -> a -> b) -> b -> Grammar a -> b # foldl' :: (b -> a -> b) -> b -> Grammar a -> b # foldr1 :: (a -> a -> a) -> Grammar a -> a # foldl1 :: (a -> a -> a) -> Grammar a -> a # elem :: Eq a => a -> Grammar a -> Bool # maximum :: Ord a => Grammar a -> a # minimum :: Ord a => Grammar a -> a # | |
Functor Grammar Source # | Since: 0.2.0.0 |
IsString (Grammar Char) Source # | Since: 0.2.0.0 |
Defined in Language.Grammar.Sequitur Methods fromString :: String -> Grammar Char # | |
IsTerminalSymbol a => IsList (Grammar a) Source # | Since: 0.2.0.0 |
Show a => Show (Grammar a) Source # | |
Eq a => Eq (Grammar a) Source # | |
type Item (Grammar a) Source # | |
Defined in Language.Grammar.Sequitur |
A symbol is either a terminal symbol (from a user-specified type) or a non-terminal symbol.
Constructors
NonTerminal !NonTerminalSymbol | |
Terminal a |
Instances
Functor Symbol Source # | Since: 0.2.0.0 |
Generic (Symbol a) Source # | |
Show a => Show (Symbol a) Source # | |
Eq a => Eq (Symbol a) Source # | |
Ord a => Ord (Symbol a) Source # | |
Defined in Language.Grammar.Sequitur | |
Hashable a => Hashable (Symbol a) Source # | |
Defined in Language.Grammar.Sequitur | |
type Rep (Symbol a) Source # | |
Defined in Language.Grammar.Sequitur type Rep (Symbol a) = D1 ('MetaData "Symbol" "Language.Grammar.Sequitur" "sequitur-0.2.0.0-DRyoTT0Z6K7K4tnKJGe036" 'False) (C1 ('MetaCons "NonTerminal" 'PrefixI 'False) (S1 ('MetaSel ('Nothing :: Maybe Symbol) 'NoSourceUnpackedness 'SourceStrict 'DecidedStrict) (Rec0 NonTerminalSymbol)) :+: C1 ('MetaCons "Terminal" 'PrefixI 'False) (S1 ('MetaSel ('Nothing :: Maybe Symbol) 'NoSourceUnpackedness 'NoSourceStrictness 'DecidedLazy) (Rec0 a))) |
type NonTerminalSymbol = Int Source #
Non-terminal symbols are represented by Int
.
The number 0
is reserved for the start symbol of the grammar.
type IsTerminalSymbol a = Hashable a Source #
IsTerminalSymbol
is a class synonym for absorbing the difference
between hashable
<1.4.0.0
and >=1.4.0.0
.
hashable-1.4.0.0
makes Eq
be a superclass of Hashable
.
Therefore we define
type IsTerminalSymbol a = Hashable a
on hashable >=1.4.0.0
, while we define
type IsTerminalSymbol a = (Eq a, Hashable a)
on hashable <1.4.0.0
.
Also, developers can temporarily add other classes (e.g. Show
) to
ease debugging.
Construction
High-level API
Use encode
if the entire sequence is given at once and you
only need to create a single grammar from it.
encode :: IsTerminalSymbol a => [a] -> Grammar a Source #
Construct a grammar from a given sequence of symbols using SEQUITUR.
fromList
and fromString
can also be used.
Low-level monadic API
Use these low-level monadic API if the input sequence is given incrementally, or you want to repeatedly construct grammars with newly added inputs.
add :: (PrimMonad m, IsTerminalSymbol a) => Builder (PrimState m) a -> a -> m () Source #
Add a new symbol to the end of grammar's start production, and perform normalization to keep the invariants of the SEQUITUR algorithm.
build :: PrimMonad m => Builder (PrimState m) a -> m (Grammar a) Source #
Retrieve a grammar (as a persistent data structure) from the Builder
's internal state.
Conversion to other types
decode :: HasCallStack => Grammar a -> [a] Source #
Reconstruct an input sequence from a grammar.
It is lazy in the sense that you can consume from the beginning
before constructing the entire sequence. This function is suitable
if you just need to access the resulting sequence only once and
from beginning to end. If you need to use the resulting sequence in
a more complex way, decodeToSeq
would be more suitable.
This is a left-inverse of encode
, and is equivalent to toList
of Foldable
class and toList
of IsList
.
decodeToSeq :: HasCallStack => Grammar a -> Seq a Source #
decodeToMonoid :: (Monoid m, HasCallStack) => (a -> m) -> Grammar a -> m Source #
decodeNonTerminalsToMonoid :: (Monoid m, HasCallStack) => (a -> m) -> Grammar a -> IntMap m Source #
Monoid
-based folding over the decoded sequence of each non-terminal symbol.
For example, in the following grammar
g = Grammar (IntMap.fromList [ (0, [NonTerminal 1, Terminal 'c', NonTerminal 1]) , (1, [Terminal 'a', Terminal 'b']) ])
non-terminal symbol 0
and 1
produces "abcab"
and "ab"
respectively.
Therefore,
yieldsdecodeNonTerminalsToMonoid
f
IntMap.fromList [ (0, mconcat (map f "abcab")) , (1, mconcat (map f "ab")) ]