Tokenizer === ***WARNING this package is not tested enough for the moment. Bugs are very likely here.*** This package provides solution for two problems: - split input string on tokens of specificated sort; - check tokenizing of *all possible* strings is unique. Some examples --- *If you have problems with understanding the syntax we use read two sections bellow.* - *parse* make `Token` from a string - *checkUniqueTokenizing* check if every string can no more than one way to be split into parts in such a way, that each of them would be matched by one of the given tokens - *tokenize* tries to split given string on parts with the same condition. Here everything is ok ```hs > checkUniqueTokenizing $ parse <$> ["ab", "bc", "abc"] Right () ``` and we can split string on these token with deterministic result: ```hs > tokenize (makeTokenizeMap $ parse <$> ["ab", "bc", "abc"]) "abbcabc" Right [("ab","ab"),("bc","bc"),("abc","abc")] > tokenize (makeTokenizeMap $ parse <$> ["ab", "bc", "abc"]) "abbcabca" Left (NoWayTokenize 7 [("ab","ab"),("bc","bc"),("abc","abc")]) ``` We can parse `"ab"` as `"a"` and `"b"` or `"ab"` ```hs > checkUniqueTokenizing $ parse <$> ["ab", "a", "b"] Left Conflicts: [("a",a),("b",b)] [("ab",ab)] ``` we *can* tokenize using this set of tokens, but sometimes it gives us `TwoWaysTokenize` error: ```hs > tokenize (makeTokenizeMap $ parse <$> ["a", "b", "ab"]) "bba" Right [("b","b"),("b","b"),("a","a")] > tokenize (makeTokenizeMap $ parse <$> ["a", "b", "ab"]) "aab" Left (TwoWaysTokenize 1 [("a","a"),("ab","ab")] [("a","a"),("a","a"),("b","b")]) ``` to solve the problem we can specify that that there should be no `b` character after separate `a` token ```hs > checkUniqueTokenizing $ parse <$> ["ab", "a?!b", "b"] Right () > tokenize (makeTokenizeMap $ parse <$> ["a?!b", "b", "ab"]) "aab" Right [("a?!b","a"),("ab","ab")] ``` More complex example. Problem is for string `"ababab"`: ```hs > checkUniqueTokenizing $ parse <$> ["ab", "aba", "bab"] Left Conflicts: [("ab",ab),("ab",ab),("ab",ab)] [("aba",aba),("bab",bab)] ``` Here even `"aab"` can be split as `aa` and then `b` or `a*b`. However, current algorithm gives another conflict `"aaa*b"` can be spit as `aa` and `a*b` or simply `a*b`: ``` hs > checkUniqueTokenizing $ parse <$> ["a*b", "aa", "b"] Left Conflicts: [("aa",aa),("a*b",ab)] [("a*b",aaab)] ``` Try it yourself by executing `cabal repl examples -f examples` What is a token? --- A token is a parts' of string template. It consists of three parts. Each part provides some restrictions on characters of a string that can be matched. The main part of token is it's `body`. It describes characters of a string part, matchable by token. Two others parts `behind` and `ahead` restrict symbols, that can be situated before/after matched part respectively. Note that they are assumed to be satisfied if begin/end of line is achieved. Each part of token is a list, describing subsequent symbols from left to right. In `behind` and `ahead` part we can specify for each position what symbols can be used or cannot be used via `BlackWhiteSet`. In token's body we can restrict not only one position, but some of subsequent positions. More precisely, we can mark a `BlackWhiteSet` to be `Repeatable` one or some (it's one or more) times. Syntax, used in examples --- To make examples more readable we provide simple language for describing tokens. We'll use alpha characters as symbols and some punctuation for describing token's structure: - `{` and `}` for grouping set of characters, containing more then one char; - `!` means "all, except those" (`BlackSet` in this package' terminology); - `*` behind the charset means "some characters, containing in this set"; - `?` at the beginning of `behind`/`ahead` parts; - `<` and `>` for grouping complex `behind`/`ahead` parts. The grammar in EBNF: ```ebnf BlackWhiteSet := ['!'], (letter | '{', {letter}, '}'); Repeatable := charset, ['*']; ahead_or_behind := '?', (symbol | '<', {symbol}, '>'); body := symbol, {symbol} token := [ahead_or_behind], body, [ahead_or_behind]; ``` Parser for tokens, described in this manner is available in `examples/Main.hs` Technical details --- Uniqueness checking is provided by a modification of Sardinas-Patterson's algorithm. Tokenizing process is written in the most simple way with non-exponential asymptotic of the input string's length. Usage --- It's very likely, that all you need is exported from `Text.Tokenizer`. Bug reports and feature requests --- Feel free open issues at [the GitHub repo](https://github.com/Lev135/tokenizer/issues) Contribution --- I would be vary glad for any contribution. The are many ways to improve this lib: - improve documentation and examples; - add more tests to check, that everything works nice; - improve performance (I think there are many opportunities here in both algorithms); - add benchmarks (connected with the previous). - *(this issue is mostly for me :)* improve code readability (I've tried not to make it absolutely terrible, but it's definitely not perfect); I know, that some of those problems (especially code readability) should be closed by myself, but unfortunately I have no time to deal with them now. Maybe, I'm the package is too raw to publish it, but there are some reasons for me to do so: - I don't no when it will be improved enough; - it is needed for my main project ([FineTeX](https://github.com/lev135/FineTeX)).