uu-parsinglib-2.4.1: Online, error-correcting parser combinators; monadic and applicative interfaces

Text.ParserCombinators.UU.Examples

Synopsis

Documentation

type Parser a = P (Str Char) aSource

We start out by defining the type of parser we want; by specifying the type of the state we resolve a lot of overloading

run :: Show t => P (Str Char) t -> String -> IO ()Source

The fuction run runs the parser and shows both the result, and the correcting steps which were taken during the parsing process.

pa :: Parser StringSource

Our first two parsers are simple; one recognises a single a character and the other one a single b. Since we will use them later we convert the recognsied character into String so they can be easily combined.

(<++>) :: Parser String -> Parser String -> Parser StringSource

We can now run the parser pa on input "a", which succeeds:

 run pa "a" 
 Result: "a"

If we run the parser pa on the empty input "", the expected symbol in inserted, that the position where it was inserted is reported, and we get information about what was expected at that position:

 run pa ""
 Result: "a"
 Correcting steps: 
    Inserted 'a' at position 0 expecting 'a'

Now let's see what happens if we encounter an unexpected symbol, as in:

 run pa "b"
 Result: "a"
 Correcting steps: 
    Deleted  'b' at position 0 expecting 'a'
    Inserted 'a' at position 1 expecting 'a'

The combinator <++> applies two parsers sequentially to the input and concatenates their results:

 run (pa <++> pa) "aa"@:
 Result: "aa"

wfp :: Parser IntSource

The function pSym is overloaded. The type of its argument determines how to interpret the argument. Thus far we have seen single characters, but we may pass ranges as well as argument:

 run (pList (pSym ('a','z'))) "doaitse"
 Result: "doaitse"

An even more general instance of pSym takes a triple as argument: a predicate, a string indicating what is expected, and the value to insert if nothing can be recognised:

 run (pSym (\t -> 'a' <= t && t <= 'z', "'a'..'z'", 'k')) "1"
 Result: 'k'
 Correcting steps: 
    Deleted  '1' at position 0 expecting 'a'..'z'
    Inserted 'k' at position 1 expecting 'a'..'z'

The parser pCount recognises a sequence of elements, throws away the results of the recognition process ( <$ ), and just returns the number of returned elements. The choice combinator <<|> indicates that preference is to be given to the left alternative if it can make progress. This enables us to specify greedy strategies:

 run (pCount pa) "aaaaa"
 Result: 5

The parsers are instance of the class Monad and hence we can use the result of a previous parser to construct a following one:

 run (do  {l <- pCount pa; pExact l pb}) "aaacabbb"
 Result: ["b","b","b","b"]
 Correcting steps: 
    Deleted  'c' at position 3 expecting one of ['a', 'b']
    Inserted 'b' at position 8 expecting 'b'

The function amb converts an ambigous parser into one which returns all possible parses:

 run (amb ( (++) <$> pa2 <*> pa3 <|> (++) <$> pa3 <*> pa2))  "aaaaa"
 Result: ["aaaaa","aaaaa"]

The applicative style makes it very easy to merge recognsition and computing a result. As an example we parse a sequence of nested well formed parentheses pairs and compute the maximum nesting depth with wfp:

 run wfp "((()))()(())" 
 Result: 3

expr :: Parser IntSource

It is very easy to recognise infix expressions with any number of priorities and operators:

 operators       = [[('+', (+)), ('-', (-))],  [('*' , (*))], [('^', (^))]]
 same_prio  ops  = msum [ op <$ pSym c | (c, op) <- ops]
 expr            = foldr pChainl ( pNatural <|> pParens expr) (map same_prio operators) -- 

which we can call:

 run expr "15-3*5+2^5"
 Result: 32

Note that also here correction takes place:

 run expr "2 + + 3 5"
 Result: 37
 Correcting steps: 
    Deleted  ' ' at position 1 expecting one of ['0'..'9', '^', '*', '-', '+']
    Deleted  ' ' at position 3 expecting one of ['(', '0'..'9']
    Inserted '0' at position 4 expecting '0'..'9'
    Deleted  ' ' at position 5 expecting one of ['(', '0'..'9']
    Deleted  ' ' at position 7 expecting one of ['0'..'9', '^', '*', '-', '+']

spaces :: Parser StringSource

A common case where ambiguity arises is when we e.g. want to recognise identifiers, but only those which are not keywords. The combinator micro inserts steps with a specfied cost in the result of the parser which can be used to disambiguate:

 
 ident ::  Parser String
 ident = ((:) <$> pSym ('a','z') <*> pMunch (\x -> 'a' <= x && x <= 'z') `micro` 2) <* spaces
 idents = pList1 ident
 pKey keyw = pToken keyw `micro` 1 <* spaces
 spaces :: Parser String
 spaces = pMunch (==' ')
 takes_second_alt =   pList ident 
                \<|> (\ c t e -> ["IfThenElse"] ++  c   ++  t  ++  e) 
                    \<$ pKey "if"   <*> pList_ng ident 
                    \<* pKey "then" <*> pList_ng ident
                    \<* pKey "else" <*> pList_ng ident  

A keyword is followed by a small cost 1, which makes sure that identifiers which have a keyword as a prefix win over the keyword. Identifiers are however followed by a cost 2, with as result that in this case the keyword wins. Note that a limitation of this approach is that keywords are only recognised as such when expected!

 test13 = run takes_second_alt "if a then if else c"
 test14 = run takes_second_alt "ifx a then if else c"

with results for test13 and test14:

 Result: ["IfThenElse","a","if","c"]
 Result: ["ifx","a","then","if", "else","c"]

A mistake which is made quite often is to construct a parser which can recognise a sequence of elements using one of the derived combinators (say pList), but where the argument parser can recognise the empty string. The derived combinators check whether this is the case and terminate the parsing process with an error message:

 run (pList spaces) ""
 Result: *** Exception: The combinator pList
  requires that it's argument cannot recognise the empty string
 run (pMaybe spaces) " "
 Result: *** Exception: The combinator pMaybe
 requires that it's argument cannot recognise the empty string

munch :: Parser StringSource

For documentation of pMerged and <||> see the module Text.ParserCombinators.UU.Merge:

 run  (,,) `pMerged` (list_of pDigit <||> list_of pLower <||> list_of pUpper) "1AabCD2D3d"

results in

 Result: ("123","abd","ACDD")

The function

 munch =  pMunch ( `elem` "^=*") 

returns the longest prefix of the input obeying the predicate:

 run munch "==^^**rest" 
 Result: "==^^**"
 Correcting steps: 
    The token 'r' was not consumed by the parsing process.
    The token 'e' was not consumed by the parsing process.
    The token 's' was not consumed by the parsing process.
    The token 't' was not consumed by the parsing process.