Source: http://www.haskell.org/haskellwiki/Parsing_a_simple_imperative_language
This tutorial will present how to parse a subset of a simple imperative
programming language called WHILE (introduced in a book
"Principles of Program Analysis" by Nielson, Nielson and Hankin). It includes
only a few statements and basic boolean/arithmetic expressions, which makes it
a nice material for a tutorial.
== Imports ==
First let's specify the name of the module:
> module Lang where
And then import the necessary libraries:
> import System.IO
> import Control.Monad
> import Text.ParserCombinators.Parsec
> import Text.ParserCombinators.Parsec.Expr
> import Text.ParserCombinators.Parsec.Language
> import qualified Text.ParserCombinators.Parsec.Token as Token
== The language ==
The grammar for expressions is defined as follows:
''a'' ::= ''x'' | ''n'' | - ''a'' | ''a'' ''opa'' ''a''
''b'' ::= true | false | not ''b'' | ''b'' ''opb'' ''b'' | ''a'' ''opr'' ''a''
''opa'' ::= + | - | * | /
''opb'' ::= and | or
''opr'' ::= > | <
Note that we have three groups of operators - arithmetic, booloan and
relational ones.
And now the definition of statements:
''S'' ::= x := ''a'' | skip | ''S1''; ''S2'' | ''( S )'' | if ''b'' then ''S1'' else ''S2'' | while ''b'' do ''S''
We probably want to parse that into some internal representation of the
language (abstract syntax tree). Therefore we need to define the data
structures for the expressions and statements.
== Data structures ==
We need to take care of boolean and arithmetic expressions and the
appropriate operators. First let's look at the boolean expressions:
> data BExpr = BConst Bool
> | Not BExpr
> | And BExpr BExpr
> | Greater AExpr AExpr
> deriving (Show, Eq)
Now we define the types for arithmetic expressions:
> data AExpr = Var String
> | Const Integer
> | Neg AExpr
> | Add AExpr AExpr
> deriving (Show, Eq)
Finally let's take care of the statements:
> data Stmt = Seq [Stmt]
> | Assign String AExpr
> | If BExpr Stmt Stmt
> | While BExpr Stmt
> | Skip
> deriving (Show, Eq)
== Lexer ==
Having all the data structures we can go on with writing the code to do actual
parsing. First of all we create the language definition using Haskell's record
syntax and the constructor emptyDef (from
Text.ParserCombinators.Parsec.Language):
> languageDef =
> emptyDef { Token.commentStart = "/*"
> , Token.commentEnd = "*/"
> , Token.commentLine = "//"
> , Token.identStart = letter
> , Token.identLetter = alphaNum
> , Token.reservedNames = [ "if"
> , "then"
> , "else"
> , "while"
> , "do"
> , "skip"
> , "true"
> , "false"
> , "not"
> , "and"
> , "or"
> ]
> , Token.reservedOpNames = ["+", "-", "*", "/", ":="
> , "<", ">", "and", "or", "not"
> ]
> }
This creates a language definition that accepts the C-style comments, requires
that the identifiers start with a letter, and end with alphanumeric
characters. Moreover there is a number of reserved names, that cannot be used
by the identifiers.
Having the above definition we can create a lexer:
> lexer = Token.makeTokenParser languageDef
lexer contains a number of lexical parsers, that we can us to parse
identifiers, reserved words/operations, etc. Now we can select/extract them in
the following way:
> identifier = Token.identifier lexer -- parses an identifier
> reserved = Token.reserved lexer -- parses a reserved name
> reservedOp = Token.reservedOp lexer -- parses an operator
> parens = Token.parens lexer -- parses surrounding parenthesis:
> -- parens p
> -- takes care of the parenthesis and
> -- uses p to parse what's inside them
> integer = Token.integer lexer -- parses an integer
> semi = Token.semi lexer -- parses a semicolon
> whiteSpace = Token.whiteSpace lexer -- parses whitespace
This isn't really necessary, but should make the code much more readable (also
this is the reason why we used the qualified import of
Text.ParserCombinators.Parsec.Token). Now we can use them to
parse the source code at the token level. One of the nice features of these
parsers is that they take care of all whitespace after the tokens.
== Main parser ==
As already mentioned a program in this language is simply a statement, so the
main parser should basically only parse a statement. But remember to take care of
initial whitespace - our parsers only get rid of whitespace after the tokens!
> whileParser :: Parser Stmt
> whileParser = whiteSpace >> statement
Now because any statement might be actually a sequence of statements separated
by semicolon, we use sepBy1 to parse at least one statement. The
result is a list of statements. We also allow grouping statements by the
parenthesis, which is useful, for instance, in the while loop.
> statement :: Parser Stmt
> statement = parens statement
> <|> sequenceOfStmt
> sequenceOfStmt =
> do list <- (sepBy1 statement' semi)
> -- If there's only one statement return it without using Seq.
> return $ if length list == 1 then head list else Seq list
Now a single statement is quite simple, it's either an if conditional, a while
loop, an assignment or simply a skip statement. We use <|> to
express choice. So a <|> b will first try parser a
and if it fails (but without actually consuming any input) then parser
b will be used. Note: this means that the order is important.
> statement' :: Parser Stmt
> statement' = ifStmt
> <|> whileStmt
> <|> skipStmt
> <|> assignStmt
If you have a parser that might fail after consuming some input, and you still
want to try the next parser, you should look into try combinator.
For instance try p <|> q will try parsing with p and
if it fails, even after consuming the input, the q parser will be
used as if nothing has been consumed by p.
Now let's define the parsers for all the possible statements. This is quite
straightforward as we just use the parsers from the lexer and then use all the
necessary information to create appropriate data structures.
> ifStmt :: Parser Stmt
> ifStmt =
> do reserved "if"
> cond <- bExpression
> reserved "then"
> stmt1 <- statement
> reserved "else"
> stmt2 <- statement
> return $ If cond stmt1 stmt2
> whileStmt :: Parser Stmt
> whileStmt =
> do reserved "while"
> cond <- bExpression
> reserved "do"
> stmt <- statement
> return $ While cond stmt
> assignStmt :: Parser Stmt
> assignStmt =
> do var <- identifier
> reservedOp ":="
> expr <- aExpression
> return $ Assign var expr
> skipStmt :: Parser Stmt
> skipStmt = reserved "skip" >> return Skip
== Expressions ==
What's left is to parse the expressions. Fortunately Parsec provides a very
easy way to do that. Let's define the arithmetic and boolean expressions:
> aExpression :: Parser AExpr
> aExpression = buildExpressionParser aOperators aTerm
> bExpression :: Parser BExpr
> bExpression = buildExpressionParser bOperators bTerm
Now we have to define the lists with operator precedence, associativity and
what constructors to use in each case.
> aOperators = [ [Prefix (reservedOp "-" >> return (Neg )) ]
> , [Infix (reservedOp "+" >> return (Add )) AssocLeft]
> ]
> bOperators = [ [Prefix (reservedOp "not" >> return (Not )) ]
> , [Infix (reservedOp "and" >> return (And )) AssocLeft]
> ]
In case of Prefix operators it is enough to specify which one should be parsed
and what is the associated data constructor. Infix operators are defined
similarly, but it's necessary to add information about associativity. Note
that the operator precedence depends only on the order of the elements in the
list.
Finally we have to define the terms. In case of arithmetic expressions, it is
quite simple:
> aTerm = parens aExpression
> <|> liftM Var identifier
> <|> liftM Const integer
However, the term in a boolean expression is a bit more tricky. In this case,
a term can also be an expression with relational operator consisting of
arithmetic expressions.
> bTerm = parens bExpression
> <|> (reserved "true" >> return (BConst True ))
> <|> (reserved "false" >> return (BConst False))
> <|> rExpression
Therefore we have to define a parser for relational expressions:
> rExpression =
> do a1 <- aExpression
> op <- reservedOp ">"
> a2 <- aExpression
> return $ Greater a1 a2
And that's it. We have a quite simple parser able to parse a few statements and
arithmetic/boolean expressions.
== Notes ==
If you want to experiment with the parser inside ghci, these functions might be
handy:
> parseString :: String -> Stmt
> parseString str =
> case parse whileParser "" str of
> Left e -> error $ show e
> Right r -> r
> parseFile :: String -> IO Stmt
> parseFile file =
> do program <- readFile file
> case parse whileParser "" program of
> Left e -> print e >> fail "parse error"
> Right r -> return r
Now you can simply load the module in ghci and then do
ast <- parseFile "" to parse a file and get the
result if parsing was successful. If you already have a string with
the program, you can use parseString.
[[Category:How to]]