Source: http://www.haskell.org/haskellwiki/Parsing_a_simple_imperative_language This tutorial will present how to parse a subset of a simple imperative programming language called WHILE (introduced in a book "Principles of Program Analysis" by Nielson, Nielson and Hankin). It includes only a few statements and basic boolean/arithmetic expressions, which makes it a nice material for a tutorial. == Imports == First let's specify the name of the module: > module Lang where And then import the necessary libraries: > import System.IO > import Control.Monad > import Text.ParserCombinators.Parsec > import Text.ParserCombinators.Parsec.Expr > import Text.ParserCombinators.Parsec.Language > import qualified Text.ParserCombinators.Parsec.Token as Token == The language == The grammar for expressions is defined as follows: ''a'' ::= ''x'' | ''n'' | - ''a'' | ''a'' ''opa'' ''a'' ''b'' ::= true | false | not ''b'' | ''b'' ''opb'' ''b'' | ''a'' ''opr'' ''a'' ''opa'' ::= + | - | * | / ''opb'' ::= and | or ''opr'' ::= > | < Note that we have three groups of operators - arithmetic, booloan and relational ones. And now the definition of statements: ''S'' ::= x := ''a'' | skip | ''S1''; ''S2'' | ''( S )'' | if ''b'' then ''S1'' else ''S2'' | while ''b'' do ''S'' We probably want to parse that into some internal representation of the language (abstract syntax tree). Therefore we need to define the data structures for the expressions and statements. == Data structures == We need to take care of boolean and arithmetic expressions and the appropriate operators. First let's look at the boolean expressions: > data BExpr = BConst Bool > | Not BExpr > | And BExpr BExpr > | Greater AExpr AExpr > deriving (Show, Eq) Now we define the types for arithmetic expressions: > data AExpr = Var String > | Const Integer > | Neg AExpr > | Add AExpr AExpr > deriving (Show, Eq) Finally let's take care of the statements: > data Stmt = Seq [Stmt] > | Assign String AExpr > | If BExpr Stmt Stmt > | While BExpr Stmt > | Skip > deriving (Show, Eq) == Lexer == Having all the data structures we can go on with writing the code to do actual parsing. First of all we create the language definition using Haskell's record syntax and the constructor emptyDef (from Text.ParserCombinators.Parsec.Language): > languageDef = > emptyDef { Token.commentStart = "/*" > , Token.commentEnd = "*/" > , Token.commentLine = "//" > , Token.identStart = letter > , Token.identLetter = alphaNum > , Token.reservedNames = [ "if" > , "then" > , "else" > , "while" > , "do" > , "skip" > , "true" > , "false" > , "not" > , "and" > , "or" > ] > , Token.reservedOpNames = ["+", "-", "*", "/", ":=" > , "<", ">", "and", "or", "not" > ] > } This creates a language definition that accepts the C-style comments, requires that the identifiers start with a letter, and end with alphanumeric characters. Moreover there is a number of reserved names, that cannot be used by the identifiers. Having the above definition we can create a lexer: > lexer = Token.makeTokenParser languageDef lexer contains a number of lexical parsers, that we can us to parse identifiers, reserved words/operations, etc. Now we can select/extract them in the following way: > identifier = Token.identifier lexer -- parses an identifier > reserved = Token.reserved lexer -- parses a reserved name > reservedOp = Token.reservedOp lexer -- parses an operator > parens = Token.parens lexer -- parses surrounding parenthesis: > -- parens p > -- takes care of the parenthesis and > -- uses p to parse what's inside them > integer = Token.integer lexer -- parses an integer > semi = Token.semi lexer -- parses a semicolon > whiteSpace = Token.whiteSpace lexer -- parses whitespace This isn't really necessary, but should make the code much more readable (also this is the reason why we used the qualified import of Text.ParserCombinators.Parsec.Token). Now we can use them to parse the source code at the token level. One of the nice features of these parsers is that they take care of all whitespace after the tokens. == Main parser == As already mentioned a program in this language is simply a statement, so the main parser should basically only parse a statement. But remember to take care of initial whitespace - our parsers only get rid of whitespace after the tokens! > whileParser :: Parser Stmt > whileParser = whiteSpace >> statement Now because any statement might be actually a sequence of statements separated by semicolon, we use sepBy1 to parse at least one statement. The result is a list of statements. We also allow grouping statements by the parenthesis, which is useful, for instance, in the while loop. > statement :: Parser Stmt > statement = parens statement > <|> sequenceOfStmt > sequenceOfStmt = > do list <- (sepBy1 statement' semi) > -- If there's only one statement return it without using Seq. > return $ if length list == 1 then head list else Seq list Now a single statement is quite simple, it's either an if conditional, a while loop, an assignment or simply a skip statement. We use <|> to express choice. So a <|> b will first try parser a and if it fails (but without actually consuming any input) then parser b will be used. Note: this means that the order is important. > statement' :: Parser Stmt > statement' = ifStmt > <|> whileStmt > <|> skipStmt > <|> assignStmt If you have a parser that might fail after consuming some input, and you still want to try the next parser, you should look into try combinator. For instance try p <|> q will try parsing with p and if it fails, even after consuming the input, the q parser will be used as if nothing has been consumed by p. Now let's define the parsers for all the possible statements. This is quite straightforward as we just use the parsers from the lexer and then use all the necessary information to create appropriate data structures. > ifStmt :: Parser Stmt > ifStmt = > do reserved "if" > cond <- bExpression > reserved "then" > stmt1 <- statement > reserved "else" > stmt2 <- statement > return $ If cond stmt1 stmt2 > whileStmt :: Parser Stmt > whileStmt = > do reserved "while" > cond <- bExpression > reserved "do" > stmt <- statement > return $ While cond stmt > assignStmt :: Parser Stmt > assignStmt = > do var <- identifier > reservedOp ":=" > expr <- aExpression > return $ Assign var expr > skipStmt :: Parser Stmt > skipStmt = reserved "skip" >> return Skip == Expressions == What's left is to parse the expressions. Fortunately Parsec provides a very easy way to do that. Let's define the arithmetic and boolean expressions: > aExpression :: Parser AExpr > aExpression = buildExpressionParser aOperators aTerm > bExpression :: Parser BExpr > bExpression = buildExpressionParser bOperators bTerm Now we have to define the lists with operator precedence, associativity and what constructors to use in each case. > aOperators = [ [Prefix (reservedOp "-" >> return (Neg )) ] > , [Infix (reservedOp "+" >> return (Add )) AssocLeft] > ] > bOperators = [ [Prefix (reservedOp "not" >> return (Not )) ] > , [Infix (reservedOp "and" >> return (And )) AssocLeft] > ] In case of Prefix operators it is enough to specify which one should be parsed and what is the associated data constructor. Infix operators are defined similarly, but it's necessary to add information about associativity. Note that the operator precedence depends only on the order of the elements in the list. Finally we have to define the terms. In case of arithmetic expressions, it is quite simple: > aTerm = parens aExpression > <|> liftM Var identifier > <|> liftM Const integer However, the term in a boolean expression is a bit more tricky. In this case, a term can also be an expression with relational operator consisting of arithmetic expressions. > bTerm = parens bExpression > <|> (reserved "true" >> return (BConst True )) > <|> (reserved "false" >> return (BConst False)) > <|> rExpression Therefore we have to define a parser for relational expressions: > rExpression = > do a1 <- aExpression > op <- reservedOp ">" > a2 <- aExpression > return $ Greater a1 a2 And that's it. We have a quite simple parser able to parse a few statements and arithmetic/boolean expressions. == Notes == If you want to experiment with the parser inside ghci, these functions might be handy: > parseString :: String -> Stmt > parseString str = > case parse whileParser "" str of > Left e -> error $ show e > Right r -> r > parseFile :: String -> IO Stmt > parseFile file = > do program <- readFile file > case parse whileParser "" program of > Left e -> print e >> fail "parse error" > Right r -> return r Now you can simply load the module in ghci and then do ast <- parseFile "" to parse a file and get the result if parsing was successful. If you already have a string with the program, you can use parseString. [[Category:How to]]