You don't normally need to use this Lex module directly - it is called automatically by the parser. (This interface is only exposed for debugging purposes.)
This is a hand-written lexer for tokenising the text of an XML
document so that it is ready for parsing. It attaches position
information in (line,column) format to every token. The main
entry point is xmlLex
. A secondary entry point, xmlReLex
, is
provided for when the parser needs to stuff a string back onto
the front of the text and re-tokenise it (typically when expanding
macros).
As one would expect, the lexer is essentially a small finite state machine.
- xmlLex :: String -> String -> [Token]
- xmlReLex :: Posn -> String -> [Token]
- posInNewCxt :: String -> Maybe Posn -> Posn
- type Token = Either String (Posn, TokenT)
- data Posn = Pn String !Int !Int (Maybe Posn)
- data TokenT
- = TokCommentOpen
- | TokCommentClose
- | TokPIOpen
- | TokPIClose
- | TokSectionOpen
- | TokSectionClose
- | TokSection Section
- | TokSpecialOpen
- | TokSpecial Special
- | TokEndOpen
- | TokEndClose
- | TokAnyOpen
- | TokAnyClose
- | TokSqOpen
- | TokSqClose
- | TokEqual
- | TokQuery
- | TokStar
- | TokPlus
- | TokAmp
- | TokSemi
- | TokHash
- | TokBraOpen
- | TokBraClose
- | TokPipe
- | TokPercent
- | TokComma
- | TokQuote
- | TokName String
- | TokFreeText String
- | TokNull
- data Special
- data Section
Entry points to the lexer
xmlLex :: String -> String -> [Token]Source
The first argument to xmlLex
is the filename (used for source positions,
especially in error messages), and the second is the string content of
the XML file.
xmlReLex :: Posn -> String -> [Token]Source
xmlReLex
is used when the parser expands a macro (PE reference).
The expansion of the macro must be re-lexed as if for the first time.
posInNewCxt :: String -> Maybe Posn -> PosnSource
posInNewCxt name pos
creates a new source position from an old one.
It is used when opening a new file (e.g. a DTD inclusion), to denote
the start of the file name
, but retain the stacked information that
it was included from the old pos
.
Token and position types
type Token = Either String (Posn, TokenT)Source
All tokens are paired up with a source position.
Lexical errors are passed back through the Either
type.
Source positions contain a filename, line, column, and an inclusion point, which is itself another source position, recursively.
The basic token type.
TokCommentOpen | <!-- |
TokCommentClose |
|
TokPIOpen | <? |
TokPIClose | ?> |
TokSectionOpen | <![ |
TokSectionClose | ]]> |
TokSection Section | CDATA INCLUDE IGNORE etc |
TokSpecialOpen | <! |
TokSpecial Special | DOCTYPE ELEMENT ATTLIST etc |
TokEndOpen | </ |
TokEndClose | /> |
TokAnyOpen | < |
TokAnyClose | |
TokSqOpen | [ |
TokSqClose | ] |
TokEqual | = |
TokQuery | ? |
TokStar | * |
TokPlus | + |
TokAmp | & |
TokSemi | ; |
TokHash | # |
TokBraOpen | ( |
TokBraClose | ) |
TokPipe | | |
TokPercent | % |
TokComma | , |
TokQuote | '' or "" |
TokName String | begins with letter, no spaces |
TokFreeText String | any character data |
TokNull | fake token |