luthor-0.0.2: Tools for lexing and utilizing lexemes that integrate with Parsec.

Safe HaskellSafe
LanguageHaskell98

Text.Luthor.Syntax

Contents

Description

Parsec parsers for common tokens. This module is written with bottom-up programming in mind so that is stays flexible. For example, although we export several numerical token parsers, we also export the sub-token parsers they were built out of: numSign, numNatural, numExponent, &c.

Synopsis

Basic Characters and Strings

char :: Stream s m Char => Char -> ParsecT s u m Char

char c parses a single character c. Returns the parsed character (i.e. c).

 semiColon  = char ';'

string :: Stream s m Char => String -> ParsecT s u m String Source

string s parses a sequence of characters given by s. Returns the parsed string (i.e. s). Unlike the Parsec version, this combinator never consumes input on failure.

 adrenalineWord  =  string "fight" 
                <|> string "flight"

charI :: Stream s m Char => Char -> ParsecT s u m Char Source

Parse a single character, case-insensitive. Normalized to lowercase.

stringI :: Stream s m Char => String -> ParsecT s u m String Source

Parse a string, case-insensitive. If this parser fails, it consumes no input. Normalized to lowercase.

anyChar :: Stream s m Char => ParsecT s u m Char

This parser succeeds for any character. Returns the parsed character.

oneOf :: Stream s m Char => [Char] -> ParsecT s u m Char

oneOf cs succeeds if the current character is in the supplied list of characters cs. Returns the parsed character. See also satisfy.

  vowel  = oneOf "aeiou"

noneOf :: Stream s m Char => [Char] -> ParsecT s u m Char

As the dual of oneOf, noneOf cs succeeds if the current character not in the supplied list of characters cs. Returns the parsed character.

 consonant = noneOf "aeiou"

aChar :: Stream s m Char => (Char -> Bool) -> ParsecT s u m Char Source

Parse a single char when it satisfies the predicate. Fails when the next input character does not satisfy the predicate.

manyChar :: Stream s m Char => (Char -> Bool) -> ParsecT s u m String Source

Parse zero or more characters satisfying the predicate, c.f. many1Char.

many1Char :: Stream s m Char => (Char -> Bool) -> ParsecT s u m String Source

Parse one or more characters satisfying the predicate, c.f. manyChar.

Common Classes

upAlpha :: Stream s m Char => ParsecT s u m Char Source

Rule UPALPHA from RFC2616 §2.2

loAlpha :: Stream s m Char => ParsecT s u m Char Source

Rule LOALPHA from RFC2616 §2.2

alpha :: Stream s m Char => ParsecT s u m Char Source

Rule ALPHA from RFC2616 §2.2

digit :: Stream s m Char => ParsecT s u m Char Source

Rule DIGIT from RFC2616 §2.2

hexDigit :: Stream s m Char => ParsecT s u m Char

Parses a hexadecimal digit (a digit or a letter between 'a' and 'f' or 'A' and 'F'). Returns the parsed character.

octDigit :: Stream s m Char => ParsecT s u m Char

Parses an octal digit (a character between '0' and '7'). Returns the parsed character.

binDigit :: Stream s m Char => ParsecT s u m Char Source

Parse a binary digit ('0' or '1')

ctl :: Stream s m Char => ParsecT s u m Char Source

Rule CTL from RFC2616 §2.2

asciiText :: Stream s m Char => ParsecT s u m Char Source

A single printable ASCII character, as the TEXT rule from RFC2616 §2.2

uniText :: Stream s m Char => ParsecT s u m Char Source

A single printable unicode character, a generalization of text. See uniPrint.

Common Special Literals

cr :: Stream s m Char => ParsecT s u m () Source

A carriage return (ASCII 13)

lf :: Stream s m Char => ParsecT s u m () Source

A line feed (ASCII 10)

sp :: Stream s m Char => ParsecT s u m () Source

A space (ASCII 32)

ht :: Stream s m Char => ParsecT s u m () Source

A horizonal tab (ASCII 9)

sq :: Stream s m Char => ParsecT s u m () Source

A single quote

dq :: Stream s m Char => ParsecT s u m () Source

A double quote

colon :: Stream s m Char => ParsecT s u m () Source

A colon (:)

semicolon :: Stream s m Char => ParsecT s u m () Source

A semicolon (;)

dot :: Stream s m Char => ParsecT s u m () Source

A period (.)

comma :: Stream s m Char => ParsecT s u m () Source

A comma (,)

ellipsis2 :: Stream s m Char => ParsecT s u m () Source

Two dots (..)

ellipsis3 :: Stream s m Char => ParsecT s u m () Source

Three dots (...)

bsEsc :: Stream s m Char => (Char -> Bool) -> ParsecT s u m Char Source

A backslash-escape: backslash followed by a single character satisfying the predicate.

yes :: Stream s m Char => ParsecT s u m Bool Source

Parse "yes", "y", or "1", case-insensitive. Return True.

no :: Stream s m Char => ParsecT s u m Bool Source

Parse "no", "n", or "0", case-insensitive. Return False.

yesno :: Stream s m Char => ParsecT s u m Bool Source

Parse yes or no and return whether the answer was yes-ful.

Programming Idioms

Whitespace

lws :: Stream s m Char => ParsecT s u m String Source

Short for "linear whitespace": one or more spaces or tabs. Similar to rule LWS from RFC2616 §2.2, but without line folding.

newline :: Stream s m Char => ParsecT s u m () Source

Parse a single line feed or carriage return. Does not succeed at end of file.

lineBreak :: Stream s m Char => ParsecT s u m () Source

Recognize when the parser is at a line break (LF, CR, or end of input) If the break is due to a CR or LF, consume it.

crlf :: Stream s m Char => ParsecT s u m () Source

A carriage return + line feed sequence.

bsnl :: Stream s m Char => ParsecT s u m () Source

Parse a backslash followed by a newline

bsnlwsbs :: Stream s m Char => ParsecT s u m () Source

Parse a backslash followed by a newline, then linear whitespace (lws) and finally another backslash.

data IndentPolicy Source

Determine how the depth of indentation is calculated.

Constructors

DontMix [Char]

Any of the passed Chars can be used, but allow only one kind of character in a line. Depth is number of those characters.

Convert [(Char, Int)]

Allow any mix of of the passed Chars. Calculate depth by assigning a number to each of character by kind and summing.

dentation :: Stream s m Char => IndentPolicy -> ParsecT s u m Int Source

Parse a lineBreak followed by whitespace characters. Return the depth of indentation.

The acceptable whitespace characters and the method by whch to caclulate depth is determined by an IntentPolicy.

WARNING: Do not use this combinator with the Lex module, it will break because Parsec is sissy.

Identifiers

many1Not Source

Arguments

:: Stream s m Char 
=> (Char -> Bool)

Whether a character is allowed in the identifier

-> (Char -> Bool)

Whether the character is disallowed as the first character of the identifier

-> ParsecT s u m String 

Parse one or more characters that satisfy a predicate, but with additional restrictions on the first character parsed.

This is especially useful for identifiers (such as /[a-zA-Z_][a-zA-Z_0-9]*/) and certain kinds of numbers (such as /[1-9][0-9]*/).

identifier = charClass "a-zA-Z0-9_" `many1Not` charClass "0-9"
naturalLiteral = stringToInteger 10 <$> charClass "0-9" `many1Not` (=='0')

sigilized Source

Arguments

:: Stream s m Char 
=> [(Char, sigil)]

The sigils and their corresponding semantics

-> ParsecT s u m a

An identifier parser

-> ParsecT s u m (sigil, a) 

Parse a sigil character immediately followed by an identifier.

data Sigil = Scalar | Array
name = sigilized (zip "$@" [Scalar, Array]) $ many1Not (charClass' "_a-zA-Z0-9") (charClass "0-9")

Punctuation

inParens :: Stream s m Char => ParsecT s u m a -> ParsecT s u m a Source

Parse between open and close parenthesis.

inBrackets :: Stream s m Char => ParsecT s u m a -> ParsecT s u m a Source

Parse between open and close square brackets ([...]).

inBraces :: Stream s m Char => ParsecT s u m a -> ParsecT s u m a Source

Parse between open and close curly braces ({...}).

inAngles :: Stream s m Char => ParsecT s u m a -> ParsecT s u m a Source

Parse between open and close angles (<...>).

Number Parts

numSign :: Stream s m Char => ParsecT s u m Integer Source

Parse a minus or plus sign and return the appropriate multiplier.

numBase :: Stream s m Char => ParsecT s u m Int Source

Parse "0x", "0o", or "0b" case-insensitive and return the appropriate base. If none of these parse, return base 10.

numNatural :: Stream s m Char => Int -> ParsecT s u m Integer Source

Parse many digits in the passed base and return the corresponding integer.

numAfterPoint :: Stream s m Char => Int -> ParsecT s u m Rational Source

Parse many digits in the passed base and return the appropriate rational.

numDenominator :: Stream s m Char => Int -> ParsecT s u m Rational Source

Parse a natural in the passed base and return its reciprocal.

numOptSign :: Stream s m Char => ParsecT s u m Integer Source

Optional sign as numSign, defaults to positive.

numInteger :: Stream s m Char => Int -> ParsecT s u m Integer Source

Parse an optional sign (as numOptSign), then a natural number of the specified base (as in numNatural).

xDigit :: Stream s m Char => Int -> ParsecT s u m Char Source

Parse a digit in the passed base: 2, 8, 10 or 16.

stringToInteger :: Integral n => Int -> String -> n Source

Interpret a string as an integer in the passed base.

stringToMantissa :: Int -> String -> Ratio Integer Source

Interpret a string as a mantissa in the passed base.

Numbers

integer :: Stream s m Char => ParsecT s u m Integer Source

Parse an integer: optional sign, then a number of digits. Bases 10, 16, 8 and 2 are supported with appropriate prefixes as in numBase before the digits.

rational :: Stream s m Char => ParsecT s u m Rational Source

Parse a rational number: an optional sign, then two sequences of digits separated by a slash. Return the ratio of the appropriate sign between the two numbers. Bases 10, 16, 8 and 2 are supported.

scientific :: Stream s m Char => ParsecT s u m Rational Source

Parse a number in scientific notation: an optional sign, then a radix mark, two sequences of digits separated by a dot, and finally an optional exponent, which is an exponent letter, an optional sign and finally one or more digits in the same base.

The base of the exponent is the same as the base of the significand. In base ten, the exponent letter is either e or p, but in other bases, it must be p (since e is already a hexdigit).

Note that digits are required on both sides of the (hexa)decimal point, so neither 0. nor .14 are recognized.

hexOctet :: (Stream s m Char, Integral n) => ParsecT s u m n Source

Parse a two-digit hexadacimal number.

Character Escapes

letterEsc :: Stream s m Char => [(Char, Char)] -> ParsecT s u m Char Source

Parse a backslash and another character; use the passed table to determine the returned character.

decimalEsc :: Stream s m Char => ParsecT s u m Char Source

Escape sequences for any unicode code point. Represented by a backslash + one or more decimal digits. Fails when the code point is not representable in unicode.

asciiEsc :: Stream s m Char => ParsecT s u m Char Source

Escape sequences for bytes (including ASCII). Represented by a backslash + two hexdigits.

loUniEsc :: Stream s m Char => ParsecT s u m Char Source

Escape sequences in the Unicode Basic Multilingual Plane (BMP). C.f. hiUniEsc. Represented by a backslash+lowercase u followed by four hexdigits.

hiUniEsc :: Stream s m Char => ParsecT s u m Char Source

Escape sequences outside the Unicode BMP. C.f. loUniEsc. Represented by a backslash+uppercase U followed by five or six hexdigits totalling at most 0x10FFFF

uniEsc :: Stream s m Char => ParsecT s u m Char Source

A unicode escape, either as loUniEsc or hiUniEsc.

cEscapes :: [(Char, Char)] Source

Common characetr escapes in programming language string literals:

\0 -> ASCII 00 (nul)
\a -> ASCII 07 (alarm/bell)
\b -> ASCII 08 (backspace)
\e -> ASCII 1B (escape)
\f -> ASCII 0C (form feed)
\n -> ASCII 0A (line feed)
\r -> ASCII 0D (carriage return)
\t -> ASCII 09 (horizontal tab)
\v -> ASCII 0B (vertical tab)
\' -> single-quote
\" -> double-quote
\\ -> backslash

String Literals

sqString :: Stream s m Char => ParsecT s u m String Source

Parse a single-quoted string with no escape sequences, except that a single-quote in the string is encoded as two single-quote characters.

This is as single-quoted strings in SQL and Rc (the shell in Plan9). It is an excellent encoding for strings because they are so easy to validate from untrusted input and escape for rendering, whether it is done by human or machine.

dqString :: Stream s m Char => [(Char, Char)] -> ParsecT s u m String Source

Parse a double-quoted string with common backslash escape sequences.

  • We use letterEsc with the passed table of contents. There is no default table.
  • Also, decimalEsc, asciiEsc and uniEsc are allowed.
  • Further, the \& stands for no character: it literally adds nothing to the string in which it appears, but it can be useful as in "\127\&0".
  • Finally, lines can be folded with a backslash-newline-backslash, ignoring any lws between the newline and the second backslash. This is preferred over a simple backslash-newline, as it reduces any need to remember how leading whitespace is treated after a line-fold.

The escapes parsed by letterEsc are preferred to the other escape sequences. Note that there are no escape sequences for backslash or double-quote by default (aside from a numerical escape), so you'll need to include them in the table for letterEsc.

Comments

lineComment :: Stream s m Char => String -> ParsecT s u m String Source

Parse a comment beginning with the passed string and ending at (but not including) a lineBreak.

blockComment Source

Arguments

:: Stream s m Char 
=> String

Start the block comment

-> String

End the block comment

-> ParsecT s u m String 

Parse a non-nesting comment beginning at the first passed string and ending with (and including) the second passed string.

C.f nestingComment.

nestingComment Source

Arguments

:: Stream s m Char 
=> String

Start a comment

-> String

End a comment

-> ParsecT s u m String 

Parse a nesting block comment.

C.f. blockComment.

Character Classes

charClass :: String -> Char -> Bool Source

Match any character in a set.

vowel = charClass "aeiou"

Range notation is supported.

halfAlphabet = charClass "a-nA-N"

To add a literal '-' to a set, place it at the beginning or end of the string.

You may also invert the set by placing a caret at the beginning of the string.

nonVowel = "^aeiou"

To add a literal '^' to a set, place it somewhere other than at the beginning of the string.

uniPrint :: Char -> Bool Source

The class of printable unicode characters, including linear whitesapce.

  • Accepts: Letter, Number, Symbol, Space, Punctuation/Quote, Mark, Format, PrivateUse
  • Does not Accept: LineSeparator, ParagraphSeparator, Control, Surrogate, NotAssigned

uniPrintMinus :: (Char -> Bool) -> Char -> Bool Source

Accepts characters from uniPrint, except those which satisfy the passed predicate.

uniId :: Char -> Bool Source

Accepts a wide variety of unicode characters. This is the largest class of characters which might be used in a programming language that allows unicode identifiers, and probably include a little too much.

  • Accepts: Letter, Mark, Number, Punctuation/Quote, Symbol
  • Does not Accept: Space, LineSeparator, ParagraphSeparator, Control, Format, Surrogate, PrivateUse, NotAssigned

uniIdMinus :: (Char -> Bool) -> Char -> Bool Source

Accepts characters from uniId, except those which satisfy the passed predicate.