License	BSD-3-Clause
Maintainer	Jamie Willis, Gigaparsec Maintainers
Stability	stable
Safe Haskell	Safe
Language	Haskell2010

Text.Gigaparsec.Errors.TokenExtractors

Description

This module contains implementations of token extractors that can be used in the Text.Gigaparsec.Errors.ErrorBuilder to decide how to extract unexpected tokens from the residual input left over from a parse error.

These are common strategies, and something here is likely to be what is needed. They are all careful to handle unprintable characters and whitespace in a sensible way, and account for unicode codepoints that are wider than a single 16-bit character.

Since: 0.2.5.0

Synopsis

data Token
- = Raw !String
- | Named !String !Word
type TokenExtractor = NonEmpty Char -> Word -> Bool -> Token
tillNextWhitespace :: Bool -> (Char -> Bool) -> TokenExtractor
singleChar :: TokenExtractor
matchParserDemand :: TokenExtractor
lexToken :: [Parsec String] -> TokenExtractor -> TokenExtractor
lexTokenWithSelect :: (NonEmpty (String, Word) -> (String, Word)) -> [Parsec String] -> TokenExtractor -> TokenExtractor

Documentation

data Token Source #

This type represents an extracted token returned by unexpectedToken in ErrorBuilder.

There is deliberately no analogue for EndOfInput because we guarantee that non-empty residual input is provided to token extraction.

Since: 0.2.5.0

Constructors

Raw	This is a token that is directly extracted from the residual input itself.
Fields !String the input extracted.
Named	This is a token that has been given a name, and is treated like a labelled item.
Fields !String the description of the token. !Word the amount of residual input this token ate.

type TokenExtractor Source #

Arguments

= NonEmpty Char	the remaining input, `cs`, at point of failure.
-> Word	the input the parser tried to read when it failed (this is not guaranteed to be smaller than the length of `cs`, but is guaranteed to be greater than 0).
-> Bool	was this error generated as part of "lexing", or in a wider parser (see `markAsToken`).
-> Token	a token extracted from `cs` that will be used as part of the unexpected message.

Type alias for token extractors, matches the shape of unexpectedToken.

Since: 0.2.5.0

tillNextWhitespace Source #

Arguments

:: Bool	should the extractor cap the token to the amount of input the parser demanded?
-> (Char -> Bool)	what counts as a space character
-> TokenExtractor

This extractor provides an implementation for unexpectedToken: it will construct a token that extends to the next available whitespace in the remaining input. It can be configured to constrict this token to the minimum of the next whitespace or whatever the parser demanded.

In the case of unprintable characters or whitespace, this extractor will favour reporting a more meaningful name.

Since: 0.2.5.0

singleChar :: TokenExtractor Source #

This extractor provides an implementation for unexpectedToken: it will unconditionally report the first character in the remaining input as the problematic token.

In the case of unprintable characters or whitespace, this extractor will favour reporting a more meaningful name.

Since: 0.2.5.0

matchParserDemand :: TokenExtractor Source #

This extractor provides an implementation for unexpectedToken: it will make a token as wide as the amount of input the parser tried to consume when it failed.

In the case of unprintable characters or whitespace, this extractor will favour reporting a more meaningful name.

Since: 0.2.5.0

lexToken Source #

Arguments

:: [Parsec String]

The tokens that should be recognised by this extractor: each parser should return the intended name of the token exactly as it should appear in the Named token.

This should include a whitespace parser for "unexpected whitespace". However, with the exception of the whitespace parser, these tokens should not consume trailing (and certainly not leading) whitespace: if using definitions from Text.Gigaparsec.Token.Lexer functionality, the nonlexeme versions of the tokens should be used.

-> TokenExtractor

If the parser failed during the parsing of a token, this function extracts the problematic item from the remaining input.

-> TokenExtractor

This extractor provides an implementation for unexpectedToken: it will try and parse the residual input to identify a valid lexical token to report.

When parsing a grammar that as a dedicated lexical distinction, it is nice to be able to report problematic tokens relevant to that grammar as opposed to generic input lifted straight from the input stream. The easiest way of doing this would be having a pre-lexing pass and parsing based on tokens, but this is deliberately not how Parsley is designed. Instead, this extractor can try and parse the remaining input to try and identify a token on demand.

If the lexicalError flag of the unexpectedToken function is not set, which would indicate a problem within a token reported by a classical lexer and not the parser, the extractor will try to parse each of the provided tokens in turn: whichever is the longest matched of these tokens will be reported as the problematic one, where an earlier token arbitrates ties (lexTokenWithSelect can alter which is chosen). For best effect, these tokens should not consume whitespace (which would otherwise be included at the end of the token!): this means that, if using the Lexer, the functionality in nonlexeme should be used. If one of the givens tokens cannot be parsed, the input until the next valid parsable token (or end of input) is returned as a Raw.

If lexicalError is true, then the given token extractor will be used instead to extract a default token.

Since: 0.2.5.0

lexTokenWithSelect Source #

Arguments

:: (NonEmpty (String, Word) -> (String, Word))	If the extractor is successful in identifying tokens that can be parsed from the residual input, this function will select one of them to report back.
-> [Parsec String]	The tokens that should be recognised by this extractor: each parser should return the intended name of the token exactly as it should appear in the Named token. This should include a whitespace parser for "unexpected whitespace". However, with the exception of the whitespace parser, these tokens should not consume trailing (and certainly not leading) whitespace: if using definitions from Text.Gigaparsec.Token.Lexer functionality, the `nonlexeme` versions of the tokens should be used.
-> TokenExtractor	If the parser failed during the parsing of a token, this function extracts the problematic item from the remaining input.
-> TokenExtractor

This extractor provides an implementation for unexpectedToken: it will try and parse the residual input to identify a valid lexical token to report.

When parsing a grammar that as a dedicated lexical distinction, it is nice to be able to report problematic tokens relevant to that grammar as opposed to generic input lifted straight from the input stream. The easiest way of doing this would be having a pre-lexing pass and parsing based on tokens, but this is deliberately not how Parsley is designed. Instead, this extractor can try and parse the remaining input to try and identify a token on demand.

If the lexicalError flag of the unexpectedToken function is not set, which would indicate a problem within a token reported by a classical lexer and not the parser, the extractor will try to parse each of the provided tokens in turn: the given function is used to select which is returned. For best effect, these tokens should not consume whitespace (which would otherwise be included at the end of the token!): this means that, if using the Lexer, the functionality in nonlexeme should be used. If one of the givens tokens cannot be parsed, the input until the next valid parsable token (or end of input) is returned as a Raw.

If lexicalError is true, then the given token extractor will be used instead to extract a default token.

Since: 0.2.5.0