Safe Haskell | Safe-Inferred |
---|---|
Language | Haskell2010 |
## High-Level Summary
This section provides a high-level summary of this file. For those who know more about compiler-development, the below explanation is likely enough. For everyone else, see the next section.
The parser itself is unaware of indentation, and instead only parses explicit delimiters which are inserted by this layout algorithm (much like Haskell). This is convenient because the actual grammar can be specified apart from the indentation rules. Haskell has a few problematic productions which make it impossible to implement a purely lexical layout algorithm, so it also has an additional (and somewhat contentious) parser error side condition. PureScript does not have these problematic productions (particularly foo, bar :: SomeType syntax in declarations), but it does have a few gotchas of it's own. The algorithm is "non-trivial" to say the least, but it is implemented as a purely lexical delimiter parser on a token-by-token basis, which is highly convenient, since it can be replicated in any language or toolchain. There is likely room to simplify it, but there are some seemingly innocuous things that complicate it.
Naked commas (case, patterns, guards, fundeps) are a constant source of complexity, and indeed too much of this is what prevents Haskell from having such an algorithm. Unquoted properties for layout keywords introduce a domino effect of complexity since we have to mask and unmask any usage of . (also in foralls!) or labels in record literals.
## Detailed Summary
### The Problem
The parser itself is unaware of indentation or other such layout concerns.
Rather than dealing with it explicitly, the parser and its
grammar rules are only aware of normal tokens (e.g. TokLowerName
) and
three special zero-width tokens, TokLayoutStart
, TokLayoutSep
,
and TokLayoutEnd
. This is convenient because the actual grammar
can be specified apart from the indentation rules and other such
layout concerns.
For a simple example, the parser parses all three examples of the code below
using the exact same grammar rules for the let
keyword despite
each example using different indentations levels:
-- Example 1 let foo = 5 x = 2 in foo -- Example 2 let bar = 5 y = 2 in bar -- Example 3 let baz = 5 z= 2 in baz
Each block of code might appear to the parser as a stream of the
following source tokens where the {
sequence represents
TokLayoutStart
, the ;
sequence represents TokLayoutSep
,
and the }
sequence represents TokLayoutEnd
:
- let {foo = 5;x = 2} in foo
- let {bar = 5;y = 2} in bar
- let {baz = 5;z = 2} in baz
For a more complex example, consider commas:
case one, { twoA, twoB }, [ three1 , three2 , do { three3, three4 } <- case arg1, arg2 of Nothing, _ -> { three3: 1, three4: 2 } Just _, Nothing -> { three3: 2, three4: 3 } _, _ -> { three3: 3, three4: 4 } pure $ three3 + three4 ] of
Which of the above 13 commas function as the separators between the
case binders (e.g. one
) in the outermost case ... of
context?
### The Solution
The parser doesn't have to care about layout concerns (e.g. indentation or what starts and ends a context, such as a case binder) because the lexer solves that problem instead.
So, how does the lexer solve this problem? It follows this general algorithm:
1. Lex the source code text into an initial stream of SourceToken
s
that do not have any of the three special tokens mentioned previously.
2. On a token-by-token basis, determine whether the lexer should
1. insert one of the three special tokens,
2. modify the current context (e.g. are we within a case binder?
Are we in a record expression?)
Step 2 is handled via insertLayout
and is essentially a state machine.
The layout delimiters, (e.g. LytCase
, LytBrace
, LytProperty
,
and LytOf
in the next section's example) either stop certain "rules"
from applying or ensure that certain "rules" now apply. By "rules",
we mean whether and where one of the three special tokens are added.
The comments in the source code for the insertLayout
algorithm call
pushing these delimiters onto the stack "masking" and popping them off
as "unmasking". Seeing when a layout delimiter is pushed and popped
are the keys to understanding this algorithm.
### Walking Through an Example
Before showing an example, let's remember a few things.
1. The TokLowerName "case"
token (i.e. a "case" keyword) indicates the start
of a case ... of
context. That context includes case binders (like the
example shown previously) that can get quite complex. When encountered,
we may need to insert one or more of the three special tokens here
until we encounter the terminating TokLowerName "of"
token that
signifies its end.
2. "case" and "of" can also appear as a record field's name. In such a context,
they would not start or end a case ... of
block.
Given the below source code...
case { case: "foo", of: "bar" } of
the lexer would go through something like the following states:
1. Encountered TokLowerName "case"
. Update current context to
"within a case of expression" by pushing the LytCase
delimiter
onto the layout delimiter stack. Insert the case
token
into the stream of source tokens.
2. Encountered TokLeftBrace
. Update current context to
"within a record expression" by pushing the LytBrace
delimiter.
Since we expect a field name to be the next token we see,
which may include a reserved keyword, update the current context again to
"expecting a field name" by pushing the LytProperty
.
delimiter. Insert the {
token into the stream of source tokens.
3. Encountered TokLowerName "case"
. Check the current context.
Since it's a LytProperty
, this is a field name and we shouldn't
assume that the next few tokens will be case binders. However,
since this might be a record with no more fields, update the
current context back to "within a record expression" by popping
the LytProperty
off the layout delimiter stack. Insert the case
token
4. Encountered TokColon
. Insert the :
token
5. Encountered TokLowerName "foo"
. Insert the foo
token.
6. Encountered TokComma
. Check the current context. Since it's a LytBrace
,
we're in a record expression and there is another field. Update the
current context by pushing LytProperty
as we expect a field name again.
7. Encountered TokLowerName "of"
. Check the current context.
Since it's a LytProperty
, this is a field name rather
than the end of a case binder. Thus, we don't expect the next tokens
to be the body
in a case ... of body
expression. However, since
this might be a record with no more fields, update the current context
back to "within a record expression" by popping the LytProperty
off the stack. Insert the of
token.
8. Encountered TokRightBrace
. Check the current context.
Since it's a LytBrace
, this is the end of a record expression.
Update the current context to "within a case of expression"
by popping the LytBrace
off the stack. Insert the }
token.
9. Encountered TokLowername "of"
. Check the current context.
Since it's a LytCase
, this is the end of a case ... of
expression
and the body will follow. Update the current context to
"body of a case of expression" by pushing LytOf
onto the layout stack.
Insert the of
token into the stream of tokens.
Documentation
type LayoutStack = [(SourcePos, LayoutDelim)] Source #
data LayoutDelim Source #
Instances
Show LayoutDelim Source # | |
Defined in Language.PureScript.CST.Layout showsPrec :: Int -> LayoutDelim -> ShowS # show :: LayoutDelim -> String # showList :: [LayoutDelim] -> ShowS # | |
Eq LayoutDelim Source # | |
Defined in Language.PureScript.CST.Layout (==) :: LayoutDelim -> LayoutDelim -> Bool # (/=) :: LayoutDelim -> LayoutDelim -> Bool # | |
Ord LayoutDelim Source # | |
Defined in Language.PureScript.CST.Layout compare :: LayoutDelim -> LayoutDelim -> Ordering # (<) :: LayoutDelim -> LayoutDelim -> Bool # (<=) :: LayoutDelim -> LayoutDelim -> Bool # (>) :: LayoutDelim -> LayoutDelim -> Bool # (>=) :: LayoutDelim -> LayoutDelim -> Bool # max :: LayoutDelim -> LayoutDelim -> LayoutDelim # min :: LayoutDelim -> LayoutDelim -> LayoutDelim # |
isIndented :: LayoutDelim -> Bool Source #
insertLayout :: SourceToken -> SourcePos -> LayoutStack -> (LayoutStack, [SourceToken]) Source #
unwindLayout :: SourcePos -> [Comment LineFeed] -> LayoutStack -> [SourceToken] Source #