-- | -- ## High-Level Summary -- -- This section provides a high-level summary of this file. For those who -- know more about compiler-development, the below explanation is likely enough. -- For everyone else, see the next section. -- -- The parser itself is unaware of indentation, and instead only parses explicit -- delimiters which are inserted by this layout algorithm (much like Haskell). -- This is convenient because the actual grammar can be specified apart from the -- indentation rules. Haskell has a few problematic productions which make it -- impossible to implement a purely lexical layout algorithm, so it also has an -- additional (and somewhat contentious) parser error side condition. PureScript -- does not have these problematic productions (particularly foo, bar :: -- SomeType syntax in declarations), but it does have a few gotchas of it's own. -- The algorithm is "non-trivial" to say the least, but it is implemented as a -- purely lexical delimiter parser on a token-by-token basis, which is highly -- convenient, since it can be replicated in any language or toolchain. There is -- likely room to simplify it, but there are some seemingly innocuous things -- that complicate it. -- -- "Naked" commas (case, patterns, guards, fundeps) are a constant source of -- complexity, and indeed too much of this is what prevents Haskell from having -- such an algorithm. Unquoted properties for layout keywords introduce a domino -- effect of complexity since we have to mask and unmask any usage of . (also in -- foralls!) or labels in record literals. -- -- ## Detailed Summary -- -- ### The Problem -- -- The parser itself is unaware of indentation or other such layout concerns. -- Rather than dealing with it explicitly, the parser and its -- grammar rules are only aware of normal tokens (e.g. @TokLowerName@) and -- three special zero-width tokens, @TokLayoutStart@, @TokLayoutSep@, -- and @TokLayoutEnd@. This is convenient because the actual grammar -- can be specified apart from the indentation rules and other such -- layout concerns. -- -- For a simple example, the parser parses all three examples of the code below -- using the exact same grammar rules for the @let@ keyword despite -- each example using different indentations levels: -- -- @ -- -- Example 1 -- let foo = 5 -- x = 2 in foo -- -- -- Example 2 -- let -- bar = 5 -- y = 2 -- in bar -- -- -- Example 3 -- let baz -- = -- 5 -- z= 2 in baz -- @ -- -- Each block of code might appear to the parser as a stream of the -- following source tokens where the @\{@ sequence represents -- @TokLayoutStart@, the @\;@ sequence represents @TokLayoutSep@, -- and the @\}@ sequence represents @TokLayoutEnd@: -- - @let \{foo = 5\;x = 2\} in foo@ -- - @let \{bar = 5\;y = 2\} in bar@ -- - @let \{baz = 5\;z = 2\} in baz@ -- -- -- For a more complex example, consider commas: -- -- @ -- case one, { twoA, twoB }, [ three1 -- , three2 -- , do -- { three3, three4 } <- case arg1, arg2 of -- Nothing, _ -> { three3: 1, three4: 2 } -- Just _, Nothing -> { three3: 2, three4: 3 } -- _, _ -> { three3: 3, three4: 4 } -- pure $ three3 + three4 -- ] of -- @ -- -- Which of the above 13 commas function as the separaters between the -- case binders (e.g. @one@) in the outermost @case ... of@ context? -- -- ### The Solution -- -- The parser doesn't have to care about layout concerns (e.g. indentation -- or what starts and ends a context, such as a case binder) because the -- lexer solves that problem instead. -- -- So, how does the lexer solve this problem? It follows this general algorithm: -- 1. Lex the source code text into an initial stream of `SourceToken`s -- that do not have any of the three special tokens mentioned previously. -- 2. On a token-by-token basis, determine whether the lexer should -- 1. insert one of the three special tokens, -- 2. modify the current context (e.g. are we within a case binder? -- Are we in a record expression?) -- -- Step 2 is handled via 'insertLayout' and is essentially a state machine. -- The layout delimiters, (e.g. 'LytCase', 'LytBrace', 'LytProperty', -- and 'LytOf' in the next section's example) either stop certain "rules" -- from applying or ensure that certain "rules" now apply. By "rules", -- we mean whether and where one of the three special tokens are added. -- The comments in the source code for the 'insertLayout' algorithm call -- pushing these delimiters onto the stack "masking" and popping them off -- as "unmasking". Seeing when a layout delimiter is pushed and popped -- are the keys to understanding this algorithm. -- -- ### Walking Through an Example -- -- Before showing an example, let's remember a few things. -- 1. The @TokLowerName "case"@ token (i.e. a "case" keyword) indicates the start -- of a @case ... of@ context. That context includes case binders (like the -- example shown previously) that can get quite complex. When encountered, -- we may need to insert one or more of the three special tokens here -- until we encounter the terminating @TokLowerName "of"@ token that -- signifies its end. -- 2. "case" and "of" can also appear as a record field's name. In such a context, -- they would not start or end a @case ... of@ block. -- -- Given the below source code... -- -- @ -- case { case: "foo", of: "bar" } of -- @ -- -- the lexer would go through something like the following states: -- 1. Encountered @TokLowerName "case"@. Update current context to -- "within a case of expression" by pushing the 'LytCase' delimiter -- onto the layout delimiter stack. Insert the @case@ token -- into the stream of source tokens. -- 2. Encountered @TokLeftBrace@. Update current context to -- "within a record expression" by pushing the 'LytBrace' delimiter. -- Since we expect a field name to be the next token we see, -- which may include a reserved keyword, update the current context again to -- "expecting a field name" by pushing the `LytProperty`. -- delimiter. Insert the @{@ token into the stream of source tokens. -- 3. Encountered @TokLowerName "case"@. Check the current context. -- Since it's a `LytProperty`, this is a field name and we shouldn't -- assume that the next few tokens will be case binders. However, -- since this might be a record with no more fields, update the -- current context back to "within a record expression" by popping -- the `LytProperty` off the layout delimiter stack. Insert the @case@ token -- 4. Encountered @TokColon@. Insert the @:@ token -- 5. Encountered @TokLowerName "foo"@. Insert the @foo@ token. -- 6. Encountered @TokComma@. Check the current context. Since it's a `LytBrace`, -- we're in a record expression and there is another field. Update the -- current context by pushing `LytProperty` as we expect a field name again. -- 7. Encountered @TokLowerName "of"@. Check the current context. -- Since it's a `LytProperty`, this is a field name rather -- than the end of a case binder. Thus, we don't expect the next tokens -- to be the @body@ in a @case ... of body@ expression. However, since -- this might be a record with no more fields, update the current context -- back to "within a record expression" by popping the `LytProperty` -- off the stack. Insert the @of@ token. -- 8. Encountered @TokRightBrace@. Check the current context. -- Since it's a `LytBrace`, this is the end of a record expression. -- Update the current context to "within a case of expression" -- by popping the `LytBrace` off the stack. Insert the @}@ token. -- 9. Encountered @TokLowername "of"@. Check the current context. -- Since it's a 'LytCase', this is the end of a @case ... of@ expression -- and the body will follow. Update the current context to -- "body of a case of expression" by pushing 'LytOf' onto the layout stack. -- Insert the @of@ token into the stream of tokens. -- module Language.PureScript.CST.Layout where import Prelude import Data.DList (snoc) import qualified Data.DList as DList import Data.Foldable (find) import Data.Function ((&)) import Language.PureScript.CST.Types type LayoutStack = [(SourcePos, LayoutDelim)] data LayoutDelim = LytRoot | LytTopDecl | LytTopDeclHead | LytDeclGuard | LytCase | LytCaseBinders | LytCaseGuard | LytLambdaBinders | LytParen | LytBrace | LytSquare | LytIf | LytThen | LytProperty | LytForall | LytTick | LytLet | LytLetStmt | LytWhere | LytOf | LytDo | LytAdo deriving (Show, Eq, Ord) isIndented :: LayoutDelim -> Bool isIndented = \case LytLet -> True LytLetStmt -> True LytWhere -> True LytOf -> True LytDo -> True LytAdo -> True _ -> False isTopDecl :: SourcePos -> LayoutStack -> Bool isTopDecl tokPos = \case [(lytPos, LytWhere), (_, LytRoot)] | srcColumn tokPos == srcColumn lytPos -> True _ -> False lytToken :: SourcePos -> Token -> SourceToken lytToken pos = SourceToken ann where ann = TokenAnn { tokRange = SourceRange pos pos , tokLeadingComments = [] , tokTrailingComments = [] } insertLayout :: SourceToken -> SourcePos -> LayoutStack -> (LayoutStack, [SourceToken]) insertLayout src@(SourceToken tokAnn tok) nextPos stack = DList.toList <$> insert (stack, mempty) where tokPos = srcStart $ tokRange tokAnn insert state@(stk, acc) = case tok of -- `data` declarations need masking (LytTopDecl) because the usage of `|` -- should not introduce a LytDeclGard context. TokLowerName [] "data" -> case state & insertDefault of state'@(stk', _) | isTopDecl tokPos stk' -> state' & pushStack tokPos LytTopDecl state' -> state' & popStack (== LytProperty) -- `class` declaration heads need masking (LytTopDeclHead) because the -- usage of commas in functional dependencies. TokLowerName [] "class" -> case state & insertDefault of state'@(stk', _) | isTopDecl tokPos stk' -> state' & pushStack tokPos LytTopDeclHead state' -> state' & popStack (== LytProperty) TokLowerName [] "where" -> case stk of (_, LytTopDeclHead) : stk' -> (stk', acc) & insertToken src & insertStart LytWhere (_, LytProperty) : stk' -> (stk', acc) & insertToken src _ -> state & collapse whereP & insertToken src & insertStart LytWhere where -- `where` always closes do blocks: -- example = do do do do foo where foo = ... -- -- `where` closes layout contexts even when indented at the same level: -- example = case -- Foo -> ... -- Bar -> ... -- where foo = ... whereP _ LytDo = True whereP lytPos lyt = offsideEndP lytPos lyt TokLowerName [] "in" -> case collapse inP state of -- `let/in` is not allowed in `ado` syntax. `in` is treated as a -- delimiter and must always close the `ado`. -- example = ado -- foo <- ... -- let bar = ... -- in ... ((_, LytLetStmt) : (_, LytAdo) : stk', acc') -> (stk', acc') & insertEnd & insertEnd & insertToken src ((_, lyt) : stk', acc') | isIndented lyt -> (stk', acc') & insertEnd & insertToken src _ -> state & insertDefault & popStack (== LytProperty) where inP _ LytLet = False inP _ LytAdo = False inP _ lyt = isIndented lyt TokLowerName [] "let" -> state & insertKwProperty next where next state'@(stk', _) = case stk' of (p, LytDo) : _ | srcColumn p == srcColumn tokPos -> state' & insertStart LytLetStmt (p, LytAdo) : _ | srcColumn p == srcColumn tokPos -> state' & insertStart LytLetStmt _ -> state' & insertStart LytLet TokLowerName _ "do" -> state & insertKwProperty (insertStart LytDo) TokLowerName _ "ado" -> state & insertKwProperty (insertStart LytAdo) -- `case` heads need masking due to commas. TokLowerName [] "case" -> state & insertKwProperty (pushStack tokPos LytCase) TokLowerName [] "of" -> case collapse indentedP state of -- When `of` is matched with a `case`, we are in a case block, and we -- need to mask additional contexts (LytCaseBinders, LytCaseGuards) -- due to commas. ((_, LytCase) : stk', acc') -> (stk', acc') & insertToken src & insertStart LytOf & pushStack nextPos LytCaseBinders state' -> state' & insertDefault & popStack (== LytProperty) -- `if/then/else` is considered a delimiter context. This allows us to -- write chained expressions in `do` blocks without stair-stepping: -- example = do -- foo -- if ... then -- ... -- else if ... then -- ... -- else -- ... TokLowerName [] "if" -> state & insertKwProperty (pushStack tokPos LytIf) TokLowerName [] "then" -> case state & collapse indentedP of ((_, LytIf) : stk', acc') -> (stk', acc') & insertToken src & pushStack tokPos LytThen _ -> state & insertDefault & popStack (== LytProperty) TokLowerName [] "else" -> case state & collapse indentedP of ((_, LytThen) : stk', acc') -> (stk', acc') & insertToken src _ -> -- We don't want to insert a layout separator for top-level `else` in -- instance chains. case state & collapse offsideP of state'@(stk', _) | isTopDecl tokPos stk' -> state' & insertToken src state' -> state' & insertSep & insertToken src & popStack (== LytProperty) -- `forall` binders need masking because the usage of `.` should not -- introduce a LytProperty context. TokForall _ -> state & insertKwProperty (pushStack tokPos LytForall) -- Lambdas need masking because the usage of `->` should not close a -- LytDeclGuard or LytCaseGuard context. TokBackslash -> state & insertDefault & pushStack tokPos LytLambdaBinders TokRightArrow _ -> state & collapse arrowP & popStack guardP & insertToken src where arrowP _ LytDo = True arrowP _ LytOf = False arrowP lytPos lyt = offsideEndP lytPos lyt guardP LytCaseBinders = True guardP LytCaseGuard = True guardP LytLambdaBinders = True guardP _ = False TokEquals -> case state & collapse equalsP of ((_, LytDeclGuard) : stk', acc') -> (stk', acc') & insertToken src _ -> state & insertDefault where equalsP _ LytWhere = True equalsP _ LytLet = True equalsP _ LytLetStmt = True equalsP _ _ = False -- Guards need masking because of commas. TokPipe -> case collapse offsideEndP state of state'@((_, LytOf) : _, _) -> state' & pushStack tokPos LytCaseGuard & insertToken src state'@((_, LytLet) : _, _) -> state' & pushStack tokPos LytDeclGuard & insertToken src state'@((_, LytLetStmt) : _, _) -> state' & pushStack tokPos LytDeclGuard & insertToken src state'@((_, LytWhere) : _, _) -> state' & pushStack tokPos LytDeclGuard & insertToken src _ -> state & insertDefault -- Ticks can either start or end an infix expression. We preemptively -- collapse all indentation contexts in search of a starting delimiter, -- and backtrack if we don't find one. TokTick -> case state & collapse indentedP of ((_, LytTick) : stk', acc') -> (stk', acc') & insertToken src _ -> state & collapse offsideEndP & insertSep & insertToken src & pushStack tokPos LytTick -- In general, commas should close all indented contexts. -- example = [ do foo -- bar, baz ] TokComma -> case state & collapse indentedP of -- If we see a LytBrace, then we are in a record type or literal. -- Record labels need masking so we can use unquoted keywords as labels -- without accidentally littering layout delimiters. state'@((_, LytBrace) : _, _) -> state' & insertToken src & pushStack tokPos LytProperty state' -> state' & insertToken src -- TokDot tokens usually entail property access, which need masking so we -- can use unquoted keywords as labels. TokDot -> case state & insertDefault of ((_, LytForall) : stk', acc') -> (stk', acc') state' -> state' & pushStack tokPos LytProperty TokLeftParen -> state & insertDefault & pushStack tokPos LytParen TokLeftBrace -> state & insertDefault & pushStack tokPos LytBrace & pushStack tokPos LytProperty TokLeftSquare -> state & insertDefault & pushStack tokPos LytSquare TokRightParen -> state & collapse indentedP & popStack (== LytParen) & insertToken src TokRightBrace -> state & collapse indentedP & popStack (== LytProperty) & popStack (== LytBrace) & insertToken src TokRightSquare -> state & collapse indentedP & popStack (== LytSquare) & insertToken src TokString _ _ -> state & insertDefault & popStack (== LytProperty) TokLowerName [] _ -> state & insertDefault & popStack (== LytProperty) TokOperator _ _ -> state & collapse offsideEndP & insertSep & insertToken src _ -> state & insertDefault insertDefault state = state & collapse offsideP & insertSep & insertToken src insertStart lyt state@(stk, _) = -- We only insert a new layout start when it's going to increase indentation. -- This prevents things like the following from parsing: -- instance foo :: Foo where -- foo = 42 case find (isIndented . snd) stk of Just (pos, _) | srcColumn nextPos <= srcColumn pos -> state _ -> state & pushStack nextPos lyt & insertToken (lytToken nextPos TokLayoutStart) insertSep state@(stk, acc) = case stk of -- LytTopDecl is closed by a separator. (lytPos, LytTopDecl) : stk' | sepP lytPos -> (stk', acc) & insertToken sepTok -- LytTopDeclHead can be closed by a separator if there is no `where`. (lytPos, LytTopDeclHead) : stk' | sepP lytPos -> (stk', acc) & insertToken sepTok (lytPos, lyt) : _ | indentSepP lytPos lyt -> case lyt of -- If a separator is inserted in a case block, we need to push an -- additional LytCaseBinders context for comma masking. LytOf -> state & insertToken sepTok & pushStack tokPos LytCaseBinders _ -> state & insertToken sepTok _ -> state where sepTok = lytToken tokPos TokLayoutSep insertKwProperty k state = case state & insertDefault of ((_, LytProperty) : stk', acc') -> (stk', acc') state' -> k state' insertEnd = insertToken (lytToken tokPos TokLayoutEnd) insertToken token (stk, acc) = (stk, acc `snoc` token) pushStack lytPos lyt (stk, acc) = ((lytPos, lyt) : stk, acc) popStack p ((_, lyt) : stk', acc) | p lyt = (stk', acc) popStack _ state = state collapse p = uncurry go where go ((lytPos, lyt) : stk) acc | p lytPos lyt = go stk $ if isIndented lyt then acc `snoc` lytToken tokPos TokLayoutEnd else acc go stk acc = (stk, acc) indentedP = const isIndented offsideP lytPos lyt = isIndented lyt && srcColumn tokPos < srcColumn lytPos offsideEndP lytPos lyt = isIndented lyt && srcColumn tokPos <= srcColumn lytPos indentSepP lytPos lyt = isIndented lyt && sepP lytPos sepP lytPos = srcColumn tokPos == srcColumn lytPos && srcLine tokPos /= srcLine lytPos unwindLayout :: SourcePos -> [Comment LineFeed] -> LayoutStack -> [SourceToken] unwindLayout pos leading = go where go [] = [] go ((_, LytRoot) : _) = [SourceToken (TokenAnn (SourceRange pos pos) leading []) TokEof] go ((_, lyt) : stk) | isIndented lyt = lytToken pos TokLayoutEnd : go stk go (_ : stk) = go stk