ghc-syntax-highlighter: Syntax highlighter for Haskell using lexer of GHC itself

[ bsd3, library, text ] [ Propose Tags ]

Syntax highlighter for Haskell using lexer of GHC itself.


[Skip to Readme]
Versions 0.0.1.0, 0.0.2.0, 0.0.3.0
Change log CHANGELOG.md
Dependencies base (>=4.11 && <5.0), ghc (>=8.4 && <8.7), text (>=0.2 && <1.3) [details]
License BSD-3-Clause
Author Mark Karpov <markkarpov92@gmail.com>
Maintainer Mark Karpov <markkarpov92@gmail.com>
Category Text
Home page https://github.com/mrkkrp/ghc-syntax-highlighter
Bug tracker https://github.com/mrkkrp/ghc-syntax-highlighter/issues
Source repo head: git clone https://github.com/mrkkrp/ghc-syntax-highlighter.git
Uploaded by mrkkrp at Wed Nov 7 19:17:32 UTC 2018
Distributions LTSHaskell:0.0.3.0, NixOS:0.0.3.0, Stackage:0.0.3.0
Downloads 256 total (34 in the last 30 days)
Rating (no votes yet) [estimated by rule of succession]
Your Rating
  • λ
  • λ
  • λ
Status Docs available [build log]
Last success reported on 2018-11-07 [all 1 reports]
Hackage Matrix CI

Modules

[Index] [Quick Jump]

Flags

NameDescriptionDefaultType
dev

Turn on development settings.

DisabledManual

Use -f <flag> to enable a flag, or -f -<flag> to disable that flag. More info

Downloads

Maintainer's Corner

For package maintainers and hackage trustees


Readme for ghc-syntax-highlighter-0.0.3.0

[back to package description]

GHC syntax highligher

License FreeBSD Hackage Stackage Nightly Stackage LTS Build Status

This is a syntax highlighter library for Haskell using lexer of GHC itself.

Here is a blog post announcing the package, the readme is mostly derived from it:

Motivation

Parsing Haskell is hard, because Haskell is a complex language with countless features. The only way to get it right 100% is to use parser of GHC itself. Fortunately, now there is the ghc package, which as of version 8.4.1 exports enough of GHC's source code to allow us use its lexer.

Alternative approaches, even decent ones like highlight.js either don't support cutting-edge features or do their work without sufficient precision so that many tokens end up combined and the end result is typically still hard to read.

API

The API is extremely simple:

-- | Token types that are used as tags to mark spans of source code.

data Token
  = KeywordTok         -- ^ Keyword
  | PragmaTok          -- ^ Pragmas
  | SymbolTok          -- ^ Symbols (punctuation that is not an operator)
  | VariableTok        -- ^ Variable name (term level)
  | ConstructorTok     -- ^ Data\/type constructor
  | OperatorTok        -- ^ Operator
  | CharTok            -- ^ Character
  | StringTok          -- ^ String
  | IntegerTok         -- ^ Integer
  | RationalTok        -- ^ Rational number
  | CommentTok         -- ^ Comment (including Haddocks)
  | SpaceTok           -- ^ Space filling
  | OtherTok           -- ^ Something else?
  deriving (Eq, Ord, Enum, Bounded, Show)

-- | Tokenize Haskell source code. If the code cannot be parsed, return
-- 'Nothing'. Otherwise return the original input tagged by 'Token's.
--
-- The parser does not require the input source code to form a valid Haskell
-- program, so as long as the lexer can decompose your input (most of the
-- time), it'll return something in 'Just'.

tokenizeHaskell :: Text -> Maybe [(Token, Text)]

So given a simple program:

module Main (main) where

import Data.Bits

-- | Program's entry point.

main :: IO ()
main = return ()

It outputs something like this:

basicModule :: [(Token, Text)]
basicModule =
  [ (KeywordTok,"module")
  , (SpaceTok," ")
  , (ConstructorTok,"Main")
  , (SpaceTok," ")
  , (SymbolTok,"(")
  , (VariableTok,"main")
  , (SymbolTok,")")
  , (SpaceTok," ")
  , (KeywordTok,"where")
  , (SpaceTok,"\n\n")
  , (KeywordTok,"import")
  , (SpaceTok," ")
  , (ConstructorTok,"Data.Bits")
  , (SpaceTok,"\n\n")
  , (CommentTok,"-- | Program's entry point.")
  , (SpaceTok,"\n\n")
  , (VariableTok,"main")
  , (SpaceTok," ")
  , (SymbolTok,"::")
  , (SpaceTok," ")
  , (ConstructorTok,"IO")
  , (SpaceTok," ")
  , (SymbolTok,"(")
  , (SymbolTok,")")
  , (SpaceTok,"\n")
  , (VariableTok,"main")
  , (SpaceTok," ")
  , (SymbolTok,"=")
  , (SpaceTok," ")
  , (VariableTok,"return")
  , (SpaceTok," ")
  , (SymbolTok,"(")
  , (SymbolTok,")")
  , (SpaceTok,"\n")
  ]

Nothing is rarely returned if ever, because it looks like the lexer is capable of interpreting almost any text as a stream of GHC tokens.

How to use it in your blog

Depends on your markdown processor. If you're an mmark user, good news, since version 0.2.1.0 of mmark-ext it includes the ghcSyntaxHighlighter extension. Due to flexibility of MMark, it's possible to use this highlighter for Haskell and skylighting as a fall-back for everything else. Consult the docs for more information.

skylighting is what Pandoc uses. And from what I can tell it's hardcoded to use only that library for highlighting, so some creativity may be necessary to get it work.

Contribution

Issues, bugs, and questions may be reported in the GitHub issue tracker for this project.

Pull requests are also welcome.

License

Copyright © 2018 Mark Karpov

Distributed under BSD 3 clause license.