mmark: Strict markdown processor for writers

[ bsd3, library, text ] [ Propose Tags ]

Strict markdown processor for writers.

[Skip to Readme]

Modules

[Index]

Text
- Text.MMark
  - Text.MMark.Extension

Flags

Manual Flags

Name	Description	Default
dev	Turn on development settings.	Disabled

Use -f <flag> to enable a flag, or -f -<flag> to disable that flag. More info

Downloads

mmark-0.0.4.0.tar.gz [browse] (Cabal source package)
Package description (revised from the package)

Note: This package has metadata revisions in the cabal description newer than included in the tarball. To unpack the package including the revisions, use 'cabal get'.

Maintainer's Corner

Package maintainers

mrkkrp

For package maintainers and hackage trustees

edit package information

Candidates

No Candidates

Versions [RSS]	0.0.1.0, 0.0.1.1, 0.0.2.0, 0.0.2.1, 0.0.3.0, 0.0.3.1, 0.0.3.2, 0.0.4.0, 0.0.4.1, 0.0.4.2, 0.0.4.3, 0.0.5.0, 0.0.5.1, 0.0.5.2, 0.0.5.3, 0.0.5.4, 0.0.5.5, 0.0.5.6, 0.0.6.0, 0.0.6.1, 0.0.6.2, 0.0.7.0, 0.0.7.1, 0.0.7.2, 0.0.7.3, 0.0.7.4, 0.0.7.5, 0.0.7.6, 0.0.8.0
Change log	CHANGELOG.md
Dependencies	aeson (>=0.11 && <1.3), base (>=4.8 && <5.0), case-insensitive (>=1.2 && <1.3), containers (>=0.5 && <0.6), data-default-class, deepseq (>=1.3 && <1.5), dlist (>=0.8 && <0.9), email-validate (>=2.2 && <2.4), foldl (>=1.2 && <1.4), hashable (>=1.0.1.1 && <1.3), html-entity-map (>=0.1 && <0.2), lucid (>=2.6 && <3.0), megaparsec (>=6.3 && <6.4), microlens (>=0.4 && <0.5), microlens-th (>=0.4 && <0.5), modern-uri (>=0.1.1 && <0.2), mtl (>=2.0 && <3.0), parser-combinators (>=0.2 && <1.0), semigroups (>=0.18 && <0.19), text (>=0.2 && <1.3), text-metrics (>=0.3 && <0.4), unordered-containers (>=0.2.5 && <0.3), void (>=0.7 && <0.8), yaml (>=0.8.10 && <0.9) [details]
License	BSD-3-Clause
Author	Mark Karpov <markkarpov92@gmail.com>
Maintainer	Mark Karpov <markkarpov92@gmail.com>
Revised	Revision 2 made by mrkkrp at 2018-01-12T06:36:25Z
Category	Text
Home page	https://github.com/mrkkrp/mmark
Bug tracker	https://github.com/mrkkrp/mmark/issues
Source repo	head: git clone https://github.com/mrkkrp/mmark.git
Uploaded	by mrkkrp at 2017-12-29T12:18:55Z
Distributions	Arch:0.0.7.6, LTSHaskell:0.0.8.0, NixOS:0.0.8.0, Stackage:0.0.8.0
Reverse Dependencies	4 direct, 0 indirect [details]
Downloads	18872 total (149 in the last 30 days)
Rating	2.0 (votes: 1) [estimated by Bayesian average]
Your Rating	λ λ λ
Status	Docs available [build log] Last success reported on 2017-12-29 [all 1 reports]

Readme for mmark-0.0.4.0

[back to package description]

MMark

Quick start: MMark vs GitHub-flavored markdown
MMark and Common Mark
- Differences in inline parsing
- Other differences
About MMark-specific extensions
Performance
Contribution
License

MMark (read “em-mark”) is a strict markdown processor for writers. “Strict” means that not every input is considered valid markdown document and parse errors are possible and even desirable, because they allow to spot markup issues without searching for them in rendered document. If a markdown document passes MMark parser, then it'll likely produce HTML without quirks. This feature makes it a good choice for writers and bloggers.

MMark in its current state features:

A parser that produces high-quality error messages and does not choke on first parse error. It is capable of reporting many parse errors where makes sense.
An extension system allowing to create extensions that alter parsed markdown document in some way. Some of them are available in the mmark-ext package.
A lucid-based render.

There is also a blog post announcing the project:

https://markkarpov.com/post/announcing-mmark.html

Quick start: MMark vs GitHub-flavored markdown

It's easy to start using MMark if you're used to GitHub-flavored markdown. There are three main differences:

URIs are not automatically recognized, you must to enclose them in < and >.

Block quotes require only one > and they continue as long as long the inner content is indented.

This is OK:

> Here goes my block quote.
  And this is the second line of the quote.

This produces two block quotes:

> Here goes my block quote.
> And this is another block quote!

See differences in inline parsing.

MMark and Common Mark

MMark mostly tries to follow the Common Mark specification as given here:

http://spec.commonmark.org/0.28/

However, due to the fact that we do not allow inputs that do not make sense, and also try to guard against common mistakes (like writing ##My header and having it rendered as a paragraph starting with hashes) MMark obviously can't follow the specification precisely. In particular, parsing of inlines differs considerably from Common Mark (see below).

Another difference between Common Mark and MMark is that the latter supports more (pun alert) common markdown extensions out-of-the-box. In particular, MMark supports:

parsing of an optional YAML block
strikeout using ~~this~~ syntax
superscript using ^this^ syntax
subscript using ~this~ syntax
automatic assignment of ids to headers
pipe tables (as on GitHub)

One do not need to enable or tweak anything for these to work, they are built-in features.

Differences in inline parsing

Emphasis and strong emphasis is an especially hairy topic in the Common Mark specification. There are 17 ad-hoc rules defining interaction between * and _ -based emphasis and more than an half of all Common Mark examples (that's about 300) test just this tricky logic.

Not only it is hard to implement, it's hard to understand for humans too. For example, this input:

*(*foo*)*

results in the following HTML:

<p><em>(<em>foo</em>)</em></p>

(Note the nested emphasis.)

Could it produce something like this instead?

<p><em>(</em>foo<em>)</em></p>

Well, why not? Without remembering those 17 ad-hoc rules, there going to be a lot of tricky cases when a user won't be able to tell how markdown will be parsed.

I decided to make parsing of emphasis, strong emphasis, and similar constructs like strikethrough, subscript, and superscript more symmetric and less ad-hoc. This is a work in progress and I'm not fully satisfied with the current approach as it does not allow to express some combinations of characters and markup, but in 99% of practical cases it is identical to Common Mark, and normal markdown intuitions will work OK for the users.

Let's start by dividing all characters into three groups:

Markup characters, including the following: *, ~, _, `, ^, [, ]. These are used for markup and whenever they appear in a document, they must form valid markup constructions. To be used as ordinary punctuation characters they must be backslash escaped.
Space characters, including space, tab, newline and carriage return.
Other characters, which include all characters not falling into the two groups described above.

Markup characters can be “converted” to other characters via backslash escaping. We'll see how this is useful in a few moments.

We'll call markdown characters placed between space characters and other characters left-flanking delimiter run. These markup characters sort of hang on the left hand side of a word.

Similarly we'll call markdown characters placed between other characters and space characters right-flanking delimiter run. These hang on the right hand side of a word.

Emphasis markup (and other similar things like strikethrough, which we won't mention explicitly anymore for brevity) can start only as left-flanking delimiter run and end only as right-flanking delimiter run.

This produces a parse error:

*Something * is not right.
Something __is __ not right.

And this too:

__foo__bar

This means that inter-word emphasis is not supported by this approach. (This is a pity, maybe I should adjust something to allow it.)

There is one more tricky thing. In some cases we want to end emphasis and have full stop or other punctuation right after it:

Here it *goes*.

You can see that the closing * is not in right-flanking position here, and so it's a parse error. To avoid this, some punctuation characters that normally appear outside of markup were made “transparent” and thus they are regarded as white space, so the example above parses correctly and works as expected. To put a transparent character inside emphasis, backslash escaping is necessary:

We *\(can\)* have it.

Here ( and ) are transparent punctuation characters, just like ., so they must be turned into other characters to go inside the emphasis. This is a corner case and should not be common in practice.

So far the main limitation of this approach is the pains with inter-word markup, as in this example:

**We started to work on the *issue*.**

Should we escape . here? On one hand we should, to close **. But if we do, the closing * won't be in right-flanking position anymore. God dammit.

Other differences

Block-level parsing:

If a line starts with hash signs it is expected to be a valid non-empty header (level 1–6 inclusive). If you want to start a paragraph with hashes, just escape the first hash with backslash and that will be enough.
Setext headings are not supported for the sake of simplicity.
Fenced code blocks must be explicitly closed by a closing fence. They are not closed by the end of document or by start of another block.
Lists and block quotes are defined by column at which their content starts. Content belonging to a particular list or block quote should start at the same column (or greater column, up to the column where indented code blocks start). As a consequence of this, block quotes do not feature “laziness”
Block quotes are started by a single > character, it's not necessary to put a > character at beginning of every line belonging to a quote (in fact, this would make every line a separate block quote).
Paragraphs can be interrupted by unordered and ordered lists with any valid starting index.
HTML blocks are not supported because the syntax conflicts with autolinks and the feature is a hack to compensate for the lack of extensibility and customization in the original markdown.

Inline-level parsing:

MMark does not support hard line breaks represented as double space before newline. Nevertheless, hard line breaks in the form of backslash before newline are supported (these are more explicit too).
All URI references (in links, images, autolinks, etc.) are parsed as per RFC 3986, no support for escaping or support for entity and numeric character references is provided. In addition to that, when a URI reference in not enclosed with < and >, then closing parenthesis character ) is not considered part of URI (use <uri> syntax if you want a closing parenthesis as part of a URI). Since the empty string is a valid URI and it may be confusing in some cases, we also force the user to write <> to represent the empty URI.
Putting links in text of another link is not allowed, i.e. no nested links is possible.
Putting images in description of other images is not allowed (similarly to the situation with links).
HTML inlines are not supported for the same reason why HTML blocks are not supported.

About MMark-specific extensions

YAML block must start with three hyphens --- and end with three hyphens ---. It can only be placed at the beginning of a markdown document. Trailing white space after the --- sequences is allowed.

Performance

I have compared speed and memory consumption of various Haskell markdown libraries by running them on an identical, big-enough markdown document and by rendering it as HTML:

Library	Execution time	Allocated	Max residency	Parsing library
`cmark-0.5.6`	325.5 μs	228,440	9,608	Custom C code
`mmark-0.0.4.0`	8.526 ms	36,282,776	313,632	Megaparsec
`cheapskate-0.1.1`	10.84 ms	44,686,272	799,200	Custom Haskell code
`markdown-0.1.16` †	14.14 ms	69,261,816	699,656	Attoparsec
`pandoc-2.0.5`	38.32 ms	141,868,840	1,471,080	Parsec

Results are ordered from fastest to slowest.

† The markdown library is sloppy and parses markdown incorrectly. For example, it parses the following *My * text as an inline containing emphasis, while in reality both asterisks must form flanking delimiter runs to create emphasis, like so *My* text. This allowed markdown to get away with a far simpler approach to parsing at the price that it's not really a valid markdown implementation.

Contribution

Issues, bugs, and questions may be reported in the GitHub issue tracker for this project.

Pull requests are also welcome and will be reviewed quickly.

License

Distributed under BSD 3 clause license.