flatparse: High-performance parsing from strict bytestrings

[ library, mit, parsing ] [ Propose Tags ] [ Report a vulnerability ]

Flatparse is a high-performance parsing library, focusing on programming languages and human-readable data formats. See the README for more information: https://github.com/AndrasKovacs/flatparse.

[Skip to Readme]

Modules

[Last Documentation]

FlatParse
- FlatParse.Basic
- Examples
  - BasicLambda
    - FlatParse.Examples.BasicLambda.Lexer
    - FlatParse.Examples.BasicLambda.Parser
- FlatParse.Stateful

Downloads

flatparse-0.1.0.0.tar.gz [browse] (Cabal source package)
Package description (as included in the package)

Maintainer's Corner

Package maintainers

AndrasKovacs

For package maintainers and hackage trustees

edit package information

Candidates

0.1.0.1

Versions [RSS]	0.1.0.0, 0.1.0.1, 0.1.0.2, 0.1.1.1, 0.1.1.2, 0.2.0.0, 0.2.1.0, 0.2.2.0, 0.3.0.0, 0.3.0.1, 0.3.0.2, 0.3.0.3, 0.3.1.0, 0.3.2.0, 0.3.3.0, 0.3.4.0, 0.3.5.0, 0.3.5.1, 0.4.0.0, 0.4.0.1, 0.4.0.2, 0.4.1.0, 0.5.0.0, 0.5.0.1, 0.5.0.2, 0.5.1.0, 0.5.1.1, 0.5.2.0, 0.5.2.1, 0.5.3.0, 0.5.3.1
Dependencies	attoparsec, base (>=4.7 && <5), bytestring, containers, flatparse, gauge, megaparsec, parsec, template-haskell [details]
Tested with	ghc ==8.8.4
License	MIT
Copyright	2021 András Kovács
Author	András Kovács
Maintainer	puttamalac@gmail.com
Uploaded	by AndrasKovacs at 2021-03-15T18:52:05Z
Category	Parsing
Home page	https://github.com/AndrasKovacs/flatparse#readme
Bug tracker	https://github.com/AndrasKovacs/flatparse/issues
Source repo	head: git clone https://github.com/AndrasKovacs/flatparse
Distributions	LTSHaskell:0.5.3.1, NixOS:0.5.3.1, Stackage:0.5.3.1
Reverse Dependencies	11 direct, 30 indirect [details]
Executables	test, bench
Downloads	5354 total (90 in the last 30 days)
Rating	(no votes yet) [estimated by Bayesian average]
Your Rating	λ λ λ
Status	Docs not available [build log] All reported builds failed as of 2021-03-15 [all 3 reports]

Readme for flatparse-0.1.0.0

[back to package description]

flatparse

flatparse is a high-performance parsing library, focusing on programming languages and human-readable data formats. The "flat" in the name refers to the ByteString parsing input, which has pinned contiguous data, and also to the library internals, which avoids indirections and heap allocations whenever possible.

Features and non-features

Excellent performance. On microbenchmarks, flatparse is at least 10 times faster than attoparsec or megaparsec. On larger examples with heavier use of source positions and spans and/or indentation parsing, the performance difference grows to 20-30 times. Compile times and exectuable sizes are also significantly better with flatparse than with megaparsec or attoparsec. flatparse interals make liberal use of unboxed tuples and GHC primops. As a result, pure validators (parsers returning ()) in flatparse are not difficult to implement with zero heap allocation.
No incremental parsing, and only strict ByteString is supported as input. However, it can be still useful to convert from Text, String or other types to ByteString, and then use flatparse for parsing, since flatparse performance usually more than makes up for the conversion costs.
Only little-endian 64 bit systems are currently supported. This may change in the future. Getting good performance requires architecture-specific optimizations; I've only considered the most common setting at this point.
Support for fast source location handling, indentation parsing and informative error messages. flatparse provides a low-level interface to these. Batteries are not included, but it should be possible for users to build custom solutions, which are more sophisticated, but still as fast as possible. In my experience, the included batteries in other libraries often come with major unavoidable overheads, and often we still have to extend existing machinery in order to scale to production features.
The backtracking model of flatparse is different to parsec libraries, and is more close to the nom library in Rust. The idea is that parser failure is distinguished from parsing error. The former is used for control flow, and we can backtrack from it. The latter is used for unrecoverable errors, and by default it's propagated to the top. flatparse does not track whether parsers have consumed inputs. In my experience, what we really care about is the failure/error distinction, and in parsec or megaparsec the consumed/non-consumed separation is often muddled and discarded in larger parser implementations. By default, basic flatparse parsers can fail but can not throw errors, with the exception of the specifically error-throwing operations. Hence, flatparse users have to be mindful about grammar, and explicitly insert errors where it is known that the input can't be valid.

flatparse comes in two flavors: FlatParse.Basic and FlatParse.Stateful. Both support a custom error type and a custom reader environment.

FlatParse.Basic only supports the above features. If you don't need indentation parsing, this is sufficient.
FlatParse.Stateful additionally supports a built-in Int worth of internal state. This can support a wide range of indentation parsing features. There is a slight overhead in performance and code size compared to Basic. However, in small parsers and microbenchmarks the difference between Basic and Stateful is often reduced to near zero by GHC and LLVM optimization. The difference is more marked if we use native code backend instead of LLVM.

The reason for baking a reader into the parsers, is that if we need it, it's convenient, and if we don't, then GHC very reliably optimizes unused environments away. In contrast, GHC optimizes much less reliably if we try to wrap the existing Reader from transformers around our parsers.

Tutorial

Informative tutorials are work in progress. See src/FlatParse/Examples for a lexer/parser example with acceptably good error messages.

Some benchmarks

Execution times below. See source code in bench. Compiled with GHC 8.8.4 -O2 -fllvm.

benchmark	runtime
s-exp/fpbasic	3.365 ms
s-exp/fpstateful	3.421 ms
s-exp/attoparsec	42.84 ms
s-exp/megaparsec	57.54 ms
s-exp/parsec	179.7 ms
long keyword/fpbasic	216.4 μs
long keyword/fpstateful	299.0 μs
long keyword/attoparsec	5.297 ms
long keyword/megaparsec	3.646 ms
long keyword/parsec	49.18 ms
numeral csv/fpbasic	743.5 μs
numeral csv/fpstateful	848.5 μs
numeral csv/attoparsec	20.64 ms
numeral csv/megaparsec	10.12 ms
numeral csv/parsec	78.52 ms

Object file sizes for each module containing the s-exp, long keyword and numeral csv benchmarks.

library	object file size (bytes)
fpbasic	26456
fpstateful	30008
attoparsec	83288
megaparsec	188696
parsec	75880