html-parse: A high-performance HTML tokenizer

[ bsd3, library, text ] [ Propose Tags ]

This package provides a fast and reasonably robust HTML5 tokenizer built upon the attoparsec library. The parsing strategy is based upon the HTML5 parsing specification with few deviations.

The package targets similar use-cases to the venerable tagsoup library, but is significantly more efficient, achieving parsing speeds of over 50 megabytes per second on modern hardware with and typical web documents.

For instance,

>>> parseTokens "<div><h1 class=widget>Hello World</h1><br/>"
[TagOpen "div" [],TagOpen "h1" [Attr "class" "widget"],
ContentText "Hello World",TagClose "h1",TagSelfClose "br" []]
Dependencies attoparsec (==0.13.*), base (>=4.7 && <4.11), containers (==0.5.*), deepseq (==1.4.*), text (==1.2.*) [details]
License BSD-3-Clause
Copyright (c) 2016 Ben Gamari
Author Ben Gamari
Category Text
Home page
Source repo head: git clone git://
Uploaded by BenGamari at Thu Aug 10 03:52:43 UTC 2017
Distributions NixOS:
Downloads 615 total (13 in the last 30 days)
Rating (no votes yet) [estimated by rule of succession]
Your Rating
  • λ
  • λ
  • λ
Status Docs uploaded by user
Build status unknown [no reports yet]
Hackage Matrix CI




Maintainer's Corner

For package maintainers and hackage trustees