The html-parse package

[ Tags: benchmark, bsd3, library, text ] [ Propose Tags ]

This package provides a fast and reasonably robust HTML5 tokenizer built upon the attoparsec library. The parsing strategy is based upon the HTML5 parsing specification with few deviations.

The package targets similar use-cases to the venerable tagsoup library, but is significantly more efficient, achieving parsing speeds of over 50 megabytes per second on modern hardware with and typical web documents.

For instance,

>>> parseTokens "<div><h1 class=widget>Hello World</h1><br/>"
[TagOpen "div" [],TagOpen "h1" [Attr "class" "widget"],
ContentText "Hello World",TagClose "h1",TagSelfClose "br" []]

Properties

Versions 0.1.0.0, 0.2.0.0, 0.2.0.1
Dependencies attoparsec (==0.13.*), base (>=4.7 && <4.11), containers (==0.5.*), deepseq (==1.4.*), text (==1.2.*) [details]
License BSD3
Copyright (c) 2016 Ben Gamari
Author Ben Gamari
Maintainer ben@smart-cactus.org
Category Text
Home page http://github.com/bgamari/html-parse
Source repository head: git clone git://github.com/bgamari/html-parse
Uploaded Thu Aug 10 03:52:43 UTC 2017 by BenGamari
Distributions NixOS:0.2.0.1
Downloads 195 total (19 in the last 30 days)
Rating 0.0 (0 ratings) [clear rating]
  • λ
  • λ
  • λ
Status Docs uploaded by user
Build status unknown [no reports yet]
Hackage Matrix CI

Modules

[Index]

Downloads

Maintainer's Corner

For package maintainers and hackage trustees