zenacy-html: A standard compliant HTML parsing library

[ library, mit, program, web ] [ Propose Tags ]

Zenacy HTML is an HTML parsing and processing library that implements the WHATWG HTML parsing standard. The standard is described as a state machine that this library implements exactly as spelled out including all the error handling, recovery, and conformance checks that makes it robust in handling any HTML pulled from the web. In addition to parsing, the library provides many processing features to help extract information from web pages or rewrite them and render the modified results.


[Skip to Readme]
Versions [faq] 2.0.0, 2.0.1, 2.0.2
Change log CHANGES.md
Dependencies base (==4.*), bytestring (>=0.10.6.0 && <0.11), containers (>=0.5.7.1 && <0.7), data-default (>=0.7.1.1 && <0.8), dlist (>=0.8 && <1.1), extra (>=1.4 && <1.8), mtl (>=2.1 && <2.3), pretty-show (>=1.6 && <1.11), safe (>=0.3.14 && <0.4), safe-exceptions (>=0.1.5.0 && <0.2), text (>=1.2.2.0 && <1.3), transformers (>=0.5.2 && <0.6), vector (>=0.11 && <0.13), word8 (>=0.1.2 && <0.2), zenacy-html [details]
License MIT
Copyright Copyright (C) 2015-2020 Michael P Williams
Author Michael Williams <mlcfp@icloud.com>
Maintainer Michael Williams <mlcfp@icloud.com>
Category Web
Home page https://github.com/mlcfp/zenacy-html
Source repo head: git clone https://github.com/mlcfp/zenacy-html.git
Uploaded by mlcfp at 2020-08-26T14:01:05Z
Distributions NixOS:2.0.2
Executables zenacy-html-exe
Downloads 133 total (7 in the last 30 days)
Rating (no votes yet) [estimated by Bayesian average]
Your Rating
  • λ
  • λ
  • λ
Status Hackage Matrix CI
Docs available [build log]
Last success reported on 2020-08-26 [all 1 reports]

Modules

[Index] [Quick Jump]

Downloads

Maintainer's Corner

For package maintainers and hackage trustees


Readme for zenacy-html-2.0.2

[back to package description]

Zenacy HTML

hackage-shield stackage-shield linux-shield packdeps-shield

Zenacy HTML is an HTML parsing and processing library that implements the WHATWG HTML parsing standard. The standard is described as a state machine that this library implements exactly as spelled out including all the error handling, recovery, and conformance checks that makes it robust in handling any HTML pulled from the web. In addition to parsing, the library provides many processing features to help extract information from web pages or rewrite them and render the modified results.

Introduction

The Zenacy HTML parser is an implementation of the HTML parsing standard defined by the WHATWG.

https://html.spec.whatwg.org/multipage/parsing.html

The standard defines a parsing state machine, so it is very prescriptive on how HTML is handled including many edge cases and error recovery. This library aims to follow the standard closely in such a way to match the code back to the standard and make future updates straightforward.

One of the main uses an a HTML parser is for extracting information from the web. Having a parser that can handle all the nuances of poorly formatted HTML helps to make this extraction as robust as possible. This was a key motivation in deciding to implement a parser in this fashion. Additionally, the standard describes the algorithms needed to produce the correct document structure. Applications that are sensitive to the document structure, such as extracting and rewriting large portions of a web page, may benefit from Zenacy HTML.

The library provides a wide variety of features including:

  • A fully standard compliant HTML parser
  • HTML Fragment parsing
  • Document rendering
  • A zipper type for document traversal
  • An iterator type for document walking
  • Various functions for processing aspects of HTML
  • Lightweight queries for rewriting

Parsing

The library is designed to be imported unqualified.

import Zenacy.HTML

The htmlParseEasy function can be used to parse an HTML document string and return the document model.

htmlParseEasy "<div>HelloWorld</div>"

Note that some of the missing elements where automatically added to the document structure as required by the standard.

HTMLDocument ""
  [ HTMLElement "html" HTMLNamespaceHTML []
    [ HTMLElement "head" HTMLNamespaceHTML [] []
    , HTMLElement "body" HTMLNamespaceHTML []
      [ HTMLElement "div" HTMLNamespaceHTML []
        [ HTMLText "HelloWorld" ] ] ] ]

The parsed result can also be rendered using htmlRender.

htmlRender $ htmlParseEasy "<div>HelloWorld</div>"

The resulting rendered document appears like so.

<html><head></head><body><div>HelloWorld</div></body></html>

Rewriting

This example illustrates a function that converts span elements to divs.

rewrite :: Text -> Text
rewrite = htmlRender . htmlMapElem f . fromJust . htmlDocHtml . htmlParseEasy
  where
    f x
      | htmlElemHasName "span" x = htmlElemRename "div" x
      | otherwise = x

rewrite "<span>Hello</span><span>World</span>"

Running the above gives the modified document.

<html><head></head><body><div>Hello</div><div>World</div></body></html>

Extraction

The next example shows one way to find all the hyperlinks in a document. This solution recurses over the document elements while ignoring fragments and templates.

extract :: Text -> [Text]
extract = go . htmlParseEasy
  where
    go = \case
      HTMLDocument n c ->
        concatMap go c
      e @ (HTMLElement "a" s a c) ->
        case htmlElemAttrFind (htmlAttrHasName "href") e of
          Just (HTMLAttr n v s) ->
            v : concatMap go c
          Nothing ->
            concatMap go c
      HTMLElement n s a c ->
        concatMap go c
      _otherwise ->
        []

extract "<a href=\"https://example1.com\"></a><a href=\"https://example2.com\"></a>"

The extract function will give the following list.

[ "https://example1.com"
, "https://example2.com"
]

Queries

The library includes a basic query facility implemented as a thin wrapper around an HTMLZipper. Queries match patterns in HTML structures and can be used to extract information or update documents. As a first example, consider the following HTML.

<p>
  <span id="x" class="y z"></span>
  <br>
  <a href="bbb">AAA</a>
  <img>
</p>

The HTML can be parsed as normal. Note though the additional step of whitespace removal, which is often important in documents that include indentation such as above.

fromJust . htmlSpaceRemove . fromJust . htmlDocBody . htmlParseEasy

Now a query function can be defined. This function expects to be given a body element whose first child is a p element whose first child has an id of x whose second sibling is an anchor element. If all of those conditions are met, the the text contents of the anchor is returned.

query :: HTMLNode -> Maybe Text
query = htmlQueryExec $ do
  htmlQueryName "body"
  htmlQueryFirst
  htmlQueryName "p"
  htmlQueryFirst
  htmlQueryId "x"
  htmlQueryNext
  htmlQueryNext
  htmlQueryName "a"
  a <- htmlQueryNode
  htmlQuerySucc $
    fromMaybe "" $ htmlElemText a

Running the query on the parsed document will give the result.

Just "AAA"

Queries can also be used to modifiy documents. In the next example, let's say we would like to find any img that is the only content in a div and replace the div with a link. The document could look as follows.

<section><div><img src="aaa"></div></section>
<section><div><img src="bbb"></div></section>
<section><div><img src="ccc"></div></section>

A query function can be defined to match the desired pattern and return the modified element.

query2 :: HTMLNode -> HTMLNode
query2 = htmlQueryTry $ do
  htmlQueryName "div"
  htmlQueryOnly "img"
  a <- htmlQueryNode
  let Just b = htmlElemGetAttr "src" a
  htmlQuerySucc $
    htmlElem "a" [ htmlAttr "href" b ]
      [ htmlText b ]

The query can then be applied to the entire document using htmlMapElem.

htmlMapElem query2

Rendering the mapped query with give the updated content.

<section><a href="aaa">aaa</a></section>
<section><a href="bbb">bbb</a></section>
<section><a href="ccc">ccc</a></section>

Samples

The unit tests include the above samples as well as many other example usages of the library.

Origin

Zenacy HTML was originally developed for Zenacy Reader Technologies LLC starting around 2015 and used in a web reading SaaS for a few years. The need to understand and handle the wide variety and sublties of HTML found on the web lead to the development of library that closely followed the standard. The library was tweaked and optimized a bit and though there is room for more improvements the result worked quite well in production (a lot of credit goes to the GHC team and Haskell community for providing such great, fast functional programming tools).