xmlhtml: XML parser and renderer with HTML 5 quirks mode

[ bsd3, library, text, xml ] [ Propose Tags ]

Contains renderers and parsers for both XML and HTML 5 document fragments, which share data structures wo that it's easy to work with both. Document fragments are bits of documents, which are not constrained by some of the high-level structure rules (in particular, they may contain more than one root element).

Note that this is not a compliant HTML 5 parser. Rather, it is a parser for HTML 5 compliant documents. It does not implement the HTML 5 parsing algorithm, and should generally be expected to perform correctly only on documents that you trust to conform to HTML 5. This is not a suitable library for implementing web crawlers or other software that will be exposed to documents from outside sources. The result is also not the HTML 5 node structure, but rather something closer to the physical structure. For example, omitted start tags are not inserted (and so, their corresponding end tags must also be omitted).


[Skip to Readme]

Downloads

Note: This package has metadata revisions in the cabal description newer than included in the tarball. To unpack the package including the revisions, use 'cabal get'.

Maintainer's Corner

Package maintainers

For package maintainers and hackage trustees

Candidates

  • No Candidates
Versions [RSS] 0.1, 0.1.0.1, 0.1.1, 0.1.2, 0.1.3, 0.1.4, 0.1.5, 0.1.5.1, 0.1.5.2, 0.1.6, 0.1.7, 0.2.0, 0.2.0.1, 0.2.0.2, 0.2.0.3, 0.2.0.4, 0.2.1, 0.2.2, 0.2.3, 0.2.3.1, 0.2.3.2, 0.2.3.3, 0.2.3.4, 0.2.3.5, 0.2.4, 0.2.5, 0.2.5.1, 0.2.5.2, 0.2.5.3, 0.2.5.4
Dependencies base (>=4 && <5), blaze-builder (>=0.2 && <0.3), blaze-html (>=0.3 && <0.4), bytestring (>=0.9 && <0.10), containers (>=0.3 && <0.5), parsec (>=3.0 && <3.1.6), text (>=0.11 && <0.12) [details]
License BSD-3-Clause
Author Chris Smith <cdsmith@gmail.com>
Maintainer Chris Smith <cdsmith@gmail.com>
Revised Revision 1 made by HerbertValerioRiedel at 2017-08-14T20:57:07Z
Category Text
Uploaded by ChrisSmith at 2011-02-06T16:53:22Z
Distributions Debian:0.2.5.2, FreeBSD:0.2.3.4, LTSHaskell:0.2.5.4, NixOS:0.2.5.4, Stackage:0.2.5.4
Reverse Dependencies 29 direct, 56 indirect [details]
Downloads 42473 total (69 in the last 30 days)
Rating (no votes yet) [estimated by Bayesian average]
Your Rating
  • λ
  • λ
  • λ
Status Docs uploaded by user
Build status unknown [no reports yet]

Readme for xmlhtml-0.1.0.1

[back to package description]
xmlhtml - XML and HTML 5 parsing and rendering

This library implements both parsers and renderers for XML and HTML 5 document
fragments.  The two share data structures to represent the document tree, so
that you can write code to easily work with either XML or HTML 5.  Convenience
functions are also available to work with the internal data structure in
several natural ways.

Caveats:

- Both parsers are written to parse document fragments, not complete
  documents.  This means that they do not enforce rules about overall
  document structure.  There does not need to be only a single root node,
  and the HTML 5 implementation never inserts any missing start tags.

- The XML parser is incapable of handling processing instructions, or defined
  entities.  If will silently drop processing instructions, and will fail if
  encounters an entity reference for anything by the predefined entities
  (apos, quot, amp, lt, and gt).

- The HTML parser is really an XML parser with HTML 5 quirks mode.  It should
  be just fine for parsing documents that conform to the HTML 5 specification.
  However, it is *not* a compliant HTML 5 parser, as compliant parsers are
  required to be compatible with non-compliant documents in many ways that we
  aren't interested in.  So this is a great basis for a template system, for
  example, but a very poor basis for a web browser or web spider.

To get started, just use the parseHTML or parseXML functions from Text.XmlHtml
to parse a ByteString into a document tree.  On the other side, use render to
write the document tree back to a ByteString.

Working with document trees is easily done in two ways.

1. Text.XmlHtml exports the document tree types (notably, Document and Node)
   and functions like getAttribute, setAttribute, tagName, childNodes, etc. for
   working with them.

2. Text.XmlHtml.Cursor exports a zipper for node forests, which you can use to
   navigate and modify the document tree positionally.

That's it, basically.  This is hopefully a pretty simple package to use.

TO DO Items:

1. Do something better with character encodings.  For now, they are basically
   ignored, and we just use the byte order mark to distinguish between the
   three required encodings.  We should implement the encoding sniffing rules
   for both XML (the <?xml ... ?> declaration) and HTML 5.

2. Benchmark and improve performance of the parsers and renderers.

3. Ensure that rendering always gives an error rather than writing an invalid
   document. (Is this a good idea?  It does limit rendering speed.)