Copyright | (c) 2020-2021 Sam May |
---|---|
License | MPL-2.0 |
Maintainer | ag.eitilt@gmail.com |
Stability | provisional |
Portability | portable |
Safe Haskell | Safe-Inferred |
Language | Haskell98 |
This module and the internal branch it heads implement the "Tree Construction" section of the HTML document parsing specification, operating over the output of the Web.Mangrove.Parse.Tokenize stage to produce a DOM tree representation of a web page. As this library is still in the early stages of development, the representation produced here is not actually a proper DOM implementation, but instead only stores basic parameters in an equivalent (but less-featured) structure. Nonetheless, it is still enough for basic evaluation and unstyled rendering.
Synopsis
- data Tree = Tree {}
- data Node
- data QuirksMode
- data Patch
- data TreeState
- data Encoding
- = Utf8
- | Utf16be
- | Utf16le
- | Big5
- | EucJp
- | EucKr
- | Gb18030
- | Gbk
- | Ibm866
- | Iso2022Jp
- | Iso8859_2
- | Iso8859_3
- | Iso8859_4
- | Iso8859_5
- | Iso8859_6
- | Iso8859_7
- | Iso8859_8
- | Iso8859_8i
- | Iso8859_10
- | Iso8859_13
- | Iso8859_14
- | Iso8859_15
- | Iso8859_16
- | Koi8R
- | Koi8U
- | Macintosh
- | MacintoshCyrillic
- | ShiftJis
- | Windows874
- | Windows1250
- | Windows1251
- | Windows1252
- | Windows1253
- | Windows1254
- | Windows1255
- | Windows1256
- | Windows1257
- | Windows1258
- | Replacement
- | UserDefined
- type NodeIndex = Natural
- data ElementParams = ElementParams {}
- emptyElementParams :: ElementParams
- defaultTreeState :: TreeState
- treeEncoding :: Either SnifferEnvironment (Maybe Encoding) -> TreeState -> TreeState
- treeFragment :: ElementParams -> [(NodeIndex, ElementParams)] -> Maybe QuirksMode -> Maybe Bool -> TreeState -> TreeState
- treeInIFrame :: Bool -> TreeState -> TreeState
- tree :: TreeState -> ByteString -> ([Patch], TreeState)
- treeStep :: TreeState -> ByteString -> ([Patch], TreeState, ByteString)
- finalizeTree :: [Patch] -> TreeState -> Tree
Types
Final
DOM:
tree
The core concept underlying HTML and related languages: a nested collection
of data and metadata marked up according to several broad categories.
Values may be easily instantiated as updates to emptyTree
.
DOM:
node
The sum type of all different classes of behaviour a particular point of data may fill.
Text Text | DOM:
A simple character string to be rendered to the output or to be
processed further, according to which |
Comment Text | DOM:
An author's aside, not intended to be shown to the end user. |
DocumentType DocumentTypeParams | DOM:
Largely vestigial in HTML5, but used in previous versions and
related languages to specify the semantics of |
Element ElementParams | DOM:
Markup instructions directing the behaviour or classifying a portion of the document's content. |
Attribute AttributeParams | DOM:
Metadata allowing finer customization and description of the heavier
|
DocumentFragment | DOM:
As like |
Document QuirksMode | DOM:
The root of a |
data QuirksMode #
Through the long history of HTML browsers, many unique and/or buggy behaviours have become enshrined due to the simple fact that website authors used them. As the standards and the parse engines have continued to develop, three separated degrees of emulation have emerged for that backwards compatibility.
NoQuirks | DOM:
Fully compliant with the modern standard. |
LimitedQuirks | DOM:
Largely compliant with the standard, except for a couple height calculations. |
FullQuirks | DOM:
Backwards compatibility with 1990's-era technology. |
Instances
Intermediate
The atomic, self-contained instruction set describing how to build a final document tree.
The collection of data required to extract a list of semantic atoms from a
binary document stream. Values may be easily instantiated as updates to
defaultTreeState
.
Encoding:
encoding
All character encoding schemes supported by the HTML standard, defined as a
bidirectional map between characters and binary sequences. Utf8
is
strongly encouraged for new content (including all encoding purposes), but
the others are retained for compatibility with existing pages.
Note that none of these are complete functions, to one degree or another, and that no guarantee is made that the mapping round-trips.
Utf8 | The UTF-8 encoding for Unicode. |
Utf16be | The UTF-16 encoding for Unicode, in big endian order. No encoder is provided for this scheme. |
Utf16le | The UTF-16 encoding for Unicode, in little endian order. No encoder is provided for this scheme. |
Big5 | Big5, primarily covering traditional Chinese characters. |
EucJp | EUC-JP, primarily covering Japanese as the union of JIS-0208 and JIS-0212. |
EucKr | EUC-KR, primarily covering Hangul. |
Gb18030 | The GB18030-2005 extension to GBK, with one tweak for web compatibility, primarily covering both forms of Chinese characters. Note that this encoding also includes a large number of four-byte sequences which aren't listed in the linked visualization. |
Gbk | GBK, primarily covering simplified Chinese characters. In practice, this is just |
Ibm866 | DOS and OS/2 code page for Cyrillic characters. |
Iso2022Jp | A Japanese-focused implementation of the ISO 2022 meta-encoding, including both JIS-0208 and halfwidth katakana. |
Iso8859_2 | Latin-2 (Central European). |
Iso8859_3 | Latin-3 (South European and Esperanto) |
Iso8859_4 | Latin-4 (North European). |
Iso8859_5 | |
Iso8859_6 | |
Iso8859_7 | Latin/Greek (modern monotonic). |
Iso8859_8 | Latin/Hebrew (visual order). |
Iso8859_8i | Latin/Hebrew (logical order). |
Iso8859_10 | Latin-6 (Nordic). |
Iso8859_13 | Latin-7 (Baltic Rim). |
Iso8859_14 | Latin-8 (Celtic). |
Iso8859_15 | Latin-9 (revision of ISO 8859-1 Latin-1, Western European). |
Iso8859_16 | Latin-10 (South-Eastern European). |
Koi8R | KOI-8 specialized for Russian Cyrillic. |
Koi8U | KOI-8 specialized for Ukrainian Cyrillic. |
Macintosh | |
MacintoshCyrillic | Mac OS Cyrillic (as of Mac OS 9.0) |
ShiftJis | The Windows variant (code page 932) of Shift JIS. |
Windows874 | ISO 8859-11 Latin/Thai with Windows extensions in the C1 control character slots. Note that this encoding is always used instead of pure Latin/Thai. |
Windows1250 | The Windows extension and rearrangement of ISO 8859-2 Latin-2. |
Windows1251 | |
Windows1252 | The Windows extension of ISO 8859-1 Latin-1, replacing most of the C1 control characters with printable glyphs. Note that this encoding is always used instead of pure Latin-1. |
Windows1253 | Windows Greek (modern monotonic). |
Windows1254 | The Windows extension of ISO 8859-9 Latin-5 (Turkish), replacing most of the C1 control characters with printable glyphs. Note that this encoding is always used instead of pure Latin-5. |
Windows1255 | The Windows extension and rearrangement of ISO 8859-8 Latin/Hebrew. |
Windows1256 | |
Windows1257 | |
Windows1258 | |
Replacement | The input is reduced to a single No encoder is provided for this scheme. |
UserDefined | Non-ASCII bytes ( |
Instances
Bounded Encoding | |
Enum Encoding | |
Defined in Web.Willow.Common.Encoding.Common | |
Eq Encoding | |
Ord Encoding | |
Defined in Web.Willow.Common.Encoding.Common | |
Read Encoding | |
Show Encoding | |
Hashable Encoding | |
Defined in Web.Willow.Common.Encoding.Common |
type NodeIndex = Natural Source #
Type-level clarification for an identifier uniquely identifying each element in the document tree, assigned in a rough tree order.
The specification assumes a reference-based and mutable memory model, where each element in, e.g., the stack of open elements not only describes the shape of the node, but also is distinct from all other nodes with the same data. Haskell's memory is much less accessible and more likely to be shared, so an extra datapoint needs to be carried along.
data ElementParams #
DOM:
Element
The collection of metadata identifying and describing a markup tag used to
associate text or other data with its broader role in the document, or to
indicate a preferred rendering. Values may be easily instantiated as
updates to emptyElementParams
.
ElementParams | |
|
Instances
Eq ElementParams | |
Defined in Web.Willow.DOM (==) :: ElementParams -> ElementParams -> Bool # (/=) :: ElementParams -> ElementParams -> Bool # | |
Read ElementParams | |
Defined in Web.Willow.DOM readsPrec :: Int -> ReadS ElementParams # readList :: ReadS [ElementParams] # | |
Show ElementParams | |
Defined in Web.Willow.DOM showsPrec :: Int -> ElementParams -> ShowS # show :: ElementParams -> String # showList :: [ElementParams] -> ShowS # |
emptyElementParams :: ElementParams #
A sane default collection for easy record initialization.
Initialization
defaultTreeState :: TreeState Source #
The collection of data which results in behaviour according to the "initially" instructions in the HTML tree construction algorithm.
treeEncoding :: Either SnifferEnvironment (Maybe Encoding) -> TreeState -> TreeState Source #
Specify the encoding scheme a given parse environment should use to read
from the binary document stream. Note that this will always use the initial
state for the respective decoder; intermediate states as returned by
decodeStep
are not supported.
:: ElementParams | HTML:
The node wrapping -- in one way or another -- the embedded document fragment. |
-> [(NodeIndex, ElementParams)] | The ancestors of the context element, most immediate first. |
-> Maybe QuirksMode | The degree of backwards compatibility used in the node document of the context element, if it can be determined. |
-> Maybe Bool | Whether the node document of the context element has been parsed
in a way which would require scripting to be enabled ( |
-> TreeState | |
-> TreeState |
HTML:
fragment parsing algorithm
Transform a given parse environment by adding context for an embedded but
separate document fragment. Calling this with an intermediate state
returned by treeStep
(as opposed to an initial state from
defaultTreeState
) may result in an unexpected tree structure.
treeInIFrame :: Bool -> TreeState -> TreeState Source #
Specify whether the given parse environment should be treated as if the
document were contained within the srcdoc
attribute of an <iframe>
element (False
by default).
Transformations
tree :: TreeState -> ByteString -> ([Patch], TreeState) Source #
HTML:
tree construction
Given a starting environment, transform a binary document stream into a
hierarchical markup tree. If the parse fails, returns an empty tree (a
Document
node with no children).
treeStep :: TreeState -> ByteString -> ([Patch], TreeState, ByteString) Source #
Parse a minimal number of tokens from a binary document stream, into a state-independent sequence of folding instructions. Returns all data required to seamlessly resume parsing.