Haskell Xml Toolbox 7.3: The complete APISource codeContentsIndex
Text.XML.HXT.Arrow.ReadDocument
Portabilityportable
Stabilityexperimental
MaintainerUwe Schmidt (uwe@fh-wedel.de)
Description

Version : $Id: ReadDocument.hs,v 1.10 20061124 07:41:37 hxml Exp $

Compound arrows for reading an XML/HTML document or an XML/HTML string

Synopsis
readDocument :: Attributes -> String -> IOStateArrow s b XmlTree
readFromDocument :: Attributes -> IOStateArrow s String XmlTree
readString :: Attributes -> String -> IOStateArrow s b XmlTree
readFromString :: Attributes -> IOStateArrow s String XmlTree
hread :: ArrowXml a => a String XmlTree
xread :: ArrowXml a => a String XmlTree
Documentation
readDocument :: Attributes -> String -> IOStateArrow s b XmlTree

the main document input filter

this filter can be configured by an option list, a value of type Attributes

available options:

  • a_parse_html: use HTML parser, else use XML parser (default)
  • a_validate : validate document againsd DTD (default), else skip validation
  • a_relax_schema : validate document with Relax NG, the options value is the schema URI this implies using XML parser, no validation against DTD, and canonicalisation
  • a_check_namespaces : check namespaces, else skip namespace processing (default)
  • a_canonicalize : canonicalize document (default), else skip canonicalization
  • a_preserve_comment : preserve comments during canonicalization, else remove comments (default)
  • a_remove_whitespace : remove all whitespace, used for document indentation, else skip this step (default)
  • a_indent : indent document by inserting whitespace, else skip this step (default)
  • a_issue_warnings : issue warnings, when parsing HTML (default), else ignore HTML parser warnings
  • a_issue_errors : issue all error messages on stderr (default), or ignore all error messages (default)
  • a_trace : trace level: values: 0 - 4
  • a_proxy : proxy for http access, e.g. www-cache:3128
  • a_use_curl : for http access via external programm curl, default is native HTTP access
  • a_options_curl : more options for external program curl
  • a_encoding : default document encoding (utf8, isoLatin1, usAscii, iso8859_2, ... , iso8859_16, ...)

All attributes not evaluated by readDocument are stored in the created document root node for easy access of the various options in e.g. the input/output modules

If the document name is the empty string or an uri of the form "stdin:", the document is read from standard input.

examples:

 readDocument [ ] "test.xml"

reads and validates a document "test.xml", no namespace propagation, only canonicalization is performed

 readDocument [ (a_validate, "0")
              , (a_encoding, isoLatin1)
              ] "test.xml"

reads document "test.xml" without validation, default encoding isoLatin1.

 readDocument [ (a_parse_html, "1")
              , (a_encoding, isoLatin1)
              ] ""

reads a HTML document from standard input, no validation is done when parsing HTML, default encoding is isoLatin1

 readDocument [ (a_parse_html,     "1")
              , (a_proxy,          "www-cache:3128")
              , (a_curl,           "1")
              , (a_issue_warnings, "0")
              ] "http://www.haskell.org/"

reads Haskell homepage with HTML parser ignoring any warnings, with http access via external program curl and proxy "www-cache" at port 3128

 readDocument [ (a_validate,          "1")
              , (a_check_namespace,   "1")
              , (a_remove_whitespace, "1")
              , (a_trace,             "2")
              ] "http://www.w3c.org/"

read w3c home page (xhtml), validate and check namespaces, remove whitespace between tags, trace activities with level 2

for minimal complete examples see writeDocument and runX, the main starting point for running an XML arrow.

readFromDocument :: Attributes -> IOStateArrow s String XmlTree
the arrow version of readDocument, the arrow input is the source URI
readString :: Attributes -> String -> IOStateArrow s b XmlTree

read a document that is stored in a normal Haskell String

the same function as readDocument, but the parameter forms the input. All options available for readDocument are applicable for readString.

Default encoding: No encoding is done, the String argument is taken as Unicode string

readFromString :: Attributes -> IOStateArrow s String XmlTree
the arrow version of readString, the arrow input is the source URI
hread :: ArrowXml a => a String XmlTree

parse a string as HTML content, substitute all HTML entity refs and canonicalize tree (substitute char refs, ...). Errors are ignored.

A simpler version of readFromString but with less functionality. Does not run in the IO monad

xread :: ArrowXml a => a String XmlTree
parse a string as XML content, substitute all predefined XML entity refs and canonicalize tree (substitute char refs, ...)
Produced by Haddock version 0.8