hxt-7.1: A collection of tools for processing XML with Haskell.ContentsIndex
Text.XML.HXT.Arrow.ReadDocument
Portabilityportable
Stabilityexperimental
MaintainerUwe Schmidt (uwe\@fh-wedel.de)
Description

Version : $Id: ReadDocument.hs,v 1.10 20061124 07:41:37 hxml Exp $

Compound arrows for reading an XML/HTML document or an XML/HTML string

Synopsis
readDocument :: Attributes -> String -> IOStateArrow s b XmlTree
readFromDocument :: Attributes -> IOStateArrow s String XmlTree
readString :: Attributes -> String -> IOStateArrow s b XmlTree
readFromString :: Attributes -> IOStateArrow s String XmlTree
hread :: ArrowXml a => a String XmlTree
xread :: ArrowXml a => a String XmlTree
Documentation
readDocument :: Attributes -> String -> IOStateArrow s b XmlTree

the main document input filter

this filter can be configured by an option list, a value of type Attributes

available options:

  • a_parse_html: use HTML parser, else use XML parser (default)
  • a_validate : validate document againsd DTD (default), else skip validation
  • a_relax_schema : validate document with Relax NG, the options value is the schema URI this implies using XML parser, no validation against DTD, and canonicalisation
  • a_check_namespaces : check namespaces, else skip namespace processing (default)
  • a_canonicalize : canonicalize document (default), else skip canonicalization
  • a_preserve_comment : preserve comments during canonicalization, else remove comments (default)
  • a_remove_whitespace : remove all whitespace, used for document indentation, else skip this step (default)
  • a_indent : indent document by inserting whitespace, else skip this step (default)
  • a_issue_warnings : issue warnings, when parsing HTML (default), else ignore HTML parser warnings
  • a_issue_errors : issue all error messages on stderr (default), or ignore all error messages (default)
  • a_trace : trace level: values: 0 - 4
  • a_proxy : proxy for http access, e.g. www-cache:3128
  • a_use_curl : for http access via external programm curl, default is native HTTP access
  • a_options_curl : more options for external program curl
  • a_encoding : default document encoding (utf8, isoLatin1, usAscii, ...)

All attributes not evaluated by readDocument are stored in the created document root node for easy access of the various options in e.g. the input/output modules

If the document name is the empty string or an uri of the form "stdin:", the document is read from standard input.

examples:

 readDocument [ ] "test.xml"

reads and validates a document "test.xml", no namespace propagation, only canonicalization is performed

 readDocument [ (a_validate, "0")
              , (a_encoding, isoLatin1)
              ] "test.xml"

reads document "test.xml" without validation, default encoding isoLatin1.

 readDocument [ (a_parse_html, "1")
              , (a_encoding, isoLatin1)
              ] ""

reads a HTML document from standard input, no validation is done when parsing HTML, default encoding is isoLatin1

 readDocument [ (a_parse_html,     "1")
              , (a_proxy,          "www-cache:3128")
              , (a_curl,           "1")
              , (a_issue_warnings, "0")
              ] "http://www.haskell.org/"

reads Haskell homepage with HTML parser ignoring any warnings, with http access via external program curl and proxy "www-cache" at port 3128

 readDocument [ (a_validate,          "1")
              , (a_check_namespace,   "1")
              , (a_remove_whitespace, "1")
              , (a_trace,             "2")
              ] "http://www.w3c.org/"

read w3c home page (xhtml), validate and check namespaces, remove whitespace between tags, trace activities with level 2

for minimal complete examples see writeDocument and runX, the main starting point for running an XML arrow.

readFromDocument :: Attributes -> IOStateArrow s String XmlTree
the arrow version of readDocument, the arrow input is the source URI
readString :: Attributes -> String -> IOStateArrow s b XmlTree

read a document that is stored in a normal Haskell String

the same function as readDocument, but the parameter forms the input. All options available for readDocument are applicable for readString.

Default encoding: No encoding is done, the String argument is taken as Unicode string

readFromString :: Attributes -> IOStateArrow s String XmlTree
the arrow version of readString, the arrow input is the source URI
hread :: ArrowXml a => a String XmlTree

parse a string as HTML content, substitute all HTML entity refs and canonicalize tree (substitute char refs, ...). Errors are ignored.

A simpler version of readFromString but with less functionality. Does not run in the IO monad

xread :: ArrowXml a => a String XmlTree
parse a string as XML content, substitute all predefined XML entity refs and canonicalize tree (substitute char refs, ...)
Produced by Haddock version 0.8