Holumbus-Searchengine-1.2.2: A search and indexing engine.

Safe Haskell	None

Holumbus.Crawler.Html

Synopsis

Documentation

defaultHtmlCrawlerConfig :: AccumulateDocResult a r -> MergeDocResults r -> CrawlerConfig a rSource

getHtmlReferences :: ArrowXml a => a XmlTree URI Source

Collect all HTML references to other documents within a, frame and iframe elements

getDocReferences :: ArrowXml a => a XmlTree URI Source

toAbsRef :: URI -> URI -> URI Source

construct an absolute URI by a base URI and a possibly relative URI

computeDocBase :: ArrowXml a => a XmlTree String Source

Compute the base URI of a HTML page with respect to a possibly given base element in the head element of a html page.

Stolen from Uwe Schmidt, http://www.haskell.org/haskellwiki/HXT and then stolen back again by Uwe from Holumbus.Utility

getByPath :: ArrowXml a => [String] -> a XmlTree XmlTree Source

getHtmlTitle :: ArrowXml a => a XmlTree String Source

getHtmlPlainText :: ArrowXml a => a XmlTree String Source

getAllText :: ArrowXml a => a XmlTree XmlTree -> a XmlTree String Source

isHtmlContents :: ArrowXml a => a XmlTree XmlTree Source

isPdfContents :: ArrowXml a => a XmlTree XmlTree Source

getTitleOrDocName :: ArrowXml a => a XmlTree String Source

isElemWithAttr :: ArrowXml a => String -> String -> (String -> Bool) -> a XmlTree XmlTree Source

application_pdf :: String Source

normalizeWS :: String -> String Source

normalize whitespace by splitting a text into words and joining this together with unwords

limitLength :: Int -> String -> String Source

take the first n chars of a string, if the input is too long the cut off is indicated by "..." at the end