Holumbus-Searchengine-1.2.1: A search and indexing engine.

Safe HaskellNone

Holumbus.Crawler.Core

Synopsis

Documentation

type MapFold a r = (a -> IO r) -> (r -> r -> IO r) -> [a] -> IO rSource

crawlNextDoc :: (NFData a, NFData r) => CrawlerAction a r ()Source

crawl a single doc, mark doc as processed, collect new hrefs and combine doc result with accumulator in state

processDoc :: URIWithLevel -> CrawlerAction a r (URI, [URIWithLevel], [(URI, a)])Source

Run the process document arrow and prepare results

isAllowedByRobots :: URI -> CrawlerAction a r BoolSource

filter uris rejected by robots.txt

processDocArrow :: CrawlerConfig c r -> URI -> IOSArrow a (URI, ([URI], [(URI, c)]))Source

From a document two results are computed, 1. the list of all hrefs in the contents, and 2. the collected info contained in the page. This result is augmented with the transfer uri such that following functions know the source of this contents. The transfer-URI may be another one as the input uri, there could happen a redirect in the http request.

The two listA arrows make the whole arrow deterministic, so it never fails

getLocationReference :: ArrowXml a => a XmlTree StringSource

compute the real URI in case of a 301 or 302 response (moved permanently or temporary), else the arrow will fail

getRealDocURI :: ArrowXml a => a XmlTree StringSource

compute the real URI of the document, in case of a move response this is contained in the "http-location" attribute, else it's the tranferURI.