Holumbus-Searchengine-1.2.3: A search and indexing engine.

Safe HaskellNone

Holumbus.Crawler.Types

Synopsis

Documentation

type AccumulateDocResult a r = (URI, a) -> r -> IO rSource

The action to combine the result of a single document with the accumulator for the overall crawler result. This combining function runs in the IO monad to enable storing parts of the result externally but it is not a CrawlerAction, else parallel crawling with forkIO is not longer applicable

type MergeDocResults r = r -> r -> IO rSource

The folding operator for merging partial results when working with mapFold and parallel crawling

type SavePartialResults r = FilePath -> r -> IO rSource

The operator for saving intermediate results

type ProcessDocument a = IOSArrow XmlTree aSource

The extractor function for a single document

type CrawlerAction a r = ReaderStateIO (CrawlerConfig a r) (CrawlerState r)Source

The crawler action monad

theToBeProcessed :: Selector (CrawlerState r) URIsWithLevelSource

selector functions for CrawlerState

theSysConfig :: Selector (CrawlerConfig a r) SysConfigSource

selector functions for CrawlerConfig

addSysConfig :: SysConfig -> CrawlerConfig a r -> CrawlerConfig a rSource

Add attributes for accessing documents

addRobotsNoFollow :: CrawlerConfig a r -> CrawlerConfig a rSource

Insert a robots no follow filter before thePreRefsFilter

addRobotsNoIndex :: CrawlerConfig a r -> CrawlerConfig a rSource

Insert a robots no follow filter before thePreRefsFilter

setCrawlerSaveConf :: Int -> String -> CrawlerConfig a r -> CrawlerConfig a rSource

Set save intervall in config

setCrawlerSaveAction :: (FilePath -> CrawlerAction a r ()) -> CrawlerConfig a r -> CrawlerConfig a rSource

Set action performed before saving crawler state

setCrawlerClickLevel :: Int -> CrawlerConfig a r -> CrawlerConfig a rSource

Set max # of steps (clicks) to reach a document

setCrawlerMaxDocs :: Int -> Int -> Int -> CrawlerConfig a r -> CrawlerConfig a rSource

Set max # of documents to be crawled and max # of documents crawled in parallel

setCrawlerPreRefsFilter :: IOSArrow XmlTree XmlTree -> CrawlerConfig a r -> CrawlerConfig a rSource

Set the pre hook filter executed before the hrefs are collected

getConf :: Selector (CrawlerConfig a r) v -> CrawlerAction a r vSource

Load a component from the crawler configuration