crawlchain- Simulation user crawl paths

Safe HaskellSafe




data CrawlDirective Source

A crawl directive takes a content of a web page and produces crawl actions for links/forms to follow. The general idea is to specify a list of operations that in theory produces a dynamically collected tree of requests which leaves are either dead ends or end results.

Additional, logical branching/combination of Directives is possible with: * Alternatives - evaluate both Directives in order. * Restart - evaluate completely new initial action & chain if the previous combo does not produce end results.


SimpleDirective (String -> [CrawlAction])

access content to find absolute follow-up urls

RelativeDirective (String -> [CrawlAction])

as simple, but found relative urls are completed

FollowUpDirective (CrawlResult -> [CrawlAction])

as simple, but with access to complete result

DelayDirective Int CrawlDirective

wait additional seconds before executing

RetryDirective Int CrawlDirective

if given directive yields no results use add. retries

AlternativeDirective CrawlDirective CrawlDirective

fallback to second argument if first yields no results

RestartChainDirective (CrawlAction, CrawlDirective)

the possibility to start a new chain (when using alternative)

GuardDirective (CrawlAction -> Bool)

not crawling anything, just a blacklisting option

DirectiveSequence [CrawlDirective]

chaining of directives