shpider-0.2.1.1: Web automation library in Haskell.

Network.Shpider

Contents

Description

This module exposes the main functionality of shpider It allows you to quickly write crawlers, and for simple cases even without reading the page source eg.

 runShpider $ do
    download "http://hackage.haskell.org/packages/archive/pkg-list.html"
    l : _ <- getLinksByText "shpider"
    download $ linkAddress l

Synopsis

Documentation

Crawl Functions

download :: String -> Shpider (ShpiderCode, Page)Source

Fetch whatever is at this address, and attempt to parse the content into a Page. Return the status code with the parsed content.

sendForm :: Form -> Shpider (ShpiderCode, Page)Source

Send a form to the URL specified in its action attribute

Basic Parsing/Decision Support

getLinksByText :: String -> Shpider [Link]Source

Get all links which match this text.

getLinksByTextRegex :: String -> Shpider [Link]Source

Get all links whose text matches this regex.

getLinksByAddressRegex :: String -> Shpider [Link]Source

Get all links whose address matches this regex.

getFormsByAction :: String -> Shpider [Form]Source

Get all forms whose action matches the given action

currentLinks :: Shpider [Link]Source

Return the links on the current page.

currentForms :: Shpider [Form]Source

Return the forms on the current page.

Utilities

parsePage :: String -> String -> Shpider PageSource

Parse a given URL and source html into the Page datatype. This will set the current page.

isAuthorizedDomain :: String -> Shpider BoolSource

If stayOnDomain has been set to true, then isAuthorizedDomain returns True if the given URL is on the domain and false otherwise. If stayOnDomain has not been set to True, then it returns True.

withAuthorizedDomain :: String -> Shpider (ShpiderCode, Page) -> Shpider (ShpiderCode, Page)Source

withAuthorizedDomain will execute the function if the url given is an authorized domain. See isAuthorizedDomain.

haveVisited :: String -> Shpider BoolSource

if keepTrack has been set, then haveVisited will return True if the given URL has been visited.