This module exposes the main functionality of shpider It allows you to quickly write crawlers, and for simple cases even without reading the page source eg.
runShpider
$ dodownload
"http://hackage.haskell.org/packages/archive/pkg-list.html" l : _ <-getLinksByText
"shpider"download
$ linkAddress l
- module Network.Shpider.Code
- module Network.Shpider.State
- module Network.Shpider.URL
- module Network.Shpider.Options
- module Network.Shpider.Forms
- module Network.Shpider.Links
- download :: String -> Shpider (ShpiderCode, Page)
- sendForm :: Form -> Shpider (ShpiderCode, Page)
- getLinksByText :: String -> Shpider [Link]
- getLinksByTextRegex :: String -> Shpider [Link]
- getLinksByAddressRegex :: String -> Shpider [Link]
- getFormsByAction :: String -> Shpider [Form]
- getFormsHasAction :: (String -> Bool) -> Shpider [Form]
- currentLinks :: Shpider [Link]
- currentForms :: Shpider [Form]
- parsePage :: String -> String -> Shpider Page
- isAuthorizedDomain :: String -> Shpider Bool
- withAuthorizedDomain :: String -> Shpider (ShpiderCode, Page) -> Shpider (ShpiderCode, Page)
- haveVisited :: String -> Shpider Bool
Documentation
module Network.Shpider.Code
module Network.Shpider.State
module Network.Shpider.URL
module Network.Shpider.Options
module Network.Shpider.Forms
module Network.Shpider.Links
Crawl Functions
download :: String -> Shpider (ShpiderCode, Page)Source
Fetch whatever is at this address, and attempt to parse the content into a Page. Return the status code with the parsed content.
sendForm :: Form -> Shpider (ShpiderCode, Page)Source
Send a form to the URL specified in its action attribute
Basic Parsing/Decision Support
getLinksByText :: String -> Shpider [Link]Source
Get all links which match this text.
getLinksByTextRegex :: String -> Shpider [Link]Source
Get all links whose text matches this regex.
getLinksByAddressRegex :: String -> Shpider [Link]Source
Get all links whose address matches this regex.
getFormsByAction :: String -> Shpider [Form]Source
Get all forms whose action matches the given action
currentLinks :: Shpider [Link]Source
Return the links on the current page.
currentForms :: Shpider [Form]Source
Return the forms on the current page.
Utilities
parsePage :: String -> String -> Shpider PageSource
Parse a given URL and source html into the Page
datatype.
This will set the current page.
isAuthorizedDomain :: String -> Shpider BoolSource
If stayOnDomain
has been set to true, then isAuthorizedDomain returns True
if the given URL is on the domain and false otherwise. If stayOnDomain
has not been set to True, then it returns True
.
withAuthorizedDomain :: String -> Shpider (ShpiderCode, Page) -> Shpider (ShpiderCode, Page)Source
withAuthorizedDomain will execute the function if the url given is an authorized domain.
See isAuthorizedDomain
.