Safe Haskell | None |
---|---|
Language | Haskell2010 |
Scalpel is a web scraping library inspired by libraries like parsec and Perl's Web::Scraper. Scalpel builds on top of Text.HTML.TagSoup to provide a declarative and monadic interface.
There are two general mechanisms provided by this library that are used to build web scrapers: Selectors and Scrapers.
Selectors describe a location within an HTML DOM tree. The simplest selector,
that can be written is a simple string value. For example, the selector
"div"
matches every single div node in a DOM. Selectors can be combined
using tag combinators. The //
operator to define nested relationships
within a DOM tree. For example, the selector "div" // "a"
matches all
anchor tags nested arbitrarily deep within a div tag.
In addition to describing the nested relationships between tags, selectors
can also include predicates on the attributes of a tag. The @:
operator
creates a selector that matches a tag based on the name and various
conditions on the tag's attributes. An attribute predicate is just a function
that takes an attribute and returns a boolean indicating if the attribute
matches a criteria. There are several attribute operators that can be used
to generate common predicates. The @=
operator creates a predicate that
matches the name and value of an attribute exactly. For example, the selector
"div" @: ["id" @= "article"]
matches div tags where the id
attribute is equal to "article"
.
Scrapers are values that are parameterized over a selector and produce
a value from an HTML DOM tree. The Scraper
type takes two type parameters.
The first is the string like type that is used to store the text values
within a DOM tree. Any string like type supported by Text.StringLike is
valid. The second type is the type of value that the scraper produces.
There are several scraper primitives that take selectors and extract content from the DOM. Each primitive defined by this library comes in two variants: singular and plural. The singular variants extract the first instance matching the given selector, while the plural variants match every instance.
The following is an example that demonstrates most of the features provided
by this library. Suppose you have the following hypothetical HTML located at
"http:/example.comarticle.html"
and you would like to extract a list of
all of the comments.
<html> <body> <div class='comments'> <div class='comment container'> <span class='comment author'>Sally</span> <div class='comment text'>Woo hoo!</div> </div> <div class='comment container'> <span class='comment author'>Bill</span> <img class='comment image' src='http://example.com/cat.gif' /> </div> <div class='comment container'> <span class='comment author'>Susan</span> <div class='comment text'>WTF!?!</div> </div> </div> </body> </html>
The following snippet defines a function, allComments
, that will download
the web page, and extract all of the comments into a list:
type Author = String data Comment = TextComment Author String | ImageComment Author URL deriving (Show, Eq) allComments :: IO (Maybe [Comment]) allComments = scrapeURL "http://example.com/article.html" comments where comments :: Scraper String [Comment] comments = chroots ("div" @: [hasClass "container"]) comment comment :: Scraper String Comment comment = textComment <|> imageComment textComment :: Scraper String Comment textComment = do author <- text $ "span" @: [hasClass "author"] commentText <- text $ "div" @: [hasClass "text"] return $ TextComment author commentText imageComment :: Scraper String Comment imageComment = do author <- text $ "span" @: [hasClass "author"] imageURL <- attr "src" $ "img" @: [hasClass "image"] return $ ImageComment author imageURL
Complete examples can be found in the examples folder in the scalpel git repository.
- data Selector
- class Selectable s where
- toSelector :: s -> Selector
- data AttributePredicate
- class AttributeName k
- class TagName t
- data Any = Any
- (//) :: (Selectable a, Selectable b) => a -> b -> Selector
- (@:) :: TagName tag => tag -> [AttributePredicate] -> Selector
- (@=) :: AttributeName key => key -> String -> AttributePredicate
- (@=~) :: (AttributeName key, RegexLike re String) => key -> re -> AttributePredicate
- hasClass :: String -> AttributePredicate
- data Scraper str a
- attr :: (Ord str, Show str, StringLike str, Selectable s) => String -> s -> Scraper str str
- attrs :: (Ord str, Show str, StringLike str, Selectable s) => String -> s -> Scraper str [str]
- html :: (Ord str, StringLike str, Selectable s) => s -> Scraper str str
- htmls :: (Ord str, StringLike str, Selectable s) => s -> Scraper str [str]
- text :: (Ord str, StringLike str, Selectable s) => s -> Scraper str str
- texts :: (Ord str, StringLike str, Selectable s) => s -> Scraper str [str]
- chroot :: (Ord str, StringLike str, Selectable s) => s -> Scraper str a -> Scraper str a
- chroots :: (Ord str, StringLike str, Selectable s) => s -> Scraper str a -> Scraper str [a]
- scrape :: (Ord str, StringLike str) => Scraper str a -> [Tag str] -> Maybe a
- scrapeStringLike :: (Ord str, StringLike str) => str -> Scraper str a -> Maybe a
- type URL = String
- scrapeURL :: (Ord str, StringLike str) => URL -> Scraper str a -> IO (Maybe a)
- scrapeURLWithOpts :: (Ord str, StringLike str) => [CurlOption] -> URL -> Scraper str a -> IO (Maybe a)
- scrapeURLWithConfig :: (Ord str, StringLike str) => Config str -> URL -> Scraper str a -> IO (Maybe a)
- data Config str = Config {
- curlOpts :: [CurlOption]
- decoder :: Decoder str
- type Decoder str = CurlResponse_ [(String, String)] ByteString -> str
- defaultDecoder :: StringLike str => Decoder str
- utf8Decoder :: StringLike str => Decoder str
- iso88591Decoder :: StringLike str => Decoder str
Selectors
Selector
defines a selection of an HTML DOM tree to be operated on by
a web scraper. The selection includes the opening tag that matches the
selection, all of the inner tags, and the corresponding closing tag.
class Selectable s where Source
The Selectable
class defines a class of types that are capable of being
cast into a Selector
which in turns describes a section of an HTML DOM
tree.
toSelector :: s -> Selector Source
data AttributePredicate Source
An AttributePredicate
is a method that takes a Attribute
and
returns a Bool
indicating if the given attribute matches a predicate.
class AttributeName k Source
The AttributeName
class defines a class of types that can be used when
creating Selector
s to specify the name of an attribute of a tag. Currently
the only types of this class are String
for matching attributes exactly,
and Any
for matching attributes with any name.
matchKey
The TagName
class defines a class of types that can be used when creating
Selector
s to specify the name of a tag. Currently the only types of this
class are String
for matching tags exactly, and Any
for matching tags
with any name.
toSelectNode
Wildcards
Any
can be used as a wildcard when constructing selectors to match tags
and attributes with any name.
For example, the selector Any @: [Any @= "foo"]
matches all tags that
have any attribute where the value is "foo"
.
Tag combinators
(//) :: (Selectable a, Selectable b) => a -> b -> Selector infixl 5 Source
Attribute predicates
(@:) :: TagName tag => tag -> [AttributePredicate] -> Selector infixl 9 Source
The @:
operator creates a Selector
by combining a TagName
with a list
of AttributePredicate
s.
(@=) :: AttributeName key => key -> String -> AttributePredicate infixl 6 Source
The @=
operator creates an AttributePredicate
that will match
attributes with the given name and value.
If you are attempting to match a specific class of a tag with potentially
multiple classes, you should use the hasClass
utility function.
(@=~) :: (AttributeName key, RegexLike re String) => key -> re -> AttributePredicate infixl 6 Source
The @=~
operator creates an AttributePredicate
that will match
attributes with the given name and whose value matches the given regular
expression.
hasClass :: String -> AttributePredicate Source
The classes of a tag are defined in HTML as a space separated list given by
the class
attribute. The hasClass
function will match a class
attribute
if the given class appears anywhere in the space separated list of classes.
Scrapers
Primitives
attr :: (Ord str, Show str, StringLike str, Selectable s) => String -> s -> Scraper str str Source
attrs :: (Ord str, Show str, StringLike str, Selectable s) => String -> s -> Scraper str [str] Source
The attrs
function takes an attribute name and a selector and returns the
value of the attribute of the given name for every opening tag that matches
the given selector.
html :: (Ord str, StringLike str, Selectable s) => s -> Scraper str str Source
htmls :: (Ord str, StringLike str, Selectable s) => s -> Scraper str [str] Source
The htmls
function takes a selector and returns the html string from every
set of tags matching the given selector.
text :: (Ord str, StringLike str, Selectable s) => s -> Scraper str str Source
texts :: (Ord str, StringLike str, Selectable s) => s -> Scraper str [str] Source
The texts
function takes a selector and returns the inner text from every
set of tags matching the given selector.
chroot :: (Ord str, StringLike str, Selectable s) => s -> Scraper str a -> Scraper str a Source
The chroot
function takes a selector and an inner scraper and executes
the inner scraper as if it were scraping a document that consists solely of
the tags corresponding to the selector.
This function will match only the first set of tags matching the selector, to
match every set of tags, use chroots
.
chroots :: (Ord str, StringLike str, Selectable s) => s -> Scraper str a -> Scraper str [a] Source
The chroots
function takes a selector and an inner scraper and executes
the inner scraper as if it were scraping a document that consists solely of
the tags corresponding to the selector. The inner scraper is executed for
each set of tags matching the given selector.
Executing scrapers
scrapeStringLike :: (Ord str, StringLike str) => str -> Scraper str a -> Maybe a Source
The scrapeStringLike
function parses a StringLike
value into a list of
tags and executes a Scraper
on it.
scrapeURLWithOpts :: (Ord str, StringLike str) => [CurlOption] -> URL -> Scraper str a -> IO (Maybe a) Source
The scrapeURLWithOpts
function take a list of curl options and downloads
the contents of the given URL and executes a Scraper
on it.
scrapeURLWithConfig :: (Ord str, StringLike str) => Config str -> URL -> Scraper str a -> IO (Maybe a) Source
The scrapeURLWithConfig
function takes a Config
record type and
downloads the contents of the given URL and executes a Scraper
on it.
A record type that determines how scrapeUrlWithConfig
interacts with the
HTTP server and interprets the results.
Config | |
|
StringLike str => Default (Config str) Source |
type Decoder str = CurlResponse_ [(String, String)] ByteString -> str Source
A method that takes a HTTP response as raw bytes and returns the body as a string type.
defaultDecoder :: StringLike str => Decoder str Source
The default response decoder. This decoder attempts to infer the character set of the HTTP response body from the `Content-Type` header. If this header is not present, then the character set is assumed to be `ISO-8859-1`.
utf8Decoder :: StringLike str => Decoder str Source
A decoder that will always decode using `UTF-8`.
iso88591Decoder :: StringLike str => Decoder str Source
A decoder that will always decode using `ISO-8859-1`.