-- | -- Scalpel is a web scraping library inspired by libraries like parsec and -- Perl's . -- Scalpel builds on top of "Text.HTML.TagSoup" to provide a declarative and -- monadic interface. -- -- There are two general mechanisms provided by this library that are used to -- build web scrapers: Selectors and Scrapers. -- -- -- Selectors describe a location within an HTML DOM tree. The simplest selector, -- that can be written is a simple string value. For example, the selector -- @\"div\"@ matches every single div node in a DOM. Selectors can be combined -- using tag combinators. The '//' operator to define nested relationships -- within a DOM tree. For example, the selector @\"div\" \/\/ \"a\"@ matches all -- anchor tags nested arbitrarily deep within a div tag. -- -- In addition to describing the nested relationships between tags, selectors -- can also include predicates on the attributes of a tag. The '@:' operator -- creates a selector that matches a tag based on the name and various -- conditions on the tag's attributes. An attribute predicate is just a function -- that takes an attribute and returns a boolean indicating if the attribute -- matches a criteria. There are several attribute operators that can be used -- to generate common predicates. The '@=' operator creates a predicate that -- matches the name and value of an attribute exactly. For example, the selector -- @\"div\" \@: [\"id\" \@= \"article\"]@ matches div tags where the id -- attribute is equal to @\"article\"@. -- -- -- Scrapers are values that are parameterized over a selector and produce -- a value from an HTML DOM tree. The 'Scraper' type takes two type parameters. -- The first is the string like type that is used to store the text values -- within a DOM tree. Any string like type supported by "Text.StringLike" is -- valid. The second type is the type of value that the scraper produces. -- -- There are several scraper primitives that take selectors and extract content -- from the DOM. Each primitive defined by this library comes in two variants: -- singular and plural. The singular variants extract the first instance -- matching the given selector, while the plural variants match every instance. -- -- -- The following is an example that demonstrates most of the features provided -- by this library. Suppose you have the following hypothetical HTML located at -- @\"http://example.com/article.html\"@ and you would like to extract a list of -- all of the comments. -- -- > -- > -- >
-- >
-- > Sally -- >
Woo hoo!
-- >
-- >
-- > Bill -- > -- >
-- >
-- > Susan -- >
WTF!?!
-- >
-- >
-- > -- > -- -- The following snippet defines a function, @allComments@, that will download -- the web page, and extract all of the comments into a list: -- -- > type Author = String -- > -- > data Comment -- > = TextComment Author String -- > | ImageComment Author URL -- > deriving (Show, Eq) -- > -- > allComments :: IO (Maybe [Comment]) -- > allComments = scrapeURL "http://example.com/article.html" comments -- > where -- > comments :: Scraper String [Comment] -- > comments = chroots ("div" @: [hasClass "container"]) comment -- > -- > comment :: Scraper String Comment -- > comment = textComment <|> imageComment -- > -- > textComment :: Scraper String Comment -- > textComment = do -- > author <- text $ "span" @: [hasClass "author"] -- > commentText <- text $ "div" @: [hasClass "text"] -- > return $ TextComment author commentText -- > -- > imageComment :: Scraper String Comment -- > imageComment = do -- > author <- text $ "span" @: [hasClass "author"] -- > imageURL <- attr "src" $ "img" @: [hasClass "image"] -- > return $ ImageComment author imageURL -- -- Complete examples can be found in the -- folder in -- the scalpel git repository. module Text.HTML.Scalpel ( -- * Selectors Selector , AttributePredicate , AttributeName (..) , TagName (..) , tagSelector -- ** Wildcards , anySelector -- ** Tag combinators , (//) -- ** Attribute predicates , (@:) , (@=) , (@=~) , hasClass , notP , match -- * Scrapers , Scraper -- ** Primitives , attr , attrs , html , htmls , innerHTML , innerHTMLs , text , texts , chroot , chroots , position -- ** Executing scrapers , scrape , scrapeStringLike , URL , scrapeURL , scrapeURLWithOpts , scrapeURLWithConfig , Config (..) , Decoder , defaultDecoder , utf8Decoder , iso88591Decoder ) where import Text.HTML.Scalpel.Core import Text.HTML.Scalpel.Internal.Scrape.URL