| Safe Haskell | None |
|---|---|
| Language | Haskell2010 |
Follow.Fetchers.WebScraping
Description
This module is the namespace to define a fetcher strategy which generates entries scraping the contents requested to an HTML URI.
A Selector must be given in order to know from where the information
for each entry field should be taken.
Be aware that scraping an HTML page has very few consistency
warantee. So, depending on the page structure and the selector you
give, you could end up with 5 URIs, 4 titles and 6 descriptions. Keep
in mind that the leading and limiting asset are the URIs, so in the
previous scenario one Nothing title would be added and one
description would be discarded.
Here it is an example:
import Follow
import Follow.Fetchers.WebScraping
selector :: Selector
selector = Selector {
selURI = Just $ Attr ".title a" "href"
, selGUID = Just $ Attr ".title a" "href"
, selTitle = Just $ InnerText ".title a"
, selDescription = Just $ InnerText ".description"
, selAuthor = Just $ InnerText ".author"
, selPublishDate = Nothing
}
result :: IO [Entry]
result = fetch ("http://an_url.com", selector)
Synopsis
- fetch :: (MonadThrow m, MonadIO m, MonadHttp m) => ByteString -> Selector -> Fetched m
- data Selector = Selector {}
- data SelectorItem
- type CSSSelector = Text
- type HTMLAttribute = Text
Documentation
fetch :: (MonadThrow m, MonadIO m, MonadHttp m) => ByteString -> Selector -> Fetched m Source #
Fetches entries from given url using specified selectors.
Data type with the selectors to use when scraping each Entry
item.
Constructors
| Selector | |
Fields | |
data SelectorItem Source #
Selector to use when scraping an Entry item.
Constructors
| InnerText CSSSelector | This selector will take the inner text immediately descendant of a tag selected with given css selector. |
| Attr CSSSelector HTMLAttribute | This selector will take the value of given argument in the tag matched by given css selector. |
Instances
| Eq SelectorItem Source # | |
Defined in Follow.Fetchers.WebScraping.Internal | |
| Show SelectorItem Source # | |
Defined in Follow.Fetchers.WebScraping.Internal Methods showsPrec :: Int -> SelectorItem -> ShowS # show :: SelectorItem -> String # showList :: [SelectorItem] -> ShowS # | |
| FromJSON SelectorItem # | type: text
options:
css: .selector
or type: attr
options:
css: .link
name: href
|
Defined in Follow.Parser | |
type CSSSelector = Text Source #
A CSS2 selector.
type HTMLAttribute = Text Source #
An HTML attribute name.