scalpel-core-0.6.1: A high level web scraping library for Haskell.

Safe HaskellNone
LanguageHaskell2010

Text.HTML.Scalpel.Core

Contents

Description

Scalpel core provides a subset of the scalpel web scraping library that is intended to have lightweight dependencies and to be free of all non-Haskell dependencies.

Notably this package does not contain any networking support. Users who desire a batteries include solution should depend on scalpel which does include networking support instead of scalpel-core.

More thorough documentation including example code can be found in the documentation of the scalpel package.

Synopsis

Selectors

data Selector Source #

Selector defines a selection of an HTML DOM tree to be operated on by a web scraper. The selection includes the opening tag that matches the selection, all of the inner tags, and the corresponding closing tag.

Instances
IsString Selector Source # 
Instance details

Defined in Text.HTML.Scalpel.Internal.Select.Types

data AttributePredicate Source #

An AttributePredicate is a method that takes a Attribute and returns a Bool indicating if the given attribute matches a predicate.

data AttributeName Source #

The AttributeName type can be used when creating Selectors to specify the name of an attribute of a tag.

data TagName Source #

The TagName type is used when creating a Selector to specify the name of a tag.

Constructors

AnyTag 
TagString String 
Instances
IsString TagName Source # 
Instance details

Defined in Text.HTML.Scalpel.Internal.Select.Types

Methods

fromString :: String -> TagName #

textSelector :: Selector Source #

A selector which will match all text nodes.

Wildcards

anySelector :: Selector Source #

A selector which will match any node (including tags and bare text).

Tag combinators

(//) :: Selector -> Selector -> Selector infixl 5 Source #

The // operator creates an Selector by nesting one Selector in another. For example, "div" // "a" will create a Selector that matches anchor tags that are nested arbitrarily deep within a div tag.

atDepth :: Selector -> Int -> Selector infixl 6 Source #

The atDepth operator constrains a Selector to only match when it is at depth below the previous selector.

For example, "div" // "a" atDepth 1 creates a Selector that matches anchor tags that are direct children of a div tag.

Attribute predicates

(@:) :: TagName -> [AttributePredicate] -> Selector infixl 9 Source #

The @: operator creates a Selector by combining a TagName with a list of AttributePredicates.

(@=) :: AttributeName -> String -> AttributePredicate infixl 6 Source #

The @= operator creates an AttributePredicate that will match attributes with the given name and value.

If you are attempting to match a specific class of a tag with potentially multiple classes, you should use the hasClass utility function.

(@=~) :: RegexLike re String => AttributeName -> re -> AttributePredicate infixl 6 Source #

The @=~ operator creates an AttributePredicate that will match attributes with the given name and whose value matches the given regular expression.

hasClass :: String -> AttributePredicate Source #

The classes of a tag are defined in HTML as a space separated list given by the class attribute. The hasClass function will match a class attribute if the given class appears anywhere in the space separated list of classes.

match :: (String -> String -> Bool) -> AttributePredicate Source #

The match function allows for the creation of arbitrary AttributePredicates. The argument is a function that takes the attribute key followed by the attribute value and returns a boolean indicating if the attribute satisfies the predicate.

Scrapers

data Scraper str a Source #

A value of Scraper a defines a web scraper that is capable of consuming a list of Tags and optionally producing a value of type a.

Instances
Monad (Scraper str) Source # 
Instance details

Defined in Text.HTML.Scalpel.Internal.Scrape

Methods

(>>=) :: Scraper str a -> (a -> Scraper str b) -> Scraper str b #

(>>) :: Scraper str a -> Scraper str b -> Scraper str b #

return :: a -> Scraper str a #

fail :: String -> Scraper str a #

Functor (Scraper str) Source # 
Instance details

Defined in Text.HTML.Scalpel.Internal.Scrape

Methods

fmap :: (a -> b) -> Scraper str a -> Scraper str b #

(<$) :: a -> Scraper str b -> Scraper str a #

MonadFail (Scraper str) Source # 
Instance details

Defined in Text.HTML.Scalpel.Internal.Scrape

Methods

fail :: String -> Scraper str a #

Applicative (Scraper str) Source # 
Instance details

Defined in Text.HTML.Scalpel.Internal.Scrape

Methods

pure :: a -> Scraper str a #

(<*>) :: Scraper str (a -> b) -> Scraper str a -> Scraper str b #

liftA2 :: (a -> b -> c) -> Scraper str a -> Scraper str b -> Scraper str c #

(*>) :: Scraper str a -> Scraper str b -> Scraper str b #

(<*) :: Scraper str a -> Scraper str b -> Scraper str a #

Alternative (Scraper str) Source # 
Instance details

Defined in Text.HTML.Scalpel.Internal.Scrape

Methods

empty :: Scraper str a #

(<|>) :: Scraper str a -> Scraper str a -> Scraper str a #

some :: Scraper str a -> Scraper str [a] #

many :: Scraper str a -> Scraper str [a] #

MonadPlus (Scraper str) Source # 
Instance details

Defined in Text.HTML.Scalpel.Internal.Scrape

Methods

mzero :: Scraper str a #

mplus :: Scraper str a -> Scraper str a -> Scraper str a #

Primitives

attr :: (Show str, StringLike str) => String -> Selector -> Scraper str str Source #

The attr function takes an attribute name and a selector and returns the value of the attribute of the given name for the first opening tag that matches the given selector.

This function will match only the opening tag matching the selector, to match every tag, use attrs.

attrs :: (Show str, StringLike str) => String -> Selector -> Scraper str [str] Source #

The attrs function takes an attribute name and a selector and returns the value of the attribute of the given name for every opening tag (possibly nested) that matches the given selector.

s = "<div id=\"out\"><div id=\"in\"></div></div>"
scrapeStringLike s (attrs "id" "div") == Just ["out", "in"]

html :: StringLike str => Selector -> Scraper str str Source #

The html function takes a selector and returns the html string from the set of tags described by the given selector.

This function will match only the first set of tags matching the selector, to match every set of tags, use htmls.

htmls :: StringLike str => Selector -> Scraper str [str] Source #

The htmls function takes a selector and returns the html string from every set of tags (possibly nested) matching the given selector.

s = "<div><div>A</div></div>"
scrapeStringLike s (htmls "div") == Just ["<div><div>A</div></div>", "<div>A</div>"]

innerHTML :: StringLike str => Selector -> Scraper str str Source #

The innerHTML function takes a selector and returns the inner html string from the set of tags described by the given selector. Inner html here meaning the html within but not including the selected tags.

This function will match only the first set of tags matching the selector, to match every set of tags, use innerHTMLs.

innerHTMLs :: StringLike str => Selector -> Scraper str [str] Source #

The innerHTMLs function takes a selector and returns the inner html string from every set of tags (possibly nested) matching the given selector.

s = "<div><div>A</div></div>"
scrapeStringLike s (innerHTMLs "div") == Just ["<div>A</div>", "A"]

text :: StringLike str => Selector -> Scraper str str Source #

The text function takes a selector and returns the inner text from the set of tags described by the given selector.

This function will match only the first set of tags matching the selector, to match every set of tags, use texts.

texts :: StringLike str => Selector -> Scraper str [str] Source #

The texts function takes a selector and returns the inner text from every set of tags (possibly nested) matching the given selector.

s = "<div>Hello <div>World</div></div>"
scrapeStringLike s (texts "div") == Just ["Hello World", "World"]

chroot :: StringLike str => Selector -> Scraper str a -> Scraper str a Source #

The chroot function takes a selector and an inner scraper and executes the inner scraper as if it were scraping a document that consists solely of the tags corresponding to the selector.

This function will match only the first set of tags matching the selector, to match every set of tags, use chroots.

chroots :: StringLike str => Selector -> Scraper str a -> Scraper str [a] Source #

The chroots function takes a selector and an inner scraper and executes the inner scraper as if it were scraping a document that consists solely of the tags corresponding to the selector. The inner scraper is executed for each set of tags (possibly nested) matching the given selector.

s = "<div><div>A</div></div>"
scrapeStringLike s (chroots "div" (pure 0)) == Just [0, 0]

position :: StringLike str => Scraper str Int Source #

The position function is intended to be used within the do-block of a chroots call. Within the do-block position will return the index of the current sub-tree within the list of all sub-trees matched by the selector passed to chroots.

For example, consider the following HTML:

<article>
 <p> First paragraph. </p>
 <p> Second paragraph. </p>
 <p> Third paragraph. </p>
</article>

The position function can be used to determine the index of each <p> tag within the article tag by doing the following.

chroots "article" // "p" $ do
  index   <- position
  content <- text "p"
  return (index, content)

Which will evaluate to the list:

[
  (0, "First paragraph.")
, (1, "Second paragraph.")
, (2, "Third paragraph.")
]

matches :: StringLike str => Selector -> Scraper str () Source #

The matches function takes a selector and returns `()` if the selector matches any node in the DOM.

Executing scrapers

scrape :: StringLike str => Scraper str a -> [Tag str] -> Maybe a Source #

The scrape function executes a Scraper on a list of Tags and produces an optional value.

scrapeStringLike :: StringLike str => str -> Scraper str a -> Maybe a Source #

The scrapeStringLike function parses a StringLike value into a list of tags and executes a Scraper on it.

Serial Scraping

data SerialScraper str a Source #

A SerialScraper allows for the application of Scrapers on a sequence of sibling nodes. This allows for use cases like targeting the sibling of a node, or extracting a sequence of sibling nodes (e.g. paragraphs (<p>) under a header (<h2>)).

Conceptually serial scrapers operate on a sequence of tags that correspond to the immediate children of the currently focused node. For example, given the following HTML:

 <article>
   <h1>title</h1>
   <h2>Section 1</h2>
   <p>Paragraph 1.1</p>
   <p>Paragraph 1.2</p>
   <h2>Section 2</h2>
   <p>Paragraph 2.1</p>
   <p>Paragraph 2.2</p>
 </article>

A serial scraper that visits the header and paragraph nodes can be executed with the following:

chroot "article" $ inSerial $ do ...

Each SerialScraper primitive follows the pattern of first moving the focus backward or forward and then extracting content from the new focus. Attempting to extract content from beyond the end of the sequence causes the scraper to fail.

To complete the above example, the article's structure and content can be extracted with the following code:

chroot "article" $ inSerial $ do
    title <- seekNext $ text "h1"
    sections <- many $ do
       section <- seekNext $ text "h2"
       ps <- untilNext (matches "h2") (many $ seekNext $ text "p")
       return (section, ps)
    return (title, sections)

Which will evaluate to:

 ("title", [
   ("Section 1", ["Paragraph 1.1", "Paragraph 1.2"]),
   ("Section 2", ["Paragraph 2.1", "Paragraph 2.2"]),
 ])
Instances
Monad (SerialScraper str) Source # 
Instance details

Defined in Text.HTML.Scalpel.Internal.Serial

Methods

(>>=) :: SerialScraper str a -> (a -> SerialScraper str b) -> SerialScraper str b #

(>>) :: SerialScraper str a -> SerialScraper str b -> SerialScraper str b #

return :: a -> SerialScraper str a #

fail :: String -> SerialScraper str a #

Functor (SerialScraper str) Source # 
Instance details

Defined in Text.HTML.Scalpel.Internal.Serial

Methods

fmap :: (a -> b) -> SerialScraper str a -> SerialScraper str b #

(<$) :: a -> SerialScraper str b -> SerialScraper str a #

MonadFail (SerialScraper str) Source # 
Instance details

Defined in Text.HTML.Scalpel.Internal.Serial

Methods

fail :: String -> SerialScraper str a #

Applicative (SerialScraper str) Source # 
Instance details

Defined in Text.HTML.Scalpel.Internal.Serial

Methods

pure :: a -> SerialScraper str a #

(<*>) :: SerialScraper str (a -> b) -> SerialScraper str a -> SerialScraper str b #

liftA2 :: (a -> b -> c) -> SerialScraper str a -> SerialScraper str b -> SerialScraper str c #

(*>) :: SerialScraper str a -> SerialScraper str b -> SerialScraper str b #

(<*) :: SerialScraper str a -> SerialScraper str b -> SerialScraper str a #

Alternative (SerialScraper str) Source # 
Instance details

Defined in Text.HTML.Scalpel.Internal.Serial

Methods

empty :: SerialScraper str a #

(<|>) :: SerialScraper str a -> SerialScraper str a -> SerialScraper str a #

some :: SerialScraper str a -> SerialScraper str [a] #

many :: SerialScraper str a -> SerialScraper str [a] #

MonadPlus (SerialScraper str) Source # 
Instance details

Defined in Text.HTML.Scalpel.Internal.Serial

Methods

mzero :: SerialScraper str a #

mplus :: SerialScraper str a -> SerialScraper str a -> SerialScraper str a #

inSerial :: StringLike str => SerialScraper str a -> Scraper str a Source #

Executes a SerialScraper in the context of a Scraper. The immediate children of the currently focused node are visited serially.

Primitives

stepNext :: StringLike str => Scraper str a -> SerialScraper str a Source #

Move the cursor forward one node and execute the given scraper on the new focused node.

stepBack :: StringLike str => Scraper str a -> SerialScraper str a Source #

Move the cursor back one node and execute the given scraper on the new focused node.

seekNext :: StringLike str => Scraper str a -> SerialScraper str a Source #

Move the cursor forward until the given scraper is successfully able to execute on the focused node. If the scraper is never successful then the serial scraper will fail.

seekBack :: StringLike str => Scraper str a -> SerialScraper str a Source #

Move the cursor backward until the given scraper is successfully able to execute on the focused node. If the scraper is never successful then the serial scraper will fail.

untilNext :: StringLike str => Scraper str a -> SerialScraper str b -> SerialScraper str b Source #

Create a new serial context by moving the focus forward and collecting nodes until the scraper matches the focused node. The serial scraper is then executed on the collected nodes.

The provided serial scraper is unable to see nodes outside the new restricted context.

untilBack :: StringLike str => Scraper str a -> SerialScraper str b -> SerialScraper str b Source #

Create a new serial context by moving the focus backward and collecting nodes until the scraper matches the focused node. The serial scraper is then executed on the collected nodes.