The HandsomeSoup package

[ Tags: bsd3, library, text ] [ Propose Tags ]

See examples and full readme on the Github page: https://github.com/egonSchiele/HandsomeSoup


[Skip to Readme]

Properties

Versions 0.1, 0.2, 0.3, 0.3.1, 0.3.2, 0.3.3, 0.3.4, 0.3.5, 0.4, 0.4.2
Dependencies base (<5), containers, HTTP, hxt, MaybeT, mtl, network (<2.6), parsec, transformers [details]
License BSD3
Author Aditya Bhargava
Maintainer bluemangroupie@gmail.com
Category Text
Home page https://github.com/egonSchiele/HandsomeSoup
Uploaded Thu Apr 26 20:11:23 UTC 2012 by AdityaBhargava
Updated Sun May 10 12:34:52 UTC 2015 by AdamBergmark to revision 1   [What is this?]
Distributions LTSHaskell:0.4.2, NixOS:0.4.2, Stackage:0.4.2, Tumbleweed:0.4.2
Downloads 5608 total (503 in the last 30 days)
Rating (no votes yet) [estimated by rule of succession]
Your Rating
  • λ
  • λ
  • λ
Status Docs uploaded by user
Build status unknown [no reports yet]
Hackage Matrix CI

Modules

[Index]

Downloads

Note: This package has metadata revisions in the cabal description newer than included in the tarball. To unpack the package including the revisions, use 'cabal get'.

Maintainer's Corner

For package maintainers and hackage trustees


Readme for HandsomeSoup-0.2

[back to package description]

HandsomeSoup

Current Status: Usable but untested (tests coming soon! See todo list).

HandsomeSoup is the library I wish I had when I started parsing HTML in Haskell.

It is built on top of HXT and adds a few functions that make is easier to work with HTML.

Most importantly, it adds CSS selectors to HXT. The goal of HandsomeSoup is to be a complete CSS2 parser for HXT (it is very close to this right now).

Install

cabal install HandsomeSoup

Example

Nokogiri, the HTML parser for Ruby, has an example showing how to scrape Google search results. This is easy in HandsomeSoup:

main = do
    doc <- fromUrl "http://www.google.com/search?q=egon+schiele"
    links <- runX $ doc >>> css "h3.r a" ! "href"
    mapM_ putStrLn links

What can HandsomeSoup do for you?

Easily parse an online page using fromUrl

doc <- fromUrl "http://example.com"

Or a local page using parseHtml

contents <- readFile [filename]
doc <- parseHtml contents

Easily extract elements using css

Here are some valid selectors:

doc <<< css "a"
doc <<< css "*"
doc <<< css "a#link1"
doc <<< css "a.foo"
doc <<< css "p > a"
doc <<< css "#container h1"
doc <<< css "img[width]"
doc <<< css "img[width=400]"
doc <<< css "a[class~=bar]"

Easily get attributes using (!)

doc <<< css "img" ! "src"
doc <<< css "a" ! "href"

Docs

Find Haddock docs on Hackage.

I also wrote The Complete Guide To Parsing HXT With Haskell.

Credits

Made by Adit.