HandsomeSoup: Work with HTML more easily in HXT

[ bsd3, library, text ] [ Propose Tags ]

See examples and full readme on the Github page: https://github.com/egonSchiele/HandsomeSoup


[Skip to Readme]

Flags

Automatic Flags
NameDescriptionDefault
network-uri

Get Network.URI from the network-uri package

Enabled
buildexamples

Build examples

Disabled

Use -f <flag> to enable a flag, or -f -<flag> to disable that flag. More info

Downloads

Maintainer's Corner

Package maintainers

For package maintainers and hackage trustees

Candidates

  • No Candidates
Versions [RSS] 0.1, 0.2, 0.3, 0.3.1, 0.3.2, 0.3.3, 0.3.4, 0.3.5, 0.4, 0.4.2
Dependencies base (>=4.6 && <5), containers, HandsomeSoup, HTTP, hxt, hxt-http, mtl, network, network-uri, parsec, transformers (>=0.3) [details]
License BSD-3-Clause
Author Aditya Bhargava
Maintainer bluemangroupie@gmail.com
Category Text
Home page https://github.com/egonSchiele/HandsomeSoup
Uploaded by AdityaBhargava at 2015-06-09T20:16:37Z
Distributions LTSHaskell:0.4.2, NixOS:0.4.2, Stackage:0.4.2
Reverse Dependencies 5 direct, 1 indirect [details]
Executables handsomesoup
Downloads 13484 total (26 in the last 30 days)
Rating (no votes yet) [estimated by Bayesian average]
Your Rating
  • λ
  • λ
  • λ
Status Docs available [build log]
Last success reported on 2015-06-11 [all 1 reports]

Readme for HandsomeSoup-0.4.2

[back to package description]

HandsomeSoup

Current Status: Usable and stable. Needs GHC 7.6. Please file bugs!

HandsomeSoup is the library I wish I had when I started parsing HTML in Haskell.

It is built on top of HXT and adds a few functions that make it easier to work with HTML.

Most importantly, it adds CSS selectors to HXT. The goal of HandsomeSoup is to be a complete CSS2 selector parser for HXT.

Install

cabal install HandsomeSoup

Example

Nokogiri, the HTML parser for Ruby, has an example showing how to scrape Google search results. This is easy in HandsomeSoup:

import Text.XML.HXT.Core
import Text.HandsomeSoup

main = do
    let doc = fromUrl "http://www.google.com/search?q=egon+schiele"
    links <- runX $ doc >>> css "h3.r a" ! "href"
    mapM_ putStrLn links

What can HandsomeSoup do for you?

Easily parse an online page using fromUrl

let doc = fromUrl "http://example.com"

Or a local page using parseHtml

contents <- readFile [filename]
let doc = parseHtml contents

Easily extract elements using css

Here are some valid selectors:

doc <<< css "a"
doc <<< css "*"
doc <<< css "a#link1"
doc <<< css "a.foo"
doc <<< css "p > a"
doc <<< css "p strong"
doc <<< css "#container h1"
doc <<< css "img[width]"
doc <<< css "img[width=400]"
doc <<< css "a[class~=bar]"
doc <<< css "a:first-child"

Easily get attributes using (!)

doc <<< css "img" ! "src"
doc <<< css "a" ! "href"

Docs

Find Haddock docs on Hackage.

I also wrote The Complete Guide To Parsing HXT With Haskell.

Credits

Made by Adit.