xpathdsv: Command line tool to extract DSV data from HTML and XML with XPATH expressions

[ bsd3, program, text ] [ Propose Tags ]

Please see README.md


[Skip to Readme]

Downloads

Maintainer's Corner

Package maintainers

For package maintainers and hackage trustees

Candidates

  • No Candidates
Versions [RSS] 0.1.0.0, 0.1.1.0
Dependencies base (>=4.7 && <5), hxt, hxt-xpath, optparse-applicative, text [details]
License BSD-3-Clause
Copyright Daniel Choi 2016
Author Daniel Choi
Maintainer dhchoi@gmail.com
Category Text
Home page https://github.com/danchoi/xpathdsv#readme
Uploaded by DanielChoi at 2016-05-24T18:10:36Z
Distributions NixOS:0.1.1.0
Reverse Dependencies 1 direct, 0 indirect [details]
Executables xpathdsv
Downloads 1346 total (9 in the last 30 days)
Rating (no votes yet) [estimated by Bayesian average]
Your Rating
  • λ
  • λ
  • λ
Status Docs not available [build log]
Last success reported on 2016-05-24 [all 4 reports]

Readme for xpathdsv-0.1.0.0

[back to package description]

xpathdsv

Extract DSV text from HTML and XML using XPATH expressions.

Example

If you have an HTML file like this:

sample.html

<html>
  <head><title>Test</title></head>
  <body>
    <h1>Some links</h1>
    <ul>
      <li><a href="http://news.ycombinator.com">Hacker News</a></li>
      <li><a href="http://yahoo.com">Yahoo</a>
      <li><a href="http://duckduckgo.com">Duck Duck Go</a>
      <li><a href="http://github.com">GitHub</a>
    </ul>
  </body>
</html>

You can extract a list of tab-separated values like this:

xpathdsv  '//a'  '/a/text()' '/a/@href/text()' < sample.html

Output:

Hacker News	http://news.ycombinator.com
Yahoo	http://yahoo.com
Duck Duck Go	http://duckduckgo.com
GitHub	http://github.com

The first XPATH expression in the command sets the base node on which all the following XPATH expressions are applied. Each of the following XPATH expressions then generate a column of the row of data.

If you don't specify a text() node at the end of an XPATH expression, you'll get a string representation of a node, which may be useful for debugging:

 xpathdsv '//a' '/a' < sample.html

Output:

<a href="http://news.ycombinator.com">Hacker News</a>
<a href="http://yahoo.com">Yahoo</a>
<a href="http://duckduckgo.com">Duck Duck Go</a>
<a href="http://github.com">GitHub</a>

Usage

xpathdsv

Usage: xpathdsv [--xml] [-F OUTPUT-DELIM] [-n NULL-OUTPUT] BASE-XPATH
                [CHILD-XPATH]
  Extract DSV data from HTML or XML with XPath

Available options:
  -h,--help                Show this help text
  --xml                    Parse as XML, rather than HTML.
  -F OUTPUT-DELIM          Default \t
  -n NULL-OUTPUT           Null value output string. Default ""

See https://github.com/danchoi/xpathdsv for more information.

Author

Daniel Choi https://github.com/danchoi

References