The wp-archivebot package
A MediaWiki's RecentChanges or NewPages links to every new edit or article; this bot will poll the corresponding RSS feeds (easier and more reliable than parsing the HTML), follow the links to the new edit/article, and then use TagSoup to filter out every off-wiki link (eg. to http:cnn.com).
With this list of external links, the bot will then fire off requests to http:webcitation.org/, which will make a backup (similar to the Internet Archive, but on-demand).
Example: to archive links from every article in the English Wikipedia's RecentChanges:
wp-archivebot gwern0@gmail.com 'http://en.wikipedia.org/w/index.php?title=Special:RecentChanges&feed=rss'
Properties
| Version | 0.1 |
|---|---|
| Dependencies | base (3.*), feed, HTTP, network, parallel, tagsoup |
| License | BSD3 |
| Author | Gwern |
| Maintainer | gwern0@gmail.com |
| Stability | Experimental |
| Category | Network |
| Executables | wp-archivebot |
| Upload date | Thu Jun 4 16:31:50 UTC 2009 |
| Uploaded by | GwernBranwen |
| Built on | ghc-6.10 |
Downloads
- wp-archivebot-0.1.tar.gz (Cabal source package)
- package description (included in the package)