Web Scraping and Archiving

The web forgets quickly. How can we help it remember?

Too Much JavaScript

Saving a webpage or even a whole website for later used to be as simple as pointing wget toward the right URL. Sadly, wget doesn’t work for most modern sites because it doesn’t execute the JavaScript necessary to render them. If they are really fancy and doing isomorphic JS (or other forms of server side pre-rendering) it can work, but that isn't terribly popular on today's web.

Good Starting Tools

  • WebScrapBook (source code): A browser extension that captures web pages to local device or backend server for future retrieval, organization, annotation, and edit. This project inherits from legacy Firefox add-on ScrapBook X.
  • Xidel (source code): Command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern matching. It can also create new or transformed XML/HTML/JSON documents.
  • Hurl (source code): Run and Test HTTP Requests
  • Monolith (source code): CLI tool for saving complete web pages as a single HTML file
  • xh: Friendly and fast tool for sending HTTP requests. Basically a Rust rewrite of HTTPie (source code).

Awesome Repos

The following awesome repos have the best listings of scraping/archiving tools I’ve come across:

Anti-Scraping

On the other side, there are the site owners that are trying to prevent scraping. It’s useful to understand the techniques you might be running up against:

Edit this page on GitHub