Web Scraping and Archiving

The web forgets quickly. How can we help it remember?

Too Much JavaScript

Saving a webpage or even a whole website for later used to be as simple as pointing wget toward the right URL. Sadly, wget doesn’t work for most modern sites because it doesn’t execute the JavaScript necessary to render them. If they are really fancy and doing isomorphic JS (or other forms of server side pre-rendering) it can work, but that isn't terribly popular on today's web.

Good Starting Tools

WebScrapBook (source code): A browser extension that captures web pages to local device or backend server for future retrieval, organization, annotation, and edit. This project inherits from legacy Firefox add-on ScrapBook X.
Xidel (source code): Command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern matching. It can also create new or transformed XML/HTML/JSON documents.
Hurl (source code): Run and Test HTTP Requests
Monolith (source code): CLI tool for saving complete web pages as a single HTML file
xh: Friendly and fast tool for sending HTTP requests. Basically a Rust rewrite of HTTPie (source code).

Awesome Repos

The following awesome repos have the best listings of scraping/archiving tools I’ve come across:

awesome-web-archiving: An Awesome List for getting started with web archiving
awesome-web-scraping: List of libraries, tools and APIs for web scraping and data processing.
awesome-datahoarding: List of data-hoarding related tools

Anti-Scraping

On the other side, there are the site owners that are trying to prevent scraping. It’s useful to understand the techniques you might be running up against:

How-To-Prevent-Scraping: The ultimate guide on preventing Website Scraping

Edit this page on GitHub