Web Scraping and Archiving
The web forgets quickly. How can we help it remember?
Too Much JavaScript
Saving a webpage or even a whole website for later used to be as simple as pointing wget toward the right URL. Sadly, wget doesn’t work for most modern sites because it doesn’t execute the JavaScript necessary to render them. If they are really fancy and doing isomorphic JS (or other forms of server side pre-rendering) it can work, but that isn't terribly popular on today's web.
Good Starting Tools
- WebScrapBook (source code): A browser extension that captures web pages to local device or backend server for future retrieval, organization, annotation, and edit. This project inherits from legacy Firefox add-on ScrapBook X.
- Xidel (source code): Command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern matching. It can also create new or transformed XML/HTML/JSON documents.
- Hurl (source code): Run and Test HTTP Requests
- Monolith (source code): CLI tool for saving complete web pages as a single HTML file
- xh: Friendly and fast tool for sending HTTP requests. Basically a Rust rewrite of HTTPie (source code).
Awesome Repos
The following awesome repos have the best listings of scraping/archiving tools I’ve come across:
- awesome-web-archiving: An Awesome List for getting started with web archiving
- awesome-web-scraping: List of libraries, tools and APIs for web scraping and data processing.
- awesome-datahoarding: List of data-hoarding related tools
Anti-Scraping
On the other side, there are the site owners that are trying to prevent scraping. It’s useful to understand the techniques you might be running up against:
- How-To-Prevent-Scraping: The ultimate guide on preventing Website Scraping