A major pain point in doing data science on web scraped data is the parsing of html into structured and clean data.
The process is usually
- Scrape a web url.
- Parse into “clean” data.
- Put in database.
- Collect data for all urls from database for analysis.
This might work well for a mature project but when you want to do a quick POC, the startup costs are too high (setting up databases, pipelines etc).
I want to be able to go from step 1 - 4 in with little effort.
Cached Requests
In my experience, step 2 is where most things go wrong. Hence, you want to cache step 1, allowing you to work and test your pipeline effectively.
Scraper.cached_requests “library” mimics requests
with the same API i.e. requests.get
and requests.post
while caching the response in a database and file system. The user makes a request based on url as usual but the library intelligently decides if it wants to read from disk or web. Hence, the UX of step 1 does not change and step 3-4 are redundant.
Eg. For my properties project, I am able to parse 40 pages/sec on two threads on a 2 year old laptop. So, I just generate my dataset on the fly when I need.
This can be approximately 30x quicker than reading from www all the time, and you do not put extra load on the webserver / proxy service whilst fixing your parsing 😇.
A byproduct of the caching is that you can resume gracefully from faults or terminations, without losing any ‘scraped’ data.
Code is on my GitHub
How it works
Uses a SQLAlchemy backend datastore to cache requests metadata.
The url, headers, status code of the endpoint and post data is stored in database against a primary key. The payload is stored on disk with the primary id as the filename.
If a request is made and it matches the database and its within the max age specified then the payload is served from cache.
Note All post()
requests are made in a common session (unlike requests
)
Usage
Setup a sql database (see database) and then
|
|
Multiprocessing
You can use Pool
to crawl multiple urls. See below
|
|
Future work
I can imagine you could implement this as an API service such that all requests made via your “team” are passed through the API.
Then, if someone wants to improve your parsing code, they no longer have to find the location of the file on some s3 bucket (or setup proxies for scraping from www), they just put point the original url to the API.
If multiple people are working on the same url (e.g. to extract different kinds of data or debugging) they don’t have to scrape their ‘own’ page.
Code is on my GitHub