Efficient web page parsing

A major pain point in doing data science on web scraped data is the parsing of html into structured and clean data.

The process is usually

Scrape a web url.
Parse into “clean” data.
Put in database.
Collect data for all urls from database for analysis.

This might work well for a mature project but when you want to do a quick POC, the startup costs are too high (setting up databases, pipelines etc).

I want to be able to go from step 1 - 4 in with little effort.

Cached Requests

In my experience, step 2 is where most things go wrong. Hence, you want to cache step 1, allowing you to work and test your pipeline effectively.

Scraper.cached_requests “library” mimics requests with the same API i.e. requests.get and requests.post while caching the response in a database and file system. The user makes a request based on url as usual but the library intelligently decides if it wants to read from disk or web. Hence, the UX of step 1 does not change and step 3-4 are redundant.

Eg. For my properties project, I am able to parse 40 pages/sec on two threads on a 2 year old laptop. So, I just generate my dataset on the fly when I need.

This can be approximately 30x quicker than reading from www all the time, and you do not put extra load on the webserver / proxy service whilst fixing your parsing 😇.

A byproduct of the caching is that you can resume gracefully from faults or terminations, without losing any ‘scraped’ data.

Code is on my GitHub

How it works

Uses a SQLAlchemy backend datastore to cache requests metadata.

The url, headers, status code of the endpoint and post data is stored in database against a primary key. The payload is stored on disk with the primary id as the filename.

If a request is made and it matches the database and its within the max age specified then the payload is served from cache.

Note All post() requests are made in a common session (unlike requests)

Usage

Setup a sql database (see database) and then

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


from cached_requests import CReq, default_engine

DATABASE =  'postgresql://user@localhost/db' # sqlalchemy connection string
STORE = "tmp" # file path to store payload
PROXIES = [] # list of proxies, chosen randomly at each request

requests = CReq(engine=default_engine(DATABASE), 
                        proxies=PROXIES, 
                        cache_loc=STORE)

requests.get("http://www.bbc.co.uk") # will grab from www
requests.get("http://www.bbc.co.uk") # will grab from disk

requests.get("http://www.bbc.co.uk", max_age_days=0) # force to get from www

Multiprocessing

You can use Pool to crawl multiple urls. See below

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


from crawler_config import requests # CReq object

from crawler import crawl_urls

def func(url):
    page = requests.get(url)
    # do something with the page, eg extraction etc
    res = page.xpath('//text())
    return res

crawl_urls(func, URLS_CRAWL, threads=2)

Future work

I can imagine you could implement this as an API service such that all requests made via your “team” are passed through the API.

Then, if someone wants to improve your parsing code, they no longer have to find the location of the file on some s3 bucket (or setup proxies for scraping from www), they just put point the original url to the API.

If multiple people are working on the same url (e.g. to extract different kinds of data or debugging) they don’t have to scrape their ‘own’ page.