Polstat: web scraping with Scrapy

The www is a huge repository of information, but it is not an easy task to get the exact information you need, when you need it and in the form you need it. Although information sharing through feeds, online APIs and even semantic annotations are becoming increasingly popular, sharing features tend to be restrictive and not a priority for web owners. In my opinion, economical restrictions or webmaster laziness should not blockade for those who want access and use freely available information in new ways. The number of websites that do not implement any data particular sharing capabilities (other than its html structure, and possibly a news feed) are by far dominating the web.

Extracting data directly from the source code of an external webpage tend to be error-prone. As I see it, there are two main reasons for this. The first is related to the code that actually parses the target html and the fact that it may face suddenly changes in the layout and content, totally out of the developers control. Secondly comes the challenge of scaling. Parsing an array of URLs, possibly on different hosts, in one process execution is a risky approach without proper quality mechanisms. Moreover, there is soon a need for more sophisticated crawling functionality such as following links or customized behavior based on given patterns.

A solution to most of the above challenges are addressed in modern web crawling frameworks. One of the more interesting solutions in terms of simplicity, flexibility and community is the python-based Scrapy web crawling framework.

To check it out, I wrote a bot for collecting wine prices from vinmonopolet.no* every month. This turned out to work quite well, and the result is available at polstat.inevitable.no. Visualization of historical price development is interesting for two main reasons: (1) advertisements for for alcoholic beverages are illegal in this country, and (2) as Vinmonopolet buys large quantities of products from their vendors, supply and demand in the market could cause prices to fluctuate. The following links have more information about this:

http://www.dagbladet.no/2010/01/06/tema/drikke/mat/klikk/9797457/
http://www.klikk.no/mat/drikke/article476105.ece

I hope to return with more posts related to the implementation as well as legal aspects of polstat.

*) Vinmonopolet has the exclusive rights to sell wine, spirits and strong beer in Norway.

Advertisement

About magnuspo

Magnus Stoveland is a student at the Norweigan University of Science and Technology
This entry was posted in Uncategorized. Bookmark the permalink.

One Response to Polstat: web scraping with Scrapy

  1. Eirik Askheim says:

    This is very interesting. Would be great if vinmonopolet would let you do this and use the information for other uses.

    Found the page while looking for where to get data for a wine application for Android. I don’t think i will start to scrape vinmonopolet.no for data though…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s