Extracting data directly from the source code of an external webpage tend to be error-prone. As I see it, there are two main reasons for this. The first is related to the code that actually parses the target html and the fact that it may face suddenly changes in the layout and content, totally out of the developers control. Secondly comes the challenge of scaling. Parsing an array of URLs, possibly on different hosts, in one process execution is a risky approach without proper quality mechanisms. Moreover, there is soon a need for more sophisticated crawling functionality such as following links or customized behavior based on given patterns.
A solution to most of the above challenges are addressed in modern web crawling frameworks. One of the more interesting solutions in terms of simplicity, flexibility and community is the python-based Scrapy web crawling framework.
To check it out, I wrote a bot for collecting wine prices from vinmonopolet.no* every month. This turned out to work quite well, and the result is available at polstat.inevitable.no. Visualization of historical price development is interesting for two main reasons: (1) advertisements for for alcoholic beverages are illegal in this country, and (2) as Vinmonopolet buys large quantities of products from their vendors, supply and demand in the market could cause prices to fluctuate. The following links have more information about this:
http://www.dagbladet.no/2010/01/06/tema/drikke/mat/klikk/9797457/
http://www.klikk.no/mat/drikke/article476105.ece
I hope to return with more posts related to the implementation as well as legal aspects of polstat.
*) Vinmonopolet has the exclusive rights to sell wine, spirits and strong beer in Norway.
This is very interesting. Would be great if vinmonopolet would let you do this and use the information for other uses.
Found the page while looking for where to get data for a wine application for Android. I don’t think i will start to scrape vinmonopolet.no for data though…