Polstat: web scraping with Scrapy

The www is a huge repository of information, but it is not an easy task to get the exact information you need, when you need it and in the form you need it. Although information sharing through feeds, online APIs and even semantic annotations are becoming increasingly popular, sharing features tend to be restrictive and not a priority for web owners. In my opinion, economical restrictions or webmaster laziness should not blockade for those who want access and use freely available information in new ways. The number of websites that do not implement any data particular sharing capabilities (other than its html structure, and possibly a news feed) are by far dominating the web.

Extracting data directly from the source code of an external webpage tend to be error-prone. As I see it, there are two main reasons for this. The first is related to the code that actually parses the target html and the fact that it may face suddenly changes in the layout and content, totally out of the developers control. Secondly comes the challenge of scaling. Parsing an array of URLs, possibly on different hosts, in one process execution is a risky approach without proper quality mechanisms. Moreover, there is soon a need for more sophisticated crawling functionality such as following links or customized behavior based on given patterns.

A solution to most of the above challenges are addressed in modern web crawling frameworks. One of the more interesting solutions in terms of simplicity, flexibility and community is the python-based Scrapy web crawling framework.

To check it out, I wrote a bot for collecting wine prices from vinmonopolet.no* every month. This turned out to work quite well, and the result is available at polstat.inevitable.no. Visualization of historical price development is interesting for two main reasons: (1) advertisements for for alcoholic beverages are illegal in this country, and (2) as Vinmonopolet buys large quantities of products from their vendors, supply and demand in the market could cause prices to fluctuate. The following links have more information about this:

http://www.dagbladet.no/2010/01/06/tema/drikke/mat/klikk/9797457/
http://www.klikk.no/mat/drikke/article476105.ece

I hope to return with more posts related to the implementation as well as legal aspects of polstat.

*) Vinmonopolet has the exclusive rights to sell wine, spirits and strong beer in Norway.

Posted in Uncategorized | 1 Comment

Vintersportmesse

  1. I show up at the entrance desk. One of the girls behind it asks me to pay a 20 kr entrance fee. Ok, I do so and receive a ticket back. She say I have to place my bag in the “wardrobe” (an area behind her desk).
  2. I pass another girl on my way around the desk. She mention the same thing, that I have to place my bag behind the desk. “Ok, I already know”. “- Great, take this and show it to me when you come back “.  She give me another ticket, identical to the one I already have.
  3. I come to another line of people, registering themselves at a computer stand. When it’s my turn I’m asked to register my email. I do so, and a sticker ticket saying “Visitor” pops out of a printer next to the computer.
  4. I pass the entrance guard without showing one of my two tickets, nor the sticker.
    (….)
  5. I arrive back to the desk where I left my bag, having to point it out in a heap of other bags. It has no ID saying it’s mine. The girl smiles and give it to me.
Posted in life | Leave a comment

A rule-based implementation of many-to-many relations

“In this post I will describe an alternative solution where the database is replaced by a rule engine and object serialization.”

While working with my master thesis, I’ve come across several many-to-many dependencies between concepts in a domain model I’m currently studying which focuses on calculation of pollutive emissions from ships. A common way of implementing this type of dependency between concepts is by using a relational database. This solution typically involves three tables, two of them describing the two concepts and a mapping table taking care of the relation between them.

uml-manytomany

In addition, you need a bunch of code to take care of the connection to the database, CRUD operations, and not at least to ensure consistency between the object model and the database tables. Setting up all this is time consuming and boring work, and the result has has weaknesses (e.g. what if relational characteristics of concepts in the model changes?).

The following tables shows some test data.

enginetypetable
emissiontypetable
mappingtable

It’s not a pleasant job to maintain table data like the above directly, without using SQL calls with a syntax you never remember properly. In this post I will describe an alternative solution where the database is replaced by a rule engine and object serialization.

By using a rule engine such as BRIX Rules, relations are mde by freely combining object attributes in a separate graphical interface. A clear advantage of this approach is that the relations are made by using the object attributes directly – not by querying a separate mapping table.

The EngineType class may look something like this, a pretty basic XML-serializable class. The EmissionType class is implemented in the same way.

public class EngineType : IFindable
{
    [XmlAttribute("name")]
    public string Name { get; set; }

    [XmlAttribute("description")]
    public string Description { get; set; }

    public EngineType()
    {
    }
}

Instances of EngineType are serialized to XML, and retrieved by deserialization (reading the XML-file). This is done in a generic XMlSerialization class.

   ...
   public static void Main()
   {
       EngineType engineType = XMLSerialization<EngineType>.Find("Slow Speed");
       EmissionType emissionType = XMLSerialization<EmissionType>.Find("CO2");
       Console.WriteLine("KgPerMetricTon=" + Emission.GetKgPerMetricTon(engineType, emissionType))
   }

A WorkingMemory class sets up the connection to the rule engine, much in the same manner as connecting to a database. The rulebase is maintained through the rule system interface, possibly by an external domain expert.

'
public class Emission
{
    ...

    public static double GetKgPerMetricTon(EngineType EngineType, EmissionType EmissionType)
    {
        double kgPerMetricTon = 0.0;
        object[] inArgs = new object[] { EngineType, EmissionType, kgPerMetricTon };
        object[] outArgs = (object[])WorkingMemory.Instance.FireRule(inArgs, "EngineTypeToEmissionType");
        kgPerMetricTon = (double)outArgs[2];
        return kgPerMetricTon;
    }
}

The relation between engine types and emission types are set up as a graphical matrix in BRIX Rules. Each axis cell holds premises like
If EngineType.Name=="Slow Speed" or If EmissionType.Name=="CO2" and each result cell is performing a result action like assigning a value to a return variable ?kgPerMetricTonFct:=3170

brixtable

The table above is invoked by the consequent (look at it as a function), taking the EngineType and EmissionType objects as in-parameter and returns the value of ?KgPerMetricTonFct.

brixproperty

Setting a single attribute value based on attribute values from to classes is a simple case – yet I think it already reveals potential for incorporating externally maintained rules into the business logic of software applications working on domain knowledge like this. I’m working to get more advanced examples up and running, involving rule chaining (rules invoking rules) and more ambitious relations between domain model concepts.

bl_rules

The figure above tries to depict the abstraction levels in the two solutions. Gray represents data storage, while yellow represents business logic. The size of the ellipsis, is not accidental – it should give and indication of the amount of code involved.

I’m actually a bit surprised that rule-based approach such as this is not more common practice already. The answer might be that as the good old relational database still is needed anyway, why bother to have more dependent components (the rule engine) and to learn a new and unfamiliar computational regime?

Posted in code, development, software | 2 Comments

MSc Thesis

This semester I’m writing my master thesis in Marine Systems Design. After reading Atle Frenvik Sveens call to master students to blog about their thesis, I found it really in time to post a brief description. Also Alex has followed up this, keep up the good posting (:

The title of my thesis is Benchmarking a rule-based approach in marine domain modeling.

Said in a more descriptive way, I will perform an evaluation of the incorporation of rule-based modeling and reasoning in domain specific software engineering. To get something valuable out of my work I strive to be as concrete and straight-to-the-point as possible (if ever possible).

Ok. Development of modern (engineering) software typically requires extensive domain models capturing core concepts, relations and knowledge elements. These models tend to be large, complex and resource intensive both in development and maintenance. My motivation is that Knowledge representation have techniques that can alleviate this situation. One such technique is rule-based reasoning which represents a pragmatic, yet promising, regime for an application knowledge platform.

This is nothing new, and the approach have been followed by Microsoft, ILOG (IBM) and DNV. In the open source community, Drools is an interesting project.

So. An important aspect of my thesis is to put a lightweight conventional business model up against a rule-based version if the same model. On this basis I will analyse different aspects of the two candidates. By actually implementing the two versions in code, I will be able to present real world KPIs. Example of interesting KPIs are complexity (number of entities, lines of code), maintainability (separation of domain knowledge) and ease of implementation. My hypothesis is that some models can benefit from using rules as part of the computational model, and thus substitute for a number of relations of entities in a traditional approach as well as chained if-statements buried into the business logic code.

An concise description of this subject is found on Martin Fowlers’s bliki.

(Parts the text in this post is taken from my thesis definition, formulated by my supervisor Stein Ove Erikstad)

Posted in development, software | 2 Comments

Intelligent breakdown structures

Combining Frame-based and Rule-based reasoning on breakdown structures.

Breakdown structures are well-known in many fields of planning and analysis. Cost -, work -, and component breakdown structures are examples of widely used breakdown structures. Their structure is a tree with sub-nodes deriving from a root-node. Sub-nodes are all a specialization of its super-node(s).

Frames has the same data structure, with each node representing a stereotyped situations. The substance of the nodes are slots that has associated values and procedural attachments that can hold more complex functionality of the node.

The possibility of using frames to model breakdown structures with intelligent content seems quite obvious. To see an example of frame-based knowledge representation, take a look at a course project titled ANIMAL I carried out lately.

I find intelligent representations of breakdown structures exciting, and I see possibilities to go further with this. Frequently, nodes on different branches of the “breakdown-tree” have dependencies (or any other kind of relations) to each other. To capture this in an intelligent system, I suggest the introduction of a separate rule-base, placed in the position to reason about the breakdown structure as a whole. This enables higher level rules controlling frames’ built-in functionality.

An example architecture related to intelligent breakdown structures

An example architecture related to intelligent breakdown structures

Posted in development, software | 1 Comment

The fall of festival bracelets

After collecting for 4 years, it was time to free my arm from 13 bracelets.

festival_bracelets_on

festival_bracelets_off

Posted in life | 2 Comments

Old and simple desktop monitor


I always like to have the latest version of every software – with one exception that confirms the truth: CoolMon 1.0 (freeware) from 2003 is my all time favorite desktop monitor for Windows. Without steeling space from the desktop, CoolMon displays key information about your computers goings-on. The picture shows my current config. (Note that the text is displayed with a full transparent background and appears to be part of the wallpaper)

Posted in look and feel, software | 3 Comments

Small moments from my semester in Delft

My wonderful time in Delft is soon over. Here are a few pictures capturing small valuable moments.

Marcushof
Marcushof in Delft where I lived for 5 months.

Scooter speed control in Rotterdam
The police perfom a speed control of scooters in the center of Rotterdam.

Armin Only April 19. 2008
Armin van Buuren live at Armin Only, Jaarbeurs Utrecht, April 19th.

Cheers in persian
“Cheers” written in Persian

Queensday 2008 in Amsterdam
Queensday 2008 in Amsterdam. Look at the shirt of the DJ (:

Trashcan on fire at Queensday #1
Trashcan on fire at Queensday. The police are helpless.

Trashcan on fire at Queensday #2
The result of the trashcan fire. Moral: don’t call the police in case of fire.

Darwin
The iguana Darwin with flower in his (hair?)

My accomodation
My accomodation (!) in Marcushof.

on the beach in Scheveningen
A wonderful day at the beach in Scheveningen.

Marine propulsion
Marine propulsion in Zaanse Schans.

Magic mushroom
Magic mushrooms are still legal in Holland.

Farewell dinner in Marcushof
Nicla’s farwell dinner in the kitchen, Marcushof 5th floor..

And don’t miss:
The watermovie
Juha’s time in Delft
Erasmus 3mE TUDelft 2008 (thanks to Torben!)

Posted in life | Tagged | 2 Comments

Javamail: secure SMTP

Using a secure SMTP server to send mail with the javamail library should not call for big problems. But as it took me some time to wind up threads from several search on Google, I would like to share a simple code example.

Properties props = new Properties();
props.put("mail.smtp.host", "your.smtp-host.net");
props.put("mail.smtp.auth", "true");
props.put("mail.smtp.port", "465");
props.put("mail.smtp.socketFactory.port", "465");
props.put("mail.smtp.socketFactory.class",
    "javax.net.ssl.SSLSocketFactory");
props.put("mail.smtp.socketFactory.fallback", "false");
Session session = Session.getInstance(props,
    new javax.mail.Authenticator() {
        protected PasswordAuthentication getPasswordAuthentication() {
            return new PasswordAuthentication("username", "password");
	}
    }
);
try {
    MimeMessage msg = new MimeMessage(session);
    msg.setFrom(new InternetAddress("from@address.net"));
    msg.setRecipients(Message.RecipientType.TO,
        "reciever@address.com");
    msg.setSubject("Subjectline");
    msg.setSentDate(new Date());
    msg.setText("Mail content");
    Transport.send(msg);
} catch (MessagingException e) {
    System.out.println("send failed, exception: " + e);
}

Posted in code | 2 Comments

Grab a wiki

For a long time I have searched for “the perfect” wiki software to use for projects. Mediawiki (Wikipadia) is in my opinion way to heavy, and simple stuff like phpwiki lacks functionality.

Today I stumbled upon the neat WikiMatrix – a place to very easily compare a huge lists of available wiki systems. Through the Wiki Choice Wizard, I finally found a wiki that perfectly suites all my (present) needs: PmWiki.

PmWikicharacteristics:

  • higly customizable with add-on support (through the cookbook)
  • a default clean and simple GUI
  • history, printer friendly page generation, file upload etc.
  • a built in search engine
  • simple construction (PHP, data storage in flat files)
  • open source and free to use

My first mission in PmWiki will be a (private) wiki to paste knowledge in various subjects (courses I attend at the university). I believe organizing facts, formulas, figures, graphs and examples in a wiki while working with assignments and preparing for exams is a unique way to gain knowledge that is easy to find back to. Not to mention the options for sharing the information with everyone.

Posted in software, web | 1 Comment