Real estate project – Minutes from kickoff

On Monday the 21st of November, we kicked off our first project using historic sales data from a real estate website  (meetup).

Interesting discussions throughout the evening lead us to define two paralel tracks. On the one hand, we will try to come up with a semi-structured model for real estate data. On the other hand, we will attempt to apply data analytics on real estate data using algorithms provided by the Apache Mahout community.

Besides being of interest to the community, we reasoned that a semi-structured data model would support integration of real estate data with orthogonal information derived from e.g. OpenStreetMap or OpenBelgium. This will enable us to enrich the existing data with things like

  • distance to n cities,
  • a ‘green index’ that correlates how far a real estate property is located from a nearby forest, or
  • a ‘density index’ that ties in with the number of houses that are for sale at the moment in a radius of 1, 2, 4, 8 or 16 km .

We thought of using HBase or Cassandra as datastores to address this task, but we will only decide this during follow-up meetings. Remembering the interests poll from the first meetup, there will hopefully be quite a few members from the community to elaborate this track of the real estate project.

The second track of the real estate project on the other hand aims to attract social interest by producing insights that are more relevant to a wider audience then just our community. As such we touched three topics for data analytics we could elaborate depending on their feasibility.

  1. Prediction of the price and time of the sale
    Based on archives from real estate companies, we will evaluate how well the price and time of the sale of a house can be predicted. For the suspicious readers, have a look a the Zestimate® price from the real estate agency from the U.S..
  2. Text mining on free text descriptions
    It’s obvious that a plastic door correlates to a cheap house, while a granite kitchen correlates to expensive houses. But what other vocabulary-based associations can we derive by performing text mining analysis on the free text fields specified by the seller? (cfr. Freakonomics chapter 2)
  3. Recommendation engine for similar real estate properties
    Finally, by analyzing traffic logs from a real estate website, we should be able to build a recommendation engine that guides the visitor’s to related houses on sale. Think of how you search for a Canon 550D, and you will surely see the camera bag as well.


Keep your eyes on the meetup site and, if interested, join the next events on the real estate project!


2 thoughts on “Real estate project – Minutes from kickoff”

  1. Great idea!

    I actually worked already on this idea for a while. Have a look at

    Based on historical real estate transactions Immoreus gives an estimate of the value of a house and also its rent value in case of an apartment. Most important driver of the value is of course the location. Therefore a lot of time and effort was spend on modeling this component accurately. You can find a high level visualization on the following page: The actual valuation algorithm produces more accurate results but due to performance reasons the map is generated with lower accuracy.

    Concepts that are used:
    • Support Vector Machines
    • Neural Networks
    • Some other homebrew statistical techniques
    • Plain vanilla MySql

    I could probably look into MapReduce to speed up some calculations but a large part can be done offline so currently this is not yet a bottleneck.

    The best way to increase the accuracy is currently just gathering more data.

  2. Hi Peter,
    Nice to meet you and your work on! I had never heard about this project before, but it is indeed highly relevant to our project (and many other people!). Given the name of our group, we mainly intend to approach this challenge from a bigdata perspective. We have therefore started to manage our data in HBase, which enables us to continuously gather more and more heterogenous data for each house without database schema constraints.
    We will only tackle the machine learning and text mining problems in a later phase of the project, yet again, we will approach this from a bigdata perspective and make use of the algorithms provided in the apache Mahout framework.

    You are most welcome to join us and discuss immoreus to our approach on of the upcoming meetups on this project! More information @

Leave a Reply