Real estate project – Minutes from kickoff

On Monday the 21st of November, we kicked off our first bigdata.be project using historic sales data from a real estate website  (meetup).

Interesting discussions throughout the evening lead us to define two paralel tracks. On the one hand, we will try to come up with a semi-structured model for real estate data. On the other hand, we will attempt to apply data analytics on real estate data using algorithms provided by the Apache Mahout community.

Besides being of interest to the bigdata.be community, we reasoned that a semi-structured data model would support integration of real estate data with orthogonal information derived from e.g. OpenStreetMap or OpenBelgium. This will enable us to enrich the existing data with things like

  • distance to n cities,
  • a ‘green index’ that correlates how far a real estate property is located from a nearby forest, or
  • a ‘density index’ that ties in with the number of houses that are for sale at the moment in a radius of 1, 2, 4, 8 or 16 km .

We thought of using HBase or Cassandra as datastores to address this task, but we will only decide this during follow-up meetings. Remembering the interests poll from the first bigdata.be meetup, there will hopefully be quite a few members from the bigdata.be community to elaborate this track of the real estate project.

The second track of the real estate project on the other hand aims to attract social interest by producing insights that are more relevant to a wider audience then just our bigdata.be community. As such we touched three topics for data analytics we could elaborate depending on their feasibility.

  1. Prediction of the price and time of the sale
    Based on archives from real estate companies, we will evaluate how well the price and time of the sale of a house can be predicted. For the suspicious readers, have a look a the Zestimate® price from the zillow.com real estate agency from the U.S..
  2. Text mining on free text descriptions
    It’s obvious that a plastic door correlates to a cheap house, while a granite kitchen correlates to expensive houses. But what other vocabulary-based associations can we derive by performing text mining analysis on the free text fields specified by the seller? (cfr. Freakonomics chapter 2)
  3. Recommendation engine for similar real estate properties
    Finally, by analyzing traffic logs from a real estate website, we should be able to build a recommendation engine that guides the visitor’s to related houses on sale. Think of how you search Amazon.co.uk for a Canon 550D, and you will surely see the camera bag as well.

 

Keep your eyes on the bigdata.be meetup site and, if interested, join the next events on the real estate project!