bigdata.be

22nd meetup – Cloudera on HBase and Scoop

It has been quite a while since we actually posted something on our website. Wow!!! Time really flies.

On April 4th 2014, we had our 22nd meetup already. Klaas Bosteels was able to attract 2 prominent speakers from Cloudera who were touring Europe and presenting at the 2014 Hadoop Summit in Amsterdam.

Jon Hsieh (Software Engineer @ Cloudera and HBase Committer/PMC Member) talked about Apache HBase: Now and the future: Apache HBase is a distributed non-relational database that provides low-latency random read write access to massive quantities of data. This talk will be broken up into two parts. First I’ll talk about how in the past few years, HBase has been deployed in production at companies like Facebook, Pinterest, Groupon, and eBay and about the vibrant community of contributors from around the world include folks at Cloudera, Salesforce.com, Intel, HortonWorks, Yahoo!, and XiaoMi. Second I’ll talk about the features in the newest release 0.96.x and in the upcoming 0.98.x release.
Kate Ting (Technical Account Manager @ Cloudera and Sqoop Committer/PMC Member, co-author of the Apache Sqoop Cookbook) presented Apache Sqoop: Unlock Hadoop: Unlocking data stored in an organization’s RDBMS and transferring it to Apache Hadoop is a major concern in the big data industry. Apache Sqoop enables users with information stored in existing SQL tables to use new analytic tools like Apache HBase and Apache Hive. This talk will go over how to deploy and apply Sqoop in your environment as well as transferring data from MySQL, Oracle, PostgreSQL, SQL Server, Netezza, Teradata, and other relational systems. In addition, we’ll show you how to keep table data and Hadoop in sync by importing data incrementally as well as how to customize transferred data by calling various database functions.

And of course, Accenture was so kind to host us at their gorgeous venue in Brussels with that spectacular view! They presented their Big Data Challenge where 4 teams of about 5 consultants deep dived into big data and data science to solve some practical cases. You can get in touch with their consultants to know more.

At least the elaborated example on real-time predicting the delays of public transport was really interesting. It made my hands itch to start a new BigData.be project!

See you next time!

16th meetup — Scaling Big Data Mining Infrastructure: The Twitter Experience

It’s a bit late notice unfortunately, but we’ll be doing another meetup on July 16th in Ghent, featuring a very promising talk by renowned data geek Jimmy Lin about Twitter’s big data mining infrastructure. Space is limited, so you should head to our corresponding meetup page straight away to reserve your spot.

Talk abstract

The analytics platform at Twitter has experienced tremendous growth over the past few years in terms of size, complexity, number of users, and variety of use cases. This talk will discuss the evolution of the Twitter infrastructure and the development of capabilities for data mining on “big data”. We’ll share experiences as a case study, but make recommendations for best practices and point out opportunities for future work.

About the speaker

Jimmy Lin is an associate professor in the iSchool at the University of Maryland, with appointments in the Institute for Advanced Computer Studies (UMIACS) and the Department of Computer Science. He works on “big data”, with a particular focus on large-scale distributed algorithms for text processing. His research lies at the intersection of natural language processing (NLP) and information retrieval (IR). Recently, Jimmy spent an extended sabbatical (from 2010 to 2012) at Twitter working on large-scale data analytics. Previously, he has also done work for Cloudera, the enterprise Hadoop company.

9th meetup – schedule complete

Good news everyone!

The schedule for our 9th meetup is complete, we will have three talks from different areas of the big-data universe:

Davy Suvee will give an introduction to datomic and some graph related work, he is doing with it.
Kenny Helsens will present a wrap-up of the zimmo.be project
Gabriel Reid will give an introduction to apache crunch, of which is is a committer

We hope you like this schedule, as much, as we do and see plenty of you!

-BigData.be

9th meetup – call for participation

Hello all,

The friendly folks of NGDATA in Gent will host our 9th meetup. Thanks for that already!

Next to a location, we are always looking for interesting things to discuss during the meetup. Have you read something interesting in the bigdata/nosql space lately? Are you implementing something amazing right now? Do you have a problem, that you want to discuss? Let us know!

Looking forward to hearing from you all!

-BigData.be

Meetup 8: Differencing social profiles – a storm demo case

As our 8th meetup turned out into a hackaton on Twitter’s Storm, we devised a small presentation on what our use case was going to be: based on a person’s name, we fetch various social network profile information and put them next to each other to highlight the differences.

Meetup 8: Call for participation

Hello all,

the next meetup is already approaching and we are still missing some interesting topics to discuss.

So if you have read something lately that is worth mentioning, or if you’re in the middle having a breakthrough on an interesting brain teaser, or if you are implementing a wonderful project or just doing anything else relevant to our domain, please take a moment to prep some slides and get a discussion going on our 8th meetup!

Looking forward to hearing from you all!

-BigData.be

The 7th meetup or Waiting for CSI Ixelles

Three weeks ago our litte community on bigdata had their 7th meetup in Brussels. We think it is a good idea to hold our meetups in different cities, since we are the Belgian bigdata community. (If you can host a meetup in your city, please contact us!). Next to the typical evening traffic chaos and a meeting of all European prime ministers there was a crime scene (some sort of knife fight) next to our meeting place, which caused some of our participants to arrive a bit later, than planned.

Nevertheless did we have a good schedule, which consisted out of two talks with lots of good interaction between the speakers and the audience.

The first talk was about storm a distributed realtime processing framework coming out of twitter. Daan Gerrits gave an introduction into storm and walked us through an example application he had created for this meetup.

The second talk (by me) was about apache giraph a graph processing framework on top of apache hadoop.

If you have been to one of our meetings and you liked it, please spread the word, leave comments here, and consider the “call for papers” for our 8th meetup in July open!

Real estate project – Minutes from kickoff

On Monday the 21st of November, we kicked off our first bigdata.be project using historic sales data from a real estate website (meetup).

Interesting discussions throughout the evening lead us to define two paralel tracks. On the one hand, we will try to come up with a semi-structured model for real estate data. On the other hand, we will attempt to apply data analytics on real estate data using algorithms provided by the Apache Mahout community.

Besides being of interest to the bigdata.be community, we reasoned that a semi-structured data model would support integration of real estate data with orthogonal information derived from e.g. OpenStreetMap or OpenBelgium. This will enable us to enrich the existing data with things like

distance to n cities,
a ‘green index’ that correlates how far a real estate property is located from a nearby forest, or
a ‘density index’ that ties in with the number of houses that are for sale at the moment in a radius of 1, 2, 4, 8 or 16 km .

We thought of using HBase or Cassandra as datastores to address this task, but we will only decide this during follow-up meetings. Remembering the interests poll from the first bigdata.be meetup, there will hopefully be quite a few members from the bigdata.be community to elaborate this track of the real estate project.

The second track of the real estate project on the other hand aims to attract social interest by producing insights that are more relevant to a wider audience then just our bigdata.be community. As such we touched three topics for data analytics we could elaborate depending on their feasibility.

Prediction of the price and time of the sale
Based on archives from real estate companies, we will evaluate how well the price and time of the sale of a house can be predicted. For the suspicious readers, have a look a the Zestimate® price from the zillow.com real estate agency from the U.S..
Text mining on free text descriptions
It’s obvious that a plastic door correlates to a cheap house, while a granite kitchen correlates to expensive houses. But what other vocabulary-based associations can we derive by performing text mining analysis on the free text fields specified by the seller? (cfr. Freakonomics chapter 2)
Recommendation engine for similar real estate properties
Finally, by analyzing traffic logs from a real estate website, we should be able to build a recommendation engine that guides the visitor’s to related houses on sale. Think of how you search Amazon.co.uk for a Canon 550D, and you will surely see the camera bag as well.

Keep your eyes on the bigdata.be meetup site and, if interested, join the next events on the real estate project!

2nd Meetup – BigData project

On Wednesday August 24th 2011, we had our second meetup. As some people cancelled at the last moment, the crowd was not as large as for our first meetup: now, there were just 6 of us.

This meetup was the first time we organised our get-together as an open discussion. Davy Suvee came up with that idea and apparently everybody present enjoyed this format very much. This article is a synopsis of the topics discussed.

On the look for a Big Data project

One of the major demands from our community members is to work together on some specific Big Data project to gain hands on experience. In this respect the Big Data Wars thread was started on our groups page. However, instead of organizing a true public challenge, it would be easier and more instructive to just participate on a challenge as a team or to look for a specific project that we can implement ourselves. They are outlined underneath.

We decided to create a bitbucket account where we can define, plan, work and code on these projects.

Wikipedia’s Participation Challenge

A few days after Daan Gerits launched the BigData Wars idea on the group page, Nathan Bijnens referred to the Wikipedia’s Participation Challenge as a good fit for a BigData.be project. You can read the full description on the kaggle.com page, but the idea is to build a predictive model that allows the Wikimedia Foundation to understand what factors determine editing behaviour and to forecast long term trends in the number of edits to be expected on wikipedia.org. A random sample from the English Wikipedia dataset from the period January 2001 – August 2010 serves as training dataset for the predictive model.

Appealing to this challenge is that it is a very specific problem description, encompassing quite a large fraction of the big data complexity:

data availability,
datastore modelling,
defining secondary data,
deriving sampled test data,
building predictive models using various approaches, technologies and algorithms.

However, as the deadline for the project is very close, i.e. Tuesday 20 September 2011, we probably won’t be truly participating, but we can still use the problem definition.

So, if you are interested in participating in this topic, contact us or join in in the discussion or the repository:

Other proposals

Two more proposals were made as project topics:

BioMed: As a computational biology researcher at Ghent University, Kenny Helsens proposed to see if he could come up with some suitable project definition in the area of bioinformatics, maybe genome- or proteome related.
Immo and GIS: Another interesting area for a well-suited project might be the combination of historical data made available by some immo website with GIS related data e.g. from OpenStreetMap. A number of interesting problems can be derived, requiring e.g. predictive models. We’ll be contacting some immo websites on this matter.

As these projects are still in their incubation stage, we haven’t yet created any specific web areas for them. However, these might follow very soon, so keep checking back!

Realtime MapReduce

Daan Gerits has been working on some big data and noSQL projects, and was wondering if anybody has experience with bringing Hadoop and/or MR to the realtime playing field, instead of keeping it strictly for batch processing. The basic difference being that you would be able to feed the data crunching algorithms incrementally or by streaming data into the system and have the algorithms merge these into the already (partially) existing result sets.

During a presentation at the SAI on 7 April 2011, Steven Noels and Wim Van Leuven also pointed out that any big data processing system needs a combination of a batch layer with a speed layer to achieve at least eventual accuracy. The speed layer architectually being the most challenging. However, if we could combine realtime with existing MR algorithms, …

Some suitable technologies and pointers where raised: Yahoo S4, IBM InfoSphere Streams, Datastax Brisk, and Twitter’s Storm. However, no practical experience exists within our community.

Big Data Technology poll

As our meetup group was rather small, we did not redo the technology poll. However, we were wondering if some suitable tool exist to automate the poll via the bigdata.be website or other electronic means, like Google docs. So if you have any good idea in this area, please let us know!

See y’all at the next meetup!

The day after …. and more

So it has been a whlie since we held our first meetup on July 5th, 2011. We had a lively discussion on ideas, wants and won’ts for our young but apparently vibrant community. After some discussion in our group, we prefer to setup our meetups using a rotating schedule over Tuesday, Wednesday and Thursday at an interval of 6-7 weeks.

So, we’ll be calendering our 2nd meetup for Wednesday August 24th, 2011. Keep an eye on our meetup page.

All ideas for a topic that night are more then welcome!