Data science and R meetup

As announced on our meetup page, we had @JaredLander over from New York for a project at BigBoards.  So we rose to the occasion and had him talk about Backends for Big Data in R. This comment from the meetup page, says it all!

Nice fast-paced presentation with style from Jared
Marcel Dumont

To complement this “gentle introduction to R” ;-), our second presentation was given by DataCamp. They introduced us to what DataCamp stands for, how their platform is architected and how teachers write courses (completely in R!). Again, your feedback says more than a thousand words …

This presentation impressed me the most. How a couple of students from the KULeuven can start-up their own company and be successful in filling the gap of training services in R
Jean-Jacques DE CLERCQ

(Hey guys @datacamp, if you are reading this from over there in LA at the use R! conference, can we share your slides here?)

What did you learn? What do you think about this meetup?

22nd meetup – Cloudera on HBase and Scoop

It has been quite a while since we actually posted something on our website. Wow!!! Time really flies.

On April 4th 2014, we had our 22nd meetup already. Klaas Bosteels was able to attract 2 prominent speakers from Cloudera who were touring Europe and presenting at the 2014 Hadoop Summit in Amsterdam.

  1. Jon Hsieh (Software Engineer @ Cloudera and HBase Committer/PMC Member) talked about Apache HBase: Now and the futureApache HBase is a distributed non-relational database that provides low-latency random read write access to massive quantities of data. This talk will be broken up into two parts. First I’ll talk about how in the past few years, HBase has been deployed in production at companies like Facebook, Pinterest, Groupon, and eBay and about the vibrant community of contributors from around the world include folks at Cloudera, Salesforce.com, Intel, HortonWorks, Yahoo!, and XiaoMi. Second I’ll talk about the features in the newest release 0.96.x and in the upcoming 0.98.x release.
  2. Kate Ting (Technical Account Manager @ Cloudera and Sqoop Committer/PMC Member, co-author of the Apache Sqoop Cookbook) presented Apache Sqoop: Unlock HadoopUnlocking data stored in an organization’s RDBMS and transferring it to Apache Hadoop is a major concern in the big data industry. Apache Sqoop enables users with information stored in existing SQL tables to use new analytic tools like Apache HBase and Apache Hive. This talk will go over how to deploy and apply Sqoop in your environment as well as transferring data from MySQL, Oracle, PostgreSQL, SQL Server, Netezza, Teradata, and other relational systems. In addition, we’ll show you how to keep table data and Hadoop in sync by importing data incrementally as well as how to customize transferred data by calling various database functions.

And of course, Accenture was so kind to host us at their gorgeous venue in Brussels with that spectacular view! They presented their Big Data Challenge where 4 teams of about 5 consultants deep dived into big data and data science to solve some practical cases. You can get in touch with their consultants to know more.

At least the elaborated example on real-time predicting the delays of  public transport was really interesting. It made my hands itch to start a new BigData.be project!

See you next time!

2nd Meetup – BigData project

On Wednesday August 24th 2011, we had our second meetup. As some people cancelled at the last moment, the crowd was not as large as for our first meetup: now, there were just 6 of us.

This meetup was the first time we organised our get-together as an open discussion. Davy Suvee came up with that idea and apparently everybody present enjoyed this format very much. This article is a synopsis of the topics discussed.

On the look for a Big Data project

One of the major demands from our community members is to work together on some specific Big Data project to gain hands on experience. In this respect the Big Data Wars thread was started on our groups page. However, instead of organizing a true public challenge, it would be easier and more instructive to just participate on a challenge as a team or to look for a specific project that we can implement ourselves. They are outlined underneath.

We decided to create a bitbucket account where we can define, plan, work and code on these projects.

Wikipedia’s Participation Challenge

A few days after Daan Gerits launched the BigData Wars idea on the group page, Nathan Bijnens referred to the Wikipedia’s Participation Challenge as a good fit for a BigData.be project. You can read the full description on the kaggle.com page, but the idea is to build a predictive model that allows the Wikimedia Foundation to understand what factors determine editing behaviour and to forecast long term trends in the number of edits to be expected on wikipedia.org. A random sample from the English Wikipedia dataset from the period January 2001 – August 2010 serves as training dataset for the predictive model.

Appealing to this challenge is that it is a very specific problem description,  encompassing quite a large fraction of the big data complexity:

  • data availability,
  • datastore modelling,
  • defining secondary data,
  • deriving sampled test data,
  • building predictive models using various approaches, technologies and algorithms.

However, as the deadline for the project is very close, i.e. Tuesday 20 September 2011, we probably won’t be truly participating, but we can still use the problem definition.

So, if you are interested in participating in this topic, contact us or join in in the discussion or the repository:

 

Other proposals

Two more proposals were made as project topics:

  1. BioMed: As a computational biology researcher at Ghent University, Kenny Helsens proposed to see if he could come up with some suitable project definition in the area of bioinformatics, maybe genome- or proteome related.
  2. Immo and GIS: Another interesting area for a well-suited project might be the combination of historical data made available by some immo website with GIS related data e.g. from OpenStreetMap. A number of interesting problems can be derived, requiring e.g. predictive models. We’ll be contacting some immo websites on this matter.

As these projects are still in their incubation stage, we haven’t yet created any specific web areas for them. However, these might follow very soon, so keep checking back!

Realtime MapReduce

Daan Gerits has been working on some big data and noSQL projects, and was wondering if anybody has experience with bringing Hadoop and/or MR to the realtime playing field, instead of keeping it strictly for batch processing. The basic difference being that you would be able to feed the data crunching algorithms incrementally or by streaming data into the system and have the algorithms merge these into the already (partially) existing result sets.

During a presentation at the SAI on 7 April 2011, Steven Noels and Wim Van Leuven also pointed out that any big data processing system needs a combination of a batch layer with a speed layer to achieve at least eventual accuracy. The speed layer architectually being the most challenging. However, if we could combine realtime with existing MR algorithms, …

Some suitable technologies and pointers where raised: Yahoo S4, IBM InfoSphere Streams, Datastax Brisk, and Twitter’s Storm. However, no practical experience exists within our community.

Big Data Technology poll

As our meetup group was rather small, we did not redo the technology poll. However, we were wondering if some suitable tool exist to automate the poll via the bigdata.be website or other electronic means, like Google docs. So if you have any good idea in this area, please let us know!

See y’all at the next meetup!

The day after …. and more

So it has been a whlie since we held our first meetup on July 5th, 2011. We had a lively discussion on ideas, wants and won’ts for our young but apparently vibrant community. After some discussion in our group, we prefer to setup our meetups using a rotating schedule over Tuesday, Wednesday and Thursday at an interval of 6-7 weeks.

So, we’ll be calendering our 2nd meetup for Wednesday August 24th, 2011. Keep an eye on our meetup page.

All ideas for  a topic that night are more then welcome!

bbuz – the day after

After 2 intensively immersive days on big data at the Berlin Buzzwords (#bbuzz) conference in Berlin, the bigdata.be crew is back in Belgium again. Those two days were a rollercoaster ride of meeting smart people and listening to great talks on everything big data, NoSQL datastores and scalable search.

There were quite some bigdata.be people present at the event: Andre Kelpe, Wim Van Leuven and Steven Noels (who presented on Lily) with 2 of his colleagues at Outherthought. Of course we weren’t noticeable amongst the crowd of about 450 data geeks: we might have missed some other Belgians out there. So, if you were there or know somebody who was, please step forward and get in touch! We want you!

The conference itself was a great lineup of two keynotes and numerous sessions on three tracks (store, scale and search) by the most knowledgeable people in our specific domain. Check out the agenda for Monday and Tuesday! Slides are posted here as they become available.

There were a few presentations that kept simmering through the hallways. Jonathan Gray‘s exposé on Realtime Big Data at Facebook with Hadoop and HBase was much bespoken because a bit controversial. Remember that Apache Cassandra was developed at and open-sourced by Facebook in 2008.

Also, during the second day keynote, Ted Dunning challenged the Hadoop, and more generally, the Apache community. He postulated that the days as a community are over. Too many stakeholders participate with too many conflicting interests. For the first time in its existence, the Apache Software Foundation  is confronted with such a large scale community. A community that is becoming an ecosystem. And the ASF is not the right body to manage it. A provocative but also challenging thought. But what is the right structure to manage that eco-system? A Linux-like guerilla approach? Or an overly structured standards body? These are some thoughts that I have been discussing with Lars George, but of course without finding an answer. There’s only a sense of future promises. A future that looks difficult but interesting …

Remember if you were there, or not, but you are interested in big data, we want you! Get in touch and join!!!

Welcome

Welcome to our Belgian community about bigdataNoSQL and anything cloud.

We will soon be organizing the kickoff of this community with our founding members. So keep coming back to see where our initiative is going, join us and hopefully participate actively to evangelize and promote the use of these new technologies throughout Belgium.