45th meetup – opensource big data tools

On Big Data Belgium’s 45th meetup, 3 Belgian companies announced they will opensource an important big data tool: Trumania by Real-Impact Analytics, Dataprism by VRT and Lighthouse by Dataminded.

Trumania

Trumania

Real-Impact Analytics not only offered to host Big Data Belgium’s 45th meetup on Wednesday, January 31, 2018. Real-impact Analytics has been working on Trumania, which is their synthetic data generation tool. Trumania allows RIA not only to load-test their solutions, but also inject problem situations and validate different scenarios.

Dataprism

Dataprism

Some time ago, VRT introduced their streaming platform VRT.nu. To capture the platform’s usage, a data platform was build to capture the video player’s and user journey events in realtime. The platform is built on Apache Kafka and Apache Kafka Streams and runs in production on AWS. VRT has released their platform into opensource as Dataprism.

Lighthouse

And finally, Dataminded gathered all their best-practices on how to build a data lake. They poured their collective intellect in the opensource library Lighthouse. If you are into Apache Spark and Scala, this is your best starting point.

44th meetup – CI/CD for your data pipelines

Axa was so kind to host our 44th BigData.be meetup in their new headquarters at Troonplein 1, 1000 Brussel. As you can see in the picture above, the venue is simply amazing. However, the arena-style seating is a bit of a challenge for the presenters, but it gave the presentations a specific dynamic that was well appreciated.

Schedule

For this meetup, we steered away from our traditional 2-presentations-per-evening style. Instead, we opted for 4 but shorter presentations on Continuous Integration (CI) and Continuous Deployment (CD) for full data pipelines.

19h00 - doors
19h30 - CI/CD for full data pipelines (Mehdi OUAZZA - Axa)
19h50 - CI/CD for full data pipelines (Daniel Mescheder - Real Impact Analytics)
20h10 - break
20h20 - Car damage visual detection (Edward De Brouwer - PhD KU Leuven)
20h40 - Managing Nation-Wide Traffic Cameras and Sensors (David Massart - D.E. Solution)
21h00 - networking

Slidedecks

Attendance

The announced meetup subject ‘CI/CD for data pipelines‘ proved to be a very hot topic as we hit the 100 subscriber mark on our meetup page very fast. Eventually we closed with 117 RSVPs. From the picture above, we estimate that we had about 70 effective attendees.

Finally, we can only finish with a big THANK YOU to Axa for hosting, to the presenters for sharing and to the community for attending, for asking intelligent and interesting questions during the Q&A sessions and for making BigData.be the vibrant community that it is!

All the best for 2018 from BigData Belgium!

O’Reilly Strata Europe 2014 – impressions from a PhD student

Guest post by Vasia Kalavri

I had marked the dates for the Strata + Hadoop World conference since I found out I was coming to Barcelona for an internship, about 5 months ago. Yet, as a PhD student, I knew I had no chance of finding a way to pay such a crazy registration fee… unless, I would constrain my diet to rice and water; a -realistically speaking- impossible goal when living in Barcelona, surrounded by such good food.

Then, last Monday morning, while attending the tutorial sessions at papis.io, I received an e-mail with the following content: “1 FREE 2-day pass to enter the conference on Thursday and Friday! Net value €779,00!!!” and -oh my god- it was not a lie! It was a message from the Belgian Big Data community, offering a free 2-day pass to Strata, to the first member that would send a reply to this e-mail, after 14h00 CET! I immediately looked at my watch: 12h24. I quickly made a draft reply and waited patiently for time to go by. At 13h59, I opened the draft reply, waited for the last minute to pass and pressed the “send” button, hoping that my mobile data connection won’t give up on me. And… bingo! A few minutes later, I was informed that I was indeed the first one to reply. My message had arrived at 14h00s06 :))

In the remaining of this post, I will provide my personal short and biased summary of the event.

General impressions

My first thought when entering the main room on Thursday morning was “wow this is huge”. I’ve been to several -mostly academic- conferences, but this one had both the largest amount of attendees and the most fascinating venue. And, by all means, it looked nothing like an academic conference. At least in the beginning, it was more of a show: fancy lighting, music to introduce the keynote speakers, text-free slides! To be honest, for a moment I thought this would all be a huge marketing campaign and I would waste my time. In the end, I have to admit that I was very happily surprised by the technical level of the talks and by the things I learned. As a systems person, I don’t often get to attend events that focus on use-cases and applications. Getting to hear about real-world use-cases was very inspiring for me!

Favorite Talks

I tried to avoid the business and industry tracks and mostly attended the hadoop, tools and data science tracks. Among the keynotes, I especially enjoyed Geoff McGrath and Camille Fournier on Thursday and Jordan Tigani on Friday. I would definitely suggest watching them: https://www.youtube.com/playlist?list=PL055Epbe6d5Y8aARKdXVVtJnEttlhsRyf.

Among the rest of the talks, my favorites were:

  • SAMOA: A Platform for Mining Big Data Streams”, by Gianmarco De Francisci Morales
  • High-Level Abstractions Make Big Data Useful for Real People” (even though the talk content was quite different than what the title suggests), by Melissa Santos
  • How Search Can Save Your Hadoop Investment and More”, by Shay Banon
  • Realtime Data Analysis Patterns”, by Mikio Braun
  • RT-Giraph: Online graph Mining Simplified”, by Georgios Siganos.

Most of the slides are already available here: http://strataconf.com/strataeu2014/public/schedule/proceedings

Misc and +1’s

  • I was really happy to see so many great female speakers! Keep it up organizers!
  • I got a couple of really nice T-shirts, +1 to Cloudera for providing female sizes!
  • +1 for the food and the great sea view of the banquet room.
  • It turns out I was not the only one with a University affiliation. I met a fellow PhD student from ULB there :))

Overall, Strata was a very enjoyable experience for me. Who knows, I might even consider sending a talk next year!
Finally, I’m really grateful to BigData.be for the free pass and, of course, to my mobile operator for delivering my e-mail with such great precision!

Short Bio

Vasia Kalavri is a PhD student at KTH, Sweden and UCL, Belgium. She is currently doing an internship at Telefonica Research and lives in Barcelona, Spain. Vasia is working in the area of distributed data processing, systems optimization and large-scale graph analysis. She is a committer of Apache Flink (flink.incubator.apache.org) and also contributing to Grafos.ml (grafos.ml).

Website: http://web.ict.kth.se/~kalavri/
Twitter: @vkalavri

Data science and R meetup

As announced on our meetup page, we had @JaredLander over from New York for a project at BigBoards.  So we rose to the occasion and had him talk about Backends for Big Data in R. This comment from the meetup page, says it all!

Nice fast-paced presentation with style from Jared
Marcel Dumont

To complement this “gentle introduction to R” ;-), our second presentation was given by DataCamp. They introduced us to what DataCamp stands for, how their platform is architected and how teachers write courses (completely in R!). Again, your feedback says more than a thousand words …

This presentation impressed me the most. How a couple of students from the KULeuven can start-up their own company and be successful in filling the gap of training services in R
Jean-Jacques DE CLERCQ

(Hey guys @datacamp, if you are reading this from over there in LA at the use R! conference, can we share your slides here?)

What did you learn? What do you think about this meetup?

23RD MEETUP – DATA SCIENCE/ELASTICSEARCH/ELECTIONS

On May 27th we had our 23rd meetup in Ghent kindly hosted by iMinds. We had a healthy mix of technical and business items.

Following presentations were given:

  1. Introduction to the Brussels Data Science Meetup by Philippe Van Impe (30 min)

Philippe gave an introduction to this Meetup, the projects they are working (with possible link to big data) and their link to the non-profit datakind organization (“data for good”)

His presentation can be found here.

2.  Introduction to ElasticSearch by Eric Rodriguez (60 min)

In just less than one hour Eric gave an introduction to all features of ElasticSearch.  During the presentation Toon Vanaght showed how ES is used at data.be.

His presentation can be found here and you are also invited to checkout the Belgian ElasticSearch Meetup group.

3.  Election Bingo by Stijn Beauprez (30 min)

The vk14-bingo.be application can be used for making up your mind for voting on sunday, it gives you insight into the topics our political parties are talking about.

Stijn demo-ed the application and explained which technologies were used to implement this application.

His presentation can be found here.

4.  De Verkiezingen by Philippe Kerremans (10 min)

Deverkiezingen.be website is another application of using social media to get insight into the mother of all elections. Philippe explained how they used ElasticSearch and D3.js among other technologies.

His presentation can be downloaded here: BigDataMeetupDeverkiezingen

Many thanks to iMinds for hosting the location and DataCrunchers for providing drinks.

In the mean time we now have more than 700 members!

22nd meetup – Cloudera on HBase and Scoop

It has been quite a while since we actually posted something on our website. Wow!!! Time really flies.

On April 4th 2014, we had our 22nd meetup already. Klaas Bosteels was able to attract 2 prominent speakers from Cloudera who were touring Europe and presenting at the 2014 Hadoop Summit in Amsterdam.

  1. Jon Hsieh (Software Engineer @ Cloudera and HBase Committer/PMC Member) talked about Apache HBase: Now and the futureApache HBase is a distributed non-relational database that provides low-latency random read write access to massive quantities of data. This talk will be broken up into two parts. First I’ll talk about how in the past few years, HBase has been deployed in production at companies like Facebook, Pinterest, Groupon, and eBay and about the vibrant community of contributors from around the world include folks at Cloudera, Salesforce.com, Intel, HortonWorks, Yahoo!, and XiaoMi. Second I’ll talk about the features in the newest release 0.96.x and in the upcoming 0.98.x release.
  2. Kate Ting (Technical Account Manager @ Cloudera and Sqoop Committer/PMC Member, co-author of the Apache Sqoop Cookbook) presented Apache Sqoop: Unlock HadoopUnlocking data stored in an organization’s RDBMS and transferring it to Apache Hadoop is a major concern in the big data industry. Apache Sqoop enables users with information stored in existing SQL tables to use new analytic tools like Apache HBase and Apache Hive. This talk will go over how to deploy and apply Sqoop in your environment as well as transferring data from MySQL, Oracle, PostgreSQL, SQL Server, Netezza, Teradata, and other relational systems. In addition, we’ll show you how to keep table data and Hadoop in sync by importing data incrementally as well as how to customize transferred data by calling various database functions.

And of course, Accenture was so kind to host us at their gorgeous venue in Brussels with that spectacular view! They presented their Big Data Challenge where 4 teams of about 5 consultants deep dived into big data and data science to solve some practical cases. You can get in touch with their consultants to know more.

At least the elaborated example on real-time predicting the delays of  public transport was really interesting. It made my hands itch to start a new BigData.be project!

See you next time!

16th meetup — Scaling Big Data Mining Infrastructure: The Twitter Experience

It’s a bit late notice unfortunately, but we’ll be doing another meetup on July 16th in Ghent, featuring a very promising talk by renowned data geek Jimmy Lin about Twitter’s big data mining infrastructure. Space is limited, so you should head to our corresponding meetup page straight away to reserve your spot.

Talk abstract

The analytics platform at Twitter has experienced tremendous growth over the past few years in terms of size, complexity, number of users, and variety of use cases. This talk will discuss the evolution of the Twitter infrastructure and the development of capabilities for data mining on “big data”. We’ll share experiences as a case study, but make recommendations for best practices and point out opportunities for future work.

About the speaker

Jimmy Lin is an associate professor in the iSchool at the University of Maryland, with appointments in the Institute for Advanced Computer Studies (UMIACS) and the Department of Computer Science. He works on “big data”, with a particular focus on large-scale distributed algorithms for text processing. His research lies at the intersection of natural language processing (NLP) and information retrieval (IR). Recently, Jimmy spent an extended sabbatical (from 2010 to 2012) at Twitter working on large-scale data analytics. Previously, he has also done work for Cloudera, the enterprise Hadoop company.

9th meetup – schedule complete

Good news everyone!

The schedule for our 9th meetup is complete, we will have three talks from different areas of the big-data universe:

We hope you like this schedule, as much, as we do and see plenty of you!

-BigData.be

9th meetup – call for participation

Hello all,

The friendly folks of NGDATA in Gent will host our 9th meetup. Thanks for that already!

Next to a location, we are always looking for interesting things to discuss during the meetup. Have you read something interesting in the bigdata/nosql space lately? Are you implementing something amazing right now? Do you have a problem, that you want to discuss? Let us know!

Looking forward to hearing from you all!

-BigData.be

Real estate project – Minutes from kickoff

On Monday the 21st of November, we kicked off our first bigdata.be project using historic sales data from a real estate website  (meetup).

Interesting discussions throughout the evening lead us to define two paralel tracks. On the one hand, we will try to come up with a semi-structured model for real estate data. On the other hand, we will attempt to apply data analytics on real estate data using algorithms provided by the Apache Mahout community.

Besides being of interest to the bigdata.be community, we reasoned that a semi-structured data model would support integration of real estate data with orthogonal information derived from e.g. OpenStreetMap or OpenBelgium. This will enable us to enrich the existing data with things like

  • distance to n cities,
  • a ‘green index’ that correlates how far a real estate property is located from a nearby forest, or
  • a ‘density index’ that ties in with the number of houses that are for sale at the moment in a radius of 1, 2, 4, 8 or 16 km .

We thought of using HBase or Cassandra as datastores to address this task, but we will only decide this during follow-up meetings. Remembering the interests poll from the first bigdata.be meetup, there will hopefully be quite a few members from the bigdata.be community to elaborate this track of the real estate project.

The second track of the real estate project on the other hand aims to attract social interest by producing insights that are more relevant to a wider audience then just our bigdata.be community. As such we touched three topics for data analytics we could elaborate depending on their feasibility.

  1. Prediction of the price and time of the sale
    Based on archives from real estate companies, we will evaluate how well the price and time of the sale of a house can be predicted. For the suspicious readers, have a look a the Zestimate® price from the zillow.com real estate agency from the U.S..
  2. Text mining on free text descriptions
    It’s obvious that a plastic door correlates to a cheap house, while a granite kitchen correlates to expensive houses. But what other vocabulary-based associations can we derive by performing text mining analysis on the free text fields specified by the seller? (cfr. Freakonomics chapter 2)
  3. Recommendation engine for similar real estate properties
    Finally, by analyzing traffic logs from a real estate website, we should be able to build a recommendation engine that guides the visitor’s to related houses on sale. Think of how you search Amazon.co.uk for a Canon 550D, and you will surely see the camera bag as well.

 

Keep your eyes on the bigdata.be meetup site and, if interested, join the next events on the real estate project!