45th meetup – opensource big data tools

On Big Data Belgium’s 45th meetup, 3 Belgian companies announced they will opensource an important big data tool: Trumania by Real-Impact Analytics, Dataprism by VRT and Lighthouse by Dataminded.

Trumania

Real-Impact Analytics not only offered to host Big Data Belgium’s 45th meetup on Wednesday, January 31, 2018. Real-impact Analytics has been working on Trumania, which is their synthetic data generation tool. Trumania allows RIA not only to load-test their solutions, but also inject problem situations and validate different scenarios.

Dataprism

Some time ago, VRT introduced their streaming platform VRT.nu. To capture the platform’s usage, a data platform was build to capture the video player’s and user journey events in realtime. The platform is built on Apache Kafka and Apache Kafka Streams and runs in production on AWS. VRT has released their platform into opensource as Dataprism.

Lighthouse

And finally, Dataminded gathered all their best-practices on how to build a data lake. They poured their collective intellect in the opensource library Lighthouse. If you are into Apache Spark and Scala, this is your best starting point.

44th meetup – CI/CD for your data pipelines

Axa was so kind to host our 44th BigData.be meetup in their new headquarters at Troonplein 1, 1000 Brussel. As you can see in the picture above, the venue is simply amazing. However, the arena-style seating is a bit of a challenge for the presenters, but it gave the presentations a specific dynamic that was well appreciated.

Schedule

For this meetup, we steered away from our traditional 2-presentations-per-evening style. Instead, we opted for 4 but shorter presentations on Continuous Integration (CI) and Continuous Deployment (CD) for full data pipelines.

19h00 - doors
19h30 - CI/CD for full data pipelines (Mehdi OUAZZA - Axa)
19h50 - CI/CD for full data pipelines (Daniel Mescheder - Real Impact Analytics)
20h10 - break
20h20 - Car damage visual detection (Edward De Brouwer - PhD KU Leuven)
20h40 - Managing Nation-Wide Traffic Cameras and Sensors (David Massart - D.E. Solution)
21h00 - networking

Slidedecks

Attendance

The announced meetup subject ‘CI/CD for data pipelines‘ proved to be a very hot topic as we hit the 100 subscriber mark on our meetup page very fast. Eventually we closed with 117 RSVPs. From the picture above, we estimate that we had about 70 effective attendees.

Finally, we can only finish with a big THANK YOU to Axa for hosting, to the presenters for sharing and to the community for attending, for asking intelligent and interesting questions during the Q&A sessions and for making BigData.be the vibrant community that it is!

All the best for 2018 from BigData Belgium!

39th meetup on Apache Kafka and Apache Ignite

Winter is coming, so we are picking up our best habits again by organising our 39th BigData.be meetup. We had 2 great presentations lined up for you on Apache Kafka and Apache Ignite. Read on!

Summer is already a long time ago, but so were our BigData.be meetups! Our mistake …

But “Winter is coming!” So we picked up our best habits again. The 39th meetup was on Tuesday, November 8th, 2016 at 19h30.

Cegeka was so kind to offer their magnificent aula at their Hasselt campus for our venue. Thank you, Rutger Claes for organising this!

The attendance this time was a cozy bunch of people! It turned out to be a most interesting and interactive meetup.

Really interesting talks by Daan and Mathias; strongly recommended.
–Bjorn

Streams++, A complete streaming framework using Apache Kafka

Daan Gerits, CTO at BigBoards

Daan introduced the most important components in the Apache Kafka ecosystem.

“Build apps, not jobs“

That quote eloquently compares Kafka to the other Big Data application patterns.

The Apache Kafka ecosystem showing Kafka at the center of course, surrounded by from left to right Kafka Security, Schema Repo, Kafka Proxy, Kafka Connect and finally Kafka Streams — The Apache Kafka ecosystem

What is Apache Kafka?

Kafka is a distributed streaming platform. “A message broker with a twist.” What does that mean? Producers put messsages on Topics, whereas Consumers process messages from Topics sequentially. A message is a simple byte array, but has a timestamp, key and value. On the other hand, the Topics themselves are more like datastores: they are persisted to disk to keep the messages available. Partitioning and replication make Topics high available. Kafka is fast!

Kafka Connect is a framework that integrates Kafka with other systems. It purpose is to make it easy to add new systems to your stream data pipelines. Source Connectors import data from another system by putting them as messages on a Topic. Sink Connectors on the other hand read data from a Topic to a target system.

Kafka Streams is an open-source solution to build and execute powerful stream processing functions. If you use Kafka for stream data transport, Kafka Streams can immediately add stream processing capabilities. Kafka Streams doesn’t even need a separate computer cluster for distributed processing! A Stream reads messages from a topic, transforms it and puts it on another topic.

What is Apache Kafka’s appeal?

Apache Kafka is elegantly designed using only simple constructs:

Connect and Streams are just libraries.
Connect and Stream apps are just simple apps or processes.
Apps are super easy to deploy.
Just spawn many instances of the same app for more throughput.
Kafka apps and orchestration tools are a match made in heaven. Think of Mesos DC/OS, Kubernetes, Docker Swarm and the like.

Here is Daan’s slidedeck …

Apache kafka from Daan Gerits

Sequence data at warp speed with Apache Ignite

Mathias Lavaert, Data Engineer at Dataminded

Mathias and his team had to crack a challenging customer problem:

How to do interactive querying on sequence data at scale?!

We all know that sequential data and the processing thereof are hard to deal with. Typical use case:

Sales history of a customer in retail business
Browser activity of a visitor on a website
Events generated by a sensor in IoT
DNA sequence of a particular species in genomics

Typical operations on sequential data are: align, diff, down sample, outlier, min/max, avg/med, slope, ….

Let’s try to put Apache Spark to work?

The first plan of attack was to put Apache Spark straight to work. But the typical layout of the sequence by timestamp is far from ideal for analysis, because the required group by is cumbersome ánd expensive.

A table showing time series data according to 'observations' layout, i.e. sorted first by key, next by timestamp and pointing then to value — Time service ‘Observations’ layout

First alternative. Spark TS builds on top of Spark to add a set of abstractions for manipulating time series data. But this project is a risk as being immature and maintained by 1 developer.

2nd alternative. HuoHua. Right now, it is only a concept laid out in this presentation. HuoHua is Chinese for ‘spark’ 🙂 The authors are a team of 8 engineers at TwoSigma (ed. 1 of the 1st customers of BigBoards).

Anyways, time series does not match well with Spark’s typical data model. It cause too much memory pressure and shuffle issues.

Apache Ignite to the rescue!

“Apache Ignite In-Memory Data Fabric is a high-performance, integrated and distributed in-memory platform for computing and transacting on large-scale data sets in real-time, orders of magnitude faster than possible with traditional disk-based or flash technologies.”

A jigsaw puzzle containing all Apache Ignite's components: data grid, compute grid, service grid, streaming, hadoop accelartion, advanced clustering, file system, messaging, events and data structures — The Apache Ignite component overview

If you look at the picture above, you can see that Apache Ignite is a lot. Mathias and his team specifically used the Data Grid and Compute Grid components to build their proof of concept.

An image says more than a thousand words …

Apache Ignite's in memory data grid, stores an array of keys and values across the memory multiple nodes to make the data high-available, using a back-end database as write-through and read-through persistence engine — Apache Ignite’s in-memory data grid

Apache Ignite's compute grid distributes the compute workload across it's various nodes, each calculating an intermediate result, but combined delivering the solution in part of the time for a sequential calculation — Apache Ignite’s compute grid

The advantages of Apache Ignite that appealed to the Dataminded team, are:

Recognizable Java APIs
Computations are simple Java Callables returning a Future<T>. In an interactive environment that is way more flexible than launching Apache Spark.
Data affinity allows Apache Ignite to execute code on the nodes where the data resides. This results in major speed improvements.

The benefits that Apache Ignite brought to Mathias’ problem domain, are:

Processing sequences without the loss of context
Collocating sequences results in 0 network transfer
Analyzing a single sequence is fast and convenient
Very high level of control

Here is Mathias’ slidedeck …

PS: Do scroll to the last page …we love Dataminded’s corporate values!!!

[pdfjs-viewer url=http://bigdata.be/wp-content/uploads/2016/11/Apache-Ignite.pdf viewer_width=100% viewer_height=356px fullscreen=false download=false print=false]

The After-party

After the presentations we had a group discussion on what our members want or expect from the community. We continued this discussion over a beer in the bar across the street. Here are some thoughts:

Do we want to organize and participate in hackathons to get hands-on experience?
Who can present meaningful ideas to hackathon on?
Do we want to see more business use cases during the meetups?
Other?

That was my 1st BigData.be meetup. I liked it a lot! Engaging talks at a great location/venue. Keep it coming. I’ll plan some time for a beer next time 🙂
–Mikhail Shilkov

After the meetup, Tom Bayens proposed a possible topic for a hackathon: build the best investment strategy using historic and real-time stock quote data. As far as Tom in concerned, there seems to be ‘an ideal match between the necessary calculations and technologies that fit BigData.be.

Share your own ideas to grow our community!

On Kafka and Hadoop use cases in Europe – 35th meetup

As Big Data Belgium’s first meetup of 2016, we had 2 interesting topics scheduled: Apache Kafka performance and Hadoop use cases in Europe. So, a big thanks goes to both our speakers, but also to Co.Station BXL to host us!

Kafka, the Big Data message broker

Wannes De Smet – Sizing Servers

Often described as the heart of any scalable Big Data cluster, Apache Kafka has quickly become the message broker for your environment. As a message broker’s task is to reliably move messages from component A to B (and C), doing so in a scalable and reliable way with millions of messages is no small feat.

Wannes presented a short intro of Kafka, followed by a deep dive through the entire process of reliably producing and consuming messages. Oh, and doing all that in a distributed, highly-available, fault-tolerant manner, of course. He walked us through some the architectural requirements and operational intrinsics (configuration, monitoring, …) of using and operating a Kafka cluster, based on experiences learned from moving a complex stack to Kafka in production.

So if you are still shifting CSV files around, take some time to get to learn the ultimate upgrade.

Wannes can not share his slide deck with us, but I’m pretty sure that he’ll be keen to share his slides to you personally, if you provide the Sizing Servers Research Lab with your input on their Big Data performance research.

Hadoop in the real world: stories from across Europe

Tim Marston, Director, Regional Alliances, EMEA – Hortonworks

In the 2nd presentation of the evening, Tim Marston introduced us to HDP. This is Hortonworks flavour of Hadoop. He highlighted its strengths as a fully open-source system, before going into more detail on various use cases which they implemented across Europe. The slides give all the details.

Hortonworks – Hadoop Stories

Thank you all for being there!

May the data be with you!

Wannes De Smet in front of a well-attended aula presenting on Apache Kafka and its performance

Wannes De Smet on Apache Kafka and its performance

Our 35th meetup was well attended

Big Data and Data Science – 27th meetup

Our 27th meetup as a joint venture with DataScience.be was a huge succes! The goal was to give a thorough introduction of Big Data to the data scientists and business people of both organizations.

In total, 221 participants registered over both communities! Unfortunatley, quite a lot of people did not make it last night. That is probably due to the EU top that was happening yesterday. (Note that the meetup was held at the VUB in Elsene.) But it was still a huge crowd.

Presentations on Big Data for Data Science

Philippe Van Impe, co-organizer of DataScience.be, gave an overview of last years activities of the DataScience.be community. He focussed specifically on their data for good and hackaton initiatives. In the presentation, he hid a product placement for BigBoards: on one of the pictures from a hackaton, Kris Peeter’s Hex was visible in the foreground. The Hex was used to do social network analysis!

Next, the DataScience.be’s team who have been working on their Médecins Sans Frontières (MSF), presented an overview of their work and results. The team was lead by Edward Vanden Berghe. They received a dataset from MSF on the organisation’s donations. The team screened the dataset for donator segmentations and looked for actionable insights to help MSF improve their revenues.

As 3rd speaker, I gave an introduction to Big Data and what it can mean to organisations, large and small. Finally, I touched on the importance of data science to give meaning to the data.

Daan Gerits took over and got into the details of how to setup a scalable and resilient Big Data architecture.

After the break, Ferdinand Casier en Mathias Verbeke exposed their EluciDATA project which starts in 2015. The goals is to help Belgian companies with data innovation. Any questions or request for participation can be send to info@elucidata.be!

And last but not least, Karim Douïeb explained how they are using Spark for call record details analysis for mobile operators. Really interesting!

The meetup ended at about 21h30 with a Q&A session with all presenters together. Very thoughtful questions were raised by a sharp audience!

Thank you all for participating!!!

Images from the 27th meetup

Data Science on MSF data

Big Data and Data Science meetup

Data Science on MSF data

Big Data for data scientists

EluciDATA to help Belgian companies with Big Data innotvation by Agoria and Sirris

EluciDATA by Agoria and Sirris

Karim Douïeb on Spark for Big Data at telcos

Karim Douïeb on Spark at telcos

Drinks after 27th meetup

O’Reilly Strata Europe 2014 – impressions from a PhD student

Guest post by Vasia Kalavri.

I had marked the dates for the Strata + Hadoop World conference since I found out I was coming to Barcelona for an internship, about 5 months ago. Yet, as a PhD student, I knew I had no chance of finding a way to pay such a crazy registration fee… unless, I would constrain my diet to rice and water; a -realistically speaking- impossible goal when living in Barcelona, surrounded by such good food.

Then, last Monday morning, while attending the tutorial sessions at papis.io, I received an e-mail with the following content: “1 FREE 2-day pass to enter the conference on Thursday and Friday! Net value €779,00!!!” and -oh my god- it was not a lie! It was a message from the Belgian Big Data community, offering a free 2-day pass to Strata, to the first member that would send a reply to this e-mail, after 14h00 CET! I immediately looked at my watch: 12h24. I quickly made a draft reply and waited patiently for time to go by. At 13h59, I opened the draft reply, waited for the last minute to pass and pressed the “send” button, hoping that my mobile data connection won’t give up on me. And… bingo! A few minutes later, I was informed that I was indeed the first one to reply. My message had arrived at 14h00s06 :))

In the remaining of this post, I will provide my personal short and biased summary of the event.

General impressions

My first thought when entering the main room on Thursday morning was “wow this is huge”. I’ve been to several -mostly academic- conferences, but this one had both the largest amount of attendees and the most fascinating venue. And, by all means, it looked nothing like an academic conference. At least in the beginning, it was more of a show: fancy lighting, music to introduce the keynote speakers, text-free slides! To be honest, for a moment I thought this would all be a huge marketing campaign and I would waste my time. In the end, I have to admit that I was very happily surprised by the technical level of the talks and by the things I learned. As a systems person, I don’t often get to attend events that focus on use-cases and applications. Getting to hear about real-world use-cases was very inspiring for me!

Favorite Talks

I tried to avoid the business and industry tracks and mostly attended the hadoop, tools and data science tracks. Among the keynotes, I especially enjoyed Geoff McGrath and Camille Fournier on Thursday and Jordan Tigani on Friday. I would definitely suggest watching them: https://www.youtube.com/playlist?list=PL055Epbe6d5Y8aARKdXVVtJnEttlhsRyf.

Among the rest of the talks, my favorites were:

“SAMOA: A Platform for Mining Big Data Streams”, by Gianmarco De Francisci Morales
“High-Level Abstractions Make Big Data Useful for Real People” (even though the talk content was quite different than what the title suggests), by Melissa Santos
“How Search Can Save Your Hadoop Investment and More”, by Shay Banon
“Realtime Data Analysis Patterns”, by Mikio Braun
“RT-Giraph: Online graph Mining Simplified”, by Georgios Siganos.

Most of the slides are already available here: http://strataconf.com/strataeu2014/public/schedule/proceedings

Misc and +1’s

I was really happy to see so many great female speakers! Keep it up organizers!
I got a couple of really nice T-shirts, +1 to Cloudera for providing female sizes!
+1 for the food and the great sea view of the banquet room.
It turns out I was not the only one with a University affiliation. I met a fellow PhD student from ULB there :))

Overall, Strata was a very enjoyable experience for me. Who knows, I might even consider sending a talk next year!
Finally, I’m really grateful to BigData.be for the free pass and, of course, to my mobile operator for delivering my e-mail with such great precision!

Short Bio

Vasia Kalavri is a PhD student at KTH, Sweden and UCL, Belgium. She is currently doing an internship at Telefonica Research and lives in Barcelona, Spain. Vasia is working in the area of distributed data processing, systems optimization and large-scale graph analysis. She is a committer of Apache Flink (flink.incubator.apache.org) and also contributing to Grafos.ml (grafos.ml).

Website: http://web.ict.kth.se/~kalavri/
Twitter: @vkalavri

Strata 2014 – Claim your discount!

This year, Strata conference is going down from 19-21 November 2014 in Barcelona. Next to being a gorgeous city, the conference is another reason to visit for anyone with an interest for data! To give you an idea of what Strata is, I pulled a summary from the StrataConf website.

Moreover we got a discount code! Pull the link and code from the sponsors list on our meetup page!

About the O’Reilly Strata Conference

The best minds in data will gather in Barcelona this November for the O’Reilly Strata Conference to learn, connect, and explore the complex issues and exciting opportunities brought to business by big data, data science, and pervasive computing.

The future belongs to those who understand how to collect and use their data successfully. And that future happens at Strata.

Why You Should Attend

Strata Conference is where big data’s most influential business decision makers, strategists, architects, developers, and analysts gather to shape the future of their businesses and technologies. If you want to tap into the opportunity that big data presents, you want to be at Strata.

In a crowded market place of “Big Data” conferences, Strata has firmly established itself as the place where you go to meet people who think and do data science.

At Strata, you’ll:

Be among the first to understand how you can leverage the promise of this huge change, and survive the resulting disruption
Find new ways to leverage your data assets across industries and disciplines
Learn how to take big data from science project to real business application
Discover training, hiring, and career opportunities for data professionals
Meet-face-to face with other innovators and thought leaders

Experience Strata

Strata Conference delivers the nuts-and-bolts foundation for building a data-driven business—the latest on the skills, tools, and technologies you need to make data work—alongside the forward-looking insights and ahead-of-the-curve thinking O’Reilly is known for.

There was a palpable sense of excitement in the air. Obviously most of the attendees were already ‘data’ aficionados, but it’s clear that ‘data’ in various forms is on the radar for governments, large corporations, and the developer communities.

At Strata, you’ll find:

Three days of inspiring keynotes and intensely practical, information-rich sessions exploring the latest advances, case studies, and best practices
A sponsor pavilion with key players and latest technologies
A vibrant “hallway track” for attendees, speakers, journalists, and vendors to debate and discuss important issues
Plenty of events and opportunities to meet other business leaders, data professionals, designers, and developers

About O’Reilly

O’Reilly is followed by venture capitalists, business analysts, news pundits, tech journalists, and thought leaders because we have a knack for knowing what’s important now and what will be important next—and the ability to articulate the seminal narratives about emerging and game-changing technologies.

We don’t say this to brag. We say it to make a point: we’re not easily hypnotized by hype. We’ve seen the bubbles build and burst. For over three decades, we’ve been tapping into a deep network of alpha geeks and thought leaders to recognize the truly disruptive technologies amidst the fluff. So when we invest in a conference, we’re not just following the hype, we’re committed to creating a community around an issue we believe is transformative.

At O’Reilly, we think big data is not just important. We think it’s a game changer. That’s why we created Strata.

O’Reilly’s conferences forge new ties between industry leaders, raise awareness of technology issues we think are interesting and important, and crystallize the critical issues around emerging technologies. Understanding these emerging technologies—and how they will transform the way we do business—has never been more crucial. If you want to understand the challenges and opportunities wrought by big data, you’ll want to attend Strata.

Spark!

More than 80 people showed up at our last meetup focused on Spark. Because there are more and more signs that Spark will become the successor to Hadoop MapReduce we invited some people who are already using Spark in production.

Andy gave an introduction to functional progamming and Scala in just 45 minutes, which is definitely not enough for passing all details. His slides can be found here

Excellent meetup. The Scala introduction was so quick that it blew my mind but gave me enough information to follow the rest

(Eric Darchis)

We had Toni Verbeiren who gave an introduction to Spark and demonstrated Spark from the command line. Follow the links to his slides and visualization code.

Very interesting mix of Scala, Spark and Use Case

(Peter Vandenabeele)

Gerard Maas showed us how Spark is used in production at Virdata.com. With a cool demo of their platform in the end. His slides are availabele here: Spark-at-Virdata

It was Sparkling! (Radek O)

I am always amazed by the quality of the BigData.be and ScalaBe presentations. Big up to all of you ! (Frederic)

The presentations were recorded by Parleys.com and to be published in a “bigdata.be” channel. We’ll let you know when they become available over there.

Thanks to Ordina for the location and for providing food and drinks.

See you next time, we are always looking for venues and presenters.

Data science and R meetup

As announced on our meetup page, we had @JaredLander over from New York for a project at BigBoards. So we rose to the occasion and had him talk about Backends for Big Data in R. This comment from the meetup page, says it all!

Nice fast-paced presentation with style from Jared
—Marcel Dumont

To complement this “gentle introduction to R” ;-), our second presentation was given by DataCamp. They introduced us to what DataCamp stands for, how their platform is architected and how teachers write courses (completely in R!). Again, your feedback says more than a thousand words …

This presentation impressed me the most. How a couple of students from the KULeuven can start-up their own company and be successful in filling the gap of training services in R
—Jean-Jacques DE CLERCQ

(Hey guys @datacamp, if you are reading this from over there in LA at the use R! conference, can we share your slides here?)

What did you learn? What do you think about this meetup?

23RD MEETUP – DATA SCIENCE/ELASTICSEARCH/ELECTIONS

On May 27th we had our 23rd meetup in Ghent kindly hosted by iMinds. We had a healthy mix of technical and business items.

Following presentations were given:

Introduction to the Brussels Data Science Meetup by Philippe Van Impe (30 min)

Philippe gave an introduction to this Meetup, the projects they are working (with possible link to big data) and their link to the non-profit datakind organization (“data for good”)

His presentation can be found here.

2. Introduction to ElasticSearch by Eric Rodriguez (60 min)

In just less than one hour Eric gave an introduction to all features of ElasticSearch. During the presentation Toon Vanaght showed how ES is used at data.be.

His presentation can be found here and you are also invited to checkout the Belgian ElasticSearch Meetup group.

3. Election Bingo by Stijn Beauprez (30 min)

The vk14-bingo.be application can be used for making up your mind for voting on sunday, it gives you insight into the topics our political parties are talking about.

Stijn demo-ed the application and explained which technologies were used to implement this application.

His presentation can be found here.

4. De Verkiezingen by Philippe Kerremans (10 min)

Deverkiezingen.be website is another application of using social media to get insight into the mother of all elections. Philippe explained how they used ElasticSearch and D3.js among other technologies.

His presentation can be downloaded here: BigDataMeetupDeverkiezingen

Many thanks to iMinds for hosting the location and DataCrunchers for providing drinks.

In the mean time we now have more than 700 members!