39th meetup on Apache Kafka and Apache Ignite

Winter is coming, so we are picking up our best habits again by organising our 39th BigData.be meetup. We had 2 great presentations lined up for you on Apache Kafka and Apache Ignite. Read on!

Summer is already a long time ago, but so were our BigData.be meetups! Our mistake …

But “Winter is coming!” So we picked up our best habits again. The 39th meetup was on Tuesday, November 8th, 2016 at 19h30.

Cegeka was so kind to offer their magnificent aula at their Hasselt campus for our venue. Thank you, Rutger Claes for organising this!

The attendance this time was a cozy bunch of people! It turned out to be a most interesting and interactive meetup.

Really interesting talks by Daan and Mathias; strongly recommended.
Bjorn

Streams++, A complete streaming framework using Apache Kafka

Daan Gerits, CTO at BigBoards

Daan introduced the most important components in the Apache Kafka ecosystem.

Build apps, not jobs

That quote eloquently compares Kafka to the other Big Data application patterns.

The Apache Kafka ecosystem showing Kafka at the center of course, surrounded by from left to right Kafka Security, Schema Repo, Kafka Proxy, Kafka Connect and finally Kafka Streams
The Apache Kafka ecosystem

What is Apache Kafka?

Kafka is a distributed streaming platform. “A message broker with a twist.” What does that mean? Producers put messsages on Topics, whereas Consumers process messages from Topics sequentially. A message is a simple byte array, but has a timestamp, key and value. On the other hand, the Topics themselves are more like datastores: they are persisted to disk to keep the messages available. Partitioning and replication make Topics high available. Kafka is fast!

Kafka Connect is a framework that integrates Kafka with other systems. It purpose is to make it easy to add new systems to your stream data pipelines. Source Connectors import data from another system by putting them as messages on a Topic. Sink Connectors on the other hand read data from a Topic to a target system.

Kafka Streams is an open-source solution to build and execute powerful stream processing functions. If you use Kafka for stream data transport, Kafka Streams can immediately add stream processing capabilities. Kafka Streams doesn’t even need a separate computer cluster for distributed processing! A Stream reads messages from a topic, transforms it and puts it on another topic.

What is Apache Kafka’s appeal?

Apache Kafka is elegantly designed using only simple constructs:

  • Connect and Streams are just libraries.
  • Connect and Stream apps are just simple apps or processes.
  • Apps are super easy to deploy.
  • Just spawn many instances of the same app for more throughput.
  • Kafka apps and orchestration tools are a match made in heaven. Think of Mesos DC/OS, Kubernetes, Docker Swarm and the like.

Here is Daan’s slidedeck …

Sequence data at warp speed with Apache Ignite

Mathias Lavaert, Data Engineer at Dataminded

Mathias and his team had to crack a challenging customer problem:

How to do interactive querying on sequence data at scale?!

We all know that sequential data and the processing thereof are hard to deal with. Typical use case:

  • Sales history of a customer in retail business
  • Browser activity of a visitor on a website
  • Events generated by a sensor in IoT
  • DNA sequence of a particular species in genomics

Typical operations on sequential data are: align, diff, down sample, outlier, min/max, avg/med, slope, ….

Let’s try to put Apache Spark to work?

The first plan of attack was to put Apache Spark straight to work. But the typical layout of the sequence by timestamp is far from ideal for analysis, because the required group by is cumbersome ánd expensive.

A table showing time series data according to 'observations' layout, i.e. sorted first by key, next by timestamp and pointing then to value
Time service ‘Observations’ layout

First alternative. Spark TS builds on top of Spark to add a set of abstractions for manipulating time series data. But this project is a risk as being immature and maintained by 1 developer.

2nd alternative. HuoHua. Right now, it is only a concept laid out in this presentation. HuoHua is Chinese for ‘spark’ 🙂 The authors are a team of 8 engineers at TwoSigma (ed. 1 of the 1st customers of BigBoards).

Anyways, time series does not match well with Spark’s typical data model. It cause too much memory pressure and shuffle issues.

Apache Ignite to the rescue!

Apache Ignite In-Memory Data Fabric is a high-performance, integrated and distributed in-memory platform for computing and transacting on large-scale data sets in real-time, orders of magnitude faster than possible with traditional disk-based or flash technologies.

A jigsaw puzzle containing all Apache Ignite's components: data grid, compute grid, service grid, streaming, hadoop accelartion, advanced clustering, file system, messaging, events and data structures
The Apache Ignite component overview

If you look at the picture above, you can see that Apache Ignite is a lot. Mathias and his team specifically used the Data Grid and Compute Grid components to build their proof of concept.

An image says more than a thousand words …

Apache Ignite's in memory data grid, stores an array of keys and values across the memory multiple nodes to make the data high-available, using a back-end database as write-through and read-through persistence engine
Apache Ignite’s in-memory data grid
Apache Ignite's compute grid distributes the compute workload across it's various nodes, each calculating an intermediate result, but combined delivering the solution in part of the time for a sequential calculation
Apache Ignite’s compute grid

The advantages of Apache Ignite that appealed to the Dataminded team, are:

  • Recognizable Java APIs
  • Computations are simple  Java Callables returning a Future<T>. In an interactive environment that is way more flexible than launching Apache Spark.
  • Data affinity allows Apache Ignite to execute code on the nodes where the data resides. This results in major speed improvements.

The benefits that Apache Ignite brought to Mathias’ problem domain, are:

  • Processing sequences without the loss of context
  • Collocating sequences results in 0 network transfer
  • Analyzing a single sequence is fast and convenient
  • Very high level of control

Here is Mathias’ slidedeck …

PS: Do scroll to the last page …we love Dataminded’s corporate values!!!

[pdfjs-viewer url=http://bigdata.be/wp-content/uploads/2016/11/Apache-Ignite.pdf viewer_width=100% viewer_height=356px fullscreen=false download=false print=false]

The After-party

After the presentations we had a group discussion on what our members want or expect from the community. We continued this discussion over a beer in the bar across the street. Here are some thoughts:

  • Do we want to organize and participate in hackathons to get hands-on experience?
  • Who can present meaningful ideas to hackathon on?
  • Do we want to see more business use cases during the meetups?
  • Other?

That was my 1st BigData.be meetup. I liked it a lot! Engaging talks at a great location/venue. Keep it coming. I’ll plan some time for a beer next time 🙂
Mikhail Shilkov

After the meetup, Tom Bayens proposed a possible topic for a hackathon: build the best investment strategy using historic and real-time stock quote data. As far as Tom in concerned, there seems to be ‘an ideal match between the necessary calculations and technologies that fit BigData.be.

Share your own ideas to grow our community!

O’Reilly Strata Europe 2014 – impressions from a PhD student

Guest post by Vasia Kalavri

I had marked the dates for the Strata + Hadoop World conference since I found out I was coming to Barcelona for an internship, about 5 months ago. Yet, as a PhD student, I knew I had no chance of finding a way to pay such a crazy registration fee… unless, I would constrain my diet to rice and water; a -realistically speaking- impossible goal when living in Barcelona, surrounded by such good food.

Then, last Monday morning, while attending the tutorial sessions at papis.io, I received an e-mail with the following content: “1 FREE 2-day pass to enter the conference on Thursday and Friday! Net value €779,00!!!” and -oh my god- it was not a lie! It was a message from the Belgian Big Data community, offering a free 2-day pass to Strata, to the first member that would send a reply to this e-mail, after 14h00 CET! I immediately looked at my watch: 12h24. I quickly made a draft reply and waited patiently for time to go by. At 13h59, I opened the draft reply, waited for the last minute to pass and pressed the “send” button, hoping that my mobile data connection won’t give up on me. And… bingo! A few minutes later, I was informed that I was indeed the first one to reply. My message had arrived at 14h00s06 :))

In the remaining of this post, I will provide my personal short and biased summary of the event.

General impressions

My first thought when entering the main room on Thursday morning was “wow this is huge”. I’ve been to several -mostly academic- conferences, but this one had both the largest amount of attendees and the most fascinating venue. And, by all means, it looked nothing like an academic conference. At least in the beginning, it was more of a show: fancy lighting, music to introduce the keynote speakers, text-free slides! To be honest, for a moment I thought this would all be a huge marketing campaign and I would waste my time. In the end, I have to admit that I was very happily surprised by the technical level of the talks and by the things I learned. As a systems person, I don’t often get to attend events that focus on use-cases and applications. Getting to hear about real-world use-cases was very inspiring for me!

Favorite Talks

I tried to avoid the business and industry tracks and mostly attended the hadoop, tools and data science tracks. Among the keynotes, I especially enjoyed Geoff McGrath and Camille Fournier on Thursday and Jordan Tigani on Friday. I would definitely suggest watching them: https://www.youtube.com/playlist?list=PL055Epbe6d5Y8aARKdXVVtJnEttlhsRyf.

Among the rest of the talks, my favorites were:

  • SAMOA: A Platform for Mining Big Data Streams”, by Gianmarco De Francisci Morales
  • High-Level Abstractions Make Big Data Useful for Real People” (even though the talk content was quite different than what the title suggests), by Melissa Santos
  • How Search Can Save Your Hadoop Investment and More”, by Shay Banon
  • Realtime Data Analysis Patterns”, by Mikio Braun
  • RT-Giraph: Online graph Mining Simplified”, by Georgios Siganos.

Most of the slides are already available here: http://strataconf.com/strataeu2014/public/schedule/proceedings

Misc and +1’s

  • I was really happy to see so many great female speakers! Keep it up organizers!
  • I got a couple of really nice T-shirts, +1 to Cloudera for providing female sizes!
  • +1 for the food and the great sea view of the banquet room.
  • It turns out I was not the only one with a University affiliation. I met a fellow PhD student from ULB there :))

Overall, Strata was a very enjoyable experience for me. Who knows, I might even consider sending a talk next year!
Finally, I’m really grateful to BigData.be for the free pass and, of course, to my mobile operator for delivering my e-mail with such great precision!

Short Bio

Vasia Kalavri is a PhD student at KTH, Sweden and UCL, Belgium. She is currently doing an internship at Telefonica Research and lives in Barcelona, Spain. Vasia is working in the area of distributed data processing, systems optimization and large-scale graph analysis. She is a committer of Apache Flink (flink.incubator.apache.org) and also contributing to Grafos.ml (grafos.ml).

Website: http://web.ict.kth.se/~kalavri/
Twitter: @vkalavri