Kafka: "git branch" for production streams

I switched from SVN to Git for version control years back largely because it made experimenting a blast. Creating branches, viewing diffs, stashing - it was fun to veer off the highway and explore dirt roads.

Making it easy to experiment on my dev box is great, but rolling those changes to production is scary. Is there a way to compare the performance of application 1.4 (the steady workhorse) vs. application 2.0 (the silky smooth refactor) against live data?

Yes, there is. Lets explore.

Meet Kafka

In the simplest terms - Kafka is a message broker. It receives messages from producers, stores those messages within topics and makes each of the topics available to consumers. Each message is written to disk, and replicated within the cluster. Think of it as a super fast commit log.

Isn't Kafka just a queuing system?

No. In a traditional queuing system - a message is popped off the stack once it is assigned for processing. The queue is responsible for keeping track of the status of jobs.

Kafka is different. Kafka holds on to each message. When a message is read, it's not destroyed and is available to other consumers. Each consumer has the responsibility for tracking which messages were processed, and notifying when work is complete. Kafka is just a repository.

Trying Kafka out

The Kafka site has a great quickstart, but instead, I'll create a test that feels more real. In Scout's world, we live & breathe metrics - and often have a need to try out new strategies for time series data. Let's setup a stream of time series data, push it thru Kafka and write the results to separate databases. Just for grins - we'll say that Mongo is our existing database, and Postgres is the one we want to test.

We will send a random metric every 3 seconds - this will be our producer. For our topic, we just have to give it a consistent name - "metrics" so that each of our consumers know where to get it. Lastly, our databases will be our consumers. One consumer to represent our existing Mongo instance and another for our test Postgres instance. I'll use containers to create my environment:

Database Containers:

Additionally, I've created a few gists to test out the major players:

Finally, a quick cheatsheet of the Docker commands I used to kick off the containers:

You'll need Ruby installed, and a couple of gems (listed in the gists). Once you piece this all together, run the producer and each consumer in a separate terminal window. You'll see each consumer happily adding new metrics from the producer to their database.

Whew! A lot of pieces. Let me just get to the heart of it:

If you take a step back, this sample is proving a larger point. We've inserted a shim between our event stream and those applications processing the stream. We've got the same message being processing by two distinct entities. This means that we can compare these two processes - without interrupting our existing stream. If one of the processes isn't improving the stack - just shut it down, and try a new one.

This is just the tip of the iceberg. Given the simplicity of Kafka's design - there are multiple ways Kafka can be used: What if we only had one consumer and multiple producers? Sounds like the basis for a log aggregator to me.

How about a website activity tracker? What if we inserted user activity into their own topics (e.g. searches, page views, etc..)? We could set up different consumers - maybe give the marketing department a real-time feed of website activity. Perhaps another consumer for offline processing - all based on the same data.

TL; DR

If you've followed our blog posts for the past couple months, you'll recognize that we're big fans of tools that give you the most flexibility for your infrastructure. Just like StatsD & Docker - Kafka is another great "swiss army knife". It gives you the ability to create a quick "checkpoint" in your data processing flow and try out new stack options without disruption.

Questions? Comments? Contact us here. For more updates on great tooling, follow us on Twitter.