Choosing A Streaming Framework

Venkatesh Iyer
Split Brain
Published in
2 min readJul 6, 2017

--

NOTE: This is not a full fledged feature comparison of streaming frameworks. It’s more about how to make a decision even without doing one! :P

To start off my project I had to pick one of the many many stream processing frameworks: Storm, Heron, Samza, Flink, Spark, Dataflow/Beam or some other thing that I haven’t heard about yet!

Fortunately I knew that I won’t make a biased decision because I haven’t had experience with any of them :)

The biggest constraint that I had to abide by was, as a principle our team was trying to not own and operate infrastructure as much as possible (small team, lot of other important business-specific stuff to build). This was also the guiding principle behind our decision to move our Hadoop + Spark infrastructure from an AWS EC2 cluster to Google Dataproc.

So that reduced the list to Google Dataflow and Spark Streaming — both of which are available as a managed service on GCP. No matter how good the others are, we are not ready for them yet!

Here’s how the two compare (at a very high level, with the only practical experience being a sample program written in both frameworks):

One thing I know from experience is that no platform service “just works”, you’ve got to make it work for you. So even though that comparison clearly showed Spark Streaming as the lesser candidate, I decided to use it based on one property alone — code reuse.

There is no way we are going to port all our batch processing jobs to Dataflow; but on Spark once the Streaming boilerplate is built we’d be expanding it to a lot of dual mode (batch+streaming) processing. And the fact that as a team we are reasonably well versed with the regular Spark, is an added advantage.

How big a mistake is this? Only time will tell, or a late 2017 blog post!

--

--