KOSTASA multi-coloured squirrel may not seem like the most obvious choice of logo for a data processing technology; then again, the team behind Apache Flink have hardly done things by the book. What start out as a University research project evolved into a fully-fledged company, complete with artfully-decapitalised name (data Artisans), and an Apache Software Foundation Top-Level Project. As Flink grew in Dataconomy’s back yard in Berlin, we had the pleasure of meeting some of the data Artisans, and discussing with CEO Kostas Tzoumas how Flink transformed from a research project to a Top-Level Project.


Give us a brief introduction to yourself & your work.

My career has been focused on building innovative data-intensive systems, first in academia and now in practice. I worked on the Stratosphere research project, the ideas of which later created what is now Apache Flink. Last summer, I co-founded data Artisans, a Berlin-based company, to build the next generation of data processing technology under the umbrella of the Apache Flink project.

Apache Flink started out in academia, and has now graduated to an Apache Top-Level project; talk us through your journey.

Indeed, academic research on what is now Flink started back in 2009/2010. While doing research we also put the system out on github as open source and saw a lot of interest in the community. Last April, we proposed Flink to the Apache Software Foundation as an Incubator project. Flink graduated from the Incubator and became a top-level project quite fast, in just 8 months. I am extremely happy about this
outcome that reflects the fast growth of the community, and gives a stamp of approval by the ASF to our work on Flink.

Why did you decide to submit Flink to the Apache Software Foundation? What would you consider to be the main benefits of working with ASF?

The ASF provides infrastructure and legal support, and access to people that are very experienced in open source. Most importantly, I feel that the values of community and meritocracy that are central in the ASF help software projects stay healthy. Not that this is the only way to build an open source project, but it seems to me to be a very effective one. The ASF gave us a blueprint to follow, which we might have re-discovered through trial and error if we started the project from scratch without an umbrella organization.

Open-sourcing continues to be a cornerstone of the database tech community; why do you think this is?

This was actually not always the case; for many years the data management field was dominated by closed-source products. The rise of the open source in the database community is perhaps an artifact of the popularity of the Hadoop project. I find this great! I believe that enterprises realized that their data infrastructure is so critical for them that they cannot afford to get locked in with a closed solution.

You’ve also established a company, data Artisans- why did you decide to do this?

We started data Artisans because we believe that Flink can serve as the foundation of the next-generation data processing technology, and the best way to make this happen is through a company that is dedicated to this cause. data Artisans is a company that was born from the Flink community itself.

As well as graduating to a Top Level Project and establishing a company, you’ve also recently made considerable updates to Flink; talk us through these.

Sure. Flink has made significant progress over the last months. A lot of work has gone into the system internals to improve the reliability and scalability of the system. This work is simply too much to list. As a data point, we recently tested Flink on Google Compute Engine for a recommendation training use case, and managed to scale Flink to very large problem sizes (roughly 6 times as much as the number of movie ratings Netflix reported to have in 2012). You can see these results here.

At the same time, many user-facing features were added to the system. The work on Flink Streaming, the part of Flink that ingests streaming data sources is progressing very fast, a new library for graph processing is being added, and a new effort for Machine Learning functionality on Flink has started.

How does Flink compare/differentiate itself from other ASF projects, such as Apache Spark?

Flink starts from a different point in the system design space from other systems like Spark, Storm, or Tez. A lot of Flink’s features lead to operational benefits such as robust execution without memory configuration, but also end-user benefits such as enabling streaming applications with flexible window semantics. That said, I think that the world of data processing systems does not lend itself to monopolies. We often see Flink alongside Spark alongside MapReduce being used in the same cluster.

How wide has adoption of Flink been so far?

Adoption is growing fast. I know of the first company that is now using Flink in production, and several others are trying it out. Top-level project status is now giving a big boost to adoption.

What’s the roadmap for Apache Flink?

Recently, the Flink community discussed a roadmap for 2015 and posted it on the Flink wiki. The roadmap includes a lot of domain-specific application libraries on top of Flink such as Machine Learning and graph libraries, as well as integration with other projects.

What are you most excited about for Apache Flink & data Artisans in 2015?

I think that 2015 will be the tipping point for Flink. I see the first large-scale users putting the system under serious loads, new use cases such as data streaming being explored on top of the engine and the project getting into the mainstream.

What do envision in the near future of database tech, and data
science in general?

I think that we will see more dynamic data management techniques go mainstream quite soon. The current workflow of data collection, storage, and analysis is very static and misses the most recent information that typically contains the most value.

I also see that lot more of new technologies go operational, and people starting to care more about issues such as debuggability, robustness, and predictability.


(Image credit: Apache Flink)

Previous post

Criteo Surpass $2.5bn Market Cap, Splurge Some of That Cash on DataPop

Next post

eBay Open Sources Pulsar to Analyse User Data in Real Time