Gerd König is a Data Engineer and instructor for YMC AG. YMC is a specialized service provider for web technologies, providing consulting, creation, custom development and safe (highly-available) hosting, as well as operations and maintenance services for online projects. Dataconomy recently sat down with Gerd at Berlin Buzzwords to discuss all things Hadoop; the recent trend for SQL-on-Hadoop, his experiences as a Hadoop/Cloudera instructor, and what he thinks the future holds for this technology.
Could you tell us a little bit about yourself?
My name is Gerd Koenig. I’m working for YMC in Switzerland. I’m working as a data engineer mainly targeted to building data pipelines, workflows, everything related to Hadoop clusters, as well as operating, planning and sizing clusters. My talk was related to the current hype relating to SQL-on-Hadoop and mainly targeted to Shark and Spark SQL, these upcoming trends.
There’s a lot of other products out there that are using SQL-on-Hadoop, like Impala for instance. What do you think that Shark offers in particular? What’s unique about it?
The biggest benefit you will get from Shark is its in-memory computing. The second biggest is its interactions with other plugins from the Spark ecosystem, like the machine learning libraries or graph libraries. Those two parts are the biggest benefits.
As part of your work with YMC, I understand that you do some training focused around Hadoop.
Yes. I was a certified Cloudera administrator trainer. So I’m teaching administrative and a data analysis course just to compliment portfolio.
So when people are coming into the training program, what is usually their biggest weakness or the area they know least about?
The biggest weakness is lacking information of what Hadoop is at all.You have a lot of people coming from the relational database world or the system engineering world. They have to re-think almost any concepts they did in the last 20 years. And this is completely new for them and amazing for us to convince them, using the Hadoop stuff.
I was reading somewhere that a lot of people who are working with relational databases come from an engineering background, and for Hadoop you need a grounding in applied mathematics as well. Is that something you’d agree with?
You need all of them. You need the engineering stuff for setting up the clusters, maintaining them. And you need, let’s say the integration guys, who build these data pipelines, workflows, and who integrate the third-party tools. And then, of course, then come the statistical business analysts who went to use their existing tools . For them, it doesn’t matter if it’s Hadoop or a relational database. They just want to use their tools and they’re happy.
You also published a paper last month about social media monitoring in big data technologies- can you tell us more about that?
Yes. The article is dedicated to the self implemented system, which is crawling websites and storing the contents in the database. And there are additional mechanisms to send alerts if the document contains words that aren’t mentioning predefined words. So one can set a pay load off specific words and if another one comes in that is matching, then you will get a real-time alert and interline techniques using Solr as an indexing mechanism with Hadoop and HDFS.
Moving forward, what kind of projects are you working on at the moment? What will you be releasing in the future?
So currently what I have been working on doing some workshops and proof of concepts just before starting into a real project. So this is currently the trend. The companies are going to use the Hadoop and the big data stuff, so they need to know the basics. How to get started. What can we do with this new technology? How can we fit our business requirements to that technology and how can we integrate our systems and tools into this new world?
So YMC helps with finding the solution, as well as providing the software?
Yes. This is one business case for us. Help customers find a solution, implement a solution and, for example, build manual implementations on top of that.
There’s been a huge amount of hype over big data over the past couple of years. I think the affordability of the technology means that kind of everybody has access to it now and it’s become a lot more ubiquitous. So where do you think this is headed? Do you have any predictions for the future of big data?
It’s hard to predict the future. Hadoop will be well-established soon so it’s getting more and more enterprise ready already. So as I mentioned earlier, the most important thing is that the users have to adapt to this new framework and a new way of working this big, unstructured data and how to process and analyze that. This is the biggest challenge, but Hadoop will be much more in use in the future.
(Image credit: Bootstrap)
YMC is a specialized service provider for web technologies, providing consulting, creation, custom development and safe (highly-available) hosting, as well as operations and maintenance services for online projects. Their clients include Swiss Radio and Television (SRF), ETH Zurich, SOS-Kinderdorf Germany and WWF Switzerland.
Interested in more content like this? Sign up to our newsletter, and you wont miss a thing!