eaced504256165f043f20474ba9150aefeab442cIt’s safe to say that at the present moment, machine learning is big news. In the past week, we’ve seen Tumblr getting in the game, Google making further machine learning acquistions, and Nervana annoucing an £3.3 million in funding for the machine learning initiatives. When you think of machine learning, Hadoop may not be the first name that comes to mind. But Sean Owen, Cloudera‘s Director of Data Science, suggests it should be. In the first part of our interview with Owen, we discuss the relationship between machine learning and Hadoop, the future of Apache Mahout and why machine learning has become such hot property.


Give us a brief background into the relationship between Hadoop and machine learning.

As a platform and a technology,  Hadoop  in the beginning was just a place to store data, and maybe write some basic queries and some processing under  that data  in  batch. It evolved over time to the point where companies like  Cloudera- and people like me- could perform machine learning tasks on data in Hadoop. And try to figure what that means, how to do it. And that in sense, it’s a story about  what enterprises are actually able to access from a standard open-source models like Hadoop today.

Tell us more about your work on machine learning for Cloudera.

My work, and some of the problems that we try to solve at Cloudera, tend to be operational rather than exploratory in nature. A lot of what see in the press and alot of discussion about data science and machine learning is exploratory analytics or investigative analytics. You can do these these things in Hadoop. When we hear about people using business intelligence tools, and even applying deep learning, that’s interesting and those are things you can do in Hadoop. But it’s almost not really what customers need to consume. For example, it’s one thing to tell a customer: ‘you can connect R to a cluster, and you can build a fancy model of customer churn or fraud’. But what they want to do, inevitably, is productionise.

A lot of our work at  Cloudera, with regards to Hadoop and machine learning, involves productionising machine learning, rather than say inventing new  algorithms. It’s a little more prosaic but I think it actually is important. Take some of these ideas that we hear about everyday, like Twitter buying Madbits– this is a great piece of technology, but the rest of the story (how that gets productionised and used at large scale) is the hard part. It’s this puzzle we try to solve for consumers on Hadoop. We have a couple of tools within the Hadoop ecosystem we use for this, including Apache Mahout.

Tell us more about Apache Mahout; I understand they’re moving away from using MapReduce right now?

Yeah – Mahout at this point is a fairly, “old” project, just because it has been around for 5 years and that’s a lifetime in this  ecosystem. It’s MapReduce-based, and it’s a project whose time is kind of finished, I think. Instead, the focus for people building operational machine learning systems on Hadoop is builds like Apache Spark. Spark provides a more complete  sub-stream  for model building, model serving at  run  time or at near real time; and also includes a small machine learning library as well.

So for me the focus is mostly on helping people actually build their own  algorithms, build their own implementations. This is probably the most important and exciting technology within the platform.

Tell us more about your company Myrrix, which was acquired by Cloudera.

So this happened a year ago. I’m a long time committer to the Apache Mahout project; maybe two plus years ago, I decided that I wanted to build a product in small company, around a next-generation version of the project. As it happens, the first focus was  recommenders; this is one of the most  common  use cases customers come to us with. This proved to be a power idea because at the time, Hadoop was something which you could only use as a  batch-mode  processing engine to build models offline, to make recommendations offline. Of course, the recommender problem is very much a real-time problem; we need to learn in real-time, we need answer  queries in real-time.  That was the focus of the technology that was acquired by Cloudera, and that we built into the open-sourced Oryx project. That continues to evolve; that continues to be where my heart is, and where I think the  important gap is still on the platform.  We need a good infrastructure and sub-stream for building not just models, but serving models on a platform like Hadoop, at large scale and in real-time. And that’s something that we’re now rebuilding on top of Spark,  as the next-generation platform within Hadoop, for building these kind of things.

There has been a trend for a machine learning startups being acquired by industry-leading enterprises recently. Twitter recently acquired Madbits, as you mentioned earlier, as well as Google acquiring DNNResearch & Jetpac, and Pinterest acquiring Visual Graph. Why do you think machine learning start-ups have become hot property, and all the big businesses are taking notice?

I’ve wondered the same thing myself. These acquisitions seem to be very similar as well. We’ve got a small group of people without a real product to sell, but that they have a very interesting take on a hot topic in research, like deep learning or multi-layered convolutional neural nets. Interestingly all of these small start-ups have managed to make deep learning do something related to image recognition. It’s not surprising that image classification is a problem the tech giants have, so it’s not surprising that they want to buy these technologies at almost any price. In a way I’m not sure this is reflective of a lot of trends in machine learning and in the industry, even if these are the most visible transactions in this space. All of these seem to be of a similar pattern – tech giants buying a deep learning start-up, of a couple people, to enhance some kind of image recognition capability. It’s interesting, but it’s not what 90% of companies out there do when they do machine learning.

Read the second part of our interview with Sean here, where we discuss the future of deep learning and neural networks, and how he foresees the relationship between machine learning and enterprise evolving.

(Image credit: wpdang)

Previous post

Cornell’s Robo Brain Educates Robots by Tapping the Internet for Information

Next post

The Cold, Hard Data Behind the Ice Bucket Challenge