Dataconomy
  • News
  • AI
  • Big Data
  • Machine Learning
  • Trends
    • Blockchain
    • Cybersecurity
    • FinTech
    • Gaming
    • Internet of Things
    • Startups
    • Whitepapers
  • Industry
    • Energy & Environment
    • Finance
    • Healthcare
    • Industrial Goods & Services
    • Marketing & Sales
    • Retail & Consumer
    • Technology & IT
    • Transportation & Logistics
  • Events
  • About
    • About Us
    • Contact
    • Imprint
    • Legal & Privacy
    • Newsletter
    • Partner With Us
    • Writers wanted
Subscribe
No Result
View All Result
Dataconomy
  • News
  • AI
  • Big Data
  • Machine Learning
  • Trends
    • Blockchain
    • Cybersecurity
    • FinTech
    • Gaming
    • Internet of Things
    • Startups
    • Whitepapers
  • Industry
    • Energy & Environment
    • Finance
    • Healthcare
    • Industrial Goods & Services
    • Marketing & Sales
    • Retail & Consumer
    • Technology & IT
    • Transportation & Logistics
  • Events
  • About
    • About Us
    • Contact
    • Imprint
    • Legal & Privacy
    • Newsletter
    • Partner With Us
    • Writers wanted
Subscribe
No Result
View All Result
Dataconomy
No Result
View All Result

Machine Learning and Hadoop- How One of the Most Widely Used Big Data Technologies Has Evolved

by Eileen McNulty
September 8, 2014
in Machine Learning
Home Topics Data Science Machine Learning
Share on FacebookShare on TwitterShare on LinkedInShare on WhatsAppShare on e-mail

eaced504256165f043f20474ba9150aefeab442cIt’s safe to say that at the present moment, machine learning is big news. In the past week, we’ve seen Tumblr getting in the game, Google making further machine learning acquistions, and Nervana annoucing an £3.3 million in funding for the machine learning initiatives. When you think of machine learning, Hadoop may not be the first name that comes to mind. But Sean Owen, Cloudera‘s Director of Data Science, suggests it should be. In the first part of our interview with Owen, we discuss the relationship between machine learning and Hadoop, the future of Apache Mahout and why machine learning has become such hot property.


Give us a brief background into the relationship between Hadoop and machine learning.

As a platform and a technology,  Hadoop  in the beginning was just a place to store data, and maybe write some basic queries and some processing under  that data  in  batch. It evolved over time to the point where companies like  Cloudera- and people like me- could perform machine learning tasks on data in Hadoop. And try to figure what that means, how to do it. And that in sense, it’s a story about  what enterprises are actually able to access from a standard open-source models like Hadoop today.

Tell us more about your work on machine learning for Cloudera.

My work, and some of the problems that we try to solve at Cloudera, tend to be operational rather than exploratory in nature. A lot of what see in the press and alot of discussion about data science and machine learning is exploratory analytics or investigative analytics. You can do these these things in Hadoop. When we hear about people using business intelligence tools, and even applying deep learning, that’s interesting and those are things you can do in Hadoop. But it’s almost not really what customers need to consume. For example, it’s one thing to tell a customer: ‘you can connect R to a cluster, and you can build a fancy model of customer churn or fraud’. But what they want to do, inevitably, is productionise.


Join the Partisia Blockchain Hackathon, design the future, gain new skills, and win!


A lot of our work at  Cloudera, with regards to Hadoop and machine learning, involves productionising machine learning, rather than say inventing new  algorithms. It’s a little more prosaic but I think it actually is important. Take some of these ideas that we hear about everyday, like Twitter buying Madbits– this is a great piece of technology, but the rest of the story (how that gets productionised and used at large scale) is the hard part. It’s this puzzle we try to solve for consumers on Hadoop. We have a couple of tools within the Hadoop ecosystem we use for this, including Apache Mahout.

Tell us more about Apache Mahout; I understand they’re moving away from using MapReduce right now?

Yeah – Mahout at this point is a fairly, “old” project, just because it has been around for 5 years and that’s a lifetime in this  ecosystem. It’s MapReduce-based, and it’s a project whose time is kind of finished, I think. Instead, the focus for people building operational machine learning systems on Hadoop is builds like Apache Spark. Spark provides a more complete  sub-stream  for model building, model serving at  run  time or at near real time; and also includes a small machine learning library as well.

So for me the focus is mostly on helping people actually build their own  algorithms, build their own implementations. This is probably the most important and exciting technology within the platform.

Tell us more about your company Myrrix, which was acquired by Cloudera.

So this happened a year ago. I’m a long time committer to the Apache Mahout project; maybe two plus years ago, I decided that I wanted to build a product in small company, around a next-generation version of the project. As it happens, the first focus was  recommenders; this is one of the most  common  use cases customers come to us with. This proved to be a power idea because at the time, Hadoop was something which you could only use as a  batch-mode  processing engine to build models offline, to make recommendations offline. Of course, the recommender problem is very much a real-time problem; we need to learn in real-time, we need answer  queries in real-time.  That was the focus of the technology that was acquired by Cloudera, and that we built into the open-sourced Oryx project. That continues to evolve; that continues to be where my heart is, and where I think the  important gap is still on the platform.  We need a good infrastructure and sub-stream for building not just models, but serving models on a platform like Hadoop, at large scale and in real-time. And that’s something that we’re now rebuilding on top of Spark,  as the next-generation platform within Hadoop, for building these kind of things.

There has been a trend for a machine learning startups being acquired by industry-leading enterprises recently. Twitter recently acquired Madbits, as you mentioned earlier, as well as Google acquiring DNNResearch & Jetpac, and Pinterest acquiring Visual Graph. Why do you think machine learning start-ups have become hot property, and all the big businesses are taking notice?

I’ve wondered the same thing myself. These acquisitions seem to be very similar as well. We’ve got a small group of people without a real product to sell, but that they have a very interesting take on a hot topic in research, like deep learning or multi-layered convolutional neural nets. Interestingly all of these small start-ups have managed to make deep learning do something related to image recognition. It’s not surprising that image classification is a problem the tech giants have, so it’s not surprising that they want to buy these technologies at almost any price. In a way I’m not sure this is reflective of a lot of trends in machine learning and in the industry, even if these are the most visible transactions in this space. All of these seem to be of a similar pattern – tech giants buying a deep learning start-up, of a couple people, to enhance some kind of image recognition capability. It’s interesting, but it’s not what 90% of companies out there do when they do machine learning.

Read the second part of our interview with Sean here, where we discuss the future of deep learning and neural networks, and how he foresees the relationship between machine learning and enterprise evolving.

(Image credit: wpdang)

Tags: Apache SparkClouderaHadoopimage recognition

Related Posts

What is multimodal AI: Understanding GPT-4

Tracing the evolution of a revolutionary idea: GPT-4 and multimodal AI

March 15, 2023
What are natural language processing and conversational AI

A journey from hieroglyphs to chatbots: Understanding NLP over Google’s USM updates

March 14, 2023
Machine learning in asset pricing explained

Rethinking finance through the potential of machine learning in asset pricing

March 3, 2023
Exploring the intricacies of deep learning models

Exploring the intricacies of deep learning models

February 28, 2023
machine learning prediction

Insights from the game of Go: Discussing ML prediction

February 24, 2023
embedded machine learning 101

Exploring the exciting possibilities of embedded machine learning for consumers

February 13, 2023

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

LATEST ARTICLES

LinkedIn AI won’t take your job but will help you find one

Where does your data go: Inside the world of blockchain storage

OpenAI released GPT-4, the highly anticipated successor to ChatGPT

Tracing the evolution of a revolutionary idea: GPT-4 and multimodal AI

Reimagine Home AI wants to redesign your home

A journey from hieroglyphs to chatbots: Understanding NLP over Google’s USM updates

Dataconomy

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

  • About
  • Imprint
  • Contact
  • Legal & Privacy
  • Partnership
  • Writers wanted

Follow Us

  • News
  • AI
  • Big Data
  • Machine Learning
  • Trends
    • Blockchain
    • Cybersecurity
    • FinTech
    • Gaming
    • Internet of Things
    • Startups
    • Whitepapers
  • Industry
    • Energy & Environment
    • Finance
    • Healthcare
    • Industrial Goods & Services
    • Marketing & Sales
    • Retail & Consumer
    • Technology & IT
    • Transportation & Logistics
  • Events
  • About
    • About Us
    • Contact
    • Imprint
    • Legal & Privacy
    • Newsletter
    • Partner With Us
    • Writers wanted
No Result
View All Result
Subscribe

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy Policy.