Cloudera on Why Hadoop Projects FailMark Lewis is the Senior Director of Marketing at Cloudera. Cloudera offers a unified platform for big data; its enterprise data hub, built on Apache Hadoop. The Enterprise Data Hub allows companies to execute storage, access, management, analysis, security and search all within one framework. We recently caught up with Mark at Big Data & Analytics Day to discuss the Enterprise Data Hub, the evolving discipline of data science and why Hadoop projects fail.
 


 

Tell us about the relationship between the Enterprise Data Hub and the data warehouse.

A lot of people think they need to build a big data system as though it’s one entire entity. But it’s not, it’s a number of entities, and there’s a journey in how you get there. Many ask “If I adopt an enterprise data hub, does that mean I don’t need my data warehouse anymore?” Absolutely not. A great analogy actually is the digital camera—it is the best at its job, it takes great pictures, you’ve got great focus, and the quality of the lenses are absolutely superb. But how many people actually take a photo with a digital camera? Quite a few. But how many people in comparison take a picture with a smartphone? Now, a lot more people would take a picture with a smartphone.

So examine for a moment, why is that? Well, a smartphone has a lot of apps built into it. Apps that enable you to edit the data and share the data as well. That’s what an enterprise data hub is. It’s bringing the compute power to where the data resides. A data warehouse is a bit like the SLR—it’s really good at what it does but you need to do something with the data in order for it to actually operate on it, which is where the ETL process is. A digital camera doesn’t work on its own, you need to take the data from the camera and move it to your computer and then do something with it.

In a recent LinkedIn poll of “Why Hadoop Projects Fail?”, a consistent pain point identified by respondents was “No financially compelling use case”. Do you agree?

Yes. I think what this highlights is that big data is a very recent and evolving discipline. In my talk today “How many people here are actually right at the beginning and have actually no idea about big data?”- about 15% of the room actually raised their hands. When I asked how many people in the room are real experts in this field, 0% put their hand up. When I asked “Who has an element of knowledge but is a bit confused?”, nearly all the room put their hands up. And I’m seeing that from Germany to the UK to Ireland to the US to the Middle East.

What we’ve got here is a new, evolving discipline and I think that people don’t understand it. So if people are looking to get into this, great! But be prepared to actually put the effort in to understand and build your discipline, and gain an overview of the all of the different aspects.There’s data scientists, there’s IT architects, there’s people who understand data warehousing—each of those are quite disciplined and all of that coming together under the enterprise data hub is a relatively new concept.

It’s easy to understand why the data scientists just feel like everything is landing on their lap. You have the word “big data” and everyone thinks you know everything about everything. It’s a bit like—another analogy I like to use—being a mediocre violinist and you’ve played with an orchestra once or twice but you’re not that great. To your non-musical friends, you’re a genius! You know everything about the orchestra. You could conduct it, you can compose, you can write great symphonies! But you know that in the grand scheme of things you’ve got a lot to learn and I think if people are honest in big data, there are disciplines where people have a lot to learn.

There are experts out there. But the people skills as well as the technology are still evolving. It’s going to be a journey before people fully appreciate and can really use it to its full extent. That’s why companies like Cloudera are helping with training.

Another question we’re often asked is “Why are Cloudera and MongoDB (for instance) marketing themselves as partners- aren’t they doing exactly the same thing?” Would you care to clear up the confusion about the specific uses of Hadoop and NoSQL databases, and how they work together?

Very simply put, the Hadoop structure is like an operating system. It’s not actually that exciting to look at or do much with. It is a way of just scaling, computing, and storing that data. What you then have are the applications and the modules that then allow you to put data in and extract data out.

The partner companies each have specific uses; some do ETL, some do analytics. These components come together, and each of them is specialist in their own right. But it is one system. That’s why big data means different things to different people. Someone in an analytics company will say, “We’ve been doing big data for years!” and they’re not wrong. Someone in an ETL company will say, “We’ve been doing analytics data for years,” and they’re not wrong. And somebody who’s just simply been running spreadsheets and doing a bit of analysis on a few numbers says they’ve been doing big data and they’re not completely wrong. But to scale, and to have the different volumes of data from different places is the challenge we’re facing today. That’s where the partner ecosystem comes in.

On the subject of partnerships, would you like to tell us more about the partnership between Intel and Cloudera?

It’s something both Cloudera and Intel are really excited about. The whole thing is building together an engagement that’s ultimately going to help the customer. So if you take the way that the systems are operating, 95% of the data sent today is based on Intel. We’re working with Intel on helping to improve the Hadoop’s performance, and the way that it operates at an engineering level.

You can imagine that that’s going to help a lot of people out there get a lot of performance improvement across a lot of systems. And of course being an open source company—as Cloudera is—with the committers that we have, we’re going to contribute that back to the community as well. So that will benefit a lot of people who are involved in Hadoop, not just Cloudera. So it’s a very exciting partnership.


Cloudera on Why Hadoop Projects Fail 2Cloudera offers a unified platform for big data; its enterprise data hub, built on Apache Hadoop. The Enterprise Data Hub allows companies to execute storage, access, management, analysis, security and search all within one framework.



(Featured Image Credit: Cloudera)

Interested in more content like this? Sign up to our newsletter, and you wont miss a thing!

Previous post

The Humanisation of IT: Overcoming the Data Deluge

Next post

MapR & Tata Consultancy Service Partner to Help Enterprise Customers with Big Data