Dataconomy
  • News
  • AI
  • Big Data
  • Machine Learning
  • Trends
    • Blockchain
    • Cybersecurity
    • FinTech
    • Gaming
    • Internet of Things
    • Startups
    • Whitepapers
  • Industry
    • Energy & Environment
    • Finance
    • Healthcare
    • Industrial Goods & Services
    • Marketing & Sales
    • Retail & Consumer
    • Technology & IT
    • Transportation & Logistics
  • Events
  • About
    • About Us
    • Contact
    • Imprint
    • Legal & Privacy
    • Newsletter
    • Partner With Us
    • Writers wanted
Subscribe
No Result
View All Result
Dataconomy
  • News
  • AI
  • Big Data
  • Machine Learning
  • Trends
    • Blockchain
    • Cybersecurity
    • FinTech
    • Gaming
    • Internet of Things
    • Startups
    • Whitepapers
  • Industry
    • Energy & Environment
    • Finance
    • Healthcare
    • Industrial Goods & Services
    • Marketing & Sales
    • Retail & Consumer
    • Technology & IT
    • Transportation & Logistics
  • Events
  • About
    • About Us
    • Contact
    • Imprint
    • Legal & Privacy
    • Newsletter
    • Partner With Us
    • Writers wanted
Subscribe
No Result
View All Result
Dataconomy
No Result
View All Result

Understanding Big Data: Infrastructure

by Eileen McNulty
July 23, 2014
in Understanding Big Data
Home Understanding Big Data
Share on FacebookShare on TwitterShare on LinkedInShare on WhatsAppShare on e-mail

Infrastructure is the cornerstone of Big Data architecture. Possessing the right tools for storing, processing and analysing your data is crucial in any Big Data project. In last installment of “Understanding Big Data”, we provided a general overview of all of the technologies in the Big Data landscape. In this edition, we’ll be closely examining infrastructural approaches- what they are, how they work and what each approach is best used for.

Table of Contents

  • Hadoop
  • NoSQL
  • Massively Parallel Processing (MPP)
  • Cloud

Hadoop

To recap, Hadoop is essentially an open-source framework for processing, storing and analysing data. The fundamental principle behind Hadoop is rather than tackling one monolithic block of data all in one go, it’s more efficient to break up & distribute data into many parts, allowing processing and analysing of different parts concurrently.

When hearing Hadoop discussed, it’s easy to think of Hadoop as one vast entity; this is a myth. In reality, Hadoop is a whole ecosystem of different products, largely presided over by the Apache Software Foundation. Some key components include:

  • HDFS- The default storage layer
  • MapReduce- Executes a wide range of analytic functions by analysing datasets in parallel before ‘reducing’ the results. The “Map” job distributes a query to different nodes, and the “Reduce” gathers the results and resolves them into a single value.
  • YARN- Responsible for cluster management and scheduling user applications
  • Spark- Used on top of HDFS, and promises speeds up to 100 times faster than the two-step MapReduce function in certain applications. Allows data to loaded in-memory and queried repeatedly, making it particularly apt for machine learning algorithms

More information about Apache Hadoop add-on components, can be found here.


Join the Partisia Blockchain Hackathon, design the future, gain new skills, and win!


The main advantages of Hadoop are its cost- and time-effectiveness. Cost, because as it’s open source, it’s free and available for anyone to use, and can run off cheap commodity hardware. Time, because it processes multiple ‘parts’ of the data set concurrently, making it a comparatively fast tool for retrospective, in-depth analysis. However, open source has its drawbacks. The Apache Software Foundation are constantly updating and developing the Hadoop ecosystem; but if you hit a snag with open-source technology, there’s no one go-to source for troubleshooting.

This is where Hadoop-on-Premium packages enter the picture. Hadoop-on-Premium services such as Cloudera, Hortonworks and Splice offer the Hadoop framework with greater security and support, with added system & data management tools and enterprise capabilities.

NoSQL

NoSQL, which stands for Not Only SQL, is a term used to cover a range of different database technologies. As mentioned in the previous article, unlike their relational predecessors, NoSQL databases are adept at processing dynamic, semi-structured data with low latency, making them better tailored to a Big Data environment.

The different strengths and uses of Hadoop and NoSQL are often described as “operational” and “analytical”. NoSQL is better suited for “operational” tasks; interactive workloads based on selective criteria where data can be processed in near real-time. Hadoop is better suited to high-throughput, in-depth analysis in retrospect, where the majority or all of the data is harnessed. Since they serve different purposes, Hadoop and NoSQL products are sometimes marketed concurrently. Some NoSQL databases, such as HBase, were primarily designed to work on top of Hadoop.

Some big names in NoSQL field include Apache Cassandra, MongoDB, and Oracle NoSQL. Many of the most widely used NoSQL technologies are open source, meaning security and troubleshooting may be an issue. It also places less focus on atomicity and consistency than on performance and scalability. Premium packages of NoSQL databases (such as Datastax for Cassandra) work to address these issues.

Massively Parallel Processing (MPP)

As the name might suggest, MPP technologies process massive amounts of data in parallel. Hundreds (or potentially even thousands) of processors, each with their own operating system and memory, work on different parts of the same programme.

As mentioned in the previous article, MPP usually runs on expensive data warehouse appliances, whereas Hadoop is most often run on cheap commodity hardware (allowing for inexpensive horizontal scale out). MPP uses SQL, and Hadoop uses Java as default (although the Apache Foundation developed Hive, a language used in Hadoop similar to SQL, to make using Hadoop slightly easier and less specialist). As with all technologies in this article, MPP has crossovers with the other technologies; Teradata, an MPP technology, has an ongoing partnership with Hortonworks (a Hadoop-on-Premium service).

Many of the major players in the MPP market have been acquired by technology vendor behemoths; Netezza, for instance, is owned by IBM, Vertica is owned by HP and Greenplum is owned by EMC.

Cloud

Cloud computing refers to a broad set of products that are sold as a service and delivered over a network. In other infrastructural approaches, when setting up your big architecture you need to buy hardware and software for each person involved with the processing and analysing of your data. In cloud computing, your analysts only require access to 1 application- a web-based service where all of the necessary resources and programmes are hosted. In cloud computing, up-front costs are minimal as you typically only pay for what you use, and scale out from there- Amazon Redshift, for instance, allows you to get started for as little as 25 cents an hour. As well as cost, Cloud computing also has an advantage in terms of delivering faster insights.

Of course, having your data hosted by third party can raise questions about security; many choose to host their confidential information in-house, and use the cloud for less private data.

Alot of big names in IT offer cloud computing solutions; Google has a whole host of Cloud computing products, including BigQuery, specifically designed for the processing and management of Big Data; Amazon Web Services also has a wide range, included EMR for Hadoop, RDS for MySQL and DynamoDB for NoSQL. There are also vendors such as Infochimps and Mortar specifically dedicated to offering cloud computing solutions.

As you can see, these different technologies are by no means direct competitors; each has its own particular uses and capabilities, and complex architectures will make use of combinations of all of these approaches, and more. In the next “Understanding Big Data”, we will be moving beyond processing data and into the realm of advanced analytics; programmes specifically designed to help you harness your data and glean insights from it.

(Image credit: NATS Press Office)

Follow @DataconomyMedia


Eileen McNulty-Holmes – Editor

1069171_10151498260206906_1602723926_n

Eileen has five years’ experience in journalism and editing for a range of online publications. She has a degree in English Literature from the University of Exeter, and is particularly interested in big data’s application in humanities. She is a native of Shropshire, United Kingdom.

Email: eileen@dataconomy.com


Interested in more content like this? Sign up to our newsletter, and you wont miss a thing!

[mc4wp_form]

 

Tags: cloudHadoopMPPNoSQLWeekly Newsletter

Related Posts

Business analytics and business intelligence solutions in retail

Navigate through the rough seas of retail with business intelligence as your compass

September 20, 2022
Machine learning vs data science: We explained the differences and similarities between data science and ML, as well as their salaries, skills, future, and more.

Machine learning makes life easier for data scientists

August 5, 2022
In this article, you can learn What are big data services (BDaaS), big data services company, big data solutions, big data consulting services, big data as a service examples, and more.

Everything you should know about big data services

July 26, 2022
In this article, you can learn what is data transformation, data transformation examples, data transformation tools, data transformation process, data transformation rules, and more.

The ABC’s of data transformation

July 14, 2022
In today’s article we will explain what are data points and their synonyms. We’ll also clarify how unit of observation is utilized in addition to types of data points. For digital marketing analytics, there are some important data point categories professionals need to be aware of. Finally we will learn differences between a data point, data set, data field and so on.

The key of optimization: Data points

July 11, 2022
In this article, you can learn what is a business intelligence report, types of reports in business intelligence, BI reporting best practices, business intelligence report example, and more.

BI reporting: Types, best practices, tools, and more

July 4, 2022

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

LATEST ARTICLES

AI experts call for pause in development of advanced systems

Microsoft Security Copilot is the AI-ssential tool for cybersecurity experts

Data governance 101: Building a strong foundation for your organization

Explained: Is ChatGPT plagiarism free?

How can data science optimize performance in IoT ecosystems?

Consensus AI makes accessing scientific information easier than ever

Dataconomy

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

  • About
  • Imprint
  • Contact
  • Legal & Privacy
  • Partnership
  • Writers wanted

Follow Us

  • News
  • AI
  • Big Data
  • Machine Learning
  • Trends
    • Blockchain
    • Cybersecurity
    • FinTech
    • Gaming
    • Internet of Things
    • Startups
    • Whitepapers
  • Industry
    • Energy & Environment
    • Finance
    • Healthcare
    • Industrial Goods & Services
    • Marketing & Sales
    • Retail & Consumer
    • Technology & IT
    • Transportation & Logistics
  • Events
  • About
    • About Us
    • Contact
    • Imprint
    • Legal & Privacy
    • Newsletter
    • Partner With Us
    • Writers wanted
No Result
View All Result
Subscribe

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy Policy.