Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Whitepapers
    • AI Models Leaderboard
  • AI toolsNEW
  • Newsletter
  • + More
    • Glossary
    • Conversations
    • Events
    • About
      • Who we are
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
  • AI
  • Tech
  • Cybersecurity
  • Finance
  • DeFi & Blockchain
  • Startups
  • Gaming
Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Whitepapers
    • AI Models Leaderboard
  • AI toolsNEW
  • Newsletter
  • + More
    • Glossary
    • Conversations
    • Events
    • About
      • Who we are
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
Dataconomy
No Result
View All Result

What is Hadoop and how does it work?

byNicolas
February 2, 2014
in Tech
Home News Tech
Share on FacebookShare on TwitterShare on LinkedInShare on WhatsAppShare on e-mail
Google Preferred Source

Introducing Hadoop

In the Big data world the sheer volume, velocity and variety of data renders most ordinary technologies ineffective. Thus in order to overcome their helplessness companies like Google and Yahoo! needed to find solutions to manage all the data that their servers were gathering in an efficient, cost effective way.

Hadoop was originally created by a Yahoo! Engineer, Doug Cutting, as a counter-weight to Google’s BigTable. Hadoop was Yahoo!’s attempt to break down the big data problem into small pieces that could be processed in parallel. Hadoop is now an open source project available under Apache License 2.0 and is now widely used to manage large chunks of data successfully by many companies. What follows is a short introduction to how it works.

At its core, Hadoop has two main systems:

Stay Ahead of the Curve!

Don't miss out on the latest insights, trends, and analysis in the world of data, technology, and startups. Subscribe to our newsletter and get exclusive content delivered straight to your inbox.

Hadoop Distributed File System (HDFS):  the storage system for Hadoop spread out over multiple machines as a means to reduce cost and increase reliability.

MapReduce engine: the algorithm that filters, sorts and then uses the database input in some way.

How does HDFS work?

With the Hadoop Distributed File system the data is written once on the server and subsequently read and re-used many times thereafter. When contrasted with the repeated read/write actions of most other file systems it explains part of the speed with which Hadoop operates. As we will see, this is why HDFS is an excellent choice to deal with the high volumes and velocity of data required today.

serversThe way HDFS works is by having a main « NameNode » and multiple « data nodes » on a commodity hardware cluster. All the nodes are usually organized within the same physical rack in the data center. Data is then broken down into separate « blocks » that are distributed among the various data nodes for storage. Blocks are also replicated across nodes to reduce the likelihood of failure.

The NameNode is the «smart» node in the cluster. It knows exactly which data node contains which blocks and where the data nodes are located within the machine cluster. The NameNode also manages access to the files, including reads, writes, creates, deletes and replication of data blocks across different data nodes.

The NameNode operates in a “loosely coupled” way with the data nodes. This means the elements of the cluster can dynamically adapt to the real-time demand of server capacity by adding or subtracting nodes as the system sees fit.

The data nodes constantly communicate with the NameNode to see if they need complete a certain task. The constant communication ensures that the NameNode is aware of each data node’s status at all times. Since the NameNode assigns tasks to the individual datanodes, should it realize that a datanode is not functioning properly it is able to immediately re-assign that node’s task to a different node containing that same data block. Data nodes also communicate with each other so they can cooperate during normal file operations. Clearly the NameNode is critical to the whole system and should be replicated to prevent system failure.

Again, data blocks are replicated across multiple data nodes and access is managed by the NameNode. This means when a data node no longer sends a “life signal” to the NameNode, the NameNode unmaps the data note from the cluster and keeps operating with the other data nodes as if nothing had happened. When this data node comes back to life or a different (new) data node is detected, that new data node is (re-)added to the system. That is what makes HDFS resilient and self-healing. Since data blocks are replicated across several data nodes, the failure of one server will not corrupt a file. The degree of replication and the number of data nodes are adjusted when the cluster is implemented and they can be dynamically adjusted while the cluster is operating.

Data integrity is also carefully monitored by HDFS’s many capabilities. HDFS uses transaction logs and validations to ensure integrity across the cluster. Usually there is one NameNode and possibly a data node running on a physical server in the rack, while all other servers run data nodes only.

Hadoop MapReduce in action

Hadoop MapReduce is an implementation of the MapReduce algorithm developed and maintained by the Apache Hadoop project. The general idea of the MapReduce algorithm is to break down the data into smaller manageable pieces, process the data in parallel on your distributed cluster, and subsequently combine it into the desired result or output.

Hadoop MapReduce includes several stages, each with an important set of operations designed to handle big data. The first step is for the program to locate and read the « input file » containing the raw data. Since the file format is arbitrary, the data must be converted to something the program can process. This is the function of « InputFormat » and « RecordReader » (RR). InputFormat decides how to split the file into smaller pieces (using a function called InputSplit). Then the RecordReader transforms the raw data for processing by the map. The result is a sequence of « key » and « value » pairs.

Once the data is in a form acceptable to map, each key-value pair of data is processed by the mapping function. To keep track of and collect the output data, the program uses an « OutputCollector ». Another function called « Reporter » provides information that lets you know when the individual mapping tasks are complete.

Once all the mapping is done, the Reduce function performs its task on each output key-value pair. Finally an OutputFormat feature takes those key-value pairs and organizes the output for writing to HDFS, which is the last step of the program.

Hadoop MapReduce is the heart of the Hadoop system. It is able to process the data in a highly resilient, fault-tolerant manner. Obviously this is just an overview of a larger and growing ecosystem with tools and technologies adapted to manage modern big data problems.


nicNicolas Silvy

Nicolas has previously worked for Salesforce.com in Dublin and with Goldman Sachs International in London. Currently based in Paris, Nicolas is an avid fan of technology and Big Data


Image Credits:
Paul Hammond / Flickr
Tags: HadoophdfsMapReduceopen source

Related Posts

Meta launches 9 smart glasses under its own brand

Meta launches $299 smart glasses under its own brand

June 24, 2026
Samsung unveils UFS 5.0 storage for future Galaxy phones

Samsung unveils UFS 5.0 storage for future Galaxy phones

June 23, 2026
Instagram for TV launches on Samsung TVs in the US

Instagram for TV launches on Samsung TVs in the US

June 23, 2026
Apple releases iOS 27 beta 2 with new “Write with Siri” feature

Apple releases iOS 27 beta 2 with new “Write with Siri” feature

June 23, 2026
Samsung Galaxy S27 Pro leak points to built-in Privacy Display

Samsung Galaxy S27 Pro leak points to built-in Privacy Display

June 22, 2026
Perseverance rover completes a marathon on Mars

Perseverance rover completes a marathon on Mars

June 22, 2026
Please login to join discussion

LATEST NEWS

Meta launches $299 smart glasses under its own brand

Claude Tag brings shared AI assistant to Slack channels

PlayStation 6 leak points to 2027 release window

Samsung unveils UFS 5.0 storage for future Galaxy phones

Getty Images partners with OpenAI to supply licensed visuals for ChatGPT

Instagram for TV launches on Samsung TVs in the US

BEST AI MODELS LEADERBOARD

See the best AI models, ranked by intelligence, benchmark results, speed and token price. Find the most suitable LLMs, Text-to-Image, Image Editing, Text-to-Speech, Text-to-Video and Image-to-Video  artificial intelligence model for your tasks and business.

LATEST TOOLS

Moonbeam

Charisma AI

Essay Writer by Papertyper

Slite

Wonderin AI

Spur

Stenography

Calldesk

MaxAI.me

PhotoRestore

Dataconomy

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

  • About
  • Imprint
  • Contact
  • Legal & Privacy

Follow Us

  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Whitepapers
    • AI Models Leaderboard
  • AI tools
  • Newsletter
  • + More
    • Glossary
    • Conversations
    • Events
    • About
      • Who we are
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
No Result
View All Result
Subscribe

This website uses cookies to improve your experience. You can choose to accept or reject them. Visit our Privacy Policy.