Introducing Hadoop
In the Big data world the sheer volume, velocity and variety of data renders most ordinary technologies ineffective. Thus in order to overcome their helplessness companies like Google and Yahoo! needed to find solutions to manage all the data that their servers were gathering in an efficient, cost effective way.
Hadoop was originally created by a Yahoo! Engineer, Doug Cutting, as a counter-weight to Google’s BigTable. Hadoop was Yahoo!’s attempt to break down the big data problem into small pieces that could be processed in parallel. Hadoop is now an open source project available under Apache License 2.0 and is now widely used to manage large chunks of data successfully by many companies. What follows is a short introduction to how it works.
At its core, Hadoop has two main systems:
Hadoop Distributed File System (HDFS): the storage system for Hadoop spread out over multiple machines as a means to reduce cost and increase reliability.
MapReduce engine: the algorithm that filters, sorts and then uses the database input in some way.
How does HDFS work?
With the Hadoop Distributed File system the data is written once on the server and subsequently read and re-used many times thereafter. When contrasted with the repeated read/write actions of most other file systems it explains part of the speed with which Hadoop operates. As we will see, this is why HDFS is an excellent choice to deal with the high volumes and velocity of data required today.
The way HDFS works is by having a main « NameNode » and multiple « data nodes » on a commodity hardware cluster. All the nodes are usually organized within the same physical rack in the data center. Data is then broken down into separate « blocks » that are distributed among the various data nodes for storage. Blocks are also replicated across nodes to reduce the likelihood of failure.
The NameNode is the «smart» node in the cluster. It knows exactly which data node contains which blocks and where the data nodes are located within the machine cluster. The NameNode also manages access to the files, including reads, writes, creates, deletes and replication of data blocks across different data nodes.
The NameNode operates in a “loosely coupled” way with the data nodes. This means the elements of the cluster can dynamically adapt to the real-time demand of server capacity by adding or subtracting nodes as the system sees fit.
The data nodes constantly communicate with the NameNode to see if they need complete a certain task. The constant communication ensures that the NameNode is aware of each data node’s status at all times. Since the NameNode assigns tasks to the individual datanodes, should it realize that a datanode is not functioning properly it is able to immediately re-assign that node’s task to a different node containing that same data block. Data nodes also communicate with each other so they can cooperate during normal file operations. Clearly the NameNode is critical to the whole system and should be replicated to prevent system failure.
Again, data blocks are replicated across multiple data nodes and access is managed by the NameNode. This means when a data node no longer sends a “life signal” to the NameNode, the NameNode unmaps the data note from the cluster and keeps operating with the other data nodes as if nothing had happened. When this data node comes back to life or a different (new) data node is detected, that new data node is (re-)added to the system. That is what makes HDFS resilient and self-healing. Since data blocks are replicated across several data nodes, the failure of one server will not corrupt a file. The degree of replication and the number of data nodes are adjusted when the cluster is implemented and they can be dynamically adjusted while the cluster is operating.
Data integrity is also carefully monitored by HDFS’s many capabilities. HDFS uses transaction logs and validations to ensure integrity across the cluster. Usually there is one NameNode and possibly a data node running on a physical server in the rack, while all other servers run data nodes only.
Hadoop MapReduce in action
Hadoop MapReduce is an implementation of the MapReduce algorithm developed and maintained by the Apache Hadoop project. The general idea of the MapReduce algorithm is to break down the data into smaller manageable pieces, process the data in parallel on your distributed cluster, and subsequently combine it into the desired result or output.
Hadoop MapReduce includes several stages, each with an important set of operations designed to handle big data. The first step is for the program to locate and read the « input file » containing the raw data. Since the file format is arbitrary, the data must be converted to something the program can process. This is the function of « InputFormat » and « RecordReader » (RR). InputFormat decides how to split the file into smaller pieces (using a function called InputSplit). Then the RecordReader transforms the raw data for processing by the map. The result is a sequence of « key » and « value » pairs.
Once the data is in a form acceptable to map, each key-value pair of data is processed by the mapping function. To keep track of and collect the output data, the program uses an « OutputCollector ». Another function called « Reporter » provides information that lets you know when the individual mapping tasks are complete.
Once all the mapping is done, the Reduce function performs its task on each output key-value pair. Finally an OutputFormat feature takes those key-value pairs and organizes the output for writing to HDFS, which is the last step of the program.
Hadoop MapReduce is the heart of the Hadoop system. It is able to process the data in a highly resilient, fault-tolerant manner. Obviously this is just an overview of a larger and growing ecosystem with tools and technologies adapted to manage modern big data problems.
Nicolas has previously worked for Salesforce.com in Dublin and with Goldman Sachs International in London. Currently based in Paris, Nicolas is an avid fan of technology and Big Data