Fans and supporters of Hadoop have no reason to fear; Hadoop isn’t going away anytime soon. There’s been a great deal of consternation about the future of Hadoop, most of it stemming from the growing popularity of Apache Spark. Some big data experts have even gone so far as to say Apache Spark will one day soon replace Hadoop in the realm of big data analytics. So are the Spark supporters correct in this assessment? Not necessarily. Apache Spark may represent a new technology that’s getting a lot of attention. In fact, the number of Apache Spark users are growing at a considerable pace, but that doesn’t make it Hadoop’s successor. The two technologies certainly have similarities, but their difference really set them apart, showing that the right platform really depends on what task they’ll be used for. To say Spark is on its way to dethroning Hadoop is simply a premature statement. If anything, the two look to be complementary in the work they do.
Every discussion surrounding Hadoop should include talk about MapReduce, which is a parallel processing framework where jobs can be run to process and analyze large sets of data. If an enterprise needs to analyze big data offline, Hadoop is usually the preferred choice. That’s what drew so many businesses and industries to Hadoop in the first place. Hadoop had the capability to store and analyze big data inexpensively. As Matt Asay of InfoWorld Tech Watch puts it, Hadoop essentially “democratized big data.” Suddenly, businesses had access to information the likes of which they never had before, and they could put that big data to good use, creating a large number of big data use cases. Hadoop’s batch-processing technology was revolutionary and is still used often today. When it comes to data warehousing and offline data analysis of jobs that may take hours to complete, it’s tough to go wrong with Hadoop.
Apache Spark, which was developed as a project independently from Hadoop, offers its own advantages that have made many organizations sit up and take notice. Many supporters say Spark represents a technological evolution of Hadoop’s capabilities. There are several categories where Apache Spark excels. The first and most touted is speed. When processing data in a Hadoop cluster, Spark can run applications much more quickly — anywhere from ten to a hundred times faster to some cases. This capability has basically ushered in the era of real-time big data analytics, sometimes referred to as streaming data. Beyond speed, Spark is also relatively easy-to-use, particularly when it comes to developers. Writing applications in a familiar programming language, like Java or Python, makes the processing of building apps that much easier. Spark is also quite versatile, meaning it can run on a myriad of different platforms like Hadoop or the cloud. Spark can also access a wide variety of different data sources, among them being Amazon Web Services’ S3, Cassandra, and Hadoop’s own data store.
With Spark’s capabilities in mind, some may wonder why any organization should stick with Hadoop at all. After all, Spark appears to be able to run more complex and sophisticated workloads more quickly. Who wouldn’t want real-time analytics? But the truth is Hadoop and Spark may in fact work better together. If anything, Spark loses some of its effectiveness without Hadoop since it was designed to run on top of it. Hadoop can support both the traditional batch-processing model and the real-time analytics model. Think of Spark as an added feature that can go with Hadoop. When needing interactive data mining, machine learning, and stream processing, Spark is the way to go. For businesses requiring more scalable infrastructure, enabling them to add servers for growing workloads, Hadoop and MapReduce are a better bet. Utilizing both at the same time in a complementary approach gets organizations the best that both have to offer.
Talk of the death of Hadoop always seemed a little hasty, no matter how impressive Spark’s capabilities have been. There’s no denying the advantages that Spark brings to the table, but Hadoop isn’t going to just disappear. Spark was never designed to replace Hadoop anyway. When used in tandem, businesses can gain the advantages of both, effectively increasing the benefits they receive. While there will still be movement toward real-time analytics, Hadoop will still be needed and readily available for all companies.
Rick Delgado- I’ve been blessed to have a successful career and have recently taken a step back to pursue my passion of freelance writing. I love to write about new technologies and keeping ourselves secure in a changing digital landscape. I occasionally write articles for several companies, including Dell.
Photo credit: Ben K Adams / Photo / CC BY-NC-ND
What do you think it’s going to take to reach the tipping point? Cloud has made this so much less of an IT hassle besides the standardization of KM. Is that what we’re waiting on? For faster, easier ways to create organizational ontologies?