Understanding Big Data: Open Source
The open source technology market is huge. It fuels 1 million unique projects today, and opens up massive opportunities for small and large enterprises alike. It means small companies can deploy technologies in a cost-effective manner, and large enterprises have the means to scale; as John Gallaugher points out, Google has over 1.4 million servers; without open sourced technology, licensing costs on that scale would be huge.
Yet, open source is a multi-billion dollar industry. For technology that’s ostensibly free to use and develop, where is the money coming from? In this installment of Understanding Big Data, we’ll be looking at some of the leading open source providers- and how much you can actually get for free.
The Apache Foundation
The Apache Foundation have been providing users with community-led, open-source solutions for 15 years. They currently have nearly 150 top-level projects, covering a vast spectrum of technologies. Notable projects include:
- Hadoop- In our overview of Hadoop, we defined it as “open-source framework for processing, storing and analysing data.” The fundamental principle behind Hadoop is rather than tackling one monolithic block of data all in one go, it’s more efficient to break up & distribute data into many parts, allowing processing and analysing of different parts concurrently. The Apache Foundation also develops a range of Hadoop integrations, which you can find out about here.
- Lucene- Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. The Apache Foundation also developed an enterprise search sever based on Lucene’s search library known as Solr.
- Storm- Storm is a project currently being incubated by the Apache Foundation. The aim of Storm is do for real-time processing what Hadoop did to batch processing. The main selling point of Storm is that it’s really, really fast: a benchmark clocked it at over a million tuples processed per second per node. It is also scalable, fault-tolerant and supports a range of programming languages. Current users include Twitter, the Weather Channel and WebMD.
Oracle RDBMS is the leading open-source RDBMS solution. Their SQL-based RDBMS solution, released in 1979, was the first commercially available technology of its kind, and revolutionised relational databases. They introduced partitioning in 1997, internet computing in 1999, and application clusters in 2001.
Their customer list, as you might expect, is extensive; high-profile customers include Vodafone, BT, Aria Systems and Deutsche Börse AG.
A comprehensive and insightful beginner’s guide to Oracle RDBMS can be found here.
MySQL is currently owned by Oracle, after Oracle acquired Sun Microsystems in 2010. Its website claims it’s the most widely-used open-source database solution in the world, although parent company Oracle may have them beat.
MySQL was originally developed in 1994 by Michael Widenius and David Axmark and named after Widenius’ daughter, My. MySQL is ACID compliant, supports transactions, row-level locking as is highly scalable. It can also supports a much broader range of operating systems compared to many other database management systems; as well as Windows, OS X and Linux, it can also run on BSD, UNIX, AmigaOS, iOS, Symbian and Android.
It powers a range of well-known applications like WordPress, phpBB and Drupal, and is used by Wikipedia, Google, Facebook, Twitter, Flickr and Youtube.
There’s been widespread discussion about whether SQL-based technologies can compete in a big data environment, but as Carl W. Olofson of the International Data Corporation (IDC) stated earlier this year: “Without overstating the case, the MySQL movement is still revolutionary and is still young. It’s important to drive people to it and get them excited about it. MySQL is doing the jobs people need done.”
Many of the leading NoSQL databases have open source offerings. These include:
- MongoDB- MongoDB is a document-store database, and is (according to DB-Engines) to 5th most popular database management system in the world
- Cassandra- Cassandra is wide-column (or columnar) database focused around performance and scalability, supported by the Apache Cassandra Foundation
- HBase- HBase is a NoSQL columnar database which is designed to run on top of HDFS. It is modelled after Google’s BigTable and written in Java. It was designed to provide BigTable-like capabilities to Hadoop, such as the columnar data storage model and storage for sparse data.
More information about different NoSQL solutions can be found in this previous installment of “Understanding Big Data”.
Are They Actually Free?
Short answer: yes. All of the technologies mentioned above are available to download and deploy for free. It is worth keeping in mind that there are various open source licensing agreements (such as GPL and the Apache License), which have different provisions and are constantly evolving. It’s undoubtedly worth doing your research; Black Duck’s Knowledge Base is a good place to start looking at the freedoms and limitations of open source licenses.
But, you may have noticed some of the companies mentioned above are multi-million dollar enterprises. How do they make this money? By offering “enterprise” additions of their products, with added features.
The Apache Software Foundation is entirely open source. However, there are several external companies which offer “enterprise-class” Hadoop, such as Hortonworks and Cloudera. They offer the Hadoop technologies with added features such as greater security and stability as well as training for companies unfamiliar with the technology, and exclusive integrations with other technologies. There’s a similar industry around Apache Cassandra; Datastax offers Cassandra with added security, search, analytics and management features.
Oracle RDBMS is free, but they offer a vast range of products that aren’t– downloading and deploying their specialist big data management system could cost in the region of $300,000.
The open source MySQL offering is known as the “community edition”; they also offer a Standard Edition for $2,000, Enterprise Edition for $5,000 and a Cluster Carrier Edition for $10,000- a full breakdown of the features available in each edition can be found here.
Many of the NoSQL solutions, such as Couchbase and MongoDB, operate on a “freemium” model. Their core technology is open sourced, but the enterprise edition with more features and greater security and support is monetised.
So what can you get for free? In summary, quite alot. You can have access to world-leading database management systems without parting with a penny. But if you do have money to spend, there’s plenty of enterprises out there that want to make it easier for you to take the first steps towards crafting a big data architecture.
(Featured Image Credit: Raconteur)
Eileen has five years’ experience in journalism and editing for a range of online publications. She has a degree in English Literature from the University of Exeter, and is particularly interested in big data’s application in humanities. She is a native of Shropshire, United Kingdom.
Interested in more content like this? Sign up to our newsletter, and you wont miss a thing!