Choosing a NoSQL

Many people ask my opinion on different NoSQL databases, and they also want to know the benchmark numbers. I guess that the readers of this post probably have similar questions. When you start building your next cool cloud application, there are dozens of NoSQL options to choose. It is natural to ask which one is the fastest? AP or CP? What’s the sharding and fault tolerance strategy? And so on. But you won’t find any comparison chart or benchmark numbers here. You may be yelling now: “We benchmarked Oracle, DB2, SQL Server, MySQL and PostgreSQL in old days. And it was very meaningful and helpful!” Please let me explain.

Yes, it makes sense to benchmark relational databases because SQL-based relational database products are largely indistinguishable. You are doing an apples-to-apples comparison. The benchmarks do help us to understand which implementation is suitable for which use case. But NoSQL solutions are different animals. They offer different data models (key-value, wide-columnar, objects/documents, graph, etc.), CP or AP, synchronous or asynchronous replication, in memory or durability, strong consistency or eventual consistency, etc. When comparing them, we are comparing apples to oranges. The various benchmarks with contradicting results may just confuse us further.

“But I have to make a choice!”, you cry. Well, a quick (and cheating) way is to check out NoSQL’s creators and their applications. All NoSQL products are great in the sense that they do what they’re supposed do for their creators. In the old days, we built applications against a database. Today, it seems that people build databases against applications, because each company has so different requirements. It is why Google built BigTable, Amazon.com built Dynamo, and Facebook built Cassandra. This list can go on. For example, 10gen built MongoDB for their web platform originally. In other words, they built these NoSQLs for themselves. If your business aligns with one of theirs, congratulations! Just go with the corresponding open source model.

But what if your idea is truly innovative, and you are doing something wild that no existing solutions seem like a good fit? In an environment of rapid technology advancement and ever-changing user requisitions, it is not realistic to choose the “best” solution. It is better to think of the minimization of business risk first, rather than technical comparisons. When we embrace cutting-edge technologies including NoSQL, we have to be careful that cutting-edge technologies don’t turn the bleeding edge on us. It is always good to ask if we have a plan B that has minimal migration cost.

If we look back, history may teach us something helpful. Before relational databases, the world of DBMS was somehow similar to today. There were many different data models, systems, and interfaces. Why did relational databases replace these dinosaurs? There are many reasons. Let’s look at this from a programmer’s perspective. With relational databases, I, a software engineer, virtually don’t care what the back end looks like. No matter if it’s MySQL, Oracle, or TeraData, all I face is the ubiquitous relational data model and all I use is SQL. Yes, there are always some small differences on data types and SQL syntax among them. But it doesn’t take 10 years to migrate from one to another.

Based on this observation, we probably should firstly choose a data model that is flexible and expressive. More importantly, this data model should be supported in multiple major solutions that are from both CP and AP schools. With this in mind, I am thinking of BigTable’s wide-columnar data model. As we know, key-value pairs are the simplest yet most flexible data model. With the logic concept of column/column family, the wide-columnar data model also enables us to encapsulate document and graph models. Crucially, this data model is supported by both HBase (CP) and Cassandra (AP). Both HBase and Cassandra have very large community and are used in large-scale real-life systems. HBase provides strong consistency, tight integration with MapReduce, and in-database computation through coprocessors. Cassandra provides simple and symmetric architecture and also excellent multi-datacenter support. With some abstraction, we can easily switch from one to the other.

This is my two cents. What’s your opinion? Please feel free to leave your comment below.

ALSO IN THIS SERIES:

Distributed NoSQL: MongoDB

In this installment of his Understanding NoSQL series, guest contributor Haifeng Li delves deeper into MongoDB, the fifth most popular database in the world. Examining the data model, storage and cluster architecture, Li aims to give us an in-depth understanding of MongoDB’s database technology.

Haifeng Li is the Chief Data Scientist at ADP. He has a proven history in delivering end-to-end solutions, having previously worked for Bloomberg and Motorola. He is a technical strategist with deep understanding of computer theory and emerging technologies. He also has a diverse academic background, researching in fields including machine learning, data mining, computer vision, pattern recognition, NLP, and big data. His personal blog can be found here.