Whenever you ask a successful company why they wrote their own large distributed system, or put a lot of work into gluing multiple systems together, the reason almost always boils down to something like, “Well, XYZ did nearly everything we needed, except…”. There are a couple of stellar write-ups from experienced systems builders about what they want to see in the future. Having examined many real-world systems, I think there are 10 essential features that any serious database will need in order to keep being considered a serious database.
Massive scalability: If the success of companies like Google, Amazon, and Facebook has taught us anything, it’s the power of big piles of itty-bitty boxes working in concert to complete a task. This means the ability to function on hundreds or thousands of machines, handling failover automatically, and robust replication to multiple datacenters.
Massive parallelism: A distributed system does you no good if you can only do a few things at a time. This is an active area of innovation, but the trend is clearly toward lock-free data structures and multi-version concurrency control (MVCC). The goal is to be able to use all of the CPUs at once to service thousands or millions of transactions per second without having one part of the system waiting on a global lock.
Flexibility: It’s critical to be able to implement databases on commodity hardware with as few specific requirements as possible. Direct-attached storage instead of centralized (and oversubscribed) SANs. Standard Linux as the operating system using standard network hardware. The “run anywhere” idea can sometimes be taken too far. With easy virtualization and containerization, it’s not clear that you even need a Mac OS port, let alone a Windows port. The standard chipset these days is 64-bit x86. ARM is still over the horizon as far as mainstream datacenters are concerned.
Extensibility: User Defined Functions and scripting in general. The imagination of your userbase is always going to be greater than yours. Giving users a way to extend the database in ways you can’t imagine is a powerful tool.
Real-time capabilities: Too often, users are given a false choice between storing data efficiently and being able to query it immediately. Immediate access to data as it happens is increasingly valuable.
Ad-hoc analysis: Relatedly, the kinds of real-time queries you can run shouldn’t be limited. Many traditional “real-time” systems pre-aggregate metrics in order to drive graphs. That’s fine, except you literally have to be psychic to set up all the queries you might need before you know you need them. Exploratory, ad-hoc analysis should be just as easy as pre-baked reports. It’s through that exploration that you discover what queries should become pre-baked.
Mature monitoring: Too many database systems, especially the new ones, lack mature tools for monitoring cluster health, system load, data layout, and status.
Compatibility: No database works in a vaccuum, not even the ones in space. A database should have easy integration with other datastores, analysis tools, libraries, and existing applications.
Low barrier to entry: Databases should be easy to use. This means both simplicity of the interface and familiarity to existing skills. Of all the new APIs and query languages that have been pushed onto the market over the last 10 years, I’ve yet to see one that’s strictly better than standards like SQL. A more-flexible data model shouldn’t require a complete rewrite of your applications. You shouldn’t have to learn another language just because the data is over here instead of over there.
Real SQL / relational features: As various NoSQL databases matured over the last 5 years or so, a curious thing happened to their APIs: they started looking more like SQL. It’s not because SQL is the ultimate language. It’s because SQL is based on some pretty fundamental math called relational algebra. Math has a funny habit of being true and useful no matter how many mean things people say about it on their blogs. A database without true relational features built in is at best a fancy filesystem.
We’re not all there yet as an industry, but we’re getting closer. A lot of hype that used to obscure the landscape is falling away as more companies acquire first-hand experience with the reality of real-time distributed analysis.
(image credit: Andrea Goh, CC2.0)