Whenever you ask a successful company why they wrote their own large distributed system, or put a lot of work into gluing multiple systems together, the reason almost always boils down to something like, “Well, XYZ did nearly everything we needed, except…”. There are a couple of stellar write-ups from experienced systems builders about what they want to see in the future. Having examined many real-world systems, I think there are 10 essential features that any serious database will need in order to keep being considered a serious database.

Massive scalability: If the success of companies like Google, Amazon, and Facebook has taught us anything, it’s the power of big piles of itty-bitty boxes working in concert to complete a task. This means the ability to function on hundreds or thousands of machines, handling failover automatically, and robust replication to multiple datacenters.

Massive parallelism: A distributed system does you no good if you can only do a few things at a time. This is an active area of innovation, but the trend is clearly toward lock-free data structures and multi-version concurrency control (MVCC). The goal is to be able to use all of the CPUs at once to service thousands or millions of transactions per second without having one part of the system waiting on a global lock.

Flexibility: It’s critical to be able to implement databases on commodity hardware with as few specific requirements as possible. Direct-attached storage instead of centralized (and oversubscribed) SANs. Standard Linux as the operating system using standard network hardware. The “run anywhere” idea can sometimes be taken too far. With easy virtualization and containerization, it’s not clear that you even need a Mac OS port, let alone a Windows port. The standard chipset these days is 64-bit x86. ARM is still over the horizon as far as mainstream datacenters are concerned.

Extensibility: User Defined Functions and scripting in general. The imagination of your userbase is always going to be greater than yours. Giving users a way to extend the database in ways you can’t imagine is a powerful tool.

Real-time capabilities: Too often, users are given a false choice between storing data efficiently and being able to query it immediately. Immediate access to data as it happens is increasingly valuable.

Ad-hoc analysis: Relatedly, the kinds of real-time queries you can run shouldn’t be limited. Many traditional “real-time” systems pre-aggregate metrics in order to drive graphs. That’s fine, except you literally have to be psychic to set up all the queries you might need before you know you need them. Exploratory, ad-hoc analysis should be just as easy as pre-baked reports. It’s through that exploration that you discover what queries should become pre-baked.

Mature monitoring: Too many database systems, especially the new ones, lack mature tools for monitoring cluster health, system load, data layout, and status.

Compatibility: No database works in a vaccuum, not even the ones in space. A database should have easy integration with other datastores, analysis tools, libraries, and existing applications.

Low barrier to entry: Databases should be easy to use. This means both simplicity of the interface and familiarity to existing skills. Of all the new APIs and query languages that have been pushed onto the market over the last 10 years, I’ve yet to see one that’s strictly better than standards like SQL. A more-flexible data model shouldn’t require a complete rewrite of your applications. You shouldn’t have to learn another language just because the data is over here instead of over there.

Real SQL / relational features: As various NoSQL databases matured over the last 5 years or so, a curious thing happened to their APIs: they started looking more like SQL. It’s not because SQL is the ultimate language. It’s because SQL is based on some pretty fundamental math called relational algebra. Math has a funny habit of being true and useful no matter how many mean things people say about it on their blogs. A database without true relational features built in is at best a fancy filesystem.

We’re not all there yet as an industry, but we’re getting closer. A lot of hype that used to obscure the landscape is falling away as more companies acquire first-hand experience with the reality of real-time distributed analysis.

(image credit: Andrea Goh, CC2.0)

Previous post

Fourth Round of Confirmed Speakers for Data Natives 2015

Next post

Panoply.io Raises $1.3M to Democratize Data Management

  • Ilya Geller

    There is only one ‘ingredient’ that should be excluded – SQL.

    SQL, Structured Query Language is a programming language designed for managing data held in relational database, and was intended to manipulate and retrieve the data. SQL structures EXTERNAL questions in the sense that it was designed to convert incorrectly formulated EXTERNAL questions into the right ones.
    SQL works with (usually manually) structured data; where the structured data refers to information with a high – but never absolute! – degree of organization, such the database is easily searchable by simple, straightforward search engine.
    SQL structures queries which have nothing in common with the data itself! Actually SQL operates with EXTERNAL descriptions of the data – this is the reason why everybody collects all possible EXTERNAL details and sells them to advertisers. For instance, what you had for breakfast and color of you socks – these answers are presupposed to convert ‘convert incorrectly formulated EXTERNAL questions into the right ones’.

    I, however, discovered and patented how to structure any data without SQL, the queries – INTERNALLY: Language has its own INTERNAL parsing, indexing and statistics and can be structured INTERNALLY. (For more details please browse on my name ‘Ilya Geller’.)
    For instance, there are two sentences:
    a) ‘Pickwick!’
    b) ‘That, with the view just mentioned, this Association has taken into its serious consideration a proposal, emanating from the aforesaid, Samuel Pickwick, Esq., G.C.M.P.C., and three other Pickwickians hereinafter named, for forming a new branch of United Pickwickians, under the title of The Corresponding Society of the Pickwick Club.’
    Evidently, that the ‘ Pickwick’ has different importance into both sentences, in regard to extra information in both. This distinction is reflected as the phrases, which contain ‘Pickwick’, weights: the first has 1, the second – 0.11; the greater weight signifies stronger emotional ‘acuteness’; where the weight refers to the frequency that a phrase occurs in relation to other phrases.

    SQL cannot produce the above statistics – SQL is obsolete and out of business.
    SQL is what forces everybody to spy on Internet.
    My INTERNAL technology structures data and allows to search by meaning, sense without EXTERNAL details.

  • Pingback: 10 Essential Database Ingredients | Dataconomy ...()

  • Low barriers to entry can include easy to understand graphs, charts, and other forms of data visualization. While you should have a dedicated team of data analysts, you don’t want the information in the database to get bottlenecked in one department. If users across an organization can log on and pull reports – they can integrate big data into their day-to-day operations and make better, stronger decisions.