Data that can’t be easily queried or read is like a book with the pages glued together: not very useful.
For decades, SQL was the established language used to interact with databases. Then data management requirements necessitated by cloud environments and big data use cases led to new systems. These new systems, collectively termed “NoSQL,” focused first on scalability and availability. They often sacrificed the interactivity and queryability of SQL as a false shortcut to scale.
Giving up SQL was not a necessary trade-off. In fact, this design choice created systems that were harder to use, narrowed their use case profile, and forced users to write complex programs to replicate what would have been simple SQL statements. For example, if you talk to enterprises that were early Hadoop adopters, many complain that using the new technology required the expertise of exclusive, highly skilled programmers, while legacy technologies that were SQL based were more accessible to analysts, data scientists, managers and marketers. The short-term choice to give up SQL caused a lot of pain for users, led to incompatible and proprietary query languages, and in some cases resulted in systems that were severely limited by weak interactivity and queryability.
Understanding the modern database landscape is difficult. There are a large variety of systems, many vendors, and many different and overlapping technologies. It is interesting, though, that across this landscape, almost all have learned that SQL is table stakes. The original NoSQL systems posited that SQL didn’t scale, that SQL didn’t work without relational (table-based) schema, and that SQL interfaces were inappropriate to “modern” use cases. All of these claims have been debunked. Users are driving vendors to embrace SQL. We now see NoSQL vendors advertising their SQL capabilities as competitive differentiators. It is easy to predict that over the next 12 to 18 months we will begin to see marketing literature comparing SQL compatibility across different NoSQL systems. Try explaining that one to your manager.
Let’s look at a few very different systems that have adopted SQL despite different storage designs, different primary use cases, and different data models.
“The system must provide full SQL query support and other functionality users expect from a SQL database. Features like indexes and ad hoc query are not just nice to have, but absolute requirements for our business.” (Google F1 paper). Google implemented a SQL system (F1) to manage its core adwords business. From 2012, F1 has been stated to manage 100s of TB and processes 100,000s of queries per second – a clear demonstration that SQL scales and that SQL features are critical to satisfying business users.
The headline feature of Couchbase 4.0, long a vocal proponent of NoSQL, is a SQL interface. Couchbase brands its version of SQL “N1QL.” But N1QL is nearly indistinguishable from SQL. Couchbase touts SQL benefits alongside its NoSQL messaging: N1QL enables you to query a document database without limitations and without compromise – sort, filter, transform, group, and combine data with a query. That’s right. You can combine data from multiple documents with a JOIN. That flexible data model you were promised? This is it. You’re no longer limited to “single table” and “table per query” data models. (www.couchbase.com). The strawman that users were ever limited to single table and table per query models is certainly frustrating to vendors that have seen and embraced the SQL requirement from day one – but the sentiment is certainly easy to understand.
Apache Cassandra is another prominent “NoSQL” system that has been pushed to deliver SQL as the primary database interface. The project brands its flavor of SQL “CQL” but goes to considerable length to explain that CQL is really just SQL. “CQL v3 offers a model very close to SQL in the sense that data is put in tables containing rows of columns. For that reason, when used in this document, these terms (tables, rows and columns) have the same definition than (sic) they have in SQL.” (https://cassandra.apache.org/doc/cql3/CQL.html#Preamble)
Google first birthed the MapReduce model into the marketplace. Behind the scenes the company quickly realized the shortcomings of MapReduce as a query model and built a large-scale analytic SQL-based system named “Dremel.” Dremel maps a SQL query model onto a document-based storage model at massive scale. Again, Google proves that SQL scales across volume, velocity and problem space. The Hadoop ecosystem has taken up the path, launching multiple SQL-on-Hadoop projects, including Impala, Presto, SparkSQL and Drill.
Why SQL? Data management systems must make data interactive. Interactivity is core to deriving value from the data that is being stored. All of the critical design choices made by database designers are motivated by interactivity. Vendors that put other priorities first are quickly realizing the corner they’ve backed into and are grafting SQL query processors into their core products.
If you have to support queries, why use SQL?
SQL is a Standard: It’s easy to find people who know SQL. It’s easy to connect to standard tools. A multitude of resources are available to learn SQL. Bonus: your editor already has syntax highlighting.
SQL is Declarative: With SQL, the query is written by specifying the form of the results declaratively. It is the job of the database software to understand the best way to access data, operate on it, and turn it into results. Declarative queries insulate the query-author from the underlying physical schema of the data. Compared to non-declarative processing, applications are much less fragile, and tolerate changes to the underlying schema, such as added columns or indexes, with no changes to the query itself.
SQL Scales: “SQL doesn’t scale” was oft cited as a key reason NoSQL exploded. Also frequently heard is, “To solve internet-scale problems, you need to abandon SQL.” Today, both Facebook and Google have publicly sung the praises of their SQL systems. Many NoSQL stores have added SQL or SQL-like query languages without crippling performance. New SQL systems such as VoltDB have shown that millions of SQL operations per second are within the reach of anyone with an Amazon Web Services (AWS) account.
SQL is Flexible: While there are a number of SQL standards, vendors and open-source projects have liberally extended SQL. VoltDB supports UPSERT functionality, JSON extensions and other non-standard SQL our customers have requested, all while supporting all of the standard SQL operations familiar to developers.
In sum, SQL is a proven technology. It is the easiest way to write queries. It is the most familiar way to write queries. It is the most compatible way to write queries. It is the most powerful way to write queries. The ability to query data is at the heart of the database value proposition.
Ryan Betts is CTO at VoltDB. He was one of the initial developers of VoltDB’s commercial product, and values his unique opportunity to closely collaborate with customers, partners and prospects to understand their data management needs and help them to realize the business value of VoltDB and related technologies.
Prior to joining VoltDB in 2008, Ryan was a software engineer at IBM. During a four-and-a-half year tenure, he was responsible for implementing device configuration and monitoring as well as Web service management. Before IBM, Ryan was a software engineer at Lucent Technologies for five years. In that role, he played an integral part in the implementation of an automation framework for acceptance testing of Frame Relay, ATM and IP services, as well as a high-availability upgrade capability and several internal components related to device provisioning. Ryan is an alumnus of Worcester Polytechnic Institute.
(image credit: Ishrona)
How stupid do you think your audience is. SQL is a standard for RELATIONAL data. SQL is proven for RELATIONAL data. Documents, graphs and all the other things you are talking about querying are NOT Relations.
Good point. Yet NoSQL vendors have extended SQL to do just that. For example, N1QL that Ryan references is similar to SQL++, details here: http://forward.ucsd.edu/sqlpp.html. IIRC, the idea is to get this accepted into the SQL standard.
That’s just a claim and I’m not that gullible. If it’s really proper SQL why is it call N1QL. What database vendor does that.
That aside the guy is trying to tell us that SQL can and should be used for everything. That’s patent nonsense. The people who created, XQuery, JSONiq, SParQL, Gremlin and Cypher didn’t do so because they are stupid and/or like to create extra work for themselves.
I don’t know the reason for the name — ask Couchbase or check on Wikipedia which has an explanation of sorts: https://en.wikipedia.org/wiki/N1QL. A developer preview is available and they will ship with their next major release according to press releases. You can get further details here: http://docs.couchbase.com/4.0/n1ql/
I’m not interested in using SQL to query non-relational data. I know many people will want to but it’s not the right tool for that job.
Hi Ihe,
A key benefit of a high level query language is the abstraction it provides from physical storage. You can see this really clearly in the Dremel paper that’s referenced. That system provides a SQL interface on top of non-relational storage. CQL does similarly. (You can see the same between columnar and row relational stores — very different data layouts – same query language.)
I think if you look at the definition of N1QL, CQL, and others, you would be hard pressed to not call them SQL. At the same time, SQL vendors have added more support for non-relational structured data. Many SQL systems now have first class JSON support, for example. There’s a converging trend.
I’m not sure why the NoSQL vendors rename their SQL dialects. I suspect they do this in the short term to avoid questions about level of compatibility. Medium term, as support evolves closer to standards, I suspect the branding of the dialects will fade.
Thanks for providing your viewpoint and the other side of the coin.
Ryan.
Ryan, Absolutely agree with the benefits of high level query languages. However I don’t expect these dialects to ever get close enough to the standard because there are limits to what you can do with a relational language running on semi-structured data.
I don’t think the NoSQL crowd that are making these claims are any smarter than those who looked at applying SQL to semi-structured data 10-15 years ago (if anything the opposite). For all this talk about convergence the NoSQL market is relatively tiny. Applying SQL to the semi-structured data wouldn’t be the first stupid thing they have done and it won’t be the last.
Excerpting from ArangoDB documentation, just to show that not every vendor is in concordance.
“ArangoDB does not support pure SQL, as SQL is too limited to cover the different data models in ArangoDB. For example, a nested list structures inside a document or a graph traversal. Following is an example how you can work with subqueries.
This is easier and more structured than the same procedure with SQL.
There is no way to query such structures as the graph traversal in standard SQL, and deviating from standard SQL does not make much sense.”
Ryan, check this out as well: http://www.dbdebunk.com/2015/07/the-sql-and-nosql-effects-will-they.html
CQL looks like SQL but isn’t ! There are no joins, you decide on the columns you can query at design time (limited where functionality) and you can’t easliy order by. Here’s the point, SQL implements set theory and closure, the NoSQL databases aren’t designed to support either of these. CQL is great, but you really need to undertsand why it’s not SQL when you want to use it.