Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
  • Login
  • Register
  • AI
  • Tech
  • Cybersecurity
  • Finance
  • DeFi & Blockchain
  • Startups
  • Gaming
Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
Dataconomy
No Result
View All Result

Distributed NoSQL: Cassandra

byHaifeng Li
August 25, 2014
in Trends
Home Trends

8006385832_395d4f3815_k

In previous posts Distributed NoSQL: HBase and Accumulo and Distributed NoSQL: Riak, we explored two very different designs of key-value pair databases. In this post we will learn about Apache Cassandra, a hybrid of BigTable’s data model and Dynamo’s system design. With BigTable-like column/column family in mind, Cassandra provides a more flexible data model than Riak. Modeled after Dynamo’s system design, Cassandra has linear scalability and proven fault-tolerance on commodity hardware. Besides, Cassandra’s support for replicating across multiple datacenters is best-in-class. Since many features of Cassandra were already covered in previous posts as they are shared with HBase/Accumulo and Riak, we will focus on the additional unique features in what follows.

Data Model

Cassandra provides a two-dimensional row-column view to the data contained in a keyspace (i.e. table in HBase). Keyspaces are used to group column families together. If you need higher dimension to organize application data, there is the concept of super columns, which are columns that contain columns. However, super column is deprecated because of performance issues. Instead, developers are encouraged to use composite columns that was introduced in version 0.8.1. Before jumping into composite columns, we need to understand column sorting. Just like other key-value pair databases, the data type of keys and values in Cassandra are byte arrays. More interestingly, we can specify how column names will be compared for sort order when results are returned to the client. But why would I want to sort column names? This especially sounds strange to a relational database developer. In a relational database, we usually have tall tables, i.e. millions skinny rows with a handful columns. We could still follow this design in Cassandra although different rows don’t have to share same column set. On the other hand, one wide row could have millions columns in BigTable-like database, actually up to 2 billion columns in Cassandra. In this case, column names are usually part of the data (one may even go with valueless columns, i.e. column names are data themselves and the values are not really meaningful), rather than purely schema. For example, we can build inverted index with terms as the keys, document ids as the column names, and frequency as the value. With wide-row design, it is necessary to compare column names sometimes, for instance, each row is the time series of stock price in a day and the column names are time points. You can use compare_with attribute on a column family to tell Cassandra how to sort the columns. The default is BytesType, which is a straightforward lexical comparison of the bytes in each column. Other options are AsciiType, UTF8Type, LexicalUUIDType, TimeUUIDType, and LongType. You can also specify the fully-qualified class name to a class extending org.apache.cassandra.db.marshal.AbstractType. Again, we are sorting column names, not values. However, sorting column name providing a way to build second index, which is very useful in real world. Now come back to composite columns, which are arbitrary dimensional column names that can have types like CompositeType(UTF8Type, ReversedType(TimeUUIDType), LongType)). It’s also really simple: it is implemented as a comparator so adds very little complexity.

Stay Ahead of the Curve!

Don't miss out on the latest insights, trends, and analysis in the world of data, technology, and startups. Subscribe to our newsletter and get exclusive content delivered straight to your inbox.

Storage

Because of the data model, Cassandra has the concepts of SSTable and MemTable, borrowed from Google BigTable. The details can be found in previous post Distributed NoSQL: HBase and Accumulo. On the other hand, Cassandra stores data in the native file system like Riak. Besides, Cassandra doesn’t require ZooKeeper (or any other third-party components). So it is very easy to configure and run. Actually Cassandra starts only one JVM per node, which brings a lot of simplicity to operation and maintenance. Remember how many moving parts in HBase/Accumulo?

Architecture

Same as Riak, Cassandra employs a ring topology but with more partition options. You can provide any IPartitioner implementation to distribute data on nodes. Out of the box, Cassandra provides RandomPartitioner, OrderPreservingPartitioner, ByteOrderedPartitioner, and CollatingOrderPreservingPartitioner. The default is RandomPartitioner to force equal spacing of tokens around the (MD5) hash space, especially for clusters with a small number of nodes. With OrderPreservingPartitioner the keys themselves are used to place on the ring. It brings data locality but also potential bottleneck on hot spots.

Beyond partitions, Cassandra also supports pluggable replication strategies through IReplicaPlacementStrategy to ensure reliability and fault tolerance. Out of the box, Cassandra provides SimpleStrategy (rack unaware), LocalStrategy (rack aware) and NetworkTopologyStrategy (datacenter aware). In addition to setting the number of replicas, the strategy sets the distribution of the replicas across the nodes in the cluster depending on the cluster’s topology. We are particularly interested in NetworkTopologyStrategy. With it, we can deploy the cluster across multiple data centers and specify how many replicas we want in each data center. If configured properly, Cassandra is able to read locally without incurring cross-datacenter latency, and handles failures nicely. The NetworkTopologyStrategy determines replica placement independently within each data center as follows:

  • The first replica is placed according to the partitioner
  • Additional replicas are placed by walking the ring clockwise until a node in a different rack is found. If no such node exists, additional replicas are placed in different nodes in the same rack.

To achieve this, we need a snitch maps IPs to racks and data centers. It defines how the nodes are grouped together within the overall network topology.

Vector Clock and Last-Write-Wins

Like Riak, Cassandra also has read repairs and active anti-entropy to resolve some consistency issues. However, Cassandra doesn’t have vector clocks and simply uses the last-write-wins approach. For sure, last-write-wins is simple but has problems with updates-based-on-existing-values issues. On the other hand, vector clocks are not adequate either in this case unless the data structure is CRDTs (consistency without concurrency control). It is debatable which approach is better. Meanwhile, Cassandra 2.0 provides compare and set based on Paxos consensus protocol, but misleadlingly labels it as “lightweight transaction”.

Summary

Overall, Cassandra is a very nice system with flexible data model, linear scalability and high availability. A Cassandra cluster is also much easier to set up and run than HBase/Accumulo. Besides, Cassandra offers secondary indexes to allow querying by value, which we didn’t discuss in detail in this post.

ALSO IN THIS SERIES:


4524698671_d27be90ea5_zChoosing a NoSQL

In the first installment of his Understanding NoSQL series, guest contributor Haifeng Li gives us a crucial advice for choosing a NoSQL database. Tips include: ignore benchmarking, and look into which problem the developers were originally trying to solve.
 


0b16661Haifeng Li is the Chief Data Scientist at ADP. He has a proven history in delivering end-to-end solutions, having previously worked for Bloomberg and Motorola. He is a technical strategist with deep understanding of computer theory and emerging technologies. He also has a diverse academic background, researching in fields including machine learning, data mining, computer vision, pattern recognition, NLP, and big data. His personal blog can be found here.


Follow @DataconomyMedia

(Image credit: Yuko Honda)

Tags: CassandraChoosing a NoSQLNoSQL

Related Posts

accessiBe review: How the company helps businesses build more accessible digital experiences

accessiBe review: How the company helps businesses build more accessible digital experiences

July 14, 2025
The creator economy is dead. Welcome to the learner economy

The creator economy is dead. Welcome to the learner economy

July 3, 2025
How a crypto wallet keeps digital coins safe: All you need to know

How a crypto wallet keeps digital coins safe: All you need to know

July 3, 2025
Implementing identity first security for stronger access control

Implementing identity first security for stronger access control

April 15, 2025
Switch 2 vs Switch Lite: Is the upgrade worth it?

Switch 2 vs Switch Lite: Is the upgrade worth it?

April 9, 2025
What is airport theory: A TikTok trend that pushes travel limits

What is airport theory: A TikTok trend that pushes travel limits

March 18, 2025
Please login to join discussion

LATEST NEWS

AI agents can be controlled by malicious commands hidden in images

Xiaomi 17 skips 16, uses Snapdragon 8 Elite Gen 5

There are more women using ChatGPT than men now

Google Gemini tops App Store charts with Nano Banana tool

Cloudflare tracks Anthropic’s Claude crawl-to-refer ratio

Inception Point AI releases 3000 weekly podcast episodes

Dataconomy

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

  • About
  • Imprint
  • Contact
  • Legal & Privacy

Follow Us

Welcome Back!

Sign In with Google
OR

Login to your account below

Forgotten Password? Sign Up

Create New Account!

Sign Up with Google
OR

Fill the forms below to register

All fields are required. Log In

Retrieve your password

Please enter your username or email address to reset your password.

Log In
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
No Result
View All Result
Subscribe

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy Policy.