In Cassandra, reads are more expensive than writes. Writes are appended to a commit log and written to an in memory structure called a memtable that is eventually flushed to disk.

Reads, however, need to query the memtable and potentially multiple SSTables (on-disk files), a more expensive operation. Lots of concurrent reads as users interact with servers can hotspot a partition, which we refer to imaginatively as a “hot partition”. The size of our dataset when combined with these access patterns led to struggles for the cluster.

WHEN IT’S A “YES”

Cassandra is by nature good for heavy write workloads. Inter-node data distribution is quick, writes are cheap, which makes Cassandra’s handling hundreds of thousands of write operations per second just a regular Tuesday.

If you’re planning data distribution across*** multiple data centers and cloud availability zones***, Cassandra suits too. Your users in Boston and in Honolulu will access their local data centers (which is faster) but will work with the same pools of data.

Examples of application:

  • Sensor data
  • Messaging systems
  • Fraud detection for banks

WHEN IT’S A “NO”

  • No ACID transactions
  • Lots of updates and deletes:
    • Cassandra is incredible at writes (here are the reasons for this amazing write performance). But it’s only append-oriented. If you need to *update *a lot, Cassandra’s no good: for each update, it just adds a ‘younger’ data version with the same primary key. Imagine how agonizing it can be for reads to find the needed data version in the pool of their ‘lookalikes.’ What’s more, Cassandra handles deletes similarly: it adds a tombstone to data without actually deleting it. Thus, reads targeted to the same primary key uncover lots of ‘undead’ data instead of mere up-to-date values. From time to time, compaction takes place and all the unnecessary data does get deleted, but in between compactions, reads take longer.
  • Lots of scans:
    • Cassandra reads data pretty well. But it’s good at reading as long as you know the primary key of data you want. If you don’t, Cassandra will have to scan all nodes to find what you need, which will take a while. And if the latency threshold is exceeded, the scan will not be completed at all.

Expose your data with REST/gRPC/GraphQL

https://stargate.io/


https://www.threatstack.com/blog/scaling-cassandra-lessons-learned

https://www.cloudwalker.io/2020/05/17/monitoring-cassandra-with-prometheus/

https://shermandigital.com/blog/designing-a-cassandra-data-model/

https://discord.com/blog/how-discord-stores-billions-of-messages


🌱 Back to Garden