Hot questions for Using Cassandra in scylla

Question:

In Java Spark, I have a dataframe that has a 'bucket_timestamp' column, which represents the time of the bucket that the row belongs to.

I want to write the dataframe to a Cassandra DB. The data must be written to the DB with TTL. The TTL should be depended on the bucket timestamp - where each row's TTL should be calculated as ROW_TTL = CONST_TTL - (CurrentTime - bucket_timestamp), where CONST_TTL is a constant TTL that I configured.

Currently I am writing to Cassandra with spark using a constant TTL, with the following code:

df.write().format("org.apache.spark.sql.cassandra")
            .options(new HashMap<String, String>() {
                {
                    put("keyspace", "key_space_name");
                    put("table, "table_name");
                    put("spark.cassandra.output.ttl, Long.toString(CONST_TTL)); // Should be depended on bucket_timestamp column
                }
            }).mode(SaveMode.Overwrite).save();

One possible way I thought about is - for each possible bucket_timestamp - filter the data according to timestamp, calculate the TTL and write filtered data to Cassandra. but this seems very non-efficient and not the spark way. Is there a way in Java Spark to provide a spark column as the TTL option, so that the TTL will differ for each row?

Solution should be working with Java and dataset< Row>: I encountered some solutions for performing this with RDD in scala, but didn't find a solution for using Java and dataframe.

Thanks!


Answer:

From Spark-Cassandra connector options (https://github.com/datastax/spark-cassandra-connector/blob/v2.3.0/spark-cassandra-connector/src/main/java/com/datastax/spark/connector/japi/RDDAndDStreamCommonJavaFunctions.java) you can set the TTL as:

  • constant value (withConstantTTL)
  • automatically resolved value (withAutoTTL)
  • column-based value (withPerRowTTL)

In your case you could try the last option and compute the TTL as a new column of the starting Dataset with the rule you provided in the question.

For use case you can see the test here: https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/it/scala/com/datastax/spark/connector/writer/TableWriterSpec.scala#L612

Question:

I have this table in cassandra :

CREATE TABLE adress (
adress_id uuid,
adress_name text,
key1 text,
key2 text,
key3 text,
key4 text,
effective_date timestamp,
value text,
active boolean,
PRIMARY KEY ((adress_id, adress_name), key1, key2, key3, key4, effective_date)
) 

As I can understand, cassandra will distribute the data of the table adress based on the partition key which is (adress_id, adress_name).

There is a risk when I try to insert too many data where they share the same (adress_id,adress_name)..

I would like to check before inserting data, the check happen like this:

  1. how many data do I already have in cassandra with the couple (adress_id, adress_name), lets suppose it's 5MO.
  2. I need to check that the size of data that I'm trying to insert don't exceed the Cassandra limit per partition key minus the existing data in cassandra.

My question is how to query cassandra to get the size of data with the couple (adress_id, adress_name). After that what is the size limit of a partition key in Cassandra.


Answer:

As Alex Ott noted above, you should spend more time on the data model to avoid the possibility of huge partitions in the first place, by organizing your data differently, or by artificially splitting partitions to more pieces (e.g., time-series data often splits data into a separate partition each day, for example).

It is technically possible to figure out the existing size of a partition, but it will never be efficient. To understand why, you need to recall how Cassandra stores data. The content of a single partition isn't always stored in the same sstable (on-disk file) - data for the same partition may be spread across multiple files. One file may have a few rows, another file may have a few more rows, a third file may delete or modify some old rows, and so on. To figure out the length of the partition, Cassandra would need to read all this all data, merge it together, and measure the size of the result. Cassandra does not normally do this on writes - it just writes the new update to memory (and eventually a new sstable), without reading the old data first. This is what makes writes in Cassandra so fast - and your idea to read the entire partition before each write will drastically slow them down.

Finally while Cassandra does not handle huge partitions very well, there is no inherent reason why it never could if the developers wanted to solve this issue. The developers of the Cassandra clone Scylla a worried about this issue, and are working to improve it, but even in Scylla the handling of huge partitions isn't perfect yet. But eventually it will be. Almost - there will always be a limit for the size of a single partition (which, by definition, is stored on a single node) as the size of a single disk. This limit too may become a serious problem if your data model is really broken and you can end up with a terabyte in a single partition.