Hot questions for Using Cassandra in blob

Question:

One problem with blob for me is, in java, ByteBuffer (which is mapped to blob in cassandra) is not Serializable hence does not work well with EJBs.

Considering the json is fairly large what would be the better type for storing json in cassandra. Is it text or blob?

Does the size of the json matter when deciding the blob vs json?

If it were any other database like oracle, it's common to use blob/clob. But in Cassandra where each cell can hold as large as 2GB, does it matter?

Please consider this question as the choose between text vs blob for this case, instead of sorting to suggestions regarding whether to use single column for json.


Answer:

I don't think there's any benefit for storing the literal JSON data as a BLOB in Cassandra. At best your storage costs are identical, and in general the API's are less convenient in terms of working with BLOB types as they are for working with strings/text.

For instance, if you're using their Java API then in order to store the data as a BLOB using a parameterized PreparedStatement you first need to load it all into a ByteBuffer, for instance by packing your JSON data into an InputStream.

Unless you're dealing with very large JSON snippets that force you to stream your data anyways, that's a fair bit of extra work to get access to the BLOB type. And what would you gain from it? Essentially nothing.

However, I think there's some merit in asking 'Should I store JSON as text, or gzip it and store the compressed data as a BLOB?'.

And the answer to that comes down to how you've configured Cassandra and your table. In particular, as long as you're using Cassandra version 1.1 or later your tables have compression enabled by default. That may be adequate, particularly if your JSON data is fairly uniform across each row.

However, Cassandra's built-in compression is applied table-wide, rather than to individual rows. So you may get a better compression ratio by manually compressing your JSON data before storage, writing the compressed bytes into a ByteBuffer, and then shipping the data into Cassandra as a BLOB.

So it essentially comes down to a tradeoff in terms of storage space vs. programming convenience vs. CPU usage. I would decide the matter as follows:

  1. Is minimizing the amount of storage consumed your biggest concern?
    • If yes, compress the JSON data and store the compressed bytes as a BLOB;
    • Otherwise, proceed to #2.
  2. Is Cassandra's built-in compression available and enabled for your table?
    • If no (and if you can't enable the compression), compress the JSON data and store the compressed bytes as a BLOB;
    • Otherwise, proceed to #3.
  3. Is the data you'll be storing relatively uniform across each row?
    • Probably for JSON data the answer is 'yes', in which case you should store the data as text and let Cassandra handle the compression;
    • Otherwise proceed to #4.
  4. Do you want efficiency, or convenience?
    • Efficiency; compress the JSON data and store the compressed bytes as a BLOB.
    • Convenience; compress the JSON data, base64 the compressed data, and then store the base64-encoded data as text.

Question:

I have an old column family which has a column named "value" which was defined as a blob data type. This column usually holds two numbers separated with an underscore, like "421_2".

When im using the python datastax driver and execute the query, the results return with that field parsed as a string:

In [21]: session.execute(q)
Out[21]: 
[Row(column1=4776015, value='145_0'),
 Row(column1=4891778, value='114_0'),
 Row(column1=4891780, value='195_0'),
 Row(column1=4893662, value='105_0'),
 Row(column1=4893664, value='115_0'),
 Row(column1=4898493, value='168_0'),
 Row(column1=4945162, value='148_0'),
 Row(column1=4945163, value='131_0'),
 Row(column1=4945168, value='125_0'),
 Row(column1=4945169, value='211_0'),
 Row(column1=4998426, value='463_0')]

When I use the java driver I get a com.datastax.driver.core.Row object back. When I try to read the value field by, for example, row.getString("value") I get the expected InvalidTypeException: Column value is of type blob. Seems like the only way to read the field is via row.getBytes("value") and then I get back an java.nio.HeapByteBuffer object.

Problem is, I cant seem to convert this object to string in an easy fashion. Googling yielded two answers from 2012 that suggest the following:

String string_value = new String(result.getBytes("value"), "UTF-8");

But such a String constructor doesn't seems to exist anymore. So, my questions are:

  1. How do I convert HeapByteBuffer into string?
  2. How come the python driver converted the blob easily and the java one did not?

Side Note: I could debug the python driver, but currently that seems too much work for something that should be trivial. (and the fact that no one asked about it suggests Im missing something simple here..)


Answer:

Another easier way is to change the CQL statement.

select column1, blobastext(value) from YourTable where key = xxx

The second column would be type of String.

Question:

I have a lot of row data for events. These events share some common identifiers, but also have data unique to the event.

Since Cassandra does not allow where clauses over fields that are not indexed, is there any advantage to making a row for each of these keys or would it be easier to just store the json data in a blob?

Example

table has a key on (event_type, timestamp) partitioned by event_type

event_type click might have ts:12345 page_uri: "ABC1234" user_id: "1235567" click_location: "1234,564" event_type keyboard_input might have ts:1245 page_uri: "ABC345" words_entered: "foobar"

Is there any advantage of storing this as a table with rows:

event_type, ts, page_uri, user_id, click_location, words_entered

vs a table

event_type, ts, blob

where blob is a json encoded map of the rest of the data.


Answer:

A few immediate advantages to storing the data in columns come to mind:

  1. You can add a new index at a later date if you determine you suddenly need to query by another field.
  2. You can update specific fields only without updating the entire row. With a blob you need to read and update the entire blob. Concurrent updates to different fields could be wiped out in the blob case as a result.
  3. You can write less data with each update due to #2 above.
  4. You can read less data with each select if you do not need all of the columns
  5. #3 and #4 above may lead to a performance benefit.

Question:

I am trying to insert some java object as a list into cassandra. I have a serializable java class I insert them as :

public void storeFeatures(java.util.Vector features){

        java.util.UUID uuid =   java.util.UUID.randomUUID();
        java.util.List<ByteBuffer> ls=new java.util.ArrayList<ByteBuffer>();
        for(int i=0;i<features.size();i++) {
           my.package.Feature f=(my.package.Feature)features.elementAt(i);
            byte[]b=(new ObjectToBytes()).getBytes(f);
            ls.add(java.nio.ByteBuffer.wrap(b));
         }
        session.execute("INSERT INTO mytable ( id, features) values ( ?, ? )", uuid, ls);
    }

But, somehow I can not retrieve them with

public java.util.List getFeatures(){

        java.util.List<ByteBuffer> ret=null;
        ResultSet rows = session.execute("SELECT * FROM imgs limit 10");
        for(Row row: rows){            
            row.getBytes("features");
            ret=row.getList("features",java.nio.ByteBuffer.class);

        }
        return ret;
    }

I get

com.datastax.driver.core.exceptions.InvalidTypeException: Column features is of type list<blob>
    at com.datastax.driver.core.ColumnDefinitions.checkType(ColumnDefinitions.java:291)

How can I retrieve list < blob > types as list < ByteBuffer >?


Answer:

To leave an answer: there was unnecessary call to row.getBytes("features"); which caused the error