The reason why Cassandra is not good for ETL
I’ve been recently working with cassandra. We have been using cassandra in production for 3 months now. The main project I’m dealing with is a simple event api for getting event logs on a central data store. We’ve been using 5 nodes of cassandra and 4 nodes of web servers for now. We have been using cassandra for storing event logs on a central cluster. We have done some calls with Data Stax for some support and questions to answer.
The problem #
In our project, the main aim is to incrementally dump from cassandra datastore to vertica - which can support aggregate queries with great query execution performance. We used to handle that problem in mysql before migrating to cassandra. This was good but mysql is not good enough for handling 2 billion+ rows in a single table.
Workarounds for migrating to cassandra #
To make time series etl work, you need to add 2 - 3 fields to each column family. In cassandra the key field on a column family only allows EQ comparison, which means you can only read data with the following query;
select ... from my_columnfamily where c_date = '2013-11-11';
This means you need to add c_date field in order to get rows which inserted on a specific date. However, I don’t want to run etl code for each day. I want the data present in vertica with at most 10min delay. So, if we design the cassandra database having the key field only c_date
, then we can read all data on that date. What if I want timestamp comparison and read the data after a specific timestamp ?
For that, you need to add 1 more field. The second field will be secondary index of type timeuuid
. So, if you are already in production w/out adding timeuuid
, you nearly don’t have the chance to change that. In this database design, (c_date, c_id)
(c_date being partition key, c_id being secondary key) now you can run the following cql query;
select * from my_cf
where
c_date = '2013-11-11' and
c_id > mintimeuuid('2013-11-11 11:00');
By keeping this redundant data, you may have the ability of dumping data incrementally from cassandra to vertica.
Moreover, if you don’t use cqlsh
(cassandra native client which uses thrift protocol), the query may result in timeouts. So, you need to have a guess on which queries may go to timeouts (!).
Currently, cqlsh
doesn’t support to run queries in headless or csv format. This brings much more challenges for etl because I want and need to import the data in csv format.
What we have done to overcome this challenge is to change cqlsh
code. So, you need to hack into cqlsh
client to add a switch to return queries in csv format.
Happy with these workarounds ? I’m currenlty not. But, I didn’t write this blog post to thrash Cassandra or Data Stax. However, Cassandra IS NOT the mysql replacement. Tech people need to know that. It’s a fantastic distributed key-value store, but currently, it’s nothing more than that. However, Cassandra developers keeps saying that it’s designed for time-series data or it’s good at ETL. However, it isn’t. It is only a scalable distributed key-value store, nothing more.