Skip to main content

Cassandra : Sub-Query Implementation and Cached prepared statement


Introduction:

While familiarizing with Cassandra, I felt the unavailability of Sub-query is polluting my application by iterating over the first query result, and the next round trip is to fetch actual data. The main reason behind is the normalized approach that I have taken while designing column families. The RDBMS style of table design is not fair for non-structured NoSQL data stores. We can keep the reference table contents in a single table as separate columns. But some use cases will not allow as keeping everything in a single column family. (Especially when we consider the performance degradation caused by the compaction when the traffic to a single column family increased since compaction are per column family based).Also if we keep the index structure in a separate table without using the inbuilt secondary index provided by Cassandra. (Separate wide row implementation of index data will allow as to perform equal, range queries against the data), two queries have to be fired. (One for index and then for the actual object). The join/sub-query feature will always useful in these conditions. 

             The performance degradation of my application was due to excessive network usage and client machine CPU usage. If my first query contains 50 results, then another 50 query has to be executed to achieve the sub-query functionality.

SubQuery:

             Since there was no support for Join/Sub-query in Cassandra, I was forced to think about implementing it by myself and, I was so eager to find the performance improvement that brings to my application. I found the implementation is simple since the query that we submit to Cassandra through JDBC driver is reached at the server side as it is. so the only task was to split the query into sub-queries and execute it based on the priority (Right to Left and Child to Parent). The sub-query results should be set into the parent query and execute it at the end. 

             The challenge that I have faced due to the complexity of queries when it contains multiple subcomponents(vertically and horizontally). Logic has been developed to achieve the smooth functioning of sub-queries, that can have N number of vertical and horizontal children.
            The user should be aware of the amount of result each child query is returned, to avoid memory overflow. This is not a full-fledged feature to handle all these kinds of scenarios.

Prepared Statements for SubQuery:

             Later on, I have noticed that my application wants to execute same sub-queries multiple times. When the sub-query structure is more complex, its metadata generation/parsing time will be more at the server side. This leads to the thoughts that, to maintain the prepared statements in the server side. The prepared statement will always help to reduce time consumption. As like normally prepared statement, the query will be prepared only once and can be executed multiple times by simply changing the bind variables. Only the most recently used prepared statements are cached in order to avoid memory leakage.


SubQuery along with IN operator.

             The following query finds the names of the items that have an order placed in NY. The inner query returns a list of items that satisfy the search criteria and the outer query will make use of this list to find matching entries.

SELECT ItemName FROM Items WHERE ItemID IN (SELECT ItemID FROM Orders WHERE place='NY')

SubQuery along with  = operator.

SELECT ItemName FROM Items  WHERE ItemID =
                                       ( SELECT ItemID FROM Orders WHERE place= 'NY');

SubQuery with multiple levels of nesting 

              Below is an example of a complicated query that has multiple levels of sub-queries connected with AND operator. This is to find the items sold to customers in Belgium.

SELECT ItemID, ItemName FROM Items WHERE ItemID=
    (SELECT ItemID  FROM Orders WHERE CustomerID IN
         (SELECT CustomerID FROM Customers WHERE Country = ‘Belgium’))

How to get it?

            I have used  Cassandra-JDBC driver to test this feature. There is no Syntax check for the query in client side, unlike Datastax java driver. The  DataStax Java driver returns syntax error by checking the query from the client side itself. 

The code is published in Git. Here

Newly added one package org/apache/cassandra/cql3/subquery and Altered an existing file CassandraServer.java. I have done the experiment on 3.0 branch.  You can download Cassandra source and build it with these changes.

Comments

  1. The problem here isn't the lack of subqueries, but how you model your data. You're trying to use Cassandra as a relational DB, which it isn't. If you simply denormalized you'd get better performance and wouldn't need to hack subquery support in.

    ReplyDelete
  2. Yes, you are right. But If we stores the index separately in another table as wide row, we can retrieve the object in main table by the keys retrieved from index table in a single query. No need of two round trip here in this case. There are many use cases where can be use this feature to gain more performance.

    ReplyDelete
    Replies
    1. But that doesn't address that the one node, receiving the query, becomes the coordinator?!

      Delete
    2. Yes, the one node that receives the full query will break and execute each of them. So that node will be the coordinator in this case . The advantage is less round-trip from client to Cassandra cluster.

      Delete
  3. This comment has been removed by the author.

    ReplyDelete
  4. Is it possible to write subquery inside lucene index filter?

    where lucene = '{ filter : {type:"match", field:"definition_code", value:"Contract" }}

    At the place of "Contract" I want to write a sub query. Is it possible?

    ReplyDelete

Post a Comment