Saturday, May 10, 2014

Understand cassandra read path by tracing in CQL

In our last article, we explored cassandra 2.0.7 write path and in this article, we will explore cassandra read path. We will again follow the same investigation method we used on write for read too. That is, we will trace the read path by turning on tracing in cqlsh.

Let's start by enabling tracing and consistency to all. Then issue statement select and start to dig into code. Below are the output of the commands executed in cqlsh and output in cassandra system.log
cqlsh:jw_schema1> consistency all;
Consistency level set to ALL.
cqlsh:jw_schema1> tracing on;
Now tracing requests.
cqlsh:jw_schema1> select * from users;

user_id | age | first | last | middle
---------+-----+-----------+-------+--------
4 | 10 | john30003 | smith | junior
3 | 10 | john30003 | smith | junior
5 | 10 | john30003 | smith | junior
2 | 10 | john30003 | smith | junior

(4 rows)

Tracing session: 66a845c0-d5f3-11e3-bd26-a322c40b8b81

activity | timestamp | source | source_elapsed
-------------------------------------------------------------------------------------------------+--------------+---------------+----------------
execute_cql3_query | 22:25:12,732 | <node3_ip> | 0
Message received from /<node3_ip> | 22:25:11,106 | <node2_ip> | 26
Executing seq scan across 0 sstables for [min(-9223372036854775808), min(-9223372036854775808)] | 22:25:11,106 | <node2_ip> | 289
Read 1 live and 0 tombstoned cells | 22:25:11,106 | <node2_ip> | 442
Read 1 live and 0 tombstoned cells | 22:25:11,106 | <node2_ip> | 581
Read 1 live and 0 tombstoned cells | 22:25:11,106 | <node2_ip> | 658
Read 1 live and 0 tombstoned cells | 22:25:11,106 | <node2_ip> | 724
Scanned 4 rows and matched 4 | 22:25:11,106 | <node2_ip> | 760
Enqueuing response to /<node3_ip> | 22:25:11,106 | <node2_ip> | 785
Sending message to /<node3_ip> | 22:25:11,107 | <node2_ip> | 954
Message received from /<node3_ip> | 22:25:12,430 | <node1_ip> | 76
Executing seq scan across 0 sstables for [min(-9223372036854775808), min(-9223372036854775808)] | 22:25:12,431 | <node1_ip> | 1054
Read 1 live and 0 tombstoned cells | 22:25:12,432 | <node1_ip> | 1250
Read 1 live and 0 tombstoned cells | 22:25:12,432 | <node1_ip> | 1399
Read 1 live and 0 tombstoned cells | 22:25:12,432 | <node1_ip> | 1537
Read 1 live and 0 tombstoned cells | 22:25:12,432 | <node1_ip> | 1718
Scanned 4 rows and matched 4 | 22:25:12,432 | <node1_ip> | 1777
Enqueuing response to /<node3_ip> | 22:25:12,432 | <node1_ip> | 1935
Sending message to /<node3_ip> | 22:25:12,433 | <node1_ip> | 2212
Parsing select * from users LIMIT 10000; | 22:25:12,732 | <node3_ip> | 148
Preparing statement | 22:25:12,732 | <node3_ip> | 259
Determining replicas to query | 22:25:12,733 | <node3_ip> | 941
Enqueuing request to /<node2_ip> | 22:25:12,738 | <node3_ip> | 5556
Enqueuing request to /<node1_ip> | 22:25:12,738 | <node3_ip> | 5645
Enqueuing request to <node3_hostname>/<node3_ip> | 22:25:12,738 | <node3_ip> | 5688
Sending message to /<node2_ip> | 22:25:12,738 | <node3_ip> | 5811
Sending message to /192.168.0.2 | 22:25:12,738 | <node3_ip> | 5817
Sending message to /<node1_ip> | 22:25:12,738 | <node3_ip> | 6133
Message received from /<node3_ip> | 22:25:12,739 | <node3_ip> | 6558
Executing seq scan across 0 sstables for [min(-9223372036854775808), min(-9223372036854775808)] | 22:25:12,740 | <node3_ip> | 7294
Read 1 live and 0 tombstoned cells | 22:25:12,740 | <node3_ip> | 7506
Read 1 live and 0 tombstoned cells | 22:25:12,740 | <node3_ip> | 7698
Read 1 live and 0 tombstoned cells | 22:25:12,740 | <node3_ip> | 8222
Read 1 live and 0 tombstoned cells | 22:25:12,741 | <node3_ip> | 8570
Scanned 4 rows and matched 4 | 22:25:12,741 | <node3_ip> | 8634
Enqueuing response to /<node3_ip> | 22:25:12,741 | <node3_ip> | 8689
Sending message to /192.168.0.2 | 22:25:12,741 | <node3_ip> | 8821
Message received from /<node3_ip> | 22:25:12,742 | <node3_ip> | null
Processing response from /<node3_ip> | 22:25:12,742 | <node3_ip> | null
Message received from /<node1_ip> | 22:25:13,029 | <node3_ip> | null
Processing response from /<node1_ip> | 22:25:13,029 | <node3_ip> | null
Message received from /<node2_ip> | 22:25:13,061 | <node3_ip> | null
Processing response from /<node2_ip> | 22:25:13,061 | <node3_ip> | null
Read 5 live and 0 tombstoned cells | 22:25:13,086 | <node3_ip> | 353631
Read 5 live and 0 tombstoned cells | 22:25:13,087 | <node3_ip> | 355232
Read 5 live and 0 tombstoned cells | 22:25:13,093 | <node3_ip> | 360675
Read 5 live and 0 tombstoned cells | 22:25:13,093 | <node3_ip> | 360908
Request complete | 22:25:13,093 | <node3_ip> | 361266

cqlsh:jw_schema1>

TRACE [Thrift:186] 2014-05-07 22:25:12,733 QueryProcessor.java (line 153) Process org.apache.cassandra.cql3.statements.SelectStatement@43e049 @CL.ALL
DEBUG [Thrift:186] 2014-05-07 22:25:13,604 CassandraServer.java (line 1955) execute_cql3_query
TRACE [Thrift:186] 2014-05-07 22:25:13,605 QueryProcessor.java (line 153) Process org.apache.cassandra.cql3.statements.SelectStatement@17100f1 @CL.ONE
DEBUG [Thrift:186] 2014-05-07 22:25:13,911 Tracing.java (line 159) request complete
DEBUG [Thrift:186] 2014-05-07 22:25:13,915 CassandraServer.java (line 1955) execute_cql3_query
TRACE [Thrift:186] 2014-05-07 22:25:13,916 QueryProcessor.java (line 153) Process org.apache.cassandra.cql3.statements.SelectStatement@d34f6 @CL.ONE
DEBUG [Thrift:186] 2014-05-07 22:25:14,227 Tracing.java (line 159) request complete

As write entry path is execute_cql3_query, so is read path. If you trace the code down, it will be too much to even start the discussion. I summarize the points below in tandem with the output of cqlsh tracing and system.log where applicable. Thus it may not be complete but I will give you the link as narration goes so that you can study yourself in detail.

It started at CassandraServer.execute_cql3_query(...)  as indicated in cqlsh tracing output. So basically the work done can be summarize by this line:
cState.getCQLQueryHandler().process(queryString, cState.getQueryState(), new QueryOptions(ThriftConversion.fromThrift(cLevel), Collections.<ByteBuffer>emptyList())).toThriftResult();

If you step into the code above, QueryProcessor.process(...)  which implement the interface QueryHandler  which get a valid CQLStatement. The execution continue by calling method QueryProcessor.processStatement(...). Notice that the logger in this method is shown in cassandra system.log (of cause you need to enable tracing for this class in log4j.properties in order for this line to log successfully). So access checking and validation are perform here. When checking and validation were done, then CQLStatement.execute(...) is executed.

Because we are executing select statement, the correspond class that implement interface CQLStatement is SelectStatement.  Extract from SelectStatement.execute(...)
public ResultMessage.Rows execute(QueryState state, QueryOptions options) throws RequestExecutionException, RequestValidationException
{
ConsistencyLevel cl = options.getConsistency();
List<ByteBuffer> variables = options.getValues();
if (cl == null)
throw new InvalidRequestException("Invalid empty consistency level");

cl.validateForRead(keyspace());

int limit = getLimit(variables);
long now = System.currentTimeMillis();
Pageable command;
if (isKeyRange || usesSecondaryIndexing)
{
command = getRangeCommand(variables, limit, now);
}
else
{
List<ReadCommand> commands = getSliceCommands(variables, limit, now);
command = commands == null ? null : new Pageable.ReadCommands(commands);
}

int pageSize = options.getPageSize();
// A count query will never be paged for the user, but we always page it internally to avoid OOM.
// If we user provided a pageSize we'll use that to page internally (because why not), otherwise we use our default
// Note that if there are some nodes in the cluster with a version less than 2.0, we can't use paging (CASSANDRA-6707).
if (parameters.isCount && pageSize <= 0 && MessagingService.instance().allNodesAtLeast20)
pageSize = DEFAULT_COUNT_PAGE_SIZE;

if (pageSize <= 0 || command == null || !QueryPagers.mayNeedPaging(command, pageSize))
{
return execute(command, cl, variables, limit, now);
}
else
{
QueryPager pager = QueryPagers.pager(command, cl, options.getPagingState());
if (parameters.isCount)
return pageCountQuery(pager, variables, pageSize, now);

// We can't properly do post-query ordering if we page (see #6722)
if (needsPostQueryOrdering())
throw new InvalidRequestException("Cannot page queries with both ORDER BY and a IN restriction on the partition key; you must either remove the "
+ "ORDER BY or the IN and sort client side, or disable paging for this query");

List<Row> page = pager.fetchPage(pageSize);
ResultMessage.Rows msg = processResults(page, variables, limit, now);
if (!pager.isExhausted())
msg.result.metadata.setHasMorePages(pager.state());
return msg;
}
}

The execution continue to get the Pageable Command . Execution continue to private method execute(...). Then method getRangeSlice(...) is called. This is the actual work done to retrieve all the rows. This method implementation does a lot of works and I would recommend you click on the link and study the code yourself to get a better picture.

When the control is returned, the rows are sent for further processing using method processResults(...)  which eventually return the result back to the cassandra client.

As you may have notice, the upper layer execution is similar as to write path execution, until control passed to CQLStatement. That's it for this article, I hope you like it.

No comments:

Post a Comment