Information Technology Blogs: cassandra

Showing posts with label cassandra. Show all posts

Sunday, January 18, 2015

how to improve apache cassandra 1.0.8 read speed

This article is for improve reading speed for apache cassandra 1.0.8. Because the reading improvement determined by many factors, we will investigate all possible areas so the gain will be improve collectively. So you may experience these factors and alter according to suit your node environment to achieve the best result. As the cassandra 1.0 released, the official cited that the read performance has increased up to 400%!

First and foremost, there are numerous articles which I use as a reference has cited copyright, I take no ownership nor credit of their hardwork as that is rightfully belong to them entirely. I only reference their work to improve my knowledge and to help people (like me) who need help and came to read what I share in my article.

Let's split these improvements into two parts, the hardware and the software.

hardware

ssd

ssd disk is way faster than hdd disk in term of reading in multiple magnitude, please read cassandra-benchmark for the benchmark. Although the cassandra using was version 0.8.10, but when cassandra 1.0 was released, read gained tremendous improvement. Then these two improvemetns will be linear gain too. Also, it is recommend to read the aforementioned article as it explain why is the random speed will hurt the read performance for hdd disk.

multiple disks

disks allocation to the commit log should be different than the data directory. Because during data write, data is repeating appending to the commit log. If the data directory is located on different drive, read performance gain should be visible.

software

memtable
If write behaviour has a lot of updates, it is good to look into memtable settings. There are two settings which you can start with

memtable_total_space_in_mb

commitlog_total_space_in_mb

more memory to this settings means the frequent write (update) will be absorded by the memory and thus, reading will be fast too as read start from memtable first before going into sstable. But because this impact system wide, you might want to gradually start to increase it and measure them. Read below for more information on what these two settings are and how to tune them
http://www.datastax.com/docs/1.0/operations/tuning#memtable-sizing
http://www.datastax.com/docs/1.0/configuration/node_configuration#memtable-total-space-in-mb
http://www.slideshare.net/driftx/cassandra-summit-2010-performance-tuning slide 14 and slide 26
http://wiki.apache.org/cassandra/MemtableThresholds

WP-DataStax-Cassandra page 16
Specifically, for read performance, Cassandra 1.0 optimizes queries by using a lighter-weight data structure for representing a row fragment from a read, than for a row fragment in a memtable into which updates accumulates. Also, with named reads, Cassandra 1.0 includes enhancements for deserializing the most recent versions of requested columns. Combined with the other optimizations, this makes reads in Cassandra as fast as writes for many workloads.

data compression

Previously I have done a study on the compression affect improvement read, read here, here, here and here. Please read the links as it provide comprehensive explanation than I could describe here.

compaction

compaction can improve (or impact) the read speed. Citation fromWP-DataStax-Cassandra,

The above process produces exceptionally fast write operations; however it also can lead to data fragmentation across the disk. Read requests may have to combine data from many SStables as well as Memtables to satisfy end user requests for data, and this can increase query response times.

To reduce data fragmentation and reclaim space taken by obsolete data, Cassandra performs "compactions" that merge the most recent data from many different SStables on disk into a new one.

So with my experience, if you trigger compaction (major) through nodetool, during compaction, the read latency will increase, thus impact but when the compaction is done, the read performance is improved.

In this documentation, it explain different compaction strategy to use for read or write workload. So identify how's your environment write and read pattern and always measure it so you know what and when it could went wrong. Choose compaction strategy to suit your data model. For instance, if cassandra is not strong at a point, choose other big data technology. Read here for bad experience encountered.

sstables counts

Keep the sstables counts as low as possible for a column family. Excerpt from FULLTEXT02 page 39.

If a read operation is performed, initially the data are read from the memtable. If data are not in the memtable, then data get read from SSTable. Multiple SSTables may be looked up to find the data. Reading directly from SSTable decreases the performance because there are many SSTable that might need to be looked at hence requires an I/O operation means it requires touching the disk. Compared to SSTable, reading directly from memtable is fast because there is no I/O involved. The more the I/O operations are involved, the more performance will be degraded. Performance can also be increased by increasing the size of memtable [7].

Cassandra uses Bloom filter to judge quickly whether the key exists in the SSTable or not before touching the disk. Bloom filter is a efficient data structure that checks whether element is a member of a set by dividing the memory into buckets. Check each bucket to see if a key is present and if any bucket is empty then key was never inserted before. If there are many SSTables, then lots of I/O operations would be needed to read the data which can definitely decrease the performance. This is because of the fact that I/O operations are expensive and therefore compaction is used to improve read performance. Compaction merge two SSTables and sort to become one SSTable, which eventually decreases the number of SSTables and number of I/O operations, hence increasing the performance [7].

key / row cache

key cache should be enable to reduce the search from touching the disk, especially spinning disk. Excerpt from FULLTEXT02 page 47.

By default key cache is enabled and Cassandra caches 20,000 keys per Column Family (CF). The key cache decreases the Input/Output (I/O) operations because if key cache is not enabled then I/O operation is required in order to figure out the exact location of the row. Key cache holds the exact location of the data belonging to that key.

Row cache holds the entire content of the row in cache. By default, row cache is disabled. The overhead of enabling or increasing the row cache is that it may require more Java Virtual Machine (JVM) heap of Cassandra. By if jna lib is available, then storing row cache off heap is a good option. This article has diagram on how read is perform.

concurrent_read

Excerpt from FULLTEXT02 page 48.

Read performance can also be increased by tuning the concurrent reads. The rule is span 4 threads per Central Processing Units (CPU) core in the cluster. The higher the number of threads spanned for read, the higher performance can be achieved if the machines have got faster I/O.

A word of cautious, I tried increase concurrent_read from 32 to 64 and see some unpredictable behaviour, so it is better you do this in test environment.

decreasing read consistency level

If your business requirement can tolerate of eventual consistency, then decrease from quorum to one will improve read speed as only one node acknowledgement is sufficient to fulfill the read request compare to a certain amount of nodes in quorum.

turn off swap space

When the node start to swap due to shortage of memory, the response of node be it write or read will be visible. Hence it is best to turn off swap, and let the operating system kill or jvm kill itself to oom than the page swap start to happen.

java heap

Citation from OS-8.1.3-Cassandra Installation and Configuration Guide page 33

HEAP_NEWSIZE : Size of young generation. Larger value leads to longer GC pause times while smaller value will typically lead to more expensive GC. Set in conjunction with MAX_HEAP_SIZE.

So tune it carefully since this is pretty low level. Read this article as it mentioned a few garbage collector settings for cassandra and memory footprint.

upgrade

Each release of software improve or fix the previous defect, so is cassandra. If upgrade is viable, you should consider. For instance, to quote Aaron Morton

1.0 has key and row caches defined per CF, 1.1 has global ones which are better utilised and easier to manage. 1.2 moves bloom filters and compression meta off heap which reduces GC, which will help. Things normally get faster.

This is also true.

monitoring

Because data load increase and/or decrease will impact the read response time, it is vital if there is monitoring services running. As cited from this paper, p1724_tilmannrabl_vldb2012 Page 1,

In modern enterprise systems it is not uncommon to have thousands of different metrics that are reported from a single host machine.

So monitor crucial metrics by cassandra, example, cpu, java heap and io should give some indicator if your speed has been reduced.

Whilst these are collected knowledge are from public and free will sharing. Any mistake and errors in this article is mine alone and does not reflect to them. Thank you and I hope you learned something.

Saturday, January 17, 2015

Investigate into Apache Cassandra Memtable updateLiveRatio logging output

Today, we are going to study apache cassandra 1.0.8 on memtable logging its statistics in the cassandra system.log. Example below

WARN [MemoryMeter:1] 2014-10-17 07:38:15,346 Memtable.java (line 176) setting live ratio to minimum of 1.0 instead of 0.16714977001091136
INFO [MemoryMeter:1] 2014-10-17 07:38:15,346 Memtable.java (line 186) CFS(Keyspace='OpsCenter', ColumnFamily='pdps') liveRatio is 1.6474114033116056 (just-counted was 1.0).  calculation took 3ms for 595 columns

What does the above logging output means? Is cassandra instance operating normally since logging level is warn? Before we dive into the code to understand its meaning, let first read and understand what is memtable in cassandra context.

Excerpts from datastax documentation

Cassandra is optimized for write throughput. Cassandra writes are first written to a commit log (for durability), and then to an in-memory table structure called a memtable. A write is successful once it is written to the commit log and memory, so there is very minimal disk I/O at the time of write. Writes are batched in memory and periodically written to disk to a persistent table structure called an SSTable (sorted string table). Memtables and SSTables are maintained per column family. Memtables are organized in sorted order by row key and flushed to SSTables sequentially (no random seeking as in relational databases).

Whenever method updateRatio() from class Memtable is called, the following codes applies.

public void updateLiveRatio()
{
        if (!MemoryMeter.isInitialized())
        {
            // hack for openjdk.  we log a warning about this in the startup script too.
            logger.warn("MemoryMeter uninitialized (jamm not specified as java agent); assuming liveRatio of 10.0.  Usually this means cassandra-env.sh disabled jamm because you are using a buggy JRE; upgrade to the Sun JRE instead");
            cfs.liveRatio = 10.0;
            return;
        }

        Runnable runnable = new Runnable()
        {
            public void run()
            {
                activelyMeasuring = Memtable.this;

                long start = System.currentTimeMillis();
                // ConcurrentSkipListMap has cycles, so measureDeep will have to track a reference to EACH object it visits.
                // So to reduce the memory overhead of doing a measurement, we break it up to row-at-a-time.
                long deepSize = meter.measure(columnFamilies);
                int objects = 0;
                for (Map.Entry<DecoratedKey, ColumnFamily> entry : columnFamilies.entrySet())
                {
                    deepSize += meter.measureDeep(entry.getKey()) + meter.measureDeep(entry.getValue());
                    objects += entry.getValue().getColumnCount();
                }
                double newRatio = (double) deepSize / currentThroughput.get();

                if (newRatio < MIN_SANE_LIVE_RATIO)
                {
                    logger.warn("setting live ratio to minimum of 1.0 instead of {}", newRatio);
                    newRatio = MIN_SANE_LIVE_RATIO;
                }
                if (newRatio > MAX_SANE_LIVE_RATIO)
                {
                    logger.warn("setting live ratio to maximum of 64 instead of {}", newRatio);
                    newRatio = MAX_SANE_LIVE_RATIO;
                }
                cfs.liveRatio = Math.max(cfs.liveRatio, newRatio);

                logger.info("{} liveRatio is {} (just-counted was {}).  calculation took {}ms for {} columns",
                            new Object[]{ cfs, cfs.liveRatio, newRatio, System.currentTimeMillis() - start, objects });
                activelyMeasuring = null;
            }
        };

        try
        {
            meterExecutor.submit(runnable);
        }
        catch (RejectedExecutionException e)
        {
            logger.debug("Meter thread is busy; skipping liveRatio update for {}", cfs);
        }
}

There is a new thread which will be execute by the executor meterExecutor. This executor if it is not busy, will start to measure this column family associated with this memtable. There are two essential metrics that is involve in the arithmetic, key size plus value size and the column count. deepSize is the summation of the column family , its key and its value. A new variable newRatio is calculated with the deepSize divided by the currentThroughput.

Valid newRatio range is between 1.0 to 64.0 inclusive. When calculated newRatio is less than 1.0, first line of log such as above will started to appear in cassandra system log and newRatio will be reset to 1.0. The same check when newRatio exceed 64.0, it will be logged and value for newRatio reset to maximum valid value of 64.0. Then column family live ratio is updated with whichever which is higher, the current or the new calculated newRatio.

So technically, this is nothing to really concern about. It's a measurement of the number of operations has increased as compare to the previous. Also, given the next throughput, as estimated of size of the memtable can be calculated based on the newRatio. As a side note, this method updateRatio() is called after a mutation has been applied to the memtable but before a flush is requested.

That's it for today, I hope you learned something.

Friday, January 16, 2015

operate casandra using jmx in terminal including changing pool size, compacting sstables and key cache

If you operate apache cassandra cluster and if load per node goes huge (like nodetool info show 800GB), compactions become a problem. It's a big problem for apache cassandra 1.0.8 if you have load per node average hover around 600GB to 1TB. The read performance suffers and at times system uptime load goes high. In some instance, I noticed when repair is running, system load goes more than 20. It's not a concern if this is operating well, but the more often you see this, something has gone wrong. Today, I will share my experience on how to operate cassandra when node load is huge and cassandra instance is still running. Often times, there are nice method that is exposed via jmx but to operate remotely, jmx gui client such as jmxconsole is not ideal. Instead, we will using a jmxterm for these operation in apache cassandra 1.0.8. So let's get started.

Changing pool size

So, it is pretty simple, launch it and set to the bean, and then set the CorePoolSize. The steps will be illustrate below.

$ java -jar jmxterm-1.0-alpha-4-uber.jar
$>open localhost:7199   
#Connection to localhost:7199 is opened
$>bean org.apache.cassandra.request:type=ReplicateOnWriteStage
#bean is set to org.apache.cassandra.request:type=ReplicateOnWriteStage
$>get CorePoolSize
#mbean = org.apache.cassandra.request:type=ReplicateOnWriteStage:
CorePoolSize = 32;
$>info
#mbean = org.apache.cassandra.request:type=ReplicateOnWriteStage
#class name = org.apache.cassandra.concurrent.JMXConfigurableThreadPoolExecutor
# attributes
  %0   - ActiveCount (int, r)
  %1   - CompletedTasks (long, r)
  %2   - CorePoolSize (int, rw)
  %3   - CurrentlyBlockedTasks (int, r)
  %4   - PendingTasks (long, r)
  %5   - TotalBlockedTasks (int, r)
#there's no operations
#there's no notifications
$>set CorePoolSize 64
$>get CorePoolSize
#mbean = org.apache.cassandra.request:type=ReplicateOnWriteStage:
CorePoolSize = 64;

Alter key cache

Often times, when there is heap pressure in the jvm, the safety valve kicks in. You can restart the cassandra instance or you can reset the key cache back to the initial value. Assuming your column family name FooBar and keyspace just4fun, then the following are steps to illustrate how is this done.

$>bean org.apache.cassandra.db:cache=FooBarKeyCache,keyspace=just4fun,type=Caches
#bean is set to org.apache.cassandra.db:cache=FooBarKeyCache,keyspace=just4fun,type=Caches
$>info
#mbean = org.apache.cassandra.db:cache=FooBarKeyCache,keyspace=just4fun,type=Caches
#class name = org.apache.cassandra.cache.AutoSavingKeyCache
# attributes
  %0   - Capacity (int, rw)
  %1   - Hits (long, r)
  %2   - RecentHitRate (double, r)
  %3   - Requests (long, r)
  %4   - Size (int, r)
#there's no operations
#there's no notifications
$>
$>get Size
#mbean = org.apache.cassandra.db:cache=FooBarKeyCache,keyspace=just4fun,type=Caches:
Size = 122307;

$>set Capacity 250000
#Value of attribute Capacity is set to 250000
$>get Capacity;      
#mbean = org.apache.cassandra.db:cache=FooBarKeyCache,keyspace=just4fun,type=Caches:
$>get Capacity 
#mbean = org.apache.cassandra.db:cache=FooBarKeyCache,keyspace=just4fun,type=Caches:
Capacity = 250000;

Compact sstable

Lastly, to compact sstables. It's amazing we have a sstable that as huge as 84GB! So trigger major compaction is not an option here, often time when load per node goes beyond 600GB, compaction took forever, as GC kick in and cpu keep on recollecting heap, making system load goes high. So here, we will select one sstable that is huge and compact that only. You can also select a few sstable and compact them and separate using comma.

$>bean org.apache.cassandra.db:type=CompactionManager
#bean is set to org.apache.cassandra.db:type=CompactionManager
$>run forceUserDefinedCompaction just4fun FooBar-hc-5-Index.db
#calling operation forceUserDefinedCompaction of mbean org.apache.cassandra.db:type=CompactionManager
#RuntimeMBeanException: java.lang.IllegalArgumentException: FooBar-hc-5-Index.db does not appear to be a data file
$>run forceUserDefinedCompaction just4fun FooBar-hc-401-Data.db 
#calling operation forceUserDefinedCompaction of mbean org.apache.cassandra.db:type=CompactionManager
#operation returns: 
null

The compaction should be started, you can check in cassandra system log or the nodetool compaction. So that's it, I hope you learned something.

Saturday, January 3, 2015

apache cassandra 1.0.8 IncompatibleClassChangeError vtable stub and AssertionError Added column does not sort as the last column

Today we will spend sometime to look into two errors and see if it is really something to concern about. The erros are thrown when apache cassandra version 1.0.8 is running. Okay, let's to the first error.

ERROR [ReadStage:1559] 2012-10-16 20:38:25,336 AbstractCassandraDaemon.java (line 139) Fatal exception in thread Thread[ReadStage:1559,5,main]
java.lang.IncompatibleClassChangeError: vtable stub
        at org.apache.cassandra.db.AbstractColumnContainer.getColumn(AbstractColumnContainer.java:134)
        at org.apache.cassandra.db.Memtable$6.computeNext(Memtable.java:402)
        at org.apache.cassandra.db.Memtable$6.computeNext(Memtable.java:384)
        at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140)
        at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135)
        at org.apache.cassandra.db.CollationController.collectTimeOrderedData(CollationController.java:93)
        at org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:62)
        at org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1298)
        at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1184)
        at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1151)
        at org.apache.cassandra.db.Table.getRow(Table.java:375)
        at org.apache.cassandra.db.SliceByNamesReadCommand.getRow(SliceByNamesReadCommand.java:58)
        at org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:765)
        at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1224)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)

I guess this is really some fatal errors encountered. Let's check out what is IncompatibleClassChangeError means. From IncompatibleClassChangeError javadoc

Thrown when an incompatible class change has occurred to some class definition. The definition of some class, on which the currently executing method depends, has since changed.

Honestly, this is really odd, this error thrown out of no where and the jvm stopped. I guess nobody change the compiled apache cassandra code during the node instance is running. It certainly sounds odd but I guess when the data hold per node is huge, then strange thing start to happened. Now, let's get into stack trace and follow stack trace path.

Bottom three stack traces are pretty obvious, a new thread was started and execute by the thread pool executor. Then we have class StorageProxy. Within this class, there is a static class LocalReadRunnable which implement the abstract method LocalReadRunnable. It seem like it is trying to read a local node table row. The table eventually make calls to the column family to retrieve column. Tracing even deeper, at line 134 of class AbstractColumnContainer, reveal that no exception is thrown from here. This is like a mystery! :) The stack trace analysis and observed jvm stopped shown something is wrong. Though I am not sure what went wrong but if you have any idea, please discuss it as a comment below.

Next we look to another error.

ERROR [CompactionExecutor:4] 2014-10-22 06:13:00,884 AbstractCassandraDaemon.java (line 139) Fatal exception in thread Thread[CompactionExecutor:4,1,main]
java.lang.AssertionError: Added column does not sort as the last column
        at org.apache.cassandra.db.ArrayBackedSortedColumns.addColumn(ArrayBackedSortedColumns.java:126)
        at org.apache.cassandra.db.AbstractColumnContainer.addColumn(AbstractColumnContainer.java:129)
        at org.apache.cassandra.db.AbstractColumnContainer.addColumn(AbstractColumnContainer.java:124)
        at org.apache.cassandra.db.ColumnFamilySerializer.deserializeColumns(ColumnFamilySerializer.java:148)
        at org.apache.cassandra.io.sstable.SSTableIdentityIterator.getColumnFamilyWithColumns(SSTableIdentityIterator.java:232)
        at org.apache.cassandra.db.compaction.PrecompactedRow.merge(PrecompactedRow.java:110)
        at org.apache.cassandra.db.compaction.PrecompactedRow.<init>(PrecompactedRow.java:97)
        at org.apache.cassandra.db.compaction.CompactionController.getCompactedRow(CompactionController.java:137)
        at org.apache.cassandra.db.compaction.CompactionIterable$Reducer.getReduced(CompactionIterable.java:102)
        at org.apache.cassandra.db.compaction.CompactionIterable$Reducer.getReduced(CompactionIterable.java:87)
        at org.apache.cassandra.utils.MergeIterator$ManyToOne.consume(MergeIterator.java:118)
        at org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:101)
        at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140)
        at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135)
        at com.google.common.collect.Iterators$7.computeNext(Iterators.java:614)
        at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140)
        at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135)
        at org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:173)
        at org.apache.cassandra.db.compaction.CompactionManager$1.call(CompactionManager.java:135)
        at org.apache.cassandra.db.compaction.CompactionManager$1.call(CompactionManager.java:115)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)

Again, this error pop out of nowhere, during node operation, this exception just thrown. So now, we will look into what happened using the stack trace given. Similar to the above, bottom three stack traces are pretty obvious, a new thread was started and execute by the thread pool executor. It is a compaction thread and when compacting sstable, things break. Then row get reduced and columns get deserialized. When columns are re-formation, it throw an exception as the column does not sort. This is another strange phenomenon. It should have been sort before anyway but it is not. The javdoc for this method addColumn

AddColumn throws an exception if the column added does not sort after the last column in the map. The reasoning is that this implementation can get slower if too much insertions are done in unsorted order and right now we only use it when *all* insertion (with this method) are done in sorted order. The assertion throwing is thus a protection against performance regression without knowing about (we can revisit that decision later if we have use cases where most insert are in sorted order but a few are not).

It seem that this is probably reproducible during development environment as a few are not.. I guess this is not as fatal as previous error. That's it for this analysis, please give your inputs or comments if you have workaround.

UPDATE:

It seem that the second error is fix in https://issues.apache.org/jira/browse/CASSANDRA-5856

Sunday, December 21, 2014

apache cassandra 1.0.8 out of memory error unable to create new native thread

If you are using apache cassandra 1.0.8 and having the exception such as below, you may want to further read. Today, we will investigate on what this error means and what can we do to correct this situation.

ERROR [Thread-273] 2012-14-10 16:33:18,328 AbstractCassandraDaemon.java (line 139) Fatal exception in thread Thread[Thread-273,5,main]
java.lang.OutOfMemoryError: unable to create new native thread
        at java.lang.Thread.start0(Native Method)
        at java.lang.Thread.start(Thread.java:640)
        at java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
        at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:657)
        at org.apache.cassandra.thrift.CustomTThreadPoolServer.serve(CustomTThreadPoolServer.java:104)
        at org.apache.cassandra.thrift.CassandraDaemon$ThriftServer.run(CassandraDaemon.java:214)

This is not good, the application crashed with this error during operation. To illustrate this environment, it is running using oracle java 6 with apache cassandra 1.0.8. It has 12GB of java heap assigned with stack size 128k, max user processes 260000 and open files capped at 65536.

Investigate into the java stack trace, reveal that, this error is not thrown by java code but native code. Below is the trace path.

https://github.com/apache/cassandra/blob/cassandra-0.8/src/java/org/apache/cassandra/thrift/CassandraDaemon.java#L214

https://github.com/apache/cassandra/blob/cassandra-1.0.8/src/java/org/apache/cassandra/thrift/CustomTThreadPoolServer.java#L104

ThreadPoolExecutor.java line 657

ThreadPoolExecutor.java line 727

Thread.java line 640

A little explanation before we delve even deeper. Number 3 to 5, is jdk dependent. Hence, if you are using openjdk, the line number may be different. As mentioned early, I'm using oracle jdk. Unfortunately, it is not available online for browsing but you can download the source from oracle site.

Because this is a native call, we will look into code that is not in Java. If the following code looks alien to you, it sure looks alien to me as it is probably written in c++. If you have also notice, this code is taken from openjdk and it is not found in the oracle jdk. Probably it is a closed source but we will not go there. Let's just focus where this error thrown from and why. It is taken from here and the explanation here.

JVM_ENTRY(void, JVM_StartThread(JNIEnv* env, jobject jthread))
  JVMWrapper("JVM_StartThread");
  JavaThread *native_thread = NULL;

  // We cannot hold the Threads_lock when we throw an exception,
  // due to rank ordering issues. Example:  we might need to grab the
  // Heap_lock while we construct the exception.
  bool throw_illegal_thread_state = false;

  // We must release the Threads_lock before we can post a jvmti event
  // in Thread::start.
  {
    // Ensure that the C++ Thread and OSThread structures aren't freed before
    // we operate.
    MutexLocker mu(Threads_lock);

    // Since JDK 5 the java.lang.Thread threadStatus is used to prevent
    // re-starting an already started thread, so we should usually find
    // that the JavaThread is null. However for a JNI attached thread
    // there is a small window between the Thread object being created
    // (with its JavaThread set) and the update to its threadStatus, so we
    // have to check for this
    if (java_lang_Thread::thread(JNIHandles::resolve_non_null(jthread)) != NULL) {
      throw_illegal_thread_state = true;
    } else {
      // We could also check the stillborn flag to see if this thread was already stopped, but
      // for historical reasons we let the thread detect that itself when it starts running

      jlong size =
             java_lang_Thread::stackSize(JNIHandles::resolve_non_null(jthread));
      // Allocate the C++ Thread structure and create the native thread.  The
      // stack size retrieved from java is signed, but the constructor takes
      // size_t (an unsigned type), so avoid passing negative values which would
      // result in really large stacks.
      size_t sz = size > 0 ? (size_t) size : 0;
      native_thread = new JavaThread(&thread_entry, sz);

      // At this point it may be possible that no osthread was created for the
      // JavaThread due to lack of memory. Check for this situation and throw
      // an exception if necessary. Eventually we may want to change this so
      // that we only grab the lock if the thread was created successfully -
      // then we can also do this check and throw the exception in the
      // JavaThread constructor.
      if (native_thread->osthread() != NULL) {
        // Note: the current thread is not being used within "prepare".
        native_thread->prepare(jthread);
      }
    }
  }

  if (throw_illegal_thread_state) {
    THROW(vmSymbols::java_lang_IllegalThreadStateException());
  }

  assert(native_thread != NULL, "Starting null thread?");

  if (native_thread->osthread() == NULL) {
    // No one should hold a reference to the 'native_thread'.
    delete native_thread;
    if (JvmtiExport::should_post_resource_exhausted()) {
      JvmtiExport::post_resource_exhausted(
        JVMTI_RESOURCE_EXHAUSTED_OOM_ERROR | JVMTI_RESOURCE_EXHAUSTED_THREADS,
        "unable to create new native thread");
    }
    THROW_MSG(vmSymbols::java_lang_OutOfMemoryError(),
              "unable to create new native thread");
  }

  Thread::start(native_thread);

JVM_END

As I don't have knowledge in cpp, hence, there is no analysis into this snippet above, but if you understand what it does, I will be happy if you can give your analysis as a comment below of this article. It certainly looks to me that the operating system cannot create a thread at this point due to a few errors, JVMTI_RESOURCE_EXHAUSTED_OOM_ERROR and / or JVMTI_RESOURCE_EXHAUSTED_THREADS. Let's google to find out what is that supposed to mean. Below are some which is interesting.

http://www.oracle.com/technetwork/java/hotspotfaq-138619.html#threads_oom

http://www.onkarjoshi.com/blog/209/using-xss-to-adjust-java-default-thread-stack-size-to-save-memory-and-prevent-stackoverflowerror/

http://www.odi.ch/weblog/posting.php?posting=411

http://www.avricot.com/blog/?post/2010/05/03/Get-started-with-java-JVM-memory-%28heap%2C-stack%2C-xss-xms-xmx-xmn...%29

https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-2.html#jvms-2.5.2

To summarize the analysis from the links above.

stack is created when thread is created and when more threads are created, hence the total of stacks also increased as a result.

A Java Virtual Machine stack stores frames. A Java Virtual Machine stack is analogous to the stack of a conventional language such as C: it holds local variables and partial results, and plays a part in method invocation and return.

Java stack is not within of java heap, hence, even if you increase java heap to the cassandra via parameter -Xms or -Xmx, this error will happen again if the condition is met again in the future.

If Java Virtual Machine stacks can be dynamically expanded, and expansion is attempted but insufficient memory can be made available to effect the expansion, or if insufficient memory can be made available to create the initial Java Virtual Machine stack for a new thread, the Java Virtual Machine throws an OutOfMemoryError.

Until current analysis, it certainly looks to me that when cassandra instance trying to create a new thread, it was not able to. It was not able to because the underlying operating system cannot create the thread due to two errors. It actually looks like the operating system does not have sufficient memory to create the thread, hence increasing -Xms or -Xmx will not solve the problem. Note that the file descriptor set in this case is not met neither as most of the criterias pretty much infinite.

It's pretty interesting to note that, if such error is thrown, to solve the problem is to decrease the -Xss or even the heap -Xms and -Xmx. Although I don't understand the logic behind of such method used, perhaps you should try but I seriously doubt so. If cassandra node has high usage of heap, decreasing heap will only create another type of problem.

If you know or have encountered such problem before and has a good fix, please leave the comment below this article. To end this article, there is currently as of this writing, a discussion happen at cassandra mailing list.

Sunday, November 23, 2014

Investigate into why key cache in apache cassandra 1.0.8 gets reduced

Today, we will investigate into apache cassandra 1.0.8 when and why it reduce configured key cache. If you run the command nodetool cfstats. One of the statistics would probably interest you. I paste the snippet below.

Key cache capacity: 200000
Key cache size: 200000
Key cache hit rate: 0.9655797101449275
Row cache: disabled

After cassandra instance has been running for sometime, and you start to notice that the key cache capacity has gone down.

Key cache capacity: 150000
Key cache size: 150000
Key cache hit rate: 0.962251615630851
Row cache: disabled

As seen above, the initial capacity for this column family has 20,000 total key for cache. Currently, all object (that is 20,000) occupied fully in the key cache assigned. The hit rate is 96% which is very good statistics. So after a while, what had happened and why was it reduce? Let's investigate into the log file.

 WARN [ScheduledTasks:1] 2014-02-02 00:46:46,384 AutoSavingCache.java (line 187) Reducing MyColumnFamily KeyCache capacity from 200000 to 150000 to reduce memory pressure

Apparently memory is not enough at this point of time and the key cache is reduced to free up more memory for the cassandra instance. Let's look at the cassandra yaml file if there is any description for the key cache.

# emergency pressure valve #2: the first time heap usage after a full
# (CMS) garbage collection is above this fraction of the max,
# Cassandra will reduce cache maximum _capacity_ to the given fraction
# of the current _size_.  Should usually be set substantially above
# flush_largest_memtables_at, since that will have less long-term
# impact on the system.  
# 
# Set to 1.0 to disable.  Setting this lower than
# CMSInitiatingOccupancyFraction is not likely to be useful.
reduce_cache_sizes_at: 0.85
reduce_cache_capacity_to: 0.6

There are two configurations that reduce the cache size. When memory heap usage at 85%, key cache is reduced to 60% of its initial value. So now we dive deeper into the code to see what happened. Let's read into class GCInspector.

double usage = (double) memoryUsed / memoryMax;

if (memoryUsed > DatabaseDescriptor.getReduceCacheSizesAt() * memoryMax && !cacheSizesReduced)
{
    cacheSizesReduced = true;
    logger.warn("Heap is " + usage + " full.  You may need to reduce memtable and/or cache sizes.  Cassandra is now reducing cache sizes to free up memory.  Adjust reduce_cache_sizes_at threshold in cassandra.yaml if you don't want Cassandra to do this automatically");
    StorageService.instance.reduceCacheSizes();
}

When memory used is greater than reduce_cache_sizes_at (configured in cassanra.yaml, value at 0.85) multiply maximum memory in the heap and cache has not been reduced before. For example, if jvm is assigned with 8GB of heap, so the if statement evaluation become valid under such arithmetic, memory usage greater than 6.8GB when cache size has not been reduced before.

When the condition become true, StorageService will start to reduce cache size. A simple for loop over all column families to reduce the cache size. As seen here, there are two caches are being reduced. The rowcache and the keycache. Because we did not enable row cache and also not a focus on this study, I'll leave as an exercise for you. The investigation continue on the keyCache.reduceCacheSize();. As the snippet of code below shown.

public void reduceCacheSize()
{
   if (getCapacity() > 0)
   {
       int newCapacity = (int) (DatabaseDescriptor.getReduceCacheCapacityTo() * size());
       logger.warn(String.format("Reducing %s %s capacity from %d to %s to reduce memory pressure",
                              cfName, cacheType, getCapacity(), newCapacity));
       setCapacity(newCapacity);
   }
}

So if the capacity is initially assigned to a value larger than 0, then a new capacity is set. The new capacity is such that, reduce_cache_capacity_to (default at cassandra yaml, 0.60) multiply with the current size of the cache. For example, if the cache is occupied at 20000 x 0.60, the new value will be the new cache capacity at 12000.

This wrap up the investigation. Final thought, because the memory consumption is exceed certain amount of threshold, this emergency pressure valve kicked in. To fix immediate, an increase heap for cassandra instance will solve, but the correct would probably reduce node load or increase node for the cluster. When cache capacity is reduced, expect read become slower too and in data storage perspective, speed and performance is everything and reduced cache is definitely an impact to the cluster.

Friday, July 18, 2014

Cassandra discarding obsolete commit log

Often times, I noticed the log show the following lines in apache cassandra 1.0.8

INFO [COMMIT-LOG-WRITER] 2014-06-30 08:58:55,039 CommitLog.java (line 490) Discarding obsolete commit log:CommitLogSegment(/mnt/cassandra/commitlog/CommitLog-1404095354976.log)

It looks nothing to worry about as the log level is INFO. However, I often see this and just to check to understand what it really is.

This log is written by CommitLog.java.

private void maybeDiscardSegment(CommitLogSegment segment, Iterator<CommitLogSegment> iter)
{
    if (segment.isSafeToDelete() && iter.hasNext())
    {
        logger.info("Discarding obsolete commit log:" + segment);
        FileUtils.deleteAsync(segment.getPath());
        // usually this will be the first (remaining) segment, but not always, if segment A contains
        // writes to a CF that is unflushed but is followed by segment B whose CFs are all flushed.
        iter.remove();
    }
    else
    {
        if (logger.isDebugEnabled())
            logger.debug("Not safe to delete commit log " + segment + "; dirty is " + segment.dirtyString() + "; hasNext: " + iter.hasNext());
    }
}

So why discard segment? From the code documentation

Commit Log tracks every write operation into the system. The aim of the commit log is to be able to successfully recover data that was not stored to disk via the Memtable. Every Commit Log maintains a header represented by the abstraction CommitLogHeader. The header contains a bit array and an array of longs and both the arrays are of size, #column families for the Table, the Commit Log represents.

Whenever a ColumnFamily is written to, for the first time its bit flag is set to one in the CommitLogHeader. When it is flushed to disk by the Memtable its corresponding bit in the header is set to zero. This helps track which CommitLogs can be thrown away as a result of Memtable flushes. Additionally, when a ColumnFamily is flushed and written to disk, its entry in the array of longs is updated with the offset in the Commit Log file where it was written. This helps speed up recovery since we can seek to these offsets and start processing the commit log.

Every Commit Log is rolled over everytime it reaches its threshold in size; the new log inherits the "dirty" bits from the old.

Over time there could be a number of commit logs that would be generated. To allow cleaning up non-active commit logs, whenever we flush a column family and update its bit flag in the active CL, we take the dirty bit array and bitwise & it with the headers of the older logs. If the result is 0, then it is safe to remove the older file. (Since the new CL inherited the old's dirty bitflags, getting a zero for any given bit in the anding means that either the CF was clean in the old CL or it has been flushed since the switch in the new.)

So that's pretty clear why commit log is remove. Tracing the call upward,

CommitLog.maybeDiscardSegment()
  ^
  |
  +--- CommitLog.discardCompletedSegmentsInternal()
    ^
    |
    +--- CommitLog.discardCompletedSegments()
       ^
       |
       +--- ColumnFamilyStore.maybeSwitchMemtable()
       |
       +--- ColumnFamilyStore.truncate()

So whenever a column family is truncated, the commit log is discard (remove) as well which make sense. When maybeSwitchMemtable is executed, that is, memtable is flushed, after that, its segments get discard (remove) as well.

This logging message is just fine as it is part of cassandra operation. That's it for this article.

Friday, May 16, 2014

Learn and experiment with cassandra trigger

In cassandra 2.0, an experimental trigger was introduced and this seem exciting to bring cassandra into a whole new level. Today, by using cassandra 2.0.7 , we are going to learn cassandra trigger. But first, let's understand what conventional database trigger is.

Excerpt from wikipedia,

A database trigger is procedural code that is automatically executed in response to certain events on a particular table or view in a database. The trigger is mostly used for maintaining the integrity of the information on the database. For example, when a new record (representing a new worker) is added to the employees table, new records should also be created in the tables of the taxes, vacations and salaries.

So let's create a table in cassandra and then create a trigger for the table. We will do these execution via cqlsh and the example we are going to follow available in this link. Below are the steps I have taken from studying into the example trigger code.

1. build cassandra jar files in cassandra base directory.
2. build trigger-example.jar from trigger example directory.
3. upload trigger-example.jar to cassandra node directory in /etc/cassandra/triggers
4. copy InvertedIndex.properties to cassandra node directory in /etc/cassandra/
5. make cassandra aware of this jar and properties file addition via nodetool reloadtriggers
nodetool -h localhost reloadtriggers
5. repeat step 3 and 4 for all the nodes in the cluster.
6. create column family invertedindex via cqlsh.
7. create column family standard1 via cqlsh.
8. create trigger via cqlsh CREATE TRIGGER test1 ON "Keyspace1"."Standard1" USING 'org.apache.cassandra.triggers.InvertedIndex';
note that you can also drop trigger via command drop trigger test1 on "Keyspace1"."Standard1"

So that exciting part comes, when I tried to insert, the response keep on complaining key may not be empty, it is strange that we does specify the user_id as our key but it keep on giving error. So what went wrong?

cqlsh:keyspace1> insert into standard1 (user_id, age) values (124, 11);
Bad Request: Key may not be empty

TRACE [Thrift:5] 2014-05-12 22:17:02,492 QueryProcessor.java (line 153) Process org.apache.cassandra.cql3.statements.UpdateStatement@164b11c @CL.ONE
DEBUG [Thrift:5] 2014-05-12 22:17:02,493 Tracing.java (line 159) request complete
ERROR [Thrift:5] 2014-05-12 22:17:02,493 CustomTThreadPoolServer.java (line 219) Error occurred during processing of message.
java.lang.RuntimeException: Exception while creating trigger on CF with ID: d04577ab-ecc0-3f57-bb01-6febc9d27803
	at org.apache.cassandra.triggers.TriggerExecutor.executeInternal(TriggerExecutor.java:167)
	at org.apache.cassandra.triggers.TriggerExecutor.execute(TriggerExecutor.java:91)
	at org.apache.cassandra.service.StorageProxy.mutateWithTriggers(StorageProxy.java:525)
	at org.apache.cassandra.cql3.statements.ModificationStatement.executeWithoutCondition(ModificationStatement.java:542)
	at org.apache.cassandra.cql3.statements.ModificationStatement.execute(ModificationStatement.java:526)
	at org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:158)
	at org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:175)
	at org.apache.cassandra.thrift.CassandraServer.execute_cql3_query(CassandraServer.java:1959)
	at org.apache.cassandra.thrift.Cassandra$Processor$execute_cql3_query.getResult(Cassandra.java:4486)
	at org.apache.cassandra.thrift.Cassandra$Processor$execute_cql3_query.getResult(Cassandra.java:4470)
	at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
	at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
	at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:201)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
	at java.lang.Thread.run(Thread.java:722)
Caused by: java.lang.NullPointerException
	at org.apache.cassandra.db.RowMutation.addOrGet(RowMutation.java:133)
	at org.apache.cassandra.db.RowMutation.addOrGet(RowMutation.java:128)
	at org.apache.cassandra.db.RowMutation.addOrGet(RowMutation.java:123)
	at org.apache.cassandra.db.RowMutation.add(RowMutation.java:149)
	at org.apache.cassandra.db.RowMutation.add(RowMutation.java:159)
	at org.apache.cassandra.triggers.InvertedIndex.augment(InvertedIndex.java:46)
	at org.apache.cassandra.triggers.TriggerExecutor.executeInternal(TriggerExecutor.java:159)
	... 15 more
TRACE [Thrift:5] 2014-05-12 22:17:02,495 ThriftSessionManager.java (line 74) ClientState removed for socket ad

So I decided to go into further, and I got it to works after spending hours. Changes below.

1. change to lower letters for InvertedIndex.properties

$ cat /etc/cassandra/InvertedIndex.properties
keyspace=keyspace1
columnfamily=invertedindex

2. rebuild trigger-example.jar file with different augment method implementation and remember deploy this to every node in the cluster and execute command reloadtriggers using nodetool.

import org.apache.cassandra.utils.ByteBufferUtil;
import org.apache.cassandra.db.ArrayBackedSortedColumns;
import java.util.Collections;

public Collection<RowMutation> augment(ByteBuffer key, ColumnFamily update)
{
    ColumnFamily extraUpdate = update.cloneMeShallow(ArrayBackedSortedColumns.factory, false);
    extraUpdate.addColumn(new Column(update.metadata().comparator.fromString("v2"),
                                        ByteBufferUtil.bytes(999)));
    RowMutation rm = new RowMutation("keyspace1", key);
    rm.add(extraUpdate);
    return Collections.singletonList(rm);
}

3. drop both column family and recreate again, below are the schema.

cqlsh:keyspace1> desc table invertedindex;

CREATE TABLE invertedindex (
  k int,
  v1 int,
  v2 int,
  PRIMARY KEY (k)
) WITH
  bloom_filter_fp_chance=0.010000 AND
  caching='KEYS_ONLY' AND
  comment='' AND
  dclocal_read_repair_chance=0.000000 AND
  gc_grace_seconds=864000 AND
  index_interval=128 AND
  read_repair_chance=0.100000 AND
  replicate_on_write='true' AND
  populate_io_cache_on_flush='false' AND
  default_time_to_live=0 AND
  speculative_retry='99.0PERCENTILE' AND
  memtable_flush_period_in_ms=0 AND
  compaction={'class': 'SizeTieredCompactionStrategy'} AND
  compression={'sstable_compression': 'LZ4Compressor'};

cqlsh:keyspace1> desc table test_table;

CREATE TABLE test_table (
  k int,
  v1 int,
  v2 int,
  PRIMARY KEY (k)
) WITH
  bloom_filter_fp_chance=0.010000 AND
  caching='KEYS_ONLY' AND
  comment='' AND
  dclocal_read_repair_chance=0.000000 AND
  gc_grace_seconds=864000 AND
  index_interval=128 AND
  read_repair_chance=0.100000 AND
  replicate_on_write='true' AND
  populate_io_cache_on_flush='false' AND
  default_time_to_live=0 AND
  speculative_retry='99.0PERCENTILE' AND
  memtable_flush_period_in_ms=0 AND
  compaction={'class': 'SizeTieredCompactionStrategy'} AND
  compression={'sstable_compression': 'LZ4Compressor'};

and now when insert again into the cf, voila, 999 was auto created and no more exception in the log or cqlsh output!

cqlsh:keyspace1> select * from test_table;

(0 rows)

cqlsh:keyspace1> insert into test_table (k, v1) values (0, 0);
cqlsh:keyspace1> select * from test_table;

 k | v1 | v2
---+----+-----
 0 |  0 | 999

(1 rows)

Conclusion that we can draw is, since it is experimental, that means in the future, trigger is subject to many changes including API and chances that it could fail is higher ;-). It also need cassandra and java knowledge to build trigger at the mean time. Thus, you should not use this in production but that does not mean you cannot try this feature. In fact, cassandra would like to receive feedback on the trigger to improve or make cassandra trigger production ready in the future.

That's it for this article, if you like, please go to the donation page to contribute back as funding will keep us continue to write in the future.

Sunday, May 11, 2014

Store video on cassandra and using hector streaming IO

Today, we are going to learn how to stream in and stream out using hector-client. There are two classes implemented in hector-client which storing binary in chunk and reading binary in chunk. That's pretty neat! The two mentioned classes are

ChunkOutputStream storing binary as blog into cassandra.
ChunkInputStream read binary as blog from cassandra.

Below is a test case coded with the two classes to show how to store and read data using the two mentioned class.

import static org.junit.Assert.*;

import java.io.IOException;
import java.util.Arrays;

import me.prettyprint.cassandra.connection.HConnectionManager;
import me.prettyprint.cassandra.io.ChunkInputStream;
import me.prettyprint.cassandra.io.ChunkOutputStream;
import me.prettyprint.cassandra.serializers.StringSerializer;
import me.prettyprint.cassandra.service.CassandraHostConfigurator;
import me.prettyprint.cassandra.service.ThriftCluster;
import me.prettyprint.cassandra.service.ThriftKsDef;
import me.prettyprint.hector.api.Keyspace;
import me.prettyprint.hector.api.ddl.KeyspaceDefinition;
import me.prettyprint.hector.api.factory.HFactory;

import org.apache.cassandra.thrift.CfDef;
import org.apache.cassandra.thrift.KsDef;
import org.junit.After;
import org.junit.AfterClass;
import org.junit.Before;
import org.junit.BeforeClass;
import org.junit.Test;

public class HectorStreamTest {

	private Keyspace keyspace;
	private ThriftCluster cassandraCluster;
	private CassandraHostConfigurator cassandraHostConfigurator;
	protected HConnectionManager connectionManager;
	public static KeyspaceDefinition KEYSPACE_DEV;
	public final static String KEYSPACE = "TestKeyspace";
	public final static String BLOB_CF = "Blob";
	public final static CfDef BLOB_CF_DEF = new CfDef(KEYSPACE, BLOB_CF);

	@BeforeClass
	public static void setUpBeforeClass() throws Exception {
	}

	@AfterClass
	public static void tearDownAfterClass() throws Exception {
	}

	@Before
	public void setUp() throws Exception {
		cassandraHostConfigurator = new CassandraHostConfigurator(
				"192.168.0.2:9160");
		connectionManager = new HConnectionManager("just4fun",
				cassandraHostConfigurator);

		KEYSPACE_DEV = new ThriftKsDef(new KsDef(KEYSPACE,
				"org.apache.cassandra.locator.SimpleStrategy",
				Arrays.asList(new CfDef[] { BLOB_CF_DEF })));
		((ThriftKsDef) KEYSPACE_DEV).setReplicationFactor(1);
		cassandraCluster = new ThriftCluster("just4fun",
				cassandraHostConfigurator);

		keyspace = HFactory.createKeyspace(KEYSPACE, cassandraCluster);

		cassandraCluster.addKeyspace(KEYSPACE_DEV, true);
	}

	@After
	public void tearDown() throws Exception {
		cassandraCluster.dropKeyspace(KEYSPACE);
	}

	@Test
	public void testWriteAndReadStream() throws IOException {
		byte[] value = "hello world, store and read binary as a chunk of blob in and from cassandra.".getBytes();

		// write to cassandra.
		ChunkOutputStream<String> out = new ChunkOutputStream<String>(keyspace, BLOB_CF, "row1", StringSerializer.get(), 2);
		out.write(value);
		out.close();

		// read from cassandra.
		ChunkInputStream<String> in = new ChunkInputStream<String>(keyspace, BLOB_CF, "row1", StringSerializer.get());
		int i = -1;
		int written = 0;

		while ((i = in.read()) != -1) {
			assertSame(value[written++], (byte) i);
			byte[] b = {(byte)i};
			System.out.print(new String(b));
		}

		in.close();
	}

}

Keyspace TestKeyspace is created and table Blob is use to write and read the blob data. The main point is probably on the chunk size in ChunkOutputStream, it is set to 2 but you can give another even number to store the byte. Remember, each byte is represented by two hexadecimal characters, see cqlsh output below for more information. The test method testWriteAndReadStream() store the data in variable value using ChunkOutputStream.write() and remember to close it so that the data actually flush to cassandra or else it will stay in client code. To read from cassandra, is by specifing the row to ChunkInputStream and calling method read() which it will return the data in chunk. When the test is done, keyspace is removed.

cqlsh:TestKeyspace> select * from "Blob";

 key        | column1            | value
------------+--------------------+--------
 0x726f7731 | 0x0000000000000000 | 0x6800
 0x726f7731 | 0x0000000000000001 | 0x6500
 0x726f7731 | 0x0000000000000002 | 0x6c00
 0x726f7731 | 0x0000000000000003 | 0x6c00
 0x726f7731 | 0x0000000000000004 | 0x6f00
 0x726f7731 | 0x0000000000000005 | 0x2000
 0x726f7731 | 0x0000000000000006 | 0x7700
 0x726f7731 | 0x0000000000000007 | 0x6f00
 0x726f7731 | 0x0000000000000008 | 0x7200
 0x726f7731 | 0x0000000000000009 | 0x6c00
 0x726f7731 | 0x000000000000000a | 0x6400
 0x726f7731 | 0x000000000000000b | 0x2c00
 0x726f7731 | 0x000000000000000c | 0x2000
 0x726f7731 | 0x000000000000000d | 0x7300
 0x726f7731 | 0x000000000000000e | 0x7400
 0x726f7731 | 0x000000000000000f | 0x6f00
 0x726f7731 | 0x0000000000000010 | 0x7200
 0x726f7731 | 0x0000000000000011 | 0x6500
 0x726f7731 | 0x0000000000000012 | 0x2000
 0x726f7731 | 0x0000000000000013 | 0x6100
 0x726f7731 | 0x0000000000000014 | 0x6e00
 0x726f7731 | 0x0000000000000015 | 0x6400
 0x726f7731 | 0x0000000000000016 | 0x2000
 0x726f7731 | 0x0000000000000017 | 0x7200
 0x726f7731 | 0x0000000000000018 | 0x6500
 0x726f7731 | 0x0000000000000019 | 0x6100
 0x726f7731 | 0x000000000000001a | 0x6400
 0x726f7731 | 0x000000000000001b | 0x2000
 0x726f7731 | 0x000000000000001c | 0x6200
 0x726f7731 | 0x000000000000001d | 0x6900
 0x726f7731 | 0x000000000000001e | 0x6e00
 0x726f7731 | 0x000000000000001f | 0x6100
 0x726f7731 | 0x0000000000000020 | 0x7200
 0x726f7731 | 0x0000000000000021 | 0x7900
 0x726f7731 | 0x0000000000000022 | 0x2000
 0x726f7731 | 0x0000000000000023 | 0x6100
 0x726f7731 | 0x0000000000000024 | 0x7300
 0x726f7731 | 0x0000000000000025 | 0x2000
 0x726f7731 | 0x0000000000000026 | 0x6100
 0x726f7731 | 0x0000000000000027 | 0x2000
 0x726f7731 | 0x0000000000000028 | 0x6300
 0x726f7731 | 0x0000000000000029 | 0x6800
 0x726f7731 | 0x000000000000002a | 0x7500
 0x726f7731 | 0x000000000000002b | 0x6e00
 0x726f7731 | 0x000000000000002c | 0x6b00
 0x726f7731 | 0x000000000000002d | 0x2000
 0x726f7731 | 0x000000000000002e | 0x6f00
 0x726f7731 | 0x000000000000002f | 0x6600
 0x726f7731 | 0x0000000000000030 | 0x2000
 0x726f7731 | 0x0000000000000031 | 0x6200
 0x726f7731 | 0x0000000000000032 | 0x6c00
 0x726f7731 | 0x0000000000000033 | 0x6f00
 0x726f7731 | 0x0000000000000034 | 0x6200
 0x726f7731 | 0x0000000000000035 | 0x2000
 0x726f7731 | 0x0000000000000036 | 0x6900
 0x726f7731 | 0x0000000000000037 | 0x6e00
 0x726f7731 | 0x0000000000000038 | 0x2000
 0x726f7731 | 0x0000000000000039 | 0x6100
 0x726f7731 | 0x000000000000003a | 0x6e00
 0x726f7731 | 0x000000000000003b | 0x6400
 0x726f7731 | 0x000000000000003c | 0x2000
 0x726f7731 | 0x000000000000003d | 0x6600
 0x726f7731 | 0x000000000000003e | 0x7200
 0x726f7731 | 0x000000000000003f | 0x6f00
 0x726f7731 | 0x0000000000000040 | 0x6d00
 0x726f7731 | 0x0000000000000041 | 0x2000
 0x726f7731 | 0x0000000000000042 | 0x6300
 0x726f7731 | 0x0000000000000043 | 0x6100
 0x726f7731 | 0x0000000000000044 | 0x7300
 0x726f7731 | 0x0000000000000045 | 0x7300
 0x726f7731 | 0x0000000000000046 | 0x6100
 0x726f7731 | 0x0000000000000047 | 0x6e00
 0x726f7731 | 0x0000000000000048 | 0x6400
 0x726f7731 | 0x0000000000000049 | 0x7200
 0x726f7731 | 0x000000000000004a | 0x6100
 0x726f7731 | 0x000000000000004b | 0x2e00

(76 rows)

cqlsh:TestKeyspace>

As basic classes provided by hector-client is in the package, it's a good to have feature to have if you want to stream in and stream out content like audio or video. I have implemented something similar here. The concept is similar, write binary content to cassandra and reconstruct the binary data into the file again.

That's it, hope you like this.

Saturday, May 10, 2014

Understand cassandra read path by tracing in CQL

In our last article, we explored cassandra 2.0.7 write path and in this article, we will explore cassandra read path. We will again follow the same investigation method we used on write for read too. That is, we will trace the read path by turning on tracing in cqlsh.

Let's start by enabling tracing and consistency to all. Then issue statement select and start to dig into code. Below are the output of the commands executed in cqlsh and output in cassandra system.log

cqlsh:jw_schema1> consistency all;
Consistency level set to ALL.
cqlsh:jw_schema1> tracing on;
Now tracing requests.
cqlsh:jw_schema1> select * from users;

 user_id | age | first     | last  | middle
---------+-----+-----------+-------+--------
       4 |  10 | john30003 | smith | junior
       3 |  10 | john30003 | smith | junior
       5 |  10 | john30003 | smith | junior
       2 |  10 | john30003 | smith | junior

(4 rows)

Tracing session: 66a845c0-d5f3-11e3-bd26-a322c40b8b81

 activity                                                                                        | timestamp    | source        | source_elapsed
-------------------------------------------------------------------------------------------------+--------------+---------------+----------------
                                                                              execute_cql3_query | 22:25:12,732 |  <node3_ip>   |              0
                                                             Message received from /<node3_ip>   | 22:25:11,106 |  <node2_ip>   |             26
 Executing seq scan across 0 sstables for [min(-9223372036854775808), min(-9223372036854775808)] | 22:25:11,106 |  <node2_ip>   |            289
                                                              Read 1 live and 0 tombstoned cells | 22:25:11,106 |  <node2_ip>   |            442
                                                              Read 1 live and 0 tombstoned cells | 22:25:11,106 |  <node2_ip>   |            581
                                                              Read 1 live and 0 tombstoned cells | 22:25:11,106 |  <node2_ip>   |            658
                                                              Read 1 live and 0 tombstoned cells | 22:25:11,106 |  <node2_ip>   |            724
                                                                    Scanned 4 rows and matched 4 | 22:25:11,106 |  <node2_ip>   |            760
                                                             Enqueuing response to /<node3_ip>   | 22:25:11,106 |  <node2_ip>   |            785
                                                                Sending message to /<node3_ip>   | 22:25:11,107 |  <node2_ip>   |            954
                                                             Message received from /<node3_ip>   | 22:25:12,430 |  <node1_ip>   |             76
 Executing seq scan across 0 sstables for [min(-9223372036854775808), min(-9223372036854775808)] | 22:25:12,431 |  <node1_ip>   |           1054
                                                              Read 1 live and 0 tombstoned cells | 22:25:12,432 |  <node1_ip>   |           1250
                                                              Read 1 live and 0 tombstoned cells | 22:25:12,432 |  <node1_ip>   |           1399
                                                              Read 1 live and 0 tombstoned cells | 22:25:12,432 |  <node1_ip>   |           1537
                                                              Read 1 live and 0 tombstoned cells | 22:25:12,432 |  <node1_ip>   |           1718
                                                                    Scanned 4 rows and matched 4 | 22:25:12,432 |  <node1_ip>   |           1777
                                                             Enqueuing response to /<node3_ip>   | 22:25:12,432 |  <node1_ip>   |           1935
                                                                Sending message to /<node3_ip>   | 22:25:12,433 |  <node1_ip>   |           2212
                                                        Parsing select * from users LIMIT 10000; | 22:25:12,732 |  <node3_ip>   |            148
                                                                             Preparing statement | 22:25:12,732 |  <node3_ip>   |            259
                                                                   Determining replicas to query | 22:25:12,733 |  <node3_ip>   |            941
                                                              Enqueuing request to /<node2_ip>   | 22:25:12,738 |  <node3_ip>   |           5556
                                                             Enqueuing request to /<node1_ip>    | 22:25:12,738 |  <node3_ip>   |           5645
                                             Enqueuing request to <node3_hostname>/<node3_ip>    | 22:25:12,738 |  <node3_ip>   |           5688
                                                                Sending message to /<node2_ip>   | 22:25:12,738 |  <node3_ip>   |           5811
                                                                 Sending message to /192.168.0.2 | 22:25:12,738 |  <node3_ip>   |           5817
                                                               Sending message to /<node1_ip>    | 22:25:12,738 |  <node3_ip>   |           6133
                                                             Message received from /<node3_ip>   | 22:25:12,739 |  <node3_ip>   |           6558
 Executing seq scan across 0 sstables for [min(-9223372036854775808), min(-9223372036854775808)] | 22:25:12,740 |  <node3_ip>   |           7294
                                                              Read 1 live and 0 tombstoned cells | 22:25:12,740 |  <node3_ip>   |           7506
                                                              Read 1 live and 0 tombstoned cells | 22:25:12,740 |  <node3_ip>   |           7698
                                                              Read 1 live and 0 tombstoned cells | 22:25:12,740 |  <node3_ip>   |           8222
                                                              Read 1 live and 0 tombstoned cells | 22:25:12,741 |  <node3_ip>   |           8570
                                                                    Scanned 4 rows and matched 4 | 22:25:12,741 |  <node3_ip>   |           8634
                                                             Enqueuing response to /<node3_ip>   | 22:25:12,741 |  <node3_ip>   |           8689
                                                                 Sending message to /192.168.0.2 | 22:25:12,741 |  <node3_ip>   |           8821
                                                             Message received from /<node3_ip>   | 22:25:12,742 |  <node3_ip>   |           null
                                                          Processing response from /<node3_ip>   | 22:25:12,742 |  <node3_ip>   |           null
                                                            Message received from /<node1_ip>    | 22:25:13,029 |  <node3_ip>   |           null
                                                         Processing response from /<node1_ip>    | 22:25:13,029 |  <node3_ip>   |           null
                                                             Message received from /<node2_ip>   | 22:25:13,061 |  <node3_ip>   |           null
                                                          Processing response from /<node2_ip>   | 22:25:13,061 |  <node3_ip>   |           null
                                                              Read 5 live and 0 tombstoned cells | 22:25:13,086 |  <node3_ip>   |         353631
                                                              Read 5 live and 0 tombstoned cells | 22:25:13,087 |  <node3_ip>   |         355232
                                                              Read 5 live and 0 tombstoned cells | 22:25:13,093 |  <node3_ip>   |         360675
                                                              Read 5 live and 0 tombstoned cells | 22:25:13,093 |  <node3_ip>   |         360908
                                                                                Request complete | 22:25:13,093 |  <node3_ip>   |         361266

cqlsh:jw_schema1> 

TRACE [Thrift:186] 2014-05-07 22:25:12,733 QueryProcessor.java (line 153) Process org.apache.cassandra.cql3.statements.SelectStatement@43e049 @CL.ALL
DEBUG [Thrift:186] 2014-05-07 22:25:13,604 CassandraServer.java (line 1955) execute_cql3_query
TRACE [Thrift:186] 2014-05-07 22:25:13,605 QueryProcessor.java (line 153) Process org.apache.cassandra.cql3.statements.SelectStatement@17100f1 @CL.ONE
DEBUG [Thrift:186] 2014-05-07 22:25:13,911 Tracing.java (line 159) request complete
DEBUG [Thrift:186] 2014-05-07 22:25:13,915 CassandraServer.java (line 1955) execute_cql3_query
TRACE [Thrift:186] 2014-05-07 22:25:13,916 QueryProcessor.java (line 153) Process org.apache.cassandra.cql3.statements.SelectStatement@d34f6 @CL.ONE
DEBUG [Thrift:186] 2014-05-07 22:25:14,227 Tracing.java (line 159) request complete

As write entry path is execute_cql3_query, so is read path. If you trace the code down, it will be too much to even start the discussion. I summarize the points below in tandem with the output of cqlsh tracing and system.log where applicable. Thus it may not be complete but I will give you the link as narration goes so that you can study yourself in detail.

It started at CassandraServer.execute_cql3_query(...) as indicated in cqlsh tracing output. So basically the work done can be summarize by this line:

cState.getCQLQueryHandler().process(queryString, cState.getQueryState(), new QueryOptions(ThriftConversion.fromThrift(cLevel), Collections.<ByteBuffer>emptyList())).toThriftResult();

If you step into the code above, QueryProcessor.process(...) which implement the interface QueryHandler which get a valid CQLStatement. The execution continue by calling method QueryProcessor.processStatement(...). Notice that the logger in this method is shown in cassandra system.log (of cause you need to enable tracing for this class in log4j.properties in order for this line to log successfully). So access checking and validation are perform here. When checking and validation were done, then CQLStatement.execute(...) is executed.

Because we are executing select statement, the correspond class that implement interface CQLStatement is SelectStatement. Extract from SelectStatement.execute(...)

public ResultMessage.Rows execute(QueryState state, QueryOptions options) throws RequestExecutionException, RequestValidationException
{
    ConsistencyLevel cl = options.getConsistency();
    List<ByteBuffer> variables = options.getValues();
    if (cl == null)
        throw new InvalidRequestException("Invalid empty consistency level");

    cl.validateForRead(keyspace());

    int limit = getLimit(variables);
    long now = System.currentTimeMillis();
    Pageable command;
    if (isKeyRange || usesSecondaryIndexing)
    {
        command = getRangeCommand(variables, limit, now);
    }
    else
    {
        List<ReadCommand> commands = getSliceCommands(variables, limit, now);
        command = commands == null ? null : new Pageable.ReadCommands(commands);
    }

    int pageSize = options.getPageSize();
    // A count query will never be paged for the user, but we always page it internally to avoid OOM.
    // If we user provided a pageSize we'll use that to page internally (because why not), otherwise we use our default
    // Note that if there are some nodes in the cluster with a version less than 2.0, we can't use paging (CASSANDRA-6707).
    if (parameters.isCount && pageSize <= 0 && MessagingService.instance().allNodesAtLeast20)
        pageSize = DEFAULT_COUNT_PAGE_SIZE;

    if (pageSize <= 0 || command == null || !QueryPagers.mayNeedPaging(command, pageSize))
    {
        return execute(command, cl, variables, limit, now);
    }
    else
    {
        QueryPager pager = QueryPagers.pager(command, cl, options.getPagingState());
        if (parameters.isCount)
            return pageCountQuery(pager, variables, pageSize, now);

        // We can't properly do post-query ordering if we page (see #6722)
        if (needsPostQueryOrdering())
            throw new InvalidRequestException("Cannot page queries with both ORDER BY and a IN restriction on the partition key; you must either remove the "
                                            + "ORDER BY or the IN and sort client side, or disable paging for this query");

        List<Row> page = pager.fetchPage(pageSize);
        ResultMessage.Rows msg = processResults(page, variables, limit, now);
        if (!pager.isExhausted())
            msg.result.metadata.setHasMorePages(pager.state());
        return msg;
    }
}

The execution continue to get the Pageable Command . Execution continue to private method execute(...). Then method getRangeSlice(...) is called. This is the actual work done to retrieve all the rows. This method implementation does a lot of works and I would recommend you click on the link and study the code yourself to get a better picture.

When the control is returned, the rows are sent for further processing using method processResults(...) which eventually return the result back to the cassandra client.

As you may have notice, the upper layer execution is similar as to write path execution, until control passed to CQLStatement. That's it for this article, I hope you like it.

Friday, May 9, 2014

Understand cassandra write path by tracing in CQL

In this article, we will learn the write path for cassandra 2.0.7. Since cql is the way moving forward, we will start learning write path by focusing on cqlsh. Let's turn on the tracing, consistency to all and insert one row of data. Read output below:

cqlsh:jw_schema1> tracing on;
Now tracing requests.
cqlsh:jw_schema1> consistency all;
Consistency level set to ALL.

cqlsh:jw_schema1> insert into users (user_id, age, first, last, middle) values ('1', 10, 'john30003', 'smith', 'junior');

Tracing session: 03477650-d43f-11e3-bd26-a322c40b8b81

 activity                                                                                                        | timestamp    | source        | source_elapsed
-----------------------------------------------------------------------------------------------------------------+--------------+---------------+----------------
                                                                                              execute_cql3_query | 18:21:25,430 |  <node1_ip> |              0
                                                                             Message received from /<node1_ip>   | 18:21:23,795 |  <node2_ip> |             52
                                                                                  Acquiring switchLock read lock | 18:21:23,795 |  <node2_ip> |            455
                                                                                          Appending to commitlog | 18:21:23,795 |  <node2_ip> |            497
                                                                                        Adding to users memtable | 18:21:23,795 |  <node2_ip> |            613
                                                                             Enqueuing response to /<node1_ip>   | 18:21:23,800 |  <node2_ip> |           5520
                                                                                Sending message to /<node1_ip>   | 18:21:23,801 |  <node2_ip> |           6359
                                                                             Message received from /<node1_ip>   | 18:21:25,121 |  <node3_ip> |             84
                                                                                  Acquiring switchLock read lock | 18:21:25,123 |  <node3_ip> |           1777
                                                                                          Appending to commitlog | 18:21:25,123 |  <node3_ip> |           1826
                                                                                        Adding to users memtable | 18:21:25,123 |  <node3_ip> |           2121
                                                                             Enqueuing response to /<node1_ip>   | 18:21:25,129 |  <node3_ip> |           8278
                                                                                Sending message to /<node1_ip>   | 18:21:25,129 |  <node3_ip> |           8563
 Parsing insert into users (user_id, age, first, last, middle) values ('1', 10, 'john30003', 'smith', 'junior'); | 18:21:25,430 |  <node1_ip> |             93
                                                                                             Preparing statement | 18:21:25,430 |  <node1_ip> |            227
                                                                               Determining replicas for mutation | 18:21:25,433 |  <node1_ip> |           2721
                                                                                Sending message to /<node2_ip>   | 18:21:25,433 |  <node1_ip> |           3525
                                                                               Sending message to /<node3_ip>    | 18:21:25,434 |  <node1_ip> |           3751
                                                                                  Acquiring switchLock read lock | 18:21:25,434 |  <node1_ip> |           3963
                                                                                          Appending to commitlog | 18:21:25,434 |  <node1_ip> |           3992
                                                                                        Adding to users memtable | 18:21:25,434 |  <node1_ip> |           4067
                                                                            Message received from /<node3_ip>    | 18:21:25,730 |  <node1_ip> |         300016
                                                                         Processing response from /<node3_ip>    | 18:21:25,730 |  <node1_ip> |         300178
                                                                             Message received from /<node2_ip>   | 18:21:25,738 |  <node1_ip> |         308225
                                                                          Processing response from /<node2_ip>   | 18:21:25,738 |  <node1_ip> |         308676
                                                                                                Request complete | 18:21:25,738 |  <node1_ip> |         308825

TRACE [Thrift:186] 2014-05-05 18:24:33,825 QueryProcessor.java (line 153) Process org.apache.cassandra.cql3.statements.UpdateStatement@17d2390 @CL.ALL
DEBUG [Thrift:186] 2014-05-05 18:24:34,621 CassandraServer.java (line 1955) execute_cql3_query
TRACE [Thrift:186] 2014-05-05 18:24:34,622 QueryProcessor.java (line 153) Process org.apache.cassandra.cql3.statements.SelectStatement@159d495 @CL.ONE
DEBUG [Thrift:186] 2014-05-05 18:24:34,623 Tracing.java (line 159) request complete
DEBUG [Thrift:186] 2014-05-05 18:24:34,626 CassandraServer.java (line 1955) execute_cql3_query
TRACE [Thrift:186] 2014-05-05 18:24:34,626 QueryProcessor.java (line 153) Process org.apache.cassandra.cql3.statements.SelectStatement@75219b @CL.ONE
DEBUG [Thrift:186] 2014-05-05 18:24:34,629 Tracing.java (line 159) request complete

If you noticed, the entry path will be execute_cql3_query no matter write or read. If you trace the code down, it will be too much to even start the discussion. I summarize the points below in tandem with the output of cqlsh tracing and system.log where applicable. Thus it may not be complete but I will give you the link to the code as narration goes so that you can study yourself in detail.

It started at CassandraServer.execute_cql3_query(...) as indicated in cqlsh tracing output. So basically the work done can be summarize by this line:

cState.getCQLQueryHandler().process(queryString, cState.getQueryState(), new QueryOptions(ThriftConversion.fromThrift(cLevel), Collections.<ByteBuffer>emptyList())).toThriftResult();

If you step into the line above, QueryProcessor.process(...) which implement the interface QueryHandler which get a valid CQLStatement. The execution continue by calling method QueryProcessor.processStatement(...). Notice that the logger in this method is shown in cassandra system.log (of cause you need to enable tracing for this class in log4j.properties in order for this line to log successfully). So access checking and validation are perform here. When checking and validation were done, then CQLStatement.execute(...) is executed. Because we are adding a new row by inserting a new row of data, the correspond class that implement interface CQLStatement is ModificationStatement. Extract from ModificationStatement.execute(...)

public ResultMessage execute(QueryState queryState, QueryOptions options)
throws RequestExecutionException, RequestValidationException
{
    if (options.getConsistency() == null)
        throw new InvalidRequestException("Invalid empty consistency level");

    if (hasConditions() && options.getProtocolVersion() == 1)
        throw new InvalidRequestException("Conditional updates are not supported by the protocol version in use. You need to upgrade to a driver using the native protocol v2.");

    return hasConditions()
         ? executeWithCondition(queryState, options)
         : executeWithoutCondition(queryState, options);
}

The execution continue to the method ModificationStatement.executeWithoutCondition(...) as our insert statement does not contain if not exists. Method getMutations(...) return a collection of mutations to be perform.

The collections of mutation is pass to StorageProxy.mutateWithTriggers(...) for further processing. This column family does not have trigger, so the execution continue to method StorageProxy.mutate() . The description of this method is informative, it write:

Use this method to have these Mutations applied across all replicas. This method will take care of the possibility of a replica being down and hint the data across to some other replica.

So this method basically does saving of data by applying to all replicas. If you trace along this path, you should notice the cqlsh tracing debug output appear along the way.

That's it for this article, for my next article, we will trace for cassandra read path. Thank you.