Sunday, April 27, 2014

code study in cassandra compaction 108 and check what is actually gets remove

Last we covered topic such as compaction via jconsole and general study into compaction and what this article is going to focus is, when compaction happened, what happened to the data that is marked as delete, that is the tombstone?

Continue to where we left in previous article, in the method CompactionTask.execute() , snippet below:
AbstractCompactionIterable ci = DatabaseDescriptor.isMultithreadedCompaction()
? new ParallelCompactionIterable(OperationType.COMPACTION, toCompact, controller)
: new CompactionIterable(OperationType.COMPACTION, toCompact, controller);
CloseableIterator<AbstractCompactedRow> iter = ci.iterator();
Iterator<AbstractCompactedRow> nni = Iterators.filter(iter, Predicates.notNull());

calling ci.iterator() return a new Reducer() where this class will perform remove this row from cache and sstable.

protected class Reducer extends MergeIterator.Reducer<IColumnIterator, AbstractCompactedRow>
{
protected final List<SSTableIdentityIterator> rows = new ArrayList<SSTableIdentityIterator>();

public void reduce(IColumnIterator current)
{
rows.add((SSTableIdentityIterator) current);
}

protected AbstractCompactedRow getReduced()
{
assert !rows.isEmpty();

try
{
AbstractCompactedRow compactedRow = controller.getCompactedRow(new ArrayList<SSTableIdentityIterator>(rows));
if (compactedRow.isEmpty())
{
controller.invalidateCachedRow(compactedRow.key);
return null;
}
else
{
// If the raw is cached, we call removeDeleted on it to have/ coherent query returns. However it would look
// like some deleted columns lived longer than gc_grace + compaction. This can also free up big amount of
// memory on long running instances
controller.removeDeletedInCache(compactedRow.key);
}

return compactedRow;
}
finally
{
rows.clear();
if ((row++ % 1000) == 0)
{
long n = 0;
for (SSTableScanner scanner : scanners)
n += scanner.getFilePointer();
bytesRead = n;
throttle.throttle(bytesRead);
}
}
}
}

The logic is similar and below is the logic to remove the expired column from the standard column family.
private static void removeDeletedStandard(ColumnFamily cf, int gcBefore)
{
Iterator<IColumn> iter = cf.iterator();
while (iter.hasNext())
{
IColumn c = iter.next();
ByteBuffer cname = c.name();
// remove columns if
// (a) the column itself is tombstoned or
// (b) the CF is tombstoned and the column is not newer than it
//
// Note that we need the inequality below for case (a) to be strict for expiring columns
// to work correctly -- see the comment in ExpiringColumn.isMarkedForDelete().
if ((c.isMarkedForDelete() && c.getLocalDeletionTime() < gcBefore)
|| c.timestamp() <= cf.getMarkedForDeleteAt())
{
iter.remove();
}
}
}

So that's pretty obvious. columns and rows get remove if the condition is satisfied.

Last but not least, if you are happy reading this and learn something, please remember to donate too.

Saturday, April 26, 2014

study gc parameters in cassandra 1.0.8

Today we are going to study the GC parameter in the file cassandra-env.sh
. Below are the GC parameter extracted from cassandra 1.0.8 environment file cassandra-env.sh . So let's study them one by one what is the parameter means and what can be change.
# GC tuning options
JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"
JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"
JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"
JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"
JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=1"
JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75"
JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"

# GC logging options -- uncomment to enable
# JVM_OPTS="$JVM_OPTS -XX:+PrintGCDetails"
# JVM_OPTS="$JVM_OPTS -XX:+PrintGCDateStamps"
# JVM_OPTS="$JVM_OPTS -XX:+PrintHeapAtGC"
# JVM_OPTS="$JVM_OPTS -XX:+PrintTenuringDistribution"
# JVM_OPTS="$JVM_OPTS -XX:+PrintGCApplicationStoppedTime"
# JVM_OPTS="$JVM_OPTS -XX:+PrintPromotionFailure"
# JVM_OPTS="$JVM_OPTS -XX:PrintFLSStatistics=1"
# JVM_OPTS="$JVM_OPTS -Xloggc:/var/log/cassandra/gc-`date +%s`.log"

-XX:+UseParNewGC

Use parallel algorithm for young space collection.

-XX:+UseConcMarkSweepGC

Use Concurrent Mark-Sweep GC in the old generation

-XX:SurvivorRatio=8

Ratio of eden/survivor space size. The default value is 8

-XX:MaxTenuringThreshold=1

Max value for tenuring threshold.

-XX:CMSInitiatingOccupancyFraction=75

Percentage CMS generation occupancy to start a CMS collection cycle (A negative value means that CMSTirggerRatio is used).

-XX:+UseCMSInitiatingOccupancyOnly

Only use occupancy as a criterion for starting a CMS collection.

 

-XX:+PrintGCDetails

Print more elaborated GC info

-XX:+PrintGCDateStamps

Print date stamps at garbage collection events (e.g. 2011-09-08T14:20:29.557+0400: [GC... )

-XX:+PrintHeapAtGCPrint

heap layout before and after each GC

-XX:+PrintTenuringDistribution

Print detailed demography of young space after each collection

-XX:+PrintGCApplicationStoppedTime

Print the time the application has been stopped

-XX:+PrintPromotionFailure

Print additional diagnostic information following promotion failure


-XX:PrintFLSStatistics=1

Print additional info concerning free lists


-Xloggc:<file>

Redirects GC output to file instead of console

The first part of GC tuning is geared toward which GC strategy to use in cassandra. The second GC tuning is more toward fine tune GC logging example timestamp, heap layaout, etc. If you want to get even more challenging, I end this article by providing a few good links for your further references.

http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html
http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html
http://docs.oracle.com/javase/7/docs/technotes/tools/windows/java.html
http://library.blackboard.com/ref/df5b20ed-ce8d-4428-a595-a0091b23dda3/Content/_admin_server_optimize/optimize_non_standard_jvm_arguments.htm

Last but not least, if you are happy reading this and learn something, please remember to donate too.

Friday, April 25, 2014

code dive into cassandra Stage.REPLICATE_ON_WRITE

If you are administrator of a cassandra cluster, sometime you may notice StatusLogger started to flood in cassandra system.log. Example below is the log snippet found in system.log. So what and why this happened? Let us read into the codes.
 INFO [ScheduledTasks:1] 2014-04-17 14:18:00,079 StatusLogger.java (line 65) ReplicateOnWriteStage            17        17         0

StatusLogger started to write about the node thread pools into cassandra system.log under two conditions:

These indications will give an idea that the node is under stress. As you have noticed from system.log, there are many stages involved and with this article, we are going to focus on the metric Stage.REPLICATE_ON_WRITE.

What is replicate on write stage? From the code description, Replicate every counter update from the leader to the follower replicas. Accepts the values true and false. Aside from the code description, we are going to understand this stage by studying into the code.

There are 11 stages involved. When CassandraDaemon class kickstarted, StageManager is called and stages were initialized. Of cause, Stage.REPLICATE_ON_WRITE is one of the stages. An JMXConfigurableThreadPoolExecutor object with configuration 32 threads and 60 seconds keep alive is initialized. When this happened, this object is also registered to MBean server.

Apparently replicate on write stage is only trigger by column family with type counter and the code snippet below is the only code that increment replicate on write metric.
private static Runnable counterWriteTask(final IMutation mutation,
final Collection<InetAddress> targets,
final IWriteResponseHandler responseHandler,
final String localDataCenter,
final ConsistencyLevel consistency_level)
{
return new DroppableRunnable(StorageService.Verb.MUTATION)
{
public void runMayThrow() throws IOException
{
assert mutation instanceof CounterMutation;
final CounterMutation cm = (CounterMutation) mutation;

// apply mutation
cm.apply();
responseHandler.response(null);

// then send to replicas, if any
targets.remove(FBUtilities.getBroadcastAddress());
if (cm.shouldReplicateOnWrite() && !targets.isEmpty())
{
// We do the replication on another stage because it involves a read (see CM.makeReplicationMutation)
// and we want to avoid blocking too much the MUTATION stage
StageManager.getStage(Stage.REPLICATE_ON_WRITE).execute(new DroppableRunnable(StorageService.Verb.READ)
{
public void runMayThrow() throws IOException, TimeoutException
{
// send mutation to other replica
sendToHintedEndpoints(cm.makeReplicationMutation(), targets, responseHandler, localDataCenter, consistency_level);
}
});
}
}
};
}

Whenever ThreadPoolExecutor execute the object DroppableRunner, the task will be execute by a thread in the thread pool executor.

Interface IExecutorMBean exposed three metric:

  • getActiveCount

  • getCompletedTasks

  • getPendingTasks


and interface JMXEnabledThreadPoolExecutorMBean exposed two more metrics:

  • getTotalBlockedTasks

  • getCurrentlyBlockedTasks


StatusLogger.log exposed getActiveCount, getPendingTasks and getCurrentlyBlockedTasks, hence the three columns per stage in the system.log output.

getActiveCount
get active count is actually implemented within class ThreadPoolExecutor. Whenever a worker is running a task, it is consider as an active task and this is consider as one count.

getCompletedTasks
get completed tasks were actually a wrapper to ThreadPoolExecutor.getCompletedTaskCount(). Whenever a worker is finished executed a task, this is consider one count.

getTotalBlockedTasks
when DebuggableThreadPoolExecutor object was initialized, a rejected execution handler is set. Whenever within ThreadPoolExecutor reject a command, rejectedExecution() is trigger and executed. So this translate to one reject is equivalent as one count.

That's about it for this article. When I study into this code and write this article, I get amazed on how this code is structured and it is complex. I would really recommend into study ThreadPoolExecutor.java as cassandra stage reference this code throughout.

Last but not least, if you are happy reading this and learn something, please remember to donate too.

Monday, April 21, 2014

Enable or disable sstable compression?

In cassandra 2.0.6, there are a few compression for sstables, the default is LZ4Compressor. There are others such as DeflateCompressor, SnappyCompressor or
do not compress the sstables at all.

You can read more about compression at official documentation as found it here.

With this blog, I will create two scenarios where first scenario is with enable compression and another scenario is without compression. This is the only different for both scenarios.

So I have create 50 thousands insert statement with cql and then insert using by feeding to cqlsh. So first , the schema below with LZ4Compressor compression and leave value for key sstable_compression empty for no compression.
CREATE TABLE users (
user_id text,
age int,
first text,
last text,
middle text,
PRIMARY KEY (user_id)
) WITH
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='storing user data' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
index_interval=128 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
default_time_to_live=0 AND
speculative_retry='99.0PERCENTILE' AND
memtable_flush_period_in_ms=0 AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'LZ4Compressor'};

CREATE INDEX idxAge ON users (age);

CREATE INDEX idxLast ON users (last);

jason@localhost:~$ wc -l data.cql
50000 data.cql
jason@localhost:~$ cqlsh 192.168.0.2 9160 -k jw_schema1 -f data.cql
jason@localhost:~$

so looks good, that we have total rows of 50 thousands.
cqlsh:jw_schema1> select count(*) from users limit 100000;

count
-------
50000

(1 rows)

cqlsh:jw_schema1>

Ran nodetool repair, flush, cleanup and then compact. With compression enable, the sstable count only 1 and the total filesize in this directory is about 4.5MB.
jason@localhost:/var/lib/cassandra/data/jw_schema1/users$ ls -l
total 4576
-rw-r--r-- 1 cassandra cassandra 179 Apr 15 21:02 jw_schema1-users.idxAge-jb-1-CompressionInfo.db
-rw-r--r-- 1 cassandra cassandra 599421 Apr 15 21:02 jw_schema1-users.idxAge-jb-1-Data.db
-rw-r--r-- 1 cassandra cassandra 136 Apr 15 21:02 jw_schema1-users.idxAge-jb-1-Filter.db
-rw-r--r-- 1 cassandra cassandra 1800 Apr 15 21:02 jw_schema1-users.idxAge-jb-1-Index.db
-rw-r--r-- 1 cassandra cassandra 4392 Apr 15 21:02 jw_schema1-users.idxAge-jb-1-Statistics.db
-rw-r--r-- 1 cassandra cassandra 68 Apr 15 21:02 jw_schema1-users.idxAge-jb-1-Summary.db
-rw-r--r-- 1 cassandra cassandra 79 Apr 15 21:02 jw_schema1-users.idxAge-jb-1-TOC.txt
-rw-r--r-- 1 cassandra cassandra 179 Apr 15 21:02 jw_schema1-users.idxLast-jb-1-CompressionInfo.db
-rw-r--r-- 1 cassandra cassandra 598579 Apr 15 21:02 jw_schema1-users.idxLast-jb-1-Data.db
-rw-r--r-- 1 cassandra cassandra 16 Apr 15 21:02 jw_schema1-users.idxLast-jb-1-Filter.db
-rw-r--r-- 1 cassandra cassandra 680 Apr 15 21:02 jw_schema1-users.idxLast-jb-1-Index.db
-rw-r--r-- 1 cassandra cassandra 4392 Apr 15 21:02 jw_schema1-users.idxLast-jb-1-Statistics.db
-rw-r--r-- 1 cassandra cassandra 71 Apr 15 21:02 jw_schema1-users.idxLast-jb-1-Summary.db
-rw-r--r-- 1 cassandra cassandra 79 Apr 15 21:02 jw_schema1-users.idxLast-jb-1-TOC.txt
-rw-r--r-- 1 cassandra cassandra 971 Apr 15 21:02 jw_schema1-users-jb-1-CompressionInfo.db
-rw-r--r-- 1 cassandra cassandra 2387391 Apr 15 21:02 jw_schema1-users-jb-1-Data.db
-rw-r--r-- 1 cassandra cassandra 62512 Apr 15 21:02 jw_schema1-users-jb-1-Filter.db
-rw-r--r-- 1 cassandra cassandra 938894 Apr 15 21:02 jw_schema1-users-jb-1-Index.db
-rw-r--r-- 1 cassandra cassandra 4391 Apr 15 21:02 jw_schema1-users-jb-1-Statistics.db
-rw-r--r-- 1 cassandra cassandra 6615 Apr 15 21:02 jw_schema1-users-jb-1-Summary.db
-rw-r--r-- 1 cassandra cassandra 79 Apr 15 21:02 jw_schema1-users-jb-1-TOC.txt
drwxr-xr-x 2 cassandra cassandra 4096 Apr 15 20:57 snapshots
jason@localhost:/var/lib/cassandra/data/jw_schema1/users$

Right now without compression, the total file size is about 11MB. Noticed that, the size is almost double and the sstable count is two.
jason@localhost:/var/lib/cassandra/data/jw_schema1/users$ ls -l
total 10860
-rw-r--r-- 1 cassandra cassandra 48 Apr 15 21:23 jw_schema1-users.idxAge-jb-1-CRC.db
-rw-r--r-- 1 cassandra cassandra 687656 Apr 15 21:23 jw_schema1-users.idxAge-jb-1-Data.db
-rw-r--r-- 1 cassandra cassandra 78 Apr 15 21:23 jw_schema1-users.idxAge-jb-1-Digest.sha1
-rw-r--r-- 1 cassandra cassandra 136 Apr 15 21:23 jw_schema1-users.idxAge-jb-1-Filter.db
-rw-r--r-- 1 cassandra cassandra 1800 Apr 15 21:23 jw_schema1-users.idxAge-jb-1-Index.db
-rw-r--r-- 1 cassandra cassandra 4392 Apr 15 21:23 jw_schema1-users.idxAge-jb-1-Statistics.db
-rw-r--r-- 1 cassandra cassandra 68 Apr 15 21:23 jw_schema1-users.idxAge-jb-1-Summary.db
-rw-r--r-- 1 cassandra cassandra 79 Apr 15 21:23 jw_schema1-users.idxAge-jb-1-TOC.txt
-rw-r--r-- 1 cassandra cassandra 32 Apr 15 21:24 jw_schema1-users.idxAge-jb-2-CRC.db
-rw-r--r-- 1 cassandra cassandra 455238 Apr 15 21:24 jw_schema1-users.idxAge-jb-2-Data.db
-rw-r--r-- 1 cassandra cassandra 78 Apr 15 21:24 jw_schema1-users.idxAge-jb-2-Digest.sha1
-rw-r--r-- 1 cassandra cassandra 136 Apr 15 21:24 jw_schema1-users.idxAge-jb-2-Filter.db
-rw-r--r-- 1 cassandra cassandra 1800 Apr 15 21:24 jw_schema1-users.idxAge-jb-2-Index.db
-rw-r--r-- 1 cassandra cassandra 4393 Apr 15 21:24 jw_schema1-users.idxAge-jb-2-Statistics.db
-rw-r--r-- 1 cassandra cassandra 68 Apr 15 21:24 jw_schema1-users.idxAge-jb-2-Summary.db
-rw-r--r-- 1 cassandra cassandra 79 Apr 15 21:24 jw_schema1-users.idxAge-jb-2-TOC.txt
-rw-r--r-- 1 cassandra cassandra 48 Apr 15 21:23 jw_schema1-users.idxLast-jb-1-CRC.db
-rw-r--r-- 1 cassandra cassandra 685677 Apr 15 21:23 jw_schema1-users.idxLast-jb-1-Data.db
-rw-r--r-- 1 cassandra cassandra 79 Apr 15 21:23 jw_schema1-users.idxLast-jb-1-Digest.sha1
-rw-r--r-- 1 cassandra cassandra 16 Apr 15 21:23 jw_schema1-users.idxLast-jb-1-Filter.db
-rw-r--r-- 1 cassandra cassandra 425 Apr 15 21:23 jw_schema1-users.idxLast-jb-1-Index.db
-rw-r--r-- 1 cassandra cassandra 4392 Apr 15 21:23 jw_schema1-users.idxLast-jb-1-Statistics.db
-rw-r--r-- 1 cassandra cassandra 71 Apr 15 21:23 jw_schema1-users.idxLast-jb-1-Summary.db
-rw-r--r-- 1 cassandra cassandra 79 Apr 15 21:23 jw_schema1-users.idxLast-jb-1-TOC.txt
-rw-r--r-- 1 cassandra cassandra 32 Apr 15 21:24 jw_schema1-users.idxLast-jb-2-CRC.db
-rw-r--r-- 1 cassandra cassandra 453259 Apr 15 21:24 jw_schema1-users.idxLast-jb-2-Data.db
-rw-r--r-- 1 cassandra cassandra 79 Apr 15 21:24 jw_schema1-users.idxLast-jb-2-Digest.sha1
-rw-r--r-- 1 cassandra cassandra 16 Apr 15 21:24 jw_schema1-users.idxLast-jb-2-Filter.db
-rw-r--r-- 1 cassandra cassandra 287 Apr 15 21:24 jw_schema1-users.idxLast-jb-2-Index.db
-rw-r--r-- 1 cassandra cassandra 4393 Apr 15 21:24 jw_schema1-users.idxLast-jb-2-Statistics.db
-rw-r--r-- 1 cassandra cassandra 71 Apr 15 21:24 jw_schema1-users.idxLast-jb-2-Summary.db
-rw-r--r-- 1 cassandra cassandra 79 Apr 15 21:24 jw_schema1-users.idxLast-jb-2-TOC.txt
-rw-r--r-- 1 cassandra cassandra 288 Apr 15 21:23 jw_schema1-users-jb-1-CRC.db
-rw-r--r-- 1 cassandra cassandra 4612770 Apr 15 21:23 jw_schema1-users-jb-1-Data.db
-rw-r--r-- 1 cassandra cassandra 71 Apr 15 21:23 jw_schema1-users-jb-1-Digest.sha1
-rw-r--r-- 1 cassandra cassandra 37880 Apr 15 21:23 jw_schema1-users-jb-1-Filter.db
-rw-r--r-- 1 cassandra cassandra 564480 Apr 15 21:23 jw_schema1-users-jb-1-Index.db
-rw-r--r-- 1 cassandra cassandra 4391 Apr 15 21:23 jw_schema1-users-jb-1-Statistics.db
-rw-r--r-- 1 cassandra cassandra 3984 Apr 15 21:23 jw_schema1-users-jb-1-Summary.db
-rw-r--r-- 1 cassandra cassandra 79 Apr 15 21:23 jw_schema1-users-jb-1-TOC.txt
-rw-r--r-- 1 cassandra cassandra 192 Apr 15 21:24 jw_schema1-users-jb-2-CRC.db
-rw-r--r-- 1 cassandra cassandra 3015018 Apr 15 21:24 jw_schema1-users-jb-2-Data.db
-rw-r--r-- 1 cassandra cassandra 71 Apr 15 21:24 jw_schema1-users-jb-2-Digest.sha1
-rw-r--r-- 1 cassandra cassandra 24648 Apr 15 21:24 jw_schema1-users-jb-2-Filter.db
-rw-r--r-- 1 cassandra cassandra 374414 Apr 15 21:24 jw_schema1-users-jb-2-Index.db
-rw-r--r-- 1 cassandra cassandra 4391 Apr 15 21:24 jw_schema1-users-jb-2-Statistics.db
-rw-r--r-- 1 cassandra cassandra 2672 Apr 15 21:24 jw_schema1-users-jb-2-Summary.db
-rw-r--r-- 1 cassandra cassandra 79 Apr 15 21:24 jw_schema1-users-jb-2-TOC.txt

With current hardware setup, which is loaded, with sstable compression enable, at times, the request get rpc timeout but at times, the result is returned. However without compression on sstable, all the requests executed get timeout. Below are the query perform via cqlsh.
cqlsh:jw_schema1> select * from users where age > 95 and last = 'smith' allow filtering;
Request did not complete within rpc_timeout.

Apparently enable compression does improve reading speed and saving disk size.

Saturday, April 19, 2014

Introduction to CRUD on cql 3.0 data type

In a previous article, we covered a basic data definition language,  and in this article, we are going to cover data manipulation language. With cql3, composite data type is pretty interesting compare to sql. Official documentation available here.

We covered all data type in cql 3.0 except counter and now we will create all the available data types that can coexists within a table. Let's do it.
CREATE TABLE dataType (
id uuid,
name ascii,
amount bigint,
binary blob,
isSingle boolean,
lamp decimal,
salary double,
works float,
ip inet,
car int,
email set<text>,
kidsAge map<text,int>,
places list<text>,
description text,
lastUpdate timestamp,
myTimeUUID timeuuid,
longDescription varchar,
spending varint,
PRIMARY KEY (id)
);

cqlsh:jw_schema1> desc table datatype;

CREATE TABLE datatype (
id uuid,
amount bigint,
binary blob,
car int,
description text,
email set<text>,
ip inet,
issingle boolean,
kidsage map<text, int>,
lamp decimal,
lastupdate timestamp,
longdescription text,
mytimeuuid timeuuid,
name ascii,
places list<text>,
salary double,
spending varint,
works float,
PRIMARY KEY (id)
) WITH
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
index_interval=128 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
default_time_to_live=0 AND
speculative_retry='99.0PERCENTILE' AND
memtable_flush_period_in_ms=0 AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'LZ4Compressor'};

looks good, table created and let's continue by insert data into this table!
insert into datatype (id, name, amount, binary, issingle, lamp, salary, works, ip, car, email, kidsage, places, description, lastUpdate, myTimeUUID, longDescription, spending) values (62c36092-82a1-3a00-93d1-46196ee77204, 'jason wee', 123, 0xff, false, 10, 100000, 1, '192.168.0.1', 3, {'a@b.com', 'c@d.com'}, {'juniorA':1, 'juniorB':2}, ['kuala lumpur', 'petaling jaya', 'kepong'], 'hello world', '2014-04-15 00:00:00', maxTimeuuid('2014-04-15 00:05+0000'), 'this is a longer hello world', 123);

cqlsh:jw_schema1> select * from datatype;

id | amount | binary | car | description | email | ip | issingle | kidsage | lamp | lastupdate | longdescription | mytimeuuid | name | places | salary | spending | works
--------------------------------------+--------+--------+-----+-------------+------------------------+-------------+----------+------------------------------+------+--------------------------+------------------------------+--------------------------------------+-----------+---------------------------------------------+--------+----------+-------
62c36092-82a1-3a00-93d1-46196ee77204 | 123 | 0xff | 3 | hello world | {'a@b.com', 'c@d.com'} | 192.168.0.1 | False | {'juniorA': 1, 'juniorB': 2} | 10 | 2014-04-15 00:00:00+0800 | this is a longer hello world | 95fc050f-c431-11e3-7f7f-7f7f7f7f7f7f | jason wee | ['kuala lumpur', 'petaling jaya', 'kepong'] | 1e+05 | 123 | 1

(1 rows)

Goodies, all data were inserted. Let's try update, we start by single update to one column.
cqlsh:jw_schema1> update datatype set amount = 456 where id = 62c36092-82a1-3a00-93d1-46196ee77204;
cqlsh:jw_schema1> select amount from datatype where id = 62c36092-82a1-3a00-93d1-46196ee77204;

amount
--------
456

(1 rows)

Looks good too! Now we will update three fields.
cqlsh:jw_schema1> update datatype set binary = 0x68656c6c6f20776f726c64, car = 6, description = 'changed description' where id = 62c36092-82a1-3a00-93d1-46196ee77204;
cqlsh:jw_schema1> select binary, car, description from datatype where id = 62c36092-82a1-3a00-93d1-46196ee77204;

binary | car | description
--------------------------+-----+---------------------
0x68656c6c6f20776f726c64 | 6 | changed description

(1 rows)

cqlsh:jw_schema1>

Looks good! The binary data type always prefix with 0x. It is hex representation of hello world. Let's now change the composite data type.
cqlsh:jw_schema1> update datatype set email = {'e@f.com'} where id = 62c36092-82a1-3a00-93d1-46196ee77204;
cqlsh:jw_schema1> select email from datatype;

email
-------------
{'e@f.com'}

(1 rows)

Hmm... email field values get overridden. So how do we append it? concat them ! :-)
cqlsh:jw_schema1> update datatype set email = email + {'a@b.com', 'c@d.com'} where id = 62c36092-82a1-3a00-93d1-46196ee77204;
cqlsh:jw_schema1> select email from datatype;

email
-----------------------------------
{'a@b.com', 'c@d.com', 'e@f.com'}

(1 rows)

Move on, update data type boolean and IP number.
cqlsh:jw_schema1> update datatype set ip = 'a.b.c.d', issingle = True where id = 62c36092-82a1-3a00-93d1-46196ee77204;
Bad Request: unable to make inetaddress from 'a.b.c.d'
cqlsh:jw_schema1> update datatype set ip = '255.255.255.255', issingle = True where id = 62c36092-82a1-3a00-93d1-46196ee77204;
cqlsh:jw_schema1> update datatype set ip = '255.255.255.255', issingle = Trued where id = 62c36092-82a1-3a00-93d1-46196ee77204;
Bad Request: line 1:61 no viable alternative at input 'where'
cqlsh:jw_schema1> update datatype set ip = '255.255.255.255', issingle = True where id = 62c36092-82a1-3a00-93d1-46196ee77204;

cqlsh:jw_schema1> select ip,issingle from datatype;

ip | issingle
-----------------+----------
255.255.255.255 | True

(1 rows)

Simple checking on IP number and boolean data type is enforce. Let's change map now.
cqlsh:jw_schema1> update datatype set kidsage = {'juniorC':3} where  id = 62c36092-82a1-3a00-93d1-46196ee77204;
cqlsh:jw_schema1> select kidsage from datatype;

kidsage
----------------
{'juniorC': 3}

(1 rows)

cqlsh:jw_schema1> update datatype set kidsage = kidsage + {'juniorA':1, 'juniorB':2} where id = 62c36092-82a1-3a00-93d1-46196ee77204;
cqlsh:jw_schema1> select kidsage from datatype;

kidsage
--------------------------------------------
{'juniorA': 1, 'juniorB': 2, 'juniorC': 3}

(1 rows)

Exactly like set behavior, if you need to append the data, you need to concat them using plus sign.
cqlsh:jw_schema1> update datatype set lamp = 12.34, lastupdate = '2014-04-16 20:00', longdescription = 'this is a long long long hello world', mytimeuuid = maxTimeuuid('2014-04-16'), name = 'john smith' where  id = 62c36092-82a1-3a00-93d1-46196ee77204;
cqlsh:jw_schema1> select lamp, lastupdate, longdescription, mytimeuuid, name from datatype;

lamp | lastupdate | longdescription | mytimeuuid | name
-------+--------------------------+--------------------------------------+--------------------------------------+------------
12.34 | 2014-04-16 20:00:00+0800 | this is a long long long hello world | ff72270f-c4b6-11e3-7f7f-7f7f7f7f7f7f | john smith

(1 rows)

cqlsh:jw_schema1>

Everything looks good, timestamp provided by hour and longer text and time uuid can accept only date, everything seem cool.
cqlsh:jw_schema1> update datatype set places = places + ['cheras'], salary = 985621.35, spending = 12355, works = 89.36 where  id = 62c36092-82a1-3a00-93d1-46196ee77204;
cqlsh:jw_schema1> select places, salary, spending, works from datatype;

places | salary | spending | works
-------------------------------------------------------+------------+----------+-------
['kuala lumpur', 'petaling jaya', 'kepong', 'cheras'] | 9.8562e+05 | 12355 | 89.36

(1 rows)

cqlsh:jw_schema1>

Okay, we pretty all cover update all the data type. Let's remove this one row.
cqlsh:jw_schema1> delete amount,binary,car,description,email,ip,issingle,kidsage,lamp,lastupdate,longdescription,mytimeuuid from datatype where id = 62c36092-82a1-3a00-93d1-46196ee77204;
cqlsh:jw_schema1> select * from datatype;

id | amount | binary | car | description | email | ip | issingle | kidsage | lamp | lastupdate | longdescription | mytimeuuid | name | places | salary | spending | works
--------------------------------------+--------+--------+------+-------------+-------+------+----------+---------+------+------------+-----------------+------------+------------+-------------------------------------------------------+------------+----------+-------
62c36092-82a1-3a00-93d1-46196ee77204 | null | null | null | null | null | null | null | null | null | null | null | null | john smith | ['kuala lumpur', 'petaling jaya', 'kepong', 'cheras'] | 9.8562e+05 | 12355 | 89.36

(1 rows)

Pretty interesting, we can delete columns within a row. The data is set to null.
cqlsh:jw_schema1> delete from datatype where id = 62c36092-82a1-3a00-93d1-46196ee77204;
cqlsh:jw_schema1> select * from datatype;

(0 rows)

Now we have deleted everything.

  • looks like it is case insensitive, that, is when we created the table name, dataType it is stored as datatype.

  • the composite datatype is definitely nice to have as we don't have to link a few tables.

  • we can also delete a few columns or we can delete the entire row.


 

Friday, April 18, 2014

Should we use cache in application server interface to cassandra?

There are many debates whether should a cache layer should be place in front of cassandra, read more here.

Initially, if your data set store in cassandra is small, well , fine, enable key cache in cassandra or if you are sure enough, row cache perhaps

From application server development point of view, less code to develop and faster development cycle. This could also means one less test and less ambuiguity. Everyone is happy.

But imagine over times, when data store in cassandra grow and (read and write) requests to cassandra are also increase in tandem, perhaps some request will get timeout. Remember that when sstables grows, compaction and repair will take considerable amount of hardware resources (example memory and block device).

So in this situation, it will not be such a bad idea to have a cache layer on application server. To reduce read request to cassandra, one can implement a simple caching layer in the application server. Of cause, the design and requirement is specific to one's need but the general idea is that, by using a caching layer, request to the cassandra server can be reduce and if caching use correctly, the effect can be tremendous.

There is discussion on how this can be implemented.

Sunday, April 13, 2014

Research into cassandra nodetool cfhistograms and interpret statistics

What is nodetool cfhistogram?

According to the official documentation definition: The nodetool cfhistograms command provides statistics about a table, including read/write latency, row size, column count, and number of SSTables.

If you noticed the picture output below, it is entirely different than the cfhistogram output in cassandra 2.0.6 . Apparently output of cfhistograms is simplified and improved! You can find more information about this improvement here. To get the existing way of output, give −−compact to the nodetool as a parameter.



Okay, let's start by issue command nodetool cfhistograms to our cluster.
jason@localhost:~$ nodetool -h localhost cfhistograms jw_schema1 users
jw_schema1/users histograms

SSTables per Read
1 sstables: 997

Write Latency (microseconds)
No Data

Read Latency (microseconds)
103 us: 1
124 us: 15
149 us: 28
179 us: 131
215 us: 306
258 us: 373
310 us: 66
372 us: 17
446 us: 6
535 us: 21
642 us: 10
770 us: 2
924 us: 1
1109 us: 3
1331 us: 1
1597 us: 1
1916 us: 3
2299 us: 0
2759 us: 2
3311 us: 1
3973 us: 0
4768 us: 0
5722 us: 1
6866 us: 0
8239 us: 1
9887 us: 4
11864 us: 1
14237 us: 1
17084 us: 1

Partition Size (bytes)
149 bytes: 3

Cell Count per Partition
5 cells: 3

The statistics is a bit difficult to understand if you do not know what does it mean. Let's begin by studying into the cfhistograms codes.
private void printCfHistograms(String keySpace, String columnFamily, PrintStream output, boolean compactFormat)
{
ColumnFamilyStoreMBean store = this.probe.getCfsProxy(keySpace, columnFamily);

// default is 90 offsets
long[] offsets = new EstimatedHistogram().getBucketOffsets();

long[] rrlh = store.getRecentReadLatencyHistogramMicros();
long[] rwlh = store.getRecentWriteLatencyHistogramMicros();
long[] sprh = store.getRecentSSTablesPerReadHistogram();
long[] ersh = store.getEstimatedRowSizeHistogram();
long[] ecch = store.getEstimatedColumnCountHistogram();

output.println(String.format("%s/%s histograms", keySpace, columnFamily));
output.println("");

if (compactFormat)
{
output.println(String.format("%-10s%10s%18s%18s%18s%18s",
"Offset", "SSTables", "Write Latency", "Read Latency", "Partition Size", "Cell Count"));
output.println(String.format("%-10s%10s%18s%18s%18s%18s",
"", "", "(micros)", "(micros)", "(bytes)", ""));

for (int i = 0; i < offsets.length; i++)
{
output.println(String.format("%-10d%10s%18s%18s%18s%18s",
offsets[i],
(i < sprh.length ? sprh[i] : "0"),
(i < rwlh.length ? rwlh[i] : "0"),
(i < rrlh.length ? rrlh[i] : "0"),
(i < ersh.length ? ersh[i] : "0"),
(i < ecch.length ? ecch[i] : "0")));
}
}
else
{
output.println("SSTables per Read");
printHistogram(sprh, offsets, "sstables", output);

output.println("Write Latency (microseconds)");
printHistogram(rwlh, offsets, "us", output);

output.println("Read Latency (microseconds)");
printHistogram(rrlh, offsets, "us", output);

output.println("Partition Size (bytes)");
printHistogram(ersh, offsets, "bytes", output);

output.println("Cell Count per Partition");
printHistogram(ecch, offsets, "cells", output);
}
}

Essentially a proxy ColumnFamilyStoreMBean is made through jmx ($ jconsole service:jmx:rmi:///jndi/rmi://192.168.0.2:7199/jmxrmi also see picture below) based on the previous keyspace and column family specified in the nodetool parameter. The default bucket offset will always be 90. Thus if you carefully analyzed the row output of the compact statistics, you will noticed exactly 90 rows each time nodetool cfhistogram command is triggered.



You would ask, why would 90 bucket offsets? Well according to the codes documentation:
The series of values to which the counts in `buckets` correspond:
1, 2, 3, 4, 5, 6, 7, 8, 10, 12, 14, 17, 20, etc.
Thus, a `buckets` of [0, 0, 1, 10] would mean we had seen one value of 3 and 10 values of 4.

The series starts at 1 and grows by 1.2 each time (rounding and removing duplicates). It goes from 1
to around 36M by default (creating 90+1 buckets), which will give us timing resolution from microseconds to
36 seconds, with less precision as the numbers get larger.

Each bucket represents values from (previous bucket offset, current offset].

Depending if parameter compact is specified, the output will be different. There are six metrics exposed. We will take a closer look.

  • offset | the bucket offset


Bucket offset from 149 (exclusive) to 179 (inclusive). Essentially this bucket offset contain latency from 149 microseconds until 179 microseconds.




  • SSTables | recent SSTables per read


With each read, total of sstables accessed accountable for. Note that for each nodetool cfhistograms trigger for this keyspace and column family, this metric will be reset.


This metric will increase if there is any call to CollationController.java or CacheService.java




  • Write Latency (micros) | recent write latency histogram in microseconds.


An array representing the latency histogram for write in microseconds. Note that for each nodetool cfhistograms trigger for this keyspace and column family, this metric will be reset.


This metric will increase if there is any call to ColumnFamilyStore.java, StorageProxy.java or WeightedQueue.java .




  • Read Latency (micros) | recent read latency histogram in Microseconds.


An array representing the latency histogram for read in microseconds. Note that for each nodetool cfhistograms trigger for this keyspace and column family, this metric will be reset.




  • Partition Size (bytes ) | estimated row size histogram


As estimation of row size in bytes. Note that for each nodetool cfhistograms trigger for this keyspace and column family, this metric will NOT reset.


The metric is collected by iterating over the sstables, and get the estimated row size in bytes.




  • Cell Count | estimated column count histogram


Estimated number of columns. Note that for each nodetool cfhistograms trigger for this keyspace and column family, this metric will NOT reset.


The metric is collected by iterating over the sstables, and get the estimated column count.


So with these interpretation from the codes, let's take another compact form cfhistogram to interpret the metrics. First, we will make start by make some statistics:
cqlsh:jw_schema1> select * from users where age > 5 and age < 50 and last = 'smith' allow filtering;

jason@localhost:~$ nodetool -h localhost cfhistograms jw_schema1 users -c
jw_schema1/users histograms

Offset SSTables Write Latency Read Latency Partition Size Cell Count
(micros) (micros) (bytes)
1 997 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
5 0 0 0 0 1000
6 0 0 0 0 0
7 0 0 0 0 0
8 0 0 0 0 0
10 0 0 0 0 0
12 0 0 0 0 0
14 0 0 0 0 0
17 0 0 0 0 0
20 0 0 0 0 0
24 0 0 0 0 0
29 0 0 0 0 0
35 0 0 0 0 0
42 0 0 0 0 0
50 0 0 0 0 0
60 0 0 0 0 0
72 0 0 0 0 0
86 0 0 0 0 0
103 0 0 0 0 0
124 0 0 0 0 0
149 0 0 0 999 0
179 0 0 0 1 0
215 0 0 0 0 0
258 0 0 0 0 0
310 0 0 0 0 0
372 0 0 0 0 0
446 0 0 0 0 0
535 0 0 0 0 0
642 0 0 0 0 0
770 0 0 0 0 0
924 0 0 0 0 0
1109 0 0 0 0 0
1331 0 0 51 0 0
1597 0 0 491 0 0
1916 0 0 95 0 0
2299 0 0 53 0 0
2759 0 0 84 0 0
3311 0 0 95 0 0
3973 0 0 41 0 0
4768 0 0 32 0 0
5722 0 0 25 0 0
6866 0 0 9 0 0
8239 0 0 7 0 0
9887 0 0 6 0 0
11864 0 0 4 0 0
14237 0 0 0 0 0
17084 0 0 2 0 0
20501 0 0 0 0 0
24601 0 0 0 0 0
29521 0 0 0 0 0
35425 0 0 0 0 0
42510 0 0 1 0 0
51012 0 0 0 0 0
61214 0 0 0 0 0
73457 0 0 0 0 0
88148 0 0 0 0 0
105778 0 0 1 0 0
126934 0 0 0 0 0
152321 0 0 0 0 0
182785 0 0 0 0 0
219342 0 0 0 0 0
263210 0 0 0 0 0
315852 0 0 0 0 0
379022 0 0 0 0 0
454826 0 0 0 0 0
545791 0 0 0 0 0
654949 0 0 0 0 0
785939 0 0 0 0 0
943127 0 0 0 0 0
1131752 0 0 0 0 0
1358102 0 0 0 0 0
1629722 0 0 0 0 0
1955666 0 0 0 0 0
2346799 0 0 0 0 0
2816159 0 0 0 0 0
3379391 0 0 0 0 0
4055269 0 0 0 0 0
4866323 0 0 0 0 0
5839588 0 0 0 0 0
7007506 0 0 0 0 0
8409007 0 0 0 0 0
10090808 0 0 0 0 0
12108970 0 0 0 0 0
14530764 0 0 0 0 0
17436917 0 0 0 0 0
20924300 0 0 0 0 0
25109160 0 0 0 0 0


  • There are 51 read requests spend time from 1109 microsecond to 1331 microsecond.

  • 997 sstables were read and spent time 1 microsecond.

  • Because this is a read operation, (cql select statement), there is no write latency involved.

  • The mean size for 999 partition is 149 bytes and another one is 179 bytes.

  • There are 1000 partition with 5 cells.


These metric is good for monitoring if you can poll periodically and plot them into graphs. Note that, those methods covered above, many had been deprecated in this cassandra version and probably in the coming cassandra, it will be removed and that they will have better way of depicting the metric. If you started on older cassandra version for example, pre-cassandra 1.1, the cell is correspond to column whilst partition is correspond to row.

Thank you.