Friday, April 18, 2014

Should we use cache in application server interface to cassandra?

There are many debates whether should a cache layer should be place in front of cassandra, read more here.

Initially, if your data set store in cassandra is small, well , fine, enable key cache in cassandra or if you are sure enough, row cache perhaps

From application server development point of view, less code to develop and faster development cycle. This could also means one less test and less ambuiguity. Everyone is happy.

But imagine over times, when data store in cassandra grow and (read and write) requests to cassandra are also increase in tandem, perhaps some request will get timeout. Remember that when sstables grows, compaction and repair will take considerable amount of hardware resources (example memory and block device).

So in this situation, it will not be such a bad idea to have a cache layer on application server. To reduce read request to cassandra, one can implement a simple caching layer in the application server. Of cause, the design and requirement is specific to one's need but the general idea is that, by using a caching layer, request to the cassandra server can be reduce and if caching use correctly, the effect can be tremendous.

There is discussion on how this can be implemented.

Sunday, April 13, 2014

Research into cassandra nodetool cfhistograms and interpret statistics

What is nodetool cfhistogram?

According to the official documentation definition: The nodetool cfhistograms command provides statistics about a table, including read/write latency, row size, column count, and number of SSTables.

If you noticed the picture output below, it is entirely different than the cfhistogram output in cassandra 2.0.6 . Apparently output of cfhistograms is simplified and improved! You can find more information about this improvement here. To get the existing way of output, give −−compact to the nodetool as a parameter.



Okay, let's start by issue command nodetool cfhistograms to our cluster.
jason@localhost:~$ nodetool -h localhost cfhistograms jw_schema1 users
jw_schema1/users histograms

SSTables per Read
1 sstables: 997

Write Latency (microseconds)
No Data

Read Latency (microseconds)
103 us: 1
124 us: 15
149 us: 28
179 us: 131
215 us: 306
258 us: 373
310 us: 66
372 us: 17
446 us: 6
535 us: 21
642 us: 10
770 us: 2
924 us: 1
1109 us: 3
1331 us: 1
1597 us: 1
1916 us: 3
2299 us: 0
2759 us: 2
3311 us: 1
3973 us: 0
4768 us: 0
5722 us: 1
6866 us: 0
8239 us: 1
9887 us: 4
11864 us: 1
14237 us: 1
17084 us: 1

Partition Size (bytes)
149 bytes: 3

Cell Count per Partition
5 cells: 3

The statistics is a bit difficult to understand if you do not know what does it mean. Let's begin by studying into the cfhistograms codes.
private void printCfHistograms(String keySpace, String columnFamily, PrintStream output, boolean compactFormat)
{
ColumnFamilyStoreMBean store = this.probe.getCfsProxy(keySpace, columnFamily);

// default is 90 offsets
long[] offsets = new EstimatedHistogram().getBucketOffsets();

long[] rrlh = store.getRecentReadLatencyHistogramMicros();
long[] rwlh = store.getRecentWriteLatencyHistogramMicros();
long[] sprh = store.getRecentSSTablesPerReadHistogram();
long[] ersh = store.getEstimatedRowSizeHistogram();
long[] ecch = store.getEstimatedColumnCountHistogram();

output.println(String.format("%s/%s histograms", keySpace, columnFamily));
output.println("");

if (compactFormat)
{
output.println(String.format("%-10s%10s%18s%18s%18s%18s",
"Offset", "SSTables", "Write Latency", "Read Latency", "Partition Size", "Cell Count"));
output.println(String.format("%-10s%10s%18s%18s%18s%18s",
"", "", "(micros)", "(micros)", "(bytes)", ""));

for (int i = 0; i < offsets.length; i++)
{
output.println(String.format("%-10d%10s%18s%18s%18s%18s",
offsets[i],
(i < sprh.length ? sprh[i] : "0"),
(i < rwlh.length ? rwlh[i] : "0"),
(i < rrlh.length ? rrlh[i] : "0"),
(i < ersh.length ? ersh[i] : "0"),
(i < ecch.length ? ecch[i] : "0")));
}
}
else
{
output.println("SSTables per Read");
printHistogram(sprh, offsets, "sstables", output);

output.println("Write Latency (microseconds)");
printHistogram(rwlh, offsets, "us", output);

output.println("Read Latency (microseconds)");
printHistogram(rrlh, offsets, "us", output);

output.println("Partition Size (bytes)");
printHistogram(ersh, offsets, "bytes", output);

output.println("Cell Count per Partition");
printHistogram(ecch, offsets, "cells", output);
}
}

Essentially a proxy ColumnFamilyStoreMBean is made through jmx ($ jconsole service:jmx:rmi:///jndi/rmi://192.168.0.2:7199/jmxrmi also see picture below) based on the previous keyspace and column family specified in the nodetool parameter. The default bucket offset will always be 90. Thus if you carefully analyzed the row output of the compact statistics, you will noticed exactly 90 rows each time nodetool cfhistogram command is triggered.



You would ask, why would 90 bucket offsets? Well according to the codes documentation:
The series of values to which the counts in `buckets` correspond:
1, 2, 3, 4, 5, 6, 7, 8, 10, 12, 14, 17, 20, etc.
Thus, a `buckets` of [0, 0, 1, 10] would mean we had seen one value of 3 and 10 values of 4.

The series starts at 1 and grows by 1.2 each time (rounding and removing duplicates). It goes from 1
to around 36M by default (creating 90+1 buckets), which will give us timing resolution from microseconds to
36 seconds, with less precision as the numbers get larger.

Each bucket represents values from (previous bucket offset, current offset].

Depending if parameter compact is specified, the output will be different. There are six metrics exposed. We will take a closer look.

  • offset | the bucket offset


Bucket offset from 149 (exclusive) to 179 (inclusive). Essentially this bucket offset contain latency from 149 microseconds until 179 microseconds.




  • SSTables | recent SSTables per read


With each read, total of sstables accessed accountable for. Note that for each nodetool cfhistograms trigger for this keyspace and column family, this metric will be reset.


This metric will increase if there is any call to CollationController.java or CacheService.java




  • Write Latency (micros) | recent write latency histogram in microseconds.


An array representing the latency histogram for write in microseconds. Note that for each nodetool cfhistograms trigger for this keyspace and column family, this metric will be reset.


This metric will increase if there is any call to ColumnFamilyStore.java, StorageProxy.java or WeightedQueue.java .




  • Read Latency (micros) | recent read latency histogram in Microseconds.


An array representing the latency histogram for read in microseconds. Note that for each nodetool cfhistograms trigger for this keyspace and column family, this metric will be reset.




  • Partition Size (bytes ) | estimated row size histogram


As estimation of row size in bytes. Note that for each nodetool cfhistograms trigger for this keyspace and column family, this metric will NOT reset.


The metric is collected by iterating over the sstables, and get the estimated row size in bytes.




  • Cell Count | estimated column count histogram


Estimated number of columns. Note that for each nodetool cfhistograms trigger for this keyspace and column family, this metric will NOT reset.


The metric is collected by iterating over the sstables, and get the estimated column count.


So with these interpretation from the codes, let's take another compact form cfhistogram to interpret the metrics. First, we will make start by make some statistics:
cqlsh:jw_schema1> select * from users where age > 5 and age < 50 and last = 'smith' allow filtering;

jason@localhost:~$ nodetool -h localhost cfhistograms jw_schema1 users -c
jw_schema1/users histograms

Offset SSTables Write Latency Read Latency Partition Size Cell Count
(micros) (micros) (bytes)
1 997 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
5 0 0 0 0 1000
6 0 0 0 0 0
7 0 0 0 0 0
8 0 0 0 0 0
10 0 0 0 0 0
12 0 0 0 0 0
14 0 0 0 0 0
17 0 0 0 0 0
20 0 0 0 0 0
24 0 0 0 0 0
29 0 0 0 0 0
35 0 0 0 0 0
42 0 0 0 0 0
50 0 0 0 0 0
60 0 0 0 0 0
72 0 0 0 0 0
86 0 0 0 0 0
103 0 0 0 0 0
124 0 0 0 0 0
149 0 0 0 999 0
179 0 0 0 1 0
215 0 0 0 0 0
258 0 0 0 0 0
310 0 0 0 0 0
372 0 0 0 0 0
446 0 0 0 0 0
535 0 0 0 0 0
642 0 0 0 0 0
770 0 0 0 0 0
924 0 0 0 0 0
1109 0 0 0 0 0
1331 0 0 51 0 0
1597 0 0 491 0 0
1916 0 0 95 0 0
2299 0 0 53 0 0
2759 0 0 84 0 0
3311 0 0 95 0 0
3973 0 0 41 0 0
4768 0 0 32 0 0
5722 0 0 25 0 0
6866 0 0 9 0 0
8239 0 0 7 0 0
9887 0 0 6 0 0
11864 0 0 4 0 0
14237 0 0 0 0 0
17084 0 0 2 0 0
20501 0 0 0 0 0
24601 0 0 0 0 0
29521 0 0 0 0 0
35425 0 0 0 0 0
42510 0 0 1 0 0
51012 0 0 0 0 0
61214 0 0 0 0 0
73457 0 0 0 0 0
88148 0 0 0 0 0
105778 0 0 1 0 0
126934 0 0 0 0 0
152321 0 0 0 0 0
182785 0 0 0 0 0
219342 0 0 0 0 0
263210 0 0 0 0 0
315852 0 0 0 0 0
379022 0 0 0 0 0
454826 0 0 0 0 0
545791 0 0 0 0 0
654949 0 0 0 0 0
785939 0 0 0 0 0
943127 0 0 0 0 0
1131752 0 0 0 0 0
1358102 0 0 0 0 0
1629722 0 0 0 0 0
1955666 0 0 0 0 0
2346799 0 0 0 0 0
2816159 0 0 0 0 0
3379391 0 0 0 0 0
4055269 0 0 0 0 0
4866323 0 0 0 0 0
5839588 0 0 0 0 0
7007506 0 0 0 0 0
8409007 0 0 0 0 0
10090808 0 0 0 0 0
12108970 0 0 0 0 0
14530764 0 0 0 0 0
17436917 0 0 0 0 0
20924300 0 0 0 0 0
25109160 0 0 0 0 0


  • There are 51 read requests spend time from 1109 microsecond to 1331 microsecond.

  • 997 sstables were read and spent time 1 microsecond.

  • Because this is a read operation, (cql select statement), there is no write latency involved.

  • The mean size for 999 partition is 149 bytes and another one is 179 bytes.

  • There are 1000 partition with 5 cells.


These metric is good for monitoring if you can poll periodically and plot them into graphs. Note that, those methods covered above, many had been deprecated in this cassandra version and probably in the coming cassandra, it will be removed and that they will have better way of depicting the metric. If you started on older cassandra version for example, pre-cassandra 1.1, the cell is correspond to column whilst partition is correspond to row.

Thank you.

Saturday, April 12, 2014

Learn and play with cassandra 2.0.6 snapshot and restore

Snapshot of cassandra appearing as early as cassandra version 0.4.0 beta. Today, we are going to learn on cassandra snapshot. Note that if you run snapshot on a node in a cluster, it only snapshot on that node. If you want to snapshot for all nodes in a cluster,

it is much more efficient to use a parallel ssh such as clusterssh or pssh.

Fundamentally, when snapshot is executed, it copy the sstables into a snapshot directory. So be notice that if you have a huge node load, it require two times the disk space of that server and it may spike the I/O activity on that node too if large amount of sstables is being snapshot.

Let's get down to the work.

First, ensure at least the table (column family) has data.
cqlsh:jw_schema1> select * from users;

user_id | age | first | last | middle
---------+-----+-------+----------+--------
3 | 34 | john | smith | a
2 | 35 | olee | smith | b
1 | 33 | dan | bar | c

(3 rows)

Then take a snapshot, for instance, I only take a snapshot of this keyspace, jw_schema1 and table users. What that does is that, cassandra will flush the data to sstable before snapshot is taken. For option such as giving a meaningful snapshot a name, check out command nodetool help.
jason@localhost:~$ nodetool -h localhost snapshot jw_schema1 -cf users
Requested creating snapshot for: jw_schema1 and table: users
Snapshot directory: 1397292720524

The snapshot made will be stored at <data_file_directories> that you set in cassandra.yaml file. So for instance,
jason@localhost:/var/lib/cassandra/data/jw_schema1/users$ ls -l snapshots/1397292720524/
total 96K
-rw-r--r-- 2 cassandra cassandra 16 Apr 12 16:52 jw_schema1-users.idxAge-jb-1-Filter.db
-rw-r--r-- 2 cassandra cassandra 54 Apr 12 16:52 jw_schema1-users.idxAge-jb-1-Index.db
-rw-r--r-- 2 cassandra cassandra 76 Apr 12 16:52 jw_schema1-users.idxAge-jb-1-Data.db
-rw-r--r-- 2 cassandra cassandra 43 Apr 12 16:52 jw_schema1-users.idxAge-jb-1-CompressionInfo.db
-rw-r--r-- 2 cassandra cassandra 4.3K Apr 12 16:52 jw_schema1-users.idxAge-jb-1-Statistics.db
-rw-r--r-- 2 cassandra cassandra 79 Apr 12 16:52 jw_schema1-users.idxAge-jb-1-TOC.txt
-rw-r--r-- 2 cassandra cassandra 68 Apr 12 16:52 jw_schema1-users.idxAge-jb-1-Summary.db
-rw-r--r-- 2 cassandra cassandra 16 Apr 12 16:52 jw_schema1-users.idxLast-jb-1-Filter.db
-rw-r--r-- 2 cassandra cassandra 58 Apr 12 16:52 jw_schema1-users.idxLast-jb-1-Index.db
-rw-r--r-- 2 cassandra cassandra 87 Apr 12 16:52 jw_schema1-users.idxLast-jb-1-Data.db
-rw-r--r-- 2 cassandra cassandra 43 Apr 12 16:52 jw_schema1-users.idxLast-jb-1-CompressionInfo.db
-rw-r--r-- 2 cassandra cassandra 4.3K Apr 12 16:52 jw_schema1-users.idxLast-jb-1-Statistics.db
-rw-r--r-- 2 cassandra cassandra 79 Apr 12 16:52 jw_schema1-users.idxLast-jb-1-TOC.txt
-rw-r--r-- 2 cassandra cassandra 75 Apr 12 16:52 jw_schema1-users.idxLast-jb-1-Summary.db
-rw-r--r-- 2 cassandra cassandra 16 Apr 12 16:52 jw_schema1-users-jb-1-Filter.db
-rw-r--r-- 2 cassandra cassandra 45 Apr 12 16:52 jw_schema1-users-jb-1-Index.db
-rw-r--r-- 2 cassandra cassandra 206 Apr 12 16:52 jw_schema1-users-jb-1-Data.db
-rw-r--r-- 2 cassandra cassandra 43 Apr 12 16:52 jw_schema1-users-jb-1-CompressionInfo.db
-rw-r--r-- 2 cassandra cassandra 4.3K Apr 12 16:52 jw_schema1-users-jb-1-Statistics.db
-rw-r--r-- 2 cassandra cassandra 79 Apr 12 16:52 jw_schema1-users-jb-1-TOC.txt
-rw-r--r-- 2 cassandra cassandra 59 Apr 12 16:52 jw_schema1-users-jb-1-Summary.db

If you md5sum on the data files between snapshot and the live data, they are identically match.
jason@localhost:/var/lib/cassandra/data/jw_schema1/users$ md5sum snapshots/1397292720524/*Data*
3d4351d714500417c74de6811b1eae3b snapshots/1397292720524/jw_schema1-users.idxAge-jb-1-Data.db
a430a2d65c0a504fe3ab06344654a89a snapshots/1397292720524/jw_schema1-users.idxLast-jb-1-Data.db
13798e1ffb5ed6a871d768399f54b125 snapshots/1397292720524/jw_schema1-users-jb-1-Data.db
jason@localhost:/var/lib/cassandra/data/jw_schema1/users$ md5sum *Data*
3d4351d714500417c74de6811b1eae3b jw_schema1-users.idxAge-jb-1-Data.db
a430a2d65c0a504fe3ab06344654a89a jw_schema1-users.idxLast-jb-1-Data.db
13798e1ffb5ed6a871d768399f54b125 jw_schema1-users-jb-1-Data.db

A snapshot made is not meaningful if you cannot restore back to the node. So from this point on ward, we will take a look on how to restore the snapshot back into the node.

Surprisingly, to restore the command, you would expect for example, nodetool restore backup, but it is not. Rather, there are a few ways to restore the given snapshot sstables.

  1. You can use sstableloader,

  2. copy the sstables into <data_file_directories>/data/jw_schema1/users/ and refresh by calling loadNewSSTables via jconsole or using nodetool refresh

  3. use a node restart method.


It sounds like a lot of works to use either of the first two methods, I'm gonna just try on the last method way of restoring the snapshot sstables.

In order to simulate our backup will be successful, we are going to do a few simulation (disk failure, accidentally delete) here.

  1. copy the snapshot backup somewhere else.

  2. shutdown cassandra and delete cassandra directory.


Okay, let's continue the setup simulation environment
jason@localhost:/var/lib/cassandra/data/jw_schema1/users$ cp -r snapshots/ ~/cassandra/
jason@localhost:/var/lib/cassandra/data/jw_schema1/users$

jason@localhost:/var/lib/cassandra/data/jw_schema1/users$ sudo /etc/init.d/cassandra stop
jason@localhost:/var/lib/cassandra/data/jw_schema1/users$

jason@localhost:/var/lib/cassandra/data/jw_schema1/users$ cd /var/lib/cassandra/commitlog/
jason@localhost:/var/lib/cassandra/commitlog$ ls
total 2.6M
-rw-r--r-- 1 cassandra cassandra 32M Apr 11 18:52 CommitLog-3-1397213531634.log
-rw-r--r-- 1 cassandra cassandra 32M Apr 12 18:38 CommitLog-3-1397213531633.log
jason@localhost:/var/lib/cassandra/commitlog$ sudo rm -rf *
jason@localhost:/var/lib/cassandra/commitlog$ cd ../data/jw_schema1/users/
jason@localhost:/var/lib/cassandra/data/jw_schema1/users$ sudo rm -rf *
jason@localhost:/var/lib/cassandra/data/jw_schema1/users$

So we have copied the snapshot to a cassandra directory under home directory and also stop cassandra, remove all commitlog and table users in keyspace jw_schema1. Note that in this case, the schema for table users is still exists as the schema is stored in the system keyspace.

And now we will copy the snapshot from home directory back into cassandra.
jason@localhost:/var/lib/cassandra/data/jw_schema1/users$ sudo cp -r ~/cassandra/snapshots/1397292720524/jw_schema1-users* .
jason@localhost:/var/lib/cassandra/data/jw_schema1/users$ ls
total 96K
-rw-r--r-- 1 root root 76 Apr 12 18:50 jw_schema1-users.idxAge-jb-1-Data.db
-rw-r--r-- 1 root root 43 Apr 12 18:50 jw_schema1-users.idxAge-jb-1-CompressionInfo.db
-rw-r--r-- 1 root root 16 Apr 12 18:50 jw_schema1-users.idxAge-jb-1-Filter.db
-rw-r--r-- 1 root root 54 Apr 12 18:50 jw_schema1-users.idxAge-jb-1-Index.db
-rw-r--r-- 1 root root 4.3K Apr 12 18:50 jw_schema1-users.idxAge-jb-1-Statistics.db
-rw-r--r-- 1 root root 68 Apr 12 18:50 jw_schema1-users.idxAge-jb-1-Summary.db
-rw-r--r-- 1 root root 79 Apr 12 18:50 jw_schema1-users.idxAge-jb-1-TOC.txt
-rw-r--r-- 1 root root 43 Apr 12 18:50 jw_schema1-users.idxLast-jb-1-CompressionInfo.db
-rw-r--r-- 1 root root 87 Apr 12 18:50 jw_schema1-users.idxLast-jb-1-Data.db
-rw-r--r-- 1 root root 16 Apr 12 18:50 jw_schema1-users.idxLast-jb-1-Filter.db
-rw-r--r-- 1 root root 58 Apr 12 18:50 jw_schema1-users.idxLast-jb-1-Index.db
-rw-r--r-- 1 root root 4.3K Apr 12 18:50 jw_schema1-users.idxLast-jb-1-Statistics.db
-rw-r--r-- 1 root root 79 Apr 12 18:50 jw_schema1-users.idxLast-jb-1-TOC.txt
-rw-r--r-- 1 root root 75 Apr 12 18:50 jw_schema1-users.idxLast-jb-1-Summary.db
-rw-r--r-- 1 root root 43 Apr 12 18:50 jw_schema1-users-jb-1-CompressionInfo.db
-rw-r--r-- 1 root root 206 Apr 12 18:50 jw_schema1-users-jb-1-Data.db
-rw-r--r-- 1 root root 4.3K Apr 12 18:50 jw_schema1-users-jb-1-Statistics.db
-rw-r--r-- 1 root root 45 Apr 12 18:50 jw_schema1-users-jb-1-Index.db
-rw-r--r-- 1 root root 16 Apr 12 18:50 jw_schema1-users-jb-1-Filter.db
-rw-r--r-- 1 root root 79 Apr 12 18:50 jw_schema1-users-jb-1-TOC.txt
-rw-r--r-- 1 root root 59 Apr 12 18:50 jw_schema1-users-jb-1-Summary.db

So far it looks good, now if you tail the cassandra system.log and start cassandra, notice that the sstables are being read. If within these down time, data supposed to be own by this node is missed, you should by now run nodetool repair to make sure data is sync.
 INFO [main] 2014-04-12 18:52:32,555 ColumnFamilyStore.java (line 254) Initializing jw_schema1.users
INFO [SSTableBatchOpen:1] 2014-04-12 18:52:32,568 SSTableReader.java (line 223) Opening /var/lib/cassandra/data/jw_schema1/users/jw_schema1-users-jb-1 (206 bytes)
INFO [main] 2014-04-12 18:52:32,701 ColumnFamilyStore.java (line 254) Initializing jw_schema1.users.idxLast
INFO [SSTableBatchOpen:1] 2014-04-12 18:52:32,719 SSTableReader.java (line 223) Opening /var/lib/cassandra/data/jw_schema1/users/jw_schema1-users.idxLast-jb-1 (87 bytes)
INFO [main] 2014-04-12 18:52:32,802 ColumnFamilyStore.java (line 254) Initializing jw_schema1.users.idxAge
INFO [SSTableBatchOpen:1] 2014-04-12 18:52:32,810 SSTableReader.java (line 223) Opening /var/lib/cassandra/data/jw_schema1/users/jw_schema1-users.idxAge-jb-1 (76 bytes)

jason@localhost:~/$ nodetool -h localhost repair jw_schema1 users
[2014-04-12 18:59:57,477] Starting repair command #1, repairing 1280 ranges for keyspace jw_schema1
..
[2014-04-12 19:00:50,800] Repair command #1 finished

Now we will check our data if it is still there.
jason@localhost:~/$ cqlsh 192.168.0.2 9160 -k jw_schema1
Connected to just4fun at 192.168.0.2:9160.
[cqlsh 4.1.1 | Cassandra 2.0.6 | CQL spec 3.1.1 | Thrift protocol 19.39.0]
Use HELP for help.
cqlsh:jw_schema1> select * from users;

user_id | age | first | last | middle
---------+-----+-------+----------+--------
3 | 34 | john | smith | a
2 | 35 | olee | smith | b
1 | 33 | dan | bar | c
(3 rows)

cqlsh:jw_schema1>

All good. :)

In my humble opinion, because cassandra is built with durable and fault tolerant in mind, snapshot is rather not actually needed. Sure, it is fair to argue if someone deleted the data accidentally, but if you can prevent that by blocking from front end, you can actually save a lot of cost in term of cluster backup and restore maintenance cost. If you want to really ensure the data is save, spin up another cluster in another data centre, then the data is guaranteed safe from disaster. But hey, no harm learning a new tools in case you might need it later down the road.

 

Investigate into cassandra 1.0.8 compaction

So what happened what you trigger compact via nodetool? In a nutshell, it goes into a series of low levels java calls.

The execution started on NodeCmd.java, NodeProbe.java, StorageServiceMBean.java, StorageService.java, ColumnFamilyStore.java, CompactionManager.java, AbstractCompactionTask.java and CompactionTask.java

Once object NodeProbe is establish, method forceTableCompaction (...) is called. Within NodeProbe, there is another called StorageServiceMBean which is the JMX bean interface implemented by class StorageService.

what forceTableCompaction(...) does is that, it iterate over the column families and start major compaction. Code snippet below:
public void forceTableCompaction(String tableName, String... columnFamilies) throws IOException, ExecutionException, InterruptedException
{
for (ColumnFamilyStore cfStore : getValidColumnFamilies(tableName, columnFamilies))
{
cfStore.forceMajorCompaction();
}
}

So it is pretty clear that, the execution goes by getting a valid column families and start to call its method forceMajorCompaction(). What actually happened is that, within method forceMajorCompaction(), this object (ColumnFamilyStore) is passed to CompactionManager singleton to perform an operation known as maximal.

Within CompactionManager class, the object cfStore is perform concurrently. It does by submit the cfStore object to a concurrent codes. To explain better, let's read general compaction framework below:
public Future<Object> submitMaximal(final ColumnFamilyStore cfStore, final int gcBefore)
{
Callable<Object> callable = new Callable<Object>()
{
public Object call() throws IOException
{
// acquire the write lock long enough to schedule all sstables
compactionLock.writeLock().lock();
try
{
if (!cfStore.isValid())
return this;
AbstractCompactionStrategy strategy = cfStore.getCompactionStrategy();
for (AbstractCompactionTask task : strategy.getMaximalTasks(gcBefore))
{
if (!task.markSSTablesForCompaction(0, Integer.MAX_VALUE))
return this;
try
{
// downgrade the lock acquisition
compactionLock.readLock().lock();
compactionLock.writeLock().unlock();
try
{
return task.execute(executor);
}
finally
{
compactionLock.readLock().unlock();
}
}
finally
{
task.unmarkSSTables();
}
}
}
finally
{
// we probably already downgraded
if (compactionLock.writeLock().isHeldByCurrentThread())
compactionLock.writeLock().unlock();
}
return this;
}
};
return executor.submit(callable);
}

To summarize :

  • compaction write lock is made.

  • cfStore object is check again if it still valid.

  • the compaction strategy is retrieved from the cfStore object.

  • mark SSTables for compaction.

  • execute on the CompactionExecutor.


Currently there are two types of compaction strategy in this version; SizeTieredCompactionStrategy and LeveledCompactionStrategy and this discussion continue based on SizeTieredCompactionStrategy.

The real compaction work is done here.
public int execute(CompactionExecutorStatsCollector collector) throws IOException
{
// The collection of sstables passed may be empty (but not null); even if
// it is not empty, it may compact down to nothing if all rows are deleted.
assert sstables != null;

Set<SSTableReader> toCompact = new HashSet<SSTableReader>(sstables);
if (!isCompactionInteresting(toCompact))
return 0;

if (compactionFileLocation == null)
compactionFileLocation = cfs.table.getDataFileLocation(cfs.getExpectedCompactedFileSize(toCompact));
if (partialCompactionsAcceptable())
{
// If the compaction file path is null that means we have no space left for this compaction.
// Try again w/o the largest one.
if (compactionFileLocation == null)
{
while (compactionFileLocation == null && toCompact.size() > 1)
{
logger.warn("insufficient space to compact all requested files " + StringUtils.join(toCompact, ", "));
// Note that we have removed files that are still marked as compacting. This suboptimal but ok since the caller will unmark all
// the sstables at the end.
toCompact.remove(cfs.getMaxSizeFile(toCompact));
compactionFileLocation = cfs.table.getDataFileLocation(cfs.getExpectedCompactedFileSize(toCompact));
}
}

if (compactionFileLocation == null)
{
logger.warn("insufficient space to compact even the two smallest files, aborting");
return 0;
}
}

if (DatabaseDescriptor.isSnapshotBeforeCompaction())
cfs.snapshotWithoutFlush(System.currentTimeMillis() + "-" + "compact-" + cfs.columnFamily);

// sanity check: all sstables must belong to the same cfs
for (SSTableReader sstable : toCompact)
assert sstable.descriptor.cfname.equals(cfs.columnFamily);

CompactionController controller = new CompactionController(cfs, toCompact, gcBefore, isUserDefined);
// new sstables from flush can be added during a compaction, but only the compaction can remove them,
// so in our single-threaded compaction world this is a valid way of determining if we're compacting
// all the sstables (that existed when we started)
logger.info("Compacting {}", toCompact);

long startTime = System.currentTimeMillis();
long totalkeysWritten = 0;

long estimatedTotalKeys = Math.max(DatabaseDescriptor.getIndexInterval(), SSTableReader.getApproximateKeyCount(toCompact));
long estimatedSSTables = Math.max(1, SSTable.getTotalBytes(toCompact) / cfs.getCompactionStrategy().getMaxSSTableSize());
long keysPerSSTable = (long) Math.ceil((double) estimatedTotalKeys / estimatedSSTables);
if (logger.isDebugEnabled())
logger.debug("Expected bloom filter size : " + keysPerSSTable);

AbstractCompactionIterable ci = DatabaseDescriptor.isMultithreadedCompaction()
? new ParallelCompactionIterable(OperationType.COMPACTION, toCompact, controller)
: new CompactionIterable(OperationType.COMPACTION, toCompact, controller);
CloseableIterator<AbstractCompactedRow> iter = ci.iterator();
Iterator<AbstractCompactedRow> nni = Iterators.filter(iter, Predicates.notNull());
Map<DecoratedKey, Long> cachedKeys = new HashMap<DecoratedKey, Long>();

// we can't preheat until the tracker has been set. This doesn't happen until we tell the cfs to
// replace the old entries. Track entries to preheat here until then.
Map<SSTableReader, Map<DecoratedKey, Long>> cachedKeyMap = new HashMap<SSTableReader, Map<DecoratedKey, Long>>();

Collection<SSTableReader> sstables = new ArrayList<SSTableReader>();
Collection<SSTableWriter> writers = new ArrayList<SSTableWriter>();

if (collector != null)
collector.beginCompaction(ci);
try
{
if (!nni.hasNext())
{
// don't mark compacted in the finally block, since if there _is_ nondeleted data,
// we need to sync it (via closeAndOpen) first, so there is no period during which
// a crash could cause data loss.
cfs.markCompacted(toCompact);
return 0;
}

SSTableWriter writer = cfs.createCompactionWriter(keysPerSSTable, compactionFileLocation, toCompact);
writers.add(writer);
while (nni.hasNext())
{
AbstractCompactedRow row = nni.next();
if (row.isEmpty())
continue;

long position = writer.append(row);
totalkeysWritten++;

if (DatabaseDescriptor.getPreheatKeyCache())
{
for (SSTableReader sstable : toCompact)
{
if (sstable.getCachedPosition(row.key, false) != null)
{
cachedKeys.put(row.key, position);
break;
}
}
}
if (!nni.hasNext() || newSSTableSegmentThresholdReached(writer, position))
{
SSTableReader toIndex = writer.closeAndOpenReader(getMaxDataAge(toCompact));
cachedKeyMap.put(toIndex, cachedKeys);
sstables.add(toIndex);
if (nni.hasNext())
{
writer = cfs.createCompactionWriter(keysPerSSTable, compactionFileLocation, toCompact);
writers.add(writer);
cachedKeys = new HashMap<DecoratedKey, Long>();
}
}
}
}
catch (Exception e)
{
for (SSTableWriter writer : writers)
writer.abort();
throw FBUtilities.unchecked(e);
}
finally
{
iter.close();
if (collector != null)
collector.finishCompaction(ci);
}

cfs.replaceCompactedSSTables(toCompact, sstables);
// TODO: this doesn't belong here, it should be part of the reader to load when the tracker is wired up
for (Entry<SSTableReader, Map<DecoratedKey, Long>> ssTableReaderMapEntry : cachedKeyMap.entrySet())
{
SSTableReader key = ssTableReaderMapEntry.getKey();
for (Entry<DecoratedKey, Long> entry : ssTableReaderMapEntry.getValue().entrySet())
key.cacheKey(entry.getKey(), entry.getValue());
}

long dTime = System.currentTimeMillis() - startTime;
long startsize = SSTable.getTotalBytes(toCompact);
long endsize = SSTable.getTotalBytes(sstables);
double ratio = (double)endsize / (double)startsize;

StringBuilder builder = new StringBuilder();
builder.append("[");
for (SSTableReader reader : sstables)
builder.append(reader.getFilename()).append(",");
builder.append("]");

double mbps = dTime > 0 ? (double)endsize/(1024*1024)/((double)dTime/1000) : 0;
logger.info(String.format("Compacted to %s. %,d to %,d (~%d%% of original) bytes for %,d keys at %fMB/s. Time: %,dms.",
builder.toString(), startsize, endsize, (int) (ratio * 100), totalkeysWritten, mbps, dTime));
logger.debug(String.format("CF Total Bytes Compacted: %,d", CompactionTask.addToTotalBytesCompacted(endsize)));
return toCompact.size();
}

That's a lot of works done in this method. :-) I summarized some important points below:

  • checking if enough sstables are present to compact.

  • check if the disk size is suffcient for this compaction task.

  • snapshot before compaction happen.

  • check sstable to be compact is belong to the same column family.

  • CompactionExecutorStatsCollector begin compaction with the AbstractCompactionIterable.

  •  create a compaction writer.

  •  replace a new compacted sstable with the old sstables.


I hope you enjoy this writing.

Sunday, January 26, 2014

Overwrite tmnet DNS in router DIR-615

In the past, I have read into how the site was "blocked" in Malaysia where web surfers unable to load some websites. I don't know what is the fuss about it as it certainly create a lot of talks in the social websites. The term blocked would probably be at IP level where if the access by a machine to a address of IP a.b.c.d is unreachable, then that is a really blocked. But the funny things is, some ISPs in malaysia do not actually blocked the IPs but merely make some hostname unable to resolve into IP.

So in this article, I'm gonna show you how you can get back your freedom. There are many public DNS out there. Just google public DNS would give you a lot of list. In fact, google has public DNS too. When you choose the DNS, be sure it is public trust able and speed of your machine to the DNS server is matter when you want a quick web browser able to determine the IP address quickly.

With that said, I'm gonna show you how you can overwrite the DNS entries in tmnet router DIR-615. You need to ssh to your router in order to accomplish this. I have tried many times in the web interface, though it has text box for you to specify, apparently you cannot specify to the list that you want, it always default to tmnet DNS. This is annoying!

What you need is a terminal, ssh program installed, username = operator , password = h566UniFi

After you have ssh into the router, these are all the steps you need to do.
# cat /var/etc/resolv.conf
# This file is generated by gen_resolv.php
nameserver 1.9.1.9
nameserver 202.188.0.133

# sed 's/1.9.1.9/192.168.10.2/' -i /var/etc/resolv.conf
# sed 's/202.188.0.133/8.8.8.8/' -i /var/etc/resolv.conf
# cat resolv.conf
# This file is generated by gen_resolv.php
nameserver 192.168.10.2
nameserver 8.8.8.8

So I have an internal IP because there is a internal DNS service running. 8.8.8.8 would be the google public DNS.

Well, that's all you need, the site where it is unable to resolve to an IP, you can check by ping to the hostname, it should return a valid IP. That's all folks, I hope you learn something and gain back your right.

Tuesday, January 21, 2014

cassandra component on heap or off heap.

A summary of the components in cassandra that store offheap.

cassandra 1.2 and later use offheap for
bloom filter
compression offset map
row cache

cassandra 2.0 and later use offheap for
bloom filter
compression offset map
partition summary
index samples
row cache

Remember , off heap is not equal as non heap. Off-heap means the object is stored non in java heap allocated to the jvm, that is the native memory.

For example in the Bloom filter.
public static IAllocator newOffHeapAllocator(String offheap_allocator) throws ConfigurationException
{
if (!offheap_allocator.contains("."))
offheap_allocator = "org.apache.cassandra.io.util." + offheap_allocator;
return FBUtilities.construct(offheap_allocator, "off-heap allocator");
}

As you can see, if you want to determine if the object is stored on heap or offheap, you can to read into the code. I tried to inspect cassandra 2.0.2 using jconsole, it only show total usage of onheap or offheap. The cassandra community suggest it is better to read NEWS.txt or their blog to read the latest change if the component move on or off heap.

http://www.datastax.com/documentation/cassandra/2.0/webhelp/cassandra/operations/ops_tune_jvm_c.html
https://groups.google.com/forum/#!topic/nosql-databases/kFHS6QfZDuE
http://www.yourkit.com/docs/kb/sizes.jsp

Thursday, January 16, 2014

force sstable compaction through jconsole

Happy new year everyone, this is my first article for 2014 and as a start, it is going to be short and sweet one. :-)

I was working on this project where I have deleted the column and I remember the definition of tombstone is that, you need to run compact so that the tombstone is removed. When I run nodetool compact, this message appear below.
INFO 19:15:09,170 Nothing to compact in index1.  Use forceUserDefinedCompaction if you wish to force compaction of single sstables (e.g. for tombstone collection)

So what is this means is that, you need to have jconsole running because forceUserDefinedCompaction can only invoke through jconsole. When jconsole is connected to your cassandra daemon process, you need to navigate to the compactionmanager mbean.

Then you need to provide two value to this method as this method expected two parameters. That is the keyspace and the sstables where you wish to compact. You can see in this attachment on how I did it.



So if you ls the data directory where the sstable are store, a new sstable should be generated. Once the operation is complete, you can navigate to the data directory where the sstable is stored.

$ ls data/lucene1/
index1-hc-3-Data.db index1-hc-3-Digest.sha1 index1-hc-3-Filter.db index1-hc-3-Index.db index1-hc-3-Statistics.db snapshots

We have production machine which a single sstable size is more than 25GB. When the soft limit compaction is met (default min 4 and max 32) , in this situation, I think it will load the server if it takes place, probably best is, we forceUserDefinedCompaction on this single large sstable.