Showing posts with label analysis. Show all posts
Showing posts with label analysis. Show all posts

Friday, July 31, 2015

Attempting to understand java garbage collect statistics

If you have been develop large java application, at times troubleshooting application can go as deep as looking into garbage collector when application is running. Unfortunately the statistics are just too much to begin to investigate into or trying to understand it. At least for me, it is pretty mundane and I seek your help too if you came across this article and please leave comment.

There are very few documentation describe how are these statistics should be interpreted. There is this from oracle blog which is dated year 2006, pretty outdated to be relevant but nonetheless, it analyze line by line. More recent article from alexy ragozin and poonam bajaj are worth to take a look too.

The gc statistics should be able to regenerate using these parameter to the java command line.  -XX:+PrintGCDetails -XX:+PrintPromotionFailure -XX:PrintFLSStatistics=1 , and the following are snippets extracted from a production machine. Let's take a look at them line by line.

 Before GC:  
 Statistics for BinaryTreeDictionary:  
 ------------------------------------  
 Total Free Space: 230400  
 Max  Chunk Size: 230400  
 Number of Blocks: 1  
 Av. Block Size: 230400  
 Tree   Height: 1  
 586945.492: [ParNew  
 Desired survivor size 41943040 bytes, new threshold 1 (max 1)  
 - age  1:  10038008 bytes,  10038008 total  
 : 660426K->10292K(737280K), 0.0353470 secs] 9424156K->8774094K(12500992K)After GC:  
 Statistics for BinaryTreeDictionary:  
 ------------------------------------  
 Total Free Space: 127053189  
 Max  Chunk Size: 21404293  
 Number of Blocks: 125654  
 Av. Block Size: 1011  
 Tree   Height: 36  
   
   
   
 After GC:  
 Statistics for BinaryTreeDictionary:  
 ------------------------------------  
 Total Free Space: 230400  
 Max  Chunk Size: 230400  
 Number of Blocks: 1  
 Av. Block Size: 230400  
 Tree   Height: 1  
 , 0.0359540 secs] [Times: user=0.26 sys=0.00, real=0.03 secs]   
 Heap after GC invocations=550778 (full 2090):  
  par new generation  total 737280K, used 10292K [0x00000004fae00000, 0x000000052ce00000, 0x000000052ce00000)  
  eden space 655360K,  0% used [0x00000004fae00000, 0x00000004fae00000, 0x0000000522e00000)  
  from space 81920K, 12% used [0x0000000522e00000, 0x000000052380d360, 0x0000000527e00000)  
  to  space 81920K,  0% used [0x0000000527e00000, 0x0000000527e00000, 0x000000052ce00000)  
  concurrent mark-sweep generation total 11763712K, used 8763801K [0x000000052ce00000, 0x00000007fae00000, 0x00000007fae00000)  
  concurrent-mark-sweep perm gen total 40952K, used 24563K [0x00000007fae00000, 0x00000007fd5fe000, 0x0000000800000000)  
 }  
 Total time for which application threads were stopped: 0.0675660 seconds  
 {Heap before GC invocations=550778 (full 2090):  
  par new generation  total 737280K, used 11677K [0x00000004fae00000, 0x000000052ce00000, 0x000000052ce00000)  
  eden space 655360K,  0% used [0x00000004fae00000, 0x00000004faf5a220, 0x0000000522e00000)  
  from space 81920K, 12% used [0x0000000522e00000, 0x000000052380d360, 0x0000000527e00000)  
  to  space 81920K,  0% used [0x0000000527e00000, 0x0000000527e00000, 0x000000052ce00000)  
  concurrent mark-sweep generation total 11763712K, used 8763801K [0x000000052ce00000, 0x00000007fae00000, 0x00000007fae00000)  
  concurrent-mark-sweep perm gen total 40952K, used 24563K [0x00000007fae00000, 0x00000007fd5fe000, 0x0000000800000000)  

We can summarize the statistics above with the following points.

* the statistics generated above is from java hotspot and the source code can be foudn here https://github.com/openjdk-mirror/jdk7u-hotspot/blob/master/src/share/vm/gc_implementation/concurrentMarkSweep/binaryTreeDictionary.cpp#L1098-L1112

* there are two statistics, before gc and after gc and this is not full gc.

* before gc, we notice the max chunk size is equal to the total free space, so we assume there is no usage.

* before gc, we also noticed that the total free space has 127053189 and max chunk size is 21404293

* after gc, cpu usage is spent on user 0.26 and real 0.03.

* after gc, from region usage 12% of the heap.

* after gc, concurrent mark sweep generation total of 11,763,712k whilst concurrent mark sweep permanent generation total is 40,954k and used only 24,563K

* total time this application stop were 0.0675660 seconds.

So we can guess that this gc snippet is good. it is not a full gc and usage does not increase to 100%. There is no failure/error appear anywhere. The total time stop is trivial too, less than a second.

That's it and if you think this analysis is wrong and/or can be improve upon, please leave your message below. I would like to learn more too.

Sunday, May 10, 2015

My journey and experience on upgrading apache cassandra from version1.0.12 to 1.1.12

If you have read my previous post on apache cassandra upgrade, this is another journey to major upgrade apache cassandra from version 1.0 to 1.1. In this article, I will share on my experience on upgrading cassandra from version 1.0.12 to 1.1.12.

The sstable version used by cassandra 1.0.12 is hd  and you should ensure that all nodes sstables become hd before proceed upgrade to a newer version of cassandra.

First, let read some highlight of cassandra 1.1

  • api version 19.33.0

  • new file cassandra-rackdc.properties, commitlog_archiving.properties

  • new directory structure for sstable and filename change for sstable.

  • more features/improvement to nodetool such as compactionstats has remaining timestamp, calculate exact size required for cleanup operations, you can now stop compaction, rangekeysample, getsstables, repair print progress, etc.

  • global key and row cache.

  • cql 3.0 beta

  • schema change for cassandra in caching.

  • libthrift version 0.7.0.

  • sstable hf version.

  • default compressor become snappy compressor.

  • a lot of improvement to level compression strategy.

  • sliced_buffer_size_in_kb option has been removed from the cassandra.yaml configuration file (this option was a no-op since 1.0).

  • thread stack size increased to 160k

  • added flag UseTLAB for jvm to improve read speed.

As this is a newer version of cassandra compare the previous, it is always good to setup a test node and so you can play around and get familiar with it before actually doing the upgrade. With this new node, you can also quickly test with your application client which write and/or read to the test cassandra node. It is also recommended to do some load test to see the result is what you have expected.

If you want to be extremely careful on the upgrade, then reading the code changes between the version you chose to upgrade is always recommended. This is the link for this upgrade  and I know and understand as there are huge differences in betweeen them, so you should split as small as possible to read through it. You can learn a lot from the experience coder if you spend a lot of time reading their code and you can learn new technology too. It is a daunting huge tasks but if you willing to spend sometime to read them, the benefits return is just too much to even describe here.

If you upgrade from 1.0.12 to 1.1.12, cassandra 1.1 is smart enough to move the sstable into new directory structure. So, it ease your job that you do not need to move the sstable into the new directory structure. When the new cassandra 1.1.12 starting up, it will move for you.

So you might want to consider prepare the configuration file for your cluster environment before hand. For example, cassandra.yaml, cassandra-env.sh and cassandra.in.sh. By doing this, you can decrease the upgrade process time duration and less error when you are not actually doing it but a upgrade script will symlink this for you. So spend sometime to write upgrade and downgrade scripts for the production cluster and tests it.

Because upgrade process will take time (a long one, depend on how many nodes you have in cluster) and it will tired you in the process (remember, there will be post upgrade issues which you need to deal with), so I suggest you create a upgrade script to handle the upgrade process. The cassandra configuration which you prepare before will be automatically symlink within this script. When you do this, you reduce risk such as factor human error and for a production cluster, you will NOT want to risk anything or cut the risk to as minimum as possible.

There is official upgrade documentation here at datastax but because your cluster environment might be different, so you might want to write the upgrade step taking into consideration from the official documentation and let peer review so you cover as much as possible. Best if your peer will tests and raise in some questions which you might not think of.

If you have using monitoring system such as opscenter, spm, jconsole, or your own monitoring system, you wanna check it out if these monitoring can support the newer version of cassandra.

key cache and row cache per column family based has been replace with global key cache and global row cache respectively. These global cache settings can be found in casandra.yaml file. If you leave it to default, 1 millon key cache by default. Below are some new parameter for cassandra 1.1,

  • populate_io_cache_on_flush

  • key_cache_size_in_mb

  • key_cache_save_period

  • row_cache_size_in_mb

  • row_cache_save_period

  • row_cache_provider

  • commitlog_segment_size_in_mb

  • trickle_fsync

  • trickle_fsync_interval_in_kb

  • internode_authenticator

and below are configuration get removed

  • sliced_buffer_size_in_kb

  • thrift_max_message_length_in_mb

For the upgrade steps in production, these steps are taken appropriately:

pre-upgrade apply to all node in cluster.
* stop any repair , cleanup in all cassandra node and no streaming happened. Streaming are the nodes bootstrap or you rebuild a node.

upgrade steps.
1. download cassandra 1.1.12 and verify binary is not corrupted.
2. extract the compressed tarball.
3. nodetool snapshot.
4. nodetool drain.
5. stop cassandra if it not stopped.
6. symlink new configuration files.
7. start cassandra 1.1.12
8. monitor cassandra system.log
9. check monitoring system.

If everything looks okay for first node, best if you do two nodes, and then continue till the rest of the node in rolling upgrade fashion. After you migrate, you might also noticed there are 3 more additional column families in cassandra 1.1

cassandra 1.0 system keyspace has a total of 7 column families

  • HintsColumnFamily

  • IndexInfo

  • LocationInfo

  • Migrations

  • NodeIdInfo

  • Schema

  • Versions

cassandra 1.1 system keyspace has a total 10 column families.

  • HintsColumnFamily

  • IndexInfo

  • LocationInfo

  • Migrations

  • NodeIdInfo

  • Schema

  • schema_columnfamilies

  • schema_columns

  • schema_keyspaces

  • Versions

If you are using level compaction strategy, these sstable need to be scrub accordingly. There are nodetool scrub and offline sstablescrub for this job. If you have defined column family using counter type, you should upgrade the sstable using nodetool upgradesstables.

That's it and if you need professional service for this, please contact me and I will be gladly to provide professional advice and/or service.

Saturday, May 9, 2015

Light walkthrough on Java Execution Time Measurement Library (JETM)

Today, let's learn a java library, Java Execution Time Measurement Library or JETM. What is JETM?

From the official site
A small and free library, that helps locating performance problems in existing Java applications.

 

JETM enables developers to track down performance issues on demand, either programmatic or declarative with minimal impact on application performance, even in production.

jetm is pretty cool and has a lot of features.

You can follow the tutorial trail here. The following codes are taken from one of the tutorial with minor modification.
public class BusinessService {

private static final EtmMonitor etmMonitor = EtmManager.getEtmMonitor();

public void someMethod() {
EtmPoint point = etmMonitor.createPoint("BusinessService:someMethod");

try {
Thread.sleep((long)(10d * Math.random()));
nestedMethod();
} catch (InterruptedException e ) {

} finally {
point.collect();
}
}

public void nestedMethod() {
EtmPoint point = etmMonitor.createPoint("BusinessService:nestedMethod");

try {
Thread.sleep((long)(15d * Math.random()));
} catch (InterruptedException e) {

} finally {
point.collect();
}

}

public static void main(String[] args) {
BasicEtmConfigurator.configure(true);
//etmMonitor = EtmManager.getEtmMonitor();
etmMonitor.start();
BusinessService bizz = new BusinessService();
bizz.someMethod();
bizz.someMethod();
bizz.someMethod();
bizz.someMethod();
bizz.nestedMethod();
etmMonitor.render(new SimpleTextRenderer());

etmMonitor.stop();
}

}

Hit the run button in eclipse.
EtmMonitor info [INFO] JETM 1.2.3 started.
|--------------------------------|---|---------|-------|--------|--------|
| Measurement Point | # | Average | Min | Max | Total |
|--------------------------------|---|---------|-------|--------|--------|
| BusinessService:nestedMethod | 1 | 4.121 | 4.121 | 4.121 | 4.121 |
|--------------------------------|---|---------|-------|--------|--------|
| BusinessService:someMethod | 4 | 12.611 | 6.196 | 16.347 | 50.442 |
| BusinessService:nestedMethod | 4 | 5.381 | 0.017 | 10.194 | 21.523 |
|--------------------------------|---|---------|-------|--------|--------|
EtmMonitor info [INFO] Shutting down JETM.

So we saw that nestedMethod execute once and four time for someMethod. The result showing a minimum and maximum for the execution with an avarage. Last column shown the total. Pretty neat for a small java library.

 

Friday, May 8, 2015

Elasticsearch no node exception happened in tomcat web container

If you ever get the stack trace in web container log file such as below and wondering how to solve these. Then read on but first, a little background. A elasticsearch cluster 0.90 and client running on tomcat web container using elasticsearch java transport client. Both server and client running same elasticsearch version and same java version.
16.Feb 6:21:30,830 ERROR WebAppTransportClient [put]: error
org.elasticsearch.client.transport.NoNodeAvailableException: No node available
at org.elasticsearch.client.transport.TransportClientNodesService.execute(TransportClientNodesService.java:212)
at org.elasticsearch.client.transport.support.InternalTransportClient.execute(InternalTransportClient.java:106)
at org.elasticsearch.client.support.AbstractClient.index(AbstractClient.java:84)
at org.elasticsearch.client.transport.TransportClient.index(TransportClient.java:316)
at org.elasticsearch.action.index.IndexRequestBuilder.doExecute(IndexRequestBuilder.java:324)
at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:85)
at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:59)
at com.example.elasticsearch.WebAppTransportClient.put(WebAppTransportClient.java:258)
at com.example.elasticsearch.WebAppTransportClient.put(WebAppTransportClient.java:307)
at com.example.threadpool.TaskThread.run(TaskThread.java:38)
at java.lang.Thread.run(Thread.java:662)

This exception will disappear once web container is restarted but restarting webapp that often is not a good solution in production. I did a few research on line and gather a few information, they are as following:

* The default number of channels in each of these class are configured with the configuration prefix of transport.connections_per_node.
https://www.found.no/foundation/elasticsearch-networking/

* If you see NoNodeAvailableException you may have hit a connect timeout of the client. Connect timeout is 30 secs IIRC.
https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/elasticsearch/VyNpCs17aTA/CcXkYvVMYWAJ

* You can set org.elasticsearch.client.transport to TRACE level in your logging configuration (on the client side) to see the failures it has (to connect for example). For more information, you can turn on logging on org.elasticsearch.client.transport.
https://groups.google.com/forum/#!topic/elasticsearch/Mt2x4d5BCGI

* This means that you started to get disconnections between the client (transport) and the server. It will try and reconnect automatically, and possibly manages to do it. For more information, you can turn on logging on org.elasticsearch.client.transport.
* Can you try and increase the timeout and see how it goes? Set client.transport.ping_timeout in the settings you pass to the TransportClient to 10s for example.
* We had the same problem. reason: The application server uses a older version of log4j than ES needed.
http://elasticsearch-users.115913.n3.nabble.com/No-node-available-Exception-td3920119.html

* The correct method is to add the known host addresses with addTransportAddresses() and afterwards check the connectedNodes() method. If it returns empty list, no nodes could be found.
https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/elasticsearch/ceH3UIy14jM/XJSFKd8kAXEJ

* the most common case for NoNodeAvailable is the regular pinging that the transport client does fails to do it, so no nodes end up as the list of nodes that the transport client uses. If you will set client.transport (or org.elasticsearch.client.transport if running embedded) to TRACE, you will see the pinging effort and if it failed or not (and the reason for the failures). This might get us further into trying to understand why it happens.
* .put("client.transport.ping_timeout", pingTimeout)
* .put("client.transport.nodes_sampler_interval", pingSamplerInterval).build();
https://groups.google.com/forum/#!msg/elasticsearch/9aSkB0AVrHU/_4kDkjAFKuYJ

* this has nothing to do with migration errors. Your JVM performs a very long GC of 9 seconds which exceeds the default ping timeout of 5 seconds, so ES dropped the connection ,assuming your JVM is just too busy. Try again if you can reproduce it. If yes, increase the timeout to something like 10 seconds, or consider to update your Java version.
http://elasticsearch-users.115913.n3.nabble.com/Migration-errors-0-20-1-to-0-90-td4035165.html

* During long GC the JVM is somehow suspended. So your client can not see it anymore.
http://grokbase.com/t/gg/elasticsearch/136fw0hppp/transport-client-ping-timeout-no-node-available-exception

* You wrote that you have a 0.90.9 cluster but you added 0.90.0 jars to the client. Is that correct?
* Please check:
*
* if your cluster nodes and client node is using exactly the same JVM
* if your cluster and client use exactly the same ES version
* if your cluster and client use the same cluster name
* reasons outside ES: IP blocking, network reachability, network interfaces, IPv4/IPv6 etc.
* Then you should be able to connect with TransportClient.

https://groups.google.com/forum/#!msg/elasticsearch/fYmKjGywe8o/z9Ci5L5WjUAJ

So I have tried all that option mentioned and the problem solve by added sniff to the transport client setting. 08988For more information, read here.

I hope this will solve your problem too.

Saturday, April 25, 2015

My way of solving tomcat memory leaking issue

Recently, I did a mistake by accidentally commit a stupid static codes into a static method into production causing heap usage grow tremendously. Since the static method stay persisted with the object, tomcat has to restart often to free up the heap that get hold. So today, I will share my experience on how I solve it and I hope it will give you a way on how to solve this difficult problem.
First is the to end, I will summarize the sequence you need to investigate and find out the fix.

* CHECK YOUR CODE.
* learn on how to find the memory leak using google.
* one step at a time to trace until you successfully pin down the problem and fix it.

As you can read, only three general steps but for each step, I will talk more about it.
CHECK YOUR CODE.

Always check your code by reading and tests! Best if you have someone experience and you can probably send your code for inspection. Remember, 4 eyes ball and 2 brains are better than 2 eyes ball and a brain. If you are using opensource project, most probably, the library are well tested and you should just spend time to investigate your codes. It's difficult especially for new programmer, but that should not stopped you to find out the problem. If you still cannot find out the problem, then you should start to search on search engine on how people solve it.
learn on how to find the memory leak using google.
Nobody is perfect and know everything, but if you are unsure, always google away. Google keyword such as java memory leak, tomcat memory leak or even best java coding practice. Pay attention on the first 10 links return by google and then read on blogging or even stackoverflow, it will give you knowledge that you never know of. Example of tools needed include jstat, jmap, jhat, and visualvm that can give you an idea what or even where might be the problem from. Remember, reading this material is a way of growing and it take times, so please be patience at this step and make sure u spend adequate amount of time and jot down important points mentioned and so you can use it on final step.

one step at a time to trace until you successfully pin down the problem and fix it.
Final step would probably repeating step 1 and step 2 slowly to determine the root cause. If you are using versoning system, you should really find out when was the last best working codes and start to check file by file where the problem was introduced. This is a TEDIOUS and DAUNTING process but this is effective to solving the root cause.
These steps were used by myself during determine the tomcat web application memory problem. Thank you and I hope you can benefit too.

Friday, April 24, 2015

Learning java jstat

Today, we will going to learn a java tool, which is incredibly useful if you are frequent coding for java application. This java tool is a monitoring tool known as jstat and it came with jdk. So you would ask why would I need to use jstat, my app run just fine. So for a simple java application, yes, you do not need to this monitoring tool. However if you have a long running application or big java codebase application, and sometime when your java application run midway hang (pause/freeze), then you should start to look into this tool really. In this article, I'm going to show you how I use it.

But first, let understand on what is jstat.
The jstat tool displays performance statistics for an instrumented HotSpot Java virtual machine (JVM).

As you aware, object that you wrote in the code will eventually get free from heap when it is not reference. If you has a lot of objects and heap usage grow, then you can use this monitoring tool to check out wassup of the heap allocation. Okay now, let's read into the command input.
jstat [ generalOption | outputOptions vmid [interval[s|ms] [count]] ]

so pretty simple, the commands jstat followed by a few parameters. The parameters can be explain below. You can find official documentation here.

generalOption
A single general command-line option (-help or -options)

outputOptions
One or more output options, consisting of a single statOption, plus any of the -t, -h, and -J options.

vmid
Virtual machine identifier, a string indicating the target Java virtual machine (JVM). The general syntax is
[protocol:][//]lvmid[@hostname[:port]/servername]
The syntax of the vmid string largely corresponds to the syntax of a URI. The vmid can vary from a simple integer representing a local JVM to a more complex construction
specifying a communications protocol, port number, and other implementation-specific values. See Virtual Machine Identifier for details.

interval[s|ms]
Sampling interval in the specified units, seconds (s) or milliseconds (ms). Default units are milliseconds. Must be a positive integer. If specified, jstat will produce its
output at each interval.

count
Number of samples to display. Default value is infinity; that is, jstat displays statistics until the target JVM terminates or the jstat command is terminated. Must be a
positive integer.

It should be very clear to you if you are season java coder and if you don't, take a look at an example below.
[iser@localhost ~]$ jstat -gcutil 12345 1s
S0 S1 E O P YGC YGCT FGC FGCT GCT
10.08 0.00 70.70 69.22 59.49 122328 4380.327 355 43.146 4423.474
10.08 0.00 84.99 69.22 59.49 122328 4380.327 355 43.146 4423.474
0.00 15.62 0.00 69.24 59.49 122329 4380.351 355 43.146 4423.497

so jstat is instrument a local jvm with process id 12345 with an interval of 1 second and loop infinitely. There are different type of statistics can be shown and with the above example given, it show summary of garbage collection statistics. If you want to shown different types of gc statistics, you can use the command jstat -options and below is the table of summaries what these options display means.
Option 	                Displays...
class Statistics on the behavior of the class loader.
compiler Statistics of the behavior of the HotSpot Just-in-Time compiler.
gc Statistics of the behavior of the garbage collected heap.
gccapacity Statistics of the capacities of the generations and their corresponding spaces.
gccause Summary of garbage collection statistics (same as -gcutil), with the cause of the last and current (if applicable) garbage collection events.
gcnew Statistics of the behavior of the new generation.
gcnewcapacity Statistics of the sizes of the new generations and its corresponding spaces.
gcold Statistics of the behavior of the old and permanent generations.
gcoldcapacity Statistics of the sizes of the old generation.
gcpermcapacity Statistics of the sizes of the permanent generation.
gcutil Summary of garbage collection statistics.
printcompilation HotSpot compilation method statistics.

Out of all these options, probably the most frequently you will use is gcutil, gc and gccapacity. We will look at them with example. Please note that in order to protect the privacy of the user, there are some information is removed but what need to be presented in this article shall remained as is.

option gcutil

jstat-gcutil

As can be read above, the command jstat with option gcutil on a java process id 23483. The statistics are generated with an interval at 1 second. It has 10 columns and these column can be explain below.
Column 	Description
S0 Survivor space 0 utilization as a percentage of the space's current capacity.
S1 Survivor space 1 utilization as a percentage of the space's current capacity.
E Eden space utilization as a percentage of the space's current capacity.
O Old space utilization as a percentage of the space's current capacity.
P Permanent space utilization as a percentage of the space's current capacity.
YGC Number of young generation GC events.
YGCT Young generation garbage collection time.
FGC Number of full GC events.
FGCT Full garbage collection time.
GCT Total garbage collection time.

First five columns depict space utilization in term of percentage. The next five depict amount of young generation collection and its time, full garbage collection and its time and last, total garbage collection time. With this screen capture, we see that the eden space is filling up quickly and promoted to either survivor space 0 or survivor space 1. At one instance, some object survived and eventually promoted to old space and increased the usage by 0.01% to 5.24%. Note that also YGC is increased by one as a result to 256. This young generation collection time took 13 milliseconds. Similar pattern happen again later and we see that, YGC is increased by oen to 257 with another 13 milliseconds of collection time. In this output, there is no change to full collection, which is good. It is only one full collection happened but with a pause of 94millseconds! You might want to keep an eye on the E column so it dont fill up quickly and adjust hte young gen in your java app accordingly. But for a long term solution, you might want to spend some time to find out which code take a lot of resources and improve it.

option gc

jstat-gcAs can be read above, the command jstat with option gc on a java process id 28276. The statistics are generated with an interval at 1 second. It has 15 columns and these column can be explain below.
Column 	Description
S0C Current survivor space 0 capacity (KB).
S1C Current survivor space 1 capacity (KB).
S0U Survivor space 0 utilization (KB).
S1U Survivor space 1 utilization (KB).
EC Current eden space capacity (KB).
EU Eden space utilization (KB).
OC Current old space capacity (KB).
OU Old space utilization (KB).
PC Current permanent space capacity (KB).
PU Permanent space utilization (KB).
YGC Number of young generation GC Events.
YGCT Young generation garbage collection time.
FGC Number of full GC events.
FGCT Full garbage collection time.
GCT Total garbage collection time.

The statistics shown the capacity in term of kilobytes. First ten columns are pretty easy, the space capacity and its current utilization. The last five columns are the same as gcutil last five columns. Notice that when the column EU value near to the column EC value, young generation collection happened. Object promoted to survivor spaces. Notice that column OU grow gradually. This statistics almost the same with gcutil except that the statistics shown here display in term of bytes whereas gcutil statistics display in term of percentage.

option gccapacity

jstat-gccapacity

As can be read above, the command jstat with option gccapacity on a java process id 13080. The statistics are generated with an interval at 1 second. It has 16 columns and these column can be explain below.
Column 	Description
NGCMN Minimum new generation capacity (KB).
NGCMX Maximum new generation capacity (KB).
NGC Current new generation capacity (KB).
S0C Current survivor space 0 capacity (KB).
S1C Current survivor space 1 capacity (KB).
EC Current eden space capacity (KB).
OGCMN Minimum old generation capacity (KB).
OGCMX Maximum old generation capacity (KB).
OGC Current old generation capacity (KB).
OC Current old space capacity (KB).
PGCMN Minimum permanent generation capacity (KB).
PGCMX Maximum Permanent generation capacity (KB).
PGC Current Permanent generation capacity (KB).
PC Current Permanent space capacity (KB).
YGC Number of Young generation GC Events.
FGC Number of Full GC Events.

These output is similar to the output of option gc but with minimum and maximum for the individual java heap.

That's it for this article and I will leave three links for your references.

http://www.cubrid.org/blog/dev-platform/how-to-monitor-java-garbage-collection/
http://docs.oracle.com/javase/7/docs/technotes/tools/share/jstat.html
http://oracle-base.com/articles/misc/monitoring-java-garbage-collection-using-jstat.php

 

Sunday, March 29, 2015

My journey and experience on upgrading apache cassandra 1.0.8 to 1.0.12

Upon request of my blog reader, today I will share with you my experience on upgrading apache cassandra version 1.0.8 to 1.0.12 on a production live cluster. By sharing this information, I hope if you are also running and/or administer cassandra cluster, you can learn from my experience and ease your worry or pain.

First, let's lay out what's the current architecture in this environment.

  • java 6

  • 12 nodes cluster.

  • two spinning disk with raid 0, 32GB total system memory where 14GB allocated to the cassandra heap instance, with 800MB for young gen. quad core cpu.

  • pretty much stock cassandra.yaml configuration with the following different like concurrent_write to 64, flush_largest_memtables_at to 0.8, compaction_throughput_mb_per_sec to 64.

  • node load per node average at 500-550GB.


As you can see, this is pretty ancient cassandra we are using at of this time of writing but because cassandra has been rock solid serving read/write requests for years, it stays like this stable condition forever and we leverage on the benefit of scalling out like adding nodes from six to nine and eventually to twelve now. Realizing that the disk failure do happened in the nodes of the cluster, because of cassandra has a no single point of failure in mind, we can afford to loose a single node out of operation while replacing it. That were a few of the reasons we stayed with cassandra 1.0 for quite sometime.

Because we would like to probably goes to cassandra 2.0 and beyond, and java 6 has been EOL for quite sometime, it would be wise to upgrade java before cassandra. Because system are integrated like an ecosystem, it would be also wise to look at java used in the client system that read/write requests to the cassandra cluster. So make a checklist brainstorming what are clients that integrate into the cluster and then check out what are the current stable java 7 available. Example:

cassandra 1.0 cassandra-1.0.12 java miniumum 6 and above.
https://github.com/apache/cassandra/tree/cassandra-1.0.12

hector client using casandra 2.0.4 so java 7 minimum
https://github.com/hector-client/hector/blob/master/pom.xml

datastax cql driver use cassandra 2.1.2 so java 7 minimum
https://github.com/datastax/java-driver/blob/2.1/pom.xml

java 7 update release note
http://www.oracle.com/technetwork/java/javase/7u-relnotes-515228.html

features and enhancement
http://www.oracle.com/technetwork/java/javase/jdk7-relnotes-418459.html

java 7 in wiki http://en.wikipedia.org/wiki/Java_version_history#Java_SE_7_.28July_28.2C_2011.29

unicode
before upgrading, check if cassandra using different unicode on the data http://www.herongyang.com/Unicode/Java-Unicode-Version-Supported-in-Java-History.html
http://docs.oracle.com/javase/7/docs/technotes/guides/intl/enhancements.7.html
Early versions of the Java SE 7 release added support for Unicode 5.1.0. The final version of the Java SE 7 release supports Unicode 6.0.0. Unicode 6.0.0 is a major version of the Unicode Standard and adds support for over 2000 additional characters, as well as support for properties and data files.

As of the time of checking, we picked java 7 update 72. Upgrading java 6 to java 7 update 72 in the cassandra 1.0.8 is a painless process other than just time consuming. As load per node is huge and total number of nodes in cluster. I follow this steps for java upgrade in cassandra node.

upgrade java for all cassandra node
1. write a script to automatically install java7 on node, update java stacked size to 256k in cassandra-env.sh. set JAVA_HOME for file cassandra.in.sh to java 7.
2. execute the script in rolling fashion for all the node with one upgrade at a time.
3. stop cassandra
4. execute the script.
5. start the cassandra instance
6.0 start the cassandra instance and monitor after the node is up and then check the monitoring system after node elapsed for 30minutes, 60minutes, 1hours and 2hours.
6.1 check your client can read/write to that one upgraded node.

By now, you can perform the next node in the ring, but you can skip step 6.0 as you are sure that it is going to work. One thing I observed is that, the gc duration for cassandra using java 6 and java 7 is it is down by half! That's could means faster gc means more cpu cycle to process other tasks and less stop of the world for cassandra instance.

Leave this cluster with java 7 upgraded run a day or two and if it is okay, continue to cassandra upgrade. So which cassandra version to upgrade to? There are several guidelines I followed.

1. choose ONLY STABLE release for production cluster. How to choose? You should read this link.
2. read NEWS.txt  and Changes.txt . As time to time, change to the code base may affect example, the sstable. So pay attention especially between cassandra major upgrade.
3. read the code difference between the version you decided to upgrade too, example for this upgrade. https://github.com/apache/cassandra/compare/cassandra-1.0.8...cassandra-1.0.12
4. read the datastax upgrading node for minor version.

I spent a lot of time doing step 3 and by reading the code diference, realize what has been change and/or added and consider it will impact your cassandra environment. In order for further upgrade to cassandra 1.1, you will need to upgrade to the latest version of the one currently deployed. Example here. Once read the above checkpoints, you may have a lot of questions and TODOs and that will give further works. In the next step, it is best if you find out the questions and TODOs you have and then verify in the test cluster before apply to a production cluster.

For me, I have written a few bash scripts example mentioned above, java upgrade. Also I have written install test cluster for cassandra upgrade. Remember to also write script to snapshot the data directory using nodetool and then also write script to automatically downgrade. When something goes wrong, you can revert using the automatic downgrade script and using the backup from nodetool dump. Then you will need to save the configurations example, cassandra.in.sh, cassandra-env.sh, cassandra.yaml or any other in your environment cluster.

With these scripts written and tested, it is best if you get and acknowledgements from the management if this is to be proceed and also, it would be best if you have someone who is also administer of cassandra cluster with you just for the good and bad moments. ;-) You can also reach me by my follow button in the home page. :)

upgrade cassandra from 1.0.8 to 1.0.12

  1. stop repair and cleanup in all nodes in the cluster.

  2. write a script to automatically upgrade it and so you dont panic, waste time and composed during node upgrade. Trust me, save you a lot of time and human error free. scripts content could be the following:
    - download cassandra 1.0.12 and extract, file permission ,etc
    - backup current cassandra 1.0.8 using nodetool snapshots. make sure you write the snapshot directory name like MyKeyspace-1.0.8-date
    - drain the node.
    - stop cassandra if it is not yet stopped.
    - update cassandra 1.0.12 with your cluster settings.

  3. check the configuration changed and then start cassandra 1.0.12 new instance.

  4. monitor after the node is up and then check the monitoring system after node elapsed for 30minutes, 60minutes, 1hours and 2hours.

  5. check your client can read/write to that one upgraded node.


By now, you can perform the next node in the ring, but you can skip step 4.0 as you are sure that it is going to work. As the version of the cassandra sstable change in 1.0.10, from hc to hd, it is best all sstables in all nodes, using the hd version before perform the next major upgrade.

That's it for this article and whilst this maybe not cover all, may contain mistake, and/or if you want to comment, please leave your comment below.

Saturday, March 28, 2015

Investigate into apache cassandra corrupt sstable exception

Today, we will take a look at another apache cassandra 1.0.8 exception. Example of stack trace below.
ERROR [SSTableBatchOpen:2] 2015-03-07 06:11:58,544 SSTableReader.java (line 228) Corrupt sstable /var/lib/cassandra/data/MySuperKeyspace/MyColumnFamily-hc-6681=[Index.db, Statistics.db, CompressionInfo.db, Filter.db, Data.db]; skipped
java.io.IOException: Input/output error
at java.io.RandomAccessFile.readBytes0(Native Method)
at java.io.RandomAccessFile.readBytes(RandomAccessFile.java:350)
at java.io.RandomAccessFile.read(RandomAccessFile.java:385)
at org.apache.cassandra.io.util.RandomAccessReader.reBuffer(RandomAccessReader.java:128)
at org.apache.cassandra.io.util.RandomAccessReader.read(RandomAccessReader.java:302)
at java.io.RandomAccessFile.readFully(RandomAccessFile.java:444)
at java.io.RandomAccessFile.readFully(RandomAccessFile.java:424)
at org.apache.cassandra.io.util.RandomAccessReader.readBytes(RandomAccessReader.java:324)
at org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:393)
at org.apache.cassandra.io.sstable.SSTableReader.load(SSTableReader.java:375)
at org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:186)
at org.apache.cassandra.io.sstable.SSTableReader$1.run(SSTableReader.java:224)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Before we go into the code base for this stacktrace, I have no idea what is this about and this one shown when the cassandra 1.0.12 instance is booting up. Last I remember I trigger user defined compaction twice in cassandra 1.0.8 using the same sstables and after first compaction is done, then this sstable stay forever... like for two weeks plus. Then we have upgrade for the cassandra.

Enough said, let's go into the code base and understand what is really mean by corrupt sstable. Bottom of the the stack trace pretty obvious, ThreadPoolExecutor execute a future task run method.Then it is now on apache cassandra namespace codebase, as can be read below class SSTableReader, method batchOpen(), code snippet
    public static Collection<SSTableReader> batchOpen(Set<Map.Entry<Descriptor, Set<Component>>> entries,
final Set<DecoratedKey> savedKeys,
final DataTracker tracker,
final CFMetaData metadata,
final IPartitioner partitioner)
{
final Collection<SSTableReader> sstables = new LinkedBlockingQueue<SSTableReader>();

ExecutorService executor = DebuggableThreadPoolExecutor.createWithPoolSize("SSTableBatchOpen", Runtime.getRuntime().availableProcessors());
for (final Map.Entry<Descriptor, Set<Component>> entry : entries)
{
Runnable runnable = new Runnable()
{
public void run()
{
SSTableReader sstable;
try
{
sstable = open(entry.getKey(), entry.getValue(), savedKeys, tracker, metadata, partitioner);
}
catch (IOException ex)
{
logger.error("Corrupt sstable " + entry + "; skipped", ex);
return;
}
sstables.add(sstable);
}
};
executor.submit(runnable);
}

executor.shutdown();
try
{
executor.awaitTermination(7, TimeUnit.DAYS);
}
catch (InterruptedException e)
{
throw new AssertionError(e);
}

return sstables;

}

As can be read above, somewhere within the method open() throw the IOException, hence the above exception was thrown. Two stack trace up, we read that, sstable load method execute and, ByteBufferUtil.read() method. With the method read from class ByteBufferUtil as shown below.
    public static ByteBuffer read(DataInput in, int length) throws IOException
{
if (in instanceof FileDataInput)
return ((FileDataInput) in).readBytes(length);

byte[] buff = new byte[length];
in.readFully(buff);
return ByteBuffer.wrap(buff);
}

We see that, the input in a instance of FileDataInput stream and read the bytes with length. Since FileDataInput is a interface, we read that, the class that implement this interface is RandomAccessReader class and method readBytes as the follow.
public ByteBuffer readBytes(int length) throws IOException
{
assert length >= 0 : "buffer length should not be negative: " + length;

byte[] buff = new byte[length];
readFully(buff); // reading data buffer

return ByteBuffer.wrap(buff);
}

to read bytes with length is actually to read fully on the length but started on the current file pointer pointing at. And a little bit way up in the stack trace, method reBuffer()
    /**
* Read data from file starting from current currentOffset to populate buffer.
* @throws IOException on any I/O error.
*/
protected void reBuffer() throws IOException
{
resetBuffer();

if (bufferOffset >= channel.size())
return;

channel.position(bufferOffset); // setting channel position

int read = 0;

while (read < buffer.length)
{
int n = super.read(buffer, read, buffer.length - read);
if (n < 0)
break;
read += n;
}

validBufferBytes = read;

bytesSinceCacheFlush += read;

if (skipIOCache && bytesSinceCacheFlush >= MAX_BYTES_IN_PAGE_CACHE)
{
// with random I/O we can't control what we are skipping so
// it will be more appropriate to just skip a whole file after
// we reach threshold
CLibrary.trySkipCache(this.fd, 0, 0);
bytesSinceCacheFlush = 0;
}
}

and this method call superclass to read another chunk into the buffer. The upper class RandomAccessFile , method readBytes()
    /**
* Reads a sub array as a sequence of bytes.
* @param b the buffer into which the data is read.
* @param off the start offset of the data.
* @param len the number of bytes to read.
* @exception IOException If an I/O error has occurred.
*/
private int readBytes(byte b[], int off, int len) throws IOException {
Object traceContext = IoTrace.fileReadBegin(path);
int bytesRead = 0;
try {
bytesRead = readBytes0(b, off, len);
} finally {
IoTrace.fileReadEnd(traceContext, bytesRead == -1 ? 0 : bytesRead);
}
return bytesRead;
}

private native int readBytes0(byte b[], int off, int len) throws IOException;

.. and we are at the end of this path, it turn out that the call to readBytes0 thrown exception, the lower layer native non java call throwing the IO exception. You can use nodetool scrub to see if this fix the problem but what I do basically wipe the data directory for the cassandra and rebuild it. Then I don't see anymore of this message anymore.

That's it for this article and if you want to improve and/or comment, please leave your input below.

Friday, March 27, 2015

Investigate into apache cassandra get_slice assertion error

Today, we will investigate another error from apache cassandra. Error as shown below in cassandra log.
ERROR [Thrift:2] 2015-02-11 11:06:10,837 Cassandra.java (line 3041) Internal error processing get_slice
java.lang.AssertionError
at org.apache.cassandra.locator.TokenMetadata.firstTokenIndex(TokenMetadata.java:518)
at org.apache.cassandra.locator.TokenMetadata.firstToken(TokenMetadata.java:532)
at org.apache.cassandra.locator.AbstractReplicationStrategy.getNaturalEndpoints(AbstractReplicationStrategy.java:94)
at org.apache.cassandra.service.StorageService.getLiveNaturalEndpoints(StorageService.java:1992)
at org.apache.cassandra.service.StorageService.getLiveNaturalEndpoints(StorageService.java:1986)
at org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:604)
at org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:564)
at org.apache.cassandra.thrift.CassandraServer.readColumnFamily(CassandraServer.java:128)
at org.apache.cassandra.thrift.CassandraServer.getSlice(CassandraServer.java:283)
at org.apache.cassandra.thrift.CassandraServer.multigetSliceInternal(CassandraServer.java:365)
at org.apache.cassandra.thrift.CassandraServer.get_slice(CassandraServer.java:326)
at org.apache.cassandra.thrift.Cassandra$Processor$get_slice.process(Cassandra.java:3033)
at org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2889)
at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:187)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

So bottom first three lines pretty easy, a thread is ran with the thread pool executor. As indicated by the code snipet below, that a worker process having trouble in processing a request.
    try
{
processor = processorFactory_.getProcessor(client_);
inputTransport = inputTransportFactory_.getTransport(client_);
outputTransport = outputTransportFactory_.getTransport(client_);
inputProtocol = inputProtocolFactory_.getProtocol(inputTransport);
outputProtocol = outputProtocolFactory_.getProtocol(outputTransport);
// we check stopped_ first to make sure we're not supposed to be shutting
// down. this is necessary for graceful shutdown. (but not sufficient,
// since process() can take arbitrarily long waiting for client input.
// See comments at the end of serve().)
while (!stopped_ && processor.process(inputProtocol, outputProtocol))
{
inputProtocol = inputProtocolFactory_.getProtocol(inputTransport);
outputProtocol = outputProtocolFactory_.getProtocol(outputTransport);
}
}

Skipping a few low level byte stream processing, we arrived at the actual class which actually implement the method get_slice. Read code snippet below.
    public List<ColumnOrSuperColumn> get_slice(ByteBuffer key, ColumnParent column_parent, SlicePredicate predicate, ConsistencyLevel consistency_level)
throws InvalidRequestException, UnavailableException, TimedOutException
{
logger.debug("get_slice");

state().hasColumnFamilyAccess(column_parent.column_family, Permission.READ);
return multigetSliceInternal(state().getKeyspace(), Collections.singletonList(key), column_parent, predicate, consistency_level).get(key);
}

so we see another method is called, multigetSliceInternal. Read code snippet below where a few validations on the data.
    private Map<ByteBuffer, List<ColumnOrSuperColumn>> multigetSliceInternal(String keyspace, List<ByteBuffer> keys, ColumnParent column_parent, SlicePredicate predicate, ConsistencyLevel consistency_level)
throws InvalidRequestException, UnavailableException, TimedOutException
{
CFMetaData metadata = ThriftValidation.validateColumnFamily(keyspace, column_parent.column_family);
ThriftValidation.validateColumnParent(metadata, column_parent);
ThriftValidation.validatePredicate(metadata, column_parent, predicate);
ThriftValidation.validateConsistencyLevel(keyspace, consistency_level);

List<ReadCommand> commands = new ArrayList<ReadCommand>();
if (predicate.column_names != null)
{
for (ByteBuffer key: keys)
{
ThriftValidation.validateKey(metadata, key);
commands.add(new SliceByNamesReadCommand(keyspace, key, column_parent, predicate.column_names));
}
}
else
{
SliceRange range = predicate.slice_range;
for (ByteBuffer key: keys)
{
ThriftValidation.validateKey(metadata, key);
commands.add(new SliceFromReadCommand(keyspace, key, column_parent, range.start, range.finish, range.reversed, range.count));
}
}

return getSlice(commands, consistency_level);
}

then method getSlice is called,  and method readColumnFamily() is also called. As shown below, the code snippet
  protected Map<DecoratedKey, ColumnFamily> readColumnFamily(List<ReadCommand> commands, ConsistencyLevel consistency_level)
throws InvalidRequestException, UnavailableException, TimedOutException
{
// TODO - Support multiple column families per row, right now row only contains 1 column family
Map<DecoratedKey, ColumnFamily> columnFamilyKeyMap = new HashMap<DecoratedKey, ColumnFamily>();

if (consistency_level == ConsistencyLevel.ANY)
{
throw new InvalidRequestException("Consistency level any may not be applied to read operations");
}

List<Row> rows;
try
{
schedule(DatabaseDescriptor.getRpcTimeout());
try
{
rows = StorageProxy.read(commands, consistency_level);
}
finally
{
release();
}
}
catch (TimeoutException e)
{
logger.debug("... timed out");
throw new TimedOutException();
}
catch (IOException e)
{
throw new RuntimeException(e);
}

for (Row row: rows)
{
columnFamilyKeyMap.put(row.key, row.cf);
}
return columnFamilyKeyMap;
}

another class is called, StorageProxy to read the row in concern and the read method code snippet below.
    /**
* Performs the actual reading of a row out of the StorageService, fetching
* a specific set of column names from a given column family.
*/
public static List<Row> read(List<ReadCommand> commands, ConsistencyLevel consistency_level)
throws IOException, UnavailableException, TimeoutException, InvalidRequestException
{
if (StorageService.instance.isBootstrapMode())
throw new UnavailableException();
long startTime = System.nanoTime();
List<Row> rows;
try
{
rows = fetchRows(commands, consistency_level);
}
finally
{
readStats.addNano(System.nanoTime() - startTime);
}
return rows;
}

the exception lead this investigation to fetching the row and within the same class, for method fetchRows, code snippet below.
    /**
* This function executes local and remote reads, and blocks for the results:
*
* 1. Get the replica locations, sorted by response time according to the snitch
* 2. Send a data request to the closest replica, and digest requests to either
* a) all the replicas, if read repair is enabled
* b) the closest R-1 replicas, where R is the number required to satisfy the ConsistencyLevel
* 3. Wait for a response from R replicas
* 4. If the digests (if any) match the data return the data
* 5. else carry out read repair by getting data from all the nodes.
*/
private static List<Row> fetchRows(List<ReadCommand> initialCommands, ConsistencyLevel consistency_level) throws IOException, UnavailableException, TimeoutException
{
List<Row> rows = new ArrayList<Row>(initialCommands.size());
List<ReadCommand> commandsToRetry = Collections.emptyList();

do
{
List<ReadCommand> commands = commandsToRetry.isEmpty() ? initialCommands : commandsToRetry;
ReadCallback<Row>[] readCallbacks = new ReadCallback[commands.size()];

if (!commandsToRetry.isEmpty())
logger.debug("Retrying {} commands", commandsToRetry.size());

// send out read requests
for (int i = 0; i < commands.size(); i++)
{
ReadCommand command = commands.get(i);
assert !command.isDigestQuery();
logger.debug("Command/ConsistencyLevel is {}/{}", command, consistency_level);

List<InetAddress> endpoints = StorageService.instance.getLiveNaturalEndpoints(command.table,
command.key);
DatabaseDescriptor.getEndpointSnitch().sortByProximity(FBUtilities.getBroadcastAddress(), endpoints);

RowDigestResolver resolver = new RowDigestResolver(command.table, command.key);
ReadCallback<Row> handler = getReadCallback(resolver, command, consistency_level, endpoints);
handler.assureSufficientLiveNodes();
assert !handler.endpoints.isEmpty();
readCallbacks[i] = handler;

// The data-request message is sent to dataPoint, the node that will actually get the data for us
InetAddress dataPoint = handler.endpoints.get(0);
if (dataPoint.equals(FBUtilities.getBroadcastAddress()) && OPTIMIZE_LOCAL_REQUESTS)
{
logger.debug("reading data locally");
StageManager.getStage(Stage.READ).execute(new LocalReadRunnable(command, handler));
}
else
{
logger.debug("reading data from {}", dataPoint);
MessagingService.instance().sendRR(command, dataPoint, handler);
}

if (handler.endpoints.size() == 1)
continue;

// send the other endpoints a digest request
ReadCommand digestCommand = command.copy();
digestCommand.setDigestQuery(true);
MessageProducer producer = null;
for (InetAddress digestPoint : handler.endpoints.subList(1, handler.endpoints.size()))
{
if (digestPoint.equals(FBUtilities.getBroadcastAddress()))
{
logger.debug("reading digest locally");
StageManager.getStage(Stage.READ).execute(new LocalReadRunnable(digestCommand, handler));
}
else
{
logger.debug("reading digest from {}", digestPoint);
// (We lazy-construct the digest Message object since it may not be necessary if we
// are doing a local digest read, or no digest reads at all.)
if (producer == null)
producer = new CachingMessageProducer(digestCommand);
MessagingService.instance().sendRR(producer, digestPoint, handler);
}
}
}

// read results and make a second pass for any digest mismatches
List<ReadCommand> repairCommands = null;
List<RepairCallback> repairResponseHandlers = null;
for (int i = 0; i < commands.size(); i++)
{
ReadCallback<Row> handler = readCallbacks[i];
ReadCommand command = commands.get(i);
try
{
long startTime2 = System.currentTimeMillis();
Row row = handler.get();
if (row != null)
{
command.maybeTrim(row);
rows.add(row);
}

if (logger.isDebugEnabled())
logger.debug("Read: " + (System.currentTimeMillis() - startTime2) + " ms.");
}
catch (TimeoutException ex)
{
if (logger.isDebugEnabled())
logger.debug("Read timeout: {}", ex.toString());
throw ex;
}
catch (DigestMismatchException ex)
{
if (logger.isDebugEnabled())
logger.debug("Digest mismatch: {}", ex.toString());
RowRepairResolver resolver = new RowRepairResolver(command.table, command.key);
RepairCallback repairHandler = new RepairCallback(resolver, handler.endpoints);

if (repairCommands == null)
{
repairCommands = new ArrayList<ReadCommand>();
repairResponseHandlers = new ArrayList<RepairCallback>();
}
repairCommands.add(command);
repairResponseHandlers.add(repairHandler);

MessageProducer producer = new CachingMessageProducer(command);
for (InetAddress endpoint : handler.endpoints)
MessagingService.instance().sendRR(producer, endpoint, repairHandler);
}
}

if (commandsToRetry != Collections.EMPTY_LIST)
commandsToRetry.clear();

// read the results for the digest mismatch retries
if (repairResponseHandlers != null)
{
for (int i = 0; i < repairCommands.size(); i++)
{
ReadCommand command = repairCommands.get(i);
RepairCallback handler = repairResponseHandlers.get(i);
// wait for the repair writes to be acknowledged, to minimize impact on any replica that's
// behind on writes in case the out-of-sync row is read multiple times in quick succession
FBUtilities.waitOnFutures(handler.resolver.repairResults, DatabaseDescriptor.getRpcTimeout());

Row row;
try
{
row = handler.get();
}
catch (DigestMismatchException e)
{
throw new AssertionError(e); // full data requested from each node here, no digests should be sent
}

ReadCommand retryCommand = command.maybeGenerateRetryCommand(handler, row);
if (retryCommand != null)
{
logger.debug("issuing retry for read command");
if (commandsToRetry == Collections.EMPTY_LIST)
commandsToRetry = new ArrayList<ReadCommand>();
commandsToRetry.add(retryCommand);
continue;
}

if (row != null)
{
command.maybeTrim(row);
rows.add(row);
}
}
}
} while (!commandsToRetry.isEmpty());

return rows;
}

As this point of investigation, this method, fetchRows documentation is pretty useful for us.
* This function executes local and remote reads, and blocks for the results:
*
* 1. Get the replica locations, sorted by response time according to the snitch
* 2. Send a data request to the closest replica, and digest requests to either
* a) all the replicas, if read repair is enabled
* b) the closest R-1 replicas, where R is the number required to satisfy the ConsistencyLevel

we see this method actually execute on local and remote node, and during getting the node who is responsible to keep the row, problem occur. Let's read on the method getLiveNaturalEndpoints() and as shown below.
    /**
* This method attempts to return N endpoints that are responsible for storing the
* specified key i.e for replication.
*
* @param key - key for which we need to find the endpoint return value -
* the endpoint responsible for this key
*/
public List<InetAddress> getLiveNaturalEndpoints(String table, ByteBuffer key)
{
return getLiveNaturalEndpoints(table, partitioner.getToken(key));
}

public List<InetAddress> getLiveNaturalEndpoints(String table, Token token)
{
List<InetAddress> liveEps = new ArrayList<InetAddress>();
List<InetAddress> endpoints = Table.open(table).getReplicationStrategy().getNaturalEndpoints(token);

for (InetAddress endpoint : endpoints)
{
if (FailureDetector.instance.isAlive(endpoint))
liveEps.add(endpoint);
}

return liveEps;
}

a little upper in the stack trace, abstract class AbstractReplicationStrategy
    /**
* get the (possibly cached) endpoints that should store the given Token
* Note that while the endpoints are conceptually a Set (no duplicates will be included),
* we return a List to avoid an extra allocation when sorting by proximity later
* @param searchToken the token the natural endpoints are requested for
* @return a copy of the natural endpoints for the given token
*/
public ArrayList<InetAddress> getNaturalEndpoints(Token searchToken)
{
Token keyToken = TokenMetadata.firstToken(tokenMetadata.sortedTokens(), searchToken);
ArrayList<InetAddress> endpoints = getCachedEndpoints(keyToken);
if (endpoints == null)
{
TokenMetadata tokenMetadataClone = tokenMetadata.cloneOnlyTokenMap();
keyToken = TokenMetadata.firstToken(tokenMetadataClone.sortedTokens(), searchToken);
endpoints = new ArrayList<InetAddress>(calculateNaturalEndpoints(searchToken, tokenMetadataClone));
cacheEndpoint(keyToken, endpoints);
}

return new ArrayList<InetAddress>(endpoints);
}

somehow the ring size is equal to 0 or less than 0. class TokenMetadata.java and code snippet where the assertion thrown,
    public static int firstTokenIndex(final ArrayList ring, Token start, boolean insertMin)
{
assert ring.size() > 0;
// insert the minimum token (at index == -1) if we were asked to include it and it isn't a member of the ring
int i = Collections.binarySearch(ring, start);
if (i < 0)
{
i = (i + 1) * (-1);
if (i >= ring.size())
i = insertMin ? -1 : 0;
}
return i;
}

public static Token firstToken(final ArrayList<Token> ring, Token start)
{
return ring.get(firstTokenIndex(ring, start, false));
}

So something went during during reading a row's column and somehow the natural endpoint is either 0 or empty. My guess is that, it could be gossip is disable so the ring metadata is empty. The solution is to enable the gossip and then restart cassandra instance.

If you think this analysis is not accurate or want to provide more information, please do so by commenting below. Thank you.

Sunday, March 15, 2015

Investigate into elasticsearch already flushing exception

Today, we will take a look at another elasticsearch issue. The example of message output in the log file happened in one of my node.
[2015-03-02 04:51:30,450][DEBUG][action.admin.indices.flush] [node02] [index_A][1], node[DEF_-ABC], [P], s[STARTED]: Failed to execute [org.elasticsearch.action.admin.indices.flush.FlushRequest@23465ba5]
org.elasticsearch.index.engine.FlushNotAllowedEngineException: [index_A][1] already flushing...
at org.elasticsearch.index.engine.robin.RobinEngine.flush(RobinEngine.java:825)
at org.elasticsearch.index.shard.service.InternalIndexShard.flush(InternalIndexShard.java:564)
at org.elasticsearch.action.admin.indices.flush.TransportFlushAction.shardOperation(TransportFlushAction.java:114)
at org.elasticsearch.action.admin.indices.flush.TransportFlushAction.shardOperation(TransportFlushAction.java:49)
at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$AsyncBroadcastAction$2.run(TransportBroadcastOperationAction.java:225)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

We will investigate into the code base lead to the exception above and to decide if this is a problem. Below is the snippet of the "entry" elasticsearch code.
public void run() {
try {
onOperation(shard, shardIndex, shardOperation(shardRequest));
} catch (Throwable e) {
onOperation(shard, shardIt, shardIndex, e);
}
}

so we read above, this is an asychronous broadcast action execution happened during performing an action. Code execution path tell that this shard is a local shard and encounter exception during flushing.
    @Override
protected ShardFlushResponse shardOperation(ShardFlushRequest request) throws ElasticSearchException {
IndexShard indexShard = indicesService.indexServiceSafe(request.index()).shardSafe(request.shardId());
indexShard.flush(new Engine.Flush().type(request.full() ? Engine.Flush.Type.NEW_WRITER : Engine.Flush.Type.COMMIT_TRANSLOG).force(request.force()));
return new ShardFlushResponse(request.index(), request.shardId());
}

Tracing up in the stack trace, we noticed it is an internal index shard class during flushing of the shard.
    @Override
public void flush(Engine.Flush flush) throws ElasticSearchException {
// we allows flush while recovering, since we allow for operations to happen
// while recovering, and we want to keep the translog at bay (up to deletes, which
// we don't gc).
verifyStartedOrRecovering();
if (logger.isTraceEnabled()) {
logger.trace("flush with {}", flush);
}
long time = System.nanoTime();
engine.flush(flush);
flushMetric.inc(System.nanoTime() - time);
}

As can be read above, we are at the end of the investigation journey here, it turn out that this shard flushing count is greater than one and is false, the flushing count get decrease and exception is thrown and tell that this shard is currently being flush.

As to why this is happening it is puzzling, but I never see this exception anymore, and from the investigation above, it does not look like this is an harm since the shard in concern is flushing currently.

If you think this analysis is not correct or you have better insight, please leave your comment below. Thank you.

Saturday, March 14, 2015

Investigate into elasticsearch indices memory marking shard active or inactive

If you have enable logging for indices memory controller indices.memory: DEBUG in logging.yml in elasticsearch 0.90 to see how is the shard behaving, you will notice in the log file, messages such as below happen quite often.
[2014-12-11 15:22:27,562][DEBUG][indices.memory           ] [es01] recalculating shard indexing buffer (reason=active/inactive[true] created/deleted[false]), total is [4.2gb] with [11] active shards, each shard set to indexing=[391.8mb], translog=[64kb]
[2014-12-11 15:23:57,562][DEBUG][indices.memory ] [es01] marking shard [index_A][7] as inactive (inactive_time[30m]) indexing wise, setting size to [500kb]
[2014-12-11 15:23:57,562][DEBUG][indices.memory ] [es01] marking shard [index_B][0] as inactive (inactive_time[30m]) indexing wise, setting size to [500kb]

Tracing into the code base, class IndexingMemoryController , we noticed that a periodic with default interval of 30seconds, runnable instance ShardsIndicesStatusChecker is created. Reading into this class, we see that the algorithm is coded in such a way to loop through the indices service and for a index service, check the index shard status. Based on the log output above, we see that the control goes into this path. The shard indexing inactive is true.
    if (!status.inactiveIndexing) {
// mark it as inactive only if enough time has passed and there are no ongoing merges going on...
if ((time - status.time) > inactiveTime.millis() && indexShard.mergeStats().getCurrent() == 0) {
// inactive for this amount of time, mark it
activeToInactiveIndexingShards.add(indexShard);
status.inactiveIndexing = true;
activeInactiveStatusChanges = true;
logger.debug("marking shard [{}][{}] as inactive (inactive_time[{}]) indexing wise, setting size to [{}]", indexShard.shardId().index().name(), indexShard.shardId().id(), inactiveTime, Engine.INACTIVE_SHARD_INDEXING_BUFFER);
}
}

Going further down the code, because there is a change in the variable activeInactiveStatusChanges, the code if statement evaluation become true,
if (shardsCreatedOrDeleted || activeInactiveStatusChanges) {
calcAndSetShardBuffers("active/inactive[" + activeInactiveStatusChanges + "] created/deleted[" + shardsCreatedOrDeleted + "]");
}

Tracing into the code on method calcAndSetShardBuffers(), we see that the shard buffer size is calculated accordingly within a range. The range for default miniumum shard buffer is 4MB and default maximum shard buffer 512MB. Similar algorithm happen to translog buffer size as well. Default minimum translog buffer size is 2KB and default maximum translog buffer size is 64KB. Then we see the log message such as the one above is written using the code below.
logger.debug("recalculating shard indexing buffer (reason={}), total is [{}] with [{}] active shards, each shard set to indexing=[{}], translog=[{}]", reason, indexingBuffer, shardsCount, shardIndexingBufferSize, shardTranslogBufferSize);

From this analysis, it looks to me elasticsearch is working fine checking the index shard status, and recalculate the shard and translog buffer accordingly. If you think this analysis is incorrect or would like to contribute more information, please leave your comment below.

Friday, March 13, 2015

check out java wait, notify and notifyAll method

If you have been program java for a while already, you will most probably encounter method such as wait, notify and notifyAll. Example screenshot below.

java_object_notify

So today, we will take a look at these methods and what are them about and example code of usage of these methods. First, let's read the javadoc for these methods.

wait()
public final void wait()
throws InterruptedException

 

Causes the current thread to wait until another thread invokes the notify() method or the notifyAll() method for this object. In other words, this method behaves exactly as if it simply performs the call wait(0).

 

The current thread must own this object's monitor. The thread releases ownership of this monitor and waits until another thread notifies threads waiting on this object's monitor to wake up either through a call to the notify method or the notifyAll method. The thread then waits until it can re-obtain ownership of the monitor and resumes execution.

 

As in the one argument version, interrupts and spurious wakeups are possible, and this method should always be used in a loop:

 

synchronized (obj) {
while (<condition does not hold>)
obj.wait();
... // Perform action appropriate to condition
}

 

This method should only be called by a thread that is the owner of this object's monitor. See the notify method for a description of the ways in which a thread can become the owner of a monitor.

 

Throws:
IllegalMonitorStateException - if the current thread is not the owner of the object's monitor.
InterruptedException - if any thread interrupted the current thread before or while the current thread was waiting for a notification. The interrupted status of the current thread is cleared when this exception is thrown.

notify()
public final void notify()

 

Wakes up a single thread that is waiting on this object's monitor. If any threads are waiting on this object, one of them is chosen to be awakened. The choice is arbitrary and occurs at the discretion of the implementation. A thread waits on an object's monitor by calling one of the wait methods.

 

The awakened thread will not be able to proceed until the current thread relinquishes the lock on this object. The awakened thread will compete in the usual manner with any other threads that might be actively competing to synchronize on this object; for example, the awakened thread enjoys no reliable privilege or disadvantage in being the next thread to lock this object.

 

This method should only be called by a thread that is the owner of this object's monitor. A thread becomes the owner of the object's monitor in one of three ways:

 

By executing a synchronized instance method of that object.
By executing the body of a synchronized statement that synchronizes on the object.
For objects of type Class, by executing a synchronized static method of that class.

 

Only one thread at a time can own an object's monitor.

 

Throws:
IllegalMonitorStateException - if the current thread is not the owner of this object's monitor.

notifyAll()
public final void notifyAll()

 

Wakes up all threads that are waiting on this object's monitor. A thread waits on an object's monitor by calling one of the wait methods.

 

The awakened threads will not be able to proceed until the current thread relinquishes the lock on this object. The awakened threads will compete in the usual manner with any other threads that might be actively competing to synchronize on this object; for example, the awakened threads enjoy no reliable privilege or disadvantage in being the next thread to lock this object.

 

This method should only be called by a thread that is the owner of this object's monitor. See the notify method for a description of the ways in which a thread can become the owner of a monitor.

 

Throws:
IllegalMonitorStateException - if the current thread is not the owner of this object's monitor.

Careful reader like you might notice that, all these methods belong to the class Object and remember that all object in java has Object class as their super class. So now this answer why these methods always exists on all objects.

Now, it is pretty clear that these method are used between two or more threads that interact with each other. You might ask when or why should I use these methods? If you have these codes, like thread sleep for a second (or less) and wait up to poll/check if the queue is fill. This is resource wasting like expensive cpu cycle. Then you should start to change the code to use these methods instead.

Okay, reasons given and now let's read into code example of these method usage. I have google and many given calculator example and I will also use this an my first example. >:)
public class Calculator {

public static void main(String[] args) {

ThreadB b = new ThreadB();
b.start();

synchronized (b) {
try {
System.out.println("Waiting for b to complete...");
b.wait();
} catch (InterruptedException e) {
e.printStackTrace();
}
System.out.println("total is : " + b.total);
}

}

static class ThreadB extends Thread {
int total;

@Override
public void run() {
System.out.println("calculate start");
synchronized(this) {
for (int i = 0 ; i < 100; i++) {
total += i;
}
notify();
}
System.out.println("calculate done");
}
}

}

output

Waiting for b to complete...
calculate start
calculate done
total is : 4950

As can be read above, we have a Calculator class. Then we have a static inner class to do a summation of 0 to 100. Noticed that a thread is started in the main method. object b is synchronized and thread b tell the main thread to wait until it is notify. In object b, we see that the thread goes through the loop to do the summation and once long running summation is executed, then thread b notify. Total is print in the main method and then thread b is done with the calculation. As for homework, try modify the codebase by removing synchronized keyword, wait and notify method, and run the application again, see the output. It will not be consistent and result will not be as expected.

Okay, let's go to a more complex example. Imagine a simple clinic where there are many patients waiting to be service by a doctor and there will be a nurse to handle the patient to the doctor one at a time. With that said, let's layout and read the code.
import java.util.Queue;
import java.util.Random;

public class Doctor extends Thread {

private Queue<String> patientQueue;

public Doctor(Queue<String> patientQueue) {
this.patientQueue = patientQueue;
}

@Override
public void run() {
try {
while (true) {
synchronized (patientQueue) {
if (patientQueue.size() == 0) {
patientQueue.wait();
}

String patient = patientQueue.remove();

System.out.println("treating patient: " + patient + " total patient in queue " + patientQueue.size());
Thread.sleep(duration(2, 3));
}

}
} catch (InterruptedException e) {
e.printStackTrace();
}
}


private static int duration(int min, int max) {

Random rand = new Random();

// nextInt is normally exclusive of the top value,
// so add 1 to make it inclusive
int randomNum = rand.nextInt((max - min) + 1) + min;
randomNum *= 1000;

return randomNum;
}

}

import java.util.Queue;

public class Nurse extends Thread {

static final int MAX = 10;
private int i;

private Queue<String> patientQueue;

public Nurse(Queue<String> patientQueue) {
this.patientQueue = patientQueue;
}


@Override
public void run() {
try {
while (true) {
patientAddedToQueue();
}
} catch (InterruptedException e) {
e.printStackTrace();
}
}

private synchronized void patientAddedToQueue() throws InterruptedException {
synchronized (patientQueue) {
while (patientQueue.size() == MAX) {
patientQueue.wait(10000);
System.out.println("done wait");
}
String patientName = "patient " + i;
patientQueue.add(patientName);
i++;
System.out.println("adding patient : " + patientName);
patientQueue.notify();

}

}

}

import java.util.LinkedList;
import java.util.Queue;

public class Clinics {

public static void main(String args[]) {

Queue<String> patientQueue = new LinkedList<String>();

Doctor veelan = new Doctor(patientQueue);
veelan.start();

Nurse gomathy = new Nurse(patientQueue);
gomathy.start();

}

}

Example output
adding patient : patient 0
adding patient : patient 1
adding patient : patient 2
adding patient : patient 3
adding patient : patient 4
adding patient : patient 5
adding patient : patient 6
adding patient : patient 7
adding patient : patient 8
adding patient : patient 9
treating patient: patient 0 total patient in queue 9
treating patient: patient 1 total patient in queue 8
treating patient: patient 2 total patient in queue 7
treating patient: patient 3 total patient in queue 6
treating patient: patient 4 total patient in queue 5
treating patient: patient 5 total patient in queue 4
treating patient: patient 6 total patient in queue 3
treating patient: patient 7 total patient in queue 2
treating patient: patient 8 total patient in queue 1
treating patient: patient 9 total patient in queue 0
done wait
adding patient : patient 10
adding patient : patient 11
adding patient : patient 12
adding patient : patient 13
adding patient : patient 14
adding patient : patient 15
adding patient : patient 16
adding patient : patient 17
adding patient : patient 18
adding patient : patient 19
treating patient: patient 10 total patient in queue 9
treating patient: patient 11 total patient in queue 8

As you can read above, we have three classes, Doctor, Nurse and Clinics. The entry point is in the class Clinics where this class bind the two other classes. A queue is share between class Doctor and Nurse class where a mock up example of nurse class will "accept" new patient with a private method patientAddedToQueue and if the queue is full at 10 patient, then a wait of 10seconds to tell the nurse object to stop accept new patient. Why 10 seconds of timeout, I will explain later. If the queue is less than 10, then a patient is added to the queue and method notify is called.

In the class Doctor, in the run method, we read that object patientQueue is synchronized. If the size is zero, there is no point to called remove method of patientQueue as it just make no sense. ;-) This doctor thread is sleep between 2 to 3 seconds to simulate a mock up example of when doctor is treating the patient.

Now, imagine if nurse gomathy has queue of size 10 and gomathy is on wait state, and meanwhile patientQueue is process by doctor veelan gradually, the queue came down slowly. But eventually the queue will become 0 as well. At this time, gomathy nurse object and veelan doctor object both in wait state. Hence, now you may figure out why 10 seconds of gomathy will start to receive more patients. :-) Probably this is another homework with better interaction between thread nurse and thread doctor for you reader. Try also add one more doctor and change notify to notifyAll.

Well that's it for this article. I hope you learn something and remember to contribute back.