Showing posts with label sstable. Show all posts
Showing posts with label sstable. Show all posts

Thursday, June 1, 2017

investigate into apache cassandra 1.2.19 sstable corrupt

Last, I have investigated in apache cassandra 1.0.8 sstable corruption a fresh node after upgraded to apache cassandra 1.2. Both of the articles can be found here and here. Today, we will take another look at a running apache cassandra 1.2.19 encounter sstable corruption. Below is the stack trace found in cassandra system.log

 org.apache.cassandra.io.sstable.CorruptSSTableException: org.apache.cassandra.io.compress.CorruptBlockException: (/var/lib/cassandra/data/<KEYSPACE>/<COLUMN_FAMILY>/<KEYSPACE>-<CF>-ic-112-Data.db): corruption detected, chunk at 19042661 of length 27265.  
     at org.apache.cassandra.io.compress.CompressedRandomAccessReader.reBuffer(CompressedRandomAccessReader.java:89)  
     at org.apache.cassandra.io.compress.CompressedThrottledReader.reBuffer(CompressedThrottledReader.java:45)  
     at org.apache.cassandra.io.util.RandomAccessReader.read(RandomAccessReader.java:355)  
     at java.io.RandomAccessFile.readFully(RandomAccessFile.java:444)  
     at java.io.RandomAccessFile.readFully(RandomAccessFile.java:424)  
     at org.apache.cassandra.io.util.RandomAccessReader.readBytes(RandomAccessReader.java:380)  
     at org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:391)  
     at org.apache.cassandra.utils.ByteBufferUtil.readWithShortLength(ByteBufferUtil.java:370)  
     at org.apache.cassandra.io.sstable.SSTableScanner$KeyScanningIterator.next(SSTableScanner.java:175)  
     at org.apache.cassandra.io.sstable.SSTableScanner$KeyScanningIterator.next(SSTableScanner.java:155)  
     at org.apache.cassandra.io.sstable.SSTableScanner.next(SSTableScanner.java:142)  
     at org.apache.cassandra.io.sstable.SSTableScanner.next(SSTableScanner.java:38)  
     at org.apache.cassandra.utils.MergeIterator$Candidate.advance(MergeIterator.java:145)  
     at org.apache.cassandra.utils.MergeIterator$ManyToOne.advance(MergeIterator.java:122)  
     at org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:96)  
     at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)  
     at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)  
     at org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:145)  
     at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)  
     at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)  
     at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:58)  
     at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:60)  
     at org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask.run(CompactionManager.java:208)  
     at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)  
     at java.util.concurrent.FutureTask.run(FutureTask.java:262)  
     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)  
     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)  
     at java.lang.Thread.run(Thread.java:745)  
 Caused by: org.apache.cassandra.io.compress.CorruptBlockException: (/var/lib/cassandra/data/<KEYSPACE>/<COLUMN_FAMILY>/<KEYSPACE>-<CF>-ic-112-Data.db): corruption detected, chunk at 19042661 of length 27265.  
     at org.apache.cassandra.io.compress.CompressedRandomAccessReader.decompressChunk(CompressedRandomAccessReader.java:128)  
     at org.apache.cassandra.io.compress.CompressedRandomAccessReader.reBuffer(CompressedRandomAccessReader.java:85)  
     ... 27 more  

Okay, let's trace into the stacktrace and study what actually cause this and what is sstable corruption means.

     at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)  
     at java.util.concurrent.FutureTask.run(FutureTask.java:262)  
     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)  
     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)  
     at java.lang.Thread.run(Thread.java:745)      

Simple, a thread is run by the executor.

     at org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:145)  
     at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)  
     at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)  
     at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:58)  
     at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:60)  
     at org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask.run(CompactionManager.java:208)  

We see that a background compaction task is started. Code below.

   // the actual sstables to compact are not determined until we run the BCT; that way, if new sstables  
   // are created between task submission and execution, we execute against the most up-to-date information  
   class BackgroundCompactionTask implements Runnable  
   {  
     private final ColumnFamilyStore cfs;  
   
     BackgroundCompactionTask(ColumnFamilyStore cfs)  
     {  
       this.cfs = cfs;  
     }  
   
     public void run()  
     {  
       compactionLock.readLock().lock();  
       try  
       {  
         logger.debug("Checking {}.{}", cfs.table.name, cfs.columnFamily); // log after we get the lock so we can see delays from that if any  
         if (!cfs.isValid())  
         {  
           logger.debug("Aborting compaction for dropped CF");  
           return;  
         }  
   
         AbstractCompactionStrategy strategy = cfs.getCompactionStrategy();  
         AbstractCompactionTask task = strategy.getNextBackgroundTask(getDefaultGcBefore(cfs));  
         if (task == null)  
         {  
           logger.debug("No tasks available");  
           return;  
         }  
         task.execute(metrics);  
       }  
       finally  
       {  
         compactingCF.remove(cfs);  
         compactionLock.readLock().unlock();  
       }  
       submitBackground(cfs);  
     }  
   }  

     at org.apache.cassandra.utils.MergeIterator$Candidate.advance(MergeIterator.java:145)  
     at org.apache.cassandra.utils.MergeIterator$ManyToOne.advance(MergeIterator.java:122)  
     at org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:96)  
     at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)  
     at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)  

Here we see an iterator going over the sstables for compaction.

     at org.apache.cassandra.io.sstable.SSTableScanner$KeyScanningIterator.next(SSTableScanner.java:175)  
     at org.apache.cassandra.io.sstable.SSTableScanner$KeyScanningIterator.next(SSTableScanner.java:155)  
     at org.apache.cassandra.io.sstable.SSTableScanner.next(SSTableScanner.java:142)  
     at org.apache.cassandra.io.sstable.SSTableScanner.next(SSTableScanner.java:38)  

DecoratedKey key = sstable.decodeKey(ByteBufferUtil.readWithShortLength(dfile));
     
Here we see that what actually get iterated is the sstable and particular on the key.

     at org.apache.cassandra.io.compress.CompressedRandomAccessReader.reBuffer(CompressedRandomAccessReader.java:89)  
     at org.apache.cassandra.io.compress.CompressedThrottledReader.reBuffer(CompressedThrottledReader.java:45)  
     at org.apache.cassandra.io.util.RandomAccessReader.read(RandomAccessReader.java:355)  
     at java.io.RandomAccessFile.readFully(RandomAccessFile.java:444)  
     at java.io.RandomAccessFile.readFully(RandomAccessFile.java:424)  
     at org.apache.cassandra.io.util.RandomAccessReader.readBytes(RandomAccessReader.java:380)  
     at org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:391)  
     at org.apache.cassandra.utils.ByteBufferUtil.readWithShortLength(ByteBufferUtil.java:370)  

Here, there is code reference has change in a few files due to method override, nonetheless, the important part on method reBuffer.


   @Override  
   protected void reBuffer()  
   {  
     try  
     {  
       decompressChunk(metadata.chunkFor(current));  
     }  
     catch (CorruptBlockException e)  
     {  
       throw new CorruptSSTableException(e, getPath());  
     }  
     catch (IOException e)  
     {  
       throw new FSReadError(e, getPath());  
     }  
   }  
     
   private void decompressChunk(CompressionMetadata.Chunk chunk) throws IOException  
   {  
     if (channel.position() != chunk.offset)  
       channel.position(chunk.offset);  
   
     if (compressed.capacity() < chunk.length)  
       compressed = ByteBuffer.wrap(new byte[chunk.length]);  
     else  
       compressed.clear();  
     compressed.limit(chunk.length);  
   
     if (channel.read(compressed) != chunk.length)  
       throw new CorruptBlockException(getPath(), chunk);  
   
     // technically flip() is unnecessary since all the remaining work uses the raw array, but if that changes  
     // in the future this will save a lot of hair-pulling  
     compressed.flip();  
     try  
     {  
       validBufferBytes = metadata.compressor().uncompress(compressed.array(), 0, chunk.length, buffer, 0);  
     }  
     catch (IOException e)  
     {  
       throw new CorruptBlockException(getPath(), chunk);  
     }  
   
     if (metadata.parameters.getCrcCheckChance() > FBUtilities.threadLocalRandom().nextDouble())  
     {  
       checksum.update(buffer, 0, validBufferBytes);  
   
       if (checksum(chunk) != (int) checksum.getValue())  
         throw new CorruptBlockException(getPath(), chunk);  
   
       // reset checksum object back to the original (blank) state  
       checksum.reset();  
     }  
   
     // buffer offset is always aligned  
     bufferOffset = current & ~(buffer.length - 1);  
   }  

we read that if chunk checksum is not the same as the updated crc32 checksum, this is consider sstable corruption.

For this type of exception, I remember I did many things such as below.

1. try online nodetool scrub, does not work
2. try offline sstablescrub, does not work.
3. wipeout the node and rebuild again, does not work.

we had to change the hardware altogether. Then we don't see the problem anymore.

Saturday, March 28, 2015

Investigate into apache cassandra corrupt sstable exception

Today, we will take a look at another apache cassandra 1.0.8 exception. Example of stack trace below.
ERROR [SSTableBatchOpen:2] 2015-03-07 06:11:58,544 SSTableReader.java (line 228) Corrupt sstable /var/lib/cassandra/data/MySuperKeyspace/MyColumnFamily-hc-6681=[Index.db, Statistics.db, CompressionInfo.db, Filter.db, Data.db]; skipped
java.io.IOException: Input/output error
at java.io.RandomAccessFile.readBytes0(Native Method)
at java.io.RandomAccessFile.readBytes(RandomAccessFile.java:350)
at java.io.RandomAccessFile.read(RandomAccessFile.java:385)
at org.apache.cassandra.io.util.RandomAccessReader.reBuffer(RandomAccessReader.java:128)
at org.apache.cassandra.io.util.RandomAccessReader.read(RandomAccessReader.java:302)
at java.io.RandomAccessFile.readFully(RandomAccessFile.java:444)
at java.io.RandomAccessFile.readFully(RandomAccessFile.java:424)
at org.apache.cassandra.io.util.RandomAccessReader.readBytes(RandomAccessReader.java:324)
at org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:393)
at org.apache.cassandra.io.sstable.SSTableReader.load(SSTableReader.java:375)
at org.apache.cassandra.io.sstable.SSTableReader.open(SSTableReader.java:186)
at org.apache.cassandra.io.sstable.SSTableReader$1.run(SSTableReader.java:224)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Before we go into the code base for this stacktrace, I have no idea what is this about and this one shown when the cassandra 1.0.12 instance is booting up. Last I remember I trigger user defined compaction twice in cassandra 1.0.8 using the same sstables and after first compaction is done, then this sstable stay forever... like for two weeks plus. Then we have upgrade for the cassandra.

Enough said, let's go into the code base and understand what is really mean by corrupt sstable. Bottom of the the stack trace pretty obvious, ThreadPoolExecutor execute a future task run method.Then it is now on apache cassandra namespace codebase, as can be read below class SSTableReader, method batchOpen(), code snippet
    public static Collection<SSTableReader> batchOpen(Set<Map.Entry<Descriptor, Set<Component>>> entries,
final Set<DecoratedKey> savedKeys,
final DataTracker tracker,
final CFMetaData metadata,
final IPartitioner partitioner)
{
final Collection<SSTableReader> sstables = new LinkedBlockingQueue<SSTableReader>();

ExecutorService executor = DebuggableThreadPoolExecutor.createWithPoolSize("SSTableBatchOpen", Runtime.getRuntime().availableProcessors());
for (final Map.Entry<Descriptor, Set<Component>> entry : entries)
{
Runnable runnable = new Runnable()
{
public void run()
{
SSTableReader sstable;
try
{
sstable = open(entry.getKey(), entry.getValue(), savedKeys, tracker, metadata, partitioner);
}
catch (IOException ex)
{
logger.error("Corrupt sstable " + entry + "; skipped", ex);
return;
}
sstables.add(sstable);
}
};
executor.submit(runnable);
}

executor.shutdown();
try
{
executor.awaitTermination(7, TimeUnit.DAYS);
}
catch (InterruptedException e)
{
throw new AssertionError(e);
}

return sstables;

}

As can be read above, somewhere within the method open() throw the IOException, hence the above exception was thrown. Two stack trace up, we read that, sstable load method execute and, ByteBufferUtil.read() method. With the method read from class ByteBufferUtil as shown below.
    public static ByteBuffer read(DataInput in, int length) throws IOException
{
if (in instanceof FileDataInput)
return ((FileDataInput) in).readBytes(length);

byte[] buff = new byte[length];
in.readFully(buff);
return ByteBuffer.wrap(buff);
}

We see that, the input in a instance of FileDataInput stream and read the bytes with length. Since FileDataInput is a interface, we read that, the class that implement this interface is RandomAccessReader class and method readBytes as the follow.
public ByteBuffer readBytes(int length) throws IOException
{
assert length >= 0 : "buffer length should not be negative: " + length;

byte[] buff = new byte[length];
readFully(buff); // reading data buffer

return ByteBuffer.wrap(buff);
}

to read bytes with length is actually to read fully on the length but started on the current file pointer pointing at. And a little bit way up in the stack trace, method reBuffer()
    /**
* Read data from file starting from current currentOffset to populate buffer.
* @throws IOException on any I/O error.
*/
protected void reBuffer() throws IOException
{
resetBuffer();

if (bufferOffset >= channel.size())
return;

channel.position(bufferOffset); // setting channel position

int read = 0;

while (read < buffer.length)
{
int n = super.read(buffer, read, buffer.length - read);
if (n < 0)
break;
read += n;
}

validBufferBytes = read;

bytesSinceCacheFlush += read;

if (skipIOCache && bytesSinceCacheFlush >= MAX_BYTES_IN_PAGE_CACHE)
{
// with random I/O we can't control what we are skipping so
// it will be more appropriate to just skip a whole file after
// we reach threshold
CLibrary.trySkipCache(this.fd, 0, 0);
bytesSinceCacheFlush = 0;
}
}

and this method call superclass to read another chunk into the buffer. The upper class RandomAccessFile , method readBytes()
    /**
* Reads a sub array as a sequence of bytes.
* @param b the buffer into which the data is read.
* @param off the start offset of the data.
* @param len the number of bytes to read.
* @exception IOException If an I/O error has occurred.
*/
private int readBytes(byte b[], int off, int len) throws IOException {
Object traceContext = IoTrace.fileReadBegin(path);
int bytesRead = 0;
try {
bytesRead = readBytes0(b, off, len);
} finally {
IoTrace.fileReadEnd(traceContext, bytesRead == -1 ? 0 : bytesRead);
}
return bytesRead;
}

private native int readBytes0(byte b[], int off, int len) throws IOException;

.. and we are at the end of this path, it turn out that the call to readBytes0 thrown exception, the lower layer native non java call throwing the IO exception. You can use nodetool scrub to see if this fix the problem but what I do basically wipe the data directory for the cassandra and rebuild it. Then I don't see anymore of this message anymore.

That's it for this article and if you want to improve and/or comment, please leave your input below.

Monday, April 21, 2014

Enable or disable sstable compression?

In cassandra 2.0.6, there are a few compression for sstables, the default is LZ4Compressor. There are others such as DeflateCompressor, SnappyCompressor or
do not compress the sstables at all.

You can read more about compression at official documentation as found it here.

With this blog, I will create two scenarios where first scenario is with enable compression and another scenario is without compression. This is the only different for both scenarios.

So I have create 50 thousands insert statement with cql and then insert using by feeding to cqlsh. So first , the schema below with LZ4Compressor compression and leave value for key sstable_compression empty for no compression.
CREATE TABLE users (
user_id text,
age int,
first text,
last text,
middle text,
PRIMARY KEY (user_id)
) WITH
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='storing user data' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
index_interval=128 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
default_time_to_live=0 AND
speculative_retry='99.0PERCENTILE' AND
memtable_flush_period_in_ms=0 AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'LZ4Compressor'};

CREATE INDEX idxAge ON users (age);

CREATE INDEX idxLast ON users (last);

jason@localhost:~$ wc -l data.cql
50000 data.cql
jason@localhost:~$ cqlsh 192.168.0.2 9160 -k jw_schema1 -f data.cql
jason@localhost:~$

so looks good, that we have total rows of 50 thousands.
cqlsh:jw_schema1> select count(*) from users limit 100000;

count
-------
50000

(1 rows)

cqlsh:jw_schema1>

Ran nodetool repair, flush, cleanup and then compact. With compression enable, the sstable count only 1 and the total filesize in this directory is about 4.5MB.
jason@localhost:/var/lib/cassandra/data/jw_schema1/users$ ls -l
total 4576
-rw-r--r-- 1 cassandra cassandra 179 Apr 15 21:02 jw_schema1-users.idxAge-jb-1-CompressionInfo.db
-rw-r--r-- 1 cassandra cassandra 599421 Apr 15 21:02 jw_schema1-users.idxAge-jb-1-Data.db
-rw-r--r-- 1 cassandra cassandra 136 Apr 15 21:02 jw_schema1-users.idxAge-jb-1-Filter.db
-rw-r--r-- 1 cassandra cassandra 1800 Apr 15 21:02 jw_schema1-users.idxAge-jb-1-Index.db
-rw-r--r-- 1 cassandra cassandra 4392 Apr 15 21:02 jw_schema1-users.idxAge-jb-1-Statistics.db
-rw-r--r-- 1 cassandra cassandra 68 Apr 15 21:02 jw_schema1-users.idxAge-jb-1-Summary.db
-rw-r--r-- 1 cassandra cassandra 79 Apr 15 21:02 jw_schema1-users.idxAge-jb-1-TOC.txt
-rw-r--r-- 1 cassandra cassandra 179 Apr 15 21:02 jw_schema1-users.idxLast-jb-1-CompressionInfo.db
-rw-r--r-- 1 cassandra cassandra 598579 Apr 15 21:02 jw_schema1-users.idxLast-jb-1-Data.db
-rw-r--r-- 1 cassandra cassandra 16 Apr 15 21:02 jw_schema1-users.idxLast-jb-1-Filter.db
-rw-r--r-- 1 cassandra cassandra 680 Apr 15 21:02 jw_schema1-users.idxLast-jb-1-Index.db
-rw-r--r-- 1 cassandra cassandra 4392 Apr 15 21:02 jw_schema1-users.idxLast-jb-1-Statistics.db
-rw-r--r-- 1 cassandra cassandra 71 Apr 15 21:02 jw_schema1-users.idxLast-jb-1-Summary.db
-rw-r--r-- 1 cassandra cassandra 79 Apr 15 21:02 jw_schema1-users.idxLast-jb-1-TOC.txt
-rw-r--r-- 1 cassandra cassandra 971 Apr 15 21:02 jw_schema1-users-jb-1-CompressionInfo.db
-rw-r--r-- 1 cassandra cassandra 2387391 Apr 15 21:02 jw_schema1-users-jb-1-Data.db
-rw-r--r-- 1 cassandra cassandra 62512 Apr 15 21:02 jw_schema1-users-jb-1-Filter.db
-rw-r--r-- 1 cassandra cassandra 938894 Apr 15 21:02 jw_schema1-users-jb-1-Index.db
-rw-r--r-- 1 cassandra cassandra 4391 Apr 15 21:02 jw_schema1-users-jb-1-Statistics.db
-rw-r--r-- 1 cassandra cassandra 6615 Apr 15 21:02 jw_schema1-users-jb-1-Summary.db
-rw-r--r-- 1 cassandra cassandra 79 Apr 15 21:02 jw_schema1-users-jb-1-TOC.txt
drwxr-xr-x 2 cassandra cassandra 4096 Apr 15 20:57 snapshots
jason@localhost:/var/lib/cassandra/data/jw_schema1/users$

Right now without compression, the total file size is about 11MB. Noticed that, the size is almost double and the sstable count is two.
jason@localhost:/var/lib/cassandra/data/jw_schema1/users$ ls -l
total 10860
-rw-r--r-- 1 cassandra cassandra 48 Apr 15 21:23 jw_schema1-users.idxAge-jb-1-CRC.db
-rw-r--r-- 1 cassandra cassandra 687656 Apr 15 21:23 jw_schema1-users.idxAge-jb-1-Data.db
-rw-r--r-- 1 cassandra cassandra 78 Apr 15 21:23 jw_schema1-users.idxAge-jb-1-Digest.sha1
-rw-r--r-- 1 cassandra cassandra 136 Apr 15 21:23 jw_schema1-users.idxAge-jb-1-Filter.db
-rw-r--r-- 1 cassandra cassandra 1800 Apr 15 21:23 jw_schema1-users.idxAge-jb-1-Index.db
-rw-r--r-- 1 cassandra cassandra 4392 Apr 15 21:23 jw_schema1-users.idxAge-jb-1-Statistics.db
-rw-r--r-- 1 cassandra cassandra 68 Apr 15 21:23 jw_schema1-users.idxAge-jb-1-Summary.db
-rw-r--r-- 1 cassandra cassandra 79 Apr 15 21:23 jw_schema1-users.idxAge-jb-1-TOC.txt
-rw-r--r-- 1 cassandra cassandra 32 Apr 15 21:24 jw_schema1-users.idxAge-jb-2-CRC.db
-rw-r--r-- 1 cassandra cassandra 455238 Apr 15 21:24 jw_schema1-users.idxAge-jb-2-Data.db
-rw-r--r-- 1 cassandra cassandra 78 Apr 15 21:24 jw_schema1-users.idxAge-jb-2-Digest.sha1
-rw-r--r-- 1 cassandra cassandra 136 Apr 15 21:24 jw_schema1-users.idxAge-jb-2-Filter.db
-rw-r--r-- 1 cassandra cassandra 1800 Apr 15 21:24 jw_schema1-users.idxAge-jb-2-Index.db
-rw-r--r-- 1 cassandra cassandra 4393 Apr 15 21:24 jw_schema1-users.idxAge-jb-2-Statistics.db
-rw-r--r-- 1 cassandra cassandra 68 Apr 15 21:24 jw_schema1-users.idxAge-jb-2-Summary.db
-rw-r--r-- 1 cassandra cassandra 79 Apr 15 21:24 jw_schema1-users.idxAge-jb-2-TOC.txt
-rw-r--r-- 1 cassandra cassandra 48 Apr 15 21:23 jw_schema1-users.idxLast-jb-1-CRC.db
-rw-r--r-- 1 cassandra cassandra 685677 Apr 15 21:23 jw_schema1-users.idxLast-jb-1-Data.db
-rw-r--r-- 1 cassandra cassandra 79 Apr 15 21:23 jw_schema1-users.idxLast-jb-1-Digest.sha1
-rw-r--r-- 1 cassandra cassandra 16 Apr 15 21:23 jw_schema1-users.idxLast-jb-1-Filter.db
-rw-r--r-- 1 cassandra cassandra 425 Apr 15 21:23 jw_schema1-users.idxLast-jb-1-Index.db
-rw-r--r-- 1 cassandra cassandra 4392 Apr 15 21:23 jw_schema1-users.idxLast-jb-1-Statistics.db
-rw-r--r-- 1 cassandra cassandra 71 Apr 15 21:23 jw_schema1-users.idxLast-jb-1-Summary.db
-rw-r--r-- 1 cassandra cassandra 79 Apr 15 21:23 jw_schema1-users.idxLast-jb-1-TOC.txt
-rw-r--r-- 1 cassandra cassandra 32 Apr 15 21:24 jw_schema1-users.idxLast-jb-2-CRC.db
-rw-r--r-- 1 cassandra cassandra 453259 Apr 15 21:24 jw_schema1-users.idxLast-jb-2-Data.db
-rw-r--r-- 1 cassandra cassandra 79 Apr 15 21:24 jw_schema1-users.idxLast-jb-2-Digest.sha1
-rw-r--r-- 1 cassandra cassandra 16 Apr 15 21:24 jw_schema1-users.idxLast-jb-2-Filter.db
-rw-r--r-- 1 cassandra cassandra 287 Apr 15 21:24 jw_schema1-users.idxLast-jb-2-Index.db
-rw-r--r-- 1 cassandra cassandra 4393 Apr 15 21:24 jw_schema1-users.idxLast-jb-2-Statistics.db
-rw-r--r-- 1 cassandra cassandra 71 Apr 15 21:24 jw_schema1-users.idxLast-jb-2-Summary.db
-rw-r--r-- 1 cassandra cassandra 79 Apr 15 21:24 jw_schema1-users.idxLast-jb-2-TOC.txt
-rw-r--r-- 1 cassandra cassandra 288 Apr 15 21:23 jw_schema1-users-jb-1-CRC.db
-rw-r--r-- 1 cassandra cassandra 4612770 Apr 15 21:23 jw_schema1-users-jb-1-Data.db
-rw-r--r-- 1 cassandra cassandra 71 Apr 15 21:23 jw_schema1-users-jb-1-Digest.sha1
-rw-r--r-- 1 cassandra cassandra 37880 Apr 15 21:23 jw_schema1-users-jb-1-Filter.db
-rw-r--r-- 1 cassandra cassandra 564480 Apr 15 21:23 jw_schema1-users-jb-1-Index.db
-rw-r--r-- 1 cassandra cassandra 4391 Apr 15 21:23 jw_schema1-users-jb-1-Statistics.db
-rw-r--r-- 1 cassandra cassandra 3984 Apr 15 21:23 jw_schema1-users-jb-1-Summary.db
-rw-r--r-- 1 cassandra cassandra 79 Apr 15 21:23 jw_schema1-users-jb-1-TOC.txt
-rw-r--r-- 1 cassandra cassandra 192 Apr 15 21:24 jw_schema1-users-jb-2-CRC.db
-rw-r--r-- 1 cassandra cassandra 3015018 Apr 15 21:24 jw_schema1-users-jb-2-Data.db
-rw-r--r-- 1 cassandra cassandra 71 Apr 15 21:24 jw_schema1-users-jb-2-Digest.sha1
-rw-r--r-- 1 cassandra cassandra 24648 Apr 15 21:24 jw_schema1-users-jb-2-Filter.db
-rw-r--r-- 1 cassandra cassandra 374414 Apr 15 21:24 jw_schema1-users-jb-2-Index.db
-rw-r--r-- 1 cassandra cassandra 4391 Apr 15 21:24 jw_schema1-users-jb-2-Statistics.db
-rw-r--r-- 1 cassandra cassandra 2672 Apr 15 21:24 jw_schema1-users-jb-2-Summary.db
-rw-r--r-- 1 cassandra cassandra 79 Apr 15 21:24 jw_schema1-users-jb-2-TOC.txt

With current hardware setup, which is loaded, with sstable compression enable, at times, the request get rpc timeout but at times, the result is returned. However without compression on sstable, all the requests executed get timeout. Below are the query perform via cqlsh.
cqlsh:jw_schema1> select * from users where age > 95 and last = 'smith' allow filtering;
Request did not complete within rpc_timeout.

Apparently enable compression does improve reading speed and saving disk size.

Saturday, April 12, 2014

Learn and play with cassandra 2.0.6 snapshot and restore

Snapshot of cassandra appearing as early as cassandra version 0.4.0 beta. Today, we are going to learn on cassandra snapshot. Note that if you run snapshot on a node in a cluster, it only snapshot on that node. If you want to snapshot for all nodes in a cluster,

it is much more efficient to use a parallel ssh such as clusterssh or pssh.

Fundamentally, when snapshot is executed, it copy the sstables into a snapshot directory. So be notice that if you have a huge node load, it require two times the disk space of that server and it may spike the I/O activity on that node too if large amount of sstables is being snapshot.

Let's get down to the work.

First, ensure at least the table (column family) has data.
cqlsh:jw_schema1> select * from users;

user_id | age | first | last | middle
---------+-----+-------+----------+--------
3 | 34 | john | smith | a
2 | 35 | olee | smith | b
1 | 33 | dan | bar | c

(3 rows)

Then take a snapshot, for instance, I only take a snapshot of this keyspace, jw_schema1 and table users. What that does is that, cassandra will flush the data to sstable before snapshot is taken. For option such as giving a meaningful snapshot a name, check out command nodetool help.
jason@localhost:~$ nodetool -h localhost snapshot jw_schema1 -cf users
Requested creating snapshot for: jw_schema1 and table: users
Snapshot directory: 1397292720524

The snapshot made will be stored at <data_file_directories> that you set in cassandra.yaml file. So for instance,
jason@localhost:/var/lib/cassandra/data/jw_schema1/users$ ls -l snapshots/1397292720524/
total 96K
-rw-r--r-- 2 cassandra cassandra 16 Apr 12 16:52 jw_schema1-users.idxAge-jb-1-Filter.db
-rw-r--r-- 2 cassandra cassandra 54 Apr 12 16:52 jw_schema1-users.idxAge-jb-1-Index.db
-rw-r--r-- 2 cassandra cassandra 76 Apr 12 16:52 jw_schema1-users.idxAge-jb-1-Data.db
-rw-r--r-- 2 cassandra cassandra 43 Apr 12 16:52 jw_schema1-users.idxAge-jb-1-CompressionInfo.db
-rw-r--r-- 2 cassandra cassandra 4.3K Apr 12 16:52 jw_schema1-users.idxAge-jb-1-Statistics.db
-rw-r--r-- 2 cassandra cassandra 79 Apr 12 16:52 jw_schema1-users.idxAge-jb-1-TOC.txt
-rw-r--r-- 2 cassandra cassandra 68 Apr 12 16:52 jw_schema1-users.idxAge-jb-1-Summary.db
-rw-r--r-- 2 cassandra cassandra 16 Apr 12 16:52 jw_schema1-users.idxLast-jb-1-Filter.db
-rw-r--r-- 2 cassandra cassandra 58 Apr 12 16:52 jw_schema1-users.idxLast-jb-1-Index.db
-rw-r--r-- 2 cassandra cassandra 87 Apr 12 16:52 jw_schema1-users.idxLast-jb-1-Data.db
-rw-r--r-- 2 cassandra cassandra 43 Apr 12 16:52 jw_schema1-users.idxLast-jb-1-CompressionInfo.db
-rw-r--r-- 2 cassandra cassandra 4.3K Apr 12 16:52 jw_schema1-users.idxLast-jb-1-Statistics.db
-rw-r--r-- 2 cassandra cassandra 79 Apr 12 16:52 jw_schema1-users.idxLast-jb-1-TOC.txt
-rw-r--r-- 2 cassandra cassandra 75 Apr 12 16:52 jw_schema1-users.idxLast-jb-1-Summary.db
-rw-r--r-- 2 cassandra cassandra 16 Apr 12 16:52 jw_schema1-users-jb-1-Filter.db
-rw-r--r-- 2 cassandra cassandra 45 Apr 12 16:52 jw_schema1-users-jb-1-Index.db
-rw-r--r-- 2 cassandra cassandra 206 Apr 12 16:52 jw_schema1-users-jb-1-Data.db
-rw-r--r-- 2 cassandra cassandra 43 Apr 12 16:52 jw_schema1-users-jb-1-CompressionInfo.db
-rw-r--r-- 2 cassandra cassandra 4.3K Apr 12 16:52 jw_schema1-users-jb-1-Statistics.db
-rw-r--r-- 2 cassandra cassandra 79 Apr 12 16:52 jw_schema1-users-jb-1-TOC.txt
-rw-r--r-- 2 cassandra cassandra 59 Apr 12 16:52 jw_schema1-users-jb-1-Summary.db

If you md5sum on the data files between snapshot and the live data, they are identically match.
jason@localhost:/var/lib/cassandra/data/jw_schema1/users$ md5sum snapshots/1397292720524/*Data*
3d4351d714500417c74de6811b1eae3b snapshots/1397292720524/jw_schema1-users.idxAge-jb-1-Data.db
a430a2d65c0a504fe3ab06344654a89a snapshots/1397292720524/jw_schema1-users.idxLast-jb-1-Data.db
13798e1ffb5ed6a871d768399f54b125 snapshots/1397292720524/jw_schema1-users-jb-1-Data.db
jason@localhost:/var/lib/cassandra/data/jw_schema1/users$ md5sum *Data*
3d4351d714500417c74de6811b1eae3b jw_schema1-users.idxAge-jb-1-Data.db
a430a2d65c0a504fe3ab06344654a89a jw_schema1-users.idxLast-jb-1-Data.db
13798e1ffb5ed6a871d768399f54b125 jw_schema1-users-jb-1-Data.db

A snapshot made is not meaningful if you cannot restore back to the node. So from this point on ward, we will take a look on how to restore the snapshot back into the node.

Surprisingly, to restore the command, you would expect for example, nodetool restore backup, but it is not. Rather, there are a few ways to restore the given snapshot sstables.

  1. You can use sstableloader,

  2. copy the sstables into <data_file_directories>/data/jw_schema1/users/ and refresh by calling loadNewSSTables via jconsole or using nodetool refresh

  3. use a node restart method.


It sounds like a lot of works to use either of the first two methods, I'm gonna just try on the last method way of restoring the snapshot sstables.

In order to simulate our backup will be successful, we are going to do a few simulation (disk failure, accidentally delete) here.

  1. copy the snapshot backup somewhere else.

  2. shutdown cassandra and delete cassandra directory.


Okay, let's continue the setup simulation environment
jason@localhost:/var/lib/cassandra/data/jw_schema1/users$ cp -r snapshots/ ~/cassandra/
jason@localhost:/var/lib/cassandra/data/jw_schema1/users$

jason@localhost:/var/lib/cassandra/data/jw_schema1/users$ sudo /etc/init.d/cassandra stop
jason@localhost:/var/lib/cassandra/data/jw_schema1/users$

jason@localhost:/var/lib/cassandra/data/jw_schema1/users$ cd /var/lib/cassandra/commitlog/
jason@localhost:/var/lib/cassandra/commitlog$ ls
total 2.6M
-rw-r--r-- 1 cassandra cassandra 32M Apr 11 18:52 CommitLog-3-1397213531634.log
-rw-r--r-- 1 cassandra cassandra 32M Apr 12 18:38 CommitLog-3-1397213531633.log
jason@localhost:/var/lib/cassandra/commitlog$ sudo rm -rf *
jason@localhost:/var/lib/cassandra/commitlog$ cd ../data/jw_schema1/users/
jason@localhost:/var/lib/cassandra/data/jw_schema1/users$ sudo rm -rf *
jason@localhost:/var/lib/cassandra/data/jw_schema1/users$

So we have copied the snapshot to a cassandra directory under home directory and also stop cassandra, remove all commitlog and table users in keyspace jw_schema1. Note that in this case, the schema for table users is still exists as the schema is stored in the system keyspace.

And now we will copy the snapshot from home directory back into cassandra.
jason@localhost:/var/lib/cassandra/data/jw_schema1/users$ sudo cp -r ~/cassandra/snapshots/1397292720524/jw_schema1-users* .
jason@localhost:/var/lib/cassandra/data/jw_schema1/users$ ls
total 96K
-rw-r--r-- 1 root root 76 Apr 12 18:50 jw_schema1-users.idxAge-jb-1-Data.db
-rw-r--r-- 1 root root 43 Apr 12 18:50 jw_schema1-users.idxAge-jb-1-CompressionInfo.db
-rw-r--r-- 1 root root 16 Apr 12 18:50 jw_schema1-users.idxAge-jb-1-Filter.db
-rw-r--r-- 1 root root 54 Apr 12 18:50 jw_schema1-users.idxAge-jb-1-Index.db
-rw-r--r-- 1 root root 4.3K Apr 12 18:50 jw_schema1-users.idxAge-jb-1-Statistics.db
-rw-r--r-- 1 root root 68 Apr 12 18:50 jw_schema1-users.idxAge-jb-1-Summary.db
-rw-r--r-- 1 root root 79 Apr 12 18:50 jw_schema1-users.idxAge-jb-1-TOC.txt
-rw-r--r-- 1 root root 43 Apr 12 18:50 jw_schema1-users.idxLast-jb-1-CompressionInfo.db
-rw-r--r-- 1 root root 87 Apr 12 18:50 jw_schema1-users.idxLast-jb-1-Data.db
-rw-r--r-- 1 root root 16 Apr 12 18:50 jw_schema1-users.idxLast-jb-1-Filter.db
-rw-r--r-- 1 root root 58 Apr 12 18:50 jw_schema1-users.idxLast-jb-1-Index.db
-rw-r--r-- 1 root root 4.3K Apr 12 18:50 jw_schema1-users.idxLast-jb-1-Statistics.db
-rw-r--r-- 1 root root 79 Apr 12 18:50 jw_schema1-users.idxLast-jb-1-TOC.txt
-rw-r--r-- 1 root root 75 Apr 12 18:50 jw_schema1-users.idxLast-jb-1-Summary.db
-rw-r--r-- 1 root root 43 Apr 12 18:50 jw_schema1-users-jb-1-CompressionInfo.db
-rw-r--r-- 1 root root 206 Apr 12 18:50 jw_schema1-users-jb-1-Data.db
-rw-r--r-- 1 root root 4.3K Apr 12 18:50 jw_schema1-users-jb-1-Statistics.db
-rw-r--r-- 1 root root 45 Apr 12 18:50 jw_schema1-users-jb-1-Index.db
-rw-r--r-- 1 root root 16 Apr 12 18:50 jw_schema1-users-jb-1-Filter.db
-rw-r--r-- 1 root root 79 Apr 12 18:50 jw_schema1-users-jb-1-TOC.txt
-rw-r--r-- 1 root root 59 Apr 12 18:50 jw_schema1-users-jb-1-Summary.db

So far it looks good, now if you tail the cassandra system.log and start cassandra, notice that the sstables are being read. If within these down time, data supposed to be own by this node is missed, you should by now run nodetool repair to make sure data is sync.
 INFO [main] 2014-04-12 18:52:32,555 ColumnFamilyStore.java (line 254) Initializing jw_schema1.users
INFO [SSTableBatchOpen:1] 2014-04-12 18:52:32,568 SSTableReader.java (line 223) Opening /var/lib/cassandra/data/jw_schema1/users/jw_schema1-users-jb-1 (206 bytes)
INFO [main] 2014-04-12 18:52:32,701 ColumnFamilyStore.java (line 254) Initializing jw_schema1.users.idxLast
INFO [SSTableBatchOpen:1] 2014-04-12 18:52:32,719 SSTableReader.java (line 223) Opening /var/lib/cassandra/data/jw_schema1/users/jw_schema1-users.idxLast-jb-1 (87 bytes)
INFO [main] 2014-04-12 18:52:32,802 ColumnFamilyStore.java (line 254) Initializing jw_schema1.users.idxAge
INFO [SSTableBatchOpen:1] 2014-04-12 18:52:32,810 SSTableReader.java (line 223) Opening /var/lib/cassandra/data/jw_schema1/users/jw_schema1-users.idxAge-jb-1 (76 bytes)

jason@localhost:~/$ nodetool -h localhost repair jw_schema1 users
[2014-04-12 18:59:57,477] Starting repair command #1, repairing 1280 ranges for keyspace jw_schema1
..
[2014-04-12 19:00:50,800] Repair command #1 finished

Now we will check our data if it is still there.
jason@localhost:~/$ cqlsh 192.168.0.2 9160 -k jw_schema1
Connected to just4fun at 192.168.0.2:9160.
[cqlsh 4.1.1 | Cassandra 2.0.6 | CQL spec 3.1.1 | Thrift protocol 19.39.0]
Use HELP for help.
cqlsh:jw_schema1> select * from users;

user_id | age | first | last | middle
---------+-----+-------+----------+--------
3 | 34 | john | smith | a
2 | 35 | olee | smith | b
1 | 33 | dan | bar | c
(3 rows)

cqlsh:jw_schema1>

All good. :)

In my humble opinion, because cassandra is built with durable and fault tolerant in mind, snapshot is rather not actually needed. Sure, it is fair to argue if someone deleted the data accidentally, but if you can prevent that by blocking from front end, you can actually save a lot of cost in term of cluster backup and restore maintenance cost. If you want to really ensure the data is save, spin up another cluster in another data centre, then the data is guaranteed safe from disaster. But hey, no harm learning a new tools in case you might need it later down the road.

 

Thursday, January 16, 2014

force sstable compaction through jconsole

Happy new year everyone, this is my first article for 2014 and as a start, it is going to be short and sweet one. :-)

I was working on this project where I have deleted the column and I remember the definition of tombstone is that, you need to run compact so that the tombstone is removed. When I run nodetool compact, this message appear below.
INFO 19:15:09,170 Nothing to compact in index1.  Use forceUserDefinedCompaction if you wish to force compaction of single sstables (e.g. for tombstone collection)

So what is this means is that, you need to have jconsole running because forceUserDefinedCompaction can only invoke through jconsole. When jconsole is connected to your cassandra daemon process, you need to navigate to the compactionmanager mbean.

Then you need to provide two value to this method as this method expected two parameters. That is the keyspace and the sstables where you wish to compact. You can see in this attachment on how I did it.



So if you ls the data directory where the sstable are store, a new sstable should be generated. Once the operation is complete, you can navigate to the data directory where the sstable is stored.

$ ls data/lucene1/
index1-hc-3-Data.db index1-hc-3-Digest.sha1 index1-hc-3-Filter.db index1-hc-3-Index.db index1-hc-3-Statistics.db snapshots

We have production machine which a single sstable size is more than 25GB. When the soft limit compaction is met (default min 4 and max 32) , in this situation, I think it will load the server if it takes place, probably best is, we forceUserDefinedCompaction on this single large sstable.