Showing posts with label ColumnFamilyStore. Show all posts
Showing posts with label ColumnFamilyStore. Show all posts

Saturday, May 3, 2014

what and why always all time blocked for cassandra pool FlushWriter

FlushWriter                       0         0            941         0                53

If you noticed in a cassandra cluster, I often noticed that the pool FlushWriter all time block always increased while other pool remain 0. So is this that we should concern of?

Snippet from class ColumnFamilyStore:
/*
* maybeSwitchMemtable puts Memtable.getSortedContents on the writer executor. When the write is complete,
* we turn the writer into an SSTableReader and add it to ssTables_ where it is available for reads.
*
* There are two other things that maybeSwitchMemtable does.
* First, it puts the Memtable into memtablesPendingFlush, where it stays until the flush is complete
* and it's been added as an SSTableReader to ssTables_. Second, it adds an entry to commitLogUpdater
* that waits for the flush to complete, then calls onMemtableFlush. This allows multiple flushes
* to happen simultaneously on multicore systems, while still calling onMF in the correct order,
* which is necessary for replay in case of a restart since CommitLog assumes that when onMF is
* called, all data up to the given context has been persisted to SSTables.
*/
private static final ExecutorService flushWriter
= new JMXEnabledThreadPoolExecutor(DatabaseDescriptor.getFlushWriters(),
StageManager.KEEPALIVE,
TimeUnit.SECONDS,
new LinkedBlockingQueue<Runnable>(DatabaseDescriptor.getFlushQueueSize()),
new NamedThreadFactory("FlushWriter"),
"internal");

Just like other Stage.replicate_on_write, FlushWriter is also an instance of JMXEnabledThreadPoolExecutor, governed by two configuration which you can altered in cassandra.yaml.

  • memtable_flush_writers default based on number of data_file_directories specified.

  • memtable_flush_queue_size default 4


Whenever maybeSwitchMemtable is called, memtable.flushAndSignal() is called within.

Notice that in Memtable.flushAndSignal(), ExecutorService which is extends a few until the construction object JMXEnabledThreadPoolExecutor for pool FlushWriter aforementioned.  So whenever, the task is rejected due to queue full, method rejectedExecution() is triggered  which eventually increase the count by one.

So that's it, hope you get an idea what and why is the all time block for pool FlushWriter is increased, so it should give indication you should altered the parameter for the two configuration in cassandra.yaml file.

Last, if you learned something and would like to contribute back, please visit our donation page. Thank you.

Friday, May 2, 2014

How often is cassandra minor compaction running and what trigger it

There are two types of compactions in cassandra. The minor compaction and the major compaction. Today, we are going to look into minor compaction and to understand when is minor compaction kickstarted.

Following are description snippet when you create column family using cassandra-cli.
- max_compaction_threshold: The maximum number of SSTables allowed before a
minor compaction is forced. Default is 32, setting to 0 disables minor
compactions.

Decreasing this will cause minor compactions to start more frequently and
be less intensive. The min_compaction_threshold and max_compaction_threshold
boundaries are the number of tables Cassandra attempts to merge together at
once.

- min_compaction_threshold: The minimum number of SSTables needed
to start a minor compaction. Default is 4, setting to 0 disables minor
compactions.

Increasing this will cause minor compactions to start less frequently and
be more intensive. The min_compaction_threshold and max_compaction_threshold
boundaries are the number of tables Cassandra attempts to merge together at
once.

So minor compaction is trigger automatically by cassandra and major compaction is trigger manually via nodetool compact. But when and what exactly that trigger minor compaction? That's when we need to trace into the codebase.

Because compaction is performed on the column family, thus the minor compaction is trigger in the class ColumnFamilyStore. Two methods that will submit this object for compaction executor to perform the minor compaction, that is during

Depend on the compaction strategy chosen for the column family, the default SizeTieredCompactionStrategy which extends AbstractCompactionStrategy and in the super class, which started a single thread to perform this background compaction task. It seem that this optional single threaded task run every five minute.

When the mentioned two method trigger, the object ColumnFamilyStore will be submit to the background for the single thread to perform compaction.
/**
* Call this whenever a compaction might be needed on the given columnfamily.
* It's okay to over-call (within reason) since the compactions are single-threaded,
* and if a call is unnecessary, it will just be no-oped in the bucketing phase.
*/
public Future<Integer> submitBackground(final ColumnFamilyStore cfs)
{
Callable<Integer> callable = new Callable<Integer>()
{
public Integer call() throws IOException
{
compactionLock.readLock().lock();
try
{
if (!cfs.isValid())
return 0;

boolean taskExecuted = false;
AbstractCompactionStrategy strategy = cfs.getCompactionStrategy();
List<AbstractCompactionTask> tasks = strategy.getBackgroundTasks(getDefaultGcBefore(cfs));
for (AbstractCompactionTask task : tasks)
{
if (!task.markSSTablesForCompaction())
continue;

taskExecuted = true;
try
{
task.execute(executor);
}
finally
{
task.unmarkSSTables();
}
}
// newly created sstables might have made other compactions eligible
if (taskExecuted)
submitBackground(cfs);
}
finally
{
compactionLock.readLock().unlock();
}
return 0;
}
};
return executor.submit(callable);
}

Notice that when method getBackgroundTasks is called in submitBackground(), the min_compaction_threshold and max_compaction_threshold which you set in the column family is called here to determine if condition min_compaction_threshold is met and max_compaction_threshold.

From the experience, I don't know why datastax does not recommend major compaction via nodetool, maybe because the I/O and heap usage spike and may impair the node request and response but for me, when the node load goes beyond like 500GB, then there maybe be some stale data left in the big sstables, so it might not be a really such a bad idea to kickstart major compaction if the stale data can be removed and bring down the node load.

Last but not least, if you learn something and would like to contribute back, please go to our donation page.