Friday, March 27, 2015

Investigate into apache cassandra get_slice assertion error

Today, we will investigate another error from apache cassandra. Error as shown below in cassandra log.
ERROR [Thrift:2] 2015-02-11 11:06:10,837 Cassandra.java (line 3041) Internal error processing get_slice
java.lang.AssertionError
at org.apache.cassandra.locator.TokenMetadata.firstTokenIndex(TokenMetadata.java:518)
at org.apache.cassandra.locator.TokenMetadata.firstToken(TokenMetadata.java:532)
at org.apache.cassandra.locator.AbstractReplicationStrategy.getNaturalEndpoints(AbstractReplicationStrategy.java:94)
at org.apache.cassandra.service.StorageService.getLiveNaturalEndpoints(StorageService.java:1992)
at org.apache.cassandra.service.StorageService.getLiveNaturalEndpoints(StorageService.java:1986)
at org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:604)
at org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:564)
at org.apache.cassandra.thrift.CassandraServer.readColumnFamily(CassandraServer.java:128)
at org.apache.cassandra.thrift.CassandraServer.getSlice(CassandraServer.java:283)
at org.apache.cassandra.thrift.CassandraServer.multigetSliceInternal(CassandraServer.java:365)
at org.apache.cassandra.thrift.CassandraServer.get_slice(CassandraServer.java:326)
at org.apache.cassandra.thrift.Cassandra$Processor$get_slice.process(Cassandra.java:3033)
at org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2889)
at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:187)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

So bottom first three lines pretty easy, a thread is ran with the thread pool executor. As indicated by the code snipet below, that a worker process having trouble in processing a request.
    try
{
processor = processorFactory_.getProcessor(client_);
inputTransport = inputTransportFactory_.getTransport(client_);
outputTransport = outputTransportFactory_.getTransport(client_);
inputProtocol = inputProtocolFactory_.getProtocol(inputTransport);
outputProtocol = outputProtocolFactory_.getProtocol(outputTransport);
// we check stopped_ first to make sure we're not supposed to be shutting
// down. this is necessary for graceful shutdown. (but not sufficient,
// since process() can take arbitrarily long waiting for client input.
// See comments at the end of serve().)
while (!stopped_ && processor.process(inputProtocol, outputProtocol))
{
inputProtocol = inputProtocolFactory_.getProtocol(inputTransport);
outputProtocol = outputProtocolFactory_.getProtocol(outputTransport);
}
}

Skipping a few low level byte stream processing, we arrived at the actual class which actually implement the method get_slice. Read code snippet below.
    public List<ColumnOrSuperColumn> get_slice(ByteBuffer key, ColumnParent column_parent, SlicePredicate predicate, ConsistencyLevel consistency_level)
throws InvalidRequestException, UnavailableException, TimedOutException
{
logger.debug("get_slice");

state().hasColumnFamilyAccess(column_parent.column_family, Permission.READ);
return multigetSliceInternal(state().getKeyspace(), Collections.singletonList(key), column_parent, predicate, consistency_level).get(key);
}

so we see another method is called, multigetSliceInternal. Read code snippet below where a few validations on the data.
    private Map<ByteBuffer, List<ColumnOrSuperColumn>> multigetSliceInternal(String keyspace, List<ByteBuffer> keys, ColumnParent column_parent, SlicePredicate predicate, ConsistencyLevel consistency_level)
throws InvalidRequestException, UnavailableException, TimedOutException
{
CFMetaData metadata = ThriftValidation.validateColumnFamily(keyspace, column_parent.column_family);
ThriftValidation.validateColumnParent(metadata, column_parent);
ThriftValidation.validatePredicate(metadata, column_parent, predicate);
ThriftValidation.validateConsistencyLevel(keyspace, consistency_level);

List<ReadCommand> commands = new ArrayList<ReadCommand>();
if (predicate.column_names != null)
{
for (ByteBuffer key: keys)
{
ThriftValidation.validateKey(metadata, key);
commands.add(new SliceByNamesReadCommand(keyspace, key, column_parent, predicate.column_names));
}
}
else
{
SliceRange range = predicate.slice_range;
for (ByteBuffer key: keys)
{
ThriftValidation.validateKey(metadata, key);
commands.add(new SliceFromReadCommand(keyspace, key, column_parent, range.start, range.finish, range.reversed, range.count));
}
}

return getSlice(commands, consistency_level);
}

then method getSlice is called,  and method readColumnFamily() is also called. As shown below, the code snippet
  protected Map<DecoratedKey, ColumnFamily> readColumnFamily(List<ReadCommand> commands, ConsistencyLevel consistency_level)
throws InvalidRequestException, UnavailableException, TimedOutException
{
// TODO - Support multiple column families per row, right now row only contains 1 column family
Map<DecoratedKey, ColumnFamily> columnFamilyKeyMap = new HashMap<DecoratedKey, ColumnFamily>();

if (consistency_level == ConsistencyLevel.ANY)
{
throw new InvalidRequestException("Consistency level any may not be applied to read operations");
}

List<Row> rows;
try
{
schedule(DatabaseDescriptor.getRpcTimeout());
try
{
rows = StorageProxy.read(commands, consistency_level);
}
finally
{
release();
}
}
catch (TimeoutException e)
{
logger.debug("... timed out");
throw new TimedOutException();
}
catch (IOException e)
{
throw new RuntimeException(e);
}

for (Row row: rows)
{
columnFamilyKeyMap.put(row.key, row.cf);
}
return columnFamilyKeyMap;
}

another class is called, StorageProxy to read the row in concern and the read method code snippet below.
    /**
* Performs the actual reading of a row out of the StorageService, fetching
* a specific set of column names from a given column family.
*/
public static List<Row> read(List<ReadCommand> commands, ConsistencyLevel consistency_level)
throws IOException, UnavailableException, TimeoutException, InvalidRequestException
{
if (StorageService.instance.isBootstrapMode())
throw new UnavailableException();
long startTime = System.nanoTime();
List<Row> rows;
try
{
rows = fetchRows(commands, consistency_level);
}
finally
{
readStats.addNano(System.nanoTime() - startTime);
}
return rows;
}

the exception lead this investigation to fetching the row and within the same class, for method fetchRows, code snippet below.
    /**
* This function executes local and remote reads, and blocks for the results:
*
* 1. Get the replica locations, sorted by response time according to the snitch
* 2. Send a data request to the closest replica, and digest requests to either
* a) all the replicas, if read repair is enabled
* b) the closest R-1 replicas, where R is the number required to satisfy the ConsistencyLevel
* 3. Wait for a response from R replicas
* 4. If the digests (if any) match the data return the data
* 5. else carry out read repair by getting data from all the nodes.
*/
private static List<Row> fetchRows(List<ReadCommand> initialCommands, ConsistencyLevel consistency_level) throws IOException, UnavailableException, TimeoutException
{
List<Row> rows = new ArrayList<Row>(initialCommands.size());
List<ReadCommand> commandsToRetry = Collections.emptyList();

do
{
List<ReadCommand> commands = commandsToRetry.isEmpty() ? initialCommands : commandsToRetry;
ReadCallback<Row>[] readCallbacks = new ReadCallback[commands.size()];

if (!commandsToRetry.isEmpty())
logger.debug("Retrying {} commands", commandsToRetry.size());

// send out read requests
for (int i = 0; i < commands.size(); i++)
{
ReadCommand command = commands.get(i);
assert !command.isDigestQuery();
logger.debug("Command/ConsistencyLevel is {}/{}", command, consistency_level);

List<InetAddress> endpoints = StorageService.instance.getLiveNaturalEndpoints(command.table,
command.key);
DatabaseDescriptor.getEndpointSnitch().sortByProximity(FBUtilities.getBroadcastAddress(), endpoints);

RowDigestResolver resolver = new RowDigestResolver(command.table, command.key);
ReadCallback<Row> handler = getReadCallback(resolver, command, consistency_level, endpoints);
handler.assureSufficientLiveNodes();
assert !handler.endpoints.isEmpty();
readCallbacks[i] = handler;

// The data-request message is sent to dataPoint, the node that will actually get the data for us
InetAddress dataPoint = handler.endpoints.get(0);
if (dataPoint.equals(FBUtilities.getBroadcastAddress()) && OPTIMIZE_LOCAL_REQUESTS)
{
logger.debug("reading data locally");
StageManager.getStage(Stage.READ).execute(new LocalReadRunnable(command, handler));
}
else
{
logger.debug("reading data from {}", dataPoint);
MessagingService.instance().sendRR(command, dataPoint, handler);
}

if (handler.endpoints.size() == 1)
continue;

// send the other endpoints a digest request
ReadCommand digestCommand = command.copy();
digestCommand.setDigestQuery(true);
MessageProducer producer = null;
for (InetAddress digestPoint : handler.endpoints.subList(1, handler.endpoints.size()))
{
if (digestPoint.equals(FBUtilities.getBroadcastAddress()))
{
logger.debug("reading digest locally");
StageManager.getStage(Stage.READ).execute(new LocalReadRunnable(digestCommand, handler));
}
else
{
logger.debug("reading digest from {}", digestPoint);
// (We lazy-construct the digest Message object since it may not be necessary if we
// are doing a local digest read, or no digest reads at all.)
if (producer == null)
producer = new CachingMessageProducer(digestCommand);
MessagingService.instance().sendRR(producer, digestPoint, handler);
}
}
}

// read results and make a second pass for any digest mismatches
List<ReadCommand> repairCommands = null;
List<RepairCallback> repairResponseHandlers = null;
for (int i = 0; i < commands.size(); i++)
{
ReadCallback<Row> handler = readCallbacks[i];
ReadCommand command = commands.get(i);
try
{
long startTime2 = System.currentTimeMillis();
Row row = handler.get();
if (row != null)
{
command.maybeTrim(row);
rows.add(row);
}

if (logger.isDebugEnabled())
logger.debug("Read: " + (System.currentTimeMillis() - startTime2) + " ms.");
}
catch (TimeoutException ex)
{
if (logger.isDebugEnabled())
logger.debug("Read timeout: {}", ex.toString());
throw ex;
}
catch (DigestMismatchException ex)
{
if (logger.isDebugEnabled())
logger.debug("Digest mismatch: {}", ex.toString());
RowRepairResolver resolver = new RowRepairResolver(command.table, command.key);
RepairCallback repairHandler = new RepairCallback(resolver, handler.endpoints);

if (repairCommands == null)
{
repairCommands = new ArrayList<ReadCommand>();
repairResponseHandlers = new ArrayList<RepairCallback>();
}
repairCommands.add(command);
repairResponseHandlers.add(repairHandler);

MessageProducer producer = new CachingMessageProducer(command);
for (InetAddress endpoint : handler.endpoints)
MessagingService.instance().sendRR(producer, endpoint, repairHandler);
}
}

if (commandsToRetry != Collections.EMPTY_LIST)
commandsToRetry.clear();

// read the results for the digest mismatch retries
if (repairResponseHandlers != null)
{
for (int i = 0; i < repairCommands.size(); i++)
{
ReadCommand command = repairCommands.get(i);
RepairCallback handler = repairResponseHandlers.get(i);
// wait for the repair writes to be acknowledged, to minimize impact on any replica that's
// behind on writes in case the out-of-sync row is read multiple times in quick succession
FBUtilities.waitOnFutures(handler.resolver.repairResults, DatabaseDescriptor.getRpcTimeout());

Row row;
try
{
row = handler.get();
}
catch (DigestMismatchException e)
{
throw new AssertionError(e); // full data requested from each node here, no digests should be sent
}

ReadCommand retryCommand = command.maybeGenerateRetryCommand(handler, row);
if (retryCommand != null)
{
logger.debug("issuing retry for read command");
if (commandsToRetry == Collections.EMPTY_LIST)
commandsToRetry = new ArrayList<ReadCommand>();
commandsToRetry.add(retryCommand);
continue;
}

if (row != null)
{
command.maybeTrim(row);
rows.add(row);
}
}
}
} while (!commandsToRetry.isEmpty());

return rows;
}

As this point of investigation, this method, fetchRows documentation is pretty useful for us.
* This function executes local and remote reads, and blocks for the results:
*
* 1. Get the replica locations, sorted by response time according to the snitch
* 2. Send a data request to the closest replica, and digest requests to either
* a) all the replicas, if read repair is enabled
* b) the closest R-1 replicas, where R is the number required to satisfy the ConsistencyLevel

we see this method actually execute on local and remote node, and during getting the node who is responsible to keep the row, problem occur. Let's read on the method getLiveNaturalEndpoints() and as shown below.
    /**
* This method attempts to return N endpoints that are responsible for storing the
* specified key i.e for replication.
*
* @param key - key for which we need to find the endpoint return value -
* the endpoint responsible for this key
*/
public List<InetAddress> getLiveNaturalEndpoints(String table, ByteBuffer key)
{
return getLiveNaturalEndpoints(table, partitioner.getToken(key));
}

public List<InetAddress> getLiveNaturalEndpoints(String table, Token token)
{
List<InetAddress> liveEps = new ArrayList<InetAddress>();
List<InetAddress> endpoints = Table.open(table).getReplicationStrategy().getNaturalEndpoints(token);

for (InetAddress endpoint : endpoints)
{
if (FailureDetector.instance.isAlive(endpoint))
liveEps.add(endpoint);
}

return liveEps;
}

a little upper in the stack trace, abstract class AbstractReplicationStrategy
    /**
* get the (possibly cached) endpoints that should store the given Token
* Note that while the endpoints are conceptually a Set (no duplicates will be included),
* we return a List to avoid an extra allocation when sorting by proximity later
* @param searchToken the token the natural endpoints are requested for
* @return a copy of the natural endpoints for the given token
*/
public ArrayList<InetAddress> getNaturalEndpoints(Token searchToken)
{
Token keyToken = TokenMetadata.firstToken(tokenMetadata.sortedTokens(), searchToken);
ArrayList<InetAddress> endpoints = getCachedEndpoints(keyToken);
if (endpoints == null)
{
TokenMetadata tokenMetadataClone = tokenMetadata.cloneOnlyTokenMap();
keyToken = TokenMetadata.firstToken(tokenMetadataClone.sortedTokens(), searchToken);
endpoints = new ArrayList<InetAddress>(calculateNaturalEndpoints(searchToken, tokenMetadataClone));
cacheEndpoint(keyToken, endpoints);
}

return new ArrayList<InetAddress>(endpoints);
}

somehow the ring size is equal to 0 or less than 0. class TokenMetadata.java and code snippet where the assertion thrown,
    public static int firstTokenIndex(final ArrayList ring, Token start, boolean insertMin)
{
assert ring.size() > 0;
// insert the minimum token (at index == -1) if we were asked to include it and it isn't a member of the ring
int i = Collections.binarySearch(ring, start);
if (i < 0)
{
i = (i + 1) * (-1);
if (i >= ring.size())
i = insertMin ? -1 : 0;
}
return i;
}

public static Token firstToken(final ArrayList<Token> ring, Token start)
{
return ring.get(firstTokenIndex(ring, start, false));
}

So something went during during reading a row's column and somehow the natural endpoint is either 0 or empty. My guess is that, it could be gossip is disable so the ring metadata is empty. The solution is to enable the gossip and then restart cassandra instance.

If you think this analysis is not accurate or want to provide more information, please do so by commenting below. Thank you.

No comments:

Post a Comment