Information Technology Blogs: October 2020

Recently I have read a very good apache cassandra 4.0 books as available here. I would really like to recommend this book if you are new or even have use cassandra before (since version 1.0) and would like to know what have change since then. Below are the important points that I think would help me in the future of using cassandra 4.0.

---

https://github.com/jeffreyscarpenter/cassandra-guide

Cassandra versions from 3.0 onward require a Java 8 JVM or later,

preferably the latest stable version. It has been tested on both the

OpenJDK and Oracle’s JDK. Cassandra 4.0 has been compiled and

tested against both Java 8 and Java 11. You can check your installed

Java version by opening a command prompt and executing java -

version .

The committers work hard to ensure that data is readable from one

minor dot release to the next and from one major version to the

next. The commit log, however, needs to be completely cleared out

from version to version (even minor versions).

If you have any previous versions of Cassandra installed, you may

want to clear out the data directories for now, just to get up and

running. If you’ve messed up your Cassandra installation and want

to get started cleanly again, you can delete the data folders.

If you’ve used Cassandra in releases prior to 3.0, you may also be

familiar with the command-line client interface known as

cassandra-cli . The CLI was removed in the 3.0 release because it

depends on the legacy Thrift API, which was deprecated in 3.0 and

removed entirely in 4.0.

Cassandra uses a special type of primary key called a composite key (or compound

key) to represent groups of related rows, also called partitions. The composite key

consists of a partition key, plus an optional set of clustering columns. The partition key

is used to determine the nodes on which rows are stored and can itself consist of mul‐

tiple columns. The clustering columns are used to control how data is sorted for stor‐

age within a partition. Cassandra also supports an additional construct called a static

column, which is for storing data that is not part of the primary key but is shared by

every row in a partition.

Insert, Update, and Upsert

Because Cassandra uses an append model, there is no fundamental

difference between the insert and update operations. If you insert a

row that has the same primary key as an existing row, the row is

replaced. If you update a row and the primary key does not exist,

Cassandra creates it.

Remember that TTL is stored on a per-column level for nonpri‐

mary key columns. There is currently no mechanism for setting

TTL at a row level directly after the initial insert; you would instead

need to reinsert the row, taking advantage of Cassandra’s upsert

behavior. As with the timestamp, there is no way to obtain or set

the TTL value of a primary key column, and the TTL can only be

set for a column when you provide a value for the column.

Primary Keys Are Forever

After you create a table, there is no way to modify the primary key,

because this controls how data is distributed within the cluster, and

even more importantly, how it is stored on disk.

Server-Side Denormalization with Materialized Views

Historically, denormalization in Cassandra has required designing

and managing multiple tables using techniques we will introduce

momentarily. Beginning with the 3.0 release, Cassandra provides

an experimental feature known as materialized views which allows

you to create multiple denormalized views of data based on a base

table design. Cassandra manages materialized views on the server,

including the work of keeping the views in sync with the table.

A key goal as you begin creating data models in Cassandra is to minimize the number

of partitions that must be searched in order to satisfy a given query. Because the parti‐

tion is a unit of storage that does not get divided across nodes, a query that searches a

single partition will typically yield the best performance.

The CQL SELECT statement does support ORDER BY semantics, but only in the order specified by the

clustering columns (ascending or descending).

The Importance of Primary Keys in Cassandra

The design of the primary key is extremely important, as it will

determine how much data will be stored in each partition and how

that data is organized on disk, which in turn will affect how quickly

Cassandra processes read queries.

The queue anti-pattern serves as a reminder that any design that relies on the deletion

of data is potentially a poorly performing design.

A rack is a logical set of nodes in close proximity to each other, perhaps on

physical machines in a single rack of equipment.

A data center is a logical set of racks, perhaps located in the same building

and connected by reliable network.

The replication factor is

set per keyspace. The consistency level is specified per query, by the

client. The replication factor indicates how many nodes you want

to use to store a value during each write operation. The consistency

level specifies how many nodes the client has decided must

respond in order to feel confident of a successful read or write

operation. The confusion arises because the consistency level is

based on the replication factor, not on the number of nodes in the

system.

Since the 2.0 release, Cassandra supports a lightweight

transaction (LWT) mechanism that provides linearizable consistency.

The basic Paxos algorithm consists of two stages: prepare/promise and propose/

accept.

early implementations of Cassandra, memtables were stored on the JVM heap, but

improvements starting with the 2.1 release have moved some memtable data to native

memory, with configuration options to specify the amount of on-heap and native

memory available.

The counter cache was added in the 2.1 release to improve counter performance

by reducing lock contention for the most frequently accessed counters.

One interesting feature of compaction relates to its intersection with incremental

repair. A feature called anticompaction was added in 2.1.

sers with prior experience may recall that Cassandra exposes an

administrative operation called major compaction (also known as

full compaction) that consolidates multiple SSTables into a single

SSTable. While this feature is still available, the utility of perform‐

ing a major compaction has been greatly reduced over time. In fact,

usage is actually discouraged in production environments, as it

tends to limit Cassandra’s ability to remove stale data.

Traditionally, SSTables have been streamed one partition at a time.

The Cassandra 4.0 release introduced a zero-copy streaming fea‐

ture to stream SSTables in their entirety using zero-copying APIs of

the host operating system. These APIs allow files to be transferred

over the network without first copying them into the CPU. This

feature is enabled by default and has been estimated to improve

streaming speed by a factor of 5.

, the system_traces keyspace

was added in 1.2 to support request tracing. The system_auth and

system_distributed keyspaces were added in 2.2 to support role-

based access control (RBAC) and persistence of repair data, respec‐

tively. Tables related to schema definition were migrated from

system to the system_schema keyspace in 3.0.

Hinted handoffs have traditionally been stored in the sys

tem.hints table. As thoughtful developers have noted, the fact that

hints are really messages to be kept for a short time and deleted

means this usage is really an instance of the well-known anti-

pattern of using Cassandra as a queue, which is discussed in Chap‐

ter 5. Hint storage was moved to flat files in the 3.0 release.

Because Cassandra partitions data across multiple nodes, each

node must maintain its own copy of a secondary index based on

the data stored in partitions it owns. For this reason, queries

involving a secondary index typically involve more nodes, making

them significantly more expensive.

Secondary indexes are not recommended for several specific cases:

• Columns with high cardinality. For example, indexing on the

hotel.address column could be very expensive, as the vast

majority of addresses are unique.

• Columns with very low data cardinality. For example, it would

make little sense to index on the user.title column (from

the user table in Chapter 4) in order to support a query for

every “Mrs.” in the user table, as this would result in a massive

row in the index.

• Columns that are frequently updated or deleted. Indexes built

on these columns can generate errors if the amount of deleted

data (tombstones) builds up more quickly than the compac‐

tion process can handle.

Elimination of the Cluster Object

Previous versions of DataStax drivers supported the concept of a

Cluster object used to create Session objects. Recent driver ver‐

sions (for example, the 4.0 Java driver and later) have combined

Cluster and Session into CqlSession .

Because a CqlSession maintains TCP connections to multiple

nodes, it is a relatively heavyweight object. In most cases, you’ll

want to create a single CqlSession and reuse it throughout your

application, rather than continually building up and tearing down

CqlSessions . Another acceptable option is to create a CqlSession

per keyspace, if your application is accessing multiple keyspaces.

The write path begins when a client initiates a write query to a Cassandra node which

serves as the coordinator for this request. The coordinator node uses the partitioner

to identify which nodes in the cluster are replicas, according to the replication factor

for the keyspace. The coordinator node may itself be a replica, especially if the client

is using a token-aware load balancing policy. If the coordinator knows that there are

not enough replicas up to satisfy the requested consistency level, it returns an error

immediately.

Next, the coordinator node sends simultaneous write requests to all local replicas for

the data being written. If the cluster spans multiple data centers, the local coordinator

node selects a remote coordinator in each of the other data centers to forward the

write to the replicas in that data center. Each of the remote replicas acknowledges the

write directly to the original coordinator node.

The DataStax drivers do not provide separate mechanisms for

counter batches. Instead, you must simply remember to create

batches that include only counter modifications or only non-

counter modifications.

A node is considered unresponsive if it does not respond to a query before the

value specified by read_request_timeout_in_ms in the configuration file. The

default is 5 seconds.

The read repair may be performed either before or after the return to the client. If

you are using one of the two stronger consistency levels ( QUORUM or ALL ), then the

read repair happens before data is returned to the client. If the client specifies a weak

consistency level (such as ONE ), then the read repair is optionally performed in the

background after returning to the client. The percentage of reads that result in back‐

ground repairs for a given table is determined by the read_repair_chance and

dc_local_read_repair_chance options for the table.

The syntax of the WHERE clause involves two rules. First, all elements of the partition

key must be identified. Second, a given clustering key may only be restricted if all pre‐

vious clustering keys are restricted by equality.

While it is possible to change the partitioner on an existing cluster,

it’s a complex procedure, and the recommended approach is to

migrate data to a new cluster with your preferred partitioner using

techniques we discuss in Chapter 15.

Deprecation of Thrift RPC Properties

Historically, Cassandra supported two different client interfaces:

the original Thrift API, also known as the Remote Procedure Call

(RPC) interface, and the CQL native transport first added in 0.8.

For releases through 2.2, both interfaces were supported and

enabled by default. Starting with the 3.0 release, Thrift was disabled

by default and has been removed entirely as of the 4.0 release. If

you’re using an earlier version of Cassandra, know that properties

prefixed with rpc generally refer to the Thrift interface.

Timeouts

If you’re building a cluster that spans multiple data centers, it’s a good idea to

measure the latency between data centers and tune timeout values in the cassan‐

dra.yaml file accordingly.

However, you may wish to reclaim the disk space used by this excess data more

quickly to reduce the strain on your cluster. To do this, you can use the nodetool

cleanup command. To complete as quickly as possible, you can allocate all compac‐

tion threads to the cleanup by adding the -j 0 option. As with the flush command,

you can select to clean up specific keyspaces and tables.

The repair command can be restricted to run in the local data cen‐

ter via the -local option (which you may also specify via the

longer form --in-local-dc ), or in a named data center via the -dc

<name> option (or --in-dc <name> ).

Transitioning to Incremental Repair

Incremental repair became the default in the 2.2 release, and you

must use the -full option to request a full repair. If you are using a

version of Cassandra prior to 2.2, make sure to consult the release

documentation for any additional steps to prepare your cluster for

incremental repair.

If you’re using the PropertyFileSnitch , you’ll need to add the address of your new

node to the properties file on each node and do a rolling restart of the nodes in your

cluster. It is recommended that you wait 72 hours before removing the address of the

old node to avoid confusing the gossiper.

If the node is down, you’ll have to use the nodetool removenode command instead of

decommission . If your cluster uses vnodes, the removenode command causes Cassan‐

dra to recalculate new token ranges for the remaining nodes and stream data from

current replicas to the new owner of each token range.

Beware the Large Partition

In addition to the nodetool tablehistograms discussed earlier,

you can detect large partitions by searching logs for WARN mes‐

sages that reference “Writing large partition” or “Compacting large

partition.” The threshold for warning on compaction of large parti‐

tions is set by the compaction_large_partition_warning_thres

hold_mb property in the cassandra.yaml file.

On the server side, you can configure individual nodes to trace some or all of their

queries via the nodetool settraceprobability command. This command takes a

number between 0.0 (the default) and 1.0, where 0.0 disables tracing and 1.0 traces

every query.

DateTieredCompactionStrategy Deprecated

TWCS replaces the DateTieredCompactionStrategy (DTCS)

introduced in the 2.0.11 and 2.1.1 releases, which had similar goals

but also some rough edges that made it difficult to use and main‐

tain. DTCS is now considered deprecated as of the 3.8 release. New

tables should use TWCS.

Property name Default value Description

read_request_timeout_in_ms 5000 (5 seconds) How long the coordinator waits for read operations to complete

range_request_timeout_in_ms 10000 (10 seconds) How long the coordinator should wait for range reads to complete

write_request_timeout_in_ms 2000 (2 seconds) How long the coordinator should wait for writes to complete

counter_write_request_time_out_in_ms 5000 (5 seconds) How long the coordinator should wait for counter writes to complete

cas_contention_timeout_in_ms 1000 (1 second) How long a coordinator should continue to retry a lightweight transaction

truncate_request_timeout_in_ms 60000 (1 minute) How long the coordinator should wait for truncates to complete (including snapshot)

streaming_socket_timeout_in_ms 3600000 (1 hour) How long a node waits for streaming to complete

request_timeout_in_ms 10000 (10 seconds) The default timeout for other, miscellaneous operations

G1GC generally requires fewer tuning decisions; the intended usage is that you need

only define the min and max heap size and a pause time goal. A lower pause time will

cause GC to occur more frequently.

There has been considerable discussion in the Cassandra community about switching

to G1GC as the default. For example, G1GC was originally the default for the Cassan‐

dra 3.0 release, but was backed out because it did not perform as well as the CMS for

heap sizes smaller than 8 GB. The emerging consensus is that the G1GC performs

well without tuning, but the default configuration of ParNew/CMS can result in

shorter pauses when properly tuned.

Request throttling

If you’re concerned about a client flooding the cluster with a large number of

requests, you can use the Java driver’s request throttling feature to limit the rate

of queries to a value you define using configuration options in the

advanced.throttler namespace. Queries in excess of the rate are queued until

the utilization is back within range. This behavior is mostly transparent from the

client perspective, but it is possible to receive a RequestThrottlingException on

executing a statement; this indicates that the CqlSession is overloaded and

unable to queue the request.

As of the 4.0 release, Cassandra supports hot reloading of certificates, which enables

certificate rotation without downtime. The keystore and truststore settings are

reloaded every 10 minutes, or you can force a refresh with the nodetool reloadssl

command.

Information Technology Blogs

Pages

Friday, October 2, 2020

cassandra 4.0 important points