Saturday, November 30, 2013

how does read performance gains when in compression?

Read the following interesting discussion in the cassandra mailing list, and think very good explanation and would like to share out.

how does read performance gains when in compression?
http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-compression
http://www.datastax.com/documentation/cassandra/1.2/webhelp/index.html?pagename=docs&version=1.2&file=index#cassandra/dml/dml_about_reads_c.html

Cite from Artur Kronenberg
The way I understand it is that compression gives you the advantage of having to use way less IO and rather use CPU. The bottleneck of reads is usually the IO time you need to read the data from disk. As a figure, we had about 25 reads/s reading from disk, while we get up to 3000 reads/s when we have all of it in cache. So having good compression reduces the amount you have to read from disk. Rather you may spend a little bit more time decompressing data, but this data will be in cache anyways so it won't matter.

Cite from Edward Capriolo
The big * in the explanation: Smaller file size footprint leads to better disk cache, however decompression adds work for the JVM to do and increases the churn of objects in the JVM. Additionally compression block sizes might be 4KB while for some use cases a small row may be 200bytes. This means that internally a large block might be decompressed to get at the row inside of it.

In many use cases compression is a performance win, but not necessarily in all cases. In particular if you are already doing JVM performance tuning issues to stop garbage collection pauses enabling compression could make performance worse.

No comments:

Post a Comment