IMPORTANT NOTE: This post describes SSTable format mc (3.0.8, 3.9). There are multiple versions of SSTable format which can differ significantly.

This file stores metadata for SSTable. There are 4 types of metadata:

Validation metadata - used to validate SSTable correctness
Compaction metadata - used for compaction
Statistics - some information about SSTable which is loaded into memory and used for faster reads/compactions
Serialization header - keeps information about SSTable schema

General structure

The file is composed of two parts. First part is a table of content which allows quick access to a selected metadata. Second part is a sequence of metadata stored one after the other.

Table of content

First 4 bytes represent signed integer which is a number of metadata entries in the second part of the file. There are 8 bytes stored for each entry. First 4 bytes are a signed integer which describes the type of the metadata. Table below presents what types are available.

Type	Integer representation
Validation metadata	0
Compaction metadata	1
Statistics	2
Serialization header	3

Next 4 bytes represent an offset (signed integer), in the file, at which this metadata entry starts.

Validation metadata entry

This entry starts with a UTF8 string encoded using modified UTF-8 encoding. You can read more about this encoding HERE and HERE. This string is a name of partitioner used to create this SSTable. Next 8 bytes are a double that describes the probability of false positive matches in the bloom filter for this SSTable.

Compaction metadata entry

This entry contains a serialized HyperLogLogPlus which can be used to estimate the number of partition keys in the SSTable. If this is not present then the same estimation can be computed using Summary file. First 4 bytes represent a signed integer that is a length L of the serialized form of HyperLogLogPlus. Then there are exactly L bytes describing the HyperLogLogPlus.

Statistics entry

This entry contains some parts of EstimatedHistogram, StreamingHistogram and CommitLogPosition types. Let's have a look at them first.

EstimatedHistogram

EstimatedHistogram starts with signed integer encoded as 4 bytes which describes number of buckets. Each bucket is stored in two signed longs - 8 bytes each. First long is the offset of the previous bucket (first bucket stores offset of itself). Second long is bucket's value. Each bucket represents values from (previous bucket offset, current offset]. Offset for last bucket is +inf.

StreamingHistogram

StreamingHistogram starts with a signed integer encoded as 4 bytes which describes maximum number of bucktes this historgam can have. Another signed integer which is encoded as 4 bytes follows. It describes the actual number of buckets the StreamingHistogram has. For each bucket, there is a double and a signed long. Each takes 8 bytes. Double is an offset of the bucket and long is its value.

CommitLogPosition

CommitLogPosition consists of a signed long and signed integer. The long represents a segmentId while the integer is a position inside that segment.

Whole entry

The entry starts with an EstimatedHistogram of partition sizes. Then it has an EstimatedHistogram of column counts. Next is CommitLogPosition for upper bound of commit log. This is followed by 2 signed longs (8 bytes each) that represent min and max timepstamp in the SSTable. Next are 2 signed integers (4 bytes each) that represent min and max local deletion time in the SSTable. Then there are two signed integers (4 bytes each) that describe min and max TTL in the SSTable. Next is a double (8 bytes) that stores a compression ration of the SSTable. This is followed by a StreamingHistogarm of tombstones in the SSTable. Then the entry has a signed integer (4 bytes) that stores the level of the SSTable (used probably for leveled compaction). Next is a signed long (8 bytes) that represents the time the SSTable was repaired. This is followed by min and max clustering keys. Each is encoded with the signed integer (4 bytes) that describes the number of key blocks and the blocks itself. A clustering key block is encoded using unsigned short length L (2 bytes) and L bytes that follow it. Then there is a boolean (1 byte) that determins whether the SSTable contains legacy counters. Next are total numbers of columns and rows (8 bytes signed long each). Version MA of SSTable 3.x format ends here and it contains only one commit log position interval - [NONE = new CommitLogPosition(-1, 0), upper bound of commit log]. In other versions there's a CommitLogPosition here that represents lower bound of a commit log. Version MB of SSTable 3.x format ends here and it contains only one commit log position interval - [lower bound of commit log, upper bound of commit log]. In other versions there's a set of commit log positions intervals serialized here. The set starts with a signed integer (4 bytes) that represents the number of intervals. Each interval is a pair of CommitLogPositions.

Serialization header entry

This entry starts with unsigned variant integer that represents min timestamp (64-bit). Then it has an unsigned variant integer that represents min local deletion time (32-bit). Next is unsigned variant integer for min TTL (32-bit). This is followed by a type of partition key. Then the entry has unsigned variant integer that is length of clustering key CL (32-bit). Next there are CL types of clustering key components. Then there are static columns with their types and regular columns with their types.

Columns encoding

Both static and regular columns are encoded in the same way. First there's an unsigned variant integer (32-bit) that is a number of columns. Then for each column there's a byte buffer with a unsigned variant integer (32-bit) length, which is column name, and type of this column.

Type encoding

Type is just a byte buffer with an unsigned variant integer (32-bit) length. It is a UTF-8 string. All leading spaces, tabs and newlines are skipped. Null or empty string is a bytes type. First segment of non-blank characters should contain only alphanumerical characters and special chars like '-', '+', '.', '_', '&'. This is the name of the type. If type name does not contain any '.' then it gets "org.apache.cassandra.db.marshal." prepended to itself. Then an "instance" static field is taken from this class. If the first non-blank character that follows type name is '(' then "getInstance" static method is invoked instead. Remaining string is passed to this method as a parameter. There are following types:

Type	Parametrized
Ascii Type	No
Boolean Type	No
Bytes Type	No
Byte Type	No
ColumnToCollection Type	Yes
Composite Type	Yes
CounterColumn Type	No
Date Type	No
Decimal Type	No
Double Type	No
Duration Type	No
DynamicComposite Type	Yes
Empty Type	No
Float Type	No
Frozen Type	Yes
InetAddress Type	No
Int32 Type	No
Integer Type	No
LexicalUUID Type	No
List Type	Yes
Long Type	No
Map Type	Yes
PartitionerDefinedOrder	Yes
Reversed Type	Yes
Set Type	Yes
Short Type	No
SimpleDate Type	No
Timestamp Type	No
Time Type	No
TimeUUID Type	No
Tuple Type	Yes
User Type	Yes
UTF8 Type	No
UUID Type	No

← Previous Post

ScyllaDB/Cassandra SSTable format part 2

Statistics.db file