Encoding

Four encoding schemes plus LZ4 general compression. Applied per-column based on data characteristics.

Encoding Schemes

Plain

Raw serialization with no transformation. Baseline for comparison and fallback when no specialized encoding helps.

Run-Length Encoding (RLE)

Stores consecutive repeated values as (value, count) pairs. Effective for sorted columns or low-cardinality data like status codes.

// Input:  [A, A, A, B, B, C, C, C, C]
// Stored: [(A,3), (B,2), (C,4)]

Dictionary Encoding

Replaces string values with integer indices into a dictionary. Best for low-cardinality string columns like city names or categories.

// Dictionary: {0: "Paris", 1: "Tokyo", 2: "Beijing"}
// Input:  ["Paris", "Tokyo", "Paris", "Beijing"]
// Stored: [0, 1, 0, 2]

Delta Encoding

Stores the difference between consecutive values. Ideal for monotonically increasing integers like timestamps or auto-increment IDs.

// Input:  [100, 102, 105, 110]
// Stored: [100, +2, +3, +5]   ← smaller values = fewer bits

LZ4 Compression

General-purpose block compression applied after encoding. Fast decompression with decent ratios — the same tradeoff DuckDB and ClickHouse make.

Pipeline

Encoding is applied as a pipeline: first a type-specific encoding (RLE, dictionary, or delta), then optional LZ4 compression on the result.

Raw column data ↓ Type-specific encoding (auto-selected) ↓ LZ4 block compression (optional) ↓ Stored bytes

The encoder auto-selects the best scheme per column based on cardinality and data type heuristics.

When Each Encoding Wins

EncodingBest forExample
RLEMany consecutive repeatsSorted status column
DictionaryLow-cardinality stringsCity names, categories
DeltaMonotonic integersTimestamps, row IDs
Plain + LZ4Everything elseRandom numeric data

Files

FileRole
encoder.goInterface + auto-selection logic
plain.goNo-op baseline encoding
rle.goRun-length encoding
dictionary.goDictionary encoding for strings
delta.goDelta encoding for integers
lz4.goLZ4 block compression
pipeline.goEncoding pipeline orchestration