Encoding

Four encoding schemes plus LZ4 general compression. Applied per-column based on data characteristics.

Encoding Schemes

Raw serialization with no transformation. Baseline for comparison and fallback when no specialized encoding helps.

Stores consecutive repeated values as (value, count) pairs. Effective for sorted columns or low-cardinality data like status codes.

// Input:  [A, A, A, B, B, C, C, C, C]
// Stored: [(A,3), (B,2), (C,4)]

Replaces string values with integer indices into a dictionary. Best for low-cardinality string columns like city names or categories.

// Dictionary: {0: "Paris", 1: "Tokyo", 2: "Beijing"}
// Input:  ["Paris", "Tokyo", "Paris", "Beijing"]
// Stored: [0, 1, 0, 2]

Stores the difference between consecutive values. Ideal for monotonically increasing integers like timestamps or auto-increment IDs.

// Input:  [100, 102, 105, 110]
// Stored: [100, +2, +3, +5]   ← smaller values = fewer bits

General-purpose block compression applied after encoding. Fast decompression with decent ratios — the same tradeoff DuckDB and ClickHouse make.

Encoding is applied as a pipeline: first a type-specific encoding (RLE, dictionary, or delta), then optional LZ4 compression on the result.

Raw column data ↓ Type-specific encoding (auto-selected) ↓ LZ4 block compression (optional) ↓ Stored bytes

The encoder auto-selects the best scheme per column based on cardinality and data type heuristics.

File	Role
`encoder.go`	Interface + auto-selection logic
`plain.go`	No-op baseline encoding
`rle.go`	Run-length encoding
`dictionary.go`	Dictionary encoding for strings
`delta.go`	Delta encoding for integers
`lz4.go`	LZ4 block compression
`pipeline.go`	Encoding pipeline orchestration