Four encoding schemes plus LZ4 general compression. Applied per-column based on data characteristics.
Raw serialization with no transformation. Baseline for comparison and fallback when no specialized encoding helps.
Stores consecutive repeated values as (value, count) pairs. Effective for sorted columns or low-cardinality data like status codes.
// Input: [A, A, A, B, B, C, C, C, C]
// Stored: [(A,3), (B,2), (C,4)]
Replaces string values with integer indices into a dictionary. Best for low-cardinality string columns like city names or categories.
// Dictionary: {0: "Paris", 1: "Tokyo", 2: "Beijing"}
// Input: ["Paris", "Tokyo", "Paris", "Beijing"]
// Stored: [0, 1, 0, 2]
Stores the difference between consecutive values. Ideal for monotonically increasing integers like timestamps or auto-increment IDs.
// Input: [100, 102, 105, 110]
// Stored: [100, +2, +3, +5] ← smaller values = fewer bits
General-purpose block compression applied after encoding. Fast decompression with decent ratios — the same tradeoff DuckDB and ClickHouse make.
Encoding is applied as a pipeline: first a type-specific encoding (RLE, dictionary, or delta), then optional LZ4 compression on the result.
The encoder auto-selects the best scheme per column based on cardinality and data type heuristics.
| Encoding | Best for | Example |
|---|---|---|
| RLE | Many consecutive repeats | Sorted status column |
| Dictionary | Low-cardinality strings | City names, categories |
| Delta | Monotonic integers | Timestamps, row IDs |
| Plain + LZ4 | Everything else | Random numeric data |
| File | Role |
|---|---|
encoder.go | Interface + auto-selection logic |
plain.go | No-op baseline encoding |
rle.go | Run-length encoding |
dictionary.go | Dictionary encoding for strings |
delta.go | Delta encoding for integers |
lz4.go | LZ4 block compression |
pipeline.go | Encoding pipeline orchestration |