This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Column Storage and ColumnStore
Relevant source files
- llkv-aggregate/README.md
- llkv-column-map/README.md
- llkv-column-map/src/store/projection.rs
- llkv-csv/README.md
- llkv-expr/Cargo.toml
- llkv-expr/README.md
- llkv-expr/src/lib.rs
- llkv-expr/src/literal.rs
- llkv-join/README.md
- llkv-plan/Cargo.toml
- llkv-plan/src/lib.rs
- llkv-runtime/README.md
- llkv-storage/README.md
- llkv-storage/src/serialization.rs
- llkv-table/Cargo.toml
- llkv-table/README.md
Purpose and Scope
This document describes the column-oriented storage layer implemented by llkv-column-map, focusing on the ColumnStore struct that manages physical persistence of Arrow columnar data. The column store sits between the table abstraction and the pager interface, translating logical field requests into physical chunk operations.
For the higher-level table API that wraps ColumnStore, see Table Abstraction. For details on the underlying storage backends, see Pager Interface and SIMD Optimization.
Architecture Position
The ColumnStore acts as the bridge between schema-aware tables and raw key-value storage:
Sources: llkv-column-map/README.md:10-46 llkv-table/README.md:19-24
graph TB
Table["llkv-table::Table\nSchema validation\nMVCC injection"]
ColumnStore["llkv-column-map::ColumnStore\nColumn chunking\nLogicalFieldId → PhysicalKey"]
Pager["llkv-storage::Pager\nbatch_get / batch_put\nMemPager / SimdRDrivePager"]
Table -->|append RecordBatch| ColumnStore
Table -->|scan / gather| ColumnStore
ColumnStore -->|BatchGet / BatchPut| Pager
ColumnStore -->|serialized Arrow chunks| Pager
Pager -->|EntryHandle zero-copy| ColumnStore
Logical Field Identification
LogicalFieldId Structure
Each column is identified by a LogicalFieldId that encodes three components:
| Component | Bits | Purpose |
|---|---|---|
| Namespace | High bits | Segregates user data, MVCC metadata, and row-id shadows |
| Table ID | Middle bits | Identifies which table the column belongs to |
| Field ID | Low bits | Distinguishes columns within a table |
This structure prevents collisions when multiple tables share the same physical pager while maintaining clear boundaries between user data and system metadata.
Sources: llkv-column-map/README.md:19-22
Namespace Segregation
Three primary namespaces exist:
- User Data : Columns explicitly defined in
CREATE TABLEstatements - MVCC Metadata : System columns
created_byanddeleted_byfor transaction visibility - Row ID Shadow : Parallel storage of
row_idvalues for each column to enable efficient random access
Each namespace maps to distinct LogicalFieldId values, ensuring that MVCC bookkeeping and user data remain isolated in the catalog.
Sources: llkv-column-map/README.md:22-23 llkv-table/README.md:16-17
Physical Storage Model
PhysicalKey Allocation and Mapping
The column store maintains a catalog that maps each LogicalFieldId to a PhysicalKey (u64 identifier) allocated by the pager:
graph LR
LF1["LogicalFieldId\n(ns=User, table=5, field=2)"]
LF2["LogicalFieldId\n(ns=RowId, table=5, field=2)"]
LF3["LogicalFieldId\n(ns=Mvcc, table=5, field=created_by)"]
PK1["PhysicalKey: 1024\nDescriptor"]
PK2["PhysicalKey: 1025\nData Chunks"]
PK3["PhysicalKey: 2048\nDescriptor"]
PK4["PhysicalKey: 2049\nData Chunks"]
LF1 --> PK1
LF1 --> PK2
LF2 --> PK3
LF2 --> PK4
PK1 -->|ColumnDescriptor| Pager["Pager"]
PK2 -->|Serialized Arrays| Pager
PK3 -->|ColumnDescriptor| Pager
PK4 -->|Serialized Arrays| Pager
Each logical field requires at least two physical keys: one for the column descriptor (metadata about chunks) and one or more for the actual data chunks.
Sources: llkv-column-map/README.md:19-21 llkv-column-map/src/store/projection.rs:461-468
Column Descriptors and Chunks
Column data is split into fixed-size chunks, each serialized as an Arrow array. Metadata about chunks is stored in a ColumnDescriptor:
graph TB
subgraph "Column Descriptor"
Head["head_page_pk: PhysicalKey\nPoints to descriptor chain"]
Meta1["ChunkMetadata[0]\nchunk_pk, row_count\nmin_val_u64, max_val_u64"]
Meta2["ChunkMetadata[1]\n..."]
Meta3["ChunkMetadata[n]\n..."]
Head --> Meta1
Head --> Meta2
Head --> Meta3
end
subgraph "Physical Storage"
Chunk1["PhysicalKey: chunk_pk[0]\nSerialized Arrow Array\nrow_ids: 0..999"]
Chunk2["PhysicalKey: chunk_pk[1]\nSerialized Arrow Array\nrow_ids: 1000..1999"]
end
Meta1 -.->|references| Chunk1
Meta2 -.->|references| Chunk2
The descriptor stores min/max row ID values for each chunk, enabling efficient skip-scan during queries by filtering out chunks that cannot contain requested row IDs.
Sources: llkv-column-map/src/store/projection.rs:229-236 llkv-column-map/src/store/projection.rs:760-772
Column Catalog Persistence
The catalog mapping LogicalFieldId to PhysicalKey is itself stored in the pager at a well-known root key. On initialization:
ColumnStore::open()attempts to load the catalog from the pager root- If not found, an empty catalog is initialized
- All catalog updates are committed atomically during append operations
This design ensures the catalog state remains consistent with persisted data, even after crashes.
Sources: llkv-column-map/README.md:38-40
Append Pipeline
RecordBatch Persistence Flow
Sources: llkv-column-map/README.md:24-28
Last-Writer-Wins Semantics
When appending data with row IDs that overlap existing chunks:
- The store identifies which chunks contain conflicting row IDs
- Existing chunks are deserialized and merged with new data
- For duplicate row IDs, the new value overwrites the old
- Rewritten chunks are serialized and committed atomically
This ensures that UPDATE operations (implemented as appends at the table layer) correctly overwrite previous values without requiring separate update logic.
Sources: llkv-column-map/README.md:26-27
Chunking Strategy
Columns are divided into chunks based on:
- Chunk size threshold : Configurable limit on rows per chunk (typically several thousand)
- Row ID ranges : Each chunk covers a contiguous range of row IDs
- Physical key allocation : Each chunk gets a unique physical key from the pager
This chunking enables:
- Parallel scan operations across chunks
- Efficient skip-scan by filtering chunks based on row ID predicates
- Incremental garbage collection of deleted chunks
Sources: llkv-column-map/src/store/projection.rs:760-772
Data Retrieval
Gather Operations
The column store provides two gather strategies for random-access row retrieval:
graph TB
Input["Row IDs: [5, 123, 999]"]
Input --> Sort["Sort and deduplicate"]
Sort --> Filter["Identify intersecting chunks"]
Filter --> Fetch["batch_get(chunk keys)"]
Fetch --> Deserialize["Deserialize Arrow arrays"]
Deserialize --> Gather["Gather requested rows"]
Gather --> Output["RecordBatch"]
Single-Shot Gather
For one-off queries, gather_rows() performs a complete fetch without caching:
Sources: llkv-column-map/src/store/projection.rs:245-268
Reusable Context Gather
For repeated queries (e.g., join inner loop), gather_rows_with_reusable_context() amortizes costs:
- Prepare a
MultiGatherContextcontaining column descriptors and scratch buffers - Call gather repeatedly, reusing:
- Decoded Arrow chunk arrays (cached in
chunk_cache) - Row index hash maps (preallocated buffers)
- Scratch space for row locators
- Decoded Arrow chunk arrays (cached in
This avoids redundant descriptor fetches and chunk decodes across multiple gather calls.
Sources: llkv-column-map/src/store/projection.rs:516-758
Gather Null Policies
Three policies control null handling:
| Policy | Behavior |
|---|---|
ErrorOnMissing | Return error if any requested row ID is not found |
IncludeNulls | Missing rows surface as nulls in output arrays |
DropNulls | Remove rows where all projected columns are null or missing |
The DropNulls policy is used by MVCC filtering to exclude logically deleted rows.
Sources: llkv-column-map/src/store/projection.rs:39-47
Projection Planning
The projection subsystem prepares multi-column gathers:
Each FieldPlan contains:
- value_metas : Metadata for value chunks (actual column data)
- row_metas : Metadata for row ID chunks (parallel row ID storage)
- candidate_indices : Pre-filtered list of chunks that might contain requested rows
Sources: llkv-column-map/src/store/projection.rs:448-510 llkv-column-map/src/store/projection.rs:229-236
Chunk Intersection Logic
Before fetching chunks, the store filters based on row ID range overlap:
This optimization prevents loading chunks that cannot possibly contain any requested rows.
Sources: llkv-column-map/src/store/projection.rs:774-794
graph TB
subgraph "Serialized Array Format"
Header["Header (24 bytes)\nMAGIC: 'ARR0'\nlayout: Primitive/Varlen/FslFloat32/Struct\ntype_code: PrimType enum\nlen: element count\nextra_a, extra_b: layout-specific"]
Payload["Payload\nRaw Arrow buffer bytes"]
Header --> Payload
end
subgraph "Deserialization (Zero-Copy)"
EntryHandle["EntryHandle from Pager\n(memory-mapped or in-memory)"]
ArrowBuffer["Arrow Buffer\nwraps EntryHandle bytes"]
ArrayData["ArrayData\nreferences Buffer directly"]
EntryHandle --> ArrowBuffer
ArrowBuffer --> ArrayData
end
Payload -.->|stored as| EntryHandle
Serialization Format
Zero-Copy Array Persistence
Column chunks are serialized using a custom format optimized for memory-mapped zero-copy reads:
Sources: llkv-storage/src/serialization.rs:1-28 llkv-storage/src/serialization.rs:41-135
Layout Types
Four layout variants handle different Arrow data types:
| Layout | Use Case | Payload Structure |
|---|---|---|
Primitive | Fixed-width primitives (Int32, Float64, etc.) | Single values buffer |
Varlen | Variable-length (Binary, Utf8, LargeBinary) | Offsets buffer + values buffer |
FslFloat32 | FixedSizeList | Single contiguous Float32 buffer |
Struct | Nested struct types | Arrow IPC serialized payload |
The FslFloat32 layout is a specialized fast-path for dense vector columns, avoiding nesting overhead.
Sources: llkv-storage/src/serialization.rs:54-135
Why Not Arrow IPC?
The custom format is used instead of standard Arrow IPC for several reasons:
- Minimal headers : No schema objects or framing, reducing file size
- Predictable payloads : Each array occupies one contiguous region, ideal for mmap and SIMD
- True zero-copy : Deserialization produces
ArrayDatareferencing the original mmap directly - Stable codes : Layout and type tags are explicitly pinned with compile-time checks
The trade-off is reduced generality (e.g., no null bitmaps yet) for better scan performance in this storage engine's access patterns.
Sources: llkv-storage/src/serialization.rs:1-28 llkv-storage/README.md:10-16
Type Code Stability
The PrimType enum discriminants are compile-time pinned to prevent silent corruption:
If any discriminant is accidentally changed, the code will fail to compile, preventing data corruption.
Sources: llkv-storage/src/serialization.rs:561-586
graph TB
subgraph "Table Responsibilities"
Schema["Schema validation"]
MVCC["MVCC column injection\n(row_id, created_by, deleted_by)"]
Catalog["System catalog updates"]
end
subgraph "ColumnStore Responsibilities"
Chunk["Column chunking"]
Map["LogicalFieldId → PhysicalKey"]
Persist["Physical persistence"]
end
Schema --> Chunk
MVCC --> Chunk
Catalog --> Map
Chunk --> Persist
Integration with Table Layer
The Table struct (from llkv-table) wraps Arc<ColumnStore> and delegates storage operations:
The table layer focuses on schema enforcement and MVCC semantics, while the column store handles physical storage details.
Sources: llkv-table/README.md:19-24 llkv-column-map/README.md:12-16
Integration with Pager Trait
The column store is generic over any Pager<Blob = EntryHandle>:
This abstraction allows the same column store code to work with both ephemeral in-memory storage (for transaction staging) and durable persistent storage (for committed data).
Sources: llkv-column-map/README.md:36-39 llkv-storage/README.md:19-22
graph TB
subgraph "User Table 'employees'"
UserCol1["LogicalFieldId\n(User, table=5, field=0)\n'name' column"]
UserCol2["LogicalFieldId\n(User, table=5, field=1)\n'age' column"]
end
subgraph "MVCC Metadata for 'employees'"
MvccCol1["LogicalFieldId\n(Mvcc, table=5, field=created_by)"]
MvccCol2["LogicalFieldId\n(Mvcc, table=5, field=deleted_by)"]
end
subgraph "Row ID Shadow for 'employees'"
RowCol1["LogicalFieldId\n(RowId, table=5, field=0)"]
RowCol2["LogicalFieldId\n(RowId, table=5, field=1)"]
end
UserCol1 & UserCol2 & MvccCol1 & MvccCol2 & RowCol1 &
RowCol2 --> Store["ColumnStore"]
MVCC Column Storage
MVCC metadata columns are stored using the same column infrastructure as user data, but in a separate namespace:
This design keeps MVCC bookkeeping transparent to the column store while allowing the table layer to enforce visibility rules by querying MVCC columns.
Sources: llkv-column-map/README.md:22-23 llkv-table/README.md:32-34
Concurrency and Parallelism
Parallel Scans
The column store supports parallel scanning via Rayon:
- Chunk-level parallelism: Different chunks can be processed concurrently
- Thread pool bounded by
LLKV_MAX_THREADSenvironment variable - Lock-free reads: Descriptors and chunks are immutable once written
Sources: llkv-column-map/README.md:32-34
Catalog Locking
The catalog mapping is protected by an RwLock:
- Readers acquire shared lock during scans/gathers
- Writers acquire exclusive lock during append/create operations
- Lock contention is minimized by holding locks only during catalog lookups, not during chunk I/O
Sources: llkv-column-map/src/store/projection.rs:461-468
Performance Characteristics
Append Performance
| Operation | Complexity | Notes |
|---|---|---|
| Sequential append (no conflicts) | O(n log n) | Dominated by sorting row IDs |
| Append with overwrites | O(n log n + m log m) | m = existing rows in conflict chunks |
| Chunk serialization | O(n) | Linear in data size |
Gather Performance
| Operation | Complexity | Notes |
|---|---|---|
| Random gather (cold) | O(k log c + r) | k = chunks touched, c = total chunks, r = rows fetched |
| Random gather (hot cache) | O(r) | Chunks already decoded |
| Sequential scan | O(n) | Linear in result size |
The chunk skip-scan optimization reduces k by filtering chunks based on min/max row ID metadata.
Sources: llkv-column-map/src/store/projection.rs:774-794