This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Column Storage and ColumnStore
Loading…
Column Storage and ColumnStore
Relevant source files
- llkv-column-map/Cargo.toml
- llkv-column-map/src/gather.rs
- llkv-column-map/src/store/core.rs
- llkv-column-map/src/store/projection.rs
- llkv-column-map/src/store/scan/filter.rs
- llkv-sql/tests/pager_io_tests.rs
This page documents the columnar storage layer implemented in the llkv-column-map crate. The ColumnStore provides Arrow-native persistence of columnar data over a key-value pager, enabling efficient predicate evaluation, multi-column projections, and Last-Write-Wins (LWW) semantics for updates.
For information about the table-level abstraction built on top of ColumnStore, see Table Abstraction. For details about the underlying pager interface, see Pager Interface and SIMD Optimization.
ColumnStore Architecture
The ColumnStore<P> struct (llkv-column-map/src/store/core.rs:60-68) serves as the primary interface for columnar data operations. It is parameterized by a Pager implementation and manages:
- Column catalog : Maps
LogicalFieldIdto physical storage keys - Descriptors : Linked lists of chunk metadata for each column
- Data type cache : Avoids repeated descriptor loads for schema queries
- Index manager : Maintains presence and value indexes
- Context pool : Reusable gather contexts to amortize chunk fetch costs
Purpose : This diagram shows the internal components of ColumnStore and their relationships. The catalog is the single source of truth for column-to-physical-key mappings, while auxiliary structures provide caching and optimization.
Sources : llkv-column-map/src/store/core.rs:60-68 llkv-column-map/src/store/core.rs:100-124
Storage Organization
Namespaces
Columns are identified by LogicalFieldId, which combines a namespace, table ID, and field ID to prevent collisions between different data categories (llkv-column-map/src/store/core.rs:38-46):
| Namespace | Purpose | Example Usage |
|---|---|---|
UserData | Regular table columns | User-visible columns from CREATE TABLE |
RowIdShadow | Internal row ID tracking | Shadow column storing RowId for each value |
TxnCreatedBy | MVCC creation timestamps | Transaction ID that created each row |
TxnDeletedBy | MVCC deletion timestamps | Transaction ID that deleted each row |
Each column in the UserData namespace has a corresponding RowIdShadow column that stores the row IDs for non-null values. This pairing enables efficient row-based gathering and filtering.
Sources : llkv-column-map/src/store/core.rs:38-46
Physical Storage Structure
Data is organized as a hierarchy: ColumnCatalog → ColumnDescriptor → DescriptorPages → ChunkMetadata → Data Chunks.
Purpose : This diagram shows how data is physically organized in storage. The catalog is the entry point, descriptors form a linked list of metadata pages, and each metadata entry points to a serialized Arrow array.
graph TB
CATALOG["ColumnCatalog\nPhysicalKey: CATALOG_ROOT_PKEY"]
subgraph "Per Column"
DESC["ColumnDescriptor\n(head_page_pk, tail_page_pk,\ntotal_row_count)"]
PAGE1["DescriptorPage\n(next_page_pk, entry_count)"]
PAGE2["DescriptorPage\n(next_page_pk, entry_count)"]
end
subgraph "Chunk Metadata"
META1["ChunkMetadata\n(chunk_pk, row_count,\nmin_val, max_val,\nserialized_bytes)"]
META2["ChunkMetadata\n(chunk_pk, row_count,\nmin_val, max_val,\nserialized_bytes)"]
META3["ChunkMetadata"]
end
subgraph "Arrow Data"
CHUNK1["Serialized Arrow Array\n(values)"]
CHUNK2["Serialized Arrow Array\n(values)"]
CHUNK3["Serialized Arrow Array\n(row IDs)"]
end
CATALOG -->|maps LogicalFieldId| DESC
DESC -->|head_page_pk| PAGE1
PAGE1 -->|next_page_pk| PAGE2
PAGE1 --> META1
PAGE1 --> META2
PAGE2 --> META3
META1 -->|chunk_pk| CHUNK1
META2 -->|chunk_pk| CHUNK2
META3 -->|chunk_pk| CHUNK3
Key structures :
- ColumnCatalog (llkv-column-map/src/store/catalog.rs): Root mapping stored at
CATALOG_ROOT_PKEY - ColumnDescriptor : Per-column header with total row count and pointers to metadata pages
- DescriptorPage : Linked list node containing multiple
ChunkMetadataentries - ChunkMetadata (llkv-column-map/src/store/descriptor.rs): Min/max values, row count, serialized size, chunk physical key
The min/max values in ChunkMetadata enable chunk pruning : predicates can skip entire chunks without loading data if the min/max range doesn’t overlap the query predicate.
Sources : llkv-column-map/src/store/core.rs:100-124 llkv-column-map/src/store/descriptor.rs
Core Operations
Append with Last-Write-Wins
The append method (llkv-column-map/src/store/core.rs:787-1633) ingests a RecordBatch with LWW semantics:
- Preprocessing : Sort by
rowidif not already sorted - LWW Rewrite : For each column, identify existing row IDs and overwrite them in-place
- Filter for Append : Remove rewritten rows and nulls from the batch
- Chunk and Persist : Split data into target-sized chunks, serialize as Arrow arrays, persist with metadata
Purpose : This diagram shows the append flow with LWW semantics. Each column is processed independently, rewrites happen first, then new data is chunked and persisted atomically.
sequenceDiagram
participant Caller
participant ColumnStore
participant LWW as "LWW Rewrite"
participant Chunker
participant Pager
Caller->>ColumnStore: append(batch)
ColumnStore->>ColumnStore: Sort by rowid if needed
loop For each column
ColumnStore->>LWW: lww_rewrite_for_field(field_id, row_ids, values)
LWW->>Pager: Load existing chunks
LWW->>LWW: Identify overlapping row IDs
LWW->>Pager: Overwrite matching rows
LWW-->>ColumnStore: rewritten_ids
ColumnStore->>ColumnStore: Filter out rewritten_ids and nulls
ColumnStore->>Chunker: Split to target chunk sizes
loop For each chunk
Chunker->>Pager: alloc() for chunk_pk and rid_pk
Chunker->>Chunker: Serialize Arrow array
Chunker->>Chunker: Compute min/max/size metadata
end
end
ColumnStore->>Pager: batch_put(all_puts)
ColumnStore->>ColumnStore: Increment append_epoch
ColumnStore-->>Caller: Success
The append_epoch counter (llkv-column-map/src/store/core.rs:132-134) is incremented after every append, providing a cache invalidation signal for gather contexts.
Sources : llkv-column-map/src/store/core.rs:787-1633 llkv-column-map/src/store/core.rs:893-942
Multi-Column Gathering
The gather_rows method (llkv-column-map/src/store/projection.rs:758-785) assembles a RecordBatch from multiple columns given a list of row IDs. It uses a two-phase strategy :
Phase 1: Planning
- Load
ColumnDescriptorfor each field - Collect
ChunkMetadatafor value and row ID chunks - Build
FieldPlanwith candidate chunk indices
Phase 2: Execution
- Identify candidate chunks that intersect requested row IDs (chunk pruning via min/max)
- Batch-fetch missing chunks from pager
- For each column, build Arrow array by gathering values from chunks
- Assemble
RecordBatchfrom gathered columns
Purpose : This diagram shows the gather pipeline. Planning loads metadata, execution prunes chunks and assembles the result. The chunk cache avoids redundant pager reads across multiple gather calls.
Context Pooling : The GatherContextPool (llkv-column-map/src/store/projection.rs:651-693) maintains a pool of MultiGatherContext objects keyed by field IDs. Each context caches:
- Chunk arrays by physical key
- Row ID index for sparse lookups
- Scratch buffers for row location tracking
- Arrow builders for each column type
By reusing contexts, repeated gather calls avoid allocating temporary structures and can benefit from cached chunks.
Sources : llkv-column-map/src/store/projection.rs:758-927 llkv-column-map/src/store/projection.rs:651-720
Filter Operations
The filter_row_ids method (llkv-column-map/src/store/core.rs:356-372) evaluates a predicate against a column and returns matching row IDs. It uses:
- Type-specific dispatch :
FilterDispatchtrait routes to specialized implementations (primitive types, strings, booleans) - Chunk pruning : Skip chunks where
min_valandmax_valdon’t overlap the predicate - Vectorized evaluation : For string predicates like
CONTAINS, use Arrow’s vectorized kernels (llkv-column-map/src/store/scan/filter.rs:337-503) - Dense encoding : Return results as
FilterResultwith run-length encoding when possible
For example, the fused string filter (llkv-column-map/src/store/scan/filter.rs:337-503) evaluates multiple CONTAINS predicates in a single pass:
- Load value and row ID chunks in parallel (via Rayon)
- Use Arrow’s
contains_utf8_scalarkernel for each pattern - AND results using a bitmask to avoid per-row branching
- Extract matching row IDs from the final bitmask
Sources : llkv-column-map/src/store/core.rs:356-372 llkv-column-map/src/store/scan/filter.rs:209-503
Optimization Techniques
Chunk Pruning via Min/Max Metadata
Each ChunkMetadata stores min_val_u64 and max_val_u64 (llkv-column-map/src/store/descriptor.rs). During filtering or gathering, chunks are skipped if:
- For predicates: The predicate range doesn’t overlap
[min_val, max_val] - For point lookups: The requested row ID is outside
[min_val, max_val]
This optimization is particularly effective for:
- Range predicates on sorted or clustered columns
- Temporal queries on timestamp columns (common in OLAP)
- Primary key lookups
Sources : llkv-column-map/src/store/core.rs:679-690 llkv-column-map/src/store/projection.rs:1019-1026
Context Pooling
The GatherContextPool (llkv-column-map/src/store/projection.rs:651-693) maintains up to 4 contexts per unique field ID set. This amortizes:
- Chunk cache allocations (reuse
FxHashMap<PhysicalKey, ArrayRef>) - Arrow builder allocations (reuse builders with reserved capacity)
- Scratch buffer allocations (reuse
Vecfor row indices)
Contexts track an epoch (llkv-column-map/src/store/projection.rs:520-526) that is compared against ColumnStore.append_epoch. If epochs mismatch, the context rebuilds its metadata from the latest descriptors.
Sources : llkv-column-map/src/store/projection.rs:651-720 llkv-column-map/src/store/projection.rs:460-526
Data Type Caching
The DTypeCache (llkv-column-map/src/store/core.rs64) stores Arrow DataType for each LogicalFieldId to avoid repeated descriptor loads during schema queries. It uses a fingerprint stored in the descriptor to detect type changes (e.g., from ALTER TABLE operations).
Sources : llkv-column-map/src/store/core.rs:175-180 llkv-column-map/src/store/core.rs:190-227
Indexes
The IndexManager (llkv-column-map/src/store/core.rs65) supports two index types:
| Index Type | Purpose | Storage |
|---|---|---|
Presence | Tracks which row IDs exist | Bitmap or sorted permutation array |
Value | Enables sorted iteration by value | Permutation array mapping value order to physical order |
Indexes are created via register_index (llkv-column-map/src/store/core.rs:145-147) and maintained automatically during append operations. The has_row_id method (llkv-column-map/src/store/core.rs:663-754) uses presence indexes for fast lookups via binary search.
Sources : llkv-column-map/src/store/core.rs:145-165 llkv-column-map/src/store/core.rs:663-754
Data Flow: Append Operation
Purpose : This flowchart shows the detailed append logic. Each column is processed independently (LWW rewrite, filter, chunk), then all writes are flushed atomically.
Sources : llkv-column-map/src/store/core.rs:787-1633
Data Flow: Gather Operation
Purpose : This flowchart shows the gather operation with context reuse. The epoch check ensures cache validity, and the dense pattern optimization avoids hash lookups for contiguous row IDs.
Sources : llkv-column-map/src/store/projection.rs:929-1195
Configuration and Tuning
The ColumnStoreConfig (llkv-column-map/src/store/core.rs63) provides tuning parameters:
- Target chunk size : Defaults guide chunk splitting during append
- Write hints : Returned via
write_hints()(llkv-column-map/src/store/core.rs:126-129) to inform upstream batch sizing
The GatherContextPool (llkv-column-map/src/store/projection.rs:658-663) maintains up to max_per_key contexts per field set (currently 4). Increasing this value trades memory for reduced context allocation frequency in highly concurrent workloads.
Sources : llkv-column-map/src/store/core.rs:126-129 llkv-column-map/src/store/projection.rs:658-663
Thread Safety
ColumnStore is Send + Sync (llkv-column-map/src/store/core.rs49). Internal state uses:
Arc<RwLock<ColumnCatalog>>: Catalog allows concurrent reads, exclusive writesArc<Pager>: Pager implementations must beSend + SyncArc<AtomicU64>: Epoch counter supports lock-free reads
This enables concurrent query execution across multiple threads while serializing metadata updates (e.g., append, remove_column).
Sources : llkv-column-map/src/store/core.rs:49-85
Dismiss
Refresh this wiki
Enter email to refresh