This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Storage Layer
Loading…
Storage Layer
Relevant source files
- llkv-column-map/Cargo.toml
- llkv-column-map/src/gather.rs
- llkv-column-map/src/store/core.rs
- llkv-column-map/src/store/projection.rs
- llkv-column-map/src/store/scan/filter.rs
- llkv-sql/tests/pager_io_tests.rs
The Storage Layer manages the persistence and retrieval of columnar data in LLKV. It bridges Apache Arrow’s in-memory columnar format with a key-value storage backend, providing efficient operations for appends, updates, deletes, and scans while maintaining data integrity through MVCC semantics.
This page provides an architectural overview of the storage subsystem. For details on specific components, see:
- Table Abstraction - High-level table operations and API
- Column Storage and ColumnStore - Physical column storage implementation
- Pager Interface and SIMD Optimization - Key-value persistence layer
Sources: llkv-column-map/src/store/core.rs:1-68 llkv-column-map/Cargo.toml:1-35
Architecture Overview
The storage layer consists of three primary components arranged in a hierarchy from logical to physical representation:
Sources: llkv-column-map/src/store/core.rs:60-68 llkv-column-map/Cargo.toml:21-33
graph TB
subgraph "Arrow Layer"
RB["RecordBatch\nColumnar Arrays"]
SCHEMA["Schema\nField Metadata"]
ARRAYS["Arrow Arrays\n(Int64, Utf8, etc)"]
end
subgraph "Column Store Layer"
CS["ColumnStore\nllkv-column-map"]
CATALOG["ColumnCatalog\nLogicalFieldId → PhysicalKey"]
DESC["ColumnDescriptor\nLinked List of Chunks"]
META["ChunkMetadata\nmin/max, row_count"]
CHUNKS["Serialized Chunks\nArrow IPC Format"]
end
subgraph "Pager Layer"
PAGER["Pager Trait\nbatch_get/batch_put"]
BATCH_OPS["BatchGet/BatchPut\nOperations"]
end
subgraph "Physical Storage"
SIMD["simd-r-drive\nMemory-Mapped KV Store"]
HANDLES["EntryHandle\nByte Blobs"]
end
RB --> CS
SCHEMA --> CS
ARRAYS --> CHUNKS
CS --> CATALOG
CS --> DESC
DESC --> META
META --> CHUNKS
CS --> PAGER
PAGER --> BATCH_OPS
BATCH_OPS --> SIMD
SIMD --> HANDLES
CHUNKS -.serialized as.-> HANDLES
CATALOG -.persisted via.-> PAGER
DESC -.persisted via.-> PAGER
Columnar Storage Model
Logical Field Identification
Columns are identified by LogicalFieldId, a composite key containing:
| Component | Type | Purpose |
|---|---|---|
| Namespace | LogicalStorageNamespace | Separates user data from internal columns (row IDs, MVCC metadata) |
| Table ID | TableId | Identifies the owning table |
| Field ID | FieldId | Column position within the table schema |
Namespace Values:
UserData- Regular table columnsRowIdShadow- Internal row ID trackingTxnCreatedBy- MVCC transaction creation timestampsTxnDeletedBy- MVCC transaction deletion timestamps
This namespacing prevents collisions between user-visible columns and internal bookkeeping while enabling efficient multi-column operations within a table.
Sources: llkv-column-map/src/store/core.rs:36-46
Chunk Organization
Data is stored in chunks , which are serialized Arrow arrays. Each chunk contains:
ChunkMetadata Fields:
| Field | Type | Purpose |
|---|---|---|
chunk_pk | PhysicalKey | Storage location of value array |
min_val_u64 / max_val_u64 | u64 | Value range for predicate pruning |
row_count | u64 | Number of rows in chunk |
null_count | u64 | Number of null values |
serialized_bytes | u64 | Size of serialized array |
value_order_perm_pk | PhysicalKey | Optional sort index for fast lookups |
The metadata enables chunk pruning : evaluating predicates against min/max values to skip entire chunks without loading data.
Sources: llkv-column-map/src/store/core.rs:1-6 llkv-column-map/src/store/descriptor.rs
Descriptor Pages
A ColumnDescriptor organizes chunks into a linked list of metadata pages:
Each descriptor page contains:
- Header (
DescriptorPageHeader):entry_count,next_page_pk - Entries : Array of
ChunkMetadatastructures
Appends extend the tail page; when full, a new page is allocated and linked.
Sources: llkv-column-map/src/store/descriptor.rs
Data Flow: Append Operation
The append path implements Last-Write-Wins semantics for row ID conflicts:
Key Steps:
- Pre-sort by row ID llkv-column-map/src/store/core.rs:798-847 - Ensures efficient metadata updates and naturally sorted shadow columns
- LWW Rewrite llkv-column-map/src/store/core.rs:893-901 - Updates existing row IDs in-place
- Filter llkv-column-map/src/store/core.rs:918-942 - Removes rewritten rows and nulls
- Chunk & Serialize llkv-column-map/src/store/core.rs:1032-1114 - Splits into target-sized chunks, serializes to Arrow IPC
- Atomic Batch Put llkv-column-map/src/store/core.rs:1116-1132 - Commits all writes atomically
- Epoch Increment llkv-column-map/src/store/core.rs:1133-1134 - Invalidates gather context caches
Sources: llkv-column-map/src/store/core.rs:758-1137
sequenceDiagram
participant User
participant CS as ColumnStore
participant CTX as MultiGatherContext
participant Cache as Chunk Cache
participant Pager
User->>CS: gather_rows(field_ids, row_ids)
CS->>CTX: Acquire context from pool
CS->>CS: Build FieldPlans\n(value_metas, row_metas)
Note over CS,CTX: Phase 1: Identify Candidate Chunks
loop For each field
CS->>CTX: Filter chunks by row_id range
CTX->>CTX: candidate_indices.push(idx)
end
Note over CS,Cache: Phase 2: Fetch Chunks
CS->>Cache: Check cache for chunk_pks
CS->>Pager: batch_get(missing_chunks)
Pager-->>CS: EntryHandle[]
CS->>CS: deserialize_array(bytes)
CS->>Cache: Insert arrays
Note over CS,CTX: Phase 3: Build Row Index
CS->>CTX: row_index[row_id] = output_idx
Note over CS,CTX: Phase 4: Gather Values
loop For each column
loop For each row_id
CTX->>Cache: Lookup value array
CTX->>CTX: Find row in chunk\nvia binary search
CTX->>CTX: builder.append_value()
end
CTX->>CTX: Finish builder → ArrayRef
end
CS->>CS: RecordBatch::try_new(schema, arrays)
CS-->>User: RecordBatch
CTX->>CS: Return context to pool
Data Flow: Gather Operation
Gathering assembles a RecordBatch from a list of row IDs by fetching values from chunks:
Optimizations:
| Optimization | Location | Description |
|---|---|---|
| Context Pooling | llkv-column-map/src/store/projection.rs:651-721 | Reuses MultiGatherContext across calls to amortize allocation costs |
| Chunk Caching | llkv-column-map/src/store/projection.rs:1032-1053 | Caches deserialized Arrow arrays by PhysicalKey |
| Candidate Pruning | llkv-column-map/src/store/projection.rs:1015-1027 | Only loads chunks overlapping the requested row ID range |
| Dense Fast Path | llkv-column-map/src/store/projection.rs:984-1011 | For contiguous row IDs, uses offset arithmetic instead of hash lookup |
| Single Chunk Slicing | llkv-column-map/src/store/projection.rs:303-329 | If all rows in one sorted chunk, returns array.slice() |
Sources: llkv-column-map/src/store/projection.rs:772-960 llkv-column-map/src/store/core.rs:758-785
Data Flow: Filter Operation
Filtering evaluates predicates and returns matching row IDs:
Filter Execution Path:
- Descriptor Load llkv-column-map/src/store/scan/filter.rs:1301-1333 - Fetch column descriptor and iterate metadata
- Metadata Pruning llkv-column-map/src/store/scan/filter.rs:1336-1353 - Skip chunks where
min_val > predicate_maxormax_val < predicate_min - Chunk Fetch llkv-column-map/src/store/scan/filter.rs:1355-1367 - Batch-get overlapping chunk arrays
- Visitor Evaluation llkv-column-map/src/store/scan/filter.rs:1369-1394 - Type-specialized visitor applies predicate
- Result Encoding llkv-column-map/src/store/scan/filter.rs:506-591 - Build
FilterResultwith run-length encoding or sparse list
Visitor Pattern Example (FilterVisitor<T, F>):
The visitor pattern enables type-specialized hot loops while maintaining a generic filter interface.
Sources: llkv-column-map/src/store/scan/filter.rs:209-298 llkv-column-map/src/store/scan/filter.rs:506-591 llkv-column-map/src/store/scan/filter.rs:605-649
Index Management
The storage layer supports two index types for accelerated lookups:
Index Operations:
| Method | Purpose | Location |
|---|---|---|
register_index | Creates and persists an index for a column | llkv-column-map/src/store/core.rs:145-147 |
unregister_index | Removes an index and frees storage | llkv-column-map/src/store/core.rs:163-165 |
list_persisted_indexes | Queries existing indexes for a column | llkv-column-map/src/store/core.rs:408-425 |
Presence Index Usage llkv-column-map/src/store/core.rs:663-754:
- Stored in
ChunkMetadata.value_order_perm_pk - Binary search over permuted view of row IDs
- Accelerates
has_row_idchecks
Sources: llkv-column-map/src/store/core.rs:136-165 llkv-column-map/src/store/core.rs:663-754
Storage Optimizations
Chunk Size Tuning
The ColumnStoreConfig provides tuning parameters:
| Parameter | Default | Purpose |
|---|---|---|
target_chunk_rows | 8192 | Target rows per chunk for new appends |
min_chunk_rows | 2048 | Minimum before considering compaction |
max_chunk_rows | 32768 | Maximum before forcing a split |
Smaller chunks enable finer-grained pruning but increase metadata overhead. The defaults balance scan performance with storage efficiency.
Sources: llkv-column-map/src/store/config.rs
Last-Write-Wins Semantics
When appending rows with existing row IDs:
- Scan existing chunks llkv-column-map/src/store/core.rs:893-901 to identify conflicting row IDs
- Overwrite in-place llkv-column-map/src/store/lww.rs within the chunk containing the old value
- Append new rows llkv-column-map/src/store/core.rs:918-942 that don’t conflict
This avoids duplicates and provides update semantics without explicit UPDATE statements.
Sources: llkv-column-map/src/store/core.rs:893-942 llkv-column-map/src/store/lww.rs
Null Handling
Null values are not stored in value chunks llkv-column-map/src/store/core.rs:920-931:
- Filtered out during append
- Represented by absence in the shadow row ID column
- Surfaced as nulls during gather if row ID exists but value is missing
This reduces storage footprint for sparse columns.
Sources: llkv-column-map/src/store/core.rs:920-931
Compaction (Future)
The current implementation supports incremental writes but defers compaction:
- Small chunks accumulate over time
- Metadata overhead increases
- Scan performance degrades for highly-updated columns
Future work will implement background compaction to merge small chunks and rebuild optimal chunk sizes.
Sources: llkv-column-map/src/store/compact.rs
Component Relationships
The storage layer integrates with upstream and downstream components:
Key Interfaces:
- Table → ColumnStore : High-level append/gather operations llkv-table/src/table.rs
- ScanBuilder → ColumnStore : Predicate evaluation and filtering llkv-scan/src/scan_builder.rs
- ColumnStore → Pager : Batched key-value operations llkv-storage/src/pager.rs
Sources: llkv-column-map/src/store/core.rs:60-68 llkv-storage/src/pager.rs
Thread Safety
ColumnStore is Send + Sync and designed for concurrent access:
- Catalog :
Arc<RwLock<ColumnCatalog>>- Read-heavy workload, allows concurrent reads - Caches :
RwLockfor data type and index metadata - Append Epoch :
AtomicU64for cache invalidation signaling - Context Pool :
Mutex<FxHashMap<...>>for gather context reuse
Concurrent appends to different tables are lock-free at the ColumnStore level. Appends to the same column serialize on the catalog write lock.
Sources: llkv-column-map/src/store/core.rs:47-68
Performance Characteristics
| Operation | Complexity | Notes |
|---|---|---|
| Append (sorted) | O(n) | n = rows; includes serialization and pager writes |
| Append (unsorted) | O(n log n) | Requires lexicographic sort |
| Gather (random) | O(m · k) | m = row count, k = avg chunk scan per row |
| Gather (dense) | O(m) | Contiguous row IDs enable offset arithmetic |
| Filter (full scan) | O(c) | c = total chunks; metadata pruning reduces effective c |
| Filter (indexed) | O(log c) | With presence/value indexes |
Batching Benefits llkv-sql/tests/pager_io_tests.rs:18-120:
- INSERT: ~8 allocations, 24 puts for 3 rows (2 columns + row ID + MVCC)
- SELECT: ~36 gets in 23 batches (descriptors + chunks + metadata)
- DELETE: ~6 puts, 46 gets (read for tombstones, write MVCC columns)
Sources: llkv-column-map/src/store/core.rs llkv-sql/tests/pager_io_tests.rs:18-120
Dismiss
Refresh this wiki
Enter email to refresh