Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Storage Layer

Loading…

Storage Layer

Relevant source files

The Storage Layer manages the persistence and retrieval of columnar data in LLKV. It bridges Apache Arrow’s in-memory columnar format with a key-value storage backend, providing efficient operations for appends, updates, deletes, and scans while maintaining data integrity through MVCC semantics.

This page provides an architectural overview of the storage subsystem. For details on specific components, see:

Sources: llkv-column-map/src/store/core.rs:1-68 llkv-column-map/Cargo.toml:1-35

Architecture Overview

The storage layer consists of three primary components arranged in a hierarchy from logical to physical representation:

Sources: llkv-column-map/src/store/core.rs:60-68 llkv-column-map/Cargo.toml:21-33

graph TB
    subgraph "Arrow Layer"
        RB["RecordBatch\nColumnar Arrays"]
SCHEMA["Schema\nField Metadata"]
ARRAYS["Arrow Arrays\n(Int64, Utf8, etc)"]
end
    
    subgraph "Column Store Layer"
        CS["ColumnStore\nllkv-column-map"]
CATALOG["ColumnCatalog\nLogicalFieldId → PhysicalKey"]
DESC["ColumnDescriptor\nLinked List of Chunks"]
META["ChunkMetadata\nmin/max, row_count"]
CHUNKS["Serialized Chunks\nArrow IPC Format"]
end
    
    subgraph "Pager Layer"
        PAGER["Pager Trait\nbatch_get/batch_put"]
BATCH_OPS["BatchGet/BatchPut\nOperations"]
end
    
    subgraph "Physical Storage"
        SIMD["simd-r-drive\nMemory-Mapped KV Store"]
HANDLES["EntryHandle\nByte Blobs"]
end
    
 
   RB --> CS
 
   SCHEMA --> CS
 
   ARRAYS --> CHUNKS
    
 
   CS --> CATALOG
 
   CS --> DESC
 
   DESC --> META
 
   META --> CHUNKS
    
 
   CS --> PAGER
 
   PAGER --> BATCH_OPS
 
   BATCH_OPS --> SIMD
 
   SIMD --> HANDLES
    CHUNKS -.serialized as.-> HANDLES
    
    CATALOG -.persisted via.-> PAGER
    DESC -.persisted via.-> PAGER

Columnar Storage Model

Logical Field Identification

Columns are identified by LogicalFieldId, a composite key containing:

ComponentTypePurpose
NamespaceLogicalStorageNamespaceSeparates user data from internal columns (row IDs, MVCC metadata)
Table IDTableIdIdentifies the owning table
Field IDFieldIdColumn position within the table schema

Namespace Values:

  • UserData - Regular table columns
  • RowIdShadow - Internal row ID tracking
  • TxnCreatedBy - MVCC transaction creation timestamps
  • TxnDeletedBy - MVCC transaction deletion timestamps

This namespacing prevents collisions between user-visible columns and internal bookkeeping while enabling efficient multi-column operations within a table.

Sources: llkv-column-map/src/store/core.rs:36-46

Chunk Organization

Data is stored in chunks , which are serialized Arrow arrays. Each chunk contains:

ChunkMetadata Fields:

FieldTypePurpose
chunk_pkPhysicalKeyStorage location of value array
min_val_u64 / max_val_u64u64Value range for predicate pruning
row_countu64Number of rows in chunk
null_countu64Number of null values
serialized_bytesu64Size of serialized array
value_order_perm_pkPhysicalKeyOptional sort index for fast lookups

The metadata enables chunk pruning : evaluating predicates against min/max values to skip entire chunks without loading data.

Sources: llkv-column-map/src/store/core.rs:1-6 llkv-column-map/src/store/descriptor.rs

Descriptor Pages

A ColumnDescriptor organizes chunks into a linked list of metadata pages:

Each descriptor page contains:

  • Header (DescriptorPageHeader): entry_count, next_page_pk
  • Entries : Array of ChunkMetadata structures

Appends extend the tail page; when full, a new page is allocated and linked.

Sources: llkv-column-map/src/store/descriptor.rs

Data Flow: Append Operation

The append path implements Last-Write-Wins semantics for row ID conflicts:

Key Steps:

  1. Pre-sort by row ID llkv-column-map/src/store/core.rs:798-847 - Ensures efficient metadata updates and naturally sorted shadow columns
  2. LWW Rewrite llkv-column-map/src/store/core.rs:893-901 - Updates existing row IDs in-place
  3. Filter llkv-column-map/src/store/core.rs:918-942 - Removes rewritten rows and nulls
  4. Chunk & Serialize llkv-column-map/src/store/core.rs:1032-1114 - Splits into target-sized chunks, serializes to Arrow IPC
  5. Atomic Batch Put llkv-column-map/src/store/core.rs:1116-1132 - Commits all writes atomically
  6. Epoch Increment llkv-column-map/src/store/core.rs:1133-1134 - Invalidates gather context caches

Sources: llkv-column-map/src/store/core.rs:758-1137

sequenceDiagram
    participant User
    participant CS as ColumnStore
    participant CTX as MultiGatherContext
    participant Cache as Chunk Cache
    participant Pager
    
    User->>CS: gather_rows(field_ids, row_ids)
    CS->>CTX: Acquire context from pool
    CS->>CS: Build FieldPlans\n(value_metas, row_metas)
    
    Note over CS,CTX: Phase 1: Identify Candidate Chunks
    loop For each field
        CS->>CTX: Filter chunks by row_id range
        CTX->>CTX: candidate_indices.push(idx)
    end
    
    Note over CS,Cache: Phase 2: Fetch Chunks
    CS->>Cache: Check cache for chunk_pks
    CS->>Pager: batch_get(missing_chunks)
    Pager-->>CS: EntryHandle[]
    CS->>CS: deserialize_array(bytes)
    CS->>Cache: Insert arrays
    
    Note over CS,CTX: Phase 3: Build Row Index
    CS->>CTX: row_index[row_id] = output_idx
    
    Note over CS,CTX: Phase 4: Gather Values
    loop For each column
        loop For each row_id
            CTX->>Cache: Lookup value array
            CTX->>CTX: Find row in chunk\nvia binary search
            CTX->>CTX: builder.append_value()
        end
        CTX->>CTX: Finish builder → ArrayRef
    end
    
    CS->>CS: RecordBatch::try_new(schema, arrays)
    CS-->>User: RecordBatch
    CTX->>CS: Return context to pool

Data Flow: Gather Operation

Gathering assembles a RecordBatch from a list of row IDs by fetching values from chunks:

Optimizations:

OptimizationLocationDescription
Context Poolingllkv-column-map/src/store/projection.rs:651-721Reuses MultiGatherContext across calls to amortize allocation costs
Chunk Cachingllkv-column-map/src/store/projection.rs:1032-1053Caches deserialized Arrow arrays by PhysicalKey
Candidate Pruningllkv-column-map/src/store/projection.rs:1015-1027Only loads chunks overlapping the requested row ID range
Dense Fast Pathllkv-column-map/src/store/projection.rs:984-1011For contiguous row IDs, uses offset arithmetic instead of hash lookup
Single Chunk Slicingllkv-column-map/src/store/projection.rs:303-329If all rows in one sorted chunk, returns array.slice()

Sources: llkv-column-map/src/store/projection.rs:772-960 llkv-column-map/src/store/core.rs:758-785

Data Flow: Filter Operation

Filtering evaluates predicates and returns matching row IDs:

Filter Execution Path:

  1. Descriptor Load llkv-column-map/src/store/scan/filter.rs:1301-1333 - Fetch column descriptor and iterate metadata
  2. Metadata Pruning llkv-column-map/src/store/scan/filter.rs:1336-1353 - Skip chunks where min_val > predicate_max or max_val < predicate_min
  3. Chunk Fetch llkv-column-map/src/store/scan/filter.rs:1355-1367 - Batch-get overlapping chunk arrays
  4. Visitor Evaluation llkv-column-map/src/store/scan/filter.rs:1369-1394 - Type-specialized visitor applies predicate
  5. Result Encoding llkv-column-map/src/store/scan/filter.rs:506-591 - Build FilterResult with run-length encoding or sparse list

Visitor Pattern Example (FilterVisitor<T, F>):

The visitor pattern enables type-specialized hot loops while maintaining a generic filter interface.

Sources: llkv-column-map/src/store/scan/filter.rs:209-298 llkv-column-map/src/store/scan/filter.rs:506-591 llkv-column-map/src/store/scan/filter.rs:605-649

Index Management

The storage layer supports two index types for accelerated lookups:

Index Operations:

MethodPurposeLocation
register_indexCreates and persists an index for a columnllkv-column-map/src/store/core.rs:145-147
unregister_indexRemoves an index and frees storagellkv-column-map/src/store/core.rs:163-165
list_persisted_indexesQueries existing indexes for a columnllkv-column-map/src/store/core.rs:408-425

Presence Index Usage llkv-column-map/src/store/core.rs:663-754:

  • Stored in ChunkMetadata.value_order_perm_pk
  • Binary search over permuted view of row IDs
  • Accelerates has_row_id checks

Sources: llkv-column-map/src/store/core.rs:136-165 llkv-column-map/src/store/core.rs:663-754

Storage Optimizations

Chunk Size Tuning

The ColumnStoreConfig provides tuning parameters:

ParameterDefaultPurpose
target_chunk_rows8192Target rows per chunk for new appends
min_chunk_rows2048Minimum before considering compaction
max_chunk_rows32768Maximum before forcing a split

Smaller chunks enable finer-grained pruning but increase metadata overhead. The defaults balance scan performance with storage efficiency.

Sources: llkv-column-map/src/store/config.rs

Last-Write-Wins Semantics

When appending rows with existing row IDs:

  1. Scan existing chunks llkv-column-map/src/store/core.rs:893-901 to identify conflicting row IDs
  2. Overwrite in-place llkv-column-map/src/store/lww.rs within the chunk containing the old value
  3. Append new rows llkv-column-map/src/store/core.rs:918-942 that don’t conflict

This avoids duplicates and provides update semantics without explicit UPDATE statements.

Sources: llkv-column-map/src/store/core.rs:893-942 llkv-column-map/src/store/lww.rs

Null Handling

Null values are not stored in value chunks llkv-column-map/src/store/core.rs:920-931:

  • Filtered out during append
  • Represented by absence in the shadow row ID column
  • Surfaced as nulls during gather if row ID exists but value is missing

This reduces storage footprint for sparse columns.

Sources: llkv-column-map/src/store/core.rs:920-931

Compaction (Future)

The current implementation supports incremental writes but defers compaction:

  • Small chunks accumulate over time
  • Metadata overhead increases
  • Scan performance degrades for highly-updated columns

Future work will implement background compaction to merge small chunks and rebuild optimal chunk sizes.

Sources: llkv-column-map/src/store/compact.rs

Component Relationships

The storage layer integrates with upstream and downstream components:

Key Interfaces:

Sources: llkv-column-map/src/store/core.rs:60-68 llkv-storage/src/pager.rs

Thread Safety

ColumnStore is Send + Sync and designed for concurrent access:

  • Catalog : Arc<RwLock<ColumnCatalog>> - Read-heavy workload, allows concurrent reads
  • Caches : RwLock for data type and index metadata
  • Append Epoch : AtomicU64 for cache invalidation signaling
  • Context Pool : Mutex<FxHashMap<...>> for gather context reuse

Concurrent appends to different tables are lock-free at the ColumnStore level. Appends to the same column serialize on the catalog write lock.

Sources: llkv-column-map/src/store/core.rs:47-68

Performance Characteristics

OperationComplexityNotes
Append (sorted)O(n)n = rows; includes serialization and pager writes
Append (unsorted)O(n log n)Requires lexicographic sort
Gather (random)O(m · k)m = row count, k = avg chunk scan per row
Gather (dense)O(m)Contiguous row IDs enable offset arithmetic
Filter (full scan)O(c)c = total chunks; metadata pruning reduces effective c
Filter (indexed)O(log c)With presence/value indexes

Batching Benefits llkv-sql/tests/pager_io_tests.rs:18-120:

  • INSERT: ~8 allocations, 24 puts for 3 rows (2 columns + row ID + MVCC)
  • SELECT: ~36 gets in 23 batches (descriptors + chunks + metadata)
  • DELETE: ~6 puts, 46 gets (read for tombstones, write MVCC columns)

Sources: llkv-column-map/src/store/core.rs llkv-sql/tests/pager_io_tests.rs:18-120

Dismiss

Refresh this wiki

Enter email to refresh