This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Storage Layer

Relevant source files

Purpose and Scope

The Storage Layer provides the columnar data persistence infrastructure for LLKV. It spans three crates that form a hierarchy: llkv-table provides schema-aware table operations, llkv-column-map manages columnar chunk storage and catalog mappings, and llkv-storage defines the pluggable pager abstraction for physical persistence. Together, these components implement an Arrow-native columnar store with MVCC semantics and zero-copy read paths.

For information about query execution and predicate evaluation that operates on top of this storage layer, see Query Execution. For details on how the runtime orchestrates storage operations within transactions, see Catalog and Metadata Management.

Sources: llkv-table/README.md:1-57 llkv-column-map/README.md:1-62 llkv-storage/README.md:1-45

Architecture Overview

The storage layer implements a three-tier architecture where each tier has a distinct responsibility:

Key Components:

graph TB
    subgraph "Schema Layer"
        Table["Table\nllkv-table::Table"]
Schema["Schema Validation"]
MVCC["MVCC Injection\nrow_id, created_by, deleted_by"]
end
    
    subgraph "Column Management Layer"
        ColumnStore["ColumnStore\nllkv-column-map::ColumnStore"]
Catalog["Column Catalog\nLogicalFieldId → PhysicalKey"]
Chunks["Column Chunks\nArrow RecordBatch segments"]
Descriptors["ColumnDescriptor\nChunk metadata"]
end
    
    subgraph "Physical Persistence Layer"
        Pager["Pager Trait\nllkv-storage::pager::Pager"]
MemPager["MemPager\nHashMap backend"]
SimdPager["SimdRDrivePager\nMemory-mapped file"]
end
    
 
   Table --> Schema
 
   Schema --> MVCC
 
   MVCC --> ColumnStore
    
 
   ColumnStore --> Catalog
 
   ColumnStore --> Chunks
 
   ColumnStore --> Descriptors
    
 
   Catalog --> Pager
 
   Chunks --> Pager
 
   Descriptors --> Pager
    
 
   Pager --> MemPager
 
   Pager --> SimdPager

Layer	Crate	Primary Types	Responsibility
Schema	`llkv-table`	`Table`, `SysCatalog`	Schema validation, MVCC metadata injection, streaming scans
Column Management	`llkv-column-map`	`ColumnStore`, `LogicalFieldId`, `ColumnDescriptor`	Columnar chunking, catalog mapping, gather operations
Physical Persistence	`llkv-storage`	`Pager`, `MemPager`, `SimdRDrivePager`	Batch get/put over physical keys, zero-copy reads

Sources: llkv-table/README.md:10-41 llkv-column-map/README.md:10-41 llkv-storage/README.md:12-28

Logical vs Physical Addressing

The storage layer separates logical identifiers from physical storage locations to enable namespace isolation and catalog management:

Namespace Segregation:

graph LR
    subgraph "Logical Space"
        LogicalField["LogicalFieldId\n(namespace, table_id, field_id)"]
UserNS["Namespace::UserData"]
RowNS["Namespace::RowIdShadow"]
MVCCNS["Namespace::TxnMetadata"]
end
    
    subgraph "Catalog Mapping"
        CatalogEntry["catalog.map\nLogicalFieldId → PhysicalKey"]
ValuePK["Value PhysicalKey"]
RowPK["RowId PhysicalKey"]
end
    
    subgraph "Physical Space"
        PhysKey["PhysicalKey (u64)"]
DescBlob["Descriptor Blob\nColumnDescriptor"]
ChunkBlob["Chunk Blob\nSerialized Arrow Array"]
end
    
 
   LogicalField --> CatalogEntry
 
   UserNS --> LogicalField
 
   RowNS --> LogicalField
 
   MVCCNS --> LogicalField
    
 
   CatalogEntry --> ValuePK
 
   CatalogEntry --> RowPK
    
 
   ValuePK --> PhysKey
 
   RowPK --> PhysKey
    
 
   PhysKey --> DescBlob
 
   PhysKey --> ChunkBlob

LogicalFieldId encodes a three-part address: (namespace, table_id, field_id). Namespaces prevent collisions between user columns, MVCC metadata, and row-id shadows:

Namespace::UserData: User-defined columns (e.g., name, age, email)
Namespace::RowIdShadow: Parallel row-id arrays used for gather operations
Namespace::TxnMetadata: MVCC columns (created_by, deleted_by)

The ColumnStore maintains a catalog.map that translates each LogicalFieldId into a PhysicalKey. Physical keys are opaque u64 identifiers allocated by the pager; the catalog is persisted at pager root key 0.

Sources: llkv-column-map/README.md:18-23 llkv-column-map/src/store/projection.rs:23-37

Data Persistence Model

Data flows through the storage layer as Arrow RecordBatch objects, which are decomposed into per-column chunks for persistence:

Chunking Strategy:

sequenceDiagram
    participant Caller
    participant Table as "Table"
    participant ColumnStore as "ColumnStore"
    participant Serialization as "serialization"
    participant Pager as "Pager"
    
    Caller->>Table: append(RecordBatch)
    
    Note over Table: Validate schema\nInject MVCC columns
    
    Table->>ColumnStore: append(RecordBatch with MVCC)
    
    Note over ColumnStore: Chunk columns\nSort by row_id\nApply last-writer-wins
    
    loop "For each column"
        ColumnStore->>Serialization: serialize_array(Array)
        Serialization-->>ColumnStore: Vec<u8> blob
        ColumnStore->>Pager: batch_put(PhysicalKey, blob)
    end
    
    Pager-->>ColumnStore: Success
    ColumnStore-->>Table: Success
    Table-->>Caller: Success

Each column is split into fixed-size chunks (default 8,192 rows) to balance scan efficiency and update granularity. Chunks are serialized using a custom Arrow-compatible format optimized for zero-copy reads. The ColumnDescriptor maintains per-chunk metadata including row count, min/max row IDs, and physical keys for value and row-id arrays.

MVCC Column Layout:

Every table's physical storage includes three categories of columns:

User-defined columns from the schema
row_id (UInt64): monotonic row identifier
created_by (UInt64): transaction ID that created the row
deleted_by (UInt64): transaction ID that deleted the row (0 if live)

These MVCC columns are stored in Namespace::TxnMetadata and participate in the same chunking and persistence workflow as user data.

Sources: llkv-column-map/README.md:24-29 llkv-table/README.md:14-25

Serialization and Zero-Copy Design

The storage layer uses a custom serialization format that enables zero-copy reconstruction of Arrow arrays from memory-mapped regions:

Serialization Format

The format defined in llkv-storage/src/serialization.rs:1-586 supports multiple layouts:

Layout	Type Code	Use Case	Header Fields
`Primitive`	`PrimType::*`	Fixed-width primitives (Int32, UInt64, Float32, etc.)	`values_len`
`FslFloat32`	N/A	FixedSizeList for vector embeddings	`list_size`, `child_values_len`
`Varlen`	`PrimType::Binary`, `PrimType::Utf8`, etc.	Variable-length Binary/String	`offsets_len`, `values_len`
`Struct`	N/A	Nested struct types	`payload_len` (IPC format)
`Decimal128`	`PrimType::Decimal128`	Fixed-precision decimals	`precision`, `scale`

Each serialized blob begins with a 24-byte header:

Offset  Field             Type    Description
------  -----             ----    -----------
0-3     Magic             [u8;4]  "ARR0"
4       Layout            u8      Layout discriminant (0-3)
5       PrimType          u8      Type code (layout-specific)
6       Precision/Pad     u8      Decimal precision or padding
7       Scale/Pad         u8      Decimal scale or padding
8-15    Length            u64     Logical element count
16-19   Extra A           u32     Layout-specific (e.g., values_len)
20-23   Extra B           u32     Layout-specific (e.g., offsets_len)
24+     Payload           [u8]    Raw Arrow buffers

Why Custom Format Instead of Arrow IPC:

The custom format achieves three goals:

Minimal overhead : No schema framing or padding, just raw buffers
Contiguous payloads : Each array's bytes are adjacent, ideal for SIMD and sequential scans
True zero-copy : deserialize_array constructs ArrayData directly from EntryHandle buffers without memcpy

Sources: llkv-storage/src/serialization.rs:1-100 llkv-storage/src/serialization.rs:226-298

EntryHandle Abstraction

The Pager trait operates on EntryHandle objects from the simd-r-drive-entry-handle crate. EntryHandle provides:

as_ref() -> &[u8]: Zero-copy slice view
as_arrow_buffer() -> Buffer: Wrap as Arrow buffer without copying

graph LR
    File["Persistent File\nsimd_r_drive::DataStore"]
Mmap["Memory-Mapped Region"]
EntryHandle["EntryHandle"]
Buffer["Arrow Buffer"]
ArrayData["ArrayData"]
ArrayRef["ArrayRef"]
File --> Mmap
 
   Mmap --> EntryHandle
 
   EntryHandle --> Buffer
 
   Buffer --> ArrayData
 
   ArrayData --> ArrayRef
    
    style File fill:#f9f9f9
    style ArrayRef fill:#f9f9f9

When using SimdRDrivePager, EntryHandle wraps a memory-mapped region backed by a persistent file. The deserialize_array function slices this buffer to construct ArrayData without allocating or copying:

Sources: llkv-storage/src/serialization.rs:429-559 llkv-table/Cargo.toml29

The Pager trait defines the interface for batch get/put operations over physical keys:

MemPager

MemPager provides an in-memory, heap-backed implementation using a RwLock<FxHashMap<PhysicalKey, Arc<Vec<u8>>>>. It is used for:

Unit tests and benchmarks
Staging contexts during explicit transactions (see llkv-runtime/README.md:26-32)
Temporary namespaces that don't require persistence

SimdRDrivePager

SimdRDrivePager wraps simd_r_drive::DataStore, enabling persistent storage with zero-copy reads:

Feature	Implementation
Backing Store	Memory-mapped file via `simd-r-drive`
Alignment	SIMD-optimized (16-byte aligned regions)
Concurrency	Multiple readers, single writer (file-level locking)
EntryHandle	Zero-copy view into mmap region

The SimdRDrivePager is instantiated by the runtime when opening a persistent database. All catalog entries, column descriptors, and chunk blobs are accessed through memory-mapped reads, avoiding I/O system calls during query execution.

Sources: llkv-storage/README.md:18-22 llkv-storage/README.md:25-28

Column Storage Operations

The ColumnStore provides three primary operation patterns:

Append Workflow

Last-Writer-Wins Semantics:

When appending a RecordBatch with row_id values that overlap existing rows, ColumnStore::append applies last-writer-wins logic:

Identify chunks containing overlapping row_id ranges
Load those chunks and merge with new data
Re-serialize merged chunks
Atomically update descriptors and chunk blobs

This ensures that INSERT OR REPLACE and UPDATE operations maintain consistent state without tombstone accumulation.

Sources: llkv-column-map/README.md:24-29

Gather Operations

Gather operations retrieve specific rows by row_id from columnar storage:

Null-Handling Policies:

Gather operations support three policies via GatherNullPolicy:

Policy	Behavior
`ErrorOnMissing`	Return error if any requested `row_id` is not found
`IncludeNulls`	Emit null for missing rows
`DropNulls`	Omit rows where all projected columns are null

The MultiGatherContext caches chunk data and row indexes across multiple gather calls to amortize deserialization costs during multi-pass operations like joins.

Sources: llkv-column-map/src/store/projection.rs:39-93 llkv-column-map/src/store/projection.rs:245-446

Streaming Scans

The ColumnStream type provides paginated, filtered scans over columnar data:

Scans operate in chunks to avoid materializing entire tables:

Load next chunk of row IDs and MVCC metadata
Apply MVCC visibility filter (transaction snapshot check)
Evaluate user predicates on loaded columns
Gather matching rows into a RecordBatch
Yield batch to caller

This streaming model enables large result sets to be processed incrementally without exhausting memory.

Sources: llkv-column-map/README.md:30-35 llkv-table/README.md:22-25

Integration Points

The storage layer is consumed by multiple higher-level components:

Key Integration Patterns:

Consumer	Usage Pattern
`llkv-runtime`	Opens pagers, manages table lifecycle, coordinates MVCC tagging
`llkv-executor`	Streams scans via `Table::scan_stream`, executes joins and aggregates
`llkv-transaction`	Provides transaction IDs for MVCC columns, enforces snapshot isolation
`SysCatalog`	Persists table and column metadata using the same storage infrastructure

System Catalog Self-Hosting:

The system catalog (table 0) is itself stored using ColumnStore, creating a bootstrapping dependency:

Runtime opens the pager
ColumnStore is initialized with the pager
SysCatalog is constructed, reading metadata from table 0
User tables are opened using metadata from SysCatalog

This self-hosting design ensures that catalog operations and user data operations share the same crash recovery and MVCC semantics.

Sources: llkv-runtime/README.md:36-40 llkv-table/README.md:26-30 llkv-storage/README.md:25-28

Keyboard shortcuts

rust-llkv Documentation