Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Storage Layer

Relevant source files

Purpose and Scope

The Storage Layer provides the columnar data persistence infrastructure for LLKV. It spans three crates that form a hierarchy: llkv-table provides schema-aware table operations, llkv-column-map manages columnar chunk storage and catalog mappings, and llkv-storage defines the pluggable pager abstraction for physical persistence. Together, these components implement an Arrow-native columnar store with MVCC semantics and zero-copy read paths.

For information about query execution and predicate evaluation that operates on top of this storage layer, see Query Execution. For details on how the runtime orchestrates storage operations within transactions, see Catalog and Metadata Management.

Sources: llkv-table/README.md:1-57 llkv-column-map/README.md:1-62 llkv-storage/README.md:1-45

Architecture Overview

The storage layer implements a three-tier architecture where each tier has a distinct responsibility:

Key Components:

graph TB
    subgraph "Schema Layer"
        Table["Table\nllkv-table::Table"]
Schema["Schema Validation"]
MVCC["MVCC Injection\nrow_id, created_by, deleted_by"]
end
    
    subgraph "Column Management Layer"
        ColumnStore["ColumnStore\nllkv-column-map::ColumnStore"]
Catalog["Column Catalog\nLogicalFieldId → PhysicalKey"]
Chunks["Column Chunks\nArrow RecordBatch segments"]
Descriptors["ColumnDescriptor\nChunk metadata"]
end
    
    subgraph "Physical Persistence Layer"
        Pager["Pager Trait\nllkv-storage::pager::Pager"]
MemPager["MemPager\nHashMap backend"]
SimdPager["SimdRDrivePager\nMemory-mapped file"]
end
    
 
   Table --> Schema
 
   Schema --> MVCC
 
   MVCC --> ColumnStore
    
 
   ColumnStore --> Catalog
 
   ColumnStore --> Chunks
 
   ColumnStore --> Descriptors
    
 
   Catalog --> Pager
 
   Chunks --> Pager
 
   Descriptors --> Pager
    
 
   Pager --> MemPager
 
   Pager --> SimdPager
LayerCratePrimary TypesResponsibility
Schemallkv-tableTable, SysCatalogSchema validation, MVCC metadata injection, streaming scans
Column Managementllkv-column-mapColumnStore, LogicalFieldId, ColumnDescriptorColumnar chunking, catalog mapping, gather operations
Physical Persistencellkv-storagePager, MemPager, SimdRDrivePagerBatch get/put over physical keys, zero-copy reads

Sources: llkv-table/README.md:10-41 llkv-column-map/README.md:10-41 llkv-storage/README.md:12-28

Logical vs Physical Addressing

The storage layer separates logical identifiers from physical storage locations to enable namespace isolation and catalog management:

Namespace Segregation:

graph LR
    subgraph "Logical Space"
        LogicalField["LogicalFieldId\n(namespace, table_id, field_id)"]
UserNS["Namespace::UserData"]
RowNS["Namespace::RowIdShadow"]
MVCCNS["Namespace::TxnMetadata"]
end
    
    subgraph "Catalog Mapping"
        CatalogEntry["catalog.map\nLogicalFieldId → PhysicalKey"]
ValuePK["Value PhysicalKey"]
RowPK["RowId PhysicalKey"]
end
    
    subgraph "Physical Space"
        PhysKey["PhysicalKey (u64)"]
DescBlob["Descriptor Blob\nColumnDescriptor"]
ChunkBlob["Chunk Blob\nSerialized Arrow Array"]
end
    
 
   LogicalField --> CatalogEntry
 
   UserNS --> LogicalField
 
   RowNS --> LogicalField
 
   MVCCNS --> LogicalField
    
 
   CatalogEntry --> ValuePK
 
   CatalogEntry --> RowPK
    
 
   ValuePK --> PhysKey
 
   RowPK --> PhysKey
    
 
   PhysKey --> DescBlob
 
   PhysKey --> ChunkBlob

LogicalFieldId encodes a three-part address: (namespace, table_id, field_id). Namespaces prevent collisions between user columns, MVCC metadata, and row-id shadows:

  • Namespace::UserData: User-defined columns (e.g., name, age, email)
  • Namespace::RowIdShadow: Parallel row-id arrays used for gather operations
  • Namespace::TxnMetadata: MVCC columns (created_by, deleted_by)

The ColumnStore maintains a catalog.map that translates each LogicalFieldId into a PhysicalKey. Physical keys are opaque u64 identifiers allocated by the pager; the catalog is persisted at pager root key 0.

Sources: llkv-column-map/README.md:18-23 llkv-column-map/src/store/projection.rs:23-37

Data Persistence Model

Data flows through the storage layer as Arrow RecordBatch objects, which are decomposed into per-column chunks for persistence:

Chunking Strategy:

sequenceDiagram
    participant Caller
    participant Table as "Table"
    participant ColumnStore as "ColumnStore"
    participant Serialization as "serialization"
    participant Pager as "Pager"
    
    Caller->>Table: append(RecordBatch)
    
    Note over Table: Validate schema\nInject MVCC columns
    
    Table->>ColumnStore: append(RecordBatch with MVCC)
    
    Note over ColumnStore: Chunk columns\nSort by row_id\nApply last-writer-wins
    
    loop "For each column"
        ColumnStore->>Serialization: serialize_array(Array)
        Serialization-->>ColumnStore: Vec<u8> blob
        ColumnStore->>Pager: batch_put(PhysicalKey, blob)
    end
    
    Pager-->>ColumnStore: Success
    ColumnStore-->>Table: Success
    Table-->>Caller: Success

Each column is split into fixed-size chunks (default 8,192 rows) to balance scan efficiency and update granularity. Chunks are serialized using a custom Arrow-compatible format optimized for zero-copy reads. The ColumnDescriptor maintains per-chunk metadata including row count, min/max row IDs, and physical keys for value and row-id arrays.

MVCC Column Layout:

Every table's physical storage includes three categories of columns:

  • User-defined columns from the schema
  • row_id (UInt64): monotonic row identifier
  • created_by (UInt64): transaction ID that created the row
  • deleted_by (UInt64): transaction ID that deleted the row (0 if live)

These MVCC columns are stored in Namespace::TxnMetadata and participate in the same chunking and persistence workflow as user data.

Sources: llkv-column-map/README.md:24-29 llkv-table/README.md:14-25

Serialization and Zero-Copy Design

The storage layer uses a custom serialization format that enables zero-copy reconstruction of Arrow arrays from memory-mapped regions:

Serialization Format

The format defined in llkv-storage/src/serialization.rs:1-586 supports multiple layouts:

LayoutType CodeUse CaseHeader Fields
PrimitivePrimType::*Fixed-width primitives (Int32, UInt64, Float32, etc.)values_len
FslFloat32N/AFixedSizeList for vector embeddingslist_size, child_values_len
VarlenPrimType::Binary, PrimType::Utf8, etc.Variable-length Binary/Stringoffsets_len, values_len
StructN/ANested struct typespayload_len (IPC format)
Decimal128PrimType::Decimal128Fixed-precision decimalsprecision, scale

Each serialized blob begins with a 24-byte header:

Offset  Field             Type    Description
------  -----             ----    -----------
0-3     Magic             [u8;4]  "ARR0"
4       Layout            u8      Layout discriminant (0-3)
5       PrimType          u8      Type code (layout-specific)
6       Precision/Pad     u8      Decimal precision or padding
7       Scale/Pad         u8      Decimal scale or padding
8-15    Length            u64     Logical element count
16-19   Extra A           u32     Layout-specific (e.g., values_len)
20-23   Extra B           u32     Layout-specific (e.g., offsets_len)
24+     Payload           [u8]    Raw Arrow buffers

Why Custom Format Instead of Arrow IPC:

The custom format achieves three goals:

  1. Minimal overhead : No schema framing or padding, just raw buffers
  2. Contiguous payloads : Each array's bytes are adjacent, ideal for SIMD and sequential scans
  3. True zero-copy : deserialize_array constructs ArrayData directly from EntryHandle buffers without memcpy

Sources: llkv-storage/src/serialization.rs:1-100 llkv-storage/src/serialization.rs:226-298

EntryHandle Abstraction

The Pager trait operates on EntryHandle objects from the simd-r-drive-entry-handle crate. EntryHandle provides:

  • as_ref() -> &[u8]: Zero-copy slice view
  • as_arrow_buffer() -> Buffer: Wrap as Arrow buffer without copying
graph LR
    File["Persistent File\nsimd_r_drive::DataStore"]
Mmap["Memory-Mapped Region"]
EntryHandle["EntryHandle"]
Buffer["Arrow Buffer"]
ArrayData["ArrayData"]
ArrayRef["ArrayRef"]
File --> Mmap
 
   Mmap --> EntryHandle
 
   EntryHandle --> Buffer
 
   Buffer --> ArrayData
 
   ArrayData --> ArrayRef
    
    style File fill:#f9f9f9
    style ArrayRef fill:#f9f9f9

When using SimdRDrivePager, EntryHandle wraps a memory-mapped region backed by a persistent file. The deserialize_array function slices this buffer to construct ArrayData without allocating or copying:

Sources: llkv-storage/src/serialization.rs:429-559 llkv-table/Cargo.toml29

Pager Implementations

The Pager trait defines the interface for batch get/put operations over physical keys:

MemPager

MemPager provides an in-memory, heap-backed implementation using a RwLock<FxHashMap<PhysicalKey, Arc<Vec<u8>>>>. It is used for:

  • Unit tests and benchmarks
  • Staging contexts during explicit transactions (see llkv-runtime/README.md:26-32)
  • Temporary namespaces that don't require persistence

SimdRDrivePager

SimdRDrivePager wraps simd_r_drive::DataStore, enabling persistent storage with zero-copy reads:

FeatureImplementation
Backing StoreMemory-mapped file via simd-r-drive
AlignmentSIMD-optimized (16-byte aligned regions)
ConcurrencyMultiple readers, single writer (file-level locking)
EntryHandleZero-copy view into mmap region

The SimdRDrivePager is instantiated by the runtime when opening a persistent database. All catalog entries, column descriptors, and chunk blobs are accessed through memory-mapped reads, avoiding I/O system calls during query execution.

Sources: llkv-storage/README.md:18-22 llkv-storage/README.md:25-28

Column Storage Operations

The ColumnStore provides three primary operation patterns:

Append Workflow

Last-Writer-Wins Semantics:

When appending a RecordBatch with row_id values that overlap existing rows, ColumnStore::append applies last-writer-wins logic:

  1. Identify chunks containing overlapping row_id ranges
  2. Load those chunks and merge with new data
  3. Re-serialize merged chunks
  4. Atomically update descriptors and chunk blobs

This ensures that INSERT OR REPLACE and UPDATE operations maintain consistent state without tombstone accumulation.

Sources: llkv-column-map/README.md:24-29

Gather Operations

Gather operations retrieve specific rows by row_id from columnar storage:

Null-Handling Policies:

Gather operations support three policies via GatherNullPolicy:

PolicyBehavior
ErrorOnMissingReturn error if any requested row_id is not found
IncludeNullsEmit null for missing rows
DropNullsOmit rows where all projected columns are null

The MultiGatherContext caches chunk data and row indexes across multiple gather calls to amortize deserialization costs during multi-pass operations like joins.

Sources: llkv-column-map/src/store/projection.rs:39-93 llkv-column-map/src/store/projection.rs:245-446

Streaming Scans

The ColumnStream type provides paginated, filtered scans over columnar data:

Scans operate in chunks to avoid materializing entire tables:

  1. Load next chunk of row IDs and MVCC metadata
  2. Apply MVCC visibility filter (transaction snapshot check)
  3. Evaluate user predicates on loaded columns
  4. Gather matching rows into a RecordBatch
  5. Yield batch to caller

This streaming model enables large result sets to be processed incrementally without exhausting memory.

Sources: llkv-column-map/README.md:30-35 llkv-table/README.md:22-25

Integration Points

The storage layer is consumed by multiple higher-level components:

Key Integration Patterns:

ConsumerUsage Pattern
llkv-runtimeOpens pagers, manages table lifecycle, coordinates MVCC tagging
llkv-executorStreams scans via Table::scan_stream, executes joins and aggregates
llkv-transactionProvides transaction IDs for MVCC columns, enforces snapshot isolation
SysCatalogPersists table and column metadata using the same storage infrastructure

System Catalog Self-Hosting:

The system catalog (table 0) is itself stored using ColumnStore, creating a bootstrapping dependency:

  1. Runtime opens the pager
  2. ColumnStore is initialized with the pager
  3. SysCatalog is constructed, reading metadata from table 0
  4. User tables are opened using metadata from SysCatalog

This self-hosting design ensures that catalog operations and user data operations share the same crash recovery and MVCC semantics.

Sources: llkv-runtime/README.md:36-40 llkv-table/README.md:26-30 llkv-storage/README.md:25-28