This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Storage Layer
Relevant source files
- llkv-aggregate/README.md
- llkv-column-map/README.md
- llkv-column-map/src/store/projection.rs
- llkv-csv/README.md
- llkv-expr/Cargo.toml
- llkv-expr/README.md
- llkv-expr/src/lib.rs
- llkv-expr/src/literal.rs
- llkv-join/README.md
- llkv-plan/Cargo.toml
- llkv-plan/src/lib.rs
- llkv-runtime/README.md
- llkv-storage/README.md
- llkv-storage/src/serialization.rs
- llkv-table/Cargo.toml
- llkv-table/README.md
Purpose and Scope
The Storage Layer provides the columnar data persistence infrastructure for LLKV. It spans three crates that form a hierarchy: llkv-table provides schema-aware table operations, llkv-column-map manages columnar chunk storage and catalog mappings, and llkv-storage defines the pluggable pager abstraction for physical persistence. Together, these components implement an Arrow-native columnar store with MVCC semantics and zero-copy read paths.
For information about query execution and predicate evaluation that operates on top of this storage layer, see Query Execution. For details on how the runtime orchestrates storage operations within transactions, see Catalog and Metadata Management.
Sources: llkv-table/README.md:1-57 llkv-column-map/README.md:1-62 llkv-storage/README.md:1-45
Architecture Overview
The storage layer implements a three-tier architecture where each tier has a distinct responsibility:
Key Components:
graph TB
subgraph "Schema Layer"
Table["Table\nllkv-table::Table"]
Schema["Schema Validation"]
MVCC["MVCC Injection\nrow_id, created_by, deleted_by"]
end
subgraph "Column Management Layer"
ColumnStore["ColumnStore\nllkv-column-map::ColumnStore"]
Catalog["Column Catalog\nLogicalFieldId → PhysicalKey"]
Chunks["Column Chunks\nArrow RecordBatch segments"]
Descriptors["ColumnDescriptor\nChunk metadata"]
end
subgraph "Physical Persistence Layer"
Pager["Pager Trait\nllkv-storage::pager::Pager"]
MemPager["MemPager\nHashMap backend"]
SimdPager["SimdRDrivePager\nMemory-mapped file"]
end
Table --> Schema
Schema --> MVCC
MVCC --> ColumnStore
ColumnStore --> Catalog
ColumnStore --> Chunks
ColumnStore --> Descriptors
Catalog --> Pager
Chunks --> Pager
Descriptors --> Pager
Pager --> MemPager
Pager --> SimdPager
| Layer | Crate | Primary Types | Responsibility |
|---|---|---|---|
| Schema | llkv-table | Table, SysCatalog | Schema validation, MVCC metadata injection, streaming scans |
| Column Management | llkv-column-map | ColumnStore, LogicalFieldId, ColumnDescriptor | Columnar chunking, catalog mapping, gather operations |
| Physical Persistence | llkv-storage | Pager, MemPager, SimdRDrivePager | Batch get/put over physical keys, zero-copy reads |
Sources: llkv-table/README.md:10-41 llkv-column-map/README.md:10-41 llkv-storage/README.md:12-28
Logical vs Physical Addressing
The storage layer separates logical identifiers from physical storage locations to enable namespace isolation and catalog management:
Namespace Segregation:
graph LR
subgraph "Logical Space"
LogicalField["LogicalFieldId\n(namespace, table_id, field_id)"]
UserNS["Namespace::UserData"]
RowNS["Namespace::RowIdShadow"]
MVCCNS["Namespace::TxnMetadata"]
end
subgraph "Catalog Mapping"
CatalogEntry["catalog.map\nLogicalFieldId → PhysicalKey"]
ValuePK["Value PhysicalKey"]
RowPK["RowId PhysicalKey"]
end
subgraph "Physical Space"
PhysKey["PhysicalKey (u64)"]
DescBlob["Descriptor Blob\nColumnDescriptor"]
ChunkBlob["Chunk Blob\nSerialized Arrow Array"]
end
LogicalField --> CatalogEntry
UserNS --> LogicalField
RowNS --> LogicalField
MVCCNS --> LogicalField
CatalogEntry --> ValuePK
CatalogEntry --> RowPK
ValuePK --> PhysKey
RowPK --> PhysKey
PhysKey --> DescBlob
PhysKey --> ChunkBlob
LogicalFieldId encodes a three-part address: (namespace, table_id, field_id). Namespaces prevent collisions between user columns, MVCC metadata, and row-id shadows:
Namespace::UserData: User-defined columns (e.g.,name,age,email)Namespace::RowIdShadow: Parallel row-id arrays used for gather operationsNamespace::TxnMetadata: MVCC columns (created_by,deleted_by)
The ColumnStore maintains a catalog.map that translates each LogicalFieldId into a PhysicalKey. Physical keys are opaque u64 identifiers allocated by the pager; the catalog is persisted at pager root key 0.
Sources: llkv-column-map/README.md:18-23 llkv-column-map/src/store/projection.rs:23-37
Data Persistence Model
Data flows through the storage layer as Arrow RecordBatch objects, which are decomposed into per-column chunks for persistence:
Chunking Strategy:
sequenceDiagram
participant Caller
participant Table as "Table"
participant ColumnStore as "ColumnStore"
participant Serialization as "serialization"
participant Pager as "Pager"
Caller->>Table: append(RecordBatch)
Note over Table: Validate schema\nInject MVCC columns
Table->>ColumnStore: append(RecordBatch with MVCC)
Note over ColumnStore: Chunk columns\nSort by row_id\nApply last-writer-wins
loop "For each column"
ColumnStore->>Serialization: serialize_array(Array)
Serialization-->>ColumnStore: Vec<u8> blob
ColumnStore->>Pager: batch_put(PhysicalKey, blob)
end
Pager-->>ColumnStore: Success
ColumnStore-->>Table: Success
Table-->>Caller: Success
Each column is split into fixed-size chunks (default 8,192 rows) to balance scan efficiency and update granularity. Chunks are serialized using a custom Arrow-compatible format optimized for zero-copy reads. The ColumnDescriptor maintains per-chunk metadata including row count, min/max row IDs, and physical keys for value and row-id arrays.
MVCC Column Layout:
Every table's physical storage includes three categories of columns:
- User-defined columns from the schema
row_id(UInt64): monotonic row identifiercreated_by(UInt64): transaction ID that created the rowdeleted_by(UInt64): transaction ID that deleted the row (0 if live)
These MVCC columns are stored in Namespace::TxnMetadata and participate in the same chunking and persistence workflow as user data.
Sources: llkv-column-map/README.md:24-29 llkv-table/README.md:14-25
Serialization and Zero-Copy Design
The storage layer uses a custom serialization format that enables zero-copy reconstruction of Arrow arrays from memory-mapped regions:
Serialization Format
The format defined in llkv-storage/src/serialization.rs:1-586 supports multiple layouts:
| Layout | Type Code | Use Case | Header Fields |
|---|---|---|---|
Primitive | PrimType::* | Fixed-width primitives (Int32, UInt64, Float32, etc.) | values_len |
FslFloat32 | N/A | FixedSizeList for vector embeddings | list_size, child_values_len |
Varlen | PrimType::Binary, PrimType::Utf8, etc. | Variable-length Binary/String | offsets_len, values_len |
Struct | N/A | Nested struct types | payload_len (IPC format) |
Decimal128 | PrimType::Decimal128 | Fixed-precision decimals | precision, scale |
Each serialized blob begins with a 24-byte header:
Offset Field Type Description
------ ----- ---- -----------
0-3 Magic [u8;4] "ARR0"
4 Layout u8 Layout discriminant (0-3)
5 PrimType u8 Type code (layout-specific)
6 Precision/Pad u8 Decimal precision or padding
7 Scale/Pad u8 Decimal scale or padding
8-15 Length u64 Logical element count
16-19 Extra A u32 Layout-specific (e.g., values_len)
20-23 Extra B u32 Layout-specific (e.g., offsets_len)
24+ Payload [u8] Raw Arrow buffers
Why Custom Format Instead of Arrow IPC:
The custom format achieves three goals:
- Minimal overhead : No schema framing or padding, just raw buffers
- Contiguous payloads : Each array's bytes are adjacent, ideal for SIMD and sequential scans
- True zero-copy :
deserialize_arrayconstructsArrayDatadirectly fromEntryHandlebuffers without memcpy
Sources: llkv-storage/src/serialization.rs:1-100 llkv-storage/src/serialization.rs:226-298
EntryHandle Abstraction
The Pager trait operates on EntryHandle objects from the simd-r-drive-entry-handle crate. EntryHandle provides:
as_ref() -> &[u8]: Zero-copy slice viewas_arrow_buffer() -> Buffer: Wrap as Arrow buffer without copying
graph LR
File["Persistent File\nsimd_r_drive::DataStore"]
Mmap["Memory-Mapped Region"]
EntryHandle["EntryHandle"]
Buffer["Arrow Buffer"]
ArrayData["ArrayData"]
ArrayRef["ArrayRef"]
File --> Mmap
Mmap --> EntryHandle
EntryHandle --> Buffer
Buffer --> ArrayData
ArrayData --> ArrayRef
style File fill:#f9f9f9
style ArrayRef fill:#f9f9f9
When using SimdRDrivePager, EntryHandle wraps a memory-mapped region backed by a persistent file. The deserialize_array function slices this buffer to construct ArrayData without allocating or copying:
Sources: llkv-storage/src/serialization.rs:429-559 llkv-table/Cargo.toml29
Pager Implementations
The Pager trait defines the interface for batch get/put operations over physical keys:
MemPager
MemPager provides an in-memory, heap-backed implementation using a RwLock<FxHashMap<PhysicalKey, Arc<Vec<u8>>>>. It is used for:
- Unit tests and benchmarks
- Staging contexts during explicit transactions (see llkv-runtime/README.md:26-32)
- Temporary namespaces that don't require persistence
SimdRDrivePager
SimdRDrivePager wraps simd_r_drive::DataStore, enabling persistent storage with zero-copy reads:
| Feature | Implementation |
|---|---|
| Backing Store | Memory-mapped file via simd-r-drive |
| Alignment | SIMD-optimized (16-byte aligned regions) |
| Concurrency | Multiple readers, single writer (file-level locking) |
| EntryHandle | Zero-copy view into mmap region |
The SimdRDrivePager is instantiated by the runtime when opening a persistent database. All catalog entries, column descriptors, and chunk blobs are accessed through memory-mapped reads, avoiding I/O system calls during query execution.
Sources: llkv-storage/README.md:18-22 llkv-storage/README.md:25-28
Column Storage Operations
The ColumnStore provides three primary operation patterns:
Append Workflow
Last-Writer-Wins Semantics:
When appending a RecordBatch with row_id values that overlap existing rows, ColumnStore::append applies last-writer-wins logic:
- Identify chunks containing overlapping
row_idranges - Load those chunks and merge with new data
- Re-serialize merged chunks
- Atomically update descriptors and chunk blobs
This ensures that INSERT OR REPLACE and UPDATE operations maintain consistent state without tombstone accumulation.
Sources: llkv-column-map/README.md:24-29
Gather Operations
Gather operations retrieve specific rows by row_id from columnar storage:
Null-Handling Policies:
Gather operations support three policies via GatherNullPolicy:
| Policy | Behavior |
|---|---|
ErrorOnMissing | Return error if any requested row_id is not found |
IncludeNulls | Emit null for missing rows |
DropNulls | Omit rows where all projected columns are null |
The MultiGatherContext caches chunk data and row indexes across multiple gather calls to amortize deserialization costs during multi-pass operations like joins.
Sources: llkv-column-map/src/store/projection.rs:39-93 llkv-column-map/src/store/projection.rs:245-446
Streaming Scans
The ColumnStream type provides paginated, filtered scans over columnar data:
Scans operate in chunks to avoid materializing entire tables:
- Load next chunk of row IDs and MVCC metadata
- Apply MVCC visibility filter (transaction snapshot check)
- Evaluate user predicates on loaded columns
- Gather matching rows into a
RecordBatch - Yield batch to caller
This streaming model enables large result sets to be processed incrementally without exhausting memory.
Sources: llkv-column-map/README.md:30-35 llkv-table/README.md:22-25
Integration Points
The storage layer is consumed by multiple higher-level components:
Key Integration Patterns:
| Consumer | Usage Pattern |
|---|---|
llkv-runtime | Opens pagers, manages table lifecycle, coordinates MVCC tagging |
llkv-executor | Streams scans via Table::scan_stream, executes joins and aggregates |
llkv-transaction | Provides transaction IDs for MVCC columns, enforces snapshot isolation |
SysCatalog | Persists table and column metadata using the same storage infrastructure |
System Catalog Self-Hosting:
The system catalog (table 0) is itself stored using ColumnStore, creating a bootstrapping dependency:
- Runtime opens the pager
ColumnStoreis initialized with the pagerSysCatalogis constructed, reading metadata from table 0- User tables are opened using metadata from
SysCatalog
This self-hosting design ensures that catalog operations and user data operations share the same crash recovery and MVCC semantics.
Sources: llkv-runtime/README.md:36-40 llkv-table/README.md:26-30 llkv-storage/README.md:25-28