Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Column Storage and ColumnStore

Relevant source files

Purpose and Scope

This document describes the column-oriented storage layer implemented by llkv-column-map, focusing on the ColumnStore struct that manages physical persistence of Arrow columnar data. The column store sits between the table abstraction and the pager interface, translating logical field requests into physical chunk operations.

For the higher-level table API that wraps ColumnStore, see Table Abstraction. For details on the underlying storage backends, see Pager Interface and SIMD Optimization.

Architecture Position

The ColumnStore acts as the bridge between schema-aware tables and raw key-value storage:

Sources: llkv-column-map/README.md:10-46 llkv-table/README.md:19-24

graph TB
    Table["llkv-table::Table\nSchema validation\nMVCC injection"]
ColumnStore["llkv-column-map::ColumnStore\nColumn chunking\nLogicalFieldId → PhysicalKey"]
Pager["llkv-storage::Pager\nbatch_get / batch_put\nMemPager / SimdRDrivePager"]
Table -->|append RecordBatch| ColumnStore
 
   Table -->|scan / gather| ColumnStore
 
   ColumnStore -->|BatchGet / BatchPut| Pager
    
 
   ColumnStore -->|serialized Arrow chunks| Pager
 
   Pager -->|EntryHandle zero-copy| ColumnStore

Logical Field Identification

LogicalFieldId Structure

Each column is identified by a LogicalFieldId that encodes three components:

ComponentBitsPurpose
NamespaceHigh bitsSegregates user data, MVCC metadata, and row-id shadows
Table IDMiddle bitsIdentifies which table the column belongs to
Field IDLow bitsDistinguishes columns within a table

This structure prevents collisions when multiple tables share the same physical pager while maintaining clear boundaries between user data and system metadata.

Sources: llkv-column-map/README.md:19-22

Namespace Segregation

Three primary namespaces exist:

  • User Data : Columns explicitly defined in CREATE TABLE statements
  • MVCC Metadata : System columns created_by and deleted_by for transaction visibility
  • Row ID Shadow : Parallel storage of row_id values for each column to enable efficient random access

Each namespace maps to distinct LogicalFieldId values, ensuring that MVCC bookkeeping and user data remain isolated in the catalog.

Sources: llkv-column-map/README.md:22-23 llkv-table/README.md:16-17

Physical Storage Model

PhysicalKey Allocation and Mapping

The column store maintains a catalog that maps each LogicalFieldId to a PhysicalKey (u64 identifier) allocated by the pager:

graph LR
    LF1["LogicalFieldId\n(ns=User, table=5, field=2)"]
LF2["LogicalFieldId\n(ns=RowId, table=5, field=2)"]
LF3["LogicalFieldId\n(ns=Mvcc, table=5, field=created_by)"]
PK1["PhysicalKey: 1024\nDescriptor"]
PK2["PhysicalKey: 1025\nData Chunks"]
PK3["PhysicalKey: 2048\nDescriptor"]
PK4["PhysicalKey: 2049\nData Chunks"]
LF1 --> PK1
 
   LF1 --> PK2
 
   LF2 --> PK3
 
   LF2 --> PK4
    
 
   PK1 -->|ColumnDescriptor| Pager["Pager"]
PK2 -->|Serialized Arrays| Pager
 
   PK3 -->|ColumnDescriptor| Pager
 
   PK4 -->|Serialized Arrays| Pager

Each logical field requires at least two physical keys: one for the column descriptor (metadata about chunks) and one or more for the actual data chunks.

Sources: llkv-column-map/README.md:19-21 llkv-column-map/src/store/projection.rs:461-468

Column Descriptors and Chunks

Column data is split into fixed-size chunks, each serialized as an Arrow array. Metadata about chunks is stored in a ColumnDescriptor:

graph TB
    subgraph "Column Descriptor"
        Head["head_page_pk: PhysicalKey\nPoints to descriptor chain"]
Meta1["ChunkMetadata[0]\nchunk_pk, row_count\nmin_val_u64, max_val_u64"]
Meta2["ChunkMetadata[1]\n..."]
Meta3["ChunkMetadata[n]\n..."]
Head --> Meta1
 
       Head --> Meta2
 
       Head --> Meta3
    end
    
    subgraph "Physical Storage"
        Chunk1["PhysicalKey: chunk_pk[0]\nSerialized Arrow Array\nrow_ids: 0..999"]
Chunk2["PhysicalKey: chunk_pk[1]\nSerialized Arrow Array\nrow_ids: 1000..1999"]
end
    
 
   Meta1 -.->|references| Chunk1
 
   Meta2 -.->|references| Chunk2

The descriptor stores min/max row ID values for each chunk, enabling efficient skip-scan during queries by filtering out chunks that cannot contain requested row IDs.

Sources: llkv-column-map/src/store/projection.rs:229-236 llkv-column-map/src/store/projection.rs:760-772

Column Catalog Persistence

The catalog mapping LogicalFieldId to PhysicalKey is itself stored in the pager at a well-known root key. On initialization:

  1. ColumnStore::open() attempts to load the catalog from the pager root
  2. If not found, an empty catalog is initialized
  3. All catalog updates are committed atomically during append operations

This design ensures the catalog state remains consistent with persisted data, even after crashes.

Sources: llkv-column-map/README.md:38-40

Append Pipeline

RecordBatch Persistence Flow

Sources: llkv-column-map/README.md:24-28

Last-Writer-Wins Semantics

When appending data with row IDs that overlap existing chunks:

  1. The store identifies which chunks contain conflicting row IDs
  2. Existing chunks are deserialized and merged with new data
  3. For duplicate row IDs, the new value overwrites the old
  4. Rewritten chunks are serialized and committed atomically

This ensures that UPDATE operations (implemented as appends at the table layer) correctly overwrite previous values without requiring separate update logic.

Sources: llkv-column-map/README.md:26-27

Chunking Strategy

Columns are divided into chunks based on:

  • Chunk size threshold : Configurable limit on rows per chunk (typically several thousand)
  • Row ID ranges : Each chunk covers a contiguous range of row IDs
  • Physical key allocation : Each chunk gets a unique physical key from the pager

This chunking enables:

  • Parallel scan operations across chunks
  • Efficient skip-scan by filtering chunks based on row ID predicates
  • Incremental garbage collection of deleted chunks

Sources: llkv-column-map/src/store/projection.rs:760-772

Data Retrieval

Gather Operations

The column store provides two gather strategies for random-access row retrieval:

graph TB
    Input["Row IDs: [5, 123, 999]"]
Input --> Sort["Sort and deduplicate"]
Sort --> Filter["Identify intersecting chunks"]
Filter --> Fetch["batch_get(chunk keys)"]
Fetch --> Deserialize["Deserialize Arrow arrays"]
Deserialize --> Gather["Gather requested rows"]
Gather --> Output["RecordBatch"]

Single-Shot Gather

For one-off queries, gather_rows() performs a complete fetch without caching:

Sources: llkv-column-map/src/store/projection.rs:245-268

Reusable Context Gather

For repeated queries (e.g., join inner loop), gather_rows_with_reusable_context() amortizes costs:

  1. Prepare a MultiGatherContext containing column descriptors and scratch buffers
  2. Call gather repeatedly, reusing:
    • Decoded Arrow chunk arrays (cached in chunk_cache)
    • Row index hash maps (preallocated buffers)
    • Scratch space for row locators

This avoids redundant descriptor fetches and chunk decodes across multiple gather calls.

Sources: llkv-column-map/src/store/projection.rs:516-758

Gather Null Policies

Three policies control null handling:

PolicyBehavior
ErrorOnMissingReturn error if any requested row ID is not found
IncludeNullsMissing rows surface as nulls in output arrays
DropNullsRemove rows where all projected columns are null or missing

The DropNulls policy is used by MVCC filtering to exclude logically deleted rows.

Sources: llkv-column-map/src/store/projection.rs:39-47

Projection Planning

The projection subsystem prepares multi-column gathers:

Each FieldPlan contains:

  • value_metas : Metadata for value chunks (actual column data)
  • row_metas : Metadata for row ID chunks (parallel row ID storage)
  • candidate_indices : Pre-filtered list of chunks that might contain requested rows

Sources: llkv-column-map/src/store/projection.rs:448-510 llkv-column-map/src/store/projection.rs:229-236

Chunk Intersection Logic

Before fetching chunks, the store filters based on row ID range overlap:

This optimization prevents loading chunks that cannot possibly contain any requested rows.

Sources: llkv-column-map/src/store/projection.rs:774-794

graph TB
    subgraph "Serialized Array Format"
        Header["Header (24 bytes)\nMAGIC: 'ARR0'\nlayout: Primitive/Varlen/FslFloat32/Struct\ntype_code: PrimType enum\nlen: element count\nextra_a, extra_b: layout-specific"]
Payload["Payload\nRaw Arrow buffer bytes"]
Header --> Payload
    end
    
    subgraph "Deserialization (Zero-Copy)"
        EntryHandle["EntryHandle from Pager\n(memory-mapped or in-memory)"]
ArrowBuffer["Arrow Buffer\nwraps EntryHandle bytes"]
ArrayData["ArrayData\nreferences Buffer directly"]
EntryHandle --> ArrowBuffer
 
       ArrowBuffer --> ArrayData
    end
    
 
   Payload -.->|stored as| EntryHandle

Serialization Format

Zero-Copy Array Persistence

Column chunks are serialized using a custom format optimized for memory-mapped zero-copy reads:

Sources: llkv-storage/src/serialization.rs:1-28 llkv-storage/src/serialization.rs:41-135

Layout Types

Four layout variants handle different Arrow data types:

LayoutUse CasePayload Structure
PrimitiveFixed-width primitives (Int32, Float64, etc.)Single values buffer
VarlenVariable-length (Binary, Utf8, LargeBinary)Offsets buffer + values buffer
FslFloat32FixedSizeList (e.g., embeddings)Single contiguous Float32 buffer
StructNested struct typesArrow IPC serialized payload

The FslFloat32 layout is a specialized fast-path for dense vector columns, avoiding nesting overhead.

Sources: llkv-storage/src/serialization.rs:54-135

Why Not Arrow IPC?

The custom format is used instead of standard Arrow IPC for several reasons:

  1. Minimal headers : No schema objects or framing, reducing file size
  2. Predictable payloads : Each array occupies one contiguous region, ideal for mmap and SIMD
  3. True zero-copy : Deserialization produces ArrayData referencing the original mmap directly
  4. Stable codes : Layout and type tags are explicitly pinned with compile-time checks

The trade-off is reduced generality (e.g., no null bitmaps yet) for better scan performance in this storage engine's access patterns.

Sources: llkv-storage/src/serialization.rs:1-28 llkv-storage/README.md:10-16

Type Code Stability

The PrimType enum discriminants are compile-time pinned to prevent silent corruption:

If any discriminant is accidentally changed, the code will fail to compile, preventing data corruption.

Sources: llkv-storage/src/serialization.rs:561-586

graph TB
    subgraph "Table Responsibilities"
        Schema["Schema validation"]
MVCC["MVCC column injection\n(row_id, created_by, deleted_by)"]
Catalog["System catalog updates"]
end
    
    subgraph "ColumnStore Responsibilities"
        Chunk["Column chunking"]
Map["LogicalFieldId → PhysicalKey"]
Persist["Physical persistence"]
end
    
 
   Schema --> Chunk
 
   MVCC --> Chunk
 
   Catalog --> Map
 
   Chunk --> Persist

Integration with Table Layer

The Table struct (from llkv-table) wraps Arc<ColumnStore> and delegates storage operations:

The table layer focuses on schema enforcement and MVCC semantics, while the column store handles physical storage details.

Sources: llkv-table/README.md:19-24 llkv-column-map/README.md:12-16

Integration with Pager Trait

The column store is generic over any Pager<Blob = EntryHandle>:

This abstraction allows the same column store code to work with both ephemeral in-memory storage (for transaction staging) and durable persistent storage (for committed data).

Sources: llkv-column-map/README.md:36-39 llkv-storage/README.md:19-22

graph TB
    subgraph "User Table 'employees'"
        UserCol1["LogicalFieldId\n(User, table=5, field=0)\n'name' column"]
UserCol2["LogicalFieldId\n(User, table=5, field=1)\n'age' column"]
end
    
    subgraph "MVCC Metadata for 'employees'"
        MvccCol1["LogicalFieldId\n(Mvcc, table=5, field=created_by)"]
MvccCol2["LogicalFieldId\n(Mvcc, table=5, field=deleted_by)"]
end
    
    subgraph "Row ID Shadow for 'employees'"
        RowCol1["LogicalFieldId\n(RowId, table=5, field=0)"]
RowCol2["LogicalFieldId\n(RowId, table=5, field=1)"]
end
    
    UserCol1 & UserCol2 & MvccCol1 & MvccCol2 & RowCol1 &
 RowCol2 --> Store["ColumnStore"]

MVCC Column Storage

MVCC metadata columns are stored using the same column infrastructure as user data, but in a separate namespace:

This design keeps MVCC bookkeeping transparent to the column store while allowing the table layer to enforce visibility rules by querying MVCC columns.

Sources: llkv-column-map/README.md:22-23 llkv-table/README.md:32-34

Concurrency and Parallelism

Parallel Scans

The column store supports parallel scanning via Rayon:

  • Chunk-level parallelism: Different chunks can be processed concurrently
  • Thread pool bounded by LLKV_MAX_THREADS environment variable
  • Lock-free reads: Descriptors and chunks are immutable once written

Sources: llkv-column-map/README.md:32-34

Catalog Locking

The catalog mapping is protected by an RwLock:

  • Readers acquire shared lock during scans/gathers
  • Writers acquire exclusive lock during append/create operations
  • Lock contention is minimized by holding locks only during catalog lookups, not during chunk I/O

Sources: llkv-column-map/src/store/projection.rs:461-468

Performance Characteristics

Append Performance

OperationComplexityNotes
Sequential append (no conflicts)O(n log n)Dominated by sorting row IDs
Append with overwritesO(n log n + m log m)m = existing rows in conflict chunks
Chunk serializationO(n)Linear in data size

Gather Performance

OperationComplexityNotes
Random gather (cold)O(k log c + r)k = chunks touched, c = total chunks, r = rows fetched
Random gather (hot cache)O(r)Chunks already decoded
Sequential scanO(n)Linear in result size

The chunk skip-scan optimization reduces k by filtering chunks based on min/max row ID metadata.

Sources: llkv-column-map/src/store/projection.rs:774-794