Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Column Storage and ColumnStore

Loading…

Column Storage and ColumnStore

Relevant source files

This page documents the columnar storage layer implemented in the llkv-column-map crate. The ColumnStore provides Arrow-native persistence of columnar data over a key-value pager, enabling efficient predicate evaluation, multi-column projections, and Last-Write-Wins (LWW) semantics for updates.

For information about the table-level abstraction built on top of ColumnStore, see Table Abstraction. For details about the underlying pager interface, see Pager Interface and SIMD Optimization.

ColumnStore Architecture

The ColumnStore<P> struct (llkv-column-map/src/store/core.rs:60-68) serves as the primary interface for columnar data operations. It is parameterized by a Pager implementation and manages:

  • Column catalog : Maps LogicalFieldId to physical storage keys
  • Descriptors : Linked lists of chunk metadata for each column
  • Data type cache : Avoids repeated descriptor loads for schema queries
  • Index manager : Maintains presence and value indexes
  • Context pool : Reusable gather contexts to amortize chunk fetch costs

Purpose : This diagram shows the internal components of ColumnStore and their relationships. The catalog is the single source of truth for column-to-physical-key mappings, while auxiliary structures provide caching and optimization.

Sources : llkv-column-map/src/store/core.rs:60-68 llkv-column-map/src/store/core.rs:100-124

Storage Organization

Namespaces

Columns are identified by LogicalFieldId, which combines a namespace, table ID, and field ID to prevent collisions between different data categories (llkv-column-map/src/store/core.rs:38-46):

NamespacePurposeExample Usage
UserDataRegular table columnsUser-visible columns from CREATE TABLE
RowIdShadowInternal row ID trackingShadow column storing RowId for each value
TxnCreatedByMVCC creation timestampsTransaction ID that created each row
TxnDeletedByMVCC deletion timestampsTransaction ID that deleted each row

Each column in the UserData namespace has a corresponding RowIdShadow column that stores the row IDs for non-null values. This pairing enables efficient row-based gathering and filtering.

Sources : llkv-column-map/src/store/core.rs:38-46

Physical Storage Structure

Data is organized as a hierarchy: ColumnCatalog → ColumnDescriptor → DescriptorPages → ChunkMetadata → Data Chunks.

Purpose : This diagram shows how data is physically organized in storage. The catalog is the entry point, descriptors form a linked list of metadata pages, and each metadata entry points to a serialized Arrow array.

graph TB
    CATALOG["ColumnCatalog\nPhysicalKey: CATALOG_ROOT_PKEY"]
subgraph "Per Column"
        DESC["ColumnDescriptor\n(head_page_pk, tail_page_pk,\ntotal_row_count)"]
PAGE1["DescriptorPage\n(next_page_pk, entry_count)"]
PAGE2["DescriptorPage\n(next_page_pk, entry_count)"]
end
    
    subgraph "Chunk Metadata"
        META1["ChunkMetadata\n(chunk_pk, row_count,\nmin_val, max_val,\nserialized_bytes)"]
META2["ChunkMetadata\n(chunk_pk, row_count,\nmin_val, max_val,\nserialized_bytes)"]
META3["ChunkMetadata"]
end
    
    subgraph "Arrow Data"
        CHUNK1["Serialized Arrow Array\n(values)"]
CHUNK2["Serialized Arrow Array\n(values)"]
CHUNK3["Serialized Arrow Array\n(row IDs)"]
end
    
 
   CATALOG -->|maps LogicalFieldId| DESC
 
   DESC -->|head_page_pk| PAGE1
 
   PAGE1 -->|next_page_pk| PAGE2
 
   PAGE1 --> META1
 
   PAGE1 --> META2
 
   PAGE2 --> META3
    
 
   META1 -->|chunk_pk| CHUNK1
 
   META2 -->|chunk_pk| CHUNK2
 
   META3 -->|chunk_pk| CHUNK3

Key structures :

  • ColumnCatalog (llkv-column-map/src/store/catalog.rs): Root mapping stored at CATALOG_ROOT_PKEY
  • ColumnDescriptor : Per-column header with total row count and pointers to metadata pages
  • DescriptorPage : Linked list node containing multiple ChunkMetadata entries
  • ChunkMetadata (llkv-column-map/src/store/descriptor.rs): Min/max values, row count, serialized size, chunk physical key

The min/max values in ChunkMetadata enable chunk pruning : predicates can skip entire chunks without loading data if the min/max range doesn’t overlap the query predicate.

Sources : llkv-column-map/src/store/core.rs:100-124 llkv-column-map/src/store/descriptor.rs

Core Operations

Append with Last-Write-Wins

The append method (llkv-column-map/src/store/core.rs:787-1633) ingests a RecordBatch with LWW semantics:

  1. Preprocessing : Sort by rowid if not already sorted
  2. LWW Rewrite : For each column, identify existing row IDs and overwrite them in-place
  3. Filter for Append : Remove rewritten rows and nulls from the batch
  4. Chunk and Persist : Split data into target-sized chunks, serialize as Arrow arrays, persist with metadata

Purpose : This diagram shows the append flow with LWW semantics. Each column is processed independently, rewrites happen first, then new data is chunked and persisted atomically.

sequenceDiagram
    participant Caller
    participant ColumnStore
    participant LWW as "LWW Rewrite"
    participant Chunker
    participant Pager
    
    Caller->>ColumnStore: append(batch)
    ColumnStore->>ColumnStore: Sort by rowid if needed
    
    loop For each column
        ColumnStore->>LWW: lww_rewrite_for_field(field_id, row_ids, values)
        LWW->>Pager: Load existing chunks
        LWW->>LWW: Identify overlapping row IDs
        LWW->>Pager: Overwrite matching rows
        LWW-->>ColumnStore: rewritten_ids
        
        ColumnStore->>ColumnStore: Filter out rewritten_ids and nulls
        ColumnStore->>Chunker: Split to target chunk sizes
        
        loop For each chunk
            Chunker->>Pager: alloc() for chunk_pk and rid_pk
            Chunker->>Chunker: Serialize Arrow array
            Chunker->>Chunker: Compute min/max/size metadata
        end
    end
    
    ColumnStore->>Pager: batch_put(all_puts)
    ColumnStore->>ColumnStore: Increment append_epoch
    ColumnStore-->>Caller: Success

The append_epoch counter (llkv-column-map/src/store/core.rs:132-134) is incremented after every append, providing a cache invalidation signal for gather contexts.

Sources : llkv-column-map/src/store/core.rs:787-1633 llkv-column-map/src/store/core.rs:893-942

Multi-Column Gathering

The gather_rows method (llkv-column-map/src/store/projection.rs:758-785) assembles a RecordBatch from multiple columns given a list of row IDs. It uses a two-phase strategy :

Phase 1: Planning

  • Load ColumnDescriptor for each field
  • Collect ChunkMetadata for value and row ID chunks
  • Build FieldPlan with candidate chunk indices

Phase 2: Execution

  • Identify candidate chunks that intersect requested row IDs (chunk pruning via min/max)
  • Batch-fetch missing chunks from pager
  • For each column, build Arrow array by gathering values from chunks
  • Assemble RecordBatch from gathered columns

Purpose : This diagram shows the gather pipeline. Planning loads metadata, execution prunes chunks and assembles the result. The chunk cache avoids redundant pager reads across multiple gather calls.

Context Pooling : The GatherContextPool (llkv-column-map/src/store/projection.rs:651-693) maintains a pool of MultiGatherContext objects keyed by field IDs. Each context caches:

  • Chunk arrays by physical key
  • Row ID index for sparse lookups
  • Scratch buffers for row location tracking
  • Arrow builders for each column type

By reusing contexts, repeated gather calls avoid allocating temporary structures and can benefit from cached chunks.

Sources : llkv-column-map/src/store/projection.rs:758-927 llkv-column-map/src/store/projection.rs:651-720

Filter Operations

The filter_row_ids method (llkv-column-map/src/store/core.rs:356-372) evaluates a predicate against a column and returns matching row IDs. It uses:

  1. Type-specific dispatch : FilterDispatch trait routes to specialized implementations (primitive types, strings, booleans)
  2. Chunk pruning : Skip chunks where min_val and max_val don’t overlap the predicate
  3. Vectorized evaluation : For string predicates like CONTAINS, use Arrow’s vectorized kernels (llkv-column-map/src/store/scan/filter.rs:337-503)
  4. Dense encoding : Return results as FilterResult with run-length encoding when possible

For example, the fused string filter (llkv-column-map/src/store/scan/filter.rs:337-503) evaluates multiple CONTAINS predicates in a single pass:

  • Load value and row ID chunks in parallel (via Rayon)
  • Use Arrow’s contains_utf8_scalar kernel for each pattern
  • AND results using a bitmask to avoid per-row branching
  • Extract matching row IDs from the final bitmask

Sources : llkv-column-map/src/store/core.rs:356-372 llkv-column-map/src/store/scan/filter.rs:209-503

Optimization Techniques

Chunk Pruning via Min/Max Metadata

Each ChunkMetadata stores min_val_u64 and max_val_u64 (llkv-column-map/src/store/descriptor.rs). During filtering or gathering, chunks are skipped if:

  • For predicates: The predicate range doesn’t overlap [min_val, max_val]
  • For point lookups: The requested row ID is outside [min_val, max_val]

This optimization is particularly effective for:

  • Range predicates on sorted or clustered columns
  • Temporal queries on timestamp columns (common in OLAP)
  • Primary key lookups

Sources : llkv-column-map/src/store/core.rs:679-690 llkv-column-map/src/store/projection.rs:1019-1026

Context Pooling

The GatherContextPool (llkv-column-map/src/store/projection.rs:651-693) maintains up to 4 contexts per unique field ID set. This amortizes:

  • Chunk cache allocations (reuse FxHashMap<PhysicalKey, ArrayRef>)
  • Arrow builder allocations (reuse builders with reserved capacity)
  • Scratch buffer allocations (reuse Vec for row indices)

Contexts track an epoch (llkv-column-map/src/store/projection.rs:520-526) that is compared against ColumnStore.append_epoch. If epochs mismatch, the context rebuilds its metadata from the latest descriptors.

Sources : llkv-column-map/src/store/projection.rs:651-720 llkv-column-map/src/store/projection.rs:460-526

Data Type Caching

The DTypeCache (llkv-column-map/src/store/core.rs64) stores Arrow DataType for each LogicalFieldId to avoid repeated descriptor loads during schema queries. It uses a fingerprint stored in the descriptor to detect type changes (e.g., from ALTER TABLE operations).

Sources : llkv-column-map/src/store/core.rs:175-180 llkv-column-map/src/store/core.rs:190-227

Indexes

The IndexManager (llkv-column-map/src/store/core.rs65) supports two index types:

Index TypePurposeStorage
PresenceTracks which row IDs existBitmap or sorted permutation array
ValueEnables sorted iteration by valuePermutation array mapping value order to physical order

Indexes are created via register_index (llkv-column-map/src/store/core.rs:145-147) and maintained automatically during append operations. The has_row_id method (llkv-column-map/src/store/core.rs:663-754) uses presence indexes for fast lookups via binary search.

Sources : llkv-column-map/src/store/core.rs:145-165 llkv-column-map/src/store/core.rs:663-754

Data Flow: Append Operation

Purpose : This flowchart shows the detailed append logic. Each column is processed independently (LWW rewrite, filter, chunk), then all writes are flushed atomically.

Sources : llkv-column-map/src/store/core.rs:787-1633

Data Flow: Gather Operation

Purpose : This flowchart shows the gather operation with context reuse. The epoch check ensures cache validity, and the dense pattern optimization avoids hash lookups for contiguous row IDs.

Sources : llkv-column-map/src/store/projection.rs:929-1195

Configuration and Tuning

The ColumnStoreConfig (llkv-column-map/src/store/core.rs63) provides tuning parameters:

The GatherContextPool (llkv-column-map/src/store/projection.rs:658-663) maintains up to max_per_key contexts per field set (currently 4). Increasing this value trades memory for reduced context allocation frequency in highly concurrent workloads.

Sources : llkv-column-map/src/store/core.rs:126-129 llkv-column-map/src/store/projection.rs:658-663

Thread Safety

ColumnStore is Send + Sync (llkv-column-map/src/store/core.rs49). Internal state uses:

  • Arc<RwLock<ColumnCatalog>>: Catalog allows concurrent reads, exclusive writes
  • Arc<Pager>: Pager implementations must be Send + Sync
  • Arc<AtomicU64>: Epoch counter supports lock-free reads

This enables concurrent query execution across multiple threads while serializing metadata updates (e.g., append, remove_column).

Sources : llkv-column-map/src/store/core.rs:49-85

Dismiss

Refresh this wiki

Enter email to refresh