Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Table Abstraction

Relevant source files

Purpose and Scope

The Table abstraction provides a schema-aware interface for data operations in the LLKV storage layer. It sits between query execution components and the columnar storage engine, managing schema validation, MVCC metadata injection, and translating logical operations into physical column store interactions. This document details the Table struct and its APIs for appending data, scanning rows, and coordinating with the system catalog.

For information about the underlying columnar storage implementation, see Column Storage and ColumnStore. For details on the storage pager abstraction, see Pager Interface and SIMD Optimization. For catalog management APIs, see CatalogManager API and System Catalog and SysCatalog.

Overview

The llkv-table crate provides the primary interface between SQL execution and physical storage. Each Table instance represents a logical table with a defined schema and wraps a reference to a ColumnStore that handles the actual persistence. Tables are responsible for enforcing schema constraints, injecting MVCC metadata columns, and exposing streaming scan APIs that integrate with the query executor.

Sources: llkv-table/README.md:1-57 llkv-table/Cargo.toml:1-60

graph TB
    subgraph "Query Execution Layer"
        RUNTIME["Runtime\nStatement Executor"]
EXECUTOR["Executor\nQuery Evaluation"]
end
    
    subgraph "Table Layer (llkv-table)"
        TABLE["Table\nSchema-aware API"]
SYSCAT["SysCatalog\nTable 0\nMetadata Store"]
STREAM["ColumnStream\nStreaming Scans"]
end
    
    subgraph "Column Store Layer (llkv-column-map)"
        COLSTORE["ColumnStore\nColumnar Storage"]
PROJECTION["Projection\nGather Logic"]
end
    
    subgraph "Storage Layer (llkv-storage)"
        PAGER["Pager Trait\nBatch Get/Put"]
end
    
 
   RUNTIME -->|CREATE TABLE INSERT UPDATE| TABLE
 
   EXECUTOR -->|SELECT scan_stream| TABLE
    
 
   TABLE -->|validate schema| TABLE
 
   TABLE -->|inject MVCC cols| TABLE
 
   TABLE -->|append RecordBatch| COLSTORE
 
   TABLE -->|gather_rows| COLSTORE
    
 
   SYSCAT -->|stores TableMeta ColMeta| COLSTORE
    
 
   TABLE -->|scan_stream returns| STREAM
 
   STREAM -->|fetch batches| COLSTORE
    
 
   COLSTORE -->|uses| PROJECTION
 
   PROJECTION -->|batch_get/put| PAGER

Table Structure and Core Responsibilities

A Table instance encapsulates a schema-validated view over a ColumnStore. The table layer is responsible for:

ResponsibilityDescription
Schema ValidationEnsures all RecordBatch operations match the declared Arrow schema
MVCC InjectionAdds system columns (row_id, created_by, deleted_by) to all data
Catalog CoordinationPersists and retrieves table/column metadata via SysCatalog (table 0)
Data RoutingTranslates logical field requests to LogicalFieldId for ColumnStore
Streaming ScansProvides ColumnStream API for paginated, predicate-pushdown reads

The table wraps an Arc<ColumnStore> from llkv-column-map, enabling multiple table instances to share the same underlying storage. This design supports efficient metadata queries and concurrent access patterns.

Sources: llkv-table/README.md:12-40 llkv-column-map/README.md:1-61

MVCC Column Management

Every table in LLKV maintains three system columns alongside user-defined fields:

graph LR
    subgraph "User Schema"
        UC1["name: Utf8"]
UC2["age: Int32"]
UC3["email: Utf8"]
end
    
    subgraph "System Columns (MVCC)"
        ROW_ID["row_id: UInt64\nMonotonic identifier"]
CREATED["created_by: UInt64\nTransaction ID"]
DELETED["deleted_by: UInt64\nDeletion TXN or MAX"]
end
    
 
   UC1 -.->|stored in namespace USER| COLSTORE["ColumnStore"]
UC2 -.-> COLSTORE
 
   UC3 -.-> COLSTORE
    
 
   ROW_ID -.->|namespace TXN_METADATA| COLSTORE
 
   CREATED -.-> COLSTORE
 
   DELETED -.-> COLSTORE
    
    COLSTORE["ColumnStore\nLogicalFieldId\nNamespacing"]

MVCC Column Semantics

  • row_id : A monotonically increasing UInt64 that uniquely identifies each row within a table. Assigned during append operations and used for row-level operations and correlation.

  • created_by : The transaction ID (UInt64) that created this row version. Set during INSERT or UPDATE operations.

  • deleted_by : The transaction ID that marked this row as deleted, or u64::MAX if the row is still live. UPDATE operations logically delete old versions and insert new ones.

These columns are stored in separate logical namespaces within the ColumnStore to avoid collisions with user-defined columns. The table layer automatically injects these columns during append operations and uses them for visibility filtering during scans.

Sources: llkv-table/README.md:15-16 llkv-column-map/README.md:20-28

Data Operations

Append Operations

The Table::append method accepts an Arrow RecordBatch and performs the following steps:

graph TB
    START["Table::append(RecordBatch)"]
VALIDATE["Validate Schema\nCheck column names/types"]
INJECT["Inject MVCC Columns\nrow_id, created_by, deleted_by"]
NAMESPACE["Map to LogicalFieldId\nApply namespace prefixes"]
PERSIST["ColumnStore::append\nSort by row_id\nLast-writer-wins"]
COMMIT["Pager::batch_put\nAtomic commit"]
START --> VALIDATE
 
   VALIDATE -->|schema mismatch| ERROR["Return Error"]
VALIDATE -->|valid| INJECT
 
   INJECT --> NAMESPACE
 
   NAMESPACE --> PERSIST
 
   PERSIST --> COMMIT
 
   COMMIT --> SUCCESS["Return Ok"]

The append pipeline ensures:

  1. Schema consistency : All incoming batches must match the table's declared schema
  2. MVCC tagging : System columns are added with appropriate transaction IDs
  3. Ordering : Rows are sorted by row_id before persistence for efficient scans
  4. Atomicity : Multi-column writes are committed atomically via batch pager operations

Sources: llkv-table/README.md:20-30 llkv-column-map/README.md:24-28

Scan Operations

Tables expose streaming scan APIs through the scan_stream method, which returns a ColumnStream for paginated result retrieval:

graph TB
    SCAN["Table::scan_stream\n(projections, filter)"]
NORMALIZE["Normalize Predicate\nApply De Morgan's laws"]
COMPILE["Compile to EvalProgram\nStack-based bytecode"]
STREAM["Create ColumnStream\nLazy evaluation"]
FETCH["ColumnStream::next_batch\nFetch N rows"]
FILTER["Apply Predicate\nVectorized evaluation"]
MVCC["MVCC Filtering\nSnapshot visibility"]
PROJECT["Gather Projected Cols\ngather_rows()"]
BATCH["Return RecordBatch"]
SCAN --> NORMALIZE
 
   NORMALIZE --> COMPILE
 
   COMPILE --> STREAM
    
 
   STREAM -.->|caller iterates| FETCH
 
   FETCH --> FILTER
 
   FILTER --> MVCC
 
   MVCC --> PROJECT
 
   PROJECT --> BATCH
 
   BATCH -.->|next iteration| FETCH

The scan path supports:

  • Predicate pushdown : Filters are compiled to bytecode and evaluated at the column store level
  • Projection : Only requested columns are materialized
  • MVCC filtering : Rows are filtered based on transaction snapshot visibility rules
  • Streaming : Results are produced in fixed-size batches to avoid large memory allocations

Sources: llkv-table/README.md:23-24 llkv-column-map/README.md:30-34

Schema Validation

Schema validation occurs at table creation and during every append operation. The table layer enforces:

Validation CheckEnforcement Point
Column namesMust match declared schema exactly (case-sensitive)
Data typesMust match Arrow DataType including nested types
NullabilityEnforced for non-nullable columns
Field countBatch must contain exactly the declared columns

Schema definitions are persisted in the system catalog (table 0) as TableMeta and ColMeta entries. The catalog stores:

  • Table ID and name
  • Column names, types, and nullability flags
  • Constraint metadata (e.g., PRIMARY KEY, NOT NULL)

Sources: llkv-table/README.md:14-15 llkv-table/README.md:27-29

graph TB
    subgraph "System Catalog (Table 0)"
        TABLEMETA["TableMeta Records\ntable_id, name, schema"]
COLMETA["ColMeta Records\ntable_id, col_name, type"]
end
    
    subgraph "User Tables (1..N)"
        TBL1["Table 1\nusers"]
TBL2["Table 2\norders"]
TBL3["Table N\nproducts"]
end
    
 
   TABLEMETA -->|describes| TBL1
 
   TABLEMETA -->|describes| TBL2
 
   TABLEMETA -->|describes| TBL3
    
 
   COLMETA -->|defines columns| TBL1
 
   COLMETA -->|defines columns| TBL2
 
   COLMETA -->|defines columns| TBL3
    
    SYSCAT["SysCatalog API\ncreate_table()\nget_table_meta()\nlist_tables()"]
SYSCAT -->|reads/writes| TABLEMETA
 
   SYSCAT -->|reads/writes| COLMETA

System Catalog Integration

The SysCatalog is a special table (table ID 0) that stores metadata for all other tables:

The system catalog itself uses the same storage infrastructure as user tables:

  • Stored as Arrow RecordBatches in the ColumnStore
  • Subject to MVCC versioning
  • Persisted through the pager for crash consistency

This self-hosting design ensures metadata operations follow the same transactional semantics as data operations.

Sources: llkv-table/README.md:27-29 llkv-column-map/README.md:10-16

Projection and Gathering

The table layer delegates projection and row gathering to the ColumnStore, which provides specialized APIs for materializing requested columns:

Projection Structure

A Projection describes a single column to retrieve, optionally renaming it in the output schema. Projections are resolved to LogicalFieldId by consulting the catalog, then passed to the ColumnStore for gathering.

Sources: llkv-column-map/store/projection.rs:49-73

Null Handling Policies

The projection system supports three null-handling strategies via GatherNullPolicy:

PolicyBehavior
ErrorOnMissingAny missing row_id causes an error
IncludeNullsMissing rows surface as nulls in output arrays
DropNullsRows with all-null projected columns are omitted

These policies enable different executor semantics: INNER JOIN uses ErrorOnMissing, LEFT JOIN uses IncludeNulls, and aggregation pipelines use DropNulls to skip tombstones.

Sources: llkv-column-map/store/projection.rs:39-47

graph TB
    PREPARE["prepare_gather_context\n(field_ids)"]
CATALOG["Load ColumnDescriptors\nfrom catalog"]
METAS["Collect ChunkMetadata\nvalue + row chunks"]
CTX["MultiGatherContext\nplans, cache, scratch"]
GATHER1["gather_rows_with_reusable_context\n(row_ids_1)"]
GATHER2["gather_rows_with_reusable_context\n(row_ids_2)"]
GATHERN["gather_rows_with_reusable_context\n(row_ids_N)"]
PREPARE --> CATALOG
 
   CATALOG --> METAS
 
   METAS --> CTX
    
 
   CTX -.->|reuses chunk cache| GATHER1
 
   CTX -.->|reuses chunk cache| GATHER2
 
   CTX -.->|reuses chunk cache| GATHERN
    
 
   GATHER1 --> BATCH1["RecordBatch 1"]
GATHER2 --> BATCH2["RecordBatch 2"]
GATHERN --> BATCHN["RecordBatch N"]

Multi-Column Gather Context

For queries that scan the same row set multiple times (e.g., joins, aggregations), the table layer provides MultiGatherContext to amortize fetch costs:

The context caches:

  • Chunk arrays : Deserialized Arrow arrays for reuse across calls
  • Row indices : Hash maps for sparse row lookups
  • Scratch buffers : Pre-allocated vectors for gather operations

This optimization is critical for nested loop joins and multi-pass aggregations where the same columns are accessed repeatedly.

Sources: llkv-column-map/store/projection.rs:94-227 llkv-column-map/store/projection.rs:448-510 llkv-column-map/store/projection.rs:516-758

graph TB
    subgraph "Table Layer (llkv-table)"
        TABLE["Table\nSchema + Arc&lt;ColumnStore&gt;"]
FIELDMAP["Field Name → LogicalFieldId\nNamespace mapping"]
end
    
    subgraph "ColumnStore Layer (llkv-column-map)"
        COLSTORE["ColumnStore\nLogicalFieldId → PhysicalKey"]
DESCRIPTOR["ColumnDescriptor\nChunk metadata lists"]
CHUNKS["Column Chunks\nSerialized Arrow arrays"]
end
    
    subgraph "Storage Layer (llkv-storage)"
        PAGER["Pager\nbatch_get/batch_put"]
MEMPAGER["MemPager"]
SIMDPAGER["SimdRDrivePager"]
end
    
 
   TABLE -->|append batch| COLSTORE
 
   TABLE -->|scan_stream| COLSTORE
 
   TABLE -->|gather_rows field_ids| COLSTORE
    
 
   FIELDMAP -.->|resolves to| COLSTORE
    
 
   COLSTORE -->|maps to| DESCRIPTOR
 
   DESCRIPTOR -->|points to| CHUNKS
    
 
   COLSTORE -->|batch_get| PAGER
 
   COLSTORE -->|batch_put| PAGER
    
 
   PAGER -.->|impl| MEMPAGER
 
   PAGER -.->|impl| SIMDPAGER

Integration with ColumnStore

The table layer wraps a ColumnStore and translates high-level operations into low-level storage calls:

Logical Field Namespacing

Each logical field in a table is assigned a LogicalFieldId that encodes:

  • Namespace : USER, TXN_METADATA, or ROWID_SHADOW
  • Table ID : u32 identifier
  • Field ID : u32 column index

This namespacing prevents collisions between user columns and MVCC metadata while allowing them to share the same physical ColumnStore instance.

Sources: llkv-column-map/README.md:18-22 llkv-table/README.md:20-22

Zero-Copy Reads

The ColumnStore delegates to the Pager trait for physical storage access. When using SimdRDrivePager (persistent backend), reads are zero-copy: the pager returns EntryHandle wrappers that directly reference memory-mapped regions. This enables SIMD-accelerated scans without buffer allocation or copying.

Sources: llkv-storage/README.md:9-17 llkv-column-map/README.md:36-40

Usage in the Stack

The table abstraction is consumed by:

ComponentUsage
llkv-runtimeExecutes all DML and DDL operations through Table APIs
llkv-executorRelies on scan_stream for SELECT evaluation, joins, and aggregations
llkv-sqlIndirectly via llkv-runtime for SQL statement execution
llkv-csvUses Table::append for bulk CSV ingestion

The streaming scan API (scan_stream) is particularly important for the executor, which processes query results in fixed-size batches to avoid buffering entire result sets in memory.

Sources: llkv-table/README.md:36-40 llkv-runtime/README.md:36-40 llkv-csv/README.md:14-20