This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Table Abstraction
Relevant source files
- llkv-aggregate/README.md
- llkv-column-map/README.md
- llkv-column-map/src/store/projection.rs
- llkv-csv/README.md
- llkv-expr/Cargo.toml
- llkv-expr/README.md
- llkv-expr/src/lib.rs
- llkv-expr/src/literal.rs
- llkv-join/README.md
- llkv-plan/Cargo.toml
- llkv-plan/src/lib.rs
- llkv-runtime/README.md
- llkv-storage/README.md
- llkv-storage/src/serialization.rs
- llkv-table/Cargo.toml
- llkv-table/README.md
Purpose and Scope
The Table abstraction provides a schema-aware interface for data operations in the LLKV storage layer. It sits between query execution components and the columnar storage engine, managing schema validation, MVCC metadata injection, and translating logical operations into physical column store interactions. This document details the Table struct and its APIs for appending data, scanning rows, and coordinating with the system catalog.
For information about the underlying columnar storage implementation, see Column Storage and ColumnStore. For details on the storage pager abstraction, see Pager Interface and SIMD Optimization. For catalog management APIs, see CatalogManager API and System Catalog and SysCatalog.
Overview
The llkv-table crate provides the primary interface between SQL execution and physical storage. Each Table instance represents a logical table with a defined schema and wraps a reference to a ColumnStore that handles the actual persistence. Tables are responsible for enforcing schema constraints, injecting MVCC metadata columns, and exposing streaming scan APIs that integrate with the query executor.
Sources: llkv-table/README.md:1-57 llkv-table/Cargo.toml:1-60
graph TB
subgraph "Query Execution Layer"
RUNTIME["Runtime\nStatement Executor"]
EXECUTOR["Executor\nQuery Evaluation"]
end
subgraph "Table Layer (llkv-table)"
TABLE["Table\nSchema-aware API"]
SYSCAT["SysCatalog\nTable 0\nMetadata Store"]
STREAM["ColumnStream\nStreaming Scans"]
end
subgraph "Column Store Layer (llkv-column-map)"
COLSTORE["ColumnStore\nColumnar Storage"]
PROJECTION["Projection\nGather Logic"]
end
subgraph "Storage Layer (llkv-storage)"
PAGER["Pager Trait\nBatch Get/Put"]
end
RUNTIME -->|CREATE TABLE INSERT UPDATE| TABLE
EXECUTOR -->|SELECT scan_stream| TABLE
TABLE -->|validate schema| TABLE
TABLE -->|inject MVCC cols| TABLE
TABLE -->|append RecordBatch| COLSTORE
TABLE -->|gather_rows| COLSTORE
SYSCAT -->|stores TableMeta ColMeta| COLSTORE
TABLE -->|scan_stream returns| STREAM
STREAM -->|fetch batches| COLSTORE
COLSTORE -->|uses| PROJECTION
PROJECTION -->|batch_get/put| PAGER
Table Structure and Core Responsibilities
A Table instance encapsulates a schema-validated view over a ColumnStore. The table layer is responsible for:
| Responsibility | Description |
|---|---|
| Schema Validation | Ensures all RecordBatch operations match the declared Arrow schema |
| MVCC Injection | Adds system columns (row_id, created_by, deleted_by) to all data |
| Catalog Coordination | Persists and retrieves table/column metadata via SysCatalog (table 0) |
| Data Routing | Translates logical field requests to LogicalFieldId for ColumnStore |
| Streaming Scans | Provides ColumnStream API for paginated, predicate-pushdown reads |
The table wraps an Arc<ColumnStore> from llkv-column-map, enabling multiple table instances to share the same underlying storage. This design supports efficient metadata queries and concurrent access patterns.
Sources: llkv-table/README.md:12-40 llkv-column-map/README.md:1-61
MVCC Column Management
Every table in LLKV maintains three system columns alongside user-defined fields:
graph LR
subgraph "User Schema"
UC1["name: Utf8"]
UC2["age: Int32"]
UC3["email: Utf8"]
end
subgraph "System Columns (MVCC)"
ROW_ID["row_id: UInt64\nMonotonic identifier"]
CREATED["created_by: UInt64\nTransaction ID"]
DELETED["deleted_by: UInt64\nDeletion TXN or MAX"]
end
UC1 -.->|stored in namespace USER| COLSTORE["ColumnStore"]
UC2 -.-> COLSTORE
UC3 -.-> COLSTORE
ROW_ID -.->|namespace TXN_METADATA| COLSTORE
CREATED -.-> COLSTORE
DELETED -.-> COLSTORE
COLSTORE["ColumnStore\nLogicalFieldId\nNamespacing"]
MVCC Column Semantics
-
row_id: A monotonically increasingUInt64that uniquely identifies each row within a table. Assigned during append operations and used for row-level operations and correlation. -
created_by: The transaction ID (UInt64) that created this row version. Set duringINSERTorUPDATEoperations. -
deleted_by: The transaction ID that marked this row as deleted, oru64::MAXif the row is still live.UPDATEoperations logically delete old versions and insert new ones.
These columns are stored in separate logical namespaces within the ColumnStore to avoid collisions with user-defined columns. The table layer automatically injects these columns during append operations and uses them for visibility filtering during scans.
Sources: llkv-table/README.md:15-16 llkv-column-map/README.md:20-28
Data Operations
Append Operations
The Table::append method accepts an Arrow RecordBatch and performs the following steps:
graph TB
START["Table::append(RecordBatch)"]
VALIDATE["Validate Schema\nCheck column names/types"]
INJECT["Inject MVCC Columns\nrow_id, created_by, deleted_by"]
NAMESPACE["Map to LogicalFieldId\nApply namespace prefixes"]
PERSIST["ColumnStore::append\nSort by row_id\nLast-writer-wins"]
COMMIT["Pager::batch_put\nAtomic commit"]
START --> VALIDATE
VALIDATE -->|schema mismatch| ERROR["Return Error"]
VALIDATE -->|valid| INJECT
INJECT --> NAMESPACE
NAMESPACE --> PERSIST
PERSIST --> COMMIT
COMMIT --> SUCCESS["Return Ok"]
The append pipeline ensures:
- Schema consistency : All incoming batches must match the table's declared schema
- MVCC tagging : System columns are added with appropriate transaction IDs
- Ordering : Rows are sorted by
row_idbefore persistence for efficient scans - Atomicity : Multi-column writes are committed atomically via batch pager operations
Sources: llkv-table/README.md:20-30 llkv-column-map/README.md:24-28
Scan Operations
Tables expose streaming scan APIs through the scan_stream method, which returns a ColumnStream for paginated result retrieval:
graph TB
SCAN["Table::scan_stream\n(projections, filter)"]
NORMALIZE["Normalize Predicate\nApply De Morgan's laws"]
COMPILE["Compile to EvalProgram\nStack-based bytecode"]
STREAM["Create ColumnStream\nLazy evaluation"]
FETCH["ColumnStream::next_batch\nFetch N rows"]
FILTER["Apply Predicate\nVectorized evaluation"]
MVCC["MVCC Filtering\nSnapshot visibility"]
PROJECT["Gather Projected Cols\ngather_rows()"]
BATCH["Return RecordBatch"]
SCAN --> NORMALIZE
NORMALIZE --> COMPILE
COMPILE --> STREAM
STREAM -.->|caller iterates| FETCH
FETCH --> FILTER
FILTER --> MVCC
MVCC --> PROJECT
PROJECT --> BATCH
BATCH -.->|next iteration| FETCH
The scan path supports:
- Predicate pushdown : Filters are compiled to bytecode and evaluated at the column store level
- Projection : Only requested columns are materialized
- MVCC filtering : Rows are filtered based on transaction snapshot visibility rules
- Streaming : Results are produced in fixed-size batches to avoid large memory allocations
Sources: llkv-table/README.md:23-24 llkv-column-map/README.md:30-34
Schema Validation
Schema validation occurs at table creation and during every append operation. The table layer enforces:
| Validation Check | Enforcement Point |
|---|---|
| Column names | Must match declared schema exactly (case-sensitive) |
| Data types | Must match Arrow DataType including nested types |
| Nullability | Enforced for non-nullable columns |
| Field count | Batch must contain exactly the declared columns |
Schema definitions are persisted in the system catalog (table 0) as TableMeta and ColMeta entries. The catalog stores:
- Table ID and name
- Column names, types, and nullability flags
- Constraint metadata (e.g., PRIMARY KEY, NOT NULL)
Sources: llkv-table/README.md:14-15 llkv-table/README.md:27-29
graph TB
subgraph "System Catalog (Table 0)"
TABLEMETA["TableMeta Records\ntable_id, name, schema"]
COLMETA["ColMeta Records\ntable_id, col_name, type"]
end
subgraph "User Tables (1..N)"
TBL1["Table 1\nusers"]
TBL2["Table 2\norders"]
TBL3["Table N\nproducts"]
end
TABLEMETA -->|describes| TBL1
TABLEMETA -->|describes| TBL2
TABLEMETA -->|describes| TBL3
COLMETA -->|defines columns| TBL1
COLMETA -->|defines columns| TBL2
COLMETA -->|defines columns| TBL3
SYSCAT["SysCatalog API\ncreate_table()\nget_table_meta()\nlist_tables()"]
SYSCAT -->|reads/writes| TABLEMETA
SYSCAT -->|reads/writes| COLMETA
System Catalog Integration
The SysCatalog is a special table (table ID 0) that stores metadata for all other tables:
The system catalog itself uses the same storage infrastructure as user tables:
- Stored as Arrow
RecordBatches in theColumnStore - Subject to MVCC versioning
- Persisted through the pager for crash consistency
This self-hosting design ensures metadata operations follow the same transactional semantics as data operations.
Sources: llkv-table/README.md:27-29 llkv-column-map/README.md:10-16
Projection and Gathering
The table layer delegates projection and row gathering to the ColumnStore, which provides specialized APIs for materializing requested columns:
Projection Structure
A Projection describes a single column to retrieve, optionally renaming it in the output schema. Projections are resolved to LogicalFieldId by consulting the catalog, then passed to the ColumnStore for gathering.
Sources: llkv-column-map/store/projection.rs:49-73
Null Handling Policies
The projection system supports three null-handling strategies via GatherNullPolicy:
| Policy | Behavior |
|---|---|
ErrorOnMissing | Any missing row_id causes an error |
IncludeNulls | Missing rows surface as nulls in output arrays |
DropNulls | Rows with all-null projected columns are omitted |
These policies enable different executor semantics: INNER JOIN uses ErrorOnMissing, LEFT JOIN uses IncludeNulls, and aggregation pipelines use DropNulls to skip tombstones.
Sources: llkv-column-map/store/projection.rs:39-47
graph TB
PREPARE["prepare_gather_context\n(field_ids)"]
CATALOG["Load ColumnDescriptors\nfrom catalog"]
METAS["Collect ChunkMetadata\nvalue + row chunks"]
CTX["MultiGatherContext\nplans, cache, scratch"]
GATHER1["gather_rows_with_reusable_context\n(row_ids_1)"]
GATHER2["gather_rows_with_reusable_context\n(row_ids_2)"]
GATHERN["gather_rows_with_reusable_context\n(row_ids_N)"]
PREPARE --> CATALOG
CATALOG --> METAS
METAS --> CTX
CTX -.->|reuses chunk cache| GATHER1
CTX -.->|reuses chunk cache| GATHER2
CTX -.->|reuses chunk cache| GATHERN
GATHER1 --> BATCH1["RecordBatch 1"]
GATHER2 --> BATCH2["RecordBatch 2"]
GATHERN --> BATCHN["RecordBatch N"]
Multi-Column Gather Context
For queries that scan the same row set multiple times (e.g., joins, aggregations), the table layer provides MultiGatherContext to amortize fetch costs:
The context caches:
- Chunk arrays : Deserialized Arrow arrays for reuse across calls
- Row indices : Hash maps for sparse row lookups
- Scratch buffers : Pre-allocated vectors for gather operations
This optimization is critical for nested loop joins and multi-pass aggregations where the same columns are accessed repeatedly.
Sources: llkv-column-map/store/projection.rs:94-227 llkv-column-map/store/projection.rs:448-510 llkv-column-map/store/projection.rs:516-758
graph TB
subgraph "Table Layer (llkv-table)"
TABLE["Table\nSchema + Arc<ColumnStore>"]
FIELDMAP["Field Name → LogicalFieldId\nNamespace mapping"]
end
subgraph "ColumnStore Layer (llkv-column-map)"
COLSTORE["ColumnStore\nLogicalFieldId → PhysicalKey"]
DESCRIPTOR["ColumnDescriptor\nChunk metadata lists"]
CHUNKS["Column Chunks\nSerialized Arrow arrays"]
end
subgraph "Storage Layer (llkv-storage)"
PAGER["Pager\nbatch_get/batch_put"]
MEMPAGER["MemPager"]
SIMDPAGER["SimdRDrivePager"]
end
TABLE -->|append batch| COLSTORE
TABLE -->|scan_stream| COLSTORE
TABLE -->|gather_rows field_ids| COLSTORE
FIELDMAP -.->|resolves to| COLSTORE
COLSTORE -->|maps to| DESCRIPTOR
DESCRIPTOR -->|points to| CHUNKS
COLSTORE -->|batch_get| PAGER
COLSTORE -->|batch_put| PAGER
PAGER -.->|impl| MEMPAGER
PAGER -.->|impl| SIMDPAGER
Integration with ColumnStore
The table layer wraps a ColumnStore and translates high-level operations into low-level storage calls:
Logical Field Namespacing
Each logical field in a table is assigned a LogicalFieldId that encodes:
- Namespace :
USER,TXN_METADATA, orROWID_SHADOW - Table ID :
u32identifier - Field ID :
u32column index
This namespacing prevents collisions between user columns and MVCC metadata while allowing them to share the same physical ColumnStore instance.
Sources: llkv-column-map/README.md:18-22 llkv-table/README.md:20-22
Zero-Copy Reads
The ColumnStore delegates to the Pager trait for physical storage access. When using SimdRDrivePager (persistent backend), reads are zero-copy: the pager returns EntryHandle wrappers that directly reference memory-mapped regions. This enables SIMD-accelerated scans without buffer allocation or copying.
Sources: llkv-storage/README.md:9-17 llkv-column-map/README.md:36-40
Usage in the Stack
The table abstraction is consumed by:
| Component | Usage |
|---|---|
| llkv-runtime | Executes all DML and DDL operations through Table APIs |
| llkv-executor | Relies on scan_stream for SELECT evaluation, joins, and aggregations |
| llkv-sql | Indirectly via llkv-runtime for SQL statement execution |
| llkv-csv | Uses Table::append for bulk CSV ingestion |
The streaming scan API (scan_stream) is particularly important for the executor, which processes query results in fixed-size batches to avoid buffering entire result sets in memory.
Sources: llkv-table/README.md:36-40 llkv-runtime/README.md:36-40 llkv-csv/README.md:14-20