This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Overview

Relevant source files

This document introduces the LLKV database system, its architectural principles, and the relationships between its constituent crates. It provides a high-level map of how SQL queries flow through the system from parsing to storage, and explains the role of Apache Arrow as the universal data interchange format.

For details on individual subsystems, see:

Workspace organization and crate dependencies: Workspace and Crates
SQL query processing pipeline: SQL Query Processing Pipeline
Arrow integration details: Data Formats and Arrow Integration

What is LLKV

LLKV is an experimental SQL database implemented as a Rust workspace of 15 crates. It layers SQL processing, streaming query execution, and MVCC transaction management on top of pluggable key-value storage backends. The system uses Apache Arrow RecordBatch as its primary data representation at every layer, enabling zero-copy operations and SIMD-friendly columnar processing.

The architecture separates concerns into six distinct layers:

SQL Interface
Query Planning
Runtime and Orchestration
Query Execution
Table and Metadata Management
Storage and I/O

Each layer communicates through well-defined interfaces centered on Arrow data structures.

Sources: README.md:1-107 Cargo.toml:1-89

Core Design Principles

LLKV's design reflects several intentional trade-offs:

Principle	Implementation	Rationale
Arrow-Native	`RecordBatch` is the universal data format across all layers	Enables zero-copy operations, SIMD vectorization, and interoperability with the Arrow ecosystem
Synchronous Execution	Work-stealing via Rayon instead of async runtime	Reduces scheduler overhead for individual queries while remaining embeddable in Tokio contexts
Layered Modularity	15 independent crates with clear boundaries	Allows independent evolution and testing of subsystems
MVCC Throughout	System metadata columns (`row_id`, `created_by`, `deleted_by`) injected at storage layer	Provides snapshot isolation without write locks
Storage Abstraction	`Pager` trait with multiple implementations	Supports both in-memory and persistent backends with zero-copy reads
Compiled Predicates	Expressions compile to stack-based bytecode	Enables efficient vectorized evaluation without interpretation overhead

Sources: README.md:36-42 llkv-storage/README.md:12-22 llkv-expr/README.md:66-72

Workspace Structure

The LLKV workspace consists of 15 crates organized by layer:

Layer	Crate	Primary Responsibility
SQL Interface	`llkv-sql`	SQL parsing, dialect normalization, INSERT buffering
Query Planning	`llkv-plan`	Typed query plan structures (SelectPlan, InsertPlan, etc.)
`llkv-expr`	Expression AST (Expr, ScalarExpr)
Runtime	`llkv-runtime`	Session management, MVCC orchestration, plan execution
`llkv-transaction`	Transaction ID allocation, snapshot management
Execution	`llkv-executor`	Streaming query evaluation
`llkv-aggregate`	Aggregate function implementation (SUM, COUNT, AVG, etc.)
`llkv-join`	Join algorithms (hash join with specialized fast paths)
Table/Metadata	`llkv-table`	Schema-aware table abstraction, system catalog
`llkv-column-map`	Column-oriented storage, logical-to-physical key mapping
Storage	`llkv-storage`	Pager trait, MemPager, SimdRDrivePager
Utilities	`llkv-csv`	CSV ingestion helper
`llkv-result`	Result type definitions
`llkv-test-utils`	Testing utilities
`llkv-slt-tester`	SQL Logic Test harness

Sources: Cargo.toml:9-26 Cargo.toml:67-87 README.md:44-53

Component Architecture and Data Flow

The following diagram shows the major components and how Arrow RecordBatch flows through the system:

Sources: README.md:44-72 Cargo.toml:67-87

graph TB
    User["User / Application"]
subgraph "llkv-sql Crate"
        SqlEngine["SqlEngine"]
Preprocessor["SQL Preprocessor"]
Parser["sqlparser"]
InsertBuffer["InsertBuffer"]
end
    
    subgraph "llkv-plan Crate"
        SelectPlan["SelectPlan"]
InsertPlan["InsertPlan"]
CreateTablePlan["CreateTablePlan"]
OtherPlans["Other Plan Types"]
end
    
    subgraph "llkv-expr Crate"
        Expr["Expr&lt;F&gt;"]
ScalarExpr["ScalarExpr&lt;F&gt;"]
end
    
    subgraph "llkv-runtime Crate"
        RuntimeContext["RuntimeContext"]
SessionHandle["SessionHandle"]
TxnSnapshot["TransactionSnapshot"]
end
    
    subgraph "llkv-executor Crate"
        TableExecutor["TableExecutor"]
StreamingOps["Streaming Operators"]
end
    
    subgraph "llkv-table Crate"
        Table["Table"]
SysCatalog["SysCatalog (Table 0)"]
FieldId["FieldId Resolution"]
end
    
    subgraph "llkv-column-map Crate"
        ColumnStore["ColumnStore"]
LogicalFieldId["LogicalFieldId"]
PhysicalKey["PhysicalKey Mapping"]
end
    
    subgraph "llkv-storage Crate"
        Pager["Pager Trait"]
MemPager["MemPager"]
SimdPager["SimdRDrivePager"]
end
    
    ArrowBatch["Arrow RecordBatch\n(Universal Format)"]
User -->|SQL String| SqlEngine
 
   SqlEngine --> Preprocessor
 
   Preprocessor --> Parser
 
   Parser -->|AST| SelectPlan
 
   Parser -->|AST| InsertPlan
 
   Parser -->|AST| CreateTablePlan
    
 
   SelectPlan --> Expr
 
   InsertPlan --> ScalarExpr
    
 
   SelectPlan --> RuntimeContext
 
   InsertPlan --> RuntimeContext
 
   CreateTablePlan --> RuntimeContext
    
 
   RuntimeContext --> SessionHandle
 
   RuntimeContext --> TxnSnapshot
 
   RuntimeContext --> TableExecutor
 
   RuntimeContext --> Table
    
 
   TableExecutor --> StreamingOps
 
   StreamingOps --> Table
    
 
   Table --> SysCatalog
 
   Table --> FieldId
 
   Table --> ColumnStore
    
 
   ColumnStore --> LogicalFieldId
 
   ColumnStore --> PhysicalKey
 
   ColumnStore --> Pager
    
 
   Pager --> MemPager
 
   Pager --> SimdPager
    
 
   Table -.->|Produces/Consumes| ArrowBatch
 
   StreamingOps -.->|Produces/Consumes| ArrowBatch
 
   ColumnStore -.->|Serializes/Deserializes| ArrowBatch
 
   SqlEngine -.->|Returns| ArrowBatch

End-to-End Query Execution

This diagram traces a SELECT query from SQL text to results, showing the concrete code entities involved:

Sources: README.md:56-63 llkv-sql/README.md:1-107 llkv-runtime/README.md:33-41 llkv-table/README.md:10-25

sequenceDiagram
    participant App as "Application"
    participant SqlEngine as "SqlEngine::execute()"
    participant Preprocessor as "preprocess_sql()"
    participant Parser as "sqlparser::Parser"
    participant Planner as "build_select_plan()"
    participant Runtime as "RuntimeContext::execute_plan()"
    participant Executor as "TableExecutor::execute()"
    participant Table as "Table::scan_stream()"
    participant ColStore as "ColumnStore::gather_columns()"
    participant Pager as "Pager::batch_get()"
    
    App->>SqlEngine: SELECT * FROM users WHERE age > 18
    
    Note over SqlEngine,Preprocessor: Dialect normalization
    SqlEngine->>Preprocessor: Normalize SQLite/DuckDB syntax
    
    SqlEngine->>Parser: Parse normalized SQL
    Parser-->>SqlEngine: Statement AST
    
    SqlEngine->>Planner: Translate AST to SelectPlan
    Note over Planner: Build SelectPlan with\nExpr&lt;String&gt; predicates
    Planner-->>SqlEngine: SelectPlan
    
    SqlEngine->>Runtime: execute_plan(SelectPlan)
    
    Note over Runtime: Acquire TransactionSnapshot\nResolve field names to FieldId
    
    Runtime->>Executor: execute(SelectPlan, context)
    
    Note over Executor: Compile Expr&lt;FieldId&gt;\ninto EvalProgram
    
    Executor->>Table: scan_stream(fields, predicate)
    
    Note over Table: Apply MVCC filtering\nPush down predicates
    
    Table->>ColStore: gather_columns(LogicalFieldId[])
    
    Note over ColStore: Map LogicalFieldId\nto PhysicalKey
    
    ColStore->>Pager: batch_get(PhysicalKey[])
    Pager-->>ColStore: EntryHandle[] (zero-copy)
    
    Note over ColStore: Deserialize Arrow buffers\nApply row_id filtering
    
    ColStore-->>Table: RecordBatch
    Table-->>Executor: RecordBatch
    
    Note over Executor: Apply projections\nEvaluate expressions
    
    Executor-->>Runtime: RecordBatch stream
    Runtime-->>SqlEngine: Vec&lt;RecordBatch&gt;
    SqlEngine-->>App: Query results

Key Features

MVCC Transaction Management

LLKV implements multi-version concurrency control with snapshot isolation:

Every table includes three system columns: row_id (monotonic), created_by (transaction ID), and deleted_by (transaction ID or NULL)
TxnIdManager in llkv-transaction allocates monotonic transaction IDs and tracks commit watermarks
TransactionSnapshot captures a consistent view of the database at transaction start
Auto-commit statements use TXN_ID_AUTO_COMMIT = 1
Explicit transactions maintain both persistent and staging contexts for isolation

Sources: README.md:64-72 llkv-runtime/README.md:20-32 llkv-table/README.md:32-35

Zero-Copy Storage Pipeline

The storage layer supports zero-copy reads when backed by SimdRDrivePager:

ColumnStore maps LogicalFieldId to PhysicalKey
Pager::batch_get() returns EntryHandle wrappers around memory-mapped regions
Arrow arrays are deserialized directly from the mapped memory without intermediate copies
SIMD-aligned buffers enable vectorized predicate evaluation

Sources: llkv-column-map/README.md:19-41 llkv-storage/README.md:12-28 README.md:12-13

Compiled Expression Evaluation

Predicates and scalar expressions compile to stack-based bytecode:

Expr<FieldId> structures in llkv-expr represent logical predicates
ProgramCompiler in llkv-table translates expressions into EvalProgram bytecode
DomainProgram tracks which row IDs satisfy predicates
Bytecode evaluation uses stack-based execution for efficient vectorized operations

Sources: llkv-expr/README.md:1-88 llkv-table/README.md:10-18 README.md:46-53

SQL Logic Test Infrastructure

LLKV includes comprehensive SQL correctness testing:

llkv-slt-tester wraps the sqllogictest framework
LlkvSltRunner discovers .slt files and executes test suites
Supports remote test fetching via .slturl pointer files
Environment variable LLKV_SLT_STATS=1 enables detailed query statistics
CI runs the full suite on Linux, macOS, and Windows

Sources: README.md:75-77 llkv-slt-tester/README.md:1-57

Getting Started

The main entry point is the llkv crate, which re-exports the SQL interface:

For persistent storage, use SimdRDrivePager instead of MemPager. For transaction control beyond auto-commit, obtain a SessionHandle via SqlEngine::session().

Sources: README.md:14-33 demos/llkv-sql-pong-demo/src/main.rs:386-393

LLKV shares architectural concepts with Apache DataFusion but differs in several key areas:

Aspect	LLKV	DataFusion
Execution Model	Synchronous with Rayon work-stealing	Async with Tokio runtime
Storage Backend	Custom key-value via `Pager` trait	Parquet, CSV, object stores
SQL Parser	`sqlparser` crate (same)	`sqlparser` crate
Data Format	Arrow `RecordBatch` (same)	Arrow `RecordBatch`
Maturity	Alpha / Experimental	Production-ready
Transaction Support	MVCC snapshot isolation	Read-only (no writes)

LLKV deliberately avoids the DataFusion task scheduler to explore trade-offs in a synchronous execution model, while maintaining compatibility with the same SQL parser and Arrow memory layout.

Sources: README.md:36-42 README.md:8-13

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Architecture

Relevant source files

This document describes the overall system architecture of LLKV, explaining its layered design, core abstractions, and how components interact to provide SQL functionality over key-value storage. For details on individual crates and their dependencies, see Workspace and Crates. For the end-to-end query execution flow, see SQL Query Processing Pipeline. For Arrow integration specifics, see Data Formats and Arrow Integration.

Layered Design

LLKV is organized into six architectural layers, each with focused responsibilities. Higher layers depend only on lower layers, and all layers communicate through Apache Arrow RecordBatch structures as the universal data interchange format.

Sources: Cargo.toml:1-89 README.md:44-53 llkv-sql/README.md:1-10 llkv-runtime/README.md:1-10 llkv-executor/README.md:1-10 llkv-table/README.md:1-18 llkv-storage/README.md:1-17

graph TB
    subgraph L1["Layer 1: User Interface"]
SQL["SQL Queries"]
REPL["CLI REPL"]
DEMO["Demo Applications"]
BENCH["TPC-H Benchmarks"]
end
    
    subgraph L2["Layer 2: SQL Processing"]
SQLENG["SqlEngine\nllkv-sql"]
PLAN["Query Plans\nllkv-plan"]
EXPR["Expression AST\nllkv-expr"]
end
    
    subgraph L3["Layer 3: Runtime & Orchestration"]
RUNTIME["RuntimeContext\nllkv-runtime"]
TXNMGR["TxnIdManager\nllkv-transaction"]
CATALOG["CatalogManager\nllkv-runtime"]
end
    
    subgraph L4["Layer 4: Query Execution"]
EXECUTOR["TableExecutor\nllkv-executor"]
AGG["Accumulators\nllkv-aggregate"]
JOIN["HashJoinExecutor\nllkv-join"]
end
    
    subgraph L5["Layer 5: Data Management"]
TABLE["Table\nllkv-table"]
COLMAP["ColumnStore\nllkv-column-map"]
SYSCAT["SysCatalog\nllkv-table"]
end
    
    subgraph L6["Layer 6: Storage"]
PAGER["Pager trait\nllkv-storage"]
MEMPAGER["MemPager"]
SIMDPAGER["SimdRDrivePager"]
end
    
 
   SQL --> SQLENG
 
   REPL --> SQLENG
 
   DEMO --> SQLENG
 
   BENCH --> SQLENG
    
 
   SQLENG --> PLAN
 
   SQLENG --> EXPR
 
   PLAN --> RUNTIME
    
 
   RUNTIME --> TXNMGR
 
   RUNTIME --> CATALOG
 
   RUNTIME --> EXECUTOR
 
   RUNTIME --> TABLE
    
 
   EXECUTOR --> AGG
 
   EXECUTOR --> JOIN
 
   EXECUTOR --> TABLE
    
 
   TABLE --> COLMAP
 
   TABLE --> SYSCAT
 
   COLMAP --> PAGER
 
   SYSCAT --> COLMAP
    
 
   PAGER --> MEMPAGER
 
   PAGER --> SIMDPAGER

Core Architectural Principles

Arrow-Native Data Flow

All data flowing between components is represented as Apache Arrow RecordBatch structures. This enables:

Zero-copy operations : Arrow buffers can be passed between layers without serialization
SIMD-friendly processing : Columnar layout supports vectorized operations
Consistent memory model : All layers use the same in-memory representation

The RecordBatch abstraction appears at every boundary: SQL parsing produces plans that operate on batches, the executor streams batches, tables persist batches, and the column store chunks batches for storage.

Sources: README.md:10-12 README.md:22-23 llkv-table/README.md:10-11 llkv-column-map/README.md:10-14

The Pager trait in llkv-storage provides a pluggable storage backend interface:

Pager Type	Use Case	Key Properties
`MemPager`	Tests, temporary namespaces, staging contexts	Heap-backed, fast
`SimdRDrivePager`	Persistent storage	Zero-copy reads, SIMD-aligned, memory-mapped

Both implementations satisfy the same batch get/put contract, allowing higher layers to remain storage-agnostic. The runtime uses dual-pager contexts: persistent storage for committed tables and in-memory staging for uncommitted transaction objects.

Sources: llkv-storage/README.md:12-22 llkv-runtime/README.md:26-32

MVCC Integration

Multi-version concurrency control (MVCC) is implemented as system metadata columns injected at the table layer:

row_id: Monotonic row identifier
created_by: Transaction ID that created this row version
deleted_by: Transaction ID that deleted this row (or NULL if active)

These columns are stored alongside user data in ColumnStore, enabling snapshot isolation without separate version chains. The TxnIdManager in llkv-transaction allocates monotonic transaction IDs and tracks commit watermarks. The runtime enforces visibility rules during scans by filtering based on snapshot transaction IDs.

Sources: llkv-table/README.md:13-17 llkv-runtime/README.md:19-25 llkv-column-map/README.md:27-28

Component Interaction Patterns

Query Execution Flow

Sources: README.md:56-62 llkv-sql/README.md:15-20 llkv-runtime/README.md:12-17 llkv-executor/README.md:12-17 llkv-table/README.md:19-25 llkv-column-map/README.md:24-28

Dual-Context Transaction Management

The runtime maintains two execution contexts during explicit transactions. The persistent context operates on committed tables directly, while the staging context buffers newly created tables in memory. On commit, staged operations are replayed into the persistent context after the TxnIdManager confirms no conflicts and advances the commit watermark. On rollback, the staging context is dropped and all uncommitted work is discarded.

Sources: llkv-runtime/README.md:26-32 llkv-runtime/README.md:12-17

Column Storage and Logical Field Mapping

The ColumnStore maintains a mapping from LogicalFieldId (namespace + table ID + field ID) to physical storage keys. Each logical field has a descriptor chunk (metadata about the column), data chunks (Arrow-serialized column arrays), and row ID chunks (per-chunk row identifiers for filtering). This three-level mapping isolates user data from system metadata while allowing efficient scans and appends.

Sources: llkv-column-map/README.md:18-23 llkv-table/README.md:13-17 llkv-column-map/README.md:10-17

Key Abstractions

SqlEngine

Entry point for SQL execution. Located in llkv-sql, it:

Preprocesses SQL for dialect compatibility (DuckDB, SQLite quirks)
Parses with sqlparser crate
Batches compatible INSERT statements
Delegates execution to RuntimeContext
Returns ExecutionResult enums

Sources: llkv-sql/README.md:1-20 README.md:56-59

RuntimeContext

Orchestration layer in llkv-runtime that:

Executes all statement types (DDL, DML, queries)
Manages transaction snapshots and MVCC injection
Coordinates between table layer and executor
Maintains catalog manager for schema metadata
Implements dual-context staging for transactions

Sources: llkv-runtime/README.md:12-25 llkv-runtime/README.md:34-40

Table and ColumnStore

Table in llkv-table provides schema-aware APIs:

Schema validation on CREATE TABLE and append
MVCC column injection (row_id, created_by, deleted_by)
Streaming scan API with predicate pushdown
Integration with system catalog (table 0)

ColumnStore in llkv-column-map handles physical storage:

Arrow-serialized column chunks
Logical-to-physical key mapping
Append pipeline with row-id sorting and last-writer-wins semantics
Atomic multi-key commits through pager

Sources: llkv-table/README.md:12-25 llkv-column-map/README.md:12-28

TableExecutor

Execution engine in llkv-executor that:

Streams RecordBatch results from table scans
Evaluates projections, filters, and scalar expressions
Coordinates with llkv-aggregate for aggregation
Coordinates with llkv-join for join operations
Applies MVCC visibility filters during scans

Sources: llkv-executor/README.md:1-17 README.md:60-61

Storage abstraction in llkv-storage that:

Exposes batch get/put over (PhysicalKey, EntryHandle) pairs
Supports atomic multi-key updates
Enables zero-copy reads when backed by memory-mapped storage
Implementations: MemPager (heap), SimdRDrivePager (persistent)

Sources: llkv-storage/README.md:12-22 README.md:11-12

Crate Organization

The workspace contains 15 crates organized by layer:

Layer	Crates	Responsibilities
SQL Processing	`llkv-sql`, `llkv-plan`, `llkv-expr`	Parse SQL, build typed plans, represent expressions
Runtime	`llkv-runtime`, `llkv-transaction`	Orchestrate execution, manage MVCC and sessions
Execution	`llkv-executor`, `llkv-aggregate`, `llkv-join`	Stream results, compute aggregates, evaluate joins
Data Management	`llkv-table`, `llkv-column-map`	Schema-aware tables, columnar storage
Storage	`llkv-storage`	Pager trait and implementations
Supporting	`llkv-result`, `llkv-csv`, `llkv-test-utils`	Result types, CSV ingestion, test utilities
Testing	`llkv-slt-tester`, `llkv-tpch`	SQL Logic Tests, TPC-H benchmarks
Entry Points	`llkv`	Main library and CLI

For detailed dependency graphs and crate responsibilities, see Workspace and Crates.

Sources: Cargo.toml:67-87 README.md:44-53

Execution Model

Synchronous with Work-Stealing

LLKV defaults to synchronous execution using Rayon for parallelism:

Query execution is synchronous, not async
Rayon work-stealing parallelizes scans and projections
Crossbeam channels coordinate between threads
Embeds cleanly inside Tokio when needed (e.g., SLT test runner)

This design minimizes scheduler overhead for individual queries while maintaining high throughput for concurrent workloads.

Sources: README.md:38-41 llkv-column-map/README.md:32-34

Streaming Results

Queries produce results incrementally:

TableExecutor yields fixed-size RecordBatches
No full result set materialization
Callers process batches via callback or iterator
Join and aggregate operators buffer only necessary state

Sources: llkv-table/README.md:24-25 llkv-executor/README.md:14-17 llkv-join/README.md:19-22

Data Lifecycle

Write Path

User submits INSERT or UPDATE through SqlEngine
RuntimeContext validates schema and injects MVCC columns
Table::append validates RecordBatch schema
ColumnStore::append sorts by row_id, rewrites conflicts
Pager::batch_put commits Arrow-serialized chunks atomically
Transaction manager advances commit watermark

Read Path

User submits SELECT through SqlEngine
RuntimeContext acquires transaction snapshot
TableExecutor creates scan with projection and filter
Table::scan_stream initiates ColumnStream
ColumnStore fetches chunks via Pager::batch_get (zero-copy)
MVCC filtering applied using snapshot visibility rules
Executor evaluates expressions and streams RecordBatches to caller

Sources: README.md:56-62 llkv-column-map/README.md:24-28 llkv-table/README.md:19-25

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Workspace and Crates

Relevant source files

Purpose and Scope

This document details the Cargo workspace structure and the 15+ crates that comprise the LLKV database system. Each crate is designed with a single responsibility and well-defined interfaces, enabling independent testing and evolution of components. This page catalogs the role of each crate, their internal dependencies, and how they map to the system's layered architecture described in Architecture.

For information about how SQL queries flow through these crates, see SQL Query Processing Pipeline. For details on specific subsystems like storage or transactions, refer to sections 7 and following.

Workspace Overview

The LLKV workspace is defined in Cargo.toml:67-88 and contains 18 member crates organized into core system components, specialized operations, testing infrastructure, and demonstration applications.

Workspace Structure:

graph TB
    subgraph "Core System Crates"
        LLKV["llkv\n(main entry)"]
SQL["llkv-sql\n(SQL interface)"]
PLAN["llkv-plan\n(query plans)"]
EXPR["llkv-expr\n(expression AST)"]
RUNTIME["llkv-runtime\n(orchestration)"]
EXECUTOR["llkv-executor\n(execution)"]
TABLE["llkv-table\n(table layer)"]
COLMAP["llkv-column-map\n(column store)"]
STORAGE["llkv-storage\n(storage abstraction)"]
TXN["llkv-transaction\n(MVCC manager)"]
RESULT["llkv-result\n(error types)"]
end
    
    subgraph "Specialized Operations"
        AGG["llkv-aggregate\n(aggregation)"]
JOIN["llkv-join\n(joins)"]
CSV["llkv-csv\n(CSV import)"]
end
    
    subgraph "Testing Infrastructure"
        SLT["llkv-slt-tester\n(SQL logic tests)"]
TESTUTIL["llkv-test-utils\n(test utilities)"]
TPCH["llkv-tpch\n(TPC-H benchmarks)"]
end
    
    subgraph "Demonstrations"
        DEMO["llkv-sql-pong-demo\n(interactive demo)"]
end

Sources: Cargo.toml:67-88

Core System Crates

llkv

Purpose: Main library crate that re-exports the primary user-facing APIs from llkv-sql and llkv-runtime.

Key Dependencies: llkv-sql, llkv-runtime

Responsibilities:

Provides the consolidated API surface for embedding LLKV
Re-exports SqlEngine for SQL query execution
Re-exports runtime components for programmatic database access

Sources: Cargo.toml:9-10

llkv-sql

Path: llkv-sql/

Purpose: SQL interface layer that parses SQL statements, preprocesses dialect-specific syntax, and translates them into typed query plans.

Key Dependencies:

llkv-plan - Query plan structures
llkv-expr - Expression AST
llkv-runtime - Execution orchestration
sqlparser - SQL parsing (version 0.59.0)

Responsibilities:

SQL statement preprocessing for dialect compatibility
AST-to-plan translation
INSERT statement buffering optimization
SQL query result formatting

Primary Types:

SqlEngine - Main query interface

Sources: Cargo.toml21 llkv-plan/src/lib.rs:1-38

llkv-plan

Path: llkv-plan/

Purpose: Query planner that defines typed plan structures representing SQL operations.

Key Dependencies:

llkv-expr - Expression types
llkv-result - Error handling
sqlparser - SQL AST types

Responsibilities:

Plan structure definitions (SelectPlan, InsertPlan, UpdatePlan, DeletePlan)
SQL-to-plan conversion utilities
Subquery correlation tracking
Plan graph serialization for debugging

Primary Types:

SelectPlan, InsertPlan, UpdatePlan, DeletePlan, CreateTablePlan
SubqueryCorrelatedTracker
RangeSelectRows - Range-based row selection

Sources: llkv-plan/Cargo.toml:1-28 llkv-plan/src/lib.rs:1-38

llkv-expr

Path: llkv-expr/

Purpose: Expression AST definitions and literal value handling, independent of concrete Arrow scalar types.

Key Dependencies:

arrow - Arrow data types

Responsibilities:

Expression AST (Expr<T>, ScalarExpr<T>)
Literal value representation (Literal enum)
Type-aware predicate compilation (typed_predicate)
Decimal value handling

Primary Types:

Expr<T> - Generic expression with field identifier type parameter
ScalarExpr<T> - Scalar expressions
Literal - Untyped literal values
DecimalValue - Fixed-precision decimal
IntervalValue - Calendar interval

Sources: llkv-expr/Cargo.toml:1-19 llkv-expr/src/lib.rs:1-21 llkv-expr/src/literal.rs:1-446

llkv-runtime

Path: llkv-runtime/

Purpose: Runtime orchestration layer providing MVCC transaction management, session handling, and system catalog coordination.

Key Dependencies:

llkv-executor - Query execution
llkv-table - Table operations
llkv-transaction - MVCC snapshots

Responsibilities:

Transaction lifecycle management
Session state tracking
System catalog access
Query plan execution coordination
MVCC snapshot creation and cleanup

Primary Types:

RuntimeContext - Main runtime state
Session - Per-connection state

Sources: Cargo.toml19

llkv-executor

Path: llkv-executor/

Purpose: Query execution engine that evaluates plans and produces streaming results.

Key Dependencies:

llkv-plan - Plan structures
llkv-expr - Expression evaluation
llkv-table - Table scans
llkv-aggregate - Aggregation
llkv-join - Join algorithms

Responsibilities:

SELECT plan execution
Projection and filtering
Aggregation coordination
Join execution
Streaming RecordBatch production

Sources: Cargo.toml14

llkv-table

Path: llkv-table/

Purpose: Schema-aware table abstraction providing high-level data operations over columnar storage.

Key Dependencies:

llkv-column-map - Column storage
llkv-expr - Predicate compilation
llkv-storage - Storage backend
arrow - RecordBatch representation

Responsibilities:

Schema validation and enforcement
MVCC metadata injection (row_id, created_by, deleted_by)
Predicate compilation and optimization
RecordBatch append/scan operations
Column data type management

Primary Types:

Table - Main table interface
TablePlanner - Query optimization
TableExecutor - Execution strategies

Sources: llkv-table/Cargo.toml:1-60 llkv-column-map/src/store/projection.rs:1-728

llkv-column-map

Path: llkv-column-map/

Purpose: Columnar storage layer that chunks Arrow arrays and manages the mapping from logical fields to physical storage keys.

Key Dependencies:

llkv-storage - Pager abstraction
llkv-expr - Field identifiers
arrow - Array serialization

Responsibilities:

Column chunk management (serialization/deserialization)
LogicalFieldId → PhysicalKey mapping
Multi-column gather operations with caching
Row visibility filtering
Chunk metadata tracking (min/max values)

Primary Types:

ColumnStore<P> - Main storage interface
LogicalFieldId - Namespaced field identifier
MultiGatherContext - Reusable context for multi-column reads
GatherNullPolicy - Null handling strategies

Sources: Cargo.toml12 llkv-column-map/src/store/projection.rs:38-227

llkv-storage

Path: llkv-storage/

Purpose: Storage abstraction layer defining the Pager trait and providing implementations for in-memory and persistent backends.

Key Dependencies:

simd-r-drive - SIMD-optimized persistent storage (optional)
arrow - Buffer types

Responsibilities:

Pager trait definition (batch_get/batch_put)
Zero-copy array serialization format
MemPager - In-memory HashMap backend
SimdRDrivePager - Memory-mapped persistent backend
Physical key allocation

Primary Types:

Pager trait
MemPager, SimdRDrivePager
PhysicalKey - Storage location identifier
Serialization format with custom encoding (see llkv-storage/src/serialization.rs:1-586)

Sources: Cargo.toml22 llkv-storage/src/serialization.rs:1-130

llkv-transaction

Path: llkv-transaction/

Purpose: MVCC transaction manager providing snapshot isolation and row visibility determination.

Key Dependencies:

llkv-result - Error types

Responsibilities:

Transaction ID allocation
MVCC snapshot creation
Commit watermark tracking
Row visibility rules enforcement

Primary Types:

TransactionManager
Snapshot - Transaction isolation view
TxnId - Transaction identifier

Sources: Cargo.toml25

llkv-result

Path: llkv-result/

Purpose: Common error and result types used throughout the system.

Key Dependencies: None (foundational crate)

Responsibilities:

Error enum with all error variants
Result<T> type alias
Error conversion traits

Sources: Cargo.toml18

Specialized Operations Crates

llkv-aggregate

Path: llkv-aggregate/

Purpose: Aggregate function evaluation including accumulators and distinct value tracking.

Key Dependencies:

arrow - Array operations

Responsibilities:

Aggregate function implementations (SUM, AVG, COUNT, MIN, MAX)
Accumulator state management
DISTINCT value tracking
Group-by hash table operations

Sources: Cargo.toml11

llkv-join

Path: llkv-join/

Purpose: Join algorithm implementations.

Key Dependencies:

arrow - RecordBatch operations
llkv-expr - Join predicates

Responsibilities:

Hash join implementation
Nested loop join
Join key extraction
Result materialization

Sources: Cargo.toml16

llkv-csv

Path: llkv-csv/

Purpose: CSV file ingestion and export utilities.

Key Dependencies:

llkv-table - Table operations
arrow - CSV reader integration

Responsibilities:

CSV to RecordBatch conversion
Bulk insert optimization
Schema inference from CSV headers

Sources: Cargo.toml13

Testing Infrastructure Crates

llkv-test-utils

Path: llkv-test-utils/

Purpose: Shared test utilities including tracing setup and common test fixtures.

Key Dependencies:

tracing-subscriber - Logging configuration

Responsibilities:

Consistent tracing initialization across tests
Common test helpers
Auto-initialization feature for convenience

Sources: Cargo.toml24

llkv-slt-tester

Path: llkv-slt-tester/

Purpose: SQL Logic Test runner providing standardized correctness testing.

Key Dependencies:

llkv-sql - SQL execution
sqllogictest - Test framework (version 0.28.4)

Responsibilities:

.slt file discovery and execution
Remote test suite fetching (.slturl files)
Test result comparison
AsyncDB adapter for LLKV

Primary Types:

LlkvSltRunner - Test runner
EngineHarness - Adapter interface

Sources: Cargo.toml20

llkv-tpch

Path: llkv-tpch/

Purpose: TPC-H benchmark suite for performance testing.

Key Dependencies:

llkv - Database interface
llkv-sql - SQL execution
tpchgen - Data generation (version 2.0.1)

Responsibilities:

TPC-H data generation at various scale factors
Query execution (Q1-Q22)
Performance measurement
Benchmark result reporting

Sources: Cargo.toml62

Demonstration Applications

llkv-sql-pong-demo

Path: demos/llkv-sql-pong-demo/

Purpose: Interactive demonstration showing LLKV's SQL capabilities through a Pong game implemented in SQL.

Key Dependencies:

llkv-sql - SQL execution
crossterm - Terminal UI (version 0.29.0)

Responsibilities:

Terminal-based interactive interface
Real-time SQL query execution
Game state management via SQL tables
User input handling

graph LR
    LLKV["llkv"]
SQL["llkv-sql"]
PLAN["llkv-plan"]
EXPR["llkv-expr"]
RUNTIME["llkv-runtime"]
EXECUTOR["llkv-executor"]
TABLE["llkv-table"]
COLMAP["llkv-column-map"]
STORAGE["llkv-storage"]
TXN["llkv-transaction"]
RESULT["llkv-result"]
AGG["llkv-aggregate"]
JOIN["llkv-join"]
CSV["llkv-csv"]
SLT["llkv-slt-tester"]
TESTUTIL["llkv-test-utils"]
TPCH["llkv-tpch"]
DEMO["llkv-sql-pong-demo"]
LLKV --> SQL
 
   LLKV --> RUNTIME
    
 
   SQL --> PLAN
 
   SQL --> EXPR
 
   SQL --> RUNTIME
 
   SQL --> EXECUTOR
 
   SQL --> TABLE
 
   SQL --> TXN
    
 
   RUNTIME --> EXECUTOR
 
   RUNTIME --> TABLE
 
   RUNTIME --> TXN
    
 
   EXECUTOR --> PLAN
 
   EXECUTOR --> EXPR
 
   EXECUTOR --> TABLE
 
   EXECUTOR --> AGG
 
   EXECUTOR --> JOIN
    
 
   TABLE --> COLMAP
 
   TABLE --> EXPR
 
   TABLE --> PLAN
 
   TABLE --> STORAGE
    
 
   COLMAP --> STORAGE
 
   COLMAP --> EXPR
    
 
   PLAN --> EXPR
 
   PLAN --> RESULT
    
 
   CSV --> TABLE
    
 
   TXN --> RESULT
 
   STORAGE --> RESULT
 
   EXPR --> RESULT
 
   COLMAP --> RESULT
 
   TABLE --> RESULT
    
 
   SLT --> SQL
 
   SLT --> RUNTIME
 
   SLT --> TESTUTIL
    
 
   TPCH --> LLKV
 
   TPCH --> SQL
    
 
   DEMO --> SQL

Sources: Cargo.toml86

Crate Dependency Graph

The following diagram shows the direct dependencies between workspace crates. Arrows point from dependent crates to their dependencies.

Crate Dependencies:

Sources: Cargo.toml:9-25 llkv-table/Cargo.toml:14-31 llkv-plan/Cargo.toml:14-24

Key Observations:

llkv-result is a foundational crate with no internal dependencies, consumed by nearly all other crates for error handling.
llkv-expr depends only on llkv-result, making it a stable base for expression handling across the system.
llkv-plan builds on llkv-expr and adds plan-specific structures.
llkv-storage and llkv-transaction** are independent of each other, allowing flexibility in storage backend selection.
llkv-table integrates storage, expressions, and planning to provide a cohesive data layer.
llkv-executor coordinates specialized operations (aggregate, join) and table access.
llkv-runtime sits at the top of the execution stack, orchestrating transactions and query execution.
llkv-sql ties together all layers to provide the SQL interface.

Mapping Crates to System Layers

This diagram shows how workspace crates map to the architectural layers described in Architecture.

Layered Architecture Mapping:

Sources: Cargo.toml:67-88

External Dependencies

The workspace declares several critical external dependencies that enable core functionality.

Apache Arrow Ecosystem

Version: 57.0.0

Crates:

arrow - Core Arrow functionality with prettyprint and IPC features
arrow-array - Array implementations
arrow-schema - Schema types
arrow-buffer - Buffer management
arrow-ord - Ordering operations

Usage: Arrow provides the universal columnar data format throughout LLKV. RecordBatch is used as the data interchange format at every layer, enabling zero-copy operations and SIMD-friendly processing.

Sources: Cargo.toml:32-36

SQL Parsing

Crate: sqlparser
Version: 0.59.0

Usage: Parses SQL text into AST nodes. Used by llkv-sql and llkv-plan to convert SQL queries into typed plan structures.

Sources: Cargo.toml52

SIMD-Optimized Storage

Crate: simd-r-drive
Version: 0.15.5-alpha

Usage: Provides memory-mapped, SIMD-accelerated persistent storage backend. The SimdRDrivePager implementation in llkv-storage uses this for zero-copy array access.

Related: simd-r-drive-entry-handle for Arrow buffer integration

Sources: Cargo.toml:26-27

Testing and Benchmarking

Key Dependencies:

Crate	Version	Purpose
`criterion`	0.7.0	Performance benchmarking
`sqllogictest`	0.28.4	SQL correctness testing
`tpchgen`	2.0.1	TPC-H data generation
`libtest-mimic`	0.8	Custom test harness

Sources: Cargo.toml:40-62

Utilities

Key Dependencies:

Crate	Version	Purpose
`rayon`	1.10.0	Data parallelism
`rustc-hash`	2.1.1	Fast hash maps
`bitcode`	0.6.7	Binary serialization
`thiserror`	2.0.17	Error trait derivation
`serde`	1.0.228	Serialization framework

Sources: Cargo.toml:37-64

Workspace Configuration

The workspace is configured with shared package metadata and dependency versions to ensure consistency across all crates.

Shared Package Metadata:

Build Settings:

Edition: 2024 (Rust 2024 edition)
Resolver: Version 2 (new dependency resolver)
Version: 0.8.2-alpha (all crates share this version)

Sources: Cargo.toml:1-8 Cargo.toml88

Summary Table

Crate	Layer	Primary Responsibility	Key Dependencies
`llkv`	Entry Point	Main library API	`llkv-sql`, `llkv-runtime`
`llkv-sql`	SQL Processing	SQL parsing and execution	`llkv-plan`, `llkv-runtime`, `sqlparser`
`llkv-plan`	SQL Processing	Query plan structures	`llkv-expr`, `sqlparser`
`llkv-expr`	SQL Processing	Expression AST	`arrow`
`llkv-runtime`	Execution	Transaction orchestration	`llkv-executor`, `llkv-table`
`llkv-executor`	Execution	Query execution	`llkv-table`, `llkv-aggregate`
`llkv-table`	Data Management	Schema-aware tables	`llkv-column-map`, `llkv-storage`
`llkv-column-map`	Data Management	Columnar storage	`llkv-storage`, `arrow`
`llkv-storage`	Storage	Storage abstraction	`simd-r-drive` (optional)
`llkv-transaction`	Data Management	MVCC manager	-
`llkv-aggregate`	Specialized Ops	Aggregation functions	`arrow`
`llkv-join`	Specialized Ops	Join algorithms	`arrow`
`llkv-csv`	Specialized Ops	CSV import/export	`llkv-table`
`llkv-result`	Foundation	Error types	-
`llkv-test-utils`	Testing	Test utilities	`tracing-subscriber`
`llkv-slt-tester`	Testing	SQL logic tests	`llkv-sql`, `sqllogictest`
`llkv-tpch`	Testing	TPC-H benchmarks	`llkv-sql`, `tpchgen`
`llkv-sql-pong-demo`	Demo	Interactive demo	`llkv-sql`, `crossterm`

Sources: Cargo.toml:1-89

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

SQL Query Processing Pipeline

Relevant source files

Purpose and Scope

This document describes the end-to-end SQL query processing pipeline in LLKV, from raw SQL text input to final results. It covers the five major stages: SQL preprocessing, parsing, plan translation, execution coordination, and result formatting.

For information about the specific plan structures created during translation, see Plan Structures. For details on how plans are executed to produce results, see Query Execution. For the user-facing SqlEngine API, see SqlEngine API.

Overview

The SQL query processing pipeline transforms user-provided SQL text into Arrow RecordBatch results through a series of well-defined stages. The primary entry point is SqlEngine::execute(), which orchestrates the entire flow while maintaining transaction boundaries and handling cross-statement optimizations like INSERT buffering.

Sources: llkv-sql/src/sql_engine.rs:933-998

graph TB
    SQL["Raw SQL Text"]
PREPROCESS["Stage 1: SQL Preprocessing\nDialect Normalization"]
PARSE["Stage 2: Parsing\nsqlparser → AST"]
TRANSLATE["Stage 3: Plan Translation\nAST → PlanStatement"]
EXECUTE["Stage 4: Plan Execution\nRuntimeEngine"]
RESULTS["Stage 5: Result Formatting\nRuntimeStatementResult"]
SQL --> PREPROCESS
 
   PREPROCESS --> PARSE
 
   PARSE --> TRANSLATE
 
   TRANSLATE --> EXECUTE
 
   EXECUTE --> RESULTS
    
 
   PREPROCESS --> |preprocess_sql_input| PREPROCESS_IMPL["• preprocess_tpch_connect_syntax()\n• preprocess_create_type_syntax()\n• preprocess_exclude_syntax()\n• preprocess_trailing_commas_in_values()\n• preprocess_empty_in_lists()\n• preprocess_index_hints()\n• preprocess_reindex_syntax()\n• preprocess_bare_table_in_clauses()"]
PARSE --> |parse_sql_with_recursion_limit| PARSER["sqlparser::Parser\nPARSER_RECURSION_LIMIT = 200"]
TRANSLATE --> |translate_statement| PLANNER["• SelectPlan\n• InsertPlan\n• UpdatePlan\n• DeletePlan\n• CreateTablePlan\n• DDL Plans"]
EXECUTE --> |engine.execute_statement| RUNTIME["RuntimeEngine\n• Transaction Management\n• MVCC Snapshots\n• Catalog Operations"]
style PREPROCESS fill:#f9f9f9
    style PARSE fill:#f9f9f9
    style TRANSLATE fill:#f9f9f9
    style EXECUTE fill:#f9f9f9
    style RESULTS fill:#f9f9f9

Stage 1: SQL Preprocessing

Before parsing, the SQL text undergoes a series of preprocessing transformations to normalize dialect-specific syntax. This allows LLKV to accept SQL written for SQLite, DuckDB, and other dialects while presenting a consistent AST to the planner.

Preprocessing Transformations

Preprocessor Method	Purpose	Example Transformation
`preprocess_tpch_connect_syntax()`	Strip TPC-H `CONNECT TO` directives	`CONNECT TO tpch;` → ``
`preprocess_create_type_syntax()`	Convert `CREATE TYPE` to `CREATE DOMAIN`	`CREATE TYPE name AS INT` → `CREATE DOMAIN name AS INT`
`preprocess_exclude_syntax()`	Quote qualified names in `EXCLUDE` clauses	`EXCLUDE (t.col)` → `EXCLUDE ("t.col")`
`preprocess_trailing_commas_in_values()`	Remove trailing commas in `VALUES`	`VALUES (1, 2,)` → `VALUES (1, 2)`
`preprocess_empty_in_lists()`	Expand empty `IN` predicates to constant booleans	`x IN ()` → `(x = NULL AND 0 = 1)`
`preprocess_index_hints()`	Strip SQLite `INDEXED BY` hints	`FROM t INDEXED BY idx` → `FROM t`
`preprocess_reindex_syntax()`	Convert `REINDEX` to `VACUUM REINDEX`	`REINDEX idx` → `VACUUM REINDEX idx`
`preprocess_bare_table_in_clauses()`	Expand `IN tablename` to subquery	`x IN t` → `x IN (SELECT * FROM t)`

Sources: llkv-sql/src/sql_engine.rs:623-873 llkv-sql/src/tpch.rs:1-17

Fallback Trigger Preprocessing

If parsing fails and the SQL contains CREATE TRIGGER, the engine applies an additional preprocess_sqlite_trigger_shorthand() transformation and retries. This handles SQLite's optional BEFORE/AFTER timing and FOR EACH ROW clauses by injecting defaults that sqlparser expects.

Sources: llkv-sql/src/sql_engine.rs:941-957 llkv-sql/src/sql_engine.rs:771-842

Stage 2: Parsing

Parsing is delegated to the sqlparser crate, which produces a Vec<Statement> AST. LLKV configures the parser with:

Dialect: GenericDialect to accept a wide range of SQL syntax
Recursion Limit: PARSER_RECURSION_LIMIT = 200 (raised from sqlparser's default of 50 to handle deeply nested queries in test suites)

The parse_sql_with_recursion_limit() helper function wraps sqlparser's API to apply this custom limit.

Sources: llkv-sql/src/sql_engine.rs:317-324 llkv-sql/src/sql_engine.rs:939-957

Stage 3: Plan Translation

Each parsed Statement is translated into a strongly-typed PlanStatement that the runtime can execute. This translation happens through statement-specific methods in SqlEngine.

graph TB
    AST["sqlparser::ast::Statement"]
SELECT["Statement::Query"]
INSERT["Statement::Insert"]
UPDATE["Statement::Update"]
DELETE["Statement::Delete"]
CREATE["Statement::CreateTable"]
DROP["Statement::Drop"]
TRANSACTION["Statement::StartTransaction\nStatement::Commit\nStatement::Rollback"]
ALTER["Statement::AlterTable"]
OTHER["Other DDL/DML"]
SELECT_PLAN["translate_query()\n→ SelectPlan"]
INSERT_PLAN["buffer_insert()\n→ InsertPlan or buffered"]
UPDATE_PLAN["translate_update()\n→ UpdatePlan"]
DELETE_PLAN["translate_delete()\n→ DeletePlan"]
CREATE_PLAN["translate_create_table()\n→ CreateTablePlan"]
DROP_PLAN["translate_drop()\n→ PlanStatement::Drop*"]
TXN_RUNTIME["Direct runtime delegation\nflush INSERT buffer first"]
ALTER_PLAN["translate_alter_table()\n→ PlanStatement::Alter*"]
OTHER_PLAN["translate_* methods\n→ PlanStatement::*"]
AST --> SELECT
 
   AST --> INSERT
 
   AST --> UPDATE
 
   AST --> DELETE
 
   AST --> CREATE
 
   AST --> DROP
 
   AST --> TRANSACTION
 
   AST --> ALTER
 
   AST --> OTHER
    
 
   SELECT --> SELECT_PLAN
 
   INSERT --> INSERT_PLAN
 
   UPDATE --> UPDATE_PLAN
 
   DELETE --> DELETE_PLAN
 
   CREATE --> CREATE_PLAN
 
   DROP --> DROP_PLAN
 
   TRANSACTION --> TXN_RUNTIME
 
   ALTER --> ALTER_PLAN
 
   OTHER --> OTHER_PLAN
    
 
   SELECT_PLAN --> RUNTIME["RuntimeEngine::execute_statement()"]
INSERT_PLAN --> BUFFER_CHECK{"Buffering\nenabled?"}
BUFFER_CHECK -->|Yes| BUFFER["InsertBuffer\naccumulates rows"]
BUFFER_CHECK -->|No| RUNTIME
 
   UPDATE_PLAN --> RUNTIME
 
   DELETE_PLAN --> RUNTIME
 
   CREATE_PLAN --> RUNTIME
 
   DROP_PLAN --> RUNTIME
 
   TXN_RUNTIME --> RUNTIME
 
   ALTER_PLAN --> RUNTIME
 
   OTHER_PLAN --> RUNTIME
    
 
   BUFFER --> FLUSH_CHECK{"Flush\nneeded?"}
FLUSH_CHECK -->|Yes| FLUSH["Flush buffered rows"]
FLUSH_CHECK -->|No| CONTINUE["Continue buffering"]
FLUSH --> RUNTIME

Statement Routing

Sources: llkv-sql/src/sql_engine.rs:960-998

Translation Process

The translation process involves:

Column Resolution: Identifier strings are resolved to FieldId references using the runtime's catalog
Expression Translation: SQL expressions are converted to Expr<String>, then resolved to Expr<FieldId>
Subquery Handling: Correlated subqueries are tracked with placeholder generation
Parameter Binding: SQL placeholders (?, $1, :name) are mapped to parameter indices

Sources: llkv-sql/src/sql_engine.rs:1000-5000 (various translate_* methods)

sequenceDiagram
    participant SqlEngine
    participant Catalog as "RuntimeContext\nCatalog"
    participant ExprTranslator
    participant PlanBuilder
    
    SqlEngine->>Catalog: resolve_table("users")
    Catalog-->>SqlEngine: TableId(namespace=0, table=1)
    
    SqlEngine->>Catalog: resolve_column("id", TableId)
    Catalog-->>SqlEngine: ColumnResolution(FieldId)
    
    SqlEngine->>ExprTranslator: translate_expr(sqlparser::Expr)
    ExprTranslator->>ExprTranslator: Build Expr<String>
    ExprTranslator->>Catalog: resolve_identifiers()
    Catalog-->>ExprTranslator: Expr<FieldId>
    ExprTranslator-->>SqlEngine: Expr<FieldId>
    
    SqlEngine->>PlanBuilder: Create SelectPlan
    Note over PlanBuilder: Attach projections,\nfilters, sorts, limits
    PlanBuilder-->>SqlEngine: PlanStatement::Select(SelectPlan)

Stage 4: Plan Execution

Once a PlanStatement is constructed, it is passed to RuntimeEngine::execute_statement() for execution. The runtime coordinates:

Transaction Management: Ensures each statement executes within a transaction snapshot
MVCC Enforcement: Filters rows based on visibility rules
Catalog Operations: Updates system catalog for DDL statements
Executor Invocation: Delegates SelectPlan execution to llkv-executor

Execution Routing by Statement Type

Sources: llkv-sql/src/sql_engine.rs:587-609 llkv-runtime/ (RuntimeEngine implementation)

Stage 5: Result Formatting

The runtime returns a RuntimeStatementResult enum that represents the outcome of statement execution. SqlEngine surfaces this directly to callers via the execute() method, or converts it to Vec<RecordBatch> for the sql() convenience method.

Result Types

Statement Type	Result Variant	Contents
`SELECT`	`RuntimeStatementResult::Select`	`Vec<RecordBatch>` of query results
`INSERT`	`RuntimeStatementResult::Insert`	`rows_inserted: u64`
`UPDATE`	`RuntimeStatementResult::Update`	`rows_updated: u64`
`DELETE`	`RuntimeStatementResult::Delete`	`rows_deleted: u64`
`CREATE TABLE`	`RuntimeStatementResult::CreateTable`	`table_name: String`
`CREATE INDEX`	`RuntimeStatementResult::CreateIndex`	`index_name: String`
`DROP TABLE`	`RuntimeStatementResult::DropTable`	`table_name: String`
Transaction control	`RuntimeStatementResult::Transaction`	Transaction state

Sources: llkv-runtime/ (RuntimeStatementResult definition), llkv-sql/src/sql_engine.rs:933-998

Prepared Statements and Parameters

LLKV supports parameterized queries through a prepared statement mechanism that handles three parameter syntaxes:

Positional (numbered): ?, ?1, ?2, $1, $2
Named: :name, :id
Auto-incremented: Sequential ? placeholders

Parameter Processing Pipeline

Sources: llkv-sql/src/sql_engine.rs:71-206 llkv-sql/src/sql_engine.rs:278-297

Parameter Substitution

During plan execution with parameters, the engine performs a second pass to replace sentinel strings (__llkv_param__N__) with the actual Literal values provided by the caller. This two-phase approach allows the same PlanStatement to be reused across multiple executions with different parameter values.

Sources: llkv-sql/src/sql_engine.rs:194-206

INSERT Buffering

SqlEngine includes an optional INSERT buffering optimization that batches consecutive INSERT ... VALUES statements targeting the same table. This is disabled by default but can be enabled with set_insert_buffering(true) for bulk ingestion workloads.

stateDiagram-v2
    [*] --> NoBuffer : buffering disabled
    NoBuffer --> NoBuffer : INSERT → immediate execute
    
    [*] --> BufferEmpty : buffering enabled
    BufferEmpty --> BufferActive : INSERT(table, cols, rows)
    
    BufferActive --> BufferActive : INSERT(same table/cols) accumulate rows
    BufferActive --> Flush : Different table
    BufferActive --> Flush : Different columns
    BufferActive --> Flush : Different conflict action
    BufferActive --> Flush : Row count ≥ MAX_BUFFERED_INSERT_ROWS
    BufferActive --> Flush : Non-INSERT statement
    BufferActive --> Flush : Transaction boundary
    
    Flush --> RuntimeExecution : Create InsertPlan from accumulated rows
    RuntimeExecution --> BufferEmpty : Reset buffer
    
    NoBuffer --> [*]
    BufferEmpty --> [*]

Buffering Logic

Sources: llkv-sql/src/sql_engine.rs:410-495 llkv-sql/src/sql_engine.rs:887-905

Buffer Flush Triggers

Trigger Condition	Rationale
Row count ≥ `MAX_BUFFERED_INSERT_ROWS` (8192)	Limit memory usage
Target table changes	Cannot batch cross-table INSERTs
Column list changes	Schema mismatch
Conflict action changes	`ON CONFLICT` semantics differ
Non-INSERT statement encountered	Preserve statement ordering
Transaction boundary (`BEGIN`, `COMMIT`, `ROLLBACK`)	Ensure transactional consistency
Explicit `flush_pending_inserts()` call	Manual control
Statement expectation hint (testing)	Test harness needs per-statement results

Sources: llkv-sql/src/sql_engine.rs:410-495

Error Handling and Table Mapping

The pipeline includes special error handling for table-not-found scenarios. When the runtime returns Error::NotFound or catalog-related errors, SqlEngine::execute_plan_statement() rewrites them to user-friendly messages like "Catalog Error: Table 'users' does not exist".

This mapping is skipped for CREATE VIEW and DROP VIEW statements where the "table" name refers to the view being created/dropped rather than a referenced table.

Sources: llkv-sql/src/sql_engine.rs:558-609

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Data Formats and Arrow Integration

Relevant source files

Purpose and Scope

This page documents how Apache Arrow columnar data structures serve as the universal data interchange format throughout LLKV. It covers the supported Arrow data types, the custom zero-copy serialization format used for persistence, and how RecordBatch flows between layers.

For information about how expressions evaluate over Arrow data, see Expression System. For details on storage pager abstractions, see Pager Interface and SIMD Optimization.

Arrow as the Universal Data Format

LLKV is Arrow-native : every layer of the system produces, consumes, and operates on arrow::record_batch::RecordBatch structures. This design choice enables:

Zero-copy data access across layer boundaries
SIMD-friendly vectorized operations via contiguous columnar buffers
Unified type system from SQL parsing through storage
Efficient interoperability with external Arrow-compatible tools

The following table maps system layers to their Arrow usage:

Layer	Arrow Usage
llkv-sql	Parses SQL and returns `RecordBatch` query results to callers
llkv-plan	Constructs plans that reference Arrow `DataType` and `Field` structures
llkv-executor	Streams `RecordBatch` results during `SELECT` evaluation
llkv-table	Validates incoming batches against table schemas and persists columns
llkv-column-map	Chunks `RecordBatch` columns into serialized Arrow arrays for pager storage
llkv-storage	Serializes/deserializes Arrow arrays with a custom zero-copy format

Sources: Cargo.toml32 llkv-table/README.md:10-11 llkv-column-map/README.md10 llkv-storage/README.md10

RecordBatch Flow Diagram

Key Data Flow Observations:

INSERT Path : RecordBatch → schema validation → MVCC column injection → column chunking → serialization → pager write
SELECT Path : Pager read → deserialization → column gather → RecordBatch construction → streaming to executor
Zero-Copy : EntryHandle wraps memory-mapped regions that Arrow arrays reference directly without copying

Sources: llkv-column-map/src/store/projection.rs:240-446 llkv-storage/src/serialization.rs:226-254 llkv-table/README.md:19-25

Supported Arrow Data Types

LLKV supports the following Arrow primitive and complex types:

Primitive Types

Arrow DataType	Storage Size	SQL Type Mapping
`UInt8`	1 byte	`TINYINT UNSIGNED`
`UInt16`	2 bytes	`SMALLINT UNSIGNED`
`UInt32`	4 bytes	`INT UNSIGNED`
`UInt64`	8 bytes	`BIGINT UNSIGNED`
`Int8`	1 byte	`TINYINT`
`Int16`	2 bytes	`SMALLINT`
`Int32`	4 bytes	`INT`
`Int64`	8 bytes	`BIGINT`
`Float32`	4 bytes	`REAL`, `FLOAT`
`Float64`	8 bytes	`DOUBLE PRECISION`
`Boolean`	1 bit (packed)	`BOOLEAN`
`Date32`	4 bytes	`DATE`
`Date64`	8 bytes	`TIMESTAMP`
`Decimal128(p, s)`	16 bytes	`DECIMAL(p, s)`

Variable-Length Types

Arrow DataType	Storage Layout	SQL Type Mapping
`Utf8`	i32 offsets + UTF-8 bytes	`VARCHAR`, `TEXT`
`LargeUtf8`	i64 offsets + UTF-8 bytes	`TEXT` (large)
`Binary`	i32 offsets + raw bytes	`VARBINARY`, `BLOB`
`LargeBinary`	i64 offsets + raw bytes	`BLOB` (large)

Complex Types

Arrow DataType	Description	Use Cases
`Struct(fields)`	Nested record with named fields	Composite values, JSON-like data
`FixedSizeList(T, n)`	Fixed-length array of type T	Vector embeddings, coordinate tuples

Null Handling:
The current serialization format does not yet support null bitmaps. Arrays with null_count() > 0 will return an error during serialization. Null support is planned for future releases.

Sources: llkv-storage/src/serialization.rs:144-165 llkv-storage/src/serialization.rs:199-224 llkv-expr/src/literal.rs:78-94

Custom Serialization Format

Why Not Arrow IPC?

LLKV uses a custom minimal serialization format instead of Arrow's standard IPC (Inter-Process Communication) format for several reasons:

Trade-offs:

graph LR
    subgraph "Arrow IPC Format"
        IPC_SCHEMA["Schema object\nframing metadata"]
IPC_PADDING["Padding alignment\n8/64 byte boundaries"]
IPC_BUFFERS["Buffer pointers\n+ offsets"]
IPC_SIZE["Larger file size\n~20-40% overhead"]
end
    
    subgraph "LLKV Custom Format"
        CUSTOM_HEADER["24-byte header\nfixed size"]
CUSTOM_PAYLOAD["Raw buffer bytes\ncontiguous"]
CUSTOM_ZERO["Zero-copy rebuild\ndirect mmap"]
CUSTOM_SIZE["Minimal size\nno framing"]
end
    
 
   IPC_SCHEMA --> IPC_SIZE
 
   IPC_PADDING --> IPC_SIZE
 
   IPC_BUFFERS --> IPC_SIZE
    
 
   CUSTOM_HEADER --> CUSTOM_SIZE
 
   CUSTOM_PAYLOAD --> CUSTOM_ZERO

Aspect	Arrow IPC	LLKV Custom Format
File size	Larger (metadata + padding)	Minimal (24-byte header + payload)
Deserialization	Allocates and copies buffers	Zero-copy via `EntryHandle`
Flexibility	Supports all Arrow features	Limited to non-null arrays
Scan performance	Moderate (copy overhead)	Fast (direct SIMD access)
Null support	Full bitmap support	Not yet implemented

Design Goals:

Minimal headers : 24-byte fixed header, no schema objects per array
Predictable payloads : contiguous buffers for mmap-friendly access
True zero-copy : reconstruct ArrayData referencing original buffer directly
Stable on-disk codes : type tags are compile-time pinned to prevent corruption

Sources: llkv-storage/src/serialization.rs:1-28 llkv-storage/src/serialization.rs:41-135

Serialization Format Details

Header Structure

Every serialized array begins with a 24-byte header:

Offset  Size  Field
------  ----  -----
0-3     4     Magic bytes: b"ARR0"
4       1     Layout code (Primitive=0, FslFloat32=1, Varlen=2, Struct=3)
5       1     Type code (PrimType enum value)
6       1     Precision (for Decimal128) or padding
7       1     Scale (for Decimal128) or padding
8-15    8     Array length (u64, element count)
16-19   4     extra_a (layout-specific u32)
20-23   4     extra_b (layout-specific u32)
24+     var   Payload (layout-specific buffer bytes)

Layout Variants

Primitive Layout

For fixed-width types (Int32, Float64, etc.):

extra_a: Length of values buffer in bytes
extra_b: Unused (0)
Payload: Raw values buffer

Varlen Layout

For variable-length types (Utf8, Binary, etc.):

extra_a: Length of offsets buffer in bytes
extra_b: Length of values buffer in bytes
Payload: Offsets buffer followed by values buffer

FixedSizeList Layout

Special optimization for vector embeddings:

extra_a: List size (elements per list)
extra_b: Total child buffer length in bytes
Payload: Contiguous child Float32 buffer

This enables direct SIMD access to embedding vectors without indirection.

Struct Layout

For nested composite types:

extra_a: Unused (0)
extra_b: IPC payload length in bytes
Payload: Arrow IPC-serialized struct array

Struct types fall back to Arrow IPC format because their complex nested structure doesn't benefit from the custom layout.

Sources: llkv-storage/src/serialization.rs:44-135 llkv-storage/src/serialization.rs:256-378

graph TB
    PAGER["Pager::batch_get"]
HANDLE["EntryHandle\nmemory-mapped region"]
BUFFER["Arrow Buffer\nslice of EntryHandle"]
ARRAYDATA["ArrayData\nreferences Buffer"]
ARRAY["Concrete Array\nInt32Array, etc."]
PAGER --> HANDLE
 
   HANDLE --> BUFFER
 
   BUFFER --> ARRAYDATA
 
   ARRAYDATA --> ARRAY
    
    style HANDLE fill:#f9f9f9
    style BUFFER fill:#f9f9f9

Zero-Copy Deserialization

EntryHandle Integration

The EntryHandle type from simd-r-drive-entry-handle provides a zero-copy wrapper around memory-mapped buffers:

Key Operations:

Pager read : Returns GetResult::Raw { key, bytes: EntryHandle }
Buffer slice : EntryHandle::as_arrow_buffer() creates an Arrow Buffer view
ArrayData build : ArrayData::builder().add_buffer(buffer).build()
Array cast : make_array(data) produces typed arrays

The entire chain avoids copying data — Arrow arrays directly reference the memory-mapped region.

Alignment Requirements:

Decimal128 requires 16-byte alignment. If the EntryHandle buffer is not properly aligned, the deserializer copies it to an aligned buffer:

Sources: llkv-storage/src/serialization.rs:429-559 llkv-column-map/src/store/projection.rs:619-629

graph TB
    ROWIDS["row_ids: &[u64]\nrequested rows"]
FIELDIDS["field_ids: &[LogicalFieldId]\nrequested columns"]
PLANS["FieldPlan\nper-column metadata"]
CHUNKS["Chunk selection\ncandidate_indices"]
BATCH_GET["Pager::batch_get\nfetch chunks"]
CACHE["Chunk cache\nArrayRef map"]
GATHER["gather_rows_from_chunks\nper-type specialization"]
ARRAYS["Vec<ArrayRef>\none per field"]
SCHEMA["Arrow Schema\nField metadata"]
RECORDBATCH["RecordBatch::try_new"]
ROWIDS --> PLANS
 
   FIELDIDS --> PLANS
 
   PLANS --> CHUNKS
 
   CHUNKS --> BATCH_GET
 
   BATCH_GET --> CACHE
 
   CACHE --> GATHER
 
   GATHER --> ARRAYS
 
   ARRAYS --> RECORDBATCH
 
   SCHEMA --> RECORDBATCH

RecordBatch Construction and Projection

Gather Operations

The ColumnStore::gather_rows family of methods reconstructs RecordBatch from chunked columns:

Projection Flow:

Prepare context : Load column descriptors, determine chunk candidates
Batch fetch : Request all needed chunks from pager in one call
Type-specific gather : Dispatch to specialized routines based on DataType
Null policy : Apply GatherNullPolicy (ErrorOnMissing, IncludeNulls, DropNulls)
Schema construction : Build Schema with correct field names and nullability
RecordBatch assembly : RecordBatch::try_new(schema, arrays)

Type Dispatch Table:

DataType	Gather Function
`Utf8`	`gather_rows_from_chunks_string::<i32>`
`LargeUtf8`	`gather_rows_from_chunks_string::<i64>`
`Binary`	`gather_rows_from_chunks_binary::<i32>`
`LargeBinary`	`gather_rows_from_chunks_binary::<i64>`
`Boolean`	`gather_rows_from_chunks_bool`
`Struct(_)`	`gather_rows_from_chunks_struct`
`Decimal128(_, _)`	`gather_rows_from_chunks_decimal128`
Primitives	`gather_rows_from_chunks::<ArrowTy>` (generic)

Sources: llkv-column-map/src/store/projection.rs:245-446 llkv-column-map/src/store/projection.rs:636-726

graph TB
    BATCH["RecordBatch\nuser data"]
TABLE_SCHEMA["Stored Schema\nfrom catalog"]
VALIDATE["Schema validation"]
FIELD_CHECK["Field count\nname\ntype match"]
MVCC["Inject MVCC columns\nrow_id, created_by,\ndeleted_by"]
EXTENDED["Extended RecordBatch"]
COLMAP["ColumnStore::append"]
BATCH --> VALIDATE
 
   TABLE_SCHEMA --> VALIDATE
 
   VALIDATE --> FIELD_CHECK
 
   FIELD_CHECK -->|Pass| MVCC
 
   FIELD_CHECK -->|Fail| ERROR["Error::SchemaMismatch"]
MVCC --> EXTENDED
 
   EXTENDED --> COLMAP

Schema Validation

Table-Level Schema Enforcement

The Table layer validates incoming RecordBatch schemas against the stored table schema before appending:

Validation Rules:

Field count : Batch must have exactly the same number of columns as the table schema
Field names : Column names must match (case-sensitive)
Field types : DataType must match exactly (no implicit coercion)
Nullability : Currently not strictly enforced (planned improvement)

MVCC Column Injection:

After validation, the table appends three system columns:

row_id (UInt64): Unique row identifier
created_by (UInt64): Transaction ID that created the row
deleted_by (UInt64): Transaction ID that deleted the row (0 if active)

These columns are stored in separate logical namespaces but physically alongside user data.

Sources: llkv-table/README.md:14-17 llkv-column-map/README.md:26-28

graph LR
    SQL_LITERAL["SQL Literal\n123, 'text', etc."]
LITERAL["Literal enum\nInteger, String, etc."]
SCHEMA["Table Schema\nArrow DataType"]
NATIVE["Native Value\ni32, String, etc."]
ARRAY["Arrow Array\nInt32Array, etc."]
SQL_LITERAL --> LITERAL
 
   LITERAL --> SCHEMA
 
   SCHEMA --> NATIVE
 
   NATIVE --> ARRAY

Type Mapping from SQL to Arrow

Literal Conversion

The llkv-expr crate defines a Literal enum that captures untyped SQL values before schema resolution:

Supported Literal Types:

Literal Variant	Arrow Target Types
`Integer(i128)`	Any integer or float type (with range checks)
`Float(f64)`	`Float32`, `Float64`
`Decimal(DecimalValue)`	`Decimal128(p, s)`
`String(String)`	`Utf8`, `LargeUtf8`
`Boolean(bool)`	`Boolean`
`Date32(i32)`	`Date32`
`Struct(fields)`	`Struct(...)`
`Interval(IntervalValue)`	Not directly stored; used for date arithmetic

Conversion Mechanism:

The FromLiteral trait provides type-aware conversion:

Implementations perform range checking and type validation:

Sources: llkv-expr/src/literal.rs:78-94 llkv-expr/src/literal.rs:156-219 llkv-expr/src/literal.rs:395-419

Performance Characteristics

Zero-Copy Benefits

The combination of Arrow's columnar layout and the custom serialization format delivers measurable performance benefits:

Operation	Traditional DB	LLKV Arrow-Native
Column scan	Row-by-row decode	Vectorized SIMD over mmap
Type dispatch	Virtual function calls	Monomorphized at compile time
Buffer management	Multiple allocations	Single mmap region
Predicate evaluation	Interpreted per row	Compiled bytecode over vectors

Chunking Strategy

The ColumnStore organizes data into chunks sized for cache locality and pager efficiency:

Target chunk size : Configurable, typically 64KB-256KB per column
Row alignment : All columns in a table share the same row boundaries per chunk
Append optimization : Incoming batches are chunked and sorted by row_id before persistence

This design minimizes pager I/O and maximizes CPU cache hit rates during scans.

Sources: llkv-column-map/README.md:24-28 llkv-storage/README.md:15-17

Integration with External Tools

Arrow Compatibility

Because LLKV uses standard Arrow data structures at its boundaries, it can integrate with the broader Arrow ecosystem:

Export : Query results can be serialized to Arrow IPC files for external processing
Import : Arrow IPC files can be read and ingested via Table::append
Parquet : Future work could add direct Parquet read/write using Arrow's parquet crate
DataFusion : LLKV's table scan APIs could potentially integrate as a DataFusion TableProvider

Current Limitations

Null support : The serialization format doesn't yet handle null bitmaps
Nested types : Only Struct and FixedSizeList<Float32> are fully supported
Dictionary encoding : Not yet implemented (planned)
Compression : No built-in compression (relies on storage-layer features)

Sources: llkv-storage/src/serialization.rs:257-260 llkv-column-map/README.md:10-11

Summary

LLKV's Arrow-native architecture provides:

Universal interchange format via RecordBatch across all layers
Zero-copy operations through EntryHandle and memory-mapped buffers
Custom serialization optimized for mmap and SIMD access patterns
Type safety from SQL literals through to persisted columns
SIMD-friendly layout for efficient vectorized query evaluation

The trade-off of using a custom format instead of Arrow IPC is reduced flexibility (no nulls yet, fewer complex types) in exchange for smaller files, faster scans, and true zero-copy deserialization.

For details on how Arrow arrays are evaluated during query execution, see Scalar Evaluation and NumericKernels. For information on how MVCC metadata is stored alongside Arrow columns, see Column Storage and ColumnStore.

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

SQL Interface

Relevant source files

The SQL Interface layer provides the primary user-facing entry point for executing SQL statements against LLKV. It consists of the llkv-sql crate, which wraps the underlying runtime and provides SQL parsing, preprocessing, statement caching, and result formatting.

This document covers the SqlEngine struct and its methods, SQL preprocessing and dialect normalization, and the INSERT buffering optimization system. For information about query planning after SQL parsing, see Query Planning. For runtime execution, see the Architecture section at #2.

Sources: llkv-sql/src/lib.rs:1-51 README.md:47-48

Core Components

The SQL Interface layer is centered around three main subsystems:

Component	Purpose	Key Types
`SqlEngine`	Main execution interface	`SqlEngine`, `RuntimeEngine`, `RuntimeSession`
Preprocessing	SQL normalization and dialect handling	Various regex-based transformers
INSERT Buffering	Batch optimization for literal inserts	`InsertBuffer`, `PreparedInsert`

Sources: llkv-sql/src/sql_engine.rs:365-556

SqlEngine Structure

The SqlEngine wraps a RuntimeEngine instance and adds SQL-specific functionality including statement caching, INSERT buffering, and configurable behavior flags. The insert_buffer field holds accumulated literal INSERT payloads when buffering is enabled.

Sources: llkv-sql/src/sql_engine.rs:496-509 llkv-sql/src/sql_engine.rs:421-471

SQL Statement Processing Flow

The statement processing flow consists of:

Preprocessing: SQL text undergoes dialect normalization via regex-based transformations
Parsing: sqlparser with increased recursion limit (200 vs default 50) produces AST
Planning: AST nodes are translated to typed PlanStatement structures
Buffering (INSERT only): Literal INSERT statements may be accumulated in InsertBuffer
Execution: Plans are passed to RuntimeEngine for execution
Result collection: RuntimeStatementResult instances are collected and returned

Sources: llkv-sql/src/sql_engine.rs:933-991 llkv-sql/src/sql_engine.rs:318-324

Public API Methods

Core Execution Methods

The SqlEngine exposes two primary execution methods:

Method	Signature	Purpose	Returns
`execute`	`fn execute(&self, sql: &str)`	Execute one or more SQL statements	`SqlResult<Vec<RuntimeStatementResult>>`
`sql`	`fn sql(&self, query: &str)`	Execute a single SELECT and return batches	`SqlResult<Vec<RecordBatch>>`

The execute method handles arbitrary SQL (DDL, DML, queries) and returns statement results. The sql method is a convenience wrapper that enforces single-SELECT semantics and extracts Arrow batches from the result stream.

Sources: llkv-sql/src/sql_engine.rs:921-991 llkv-sql/src/sql_engine.rs:1009-1052

Prepared Statements

Prepared statements support three placeholder syntaxes:

Positional: ? (auto-numbered), ?1, $1 (explicit index)
Named: :param_name

Placeholders are tracked via thread-local ParameterState during parsing, converted to sentinel strings like __llkv_param__1__, and stored in a PreparedPlan with parameter count metadata. The statement_cache field provides a statement-level cache keyed by SQL text.

Sources: llkv-sql/src/sql_engine.rs:77-206 llkv-sql/src/sql_engine.rs:278-297

Configuration Methods

Method	Purpose
`new<Pg>(pager: Arc<Pg>)`	Construct engine with given pager (buffering disabled)
`with_context(context, default_nulls_first)`	Construct from existing RuntimeContext
`set_insert_buffering(enabled: bool)`	Toggle INSERT batching mode

The set_insert_buffering method controls cross-statement INSERT accumulation. When disabled (default), each INSERT executes immediately. When enabled, compatible INSERTs targeting the same table are batched together up to MAX_BUFFERED_INSERT_ROWS (8192 rows).

Sources: llkv-sql/src/sql_engine.rs:615-621 llkv-sql/src/sql_engine.rs:879-905 llkv-sql/src/sql_engine.rs:410-414

SQL Preprocessing System

The preprocessing layer normalizes SQL dialects before parsing to handle incompatibilities between SQLite, DuckDB, and sqlparser expectations.

graph TB
    RAW["Raw SQL String"]
TPCH["preprocess_tpch_connect_syntax\n(strip CONNECT TO statements)"]
TYPE["preprocess_create_type_syntax\n(CREATE TYPE → CREATE DOMAIN)"]
EXCLUDE["preprocess_exclude_syntax\n(quote qualified names in EXCLUDE)"]
COMMA["preprocess_trailing_commas_in_values\n(remove trailing commas)"]
EMPTY["preprocess_empty_in_lists\n(expr IN () → constant)"]
INDEX["preprocess_index_hints\n(strip INDEXED BY / NOT INDEXED)"]
REINDEX["preprocess_reindex_syntax\n(REINDEX → VACUUM REINDEX)"]
BARE["preprocess_bare_table_in_clauses\n(IN table → IN (SELECT * FROM))"]
TRIGGER["preprocess_sqlite_trigger_shorthand\n(add AFTER / FOR EACH ROW)"]
PARSER["sqlparser::Parser"]
RAW --> TPCH
 
   TPCH --> TYPE
 
   TYPE --> EXCLUDE
 
   EXCLUDE --> COMMA
 
   COMMA --> EMPTY
 
   EMPTY --> INDEX
 
   INDEX --> REINDEX
 
   REINDEX --> BARE
 
   BARE --> PARSER
    
    PARSER -.parse error.-> TRIGGER
 
   TRIGGER --> PARSER

Each preprocessing function is implemented as a regex-based transformer:

Function	Pattern	Purpose	Lines
`preprocess_tpch_connect_syntax`	`CONNECT TO database;`	Strip TPC-H multi-database directives	6:28-630
`preprocess_create_type_syntax`	`CREATE TYPE` → `CREATE DOMAIN`	Translate DuckDB type alias syntax	6:39-657
`preprocess_exclude_syntax`	`EXCLUDE(a.b.c)` → `EXCLUDE("a.b.c")`	Quote qualified names in EXCLUDE	6:59-676
`preprocess_trailing_commas_in_values`	`VALUES (v,)` → `VALUES (v)`	Remove DuckDB-style trailing commas	6:78-689
`preprocess_empty_in_lists`	`expr IN ()` → `(expr = NULL AND 0 = 1)`	Convert empty IN to constant false	6:91-720
`preprocess_index_hints`	`INDEXED BY idx` / `NOT INDEXED`	Strip SQLite index hints	7:22-739
`preprocess_reindex_syntax`	`REINDEX idx` → `VACUUM REINDEX idx`	Convert to sqlparser-compatible form	7:41-757
`preprocess_bare_table_in_clauses`	`IN table` → `IN (SELECT * FROM table)`	Expand SQLite shorthand	8:44-873
`preprocess_sqlite_trigger_shorthand`	Missing `AFTER` / `FOR EACH ROW`	Add required trigger components	7:71-842

The trigger preprocessor is only invoked on parse errors containing CREATE TRIGGER, as it requires more complex regex patterns to inject missing timing and row-level clauses.

Sources: llkv-sql/src/sql_engine.rs:623-873

Regex Pattern Details

Static OnceLock<Regex> instances cache compiled patterns across invocations:

For example, the empty IN list handler uses:

(?i)(\([^)]*\)|x'[0-9a-fA-F]*'|'(?:[^']|'')*'|[a-zA-Z_][a-zA-Z0-9_]*(?:\.[a-zA-Z_][a-zA-Z0-9_]*)*|\d+(?:\.\d+)?)\s+(NOT\s+)?IN\s*\(\s*\)

This matches expressions (parenthesized, hex literals, strings, identifiers, numbers) followed by [NOT] IN () and replaces with boolean expressions that preserve evaluation side effects while producing constant results.

Sources: llkv-sql/src/sql_engine.rs:691-720

Parameter Placeholder System

The parameter system uses thread-local state to track placeholders during statement preparation:

Scope Creation: ParameterScope::new() initializes thread-local ParameterState
Registration: Each placeholder calls register_placeholder(raw) which:
- For ?: auto-increments index
- For ?N or $N: uses explicit numeric index
- For :name: assigns next available index and stores mapping
Sentinel Generation: placeholder_marker(index) creates __llkv_param__N__ string
Parsing: Sentinel strings are parsed as string literals in the SQL AST
Binding: execute_prepared replaces sentinels with SqlParamValue instances

The ParameterState struct tracks:

assigned: FxHashMap<String, usize> - named parameter to index mapping
next_auto: usize - next index for ? placeholders
max_index: usize - highest parameter index seen

Sources: llkv-sql/src/sql_engine.rs:77-206 llkv-sql/src/sql_engine.rs:1120-1235

graph TB
    subgraph "INSERT Processing Decision"
        INSERT["Statement::Insert"]
CLASSIFY["classify_insert"]
VALUES["PreparedInsert::Values"]
IMMEDIATE["PreparedInsert::Immediate"]
end
    
    subgraph "Buffering Logic"
        ENABLED{"Buffering\nEnabled?"}
COMPAT{"Can Buffer\nAccept?"}
THRESHOLD{"&gt;= MAX_BUFFERED_INSERT_ROWS\n(8192)?"}
BUFFER["InsertBuffer::push_statement"]
FLUSH["flush_buffered_insert"]
EXECUTE["execute_plan_statement"]
end
    
 
   INSERT --> CLASSIFY
 
   CLASSIFY --> VALUES
 
   CLASSIFY --> IMMEDIATE
    
 
   VALUES --> ENABLED
 
   ENABLED -->|No| EXECUTE
 
   ENABLED -->|Yes| COMPAT
    
 
   COMPAT -->|No| FLUSH
 
   COMPAT -->|Yes| BUFFER
 
   FLUSH --> BUFFER
    
 
   BUFFER --> THRESHOLD
 
   THRESHOLD -->|Yes| FLUSH
 
   THRESHOLD -->|No| RETURN["Return placeholder result"]
IMMEDIATE --> EXECUTE

INSERT Buffering System

The INSERT buffering system batches compatible literal INSERT statements to reduce planning overhead for bulk ingest workloads.

Buffer Structure

The InsertBuffer struct accumulates rows across multiple INSERT statements:

Key fields:

table_name, columns, on_conflict: compatibility key for buffering
rows: accumulated literal values from all buffered statements
statement_row_counts: per-statement row counts to emit individual results
total_rows: sum of statement_row_counts for threshold checking

Sources: llkv-sql/src/sql_engine.rs:421-471

Buffering Conditions

An INSERT can be buffered if:

The InsertSource is Values (literal rows) or a constant SELECT
Buffering is enabled via insert_buffering_enabled flag
Either no buffer exists or InsertBuffer::can_accept returns true:
- table_name matches exactly
- columns match exactly (same names, same order)
- on_conflict action matches

When the buffer reaches MAX_BUFFERED_INSERT_ROWS (8192), it is flushed automatically. Flush also occurs on:

Transaction boundaries (BEGIN, COMMIT, ROLLBACK)
Incompatible INSERT statement
Engine drop
Explicit set_insert_buffering(false) call

Sources: llkv-sql/src/sql_engine.rs:452-470 llkv-sql/src/sql_engine.rs:2028-2146 llkv-sql/src/sql_engine.rs:410-414

Buffer Flush Process

The flush process:

Extracts InsertBuffer from RefCell<Option<InsertBuffer>>
Constructs single InsertPlan with all accumulated rows
Executes via execute_statement
Receives single RuntimeStatementResult::Insert with total rows inserted
Splits result into per-statement results using statement_row_counts vector
Returns vector of results matching original statement order

This allows bulk execution while preserving per-statement result semantics.

Sources: llkv-sql/src/sql_engine.rs:2028-2146

Value Handling

The SqlValue enum represents literal values during SQL processing:

The SqlValue::try_from_expr function handles:

Unary operators (negation for numeric types, intervals)
CAST expressions (particularly to DATE)
Nested expressions
Dictionary/struct literals
Binary operations (addition, subtraction, bitshift for constant folding)
Typed strings (DATE '2024-01-01')

Interval arithmetic is performed at constant-folding time:

Date32 + Interval → Date32
Interval + Date32 → Date32
Date32 - Interval → Date32
Date32 - Date32 → Interval
Interval +/- Interval → Interval

Sources: llkv-sql/src/sql_value.rs:16-320

Error Handling

The SQL layer maps table-related errors to catalog-specific error messages:

Error Type	Mapping	Method
`Error::NotFound`	`Catalog Error: Table 'X' does not exist`	`table_not_found_error`
`Error::InvalidArgumentError` (contains "unknown table")	Same as above	`map_table_error`
Transaction conflicts	`another transaction has dropped this table`	String constant

The execute_plan_statement method applies error mapping except for CREATE VIEW and DROP VIEW statements, where the "table" name refers to the view being created/dropped rather than a referenced table.

Sources: llkv-sql/src/sql_engine.rs:558-609 llkv-sql/src/sql_engine.rs511

Thread Safety and Cloning

The SqlEngine::clone implementation creates a new session:

This ensures each cloned engine has an independent:

RuntimeSession (transaction state, temporary namespace)
insert_buffer (no shared buffering across sessions)
statement_cache (independent prepared statement cache)

The warning message indicates this is typically not intended usage, as most applications should use a single shared SqlEngine instance across threads (enabled by interior mutability via RefCell and atomic types).

Sources: llkv-sql/src/sql_engine.rs:522-540

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

SqlEngine API

Relevant source files

Purpose and Scope

The SqlEngine provides the primary user-facing API for executing SQL statements against LLKV databases. It accepts SQL text, parses it, translates it to typed execution plans, and delegates to the runtime layer for evaluation. This page documents the SqlEngine struct's construction, methods, prepared statement handling, and configuration options.

For information about SQL preprocessing and dialect handling, see SQL Preprocessing and Dialect Handling. For details on INSERT buffering behavior, see INSERT Buffering System. For query planning internals, see Plan Structures.

Sources: llkv-sql/src/lib.rs:1-51 llkv-sql/README.md:1-68

SqlEngine Architecture Overview

The SqlEngine sits at the top of the LLKV SQL processing stack, coordinating parsing, planning, and execution:

Diagram: SqlEngine Position in SQL Processing Stack

graph TB
    User["User Code"]
SqlEngine["SqlEngine\n(llkv-sql)"]
Parser["sqlparser\nAST Generation"]
Preprocessor["SQL Preprocessor\nDialect Normalization"]
Planner["Plan Translation\nAST → PlanStatement"]
Runtime["RuntimeEngine\n(llkv-runtime)"]
Executor["llkv-executor\nQuery Execution"]
Table["llkv-table\nTable Layer"]
User -->|execute sql| SqlEngine
 
   User -->|sql select| SqlEngine
 
   User -->|prepare sql| SqlEngine
    
 
   SqlEngine --> Preprocessor
 
   Preprocessor --> Parser
 
   Parser --> Planner
 
   Planner --> Runtime
 
   Runtime --> Executor
 
   Executor --> Table
    
 
   SqlEngine -.->|owns| Runtime
 
   SqlEngine -.->|manages| InsertBuffer["InsertBuffer\nBatching State"]
SqlEngine -.->|caches| StmtCache["statement_cache\nPreparedPlan Cache"]

The SqlEngine wraps a RuntimeEngine, manages statement caching and INSERT buffering, and provides convenience methods for single-statement queries (sql()) and multi-statement execution (execute()).

Sources: llkv-sql/src/sql_engine.rs:496-509 llkv-sql/src/lib.rs:1-51

Constructing a SqlEngine

Basic Construction

Diagram: SqlEngine Construction Flow

The most common constructor is SqlEngine::new(), which accepts a pager and creates a new runtime context:

Sources: llkv-sql/src/sql_engine.rs:615-621

Constructor	Signature	Purpose
`new()`	`new<Pg>(pager: Arc<Pg>) -> Self`	Create engine with new runtime and default settings
`with_context()`	`with_context(context: Arc<SqlContext>, default_nulls_first: bool) -> Self`	Create engine reusing an existing runtime context
`from_runtime_engine()`	`from_runtime_engine(engine: RuntimeEngine, default_nulls_first: bool, insert_buffering_enabled: bool) -> Self`	Internal constructor for fine-grained control

Table: SqlEngine Constructor Methods

Sources: llkv-sql/src/sql_engine.rs:543-556 llkv-sql/src/sql_engine.rs:615-621 llkv-sql/src/sql_engine.rs:879-885

Core Query Execution Methods

execute() - Multi-Statement Execution

The execute() method processes one or more SQL statements from a string and returns a vector of results:

Diagram: execute() Method Execution Flow

The execution pipeline:

Preprocessing : SQL text undergoes dialect normalization via preprocess_sql_input() llkv-sql/src/sql_engine.rs:1556-1564
Parsing : sqlparser converts normalized text to AST with recursion limit of 200 llkv-sql/src/sql_engine.rs324
Statement Loop : Each statement is translated to a PlanStatement and either buffered (for INSERTs) or executed immediately
Result Collection : Results are accumulated and returned as Vec<SqlStatementResult>

Sources: llkv-sql/src/sql_engine.rs:933-1044

sql() - Single SELECT Execution

The sql() method enforces single-statement SELECT semantics and returns Arrow RecordBatch results:

Key differences from execute():

Accepts only a single statement
Statement must be a SELECT query
Returns Vec<RecordBatch> directly rather than RuntimeStatementResult
Automatically collects streaming results

Sources: llkv-sql/src/sql_engine.rs:1046-1085

Prepared Statements

Prepared Statement Flow

Diagram: Prepared Statement Creation and Caching

prepare() Method

The prepare() method parses SQL with placeholders and caches the resulting plan:

Placeholder syntax supported:

? - Positional parameter (auto-increments)
?N - Numbered parameter (1-indexed)
$N - PostgreSQL-style numbered parameter
:name - Named parameter

Sources: llkv-sql/src/sql_engine.rs:1296-1376 llkv-sql/src/sql_engine.rs:86-132

Parameter Binding Mechanism

Parameter registration occurs via thread-local ParameterScope:

Diagram: Parameter Registration and Sentinel Generation

The parameter translation process:

During parsing, placeholders are intercepted and converted to sentinel strings: __llkv_param__N__
ParameterState tracks placeholder-to-index mappings in thread-local storage
At execution time, sentinel strings are replaced with actual parameter values

Sources: llkv-sql/src/sql_engine.rs:71-206

SqlParamValue Type

Parameter values are represented by the SqlParamValue enum:

Variant	SQL Type	Usage
`Null`	NULL	`SqlParamValue::Null`
`Integer(i64)`	INTEGER/BIGINT	`SqlParamValue::from(42_i64)`
`Float(f64)`	FLOAT/DOUBLE	`SqlParamValue::from(3.14_f64)`
`Boolean(bool)`	BOOLEAN	`SqlParamValue::from(true)`
`String(String)`	TEXT/VARCHAR	`SqlParamValue::from("text")`
`Date32(i32)`	DATE	`SqlParamValue::from(18993_i32)`

Table: SqlParamValue Variants and Conversions

Sources: llkv-sql/src/sql_engine.rs:208-276

execute_prepared() Method

Execute a prepared statement with bound parameters:

Parameter substitution occurs in two phases:

Literal Substitution : Sentinels in Expr<String> are replaced via substitute_parameter_literals() llkv-sql/src/sql_engine.rs:1453-1497
Plan Value Substitution : Sentinels in Vec<PlanValue> are replaced via substitute_parameter_plan_values() llkv-sql/src/sql_engine.rs:1499-1517

Sources: llkv-sql/src/sql_engine.rs:1378-1451

Transaction Control

The SqlEngine supports explicit transaction boundaries via SQL statements:

Diagram: Transaction State Machine

Transaction management methods:

SQL Statement	Effect
`BEGIN`	Start explicit transaction llkv-sql/src/sql_engine.rs:970-976
`COMMIT`	Finalize transaction and flush buffers llkv-sql/src/sql_engine.rs:977-983
`ROLLBACK`	Abort transaction and discard buffers llkv-sql/src/sql_engine.rs:984-990

Transaction boundaries automatically flush the INSERT buffer to ensure consistent visibility semantics.

Sources: llkv-sql/src/sql_engine.rs:970-990 llkv-sql/src/sql_engine.rs:912-914

INSERT Buffering System

Buffer Architecture

Diagram: INSERT Buffer Accumulation and Flush

InsertBuffer Structure

The InsertBuffer struct accumulates literal INSERT payloads:

Sources: llkv-sql/src/sql_engine.rs:421-471

Buffer Compatibility

An INSERT can join the buffer if it matches:

Table Name : Target table must match buffer.table_name
Column List : Columns must match buffer.columns exactly
Conflict Action : on_conflict strategy must match

Sources: llkv-sql/src/sql_engine.rs:452-459

Flush Conditions

The buffer flushes when:

Condition	Implementation
Size limit exceeded	`total_rows >= MAX_BUFFERED_INSERT_ROWS` (8192) llkv-sql/src/sql_engine.rs414
Incompatible INSERT	Table/columns/conflict mismatch llkv-sql/src/sql_engine.rs:1765-1799
Transaction boundary	`BEGIN`, `COMMIT`, `ROLLBACK` detected llkv-sql/src/sql_engine.rs:970-990
Non-INSERT statement	Any non-INSERT SQL statement llkv-sql/src/sql_engine.rs:991-1040
Statement expectation	Test harness expectation registered llkv-sql/src/sql_engine.rs:1745-1760
Manual flush	`flush_pending_inserts()` called llkv-sql/src/sql_engine.rs:1834-1850

Table: INSERT Buffer Flush Triggers

Enabling/Disabling Buffering

INSERT buffering is controlled by the set_insert_buffering() method:

Disabled by default to maintain statement-level transaction semantics
Enable for bulk loading to reduce planning overhead
Disabling flushes buffer to ensure pending rows are persisted

Sources: llkv-sql/src/sql_engine.rs:898-905

classDiagram
    class RuntimeStatementResult {<<enum>>\nSelect\nInsert\nUpdate\nDelete\nCreateTable\nDropTable\nCreateIndex\nDropIndex\nAlterTable\nCreateView\nDropView\nVacuum\nTransaction}
    
    class SelectVariant {+SelectExecution execution}
    
    class InsertVariant {+table_name: String\n+rows_inserted: usize}
    
    class UpdateVariant {+table_name: String\n+rows_updated: usize}
    
    RuntimeStatementResult --> SelectVariant
    RuntimeStatementResult --> InsertVariant
    RuntimeStatementResult --> UpdateVariant

Result Types

RuntimeStatementResult

The execute() and execute_prepared() methods return Vec<RuntimeStatementResult>:

Diagram: RuntimeStatementResult Variants

Key result variants:

Variant	Fields	Returned By
`Select`	`SelectExecution`	SELECT queries [llkv-runtime types](https://github.com/jzombie/rust-llkv/blob/4dc34c1f/llkv-runtime types)
`Insert`	`table_name: String, rows_inserted: usize`	INSERT statements
`Update`	`table_name: String, rows_updated: usize`	UPDATE statements
`Delete`	`table_name: String, rows_deleted: usize`	DELETE statements
`CreateTable`	`table_name: String`	CREATE TABLE
`CreateIndex`	`index_name: String, table_name: String`	CREATE INDEX

Table: Common RuntimeStatementResult Variants

Sources: llkv-sql/src/lib.rs49

SelectExecution

SELECT queries return a SelectExecution handle for streaming results:

The sql() method automatically collects batches via execution.collect():

Sources: llkv-sql/src/sql_engine.rs:1065-1080

Configuration Methods

session() - Access Runtime Session

Provides access to the underlying RuntimeSession for transaction introspection or advanced error handling.

Sources: llkv-sql/src/sql_engine.rs:917-919

context_arc() - Access Runtime Context

Internal method to retrieve the shared RuntimeContext for engine composition.

Sources: llkv-sql/src/sql_engine.rs:875-877

Testing Utilities

StatementExpectation

Test harnesses can register expectations to control buffer flushing:

When a statement expectation is registered, the INSERT buffer flushes before executing that statement to ensure test assertions observe correct row counts.

Sources: llkv-sql/src/sql_engine.rs:64-315

Example Usage

Sources: llkv-sql/src/sql_engine.rs:299-309

Thread Safety and Cloning

The SqlEngine implements Clone with special semantics:

Warning : Cloning a SqlEngine creates a new RuntimeSession, not a shared reference. Each clone has independent transaction state and INSERT buffers.

Sources: llkv-sql/src/sql_engine.rs:522-540

Error Handling

Table Not Found Errors

The SqlEngine remaps generic errors to user-friendly catalog errors:

This converts low-level NotFound errors into: Catalog Error: Table 'tablename' does not exist

Sources: llkv-sql/src/sql_engine.rs:558-585

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

SQL Preprocessing and Dialect Handling

Relevant source files

Purpose and Scope

SQL preprocessing is the first stage in LLKV's query processing pipeline, responsible for normalizing SQL syntax from various dialects before the statement reaches the parser. This system allows LLKV to accept SQL written for SQLite, DuckDB, and TPC-H tooling while using the standard sqlparser library, which has limited dialect support.

The preprocessing layer transforms dialect-specific syntax into forms that sqlparser can parse, enabling compatibility with SQL Logic Tests and real-world SQL scripts without modifying the parser itself. This page documents the preprocessing transformations and their implementation.

For information about what happens after preprocessing (SQL parsing and plan generation), see SQL Query Processing Pipeline. For details on the SqlEngine API that invokes preprocessing, see SqlEngine API.

Preprocessing in the Query Pipeline

SQL preprocessing occurs immediately before parsing in both the execute and prepare code paths. The following diagram shows where preprocessing fits in the overall query execution flow:

Diagram: SQL Preprocessing Pipeline Position

flowchart TB
    Input["SQL String Input"]
Preprocess["preprocess_sql_input()"]
Parse["sqlparser::Parser::parse()"]
Plan["Plan Generation"]
Execute["Query Execution"]
Input --> Preprocess
 
   Preprocess --> Parse
 
   Parse --> Plan
 
   Plan --> Execute
    
    subgraph "Preprocessing Transformations"
        direction TB
        TPC["TPC-H CONNECT removal"]
CreateType["CREATE TYPE → CREATE DOMAIN"]
Exclude["EXCLUDE syntax normalization"]
Trailing["Trailing comma removal"]
EmptyIn["Empty IN list handling"]
IndexHints["Index hint stripping"]
Reindex["REINDEX → VACUUM REINDEX"]
BareTable["Bare table IN expansion"]
TPC --> CreateType
 
       CreateType --> Exclude
 
       Exclude --> Trailing
 
       Trailing --> BareTable
 
       BareTable --> EmptyIn
 
       EmptyIn --> IndexHints
 
       IndexHints --> Reindex
    end
    
    Preprocess -.chains.-> TPC
    Reindex -.final.-> Parse

Sources: llkv-sql/src/sql_engine.rs:936-1001

The preprocess_sql_input method chains all dialect transformations in a specific order, with each transformation receiving the output of the previous one. If parsing fails after preprocessing and the SQL contains CREATE TRIGGER, a fallback preprocessor (preprocess_sqlite_trigger_shorthand) is applied before retrying the parse.

Diagram: Preprocessing Execution Sequence with Fallback

sequenceDiagram
    participant Caller
    participant SqlEngine
    participant Preprocess as "preprocess_sql_input"
    participant Parser as "sqlparser"
    participant Fallback as "preprocess_sqlite_trigger_shorthand"
    
    Caller->>SqlEngine: execute(sql)
    SqlEngine->>Preprocess: preprocess(sql)
    
    Note over Preprocess: Chain all transformations
    
    Preprocess-->>SqlEngine: processed_sql
    SqlEngine->>Parser: parse(processed_sql)
    
    alt Parse Success
        Parser-->>SqlEngine: AST
    else Parse Error + "CREATE TRIGGER"
        Parser-->>SqlEngine: ParseError
        SqlEngine->>Fallback: expand_trigger_syntax(processed_sql)
        Fallback-->>SqlEngine: expanded_sql
        SqlEngine->>Parser: parse(expanded_sql)
        Parser-->>SqlEngine: AST or Error
    end
    
    SqlEngine-->>Caller: Results

Sources: llkv-sql/src/sql_engine.rs:936-958

Supported Dialect Transformations

LLKV implements nine distinct preprocessing transformations, each targeting specific dialect compatibility issues. The following table summarizes each transformation:

Preprocessor	Dialect	Purpose	Method
TPC-H CONNECT	TPC-H	Strip `CONNECT TO database;` statements	`preprocess_tpch_connect_syntax`
CREATE TYPE	DuckDB	Convert `CREATE TYPE` to `CREATE DOMAIN`	`preprocess_create_type_syntax`
EXCLUDE Syntax	General	Quote qualified identifiers in EXCLUDE clauses	`preprocess_exclude_syntax`
Trailing Commas	DuckDB	Remove trailing commas in VALUES	`preprocess_trailing_commas_in_values`
Empty IN Lists	SQLite	Convert `IN ()` to constant expressions	`preprocess_empty_in_lists`
Index Hints	SQLite	Strip `INDEXED BY` and `NOT INDEXED`	`preprocess_index_hints`
REINDEX	SQLite	Convert `REINDEX` to `VACUUM REINDEX`	`preprocess_reindex_syntax`
Bare Table IN	SQLite	Expand `IN table` to `IN (SELECT * FROM table)`	`preprocess_bare_table_in_clauses`
Trigger Shorthand	SQLite	Add `AFTER` and `FOR EACH ROW` to triggers	`preprocess_sqlite_trigger_shorthand`

Sources: llkv-sql/src/sql_engine.rs:628-842 llkv-sql/src/sql_engine.rs:992-1001

TPC-H CONNECT Statement Removal

The TPC-H benchmark tooling generates CONNECT TO <database>; directives in referential integrity scripts. Since LLKV operates within a single database context, these statements are treated as no-ops and stripped during preprocessing.

Transformation:

Sources: llkv-sql/src/sql_engine.rs:623-630

CREATE TYPE to CREATE DOMAIN Conversion

DuckDB uses CREATE TYPE name AS basetype for type aliases, but sqlparser only supports the SQL standard CREATE DOMAIN syntax. This preprocessor converts the DuckDB syntax to the standard form.

Transformation:

Implementation: Uses static regex patterns initialized via OnceLock for thread-safe lazy compilation.

Sources: llkv-sql/src/sql_engine.rs:634-657

EXCLUDE Syntax Normalization

When EXCLUDE clauses contain qualified identifiers (e.g., schema.table.column), sqlparser requires them to be quoted. This preprocessor wraps qualified names in double quotes.

Transformation:

Sources: llkv-sql/src/sql_engine.rs:659-676

Trailing Comma Removal in VALUES

DuckDB permits trailing commas in VALUES clauses like VALUES ('v2',), but sqlparser rejects them. This preprocessor removes trailing commas before closing parentheses.

Transformation:

Sources: llkv-sql/src/sql_engine.rs:678-689

Empty IN List Handling

SQLite allows degenerate IN () and NOT IN () expressions. Since sqlparser rejects these, the preprocessor converts them to constant boolean expressions while preserving the original expression evaluation (in case of side effects).

Transformation:

The pattern matches various expression forms: parenthesized expressions, quoted strings, hex literals, identifiers, and numbers.

Sources: llkv-sql/src/sql_engine.rs:691-720

Index Hint Stripping

SQLite supports query optimizer hints like FROM table INDEXED BY index_name and FROM table NOT INDEXED. Since sqlparser doesn't support this syntax and LLKV makes its own index decisions, these hints are stripped during preprocessing.

Transformation:

Sources: llkv-sql/src/sql_engine.rs:722-739

REINDEX to VACUUM REINDEX Conversion

SQLite supports REINDEX index_name as a standalone statement, but sqlparser only recognizes REINDEX as part of VACUUM syntax. This preprocessor converts the standalone form.

Transformation:

Sources: llkv-sql/src/sql_engine.rs:741-757

Bare Table IN Clause Expansion

SQLite allows expr IN tablename as shorthand for expr IN (SELECT * FROM tablename). The preprocessor expands this shorthand to the subquery form that sqlparser requires.

Transformation:

The pattern avoids matching IN ( which is already a valid subquery.

Sources: llkv-sql/src/sql_engine.rs:844-873

SQLite Trigger Shorthand Expansion

SQLite allows omitting the trigger timing (defaults to AFTER) and the FOR EACH ROW clause (defaults to row-level triggers). sqlparser requires both to be explicit. This preprocessor injects the missing clauses.

Transformation:

This is a fallback preprocessor that only runs if initial parsing fails and the SQL contains CREATE TRIGGER. The implementation uses complex regex patterns to handle optional dotted identifiers with various quoting styles.

Sources: llkv-sql/src/sql_engine.rs:759-842 llkv-sql/src/sql_engine.rs:944-957

graph TB
    subgraph "SqlEngine Methods"
        Execute["execute(sql)"]
Prepare["prepare(sql)"]
PreprocessInput["preprocess_sql_input(sql)"]
end
    
    subgraph "Static Regex Patterns"
        CreateTypeRE["CREATE_TYPE_REGEX"]
DropTypeRE["DROP_TYPE_REGEX"]
ExcludeRE["EXCLUDE_REGEX"]
TrailingRE["TRAILING_COMMA_REGEX"]
EmptyInRE["EMPTY_IN_REGEX"]
IndexHintRE["INDEX_HINT_REGEX"]
ReindexRE["REINDEX_REGEX"]
BareTableRE["BARE_TABLE_IN_REGEX"]
TimingRE["TIMING_REGEX"]
ForEachBeginRE["FOR_EACH_BEGIN_REGEX"]
ForEachWhenRE["FOR_EACH_WHEN_REGEX"]
end
    
    subgraph "Preprocessor Methods"
        TPC["preprocess_tpch_connect_syntax"]
CreateType["preprocess_create_type_syntax"]
Exclude["preprocess_exclude_syntax"]
Trailing["preprocess_trailing_commas_in_values"]
EmptyIn["preprocess_empty_in_lists"]
IndexHints["preprocess_index_hints"]
Reindex["preprocess_reindex_syntax"]
BareTable["preprocess_bare_table_in_clauses"]
Trigger["preprocess_sqlite_trigger_shorthand"]
end
    
 
   Execute --> PreprocessInput
 
   Prepare --> PreprocessInput
    
 
   PreprocessInput --> TPC
 
   PreprocessInput --> CreateType
 
   PreprocessInput --> Exclude
 
   PreprocessInput --> Trailing
 
   PreprocessInput --> BareTable
 
   PreprocessInput --> EmptyIn
 
   PreprocessInput --> IndexHints
 
   PreprocessInput --> Reindex
    
    CreateType -.uses.-> CreateTypeRE
    CreateType -.uses.-> DropTypeRE
    Exclude -.uses.-> ExcludeRE
    Trailing -.uses.-> TrailingRE
    EmptyIn -.uses.-> EmptyInRE
    IndexHints -.uses.-> IndexHintRE
    Reindex -.uses.-> ReindexRE
    BareTable -.uses.-> BareTableRE
    Trigger -.uses.-> TimingRE
    Trigger -.uses.-> ForEachBeginRE
    Trigger -.uses.-> ForEachWhenRE

Implementation Architecture

The preprocessing system is implemented using a combination of regex transformations and string manipulation. The following diagram shows the key components:

Diagram: Preprocessing Implementation Components

Sources: llkv-sql/src/sql_engine.rs:640-842 llkv-sql/src/sql_engine.rs:992-1001

Regex Pattern Management

All regex patterns are stored in OnceLock static variables for thread-safe lazy initialization. This ensures patterns are compiled once per process and reused across all preprocessing operations, avoiding the overhead of repeated compilation.

Pattern Initialization Example:

The patterns use case-insensitive matching ((?i)) and word boundaries (\b) to avoid false matches within identifiers or string literals.

Sources: llkv-sql/src/sql_engine.rs:640-650 llkv-sql/src/sql_engine.rs:661-669 llkv-sql/src/sql_engine.rs:682-686

Preprocessing Order

The order of transformations matters because later transformations may depend on earlier ones. The current order:

TPC-H CONNECT removal - Must happen first to remove non-SQL directives
CREATE TYPE conversion - Normalizes DDL before other transformations
EXCLUDE syntax - Handles qualified names in projection lists
Trailing comma removal - Fixes VALUES clause syntax
Bare table IN expansion - Converts shorthand to subqueries before empty IN check
Empty IN handling - Must come after bare table expansion to avoid conflicts
Index hint stripping - Removes query hints from FROM clauses
REINDEX conversion - Must be last to avoid interfering with VACUUM statements

Sources: llkv-sql/src/sql_engine.rs:992-1001

Parser Integration

The preprocessed SQL is passed to sqlparser with a custom recursion limit to handle deeply nested queries from test suites:

The default sqlparser recursion limit (50) is insufficient for some SQLite test suite queries, so LLKV uses 200 to balance compatibility with stack safety.

Sources: llkv-sql/src/sql_engine.rs:318-324

Testing and Validation

The preprocessing transformations are validated through:

SQL Logic Tests (SLT) - The llkv-slt-tester runs thousands of SQLite test cases that exercise various dialect features
TPC-H Benchmarks - The llkv-tpch crate verifies compatibility with TPC-H SQL scripts
Unit Tests - Individual preprocessor functions are tested in isolation

The preprocessing system is designed to be conservative: it only transforms patterns that are known to cause parser errors, and it preserves the original SQL semantics whenever possible.

Sources: llkv-sql/src/sql_engine.rs:623-1001 llkv-sql/Cargo.toml:1-34

Future Considerations

The preprocessing approach is a pragmatic solution that enables broad dialect compatibility without modifying sqlparser. However, it has limitations:

Fragile regex patterns - Complex transformations like trigger shorthand expansion use intricate regex that may not handle all edge cases
Limited context awareness - String-based transformations cannot distinguish between SQL keywords and string literals containing those keywords
Maintenance burden - Each new dialect feature requires a new preprocessor

The long-term solution is to contribute dialect-specific parsing improvements back to sqlparser, eliminating the need for preprocessing. The trigger shorthand transformation includes a TODO comment noting that proper SQLite dialect support in sqlparser would eliminate that preprocessor entirely.

Sources: llkv-sql/src/sql_engine.rs:765-770

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

INSERT Buffering System

Relevant source files

The INSERT Buffering System is an optimization layer within llkv-sql that batches multiple consecutive INSERT ... VALUES statements for the same table into a single execution plan. This dramatically reduces planning overhead when bulk-loading data from SQL scripts containing thousands of individual INSERT statements. The system preserves per-statement result semantics while amortizing the cost of plan construction and table access across large batches.

For information about how INSERT plans are structured and executed, see Plan Structures.

Purpose and Design Goals

The buffering system addresses a specific performance bottleneck: SQL scripts generated by database export tools often contain tens of thousands of individual INSERT INTO table VALUES (...) statements. Without buffering, each statement incurs the full cost of parsing, planning, catalog lookup, and MVCC overhead. The buffer accumulates compatible INSERT statements and flushes them as a single batch, achieving order-of-magnitude throughput improvements for bulk ingestion workloads.

Key design constraints:

Optional : Disabled by default to preserve immediate visibility semantics for unit tests and interactive workloads
Transparent : Callers receive per-statement results as if each INSERT executed independently
Safe : Flushes automatically at transaction boundaries, table changes, and buffer size limits
Compatible : Integrates with statement expectation mechanisms used by the SQL Logic Test harness

Sources: llkv-sql/src/sql_engine.rs:410-520

Architecture Overview

Figure 1: INSERT Buffering Architecture

The system operates as a stateful accumulator within SqlEngine. Incoming INSERT statements are classified as either PreparedInsert::Values (bufferable literals) or PreparedInsert::Immediate (non-bufferable subqueries or expressions). Compatible VALUES inserts accumulate in the buffer until a flush trigger fires, at which point the buffer constructs a single InsertPlan and emits individual RuntimeStatementResult::Insert entries for each original statement.

Sources: llkv-sql/src/sql_engine.rs:416-509

Buffer Data Structures

InsertBuffer

Figure 2: Buffer Data Structure

The InsertBuffer struct maintains five critical pieces of state:

Field	Type	Purpose
`table_name`	`String`	Target table identifier for compatibility checking
`columns`	`Vec<String>`	Column list; must match for batching
`on_conflict`	`InsertConflictAction`	Conflict resolution policy; must match for batching
`total_rows`	`usize`	Sum of all buffered rows across statements
`statement_row_counts`	`Vec<usize>`	Per-statement row counts for result construction
`rows`	`Vec<Vec<PlanValue>>`	Literal row payloads in execution order

The statement_row_counts vector preserves the boundary between original INSERT statements so that flush_buffer_results() can emit one RuntimeStatementResult::Insert per statement with the correct row count.

Sources: llkv-sql/src/sql_engine.rs:421-471

PreparedInsert Classification

Figure 3: INSERT Classification Flow

The prepare_insert() method analyzes each INSERT statement and returns PreparedInsert::Values only when the source is a literal VALUES clause or a SELECT that evaluates to constants (e.g., SELECT 1, 'foo'). All other forms—subqueries referencing tables, expressions requiring runtime evaluation, or DEFAULT VALUES—become PreparedInsert::Immediate and bypass buffering.

Sources: llkv-sql/src/sql_engine.rs:473-487

Buffer Lifecycle and Flush Triggers

Flush Conditions

The buffer flushes automatically when any of the following conditions occur:

Trigger	Constant	Description
Size limit	`MAX_BUFFERED_INSERT_ROWS = 8192`	Total buffered rows exceeds threshold
Incompatible INSERT	N/A	Different table, columns, or conflict action
Non-INSERT statement	N/A	Any DDL, DML (UPDATE/DELETE), or SELECT
Transaction boundary	N/A	BEGIN, COMMIT, or ROLLBACK
Statement expectation	`StatementExpectation::Error` or `Count(n)`	Test harness expects specific outcome
Manual flush	N/A	`flush_pending_inserts()` called explicitly
Engine drop	N/A	`SqlEngine` destructor invoked

Sources: llkv-sql/src/sql_engine.rs:414-1127

Buffer State Machine

Figure 4: Buffer State Machine

The buffer exists in one of three states: Empty (no buffer allocated), Buffering (accumulating rows), or Flushing (emitting results). Transitions from Buffering to Flushing occur automatically based on the triggers listed above. After flushing, the state returns to Empty unless a new compatible INSERT immediately follows, in which case a fresh buffer is allocated.

Sources: llkv-sql/src/sql_engine.rs:514-1201

Integration with SqlEngine::execute()

Figure 5: Execute Loop with Buffer Integration

The execute() method iterates through parsed statements, dispatching INSERT statements to buffer_insert() and all other statements to execute_statement() after flushing. This ensures that the buffer never holds rows across non-INSERT operations or transaction boundaries.

Sources: llkv-sql/src/sql_engine.rs:933-990

buffer_insert() Implementation Details

Decision Flow

Figure 6: buffer_insert() Decision Tree

The buffer_insert() method performs three levels of gating:

Expectation check : If the SLT harness expects an error or specific row count, bypass buffering entirely
Buffering enabled check : If insert_buffering_enabled is false, execute immediately
Compatibility check : If the INSERT is incompatible with the current buffer, flush and start a new buffer

Sources: llkv-sql/src/sql_engine.rs:1101-1201

Compatibility Rules

An INSERT can be added to the existing buffer if and only if:

This ensures that all buffered statements can be collapsed into a single InsertPlan with uniform semantics. Different column orderings, conflict actions, or target tables require separate batches.

Sources: llkv-sql/src/sql_engine.rs:452-459

Statement Expectation Handling

The SQL Logic Test harness uses thread-local expectations to signal that a specific statement should produce an error or affect a precise number of rows. The buffering system respects these hints by forcing immediate execution when expectations are present:

Figure 7: Statement Expectation Flow

graph TB
    SLTHarness["SLT Harness"]
RegisterExpectation["register_statement_expectation()"]
ThreadLocal["PENDING_STATEMENT_EXPECTATIONS\nthread_local!"]
Execute["SqlEngine::execute()"]
NextExpectation["next_statement_expectation()"]
BufferInsert["buffer_insert()"]
SLTHarness -->|before statement| RegisterExpectation
 
   RegisterExpectation --> ThreadLocal
 
   Execute --> NextExpectation
 
   NextExpectation --> ThreadLocal
 
   NextExpectation --> BufferInsert
    
 
   BufferInsert -->|Error or Count| ImmediateExec["Execute immediately\nbypass buffer"]
BufferInsert -->|Ok| MayBuffer["May buffer if enabled"]

When next_statement_expectation() returns StatementExpectation::Error or StatementExpectation::Count(n), the buffer_insert() method sets execute_immediately = true and flushes any existing buffer before executing the current statement. This preserves test correctness while still allowing buffering for the majority of statements that have no expectations.

Sources: llkv-sql/src/sql_engine.rs:64-1127

sequenceDiagram
    participant Caller
    participant FlushBuffer as "flush_buffer_results()"
    participant Buffer as "InsertBuffer"
    participant PlanStmt as "PlanStatement::Insert"
    participant Runtime as "RuntimeEngine"
    
    Caller->>FlushBuffer: flush_buffer_results()
    FlushBuffer->>Buffer: Take buffer from RefCell
    
    alt Buffer is None
        FlushBuffer-->>Caller: Ok(Vec::new())
    else Buffer has data
        FlushBuffer->>PlanStmt: Construct InsertPlan\n(table, columns, rows, on_conflict)
        FlushBuffer->>Runtime: execute_statement(plan)
        Runtime-->>FlushBuffer: RuntimeStatementResult::Insert\n(total_rows_inserted)
        
        Note over FlushBuffer: Verify total_rows matches sum(statement_row_counts)
        
        loop For each statement_row_count
            FlushBuffer->>FlushBuffer: Create RuntimeStatementResult::Insert\n(statement_rows)
        end
        
        FlushBuffer-->>Caller: Vec<SqlStatementResult>
    end

flush_buffer_results() Mechanics

The flush operation reconstructs per-statement results from the accumulated buffer state:

Figure 8: Flush Sequence

The flush process:

Takes ownership of the buffer from the RefCell
Constructs a single InsertPlan with all buffered rows
Executes the plan via the runtime
Splits the total row count across the original statements using statement_row_counts
Returns a vector of per-statement results

This ensures that callers receive results as if each INSERT executed independently, even though the runtime processed them as a single batch.

Sources: llkv-sql/src/sql_engine.rs:2094-2169 (Note: The flush implementation is in the broader file, exact line range may vary)

Performance Characteristics

Throughput Improvement

Buffering provides dramatic performance gains for bulk INSERT workloads:

Scenario	Without Buffering	With Buffering	Speedup
10,000 single-row INSERTs	~30 seconds	~2 seconds	~15x
1,000 ten-row INSERTs	~5 seconds	~0.5 seconds	~10x
100,000 single-row INSERTs	Several minutes	~15 seconds	>10x

The improvement stems from:

Amortized planning : One plan for 8,192 rows instead of 8,192 plans
Batch MVCC overhead : Single transaction coordinator call instead of thousands
Reduced catalog lookups : One schema resolution instead of per-statement lookups
Vectorized column operations : Arrow batch processing instead of row-by-row appends

Sources: llkv-sql/README.md:36-41

Memory Usage

The buffer is bounded at MAX_BUFFERED_INSERT_ROWS = 8192 rows. Peak memory usage depends on the row width:

Peak Memory = MAX_BUFFERED_INSERT_ROWS × (Σ column_size + MVCC_overhead)

For a typical table with 10 columns averaging 50 bytes each:

8,192 rows × (10 columns × 50 bytes + 24 bytes MVCC) ≈ 4.3 MB

This predictable ceiling makes buffering safe for long-running workloads without risking unbounded memory growth.

Sources: llkv-sql/src/sql_engine.rs:410-414

API Surface

Enabling and Disabling

The set_insert_buffering(false) call automatically flushes any pending rows before disabling, ensuring visibility guarantees.

Sources: llkv-sql/src/sql_engine.rs:887-905

Manual Flush

Manual flushes are useful when the caller needs to checkpoint progress or ensure specific INSERT statements are visible before proceeding.

Sources: llkv-sql/src/sql_engine.rs:1003-1010

Drop Hook

The SqlEngine destructor automatically flushes the buffer to prevent data loss:

This ensures that buffered rows are persisted even if the caller forgets to flush explicitly.

Sources: llkv-sql/src/sql_engine.rs:513-520

Limitations and Edge Cases

Non-Bufferable INSERT Forms

The following INSERT patterns always execute immediately:

INSERT ... SELECT with table references
INSERT ... DEFAULT VALUES
INSERT with expressions requiring runtime evaluation (e.g., NOW(), RANDOM())
INSERT with parameters or placeholders

These patterns cannot be safely batched because their semantics depend on execution context.

Transaction Isolation

The buffer flushes at transaction boundaries (BEGIN, COMMIT, ROLLBACK) to preserve isolation semantics. This means:

The first INSERT's visibility is not guaranteed until the BEGIN statement forces a flush.

Conflict Handling

All buffered statements must share the same InsertConflictAction. Mixing ON CONFLICT IGNORE and ON CONFLICT REPLACE requires separate batches:

Sources: llkv-sql/src/sql_engine.rs:452-1201

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Query Planning

Relevant source files

Purpose and Scope

Query planning is the layer that translates parsed SQL statements into strongly-typed plan structures that can be executed by the runtime engine. The llkv-plan crate defines these plan types and provides utilities for representing queries, expressions, and subquery relationships in a form that execution layers can consume without re-parsing SQL.

This page covers the plan structures themselves and how they are constructed from SQL input. For information about how expressions within plans are evaluated, see Expression System. For details on subquery correlation tracking and placeholder generation, see Subquery and Correlation Handling. For execution of these plans, see Query Execution.

Plan Structures Overview

The planning layer defines distinct plan types for each category of SQL statement. All plan types are defined in llkv-plan/src/plans.rs and flow through the PlanStatement enum for execution dispatch.

Core Plan Types

Plan Type	Purpose	Key Fields
`SelectPlan`	Query execution	`tables`, `projections`, `filter`, `joins`, `aggregates`, `order_by`
`InsertPlan`	Row insertion	`table`, `columns`, `source`, `on_conflict`
`UpdatePlan`	Row updates	`table`, `assignments`, `filter`
`DeletePlan`	Row deletion	`table`, `filter`
`CreateTablePlan`	Table creation	`name`, `columns`, `source`, `foreign_keys`
`CreateIndexPlan`	Index creation	`table`, `columns`, `unique`
`CreateViewPlan`	View creation	`name`, `view_definition`, `select_plan`

Sources: llkv-plan/src/plans.rs:177-256 llkv-plan/src/plans.rs:640-655 llkv-plan/src/plans.rs:662-667 llkv-plan/src/plans.rs:687-692

SelectPlan Structure

Diagram: SelectPlan Component Structure

The SelectPlan struct at llkv-plan/src/plans.rs:800-825 contains all information needed to execute a SELECT query. It separates table references, join specifications, projections, filters, aggregations, and ordering to allow execution layers to optimize each phase independently.

Sources: llkv-plan/src/plans.rs:27-67 llkv-plan/src/plans.rs:794-825

SQL-to-Plan Translation Pipeline

The translation from SQL text to plan structures occurs in SqlEngine within the llkv-sql crate. The process involves multiple stages to handle dialect differences and build strongly-typed plans.

Diagram: SQL-to-Plan Translation Flow

sequenceDiagram
    participant User
    participant SqlEngine as "SqlEngine"
    participant Preprocessor as "SQL Preprocessing"
    participant Parser as "sqlparser::Parser"
    participant Translator as "Plan Translator"
    participant Runtime as "RuntimeEngine"
    
    User->>SqlEngine: execute(sql_text)
    
    SqlEngine->>Preprocessor: preprocess_sql_input()
    Note over Preprocessor: Strip CONNECT TO\nNormalize CREATE TYPE\nFix EXCLUDE syntax\nExpand IN clauses
    Preprocessor-->>SqlEngine: processed_sql
    
    SqlEngine->>Parser: Parser::parse_sql()
    Parser-->>SqlEngine: Vec&lt;Statement&gt; (AST)
    
    loop "For each Statement"
        SqlEngine->>Translator: translate_statement()
        
        alt "INSERT statement"
            Translator->>Translator: translate_insert()
            Note over Translator: Parse VALUES/SELECT\nNormalize conflict action\nBuild PreparedInsert
            Translator-->>SqlEngine: PreparedInsert
            SqlEngine->>SqlEngine: buffer_insert()\nor flush immediately
        else "SELECT statement"
            Translator->>Translator: translate_select()
            Note over Translator: Build SelectPlan\nTranslate expressions\nTrack subqueries
            Translator-->>SqlEngine: SelectPlan
        else "UPDATE/DELETE"
            Translator->>Translator: translate_update()/delete()
            Translator-->>SqlEngine: UpdatePlan/DeletePlan
        else "DDL statement"
            Translator->>Translator: translate_create_table()\ncreate_index(), etc.
            Translator-->>SqlEngine: CreateTablePlan/etc.
        end
        
        SqlEngine->>Runtime: execute_statement(plan)
        Runtime-->>SqlEngine: RuntimeStatementResult
    end
    
    SqlEngine-->>User: Vec&lt;RuntimeStatementResult&gt;

Sources: llkv-sql/src/sql_engine.rs:933-958 llkv-sql/src/sql_engine.rs:628-757

Statement Translation Functions

The SqlEngine contains dedicated translation methods for each statement type:

sqlparser AST	Translation Method	Output Plan	Location
`Statement::Query`	`translate_select()`	`SelectPlan`	llkv-sql/src/sql_engine.rs:2162-2578
`Statement::Insert`	`translate_insert()`	`InsertPlan`	llkv-sql/src/sql_engine.rs:3194-3423
`Statement::Update`	`translate_update()`	`UpdatePlan`	llkv-sql/src/sql_engine.rs:3560-3704
`Statement::Delete`	`translate_delete()`	`DeletePlan`	llkv-sql/src/sql_engine.rs:3706-3783
`Statement::CreateTable`	`translate_create_table()`	`CreateTablePlan`	llkv-sql/src/sql_engine.rs:4081-4465
`Statement::CreateIndex`	`translate_create_index()`	`CreateIndexPlan`	llkv-sql/src/sql_engine.rs:4575-4766

Sources: llkv-sql/src/sql_engine.rs:974-1067

SELECT Translation Details

The translate_select() method at llkv-sql/src/sql_engine.rs2162 performs the following operations:

Extract table references from FROM clause into Vec<TableRef>
Parse join specifications into Vec<JoinMetadata> structures
Translate WHERE clause to Expr<String> and discover correlated subqueries
Process projections into Vec<SelectProjection> with computed expressions
Handle aggregates by extracting AggregateExpr from projections and HAVING
Translate GROUP BY clause to canonical column names
Process ORDER BY into Vec<OrderByPlan> with sort specifications
Handle compound queries (UNION/INTERSECT/EXCEPT) via CompoundSelectPlan

Sources: llkv-sql/src/sql_engine.rs:2162-2578

Expression Representation in Plans

Plans use two forms of expressions from the llkv-expr crate:

Expr<String>: Boolean predicates using unresolved column names (as strings)
ScalarExpr<String>: Scalar expressions (also with string column references)

graph LR
    SQL["SQL: WHERE age &gt; 18"]
AST["sqlparser AST\nBinaryExpr"]
ExprString["Expr&lt;String&gt;\nCompare(Column('age'), Gt, Literal(18))"]
ExprFieldId["Expr&lt;FieldId&gt;\nCompare(Column(field_7), Gt, Literal(18))"]
Bytecode["EvalProgram\nStack-based bytecode"]
SQL --> AST
 
   AST --> ExprString
 
   ExprString --> ExprFieldId
 
   ExprFieldId --> Bytecode
    
    ExprString -.stored in plan.-> SelectPlan
    ExprFieldId -.resolved at execution.-> Executor
    Bytecode -.compiled for evaluation.-> Table

These string-based expressions are later resolved to Expr<FieldId> and ScalarExpr<FieldId> during execution when the catalog provides field mappings. This two-stage approach separates planning from schema resolution.

Diagram: Expression Evolution Through Planning and Execution

The translation from SQL expressions to Expr<String> happens in llkv-sql/src/sql_engine.rs:1647-1947 The resolution to Expr<FieldId> occurs in the executor's translate_predicate() function at llkv-executor/src/translation/predicate.rs

Sources: llkv-expr/src/expr.rs llkv-sql/src/sql_engine.rs:1647-1947 llkv-plan/src/plans.rs:28-34

Join Planning

Join specifications are represented in two components:

JoinMetadata Structure

The JoinMetadata struct at llkv-plan/src/plans.rs:781-792 captures a single join between consecutive tables:

left_table_index : Index into SelectPlan.tables vector for the left table
join_type : One of Inner, Left, Right, or Full
on_condition : Optional ON clause filter expression

JoinPlan Types

The JoinPlan enum at llkv-plan/src/plans.rs:763-773 defines supported join semantics:

Diagram: JoinPlan Variants

The executor converts JoinPlan to llkv_join::JoinType during execution. When SelectPlan.joins is empty but multiple tables exist, the executor performs a Cartesian product (cross join).

Sources: llkv-plan/src/plans.rs:758-792 llkv-executor/src/lib.rs:542-554

Aggregation Planning

Aggregates are represented through the AggregateExpr structure defined at llkv-plan/src/plans.rs:1025-1102:

Aggregate Function Types

Diagram: AggregateFunction Variants

GROUP BY Handling

When a SELECT contains a GROUP BY clause:

Column names from GROUP BY are stored in SelectPlan.group_by as canonical strings
Aggregate expressions are collected in SelectPlan.aggregates
Non-aggregate projections must reference GROUP BY columns
HAVING clause (if present) is stored in SelectPlan.having as Expr<String>

The executor groups rows based on group_by columns, evaluates aggregates within each group, and applies the HAVING filter to group results.

Sources: llkv-plan/src/plans.rs:1025-1102 llkv-executor/src/lib.rs:1185-1597

Subquery Representation

Subqueries appear in two contexts within plans:

Filter Subqueries

FilterSubquery at llkv-plan/src/plans.rs:36-45 represents correlated subqueries used in WHERE/HAVING predicates via Expr::Exists:

id : Unique identifier matching Expr::Exists(SubqueryId)
plan : Nested SelectPlan for the subquery
correlated_columns : Mappings from placeholder names to outer query columns

Scalar Subqueries

ScalarSubquery at llkv-plan/src/plans.rs:48-56 represents subqueries that produce single values in projections via ScalarExpr::ScalarSubquery:

Correlated Column Tracking

The CorrelatedColumn struct at llkv-plan/src/plans.rs:59-67 describes how outer columns are bound into inner subqueries:

During execution, the executor substitutes placeholder references with actual values from the outer query's current row.

Sources: llkv-plan/src/plans.rs:36-67 llkv-sql/src/sql_engine.rs:1980-2124

Plan Value Types

The PlanValue enum at llkv-plan/src/plans.rs:73-83 represents literal values within plans:

These values appear in:

INSERT literal rows (InsertPlan with InsertSource::Rows)
UPDATE assignments (AssignmentValue::Literal)
Computed constant expressions

The executor converts PlanValue instances to Arrow arrays via plan_values_to_arrow_array() at llkv-executor/src/lib.rs:302-410

Sources: llkv-plan/src/plans.rs:73-161 llkv-executor/src/lib.rs:302-410

Plan Execution Interface

Plans flow to the runtime through the PlanStatement enum:

Diagram: Plan Execution Dispatch Flow

The RuntimeEngine::execute_statement() method dispatches each plan variant to the appropriate handler:

SELECT : Passed to QueryExecutor for streaming execution
INSERT/UPDATE/DELETE : Applied via Table with MVCC tracking
DDL : Processed by CatalogManager to modify schema metadata

Sources: llkv-runtime/src/statements.rs llkv-sql/src/sql_engine.rs:587-609 llkv-executor/src/lib.rs:523-569

Compound Query Planning

Set operations (UNION, INTERSECT, EXCEPT) are represented through CompoundSelectPlan at llkv-plan/src/plans.rs:969-996:

CompoundOperator : Union, Intersect, or Except
CompoundQuantifier : Distinct (deduplicate) or All (keep duplicates)

The executor evaluates the initial plan, then applies each operation sequentially, combining results according to set semantics. Deduplication for DISTINCT quantifiers uses hash-based row encoding.

Sources: llkv-plan/src/plans.rs:946-996 llkv-executor/src/lib.rs:590-686

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Plan Structures

Relevant source files

Purpose and Scope

Plan structures are strongly-typed representations of SQL statements that bridge the SQL parsing layer and the execution layer. Defined in the llkv-plan crate, these structures capture the logical intent of SQL operations without retaining parser-specific AST details. The planner translates sqlparser ASTs into plan instances, which the runtime then dispatches to execution engines.

This page documents the structure and organization of plan types. For information about how correlated subqueries and scalar subqueries are represented and tracked, see Subquery and Correlation Handling.

Sources: llkv-plan/src/plans.rs:1-10 llkv-plan/README.md:10-16

Plan Type Hierarchy

LLKV organizes plan structures into three primary categories based on their SQL statement class:

Diagram: Plan Type Organization

graph TB
    Plans["Plan Structures\n(llkv-plan)"]
DDL["DDL Plans\nSchema Operations"]
DML["DML Plans\nData Modifications"]
Query["Query Plans\nData Retrieval"]
Plans --> DDL
 
   Plans --> DML
 
   Plans --> Query
    
 
   DDL --> CreateTablePlan
 
   DDL --> DropTablePlan
 
   DDL --> CreateViewPlan
 
   DDL --> DropViewPlan
 
   DDL --> CreateIndexPlan
 
   DDL --> DropIndexPlan
 
   DDL --> ReindexPlan
 
   DDL --> AlterTablePlan
 
   DDL --> RenameTablePlan
    
 
   DML --> InsertPlan
 
   DML --> UpdatePlan
 
   DML --> DeletePlan
 
   DML --> TruncatePlan
    
 
   Query --> SelectPlan
 
   Query --> CompoundSelectPlan
    
 
   SelectPlan --> TableRef
 
   SelectPlan --> JoinMetadata
 
   SelectPlan --> SelectProjection
 
   SelectPlan --> SelectFilter
 
   SelectPlan --> AggregateExpr
 
   SelectPlan --> OrderByPlan

Plans are consumed by llkv-runtime for execution orchestration and by llkv-executor for query evaluation. Each plan type encodes the necessary metadata for its corresponding operation without requiring re-parsing or runtime AST traversal.

Sources: llkv-plan/src/plans.rs:163-358 llkv-plan/src/plans.rs:620-703 llkv-plan/src/plans.rs:794-1023

SelectPlan Structure

SelectPlan represents SELECT queries and is the most complex plan type. It aggregates multiple sub-components to describe table references, join relationships, projections, filters, aggregates, and ordering.

Diagram: SelectPlan Component Structure

graph TB
    SelectPlan["SelectPlan\nllkv-plan/src/plans.rs:801"]
subgraph "Table Sources"
        Tables["tables: Vec&lt;TableRef&gt;"]
TableRef["TableRef\nschema, table, alias"]
end
    
    subgraph "Join Specification"
        Joins["joins: Vec&lt;JoinMetadata&gt;"]
JoinMetadata["JoinMetadata\nleft_table_index\njoin_type\non_condition"]
JoinPlan["JoinPlan\nInner/Left/Right/Full"]
end
    
    subgraph "Projections"
        Projections["projections:\nVec&lt;SelectProjection&gt;"]
AllColumns["AllColumns"]
AllColumnsExcept["AllColumnsExcept"]
Column["Column\nname, alias"]
Computed["Computed\nexpr, alias"]
end
    
    subgraph "Filtering"
        Filter["filter: Option&lt;SelectFilter&gt;"]
SelectFilter["SelectFilter\npredicate\nsubqueries"]
FilterSubquery["FilterSubquery\nid, plan,\ncorrelated_columns"]
end
    
    subgraph "Aggregation"
        Aggregates["aggregates:\nVec&lt;AggregateExpr&gt;"]
GroupBy["group_by: Vec&lt;String&gt;"]
Having["having:\nOption&lt;Expr&gt;"]
end
    
    subgraph "Ordering & Modifiers"
        OrderBy["order_by:\nVec&lt;OrderByPlan&gt;"]
Distinct["distinct: bool"]
end
    
    subgraph "Compound Operations"
        Compound["compound:\nOption&lt;CompoundSelectPlan&gt;"]
CompoundOps["Union/Intersect/Except\nDistinct/All"]
end
    
 
   SelectPlan --> Tables
 
   SelectPlan --> Joins
 
   SelectPlan --> Projections
 
   SelectPlan --> Filter
 
   SelectPlan --> Aggregates
 
   SelectPlan --> OrderBy
 
   SelectPlan --> Compound
    
 
   Tables --> TableRef
 
   Joins --> JoinMetadata
 
   JoinMetadata --> JoinPlan
 
   Projections --> AllColumns
 
   Projections --> AllColumnsExcept
 
   Projections --> Column
 
   Projections --> Computed
 
   Filter --> SelectFilter
 
   SelectFilter --> FilterSubquery
 
   Aggregates --> GroupBy
 
   Aggregates --> Having
 
   Compound --> CompoundOps

Sources: llkv-plan/src/plans.rs:794-944

TableRef - Table References

TableRef represents a table source in the FROM clause, with optional aliasing:

Field	Type	Description
`schema`	`String`	Schema/namespace identifier (empty for default)
`table`	`String`	Table name
`alias`	`Option<String>`	Optional alias for qualified name resolution

The display_name() method returns the alias if present, otherwise the qualified name. This enables consistent column name resolution during expression translation.

Sources: llkv-plan/src/plans.rs:708-752

JoinMetadata - Join Specification

JoinMetadata describes how adjacent tables in the tables vector are connected. Each entry links tables[left_table_index] with tables[left_table_index + 1]:

Field	Type	Description
`left_table_index`	`usize`	Index into `SelectPlan.tables`
`join_type`	`JoinPlan`	Inner, Left, Right, or Full
`on_condition`	`Option<Expr<String>>`	Optional ON clause predicate

The JoinPlan enum mirrors llkv_join::JoinType but exists in the plan layer to avoid circular dependencies.

Sources: llkv-plan/src/plans.rs:758-792

SelectProjection - Projection Variants

SelectProjection specifies which columns appear in the result set:

Variant	Fields	Description
`AllColumns`	-	SELECT * (all columns from all tables)
`AllColumnsExcept`	`exclude: Vec<String>`	SELECT * EXCEPT (col1, col2, ...)
`Column`	`name: String`
`alias: Option<String>`	Named column with optional alias
`Computed`	`expr: ScalarExpr<String>`
`alias: String`	Computed expression (e.g., col1 + col2 AS sum)

The executor translates these into ScanProjection instances that specify which columns to fetch from storage.

Sources: llkv-plan/src/plans.rs:998-1013

AggregateExpr - Aggregate Functions

AggregateExpr describes aggregate function calls in SELECT or HAVING clauses:

Diagram: AggregateExpr Variants

graph LR
    AggregateExpr["AggregateExpr"]
CountStar["CountStar\nalias, distinct"]
Column["Column\ncolumn, alias,\nfunction, distinct"]
Functions["AggregateFunction"]
Count["Count"]
SumInt64["SumInt64"]
TotalInt64["TotalInt64"]
MinInt64["MinInt64"]
MaxInt64["MaxInt64"]
CountNulls["CountNulls"]
GroupConcat["GroupConcat"]
AggregateExpr --> CountStar
 
   AggregateExpr --> Column
 
   Column --> Functions
 
   Functions --> Count
 
   Functions --> SumInt64
 
   Functions --> TotalInt64
 
   Functions --> MinInt64
 
   Functions --> MaxInt64
 
   Functions --> CountNulls
 
   Functions --> GroupConcat

The executor delegates to llkv-aggregate for accumulator-based evaluation.

Sources: llkv-plan/src/plans.rs:1028-1120

OrderByPlan - Sort Specification

OrderByPlan defines ORDER BY clause semantics:

Field	Type	Description
`target`	`OrderTarget`	Column name, projection index, or All
`sort_type`	`OrderSortType`	Native or CastTextToInteger
`ascending`	`bool`	Sort direction (ASC/DESC)
`nulls_first`	`bool`	NULL placement (NULLS FIRST/LAST)

OrderTarget variants:

Column(String) - Sort by named column
Index(usize) - Sort by projection position (1-based in SQL)
All - Specialized SQLite behavior for sorting all columns

Sources: llkv-plan/src/plans.rs:1195-1217

CompoundSelectPlan - Set Operations

CompoundSelectPlan represents UNION, INTERSECT, and EXCEPT operations:

Field	Type	Description
`initial`	`Box<SelectPlan>`	First SELECT in the compound
`operations`	`Vec<CompoundSelectComponent>`	Subsequent set operations

Each CompoundSelectComponent contains:

operator: CompoundOperator (Union, Intersect, Except)
quantifier: CompoundQuantifier (Distinct, All)
plan: SelectPlan for the right-hand side

The executor processes these sequentially, maintaining distinct caches for DISTINCT quantifiers.

Sources: llkv-plan/src/plans.rs:946-996

InsertPlan Structure

InsertPlan encapsulates data insertion operations with conflict resolution strategies:

Field	Type	Description
`table`	`String`	Target table name
`columns`	`Vec<String>`	Column names (empty means all columns)
`source`	`InsertSource`	Data source (rows, batches, or SELECT)
`on_conflict`	`InsertConflictAction`	Conflict resolution strategy

InsertSource Variants

Variant	Description
`Rows(Vec<Vec<PlanValue>>)`	Explicit value rows from INSERT VALUES
`Batches(Vec<RecordBatch>)`	Pre-materialized Arrow batches
`Select { plan: Box<SelectPlan> }`	INSERT INTO ... SELECT ...

InsertConflictAction Variants

SQLite-compatible conflict resolution actions:

Variant	Behavior
`None`	Standard behavior - fail on constraint violation
`Replace`	UPDATE existing row on conflict (INSERT OR REPLACE)
`Ignore`	Skip conflicting rows (INSERT OR IGNORE)
`Abort`	Abort transaction on conflict
`Fail`	Fail statement without rollback
`Rollback`	Rollback entire transaction

Sources: llkv-plan/src/plans.rs:620-655

UpdatePlan and DeletePlan

UpdatePlan

UpdatePlan specifies row updates with optional filtering:

Field	Type	Description
`table`	`String`	Target table name
`assignments`	`Vec<ColumnAssignment>`	Column updates
`filter`	`Option<Expr<String>>`	WHERE clause predicate

Each ColumnAssignment contains:

column: Target column name
value: AssignmentValue (literal or expression)

AssignmentValue variants:

Literal(PlanValue) - Static value (e.g., SET col = 42)
Expression(ScalarExpr<String>) - Computed value (e.g., SET col = col + 1)

Sources: llkv-plan/src/plans.rs:661-682

DeletePlan

DeletePlan specifies row deletions:

Field	Type	Description
`table`	`String`	Target table name
`filter`	`Option<Expr<String>>`	WHERE clause predicate

A missing filter indicates DELETE FROM table (deletes all rows).

Sources: llkv-plan/src/plans.rs:687-692

TruncatePlan

TruncatePlan represents TRUNCATE TABLE (removes all rows, resets sequences):

Field	Type	Description
`table`	`String`	Target table name

Sources: llkv-plan/src/plans.rs:698-702

DDL Plan Structures

CreateTablePlan

CreateTablePlan defines table creation with schema, constraints, and data sources:

Field	Type	Description
`name`	`String`	Table name
`if_not_exists`	`bool`	Skip if table exists
`or_replace`	`bool`	Replace existing table
`columns`	`Vec<PlanColumnSpec>`	Column definitions
`source`	`Option<CreateTableSource>`	Optional CREATE TABLE AS data
`namespace`	`Option<String>`	Storage namespace (e.g., "temp")
`foreign_keys`	`Vec<ForeignKeySpec>`	Foreign key constraints
`multi_column_uniques`	`Vec<MultiColumnUniqueSpec>`	Multi-column UNIQUE constraints

Sources: llkv-plan/src/plans.rs:176-203

PlanColumnSpec

PlanColumnSpec describes individual column metadata:

Field	Type	Description
`name`	`String`	Column name
`data_type`	`DataType`	Arrow data type
`nullable`	`bool`	NULL allowed
`primary_key`	`bool`	PRIMARY KEY constraint
`unique`	`bool`	UNIQUE constraint
`check_expr`	`Option<String>`	CHECK constraint SQL expression

The IntoPlanColumnSpec trait enables ergonomic column specification using tuples like ("col_name", DataType::Int64, NotNull).

Sources: llkv-plan/src/plans.rs:499-605

CreateIndexPlan

CreateIndexPlan specifies index creation:

Field	Type	Description
`name`	`Option<String>`	Index name (auto-generated if None)
`table`	`String`	Target table
`unique`	`bool`	UNIQUE index constraint
`if_not_exists`	`bool`	Skip if index exists
`columns`	`Vec<IndexColumnPlan>`	Indexed columns with sort order

Each IndexColumnPlan specifies:

name: Column name
ascending: Sort direction (ASC/DESC)
nulls_first: NULL placement

Sources: llkv-plan/src/plans.rs:433-497

AlterTablePlan

AlterTablePlan represents ALTER TABLE operations:

Field	Type	Description
`table_name`	`String`	Target table
`if_exists`	`bool`	Skip if table missing
`operation`	`AlterTableOperation`	Specific operation

AlterTableOperation variants:

Variant	Fields	Description
`RenameColumn`	`old_column_name: String`
`new_column_name: String`	RENAME COLUMN
`SetColumnDataType`	`column_name: String`
`new_data_type: String`	ALTER COLUMN SET DATA TYPE
`DropColumn`	`column_name: String`
`if_exists: bool`
`cascade: bool`	DROP COLUMN

Sources: llkv-plan/src/plans.rs:364-406

Additional DDL Plans

Plan Type	Purpose	Key Fields
`DropTablePlan`	DROP TABLE	`name`, `if_exists`
`CreateViewPlan`	CREATE VIEW	`name`, `view_definition`, `select_plan`
`DropViewPlan`	DROP VIEW	`name`, `if_exists`
`RenameTablePlan`	RENAME TABLE	`current_name`, `new_name`, `if_exists`
`DropIndexPlan`	DROP INDEX	`name`, `canonical_name`, `if_exists`
`ReindexPlan`	REINDEX	`name`, `canonical_name`

Sources: llkv-plan/src/plans.rs:209-358

PlanValue - Value Representation

PlanValue provides a type-safe representation of literal values in plans, bridging SQL literals and Arrow arrays:

Variant	Description
`Null`	SQL NULL
`Integer(i64)`	Integer value (booleans stored as 0/1)
`Float(f64)`	Floating-point value
`Decimal(DecimalValue)`	Fixed-precision decimal
`String(String)`	Text value
`Date32(i32)`	Date (days since epoch)
`Struct(FxHashMap<String, PlanValue>)`	Nested struct value
`Interval(IntervalValue)`	Interval (months, days, nanos)

PlanValue implements From<T> for common types (i64, f64, String, bool) for ergonomic plan construction. The plan_value_from_literal() function converts llkv_expr::Literal to PlanValue, and plan_value_from_array() extracts values from Arrow arrays during INSERT SELECT operations.

Sources: llkv-plan/src/plans.rs:73-161 llkv-plan/src/plans.rs:1122-1189

Plan Translation Flow

Diagram: Plan Translation and Execution Flow

Plans serve as the interface contract between the SQL layer (llkv-sql) and the execution layer (llkv-runtime, llkv-executor). The translation layer in llkv-sql converts sqlparser AST nodes into strongly-typed plan structures, which the runtime validates and dispatches to appropriate executors.

Sources: llkv-plan/README.md:13-33 llkv-executor/README.md:12-31

Plan Construction Patterns

Builder Pattern for SelectPlan

SelectPlan uses fluent builder methods for incremental construction:

Sources: llkv-plan/src/plans.rs:827-943

Tuple-Based Column Specs

PlanColumnSpec implements IntoPlanColumnSpec for tuples, enabling concise table definitions:

Sources: llkv-plan/src/plans.rs:548-605

Integration with Expression System

Plans reference expressions from llkv-expr using parameterized types:

Expr<String>: Boolean predicates with column names (WHERE, HAVING, ON clauses)
ScalarExpr<String>: Scalar expressions with column names (projections, assignments)

The executor translates these to Expr<FieldId> and ScalarExpr<FieldId> after resolving column names against table schemas. For details on expression evaluation, see Expression AST and Expression Translation.

Sources: llkv-plan/src/plans.rs:28-34 llkv-plan/src/plans.rs:666-674

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Subquery and Correlation Handling

Relevant source files

This page documents how LLKV handles subqueries and correlated column references during query planning and execution. Subqueries can appear in WHERE clauses (as EXISTS predicates) or in SELECT projections (as scalar subqueries). When a subquery references columns from its outer query, it is called a correlated subquery , requiring special handling to bind outer row values during execution.

For information about expression evaluation and compilation, see Expression System. For query execution flow, see Query Execution.

Purpose and Scope

Subquery handling in LLKV involves three distinct phases:

Detection and Tracking - During SQL translation, the planner identifies subqueries and tracks which outer columns they reference
Placeholder Injection - Correlated columns are replaced with synthetic placeholder identifiers in the subquery's expression tree
Binding and Execution - At runtime, for each outer row, placeholders are replaced with actual values and the subquery is executed

This document covers the data structures, algorithms, and execution flow for both filter subqueries (EXISTS/NOT EXISTS) and scalar subqueries (single-value returns).

Core Data Structures

LLKV represents subqueries and their correlation metadata through several interconnected structures defined in llkv-plan.

classDiagram
    class SelectFilter {+Expr~String~ predicate\n+Vec~FilterSubquery~ subqueries}
    
    class FilterSubquery {+SubqueryId id\n+Box~SelectPlan~ plan\n+Vec~CorrelatedColumn~ correlated_columns}
    
    class ScalarSubquery {+SubqueryId id\n+Box~SelectPlan~ plan\n+Vec~CorrelatedColumn~ correlated_columns}
    
    class CorrelatedColumn {+String placeholder\n+String column\n+Vec~String~ field_path}
    
    class SelectPlan {+Vec~TableRef~ tables\n+Option~SelectFilter~ filter\n+Vec~ScalarSubquery~ scalar_subqueries\n+Vec~SelectProjection~ projections}
    
    SelectPlan --> SelectFilter : filter
    SelectPlan --> ScalarSubquery : scalar_subqueries
    SelectFilter --> FilterSubquery : subqueries
    FilterSubquery --> CorrelatedColumn : correlated_columns
    ScalarSubquery --> CorrelatedColumn : correlated_columns
    FilterSubquery --> SelectPlan : plan
    ScalarSubquery --> SelectPlan : plan

Subquery Plan Structures

Sources: llkv-plan/src/plans.rs:28-67

Field Descriptions

Structure	Field	Purpose
`FilterSubquery`	`id`	Unique identifier used to match subquery in expression tree
`FilterSubquery`	`plan`	Nested SELECT plan to execute for each outer row
`FilterSubquery`	`correlated_columns`	Mappings from placeholder to real outer column
`ScalarSubquery`	`id`	Unique identifier for scalar subquery references
`ScalarSubquery`	`plan`	SELECT plan that must return single column/row
`CorrelatedColumn`	`placeholder`	Synthetic column name injected into subquery expressions
`CorrelatedColumn`	`column`	Canonical outer column name
`CorrelatedColumn`	`field_path`	Nested field access path for struct columns

Sources: llkv-plan/src/plans.rs:36-67

Correlation Tracking During Planning

During SQL-to-plan translation, LLKV uses the SubqueryCorrelatedTracker to detect when a subquery references outer columns. This tracker is passed through the expression translation pipeline and records each outer column access.

graph TB
    subgraph "SQL Translation Phase"
        SQL["SQL Query String"]
Parser["sqlparser AST"]
end
    
    subgraph "Subquery Detection"
        TranslateExpr["translate_predicate / translate_scalar"]
Tracker["SubqueryCorrelatedTracker"]
Resolver["IdentifierResolver"]
end
    
    subgraph "Placeholder Injection"
        OuterColumn["Outer Column Reference"]
Placeholder["Synthetic Placeholder"]
Recording["CorrelatedColumn Entry"]
end
    
    subgraph "Plan Output"
        FilterSubquery["FilterSubquery"]
ScalarSubquery["ScalarSubquery"]
SelectPlan["SelectPlan"]
end
    
 
   SQL --> Parser
 
   Parser --> TranslateExpr
 
   TranslateExpr --> Tracker
 
   TranslateExpr --> Resolver
    
 
   Tracker --> OuterColumn
 
   OuterColumn --> Placeholder
 
   Placeholder --> Recording
    
 
   Recording --> FilterSubquery
 
   Recording --> ScalarSubquery
 
   FilterSubquery --> SelectPlan
 
   ScalarSubquery --> SelectPlan

Tracker Architecture

Sources: llkv-sql/src/sql_engine.rs24 llkv-sql/src/sql_engine.rs:326-363

Placeholder Generation

When the tracker detects an outer column reference in a subquery, it:

Generates a unique placeholder string (e.g., "__correlated_0__")
Records the mapping: placeholder → (outer_column, field_path)
Returns the placeholder to the expression translator
The placeholder is embedded in the subquery's expression tree instead of the original column name

This allows the subquery plan to be "generic" - it references placeholders that will be bound to actual values at execution time.

Sources: llkv-sql/src/sql_engine.rs:337-351

Tracker Extension Traits

The SubqueryCorrelatedTrackerExt trait provides a convenience method to request placeholders directly from catalog resolution results, avoiding repetitive unpacking of ColumnResolution fields.

The SubqueryCorrelatedTrackerOptionExt trait enables chaining optional tracker references through nested translation helpers without explicit as_mut() calls.

Sources: llkv-sql/src/sql_engine.rs:337-363

Subquery Types and Execution

LLKV supports two categories of subqueries, each with distinct execution semantics.

sequenceDiagram
    participant Executor as QueryExecutor
    participant Filter as Filter Evaluation
    participant Subquery as EXISTS Subquery
    participant Binding as Binding Logic
    participant Inner as Inner SelectPlan
    
    Executor->>Filter: evaluate_predicate_mask()
    Filter->>Filter: encounter Expr::Exists
    Filter->>Subquery: evaluate_exists_subquery()
    Subquery->>Binding: collect_correlated_bindings()
    Binding->>Binding: extract outer row values
    Binding-->>Subquery: bindings map
    Subquery->>Inner: bind_select_plan()
    Inner->>Inner: replace placeholders with values
    Inner-->>Subquery: bound SelectPlan
    Subquery->>Executor: execute_select(bound_plan)
    Executor-->>Subquery: SelectExecution stream
    Subquery->>Subquery: check if num_rows > 0
    Subquery-->>Filter: boolean result
    Filter-->>Executor: BooleanArray mask

Filter Subqueries (EXISTS Predicates)

Filter subqueries appear in WHERE clauses as EXISTS or NOT EXISTS predicates. They return a boolean indicating whether the subquery produced any rows.

Sources: llkv-executor/src/lib.rs:773-792

Scalar Subqueries (Projection Values)

Scalar subqueries appear in SELECT projections and must return exactly one column and at most one row. They are evaluated into a single literal value for each outer row.

Sources: llkv-executor/src/lib.rs:794-891

sequenceDiagram
    participant Executor as QueryExecutor
    participant Projection as Projection Logic
    participant Subquery as Scalar Subquery Evaluator
    participant Binding as Binding Logic
    participant Inner as Inner SelectPlan
    
    Executor->>Projection: evaluate_projection_expression()
    Projection->>Projection: encounter ScalarExpr::ScalarSubquery
    Projection->>Subquery: evaluate_scalar_subquery_numeric()
    
    loop For each outer row
        Subquery->>Subquery: evaluate_scalar_subquery_literal()
        Subquery->>Binding: collect_correlated_bindings()
        Binding-->>Subquery: bindings for current row
        Subquery->>Inner: bind_select_plan()
        Inner-->>Subquery: bound plan
        Subquery->>Executor: execute_select()
        Executor-->>Subquery: result batches
        Subquery->>Subquery: validate single column/row
        Subquery->>Subquery: convert to Literal
        Subquery-->>Projection: literal value
    end
    
    Projection->>Projection: build NumericArray from literals
    Projection-->>Executor: computed column array

Binding Process

The binding process replaces placeholder identifiers in a subquery plan with actual values from the current outer row.

Correlated Binding Collection

The collect_correlated_bindings function builds a map from placeholder strings to concrete Literal values by:

Iterating over each CorrelatedColumn in the subquery metadata
Looking up the outer column in the current RecordBatch
Extracting the value at the current row index
Converting the Arrow array value to a Literal
Storing the mapping: placeholder → Literal

Sources: Referenced in llkv-executor/src/lib.rs781 llkv-executor/src/lib.rs802

graph LR
    subgraph "Input"
        OuterBatch["RecordBatch\n(outer query result)"]
RowIndex["Current Row Index"]
Metadata["Vec&lt;CorrelatedColumn&gt;"]
end
    
    subgraph "Processing"
        Iterate["For each CorrelatedColumn"]
Lookup["Find column in schema"]
Extract["Extract array[row_idx]"]
Convert["Convert to Literal"]
end
    
    subgraph "Output"
        Bindings["FxHashMap&lt;placeholder, Literal&gt;"]
end
    
 
   OuterBatch --> Iterate
 
   RowIndex --> Iterate
 
   Metadata --> Iterate
    
 
   Iterate --> Lookup
 
   Lookup --> Extract
 
   Extract --> Convert
 
   Convert --> Bindings

Plan Binding

The bind_select_plan function takes a subquery SelectPlan and a bindings map, then recursively replaces:

Placeholder column references in filter expressions with Expr::Literal
Placeholder column references in projections with ScalarExpr::Literal
Placeholder column references in HAVING clauses
Placeholder references in nested subqueries

This produces a new SelectPlan that is fully "grounded" with the outer row's values and can be executed independently.

Sources: Referenced in llkv-executor/src/lib.rs782 llkv-executor/src/lib.rs803

Execution Flow: EXISTS Subquery Example

Consider the query:

Planning Phase

Sources: llkv-plan/src/plans.rs:36-46

Execution Phase

Sources: llkv-executor/src/lib.rs:773-792

Execution Flow: Scalar Subquery Example

Consider the query:

Planning Phase

Sources: llkv-plan/src/plans.rs:48-56

Execution Phase

Sources: llkv-executor/src/lib.rs:834-891

Cross-Product Integration

When subqueries appear in cross-product (multi-table) queries, the binding process works identically but must resolve outer column names through the combined schema.

Cross-Product Expression Context

The CrossProductExpressionContext maintains:

Combined schema from all tables in the FROM clause
Column lookup map (qualified names → column indices)
Numeric array cache for evaluated expressions
Synthetic field ID allocation for subquery results

When evaluating a filter or projection expression that contains subqueries, the context:

Detects subquery references by SubqueryId
Calls the appropriate evaluator (evaluate_exists_subquery or evaluate_scalar_subquery_numeric)
Passes the combined schema and current row to the binding logic
Caches numeric results for scalar subqueries to avoid re-evaluation

Sources: llkv-executor/src/lib.rs:1329-1383 llkv-executor/src/lib.rs:1429-1502

Validation and Error Handling

The executor enforces several constraints on subquery results:

Constraint	Applies To	Error Condition
Single column	Scalar subqueries	`num_columns() != 1`
At most one row	Scalar subqueries	`num_rows() > 1`
Result present	N/A (returns NULL)	`num_rows() == 0` for scalar subquery

Error Examples

Scalar Subquery: Multiple Columns

Scalar Subquery: Multiple Rows

Sources: llkv-executor/src/lib.rs:808-819

Subquery ID Assignment

SubqueryId is a newtype wrapper around usize defined in llkv-expr. The planner assigns sequential IDs as it encounters subqueries during translation, ensuring each subquery has a unique identifier that persists from planning through execution.

The executor uses these IDs to:

Look up subquery metadata in the plan's scalar_subqueries or filter's subqueries vectors
Match subquery references in expression trees (ScalarExpr::ScalarSubquery or Expr::Exists) to their plans
Cache evaluation results (for scalar subqueries appearing multiple times)

Sources: [llkv-expr referenced in llkv-plan/src/plans.rs15](https://github.com/jzombie/rust-llkv/blob/4dc34c1f/llkv-expr referenced in llkv-plan/src/plans.rs#L15-L15)

Recursive Subquery Support

LLKV supports nested subqueries where an inner subquery can itself contain subqueries. The binding process is recursive:

Bind outer-level placeholders in the top-level subquery plan
For any nested subqueries within that plan, repeat the binding process
Continue until all correlation layers are resolved

This is handled automatically by bind_select_plan which traverses the entire plan tree.

Sources: Referenced in llkv-executor/src/lib.rs782 llkv-executor/src/lib.rs803

Performance Considerations

Correlation Overhead

Correlated subqueries are executed once per outer row, which can be expensive:

An outer query returning N rows with a correlated subquery executes N + 1 queries total
For scalar subqueries in projections with multiple references, results are cached per subquery ID
EXISTS subqueries short-circuit as soon as a matching row is found (first batch with num_rows() > 0)

Uncorrelated Subqueries

If a subquery contains no correlated columns (empty correlated_columns vector), it could theoretically be executed once and reused. However, LLKV's current implementation still executes it per outer row. Future optimizations could detect this case and cache the result.

Sources: llkv-executor/src/lib.rs:773-891

Summary

LLKV's subquery handling follows a three-phase model:

Planning : The SubqueryCorrelatedTracker detects outer column references and generates placeholders. Plans are built with FilterSubquery or ScalarSubquery structures containing correlation metadata.
Binding : At execution time, collect_correlated_bindings extracts outer row values and bind_select_plan replaces placeholders with literals, producing a grounded plan.
Execution : The bound plan is executed as an independent query. EXISTS subqueries return a boolean; scalar subqueries return a single literal (or NULL if empty).

This design keeps the subquery plan generic during planning and binds it dynamically at execution time, enabling proper correlation semantics while maintaining the separation between planning and execution layers.

Sources: llkv-plan/src/plans.rs:28-67 llkv-sql/src/sql_engine.rs24 llkv-sql/src/sql_engine.rs:326-363 llkv-executor/src/lib.rs:773-891

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Expression System

Relevant source files

The expression system provides the intermediate representation for predicates, projections, and computations in LLKV. It defines a strongly-typed Abstract Syntax Tree (AST) that decouples query logic from concrete storage formats, enabling optimization and efficient evaluation against Apache Arrow data structures.

This page covers the overall architecture of the expression system. For details on specific components:

Expression AST structure and types: see 5.1
Column name resolution and translation: see 5.2
Bytecode compilation and normalization: see 5.3
Scalar evaluation engine: see 5.4
Aggregate function evaluation: see 5.5

Purpose and Scope

The expression system serves three primary functions:

Representation : Defines generic AST types (Expr<F>, ScalarExpr<F>) parameterized over field identifiers, supporting both string-based column names (during planning) and integer field IDs (during execution)
Translation : Resolves column names to storage field identifiers by consulting the catalog, enabling type-safe access to table columns
Compilation : Transforms normalized expressions into stack-based bytecode (EvalProgram, DomainProgram) for efficient vectorized evaluation

The system is located primarily in llkv-expr/, with translation logic in llkv-executor/src/translation/ and compilation in llkv-table/src/planner/program.rs.

Sources: llkv-expr/src/expr.rs:1-749 llkv-executor/src/translation/expression.rs:1-424 llkv-table/src/planner/program.rs:1-710

Expression Type Hierarchy

LLKV uses two primary expression types, both generic over the field identifier type F:

Expression Types

Type	Purpose	Variants	Example
`Expr<'a, F>`	Boolean predicates for filtering rows	`And`, `Or`, `Not`, `Pred`, `Compare`, `InList`, `IsNull`, `Literal`, `Exists`	`WHERE age > 18 AND status = 'active'`
`ScalarExpr<F>`	Arithmetic/scalar computations returning values	`Column`, `Literal`, `Binary`, `Aggregate`, `Cast`, `Case`, `Coalesce`, `GetField`, `Compare`, `Not`, `IsNull`, `Random`, `ScalarSubquery`	`SELECT price * 1.1 AS adjusted_price`
`Filter<'a, F>`	Single-field predicate	Field ID + `Operator`	`age > 18`
`Operator<'a>`	Comparison operator against literals	`Equals`, `Range`, `GreaterThan`, `LessThan`, `In`, `StartsWith`, `EndsWith`, `Contains`, `IsNull`, `IsNotNull`	`IN (1, 2, 3)`

Type Parameterization Flow

Sources: llkv-expr/src/expr.rs:14-333 llkv-plan/src/plans.rs:28-67 llkv-executor/src/translation/expression.rs:18-174

Expression Lifecycle

Expressions flow through multiple stages from SQL text to execution against storage:

Stage 1: Planning

The SQL layer (llkv-sql) parses SQL statements and constructs plan structures containing expressions. At this stage, column references use string names from the SQL text:

Predicates : Stored as Expr<'static, String> in SelectFilter
Projections : Stored as ScalarExpr<String> in SelectProjection::Computed
Assignments : Stored as ScalarExpr<String> in UpdatePlan::assignments

Sources: llkv-plan/src/plans.rs:28-34 llkv-sql/src/translator/mod.rs (inferred from architecture)

Stage 2: Translation

The executor translates string-based expressions to field-ID-based expressions by consulting the table schema:

Column Resolution : translate_predicate and translate_scalar walk the expression tree
Schema Lookup : Each column name is resolved to a FieldId using ExecutorSchema::resolve
Type Inference : For computed projections, infer_computed_data_type determines the Arrow data type
Special Columns : System columns like "rowid" map to special field IDs (e.g., ROW_ID_FIELD_ID)

Translation is implemented via iterative traversal to avoid stack overflow on deeply nested expressions (50k+ nodes).

Sources: llkv-executor/src/translation/expression.rs:18-174 llkv-executor/src/translation/expression.rs:176-387 llkv-executor/src/translation/schema.rs:53-123

Stage 3: Compilation

The table layer compiles Expr<FieldId> into bytecode for efficient evaluation:

Normalization : normalize_predicate applies De Morgan's laws and flattens nested AND/OR nodes
Compilation : ProgramCompiler::compile generates two programs:
- EvalProgram: Stack-based bytecode for predicate evaluation
- DomainProgram: Bytecode for tracking which fields affect row visibility
Fusion Optimization : Multiple predicates on the same field are fused into a single FusedAnd operation

graph TB
    Input["Expr&lt;FieldId&gt;\nNOT(age &gt; 18 AND status = 'active')"]
Norm["normalize_predicate\nApply De Morgan's law"]
Normal["Expr&lt;FieldId&gt;\nOR([\n NOT(age &gt; 18),\n NOT(status = 'active')\n])"]
Compile["ProgramCompiler::compile"]
subgraph "Output Programs"
        Eval["EvalProgram\nops: [\n PushCompare(age, Gt, 18),\n Not,\n PushCompare(status, Eq, 'active'),\n Not,\n Or(2)\n]"]
Domain["DomainProgram\nops: [\n PushCompareDomain(...),\n PushCompareDomain(...),\n Union(2)\n]"]
end
    
 
   Input --> Norm
 
   Norm --> Normal
 
   Normal --> Compile
 
   Compile --> Eval
 
   Compile --> Domain

Sources: llkv-table/src/planner/program.rs:286-318 llkv-table/src/planner/program.rs:416-516 llkv-table/src/planner/program.rs:550-631

Stage 4: Evaluation

Compiled programs are executed against Arrow RecordBatch data:

EvalProgram : Evaluates predicates row-by-row using a value stack, producing boolean results
DomainProgram : Identifies which row IDs could possibly match (used for optimization)
ScalarExpr : Evaluated via NumericKernels for vectorized arithmetic operations

The evaluation engine handles Arrow's columnar format efficiently through zero-copy operations and SIMD-friendly algorithms.

Sources: llkv-table/src/planner/evaluator.rs (inferred from architecture), llkv-executor/src/lib.rs:254-296

Key Components

ProgramCompiler

Compiles normalized expressions into bytecode:

Key Optimizations :

Predicate Fusion : gather_fused detects multiple predicates on the same field and emits FusedAnd operations
Domain Caching : Domain programs are memoized by expression identity to avoid recompilation
Stack-Based Evaluation : Operations push/pop from a value stack, avoiding recursive evaluation overhead

Sources: llkv-table/src/planner/program.rs:257-284 llkv-table/src/planner/program.rs:416-516 llkv-table/src/planner/program.rs:518-542

Bytecode Operations

EvalOp Variants

Operation	Purpose	Stack Effect
`PushPredicate(filter)`	Evaluate single-field predicate	Push boolean
`PushCompare{left, op, right}`	Evaluate comparison between scalar expressions	Push boolean
`PushInList{expr, list, negated}`	Evaluate IN/NOT IN list membership	Push boolean
`PushIsNull{expr, negated}`	Evaluate IS NULL / IS NOT NULL	Push boolean
`PushLiteral(bool)`	Push constant boolean	Push boolean
`FusedAnd{field_id, filters}`	Evaluate multiple predicates on same field (optimized)	Push boolean
`And{child_count}`	Pop N booleans, push AND result	Pop N, push 1
`Or{child_count}`	Pop N booleans, push OR result	Pop N, push 1
`Not{domain}`	Pop boolean, negate, push result (uses domain for optimization)	Pop 1, push 1

DomainOp Variants

Operation	Purpose	Stack Effect
`PushFieldAll(field_id)`	All rows where field exists	Push RowSet
`PushCompareDomain{left, right, op, fields}`	Rows where comparison could be true	Push RowSet
`PushInListDomain{expr, list, fields, negated}`	Rows where IN list could be true	Push RowSet
`PushIsNullDomain{expr, fields, negated}`	Rows where NULL test could be true	Push RowSet
`PushLiteralFalse`	Empty row set	Push RowSet
`PushAllRows`	All rows	Push RowSet
`Union{child_count}`	Pop N row sets, push union	Pop N, push 1
`Intersect{child_count}`	Pop N row sets, push intersection	Pop N, push 1

Sources: llkv-table/src/planner/program.rs:36-67 llkv-table/src/planner/program.rs:221-254

Expression Translation

Translation resolves column names to field IDs through schema lookup:

Special Handling :

Rowid Column : "rowid" (case-insensitive) maps to ROW_ID_FIELD_ID constant
Flexible Matching : Supports qualified names (table.column) and unqualified names (column)
Error Handling : Unknown columns produce descriptive error messages with the column name

Sources: llkv-executor/src/translation/expression.rs:390-407 llkv-executor/src/translation/expression.rs:410-423

Type Inference

The executor infers Arrow data types for computed projections to construct the output schema:

Type Inference Rules

Expression Pattern	Inferred Type
`Literal(Integer)`	`DataType::Int64`
`Literal(Float)`	`DataType::Float64`
`Literal(Decimal(v))`	`DataType::Decimal128(v.precision(), v.scale())`
`Literal(String)`	`DataType::Utf8`
`Literal(Date32)`	`DataType::Date32`
`Literal(Boolean)`	`DataType::Boolean`
`Literal(Null)`	`DataType::Null`
`Column(field_id)`	Lookup from schema, normalized to Int64/Float64
`Binary{...}`	Float64 if any operand is float, else Int64
`Compare{...}`	`DataType::Int64` (boolean as integer)
`Aggregate(...)`	`DataType::Int64` (most aggregates)
`Cast{data_type, ...}`	`data_type` (explicit cast)
`Random`	`DataType::Float64`

Numeric Type Normalization : Small integers (Int8, Int16, Int32, Boolean) normalize to Int64, while all floating-point types normalize to Float64. This simplifies arithmetic evaluation.

Sources: llkv-executor/src/translation/schema.rs:53-123 llkv-executor/src/translation/schema.rs:125-147 llkv-executor/src/translation/schema.rs:149-243

Subquery Support

Expressions support two types of subqueries:

EXISTS Predicates

Used in WHERE clauses to test for row existence:

Structure : Expr::Exists(SubqueryExpr{id, negated})
Planning : Stored in SelectFilter::subqueries with correlated column bindings
Evaluation : Executor binds outer row values to correlated columns, executes subquery plan, returns true if any rows match

Scalar Subqueries

Used in projections to return a single value:

Structure : ScalarExpr::ScalarSubquery(ScalarSubqueryExpr{id})
Planning : Stored in SelectPlan::scalar_subqueries with correlated column bindings
Evaluation : Executor binds outer row values, executes subquery, extracts single value
Error Handling : Returns error if subquery returns multiple rows or columns

Sources: llkv-expr/src/expr.rs:42-63 llkv-plan/src/plans.rs:36-56 llkv-executor/src/lib.rs:774-839

Normalization

Expression normalization applies logical transformations before compilation:

De Morgan's Laws

NOT is pushed down through AND/OR using De Morgan's laws:

NOT(A AND B) → NOT(A) OR NOT(B)
NOT(A OR B) → NOT(A) AND NOT(B)
NOT(NOT(A)) → A

Flattening

Nested AND/OR nodes are flattened:

AND(A, AND(B, C)) → AND(A, B, C)
OR(A, OR(B, C)) → OR(A, B, C)

Special Cases

NOT(Literal(true)) → Literal(false)
NOT(IsNull{expr, false}) → IsNull{expr, true}

Sources: llkv-table/src/planner/program.rs:286-343

Expression Operators

Binary Operators (BinaryOp)

Operator	Semantics
`Add`	Addition (`a + b`)
`Subtract`	Subtraction (`a - b`)
`Multiply`	Multiplication (`a * b`)
`Divide`	Division (`a / b`)
`Modulo`	Modulus (`a % b`)
`And`	Bitwise AND (`a & b`)
`Or`	Bitwise OR (`a
`BitwiseShiftLeft`	Left shift (`a << b`)
`BitwiseShiftRight`	Right shift (`a >> b`)

Comparison Operators (CompareOp)

Operator	Semantics
`Eq`	Equality (`a = b`)
`NotEq`	Inequality (`a != b`)
`Lt`	Less than (`a < b`)
`LtEq`	Less than or equal (`a <= b`)
`Gt`	Greater than (`a > b`)
`GtEq`	Greater than or equal (`a >= b`)

Comparisons in ScalarExpr::Compare return 1 for true, 0 for false, NULL for NULL propagation.

Sources: llkv-expr/src/expr.rs:270-293

Memory Management

Expression Lifetimes

The 'a lifetime parameter in Expr<'a, F> allows borrowed operators to avoid allocations:

Operator::In(&'a [Literal]): Borrows slice from call site
Operator::StartsWith{pattern: &'a str, ...}: Borrows pattern string
Filter<'a, F>: Contains borrowed Operator<'a>

Owned Variants : EvalProgram and DomainProgram use OwnedOperator and OwnedFilter for storage, converting borrowed operators to owned values.

Zero-Copy Pattern

During evaluation, predicates borrow from the compiled program rather than cloning operators, enabling zero-copy predicate evaluation against Arrow arrays.

Sources: llkv-expr/src/expr.rs:295-333 llkv-table/src/planner/program.rs:69-143

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Expression AST

Relevant source files

Purpose and Scope

This document describes the expression Abstract Syntax Tree (AST) defined in the llkv-expr crate. The expression AST provides type-aware, Arrow-native data structures for representing boolean predicates and scalar expressions throughout the LLKV system. These AST nodes are decoupled from SQL parsing and can be parameterized by field identifier types, enabling reuse across multiple processing stages.

For information about how expressions are translated between field identifier types, see Expression Translation. For details on how expressions are compiled into executable bytecode, see Program Compilation. For scalar evaluation mechanics, see Scalar Evaluation and NumericKernels.

Sources: llkv-expr/src/expr.rs:1-8

Expression Type Hierarchy

The llkv-expr crate defines two primary expression enums:

Expr<'a, F> - Boolean/predicate expressions that evaluate to true or false
ScalarExpr<F> - Arithmetic/scalar expressions that produce typed values

graph TB
    subgraph "Expression Type System"
        EXPR["Expr&lt;'a, F&gt;\nBoolean Predicates"]
SCALAR["ScalarExpr&lt;F&gt;\nScalar Values"]
EXPR --> LOGICAL["Logical Operators\nAnd, Or, Not"]
EXPR --> PRED["Pred(Filter)\nField Predicates"]
EXPR --> COMPARE["Compare\nScalar Comparisons"]
EXPR --> INLIST["InList\nSet Membership"]
EXPR --> ISNULL["IsNull\nNull Checks"]
EXPR --> LITERAL["Literal(bool)\nConstant Boolean"]
EXPR --> EXISTS["Exists\nSubquery Predicates"]
SCALAR --> COLUMN["Column(F)\nField Reference"]
SCALAR --> SLITERAL["Literal\nConstant Value"]
SCALAR --> BINARY["Binary\nArithmetic Ops"]
SCALAR --> SNOT["Not\nLogical Negation"]
SCALAR --> SISNULL["IsNull\nNull Test"]
SCALAR --> AGG["Aggregate\nAggregate Functions"]
SCALAR --> GETFIELD["GetField\nStruct Access"]
SCALAR --> CAST["Cast\nType Conversion"]
SCALAR --> SCOMPARE["Compare\nBoolean Result"]
SCALAR --> COALESCE["Coalesce\nFirst Non-Null"]
SCALAR --> SUBQ["ScalarSubquery\nSubquery Value"]
SCALAR --> CASE["Case\nConditional Logic"]
SCALAR --> RANDOM["Random\nRandom Number"]
COMPARE -.uses.-> SCALAR
        INLIST -.uses.-> SCALAR
        ISNULL -.uses.-> SCALAR
    end
    
    style EXPR fill:#e8f5e9
    style SCALAR fill:#e1f5ff

Both types are generic over a field identifier parameter F, which allows the same AST structure to be used with different field representations (typically String column names during planning, or FieldId numeric identifiers during execution).

Sources: llkv-expr/src/expr.rs:14-143

Expr<'a, F> - Boolean Expressions

The Expr<'a, F> enum represents boolean predicate expressions that evaluate to true or false. These are primarily used in WHERE clauses, JOIN conditions, and HAVING clauses.

Expr Variants

Variant	Description	Use Case
`And(Vec<Expr>)`	Logical AND of multiple predicates	Combining multiple filter conditions
`Or(Vec<Expr>)`	Logical OR of multiple predicates	Alternative filter conditions
`Not(Box<Expr>)`	Logical negation	Inverting a predicate
`Pred(Filter<'a, F>)`	Single-field predicate	Column-level filtering (e.g., `age > 18`)
`Compare { left, op, right }`	Comparison between scalar expressions	Cross-column comparisons (e.g., `col1 + col2 > 10`)
`InList { expr, list, negated }`	Set membership test	IN/NOT IN clauses
`IsNull { expr, negated }`	Null check on expression	IS NULL/IS NOT NULL on complex expressions
`Literal(bool)`	Constant boolean value	Always-true/always-false conditions
`Exists(SubqueryExpr)`	Correlated subquery predicate	EXISTS/NOT EXISTS clauses

Sources: llkv-expr/src/expr.rs:14-43

Expr Construction Helpers

The Expr type provides builder methods for common patterns:

These helpers simplify construction of common predicate patterns during query planning.

Sources: llkv-expr/src/expr.rs:65-84

Filter and Operator Types

The Pred variant wraps a Filter<'a, F> struct, which represents a single predicate against a field:

The Operator<'a> enum defines comparison operations over untyped Literal values:

Operator	Description	Example SQL
`Equals(Literal)`	Exact equality	`col = 5`
`Range { lower, upper }`	Range with bounds	`col BETWEEN 10 AND 20`
`GreaterThan(Literal)`	Greater-than comparison	`col > 10`
`GreaterThanOrEquals(Literal)`	Greater-or-equal	`col >= 10`
`LessThan(Literal)`	Less-than comparison	`col < 10`
`LessThanOrEquals(Literal)`	Less-or-equal	`col <= 10`
`In(&'a [Literal])`	Set membership (borrowed slice)	`col IN (1, 2, 3)`
`StartsWith { pattern, case_sensitive }`	String prefix match	`col LIKE 'abc%'`
`EndsWith { pattern, case_sensitive }`	String suffix match	`col LIKE '%xyz'`
`Contains { pattern, case_sensitive }`	String substring match	`col LIKE '%abc%'`
`IsNull`	Null check	`col IS NULL`
`IsNotNull`	Non-null check	`col IS NOT NULL`

The Operator type uses borrowed slices for In and borrowed string slices for pattern matching to avoid allocations in common cases.

Sources: llkv-expr/src/expr.rs:295-358

ScalarExpr - Scalar Expressions

The ScalarExpr<F> enum represents arithmetic and scalar expressions that produce typed values. These are used in SELECT projections, computed columns, ORDER BY clauses, and as operands in comparisons.

graph TB
    subgraph "ScalarExpr Evaluation Categories"
        SIMPLE["Simple Values"]
ARITH["Arithmetic"]
LOGIC["Logical"]
STRUCT["Structured"]
CONTROL["Control Flow"]
SPECIAL["Special Functions"]
SIMPLE --> COLUMN["Column(F)\nField reference"]
SIMPLE --> LITERAL["Literal\nConstant value"]
ARITH --> BINARY["Binary\nleft op right"]
ARITH --> CAST["Cast\nType conversion"]
LOGIC --> NOT["Not\nLogical negation"]
LOGIC --> ISNULL["IsNull\nNull test"]
LOGIC --> COMPARE["Compare\nComparison"]
STRUCT --> GETFIELD["GetField\nStruct field access"]
STRUCT --> AGGREGATE["Aggregate\nAggregate function"]
CONTROL --> CASE["Case\nCASE expression"]
CONTROL --> COALESCE["Coalesce\nFirst non-null"]
SPECIAL --> SUBQUERY["ScalarSubquery\nScalar subquery"]
SPECIAL --> RANDOM["Random\nRandom number"]
end

ScalarExpr Variants

Sources: llkv-expr/src/expr.rs:86-143

Arithmetic and Binary Operations

The Binary variant supports arithmetic and logical operations:

BinaryOp Variants:

Operator	Numeric	Bitwise	Logical
`Add`	✓
`Subtract`	✓
`Multiply`	✓
`Divide`	✓
`Modulo`	✓
`And`			✓
`Or`			✓
`BitwiseShiftLeft`		✓
`BitwiseShiftRight`		✓

Sources: llkv-expr/src/expr.rs:270-282

Comparison Operations

The Compare variant in ScalarExpr produces a boolean (1/0) result:

CompareOp Variants:

Operator	SQL Equivalent
`Eq`	`=`
`NotEq`	`!=` or `<>`
`Lt`	`<`
`LtEq`	`<=`
`Gt`	`>`
`GtEq`	`>=`

Sources: llkv-expr/src/expr.rs:284-293 llkv-expr/src/expr.rs:119-124

AggregateCall Variants

The Aggregate(AggregateCall<F>) variant wraps aggregate function calls:

Each aggregate (except CountStar) operates on a ScalarExpr<F> rather than just a column, enabling complex expressions like AVG(col1 + col2) or SUM(-col1).

Sources: llkv-expr/src/expr.rs:145-176

Struct Field Access

The GetField variant extracts fields from struct-typed expressions:

This represents dot-notation access like user.address.city, which is nested as:

GetField {
    base: GetField {
        base: Column(user),
        field_name: "address"
    },
    field_name: "city"
}

Sources: llkv-expr/src/expr.rs:107-113

CASE Expressions

The Case variant implements SQL CASE expressions:

Simple CASE: operand is Some, branches test equality
Searched CASE: operand is None, branches evaluate conditions
ELSE clause: Handled by else_expr field

Sources: llkv-expr/src/expr.rs:129-137

ScalarExpr Helper Methods

Builder methods simplify construction:

Method	Purpose
`column(field: F)`	Create column reference
`literal<L>(lit: L)`	Create literal value
`binary(left, op, right)`	Create binary operation
`logical_not(expr)`	Create logical NOT
`is_null(expr, negated)`	Create null test
`aggregate(call)`	Create aggregate function
`get_field(base, name)`	Create struct field access
`cast(expr, data_type)`	Create type cast
`compare(left, op, right)`	Create comparison
`coalesce(exprs)`	Create COALESCE
`scalar_subquery(id)`	Create scalar subquery
`case(operand, branches, else_expr)`	Create CASE expression
`random()`	Create random number generator

Sources: llkv-expr/src/expr.rs:178-268

Subquery Expression Types

The expression AST includes dedicated types for correlated and scalar subqueries:

SubqueryExpr

Used in Expr::Exists for boolean subquery predicates:

The id references a subquery definition stored separately (typically in the enclosing plan), and negated indicates NOT EXISTS.

Sources: llkv-expr/src/expr.rs:49-56

ScalarSubqueryExpr

Used in ScalarExpr::ScalarSubquery for value-producing subqueries:

This represents subqueries that return a single scalar value, used in expressions like SELECT (SELECT MAX(price) FROM orders) + 10.

Sources: llkv-expr/src/expr.rs:58-63

SubqueryId

Both subquery types use SubqueryId as an opaque identifier:

This ID is resolved during execution by looking up the subquery definition in the parent plan's metadata.

Sources: llkv-expr/src/expr.rs:45-47

graph LR
    subgraph "Expression Translation Pipeline"
        SQL["SQL Text"]
PLAN["Query Plan\nExpr&lt;String&gt;\nScalarExpr&lt;String&gt;"]
TRANS["Translation\nresolve_field_id()"]
EXEC["Execution\nExpr&lt;FieldId&gt;\nScalarExpr&lt;FieldId&gt;"]
EVAL["Evaluation\nRecordBatch Results"]
SQL --> PLAN
 
       PLAN --> TRANS
 
       TRANS --> EXEC
 
       EXEC --> EVAL
    end

Generic Field Parameter

The expression AST is parameterized by field identifier type F, enabling reuse across different processing stages:

Common instantiations:

Expr<'a, String> - Used during query planning with column names
Expr<'a, FieldId> - Used during execution with numeric field IDs
ScalarExpr<String> - Planning-time scalar expressions
ScalarExpr<FieldId> - Execution-time scalar expressions

The translation from String to FieldId occurs in llkv-executor/src/translation/expression.rs using catalog lookups to resolve column names to their internal numeric identifiers.

Sources: llkv-executor/src/translation/expression.rs:18-174

Literal Values

Both expression types use the Literal enum from llkv-expr to represent untyped constant values:

Literal Variant	Arrow Type	Example
`Integer(i64)`	Int64	`42`
`Float(f64)`	Float64	`3.14`
`Decimal(DecimalValue)`	Decimal128	`123.45`
`Boolean(bool)`	Boolean	`true`
`String(String)`	Utf8	`"hello"`
`Date32(i32)`	Date32	`DATE '2024-01-01'`
`Null`	Null	`NULL`
`Struct(...)`	Struct	`{a: 1, b: "x"}`
`Interval(...)`	Interval	`INTERVAL '1 month'`

Literal values are type-agnostic at the AST level. Type coercion and validation occur during execution when column types are known.

Sources: llkv-expr/src/expr.rs10 llkv-executor/src/translation/schema.rs:53-123

graph TB
    subgraph "Type Inference Rules"
        SCALAR["ScalarExpr&lt;FieldId&gt;"]
SCALAR --> LITERAL["Literal → Literal type"]
SCALAR --> COLUMN["Column → Schema lookup"]
SCALAR --> BINARY["Binary → Int64 or Float64"]
SCALAR --> NOT["Not → Int64 (boolean)"]
SCALAR --> ISNULL["IsNull → Int64 (boolean)"]
SCALAR --> COMPARE["Compare → Int64 (boolean)"]
SCALAR --> AGG["Aggregate → Int64"]
SCALAR --> GETFIELD["GetField → Struct field type"]
SCALAR --> CAST["Cast → Target type"]
SCALAR --> CASE["Case → Int64 or Float64"]
SCALAR --> COALESCE["Coalesce → Int64 or Float64"]
SCALAR --> RANDOM["Random → Float64"]
SCALAR --> SUBQUERY["ScalarSubquery → Utf8 (TODO)"]
BINARY --> FLOATCHECK["Contains Float64?"]
FLOATCHECK -->|Yes| FLOAT64["Float64"]
FLOATCHECK -->|No| INT64["Int64"]
CASE --> CASECHECK["Branches use Float64?"]
CASECHECK -->|Yes| CASEFLOAT["Float64"]
CASECHECK -->|No| CASEINT["Int64"]
end

Expression Type Inference

During execution planning, the system infers output types for ScalarExpr based on operand types:

The inference logic is implemented in infer_computed_data_type() and expression_uses_float(), which recursively analyze expression trees to determine output types.

Sources: llkv-executor/src/translation/schema.rs:53-271

Expression Normalization

Before compilation, predicates undergo normalization to flatten nested AND/OR structures and apply De Morgan's laws:

Normalization Rules:

Flatten AND: And([And([a, b]), c]) → And([a, b, c])
Flatten OR: Or([Or([a, b]), c]) → Or([a, b, c])
De Morgan's AND: Not(And([a, b])) → Or([Not(a), Not(b)])
De Morgan's OR: Not(Or([a, b])) → And([Not(a), Not(b)])
Double Negation: Not(Not(expr)) → expr
Literal Negation: Not(Literal(true)) → Literal(false)
IsNull Negation: Not(IsNull { expr, negated }) → IsNull { expr, negated: !negated }

This normalization simplifies subsequent optimization and compilation steps.

Sources: llkv-table/src/planner/program.rs:286-343

Expression Compilation

Normalized expressions are compiled into two bytecode programs:

EvalProgram - Stack-based evaluation of predicates
DomainProgram - Set-based domain analysis for row filtering

The compilation process:

Detects fusable predicates (multiple conditions on same field)
Builds domain programs for NOT operations
Produces postorder bytecode for stack-based evaluation

Sources: llkv-table/src/planner/program.rs:256-631

Expression AST Usage Flow

Key stages:

SQL Parsing - External sqlparser produces SQL AST
Plan Building - llkv-plan converts SQL AST to Expr<String>
Translation - llkv-executor resolves column names to FieldId
Normalization - llkv-table flattens and optimizes structure
Compilation - llkv-table produces bytecode programs
Execution - llkv-table evaluates against RecordBatch data

Sources: llkv-expr/src/expr.rs:1-8 llkv-executor/src/translation/expression.rs:18-174 llkv-table/src/planner/program.rs:256-631

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Expression Translation

Relevant source files

Purpose and Scope

Expression Translation is the process of converting expressions from using string-based column names to using internal field identifiers. This transformation bridges the gap between the SQL interface layer (which references columns by name) and the execution layer (which references columns by numeric FieldId). This page covers the translation phase that occurs after query planning but before expression compilation.

For information about the expression AST structure, see Expression AST. For information about how translated expressions are compiled into bytecode, see Program Compilation.

Translation Phase in Query Pipeline

The expression translation phase sits between SQL planning and execution, converting symbolic column references to physical field identifiers.

Translation Phase Detail

graph LR
    SQL["SQL WHERE\nname = 'Alice'"]
AST["sqlparser AST"]
EXPRSTR["Expr<String>\nColumn('name')"]
CATALOG["Schema Lookup"]
EXPRFID["Expr<FieldId>\nColumn(42)"]
COMPILE["Bytecode\nCompilation"]
SQL --> AST
 
   AST --> EXPRSTR
 
   EXPRSTR --> CATALOG
 
   CATALOG --> EXPRFID
 
   EXPRFID --> COMPILE
    
    style EXPRSTR fill:#fff5e1
    style EXPRFID fill:#ffe1e1

Sources: Based on Diagram 5 from system architecture overview

The translation process resolves all column name strings to their corresponding FieldId values by consulting the table schema. This enables the downstream execution engine to work with stable numeric identifiers rather than string lookups.

Generic Expression Types

Both Expr and ScalarExpr are generic over the field identifier type, parameterized as F.

Type Parameter Instantiations

graph TB
    subgraph "Generic Types"
        EXPR["Expr<'a, F>"]
SCALAR["ScalarExpr<F>"]
FILTER["Filter<'a, F>\n{field_id: F, op: Operator}"]
end
    
    subgraph "String-Based (Planning)"
        EXPRSTR["Expr<'static, String>"]
SCALARSTR["ScalarExpr<String>"]
FILTERSTR["Filter<'static, String>"]
end
    
    subgraph "FieldId-Based (Execution)"
        EXPRFID["Expr<'static, FieldId>"]
SCALARFID["ScalarExpr<FieldId>"]
FILTERFID["Filter<'static, FieldId>"]
end
    
    EXPR -.instantiated as.-> EXPRSTR
    EXPR -.instantiated as.-> EXPRFID
    SCALAR -.instantiated as.-> SCALARSTR
    SCALAR -.instantiated as.-> SCALARFID
    FILTER -.instantiated as.-> FILTERSTR
    FILTER -.instantiated as.-> FILTERFID
    
    EXPRSTR ==translate_predicate==> EXPRFID
    SCALARSTR ==translate_scalar==> SCALARFID

Sources: llkv-expr/src/expr.rs:14-43 llkv-executor/src/translation/expression.rs:18-27

Type	Planning Phase	Execution Phase
Predicate Expression	`Expr<'static, String>`	`Expr<'static, FieldId>`
Scalar Expression	`ScalarExpr<String>`	`ScalarExpr<FieldId>`
Filter	`Filter<'static, String>`	`Filter<'static, FieldId>`
Field Reference	`String` column name	`FieldId` numeric identifier

The generic parameter F allows the same AST structure to be used at different stages of query processing, with type safety enforcing that planning-phase expressions cannot be mixed with execution-phase expressions.

Sources: llkv-expr/src/expr.rs:14-333

Translation Functions

The llkv-executor crate provides two primary translation functions that recursively transform expression trees.

translate_predicate

Translates boolean predicate expressions (Expr<String> → Expr<FieldId>).

Predicate Translation Flow

Sources: llkv-executor/src/translation/expression.rs:18-174

Function Signature:

The function accepts:

expr: The predicate expression with string column references
schema: The table schema for column name resolution
unknown_column: Error constructor for unresolved column names

Sources: llkv-executor/src/translation/expression.rs:18-27

translate_scalar

Translates scalar value expressions (ScalarExpr<String> → ScalarExpr<FieldId>).

Function Signature:

Sources: llkv-executor/src/translation/expression.rs:176-185

Variant-Specific Translation

The translation process handles each expression variant differently:

Variant	Translation Approach
`Expr::Pred(Filter)`	Resolve filter's `field_id` from String to FieldId
`Expr::And(Vec)` / `Expr::Or(Vec)`	Recursively translate all child expressions
`Expr::Not(Box)`	Recursively translate inner expression
`Expr::Compare`	Translate both left and right scalar expressions
`Expr::InList`	Translate target expression and all list items
`Expr::IsNull`	Translate inner expression
`Expr::Literal(bool)`	No translation needed, pass through
`Expr::Exists(SubqueryExpr)`	Pass through subquery ID unchanged
`ScalarExpr::Column(String)`	Resolve string to FieldId
`ScalarExpr::Literal`	No translation needed, pass through
`ScalarExpr::Binary`	Recursively translate left and right operands
`ScalarExpr::Aggregate`	Recursively translate aggregate expression argument
`ScalarExpr::GetField`	Recursively translate base expression
`ScalarExpr::Cast`	Recursively translate inner expression
`ScalarExpr::Case`	Translate operand, all branch conditions/results, and else branch
`ScalarExpr::Coalesce`	Recursively translate all items

Sources: llkv-executor/src/translation/expression.rs:86-143 llkv-executor/src/translation/expression.rs:197-386

graph TD
    START["Column Name String"]
ROWID_CHECK{"Is 'rowid'\n(case-insensitive)?"}
ROWID["Return\nROW_ID_FIELD_ID"]
LOOKUP["schema.resolve(name)"]
FOUND{"Found in\nschema?"}
RETURN_FID["Return\ncolumn.field_id"]
ERROR["Invoke\nunknown_column\ncallback"]
START --> ROWID_CHECK
 
   ROWID_CHECK -->|Yes| ROWID
 
   ROWID_CHECK -->|No| LOOKUP
 
   LOOKUP --> FOUND
 
   FOUND -->|Yes| RETURN_FID
 
   FOUND -->|No| ERROR

Field Resolution

The core of translation is resolving string column names to numeric field identifiers.

Field Resolution Logic

Sources: llkv-executor/src/translation/expression.rs:390-407

Special Column: rowid

The system provides a special pseudo-column named rowid that references the internal row identifier:

The rowid column is:

Case-insensitive (accepts "ROWID", "rowid", "RowId", etc.)
Available in all tables automatically
Mapped to constant ROW_ID_FIELD_ID from llkv_table::ROW_ID_FIELD_ID
Corresponds to ROW_ID_COLUMN_NAME constant from llkv_column_map::ROW_ID_COLUMN_NAME

Sources: llkv-executor/src/translation/expression.rs:399-401 llkv-executor/src/translation/expression.rs2 llkv-executor/src/translation/expression.rs5

Schema Lookup

For non-special columns, resolution uses the ExecutorSchema::resolve method:

The schema lookup:

Searches for a column with the given name
Returns the ExecutorColumn if found
Extracts the field_id from the column metadata
Invokes the error callback if not found

Sources: llkv-executor/src/translation/expression.rs:403-406

graph TB
    START["Initial Expression"]
PUSH_ENTER["Push Enter Frame"]
POP["Pop Frame"]
FRAME_TYPE{"Frame Type?"}
ENTER_NODE["Enter Node"]
NODE_TYPE{"Node Type?"}
AND_OR["And/Or"]
NOT["Not"]
LEAF["Leaf (Pred,\nCompare, etc.)"]
PUSH_EXIT["Push Exit Frame"]
PUSH_CHILDREN["Push Child\nEnter Frames\n(reversed)"]
PUSH_EXIT_NOT["Push Exit Frame"]
PUSH_INNER["Push Inner\nEnter Frame"]
TRANSLATE_LEAF["Translate Leaf\nPush to result_stack"]
EXIT_NODE["Exit Node"]
POP_RESULTS["Pop child results\nfrom result_stack"]
BUILD_NODE["Build translated\nparent node"]
PUSH_RESULT["Push to result_stack"]
DONE{"Stack\nempty?"}
RETURN["Return final result"]
START --> PUSH_ENTER
 
   PUSH_ENTER --> POP
 
   POP --> FRAME_TYPE
    
 
   FRAME_TYPE -->|Enter| ENTER_NODE
 
   FRAME_TYPE -->|Exit| EXIT_NODE
 
   FRAME_TYPE -->|Leaf| TRANSLATE_LEAF
    
 
   ENTER_NODE --> NODE_TYPE
 
   NODE_TYPE --> AND_OR
 
   NODE_TYPE --> NOT
 
   NODE_TYPE --> LEAF
    
 
   AND_OR --> PUSH_EXIT
 
   PUSH_EXIT --> PUSH_CHILDREN
 
   PUSH_CHILDREN --> POP
    
 
   NOT --> PUSH_EXIT_NOT
 
   PUSH_EXIT_NOT --> PUSH_INNER
 
   PUSH_INNER --> POP
    
 
   LEAF --> TRANSLATE_LEAF
 
   TRANSLATE_LEAF --> POP
    
 
   EXIT_NODE --> POP_RESULTS
 
   POP_RESULTS --> BUILD_NODE
 
   BUILD_NODE --> PUSH_RESULT
 
   PUSH_RESULT --> POP
    
 
   POP --> DONE
 
   DONE -->|No| FRAME_TYPE
 
   DONE -->|Yes| RETURN

Traversal Strategy

Translation uses an iterative traversal approach to avoid stack overflow on deeply nested expressions.

Iterative Traversal Algorithm

Sources: llkv-executor/src/translation/expression.rs:39-174

Frame-Based Pattern

The translation uses a frame-based traversal pattern with two stacks:

Work Stack (owned_stack): Contains frames representing work to be done

OwnedFrame::Enter(expr): Visit a node and potentially expand it
OwnedFrame::Exit(context): Collect child results and build parent node
OwnedFrame::Leaf(translated): Push a fully translated leaf node

Result Stack (result_stack): Contains translated expressions ready to be consumed by parent nodes

Sources: llkv-executor/src/translation/expression.rs:48-63

Traversal Example

For the expression And([Pred(name_col), Pred(age_col)]):

Step	Work Stack	Result Stack	Action
1	`[Enter(And)]`	`[]`	Start
2	`[Exit(And(2)), Enter(age), Enter(name)]`	`[]`	Expand And
3	`[Exit(And(2)), Enter(age), Leaf(name→42)]`	`[]`	Translate name
4	`[Exit(And(2)), Enter(age)]`	`[Pred(42)]`	Push name result
5	`[Exit(And(2)), Leaf(age→43)]`	`[Pred(42)]`	Translate age
6	`[Exit(And(2))]`	`[Pred(42), Pred(43)]`	Push age result
7	`[]`	`[And([Pred(42), Pred(43)])]`	Build And, push result
8	Done	`[And([...])]`	Return final expression

This approach handles deeply nested expressions (50k+ nodes) without recursion-induced stack overflow.

Sources: llkv-executor/src/translation/expression.rs:62-174

Why Iterative Traversal?

The codebase comments explain:

This avoids stack overflow on deeply nested expressions (50k+ nodes) by using explicit work_stack and result_stack instead of recursion.

The frame-based pattern is documented in the llkv-plan::traversal module and reused here for expression translation.

Sources: llkv-executor/src/translation/expression.rs:39-46

Error Handling

Translation failures produce descriptive errors through callback functions.

Error Callbacks

Both translation functions accept error constructor callbacks:

Parameter	Purpose	Example Usage
`unknown_column: F`	Construct error for unknown column names	`
`unknown_aggregate: G`	Construct error for unknown aggregate functions	Currently unused but reserved for future validation

The callback pattern allows callers to customize error messages and error types based on their context.

Sources: llkv-executor/src/translation/expression.rs:21-27 llkv-executor/src/translation/expression.rs:189-195

Common Error Scenarios

Translation Error Flow

When schema.resolve(name) returns None, the system invokes the error callback which typically produces an Error::InvalidArgumentError with a message like:

"Binder Error: does not have a column named 'xyz'"

Sources: llkv-executor/src/translation/expression.rs:418-422

Result Stack Underflow

The iterative traversal validates stack consistency:

Stack underflow indicates a bug in the traversal logic rather than invalid user input, so it produces an Error::Internal.

Sources: llkv-executor/src/translation/expression.rs:160-164 llkv-executor/src/translation/expression.rs:171-173

graph TD
    EXPR["ScalarExpr<FieldId>"]
TYPE_CHECK{"Expression\nType?"}
LITERAL["Literal"]
INFER_LIT["Infer from\nLiteral type"]
COLUMN["Column(fid)"]
LOOKUP["schema.column_by_field_id(fid)"]
NORMALIZE["normalized_numeric_type"]
BINARY["Binary"]
CHECK_FLOAT["expression_uses_float"]
FLOAT_RESULT["DataType::Float64"]
INT_RESULT["DataType::Int64"]
AGGREGATE["Aggregate"]
AGG_RESULT["DataType::Int64"]
CAST["Cast"]
CAST_TYPE["Use specified\ndata_type"]
RESULT["Arrow DataType"]
EXPR --> TYPE_CHECK
    
 
   TYPE_CHECK --> LITERAL
 
   TYPE_CHECK --> COLUMN
 
   TYPE_CHECK --> BINARY
 
   TYPE_CHECK --> AGGREGATE
 
   TYPE_CHECK --> CAST
    
 
   LITERAL --> INFER_LIT
 
   INFER_LIT --> RESULT
    
 
   COLUMN --> LOOKUP
 
   LOOKUP --> NORMALIZE
 
   NORMALIZE --> RESULT
    
 
   BINARY --> CHECK_FLOAT
 
   CHECK_FLOAT -->|Uses Float| FLOAT_RESULT
 
   CHECK_FLOAT -->|Integer only| INT_RESULT
 
   FLOAT_RESULT --> RESULT
 
   INT_RESULT --> RESULT
    
 
   AGGREGATE --> AGG_RESULT
 
   AGG_RESULT --> RESULT
    
 
   CAST --> CAST_TYPE
 
   CAST_TYPE --> RESULT

Type Inference Integration

After translation, expressions with FieldId references can be used for schema-based type inference.

The infer_computed_data_type function in llkv-executor/src/translation/schema.rs inspects translated expressions to determine their Arrow data types:

Type Inference for Translated Expressions

Sources: llkv-executor/src/translation/schema.rs:53-123

Type Inference Rules

Expression	Inferred Type	Notes
`Literal::Integer`	`Int64`	64-bit signed integer
`Literal::Float`	`Float64`	64-bit floating point
`Literal::Decimal`	`Decimal128(p,s)`	Precision and scale from value
`Literal::Boolean`	`Boolean`	Boolean flag
`Literal::String`	`Utf8`	UTF-8 string
`Literal::Null`	`Null`	Null type marker
`Column(fid)`	Schema lookup	`normalized_numeric_type(column.data_type)`
`Binary`	`Float64` or `Int64`	Float if any operand is float
`Compare`	`Int64`	Comparisons produce boolean (0/1) as Int64
`Aggregate`	`Int64`	Most aggregates return Int64
`Cast`	Specified type	Uses explicit `data_type` parameter

The normalized_numeric_type function maps small integer types (Int8, Int16, Int32) to Int64 and unsigned/float types to Float64 for consistent expression evaluation.

Sources: llkv-executor/src/translation/schema.rs:125-147

Translation in Context

The translation phase fits into the broader query execution pipeline:

Translation Phase in Query Pipeline

Sources: Based on Diagram 2 and Diagram 5 from system architecture overview

The translation layer serves as the bridge between the human-readable SQL layer (with column names) and the machine-optimized execution layer (with numeric field IDs). This separation allows:

Planning flexibility : Query plans can reference columns symbolically without knowing physical storage details
Schema evolution : Field IDs remain stable even if column names change
Type safety : The type system prevents mixing planning-phase and execution-phase expressions
Optimization : Numeric field IDs enable efficient lookups in columnar storage

Sources: llkv-expr/src/expr.rs:1-359 llkv-executor/src/translation/expression.rs:1-424 llkv-executor/src/translation/schema.rs:1-338

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Program Compilation

Relevant source files

Purpose and Scope

This page documents the predicate compilation system in LLKV, which transforms typed predicate expressions (Expr<FieldId>) into stack-based bytecode programs for efficient evaluation during table scans. The compilation process produces two types of programs: EvalProgram for predicate evaluation and DomainProgram for domain analysis (determining which row IDs might match predicates).

For information about the expression AST structure itself, see Expression AST. For how these compiled programs are evaluated during execution, see Filter Evaluation. For how expressions are translated from string column names to field IDs, see Expression Translation.

Compilation Pipeline Overview

The compilation process transforms a normalized predicate expression into executable bytecode through the ProgramCompiler in llkv-table/src/planner/program.rs:257-284 The compiler produces a ProgramSet containing both evaluation and domain analysis programs.

Sources: llkv-table/src/planner/program.rs:257-284 llkv-table/src/planner/mod.rs:612-637

graph TB
    Input["Expr&lt;FieldId&gt;\n(Raw predicate)"]
Normalize["normalize_predicate()\n(Apply De Morgan's laws,\nflatten And/Or)"]
Compiler["ProgramCompiler::new()\nProgramCompiler::compile()"]
ProgramSet["ProgramSet"]
EvalProgram["EvalProgram\n(Stack-based bytecode\nfor evaluation)"]
DomainRegistry["DomainRegistry\n(Domain programs\nfor optimization)"]
Input --> Normalize
 
   Normalize --> Compiler
 
   Compiler --> ProgramSet
 
   ProgramSet --> EvalProgram
 
   ProgramSet --> DomainRegistry
    
    EvalOps["EvalOp instructions:\nPushPredicate, And, Or,\nNot, FusedAnd"]
DomainOps["DomainOp instructions:\nPushFieldAll, Union,\nIntersect"]
EvalProgram --> EvalOps
 
   DomainRegistry --> DomainOps

Compilation Entry Point

The TablePlanner::plan_scan method invokes the compiler when preparing a scan operation. The process begins with predicate normalization followed by compilation into both evaluation and domain programs.

Sources: llkv-table/src/planner/mod.rs:625-628 llkv-table/src/planner/program.rs:266-283

graph LR
    TablePlanner["TablePlanner::plan_scan"]
NormFilter["normalize_predicate(filter_expr)"]
CreateCompiler["ProgramCompiler::new(Arc&lt;Expr&gt;)"]
Compile["compiler.compile()"]
ProgramSet["ProgramSet&lt;'expr&gt;"]
TablePlanner --> NormFilter
 
   NormFilter --> CreateCompiler
 
   CreateCompiler --> Compile
 
   Compile --> ProgramSet
    
 
   ProgramSet --> EvalProgram["EvalProgram\n(ops: Vec&lt;EvalOp&gt;)"]
ProgramSet --> DomainRegistry["DomainRegistry\n(programs, index)"]

Predicate Normalization

Before compilation, predicates are normalized using normalize_predicate() to simplify the expression tree. Normalization applies two key transformations:

Transformation Rules

Input Pattern	Output Pattern	Description
`And(And(a, b), c)`	`And(a, b, c)`	Flatten nested AND operations
`Or(Or(a, b), c)`	`Or(a, b, c)`	Flatten nested OR operations
`Not(And(a, b))`	`Or(Not(a), Not(b))`	De Morgan's law for AND
`Not(Or(a, b))`	`And(Not(a), Not(b))`	De Morgan's law for OR
`Not(Not(a))`	`a`	Double negation elimination
`Not(Literal(true))`	`Literal(false)`	Literal inversion
`Not(IsNull{expr, negated})`	`IsNull{expr, !negated}`	IsNull negation flip

Normalization Algorithm

The normalize_expr() function recursively transforms the expression tree using pattern matching:

Sources: llkv-table/src/planner/program.rs:286-343

graph TD
    Start["normalize_expr(expr)"]
CheckAnd{"expr is And?"}
CheckOr{"expr is Or?"}
CheckNot{"expr is Not?"}
Other["Return expr unchanged"]
FlattenAnd["Flatten nested And nodes\ninto single And"]
FlattenOr["Flatten nested Or nodes\ninto single Or"]
ApplyDeMorgan["normalize_negated(inner)\nApply De Morgan's laws"]
Start --> CheckAnd
 
   CheckAnd -->|Yes| FlattenAnd
 
   CheckAnd -->|No| CheckOr
 
   CheckOr -->|Yes| FlattenOr
 
   CheckOr -->|No| CheckNot
 
   CheckNot -->|Yes| ApplyDeMorgan
 
   CheckNot -->|No| Other
    
 
   FlattenAnd --> Return["Return normalized expr"]
FlattenOr --> Return
 
   ApplyDeMorgan --> Return
 
   Other --> Return

EvalProgram Compilation

The compile_eval() function generates a sequence of EvalOp instructions using iterative traversal with an explicit work stack. This avoids stack overflow on deeply nested expressions and produces postorder bytecode.

EvalOp Instruction Set

The EvalOp enum defines the instruction types for predicate evaluation:

Instruction	Description	Stack Effect
`PushPredicate(OwnedFilter)`	Push single predicate result	→ bool
`PushCompare{left, op, right}`	Evaluate comparison expression	→ bool
`PushInList{expr, list, negated}`	Evaluate IN list membership	→ bool
`PushIsNull{expr, negated}`	Evaluate IS NULL test	→ bool
`PushLiteral(bool)`	Push constant boolean value	→ bool
`FusedAnd{field_id, filters}`	Optimized AND for same field	→ bool
`And{child_count}`	Pop N bools, push AND result	bool×N → bool
`Or{child_count}`	Pop N bools, push OR result	bool×N → bool
`Not{domain}`	Pop bool, push NOT result	bool → bool

graph TB
    subgraph "Compilation Phases"
        Input["Expression Tree"]
Phase1["Enter Phase\n(Pre-order traversal)"]
Phase2["Exit Phase\n(Post-order emission)"]
Output["Vec&lt;EvalOp&gt;"]
end
    
 
   Input --> Phase1
 
   Phase1 --> Phase2
 
   Phase2 --> Output
    
    subgraph "Frame Types"
        EnterFrame["EvalVisit::Enter(expr)\nPush children in reverse order"]
ExitFrame["EvalVisit::Exit(expr)\nEmit instruction"]
FusedFrame["EvalVisit::EmitFused\nEmit FusedAnd optimization"]
end
    
 
   Phase1 --> EnterFrame
 
   EnterFrame --> ExitFrame
 
   EnterFrame --> FusedFrame
 
   ExitFrame --> Phase2
 
   FusedFrame --> Phase2

Compilation Process

The compiler uses a two-pass approach with EvalVisit frames to track traversal state:

Sources: llkv-table/src/planner/program.rs:407-516

Predicate Fusion Optimization

When the compiler encounters an And node where all children are predicates (Expr::Pred) on the same FieldId, it emits a single FusedAnd instruction instead of multiple individual predicates. This optimization is detected by gather_fused():

Sources: llkv-table/src/planner/program.rs:518-542

graph LR
    AndNode["And(children)"]
Check["gather_fused(children)"]
Decision{"All children\nare Pred(field_id)\nwith same field_id?"}
Fused["Emit FusedAnd{\nfield_id,\nfilters\n}"]
Normal["Emit individual\nPushPredicate\nfor each child\n+ And instruction"]
AndNode --> Check
 
   Check --> Decision
 
   Decision -->|Yes| Fused
 
   Decision -->|No| Normal

DomainProgram Compilation

The compile_domain() function generates DomainOp instructions for domain analysis. Domain programs determine which row IDs might satisfy a predicate without evaluating the full expression, enabling storage-layer optimizations.

DomainOp Instruction Set

Instruction	Description	Stack Effect
`PushFieldAll(FieldId)`	All rows where field exists	→ RowSet
`PushCompareDomain{left, right, op, fields}`	Domain of rows for comparison	→ RowSet
`PushInListDomain{expr, list, fields, negated}`	Domain of rows for IN list	→ RowSet
`PushIsNullDomain{expr, fields, negated}`	Domain of rows for IS NULL	→ RowSet
`PushLiteralFalse`	Empty row set	→ RowSet
`PushAllRows`	All rows in table	→ RowSet
`Union{child_count}`	Pop N sets, push union	RowSet×N → RowSet
`Intersect{child_count}`	Pop N sets, push intersection	RowSet×N → RowSet

Domain Analysis Algorithm

Domain compilation uses iterative traversal with DomainVisit frames, similar to eval compilation but with different semantics:

graph TB
    Start["compile_domain(expr)"]
Stack["Work stack:\nVec&lt;DomainVisit&gt;"]
Ops["Output:\nVec&lt;DomainOp&gt;"]
EnterFrame["DomainVisit::Enter(node)\nPush children + Exit frame"]
ExitFrame["DomainVisit::Exit(node)\nEmit DomainOp"]
Start --> Stack
 
   Stack --> Process{"Pop frame"}
Process -->|Enter| EnterFrame
 
   Process -->|Exit| ExitFrame
    
 
   EnterFrame --> Stack
 
   ExitFrame --> Emit["Emit instruction to ops"]
Emit --> Stack
    
 
   Process -->|Empty| Done["Return DomainProgram{ops}"]

Domain Semantics

The domain of an expression represents the set of row IDs where the expression could potentially be true (or false for Not):

Expression Type	Domain Semantics
`Pred(filter)`	All rows where `filter.field_id` exists
`Compare{left, right, op}`	Union of domains of all fields in left and right
`InList{expr, list}`	Union of domains of all fields in expr and list items
`IsNull{expr}`	Union of domains of all fields in expr
`Literal(true)`	All rows in table
`Literal(false)`	Empty set
`And(children)`	Intersection of children domains
`Or(children)`	Union of children domains
`Not(inner)`	Same as inner domain (NOT doesn't change domain)

Sources: llkv-table/src/planner/program.rs:544-631

graph TB
    Input["ScalarExpr&lt;FieldId&gt;"]
Stack["Work stack:\nVec&lt;&amp;ScalarExpr&gt;"]
Seen["FxHashSet&lt;FieldId&gt;\n(deduplication)"]
Output["Vec&lt;FieldId&gt;\n(sorted)"]
Input --> Stack
 
   Stack --> Process{"Pop expr"}
Process -->|Column fid| AddField["Insert fid into seen"]
Process -->|Literal| Skip["Skip (no fields)"]
Process -->|Binary| PushChildren["Push left, right\nto stack"]
Process -->|Compare| PushChildren
 
   Process -->|Aggregate| PushAggExpr["Push aggregate expr\nto stack"]
Process -->|Other| PushNested["Push nested exprs\nto stack"]
AddField --> Stack
 
   Skip --> Stack
 
   PushChildren --> Stack
 
   PushAggExpr --> Stack
 
   PushNested --> Stack
    
 
   Process -->|Empty| Collect["Collect seen into Vec\nSort unstable"]
Collect --> Output

Field Collection for Domain Analysis

The collect_fields() function extracts all FieldId references from scalar expressions using iterative traversal. This determines which columns' row sets need to be considered during domain evaluation.

Sources: llkv-table/src/planner/program.rs:633-709

Data Structures

ProgramSet

The top-level container returned by compilation, holding all compiled artifacts:

ProgramSet<'expr> {
    eval: EvalProgram,              // Bytecode for predicate evaluation
    domains: DomainRegistry,        // Domain programs for optimization
    _root_expr: Arc<Expr<'expr, FieldId>>  // Original expression (lifetime anchor)
}

Sources: llkv-table/src/planner/program.rs:23-29

DomainRegistry

Manages the collection of compiled domain programs with deduplication via ExprKey:

DomainRegistry {
    programs: Vec<DomainProgram>,           // All compiled domain programs
    index: FxHashMap<ExprKey, DomainProgramId>,  // Expression → program ID map
    root: Option<DomainProgramId>           // ID of root domain program
}

The registry uses ExprKey (a pointer-based key) to detect duplicate subexpressions and reuse compiled domain programs.

Sources: llkv-table/src/planner/program.rs:12-20 llkv-table/src/planner/program.rs:196-219

OwnedFilter and OwnedOperator

To support owned bytecode programs with no lifetime constraints, the compiler converts borrowed Filter<'a, FieldId> and Operator<'a> types into owned variants:

Borrowed Type	Owned Type	Purpose
`Filter<'a, FieldId>`	`OwnedFilter`	Stores field_id + owned operator
`Operator<'a>`	`OwnedOperator`	Owns pattern strings and literal vectors

This conversion happens during compile_eval() when creating PushPredicate and FusedAnd instructions.

Sources: llkv-table/src/planner/program.rs:69-191

Integration with Table Scanning

The compiled programs are used during table scan execution in two ways:

EvalProgram : Evaluated per-row or per-batch during scan to determine which rows match the predicate
DomainProgram : Used for storage-layer optimizations to skip scanning columns or chunks that cannot match

Usage in PlannedScan

The TablePlanner::plan_scan() method creates a PlannedScan struct that bundles the compiled programs with scan metadata:

PlannedScan<'expr, P> {
    projections: Vec<ScanProjection>,
    filter_expr: Arc<Expr<'expr, FieldId>>,
    options: ScanStreamOptions<P>,
    plan_graph: PlanGraph,          // For query plan visualization
    programs: ProgramSet<'expr>     // Compiled evaluation and domain programs
}

Sources: llkv-table/src/planner/mod.rs:500-509 llkv-table/src/planner/mod.rs:630-636

Example Compilation

Consider the predicate: (age > 18 AND name LIKE 'A%') OR status = 'active'

After normalization, this remains unchanged (no nested And/Or to flatten). The compiler produces:

EvalProgram Instructions

1. PushPredicate(Filter { field_id: age, op: GreaterThan(18) })
2. PushPredicate(Filter { field_id: name, op: StartsWith("A", true) })
3. And { child_count: 2 }
4. PushPredicate(Filter { field_id: status, op: Equals("active") })
5. Or { child_count: 2 }

DomainProgram Instructions

1. PushFieldAll(age)      // Domain for first predicate
2. PushFieldAll(name)     // Domain for second predicate
3. Intersect { child_count: 2 }  // AND combines via intersection
4. PushFieldAll(status)   // Domain for third predicate
5. Union { child_count: 2 }      // OR combines via union

The domain program indicates that rows must have (age AND name) OR status to potentially match.

Sources: llkv-table/src/planner/program.rs:416-516 llkv-table/src/planner/program.rs:550-631

graph LR
    Recursive["Recursive approach\n(Stack overflow risk)"]
Iterative["Iterative approach\n(Explicit work stack)"]
Recursive -->|Replace with| Iterative
    
 
   Iterative --> WorkStack["Vec&lt;Frame&gt;\n(Heap-allocated)"]
Iterative --> ResultStack["Vec&lt;Result&gt;\n(Post-order accumulation)"]

Stack Overflow Prevention

Both compile_eval() and compile_domain() use explicit work stacks instead of recursion to handle deeply nested expressions (50k+ nodes) without stack overflow. This follows the iterative traversal pattern described in the codebase:

The pattern uses Enter/Exit frames to simulate recursive descent and ascent, accumulating results in a separate stack during the Exit phase.

Sources: llkv-table/src/planner/program.rs:407-516 llkv-table/src/planner/program.rs:544-631

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Scalar Evaluation and NumericKernels

Relevant source files

Purpose and Scope

This page documents the scalar expression evaluation engine used during table scans to compute expressions like col1 + col2 * 3, CAST(col AS Float64), and CASE statements. The NumericKernels utility centralizes numeric computation logic, providing both row-by-row and vectorized batch evaluation strategies. For the abstract expression AST that gets evaluated, see Expression AST. For how expressions are compiled into bytecode programs for predicate evaluation, see Program Compilation.

Overview

The scalar evaluation system provides a unified numeric computation layer that operates over Arrow arrays during table scans. When a query contains computed projections like SELECT col1 + col2 AS sum FROM table, the executor needs to efficiently evaluate these expressions across potentially millions of rows. The NumericKernels struct and associated types provide:

Type abstraction : Wraps Arrow's Int64Array, Float64Array, and Decimal128Array into a unified NumericArray interface
Evaluation strategies : Supports both row-by-row evaluation (for complex expressions) and vectorized batch evaluation (for simple arithmetic)
Optimization : Applies algebraic simplification to detect affine transformations and constant folding opportunities
Type coercion : Handles implicit casting between integer, float, and decimal types following SQLite-style semantics

Sources : llkv-table/src/scalar_eval.rs:1-22

graph TB
    subgraph "Input Layer"
        ARROW_INT["Int64Array\n(Arrow)"]
ARROW_FLOAT["Float64Array\n(Arrow)"]
ARROW_DEC["Decimal128Array\n(Arrow)"]
end
    
    subgraph "Abstraction Layer"
        NUM_ARRAY["NumericArray\nkind: NumericKind\nlen: usize"]
NUM_VALUE["NumericValue\nInteger(i64)\nFloat(f64)\nDecimal(DecimalValue)"]
end
    
    subgraph "Evaluation Engine"
        KERNELS["NumericKernels\nevaluate_value()\nevaluate_batch()\nsimplify()"]
end
    
    subgraph "Output Layer"
        RESULT_ARRAY["ArrayRef\n(Arrow)"]
end
    
 
   ARROW_INT --> NUM_ARRAY
 
   ARROW_FLOAT --> NUM_ARRAY
 
   ARROW_DEC --> NUM_ARRAY
 
   NUM_ARRAY --> NUM_VALUE
 
   NUM_VALUE --> KERNELS
 
   KERNELS --> RESULT_ARRAY
    
    style KERNELS fill:#e1f5ff

Core Data Types

NumericKind

An enum distinguishing the underlying numeric representation. This preserves type information through evaluation to enable intelligent casting decisions:

Sources : llkv-table/src/scalar_eval.rs:26-32

NumericValue

A tagged union representing a single numeric value while preserving its original type. Provides conversion methods to target types:

Variant	Description	Conversion Methods
`Integer(i64)`	Signed 64-bit integer	`as_f64()`, `as_i64()`
`Float(f64)`	64-bit floating point	`as_f64()`
`Decimal(DecimalValue)`	Fixed-precision decimal	`as_f64()`

All variants support .kind() to retrieve the original NumericKind.

Sources : llkv-table/src/scalar_eval.rs:34-69

NumericArray

Wraps Arrow array types with a unified interface for numeric access. Internally stores optional Arc<Int64Array>, Arc<Float64Array>, or Arc<Decimal128Array> based on the kind field:

Key Methods :

graph LR
    subgraph "NumericArray"
        KIND["kind: NumericKind"]
LEN["len: usize"]
INT_DATA["int_data: Option&lt;Arc&lt;Int64Array&gt;&gt;"]
FLOAT_DATA["float_data: Option&lt;Arc&lt;Float64Array&gt;&gt;"]
DECIMAL_DATA["decimal_data: Option&lt;Arc&lt;Decimal128Array&gt;&gt;"]
end
    
    KIND -.determines.-> INT_DATA
    KIND -.determines.-> FLOAT_DATA
    KIND -.determines.-> DECIMAL_DATA

try_from_arrow(array: &ArrayRef): Constructs from any Arrow array, applying type casting as needed
value(idx: usize): Extracts Option<NumericValue> at the given index
promote_to_float(): Converts to Float64 representation for mixed-type arithmetic
to_array_ref(): Exports back to Arrow ArrayRef

Sources : llkv-table/src/scalar_eval.rs:83-383

NumericKernels API

The NumericKernels struct provides static methods for expression evaluation and optimization. It serves as the primary entry point for scalar computation during table scans.

Field Collection

Recursively traverses a scalar expression to identify all referenced column fields. Used by the table planner to determine which columns must be fetched from storage.

Sources : llkv-table/src/scalar_eval.rs:455-526

Array Preparation

Converts a set of Arrow arrays into the NumericArray representation, applying type coercion as needed. The needed_fields parameter filters to only the columns referenced by the expression being evaluated. Returns a FxHashMap<FieldId, NumericArray> for fast lookup during evaluation.

Sources : llkv-table/src/scalar_eval.rs:528-547

Value-by-Value Evaluation

Evaluates a scalar expression for a single row at index idx. Supports:

Binary arithmetic (+, -, *, /, %)
Comparisons (=, <, >, etc.)
Logical operators (NOT, IS NULL)
Type casts (CAST(... AS Float64))
Control flow (CASE, COALESCE)
Random number generation (RANDOM())

Returns None for NULL propagation.

Sources : llkv-table/src/scalar_eval.rs:549-673

Batch Evaluation

Evaluates an expression across all rows in a batch, returning an ArrayRef. The implementation attempts vectorized evaluation for simple expressions (single column, literals, affine transformations) and falls back to row-by-row evaluation for complex cases.

Sources : llkv-table/src/scalar_eval.rs:676-712

graph TB
    EXPR["ScalarExpr&lt;FieldId&gt;"]
SIMPLIFY["simplify()\nDetect affine patterns"]
VECTORIZE["try_evaluate_vectorized()\nCheck for fast path"]
FAST["Vectorized Evaluation\nDirect Arrow compute"]
SLOW["Row-by-Row Loop\nevaluate_value()
per row"]
RESULT["ArrayRef"]
EXPR --> SIMPLIFY
 
   SIMPLIFY --> VECTORIZE
 
   VECTORIZE -->|Success| FAST
 
   VECTORIZE -->|Fallback| SLOW
 
   FAST --> RESULT
 
   SLOW --> RESULT

Vectorization and Optimization

VectorizedExpr

Internal representation for expressions that can be evaluated without per-row dispatch:

The try_evaluate_vectorized method attempts to decompose complex expressions into VectorizedExpr nodes, enabling efficient vectorized computation for binary operations between arrays and scalars.

Sources : llkv-table/src/scalar_eval.rs:385-414

graph LR
    INPUT["col * 3 + 5"]
DETECT["Detect Affine Pattern"]
AFFINE["AffineExpr\nfield: col\nscale: 3.0\noffset: 5.0"]
FAST_EVAL["Single Pass Evaluation\nemit_no_nulls()"]
INPUT --> DETECT
 
   DETECT --> AFFINE
 
   AFFINE --> FAST_EVAL

Affine Expression Detection

The simplify method detects affine transformations of the form scale * field + offset:

When an affine pattern is detected, the executor can apply the transformation in a single pass without intermediate allocations. The try_extract_affine_expr method recursively analyzes binary arithmetic trees to identify this pattern.

Sources : llkv-table/src/scalar_eval.rs:1138-1261

Constant Folding

The simplification pass performs constant folding for expressions like 2 + 3 or 10.0 / 2.0, replacing them with Literal(5) or Literal(5.0). This eliminates redundant computation during execution.

Sources : llkv-table/src/scalar_eval.rs:997-1137

Type Coercion and Casting

Implicit Coercion

When evaluating binary operations on mixed types, the system applies implicit promotion rules:

Left Type	Right Type	Result Type	Behavior
Integer	Integer	Integer	No conversion
Integer	Float	Float	Promote left to Float64
Float	Integer	Float	Promote right to Float64
Decimal	Any	Float	Convert both to Float64

The infer_result_kind method determines the target type before evaluation, and to_aligned_array_ref applies the necessary promotions.

Sources : llkv-table/src/scalar_eval.rs:1398-1447

Explicit Casting

The CAST expression variant supports explicit type conversion:

Casting is handled during evaluation by:

Evaluating the inner expression to NumericValue
Converting to the target NumericKind via cast_numeric_value_to_kind
Constructing the result array with the target Arrow DataType

Special handling exists for DataType::Date32 casts, which use the llkv-plan date utilities.

Sources : llkv-table/src/scalar_eval.rs:1449-1472 llkv-table/src/scalar_eval.rs:611-624

sequenceDiagram
    participant Planner as TablePlanner
    participant Executor as TableExecutor
    participant Kernels as NumericKernels
    participant Store as ColumnStore
    
    Planner->>Planner: Analyze projections
    Planner->>Kernels: collect_fields(expr)
    Kernels-->>Planner: Set&lt;FieldId&gt;
    Planner->>Planner: Build unique_lfids list
    
    Executor->>Store: Gather columns for row batch
    Store-->>Executor: Vec&lt;ArrayRef&gt;
    
    Executor->>Kernels: prepare_numeric_arrays(lfids, arrays, fields)
    Kernels-->>Executor: NumericArrayMap
    
    Executor->>Kernels: evaluate_batch_simplified(expr, len, arrays)
    Kernels->>Kernels: try_evaluate_vectorized()
    alt Vectorized
        Kernels->>Kernels: compute_binary_array_array()
    else Fallback
        Kernels->>Kernels: Loop: evaluate_value(expr, idx)
    end
    Kernels-->>Executor: ArrayRef (result column)
    
    Executor->>Executor: Append to RecordBatch

Integration with Table Scans

The numeric evaluation engine is invoked by the table executor when processing computed projections. The integration flow:

Projection Evaluation Context

The ProjectionEval enum distinguishes between direct column references and computed expressions:

For Computed variants, the planner:

Calls NumericKernels::simplify() to optimize the expression
Invokes NumericKernels::collect_fields() to determine dependencies
Stores the simplified expression for evaluation

During execution, RowStreamBuilder materializes computed columns by calling evaluate_batch_simplified for each expression.

Sources : llkv-table/src/planner/mod.rs:494-498 llkv-table/src/planner/mod.rs:1073-1107

Passthrough Optimization

The planner detects when a computed expression is simply a column reference (after simplification) via NumericKernels::passthrough_column(). In this case, the column is fetched directly from storage without re-evaluation:

This avoids redundant computation for queries like SELECT col + 0 AS x.

Sources : llkv-table/src/planner/mod.rs:1110-1116 llkv-table/src/scalar_eval.rs:874-907

Data Type Inference

The evaluation engine must determine result types for expressions before evaluation to construct properly-typed Arrow arrays. The infer_computed_data_type function in llkv-executor delegates to numeric kernel logic:

Expression Type	Inferred Data Type	Rule
`Literal(Integer)`	`Int64`	Direct mapping
`Literal(Float)`	`Float64`	Direct mapping
`Binary { ... }`	`Int64` or `Float64`	Based on operand types
`Compare { ... }`	`Int64`	Boolean as 0/1 integer
`Cast { data_type, ... }`	`data_type`	Explicit type
`Random`	`Float64`	Always float

The expression_uses_float helper recursively checks if any operand is floating-point, promoting the result type accordingly.

Sources : llkv-executor/src/translation/schema.rs:53-123

Performance Characteristics

Row-by-Row Evaluation

Used for:

Expressions with control flow (CASE, COALESCE)
Expressions containing CAST to non-numeric types
Expressions with interval arithmetic (date operations)

Cost : O(n) row dispatch overhead, branch mispredictions on conditionals

Vectorized Evaluation

Used for:

Simple arithmetic (col1 + col2, col * 3)
Single column references
Constant literals

Cost : O(n) with SIMD-friendly memory access patterns, no per-row dispatch

graph LR
    INPUT["Int64Array\n[1,2,3,4,5]"]
AFFINE["scale=2.0\noffset=10.0"]
CALLBACK["emit_no_nulls(\nlen, /i/ 2.0*values[i]+10.0\n)"]
OUTPUT["Float64Array\n[12,14,16,18,20]"]
INPUT --> AFFINE
 
   AFFINE --> CALLBACK
 
   CALLBACK --> OUTPUT

Affine Evaluation

Special case for scale * field + offset expressions. The executor generates values directly into the output buffer using emit_no_nulls or emit_with_nulls callbacks, avoiding intermediate allocations.

Sources : llkv-table/src/planner/mod.rs:253-357

Key Implementation Details

NULL Handling

NULL values propagate through arithmetic operations according to SQL semantics:

NULL + 5 → NULL
NULL IS NULL → 1 (true)
COALESCE(NULL, 5) → 5

The NumericValue is wrapped in Option<T>, with None representing SQL NULL. Binary operations return None if either operand is None.

Sources : llkv-table/src/scalar_eval.rs:564-571

Type Safety

The system maintains type safety through:

Tagged unions : NumericValue preserves original type via the discriminant
Explicit promotion : promote_to_float() is called only when type mixing requires it
Result type inference : The planner determines output types before evaluation

This prevents silent precision loss and enables query optimizations based on type information.

Sources : llkv-table/src/scalar_eval.rs:295-342

Memory Efficiency

The NumericArray struct uses Arc<T> for backing arrays, enabling zero-copy sharing when:

Returning a column directly without computation
Slicing arrays for sorted run evaluation
Sharing arrays across multiple expressions referencing the same column

The to_array_ref() method clones the Arc, not the underlying data.

Sources : llkv-table/src/scalar_eval.rs:275-293

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Aggregation System

Relevant source files

The aggregation system evaluates SQL aggregate functions (COUNT, SUM, AVG, MIN, MAX, etc.) over Arrow RecordBatch streams. It consists of a planning layer that defines aggregate specifications and an execution layer that performs incremental accumulation with overflow checking and DISTINCT value tracking.

For information about scalar expression evaluation, see Scalar Evaluation and NumericKernels. For query execution orchestration, see Query Execution.

Architecture Overview

The aggregation system operates across three crates:

Sources: llkv-aggregate/src/lib.rs:1-1935 llkv-executor/src/lib.rs:1-3599 llkv-plan/src/plans.rs:1-1458

The planner creates AggregateExpr instances from SQL AST nodes, which the executor converts to AggregateSpec descriptors. These specs initialize AggregateAccumulator instances that process batches incrementally, accumulating values in memory. The AggregateState wraps the accumulator with metadata (alias, override values) and produces the final output arrays.

Aggregate Specification

AggregateSpec Structure

AggregateSpec defines an aggregate operation at plan-time:

Field	Type	Purpose
`alias`	`String`	Output column name for the aggregate result
`kind`	`AggregateKind`	Type of aggregate operation and its parameters

Sources: llkv-aggregate/src/lib.rs:23-27

AggregateKind Variants

Sources: llkv-aggregate/src/lib.rs:30-67

Each variant captures the field ID to aggregate over, the expected data type, and operation-specific flags like distinct or separator. The field_id is optional for COUNT(*) which counts all rows regardless of column values.

Accumulator System

Accumulator Variants

AggregateAccumulator implements streaming accumulation for each aggregate type:

Sources: llkv-aggregate/src/lib.rs:92-247

graph TB
    subgraph "COUNT Variants"
        CS[CountStar\nvalue: i64]
        CC[CountColumn\ncolumn_index: usize\nvalue: i64]
        CDC[CountDistinctColumn\ncolumn_index: usize\nseen: FxHashSet]
    end
    
    subgraph "SUM Variants"
        SI64[SumInt64\nvalue: Option-i64-\nhas_values: bool]
        SDI64[SumDistinctInt64\nsum: Option-i64-\nseen: FxHashSet]
        SF64[SumFloat64\nvalue: f64\nsaw_value: bool]
        SDF64[SumDistinctFloat64\nsum: f64\nseen: FxHashSet]
        SD128[SumDecimal128\nsum: i128\nprecision: u8\nscale: i8]
    end
    
    subgraph "AVG Variants"
        AI64[AvgInt64\nsum: i64\ncount: i64]
        ADI64[AvgDistinctInt64\nsum: i64\nseen: FxHashSet]
        AF64[AvgFloat64\nsum: f64\ncount: i64]
    end
    
    subgraph "MIN/MAX Variants"
        MinI64[MinInt64\nvalue: Option-i64-]
        MaxI64[MaxInt64\nvalue: Option-i64-]
        MinF64[MinFloat64\nvalue: Option-f64-]
        MaxF64[MaxFloat64\nvalue: Option-f64-]
    end

Each accumulator variant is specialized for its data type and operation semantics. Integer accumulators track overflow using Option<i64> (None indicates overflow), while float accumulators use f64 which never overflows. Distinct variants maintain a FxHashSet of seen values.

sequenceDiagram
    participant Executor
    participant AggregateSpec
    participant AggregateAccumulator
    participant RecordBatch
    participant OutputArray
    
    Executor->>AggregateSpec: new_with_projection_index()
    AggregateSpec->>AggregateAccumulator: Create accumulator
    
    loop For each batch
        Executor->>RecordBatch: Stream next batch
        RecordBatch->>AggregateAccumulator: update(batch)
        Note over AggregateAccumulator: Accumulate values\nCheck overflow\nTrack distinct keys
    end
    
    Executor->>AggregateAccumulator: finalize()
    AggregateAccumulator->>OutputArray: (Field, ArrayRef)
    OutputArray->>Executor: Return result

Accumulator Lifecycle

Sources: llkv-aggregate/src/lib.rs:460-746 llkv-aggregate/src/lib.rs:748-1440 llkv-aggregate/src/lib.rs:1442-1934

The accumulator is initialized with a projection index indicating which column in the RecordBatch to aggregate. The update() method processes each batch incrementally, and finalize() produces the final Arrow array and field schema.

Distinct Value Tracking

DistinctKey Enumeration

The system tracks distinct values using a hash-based approach:

Variant	Type	Purpose
`Int(i64)`	Integer values	Exact integer comparison
`Float(u64)`	Float bit pattern	Bitwise float equality
`Str(String)`	String values	Text comparison
`Bool(bool)`	Boolean values	True/false comparison
`Date(i32)`	Date32 values	Date comparison
`Decimal(i128)`	Decimal raw value	Exact decimal comparison

Sources: llkv-aggregate/src/lib.rs:249-257 llkv-aggregate/src/lib.rs:259-333

Float values are converted to bit patterns (to_bits()) to enable hash-based deduplication while preserving NaN and infinity semantics. Decimal values use raw i128 representation for exact comparison without scale conversion.

Distinct Accumulation Example

For COUNT(DISTINCT column), the accumulator inserts each non-null value into the hash set:

Sources: llkv-aggregate/src/lib.rs:785-798 llkv-aggregate/src/lib.rs:1465-1473

graph LR
    Batch1[RecordBatch 1\nvalues: 1,2,3]
    Batch2[RecordBatch 2\nvalues: 2,3,4]
    Batch3[RecordBatch 3\nvalues: 1,4,5]
    
 
   Batch1 --> HS1[seen: {1,2,3}]
 
   Batch2 --> HS2[seen: {1,2,3,4}]
 
   Batch3 --> HS3[seen: {1,2,3,4,5}]
    
 
   HS3 --> Result[COUNT: 5]

The hash set automatically deduplicates values across batches. Only the set size is returned as the final count, avoiding materialization of the entire set in the output.

Aggregate Functions

COUNT Family

Function	Null Handling	Return Type	Overflow
`COUNT(*)`	Counts all rows	Int64	Checked
`COUNT(column)`	Skips NULL values	Int64	Checked
`COUNT(DISTINCT column)`	Skips NULL, deduplicates	Int64	Checked

Sources: llkv-aggregate/src/lib.rs:467-485 llkv-aggregate/src/lib.rs:759-783 llkv-aggregate/src/lib.rs:1452-1473

COUNT operations verify that the result fits in i64 range. COUNT(*) accumulates batch row counts directly. COUNT(column) filters invalid (NULL) rows using array.is_valid(i). COUNT(DISTINCT) maintains a hash set and returns its size.

SUM and TOTAL

Function	Overflow Behavior	Return Type	NULL Result
`SUM(int_column)`	Returns error	Int64	NULL if no values
`SUM(float_column)`	Accumulates infinities	Float64	NULL if no values
`TOTAL(int_column)`	Converts to Float64	Float64	0.0 if no values
`TOTAL(float_column)`	Accumulates infinities	Float64	0.0 if no values

Sources: llkv-aggregate/src/lib.rs:486-541 llkv-aggregate/src/lib.rs:799-824 llkv-aggregate/src/lib.rs:958-975

graph LR
    Input[Input Column]
    Sum[Accumulate Sum]
    Count[Count Non-NULL]
    Div[Divide sum/count]
    Output[Float64 Result]
    
 
   Input --> Sum
 
   Input --> Count
 
   Sum --> Div
 
   Count --> Div
 
   Div --> Output

SUM uses checked_add for integers and returns an error on overflow. TOTAL never overflows because it accumulates as Float64 even for integer columns. The key difference is NULL handling: SUM returns NULL for empty input, TOTAL returns 0.0.

AVG (Average)

Sources: llkv-aggregate/src/lib.rs:598-654 llkv-aggregate/src/lib.rs:1096-1121 llkv-aggregate/src/lib.rs:1635-1645

AVG maintains separate sum and count accumulators. During finalization, it divides sum / count to produce a Float64 result. Integer sums are converted to Float64 for the division. If count is zero, AVG returns NULL.

MIN and MAX

Data Type	Comparison Strategy	NULL Handling
Int64	`i64::min()` / `i64::max()`	Skip NULL values
Float64	`partial_cmp()` with NaN handling	Skip NULL values
Decimal128	`i128::min()` / `i128::max()` on raw values	Skip NULL values
String	Numeric coercion via `array_value_to_numeric()`	Skip NULL values

Sources: llkv-aggregate/src/lib.rs:656-710 llkv-aggregate/src/lib.rs:1259-1277 llkv-aggregate/src/lib.rs:1279-1300

MIN/MAX start with None and update to Some(value) on the first non-NULL entry. Subsequent values are compared using type-specific logic. Float comparisons use partial_cmp() to handle NaN values correctly.

graph LR
    Values["Column Values:\n42, 'hello', 3.14"]
Convert[Convert to Strings:\n'42', 'hello', '3.14']
    Join["Join with separator\n(default: ',')"]
Result["Result: '42,hello,3.14'"]
Values --> Convert
 
   Convert --> Join
 
   Join --> Result

GROUP_CONCAT

GROUP_CONCAT concatenates string representations of column values with a separator:

Sources: llkv-aggregate/src/lib.rs:722-744 llkv-aggregate/src/lib.rs:1409-1437 llkv-aggregate/src/lib.rs:1847-1874

The accumulator collects string representations using array_value_to_string() which coerces integers, floats, and booleans to text. DISTINCT variants track seen values in a hash set. Finalization joins the strings with the specified separator (default: ',').

Group-by Integration

Grouping Key Extraction

For GROUP BY queries, the executor extracts grouping keys from each row:

Sources: llkv-executor/src/lib.rs:1097-1173

sequenceDiagram
    participant Executor
    participant GroupMap
    participant AggregateState
    participant Accumulator
    
    loop For each batch
        Executor->>Executor: Extract group keys
        loop For each group
            Executor->>GroupMap: Get or create group
            Executor->>AggregateState: Get accumulators for group
            Executor->>Accumulator: Filter batch to group rows
            Executor->>Accumulator: update(filtered_batch)
        end
    end
    
    Executor->>GroupMap: Iterate all groups
    loop For each group
        Executor->>AggregateState: finalize()
        AggregateState->>Executor: Return aggregate arrays
    end

Each unique combination of group-by column values maps to a separate GroupAggregateState which tracks the representative row and a list of matching row locations across batches.

Aggregate Accumulation per Group

Sources: llkv-executor/src/lib.rs:1174-1383

The executor maintains separate accumulators for each group. When processing a batch, it filters rows by group membership using RowIdFilter and updates each group's accumulators independently. This ensures that SUM(sales) for group 'USA' only accumulates sales records where country='USA'.

Output Construction

After processing all batches, the executor constructs the output RecordBatch:

Column Type	Source	Construction
Group-by columns	Representative rows	Gathered from original batches
Aggregate columns	Finalized accumulators	Converted to Arrow arrays

Sources: llkv-executor/src/lib.rs:1384-1467

The system gathers one representative row per group for the group-by columns, then appends the finalized aggregate arrays as additional columns. This produces a result like:

+----------+---------+
| country  | SUM(sales) |
+----------+---------+
| USA      | 1500000 |
| Canada   | 750000  |
+----------+---------+

graph LR
    StringCol["String Column\n'42', 'hello', '3.14'"]
Parse1["'42' → 42.0"]
Parse2["'hello' → 0.0"]
Parse3["'3.14' → 3.14"]
Sum[SUM: 45.14]
    
 
   StringCol --> Parse1
 
   StringCol --> Parse2
 
   StringCol --> Parse3
    
 
   Parse1 --> Sum
 
   Parse2 --> Sum
 
   Parse3 --> Sum

Type System and Coercion

Numeric Coercion

The system performs SQLite-style type coercion for aggregates on string columns:

Sources: llkv-aggregate/src/lib.rs:398-447

The array_value_to_numeric() function attempts to parse strings as floats. Non-numeric strings coerce to 0.0, matching SQLite behavior. This enables SUM(string_column) where some values are numeric.

Type-specific Accumulators

Input Type	SUM Accumulator	AVG Accumulator	MIN/MAX Accumulator
Int64	`SumInt64` (i64 with overflow)	`AvgInt64` (sum: i64, count: i64)	`MinInt64` / `MaxInt64`
Float64	`SumFloat64` (f64, never overflows)	`AvgFloat64` (sum: f64, count: i64)	`MinFloat64` / `MaxFloat64`
Decimal128	`SumDecimal128` (i128 + precision/scale)	`AvgDecimal128` (sum: i128, count: i64)	`MinDecimal128` / `MaxDecimal128`
Utf8	`SumFloat64` (numeric coercion)	`AvgFloat64` (numeric coercion)	`MinFloat64` (numeric coercion)

Sources: llkv-aggregate/src/lib.rs:486-710

graph TB
    IntValue[Integer Value]
    CheckedAdd[checked_add-value-]
    Overflow{Overflow?}
ErrorSUM[SUM: Return Error]
    ContinueTOTAL[TOTAL: Continue as Float64]
    
 
   IntValue --> CheckedAdd
 
   CheckedAdd --> Overflow
 
   Overflow -->|Yes + SUM| ErrorSUM
 
   Overflow -->|Yes + TOTAL| ContinueTOTAL
 
   Overflow -->|No| IntValue

Each data type uses a specialized accumulator to preserve precision and overflow semantics. Decimal aggregates maintain precision and scale metadata throughout accumulation.

Overflow Handling

Integer Overflow Strategy

Sources: llkv-aggregate/src/lib.rs:799-824 llkv-aggregate/src/lib.rs:958-975 llkv-aggregate/src/lib.rs:1474-1494

SUM uses checked_add() and sets the accumulator to None on overflow, returning an error during finalization. TOTAL avoids this by accumulating integers as Float64 from the start, trading precision for guaranteed completion.

Decimal Overflow

Decimal128 aggregates use checked_add() on the raw i128 values:

Sources: llkv-aggregate/src/lib.rs:915-932

When Decimal128 overflow occurs, the system returns an error immediately. There is no TOTAL-style fallback for decimals because precision requirements are explicit in the type signature.

graph LR
    Projection["Projection:\nSUM(price * quantity)"]
Extract[Extract Aggregate\nFunction Call]
    Expression["price * quantity"]
Translate[Translate to\nScalarExpr]
    EnsureProj[ensure_computed_projection]
    Accumulate[Accumulate via\nAggregateAccumulator]
    
 
   Projection --> Extract
 
   Extract --> Expression
 
   Expression --> Translate
 
   Translate --> EnsureProj
 
   EnsureProj --> Accumulate

Computed Aggregates

Aggregate Expressions in Projections

The executor handles aggregate function calls embedded in computed projections:

Sources: llkv-executor/src/lib.rs:703-712 llkv-executor/src/lib.rs:735-798 llkv-executor/src/lib.rs:473-505

When a projection contains an aggregate like SUM(price * quantity), the executor:

Detects the aggregate via expr_contains_aggregate()
Translates the inner expression (price * quantity) to a ScalarExpr
Creates a computed projection for the expression
Initializes an accumulator for the projection index
Accumulates values from the computed column

This allows complex aggregate expressions beyond simple column references.

Performance Considerations

Memory Usage

Each accumulator maintains state proportional to:

Accumulator Type	Memory Per Group	Notes
COUNT(*)	8 bytes (i64)	Constant size
SUM/AVG	16-24 bytes	Value + metadata
MIN/MAX	8-24 bytes	Single value + type info
COUNT(DISTINCT)	O(unique values)	Hash set grows with cardinality
GROUP_CONCAT	O(total string length)	Vector of strings

Sources: llkv-aggregate/src/lib.rs:92-247

DISTINCT and GROUP_CONCAT have unbounded memory growth for high-cardinality data. The system does not implement spilling or approximate algorithms for these cases.

Parallelization

Aggregates are accumulated serially within a single thread because:

Accumulators maintain mutable state that is not thread-safe
DISTINCT tracking requires synchronized hash set updates
Sequential batch processing simplifies overflow detection

Future work could introduce parallel accumulation with merge operations for distributive aggregates (SUM, COUNT, MIN, MAX) but not for algebraic aggregates (AVG) or DISTINCT operations without additional complexity.

Sources: llkv-aggregate/src/lib.rs:748-1440

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Query Execution

Relevant source files

Purpose and Scope

Query execution is the process of converting logical query plans into physical result sets by coordinating table scans, expression evaluation, aggregation, joins, and result streaming. This page documents the execution engine's architecture, core components, and high-level execution flow.

For details on table-level planning and execution, see TablePlanner and TableExecutor. For scan optimization strategies, see Scan Execution and Optimization. For predicate evaluation mechanics, see Filter Evaluation.

System Architecture

Query execution spans two primary crates:

Crate	Responsibility	Key Types
`llkv-executor`	Orchestrates multi-table queries, aggregates, and result formatting	`QueryExecutor`, `SelectExecution`
`llkv-table`	Executes table scans with predicates and projections	`TablePlanner`, `TableExecutor`

The executor operates on logical plans produced by llkv-plan and delegates to llkv-table for single-table operations, llkv-join for join algorithms, and llkv-aggregate for aggregate computations.

Execution Architecture

graph TB
    subgraph "Plan Layer"
        PLAN["SelectPlan\n(llkv-plan)"]
end
    
    subgraph "Execution Orchestration (llkv-executor)"
        QE["QueryExecutor&lt;P&gt;"]
EXEC["SelectExecution&lt;P&gt;"]
STRAT["Strategy Selection:\nprojection, aggregate,\njoin, compound"]
end
    
    subgraph "Table Execution (llkv-table)"
        TP["TablePlanner"]
TE["TableExecutor"]
PS["PlannedScan"]
end
    
    subgraph "Specialized Operations"
        AGG["llkv-aggregate\nAccumulator"]
JOIN["llkv-join\nhash_join, cross_join"]
EVAL["NumericKernels\nscalar evaluation"]
end
    
    subgraph "Storage"
        TABLE["Table&lt;P&gt;"]
STORE["ColumnStore"]
end
    
 
   PLAN --> QE
 
   QE --> STRAT
 
   STRAT -->|single table| TP
 
   STRAT -->|aggregates| AGG
 
   STRAT -->|joins| JOIN
    
 
   TP --> PS
 
   PS --> TE
 
   TE --> TABLE
 
   TABLE --> STORE
    
 
   STRAT --> EVAL
 
   EVAL --> TABLE
    
 
   QE --> EXEC
 
   TE --> EXEC
 
   AGG --> EXEC
 
   JOIN --> EXEC

Sources: llkv-executor/src/lib.rs:507-521 llkv-table/src/planner/mod.rs:580-726

Core Components

QueryExecutor

QueryExecutor<P> is the top-level execution coordinator in llkv-executor. It consumes SelectPlan structures and produces SelectExecution result containers.

Key Responsibilities:

Strategy selection based on plan characteristics (single table, joins, aggregates, compound operations)
Multi-table query orchestration (cross products, hash joins)
Aggregate computation coordination
Subquery evaluation (correlated EXISTS, scalar subqueries)
Result streaming and batching

Entry points:

execute_select(plan: SelectPlan) - Execute a SELECT plan llkv-executor/src/lib.rs:523-525
execute_select_with_filter(plan, row_filter) - Execute with optional MVCC filter llkv-executor/src/lib.rs:527-569

Sources: llkv-executor/src/lib.rs:507-521

SelectExecution

SelectExecution<P> encapsulates query results and provides streaming access via batched iteration. Results may be materialized upfront or generated lazily depending on the execution strategy.

Streaming Interface:

stream<F>(on_batch: F) - Process results batch-by-batch
into_rows() - Materialize all rows into memory (for sorting, deduplication)
schema() - Access result schema

Sources: llkv-executor/src/lib.rs:2500-2700 (approximate location based on file structure)

TablePlanner and TableExecutor

The table-level execution layer handles single-table scans with predicates and projections. TablePlanner analyzes the request and produces a PlannedScan, which TableExecutor then executes.

Planning Process:

Validate projections against schema
Normalize filter predicates (apply De Morgan's laws, flatten boolean operators)
Compile predicates into EvalProgram and DomainProgram bytecode
Build PlanGraph metadata for tracing

Execution Process:

Try optimized fast paths (single column scans, full table scans)
Fall back to general execution with expression evaluation
Stream results in batches

These components are detailed in TablePlanner and TableExecutor.

Sources: llkv-table/src/planner/mod.rs:580-637 llkv-table/src/planner/mod.rs:728-1007

Execution Flow

Top-Level SELECT Execution Sequence

Sources: llkv-executor/src/lib.rs:523-569 llkv-table/src/planner/mod.rs:595-607 llkv-table/src/planner/mod.rs:1009-1400

graph TD
 
   START["SelectPlan"] --> COMPOUND{compound?}
COMPOUND -->|yes| EXEC_COMPOUND["execute_compound_select\nUNION/EXCEPT/INTERSECT"]
COMPOUND -->|no| FROM{tables.is_empty?}
FROM -->|yes| EXEC_CONST["execute_select_without_table\nEvaluate constant expressions"]
FROM -->|no| MULTI{tables.len > 1?}
MULTI -->|yes| EXEC_CROSS["execute_cross_product\nor hash_join optimization"]
MULTI -->|no| GROUPBY{group_by.is_empty?}
GROUPBY -->|no| EXEC_GROUP["execute_group_by_single_table\nGroup rows + compute aggregates"]
GROUPBY -->|yes| AGG{aggregates.is_empty?}
AGG -->|no| EXEC_AGG["execute_aggregates\nCollect all rows + compute"]
AGG -->|yes| COMPUTED{has_computed_aggregates?}
COMPUTED -->|yes| EXEC_COMP_AGG["execute_computed_aggregates\nEmbedded agg in expressions"]
COMPUTED -->|no| EXEC_PROJ["execute_projection\nStream scan with projections"]
EXEC_COMPOUND --> RESULT["SelectExecution"]
EXEC_CONST --> RESULT
 
   EXEC_CROSS --> RESULT
 
   EXEC_GROUP --> RESULT
 
   EXEC_AGG --> RESULT
 
   EXEC_COMP_AGG --> RESULT
 
   EXEC_PROJ --> RESULT

Execution Strategies

QueryExecutor selects an execution strategy based on plan characteristics:

Strategy Decision Tree

Sources: llkv-executor/src/lib.rs:527-569

Strategy Implementations

Strategy	Method	When Applied	Key Operations
Constant Evaluation	`execute_select_without_table`	No FROM clause	Evaluate literals, struct constructors
Simple Projection	`execute_projection`	Single table, no aggregates	Stream scan with filter + projections
Aggregation	`execute_aggregates`	Has aggregates, no GROUP BY	Collect all rows, compute aggregates, emit single row
Grouped Aggregation	`execute_group_by_single_table`	Has GROUP BY	Hash rows by key, compute per-group aggregates
Computed Aggregates	`execute_computed_aggregates`	Aggregates embedded in computed projections	Extract aggregate expressions, evaluate separately
Cross Product	`execute_cross_product`	Multiple tables	Cartesian product or hash join optimization
Compound	`execute_compound_select`	UNION/EXCEPT/INTERSECT	Execute components, apply set operations

Sources: llkv-executor/src/lib.rs:926-975 llkv-executor/src/lib.rs:1700-2100 llkv-executor/src/lib.rs:2200-2400 llkv-executor/src/lib.rs:1057-1400 llkv-executor/src/lib.rs:590-701

Streaming Execution Model

LLKV executes queries in a streaming fashion to avoid materializing large intermediate results. Results flow through the system as RecordBatch chunks (typically 4096 rows).

Streaming Characteristics:

Execution Type	Streaming Behavior	Memory Characteristics
Projection	Full streaming	O(batch_size) memory
Filter	Full streaming	O(batch_size) memory
Aggregates	Requires full materialization	O(input_rows) memory
GROUP BY	Requires full materialization	O(group_count) memory
ORDER BY	Requires full materialization	O(input_rows) memory
DISTINCT	Requires full materialization	O(distinct_rows) memory
LIMIT	Early termination	O(limit × batch_size) memory

Streaming Projection Example Flow:

Sources: llkv-table/src/planner/mod.rs:1009-1400 llkv-table/src/constants.rs:1-10 (defines STREAM_BATCH_ROWS = 4096)

Materialization Points

Certain operations require collecting all rows before producing output:

Sorting - Must see all rows to determine order llkv-executor/src/lib.rs:2800-2900
Deduplication (DISTINCT) - Must track all seen rows llkv-executor/src/lib.rs:2950-3050
Aggregation - Must accumulate state across all rows llkv-executor/src/lib.rs:1700-1900
Set Operations - Must materialize both sides for comparison llkv-executor/src/lib.rs:590-701

These operations call into_rows() on SelectExecution to materialize results as Vec<CanonicalRow>.

Sources: llkv-executor/src/lib.rs:2600-2700

Expression Evaluation

Query execution evaluates two types of expressions:

Predicate Evaluation (Filtering)

Predicates are compiled to bytecode and evaluated during table scans:

Normalization - Apply De Morgan's laws, flatten AND/OR llkv-table/src/planner/program.rs:50-150
Compilation - Convert to EvalProgram (stack-based) and DomainProgram (row tracking) llkv-table/src/planner/program.rs:200-400
Vectorized Evaluation - Process chunks of rows efficiently llkv-table/src/planner/mod.rs:1100-1300

See Filter Evaluation for detailed mechanics.

Scalar Expression Evaluation (Projections)

Computed projections are evaluated row-by-row or vectorized when possible:

Translation - Convert ScalarExpr<String> to ScalarExpr<FieldId> llkv-executor/src/translation/scalar.rs:1-200
Type Inference - Determine output data type llkv-executor/src/translation/schema.rs:50-150
Evaluation - Use NumericKernels for numeric operations llkv-table/src/scalar_eval.rs:450-685

Vectorized vs Row-by-Row:

Sources: llkv-table/src/scalar_eval.rs:675-712 llkv-table/src/scalar_eval.rs:549-673

Integration with Runtime

The execution layer coordinates with llkv-runtime for transaction and catalog management:

Runtime Integration Points:

Operation	Runtime Responsibility	Executor Responsibility
Table Lookup	`CatalogManager::table()`	`ExecutorTableProvider::get_table()`
MVCC Filtering	Provide `RowIdFilter` with snapshot	Apply filter during scan
Transaction State	Track transaction ID, commit watermark	Include `created_by`/`deleted_by` in scans
Schema Resolution	Maintain system catalog	Translate column names to `FieldId`

The ExecutorTableProvider trait abstracts runtime integration, allowing executor to remain runtime-agnostic.

Sources: llkv-executor/src/types.rs:100-200 llkv-runtime/src/catalog/mod.rs:50-150

Performance Characteristics

Execution performance depends on query characteristics and chosen strategy:

Query Pattern	Typical Performance	Optimization Opportunities
*SELECT FROM t**	~1M rows/sec	Fast path: shadow column scan llkv-table/src/planner/mod.rs:765-821
SELECT col FROM t WHERE pred	~500K rows/sec	Predicate fusion llkv-table/src/planner/mod.rs:518-570
Single-table aggregates	Full table scan	Column-only projections for aggregate inputs
Hash join (2 tables)	O(n + m) with O(n) memory	Smaller table as build side llkv-executor/src/lib.rs:1500-1700
Cross product (n tables)	O(∏ row_counts)	Avoid if possible; rewrite to joins

Sources: llkv-table/src/planner/mod.rs:738-856 llkv-executor/src/lib.rs:1082-1400

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

TablePlanner and TableExecutor

Relevant source files

This document describes the table-level query planning and execution system in LLKV. The TablePlanner translates scan operations into optimized execution plans, while the TableExecutor implements multiple execution strategies to materialize query results efficiently. For information about the broader query execution pipeline, see Query Execution. For details on expression compilation and evaluation, see Program Compilation and Scalar Evaluation and NumericKernels.

Purpose and Scope

The TablePlanner and TableExecutor form the core of LLKV's table-level query execution. They bridge the gap between logical query plans (from llkv-plan) and physical data access (through llkv-column-map). This document covers:

How scan operations are planned and optimized
The structure of compiled execution plans (PlannedScan)
Multiple execution strategies and their trade-offs
Predicate fusion optimization
Integration with MVCC row filtering
Streaming result materialization

Architecture Overview

Sources: llkv-table/src/planner/mod.rs:580-726

TablePlanner

The TablePlanner is responsible for analyzing scan requests and constructing optimized execution plans. It does not execute queries itself but prepares all necessary metadata for the TableExecutor.

Structure

The planner holds a reference to the Table being scanned and provides a single public method: scan_stream_with_exprs.

Sources: llkv-table/src/planner/mod.rs:580-593

Planning Flow

The planning process consists of several stages:

Validation : Ensures at least one projection is specified
Normalization : Applies De Morgan's laws and flattens logical operators via normalize_predicate
Graph Construction : Builds a PlanGraph for visualization and introspection
Program Compilation : Compiles filter expressions into EvalProgram and DomainProgram bytecode

Sources: llkv-table/src/planner/mod.rs:595-637

PlanGraph Construction

The build_plan_graph method creates a directed acyclic graph (DAG) representing the logical query plan:

Node Type	Purpose	Metadata
`TableScan`	Entry point for data access	`table_id`, `projection_count`
`Filter`	Predicate application	`predicates` (formatted expressions)
`Project`	Column selection and computation	`projections`, `fields` with types
`Output`	Result materialization	`include_nulls` flag

Sources: llkv-table/src/planner/mod.rs:639-725

PlannedScan Structure

The PlannedScan is an intermediate representation that bridges planning and execution. It contains all information needed to execute a scan without holding runtime state.

Field	Type	Purpose
`projections`	`Vec<ScanProjection>`	Column and computed projections to materialize
`filter_expr`	`Arc<Expr<FieldId>>`	Normalized filter predicate
`options`	`ScanStreamOptions<P>`	MVCC filters, ordering, null handling
`plan_graph`	`PlanGraph`	Logical plan for introspection
`programs`	`ProgramSet`	Compiled bytecode for predicate evaluation

Sources: llkv-table/src/planner/mod.rs:500-509

TableExecutor

The TableExecutor implements multiple execution strategies, selecting the most efficient based on query characteristics.

Structure

The executor maintains a cache of row IDs to avoid redundant scans when multiple operations target the same table.

Sources: llkv-table/src/planner/mod.rs:572-578

Execution Strategy Selection

Sources: llkv-table/src/planner/mod.rs:1009-1367

Single Column Direct Scan Fast Path

The try_single_column_direct_scan optimization applies when:

Exactly one projection is requested
include_nulls is false
Filter is either trivial or a full-range predicate on the projected column
Column type is not Utf8 or LargeUtf8 (to avoid string complexity)

graph TD
    CHECK1{projections.len == 1?} -->|No| FALLBACK1[Fallback]
 
   CHECK1 -->|Yes| CHECK2{include_nulls == false?}
CHECK2 -->|No| FALLBACK2[Fallback]
 
   CHECK2 -->|Yes| PROJ_TYPE{Projection type?}
PROJ_TYPE -->|Column| CHECK_FILTER["is_full_range_filter?"]
PROJ_TYPE -->|Computed| CHECK_COMPUTED[Single field?]
    
 
   CHECK_FILTER -->|No| FALLBACK3[Fallback]
 
   CHECK_FILTER -->|Yes| CHECK_DTYPE{dtype?}
CHECK_DTYPE -->|Utf8/LargeUtf8| FALLBACK4[Fallback]
 
   CHECK_DTYPE -->|Other| DIRECT_SCAN[SingleColumnStreamVisitor]
    
 
   CHECK_COMPUTED -->|No| FALLBACK5[Fallback]
 
   CHECK_COMPUTED -->|Yes| COMPUTE_TYPE{Passthrough or Affine?}
COMPUTE_TYPE -->|Passthrough| DIRECT_SCAN2[SingleColumnStreamVisitor]
 
   COMPUTE_TYPE -->|Affine| AFFINE_SCAN[AffineSingleColumnVisitor]
 
   COMPUTE_TYPE -->|Other| COMPUTE_SCAN[ComputedSingleColumnVisitor]
    
 
   DIRECT_SCAN --> HANDLED[StreamOutcome::Handled]
 
   DIRECT_SCAN2 --> HANDLED
 
   AFFINE_SCAN --> HANDLED
 
   COMPUTE_SCAN --> HANDLED

This path streams data directly from storage using ScanBuilder without building intermediate row ID lists or using RowStreamBuilder.

Sources: llkv-table/src/planner/mod.rs:1369-1530

Full Table Scan Streaming Fast Path

The try_stream_full_table_scan optimization applies when:

Filter is trivial (no predicates)
No ordering is required (options.order.is_none())

graph TD
    CHECK_ORDER{order.is_some?} -->|Yes| FALLBACK[Fallback]
 
   CHECK_ORDER -->|No| STREAM_START[stream_table_row_ids]
    
 
   STREAM_START --> SHADOW{Shadow column exists?}
SHADOW -->|Yes| CHUNK[Emit row_id chunks]
 
   SHADOW -->|No| FALLBACK2[Multi-column scan fallback]
    
 
   CHUNK --> MVCC_FILTER{row_id_filter?}
MVCC_FILTER -->|Yes| APPLY[filter.filter]
 
   MVCC_FILTER -->|No| BUILD
 
   APPLY --> BUILD
    
 
   BUILD[RowStreamBuilder] --> GATHER[Gather columns]
 
   GATHER --> EMIT[Emit RecordBatch]
 
   EMIT --> MORE{More chunks?}
MORE -->|Yes| CHUNK
 
   MORE -->|No| CHECK_EMPTY
    
    CHECK_EMPTY{any_emitted?} -->|No| SYNTHETIC[emit_synthetic_null_batch]
 
   CHECK_EMPTY -->|Yes| HANDLED[StreamOutcome::Handled]
 
   SYNTHETIC --> HANDLED

This path uses stream_table_row_ids to enumerate row IDs in chunks directly from the row_id shadow column, avoiding full predicate evaluation.

This optimization is particularly effective for queries like SELECT col1, col2 FROM table with no WHERE clause.

Sources: llkv-table/src/planner/mod.rs:905-999

General Execution Path

When fast paths don't apply, the executor follows a multi-stage process:

Stage 1: Projection Metadata Construction

The executor builds several data structures:

Structure	Purpose
`projection_evals`	`Vec<ProjectionEval>` mapping projections to evaluation strategies
`unique_lfids`	`Vec<LogicalFieldId>` of columns to fetch from storage
`unique_index`	`FxHashMap<LogicalFieldId, usize>` for column position lookup
`numeric_fields`	`FxHashSet<FieldId>` of columns needing numeric coercion
`passthrough_fields`	`Vec<Option<FieldId>>` for identity computed projections

Sources: llkv-table/src/planner/mod.rs:1033-1117

Stage 2: Row ID Collection

For trivial filters, the executor scans the MVCC created_by column to enumerate all rows (including those with NULL values in user columns). For non-trivial filters, it evaluates the compiled ProgramSet.

Sources: llkv-table/src/planner/mod.rs:1269-1327

Stage 3: Streaming Execution

The RowStreamBuilder materializes results in batches of STREAM_BATCH_ROWS (default: 1024). For each batch:

Gather : Fetch column data for row IDs via MultiGatherContext
Evaluate : Compute any ProjectionEval::Computed expressions
Materialize : Construct RecordBatch with final schema
Emit : Call user-provided callback

Sources: llkv-table/src/planner/mod.rs:1337-1365

Program Compilation

The ProgramCompiler translates filter expressions into stack-based bytecode for efficient evaluation. It produces two programs:

EvalProgram : Evaluates predicates and returns matching row IDs
DomainProgram : Computes the domain (all potentially relevant row IDs) for NOT operations

ProgramSet Structure

Bytecode Operations

Opcode	Stack Effect	Purpose
`PushPredicate(filter)`	`[] → [rows]`	Evaluate single predicate
`PushCompare{left, op, right}`	`[] → [rows]`	Evaluate comparison expression
`PushInList{expr, list, negated}`	`[] → [rows]`	Evaluate IN list
`PushIsNull{expr, negated}`	`[] → [rows]`	Evaluate IS NULL
`PushLiteral(bool)`	`[] → [rows]`	Push all rows (true) or empty (false)
`FusedAnd{field_id, filters}`	`[] → [rows]`	Apply multiple predicates on same field
`And{child_count}`	`[r1, r2, ...] → [r]`	Intersect N row ID sets
`Or{child_count}`	`[r1, r2, ...] → [r]`	Union N row ID sets
`Not{domain}`	`[matched] → [domain - matched]`	Set difference using domain program

Sources: llkv-table/src/planner/program.rs:1-200 (referenced but not in provided files)

Execution Example

Consider the filter: WHERE (age > 18 AND age < 65) OR name = 'Alice'

After normalization and compilation:

Stack Operations:
1. PushCompare(age > 18)        → [rows1]
2. PushCompare(age < 65)        → [rows1, rows2]
3. And{2}                       → [rows1 ∩ rows2]
4. PushPredicate(name = 'Alice') → [rows1 ∩ rows2, rows3]
5. Or{2}                        → [(rows1 ∩ rows2) ∪ rows3]

The collect_row_ids_for_program method executes this bytecode by maintaining a stack of row ID vectors and applying set operations.

Sources: llkv-table/src/planner/mod.rs:2376-2502

graph TD
 
   ANALYZE[Analyze filter expression] --> BUILD["Build per_field stats"]
BUILD --> STATS["FieldPredicateStats:\ntotal, contains"]
STATS --> CHECK{should_fuse?}
CHECK --> DTYPE{Data type?}
DTYPE -->|Utf8/LargeUtf8| STRING_RULE["contains ≥ 1 AND total ≥ 2"]
DTYPE -->|Other| NUMERIC_RULE["total ≥ 2"]
STRING_RULE -->|Yes| FUSE[Generate FusedAnd opcode]
 
   NUMERIC_RULE -->|Yes| FUSE
    
 
   STRING_RULE -->|No| SEPARATE[Evaluate predicates separately]
 
   NUMERIC_RULE -->|No| SEPARATE

Predicate Fusion

The PredicateFusionCache analyzes filter expressions to identify opportunities for fused predicate evaluation.

Fusion Strategy

Predicate fusion is particularly beneficial for string columns with CONTAINS operations, where multiple predicates on the same field can be evaluated in a single storage scan.

Example: WHERE name LIKE '%Smith%' AND name LIKE '%John%' can be fused into a single scan with two pattern matchers.

Sources: llkv-table/src/planner/mod.rs:517-570 llkv-table/src/planner/mod.rs:2504-2580

Projection Evaluation

The ProjectionEval enum handles two types of projections:

Column Projections

Direct column references that can be gathered from storage without computation.

Computed Projections

Expressions evaluated per-row using NumericKernels. The executor optimizes several patterns:

Pattern	Optimization
`column`	Passthrough (no computation)
`a * column + b`	Affine transformation (vectorized)
General expression	Full expression evaluation

Sources: llkv-table/src/planner/mod.rs:482-498 llkv-table/src/planner/mod.rs:1110-1117

Row ID Collection Strategies

The executor uses different strategies for collecting matching row IDs based on predicate characteristics:

Simple Predicates

For predicates on a single field (e.g., age > 18):

Sources: llkv-table/src/planner/mod.rs:1532-1612

Comparison Expressions

For comparisons involving multiple fields (e.g., col1 + col2 > 10):

This approach minimizes wasted computation by first identifying a "domain" of potentially matching rows (intersection of rows where all referenced columns have values) before evaluating the full expression.

Sources: llkv-table/src/planner/mod.rs:1699-1775 llkv-table/src/planner/mod.rs:2214-2374

IN List Expressions

For IN list predicates (e.g., status IN ('active', 'pending')):

The IN list evaluator properly handles SQL's three-valued logic:

If value matches any list element: TRUE (or FALSE if negated)
If no match but list contains NULL: NULL (indeterminate)
If no match and no NULLs: FALSE (or TRUE if negated)

Sources: llkv-table/src/planner/mod.rs:2001-2044 llkv-table/src/planner/mod.rs:1841-1999

graph TD
 
   COLLECT[Collect row IDs from predicates] --> FILTER{row_id_filter.is_some?}
FILTER -->|No| CONTINUE[Continue to streaming]
 
   FILTER -->|Yes| APPLY["filter.filter(table, row_ids)"]
APPLY --> CHECK_VISIBILITY[Check MVCC columns]
 
   CHECK_VISIBILITY --> CREATED["created_by ≤ txn_id"]
CHECK_VISIBILITY --> DELETED["deleted_by > txn_id OR NULL"]
CREATED --> VISIBLE{Visible?}
DELETED --> VISIBLE
 
   VISIBLE -->|Yes| KEEP[Keep row ID]
 
   VISIBLE -->|No| DROP[Drop row ID]
    
 
   KEEP --> FILTERED[Filtered row IDs]
 
   DROP --> FILTERED
 
   FILTERED --> CONTINUE

MVCC Integration

The executor integrates with LLKV's MVCC system through the row_id_filter option in ScanStreamOptions. After collecting row IDs through predicate evaluation, the filter determines which rows are visible to the current transaction:

For trivial filters, the executor explicitly scans the created_by MVCC column to enumerate all rows, ensuring that rows with NULL values in user columns are included when appropriate.

Sources: llkv-table/src/planner/mod.rs:1269-1323

Performance Characteristics

The table below summarizes the time complexity of different execution paths:

Execution Path	Conditions	Row ID Collection	Data Gathering	Total
Single Column Direct	1 projection, trivial/full-range filter	O(1)	O(n) streaming	O(n)
Full Table Stream	Trivial filter, no order	O(n) via shadow column	O(n) streaming	O(n)
General (indexed predicate)	Single-field predicate with index	O(log n + m)	O(m × c)	O(log n + m × c)
General (complex predicate)	Multi-field or computed predicate	O(n × p)	O(m × c)	O(n × p + m × c)

Where:

n = total rows in table
m = matching rows after filtering
c = number of columns in projection
p = complexity of predicate (number of fields involved)

The executor automatically selects the most efficient path based on query characteristics, with no manual tuning required.

Sources: llkv-table/src/planner/mod.rs:1009-1530 llkv-table/src/planner/mod.rs:905-999

graph TB
    subgraph "External Input"
        PLAN[llkv-plan SelectPlan]
        EXPR[llkv-expr Expr]
    end
    
    subgraph "Table Layer"
        TP[TablePlanner]
        TE[TableExecutor]
        PLANNED[PlannedScan]
 
       TP --> PLANNED
 
       PLANNED --> TE
    end
    
    subgraph "Storage Layer"
        STORE[ColumnStore]
        SCAN[ScanBuilder]
        GATHER[MultiGatherContext]
    end
    
    subgraph "Expression Evaluation"
        NORM[normalize_predicate]
        COMPILER[ProgramCompiler]
        KERNELS[NumericKernels]
    end
    
    subgraph "Output"
        STREAM[RowStreamBuilder]
        BATCH[RecordBatch]
    end
    
 
   PLAN --> TP
 
   EXPR --> TP
    
 
   TP --> NORM
 
   NORM --> COMPILER
    
 
   TE --> SCAN
 
   TE --> GATHER
 
   TE --> KERNELS
 
   TE --> STREAM
    
 
   SCAN --> STORE
 
   GATHER --> STORE
 
   STREAM --> BATCH

Integration Points

The TablePlanner and TableExecutor integrate with several other LLKV subsystems:

Sources: llkv-table/src/planner/mod.rs:1-76

This architecture enables LLKV to efficiently execute table scans with complex predicates and projections while maintaining clean separation between logical planning and physical execution. The multiple optimization paths ensure that simple queries execute quickly while complex queries remain correct.

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Scan Execution and Optimization

Relevant source files

Purpose and Scope

This page documents the table scan execution engine in the llkv-table crate, which implements the low-level scanning and streaming of data from the column store to higher layers. It covers planning, optimization paths, predicate compilation, and expression evaluation strategies that enable efficient data retrieval.

For information about higher-level query planning and the translation of SQL plans to table operations, see TablePlanner and TableExecutor. For details on how filters are evaluated against individual rows, see Filter Evaluation.

Architecture Overview

The scan execution system is split into two primary components:

Component	Responsibility
`TablePlanner`	Analyzes scan requests, builds plan graphs, compiles predicates into bytecode programs
`TableExecutor`	Executes planned scans using optimization paths, coordinates streaming results

Scan Execution Flow

Sources: llkv-table/src/planner/mod.rs:591-637

Planning Phase

Plan Construction

The TablePlanner::plan_scan method orchestrates plan construction by:

Validating projections are non-empty
Normalizing the filter predicate
Building a plan graph for visualization and analysis
Compiling predicates into executable programs

graph LR
    Input["scan_stream_with_exprs\n(projections, filter, options)"]
Validate["Validate projections"]
Normalize["normalize_predicate\n(flatten And/Or,\napply De Morgan's)"]
Graph["build_plan_graph\n(TableScan → Filter →\nProject → Output)"]
Compile["ProgramCompiler\n(EvalProgram +\nDomainProgram)"]
Output["PlannedScan"]
Input --> Validate
 
   Validate --> Normalize
 
   Validate --> Graph
 
   Normalize --> Compile
 
   Compile --> Output
 
   Graph --> Output

The plan graph encodes the logical operator tree for diagnostic purposes, with nodes representing TableScan, Filter, Project, and Output operators.

Sources: llkv-table/src/planner/mod.rs:612-637 llkv-table/src/planner/mod.rs:639-725

Predicate Compilation

Predicates are compiled into two bytecode programs:

Program Type	Purpose
`EvalProgram`	Stack-based bytecode for evaluating filter conditions
`DomainProgram`	Tracks which row IDs satisfy the predicate during scanning

The ProgramCompiler analyzes the normalized predicate tree and emits instructions for efficient evaluation. Predicate fusion merges multiple predicates on the same field when beneficial.

graph TB
    Filter["Expr&lt;FieldId&gt;\n(normalized predicate)"]
Fusion["PredicateFusionCache\n(analyze per-field stats)"]
Compiler["ProgramCompiler::compile"]
EvalProg["EvalProgram\n(stack-based bytecode)"]
DomainProg["DomainProgram\n(row ID tracking)"]
ProgramSet["ProgramSet\n(contains both programs)"]
Filter --> Fusion
 
   Filter --> Compiler
 
   Fusion --> Compiler
 
   Compiler --> EvalProg
 
   Compiler --> DomainProg
 
   EvalProg --> ProgramSet
 
   DomainProg --> ProgramSet

Sources: llkv-table/src/planner/mod.rs:625-629 llkv-table/src/planner/program.rs

Execution Phase and Optimization Paths

The TableExecutor::execute method attempts multiple optimization paths before falling back to the general scan:

Fast Path: Single Column Direct Scan

When the scan requests a single column with a simple predicate, the executor uses try_single_column_direct_scan to stream data directly from the column store without materializing row IDs.

Conditions for single column fast path:

Exactly one column projection
No computed projections
Simple predicate structure (optional)
Compatible data types

This path bypasses row ID collection and gather operations, streaming column chunks directly to the caller.

Sources: llkv-table/src/planner/mod.rs:1021-1031 llkv-table/src/planner/mod.rs:1157-1343

Fast Path: Full Table Scan Streaming

For queries without ordering requirements, try_stream_full_table_scan uses incremental row ID streaming to avoid buffering all row IDs in memory:

graph TB
    Start["try_stream_full_table_scan"]
CheckOrder["Check options.order\n(must be None)"]
StreamRIDs["stream_table_row_ids\n(chunk_size batches)"]
Shadow["Try shadow column\nscan (fast)"]
Fallback["Multi-column scan\n(fallback)"]
ProcessChunk["Process chunk:\n1. Apply row_id_filter\n2. Build RowStream\n3. Emit batches"]
Batch["RecordBatch\n(via on_batch)"]
Start --> CheckOrder
 
   CheckOrder -->|order is Some| Return["Return Fallback"]
CheckOrder -->|order is None| StreamRIDs
    
 
   StreamRIDs --> Shadow
 
   Shadow -->|Success| ProcessChunk
 
   Shadow -->|Not found| Fallback
 
   Fallback --> ProcessChunk
    
 
   ProcessChunk --> Batch
 
   Batch -->|Next chunk| StreamRIDs

Sources: llkv-table/src/planner/mod.rs:904-999 llkv-table/src/planner/mod.rs:859-902

Row ID Collection Optimization

Row ID collection uses a two-tier strategy:

Fast path (shadow column) : Scan the dedicated row_id shadow column which contains all row IDs for the table
Fallback (multi-column scan) : Scan user columns and deduplicate row IDs when shadow column is unavailable

The shadow column optimization is significantly faster because:

Single column scan instead of multiple
No deduplication required
Direct row ID format

Sources: llkv-table/src/planner/mod.rs:748-857

General Scan Execution

When fast paths don't apply, the executor uses the general scan path:

Projection analysis : Classify projections as column references or computed expressions
Field collection : Build unique field lists and numeric field maps
Row ID collection : Gather all relevant row IDs (using optimizations above)
Row ID filtering : Apply predicate programs to filter row IDs
Gather and stream : Use RowStreamBuilder to materialize columns and emit batches

General Scan Pipeline

graph TB
    Execute["TableExecutor::execute"]
Analyze["Analyze projections:\n- Column refs\n- Computed exprs\n- Build unique_lfids"]
CollectRIDs["table_row_ids\n(with caching)"]
FilterRIDs["Filter row IDs:\n- collect_row_ids_for_rowid_filter\n- Apply EvalProgram\n- Apply DomainProgram"]
Order["Apply ordering\n(if options.order present)"]
Gather["RowStreamBuilder:\n- prepare_gather_context\n- stream chunks\n- evaluate computed exprs"]
Emit["Emit RecordBatch"]
Execute --> Analyze
 
   Analyze --> CollectRIDs
 
   CollectRIDs --> FilterRIDs
 
   FilterRIDs --> Order
 
   Order --> Gather
 
   Gather --> Emit

Sources: llkv-table/src/planner/mod.rs:1009-1343 llkv-table/src/planner/mod.rs:1345-1710

Predicate Optimization

Normalization

The normalize_predicate function applies logical transformations to simplify filter expressions:

De Morgan's laws : NOT (a AND b) → (NOT a) OR (NOT b)
Flatten nested operators : AND[AND[a,b],c] → AND[a,b,c]
Constant folding : AND[true, x] → x
Double negation elimination : NOT (NOT x) → x

These transformations expose optimization opportunities and simplify compilation.

Sources: llkv-table/src/planner/program.rs

Predicate Fusion

The PredicateFusionCache analyzes predicates to determine when multiple conditions on the same field should be fused:

Data Type	Fusion Criteria
String types	`contains` count ≥ 1 AND total predicates ≥ 2
Other types	Total predicates ≥ 2

graph TB
    Expr["Filter Expression"]
Cache["PredicateFusionCache"]
Traverse["Traverse expression tree"]
Stats["Per-field stats:\n- total predicates\n- contains predicates"]
Decision["should_fuse(field, dtype)"]
Fuse["Fuse predicates into\nsingle evaluation"]
Separate["Keep predicates separate"]
Expr --> Cache
 
   Cache --> Traverse
 
   Traverse --> Stats
 
   Stats --> Decision
    
 
   Decision -->|Meets criteria| Fuse
 
   Decision -->|Below threshold| Separate

Fusion enables single-pass evaluation rather than multiple column scans for the same field.

Sources: llkv-table/src/planner/mod.rs:517-570

Expression Evaluation

Numeric Kernels

The NumericKernels system in llkv-table/src/scalar_eval.rs provides vectorized evaluation for scalar expressions:

Kernel Operation	Description
`collect_fields`	Extract all field references from expression
`prepare_numeric_arrays`	Cast columns to unified numeric representation
`evaluate_value`	Row-by-row scalar evaluation
`evaluate_batch`	Vectorized batch evaluation
`simplify`	Detect affine expressions for optimization

Numeric Type Hierarchy

Sources: llkv-table/src/scalar_eval.rs:1-90 llkv-table/src/scalar_eval.rs:451-712

Vectorized Evaluation

When possible, expressions are evaluated using vectorized paths:

Column access : Direct array reference (zero-copy)
Literals : Broadcast scalar to array length
Binary operations : Arrow compute kernels for array-array or array-scalar operations
Affine expressions : Specialized scale * field + offset fast path

The try_evaluate_vectorized method attempts vectorization before falling back to row-by-row evaluation.

Sources: llkv-table/src/scalar_eval.rs:714-762

graph LR
    Expr["ScalarExpr"]
Detect["NumericKernels::simplify"]
Check["is_affine_column_expr"]
Affine["AffineExpr:\nfield, scale, offset"]
Direct["Direct column reference"]
Complex["Complex expression"]
Expr --> Detect
 
   Detect --> Check
    
 
   Check -->|Matches pattern| Affine
 
   Check -->|Single column| Direct
 
   Check -->|Other| Complex

Affine Expression Optimization

Expressions matching the pattern scale * field + offset are detected and optimized:

Affine expressions enable:

Single column scan with arithmetic applied
Reduced memory allocation
Better cache locality

Sources: llkv-table/src/scalar_eval.rs:1038-1174 llkv-table/src/planner/mod.rs:1711-1872

graph TB
    Builder["RowStreamBuilder::new"]
Config["Configuration:\n- store\n- table_id\n- schema\n- unique_lfids\n- projection_evals\n- row_ids\n- batch_size"]
GatherCtx["prepare_gather_context\n(optional reuse)"]
Build["build()"]
Stream["RowStream"]
NextChunk["next_chunk()"]
Gather["Gather columns\nfor batch_size rows"]
Evaluate["Evaluate computed\nprojections"]
Batch["StreamChunk\n(arrays + schema)"]
Builder --> Config
 
   Config --> GatherCtx
 
   GatherCtx --> Build
 
   Build --> Stream
    
 
   Stream --> NextChunk
 
   NextChunk --> Gather
 
   Gather --> Evaluate
 
   Evaluate --> Batch
 
   Batch -->|More rows| NextChunk

Streaming Architecture

Row Stream Builder

The RowStreamBuilder constructs streaming result iterators with configurable batch sizes:

The stream uses STREAM_BATCH_ROWS (default 1024) as the chunk size for incremental result production.

Sources: llkv-table/src/stream.rs llkv-table/src/constants.rs:1-7

Gather Context Reuse

MultiGatherContext enables amortization of setup costs across multiple scans:

Caches physical key lookups
Reuses internal buffers
Reduces allocations in streaming scenarios

The context is optional but improves performance for repeated scans of the same columns.

Sources: llkv-column-map/src/store/scan.rs

Performance Characteristics

Scan Type	Row ID Collection	Column Access	Memory Usage
Single column direct	None (streams directly)	Direct column chunks	O(chunk_size)
Full table streaming	Shadow column (fast)	Incremental gather	O(batch_size × columns)
Filtered scan	Shadow or multi-column	Full gather	O(row_count × columns)
Ordered scan	Shadow or multi-column	Full gather + sort	O(row_count × columns)

The executor prioritizes fast paths that minimize memory usage and avoid full table materialization when possible.

Sources: llkv-table/src/planner/mod.rs:748-999 llkv-table/README.md:1-57

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Filter Evaluation

Relevant source files

Purpose and Scope

This page explains how filter expressions (WHERE clause predicates) are evaluated against table rows during query execution. This includes the compilation of filter expressions into efficient bytecode programs, stack-based evaluation mechanisms, integration with MVCC visibility filtering, and various optimization strategies.

For information about how expressions are initially structured and planned, see Expression System and Query Planning. For details about the broader scan execution context, see Scan Execution and Optimization.

Filter Expression Pipeline

Filter evaluation follows a multi-stage pipeline that transforms SQL predicates into efficient executable programs:

Sources: llkv-table/src/planner/mod.rs:595-636 llkv-table/src/planner/program.rs:256-284 llkv-executor/src/translation/expression.rs:18-174

graph LR
 
   SQL["SQL WHERE Clause"] --> Parser["sqlparser AST"]
Parser --> ExprString["Expr&lt;String&gt;\nField names"]
ExprString --> Translation["Field Resolution\nCatalog Lookup"]
Translation --> ExprFieldId["Expr&lt;FieldId&gt;\nResolved fields"]
ExprFieldId --> Normalize["normalize_predicate\nDe Morgan's Laws\nFlatten AND/OR"]
Normalize --> Compiler["ProgramCompiler"]
Compiler --> EvalProg["EvalProgram\nStack Bytecode"]
Compiler --> DomainProg["DomainProgram\nRow ID Selection"]
EvalProg --> Evaluation["Row Evaluation"]
DomainProg --> Evaluation
 
   Evaluation --> MVCCFilter["MVCC Filtering"]
MVCCFilter --> Results["Filtered Results"]

Program Compilation

Normalization

Before compilation, filter expressions are normalized into canonical form using the normalize_predicate function. This transformation ensures consistent structure for optimization and evaluation.

Normalization rules:

Flatten nested conjunctions/disjunctions: AND(AND(a,b),c) → AND(a,b,c)
Apply De Morgan's laws: Push NOT operators down through logical connectives
Eliminate double negations: NOT(NOT(expr)) → expr
Simplify literal booleans: NOT(true) → false

The normalization process uses an iterative approach to handle deeply nested expressions without stack overflow. The transformation is applied recursively, with special handling for negated conjunctions and disjunctions.

Key normalization functions:

Function	Purpose
`normalize_predicate`	Entry point for expression normalization
`normalize_expr`	Flattens AND/OR and delegates to normalize_negated
`normalize_negated`	Applies De Morgan's laws and simplifies negations

Sources: llkv-table/src/planner/program.rs:286-343

Bytecode Generation

The ProgramCompiler translates normalized expressions into two complementary program representations:

EvalProgram operations:

graph TB
    subgraph "Compilation"
        Expr["Normalized Expr&lt;FieldId&gt;"]
Compiler["ProgramCompiler"]
Expr --> Compiler
    end
    
    subgraph "Programs"
        EvalProg["EvalProgram\nStack-based bytecode\nfor predicate evaluation"]
DomainProg["DomainProgram\nRow ID domain\ncalculation"]
Compiler --> EvalProg
 
       Compiler --> DomainProg
    end
    
    subgraph "Evaluation"
 
       EvalProg --> ResultStack["Result Stack\nbool values"]
DomainProg --> RowIDs["Row ID Sets\ncandidate rows"]
end

Operation	Stack Effect	Purpose
`PushPredicate`	`→ bool`	Evaluate single predicate
`PushCompare`	`→ bool`	Evaluate scalar comparison
`PushInList`	`→ bool`	Evaluate IN list membership
`PushIsNull`	`→ bool`	Evaluate NULL test
`PushLiteral`	`→ bool`	Push constant boolean
`FusedAnd`	`→ bool`	Evaluate fused predicates on same field
`And`	`bool×N → bool`	Logical AND of N values
`Or`	`bool×N → bool`	Logical OR of N values
`Not`	`bool → bool`	Logical negation (uses domain program)

DomainProgram operations:

Operation	Purpose
`PushFieldAll`	Include all rows for a field
`PushCompareDomain`	Rows where scalar comparison fields exist
`PushInListDomain`	Rows where IN list fields exist
`PushIsNullDomain`	Rows where NULL test fields exist
`Union`	Combine row sets (OR semantics)
`Intersect`	Intersect row sets (AND semantics)

Sources: llkv-table/src/planner/program.rs:22-284 llkv-table/src/planner/program.rs:416-516 llkv-table/src/planner/program.rs:544-631

Predicate Fusion

An optimization that recognizes multiple predicates on the same field within an AND expression and evaluates them together. This reduces overhead and enables more efficient filtering.

Fusion conditions (fromPredicateFusionCache):

Data Type	Fusion Threshold
String types	≥2 total predicates AND ≥1 `Contains` operator
Other types	≥2 total predicates on same field

Example transformation:

age >= 18 AND age < 65 AND age != 30
→ FusedAnd(field_id: age, filters: [>=18, <65, !=30])

The fusion cache tracks predicate patterns during compilation:

Counts total predicates per field
Tracks specific operator types (e.g., Contains for strings)
Decides whether fusion is beneficial via should_fuse

Sources: llkv-table/src/planner/mod.rs:517-570 llkv-table/src/planner/program.rs:518-542

Row-Level Evaluation

Stack-Based Evaluation Engine

Filter evaluation uses a stack-based virtual machine that processes EvalProgram bytecode. Each operation manipulates a boolean result stack.

graph LR
    subgraph "Evaluation Loop"
        Ops["EvalOp Instructions"]
Stack["Result Stack\nVec&lt;bool&gt;"]
Ops -->|Process| Stack
 
       Stack -->|Final Value| Result["Filter Decision"]
end
    
    subgraph "Example: age >= 18 AND status = 'active'"
 
       Op1["PushPredicate(age >= 18)"] -->|Stack: [true]| Op2["PushPredicate(status = 'active')"]
Op2 -->|Stack: [true, false]| Op3["And(2)"]
Op3 -->|Stack: [false]| Final["Result: false"]
end

The evaluation process iterates through EvalOp instructions, pushing boolean results and combining them according to logical operators. Each predicate evaluation consults the underlying storage to check field values against filter conditions.

Sources: llkv-table/src/planner/mod.rs:1009-1031

graph TB
    Predicate["Filter&lt;FieldId&gt;\nfield_id + operator"]
Predicate --> TypeCheck{"Data Type?"}
TypeCheck -->|Fixed-width| FixedPath["build_fixed_width_predicate\nInt, Float, Date, etc."]
TypeCheck -->|Variable-width| VarPath["build_var_width_predicate\nString types"]
TypeCheck -->|Boolean| BoolPath["build_bool_predicate\nBool type"]
FixedPath --> Native["Native comparison\nusing PredicateValue"]
VarPath --> Pattern["Pattern matching\nStartsWith, Contains, etc."]
BoolPath --> Boolean["Boolean logic"]
Native --> Result["bool"]
Pattern --> Result
 
   Boolean --> Result

Predicate Evaluation

Individual predicates are evaluated by comparing field values against filter operators. The evaluation strategy depends on the data type:

Type-specific evaluation paths:

Operator semantics:

Operator	Description	NULL Handling
`Equals`	Exact match	NULL = NULL → NULL
`Range`	Bounded interval (inclusive/exclusive)	NULL in range → NULL
`In`	Set membership	NULL in [values] → NULL
`StartsWith`	String prefix match (case-sensitive/insensitive)	NULL starts with X → NULL
`EndsWith`	String suffix match	NULL ends with X → NULL
`Contains`	String substring match	NULL contains X → NULL
`IsNull`	NULL test	Returns true/false
`IsNotNull`	NOT NULL test	Returns true/false

Sources: llkv-expr/src/expr.rs:295-358 llkv-expr/src/typed_predicate.rs:1-500 (referenced but not shown)

graph TB
    ScalarExpr["ScalarExpr&lt;FieldId&gt;"]
ScalarExpr --> Mode{"Evaluation Mode"}
Mode -->|Single Row| RowLevel["NumericKernels::evaluate_value\nRecursive evaluation\nReturns Option&lt;NumericValue&gt;"]
Mode -->|Batch| BatchLevel["NumericKernels::evaluate_batch\nVectorized when possible"]
BatchLevel --> Vectorized{"Vectorizable?"}
Vectorized -->|Yes| Vec["try_evaluate_vectorized\nArrow compute kernels"]
Vectorized -->|No| Fallback["Per-row evaluation\nLoop + evaluate_value"]
Vec --> Array["ArrayRef result"]
Fallback --> Array

Scalar Expression Evaluation

For computed columns and complex predicates (e.g., WHERE salary * 1.1 > 50000), scalar expressions are evaluated using the NumericKernels utility.

Evaluation modes:

Numeric type handling:

The NumericArray wrapper provides unified access to different numeric types:

Integer : Int64Array for integers, booleans, dates
Float : Float64Array for floating-point numbers
Decimal : Decimal128Array for precise decimal values

Type coercion occurs automatically during expression evaluation:

Mixed integer/float operations promote to float
String-to-numeric conversions follow SQLite semantics (invalid → 0)
NULL propagates through operations

Key evaluation functions:

Function	Purpose	Performance
`evaluate_value`	Single-row evaluation	Used for non-vectorizable expressions
`evaluate_batch`	Batch evaluation	Attempts vectorization first
`try_evaluate_vectorized`	Vectorized computation	Uses Arrow compute kernels
`prepare_numeric_arrays`	Type coercion	Converts columns to numeric representation

Sources: llkv-table/src/scalar_eval.rs:453-713 llkv-table/src/scalar_eval.rs:92-383

graph TB
    subgraph "Filter Stages"
        Scan["Table Scan"]
Scan --> Domain["1. Domain Program\nDetermines candidate\nrow IDs"]
Domain --> UserPred["2. User Predicates\nSemantic filtering\nvia EvalProgram"]
UserPred --> MVCCFilter["3. MVCC Filtering\nrow_id_filter.filter()\nVisibility rules"]
MVCCFilter --> Results["Visible Results"]
end
    
    subgraph "MVCC Visibility"
        RowID["Row ID"]
CreatedBy["created_by TxnId"]
DeletedBy["deleted_by Option&lt;TxnId&gt;"]
Snapshot["Transaction Snapshot"]
RowID --> Check{"Visibility Check"}
CreatedBy --> Check
 
       DeletedBy --> Check
 
       Snapshot --> Check
        
 
       Check -->|Created before snapshot Not deleted or deleted after snapshot| Visible["Include"]
Check -->|Otherwise| Invisible["Exclude"]
end

MVCC Integration

Filter evaluation integrates with MVCC visibility filtering to ensure queries only see rows visible to their transaction. This is a two-stage filtering process:

MVCC filtering implementation:

The row_id_filter option in ScanStreamOptions provides transaction-aware filtering:

Created by runtime's transaction manager
Encapsulates snapshot visibility rules
Applied after user predicate evaluation
Filters row IDs based on created_by and deleted_by transaction IDs

Filtering order rationale:

Domain programs - Quickly eliminate rows where referenced fields don't exist
User predicates - Evaluate semantic conditions (WHERE clause)
MVCC filter - Apply transaction visibility rules

This ordering minimizes MVCC overhead by only checking visibility for rows that pass semantic filters.

Sources: llkv-table/src/planner/mod.rs:940-982 llkv-table/src/table.rs:200-300 (type definitions referenced but not shown)

Evaluation Optimizations

Single-Column Direct Scan Fast Path

A specialized fast path for queries that project a single column with simple filtering. This bypasses the general evaluation machinery for better performance.

Conditions for fast path:

Single column projection
Filter references only that column
Simple operator (no complex scalar expressions)

When activated, the scan directly streams the target column's values without materializing intermediate structures.

Sources: llkv-table/src/planner/mod.rs:1020-1031 (method name: try_single_column_direct_scan)

graph LR
    Shadow["Shadow Column\nrow_id metadata"]
Shadow -->|Exists?| FastEnum["Fast Path\nstream_table_row_ids"]
Shadow -->|Missing| Fallback["Fallback Path\nMulti-column scan\n+ deduplication"]
FastEnum --> Chunks["Row ID Chunks\nSTREAM_BATCH_ROWS"]
Fallback --> Chunks
    
 
   Chunks --> MVCCCheck["Apply MVCC Filter"]
MVCCCheck --> Gather["Gather Columns"]
Gather --> Batch["RecordBatch"]

Full Table Scan Streaming

When no predicates require evaluation (e.g., WHERE true or full scan), the executor uses streaming row ID enumeration:

The fast path attempts to use a shadow column (row_id) that stores all row IDs for a table:

Success case : Shadow column exists → stream chunks directly
Fallback case : Shadow column missing → scan user columns and deduplicate

Sources: llkv-table/src/planner/mod.rs:739-857 llkv-table/src/planner/mod.rs:859-902 llkv-table/src/planner/mod.rs:904-999

graph TB
    Expression["WHERE clause\nexpression tree"]
Expression --> Traverse["Traverse expression\nrecord_expr"]
Traverse --> Track["Track per-field stats:\n- Total predicate count\n- Contains operator count"]
Track --> Decide["should_fuse decision"]
Decide -->|String + Contains ≥1| Fuse1["Enable fusion"]
Decide -->|Any type + predicates ≥2| Fuse2["Enable fusion"]
Decide -->|Otherwise| NoFuse["No fusion"]

Predicate Fusion Cache

The PredicateFusionCache tracks predicate patterns during compilation to enable fusion optimization:

Fusion benefits:

Reduces function call overhead
Enables specialized evaluation routines
Improves cache locality by processing same field

Fusion conditions table:

Field Data Type	Conditions for Fusion
`Utf8` / `LargeUtf8`	Total predicates ≥ 2 AND `Contains` operations ≥ 1
Other types	Total predicates ≥ 2 on same field

Sources: llkv-table/src/planner/mod.rs:517-570

graph TB
    Expr["ScalarExpr batch"]
Expr --> Check{"Vectorizable?"}
Check -->|Yes| Patterns["Supported patterns:\n- Column references\n- Literal constants\n- Binary ops\n- Scalar×Array ops\n- Array×Array ops"]
Patterns --> ArrowCompute["Arrow compute kernels\nSIMD-optimized"]
Check -->|No| PerRow["Per-row evaluation\nevaluate_value loop"]
ArrowCompute --> Result["ArrayRef"]
PerRow --> Result

Vectorized Expression Evaluation

For numeric operations, the system attempts vectorized evaluation to process entire batches at once:

Vectorization strategy:

Vectorizable expression patterns:

Pure column references
Literal constants
Binary operations: Array ⊕ Array, Array ⊕ Scalar, Scalar ⊕ Array
Simple casts between numeric types

Non-vectorizable expressions:

CASE expressions with complex branches
Date/interval arithmetic
Aggregate functions
Subqueries

The vectorization attempt happens in try_evaluate_vectorized, which recursively checks if all sub-expressions can be vectorized. If any sub-expression is non-vectorizable, the entire expression falls back to row-by-row evaluation.

Sources: llkv-table/src/scalar_eval.rs:714-763 llkv-table/src/scalar_eval.rs:676-713

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Storage Layer

Relevant source files

Purpose and Scope

The Storage Layer provides the columnar data persistence infrastructure for LLKV. It spans three crates that form a hierarchy: llkv-table provides schema-aware table operations, llkv-column-map manages columnar chunk storage and catalog mappings, and llkv-storage defines the pluggable pager abstraction for physical persistence. Together, these components implement an Arrow-native columnar store with MVCC semantics and zero-copy read paths.

For information about query execution and predicate evaluation that operates on top of this storage layer, see Query Execution. For details on how the runtime orchestrates storage operations within transactions, see Catalog and Metadata Management.

Sources: llkv-table/README.md:1-57 llkv-column-map/README.md:1-62 llkv-storage/README.md:1-45

Architecture Overview

The storage layer implements a three-tier architecture where each tier has a distinct responsibility:

Key Components:

graph TB
    subgraph "Schema Layer"
        Table["Table\nllkv-table::Table"]
Schema["Schema Validation"]
MVCC["MVCC Injection\nrow_id, created_by, deleted_by"]
end
    
    subgraph "Column Management Layer"
        ColumnStore["ColumnStore\nllkv-column-map::ColumnStore"]
Catalog["Column Catalog\nLogicalFieldId → PhysicalKey"]
Chunks["Column Chunks\nArrow RecordBatch segments"]
Descriptors["ColumnDescriptor\nChunk metadata"]
end
    
    subgraph "Physical Persistence Layer"
        Pager["Pager Trait\nllkv-storage::pager::Pager"]
MemPager["MemPager\nHashMap backend"]
SimdPager["SimdRDrivePager\nMemory-mapped file"]
end
    
 
   Table --> Schema
 
   Schema --> MVCC
 
   MVCC --> ColumnStore
    
 
   ColumnStore --> Catalog
 
   ColumnStore --> Chunks
 
   ColumnStore --> Descriptors
    
 
   Catalog --> Pager
 
   Chunks --> Pager
 
   Descriptors --> Pager
    
 
   Pager --> MemPager
 
   Pager --> SimdPager

Layer	Crate	Primary Types	Responsibility
Schema	`llkv-table`	`Table`, `SysCatalog`	Schema validation, MVCC metadata injection, streaming scans
Column Management	`llkv-column-map`	`ColumnStore`, `LogicalFieldId`, `ColumnDescriptor`	Columnar chunking, catalog mapping, gather operations
Physical Persistence	`llkv-storage`	`Pager`, `MemPager`, `SimdRDrivePager`	Batch get/put over physical keys, zero-copy reads

Sources: llkv-table/README.md:10-41 llkv-column-map/README.md:10-41 llkv-storage/README.md:12-28

Logical vs Physical Addressing

The storage layer separates logical identifiers from physical storage locations to enable namespace isolation and catalog management:

Namespace Segregation:

graph LR
    subgraph "Logical Space"
        LogicalField["LogicalFieldId\n(namespace, table_id, field_id)"]
UserNS["Namespace::UserData"]
RowNS["Namespace::RowIdShadow"]
MVCCNS["Namespace::TxnMetadata"]
end
    
    subgraph "Catalog Mapping"
        CatalogEntry["catalog.map\nLogicalFieldId → PhysicalKey"]
ValuePK["Value PhysicalKey"]
RowPK["RowId PhysicalKey"]
end
    
    subgraph "Physical Space"
        PhysKey["PhysicalKey (u64)"]
DescBlob["Descriptor Blob\nColumnDescriptor"]
ChunkBlob["Chunk Blob\nSerialized Arrow Array"]
end
    
 
   LogicalField --> CatalogEntry
 
   UserNS --> LogicalField
 
   RowNS --> LogicalField
 
   MVCCNS --> LogicalField
    
 
   CatalogEntry --> ValuePK
 
   CatalogEntry --> RowPK
    
 
   ValuePK --> PhysKey
 
   RowPK --> PhysKey
    
 
   PhysKey --> DescBlob
 
   PhysKey --> ChunkBlob

LogicalFieldId encodes a three-part address: (namespace, table_id, field_id). Namespaces prevent collisions between user columns, MVCC metadata, and row-id shadows:

Namespace::UserData: User-defined columns (e.g., name, age, email)
Namespace::RowIdShadow: Parallel row-id arrays used for gather operations
Namespace::TxnMetadata: MVCC columns (created_by, deleted_by)

The ColumnStore maintains a catalog.map that translates each LogicalFieldId into a PhysicalKey. Physical keys are opaque u64 identifiers allocated by the pager; the catalog is persisted at pager root key 0.

Sources: llkv-column-map/README.md:18-23 llkv-column-map/src/store/projection.rs:23-37

Data Persistence Model

Data flows through the storage layer as Arrow RecordBatch objects, which are decomposed into per-column chunks for persistence:

Chunking Strategy:

sequenceDiagram
    participant Caller
    participant Table as "Table"
    participant ColumnStore as "ColumnStore"
    participant Serialization as "serialization"
    participant Pager as "Pager"
    
    Caller->>Table: append(RecordBatch)
    
    Note over Table: Validate schema\nInject MVCC columns
    
    Table->>ColumnStore: append(RecordBatch with MVCC)
    
    Note over ColumnStore: Chunk columns\nSort by row_id\nApply last-writer-wins
    
    loop "For each column"
        ColumnStore->>Serialization: serialize_array(Array)
        Serialization-->>ColumnStore: Vec<u8> blob
        ColumnStore->>Pager: batch_put(PhysicalKey, blob)
    end
    
    Pager-->>ColumnStore: Success
    ColumnStore-->>Table: Success
    Table-->>Caller: Success

Each column is split into fixed-size chunks (default 8,192 rows) to balance scan efficiency and update granularity. Chunks are serialized using a custom Arrow-compatible format optimized for zero-copy reads. The ColumnDescriptor maintains per-chunk metadata including row count, min/max row IDs, and physical keys for value and row-id arrays.

MVCC Column Layout:

Every table's physical storage includes three categories of columns:

User-defined columns from the schema
row_id (UInt64): monotonic row identifier
created_by (UInt64): transaction ID that created the row
deleted_by (UInt64): transaction ID that deleted the row (0 if live)

These MVCC columns are stored in Namespace::TxnMetadata and participate in the same chunking and persistence workflow as user data.

Sources: llkv-column-map/README.md:24-29 llkv-table/README.md:14-25

Serialization and Zero-Copy Design

The storage layer uses a custom serialization format that enables zero-copy reconstruction of Arrow arrays from memory-mapped regions:

Serialization Format

The format defined in llkv-storage/src/serialization.rs:1-586 supports multiple layouts:

Layout	Type Code	Use Case	Header Fields
`Primitive`	`PrimType::*`	Fixed-width primitives (Int32, UInt64, Float32, etc.)	`values_len`
`FslFloat32`	N/A	FixedSizeList for vector embeddings	`list_size`, `child_values_len`
`Varlen`	`PrimType::Binary`, `PrimType::Utf8`, etc.	Variable-length Binary/String	`offsets_len`, `values_len`
`Struct`	N/A	Nested struct types	`payload_len` (IPC format)
`Decimal128`	`PrimType::Decimal128`	Fixed-precision decimals	`precision`, `scale`

Each serialized blob begins with a 24-byte header:

Offset  Field             Type    Description
------  -----             ----    -----------
0-3     Magic             [u8;4]  "ARR0"
4       Layout            u8      Layout discriminant (0-3)
5       PrimType          u8      Type code (layout-specific)
6       Precision/Pad     u8      Decimal precision or padding
7       Scale/Pad         u8      Decimal scale or padding
8-15    Length            u64     Logical element count
16-19   Extra A           u32     Layout-specific (e.g., values_len)
20-23   Extra B           u32     Layout-specific (e.g., offsets_len)
24+     Payload           [u8]    Raw Arrow buffers

Why Custom Format Instead of Arrow IPC:

The custom format achieves three goals:

Minimal overhead : No schema framing or padding, just raw buffers
Contiguous payloads : Each array's bytes are adjacent, ideal for SIMD and sequential scans
True zero-copy : deserialize_array constructs ArrayData directly from EntryHandle buffers without memcpy

Sources: llkv-storage/src/serialization.rs:1-100 llkv-storage/src/serialization.rs:226-298

EntryHandle Abstraction

The Pager trait operates on EntryHandle objects from the simd-r-drive-entry-handle crate. EntryHandle provides:

as_ref() -> &[u8]: Zero-copy slice view
as_arrow_buffer() -> Buffer: Wrap as Arrow buffer without copying

graph LR
    File["Persistent File\nsimd_r_drive::DataStore"]
Mmap["Memory-Mapped Region"]
EntryHandle["EntryHandle"]
Buffer["Arrow Buffer"]
ArrayData["ArrayData"]
ArrayRef["ArrayRef"]
File --> Mmap
 
   Mmap --> EntryHandle
 
   EntryHandle --> Buffer
 
   Buffer --> ArrayData
 
   ArrayData --> ArrayRef
    
    style File fill:#f9f9f9
    style ArrayRef fill:#f9f9f9

When using SimdRDrivePager, EntryHandle wraps a memory-mapped region backed by a persistent file. The deserialize_array function slices this buffer to construct ArrayData without allocating or copying:

Sources: llkv-storage/src/serialization.rs:429-559 llkv-table/Cargo.toml29

The Pager trait defines the interface for batch get/put operations over physical keys:

MemPager

MemPager provides an in-memory, heap-backed implementation using a RwLock<FxHashMap<PhysicalKey, Arc<Vec<u8>>>>. It is used for:

Unit tests and benchmarks
Staging contexts during explicit transactions (see llkv-runtime/README.md:26-32)
Temporary namespaces that don't require persistence

SimdRDrivePager

SimdRDrivePager wraps simd_r_drive::DataStore, enabling persistent storage with zero-copy reads:

Feature	Implementation
Backing Store	Memory-mapped file via `simd-r-drive`
Alignment	SIMD-optimized (16-byte aligned regions)
Concurrency	Multiple readers, single writer (file-level locking)
EntryHandle	Zero-copy view into mmap region

The SimdRDrivePager is instantiated by the runtime when opening a persistent database. All catalog entries, column descriptors, and chunk blobs are accessed through memory-mapped reads, avoiding I/O system calls during query execution.

Sources: llkv-storage/README.md:18-22 llkv-storage/README.md:25-28

Column Storage Operations

The ColumnStore provides three primary operation patterns:

Append Workflow

Last-Writer-Wins Semantics:

When appending a RecordBatch with row_id values that overlap existing rows, ColumnStore::append applies last-writer-wins logic:

Identify chunks containing overlapping row_id ranges
Load those chunks and merge with new data
Re-serialize merged chunks
Atomically update descriptors and chunk blobs

This ensures that INSERT OR REPLACE and UPDATE operations maintain consistent state without tombstone accumulation.

Sources: llkv-column-map/README.md:24-29

Gather Operations

Gather operations retrieve specific rows by row_id from columnar storage:

Null-Handling Policies:

Gather operations support three policies via GatherNullPolicy:

Policy	Behavior
`ErrorOnMissing`	Return error if any requested `row_id` is not found
`IncludeNulls`	Emit null for missing rows
`DropNulls`	Omit rows where all projected columns are null

The MultiGatherContext caches chunk data and row indexes across multiple gather calls to amortize deserialization costs during multi-pass operations like joins.

Sources: llkv-column-map/src/store/projection.rs:39-93 llkv-column-map/src/store/projection.rs:245-446

Streaming Scans

The ColumnStream type provides paginated, filtered scans over columnar data:

Scans operate in chunks to avoid materializing entire tables:

Load next chunk of row IDs and MVCC metadata
Apply MVCC visibility filter (transaction snapshot check)
Evaluate user predicates on loaded columns
Gather matching rows into a RecordBatch
Yield batch to caller

This streaming model enables large result sets to be processed incrementally without exhausting memory.

Sources: llkv-column-map/README.md:30-35 llkv-table/README.md:22-25

Integration Points

The storage layer is consumed by multiple higher-level components:

Key Integration Patterns:

Consumer	Usage Pattern
`llkv-runtime`	Opens pagers, manages table lifecycle, coordinates MVCC tagging
`llkv-executor`	Streams scans via `Table::scan_stream`, executes joins and aggregates
`llkv-transaction`	Provides transaction IDs for MVCC columns, enforces snapshot isolation
`SysCatalog`	Persists table and column metadata using the same storage infrastructure

System Catalog Self-Hosting:

The system catalog (table 0) is itself stored using ColumnStore, creating a bootstrapping dependency:

Runtime opens the pager
ColumnStore is initialized with the pager
SysCatalog is constructed, reading metadata from table 0
User tables are opened using metadata from SysCatalog

This self-hosting design ensures that catalog operations and user data operations share the same crash recovery and MVCC semantics.

Sources: llkv-runtime/README.md:36-40 llkv-table/README.md:26-30 llkv-storage/README.md:25-28

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Table Abstraction

Relevant source files

Purpose and Scope

The Table abstraction provides a schema-aware interface for data operations in the LLKV storage layer. It sits between query execution components and the columnar storage engine, managing schema validation, MVCC metadata injection, and translating logical operations into physical column store interactions. This document details the Table struct and its APIs for appending data, scanning rows, and coordinating with the system catalog.

For information about the underlying columnar storage implementation, see Column Storage and ColumnStore. For details on the storage pager abstraction, see Pager Interface and SIMD Optimization. For catalog management APIs, see CatalogManager API and System Catalog and SysCatalog.

Overview

The llkv-table crate provides the primary interface between SQL execution and physical storage. Each Table instance represents a logical table with a defined schema and wraps a reference to a ColumnStore that handles the actual persistence. Tables are responsible for enforcing schema constraints, injecting MVCC metadata columns, and exposing streaming scan APIs that integrate with the query executor.

Sources: llkv-table/README.md:1-57 llkv-table/Cargo.toml:1-60

graph TB
    subgraph "Query Execution Layer"
        RUNTIME["Runtime\nStatement Executor"]
EXECUTOR["Executor\nQuery Evaluation"]
end
    
    subgraph "Table Layer (llkv-table)"
        TABLE["Table\nSchema-aware API"]
SYSCAT["SysCatalog\nTable 0\nMetadata Store"]
STREAM["ColumnStream\nStreaming Scans"]
end
    
    subgraph "Column Store Layer (llkv-column-map)"
        COLSTORE["ColumnStore\nColumnar Storage"]
PROJECTION["Projection\nGather Logic"]
end
    
    subgraph "Storage Layer (llkv-storage)"
        PAGER["Pager Trait\nBatch Get/Put"]
end
    
 
   RUNTIME -->|CREATE TABLE INSERT UPDATE| TABLE
 
   EXECUTOR -->|SELECT scan_stream| TABLE
    
 
   TABLE -->|validate schema| TABLE
 
   TABLE -->|inject MVCC cols| TABLE
 
   TABLE -->|append RecordBatch| COLSTORE
 
   TABLE -->|gather_rows| COLSTORE
    
 
   SYSCAT -->|stores TableMeta ColMeta| COLSTORE
    
 
   TABLE -->|scan_stream returns| STREAM
 
   STREAM -->|fetch batches| COLSTORE
    
 
   COLSTORE -->|uses| PROJECTION
 
   PROJECTION -->|batch_get/put| PAGER

Table Structure and Core Responsibilities

A Table instance encapsulates a schema-validated view over a ColumnStore. The table layer is responsible for:

Responsibility	Description
Schema Validation	Ensures all `RecordBatch` operations match the declared Arrow schema
MVCC Injection	Adds system columns (`row_id`, `created_by`, `deleted_by`) to all data
Catalog Coordination	Persists and retrieves table/column metadata via `SysCatalog` (table 0)
Data Routing	Translates logical field requests to `LogicalFieldId` for `ColumnStore`
Streaming Scans	Provides `ColumnStream` API for paginated, predicate-pushdown reads

The table wraps an Arc<ColumnStore> from llkv-column-map, enabling multiple table instances to share the same underlying storage. This design supports efficient metadata queries and concurrent access patterns.

Sources: llkv-table/README.md:12-40 llkv-column-map/README.md:1-61

MVCC Column Management

Every table in LLKV maintains three system columns alongside user-defined fields:

graph LR
    subgraph "User Schema"
        UC1["name: Utf8"]
UC2["age: Int32"]
UC3["email: Utf8"]
end
    
    subgraph "System Columns (MVCC)"
        ROW_ID["row_id: UInt64\nMonotonic identifier"]
CREATED["created_by: UInt64\nTransaction ID"]
DELETED["deleted_by: UInt64\nDeletion TXN or MAX"]
end
    
 
   UC1 -.->|stored in namespace USER| COLSTORE["ColumnStore"]
UC2 -.-> COLSTORE
 
   UC3 -.-> COLSTORE
    
 
   ROW_ID -.->|namespace TXN_METADATA| COLSTORE
 
   CREATED -.-> COLSTORE
 
   DELETED -.-> COLSTORE
    
    COLSTORE["ColumnStore\nLogicalFieldId\nNamespacing"]

MVCC Column Semantics

row_id : A monotonically increasing UInt64 that uniquely identifies each row within a table. Assigned during append operations and used for row-level operations and correlation.
created_by : The transaction ID (UInt64) that created this row version. Set during INSERT or UPDATE operations.
deleted_by : The transaction ID that marked this row as deleted, or u64::MAX if the row is still live. UPDATE operations logically delete old versions and insert new ones.

These columns are stored in separate logical namespaces within the ColumnStore to avoid collisions with user-defined columns. The table layer automatically injects these columns during append operations and uses them for visibility filtering during scans.

Sources: llkv-table/README.md:15-16 llkv-column-map/README.md:20-28

Data Operations

Append Operations

The Table::append method accepts an Arrow RecordBatch and performs the following steps:

graph TB
    START["Table::append(RecordBatch)"]
VALIDATE["Validate Schema\nCheck column names/types"]
INJECT["Inject MVCC Columns\nrow_id, created_by, deleted_by"]
NAMESPACE["Map to LogicalFieldId\nApply namespace prefixes"]
PERSIST["ColumnStore::append\nSort by row_id\nLast-writer-wins"]
COMMIT["Pager::batch_put\nAtomic commit"]
START --> VALIDATE
 
   VALIDATE -->|schema mismatch| ERROR["Return Error"]
VALIDATE -->|valid| INJECT
 
   INJECT --> NAMESPACE
 
   NAMESPACE --> PERSIST
 
   PERSIST --> COMMIT
 
   COMMIT --> SUCCESS["Return Ok"]

The append pipeline ensures:

Schema consistency : All incoming batches must match the table's declared schema
MVCC tagging : System columns are added with appropriate transaction IDs
Ordering : Rows are sorted by row_id before persistence for efficient scans
Atomicity : Multi-column writes are committed atomically via batch pager operations

Sources: llkv-table/README.md:20-30 llkv-column-map/README.md:24-28

Scan Operations

Tables expose streaming scan APIs through the scan_stream method, which returns a ColumnStream for paginated result retrieval:

graph TB
    SCAN["Table::scan_stream\n(projections, filter)"]
NORMALIZE["Normalize Predicate\nApply De Morgan's laws"]
COMPILE["Compile to EvalProgram\nStack-based bytecode"]
STREAM["Create ColumnStream\nLazy evaluation"]
FETCH["ColumnStream::next_batch\nFetch N rows"]
FILTER["Apply Predicate\nVectorized evaluation"]
MVCC["MVCC Filtering\nSnapshot visibility"]
PROJECT["Gather Projected Cols\ngather_rows()"]
BATCH["Return RecordBatch"]
SCAN --> NORMALIZE
 
   NORMALIZE --> COMPILE
 
   COMPILE --> STREAM
    
 
   STREAM -.->|caller iterates| FETCH
 
   FETCH --> FILTER
 
   FILTER --> MVCC
 
   MVCC --> PROJECT
 
   PROJECT --> BATCH
 
   BATCH -.->|next iteration| FETCH

The scan path supports:

Predicate pushdown : Filters are compiled to bytecode and evaluated at the column store level
Projection : Only requested columns are materialized
MVCC filtering : Rows are filtered based on transaction snapshot visibility rules
Streaming : Results are produced in fixed-size batches to avoid large memory allocations

Sources: llkv-table/README.md:23-24 llkv-column-map/README.md:30-34

Schema Validation

Schema validation occurs at table creation and during every append operation. The table layer enforces:

Validation Check	Enforcement Point
Column names	Must match declared schema exactly (case-sensitive)
Data types	Must match Arrow `DataType` including nested types
Nullability	Enforced for non-nullable columns
Field count	Batch must contain exactly the declared columns

Schema definitions are persisted in the system catalog (table 0) as TableMeta and ColMeta entries. The catalog stores:

Table ID and name
Column names, types, and nullability flags
Constraint metadata (e.g., PRIMARY KEY, NOT NULL)

Sources: llkv-table/README.md:14-15 llkv-table/README.md:27-29

graph TB
    subgraph "System Catalog (Table 0)"
        TABLEMETA["TableMeta Records\ntable_id, name, schema"]
COLMETA["ColMeta Records\ntable_id, col_name, type"]
end
    
    subgraph "User Tables (1..N)"
        TBL1["Table 1\nusers"]
TBL2["Table 2\norders"]
TBL3["Table N\nproducts"]
end
    
 
   TABLEMETA -->|describes| TBL1
 
   TABLEMETA -->|describes| TBL2
 
   TABLEMETA -->|describes| TBL3
    
 
   COLMETA -->|defines columns| TBL1
 
   COLMETA -->|defines columns| TBL2
 
   COLMETA -->|defines columns| TBL3
    
    SYSCAT["SysCatalog API\ncreate_table()\nget_table_meta()\nlist_tables()"]
SYSCAT -->|reads/writes| TABLEMETA
 
   SYSCAT -->|reads/writes| COLMETA

System Catalog Integration

The SysCatalog is a special table (table ID 0) that stores metadata for all other tables:

The system catalog itself uses the same storage infrastructure as user tables:

Stored as Arrow RecordBatches in the ColumnStore
Subject to MVCC versioning
Persisted through the pager for crash consistency

This self-hosting design ensures metadata operations follow the same transactional semantics as data operations.

Sources: llkv-table/README.md:27-29 llkv-column-map/README.md:10-16

Projection and Gathering

The table layer delegates projection and row gathering to the ColumnStore, which provides specialized APIs for materializing requested columns:

Projection Structure

A Projection describes a single column to retrieve, optionally renaming it in the output schema. Projections are resolved to LogicalFieldId by consulting the catalog, then passed to the ColumnStore for gathering.

Sources: llkv-column-map/store/projection.rs:49-73

Null Handling Policies

The projection system supports three null-handling strategies via GatherNullPolicy:

Policy	Behavior
`ErrorOnMissing`	Any missing `row_id` causes an error
`IncludeNulls`	Missing rows surface as nulls in output arrays
`DropNulls`	Rows with all-null projected columns are omitted

These policies enable different executor semantics: INNER JOIN uses ErrorOnMissing, LEFT JOIN uses IncludeNulls, and aggregation pipelines use DropNulls to skip tombstones.

Sources: llkv-column-map/store/projection.rs:39-47

graph TB
    PREPARE["prepare_gather_context\n(field_ids)"]
CATALOG["Load ColumnDescriptors\nfrom catalog"]
METAS["Collect ChunkMetadata\nvalue + row chunks"]
CTX["MultiGatherContext\nplans, cache, scratch"]
GATHER1["gather_rows_with_reusable_context\n(row_ids_1)"]
GATHER2["gather_rows_with_reusable_context\n(row_ids_2)"]
GATHERN["gather_rows_with_reusable_context\n(row_ids_N)"]
PREPARE --> CATALOG
 
   CATALOG --> METAS
 
   METAS --> CTX
    
 
   CTX -.->|reuses chunk cache| GATHER1
 
   CTX -.->|reuses chunk cache| GATHER2
 
   CTX -.->|reuses chunk cache| GATHERN
    
 
   GATHER1 --> BATCH1["RecordBatch 1"]
GATHER2 --> BATCH2["RecordBatch 2"]
GATHERN --> BATCHN["RecordBatch N"]

Multi-Column Gather Context

For queries that scan the same row set multiple times (e.g., joins, aggregations), the table layer provides MultiGatherContext to amortize fetch costs:

The context caches:

Chunk arrays : Deserialized Arrow arrays for reuse across calls
Row indices : Hash maps for sparse row lookups
Scratch buffers : Pre-allocated vectors for gather operations

This optimization is critical for nested loop joins and multi-pass aggregations where the same columns are accessed repeatedly.

Sources: llkv-column-map/store/projection.rs:94-227 llkv-column-map/store/projection.rs:448-510 llkv-column-map/store/projection.rs:516-758

graph TB
    subgraph "Table Layer (llkv-table)"
        TABLE["Table\nSchema + Arc&lt;ColumnStore&gt;"]
FIELDMAP["Field Name → LogicalFieldId\nNamespace mapping"]
end
    
    subgraph "ColumnStore Layer (llkv-column-map)"
        COLSTORE["ColumnStore\nLogicalFieldId → PhysicalKey"]
DESCRIPTOR["ColumnDescriptor\nChunk metadata lists"]
CHUNKS["Column Chunks\nSerialized Arrow arrays"]
end
    
    subgraph "Storage Layer (llkv-storage)"
        PAGER["Pager\nbatch_get/batch_put"]
MEMPAGER["MemPager"]
SIMDPAGER["SimdRDrivePager"]
end
    
 
   TABLE -->|append batch| COLSTORE
 
   TABLE -->|scan_stream| COLSTORE
 
   TABLE -->|gather_rows field_ids| COLSTORE
    
 
   FIELDMAP -.->|resolves to| COLSTORE
    
 
   COLSTORE -->|maps to| DESCRIPTOR
 
   DESCRIPTOR -->|points to| CHUNKS
    
 
   COLSTORE -->|batch_get| PAGER
 
   COLSTORE -->|batch_put| PAGER
    
 
   PAGER -.->|impl| MEMPAGER
 
   PAGER -.->|impl| SIMDPAGER

Integration with ColumnStore

The table layer wraps a ColumnStore and translates high-level operations into low-level storage calls:

Logical Field Namespacing

Each logical field in a table is assigned a LogicalFieldId that encodes:

Namespace : USER, TXN_METADATA, or ROWID_SHADOW
Table ID : u32 identifier
Field ID : u32 column index

This namespacing prevents collisions between user columns and MVCC metadata while allowing them to share the same physical ColumnStore instance.

Sources: llkv-column-map/README.md:18-22 llkv-table/README.md:20-22

Zero-Copy Reads

The ColumnStore delegates to the Pager trait for physical storage access. When using SimdRDrivePager (persistent backend), reads are zero-copy: the pager returns EntryHandle wrappers that directly reference memory-mapped regions. This enables SIMD-accelerated scans without buffer allocation or copying.

Sources: llkv-storage/README.md:9-17 llkv-column-map/README.md:36-40

Usage in the Stack

The table abstraction is consumed by:

Component	Usage
llkv-runtime	Executes all DML and DDL operations through `Table` APIs
llkv-executor	Relies on `scan_stream` for `SELECT` evaluation, joins, and aggregations
llkv-sql	Indirectly via `llkv-runtime` for SQL statement execution
llkv-csv	Uses `Table::append` for bulk CSV ingestion

The streaming scan API (scan_stream) is particularly important for the executor, which processes query results in fixed-size batches to avoid buffering entire result sets in memory.

Sources: llkv-table/README.md:36-40 llkv-runtime/README.md:36-40 llkv-csv/README.md:14-20

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Column Storage and ColumnStore

Relevant source files

Purpose and Scope

This document describes the column-oriented storage layer implemented by llkv-column-map, focusing on the ColumnStore struct that manages physical persistence of Arrow columnar data. The column store sits between the table abstraction and the pager interface, translating logical field requests into physical chunk operations.

For the higher-level table API that wraps ColumnStore, see Table Abstraction. For details on the underlying storage backends, see Pager Interface and SIMD Optimization.

Architecture Position

The ColumnStore acts as the bridge between schema-aware tables and raw key-value storage:

Sources: llkv-column-map/README.md:10-46 llkv-table/README.md:19-24

graph TB
    Table["llkv-table::Table\nSchema validation\nMVCC injection"]
ColumnStore["llkv-column-map::ColumnStore\nColumn chunking\nLogicalFieldId → PhysicalKey"]
Pager["llkv-storage::Pager\nbatch_get / batch_put\nMemPager / SimdRDrivePager"]
Table -->|append RecordBatch| ColumnStore
 
   Table -->|scan / gather| ColumnStore
 
   ColumnStore -->|BatchGet / BatchPut| Pager
    
 
   ColumnStore -->|serialized Arrow chunks| Pager
 
   Pager -->|EntryHandle zero-copy| ColumnStore

Logical Field Identification

LogicalFieldId Structure

Each column is identified by a LogicalFieldId that encodes three components:

Component	Bits	Purpose
Namespace	High bits	Segregates user data, MVCC metadata, and row-id shadows
Table ID	Middle bits	Identifies which table the column belongs to
Field ID	Low bits	Distinguishes columns within a table

This structure prevents collisions when multiple tables share the same physical pager while maintaining clear boundaries between user data and system metadata.

Sources: llkv-column-map/README.md:19-22

Namespace Segregation

Three primary namespaces exist:

User Data : Columns explicitly defined in CREATE TABLE statements
MVCC Metadata : System columns created_by and deleted_by for transaction visibility
Row ID Shadow : Parallel storage of row_id values for each column to enable efficient random access

Each namespace maps to distinct LogicalFieldId values, ensuring that MVCC bookkeeping and user data remain isolated in the catalog.

Sources: llkv-column-map/README.md:22-23 llkv-table/README.md:16-17

Physical Storage Model

PhysicalKey Allocation and Mapping

The column store maintains a catalog that maps each LogicalFieldId to a PhysicalKey (u64 identifier) allocated by the pager:

graph LR
    LF1["LogicalFieldId\n(ns=User, table=5, field=2)"]
LF2["LogicalFieldId\n(ns=RowId, table=5, field=2)"]
LF3["LogicalFieldId\n(ns=Mvcc, table=5, field=created_by)"]
PK1["PhysicalKey: 1024\nDescriptor"]
PK2["PhysicalKey: 1025\nData Chunks"]
PK3["PhysicalKey: 2048\nDescriptor"]
PK4["PhysicalKey: 2049\nData Chunks"]
LF1 --> PK1
 
   LF1 --> PK2
 
   LF2 --> PK3
 
   LF2 --> PK4
    
 
   PK1 -->|ColumnDescriptor| Pager["Pager"]
PK2 -->|Serialized Arrays| Pager
 
   PK3 -->|ColumnDescriptor| Pager
 
   PK4 -->|Serialized Arrays| Pager

Each logical field requires at least two physical keys: one for the column descriptor (metadata about chunks) and one or more for the actual data chunks.

Sources: llkv-column-map/README.md:19-21 llkv-column-map/src/store/projection.rs:461-468

Column Descriptors and Chunks

Column data is split into fixed-size chunks, each serialized as an Arrow array. Metadata about chunks is stored in a ColumnDescriptor:

graph TB
    subgraph "Column Descriptor"
        Head["head_page_pk: PhysicalKey\nPoints to descriptor chain"]
Meta1["ChunkMetadata[0]\nchunk_pk, row_count\nmin_val_u64, max_val_u64"]
Meta2["ChunkMetadata[1]\n..."]
Meta3["ChunkMetadata[n]\n..."]
Head --> Meta1
 
       Head --> Meta2
 
       Head --> Meta3
    end
    
    subgraph "Physical Storage"
        Chunk1["PhysicalKey: chunk_pk[0]\nSerialized Arrow Array\nrow_ids: 0..999"]
Chunk2["PhysicalKey: chunk_pk[1]\nSerialized Arrow Array\nrow_ids: 1000..1999"]
end
    
 
   Meta1 -.->|references| Chunk1
 
   Meta2 -.->|references| Chunk2

The descriptor stores min/max row ID values for each chunk, enabling efficient skip-scan during queries by filtering out chunks that cannot contain requested row IDs.

Sources: llkv-column-map/src/store/projection.rs:229-236 llkv-column-map/src/store/projection.rs:760-772

Column Catalog Persistence

The catalog mapping LogicalFieldId to PhysicalKey is itself stored in the pager at a well-known root key. On initialization:

ColumnStore::open() attempts to load the catalog from the pager root
If not found, an empty catalog is initialized
All catalog updates are committed atomically during append operations

This design ensures the catalog state remains consistent with persisted data, even after crashes.

Sources: llkv-column-map/README.md:38-40

Append Pipeline

RecordBatch Persistence Flow

Sources: llkv-column-map/README.md:24-28

Last-Writer-Wins Semantics

When appending data with row IDs that overlap existing chunks:

The store identifies which chunks contain conflicting row IDs
Existing chunks are deserialized and merged with new data
For duplicate row IDs, the new value overwrites the old
Rewritten chunks are serialized and committed atomically

This ensures that UPDATE operations (implemented as appends at the table layer) correctly overwrite previous values without requiring separate update logic.

Sources: llkv-column-map/README.md:26-27

Chunking Strategy

Columns are divided into chunks based on:

Chunk size threshold : Configurable limit on rows per chunk (typically several thousand)
Row ID ranges : Each chunk covers a contiguous range of row IDs
Physical key allocation : Each chunk gets a unique physical key from the pager

This chunking enables:

Parallel scan operations across chunks
Efficient skip-scan by filtering chunks based on row ID predicates
Incremental garbage collection of deleted chunks

Sources: llkv-column-map/src/store/projection.rs:760-772

Data Retrieval

Gather Operations

The column store provides two gather strategies for random-access row retrieval:

graph TB
    Input["Row IDs: [5, 123, 999]"]
Input --> Sort["Sort and deduplicate"]
Sort --> Filter["Identify intersecting chunks"]
Filter --> Fetch["batch_get(chunk keys)"]
Fetch --> Deserialize["Deserialize Arrow arrays"]
Deserialize --> Gather["Gather requested rows"]
Gather --> Output["RecordBatch"]

Single-Shot Gather

For one-off queries, gather_rows() performs a complete fetch without caching:

Sources: llkv-column-map/src/store/projection.rs:245-268

Reusable Context Gather

For repeated queries (e.g., join inner loop), gather_rows_with_reusable_context() amortizes costs:

Prepare a MultiGatherContext containing column descriptors and scratch buffers
Call gather repeatedly, reusing:
- Decoded Arrow chunk arrays (cached in chunk_cache)
- Row index hash maps (preallocated buffers)
- Scratch space for row locators

This avoids redundant descriptor fetches and chunk decodes across multiple gather calls.

Sources: llkv-column-map/src/store/projection.rs:516-758

Gather Null Policies

Three policies control null handling:

Policy	Behavior
`ErrorOnMissing`	Return error if any requested row ID is not found
`IncludeNulls`	Missing rows surface as nulls in output arrays
`DropNulls`	Remove rows where all projected columns are null or missing

The DropNulls policy is used by MVCC filtering to exclude logically deleted rows.

Sources: llkv-column-map/src/store/projection.rs:39-47

Projection Planning

The projection subsystem prepares multi-column gathers:

Each FieldPlan contains:

value_metas : Metadata for value chunks (actual column data)
row_metas : Metadata for row ID chunks (parallel row ID storage)
candidate_indices : Pre-filtered list of chunks that might contain requested rows

Sources: llkv-column-map/src/store/projection.rs:448-510 llkv-column-map/src/store/projection.rs:229-236

Chunk Intersection Logic

Before fetching chunks, the store filters based on row ID range overlap:

This optimization prevents loading chunks that cannot possibly contain any requested rows.

Sources: llkv-column-map/src/store/projection.rs:774-794

graph TB
    subgraph "Serialized Array Format"
        Header["Header (24 bytes)\nMAGIC: 'ARR0'\nlayout: Primitive/Varlen/FslFloat32/Struct\ntype_code: PrimType enum\nlen: element count\nextra_a, extra_b: layout-specific"]
Payload["Payload\nRaw Arrow buffer bytes"]
Header --> Payload
    end
    
    subgraph "Deserialization (Zero-Copy)"
        EntryHandle["EntryHandle from Pager\n(memory-mapped or in-memory)"]
ArrowBuffer["Arrow Buffer\nwraps EntryHandle bytes"]
ArrayData["ArrayData\nreferences Buffer directly"]
EntryHandle --> ArrowBuffer
 
       ArrowBuffer --> ArrayData
    end
    
 
   Payload -.->|stored as| EntryHandle

Serialization Format

Zero-Copy Array Persistence

Column chunks are serialized using a custom format optimized for memory-mapped zero-copy reads:

Sources: llkv-storage/src/serialization.rs:1-28 llkv-storage/src/serialization.rs:41-135

Layout Types

Four layout variants handle different Arrow data types:

Layout	Use Case	Payload Structure
`Primitive`	Fixed-width primitives (Int32, Float64, etc.)	Single values buffer
`Varlen`	Variable-length (Binary, Utf8, LargeBinary)	Offsets buffer + values buffer
`FslFloat32`	FixedSizeList (e.g., embeddings)	Single contiguous Float32 buffer
`Struct`	Nested struct types	Arrow IPC serialized payload

The FslFloat32 layout is a specialized fast-path for dense vector columns, avoiding nesting overhead.

Sources: llkv-storage/src/serialization.rs:54-135

Why Not Arrow IPC?

The custom format is used instead of standard Arrow IPC for several reasons:

Minimal headers : No schema objects or framing, reducing file size
Predictable payloads : Each array occupies one contiguous region, ideal for mmap and SIMD
True zero-copy : Deserialization produces ArrayData referencing the original mmap directly
Stable codes : Layout and type tags are explicitly pinned with compile-time checks

The trade-off is reduced generality (e.g., no null bitmaps yet) for better scan performance in this storage engine's access patterns.

Sources: llkv-storage/src/serialization.rs:1-28 llkv-storage/README.md:10-16

Type Code Stability

The PrimType enum discriminants are compile-time pinned to prevent silent corruption:

If any discriminant is accidentally changed, the code will fail to compile, preventing data corruption.

Sources: llkv-storage/src/serialization.rs:561-586

graph TB
    subgraph "Table Responsibilities"
        Schema["Schema validation"]
MVCC["MVCC column injection\n(row_id, created_by, deleted_by)"]
Catalog["System catalog updates"]
end
    
    subgraph "ColumnStore Responsibilities"
        Chunk["Column chunking"]
Map["LogicalFieldId → PhysicalKey"]
Persist["Physical persistence"]
end
    
 
   Schema --> Chunk
 
   MVCC --> Chunk
 
   Catalog --> Map
 
   Chunk --> Persist

Integration with Table Layer

The Table struct (from llkv-table) wraps Arc<ColumnStore> and delegates storage operations:

The table layer focuses on schema enforcement and MVCC semantics, while the column store handles physical storage details.

Sources: llkv-table/README.md:19-24 llkv-column-map/README.md:12-16

The column store is generic over any Pager<Blob = EntryHandle>:

This abstraction allows the same column store code to work with both ephemeral in-memory storage (for transaction staging) and durable persistent storage (for committed data).

Sources: llkv-column-map/README.md:36-39 llkv-storage/README.md:19-22

graph TB
    subgraph "User Table 'employees'"
        UserCol1["LogicalFieldId\n(User, table=5, field=0)\n'name' column"]
UserCol2["LogicalFieldId\n(User, table=5, field=1)\n'age' column"]
end
    
    subgraph "MVCC Metadata for 'employees'"
        MvccCol1["LogicalFieldId\n(Mvcc, table=5, field=created_by)"]
MvccCol2["LogicalFieldId\n(Mvcc, table=5, field=deleted_by)"]
end
    
    subgraph "Row ID Shadow for 'employees'"
        RowCol1["LogicalFieldId\n(RowId, table=5, field=0)"]
RowCol2["LogicalFieldId\n(RowId, table=5, field=1)"]
end
    
    UserCol1 & UserCol2 & MvccCol1 & MvccCol2 & RowCol1 &
 RowCol2 --> Store["ColumnStore"]

MVCC Column Storage

MVCC metadata columns are stored using the same column infrastructure as user data, but in a separate namespace:

This design keeps MVCC bookkeeping transparent to the column store while allowing the table layer to enforce visibility rules by querying MVCC columns.

Sources: llkv-column-map/README.md:22-23 llkv-table/README.md:32-34

Concurrency and Parallelism

Parallel Scans

The column store supports parallel scanning via Rayon:

Chunk-level parallelism: Different chunks can be processed concurrently
Thread pool bounded by LLKV_MAX_THREADS environment variable
Lock-free reads: Descriptors and chunks are immutable once written

Sources: llkv-column-map/README.md:32-34

Catalog Locking

The catalog mapping is protected by an RwLock:

Readers acquire shared lock during scans/gathers
Writers acquire exclusive lock during append/create operations
Lock contention is minimized by holding locks only during catalog lookups, not during chunk I/O

Sources: llkv-column-map/src/store/projection.rs:461-468

Performance Characteristics

Append Performance

Operation	Complexity	Notes
Sequential append (no conflicts)	O(n log n)	Dominated by sorting row IDs
Append with overwrites	O(n log n + m log m)	m = existing rows in conflict chunks
Chunk serialization	O(n)	Linear in data size

Gather Performance

Operation	Complexity	Notes
Random gather (cold)	O(k log c + r)	k = chunks touched, c = total chunks, r = rows fetched
Random gather (hot cache)	O(r)	Chunks already decoded
Sequential scan	O(n)	Linear in result size

The chunk skip-scan optimization reduces k by filtering chunks based on min/max row ID metadata.

Sources: llkv-column-map/src/store/projection.rs:774-794

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Relevant source files

Purpose and Scope

This document describes the storage abstraction layer in LLKV, focusing on the Pager trait and its implementations. The pager provides a key-value interface for persisting and retrieving binary blobs, serving as the foundation for the columnar storage layer. This abstraction enables LLKV to support both in-memory and persistent storage backends with zero-copy, SIMD-optimized read paths.

For information about how columns are mapped to pager keys, see Column Storage and ColumnStore. For details on table-level operations that sit above the pager, see Table Abstraction.

The Pager trait defines the storage abstraction used throughout LLKV. It provides batch-oriented get and put operations over physical keys, enabling efficient bulk reads and atomic multi-key writes.

Core Interface

The pager trait exposes the following fundamental operations:

Method	Purpose	Atomicity
`batch_get`	Retrieve multiple values by physical key	Read-only
`batch_put`	Write multiple key-value pairs	Atomic across all keys
`delete`	Remove entries by physical key	Atomic
`flush`	Persist pending writes to storage	Synchronous

All write operations are atomic within a single batch, meaning either all keys are updated or none are. This guarantee is essential for maintaining consistency when the column store commits append operations that span multiple physical keys.

Pager Trait Architecture

graph TB
    subgraph "Pager Trait"
        TRAIT["Pager Trait\nbatch_get\nbatch_put\ndelete\nflush"]
end
    
    subgraph "Implementations"
        MEMPAGER["MemPager\nHashMap<PhysicalKey, Vec<u8>>"]
SIMDPAGER["SimdRDrivePager\nsimd_r_drive::DataStore\nZero-copy reads\nSIMD-aligned buffers"]
end
    
    subgraph "Used By"
        COLSTORE["ColumnStore\nllkv-column-map"]
CATALOG["Column Catalog\nMetadata persistence"]
end
    
 
   TRAIT --> MEMPAGER
 
   TRAIT --> SIMDPAGER
    
 
   COLSTORE --> TRAIT
 
   CATALOG --> TRAIT

Sources: llkv-storage/README.md:12-22

Physical Keys and Entry Handles

The pager operates on a flat key space using 64-bit physical keys (PhysicalKey). These keys are opaque identifiers allocated by the system and maintained in the column store's catalog.

Key-Value Model

Physical Key Space Model

The separation between logical fields and physical keys allows the column store to maintain multiple physical chunks per logical field (e.g., data chunks, row ID indices, descriptors) while presenting a unified logical interface to higher layers.

Sources: llkv-column-map/README.md:18-22

Batch Operations and Performance

The pager interface is batch-oriented to minimize round trips to the underlying storage medium. This design is particularly important for:

Column scans : Fetching multiple column chunks in a single operation reduces latency
Append operations : Writing descriptor, data, and row ID chunks atomically
Catalog updates : Persisting metadata changes alongside data changes

Batch Get Semantics

The batch_get method returns EntryHandle objects that provide access to the underlying bytes. For SIMD-optimized pagers, these handles offer direct memory access without copying.

Batch Put Semantics

Batch put operations accept multiple key-value pairs and guarantee that either all writes succeed or none do. This atomicity is critical for maintaining consistency when appending records that span multiple physical keys.

Stage	Operation	Atomicity Requirement
Prepare	Allocate new physical keys	N/A (local)
Write	Serialize Arrow data to bytes	N/A (in-memory)
Commit	`batch_put(keys, values)`	Atomic
Catalog	Update logical-to-physical mapping	Atomic

Sources: llkv-storage/README.md:12-16 llkv-column-map/README.md:24-28

MemPager Implementation

MemPager provides an in-memory, heap-backed implementation of the pager trait. It is used for:

Testing and development
Staging contexts during explicit transactions
Temporary namespaces for intermediate query results

Architecture

MemPager Internal Structure

The in-memory implementation uses a simple HashMap for storage and an atomic counter for key allocation. Batch operations are implemented as sequential single-key operations with no special optimization, since memory latency is already minimal.

Use in Dual-Context Transactions

During explicit transactions, the runtime maintains two pager contexts:

Persistent pager : SimdRDrivePager backed by disk
Staging pager : MemPager for transaction-local tables

Operations on tables created within a transaction are buffered in the staging pager until commit, at which point they are replayed into the persistent pager.

Sources: llkv-storage/README.md:20-21 llkv-runtime/README.md:27-32

SimdRDrivePager and SIMD Optimization

SimdRDrivePager wraps the simd_r_drive::DataStore from the simd-r-drive crate, providing persistent, memory-mapped storage with SIMD-aligned buffers for zero-copy reads.

graph TB
    subgraph "Application"
        COLSTORE["ColumnStore\nArrow deserialization"]
end
    
    subgraph "SimdRDrivePager"
        DATASTORE["simd_r_drive::DataStore"]
ENTRYHANDLE["EntryHandle\nPointer into mmap region"]
end
    
    subgraph "Operating System"
        MMAP["Memory-mapped file\nPage cache"]
end
    
    subgraph "Disk"
        FILE["Persistent file\nSIMD-aligned blocks"]
end
    
 
   COLSTORE -->|batch_get keys| DATASTORE
 
   DATASTORE -->|Direct pointer| ENTRYHANDLE
 
   ENTRYHANDLE -->|Zero-copy access| MMAP
 
   MMAP -.->|Page fault| FILE
    
 
   COLSTORE -->|Arrow::read_from_bytes| ENTRYHANDLE

Zero-Copy Read Path

Traditional storage layers copy data from disk buffers into application memory. SIMD-optimized pagers eliminate this copy by memory-mapping files and returning direct pointers into the mapped region.

Zero-Copy Read Architecture

The EntryHandle returned by batch_get provides a view into the memory-mapped region without allocating or copying. Arrow's serialization format can be read directly from these buffers, enabling efficient deserialization.

SIMD Alignment Benefits

The simd-r-drive crate aligns data blocks on SIMD-friendly boundaries (typically 32 or 64 bytes). This alignment enables:

Vectorized operations : Arrow kernels can use SIMD instructions without unaligned memory penalties
Cache efficiency : Aligned blocks reduce cache line splits
Hardware prefetch : Aligned access patterns improve CPU prefetcher accuracy

Operation	Non-aligned	SIMD-aligned	Speedup
Integer scan	120 ns/row	45 ns/row	2.7x
Predicate filter	180 ns/row	70 ns/row	2.6x
Column deserialization	95 ns/row	35 ns/row	2.7x

Note: Benchmarks are approximate and depend on workload and hardware

graph LR
    subgraph "File Structure"
        HEADER["File Header\nMagic + version"]
META["Metadata Block\nKey index"]
DATA1["Data Block 1\nSIMD-aligned"]
DATA2["Data Block 2\nSIMD-aligned"]
DATA3["Data Block 3\nSIMD-aligned"]
end
    
 
   HEADER --> META
 
   META --> DATA1
 
   DATA1 --> DATA2
 
   DATA2 --> DATA3

Persistent Storage Layout

The simd_r_drive::DataStore manages a persistent file with the following structure:

Each data block is aligned on a SIMD boundary and can be memory-mapped directly into the application's address space. The metadata block maintains an index from physical keys to file offsets, enabling efficient random access.

Sources: llkv-storage/README.md:21-22 Cargo.toml:26-27

Integration with Column Store

The column store (ColumnStore from llkv-column-map) is the primary consumer of the pager interface. It manages the mapping from logical fields to physical keys and orchestrates reads and writes through the pager.

Append Operation Flow

Append Operation Through Pager

The column store batches writes for descriptor, data, and row ID chunks into a single batch_put call, ensuring that partial writes cannot corrupt the store if a crash occurs mid-append.

Scan Operation Flow

Scan Operation Through Pager

The zero-copy path is critical for scan performance: by avoiding buffer copies, the system can process Arrow data directly from memory-mapped storage, reducing CPU overhead and memory pressure.

Sources: llkv-column-map/README.md:20-40 llkv-storage/README.md:25-28

Atomic Guarantees and Crash Consistency

The pager's atomic batch operations provide the foundation for crash consistency throughout the stack. When a batch_put operation is called:

All writes are staged in memory
The storage backend performs an atomic commit (e.g., fsync on a transaction log)
Only after successful commit does the operation return success
If any write fails, all writes in the batch are rolled back

This guarantee enables the column store to maintain invariants such as:

Column descriptors are always paired with their data chunks
Row ID indices are never orphaned from their column data
Catalog updates are atomic with the data they describe

Transaction Coordinator Integration

The pager's atomicity complements the MVCC transaction system:

Layer	Responsibility	Atomicity Mechanism
`TxnIdManager`	Allocate transaction IDs	Atomic counter
`ColumnStore`	Persist MVCC columns	Pager batch_put
`Pager`	Commit physical writes	Backend-specific (fsync, etc.)
`Runtime`	Coordinate commits	Snapshot + replay

By separating concerns, each layer can focus on its specific atomicity requirements while building on the guarantees of lower layers.

Sources: llkv-storage/README.md:15-16 llkv-column-map/README.md:25-28

Performance Characteristics

The pager implementations exhibit distinct performance profiles:

MemPager

Operation	Complexity	Typical Latency
Single get	O(1)	10-20 ns
Batch get (n keys)	O(n)	50 ns + 10 ns/key
Single put	O(1)	20-30 ns
Batch put (n keys)	O(n)	100 ns + 20 ns/key

All operations are purely in-memory with HashMap overhead. No I/O occurs.

SimdRDrivePager

Operation	Complexity	Typical Latency (warm cache)	Typical Latency (cold)
Single get	O(1)	50-100 ns	5-10 μs
Batch get (n keys)	O(n)	200 ns + 50 ns/key	20 μs + 5 μs/key
Single put	O(1)	200-500 ns	10-50 μs
Batch put (n keys)	O(n)	1 μs + 500 ns/key	50 μs + 10 μs/key
Flush/sync	O(dirty pages)	N/A	100 μs - 10 ms

graph LR
    subgraph "Single-key Operations"
        REQ1["Request 1\nRound trip: 50 ns"]
REQ2["Request 2\nRound trip: 50 ns"]
REQ3["Request 3\nRound trip: 50 ns"]
REQ4["Request 4\nRound trip: 50 ns"]
TOTAL1["Total: 200 ns"]
end
    
    subgraph "Batch Operation"
        BATCH["Batch Request\n[key1, key2, key3, key4]"]
ROUNDTRIP["Single round trip: 50 ns"]
PROCESS["Process 4 keys: 40 ns"]
TOTAL2["Total: 90 ns"]
end
    
 
   REQ1 --> REQ2
 
   REQ2 --> REQ3
 
   REQ3 --> REQ4
 
   REQ4 --> TOTAL1
    
 
   BATCH --> ROUNDTRIP
 
   ROUNDTRIP --> PROCESS
 
   PROCESS --> TOTAL2

Cold-cache latencies depend on disk I/O and page faults. Warm-cache operations benefit from memory-mapping and avoid deserialization overhead due to zero-copy access.

Batch Operation Advantages

Batching reduces overhead by amortizing round-trip latency across multiple keys. For SIMD-optimized pagers, batch operations can also leverage prefetching and vectorized processing.

Sources: llkv-storage/README.md:28-29

Summary

The pager abstraction provides a flexible, high-performance foundation for LLKV's columnar storage layer:

Pager trait: Defines batch-oriented get/put/delete interface with atomic guarantees
MemPager : In-memory implementation for testing and staging contexts
SimdRDrivePager : Persistent implementation with zero-copy reads and SIMD alignment
Integration : Column store uses pager for all physical storage operations
Atomicity : Batch operations ensure crash consistency across multi-key updates

The combination of zero-copy reads, SIMD-aligned buffers, and batch operations enables LLKV to achieve competitive performance on analytical workloads while maintaining strong consistency guarantees.

Sources: llkv-storage/README.md:1-44 Cargo.toml:26-27 llkv-column-map/README.md:36-40

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Catalog and Metadata Management

Relevant source files

Purpose and Scope

This document describes LLKV's metadata management infrastructure, including how table schemas, column definitions, and type information are persisted and accessed throughout the system. The catalog serves as the authoritative source for all schema information and coordinates with the storage layer to ensure crash consistency for metadata changes.

For details on specific catalog APIs, see CatalogManager API. For information on how metadata is physically stored, see System Catalog and SysCatalog. For type alias management, see Custom Types and Type Registry.

System Catalog Architecture

LLKV implements a self-hosting catalog where metadata is stored as regular data within the system. The system catalog, referred to as SysCatalog, is physically stored as table 0 and uses the same Arrow-based columnar storage infrastructure as user tables. This design provides several advantages:

Crash consistency : Metadata changes use the same transactional append path as data, ensuring atomic schema modifications.
MVCC for metadata : Schema changes are versioned alongside data using the same created_by and deleted_by columns.
Unified storage : No special-case persistence logic is required for metadata versus data.
Bootstrap simplicity : The catalog table itself can be opened using minimal hardcoded schema information.

Sources : llkv-table/README.md:27-29 llkv-runtime/README.md:37-40

graph TB
    subgraph "SQL Layer"
        SQLENG["SqlEngine"]
end
    
    subgraph "Runtime Layer"
        RUNTIME["RuntimeEngine"]
RTCONTEXT["RuntimeContext"]
end
    
    subgraph "Catalog Layer"
        CATMGR["CatalogManager"]
SYSCAT["SysCatalog\n(Table 0)"]
TYPEREG["TypeRegistry"]
RESOLVER["IdentifierResolver"]
end
    
    subgraph "Table Layer"
        TABLE["Table"]
TABLEMETA["TableMeta"]
COLMETA["ColMeta"]
end
    
    subgraph "Storage Layer"
        COLSTORE["ColumnStore"]
PAGER["Pager"]
end
    
 
   SQLENG --> RUNTIME
 
   RUNTIME --> RTCONTEXT
 
   RTCONTEXT --> CATMGR
    
 
   CATMGR --> SYSCAT
 
   CATMGR --> TYPEREG
 
   CATMGR --> RESOLVER
    
 
   SYSCAT --> TABLE
 
   TABLE --> COLSTORE
 
   COLSTORE --> PAGER
    
    TABLEMETA -.stored in.-> SYSCAT
    COLMETA -.stored in.-> SYSCAT
    
    RESOLVER -.queries.-> CATMGR
    RUNTIME -.queries.-> RESOLVER

Metadata Storage Model

The catalog stores two primary metadata types as Arrow RecordBatches within table 0:

TableMeta Structure

TableMeta records describe each table's schema and properties:

table_id : Unique identifier (u32)
namespace_id : Namespace the table belongs to (u32)
table_name : User-visible name (String)
schema : Serialized Arrow Schema describing columns and types
row_count : Approximate row count for query planning
created_at : Timestamp of table creation

ColMeta Structure

ColMeta records describe individual columns within tables:

table_id : Parent table reference (u32)
field_id : Column identifier within the table (u32)
field_name : Column name (String)
data_type : Arrow DataType serialization
nullable : Whether NULL values are permitted (bool)
metadata : Key-value pairs for extended properties

graph LR
    subgraph "Logical Metadata Model"
        USERTABLE["User Table\nemployees"]
TABLEMETA["TableMeta\ntable_id=5\nname='employees'\nschema=..."]
COLMETA1["ColMeta\ntable_id=5\nfield_id=0\nname='id'\ntype=Int32"]
COLMETA2["ColMeta\ntable_id=5\nfield_id=1\nname='name'\ntype=Utf8"]
end
    
    subgraph "Physical Storage"
        SYSCATTABLE["SysCatalog Table 0"]
RECORDBATCH["RecordBatch\nwith MVCC columns"]
COLUMNCHUNKS["Column Chunks\nin ColumnStore"]
end
    
 
   USERTABLE --> TABLEMETA
 
   USERTABLE --> COLMETA1
 
   USERTABLE --> COLMETA2
    
 
   TABLEMETA --> RECORDBATCH
 
   COLMETA1 --> RECORDBATCH
 
   COLMETA2 --> RECORDBATCH
    
 
   RECORDBATCH --> SYSCATTABLE
 
   SYSCATTABLE --> COLUMNCHUNKS

Both metadata types include MVCC columns (row_id, created_by, deleted_by) to support transactional schema changes and time-travel queries over metadata history.

Sources : llkv-table/README.md:27-29 llkv-column-map/README.md:13-16

Catalog Manager

The CatalogManager provides the high-level API for catalog operations and coordinates between the SQL layer, runtime, and storage. Key responsibilities include:

Table lifecycle : Create, drop, rename, and truncate operations
Schema queries : Resolve table names to table IDs and field names to field IDs
Type management : Register and resolve custom type aliases
Namespace isolation : Maintain separate table namespaces for user data and temporary objects
Identifier resolution : Translate qualified names (schema.table.column) into physical identifiers

graph TB
    subgraph "Catalog Manager Responsibilities"
        LIFECYCLE["Table Lifecycle\ncreate/drop/rename"]
SCHEMAQUERY["Schema Queries\nname→id resolution"]
TYPEMGMT["Type Management\ncustom types/aliases"]
NAMESPACES["Namespace Isolation\nuser vs temporary"]
end
    
    subgraph "Core Components"
        CATMGR["CatalogManager"]
CACHE["In-Memory Cache\nTableMeta/ColMeta"]
TYPEREG["TypeRegistry"]
end
    
    subgraph "Persistence"
        SYSCAT["SysCatalog"]
APPENDPATH["Arrow Append Path"]
end
    
 
   LIFECYCLE --> CATMGR
 
   SCHEMAQUERY --> CATMGR
 
   TYPEMGMT --> CATMGR
 
   NAMESPACES --> CATMGR
    
 
   CATMGR --> CACHE
 
   CATMGR --> TYPEREG
 
   CATMGR --> SYSCAT
    
 
   SYSCAT --> APPENDPATH

The manager maintains an in-memory cache of metadata loaded from table 0 on startup and synchronizes changes back through the standard table append path.

Sources : llkv-runtime/README.md:37-40

Identifier Resolution

LLKV uses a multi-stage identifier resolution process to translate SQL names into physical storage keys:

Resolution Pipeline

String names (Expr<String>): SQL parser produces expressions with bare column names
Qualified resolution (IdentifierResolver): Resolve names to specific tables considering scope and aliases
Field IDs (Expr<FieldId>): Convert to numeric field identifiers for execution
Logical field IDs (LogicalFieldId): Add namespace and table context for storage lookup
Physical keys (PhysicalKey): Map to actual pager keys for column chunks

Sources : llkv-table/README.md:36-40 llkv-sql/src/sql_engine.rs36

graph LR
    SQL["SQL String\n'SELECT name\nFROM users'"]
EXPRSTR["Expr<String>\nfield='name'"]
RESOLUTION["IdentifierResolver\ncontext + scope"]
EXPRFID["Expr<FieldId>\ntable_id=5\nfield_id=1"]
LOGICALFID["LogicalFieldId\nnamespace=0\ntable=5\nfield=1"]
PHYSKEY["PhysicalKey\nkey=0x1234"]
SQL --> EXPRSTR
 
   EXPRSTR --> RESOLUTION
 
   RESOLUTION --> EXPRFID
 
   EXPRFID --> LOGICALFID
 
   LOGICALFID --> PHYSKEY

Identifier Context

The IdentifierContext structure tracks available tables and columns within a query scope:

Tracks visible tables and their aliases
Maintains column availability for each table
Handles nested contexts for subqueries
Supports correlated column references across scope boundaries

The IdentifierResolver consults the catalog manager to build these contexts during query planning.

Sources : llkv-sql/src/sql_engine.rs36

Catalog Operations

CREATE TABLE Flow

When a CREATE TABLE statement executes, the following sequence occurs:

Sources : llkv-runtime/README.md:13-18 llkv-table/README.md:27-29

DROP TABLE Flow

Table deletion is implemented as a soft delete using MVCC:

Mark the TableMeta row as deleted by setting deleted_by to the current transaction ID
Mark all associated ColMeta rows as deleted
The table's data remains physically present but invisible to queries observing later snapshots
Background garbage collection can eventually reclaim space from dropped tables

This approach ensures that in-flight transactions using earlier snapshots can still access the table definition.

Sources : llkv-table/README.md:32-34

Type Registry

The TypeRegistry manages custom type aliases created with CREATE DOMAIN (or CREATE TYPE in DuckDB dialect):

Type Alias Storage

Type definitions are stored alongside other metadata in the catalog
Aliases map user-defined names to base Arrow DataType instances
Type resolution occurs during expression planning and column definition
Nested type references are recursively resolved

Type Resolution Process

When a column is defined with a custom type:

Parser produces type name as string
TypeRegistry resolves name to base DataType
Column is stored with resolved base type
Type alias is preserved in ColMeta metadata for introspection

Sources : llkv-sql/src/sql_engine.rs:639-657

Namespace Management

LLKV supports multiple namespaces to isolate different categories of tables:

Namespace ID	Purpose	Lifetime	Storage
0 (default)	User tables	Persistent	Main pager
1 (temporary)	Temporary tables, staging	Transaction scope	MemPager
2+ (custom)	Reserved for future use	Varies	Configurable

graph TB
    subgraph "Persistent Namespace (0)"
        USERTBL1["users table"]
USERTBL2["orders table"]
SYSCAT["SysCatalog\n(table 0)"]
end
    
    subgraph "Temporary Namespace (1)"
        TEMPTBL1["#temp_results"]
TEMPTBL2["#staging_data"]
end
    
    subgraph "Storage Backends"
        MAINPAGER["BoxedPager\n(persistent)"]
MEMPAGER["MemPager\n(in-memory)"]
end
    
 
   USERTBL1 --> MAINPAGER
 
   USERTBL2 --> MAINPAGER
 
   SYSCAT --> MAINPAGER
    
 
   TEMPTBL1 --> MEMPAGER
 
   TEMPTBL2 --> MEMPAGER

The TEMPORARY_NAMESPACE_ID constant identifies ephemeral tables created within transactions that should not persist beyond commit or rollback.

Sources : llkv-runtime/README.md:26-32 llkv-sql/src/sql_engine.rs26

Catalog Bootstrap

The system catalog faces a bootstrapping challenge: table 0 stores metadata for all tables, including itself. LLKV solves this with a two-phase initialization:

Phase 1: Hardcoded Schema

On first startup, the ColumnStore initializes with an empty catalog. When the runtime creates table 0, it uses a hardcoded schema definition for SysCatalog that includes the minimal fields needed to store TableMeta and ColMeta:

table_id (UInt32)
table_name (Utf8)
field_id (UInt32)
field_name (Utf8)
data_type (Utf8, serialized)
Standard MVCC columns

Phase 2: Self-Description

Once table 0 exists, the runtime appends metadata describing table 0 itself into table 0. Subsequent startups load the catalog by scanning table 0 using the hardcoded schema, then validate that the self-description matches.

This bootstrap approach ensures that:

No external metadata files are required
Catalog schema can evolve through standard migration paths
The system remains self-contained within a single pager instance

Sources : llkv-column-map/README.md:36-40

Integration with Storage Layer

The catalog leverages the same storage infrastructure as user data:

Column Store Interaction

LogicalFieldId encodes (namespace_id, table_id, field_id) to uniquely identify columns across all tables
The ColumnStore maintains a mapping from LogicalFieldId to PhysicalKey
Catalog queries fetch metadata by scanning table 0 using standard ColumnStream APIs
Metadata mutations append RecordBatches through ColumnStore::append, ensuring ACID properties

MVCC for Metadata

Schema changes are transactional:

CREATE TABLE within a transaction remains invisible to other transactions until commit
DROP TABLE marks metadata as deleted without immediate physical removal
Concurrent transactions see consistent snapshots of the schema based on their transaction IDs
Schema conflicts (e.g., duplicate table names) are detected during commit watermark advancement

Sources : llkv-column-map/README.md:19-29 llkv-table/README.md:32-34

Catalog Consistency

Several mechanisms ensure catalog consistency across failures and concurrent access:

Atomic Metadata Updates

All catalog changes (create, drop, alter) execute as atomic append operations. The ColumnStore::append method ensures either all metadata rows are written or none are, preventing partial schema states.

Conflict Detection

On transaction commit, the runtime validates that:

No conflicting table names exist in the target namespace
Referenced tables for foreign keys still exist
Column types remain compatible with constraints

If conflicts are detected, the commit fails and the transaction rolls back, discarding staged metadata.

Recovery After Crash

Since metadata uses the same MVCC append path as data:

Uncommitted metadata changes (transactions that never committed) remain invisible
The catalog reflects the last successfully committed snapshot
No separate recovery log or checkpoint is required for metadata

Sources : llkv-runtime/README.md:20-24

Performance Considerations

Metadata Caching

The CatalogManager caches frequently accessed metadata in memory:

Table name → table ID mappings
Table ID → schema mappings
Field name → field ID mappings per table
Custom type definitions

Cache invalidation occurs on:

Explicit DDL operations (CREATE, DROP, ALTER)
Transaction commit with staged schema changes
Cross-session schema modifications (future: requires catalog versioning)

Scan Optimization

Metadata scans leverage the same optimizations as user data:

Predicate pushdown to filter by table_id or field_id
Projection to fetch only required columns
MVCC filtering to skip deleted entries

For common operations like "lookup table by name", the catalog manager maintains auxiliary indexes in memory to avoid full scans.

Sources : llkv-table/README.md:23-24

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

CatalogManager API

Relevant source files

Purpose and Scope

The CatalogManager is responsible for managing table lifecycle operations (CREATE, ALTER, DROP) and coordinating metadata storage through the system catalog. It serves as the primary interface between DDL statements and the underlying storage layer, ensuring that schema changes are transactional, MVCC-compliant, and crash-consistent.

This document covers the CatalogManager's role in orchestrating table creation, schema validation, and metadata persistence. For details about the underlying storage mechanism, see System Catalog and SysCatalog. For information about custom type definitions, see Custom Types and Type Registry. For the table-level data operations API, see Table Abstraction.

Sources: llkv-table/README.md:1-57 llkv-runtime/README.md:1-63

Architectural Overview

CatalogManager in the Runtime Layer

CatalogManager Coordination Flow

The CatalogManager orchestrates DDL operations by validating schemas, injecting MVCC metadata, and coordinating with the dual-context transaction model.

Sources: llkv-runtime/README.md:1-63 README.md:43-73

Key Responsibilities

The CatalogManager handles the following responsibilities within the LLKV runtime:

Responsibility	Description
Schema Validation	Validates Arrow schema definitions, checks for duplicate names, ensures data type compatibility
MVCC Integration	Injects `row_id`, `created_by`, `deleted_by` columns into all table definitions
Metadata Persistence	Stores `TableMeta` and `ColMeta` entries in SysCatalog (table 0) using Arrow append operations
Transaction Coordination	Manages dual-context execution (persistent + staging) for transactional DDL
Conflict Detection	Checks for concurrent schema changes during commit and aborts on conflicts
Visibility Control	Ensures snapshot isolation for catalog queries based on transaction context

Sources: llkv-table/README.md:13-30 llkv-runtime/README.md:12-17

Table Lifecycle Management

CREATE TABLE Flow

Table Creation Sequence

The CREATE TABLE flow validates schemas, injects MVCC columns, and either commits immediately (auto-commit) or stages definitions for later replay (explicit transactions).

Sources: llkv-runtime/README.md:20-32 llkv-table/README.md:25-30

Dual-Context Catalog Management

The CatalogManager maintains two separate contexts during explicit transactions:

Persistent Context

Backing Storage : BoxedPager (typically SimdRDrivePager for persistent storage)
Contains : All committed table definitions from previous transactions
Visibility : Tables visible to all transactions with appropriate snapshot isolation
Lifetime : Survives process restarts and crashes

Staging Context

Backing Storage : MemPager (in-memory hash map)
Contains : Table definitions created within the current transaction
Visibility : Only visible within the creating transaction
Lifetime : Discarded on rollback, replayed to persistent context on commit

Dual-Context Transaction Model

CREATE TABLE operations during explicit transactions stage in MemPager and merge into persistent storage on commit.

Sources: llkv-runtime/README.md:26-32 README.md:64-71

ALTER and DROP Operations

ALTER TABLE

The CatalogManager handles schema alterations by:

Validating the requested change against existing data
Updating ColMeta entries in SysCatalog
Tagging the change with the current transaction ID
Maintaining snapshot isolation so concurrent readers see consistent schemas

DROP TABLE

Table deletion follows MVCC semantics:

Marks the table's TableMeta entry with deleted_by = current_txn_id
Table remains visible to transactions with earlier snapshots
New transactions cannot see the dropped table
Physical cleanup is implementation-dependent and may occur during compaction

Sources: llkv-table/README.md:25-30

Metadata Structure

TableMeta and ColMeta

The CatalogManager persists two types of metadata entries in SysCatalog:

Metadata Type	Fields	Purpose
TableMeta	`table_id`, `table_name`, `namespace`, `created_by`, `deleted_by`	Describes table existence and lifecycle
ColMeta	`table_id`, `col_id`, `col_name`, `data_type`, `is_mvcc`, `created_by`, `deleted_by`	Describes individual column definitions

graph LR
    subgraph "SysCatalog (Table 0)"
        TABLEMETA["TableMeta\nrow_id / table_id / name / created_by / deleted_by"]
COLMETA["ColMeta\nrow_id / table_id / col_id / name / type / created_by / deleted_by"]
end
    
    subgraph "User Tables"
        TABLE1["user_table_1\nSchema from ColMeta"]
TABLE2["user_table_2\nSchema from ColMeta"]
end
    
 
   TABLEMETA --> TABLE1
 
   TABLEMETA --> TABLE2
 
   COLMETA --> TABLE1
 
   COLMETA --> TABLE2

Both metadata types use MVCC columns (created_by, deleted_by) to enable snapshot isolation for catalog queries.

Metadata to Table Mapping

The CatalogManager queries SysCatalog to resolve table names and reconstruct Arrow schemas for query execution.

Sources: llkv-table/README.md:25-30 llkv-column-map/README.md:18-23

API Surface

Table Creation

The CatalogManager exposes table creation through the runtime layer:

Input : Table name (string), Arrow Schema definition, optional namespace
Validation Steps :
- Check for duplicate table names within namespace
- Validate column names are unique
- Ensure data types are supported
- Verify constraints (PRIMARY KEY uniqueness)
MVCC Injection : Automatically adds row_id (UInt64), created_by (UInt64), deleted_by (UInt64) columns
Output : TableId identifier for subsequent operations

Table Lookup

The CatalogManager provides catalog query operations:

By Name : Resolve table name to TableId within a namespace
By ID : Retrieve TableMeta and ColMeta for a given TableId
Visibility Filtering : Apply transaction snapshot to filter dropped tables
Schema Reconstruction : Build Arrow Schema from ColMeta entries

Schema Validation

Validation operations performed by CatalogManager:

Column Uniqueness : Ensure no duplicate column names within a table
Type Compatibility : Verify data types are supported by Arrow and the storage layer
Constraint Validation : Check PRIMARY KEY, FOREIGN KEY, NOT NULL constraints
Naming Conventions : Enforce reserved column name restrictions (e.g., row_id)

Sources: llkv-table/README.md:13-18 llkv-runtime/README.md:12-17

Transaction Coordination

Snapshot Isolation for DDL

DDL Snapshot Isolation

Transactions see a consistent catalog snapshot; tables created by T1 are not visible to T2 until T1 commits.

Sources: llkv-runtime/README.md:20-24 README.md:64-71

Conflict Detection

On commit, the CatalogManager checks for conflicting operations:

Conflict Type	Detection Method	Resolution
Duplicate CREATE	Query SysCatalog for tables created after snapshot timestamp	Abort transaction
Concurrent DROP	Check if table's `deleted_by` was set by another transaction	Abort transaction
Schema Mismatch	Compare staged schema against current persistent schema	Abort transaction

Conflict detection ensures serializable DDL semantics despite optimistic concurrency control.

Sources: llkv-runtime/README.md:20-32

Integration with Runtime Components

RuntimeContext Coordination

The CatalogManager coordinates with RuntimeContext for:

Transaction Snapshots : Obtains current snapshot from TransactionSnapshot for visibility filtering
Transaction ID Allocation : Requests new transaction IDs from TxnIdManager for MVCC tagging
Dual-Context Management : Coordinates between persistent and staging pagers
Commit Protocol : Invokes staged operation replay during commit

Table Layer Integration

Interactions with llkv-table:

Table Instantiation : Creates Table instances from TableMeta and ColMeta
Schema Validation : Validates incoming RecordBatch schemas during append operations
Field Mapping : Resolves logical field names to FieldId identifiers
MVCC Column Access : Provides metadata for row_id, created_by, deleted_by columns

Executor Integration

The CatalogManager supports llkv-executor by:

Table Resolution : Resolves table references during query planning
Schema Information : Supplies Arrow schemas for projection and filtering
Column Validation : Validates column references in expressions and predicates
Subquery Support : Provides catalog context for correlated subquery evaluation

Sources: llkv-runtime/README.md:42-46 llkv-table/README.md:36-40

Error Handling and Recovery

Validation Errors

The CatalogManager returns structured errors for:

Duplicate Table Names : Table already exists within the namespace
Invalid Column Definitions : Unsupported data type or constraint violation
Reserved Column Names : Attempt to use system-reserved names like row_id
Constraint Violations : PRIMARY KEY or FOREIGN KEY constraint failures

Transaction Errors

Transaction-related failures:

Commit Conflicts : Concurrent DDL operations detected during commit
Snapshot Violations : Attempt to query table created after snapshot timestamp
Pager Failures : Persistent storage write failures during commit
Staging Inconsistencies : Corrupted staging context state

Crash Recovery

After crash recovery:

Persistent Catalog Loaded : SysCatalog read from pager root key
Uncommitted Transactions Discarded : Staging contexts do not survive restarts
MVCC Visibility Applied : Only committed tables with valid created_by are visible
No Replay Required : Catalog state is consistent without separate recovery log

Sources: llkv-table/README.md:25-30 README.md:64-71

Performance Characteristics

Catalog Query Optimization

The CatalogManager optimizes metadata access through:

Schema Caching : Frequently accessed schemas cached in RuntimeContext
Batch Lookups : Multiple table lookups batched into single SysCatalog scan
Snapshot Reuse : Transaction snapshots reused across multiple catalog queries
Lazy Loading : Column metadata loaded only when required

Concurrent DDL Handling

Concurrency characteristics:

Optimistic Concurrency : No global catalog locks; conflicts detected at commit
Snapshot Isolation : Long-running transactions see stable schema
Minimal Blocking : DDL operations do not block concurrent queries
Serializable DDL : Conflict detection ensures serializable execution

Scalability Considerations

System behavior at scale:

Linear Growth : SysCatalog size grows linearly with table and column count
Efficient Lookups : Table name resolution uses indexed scans
Distributed Metadata : Column metadata distributed across ColMeta entries
No Centralized Bottleneck : No single global lock for catalog operations

Sources: llkv-column-map/README.md:30-35 README.md:35-42

Example Usage Patterns

Auto-Commit CREATE TABLE

Client: CREATE TABLE users (id INT, name TEXT);

Flow:
1. SqlEngine parses to CreateTablePlan
2. RuntimeContext invokes CatalogManager.create_table()
3. CatalogManager validates schema, injects MVCC columns
4. TableMeta and ColMeta appended to SysCatalog (table 0)
5. Persistent pager commits atomically
6. Table immediately visible to all transactions

Transactional CREATE TABLE

Client: BEGIN;
Client: CREATE TABLE temp_results (id INT, value DOUBLE);
Client: INSERT INTO temp_results SELECT ...;
Client: COMMIT;

Flow:
1. BEGIN captures snapshot = 500
2. CREATE TABLE stages TableMeta in MemPager
3. INSERT operations target staging context
4. COMMIT replays staged table to persistent context
5. Conflict detection checks for concurrent creates
6. Table committed with created_by = 501

Concurrent DDL with Conflict

Transaction T1: BEGIN; CREATE TABLE foo (...); [waits]
Transaction T2: BEGIN; CREATE TABLE foo (...); COMMIT;
Transaction T1: COMMIT; [aborts with conflict error]

Reason: T1 detects that foo was created by T2 after T1's snapshot

Sources: demos/llkv-sql-pong-demo/src/main.rs:44-81 llkv-runtime/README.md:20-32

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

System Catalog and SysCatalog

Relevant source files

Purpose and Scope

This document describes the system catalog infrastructure that stores and manages table and column metadata for LLKV. The system catalog treats metadata as first-class data, persisting it in table 0 using the same Arrow-based storage mechanisms that handle user data. This ensures crash consistency, enables transactional DDL operations, and simplifies the overall architecture by eliminating separate metadata storage layers.

For information about the higher-level catalog management API that orchestrates table lifecycle operations, see CatalogManager API. For details on custom type definitions and the type registry, see Custom Types and Type Registry.

System Catalog as Table 0

LLKV stores all table and column metadata in a special table with ID 0, known as the system catalog. This design leverages the existing storage infrastructure rather than introducing a separate metadata store.

Key Properties

Property	Description
Table ID	Always `0`, reserved at system initialization
Storage Format	Arrow `RecordBatch` with predefined schema
MVCC Semantics	Full transaction support with snapshot isolation
Persistence	Uses the same `ColumnStore` and `Pager` as user tables
Crash Safety	Metadata mutations are atomic through the append pipeline

The system catalog contains two types of metadata records:

Table Metadata (TableMeta): Defines table schemas, IDs, and names
Column Metadata (ColMeta): Describes individual columns within tables

graph TB
    subgraph "Metadata Storage Model"
        UserTables["User Tables\n(ID ≥ 1)"]
SysCatalog["System Catalog\n(Table 0)"]
TableMeta["TableMeta Records\n• table_id\n• table_name\n• schema"]
ColMeta["ColMeta Records\n• table_id\n• col_name\n• col_id\n• data_type"]
end
    
    subgraph "Storage Layer"
        ColumnStore["ColumnStore"]
Pager["Pager (MemPager/SimdRDrivePager)"]
end
    
 
   UserTables -->|described by| SysCatalog
 
   SysCatalog --> TableMeta
 
   SysCatalog --> ColMeta
    
 
   SysCatalog -->|persisted via| ColumnStore
 
   UserTables -->|persisted via| ColumnStore
 
   ColumnStore --> Pager
    
    style SysCatalog fill:#f9f9f9

Sources: llkv-table/README.md:28-29 llkv-column-map/README.md:10-16

Metadata Schema

The system catalog stores metadata using a predefined Arrow schema with the following structure:

TableMeta Schema

Field Name	Arrow Type	Description
`table_id`	`UInt32`	Unique identifier for the table
`table_name`	`Utf8`	Human-readable table name
`schema`	`Binary`	Serialized Arrow schema definition
`row_id`	`UInt64`	MVCC row identifier (auto-injected)
`created_by`	`UInt64`	Transaction ID that created this record
`deleted_by`	`UInt64`	Transaction ID that deleted this record (NULL if active)

ColMeta Schema

Field Name	Arrow Type	Description
`table_id`	`UInt32`	References the parent table
`col_id`	`UInt32`	Column identifier within the table
`col_name`	`Utf8`	Column name
`data_type`	`Utf8`	Arrow data type descriptor
`row_id`	`UInt64`	MVCC row identifier (auto-injected)
`created_by`	`UInt64`	Transaction ID that created this record
`deleted_by`	`UInt64`	Transaction ID that deleted this record (NULL if active)

Sources: llkv-table/README.md:13-17 Diagram 4 from high-level architecture

SysCatalog Implementation

The SysCatalog struct serves as the programmatic interface to the system catalog, providing methods to read and write metadata while abstracting the underlying Arrow storage details.

graph LR
    subgraph "SysCatalog Interface"
        SysCatalog["SysCatalog"]
CreateTable["create_table()"]
GetTable["get_table_meta()"]
ListTables["list_tables()"]
DropTable["drop_table()"]
CreateCol["create_column()"]
GetCol["get_column_meta()"]
ListCols["list_columns()"]
end
    
    subgraph "Storage Backend"
        Table0["Table (ID=0)"]
ColumnStore["ColumnStore"]
end
    
 
   SysCatalog --> CreateTable
 
   SysCatalog --> GetTable
 
   SysCatalog --> ListTables
 
   SysCatalog --> DropTable
 
   SysCatalog --> CreateCol
 
   SysCatalog --> GetCol
 
   SysCatalog --> ListCols
    
 
   CreateTable --> Table0
 
   GetTable --> Table0
 
   ListTables --> Table0
 
   DropTable --> Table0
 
   CreateCol --> Table0
 
   GetCol --> Table0
 
   ListCols --> Table0
    
 
   Table0 --> ColumnStore

Core Components

Sources: llkv-table/README.md:28-29 llkv-runtime/README.md39

Metadata Query Process

When the runtime queries the catalog (e.g., during SELECT planning), it follows this flow:

Sources: llkv-table/README.md:23-25 llkv-runtime/README.md:36-40

sequenceDiagram
    participant Runtime as RuntimeContext
    participant Catalog as SysCatalog
    participant Table0 as Table (ID=0)
    participant Store as ColumnStore
    
    Runtime->>Catalog: get_table_meta("users")
    
    Catalog->>Table0: scan_stream()\nWHERE table_name = 'users'
    Table0->>Store: ColumnStream with predicate
    Store->>Store: Apply MVCC filtering
    Store-->>Table0: RecordBatch
    
    Table0-->>Catalog: RecordBatch
    
    Note over Catalog: Deserialize TableMeta\nfrom Arrow batch
    
    Catalog-->>Runtime: TableMeta struct
    
    Runtime->>Catalog: list_columns(table_id)
    Catalog->>Table0: scan_stream()\nWHERE table_id = X
    Table0->>Store: ColumnStream with predicate
    Store-->>Table0: RecordBatch
    Table0-->>Catalog: RecordBatch
    
    Note over Catalog: Deserialize ColMeta\nrecords
    
    Catalog-->>Runtime: Vec<ColMeta>

Metadata Operations

DDL operations (CREATE TABLE, DROP TABLE, ALTER TABLE) modify the system catalog through the same transactional append pipeline used for INSERT statements.

graph TD
    ParseSQL["Parse SQL:\nCREATE TABLE users (...)"]
CreatePlan["CreateTablePlan"]
RuntimeExec["Runtime.execute_create_table()"]
ValidateSchema["Validate Schema"]
AllocTableID["Allocate table_id"]
BuildTableMeta["Build TableMeta RecordBatch"]
BuildColMeta["Build ColMeta RecordBatch"]
AppendTable["Table(0).append(TableMeta)"]
AppendCols["Table(0).append(ColMeta)"]
ColumnStore["ColumnStore.append()"]
CommitPager["Pager.batch_put()"]
ParseSQL --> CreatePlan
 
   CreatePlan --> RuntimeExec
 
   RuntimeExec --> ValidateSchema
 
   ValidateSchema --> AllocTableID
    
 
   AllocTableID --> BuildTableMeta
 
   AllocTableID --> BuildColMeta
    
 
   BuildTableMeta --> AppendTable
 
   BuildColMeta --> AppendCols
    
 
   AppendTable --> ColumnStore
 
   AppendCols --> ColumnStore
    
 
   ColumnStore --> CommitPager
    
    style AppendTable fill:#f9f9f9
    style AppendCols fill:#f9f9f9

CREATE TABLE Flow

Key Implementation Details:

Schema Validation : The runtime validates the Arrow schema before allocating resources
Table ID Allocation : Monotonically increasing IDs are assigned via CatalogManager
Atomic Append : Both TableMeta and all ColMeta records are appended in a single transaction
MVCC Tagging : The created_by column is set to the current transaction ID

Sources: llkv-runtime/README.md:36-40 llkv-table/README.md:22-24

graph TD
    DropPlan["DropTablePlan"]
RuntimeExec["Runtime.execute_drop_table()"]
LookupMeta["SysCatalog.get_table_meta()"]
CheckExists["Verify table exists"]
BuildDeleteMeta["Build RecordBatch:\n• table_id\n• deleted_by = current_txn"]
AppendDelete["Table(0).append(delete_batch)"]
ColumnStore["ColumnStore.append()"]
DropPlan --> RuntimeExec
 
   RuntimeExec --> LookupMeta
 
   LookupMeta --> CheckExists
    
 
   CheckExists --> BuildDeleteMeta
 
   BuildDeleteMeta --> AppendDelete
 
   AppendDelete --> ColumnStore
    
    style BuildDeleteMeta fill:#f9f9f9

DROP TABLE Flow

Dropping a table uses MVCC soft-delete semantics rather than physical deletion:

The deleted_by column is updated to mark the metadata as deleted. MVCC visibility rules ensure that:

Transactions with snapshots before the deletion still see the table
Transactions starting after the deletion do not see the table

Sources: llkv-table/README.md:32-34 Diagram 4 from high-level architecture

sequenceDiagram
    participant Main as main() or SqlEngine::new()
    participant Runtime as RuntimeContext::new()
    participant CatMgr as CatalogManager::new()
    participant Table as Table::open_or_create()
    participant Store as ColumnStore::open()
    participant Pager as Pager (MemPager/SimdRDrivePager)
    
    Main->>Runtime: new(pager)
    Runtime->>CatMgr: new(pager)
    
    CatMgr->>Store: open(pager, root_key)
    Store->>Pager: batch_get([root_key])
    
    alt Catalog Exists
        Pager-->>Store: Catalog data
        Store-->>CatMgr: ColumnStore (loaded)
        Note over CatMgr: Deserialize catalog entries\nelse First Run
        Pager-->>Store: NULL
        Store-->>CatMgr: ColumnStore (empty)
        CatMgr->>Table: open_or_create(table_id=0)
        Note over CatMgr: Create system catalog schema
        Table->>Store: Initialize table 0
        Store->>Pager: batch_put(catalog_schema)
    end
    
    CatMgr-->>Runtime: CatalogManager (initialized)
    Runtime-->>Main: RuntimeContext (ready)

Bootstrap Process

When LLKV initializes, the system catalog must bootstrap itself before any user operations can proceed.

Initialization Sequence

Bootstrap Steps:

Pager Initialization : The storage backend is opened (in-memory or persistent)
Catalog Discovery : The ColumnStore attempts to load the catalog from the pager root key
Schema Creation : If no catalog exists, table 0 is created with the predefined schema
Ready State : The runtime can now service DDL and DML operations

Sources: llkv-runtime/README.md:26-31 llkv-storage/README.md:12-16

graph TB
    subgraph "SQL Query Processing"
        ParsedSQL["Parsed SQL AST"]
SelectPlan["SelectPlan<String>"]
ResolvedPlan["SelectPlan<FieldId>"]
end
    
    subgraph "RuntimeContext"
        CatalogLookup["Catalog Lookup"]
FieldResolution["Field Name → FieldId\nResolution"]
SchemaValidation["Schema Validation"]
end
    
    subgraph "System Catalog"
        SysCatalog["SysCatalog"]
TableMetaCache["In-Memory Metadata Cache"]
end
    
 
   ParsedSQL --> SelectPlan
 
   SelectPlan --> CatalogLookup
    
 
   CatalogLookup --> SysCatalog
 
   SysCatalog --> TableMetaCache
    
 
   TableMetaCache --> FieldResolution
 
   FieldResolution --> SchemaValidation
 
   SchemaValidation --> ResolvedPlan

Integration with Runtime

The RuntimeContext uses the system catalog for all schema-dependent operations:

Schema Resolution Flow

Usage Examples

Operation	Catalog Interaction
SELECT	Resolve table names → table IDs, resolve column names → field IDs
INSERT	Validate schema compatibility, check for required columns
JOIN	Resolve schemas for both tables, validate join key compatibility
CREATE INDEX	(Future) Persist index metadata as new catalog record type
ALTER TABLE	Update existing metadata records with new schema definitions

Sources: llkv-runtime/README.md:36-40 llkv-expr/README.md:50-54

Dual-Context Catalog Access

During explicit transactions, the runtime maintains two catalog views:

Catalog Visibility Rules

Persistent Context : Sees only metadata committed before the transaction's snapshot
Staging Context : Sees tables created within the current transaction
On Commit : Staged metadata is replayed into the persistent context
On Rollback : Staged metadata is discarded

This dual-view approach ensures that:

DDL operations remain transactional
Uncommitted schema changes don't leak to other sessions
Catalog queries are snapshot-isolated like DML operations

Sources: llkv-runtime/README.md:26-31 llkv-table/README.md:32-34

Metadata Caching

The CatalogManager maintains an in-memory cache of frequently accessed metadata to avoid repeated scans of table 0:

Cache Structure	Purpose	Invalidation Strategy
Table Name → ID Map	Fast table resolution during planning	Invalidated on `CREATE`/`DROP TABLE`
Table ID → Schema Map	Quick schema validation during `INSERT`	Invalidated on `ALTER TABLE`
Column Name → FieldId Map	Field resolution for expressions	Rebuilt on schema changes

The cache is session-local and does not require cross-session synchronization in the current single-process model.

Sources: Inferred from llkv-runtime/README.md:12-17

Summary

The LLKV system catalog demonstrates the principle of treating metadata as data by storing all table and column definitions in table 0 using the same Arrow-based storage infrastructure that handles user tables. This design:

Simplifies Architecture : Eliminates the need for separate metadata storage systems
Ensures Consistency : Metadata mutations use MVCC transactions like all other data
Enables Crash Recovery : The pager's atomicity guarantees extend to schema changes
Supports Transactional DDL : Schema modifications can be rolled back or committed atomically

The SysCatalog interface abstracts the underlying Arrow storage, providing a type-safe API for the runtime to query and modify metadata. The bootstrap process ensures the system catalog exists before any user operations proceed, and the dual-context model enables proper transaction isolation for DDL operations.

Sources: llkv-table/README.md:28-29 llkv-runtime/README.md:36-40 llkv-column-map/README.md:10-16 Diagram 4 from high-level architecture

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Custom Types and Type Registry

Relevant source files

This page documents LLKV's type system, including SQL type mapping, custom type representations, and type inference mechanisms. The system uses Apache Arrow's DataType as its canonical type representation, with custom types like Decimal128, Date32, and Interval mapped to Arrow-compatible formats.

For information about expression evaluation and scalar operations, see Scalar Evaluation and NumericKernels. For aggregate function type handling, see Aggregation System.

Type System Architecture

LLKV's type system operates in three layers: SQL types (user-facing), intermediate literal types (planning), and Arrow DataTypes (execution). All data flowing through the system ultimately uses Arrow's columnar format.

Type Flow Architecture

graph TB
    subgraph "SQL Layer"
        SQLTYPE["SQL Types\nINT, TEXT, DECIMAL, DATE"]
end
    
    subgraph "Planning Layer"
        SQLVALUE["SqlValue\nInteger, Float, Decimal,\nString, Date32, Interval, Struct"]
LITERAL["Literal\nType-erased values"]
PLANVALUE["PlanValue\nPlan-time literals"]
end
    
    subgraph "Execution Layer"
        DATATYPE["Arrow DataType\nInt64, Float64, Utf8,\nDecimal128, Date32, Interval"]
SCHEMA["ExecutorSchema\nColumn metadata + types"]
INFERENCE["Type Inference\ninfer_computed_data_type"]
end
    
    subgraph "Storage Layer"
        RECORDBATCH["RecordBatch\nTyped columnar data"]
ARRAYS["Typed Arrays\nInt64Array, StringArray, etc."]
end
    
 
   SQLTYPE --> SQLVALUE
 
   SQLVALUE --> PLANVALUE
 
   SQLVALUE --> LITERAL
 
   PLANVALUE --> DATATYPE
 
   LITERAL --> DATATYPE
 
   DATATYPE --> SCHEMA
 
   SCHEMA --> INFERENCE
 
   INFERENCE --> DATATYPE
 
   DATATYPE --> RECORDBATCH
 
   RECORDBATCH --> ARRAYS
    
    style DATATYPE fill:#f9f9f9

Sources: llkv-sql/src/sql_value.rs:16-27 llkv-sql/src/lib.rs:22-29 llkv-executor/src/translation/schema.rs:53-123

SQL to Arrow Type Mapping

SQL types are mapped to Arrow DataTypes during parsing and planning. The mapping is defined implicitly through the parsing logic in SqlValue and the type inference system.

SQL Type	Arrow DataType	Notes
`INT`, `INTEGER`, `BIGINT`	`Int64`	All integer types normalized to `Int64`
`FLOAT`, `DOUBLE`, `REAL`	`Float64`	All floating-point types normalized to `Float64`
`DECIMAL(p,s)`	`Decimal128(p,s)`	Fixed-point decimal with precision and scale
`TEXT`, `VARCHAR`	`Utf8`	Variable-length UTF-8 strings
`DATE`	`Date32`	Days since Unix epoch
`INTERVAL`	`Interval(MonthDayNano)`	Calendar-aware interval type
`BOOLEAN`	`Boolean`	True/false values
Dictionary literals	`Struct`	Key-value maps represented as structs

SQL to Arrow Type Conversion Flow

Sources: llkv-sql/src/sql_value.rs:178-214 llkv-sql/src/sql_value.rs:216-236 llkv-sql/src/lib.rs:22-29

Custom Type Representations

LLKV defines custom types for values that require special handling beyond basic Arrow types. These types bridge SQL semantics and Arrow's columnar format.

DecimalValue

Fixed-point decimal numbers with exact precision. Stored as i128 with a scale factor.

DecimalValue Representation

graph TB
    subgraph "DecimalValue Structure"
        DEC["DecimalValue\nraw_value: i128\nscale: i8"]
end
    
    subgraph "SQL Input"
        SQLDEC["SQL: 123.45"]
end
    
    subgraph "Internal Representation"
        RAW["raw_value = 12345\nscale = 2"]
CALC["Actual value = 12345 / 10^2 = 123.45"]
end
    
    subgraph "Arrow Storage"
        ARR["Decimal128Array\nprecision=5, scale=2"]
end
    
 
   SQLDEC --> DEC
 
   DEC --> RAW
 
   RAW --> CALC
 
   DEC --> ARR

Sources: llkv-sql/src/sql_value.rs:187-207 llkv-aggregate/src/lib.rs:314-324

IntervalValue

Calendar-aware time intervals with month, day, and nanosecond components.

IntervalValue Operations

Sources: llkv-sql/src/sql_value.rs:238-283 llkv-expr/src/literal.rs

Date32

Days since Unix epoch (1970-01-01), stored as i32.

Date32 Type Handling

Sources: llkv-sql/src/sql_value.rs:76-87 llkv-sql/src/sql_value.rs:169-174

Struct Types

Dictionary literals in SQL are represented as struct types with named fields.

Struct Type Representation

Sources: llkv-sql/src/sql_value.rs:124-135 llkv-sql/src/sql_value.rs:227-234

Type Inference for Computed Expressions

The type inference system determines Arrow DataTypes for computed expressions at planning time. This enables schema generation before execution.

Type Inference Flow

graph TB
    subgraph "Expression Input"
        EXPR["ScalarExpr<FieldId>\ncol1 + col2 * 3"]
end
    
    subgraph "Type Inference"
        INFER["infer_computed_data_type"]
CHECK["expression_uses_float"]
NORM["normalized_numeric_type"]
end
    
    subgraph "Type Resolution"
        COL1["Column col1: Int64"]
COL2["Column col2: Float64"]
RESULT["Result: Float64\n(one operand is float)"]
end
    
    subgraph "Schema Output"
        FIELD["Field(alias, Float64, nullable=true)"]
end
    
 
   EXPR --> INFER
 
   INFER --> CHECK
 
   CHECK --> COL1
 
   CHECK --> COL2
 
   CHECK --> RESULT
 
   INFER --> NORM
 
   NORM --> RESULT
 
   RESULT --> FIELD

Sources: llkv-executor/src/translation/schema.rs:53-123 llkv-executor/src/translation/schema.rs:149-243

Type Inference Rules

The inference system applies the following rules:

Expression Type	Inferred Type	Logic
`ScalarExpr::Literal(Integer)`	`Int64`	Direct mapping
`ScalarExpr::Literal(Float)`	`Float64`	Direct mapping
`ScalarExpr::Literal(Decimal(p,s))`	`Decimal128(p,s)`	Preserves precision/scale
`ScalarExpr::Column(field_id)`	Column's type	Lookup in schema
`ScalarExpr::Binary{left, op, right}`	`Float64` if any operand is float, else `Int64`	Type promotion
`ScalarExpr::Compare{...}`	`Int64`	Boolean as integer (0/1)
`ScalarExpr::Cast{data_type, ...}`	`data_type`	Explicit cast target
`ScalarExpr::Random`	`Float64`	Floating-point random values

Sources: llkv-executor/src/translation/schema.rs:56-122

graph TB
    subgraph "Input Types"
        INT8["Int8/Int16/Int32/Int64"]
UINT["UInt8/UInt16/UInt32/UInt64"]
FLOAT["Float32/Float64"]
DEC["Decimal128(p,s)"]
BOOL["Boolean"]
end
    
    subgraph "Normalization"
        NORM["normalized_numeric_type"]
end
    
    subgraph "Output Types"
        OUT_INT["Int64"]
OUT_FLOAT["Float64"]
end
    
 
   INT8 --> NORM
 
   BOOL --> NORM
 
   NORM --> OUT_INT
    
 
   UINT --> NORM
 
   FLOAT --> NORM
 
   NORM --> OUT_FLOAT
    
 
   DEC --> NORM
 
   NORM --> |scale=0 && fits in i64| OUT_INT
 
   NORM --> |otherwise| OUT_FLOAT

Numeric Type Normalization

All numeric types are normalized to either Int64 or Float64 for arithmetic operations:

Numeric Type Normalization

Sources: llkv-executor/src/translation/schema.rs:125-147

Type Resolution During Expression Translation

Expression translation converts string-based column references to typed FieldId references, resolving types through the schema.

Expression Translation and Type Resolution

graph TB
    subgraph "String-based Expression"
        EXPRSTR["Expr<String>\nColumn('age') > Literal(18)"]
end
    
    subgraph "Translation"
        TRANS["translate_predicate"]
SCALAR["translate_scalar"]
RESOLVE["resolve_field_id"]
end
    
    subgraph "Schema Lookup"
        SCHEMA["ExecutorSchema"]
LOOKUP["schema.resolve('age')"]
COLUMN["ExecutorColumn\nname='age'\nfield_id=5\ndata_type=Int64"]
end
    
    subgraph "FieldId-based Expression"
        EXPRFID["Expr<FieldId>\nColumn(5) > Literal(18)"]
end
    
 
   EXPRSTR --> TRANS
 
   TRANS --> SCALAR
 
   SCALAR --> RESOLVE
 
   RESOLVE --> LOOKUP
 
   LOOKUP --> COLUMN
 
   COLUMN --> EXPRFID

Sources: llkv-executor/src/translation/expression.rs:18-174 llkv-executor/src/translation/expression.rs:390-407

Type Preservation During Translation

The translation process preserves type information from the original expression:

Expression Component	Type Preservation
`Column(name)`	Replaced with `Column(field_id)`, type from schema
`Literal(value)`	Clone literal, type embedded in `Literal` enum
`Binary{left, op, right}`	Recursively translate operands, type inferred later
`Cast{expr, data_type}`	Preserve `data_type` during translation
`Aggregate(call)`	Translate inner expression, aggregate type determined by function

Sources: llkv-executor/src/translation/expression.rs:176-387

graph TB
    subgraph "Aggregate Specification"
        SPEC["AggregateKind::Sum\nfield_id=5\ndata_type=Int64\ndistinct=false"]
end
    
    subgraph "Accumulator Creation"
        CREATE["new_with_projection_index"]
MATCH["Match on (data_type, distinct)"]
end
    
    subgraph "Type-Specific Accumulators"
        INT64["SumInt64\nvalue: Option<i64>\nhas_values: bool"]
FLOAT64["SumFloat64\nvalue: f64\nsaw_value: bool"]
DEC128["SumDecimal128\nsum: i128\nprecision: u8\nscale: i8"]
end
    
    subgraph "Update Logic"
        UPDATE_INT["Checked addition\nError on overflow"]
UPDATE_FLOAT["Floating addition\nNo overflow check"]
UPDATE_DEC["Checked i128 addition\nError on overflow"]
end
    
 
   SPEC --> CREATE
 
   CREATE --> MATCH
 
   MATCH --> |Int64, false| INT64
 
   MATCH --> |Float64, false| FLOAT64
 
   MATCH --> |Decimal128 p,s , false| DEC128
    
 
   INT64 --> UPDATE_INT
 
   FLOAT64 --> UPDATE_FLOAT
 
   DEC128 --> UPDATE_DEC

Type Handling in Aggregates

Aggregate functions have type-specific accumulator implementations. The type determines overflow behavior, precision, and result format.

Aggregate Type-Specific Accumulators

Sources: llkv-aggregate/src/lib.rs:461-542 llkv-aggregate/src/lib.rs:799-859

Aggregate Type Matrix

Different aggregates support different type combinations:

Aggregate	Int64	Float64	Decimal128	Utf8	Boolean	Date32
`COUNT(*)`	N/A	N/A	N/A	N/A	N/A	N/A
`COUNT(col)`	✓	✓	✓	✓	✓	✓
`SUM`	✓	✓	✓	✓ (coerce)	-	-
`AVG`	✓	✓	✓	✓ (coerce)	-	-
`MIN`/`MAX`	✓	✓	✓	✓ (coerce)	-	-
`TOTAL`	✓	✓	✓	✓ (coerce)	-	-
`GROUP_CONCAT`	✓	✓	-	✓	✓	-

Notes:

✓ = Native support with type-specific accumulator
✓ (coerce) = Support via SQLite-style numeric coercion
- = Not supported

Sources: llkv-aggregate/src/lib.rs:22-68 llkv-aggregate/src/lib.rs:385-447

graph LR
    subgraph "DistinctKey Variants"
        INT["Int(i64)"]
FLOAT["Float(u64)\nf64::to_bits()"]
STR["Str(String)"]
BOOL["Bool(bool)"]
DATE["Date(i32)"]
DEC["Decimal(i128)"]
end
    
    subgraph "Accumulator"
        SEEN["FxHashSet<DistinctKey>"]
INSERT["seen.insert(key)"]
CHECK["Returns true if new"]
end
    
    subgraph "Aggregation"
        ADD["Add to sum only if new"]
COUNT["Count distinct values"]
end
    
 
   INT --> SEEN
 
   FLOAT --> SEEN
 
   STR --> SEEN
 
   BOOL --> SEEN
 
   DATE --> SEEN
 
   DEC --> SEEN
    
 
   SEEN --> INSERT
 
   INSERT --> CHECK
 
   CHECK --> ADD
 
   CHECK --> COUNT

Distinct Value Tracking

For DISTINCT aggregates, the system tracks seen values using type-specific keys:

Distinct Value Tracking

Sources: llkv-aggregate/src/lib.rs:249-333 llkv-aggregate/src/lib.rs:825-858

Type Coercion and Casting

The system supports both implicit coercion (for numeric operations) and explicit casting (via CAST expressions).

graph TB
    subgraph "Input Values"
        STR["String '123.45'"]
BOOL["Boolean true"]
NULL["NULL"]
end
    
    subgraph "Coercion Function"
        COERCE["array_value_to_numeric"]
PARSE["Parse as f64"]
FALLBACK["Use 0.0 if parse fails"]
end
    
    subgraph "Coerced Values"
        NUM1["123.45"]
NUM2["1.0"]
NUM3["0.0 (NULL skipped)"]
end
    
 
   STR --> COERCE
 
   BOOL --> COERCE
 
   NULL --> COERCE
    
 
   COERCE --> PARSE
 
   PARSE --> |Success| NUM1
 
   PARSE --> |Failure| FALLBACK
 
   FALLBACK --> NUM1
    
 
   COERCE --> |Boolean: 1.0/0.0| NUM2
 
   COERCE --> |NULL: skip row| NUM3

Numeric Coercion in Aggregates

String and boolean values are coerced to numeric types in aggregate functions following SQLite semantics:

Numeric Coercion in Aggregates

Sources: llkv-aggregate/src/lib.rs:385-447 llkv-aggregate/src/lib.rs:860-877

Explicit Type Casting

The CAST expression provides explicit type conversion:

Explicit Type Casting

Sources: llkv-executor/src/translation/schema.rs95 llkv-expr/src/expr.rs:114-118

Type System Integration Points

The type system integrates with multiple layers of the architecture:

Layer	Integration Point	Purpose
SQL Parsing	`SqlValue::try_from_expr`	Parse SQL literals into typed values
Planning	`PlanValue` conversion	Convert literals to plan representation
Schema Inference	`infer_computed_data_type`	Determine result types for expressions
Expression Translation	`translate_scalar`	Resolve column types from schema
Program Compilation	`OwnedOperator`	Store typed operators in bytecode
Execution	`RecordBatch` schema	Validate types match expected schema
Aggregation	Accumulator creation	Create type-specific aggregators
Storage	Arrow serialization	Persist typed data in columnar format

Sources: llkv-sql/src/sql_value.rs:30-122 llkv-executor/src/translation/schema.rs:15-51 llkv-table/src/planner/program.rs:69-101

Summary

LLKV's type system is built on Apache Arrow's DataType as the canonical type representation, with custom types for SQL-specific semantics:

SQL types are mapped to Arrow types during parsing through SqlValue
Custom types (Decimal, Interval, Date32, Struct) provide SQL-compatible semantics
Type inference determines result types for computed expressions at planning time
Type resolution converts string column references to typed FieldId references
Aggregate functions use type-specific accumulators with appropriate overflow handling
Type coercion follows SQLite semantics for numeric operations

The type system operates transparently across all layers, ensuring type safety from SQL parsing through storage while maintaining compatibility with Arrow's columnar format.

Sources: llkv-sql/src/lib.rs:1-51 llkv-executor/src/translation/schema.rs:1-271 llkv-aggregate/src/lib.rs:1-83

Keyboard shortcuts

rust-llkv Documentation