Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Overview

Relevant source files

This document introduces the LLKV database system, its architectural principles, and the relationships between its constituent crates. It provides a high-level map of how SQL queries flow through the system from parsing to storage, and explains the role of Apache Arrow as the universal data interchange format.

For details on individual subsystems, see:

What is LLKV

LLKV is an experimental SQL database implemented as a Rust workspace of 15 crates. It layers SQL processing, streaming query execution, and MVCC transaction management on top of pluggable key-value storage backends. The system uses Apache Arrow RecordBatch as its primary data representation at every layer, enabling zero-copy operations and SIMD-friendly columnar processing.

The architecture separates concerns into six distinct layers:

  1. SQL Interface
  2. Query Planning
  3. Runtime and Orchestration
  4. Query Execution
  5. Table and Metadata Management
  6. Storage and I/O

Each layer communicates through well-defined interfaces centered on Arrow data structures.

Sources: README.md:1-107 Cargo.toml:1-89

Core Design Principles

LLKV's design reflects several intentional trade-offs:

PrincipleImplementationRationale
Arrow-NativeRecordBatch is the universal data format across all layersEnables zero-copy operations, SIMD vectorization, and interoperability with the Arrow ecosystem
Synchronous ExecutionWork-stealing via Rayon instead of async runtimeReduces scheduler overhead for individual queries while remaining embeddable in Tokio contexts
Layered Modularity15 independent crates with clear boundariesAllows independent evolution and testing of subsystems
MVCC ThroughoutSystem metadata columns (row_id, created_by, deleted_by) injected at storage layerProvides snapshot isolation without write locks
Storage AbstractionPager trait with multiple implementationsSupports both in-memory and persistent backends with zero-copy reads
Compiled PredicatesExpressions compile to stack-based bytecodeEnables efficient vectorized evaluation without interpretation overhead

Sources: README.md:36-42 llkv-storage/README.md:12-22 llkv-expr/README.md:66-72

Workspace Structure

The LLKV workspace consists of 15 crates organized by layer:

LayerCratePrimary Responsibility
SQL Interfacellkv-sqlSQL parsing, dialect normalization, INSERT buffering
Query Planningllkv-planTyped query plan structures (SelectPlan, InsertPlan, etc.)
llkv-exprExpression AST (Expr, ScalarExpr)
Runtimellkv-runtimeSession management, MVCC orchestration, plan execution
llkv-transactionTransaction ID allocation, snapshot management
Executionllkv-executorStreaming query evaluation
llkv-aggregateAggregate function implementation (SUM, COUNT, AVG, etc.)
llkv-joinJoin algorithms (hash join with specialized fast paths)
Table/Metadatallkv-tableSchema-aware table abstraction, system catalog
llkv-column-mapColumn-oriented storage, logical-to-physical key mapping
Storagellkv-storagePager trait, MemPager, SimdRDrivePager
Utilitiesllkv-csvCSV ingestion helper
llkv-resultResult type definitions
llkv-test-utilsTesting utilities
llkv-slt-testerSQL Logic Test harness

Sources: Cargo.toml:9-26 Cargo.toml:67-87 README.md:44-53

Component Architecture and Data Flow

The following diagram shows the major components and how Arrow RecordBatch flows through the system:

Sources: README.md:44-72 Cargo.toml:67-87

graph TB
    User["User / Application"]
subgraph "llkv-sql Crate"
        SqlEngine["SqlEngine"]
Preprocessor["SQL Preprocessor"]
Parser["sqlparser"]
InsertBuffer["InsertBuffer"]
end
    
    subgraph "llkv-plan Crate"
        SelectPlan["SelectPlan"]
InsertPlan["InsertPlan"]
CreateTablePlan["CreateTablePlan"]
OtherPlans["Other Plan Types"]
end
    
    subgraph "llkv-expr Crate"
        Expr["Expr<F>"]
ScalarExpr["ScalarExpr<F>"]
end
    
    subgraph "llkv-runtime Crate"
        RuntimeContext["RuntimeContext"]
SessionHandle["SessionHandle"]
TxnSnapshot["TransactionSnapshot"]
end
    
    subgraph "llkv-executor Crate"
        TableExecutor["TableExecutor"]
StreamingOps["Streaming Operators"]
end
    
    subgraph "llkv-table Crate"
        Table["Table"]
SysCatalog["SysCatalog (Table 0)"]
FieldId["FieldId Resolution"]
end
    
    subgraph "llkv-column-map Crate"
        ColumnStore["ColumnStore"]
LogicalFieldId["LogicalFieldId"]
PhysicalKey["PhysicalKey Mapping"]
end
    
    subgraph "llkv-storage Crate"
        Pager["Pager Trait"]
MemPager["MemPager"]
SimdPager["SimdRDrivePager"]
end
    
    ArrowBatch["Arrow RecordBatch\n(Universal Format)"]
User -->|SQL String| SqlEngine
 
   SqlEngine --> Preprocessor
 
   Preprocessor --> Parser
 
   Parser -->|AST| SelectPlan
 
   Parser -->|AST| InsertPlan
 
   Parser -->|AST| CreateTablePlan
    
 
   SelectPlan --> Expr
 
   InsertPlan --> ScalarExpr
    
 
   SelectPlan --> RuntimeContext
 
   InsertPlan --> RuntimeContext
 
   CreateTablePlan --> RuntimeContext
    
 
   RuntimeContext --> SessionHandle
 
   RuntimeContext --> TxnSnapshot
 
   RuntimeContext --> TableExecutor
 
   RuntimeContext --> Table
    
 
   TableExecutor --> StreamingOps
 
   StreamingOps --> Table
    
 
   Table --> SysCatalog
 
   Table --> FieldId
 
   Table --> ColumnStore
    
 
   ColumnStore --> LogicalFieldId
 
   ColumnStore --> PhysicalKey
 
   ColumnStore --> Pager
    
 
   Pager --> MemPager
 
   Pager --> SimdPager
    
 
   Table -.->|Produces/Consumes| ArrowBatch
 
   StreamingOps -.->|Produces/Consumes| ArrowBatch
 
   ColumnStore -.->|Serializes/Deserializes| ArrowBatch
 
   SqlEngine -.->|Returns| ArrowBatch

End-to-End Query Execution

This diagram traces a SELECT query from SQL text to results, showing the concrete code entities involved:

Sources: README.md:56-63 llkv-sql/README.md:1-107 llkv-runtime/README.md:33-41 llkv-table/README.md:10-25

sequenceDiagram
    participant App as "Application"
    participant SqlEngine as "SqlEngine::execute()"
    participant Preprocessor as "preprocess_sql()"
    participant Parser as "sqlparser::Parser"
    participant Planner as "build_select_plan()"
    participant Runtime as "RuntimeContext::execute_plan()"
    participant Executor as "TableExecutor::execute()"
    participant Table as "Table::scan_stream()"
    participant ColStore as "ColumnStore::gather_columns()"
    participant Pager as "Pager::batch_get()"
    
    App->>SqlEngine: SELECT * FROM users WHERE age > 18
    
    Note over SqlEngine,Preprocessor: Dialect normalization
    SqlEngine->>Preprocessor: Normalize SQLite/DuckDB syntax
    
    SqlEngine->>Parser: Parse normalized SQL
    Parser-->>SqlEngine: Statement AST
    
    SqlEngine->>Planner: Translate AST to SelectPlan
    Note over Planner: Build SelectPlan with\nExpr<String> predicates
    Planner-->>SqlEngine: SelectPlan
    
    SqlEngine->>Runtime: execute_plan(SelectPlan)
    
    Note over Runtime: Acquire TransactionSnapshot\nResolve field names to FieldId
    
    Runtime->>Executor: execute(SelectPlan, context)
    
    Note over Executor: Compile Expr<FieldId>\ninto EvalProgram
    
    Executor->>Table: scan_stream(fields, predicate)
    
    Note over Table: Apply MVCC filtering\nPush down predicates
    
    Table->>ColStore: gather_columns(LogicalFieldId[])
    
    Note over ColStore: Map LogicalFieldId\nto PhysicalKey
    
    ColStore->>Pager: batch_get(PhysicalKey[])
    Pager-->>ColStore: EntryHandle[] (zero-copy)
    
    Note over ColStore: Deserialize Arrow buffers\nApply row_id filtering
    
    ColStore-->>Table: RecordBatch
    Table-->>Executor: RecordBatch
    
    Note over Executor: Apply projections\nEvaluate expressions
    
    Executor-->>Runtime: RecordBatch stream
    Runtime-->>SqlEngine: Vec<RecordBatch>
    SqlEngine-->>App: Query results

Key Features

MVCC Transaction Management

LLKV implements multi-version concurrency control with snapshot isolation:

  • Every table includes three system columns: row_id (monotonic), created_by (transaction ID), and deleted_by (transaction ID or NULL)
  • TxnIdManager in llkv-transaction allocates monotonic transaction IDs and tracks commit watermarks
  • TransactionSnapshot captures a consistent view of the database at transaction start
  • Auto-commit statements use TXN_ID_AUTO_COMMIT = 1
  • Explicit transactions maintain both persistent and staging contexts for isolation

Sources: README.md:64-72 llkv-runtime/README.md:20-32 llkv-table/README.md:32-35

Zero-Copy Storage Pipeline

The storage layer supports zero-copy reads when backed by SimdRDrivePager:

  1. ColumnStore maps LogicalFieldId to PhysicalKey
  2. Pager::batch_get() returns EntryHandle wrappers around memory-mapped regions
  3. Arrow arrays are deserialized directly from the mapped memory without intermediate copies
  4. SIMD-aligned buffers enable vectorized predicate evaluation

Sources: llkv-column-map/README.md:19-41 llkv-storage/README.md:12-28 README.md:12-13

Compiled Expression Evaluation

Predicates and scalar expressions compile to stack-based bytecode:

  • Expr<FieldId> structures in llkv-expr represent logical predicates
  • ProgramCompiler in llkv-table translates expressions into EvalProgram bytecode
  • DomainProgram tracks which row IDs satisfy predicates
  • Bytecode evaluation uses stack-based execution for efficient vectorized operations

Sources: llkv-expr/README.md:1-88 llkv-table/README.md:10-18 README.md:46-53

SQL Logic Test Infrastructure

LLKV includes comprehensive SQL correctness testing:

  • llkv-slt-tester wraps the sqllogictest framework
  • LlkvSltRunner discovers .slt files and executes test suites
  • Supports remote test fetching via .slturl pointer files
  • Environment variable LLKV_SLT_STATS=1 enables detailed query statistics
  • CI runs the full suite on Linux, macOS, and Windows

Sources: README.md:75-77 llkv-slt-tester/README.md:1-57

Getting Started

The main entry point is the llkv crate, which re-exports the SQL interface:

For persistent storage, use SimdRDrivePager instead of MemPager. For transaction control beyond auto-commit, obtain a SessionHandle via SqlEngine::session().

Sources: README.md:14-33 demos/llkv-sql-pong-demo/src/main.rs:386-393

LLKV shares architectural concepts with Apache DataFusion but differs in several key areas:

AspectLLKVDataFusion
Execution ModelSynchronous with Rayon work-stealingAsync with Tokio runtime
Storage BackendCustom key-value via Pager traitParquet, CSV, object stores
SQL Parsersqlparser crate (same)sqlparser crate
Data FormatArrow RecordBatch (same)Arrow RecordBatch
MaturityAlpha / ExperimentalProduction-ready
Transaction SupportMVCC snapshot isolationRead-only (no writes)

LLKV deliberately avoids the DataFusion task scheduler to explore trade-offs in a synchronous execution model, while maintaining compatibility with the same SQL parser and Arrow memory layout.

Sources: README.md:36-42 README.md:8-13


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Architecture

Relevant source files

This document describes the overall system architecture of LLKV, explaining its layered design, core abstractions, and how components interact to provide SQL functionality over key-value storage. For details on individual crates and their dependencies, see Workspace and Crates. For the end-to-end query execution flow, see SQL Query Processing Pipeline. For Arrow integration specifics, see Data Formats and Arrow Integration.

Layered Design

LLKV is organized into six architectural layers, each with focused responsibilities. Higher layers depend only on lower layers, and all layers communicate through Apache Arrow RecordBatch structures as the universal data interchange format.

Sources: Cargo.toml:1-89 README.md:44-53 llkv-sql/README.md:1-10 llkv-runtime/README.md:1-10 llkv-executor/README.md:1-10 llkv-table/README.md:1-18 llkv-storage/README.md:1-17

graph TB
    subgraph L1["Layer 1: User Interface"]
SQL["SQL Queries"]
REPL["CLI REPL"]
DEMO["Demo Applications"]
BENCH["TPC-H Benchmarks"]
end
    
    subgraph L2["Layer 2: SQL Processing"]
SQLENG["SqlEngine\nllkv-sql"]
PLAN["Query Plans\nllkv-plan"]
EXPR["Expression AST\nllkv-expr"]
end
    
    subgraph L3["Layer 3: Runtime & Orchestration"]
RUNTIME["RuntimeContext\nllkv-runtime"]
TXNMGR["TxnIdManager\nllkv-transaction"]
CATALOG["CatalogManager\nllkv-runtime"]
end
    
    subgraph L4["Layer 4: Query Execution"]
EXECUTOR["TableExecutor\nllkv-executor"]
AGG["Accumulators\nllkv-aggregate"]
JOIN["HashJoinExecutor\nllkv-join"]
end
    
    subgraph L5["Layer 5: Data Management"]
TABLE["Table\nllkv-table"]
COLMAP["ColumnStore\nllkv-column-map"]
SYSCAT["SysCatalog\nllkv-table"]
end
    
    subgraph L6["Layer 6: Storage"]
PAGER["Pager trait\nllkv-storage"]
MEMPAGER["MemPager"]
SIMDPAGER["SimdRDrivePager"]
end
    
 
   SQL --> SQLENG
 
   REPL --> SQLENG
 
   DEMO --> SQLENG
 
   BENCH --> SQLENG
    
 
   SQLENG --> PLAN
 
   SQLENG --> EXPR
 
   PLAN --> RUNTIME
    
 
   RUNTIME --> TXNMGR
 
   RUNTIME --> CATALOG
 
   RUNTIME --> EXECUTOR
 
   RUNTIME --> TABLE
    
 
   EXECUTOR --> AGG
 
   EXECUTOR --> JOIN
 
   EXECUTOR --> TABLE
    
 
   TABLE --> COLMAP
 
   TABLE --> SYSCAT
 
   COLMAP --> PAGER
 
   SYSCAT --> COLMAP
    
 
   PAGER --> MEMPAGER
 
   PAGER --> SIMDPAGER

Core Architectural Principles

Arrow-Native Data Flow

All data flowing between components is represented as Apache Arrow RecordBatch structures. This enables:

  • Zero-copy operations : Arrow buffers can be passed between layers without serialization
  • SIMD-friendly processing : Columnar layout supports vectorized operations
  • Consistent memory model : All layers use the same in-memory representation

The RecordBatch abstraction appears at every boundary: SQL parsing produces plans that operate on batches, the executor streams batches, tables persist batches, and the column store chunks batches for storage.

Sources: README.md:10-12 README.md:22-23 llkv-table/README.md:10-11 llkv-column-map/README.md:10-14

Storage Abstraction Through Pager Trait

The Pager trait in llkv-storage provides a pluggable storage backend interface:

Pager TypeUse CaseKey Properties
MemPagerTests, temporary namespaces, staging contextsHeap-backed, fast
SimdRDrivePagerPersistent storageZero-copy reads, SIMD-aligned, memory-mapped

Both implementations satisfy the same batch get/put contract, allowing higher layers to remain storage-agnostic. The runtime uses dual-pager contexts: persistent storage for committed tables and in-memory staging for uncommitted transaction objects.

Sources: llkv-storage/README.md:12-22 llkv-runtime/README.md:26-32

MVCC Integration

Multi-version concurrency control (MVCC) is implemented as system metadata columns injected at the table layer:

  • row_id: Monotonic row identifier
  • created_by: Transaction ID that created this row version
  • deleted_by: Transaction ID that deleted this row (or NULL if active)

These columns are stored alongside user data in ColumnStore, enabling snapshot isolation without separate version chains. The TxnIdManager in llkv-transaction allocates monotonic transaction IDs and tracks commit watermarks. The runtime enforces visibility rules during scans by filtering based on snapshot transaction IDs.

Sources: llkv-table/README.md:13-17 llkv-runtime/README.md:19-25 llkv-column-map/README.md:27-28

Component Interaction Patterns

Query Execution Flow

Sources: README.md:56-62 llkv-sql/README.md:15-20 llkv-runtime/README.md:12-17 llkv-executor/README.md:12-17 llkv-table/README.md:19-25 llkv-column-map/README.md:24-28

Dual-Context Transaction Management

The runtime maintains two execution contexts during explicit transactions. The persistent context operates on committed tables directly, while the staging context buffers newly created tables in memory. On commit, staged operations are replayed into the persistent context after the TxnIdManager confirms no conflicts and advances the commit watermark. On rollback, the staging context is dropped and all uncommitted work is discarded.

Sources: llkv-runtime/README.md:26-32 llkv-runtime/README.md:12-17

Column Storage and Logical Field Mapping

The ColumnStore maintains a mapping from LogicalFieldId (namespace + table ID + field ID) to physical storage keys. Each logical field has a descriptor chunk (metadata about the column), data chunks (Arrow-serialized column arrays), and row ID chunks (per-chunk row identifiers for filtering). This three-level mapping isolates user data from system metadata while allowing efficient scans and appends.

Sources: llkv-column-map/README.md:18-23 llkv-table/README.md:13-17 llkv-column-map/README.md:10-17

Key Abstractions

SqlEngine

Entry point for SQL execution. Located in llkv-sql, it:

  • Preprocesses SQL for dialect compatibility (DuckDB, SQLite quirks)
  • Parses with sqlparser crate
  • Batches compatible INSERT statements
  • Delegates execution to RuntimeContext
  • Returns ExecutionResult enums

Sources: llkv-sql/README.md:1-20 README.md:56-59

RuntimeContext

Orchestration layer in llkv-runtime that:

  • Executes all statement types (DDL, DML, queries)
  • Manages transaction snapshots and MVCC injection
  • Coordinates between table layer and executor
  • Maintains catalog manager for schema metadata
  • Implements dual-context staging for transactions

Sources: llkv-runtime/README.md:12-25 llkv-runtime/README.md:34-40

Table and ColumnStore

Table in llkv-table provides schema-aware APIs:

  • Schema validation on CREATE TABLE and append
  • MVCC column injection (row_id, created_by, deleted_by)
  • Streaming scan API with predicate pushdown
  • Integration with system catalog (table 0)

ColumnStore in llkv-column-map handles physical storage:

  • Arrow-serialized column chunks
  • Logical-to-physical key mapping
  • Append pipeline with row-id sorting and last-writer-wins semantics
  • Atomic multi-key commits through pager

Sources: llkv-table/README.md:12-25 llkv-column-map/README.md:12-28

TableExecutor

Execution engine in llkv-executor that:

  • Streams RecordBatch results from table scans
  • Evaluates projections, filters, and scalar expressions
  • Coordinates with llkv-aggregate for aggregation
  • Coordinates with llkv-join for join operations
  • Applies MVCC visibility filters during scans

Sources: llkv-executor/README.md:1-17 README.md:60-61

Pager Trait

Storage abstraction in llkv-storage that:

  • Exposes batch get/put over (PhysicalKey, EntryHandle) pairs
  • Supports atomic multi-key updates
  • Enables zero-copy reads when backed by memory-mapped storage
  • Implementations: MemPager (heap), SimdRDrivePager (persistent)

Sources: llkv-storage/README.md:12-22 README.md:11-12

Crate Organization

The workspace contains 15 crates organized by layer:

LayerCratesResponsibilities
SQL Processingllkv-sql, llkv-plan, llkv-exprParse SQL, build typed plans, represent expressions
Runtimellkv-runtime, llkv-transactionOrchestrate execution, manage MVCC and sessions
Executionllkv-executor, llkv-aggregate, llkv-joinStream results, compute aggregates, evaluate joins
Data Managementllkv-table, llkv-column-mapSchema-aware tables, columnar storage
Storagellkv-storagePager trait and implementations
Supportingllkv-result, llkv-csv, llkv-test-utilsResult types, CSV ingestion, test utilities
Testingllkv-slt-tester, llkv-tpchSQL Logic Tests, TPC-H benchmarks
Entry PointsllkvMain library and CLI

For detailed dependency graphs and crate responsibilities, see Workspace and Crates.

Sources: Cargo.toml:67-87 README.md:44-53

Execution Model

Synchronous with Work-Stealing

LLKV defaults to synchronous execution using Rayon for parallelism:

  • Query execution is synchronous, not async
  • Rayon work-stealing parallelizes scans and projections
  • Crossbeam channels coordinate between threads
  • Embeds cleanly inside Tokio when needed (e.g., SLT test runner)

This design minimizes scheduler overhead for individual queries while maintaining high throughput for concurrent workloads.

Sources: README.md:38-41 llkv-column-map/README.md:32-34

Streaming Results

Queries produce results incrementally:

  • TableExecutor yields fixed-size RecordBatches
  • No full result set materialization
  • Callers process batches via callback or iterator
  • Join and aggregate operators buffer only necessary state

Sources: llkv-table/README.md:24-25 llkv-executor/README.md:14-17 llkv-join/README.md:19-22

Data Lifecycle

Write Path

  1. User submits INSERT or UPDATE through SqlEngine
  2. RuntimeContext validates schema and injects MVCC columns
  3. Table::append validates RecordBatch schema
  4. ColumnStore::append sorts by row_id, rewrites conflicts
  5. Pager::batch_put commits Arrow-serialized chunks atomically
  6. Transaction manager advances commit watermark

Read Path

  1. User submits SELECT through SqlEngine
  2. RuntimeContext acquires transaction snapshot
  3. TableExecutor creates scan with projection and filter
  4. Table::scan_stream initiates ColumnStream
  5. ColumnStore fetches chunks via Pager::batch_get (zero-copy)
  6. MVCC filtering applied using snapshot visibility rules
  7. Executor evaluates expressions and streams RecordBatches to caller

Sources: README.md:56-62 llkv-column-map/README.md:24-28 llkv-table/README.md:19-25


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Workspace and Crates

Relevant source files

Purpose and Scope

This document details the Cargo workspace structure and the 15+ crates that comprise the LLKV database system. Each crate is designed with a single responsibility and well-defined interfaces, enabling independent testing and evolution of components. This page catalogs the role of each crate, their internal dependencies, and how they map to the system's layered architecture described in Architecture.

For information about how SQL queries flow through these crates, see SQL Query Processing Pipeline. For details on specific subsystems like storage or transactions, refer to sections 7 and following.


Workspace Overview

The LLKV workspace is defined in Cargo.toml:67-88 and contains 18 member crates organized into core system components, specialized operations, testing infrastructure, and demonstration applications.

Workspace Structure:

graph TB
    subgraph "Core System Crates"
        LLKV["llkv\n(main entry)"]
SQL["llkv-sql\n(SQL interface)"]
PLAN["llkv-plan\n(query plans)"]
EXPR["llkv-expr\n(expression AST)"]
RUNTIME["llkv-runtime\n(orchestration)"]
EXECUTOR["llkv-executor\n(execution)"]
TABLE["llkv-table\n(table layer)"]
COLMAP["llkv-column-map\n(column store)"]
STORAGE["llkv-storage\n(storage abstraction)"]
TXN["llkv-transaction\n(MVCC manager)"]
RESULT["llkv-result\n(error types)"]
end
    
    subgraph "Specialized Operations"
        AGG["llkv-aggregate\n(aggregation)"]
JOIN["llkv-join\n(joins)"]
CSV["llkv-csv\n(CSV import)"]
end
    
    subgraph "Testing Infrastructure"
        SLT["llkv-slt-tester\n(SQL logic tests)"]
TESTUTIL["llkv-test-utils\n(test utilities)"]
TPCH["llkv-tpch\n(TPC-H benchmarks)"]
end
    
    subgraph "Demonstrations"
        DEMO["llkv-sql-pong-demo\n(interactive demo)"]
end

Sources: Cargo.toml:67-88


Core System Crates

llkv

Purpose: Main library crate that re-exports the primary user-facing APIs from llkv-sql and llkv-runtime.

Key Dependencies: llkv-sql, llkv-runtime

Responsibilities:

  • Provides the consolidated API surface for embedding LLKV
  • Re-exports SqlEngine for SQL query execution
  • Re-exports runtime components for programmatic database access

Sources: Cargo.toml:9-10


llkv-sql

Path: llkv-sql/

Purpose: SQL interface layer that parses SQL statements, preprocesses dialect-specific syntax, and translates them into typed query plans.

Key Dependencies:

  • llkv-plan - Query plan structures
  • llkv-expr - Expression AST
  • llkv-runtime - Execution orchestration
  • sqlparser - SQL parsing (version 0.59.0)

Responsibilities:

  • SQL statement preprocessing for dialect compatibility
  • AST-to-plan translation
  • INSERT statement buffering optimization
  • SQL query result formatting

Primary Types:

  • SqlEngine - Main query interface

Sources: Cargo.toml21 llkv-plan/src/lib.rs:1-38


llkv-plan

Path: llkv-plan/

Purpose: Query planner that defines typed plan structures representing SQL operations.

Key Dependencies:

  • llkv-expr - Expression types
  • llkv-result - Error handling
  • sqlparser - SQL AST types

Responsibilities:

  • Plan structure definitions (SelectPlan, InsertPlan, UpdatePlan, DeletePlan)
  • SQL-to-plan conversion utilities
  • Subquery correlation tracking
  • Plan graph serialization for debugging

Primary Types:

  • SelectPlan, InsertPlan, UpdatePlan, DeletePlan, CreateTablePlan
  • SubqueryCorrelatedTracker
  • RangeSelectRows - Range-based row selection

Sources: llkv-plan/Cargo.toml:1-28 llkv-plan/src/lib.rs:1-38


llkv-expr

Path: llkv-expr/

Purpose: Expression AST definitions and literal value handling, independent of concrete Arrow scalar types.

Key Dependencies:

  • arrow - Arrow data types

Responsibilities:

  • Expression AST (Expr<T>, ScalarExpr<T>)
  • Literal value representation (Literal enum)
  • Type-aware predicate compilation (typed_predicate)
  • Decimal value handling

Primary Types:

  • Expr<T> - Generic expression with field identifier type parameter
  • ScalarExpr<T> - Scalar expressions
  • Literal - Untyped literal values
  • DecimalValue - Fixed-precision decimal
  • IntervalValue - Calendar interval

Sources: llkv-expr/Cargo.toml:1-19 llkv-expr/src/lib.rs:1-21 llkv-expr/src/literal.rs:1-446


llkv-runtime

Path: llkv-runtime/

Purpose: Runtime orchestration layer providing MVCC transaction management, session handling, and system catalog coordination.

Key Dependencies:

  • llkv-executor - Query execution
  • llkv-table - Table operations
  • llkv-transaction - MVCC snapshots

Responsibilities:

  • Transaction lifecycle management
  • Session state tracking
  • System catalog access
  • Query plan execution coordination
  • MVCC snapshot creation and cleanup

Primary Types:

  • RuntimeContext - Main runtime state
  • Session - Per-connection state

Sources: Cargo.toml19


llkv-executor

Path: llkv-executor/

Purpose: Query execution engine that evaluates plans and produces streaming results.

Key Dependencies:

  • llkv-plan - Plan structures
  • llkv-expr - Expression evaluation
  • llkv-table - Table scans
  • llkv-aggregate - Aggregation
  • llkv-join - Join algorithms

Responsibilities:

  • SELECT plan execution
  • Projection and filtering
  • Aggregation coordination
  • Join execution
  • Streaming RecordBatch production

Sources: Cargo.toml14


llkv-table

Path: llkv-table/

Purpose: Schema-aware table abstraction providing high-level data operations over columnar storage.

Key Dependencies:

  • llkv-column-map - Column storage
  • llkv-expr - Predicate compilation
  • llkv-storage - Storage backend
  • arrow - RecordBatch representation

Responsibilities:

  • Schema validation and enforcement
  • MVCC metadata injection (row_id, created_by, deleted_by)
  • Predicate compilation and optimization
  • RecordBatch append/scan operations
  • Column data type management

Primary Types:

  • Table - Main table interface
  • TablePlanner - Query optimization
  • TableExecutor - Execution strategies

Sources: llkv-table/Cargo.toml:1-60 llkv-column-map/src/store/projection.rs:1-728


llkv-column-map

Path: llkv-column-map/

Purpose: Columnar storage layer that chunks Arrow arrays and manages the mapping from logical fields to physical storage keys.

Key Dependencies:

  • llkv-storage - Pager abstraction
  • llkv-expr - Field identifiers
  • arrow - Array serialization

Responsibilities:

  • Column chunk management (serialization/deserialization)
  • LogicalFieldId → PhysicalKey mapping
  • Multi-column gather operations with caching
  • Row visibility filtering
  • Chunk metadata tracking (min/max values)

Primary Types:

  • ColumnStore<P> - Main storage interface
  • LogicalFieldId - Namespaced field identifier
  • MultiGatherContext - Reusable context for multi-column reads
  • GatherNullPolicy - Null handling strategies

Sources: Cargo.toml12 llkv-column-map/src/store/projection.rs:38-227


llkv-storage

Path: llkv-storage/

Purpose: Storage abstraction layer defining the Pager trait and providing implementations for in-memory and persistent backends.

Key Dependencies:

  • simd-r-drive - SIMD-optimized persistent storage (optional)
  • arrow - Buffer types

Responsibilities:

  • Pager trait definition (batch_get/batch_put)
  • Zero-copy array serialization format
  • MemPager - In-memory HashMap backend
  • SimdRDrivePager - Memory-mapped persistent backend
  • Physical key allocation

Primary Types:

Sources: Cargo.toml22 llkv-storage/src/serialization.rs:1-130


llkv-transaction

Path: llkv-transaction/

Purpose: MVCC transaction manager providing snapshot isolation and row visibility determination.

Key Dependencies:

  • llkv-result - Error types

Responsibilities:

  • Transaction ID allocation
  • MVCC snapshot creation
  • Commit watermark tracking
  • Row visibility rules enforcement

Primary Types:

  • TransactionManager
  • Snapshot - Transaction isolation view
  • TxnId - Transaction identifier

Sources: Cargo.toml25


llkv-result

Path: llkv-result/

Purpose: Common error and result types used throughout the system.

Key Dependencies: None (foundational crate)

Responsibilities:

  • Error enum with all error variants
  • Result<T> type alias
  • Error conversion traits

Sources: Cargo.toml18


Specialized Operations Crates

llkv-aggregate

Path: llkv-aggregate/

Purpose: Aggregate function evaluation including accumulators and distinct value tracking.

Key Dependencies:

  • arrow - Array operations

Responsibilities:

  • Aggregate function implementations (SUM, AVG, COUNT, MIN, MAX)
  • Accumulator state management
  • DISTINCT value tracking
  • Group-by hash table operations

Sources: Cargo.toml11


llkv-join

Path: llkv-join/

Purpose: Join algorithm implementations.

Key Dependencies:

  • arrow - RecordBatch operations
  • llkv-expr - Join predicates

Responsibilities:

  • Hash join implementation
  • Nested loop join
  • Join key extraction
  • Result materialization

Sources: Cargo.toml16


llkv-csv

Path: llkv-csv/

Purpose: CSV file ingestion and export utilities.

Key Dependencies:

  • llkv-table - Table operations
  • arrow - CSV reader integration

Responsibilities:

  • CSV to RecordBatch conversion
  • Bulk insert optimization
  • Schema inference from CSV headers

Sources: Cargo.toml13


Testing Infrastructure Crates

llkv-test-utils

Path: llkv-test-utils/

Purpose: Shared test utilities including tracing setup and common test fixtures.

Key Dependencies:

  • tracing-subscriber - Logging configuration

Responsibilities:

  • Consistent tracing initialization across tests
  • Common test helpers
  • Auto-initialization feature for convenience

Sources: Cargo.toml24


llkv-slt-tester

Path: llkv-slt-tester/

Purpose: SQL Logic Test runner providing standardized correctness testing.

Key Dependencies:

  • llkv-sql - SQL execution
  • sqllogictest - Test framework (version 0.28.4)

Responsibilities:

  • .slt file discovery and execution
  • Remote test suite fetching (.slturl files)
  • Test result comparison
  • AsyncDB adapter for LLKV

Primary Types:

  • LlkvSltRunner - Test runner
  • EngineHarness - Adapter interface

Sources: Cargo.toml20


llkv-tpch

Path: llkv-tpch/

Purpose: TPC-H benchmark suite for performance testing.

Key Dependencies:

  • llkv - Database interface
  • llkv-sql - SQL execution
  • tpchgen - Data generation (version 2.0.1)

Responsibilities:

  • TPC-H data generation at various scale factors
  • Query execution (Q1-Q22)
  • Performance measurement
  • Benchmark result reporting

Sources: Cargo.toml62


Demonstration Applications

llkv-sql-pong-demo

Path: demos/llkv-sql-pong-demo/

Purpose: Interactive demonstration showing LLKV's SQL capabilities through a Pong game implemented in SQL.

Key Dependencies:

  • llkv-sql - SQL execution
  • crossterm - Terminal UI (version 0.29.0)

Responsibilities:

  • Terminal-based interactive interface
  • Real-time SQL query execution
  • Game state management via SQL tables
  • User input handling
graph LR
    LLKV["llkv"]
SQL["llkv-sql"]
PLAN["llkv-plan"]
EXPR["llkv-expr"]
RUNTIME["llkv-runtime"]
EXECUTOR["llkv-executor"]
TABLE["llkv-table"]
COLMAP["llkv-column-map"]
STORAGE["llkv-storage"]
TXN["llkv-transaction"]
RESULT["llkv-result"]
AGG["llkv-aggregate"]
JOIN["llkv-join"]
CSV["llkv-csv"]
SLT["llkv-slt-tester"]
TESTUTIL["llkv-test-utils"]
TPCH["llkv-tpch"]
DEMO["llkv-sql-pong-demo"]
LLKV --> SQL
 
   LLKV --> RUNTIME
    
 
   SQL --> PLAN
 
   SQL --> EXPR
 
   SQL --> RUNTIME
 
   SQL --> EXECUTOR
 
   SQL --> TABLE
 
   SQL --> TXN
    
 
   RUNTIME --> EXECUTOR
 
   RUNTIME --> TABLE
 
   RUNTIME --> TXN
    
 
   EXECUTOR --> PLAN
 
   EXECUTOR --> EXPR
 
   EXECUTOR --> TABLE
 
   EXECUTOR --> AGG
 
   EXECUTOR --> JOIN
    
 
   TABLE --> COLMAP
 
   TABLE --> EXPR
 
   TABLE --> PLAN
 
   TABLE --> STORAGE
    
 
   COLMAP --> STORAGE
 
   COLMAP --> EXPR
    
 
   PLAN --> EXPR
 
   PLAN --> RESULT
    
 
   CSV --> TABLE
    
 
   TXN --> RESULT
 
   STORAGE --> RESULT
 
   EXPR --> RESULT
 
   COLMAP --> RESULT
 
   TABLE --> RESULT
    
 
   SLT --> SQL
 
   SLT --> RUNTIME
 
   SLT --> TESTUTIL
    
 
   TPCH --> LLKV
 
   TPCH --> SQL
    
 
   DEMO --> SQL

Sources: Cargo.toml86


Crate Dependency Graph

The following diagram shows the direct dependencies between workspace crates. Arrows point from dependent crates to their dependencies.

Crate Dependencies:

Sources: Cargo.toml:9-25 llkv-table/Cargo.toml:14-31 llkv-plan/Cargo.toml:14-24

Key Observations:

  1. llkv-result is a foundational crate with no internal dependencies, consumed by nearly all other crates for error handling.

  2. llkv-expr depends only on llkv-result, making it a stable base for expression handling across the system.

  3. llkv-plan builds on llkv-expr and adds plan-specific structures.

  4. llkv-storage and llkv-transaction** are independent of each other, allowing flexibility in storage backend selection.

  5. llkv-table integrates storage, expressions, and planning to provide a cohesive data layer.

  6. llkv-executor coordinates specialized operations (aggregate, join) and table access.

  7. llkv-runtime sits at the top of the execution stack, orchestrating transactions and query execution.

  8. llkv-sql ties together all layers to provide the SQL interface.


Mapping Crates to System Layers

This diagram shows how workspace crates map to the architectural layers described in Architecture.

Layered Architecture Mapping:

Sources: Cargo.toml:67-88


External Dependencies

The workspace declares several critical external dependencies that enable core functionality.

Apache Arrow Ecosystem

Version: 57.0.0

Crates:

  • arrow - Core Arrow functionality with prettyprint and IPC features
  • arrow-array - Array implementations
  • arrow-schema - Schema types
  • arrow-buffer - Buffer management
  • arrow-ord - Ordering operations

Usage: Arrow provides the universal columnar data format throughout LLKV. RecordBatch is used as the data interchange format at every layer, enabling zero-copy operations and SIMD-friendly processing.

Sources: Cargo.toml:32-36


SQL Parsing

Crate: sqlparser
Version: 0.59.0

Usage: Parses SQL text into AST nodes. Used by llkv-sql and llkv-plan to convert SQL queries into typed plan structures.

Sources: Cargo.toml52


SIMD-Optimized Storage

Crate: simd-r-drive
Version: 0.15.5-alpha

Usage: Provides memory-mapped, SIMD-accelerated persistent storage backend. The SimdRDrivePager implementation in llkv-storage uses this for zero-copy array access.

Related: simd-r-drive-entry-handle for Arrow buffer integration

Sources: Cargo.toml:26-27


Testing and Benchmarking

Key Dependencies:

CrateVersionPurpose
criterion0.7.0Performance benchmarking
sqllogictest0.28.4SQL correctness testing
tpchgen2.0.1TPC-H data generation
libtest-mimic0.8Custom test harness

Sources: Cargo.toml:40-62


Utilities

Key Dependencies:

CrateVersionPurpose
rayon1.10.0Data parallelism
rustc-hash2.1.1Fast hash maps
bitcode0.6.7Binary serialization
thiserror2.0.17Error trait derivation
serde1.0.228Serialization framework

Sources: Cargo.toml:37-64


Workspace Configuration

The workspace is configured with shared package metadata and dependency versions to ensure consistency across all crates.

Shared Package Metadata:

Build Settings:

  • Edition: 2024 (Rust 2024 edition)
  • Resolver: Version 2 (new dependency resolver)
  • Version: 0.8.2-alpha (all crates share this version)

Sources: Cargo.toml:1-8 Cargo.toml88


Summary Table

CrateLayerPrimary ResponsibilityKey Dependencies
llkvEntry PointMain library APIllkv-sql, llkv-runtime
llkv-sqlSQL ProcessingSQL parsing and executionllkv-plan, llkv-runtime, sqlparser
llkv-planSQL ProcessingQuery plan structuresllkv-expr, sqlparser
llkv-exprSQL ProcessingExpression ASTarrow
llkv-runtimeExecutionTransaction orchestrationllkv-executor, llkv-table
llkv-executorExecutionQuery executionllkv-table, llkv-aggregate
llkv-tableData ManagementSchema-aware tablesllkv-column-map, llkv-storage
llkv-column-mapData ManagementColumnar storagellkv-storage, arrow
llkv-storageStorageStorage abstractionsimd-r-drive (optional)
llkv-transactionData ManagementMVCC manager-
llkv-aggregateSpecialized OpsAggregation functionsarrow
llkv-joinSpecialized OpsJoin algorithmsarrow
llkv-csvSpecialized OpsCSV import/exportllkv-table
llkv-resultFoundationError types-
llkv-test-utilsTestingTest utilitiestracing-subscriber
llkv-slt-testerTestingSQL logic testsllkv-sql, sqllogictest
llkv-tpchTestingTPC-H benchmarksllkv-sql, tpchgen
llkv-sql-pong-demoDemoInteractive demollkv-sql, crossterm

Sources: Cargo.toml:1-89


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

SQL Query Processing Pipeline

Relevant source files

Purpose and Scope

This document describes the end-to-end SQL query processing pipeline in LLKV, from raw SQL text input to final results. It covers the five major stages: SQL preprocessing, parsing, plan translation, execution coordination, and result formatting.

For information about the specific plan structures created during translation, see Plan Structures. For details on how plans are executed to produce results, see Query Execution. For the user-facing SqlEngine API, see SqlEngine API.

Overview

The SQL query processing pipeline transforms user-provided SQL text into Arrow RecordBatch results through a series of well-defined stages. The primary entry point is SqlEngine::execute(), which orchestrates the entire flow while maintaining transaction boundaries and handling cross-statement optimizations like INSERT buffering.

Sources: llkv-sql/src/sql_engine.rs:933-998

graph TB
    SQL["Raw SQL Text"]
PREPROCESS["Stage 1: SQL Preprocessing\nDialect Normalization"]
PARSE["Stage 2: Parsing\nsqlparser → AST"]
TRANSLATE["Stage 3: Plan Translation\nAST → PlanStatement"]
EXECUTE["Stage 4: Plan Execution\nRuntimeEngine"]
RESULTS["Stage 5: Result Formatting\nRuntimeStatementResult"]
SQL --> PREPROCESS
 
   PREPROCESS --> PARSE
 
   PARSE --> TRANSLATE
 
   TRANSLATE --> EXECUTE
 
   EXECUTE --> RESULTS
    
 
   PREPROCESS --> |preprocess_sql_input| PREPROCESS_IMPL["• preprocess_tpch_connect_syntax()\n• preprocess_create_type_syntax()\n• preprocess_exclude_syntax()\n• preprocess_trailing_commas_in_values()\n• preprocess_empty_in_lists()\n• preprocess_index_hints()\n• preprocess_reindex_syntax()\n• preprocess_bare_table_in_clauses()"]
PARSE --> |parse_sql_with_recursion_limit| PARSER["sqlparser::Parser\nPARSER_RECURSION_LIMIT = 200"]
TRANSLATE --> |translate_statement| PLANNER["• SelectPlan\n• InsertPlan\n• UpdatePlan\n• DeletePlan\n• CreateTablePlan\n• DDL Plans"]
EXECUTE --> |engine.execute_statement| RUNTIME["RuntimeEngine\n• Transaction Management\n• MVCC Snapshots\n• Catalog Operations"]
style PREPROCESS fill:#f9f9f9
    style PARSE fill:#f9f9f9
    style TRANSLATE fill:#f9f9f9
    style EXECUTE fill:#f9f9f9
    style RESULTS fill:#f9f9f9

Stage 1: SQL Preprocessing

Before parsing, the SQL text undergoes a series of preprocessing transformations to normalize dialect-specific syntax. This allows LLKV to accept SQL written for SQLite, DuckDB, and other dialects while presenting a consistent AST to the planner.

Preprocessing Transformations

Preprocessor MethodPurposeExample Transformation
preprocess_tpch_connect_syntax()Strip TPC-H CONNECT TO directivesCONNECT TO tpch; → ``
preprocess_create_type_syntax()Convert CREATE TYPE to CREATE DOMAINCREATE TYPE name AS INTCREATE DOMAIN name AS INT
preprocess_exclude_syntax()Quote qualified names in EXCLUDE clausesEXCLUDE (t.col)EXCLUDE ("t.col")
preprocess_trailing_commas_in_values()Remove trailing commas in VALUESVALUES (1, 2,)VALUES (1, 2)
preprocess_empty_in_lists()Expand empty IN predicates to constant booleansx IN ()(x = NULL AND 0 = 1)
preprocess_index_hints()Strip SQLite INDEXED BY hintsFROM t INDEXED BY idxFROM t
preprocess_reindex_syntax()Convert REINDEX to VACUUM REINDEXREINDEX idxVACUUM REINDEX idx
preprocess_bare_table_in_clauses()Expand IN tablename to subqueryx IN tx IN (SELECT * FROM t)

Sources: llkv-sql/src/sql_engine.rs:623-873 llkv-sql/src/tpch.rs:1-17

Fallback Trigger Preprocessing

If parsing fails and the SQL contains CREATE TRIGGER, the engine applies an additional preprocess_sqlite_trigger_shorthand() transformation and retries. This handles SQLite's optional BEFORE/AFTER timing and FOR EACH ROW clauses by injecting defaults that sqlparser expects.

Sources: llkv-sql/src/sql_engine.rs:941-957 llkv-sql/src/sql_engine.rs:771-842

Stage 2: Parsing

Parsing is delegated to the sqlparser crate, which produces a Vec<Statement> AST. LLKV configures the parser with:

  • Dialect: GenericDialect to accept a wide range of SQL syntax
  • Recursion Limit: PARSER_RECURSION_LIMIT = 200 (raised from sqlparser's default of 50 to handle deeply nested queries in test suites)

The parse_sql_with_recursion_limit() helper function wraps sqlparser's API to apply this custom limit.

Sources: llkv-sql/src/sql_engine.rs:317-324 llkv-sql/src/sql_engine.rs:939-957

Stage 3: Plan Translation

Each parsed Statement is translated into a strongly-typed PlanStatement that the runtime can execute. This translation happens through statement-specific methods in SqlEngine.

graph TB
    AST["sqlparser::ast::Statement"]
SELECT["Statement::Query"]
INSERT["Statement::Insert"]
UPDATE["Statement::Update"]
DELETE["Statement::Delete"]
CREATE["Statement::CreateTable"]
DROP["Statement::Drop"]
TRANSACTION["Statement::StartTransaction\nStatement::Commit\nStatement::Rollback"]
ALTER["Statement::AlterTable"]
OTHER["Other DDL/DML"]
SELECT_PLAN["translate_query()\n→ SelectPlan"]
INSERT_PLAN["buffer_insert()\n→ InsertPlan or buffered"]
UPDATE_PLAN["translate_update()\n→ UpdatePlan"]
DELETE_PLAN["translate_delete()\n→ DeletePlan"]
CREATE_PLAN["translate_create_table()\n→ CreateTablePlan"]
DROP_PLAN["translate_drop()\n→ PlanStatement::Drop*"]
TXN_RUNTIME["Direct runtime delegation\nflush INSERT buffer first"]
ALTER_PLAN["translate_alter_table()\n→ PlanStatement::Alter*"]
OTHER_PLAN["translate_* methods\n→ PlanStatement::*"]
AST --> SELECT
 
   AST --> INSERT
 
   AST --> UPDATE
 
   AST --> DELETE
 
   AST --> CREATE
 
   AST --> DROP
 
   AST --> TRANSACTION
 
   AST --> ALTER
 
   AST --> OTHER
    
 
   SELECT --> SELECT_PLAN
 
   INSERT --> INSERT_PLAN
 
   UPDATE --> UPDATE_PLAN
 
   DELETE --> DELETE_PLAN
 
   CREATE --> CREATE_PLAN
 
   DROP --> DROP_PLAN
 
   TRANSACTION --> TXN_RUNTIME
 
   ALTER --> ALTER_PLAN
 
   OTHER --> OTHER_PLAN
    
 
   SELECT_PLAN --> RUNTIME["RuntimeEngine::execute_statement()"]
INSERT_PLAN --> BUFFER_CHECK{"Buffering\nenabled?"}
BUFFER_CHECK -->|Yes| BUFFER["InsertBuffer\naccumulates rows"]
BUFFER_CHECK -->|No| RUNTIME
 
   UPDATE_PLAN --> RUNTIME
 
   DELETE_PLAN --> RUNTIME
 
   CREATE_PLAN --> RUNTIME
 
   DROP_PLAN --> RUNTIME
 
   TXN_RUNTIME --> RUNTIME
 
   ALTER_PLAN --> RUNTIME
 
   OTHER_PLAN --> RUNTIME
    
 
   BUFFER --> FLUSH_CHECK{"Flush\nneeded?"}
FLUSH_CHECK -->|Yes| FLUSH["Flush buffered rows"]
FLUSH_CHECK -->|No| CONTINUE["Continue buffering"]
FLUSH --> RUNTIME

Statement Routing

Sources: llkv-sql/src/sql_engine.rs:960-998

Translation Process

The translation process involves:

  1. Column Resolution: Identifier strings are resolved to FieldId references using the runtime's catalog
  2. Expression Translation: SQL expressions are converted to Expr<String>, then resolved to Expr<FieldId>
  3. Subquery Handling: Correlated subqueries are tracked with placeholder generation
  4. Parameter Binding: SQL placeholders (?, $1, :name) are mapped to parameter indices

Sources: llkv-sql/src/sql_engine.rs:1000-5000 (various translate_* methods)

sequenceDiagram
    participant SqlEngine
    participant Catalog as "RuntimeContext\nCatalog"
    participant ExprTranslator
    participant PlanBuilder
    
    SqlEngine->>Catalog: resolve_table("users")
    Catalog-->>SqlEngine: TableId(namespace=0, table=1)
    
    SqlEngine->>Catalog: resolve_column("id", TableId)
    Catalog-->>SqlEngine: ColumnResolution(FieldId)
    
    SqlEngine->>ExprTranslator: translate_expr(sqlparser::Expr)
    ExprTranslator->>ExprTranslator: Build Expr<String>
    ExprTranslator->>Catalog: resolve_identifiers()
    Catalog-->>ExprTranslator: Expr<FieldId>
    ExprTranslator-->>SqlEngine: Expr<FieldId>
    
    SqlEngine->>PlanBuilder: Create SelectPlan
    Note over PlanBuilder: Attach projections,\nfilters, sorts, limits
    PlanBuilder-->>SqlEngine: PlanStatement::Select(SelectPlan)

Stage 4: Plan Execution

Once a PlanStatement is constructed, it is passed to RuntimeEngine::execute_statement() for execution. The runtime coordinates:

  • Transaction Management: Ensures each statement executes within a transaction snapshot
  • MVCC Enforcement: Filters rows based on visibility rules
  • Catalog Operations: Updates system catalog for DDL statements
  • Executor Invocation: Delegates SelectPlan execution to llkv-executor

Execution Routing by Statement Type

Sources: llkv-sql/src/sql_engine.rs:587-609 llkv-runtime/ (RuntimeEngine implementation)

Stage 5: Result Formatting

The runtime returns a RuntimeStatementResult enum that represents the outcome of statement execution. SqlEngine surfaces this directly to callers via the execute() method, or converts it to Vec<RecordBatch> for the sql() convenience method.

Result Types

Statement TypeResult VariantContents
SELECTRuntimeStatementResult::SelectVec<RecordBatch> of query results
INSERTRuntimeStatementResult::Insertrows_inserted: u64
UPDATERuntimeStatementResult::Updaterows_updated: u64
DELETERuntimeStatementResult::Deleterows_deleted: u64
CREATE TABLERuntimeStatementResult::CreateTabletable_name: String
CREATE INDEXRuntimeStatementResult::CreateIndexindex_name: String
DROP TABLERuntimeStatementResult::DropTabletable_name: String
Transaction controlRuntimeStatementResult::TransactionTransaction state

Sources: llkv-runtime/ (RuntimeStatementResult definition), llkv-sql/src/sql_engine.rs:933-998

Prepared Statements and Parameters

LLKV supports parameterized queries through a prepared statement mechanism that handles three parameter syntaxes:

  • Positional (numbered): ?, ?1, ?2, $1, $2
  • Named: :name, :id
  • Auto-incremented: Sequential ? placeholders

Parameter Processing Pipeline

Sources: llkv-sql/src/sql_engine.rs:71-206 llkv-sql/src/sql_engine.rs:278-297

Parameter Substitution

During plan execution with parameters, the engine performs a second pass to replace sentinel strings (__llkv_param__N__) with the actual Literal values provided by the caller. This two-phase approach allows the same PlanStatement to be reused across multiple executions with different parameter values.

Sources: llkv-sql/src/sql_engine.rs:194-206

INSERT Buffering

SqlEngine includes an optional INSERT buffering optimization that batches consecutive INSERT ... VALUES statements targeting the same table. This is disabled by default but can be enabled with set_insert_buffering(true) for bulk ingestion workloads.

stateDiagram-v2
    [*] --> NoBuffer : buffering disabled
    NoBuffer --> NoBuffer : INSERT → immediate execute
    
    [*] --> BufferEmpty : buffering enabled
    BufferEmpty --> BufferActive : INSERT(table, cols, rows)
    
    BufferActive --> BufferActive : INSERT(same table/cols) accumulate rows
    BufferActive --> Flush : Different table
    BufferActive --> Flush : Different columns
    BufferActive --> Flush : Different conflict action
    BufferActive --> Flush : Row count ≥ MAX_BUFFERED_INSERT_ROWS
    BufferActive --> Flush : Non-INSERT statement
    BufferActive --> Flush : Transaction boundary
    
    Flush --> RuntimeExecution : Create InsertPlan from accumulated rows
    RuntimeExecution --> BufferEmpty : Reset buffer
    
    NoBuffer --> [*]
    BufferEmpty --> [*]

Buffering Logic

Sources: llkv-sql/src/sql_engine.rs:410-495 llkv-sql/src/sql_engine.rs:887-905

Buffer Flush Triggers

Trigger ConditionRationale
Row count ≥ MAX_BUFFERED_INSERT_ROWS (8192)Limit memory usage
Target table changesCannot batch cross-table INSERTs
Column list changesSchema mismatch
Conflict action changesON CONFLICT semantics differ
Non-INSERT statement encounteredPreserve statement ordering
Transaction boundary (BEGIN, COMMIT, ROLLBACK)Ensure transactional consistency
Explicit flush_pending_inserts() callManual control
Statement expectation hint (testing)Test harness needs per-statement results

Sources: llkv-sql/src/sql_engine.rs:410-495

Error Handling and Table Mapping

The pipeline includes special error handling for table-not-found scenarios. When the runtime returns Error::NotFound or catalog-related errors, SqlEngine::execute_plan_statement() rewrites them to user-friendly messages like "Catalog Error: Table 'users' does not exist".

This mapping is skipped for CREATE VIEW and DROP VIEW statements where the "table" name refers to the view being created/dropped rather than a referenced table.

Sources: llkv-sql/src/sql_engine.rs:558-609


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Data Formats and Arrow Integration

Relevant source files

Purpose and Scope

This page documents how Apache Arrow columnar data structures serve as the universal data interchange format throughout LLKV. It covers the supported Arrow data types, the custom zero-copy serialization format used for persistence, and how RecordBatch flows between layers.

For information about how expressions evaluate over Arrow data, see Expression System. For details on storage pager abstractions, see Pager Interface and SIMD Optimization.


Arrow as the Universal Data Format

LLKV is Arrow-native : every layer of the system produces, consumes, and operates on arrow::record_batch::RecordBatch structures. This design choice enables:

  • Zero-copy data access across layer boundaries
  • SIMD-friendly vectorized operations via contiguous columnar buffers
  • Unified type system from SQL parsing through storage
  • Efficient interoperability with external Arrow-compatible tools

The following table maps system layers to their Arrow usage:

LayerArrow Usage
llkv-sqlParses SQL and returns RecordBatch query results to callers
llkv-planConstructs plans that reference Arrow DataType and Field structures
llkv-executorStreams RecordBatch results during SELECT evaluation
llkv-tableValidates incoming batches against table schemas and persists columns
llkv-column-mapChunks RecordBatch columns into serialized Arrow arrays for pager storage
llkv-storageSerializes/deserializes Arrow arrays with a custom zero-copy format

Sources: Cargo.toml32 llkv-table/README.md:10-11 llkv-column-map/README.md10 llkv-storage/README.md10


RecordBatch Flow Diagram

Key Data Flow Observations:

  1. INSERT Path : RecordBatch → schema validation → MVCC column injection → column chunking → serialization → pager write
  2. SELECT Path : Pager read → deserialization → column gather → RecordBatch construction → streaming to executor
  3. Zero-Copy : EntryHandle wraps memory-mapped regions that Arrow arrays reference directly without copying

Sources: llkv-column-map/src/store/projection.rs:240-446 llkv-storage/src/serialization.rs:226-254 llkv-table/README.md:19-25


Supported Arrow Data Types

LLKV supports the following Arrow primitive and complex types:

Primitive Types

Arrow DataTypeStorage SizeSQL Type Mapping
UInt81 byteTINYINT UNSIGNED
UInt162 bytesSMALLINT UNSIGNED
UInt324 bytesINT UNSIGNED
UInt648 bytesBIGINT UNSIGNED
Int81 byteTINYINT
Int162 bytesSMALLINT
Int324 bytesINT
Int648 bytesBIGINT
Float324 bytesREAL, FLOAT
Float648 bytesDOUBLE PRECISION
Boolean1 bit (packed)BOOLEAN
Date324 bytesDATE
Date648 bytesTIMESTAMP
Decimal128(p, s)16 bytesDECIMAL(p, s)

Variable-Length Types

Arrow DataTypeStorage LayoutSQL Type Mapping
Utf8i32 offsets + UTF-8 bytesVARCHAR, TEXT
LargeUtf8i64 offsets + UTF-8 bytesTEXT (large)
Binaryi32 offsets + raw bytesVARBINARY, BLOB
LargeBinaryi64 offsets + raw bytesBLOB (large)

Complex Types

Arrow DataTypeDescriptionUse Cases
Struct(fields)Nested record with named fieldsComposite values, JSON-like data
FixedSizeList(T, n)Fixed-length array of type TVector embeddings, coordinate tuples

Null Handling:
The current serialization format does not yet support null bitmaps. Arrays with null_count() > 0 will return an error during serialization. Null support is planned for future releases.

Sources: llkv-storage/src/serialization.rs:144-165 llkv-storage/src/serialization.rs:199-224 llkv-expr/src/literal.rs:78-94


Custom Serialization Format

Why Not Arrow IPC?

LLKV uses a custom minimal serialization format instead of Arrow's standard IPC (Inter-Process Communication) format for several reasons:

Trade-offs:

graph LR
    subgraph "Arrow IPC Format"
        IPC_SCHEMA["Schema object\nframing metadata"]
IPC_PADDING["Padding alignment\n8/64 byte boundaries"]
IPC_BUFFERS["Buffer pointers\n+ offsets"]
IPC_SIZE["Larger file size\n~20-40% overhead"]
end
    
    subgraph "LLKV Custom Format"
        CUSTOM_HEADER["24-byte header\nfixed size"]
CUSTOM_PAYLOAD["Raw buffer bytes\ncontiguous"]
CUSTOM_ZERO["Zero-copy rebuild\ndirect mmap"]
CUSTOM_SIZE["Minimal size\nno framing"]
end
    
 
   IPC_SCHEMA --> IPC_SIZE
 
   IPC_PADDING --> IPC_SIZE
 
   IPC_BUFFERS --> IPC_SIZE
    
 
   CUSTOM_HEADER --> CUSTOM_SIZE
 
   CUSTOM_PAYLOAD --> CUSTOM_ZERO
AspectArrow IPCLLKV Custom Format
File sizeLarger (metadata + padding)Minimal (24-byte header + payload)
DeserializationAllocates and copies buffersZero-copy via EntryHandle
FlexibilitySupports all Arrow featuresLimited to non-null arrays
Scan performanceModerate (copy overhead)Fast (direct SIMD access)
Null supportFull bitmap supportNot yet implemented

Design Goals:

  1. Minimal headers : 24-byte fixed header, no schema objects per array
  2. Predictable payloads : contiguous buffers for mmap-friendly access
  3. True zero-copy : reconstruct ArrayData referencing original buffer directly
  4. Stable on-disk codes : type tags are compile-time pinned to prevent corruption

Sources: llkv-storage/src/serialization.rs:1-28 llkv-storage/src/serialization.rs:41-135


Serialization Format Details

Header Structure

Every serialized array begins with a 24-byte header:

Offset  Size  Field
------  ----  -----
0-3     4     Magic bytes: b"ARR0"
4       1     Layout code (Primitive=0, FslFloat32=1, Varlen=2, Struct=3)
5       1     Type code (PrimType enum value)
6       1     Precision (for Decimal128) or padding
7       1     Scale (for Decimal128) or padding
8-15    8     Array length (u64, element count)
16-19   4     extra_a (layout-specific u32)
20-23   4     extra_b (layout-specific u32)
24+     var   Payload (layout-specific buffer bytes)

Layout Variants

Primitive Layout

For fixed-width types (Int32, Float64, etc.):

  • extra_a: Length of values buffer in bytes
  • extra_b: Unused (0)
  • Payload: Raw values buffer

Varlen Layout

For variable-length types (Utf8, Binary, etc.):

  • extra_a: Length of offsets buffer in bytes
  • extra_b: Length of values buffer in bytes
  • Payload: Offsets buffer followed by values buffer

FixedSizeList Layout

Special optimization for vector embeddings:

  • extra_a: List size (elements per list)
  • extra_b: Total child buffer length in bytes
  • Payload: Contiguous child Float32 buffer

This enables direct SIMD access to embedding vectors without indirection.

Struct Layout

For nested composite types:

  • extra_a: Unused (0)
  • extra_b: IPC payload length in bytes
  • Payload: Arrow IPC-serialized struct array

Struct types fall back to Arrow IPC format because their complex nested structure doesn't benefit from the custom layout.

Sources: llkv-storage/src/serialization.rs:44-135 llkv-storage/src/serialization.rs:256-378


graph TB
    PAGER["Pager::batch_get"]
HANDLE["EntryHandle\nmemory-mapped region"]
BUFFER["Arrow Buffer\nslice of EntryHandle"]
ARRAYDATA["ArrayData\nreferences Buffer"]
ARRAY["Concrete Array\nInt32Array, etc."]
PAGER --> HANDLE
 
   HANDLE --> BUFFER
 
   BUFFER --> ARRAYDATA
 
   ARRAYDATA --> ARRAY
    
    style HANDLE fill:#f9f9f9
    style BUFFER fill:#f9f9f9

Zero-Copy Deserialization

EntryHandle Integration

The EntryHandle type from simd-r-drive-entry-handle provides a zero-copy wrapper around memory-mapped buffers:

Key Operations:

  1. Pager read : Returns GetResult::Raw { key, bytes: EntryHandle }
  2. Buffer slice : EntryHandle::as_arrow_buffer() creates an Arrow Buffer view
  3. ArrayData build : ArrayData::builder().add_buffer(buffer).build()
  4. Array cast : make_array(data) produces typed arrays

The entire chain avoids copying data — Arrow arrays directly reference the memory-mapped region.

Alignment Requirements:

Decimal128 requires 16-byte alignment. If the EntryHandle buffer is not properly aligned, the deserializer copies it to an aligned buffer:

Sources: llkv-storage/src/serialization.rs:429-559 llkv-column-map/src/store/projection.rs:619-629


graph TB
    ROWIDS["row_ids: &[u64]\nrequested rows"]
FIELDIDS["field_ids: &[LogicalFieldId]\nrequested columns"]
PLANS["FieldPlan\nper-column metadata"]
CHUNKS["Chunk selection\ncandidate_indices"]
BATCH_GET["Pager::batch_get\nfetch chunks"]
CACHE["Chunk cache\nArrayRef map"]
GATHER["gather_rows_from_chunks\nper-type specialization"]
ARRAYS["Vec<ArrayRef>\none per field"]
SCHEMA["Arrow Schema\nField metadata"]
RECORDBATCH["RecordBatch::try_new"]
ROWIDS --> PLANS
 
   FIELDIDS --> PLANS
 
   PLANS --> CHUNKS
 
   CHUNKS --> BATCH_GET
 
   BATCH_GET --> CACHE
 
   CACHE --> GATHER
 
   GATHER --> ARRAYS
 
   ARRAYS --> RECORDBATCH
 
   SCHEMA --> RECORDBATCH

RecordBatch Construction and Projection

Gather Operations

The ColumnStore::gather_rows family of methods reconstructs RecordBatch from chunked columns:

Projection Flow:

  1. Prepare context : Load column descriptors, determine chunk candidates
  2. Batch fetch : Request all needed chunks from pager in one call
  3. Type-specific gather : Dispatch to specialized routines based on DataType
  4. Null policy : Apply GatherNullPolicy (ErrorOnMissing, IncludeNulls, DropNulls)
  5. Schema construction : Build Schema with correct field names and nullability
  6. RecordBatch assembly : RecordBatch::try_new(schema, arrays)

Type Dispatch Table:

DataTypeGather Function
Utf8gather_rows_from_chunks_string::<i32>
LargeUtf8gather_rows_from_chunks_string::<i64>
Binarygather_rows_from_chunks_binary::<i32>
LargeBinarygather_rows_from_chunks_binary::<i64>
Booleangather_rows_from_chunks_bool
Struct(_)gather_rows_from_chunks_struct
Decimal128(_, _)gather_rows_from_chunks_decimal128
Primitivesgather_rows_from_chunks::<ArrowTy> (generic)

Sources: llkv-column-map/src/store/projection.rs:245-446 llkv-column-map/src/store/projection.rs:636-726


graph TB
    BATCH["RecordBatch\nuser data"]
TABLE_SCHEMA["Stored Schema\nfrom catalog"]
VALIDATE["Schema validation"]
FIELD_CHECK["Field count\nname\ntype match"]
MVCC["Inject MVCC columns\nrow_id, created_by,\ndeleted_by"]
EXTENDED["Extended RecordBatch"]
COLMAP["ColumnStore::append"]
BATCH --> VALIDATE
 
   TABLE_SCHEMA --> VALIDATE
 
   VALIDATE --> FIELD_CHECK
 
   FIELD_CHECK -->|Pass| MVCC
 
   FIELD_CHECK -->|Fail| ERROR["Error::SchemaMismatch"]
MVCC --> EXTENDED
 
   EXTENDED --> COLMAP

Schema Validation

Table-Level Schema Enforcement

The Table layer validates incoming RecordBatch schemas against the stored table schema before appending:

Validation Rules:

  1. Field count : Batch must have exactly the same number of columns as the table schema
  2. Field names : Column names must match (case-sensitive)
  3. Field types : DataType must match exactly (no implicit coercion)
  4. Nullability : Currently not strictly enforced (planned improvement)

MVCC Column Injection:

After validation, the table appends three system columns:

  • row_id (UInt64): Unique row identifier
  • created_by (UInt64): Transaction ID that created the row
  • deleted_by (UInt64): Transaction ID that deleted the row (0 if active)

These columns are stored in separate logical namespaces but physically alongside user data.

Sources: llkv-table/README.md:14-17 llkv-column-map/README.md:26-28


graph LR
    SQL_LITERAL["SQL Literal\n123, 'text', etc."]
LITERAL["Literal enum\nInteger, String, etc."]
SCHEMA["Table Schema\nArrow DataType"]
NATIVE["Native Value\ni32, String, etc."]
ARRAY["Arrow Array\nInt32Array, etc."]
SQL_LITERAL --> LITERAL
 
   LITERAL --> SCHEMA
 
   SCHEMA --> NATIVE
 
   NATIVE --> ARRAY

Type Mapping from SQL to Arrow

Literal Conversion

The llkv-expr crate defines a Literal enum that captures untyped SQL values before schema resolution:

Supported Literal Types:

Literal VariantArrow Target Types
Integer(i128)Any integer or float type (with range checks)
Float(f64)Float32, Float64
Decimal(DecimalValue)Decimal128(p, s)
String(String)Utf8, LargeUtf8
Boolean(bool)Boolean
Date32(i32)Date32
Struct(fields)Struct(...)
Interval(IntervalValue)Not directly stored; used for date arithmetic

Conversion Mechanism:

The FromLiteral trait provides type-aware conversion:

Implementations perform range checking and type validation:

Sources: llkv-expr/src/literal.rs:78-94 llkv-expr/src/literal.rs:156-219 llkv-expr/src/literal.rs:395-419


Performance Characteristics

Zero-Copy Benefits

The combination of Arrow's columnar layout and the custom serialization format delivers measurable performance benefits:

OperationTraditional DBLLKV Arrow-Native
Column scanRow-by-row decodeVectorized SIMD over mmap
Type dispatchVirtual function callsMonomorphized at compile time
Buffer managementMultiple allocationsSingle mmap region
Predicate evaluationInterpreted per rowCompiled bytecode over vectors

Chunking Strategy

The ColumnStore organizes data into chunks sized for cache locality and pager efficiency:

  • Target chunk size : Configurable, typically 64KB-256KB per column
  • Row alignment : All columns in a table share the same row boundaries per chunk
  • Append optimization : Incoming batches are chunked and sorted by row_id before persistence

This design minimizes pager I/O and maximizes CPU cache hit rates during scans.

Sources: llkv-column-map/README.md:24-28 llkv-storage/README.md:15-17


Integration with External Tools

Arrow Compatibility

Because LLKV uses standard Arrow data structures at its boundaries, it can integrate with the broader Arrow ecosystem:

  • Export : Query results can be serialized to Arrow IPC files for external processing
  • Import : Arrow IPC files can be read and ingested via Table::append
  • Parquet : Future work could add direct Parquet read/write using Arrow's parquet crate
  • DataFusion : LLKV's table scan APIs could potentially integrate as a DataFusion TableProvider

Current Limitations

  1. Null support : The serialization format doesn't yet handle null bitmaps
  2. Nested types : Only Struct and FixedSizeList<Float32> are fully supported
  3. Dictionary encoding : Not yet implemented (planned)
  4. Compression : No built-in compression (relies on storage-layer features)

Sources: llkv-storage/src/serialization.rs:257-260 llkv-column-map/README.md:10-11


Summary

LLKV's Arrow-native architecture provides:

  • Universal interchange format via RecordBatch across all layers
  • Zero-copy operations through EntryHandle and memory-mapped buffers
  • Custom serialization optimized for mmap and SIMD access patterns
  • Type safety from SQL literals through to persisted columns
  • SIMD-friendly layout for efficient vectorized query evaluation

The trade-off of using a custom format instead of Arrow IPC is reduced flexibility (no nulls yet, fewer complex types) in exchange for smaller files, faster scans, and true zero-copy deserialization.

For details on how Arrow arrays are evaluated during query execution, see Scalar Evaluation and NumericKernels. For information on how MVCC metadata is stored alongside Arrow columns, see Column Storage and ColumnStore.


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

SQL Interface

Relevant source files

The SQL Interface layer provides the primary user-facing entry point for executing SQL statements against LLKV. It consists of the llkv-sql crate, which wraps the underlying runtime and provides SQL parsing, preprocessing, statement caching, and result formatting.

This document covers the SqlEngine struct and its methods, SQL preprocessing and dialect normalization, and the INSERT buffering optimization system. For information about query planning after SQL parsing, see Query Planning. For runtime execution, see the Architecture section at #2.

Sources: llkv-sql/src/lib.rs:1-51 README.md:47-48

Core Components

The SQL Interface layer is centered around three main subsystems:

ComponentPurposeKey Types
SqlEngineMain execution interfaceSqlEngine, RuntimeEngine, RuntimeSession
PreprocessingSQL normalization and dialect handlingVarious regex-based transformers
INSERT BufferingBatch optimization for literal insertsInsertBuffer, PreparedInsert

Sources: llkv-sql/src/sql_engine.rs:365-556

SqlEngine Structure

The SqlEngine wraps a RuntimeEngine instance and adds SQL-specific functionality including statement caching, INSERT buffering, and configurable behavior flags. The insert_buffer field holds accumulated literal INSERT payloads when buffering is enabled.

Sources: llkv-sql/src/sql_engine.rs:496-509 llkv-sql/src/sql_engine.rs:421-471

SQL Statement Processing Flow

The statement processing flow consists of:

  1. Preprocessing: SQL text undergoes dialect normalization via regex-based transformations
  2. Parsing: sqlparser with increased recursion limit (200 vs default 50) produces AST
  3. Planning: AST nodes are translated to typed PlanStatement structures
  4. Buffering (INSERT only): Literal INSERT statements may be accumulated in InsertBuffer
  5. Execution: Plans are passed to RuntimeEngine for execution
  6. Result collection: RuntimeStatementResult instances are collected and returned

Sources: llkv-sql/src/sql_engine.rs:933-991 llkv-sql/src/sql_engine.rs:318-324

Public API Methods

Core Execution Methods

The SqlEngine exposes two primary execution methods:

MethodSignaturePurposeReturns
executefn execute(&self, sql: &str)Execute one or more SQL statementsSqlResult<Vec<RuntimeStatementResult>>
sqlfn sql(&self, query: &str)Execute a single SELECT and return batchesSqlResult<Vec<RecordBatch>>

The execute method handles arbitrary SQL (DDL, DML, queries) and returns statement results. The sql method is a convenience wrapper that enforces single-SELECT semantics and extracts Arrow batches from the result stream.

Sources: llkv-sql/src/sql_engine.rs:921-991 llkv-sql/src/sql_engine.rs:1009-1052

Prepared Statements

Prepared statements support three placeholder syntaxes:

  • Positional: ? (auto-numbered), ?1, $1 (explicit index)
  • Named: :param_name

Placeholders are tracked via thread-local ParameterState during parsing, converted to sentinel strings like __llkv_param__1__, and stored in a PreparedPlan with parameter count metadata. The statement_cache field provides a statement-level cache keyed by SQL text.

Sources: llkv-sql/src/sql_engine.rs:77-206 llkv-sql/src/sql_engine.rs:278-297

Configuration Methods

MethodPurpose
new<Pg>(pager: Arc<Pg>)Construct engine with given pager (buffering disabled)
with_context(context, default_nulls_first)Construct from existing RuntimeContext
set_insert_buffering(enabled: bool)Toggle INSERT batching mode

The set_insert_buffering method controls cross-statement INSERT accumulation. When disabled (default), each INSERT executes immediately. When enabled, compatible INSERTs targeting the same table are batched together up to MAX_BUFFERED_INSERT_ROWS (8192 rows).

Sources: llkv-sql/src/sql_engine.rs:615-621 llkv-sql/src/sql_engine.rs:879-905 llkv-sql/src/sql_engine.rs:410-414

SQL Preprocessing System

The preprocessing layer normalizes SQL dialects before parsing to handle incompatibilities between SQLite, DuckDB, and sqlparser expectations.

graph TB
    RAW["Raw SQL String"]
TPCH["preprocess_tpch_connect_syntax\n(strip CONNECT TO statements)"]
TYPE["preprocess_create_type_syntax\n(CREATE TYPE → CREATE DOMAIN)"]
EXCLUDE["preprocess_exclude_syntax\n(quote qualified names in EXCLUDE)"]
COMMA["preprocess_trailing_commas_in_values\n(remove trailing commas)"]
EMPTY["preprocess_empty_in_lists\n(expr IN () → constant)"]
INDEX["preprocess_index_hints\n(strip INDEXED BY / NOT INDEXED)"]
REINDEX["preprocess_reindex_syntax\n(REINDEX → VACUUM REINDEX)"]
BARE["preprocess_bare_table_in_clauses\n(IN table → IN (SELECT * FROM))"]
TRIGGER["preprocess_sqlite_trigger_shorthand\n(add AFTER / FOR EACH ROW)"]
PARSER["sqlparser::Parser"]
RAW --> TPCH
 
   TPCH --> TYPE
 
   TYPE --> EXCLUDE
 
   EXCLUDE --> COMMA
 
   COMMA --> EMPTY
 
   EMPTY --> INDEX
 
   INDEX --> REINDEX
 
   REINDEX --> BARE
 
   BARE --> PARSER
    
    PARSER -.parse error.-> TRIGGER
 
   TRIGGER --> PARSER

Each preprocessing function is implemented as a regex-based transformer:

FunctionPatternPurposeLines
preprocess_tpch_connect_syntaxCONNECT TO database;Strip TPC-H multi-database directives6:28-630
preprocess_create_type_syntaxCREATE TYPECREATE DOMAINTranslate DuckDB type alias syntax6:39-657
preprocess_exclude_syntaxEXCLUDE(a.b.c)EXCLUDE("a.b.c")Quote qualified names in EXCLUDE6:59-676
preprocess_trailing_commas_in_valuesVALUES (v,)VALUES (v)Remove DuckDB-style trailing commas6:78-689
preprocess_empty_in_listsexpr IN ()(expr = NULL AND 0 = 1)Convert empty IN to constant false6:91-720
preprocess_index_hintsINDEXED BY idx / NOT INDEXEDStrip SQLite index hints7:22-739
preprocess_reindex_syntaxREINDEX idxVACUUM REINDEX idxConvert to sqlparser-compatible form7:41-757
preprocess_bare_table_in_clausesIN tableIN (SELECT * FROM table)Expand SQLite shorthand8:44-873
preprocess_sqlite_trigger_shorthandMissing AFTER / FOR EACH ROWAdd required trigger components7:71-842

The trigger preprocessor is only invoked on parse errors containing CREATE TRIGGER, as it requires more complex regex patterns to inject missing timing and row-level clauses.

Sources: llkv-sql/src/sql_engine.rs:623-873

Regex Pattern Details

Static OnceLock<Regex> instances cache compiled patterns across invocations:

For example, the empty IN list handler uses:

(?i)(\([^)]*\)|x'[0-9a-fA-F]*'|'(?:[^']|'')*'|[a-zA-Z_][a-zA-Z0-9_]*(?:\.[a-zA-Z_][a-zA-Z0-9_]*)*|\d+(?:\.\d+)?)\s+(NOT\s+)?IN\s*\(\s*\)

This matches expressions (parenthesized, hex literals, strings, identifiers, numbers) followed by [NOT] IN () and replaces with boolean expressions that preserve evaluation side effects while producing constant results.

Sources: llkv-sql/src/sql_engine.rs:691-720

Parameter Placeholder System

The parameter system uses thread-local state to track placeholders during statement preparation:

  1. Scope Creation: ParameterScope::new() initializes thread-local ParameterState
  2. Registration: Each placeholder calls register_placeholder(raw) which:
    • For ?: auto-increments index
    • For ?N or $N: uses explicit numeric index
    • For :name: assigns next available index and stores mapping
  3. Sentinel Generation: placeholder_marker(index) creates __llkv_param__N__ string
  4. Parsing: Sentinel strings are parsed as string literals in the SQL AST
  5. Binding: execute_prepared replaces sentinels with SqlParamValue instances

The ParameterState struct tracks:

  • assigned: FxHashMap<String, usize> - named parameter to index mapping
  • next_auto: usize - next index for ? placeholders
  • max_index: usize - highest parameter index seen

Sources: llkv-sql/src/sql_engine.rs:77-206 llkv-sql/src/sql_engine.rs:1120-1235

graph TB
    subgraph "INSERT Processing Decision"
        INSERT["Statement::Insert"]
CLASSIFY["classify_insert"]
VALUES["PreparedInsert::Values"]
IMMEDIATE["PreparedInsert::Immediate"]
end
    
    subgraph "Buffering Logic"
        ENABLED{"Buffering\nEnabled?"}
COMPAT{"Can Buffer\nAccept?"}
THRESHOLD{"&gt;= MAX_BUFFERED_INSERT_ROWS\n(8192)?"}
BUFFER["InsertBuffer::push_statement"]
FLUSH["flush_buffered_insert"]
EXECUTE["execute_plan_statement"]
end
    
 
   INSERT --> CLASSIFY
 
   CLASSIFY --> VALUES
 
   CLASSIFY --> IMMEDIATE
    
 
   VALUES --> ENABLED
 
   ENABLED -->|No| EXECUTE
 
   ENABLED -->|Yes| COMPAT
    
 
   COMPAT -->|No| FLUSH
 
   COMPAT -->|Yes| BUFFER
 
   FLUSH --> BUFFER
    
 
   BUFFER --> THRESHOLD
 
   THRESHOLD -->|Yes| FLUSH
 
   THRESHOLD -->|No| RETURN["Return placeholder result"]
IMMEDIATE --> EXECUTE

INSERT Buffering System

The INSERT buffering system batches compatible literal INSERT statements to reduce planning overhead for bulk ingest workloads.

Buffer Structure

The InsertBuffer struct accumulates rows across multiple INSERT statements:

Key fields:

  • table_name, columns, on_conflict: compatibility key for buffering
  • rows: accumulated literal values from all buffered statements
  • statement_row_counts: per-statement row counts to emit individual results
  • total_rows: sum of statement_row_counts for threshold checking

Sources: llkv-sql/src/sql_engine.rs:421-471

Buffering Conditions

An INSERT can be buffered if:

  1. The InsertSource is Values (literal rows) or a constant SELECT
  2. Buffering is enabled via insert_buffering_enabled flag
  3. Either no buffer exists or InsertBuffer::can_accept returns true:
    • table_name matches exactly
    • columns match exactly (same names, same order)
    • on_conflict action matches

When the buffer reaches MAX_BUFFERED_INSERT_ROWS (8192), it is flushed automatically. Flush also occurs on:

  • Transaction boundaries (BEGIN, COMMIT, ROLLBACK)
  • Incompatible INSERT statement
  • Engine drop
  • Explicit set_insert_buffering(false) call

Sources: llkv-sql/src/sql_engine.rs:452-470 llkv-sql/src/sql_engine.rs:2028-2146 llkv-sql/src/sql_engine.rs:410-414

Buffer Flush Process

The flush process:

  1. Extracts InsertBuffer from RefCell<Option<InsertBuffer>>
  2. Constructs single InsertPlan with all accumulated rows
  3. Executes via execute_statement
  4. Receives single RuntimeStatementResult::Insert with total rows inserted
  5. Splits result into per-statement results using statement_row_counts vector
  6. Returns vector of results matching original statement order

This allows bulk execution while preserving per-statement result semantics.

Sources: llkv-sql/src/sql_engine.rs:2028-2146

Value Handling

The SqlValue enum represents literal values during SQL processing:

The SqlValue::try_from_expr function handles:

  • Unary operators (negation for numeric types, intervals)
  • CAST expressions (particularly to DATE)
  • Nested expressions
  • Dictionary/struct literals
  • Binary operations (addition, subtraction, bitshift for constant folding)
  • Typed strings (DATE '2024-01-01')

Interval arithmetic is performed at constant-folding time:

  • Date32 + IntervalDate32
  • Interval + Date32Date32
  • Date32 - IntervalDate32
  • Date32 - Date32Interval
  • Interval +/- IntervalInterval

Sources: llkv-sql/src/sql_value.rs:16-320

Error Handling

The SQL layer maps table-related errors to catalog-specific error messages:

Error TypeMappingMethod
Error::NotFoundCatalog Error: Table 'X' does not existtable_not_found_error
Error::InvalidArgumentError (contains "unknown table")Same as abovemap_table_error
Transaction conflictsanother transaction has dropped this tableString constant

The execute_plan_statement method applies error mapping except for CREATE VIEW and DROP VIEW statements, where the "table" name refers to the view being created/dropped rather than a referenced table.

Sources: llkv-sql/src/sql_engine.rs:558-609 llkv-sql/src/sql_engine.rs511

Thread Safety and Cloning

The SqlEngine::clone implementation creates a new session:

This ensures each cloned engine has an independent:

  • RuntimeSession (transaction state, temporary namespace)
  • insert_buffer (no shared buffering across sessions)
  • statement_cache (independent prepared statement cache)

The warning message indicates this is typically not intended usage, as most applications should use a single shared SqlEngine instance across threads (enabled by interior mutability via RefCell and atomic types).

Sources: llkv-sql/src/sql_engine.rs:522-540


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

SqlEngine API

Relevant source files

Purpose and Scope

The SqlEngine provides the primary user-facing API for executing SQL statements against LLKV databases. It accepts SQL text, parses it, translates it to typed execution plans, and delegates to the runtime layer for evaluation. This page documents the SqlEngine struct's construction, methods, prepared statement handling, and configuration options.

For information about SQL preprocessing and dialect handling, see SQL Preprocessing and Dialect Handling. For details on INSERT buffering behavior, see INSERT Buffering System. For query planning internals, see Plan Structures.

Sources: llkv-sql/src/lib.rs:1-51 llkv-sql/README.md:1-68


SqlEngine Architecture Overview

The SqlEngine sits at the top of the LLKV SQL processing stack, coordinating parsing, planning, and execution:

Diagram: SqlEngine Position in SQL Processing Stack

graph TB
    User["User Code"]
SqlEngine["SqlEngine\n(llkv-sql)"]
Parser["sqlparser\nAST Generation"]
Preprocessor["SQL Preprocessor\nDialect Normalization"]
Planner["Plan Translation\nAST → PlanStatement"]
Runtime["RuntimeEngine\n(llkv-runtime)"]
Executor["llkv-executor\nQuery Execution"]
Table["llkv-table\nTable Layer"]
User -->|execute sql| SqlEngine
 
   User -->|sql select| SqlEngine
 
   User -->|prepare sql| SqlEngine
    
 
   SqlEngine --> Preprocessor
 
   Preprocessor --> Parser
 
   Parser --> Planner
 
   Planner --> Runtime
 
   Runtime --> Executor
 
   Executor --> Table
    
 
   SqlEngine -.->|owns| Runtime
 
   SqlEngine -.->|manages| InsertBuffer["InsertBuffer\nBatching State"]
SqlEngine -.->|caches| StmtCache["statement_cache\nPreparedPlan Cache"]

The SqlEngine wraps a RuntimeEngine, manages statement caching and INSERT buffering, and provides convenience methods for single-statement queries (sql()) and multi-statement execution (execute()).

Sources: llkv-sql/src/sql_engine.rs:496-509 llkv-sql/src/lib.rs:1-51


Constructing a SqlEngine

Basic Construction

Diagram: SqlEngine Construction Flow

The most common constructor is SqlEngine::new(), which accepts a pager and creates a new runtime context:

Sources: llkv-sql/src/sql_engine.rs:615-621

ConstructorSignaturePurpose
new()new<Pg>(pager: Arc<Pg>) -> SelfCreate engine with new runtime and default settings
with_context()with_context(context: Arc<SqlContext>, default_nulls_first: bool) -> SelfCreate engine reusing an existing runtime context
from_runtime_engine()from_runtime_engine(engine: RuntimeEngine, default_nulls_first: bool, insert_buffering_enabled: bool) -> SelfInternal constructor for fine-grained control

Table: SqlEngine Constructor Methods

Sources: llkv-sql/src/sql_engine.rs:543-556 llkv-sql/src/sql_engine.rs:615-621 llkv-sql/src/sql_engine.rs:879-885


Core Query Execution Methods

execute() - Multi-Statement Execution

The execute() method processes one or more SQL statements from a string and returns a vector of results:

Diagram: execute() Method Execution Flow

The execution pipeline:

  1. Preprocessing : SQL text undergoes dialect normalization via preprocess_sql_input() llkv-sql/src/sql_engine.rs:1556-1564
  2. Parsing : sqlparser converts normalized text to AST with recursion limit of 200 llkv-sql/src/sql_engine.rs324
  3. Statement Loop : Each statement is translated to a PlanStatement and either buffered (for INSERTs) or executed immediately
  4. Result Collection : Results are accumulated and returned as Vec<SqlStatementResult>

Sources: llkv-sql/src/sql_engine.rs:933-1044

sql() - Single SELECT Execution

The sql() method enforces single-statement SELECT semantics and returns Arrow RecordBatch results:

Key differences from execute():

  • Accepts only a single statement
  • Statement must be a SELECT query
  • Returns Vec<RecordBatch> directly rather than RuntimeStatementResult
  • Automatically collects streaming results

Sources: llkv-sql/src/sql_engine.rs:1046-1085


Prepared Statements

Prepared Statement Flow

Diagram: Prepared Statement Creation and Caching

prepare() Method

The prepare() method parses SQL with placeholders and caches the resulting plan:

Placeholder syntax supported:

  • ? - Positional parameter (auto-increments)
  • ?N - Numbered parameter (1-indexed)
  • $N - PostgreSQL-style numbered parameter
  • :name - Named parameter

Sources: llkv-sql/src/sql_engine.rs:1296-1376 llkv-sql/src/sql_engine.rs:86-132

Parameter Binding Mechanism

Parameter registration occurs via thread-local ParameterScope:

Diagram: Parameter Registration and Sentinel Generation

The parameter translation process:

  1. During parsing, placeholders are intercepted and converted to sentinel strings: __llkv_param__N__
  2. ParameterState tracks placeholder-to-index mappings in thread-local storage
  3. At execution time, sentinel strings are replaced with actual parameter values

Sources: llkv-sql/src/sql_engine.rs:71-206

SqlParamValue Type

Parameter values are represented by the SqlParamValue enum:

VariantSQL TypeUsage
NullNULLSqlParamValue::Null
Integer(i64)INTEGER/BIGINTSqlParamValue::from(42_i64)
Float(f64)FLOAT/DOUBLESqlParamValue::from(3.14_f64)
Boolean(bool)BOOLEANSqlParamValue::from(true)
String(String)TEXT/VARCHARSqlParamValue::from("text")
Date32(i32)DATESqlParamValue::from(18993_i32)

Table: SqlParamValue Variants and Conversions

Sources: llkv-sql/src/sql_engine.rs:208-276

execute_prepared() Method

Execute a prepared statement with bound parameters:

Parameter substitution occurs in two phases:

  1. Literal Substitution : Sentinels in Expr<String> are replaced via substitute_parameter_literals() llkv-sql/src/sql_engine.rs:1453-1497
  2. Plan Value Substitution : Sentinels in Vec<PlanValue> are replaced via substitute_parameter_plan_values() llkv-sql/src/sql_engine.rs:1499-1517

Sources: llkv-sql/src/sql_engine.rs:1378-1451


Transaction Control

The SqlEngine supports explicit transaction boundaries via SQL statements:

Diagram: Transaction State Machine

Transaction management methods:

SQL StatementEffect
BEGINStart explicit transaction llkv-sql/src/sql_engine.rs:970-976
COMMITFinalize transaction and flush buffers llkv-sql/src/sql_engine.rs:977-983
ROLLBACKAbort transaction and discard buffers llkv-sql/src/sql_engine.rs:984-990

Transaction boundaries automatically flush the INSERT buffer to ensure consistent visibility semantics.

Sources: llkv-sql/src/sql_engine.rs:970-990 llkv-sql/src/sql_engine.rs:912-914


INSERT Buffering System

Buffer Architecture

Diagram: INSERT Buffer Accumulation and Flush

InsertBuffer Structure

The InsertBuffer struct accumulates literal INSERT payloads:

Sources: llkv-sql/src/sql_engine.rs:421-471

Buffer Compatibility

An INSERT can join the buffer if it matches:

  1. Table Name : Target table must match buffer.table_name
  2. Column List : Columns must match buffer.columns exactly
  3. Conflict Action : on_conflict strategy must match

Sources: llkv-sql/src/sql_engine.rs:452-459

Flush Conditions

The buffer flushes when:

ConditionImplementation
Size limit exceededtotal_rows >= MAX_BUFFERED_INSERT_ROWS (8192) llkv-sql/src/sql_engine.rs414
Incompatible INSERTTable/columns/conflict mismatch llkv-sql/src/sql_engine.rs:1765-1799
Transaction boundaryBEGIN, COMMIT, ROLLBACK detected llkv-sql/src/sql_engine.rs:970-990
Non-INSERT statementAny non-INSERT SQL statement llkv-sql/src/sql_engine.rs:991-1040
Statement expectationTest harness expectation registered llkv-sql/src/sql_engine.rs:1745-1760
Manual flushflush_pending_inserts() called llkv-sql/src/sql_engine.rs:1834-1850

Table: INSERT Buffer Flush Triggers

Enabling/Disabling Buffering

INSERT buffering is controlled by the set_insert_buffering() method:

  • Disabled by default to maintain statement-level transaction semantics
  • Enable for bulk loading to reduce planning overhead
  • Disabling flushes buffer to ensure pending rows are persisted

Sources: llkv-sql/src/sql_engine.rs:898-905


classDiagram
    class RuntimeStatementResult {<<enum>>\nSelect\nInsert\nUpdate\nDelete\nCreateTable\nDropTable\nCreateIndex\nDropIndex\nAlterTable\nCreateView\nDropView\nVacuum\nTransaction}
    
    class SelectVariant {+SelectExecution execution}
    
    class InsertVariant {+table_name: String\n+rows_inserted: usize}
    
    class UpdateVariant {+table_name: String\n+rows_updated: usize}
    
    RuntimeStatementResult --> SelectVariant
    RuntimeStatementResult --> InsertVariant
    RuntimeStatementResult --> UpdateVariant

Result Types

RuntimeStatementResult

The execute() and execute_prepared() methods return Vec<RuntimeStatementResult>:

Diagram: RuntimeStatementResult Variants

Key result variants:

VariantFieldsReturned By
SelectSelectExecutionSELECT queries [llkv-runtime types](https://github.com/jzombie/rust-llkv/blob/4dc34c1f/llkv-runtime types)
Inserttable_name: String, rows_inserted: usizeINSERT statements
Updatetable_name: String, rows_updated: usizeUPDATE statements
Deletetable_name: String, rows_deleted: usizeDELETE statements
CreateTabletable_name: StringCREATE TABLE
CreateIndexindex_name: String, table_name: StringCREATE INDEX

Table: Common RuntimeStatementResult Variants

Sources: llkv-sql/src/lib.rs49

SelectExecution

SELECT queries return a SelectExecution handle for streaming results:

The sql() method automatically collects batches via execution.collect():

Sources: llkv-sql/src/sql_engine.rs:1065-1080


Configuration Methods

session() - Access Runtime Session

Provides access to the underlying RuntimeSession for transaction introspection or advanced error handling.

Sources: llkv-sql/src/sql_engine.rs:917-919

context_arc() - Access Runtime Context

Internal method to retrieve the shared RuntimeContext for engine composition.

Sources: llkv-sql/src/sql_engine.rs:875-877


Testing Utilities

StatementExpectation

Test harnesses can register expectations to control buffer flushing:

When a statement expectation is registered, the INSERT buffer flushes before executing that statement to ensure test assertions observe correct row counts.

Sources: llkv-sql/src/sql_engine.rs:64-315

Example Usage

Sources: llkv-sql/src/sql_engine.rs:299-309


Thread Safety and Cloning

The SqlEngine implements Clone with special semantics:

Warning : Cloning a SqlEngine creates a new RuntimeSession, not a shared reference. Each clone has independent transaction state and INSERT buffers.

Sources: llkv-sql/src/sql_engine.rs:522-540


Error Handling

Table Not Found Errors

The SqlEngine remaps generic errors to user-friendly catalog errors:

This converts low-level NotFound errors into: Catalog Error: Table 'tablename' does not exist

Sources: llkv-sql/src/sql_engine.rs:558-585


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

SQL Preprocessing and Dialect Handling

Relevant source files

Purpose and Scope

SQL preprocessing is the first stage in LLKV's query processing pipeline, responsible for normalizing SQL syntax from various dialects before the statement reaches the parser. This system allows LLKV to accept SQL written for SQLite, DuckDB, and TPC-H tooling while using the standard sqlparser library, which has limited dialect support.

The preprocessing layer transforms dialect-specific syntax into forms that sqlparser can parse, enabling compatibility with SQL Logic Tests and real-world SQL scripts without modifying the parser itself. This page documents the preprocessing transformations and their implementation.

For information about what happens after preprocessing (SQL parsing and plan generation), see SQL Query Processing Pipeline. For details on the SqlEngine API that invokes preprocessing, see SqlEngine API.

Preprocessing in the Query Pipeline

SQL preprocessing occurs immediately before parsing in both the execute and prepare code paths. The following diagram shows where preprocessing fits in the overall query execution flow:

Diagram: SQL Preprocessing Pipeline Position

flowchart TB
    Input["SQL String Input"]
Preprocess["preprocess_sql_input()"]
Parse["sqlparser::Parser::parse()"]
Plan["Plan Generation"]
Execute["Query Execution"]
Input --> Preprocess
 
   Preprocess --> Parse
 
   Parse --> Plan
 
   Plan --> Execute
    
    subgraph "Preprocessing Transformations"
        direction TB
        TPC["TPC-H CONNECT removal"]
CreateType["CREATE TYPE → CREATE DOMAIN"]
Exclude["EXCLUDE syntax normalization"]
Trailing["Trailing comma removal"]
EmptyIn["Empty IN list handling"]
IndexHints["Index hint stripping"]
Reindex["REINDEX → VACUUM REINDEX"]
BareTable["Bare table IN expansion"]
TPC --> CreateType
 
       CreateType --> Exclude
 
       Exclude --> Trailing
 
       Trailing --> BareTable
 
       BareTable --> EmptyIn
 
       EmptyIn --> IndexHints
 
       IndexHints --> Reindex
    end
    
    Preprocess -.chains.-> TPC
    Reindex -.final.-> Parse

Sources: llkv-sql/src/sql_engine.rs:936-1001

The preprocess_sql_input method chains all dialect transformations in a specific order, with each transformation receiving the output of the previous one. If parsing fails after preprocessing and the SQL contains CREATE TRIGGER, a fallback preprocessor (preprocess_sqlite_trigger_shorthand) is applied before retrying the parse.

Diagram: Preprocessing Execution Sequence with Fallback

sequenceDiagram
    participant Caller
    participant SqlEngine
    participant Preprocess as "preprocess_sql_input"
    participant Parser as "sqlparser"
    participant Fallback as "preprocess_sqlite_trigger_shorthand"
    
    Caller->>SqlEngine: execute(sql)
    SqlEngine->>Preprocess: preprocess(sql)
    
    Note over Preprocess: Chain all transformations
    
    Preprocess-->>SqlEngine: processed_sql
    SqlEngine->>Parser: parse(processed_sql)
    
    alt Parse Success
        Parser-->>SqlEngine: AST
    else Parse Error + "CREATE TRIGGER"
        Parser-->>SqlEngine: ParseError
        SqlEngine->>Fallback: expand_trigger_syntax(processed_sql)
        Fallback-->>SqlEngine: expanded_sql
        SqlEngine->>Parser: parse(expanded_sql)
        Parser-->>SqlEngine: AST or Error
    end
    
    SqlEngine-->>Caller: Results

Sources: llkv-sql/src/sql_engine.rs:936-958

Supported Dialect Transformations

LLKV implements nine distinct preprocessing transformations, each targeting specific dialect compatibility issues. The following table summarizes each transformation:

PreprocessorDialectPurposeMethod
TPC-H CONNECTTPC-HStrip CONNECT TO database; statementspreprocess_tpch_connect_syntax
CREATE TYPEDuckDBConvert CREATE TYPE to CREATE DOMAINpreprocess_create_type_syntax
EXCLUDE SyntaxGeneralQuote qualified identifiers in EXCLUDE clausespreprocess_exclude_syntax
Trailing CommasDuckDBRemove trailing commas in VALUESpreprocess_trailing_commas_in_values
Empty IN ListsSQLiteConvert IN () to constant expressionspreprocess_empty_in_lists
Index HintsSQLiteStrip INDEXED BY and NOT INDEXEDpreprocess_index_hints
REINDEXSQLiteConvert REINDEX to VACUUM REINDEXpreprocess_reindex_syntax
Bare Table INSQLiteExpand IN table to IN (SELECT * FROM table)preprocess_bare_table_in_clauses
Trigger ShorthandSQLiteAdd AFTER and FOR EACH ROW to triggerspreprocess_sqlite_trigger_shorthand

Sources: llkv-sql/src/sql_engine.rs:628-842 llkv-sql/src/sql_engine.rs:992-1001

TPC-H CONNECT Statement Removal

The TPC-H benchmark tooling generates CONNECT TO <database>; directives in referential integrity scripts. Since LLKV operates within a single database context, these statements are treated as no-ops and stripped during preprocessing.

Transformation:

Sources: llkv-sql/src/sql_engine.rs:623-630

CREATE TYPE to CREATE DOMAIN Conversion

DuckDB uses CREATE TYPE name AS basetype for type aliases, but sqlparser only supports the SQL standard CREATE DOMAIN syntax. This preprocessor converts the DuckDB syntax to the standard form.

Transformation:

Implementation: Uses static regex patterns initialized via OnceLock for thread-safe lazy compilation.

Sources: llkv-sql/src/sql_engine.rs:634-657

EXCLUDE Syntax Normalization

When EXCLUDE clauses contain qualified identifiers (e.g., schema.table.column), sqlparser requires them to be quoted. This preprocessor wraps qualified names in double quotes.

Transformation:

Sources: llkv-sql/src/sql_engine.rs:659-676

Trailing Comma Removal in VALUES

DuckDB permits trailing commas in VALUES clauses like VALUES ('v2',), but sqlparser rejects them. This preprocessor removes trailing commas before closing parentheses.

Transformation:

Sources: llkv-sql/src/sql_engine.rs:678-689

Empty IN List Handling

SQLite allows degenerate IN () and NOT IN () expressions. Since sqlparser rejects these, the preprocessor converts them to constant boolean expressions while preserving the original expression evaluation (in case of side effects).

Transformation:

The pattern matches various expression forms: parenthesized expressions, quoted strings, hex literals, identifiers, and numbers.

Sources: llkv-sql/src/sql_engine.rs:691-720

Index Hint Stripping

SQLite supports query optimizer hints like FROM table INDEXED BY index_name and FROM table NOT INDEXED. Since sqlparser doesn't support this syntax and LLKV makes its own index decisions, these hints are stripped during preprocessing.

Transformation:

Sources: llkv-sql/src/sql_engine.rs:722-739

REINDEX to VACUUM REINDEX Conversion

SQLite supports REINDEX index_name as a standalone statement, but sqlparser only recognizes REINDEX as part of VACUUM syntax. This preprocessor converts the standalone form.

Transformation:

Sources: llkv-sql/src/sql_engine.rs:741-757

Bare Table IN Clause Expansion

SQLite allows expr IN tablename as shorthand for expr IN (SELECT * FROM tablename). The preprocessor expands this shorthand to the subquery form that sqlparser requires.

Transformation:

The pattern avoids matching IN ( which is already a valid subquery.

Sources: llkv-sql/src/sql_engine.rs:844-873

SQLite Trigger Shorthand Expansion

SQLite allows omitting the trigger timing (defaults to AFTER) and the FOR EACH ROW clause (defaults to row-level triggers). sqlparser requires both to be explicit. This preprocessor injects the missing clauses.

Transformation:

This is a fallback preprocessor that only runs if initial parsing fails and the SQL contains CREATE TRIGGER. The implementation uses complex regex patterns to handle optional dotted identifiers with various quoting styles.

Sources: llkv-sql/src/sql_engine.rs:759-842 llkv-sql/src/sql_engine.rs:944-957

graph TB
    subgraph "SqlEngine Methods"
        Execute["execute(sql)"]
Prepare["prepare(sql)"]
PreprocessInput["preprocess_sql_input(sql)"]
end
    
    subgraph "Static Regex Patterns"
        CreateTypeRE["CREATE_TYPE_REGEX"]
DropTypeRE["DROP_TYPE_REGEX"]
ExcludeRE["EXCLUDE_REGEX"]
TrailingRE["TRAILING_COMMA_REGEX"]
EmptyInRE["EMPTY_IN_REGEX"]
IndexHintRE["INDEX_HINT_REGEX"]
ReindexRE["REINDEX_REGEX"]
BareTableRE["BARE_TABLE_IN_REGEX"]
TimingRE["TIMING_REGEX"]
ForEachBeginRE["FOR_EACH_BEGIN_REGEX"]
ForEachWhenRE["FOR_EACH_WHEN_REGEX"]
end
    
    subgraph "Preprocessor Methods"
        TPC["preprocess_tpch_connect_syntax"]
CreateType["preprocess_create_type_syntax"]
Exclude["preprocess_exclude_syntax"]
Trailing["preprocess_trailing_commas_in_values"]
EmptyIn["preprocess_empty_in_lists"]
IndexHints["preprocess_index_hints"]
Reindex["preprocess_reindex_syntax"]
BareTable["preprocess_bare_table_in_clauses"]
Trigger["preprocess_sqlite_trigger_shorthand"]
end
    
 
   Execute --> PreprocessInput
 
   Prepare --> PreprocessInput
    
 
   PreprocessInput --> TPC
 
   PreprocessInput --> CreateType
 
   PreprocessInput --> Exclude
 
   PreprocessInput --> Trailing
 
   PreprocessInput --> BareTable
 
   PreprocessInput --> EmptyIn
 
   PreprocessInput --> IndexHints
 
   PreprocessInput --> Reindex
    
    CreateType -.uses.-> CreateTypeRE
    CreateType -.uses.-> DropTypeRE
    Exclude -.uses.-> ExcludeRE
    Trailing -.uses.-> TrailingRE
    EmptyIn -.uses.-> EmptyInRE
    IndexHints -.uses.-> IndexHintRE
    Reindex -.uses.-> ReindexRE
    BareTable -.uses.-> BareTableRE
    Trigger -.uses.-> TimingRE
    Trigger -.uses.-> ForEachBeginRE
    Trigger -.uses.-> ForEachWhenRE

Implementation Architecture

The preprocessing system is implemented using a combination of regex transformations and string manipulation. The following diagram shows the key components:

Diagram: Preprocessing Implementation Components

Sources: llkv-sql/src/sql_engine.rs:640-842 llkv-sql/src/sql_engine.rs:992-1001

Regex Pattern Management

All regex patterns are stored in OnceLock static variables for thread-safe lazy initialization. This ensures patterns are compiled once per process and reused across all preprocessing operations, avoiding the overhead of repeated compilation.

Pattern Initialization Example:

The patterns use case-insensitive matching ((?i)) and word boundaries (\b) to avoid false matches within identifiers or string literals.

Sources: llkv-sql/src/sql_engine.rs:640-650 llkv-sql/src/sql_engine.rs:661-669 llkv-sql/src/sql_engine.rs:682-686

Preprocessing Order

The order of transformations matters because later transformations may depend on earlier ones. The current order:

  1. TPC-H CONNECT removal - Must happen first to remove non-SQL directives
  2. CREATE TYPE conversion - Normalizes DDL before other transformations
  3. EXCLUDE syntax - Handles qualified names in projection lists
  4. Trailing comma removal - Fixes VALUES clause syntax
  5. Bare table IN expansion - Converts shorthand to subqueries before empty IN check
  6. Empty IN handling - Must come after bare table expansion to avoid conflicts
  7. Index hint stripping - Removes query hints from FROM clauses
  8. REINDEX conversion - Must be last to avoid interfering with VACUUM statements

Sources: llkv-sql/src/sql_engine.rs:992-1001

Parser Integration

The preprocessed SQL is passed to sqlparser with a custom recursion limit to handle deeply nested queries from test suites:

The default sqlparser recursion limit (50) is insufficient for some SQLite test suite queries, so LLKV uses 200 to balance compatibility with stack safety.

Sources: llkv-sql/src/sql_engine.rs:318-324

Testing and Validation

The preprocessing transformations are validated through:

  1. SQL Logic Tests (SLT) - The llkv-slt-tester runs thousands of SQLite test cases that exercise various dialect features
  2. TPC-H Benchmarks - The llkv-tpch crate verifies compatibility with TPC-H SQL scripts
  3. Unit Tests - Individual preprocessor functions are tested in isolation

The preprocessing system is designed to be conservative: it only transforms patterns that are known to cause parser errors, and it preserves the original SQL semantics whenever possible.

Sources: llkv-sql/src/sql_engine.rs:623-1001 llkv-sql/Cargo.toml:1-34

Future Considerations

The preprocessing approach is a pragmatic solution that enables broad dialect compatibility without modifying sqlparser. However, it has limitations:

  • Fragile regex patterns - Complex transformations like trigger shorthand expansion use intricate regex that may not handle all edge cases
  • Limited context awareness - String-based transformations cannot distinguish between SQL keywords and string literals containing those keywords
  • Maintenance burden - Each new dialect feature requires a new preprocessor

The long-term solution is to contribute dialect-specific parsing improvements back to sqlparser, eliminating the need for preprocessing. The trigger shorthand transformation includes a TODO comment noting that proper SQLite dialect support in sqlparser would eliminate that preprocessor entirely.

Sources: llkv-sql/src/sql_engine.rs:765-770


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

INSERT Buffering System

Relevant source files

The INSERT Buffering System is an optimization layer within llkv-sql that batches multiple consecutive INSERT ... VALUES statements for the same table into a single execution plan. This dramatically reduces planning overhead when bulk-loading data from SQL scripts containing thousands of individual INSERT statements. The system preserves per-statement result semantics while amortizing the cost of plan construction and table access across large batches.

For information about how INSERT plans are structured and executed, see Plan Structures.

Purpose and Design Goals

The buffering system addresses a specific performance bottleneck: SQL scripts generated by database export tools often contain tens of thousands of individual INSERT INTO table VALUES (...) statements. Without buffering, each statement incurs the full cost of parsing, planning, catalog lookup, and MVCC overhead. The buffer accumulates compatible INSERT statements and flushes them as a single batch, achieving order-of-magnitude throughput improvements for bulk ingestion workloads.

Key design constraints:

  • Optional : Disabled by default to preserve immediate visibility semantics for unit tests and interactive workloads
  • Transparent : Callers receive per-statement results as if each INSERT executed independently
  • Safe : Flushes automatically at transaction boundaries, table changes, and buffer size limits
  • Compatible : Integrates with statement expectation mechanisms used by the SQL Logic Test harness

Sources: llkv-sql/src/sql_engine.rs:410-520

Architecture Overview

Figure 1: INSERT Buffering Architecture

The system operates as a stateful accumulator within SqlEngine. Incoming INSERT statements are classified as either PreparedInsert::Values (bufferable literals) or PreparedInsert::Immediate (non-bufferable subqueries or expressions). Compatible VALUES inserts accumulate in the buffer until a flush trigger fires, at which point the buffer constructs a single InsertPlan and emits individual RuntimeStatementResult::Insert entries for each original statement.

Sources: llkv-sql/src/sql_engine.rs:416-509

Buffer Data Structures

InsertBuffer

Figure 2: Buffer Data Structure

The InsertBuffer struct maintains five critical pieces of state:

FieldTypePurpose
table_nameStringTarget table identifier for compatibility checking
columnsVec<String>Column list; must match for batching
on_conflictInsertConflictActionConflict resolution policy; must match for batching
total_rowsusizeSum of all buffered rows across statements
statement_row_countsVec<usize>Per-statement row counts for result construction
rowsVec<Vec<PlanValue>>Literal row payloads in execution order

The statement_row_counts vector preserves the boundary between original INSERT statements so that flush_buffer_results() can emit one RuntimeStatementResult::Insert per statement with the correct row count.

Sources: llkv-sql/src/sql_engine.rs:421-471

PreparedInsert Classification

Figure 3: INSERT Classification Flow

The prepare_insert() method analyzes each INSERT statement and returns PreparedInsert::Values only when the source is a literal VALUES clause or a SELECT that evaluates to constants (e.g., SELECT 1, 'foo'). All other forms—subqueries referencing tables, expressions requiring runtime evaluation, or DEFAULT VALUES—become PreparedInsert::Immediate and bypass buffering.

Sources: llkv-sql/src/sql_engine.rs:473-487

Buffer Lifecycle and Flush Triggers

Flush Conditions

The buffer flushes automatically when any of the following conditions occur:

TriggerConstantDescription
Size limitMAX_BUFFERED_INSERT_ROWS = 8192Total buffered rows exceeds threshold
Incompatible INSERTN/ADifferent table, columns, or conflict action
Non-INSERT statementN/AAny DDL, DML (UPDATE/DELETE), or SELECT
Transaction boundaryN/ABEGIN, COMMIT, or ROLLBACK
Statement expectationStatementExpectation::Error or Count(n)Test harness expects specific outcome
Manual flushN/Aflush_pending_inserts() called explicitly
Engine dropN/ASqlEngine destructor invoked

Sources: llkv-sql/src/sql_engine.rs:414-1127

Buffer State Machine

Figure 4: Buffer State Machine

The buffer exists in one of three states: Empty (no buffer allocated), Buffering (accumulating rows), or Flushing (emitting results). Transitions from Buffering to Flushing occur automatically based on the triggers listed above. After flushing, the state returns to Empty unless a new compatible INSERT immediately follows, in which case a fresh buffer is allocated.

Sources: llkv-sql/src/sql_engine.rs:514-1201

Integration with SqlEngine::execute()

Figure 5: Execute Loop with Buffer Integration

The execute() method iterates through parsed statements, dispatching INSERT statements to buffer_insert() and all other statements to execute_statement() after flushing. This ensures that the buffer never holds rows across non-INSERT operations or transaction boundaries.

Sources: llkv-sql/src/sql_engine.rs:933-990

buffer_insert() Implementation Details

Decision Flow

Figure 6: buffer_insert() Decision Tree

The buffer_insert() method performs three levels of gating:

  1. Expectation check : If the SLT harness expects an error or specific row count, bypass buffering entirely
  2. Buffering enabled check : If insert_buffering_enabled is false, execute immediately
  3. Compatibility check : If the INSERT is incompatible with the current buffer, flush and start a new buffer

Sources: llkv-sql/src/sql_engine.rs:1101-1201

Compatibility Rules

An INSERT can be added to the existing buffer if and only if:

This ensures that all buffered statements can be collapsed into a single InsertPlan with uniform semantics. Different column orderings, conflict actions, or target tables require separate batches.

Sources: llkv-sql/src/sql_engine.rs:452-459

Statement Expectation Handling

The SQL Logic Test harness uses thread-local expectations to signal that a specific statement should produce an error or affect a precise number of rows. The buffering system respects these hints by forcing immediate execution when expectations are present:

Figure 7: Statement Expectation Flow

graph TB
    SLTHarness["SLT Harness"]
RegisterExpectation["register_statement_expectation()"]
ThreadLocal["PENDING_STATEMENT_EXPECTATIONS\nthread_local!"]
Execute["SqlEngine::execute()"]
NextExpectation["next_statement_expectation()"]
BufferInsert["buffer_insert()"]
SLTHarness -->|before statement| RegisterExpectation
 
   RegisterExpectation --> ThreadLocal
 
   Execute --> NextExpectation
 
   NextExpectation --> ThreadLocal
 
   NextExpectation --> BufferInsert
    
 
   BufferInsert -->|Error or Count| ImmediateExec["Execute immediately\nbypass buffer"]
BufferInsert -->|Ok| MayBuffer["May buffer if enabled"]

When next_statement_expectation() returns StatementExpectation::Error or StatementExpectation::Count(n), the buffer_insert() method sets execute_immediately = true and flushes any existing buffer before executing the current statement. This preserves test correctness while still allowing buffering for the majority of statements that have no expectations.

Sources: llkv-sql/src/sql_engine.rs:64-1127

sequenceDiagram
    participant Caller
    participant FlushBuffer as "flush_buffer_results()"
    participant Buffer as "InsertBuffer"
    participant PlanStmt as "PlanStatement::Insert"
    participant Runtime as "RuntimeEngine"
    
    Caller->>FlushBuffer: flush_buffer_results()
    FlushBuffer->>Buffer: Take buffer from RefCell
    
    alt Buffer is None
        FlushBuffer-->>Caller: Ok(Vec::new())
    else Buffer has data
        FlushBuffer->>PlanStmt: Construct InsertPlan\n(table, columns, rows, on_conflict)
        FlushBuffer->>Runtime: execute_statement(plan)
        Runtime-->>FlushBuffer: RuntimeStatementResult::Insert\n(total_rows_inserted)
        
        Note over FlushBuffer: Verify total_rows matches sum(statement_row_counts)
        
        loop For each statement_row_count
            FlushBuffer->>FlushBuffer: Create RuntimeStatementResult::Insert\n(statement_rows)
        end
        
        FlushBuffer-->>Caller: Vec<SqlStatementResult>
    end

flush_buffer_results() Mechanics

The flush operation reconstructs per-statement results from the accumulated buffer state:

Figure 8: Flush Sequence

The flush process:

  1. Takes ownership of the buffer from the RefCell
  2. Constructs a single InsertPlan with all buffered rows
  3. Executes the plan via the runtime
  4. Splits the total row count across the original statements using statement_row_counts
  5. Returns a vector of per-statement results

This ensures that callers receive results as if each INSERT executed independently, even though the runtime processed them as a single batch.

Sources: llkv-sql/src/sql_engine.rs:2094-2169 (Note: The flush implementation is in the broader file, exact line range may vary)

Performance Characteristics

Throughput Improvement

Buffering provides dramatic performance gains for bulk INSERT workloads:

ScenarioWithout BufferingWith BufferingSpeedup
10,000 single-row INSERTs~30 seconds~2 seconds~15x
1,000 ten-row INSERTs~5 seconds~0.5 seconds~10x
100,000 single-row INSERTsSeveral minutes~15 seconds>10x

The improvement stems from:

  • Amortized planning : One plan for 8,192 rows instead of 8,192 plans
  • Batch MVCC overhead : Single transaction coordinator call instead of thousands
  • Reduced catalog lookups : One schema resolution instead of per-statement lookups
  • Vectorized column operations : Arrow batch processing instead of row-by-row appends

Sources: llkv-sql/README.md:36-41

Memory Usage

The buffer is bounded at MAX_BUFFERED_INSERT_ROWS = 8192 rows. Peak memory usage depends on the row width:

Peak Memory = MAX_BUFFERED_INSERT_ROWS × (Σ column_size + MVCC_overhead)

For a typical table with 10 columns averaging 50 bytes each:

8,192 rows × (10 columns × 50 bytes + 24 bytes MVCC) ≈ 4.3 MB

This predictable ceiling makes buffering safe for long-running workloads without risking unbounded memory growth.

Sources: llkv-sql/src/sql_engine.rs:410-414

API Surface

Enabling and Disabling

The set_insert_buffering(false) call automatically flushes any pending rows before disabling, ensuring visibility guarantees.

Sources: llkv-sql/src/sql_engine.rs:887-905

Manual Flush

Manual flushes are useful when the caller needs to checkpoint progress or ensure specific INSERT statements are visible before proceeding.

Sources: llkv-sql/src/sql_engine.rs:1003-1010

Drop Hook

The SqlEngine destructor automatically flushes the buffer to prevent data loss:

This ensures that buffered rows are persisted even if the caller forgets to flush explicitly.

Sources: llkv-sql/src/sql_engine.rs:513-520

Limitations and Edge Cases

Non-Bufferable INSERT Forms

The following INSERT patterns always execute immediately:

  • INSERT ... SELECT with table references
  • INSERT ... DEFAULT VALUES
  • INSERT with expressions requiring runtime evaluation (e.g., NOW(), RANDOM())
  • INSERT with parameters or placeholders

These patterns cannot be safely batched because their semantics depend on execution context.

Transaction Isolation

The buffer flushes at transaction boundaries (BEGIN, COMMIT, ROLLBACK) to preserve isolation semantics. This means:

The first INSERT's visibility is not guaranteed until the BEGIN statement forces a flush.

Conflict Handling

All buffered statements must share the same InsertConflictAction. Mixing ON CONFLICT IGNORE and ON CONFLICT REPLACE requires separate batches:

Sources: llkv-sql/src/sql_engine.rs:452-1201


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Query Planning

Relevant source files

Purpose and Scope

Query planning is the layer that translates parsed SQL statements into strongly-typed plan structures that can be executed by the runtime engine. The llkv-plan crate defines these plan types and provides utilities for representing queries, expressions, and subquery relationships in a form that execution layers can consume without re-parsing SQL.

This page covers the plan structures themselves and how they are constructed from SQL input. For information about how expressions within plans are evaluated, see Expression System. For details on subquery correlation tracking and placeholder generation, see Subquery and Correlation Handling. For execution of these plans, see Query Execution.

Plan Structures Overview

The planning layer defines distinct plan types for each category of SQL statement. All plan types are defined in llkv-plan/src/plans.rs and flow through the PlanStatement enum for execution dispatch.

Core Plan Types

Plan TypePurposeKey Fields
SelectPlanQuery executiontables, projections, filter, joins, aggregates, order_by
InsertPlanRow insertiontable, columns, source, on_conflict
UpdatePlanRow updatestable, assignments, filter
DeletePlanRow deletiontable, filter
CreateTablePlanTable creationname, columns, source, foreign_keys
CreateIndexPlanIndex creationtable, columns, unique
CreateViewPlanView creationname, view_definition, select_plan

Sources: llkv-plan/src/plans.rs:177-256 llkv-plan/src/plans.rs:640-655 llkv-plan/src/plans.rs:662-667 llkv-plan/src/plans.rs:687-692

SelectPlan Structure

Diagram: SelectPlan Component Structure

The SelectPlan struct at llkv-plan/src/plans.rs:800-825 contains all information needed to execute a SELECT query. It separates table references, join specifications, projections, filters, aggregations, and ordering to allow execution layers to optimize each phase independently.

Sources: llkv-plan/src/plans.rs:27-67 llkv-plan/src/plans.rs:794-825

SQL-to-Plan Translation Pipeline

The translation from SQL text to plan structures occurs in SqlEngine within the llkv-sql crate. The process involves multiple stages to handle dialect differences and build strongly-typed plans.

Diagram: SQL-to-Plan Translation Flow

sequenceDiagram
    participant User
    participant SqlEngine as "SqlEngine"
    participant Preprocessor as "SQL Preprocessing"
    participant Parser as "sqlparser::Parser"
    participant Translator as "Plan Translator"
    participant Runtime as "RuntimeEngine"
    
    User->>SqlEngine: execute(sql_text)
    
    SqlEngine->>Preprocessor: preprocess_sql_input()
    Note over Preprocessor: Strip CONNECT TO\nNormalize CREATE TYPE\nFix EXCLUDE syntax\nExpand IN clauses
    Preprocessor-->>SqlEngine: processed_sql
    
    SqlEngine->>Parser: Parser::parse_sql()
    Parser-->>SqlEngine: Vec&lt;Statement&gt; (AST)
    
    loop "For each Statement"
        SqlEngine->>Translator: translate_statement()
        
        alt "INSERT statement"
            Translator->>Translator: translate_insert()
            Note over Translator: Parse VALUES/SELECT\nNormalize conflict action\nBuild PreparedInsert
            Translator-->>SqlEngine: PreparedInsert
            SqlEngine->>SqlEngine: buffer_insert()\nor flush immediately
        else "SELECT statement"
            Translator->>Translator: translate_select()
            Note over Translator: Build SelectPlan\nTranslate expressions\nTrack subqueries
            Translator-->>SqlEngine: SelectPlan
        else "UPDATE/DELETE"
            Translator->>Translator: translate_update()/delete()
            Translator-->>SqlEngine: UpdatePlan/DeletePlan
        else "DDL statement"
            Translator->>Translator: translate_create_table()\ncreate_index(), etc.
            Translator-->>SqlEngine: CreateTablePlan/etc.
        end
        
        SqlEngine->>Runtime: execute_statement(plan)
        Runtime-->>SqlEngine: RuntimeStatementResult
    end
    
    SqlEngine-->>User: Vec&lt;RuntimeStatementResult&gt;

Sources: llkv-sql/src/sql_engine.rs:933-958 llkv-sql/src/sql_engine.rs:628-757

Statement Translation Functions

The SqlEngine contains dedicated translation methods for each statement type:

sqlparser ASTTranslation MethodOutput PlanLocation
Statement::Querytranslate_select()SelectPlanllkv-sql/src/sql_engine.rs:2162-2578
Statement::Inserttranslate_insert()InsertPlanllkv-sql/src/sql_engine.rs:3194-3423
Statement::Updatetranslate_update()UpdatePlanllkv-sql/src/sql_engine.rs:3560-3704
Statement::Deletetranslate_delete()DeletePlanllkv-sql/src/sql_engine.rs:3706-3783
Statement::CreateTabletranslate_create_table()CreateTablePlanllkv-sql/src/sql_engine.rs:4081-4465
Statement::CreateIndextranslate_create_index()CreateIndexPlanllkv-sql/src/sql_engine.rs:4575-4766

Sources: llkv-sql/src/sql_engine.rs:974-1067

SELECT Translation Details

The translate_select() method at llkv-sql/src/sql_engine.rs2162 performs the following operations:

  1. Extract table references from FROM clause into Vec<TableRef>
  2. Parse join specifications into Vec<JoinMetadata> structures
  3. Translate WHERE clause to Expr<String> and discover correlated subqueries
  4. Process projections into Vec<SelectProjection> with computed expressions
  5. Handle aggregates by extracting AggregateExpr from projections and HAVING
  6. Translate GROUP BY clause to canonical column names
  7. Process ORDER BY into Vec<OrderByPlan> with sort specifications
  8. Handle compound queries (UNION/INTERSECT/EXCEPT) via CompoundSelectPlan

Sources: llkv-sql/src/sql_engine.rs:2162-2578

Expression Representation in Plans

Plans use two forms of expressions from the llkv-expr crate:

  • Expr<String>: Boolean predicates using unresolved column names (as strings)
  • ScalarExpr<String>: Scalar expressions (also with string column references)
graph LR
    SQL["SQL: WHERE age &gt; 18"]
AST["sqlparser AST\nBinaryExpr"]
ExprString["Expr&lt;String&gt;\nCompare(Column('age'), Gt, Literal(18))"]
ExprFieldId["Expr&lt;FieldId&gt;\nCompare(Column(field_7), Gt, Literal(18))"]
Bytecode["EvalProgram\nStack-based bytecode"]
SQL --> AST
 
   AST --> ExprString
 
   ExprString --> ExprFieldId
 
   ExprFieldId --> Bytecode
    
    ExprString -.stored in plan.-> SelectPlan
    ExprFieldId -.resolved at execution.-> Executor
    Bytecode -.compiled for evaluation.-> Table

These string-based expressions are later resolved to Expr<FieldId> and ScalarExpr<FieldId> during execution when the catalog provides field mappings. This two-stage approach separates planning from schema resolution.

Diagram: Expression Evolution Through Planning and Execution

The translation from SQL expressions to Expr<String> happens in llkv-sql/src/sql_engine.rs:1647-1947 The resolution to Expr<FieldId> occurs in the executor's translate_predicate() function at llkv-executor/src/translation/predicate.rs

Sources: llkv-expr/src/expr.rs llkv-sql/src/sql_engine.rs:1647-1947 llkv-plan/src/plans.rs:28-34

Join Planning

Join specifications are represented in two components:

JoinMetadata Structure

The JoinMetadata struct at llkv-plan/src/plans.rs:781-792 captures a single join between consecutive tables:

  • left_table_index : Index into SelectPlan.tables vector for the left table
  • join_type : One of Inner, Left, Right, or Full
  • on_condition : Optional ON clause filter expression

JoinPlan Types

The JoinPlan enum at llkv-plan/src/plans.rs:763-773 defines supported join semantics:

Diagram: JoinPlan Variants

The executor converts JoinPlan to llkv_join::JoinType during execution. When SelectPlan.joins is empty but multiple tables exist, the executor performs a Cartesian product (cross join).

Sources: llkv-plan/src/plans.rs:758-792 llkv-executor/src/lib.rs:542-554

Aggregation Planning

Aggregates are represented through the AggregateExpr structure defined at llkv-plan/src/plans.rs:1025-1102:

Aggregate Function Types

Diagram: AggregateFunction Variants

GROUP BY Handling

When a SELECT contains a GROUP BY clause:

  1. Column names from GROUP BY are stored in SelectPlan.group_by as canonical strings
  2. Aggregate expressions are collected in SelectPlan.aggregates
  3. Non-aggregate projections must reference GROUP BY columns
  4. HAVING clause (if present) is stored in SelectPlan.having as Expr<String>

The executor groups rows based on group_by columns, evaluates aggregates within each group, and applies the HAVING filter to group results.

Sources: llkv-plan/src/plans.rs:1025-1102 llkv-executor/src/lib.rs:1185-1597

Subquery Representation

Subqueries appear in two contexts within plans:

Filter Subqueries

FilterSubquery at llkv-plan/src/plans.rs:36-45 represents correlated subqueries used in WHERE/HAVING predicates via Expr::Exists:

  • id : Unique identifier matching Expr::Exists(SubqueryId)
  • plan : Nested SelectPlan for the subquery
  • correlated_columns : Mappings from placeholder names to outer query columns

Scalar Subqueries

ScalarSubquery at llkv-plan/src/plans.rs:48-56 represents subqueries that produce single values in projections via ScalarExpr::ScalarSubquery:

Correlated Column Tracking

The CorrelatedColumn struct at llkv-plan/src/plans.rs:59-67 describes how outer columns are bound into inner subqueries:

During execution, the executor substitutes placeholder references with actual values from the outer query's current row.

Sources: llkv-plan/src/plans.rs:36-67 llkv-sql/src/sql_engine.rs:1980-2124

Plan Value Types

The PlanValue enum at llkv-plan/src/plans.rs:73-83 represents literal values within plans:

These values appear in:

  • INSERT literal rows (InsertPlan with InsertSource::Rows)
  • UPDATE assignments (AssignmentValue::Literal)
  • Computed constant expressions

The executor converts PlanValue instances to Arrow arrays via plan_values_to_arrow_array() at llkv-executor/src/lib.rs:302-410

Sources: llkv-plan/src/plans.rs:73-161 llkv-executor/src/lib.rs:302-410

Plan Execution Interface

Plans flow to the runtime through the PlanStatement enum:

Diagram: Plan Execution Dispatch Flow

The RuntimeEngine::execute_statement() method dispatches each plan variant to the appropriate handler:

  • SELECT : Passed to QueryExecutor for streaming execution
  • INSERT/UPDATE/DELETE : Applied via Table with MVCC tracking
  • DDL : Processed by CatalogManager to modify schema metadata

Sources: llkv-runtime/src/statements.rs llkv-sql/src/sql_engine.rs:587-609 llkv-executor/src/lib.rs:523-569

Compound Query Planning

Set operations (UNION, INTERSECT, EXCEPT) are represented through CompoundSelectPlan at llkv-plan/src/plans.rs:969-996:

  • CompoundOperator : Union, Intersect, or Except
  • CompoundQuantifier : Distinct (deduplicate) or All (keep duplicates)

The executor evaluates the initial plan, then applies each operation sequentially, combining results according to set semantics. Deduplication for DISTINCT quantifiers uses hash-based row encoding.

Sources: llkv-plan/src/plans.rs:946-996 llkv-executor/src/lib.rs:590-686


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Plan Structures

Relevant source files

Purpose and Scope

Plan structures are strongly-typed representations of SQL statements that bridge the SQL parsing layer and the execution layer. Defined in the llkv-plan crate, these structures capture the logical intent of SQL operations without retaining parser-specific AST details. The planner translates sqlparser ASTs into plan instances, which the runtime then dispatches to execution engines.

This page documents the structure and organization of plan types. For information about how correlated subqueries and scalar subqueries are represented and tracked, see Subquery and Correlation Handling.

Sources: llkv-plan/src/plans.rs:1-10 llkv-plan/README.md:10-16

Plan Type Hierarchy

LLKV organizes plan structures into three primary categories based on their SQL statement class:

Diagram: Plan Type Organization

graph TB
    Plans["Plan Structures\n(llkv-plan)"]
DDL["DDL Plans\nSchema Operations"]
DML["DML Plans\nData Modifications"]
Query["Query Plans\nData Retrieval"]
Plans --> DDL
 
   Plans --> DML
 
   Plans --> Query
    
 
   DDL --> CreateTablePlan
 
   DDL --> DropTablePlan
 
   DDL --> CreateViewPlan
 
   DDL --> DropViewPlan
 
   DDL --> CreateIndexPlan
 
   DDL --> DropIndexPlan
 
   DDL --> ReindexPlan
 
   DDL --> AlterTablePlan
 
   DDL --> RenameTablePlan
    
 
   DML --> InsertPlan
 
   DML --> UpdatePlan
 
   DML --> DeletePlan
 
   DML --> TruncatePlan
    
 
   Query --> SelectPlan
 
   Query --> CompoundSelectPlan
    
 
   SelectPlan --> TableRef
 
   SelectPlan --> JoinMetadata
 
   SelectPlan --> SelectProjection
 
   SelectPlan --> SelectFilter
 
   SelectPlan --> AggregateExpr
 
   SelectPlan --> OrderByPlan

Plans are consumed by llkv-runtime for execution orchestration and by llkv-executor for query evaluation. Each plan type encodes the necessary metadata for its corresponding operation without requiring re-parsing or runtime AST traversal.

Sources: llkv-plan/src/plans.rs:163-358 llkv-plan/src/plans.rs:620-703 llkv-plan/src/plans.rs:794-1023

SelectPlan Structure

SelectPlan represents SELECT queries and is the most complex plan type. It aggregates multiple sub-components to describe table references, join relationships, projections, filters, aggregates, and ordering.

Diagram: SelectPlan Component Structure

graph TB
    SelectPlan["SelectPlan\nllkv-plan/src/plans.rs:801"]
subgraph "Table Sources"
        Tables["tables: Vec&lt;TableRef&gt;"]
TableRef["TableRef\nschema, table, alias"]
end
    
    subgraph "Join Specification"
        Joins["joins: Vec&lt;JoinMetadata&gt;"]
JoinMetadata["JoinMetadata\nleft_table_index\njoin_type\non_condition"]
JoinPlan["JoinPlan\nInner/Left/Right/Full"]
end
    
    subgraph "Projections"
        Projections["projections:\nVec&lt;SelectProjection&gt;"]
AllColumns["AllColumns"]
AllColumnsExcept["AllColumnsExcept"]
Column["Column\nname, alias"]
Computed["Computed\nexpr, alias"]
end
    
    subgraph "Filtering"
        Filter["filter: Option&lt;SelectFilter&gt;"]
SelectFilter["SelectFilter\npredicate\nsubqueries"]
FilterSubquery["FilterSubquery\nid, plan,\ncorrelated_columns"]
end
    
    subgraph "Aggregation"
        Aggregates["aggregates:\nVec&lt;AggregateExpr&gt;"]
GroupBy["group_by: Vec&lt;String&gt;"]
Having["having:\nOption&lt;Expr&gt;"]
end
    
    subgraph "Ordering & Modifiers"
        OrderBy["order_by:\nVec&lt;OrderByPlan&gt;"]
Distinct["distinct: bool"]
end
    
    subgraph "Compound Operations"
        Compound["compound:\nOption&lt;CompoundSelectPlan&gt;"]
CompoundOps["Union/Intersect/Except\nDistinct/All"]
end
    
 
   SelectPlan --> Tables
 
   SelectPlan --> Joins
 
   SelectPlan --> Projections
 
   SelectPlan --> Filter
 
   SelectPlan --> Aggregates
 
   SelectPlan --> OrderBy
 
   SelectPlan --> Compound
    
 
   Tables --> TableRef
 
   Joins --> JoinMetadata
 
   JoinMetadata --> JoinPlan
 
   Projections --> AllColumns
 
   Projections --> AllColumnsExcept
 
   Projections --> Column
 
   Projections --> Computed
 
   Filter --> SelectFilter
 
   SelectFilter --> FilterSubquery
 
   Aggregates --> GroupBy
 
   Aggregates --> Having
 
   Compound --> CompoundOps

Sources: llkv-plan/src/plans.rs:794-944

TableRef - Table References

TableRef represents a table source in the FROM clause, with optional aliasing:

FieldTypeDescription
schemaStringSchema/namespace identifier (empty for default)
tableStringTable name
aliasOption<String>Optional alias for qualified name resolution

The display_name() method returns the alias if present, otherwise the qualified name. This enables consistent column name resolution during expression translation.

Sources: llkv-plan/src/plans.rs:708-752

JoinMetadata - Join Specification

JoinMetadata describes how adjacent tables in the tables vector are connected. Each entry links tables[left_table_index] with tables[left_table_index + 1]:

FieldTypeDescription
left_table_indexusizeIndex into SelectPlan.tables
join_typeJoinPlanInner, Left, Right, or Full
on_conditionOption<Expr<String>>Optional ON clause predicate

The JoinPlan enum mirrors llkv_join::JoinType but exists in the plan layer to avoid circular dependencies.

Sources: llkv-plan/src/plans.rs:758-792

SelectProjection - Projection Variants

SelectProjection specifies which columns appear in the result set:

VariantFieldsDescription
AllColumns-SELECT * (all columns from all tables)
AllColumnsExceptexclude: Vec<String>SELECT * EXCEPT (col1, col2, ...)
Columnname: String
alias: Option<String>Named column with optional alias
Computedexpr: ScalarExpr<String>
alias: StringComputed expression (e.g., col1 + col2 AS sum)

The executor translates these into ScanProjection instances that specify which columns to fetch from storage.

Sources: llkv-plan/src/plans.rs:998-1013

AggregateExpr - Aggregate Functions

AggregateExpr describes aggregate function calls in SELECT or HAVING clauses:

Diagram: AggregateExpr Variants

graph LR
    AggregateExpr["AggregateExpr"]
CountStar["CountStar\nalias, distinct"]
Column["Column\ncolumn, alias,\nfunction, distinct"]
Functions["AggregateFunction"]
Count["Count"]
SumInt64["SumInt64"]
TotalInt64["TotalInt64"]
MinInt64["MinInt64"]
MaxInt64["MaxInt64"]
CountNulls["CountNulls"]
GroupConcat["GroupConcat"]
AggregateExpr --> CountStar
 
   AggregateExpr --> Column
 
   Column --> Functions
 
   Functions --> Count
 
   Functions --> SumInt64
 
   Functions --> TotalInt64
 
   Functions --> MinInt64
 
   Functions --> MaxInt64
 
   Functions --> CountNulls
 
   Functions --> GroupConcat

The executor delegates to llkv-aggregate for accumulator-based evaluation.

Sources: llkv-plan/src/plans.rs:1028-1120

OrderByPlan - Sort Specification

OrderByPlan defines ORDER BY clause semantics:

FieldTypeDescription
targetOrderTargetColumn name, projection index, or All
sort_typeOrderSortTypeNative or CastTextToInteger
ascendingboolSort direction (ASC/DESC)
nulls_firstboolNULL placement (NULLS FIRST/LAST)

OrderTarget variants:

  • Column(String) - Sort by named column
  • Index(usize) - Sort by projection position (1-based in SQL)
  • All - Specialized SQLite behavior for sorting all columns

Sources: llkv-plan/src/plans.rs:1195-1217

CompoundSelectPlan - Set Operations

CompoundSelectPlan represents UNION, INTERSECT, and EXCEPT operations:

FieldTypeDescription
initialBox<SelectPlan>First SELECT in the compound
operationsVec<CompoundSelectComponent>Subsequent set operations

Each CompoundSelectComponent contains:

  • operator: CompoundOperator (Union, Intersect, Except)
  • quantifier: CompoundQuantifier (Distinct, All)
  • plan: SelectPlan for the right-hand side

The executor processes these sequentially, maintaining distinct caches for DISTINCT quantifiers.

Sources: llkv-plan/src/plans.rs:946-996

InsertPlan Structure

InsertPlan encapsulates data insertion operations with conflict resolution strategies:

FieldTypeDescription
tableStringTarget table name
columnsVec<String>Column names (empty means all columns)
sourceInsertSourceData source (rows, batches, or SELECT)
on_conflictInsertConflictActionConflict resolution strategy

InsertSource Variants

VariantDescription
Rows(Vec<Vec<PlanValue>>)Explicit value rows from INSERT VALUES
Batches(Vec<RecordBatch>)Pre-materialized Arrow batches
Select { plan: Box<SelectPlan> }INSERT INTO ... SELECT ...

InsertConflictAction Variants

SQLite-compatible conflict resolution actions:

VariantBehavior
NoneStandard behavior - fail on constraint violation
ReplaceUPDATE existing row on conflict (INSERT OR REPLACE)
IgnoreSkip conflicting rows (INSERT OR IGNORE)
AbortAbort transaction on conflict
FailFail statement without rollback
RollbackRollback entire transaction

Sources: llkv-plan/src/plans.rs:620-655

UpdatePlan and DeletePlan

UpdatePlan

UpdatePlan specifies row updates with optional filtering:

FieldTypeDescription
tableStringTarget table name
assignmentsVec<ColumnAssignment>Column updates
filterOption<Expr<String>>WHERE clause predicate

Each ColumnAssignment contains:

  • column: Target column name
  • value: AssignmentValue (literal or expression)

AssignmentValue variants:

  • Literal(PlanValue) - Static value (e.g., SET col = 42)
  • Expression(ScalarExpr<String>) - Computed value (e.g., SET col = col + 1)

Sources: llkv-plan/src/plans.rs:661-682

DeletePlan

DeletePlan specifies row deletions:

FieldTypeDescription
tableStringTarget table name
filterOption<Expr<String>>WHERE clause predicate

A missing filter indicates DELETE FROM table (deletes all rows).

Sources: llkv-plan/src/plans.rs:687-692

TruncatePlan

TruncatePlan represents TRUNCATE TABLE (removes all rows, resets sequences):

FieldTypeDescription
tableStringTarget table name

Sources: llkv-plan/src/plans.rs:698-702

DDL Plan Structures

CreateTablePlan

CreateTablePlan defines table creation with schema, constraints, and data sources:

FieldTypeDescription
nameStringTable name
if_not_existsboolSkip if table exists
or_replaceboolReplace existing table
columnsVec<PlanColumnSpec>Column definitions
sourceOption<CreateTableSource>Optional CREATE TABLE AS data
namespaceOption<String>Storage namespace (e.g., "temp")
foreign_keysVec<ForeignKeySpec>Foreign key constraints
multi_column_uniquesVec<MultiColumnUniqueSpec>Multi-column UNIQUE constraints

Sources: llkv-plan/src/plans.rs:176-203

PlanColumnSpec

PlanColumnSpec describes individual column metadata:

FieldTypeDescription
nameStringColumn name
data_typeDataTypeArrow data type
nullableboolNULL allowed
primary_keyboolPRIMARY KEY constraint
uniqueboolUNIQUE constraint
check_exprOption<String>CHECK constraint SQL expression

The IntoPlanColumnSpec trait enables ergonomic column specification using tuples like ("col_name", DataType::Int64, NotNull).

Sources: llkv-plan/src/plans.rs:499-605

CreateIndexPlan

CreateIndexPlan specifies index creation:

FieldTypeDescription
nameOption<String>Index name (auto-generated if None)
tableStringTarget table
uniqueboolUNIQUE index constraint
if_not_existsboolSkip if index exists
columnsVec<IndexColumnPlan>Indexed columns with sort order

Each IndexColumnPlan specifies:

  • name: Column name
  • ascending: Sort direction (ASC/DESC)
  • nulls_first: NULL placement

Sources: llkv-plan/src/plans.rs:433-497

AlterTablePlan

AlterTablePlan represents ALTER TABLE operations:

FieldTypeDescription
table_nameStringTarget table
if_existsboolSkip if table missing
operationAlterTableOperationSpecific operation

AlterTableOperation variants:

VariantFieldsDescription
RenameColumnold_column_name: String
new_column_name: StringRENAME COLUMN
SetColumnDataTypecolumn_name: String
new_data_type: StringALTER COLUMN SET DATA TYPE
DropColumncolumn_name: String
if_exists: bool
cascade: boolDROP COLUMN

Sources: llkv-plan/src/plans.rs:364-406

Additional DDL Plans

Plan TypePurposeKey Fields
DropTablePlanDROP TABLEname, if_exists
CreateViewPlanCREATE VIEWname, view_definition, select_plan
DropViewPlanDROP VIEWname, if_exists
RenameTablePlanRENAME TABLEcurrent_name, new_name, if_exists
DropIndexPlanDROP INDEXname, canonical_name, if_exists
ReindexPlanREINDEXname, canonical_name

Sources: llkv-plan/src/plans.rs:209-358

PlanValue - Value Representation

PlanValue provides a type-safe representation of literal values in plans, bridging SQL literals and Arrow arrays:

VariantDescription
NullSQL NULL
Integer(i64)Integer value (booleans stored as 0/1)
Float(f64)Floating-point value
Decimal(DecimalValue)Fixed-precision decimal
String(String)Text value
Date32(i32)Date (days since epoch)
Struct(FxHashMap<String, PlanValue>)Nested struct value
Interval(IntervalValue)Interval (months, days, nanos)

PlanValue implements From<T> for common types (i64, f64, String, bool) for ergonomic plan construction. The plan_value_from_literal() function converts llkv_expr::Literal to PlanValue, and plan_value_from_array() extracts values from Arrow arrays during INSERT SELECT operations.

Sources: llkv-plan/src/plans.rs:73-161 llkv-plan/src/plans.rs:1122-1189

Plan Translation Flow

Diagram: Plan Translation and Execution Flow

Plans serve as the interface contract between the SQL layer (llkv-sql) and the execution layer (llkv-runtime, llkv-executor). The translation layer in llkv-sql converts sqlparser AST nodes into strongly-typed plan structures, which the runtime validates and dispatches to appropriate executors.

Sources: llkv-plan/README.md:13-33 llkv-executor/README.md:12-31

Plan Construction Patterns

Builder Pattern for SelectPlan

SelectPlan uses fluent builder methods for incremental construction:

Sources: llkv-plan/src/plans.rs:827-943

Tuple-Based Column Specs

PlanColumnSpec implements IntoPlanColumnSpec for tuples, enabling concise table definitions:

Sources: llkv-plan/src/plans.rs:548-605

Integration with Expression System

Plans reference expressions from llkv-expr using parameterized types:

  • Expr<String>: Boolean predicates with column names (WHERE, HAVING, ON clauses)
  • ScalarExpr<String>: Scalar expressions with column names (projections, assignments)

The executor translates these to Expr<FieldId> and ScalarExpr<FieldId> after resolving column names against table schemas. For details on expression evaluation, see Expression AST and Expression Translation.

Sources: llkv-plan/src/plans.rs:28-34 llkv-plan/src/plans.rs:666-674


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Subquery and Correlation Handling

Relevant source files

This page documents how LLKV handles subqueries and correlated column references during query planning and execution. Subqueries can appear in WHERE clauses (as EXISTS predicates) or in SELECT projections (as scalar subqueries). When a subquery references columns from its outer query, it is called a correlated subquery , requiring special handling to bind outer row values during execution.

For information about expression evaluation and compilation, see Expression System. For query execution flow, see Query Execution.


Purpose and Scope

Subquery handling in LLKV involves three distinct phases:

  1. Detection and Tracking - During SQL translation, the planner identifies subqueries and tracks which outer columns they reference
  2. Placeholder Injection - Correlated columns are replaced with synthetic placeholder identifiers in the subquery's expression tree
  3. Binding and Execution - At runtime, for each outer row, placeholders are replaced with actual values and the subquery is executed

This document covers the data structures, algorithms, and execution flow for both filter subqueries (EXISTS/NOT EXISTS) and scalar subqueries (single-value returns).


Core Data Structures

LLKV represents subqueries and their correlation metadata through several interconnected structures defined in llkv-plan.

classDiagram
    class SelectFilter {+Expr~String~ predicate\n+Vec~FilterSubquery~ subqueries}
    
    class FilterSubquery {+SubqueryId id\n+Box~SelectPlan~ plan\n+Vec~CorrelatedColumn~ correlated_columns}
    
    class ScalarSubquery {+SubqueryId id\n+Box~SelectPlan~ plan\n+Vec~CorrelatedColumn~ correlated_columns}
    
    class CorrelatedColumn {+String placeholder\n+String column\n+Vec~String~ field_path}
    
    class SelectPlan {+Vec~TableRef~ tables\n+Option~SelectFilter~ filter\n+Vec~ScalarSubquery~ scalar_subqueries\n+Vec~SelectProjection~ projections}
    
    SelectPlan --> SelectFilter : filter
    SelectPlan --> ScalarSubquery : scalar_subqueries
    SelectFilter --> FilterSubquery : subqueries
    FilterSubquery --> CorrelatedColumn : correlated_columns
    ScalarSubquery --> CorrelatedColumn : correlated_columns
    FilterSubquery --> SelectPlan : plan
    ScalarSubquery --> SelectPlan : plan

Subquery Plan Structures

Sources: llkv-plan/src/plans.rs:28-67

Field Descriptions

StructureFieldPurpose
FilterSubqueryidUnique identifier used to match subquery in expression tree
FilterSubqueryplanNested SELECT plan to execute for each outer row
FilterSubquerycorrelated_columnsMappings from placeholder to real outer column
ScalarSubqueryidUnique identifier for scalar subquery references
ScalarSubqueryplanSELECT plan that must return single column/row
CorrelatedColumnplaceholderSynthetic column name injected into subquery expressions
CorrelatedColumncolumnCanonical outer column name
CorrelatedColumnfield_pathNested field access path for struct columns

Sources: llkv-plan/src/plans.rs:36-67


Correlation Tracking During Planning

During SQL-to-plan translation, LLKV uses the SubqueryCorrelatedTracker to detect when a subquery references outer columns. This tracker is passed through the expression translation pipeline and records each outer column access.

graph TB
    subgraph "SQL Translation Phase"
        SQL["SQL Query String"]
Parser["sqlparser AST"]
end
    
    subgraph "Subquery Detection"
        TranslateExpr["translate_predicate / translate_scalar"]
Tracker["SubqueryCorrelatedTracker"]
Resolver["IdentifierResolver"]
end
    
    subgraph "Placeholder Injection"
        OuterColumn["Outer Column Reference"]
Placeholder["Synthetic Placeholder"]
Recording["CorrelatedColumn Entry"]
end
    
    subgraph "Plan Output"
        FilterSubquery["FilterSubquery"]
ScalarSubquery["ScalarSubquery"]
SelectPlan["SelectPlan"]
end
    
 
   SQL --> Parser
 
   Parser --> TranslateExpr
 
   TranslateExpr --> Tracker
 
   TranslateExpr --> Resolver
    
 
   Tracker --> OuterColumn
 
   OuterColumn --> Placeholder
 
   Placeholder --> Recording
    
 
   Recording --> FilterSubquery
 
   Recording --> ScalarSubquery
 
   FilterSubquery --> SelectPlan
 
   ScalarSubquery --> SelectPlan

Tracker Architecture

Sources: llkv-sql/src/sql_engine.rs24 llkv-sql/src/sql_engine.rs:326-363

Placeholder Generation

When the tracker detects an outer column reference in a subquery, it:

  1. Generates a unique placeholder string (e.g., "__correlated_0__")
  2. Records the mapping: placeholder → (outer_column, field_path)
  3. Returns the placeholder to the expression translator
  4. The placeholder is embedded in the subquery's expression tree instead of the original column name

This allows the subquery plan to be "generic" - it references placeholders that will be bound to actual values at execution time.

Sources: llkv-sql/src/sql_engine.rs:337-351

Tracker Extension Traits

The SubqueryCorrelatedTrackerExt trait provides a convenience method to request placeholders directly from catalog resolution results, avoiding repetitive unpacking of ColumnResolution fields.

The SubqueryCorrelatedTrackerOptionExt trait enables chaining optional tracker references through nested translation helpers without explicit as_mut() calls.

Sources: llkv-sql/src/sql_engine.rs:337-363


Subquery Types and Execution

LLKV supports two categories of subqueries, each with distinct execution semantics.

sequenceDiagram
    participant Executor as QueryExecutor
    participant Filter as Filter Evaluation
    participant Subquery as EXISTS Subquery
    participant Binding as Binding Logic
    participant Inner as Inner SelectPlan
    
    Executor->>Filter: evaluate_predicate_mask()
    Filter->>Filter: encounter Expr::Exists
    Filter->>Subquery: evaluate_exists_subquery()
    Subquery->>Binding: collect_correlated_bindings()
    Binding->>Binding: extract outer row values
    Binding-->>Subquery: bindings map
    Subquery->>Inner: bind_select_plan()
    Inner->>Inner: replace placeholders with values
    Inner-->>Subquery: bound SelectPlan
    Subquery->>Executor: execute_select(bound_plan)
    Executor-->>Subquery: SelectExecution stream
    Subquery->>Subquery: check if num_rows > 0
    Subquery-->>Filter: boolean result
    Filter-->>Executor: BooleanArray mask

Filter Subqueries (EXISTS Predicates)

Filter subqueries appear in WHERE clauses as EXISTS or NOT EXISTS predicates. They return a boolean indicating whether the subquery produced any rows.

Sources: llkv-executor/src/lib.rs:773-792

Scalar Subqueries (Projection Values)

Scalar subqueries appear in SELECT projections and must return exactly one column and at most one row. They are evaluated into a single literal value for each outer row.

Sources: llkv-executor/src/lib.rs:794-891

sequenceDiagram
    participant Executor as QueryExecutor
    participant Projection as Projection Logic
    participant Subquery as Scalar Subquery Evaluator
    participant Binding as Binding Logic
    participant Inner as Inner SelectPlan
    
    Executor->>Projection: evaluate_projection_expression()
    Projection->>Projection: encounter ScalarExpr::ScalarSubquery
    Projection->>Subquery: evaluate_scalar_subquery_numeric()
    
    loop For each outer row
        Subquery->>Subquery: evaluate_scalar_subquery_literal()
        Subquery->>Binding: collect_correlated_bindings()
        Binding-->>Subquery: bindings for current row
        Subquery->>Inner: bind_select_plan()
        Inner-->>Subquery: bound plan
        Subquery->>Executor: execute_select()
        Executor-->>Subquery: result batches
        Subquery->>Subquery: validate single column/row
        Subquery->>Subquery: convert to Literal
        Subquery-->>Projection: literal value
    end
    
    Projection->>Projection: build NumericArray from literals
    Projection-->>Executor: computed column array

Binding Process

The binding process replaces placeholder identifiers in a subquery plan with actual values from the current outer row.

Correlated Binding Collection

The collect_correlated_bindings function builds a map from placeholder strings to concrete Literal values by:

  1. Iterating over each CorrelatedColumn in the subquery metadata
  2. Looking up the outer column in the current RecordBatch
  3. Extracting the value at the current row index
  4. Converting the Arrow array value to a Literal
  5. Storing the mapping: placeholder → Literal

Sources: Referenced in llkv-executor/src/lib.rs781 llkv-executor/src/lib.rs802

graph LR
    subgraph "Input"
        OuterBatch["RecordBatch\n(outer query result)"]
RowIndex["Current Row Index"]
Metadata["Vec&lt;CorrelatedColumn&gt;"]
end
    
    subgraph "Processing"
        Iterate["For each CorrelatedColumn"]
Lookup["Find column in schema"]
Extract["Extract array[row_idx]"]
Convert["Convert to Literal"]
end
    
    subgraph "Output"
        Bindings["FxHashMap&lt;placeholder, Literal&gt;"]
end
    
 
   OuterBatch --> Iterate
 
   RowIndex --> Iterate
 
   Metadata --> Iterate
    
 
   Iterate --> Lookup
 
   Lookup --> Extract
 
   Extract --> Convert
 
   Convert --> Bindings

Plan Binding

The bind_select_plan function takes a subquery SelectPlan and a bindings map, then recursively replaces:

  • Placeholder column references in filter expressions with Expr::Literal
  • Placeholder column references in projections with ScalarExpr::Literal
  • Placeholder column references in HAVING clauses
  • Placeholder references in nested subqueries

This produces a new SelectPlan that is fully "grounded" with the outer row's values and can be executed independently.

Sources: Referenced in llkv-executor/src/lib.rs782 llkv-executor/src/lib.rs803


Execution Flow: EXISTS Subquery Example

Consider the query:

Planning Phase

Sources: llkv-plan/src/plans.rs:36-46

Execution Phase

Sources: llkv-executor/src/lib.rs:773-792


Execution Flow: Scalar Subquery Example

Consider the query:

Planning Phase

Sources: llkv-plan/src/plans.rs:48-56

Execution Phase

Sources: llkv-executor/src/lib.rs:834-891


Cross-Product Integration

When subqueries appear in cross-product (multi-table) queries, the binding process works identically but must resolve outer column names through the combined schema.

Cross-Product Expression Context

The CrossProductExpressionContext maintains:

  • Combined schema from all tables in the FROM clause
  • Column lookup map (qualified names → column indices)
  • Numeric array cache for evaluated expressions
  • Synthetic field ID allocation for subquery results

When evaluating a filter or projection expression that contains subqueries, the context:

  1. Detects subquery references by SubqueryId
  2. Calls the appropriate evaluator (evaluate_exists_subquery or evaluate_scalar_subquery_numeric)
  3. Passes the combined schema and current row to the binding logic
  4. Caches numeric results for scalar subqueries to avoid re-evaluation

Sources: llkv-executor/src/lib.rs:1329-1383 llkv-executor/src/lib.rs:1429-1502


Validation and Error Handling

The executor enforces several constraints on subquery results:

ConstraintApplies ToError Condition
Single columnScalar subqueriesnum_columns() != 1
At most one rowScalar subqueriesnum_rows() > 1
Result presentN/A (returns NULL)num_rows() == 0 for scalar subquery

Error Examples

Scalar Subquery: Multiple Columns

Scalar Subquery: Multiple Rows

Sources: llkv-executor/src/lib.rs:808-819


Subquery ID Assignment

SubqueryId is a newtype wrapper around usize defined in llkv-expr. The planner assigns sequential IDs as it encounters subqueries during translation, ensuring each subquery has a unique identifier that persists from planning through execution.

The executor uses these IDs to:

  • Look up subquery metadata in the plan's scalar_subqueries or filter's subqueries vectors
  • Match subquery references in expression trees (ScalarExpr::ScalarSubquery or Expr::Exists) to their plans
  • Cache evaluation results (for scalar subqueries appearing multiple times)

Sources: [llkv-expr referenced in llkv-plan/src/plans.rs15](https://github.com/jzombie/rust-llkv/blob/4dc34c1f/llkv-expr referenced in llkv-plan/src/plans.rs#L15-L15)


Recursive Subquery Support

LLKV supports nested subqueries where an inner subquery can itself contain subqueries. The binding process is recursive:

  1. Bind outer-level placeholders in the top-level subquery plan
  2. For any nested subqueries within that plan, repeat the binding process
  3. Continue until all correlation layers are resolved

This is handled automatically by bind_select_plan which traverses the entire plan tree.

Sources: Referenced in llkv-executor/src/lib.rs782 llkv-executor/src/lib.rs803


Performance Considerations

Correlation Overhead

Correlated subqueries are executed once per outer row, which can be expensive:

  • An outer query returning N rows with a correlated subquery executes N + 1 queries total
  • For scalar subqueries in projections with multiple references, results are cached per subquery ID
  • EXISTS subqueries short-circuit as soon as a matching row is found (first batch with num_rows() > 0)

Uncorrelated Subqueries

If a subquery contains no correlated columns (empty correlated_columns vector), it could theoretically be executed once and reused. However, LLKV's current implementation still executes it per outer row. Future optimizations could detect this case and cache the result.

Sources: llkv-executor/src/lib.rs:773-891


Summary

LLKV's subquery handling follows a three-phase model:

  1. Planning : The SubqueryCorrelatedTracker detects outer column references and generates placeholders. Plans are built with FilterSubquery or ScalarSubquery structures containing correlation metadata.

  2. Binding : At execution time, collect_correlated_bindings extracts outer row values and bind_select_plan replaces placeholders with literals, producing a grounded plan.

  3. Execution : The bound plan is executed as an independent query. EXISTS subqueries return a boolean; scalar subqueries return a single literal (or NULL if empty).

This design keeps the subquery plan generic during planning and binds it dynamically at execution time, enabling proper correlation semantics while maintaining the separation between planning and execution layers.

Sources: llkv-plan/src/plans.rs:28-67 llkv-sql/src/sql_engine.rs24 llkv-sql/src/sql_engine.rs:326-363 llkv-executor/src/lib.rs:773-891


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Expression System

Relevant source files

The expression system provides the intermediate representation for predicates, projections, and computations in LLKV. It defines a strongly-typed Abstract Syntax Tree (AST) that decouples query logic from concrete storage formats, enabling optimization and efficient evaluation against Apache Arrow data structures.

This page covers the overall architecture of the expression system. For details on specific components:

  • Expression AST structure and types: see 5.1
  • Column name resolution and translation: see 5.2
  • Bytecode compilation and normalization: see 5.3
  • Scalar evaluation engine: see 5.4
  • Aggregate function evaluation: see 5.5

Purpose and Scope

The expression system serves three primary functions:

  1. Representation : Defines generic AST types (Expr<F>, ScalarExpr<F>) parameterized over field identifiers, supporting both string-based column names (during planning) and integer field IDs (during execution)

  2. Translation : Resolves column names to storage field identifiers by consulting the catalog, enabling type-safe access to table columns

  3. Compilation : Transforms normalized expressions into stack-based bytecode (EvalProgram, DomainProgram) for efficient vectorized evaluation

The system is located primarily in llkv-expr/, with translation logic in llkv-executor/src/translation/ and compilation in llkv-table/src/planner/program.rs.

Sources: llkv-expr/src/expr.rs:1-749 llkv-executor/src/translation/expression.rs:1-424 llkv-table/src/planner/program.rs:1-710

Expression Type Hierarchy

LLKV uses two primary expression types, both generic over the field identifier type F:

Expression Types

TypePurposeVariantsExample
Expr<'a, F>Boolean predicates for filtering rowsAnd, Or, Not, Pred, Compare, InList, IsNull, Literal, ExistsWHERE age > 18 AND status = 'active'
ScalarExpr<F>Arithmetic/scalar computations returning valuesColumn, Literal, Binary, Aggregate, Cast, Case, Coalesce, GetField, Compare, Not, IsNull, Random, ScalarSubquerySELECT price * 1.1 AS adjusted_price
Filter<'a, F>Single-field predicateField ID + Operatorage > 18
Operator<'a>Comparison operator against literalsEquals, Range, GreaterThan, LessThan, In, StartsWith, EndsWith, Contains, IsNull, IsNotNullIN (1, 2, 3)

Type Parameterization Flow

Sources: llkv-expr/src/expr.rs:14-333 llkv-plan/src/plans.rs:28-67 llkv-executor/src/translation/expression.rs:18-174

Expression Lifecycle

Expressions flow through multiple stages from SQL text to execution against storage:

Stage 1: Planning

The SQL layer (llkv-sql) parses SQL statements and constructs plan structures containing expressions. At this stage, column references use string names from the SQL text:

  • Predicates : Stored as Expr<'static, String> in SelectFilter
  • Projections : Stored as ScalarExpr<String> in SelectProjection::Computed
  • Assignments : Stored as ScalarExpr<String> in UpdatePlan::assignments

Sources: llkv-plan/src/plans.rs:28-34 llkv-sql/src/translator/mod.rs (inferred from architecture)

Stage 2: Translation

The executor translates string-based expressions to field-ID-based expressions by consulting the table schema:

  1. Column Resolution : translate_predicate and translate_scalar walk the expression tree
  2. Schema Lookup : Each column name is resolved to a FieldId using ExecutorSchema::resolve
  3. Type Inference : For computed projections, infer_computed_data_type determines the Arrow data type
  4. Special Columns : System columns like "rowid" map to special field IDs (e.g., ROW_ID_FIELD_ID)

Translation is implemented via iterative traversal to avoid stack overflow on deeply nested expressions (50k+ nodes).

Sources: llkv-executor/src/translation/expression.rs:18-174 llkv-executor/src/translation/expression.rs:176-387 llkv-executor/src/translation/schema.rs:53-123

Stage 3: Compilation

The table layer compiles Expr<FieldId> into bytecode for efficient evaluation:

  1. Normalization : normalize_predicate applies De Morgan's laws and flattens nested AND/OR nodes

  2. Compilation : ProgramCompiler::compile generates two programs:

    • EvalProgram: Stack-based bytecode for predicate evaluation
    • DomainProgram: Bytecode for tracking which fields affect row visibility
  3. Fusion Optimization : Multiple predicates on the same field are fused into a single FusedAnd operation

graph TB
    Input["Expr&lt;FieldId&gt;\nNOT(age &gt; 18 AND status = 'active')"]
Norm["normalize_predicate\nApply De Morgan's law"]
Normal["Expr&lt;FieldId&gt;\nOR([\n NOT(age &gt; 18),\n NOT(status = 'active')\n])"]
Compile["ProgramCompiler::compile"]
subgraph "Output Programs"
        Eval["EvalProgram\nops: [\n PushCompare(age, Gt, 18),\n Not,\n PushCompare(status, Eq, 'active'),\n Not,\n Or(2)\n]"]
Domain["DomainProgram\nops: [\n PushCompareDomain(...),\n PushCompareDomain(...),\n Union(2)\n]"]
end
    
 
   Input --> Norm
 
   Norm --> Normal
 
   Normal --> Compile
 
   Compile --> Eval
 
   Compile --> Domain

Sources: llkv-table/src/planner/program.rs:286-318 llkv-table/src/planner/program.rs:416-516 llkv-table/src/planner/program.rs:550-631

Stage 4: Evaluation

Compiled programs are executed against Arrow RecordBatch data:

  1. EvalProgram : Evaluates predicates row-by-row using a value stack, producing boolean results
  2. DomainProgram : Identifies which row IDs could possibly match (used for optimization)
  3. ScalarExpr : Evaluated via NumericKernels for vectorized arithmetic operations

The evaluation engine handles Arrow's columnar format efficiently through zero-copy operations and SIMD-friendly algorithms.

Sources: llkv-table/src/planner/evaluator.rs (inferred from architecture), llkv-executor/src/lib.rs:254-296

Key Components

ProgramCompiler

Compiles normalized expressions into bytecode:

Key Optimizations :

  • Predicate Fusion : gather_fused detects multiple predicates on the same field and emits FusedAnd operations
  • Domain Caching : Domain programs are memoized by expression identity to avoid recompilation
  • Stack-Based Evaluation : Operations push/pop from a value stack, avoiding recursive evaluation overhead

Sources: llkv-table/src/planner/program.rs:257-284 llkv-table/src/planner/program.rs:416-516 llkv-table/src/planner/program.rs:518-542

Bytecode Operations

EvalOp Variants

OperationPurposeStack Effect
PushPredicate(filter)Evaluate single-field predicatePush boolean
PushCompare{left, op, right}Evaluate comparison between scalar expressionsPush boolean
PushInList{expr, list, negated}Evaluate IN/NOT IN list membershipPush boolean
PushIsNull{expr, negated}Evaluate IS NULL / IS NOT NULLPush boolean
PushLiteral(bool)Push constant booleanPush boolean
FusedAnd{field_id, filters}Evaluate multiple predicates on same field (optimized)Push boolean
And{child_count}Pop N booleans, push AND resultPop N, push 1
Or{child_count}Pop N booleans, push OR resultPop N, push 1
Not{domain}Pop boolean, negate, push result (uses domain for optimization)Pop 1, push 1

DomainOp Variants

OperationPurposeStack Effect
PushFieldAll(field_id)All rows where field existsPush RowSet
PushCompareDomain{left, right, op, fields}Rows where comparison could be truePush RowSet
PushInListDomain{expr, list, fields, negated}Rows where IN list could be truePush RowSet
PushIsNullDomain{expr, fields, negated}Rows where NULL test could be truePush RowSet
PushLiteralFalseEmpty row setPush RowSet
PushAllRowsAll rowsPush RowSet
Union{child_count}Pop N row sets, push unionPop N, push 1
Intersect{child_count}Pop N row sets, push intersectionPop N, push 1

Sources: llkv-table/src/planner/program.rs:36-67 llkv-table/src/planner/program.rs:221-254

Expression Translation

Translation resolves column names to field IDs through schema lookup:

Special Handling :

  • Rowid Column : "rowid" (case-insensitive) maps to ROW_ID_FIELD_ID constant
  • Flexible Matching : Supports qualified names (table.column) and unqualified names (column)
  • Error Handling : Unknown columns produce descriptive error messages with the column name

Sources: llkv-executor/src/translation/expression.rs:390-407 llkv-executor/src/translation/expression.rs:410-423

Type Inference

The executor infers Arrow data types for computed projections to construct the output schema:

Type Inference Rules

Expression PatternInferred Type
Literal(Integer)DataType::Int64
Literal(Float)DataType::Float64
Literal(Decimal(v))DataType::Decimal128(v.precision(), v.scale())
Literal(String)DataType::Utf8
Literal(Date32)DataType::Date32
Literal(Boolean)DataType::Boolean
Literal(Null)DataType::Null
Column(field_id)Lookup from schema, normalized to Int64/Float64
Binary{...}Float64 if any operand is float, else Int64
Compare{...}DataType::Int64 (boolean as integer)
Aggregate(...)DataType::Int64 (most aggregates)
Cast{data_type, ...}data_type (explicit cast)
RandomDataType::Float64

Numeric Type Normalization : Small integers (Int8, Int16, Int32, Boolean) normalize to Int64, while all floating-point types normalize to Float64. This simplifies arithmetic evaluation.

Sources: llkv-executor/src/translation/schema.rs:53-123 llkv-executor/src/translation/schema.rs:125-147 llkv-executor/src/translation/schema.rs:149-243

Subquery Support

Expressions support two types of subqueries:

EXISTS Predicates

Used in WHERE clauses to test for row existence:

  • Structure : Expr::Exists(SubqueryExpr{id, negated})
  • Planning : Stored in SelectFilter::subqueries with correlated column bindings
  • Evaluation : Executor binds outer row values to correlated columns, executes subquery plan, returns true if any rows match

Scalar Subqueries

Used in projections to return a single value:

  • Structure : ScalarExpr::ScalarSubquery(ScalarSubqueryExpr{id})
  • Planning : Stored in SelectPlan::scalar_subqueries with correlated column bindings
  • Evaluation : Executor binds outer row values, executes subquery, extracts single value
  • Error Handling : Returns error if subquery returns multiple rows or columns

Sources: llkv-expr/src/expr.rs:42-63 llkv-plan/src/plans.rs:36-56 llkv-executor/src/lib.rs:774-839

Normalization

Expression normalization applies logical transformations before compilation:

De Morgan's Laws

NOT is pushed down through AND/OR using De Morgan's laws:

  • NOT(A AND B)NOT(A) OR NOT(B)
  • NOT(A OR B)NOT(A) AND NOT(B)
  • NOT(NOT(A))A

Flattening

Nested AND/OR nodes are flattened:

  • AND(A, AND(B, C))AND(A, B, C)
  • OR(A, OR(B, C))OR(A, B, C)

Special Cases

  • NOT(Literal(true))Literal(false)
  • NOT(IsNull{expr, false})IsNull{expr, true}

Sources: llkv-table/src/planner/program.rs:286-343

Expression Operators

Binary Operators (BinaryOp)

OperatorSemantics
AddAddition (a + b)
SubtractSubtraction (a - b)
MultiplyMultiplication (a * b)
DivideDivision (a / b)
ModuloModulus (a % b)
AndBitwise AND (a & b)
OrBitwise OR (`a
BitwiseShiftLeftLeft shift (a << b)
BitwiseShiftRightRight shift (a >> b)

Comparison Operators (CompareOp)

OperatorSemantics
EqEquality (a = b)
NotEqInequality (a != b)
LtLess than (a < b)
LtEqLess than or equal (a <= b)
GtGreater than (a > b)
GtEqGreater than or equal (a >= b)

Comparisons in ScalarExpr::Compare return 1 for true, 0 for false, NULL for NULL propagation.

Sources: llkv-expr/src/expr.rs:270-293

Memory Management

Expression Lifetimes

The 'a lifetime parameter in Expr<'a, F> allows borrowed operators to avoid allocations:

  • Operator::In(&'a [Literal]): Borrows slice from call site
  • Operator::StartsWith{pattern: &'a str, ...}: Borrows pattern string
  • Filter<'a, F>: Contains borrowed Operator<'a>

Owned Variants : EvalProgram and DomainProgram use OwnedOperator and OwnedFilter for storage, converting borrowed operators to owned values.

Zero-Copy Pattern

During evaluation, predicates borrow from the compiled program rather than cloning operators, enabling zero-copy predicate evaluation against Arrow arrays.

Sources: llkv-expr/src/expr.rs:295-333 llkv-table/src/planner/program.rs:69-143


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Expression AST

Relevant source files

Purpose and Scope

This document describes the expression Abstract Syntax Tree (AST) defined in the llkv-expr crate. The expression AST provides type-aware, Arrow-native data structures for representing boolean predicates and scalar expressions throughout the LLKV system. These AST nodes are decoupled from SQL parsing and can be parameterized by field identifier types, enabling reuse across multiple processing stages.

For information about how expressions are translated between field identifier types, see Expression Translation. For details on how expressions are compiled into executable bytecode, see Program Compilation. For scalar evaluation mechanics, see Scalar Evaluation and NumericKernels.

Sources: llkv-expr/src/expr.rs:1-8

Expression Type Hierarchy

The llkv-expr crate defines two primary expression enums:

  • Expr<'a, F> - Boolean/predicate expressions that evaluate to true or false
  • ScalarExpr<F> - Arithmetic/scalar expressions that produce typed values
graph TB
    subgraph "Expression Type System"
        EXPR["Expr&lt;'a, F&gt;\nBoolean Predicates"]
SCALAR["ScalarExpr&lt;F&gt;\nScalar Values"]
EXPR --> LOGICAL["Logical Operators\nAnd, Or, Not"]
EXPR --> PRED["Pred(Filter)\nField Predicates"]
EXPR --> COMPARE["Compare\nScalar Comparisons"]
EXPR --> INLIST["InList\nSet Membership"]
EXPR --> ISNULL["IsNull\nNull Checks"]
EXPR --> LITERAL["Literal(bool)\nConstant Boolean"]
EXPR --> EXISTS["Exists\nSubquery Predicates"]
SCALAR --> COLUMN["Column(F)\nField Reference"]
SCALAR --> SLITERAL["Literal\nConstant Value"]
SCALAR --> BINARY["Binary\nArithmetic Ops"]
SCALAR --> SNOT["Not\nLogical Negation"]
SCALAR --> SISNULL["IsNull\nNull Test"]
SCALAR --> AGG["Aggregate\nAggregate Functions"]
SCALAR --> GETFIELD["GetField\nStruct Access"]
SCALAR --> CAST["Cast\nType Conversion"]
SCALAR --> SCOMPARE["Compare\nBoolean Result"]
SCALAR --> COALESCE["Coalesce\nFirst Non-Null"]
SCALAR --> SUBQ["ScalarSubquery\nSubquery Value"]
SCALAR --> CASE["Case\nConditional Logic"]
SCALAR --> RANDOM["Random\nRandom Number"]
COMPARE -.uses.-> SCALAR
        INLIST -.uses.-> SCALAR
        ISNULL -.uses.-> SCALAR
    end
    
    style EXPR fill:#e8f5e9
    style SCALAR fill:#e1f5ff

Both types are generic over a field identifier parameter F, which allows the same AST structure to be used with different field representations (typically String column names during planning, or FieldId numeric identifiers during execution).

Sources: llkv-expr/src/expr.rs:14-143

Expr<'a, F> - Boolean Expressions

The Expr<'a, F> enum represents boolean predicate expressions that evaluate to true or false. These are primarily used in WHERE clauses, JOIN conditions, and HAVING clauses.

Expr Variants

VariantDescriptionUse Case
And(Vec<Expr>)Logical AND of multiple predicatesCombining multiple filter conditions
Or(Vec<Expr>)Logical OR of multiple predicatesAlternative filter conditions
Not(Box<Expr>)Logical negationInverting a predicate
Pred(Filter<'a, F>)Single-field predicateColumn-level filtering (e.g., age > 18)
Compare { left, op, right }Comparison between scalar expressionsCross-column comparisons (e.g., col1 + col2 > 10)
InList { expr, list, negated }Set membership testIN/NOT IN clauses
IsNull { expr, negated }Null check on expressionIS NULL/IS NOT NULL on complex expressions
Literal(bool)Constant boolean valueAlways-true/always-false conditions
Exists(SubqueryExpr)Correlated subquery predicateEXISTS/NOT EXISTS clauses

Sources: llkv-expr/src/expr.rs:14-43

Expr Construction Helpers

The Expr type provides builder methods for common patterns:

These helpers simplify construction of common predicate patterns during query planning.

Sources: llkv-expr/src/expr.rs:65-84

Filter and Operator Types

The Pred variant wraps a Filter<'a, F> struct, which represents a single predicate against a field:

The Operator<'a> enum defines comparison operations over untyped Literal values:

OperatorDescriptionExample SQL
Equals(Literal)Exact equalitycol = 5
Range { lower, upper }Range with boundscol BETWEEN 10 AND 20
GreaterThan(Literal)Greater-than comparisoncol > 10
GreaterThanOrEquals(Literal)Greater-or-equalcol >= 10
LessThan(Literal)Less-than comparisoncol < 10
LessThanOrEquals(Literal)Less-or-equalcol <= 10
In(&'a [Literal])Set membership (borrowed slice)col IN (1, 2, 3)
StartsWith { pattern, case_sensitive }String prefix matchcol LIKE 'abc%'
EndsWith { pattern, case_sensitive }String suffix matchcol LIKE '%xyz'
Contains { pattern, case_sensitive }String substring matchcol LIKE '%abc%'
IsNullNull checkcol IS NULL
IsNotNullNon-null checkcol IS NOT NULL

The Operator type uses borrowed slices for In and borrowed string slices for pattern matching to avoid allocations in common cases.

Sources: llkv-expr/src/expr.rs:295-358

ScalarExpr - Scalar Expressions

The ScalarExpr<F> enum represents arithmetic and scalar expressions that produce typed values. These are used in SELECT projections, computed columns, ORDER BY clauses, and as operands in comparisons.

graph TB
    subgraph "ScalarExpr Evaluation Categories"
        SIMPLE["Simple Values"]
ARITH["Arithmetic"]
LOGIC["Logical"]
STRUCT["Structured"]
CONTROL["Control Flow"]
SPECIAL["Special Functions"]
SIMPLE --> COLUMN["Column(F)\nField reference"]
SIMPLE --> LITERAL["Literal\nConstant value"]
ARITH --> BINARY["Binary\nleft op right"]
ARITH --> CAST["Cast\nType conversion"]
LOGIC --> NOT["Not\nLogical negation"]
LOGIC --> ISNULL["IsNull\nNull test"]
LOGIC --> COMPARE["Compare\nComparison"]
STRUCT --> GETFIELD["GetField\nStruct field access"]
STRUCT --> AGGREGATE["Aggregate\nAggregate function"]
CONTROL --> CASE["Case\nCASE expression"]
CONTROL --> COALESCE["Coalesce\nFirst non-null"]
SPECIAL --> SUBQUERY["ScalarSubquery\nScalar subquery"]
SPECIAL --> RANDOM["Random\nRandom number"]
end

ScalarExpr Variants

Sources: llkv-expr/src/expr.rs:86-143

Arithmetic and Binary Operations

The Binary variant supports arithmetic and logical operations:

BinaryOp Variants:

OperatorNumericBitwiseLogical
Add
Subtract
Multiply
Divide
Modulo
And
Or
BitwiseShiftLeft
BitwiseShiftRight

Sources: llkv-expr/src/expr.rs:270-282

Comparison Operations

The Compare variant in ScalarExpr produces a boolean (1/0) result:

CompareOp Variants:

OperatorSQL Equivalent
Eq=
NotEq!= or <>
Lt<
LtEq<=
Gt>
GtEq>=

Sources: llkv-expr/src/expr.rs:284-293 llkv-expr/src/expr.rs:119-124

AggregateCall Variants

The Aggregate(AggregateCall<F>) variant wraps aggregate function calls:

Each aggregate (except CountStar) operates on a ScalarExpr<F> rather than just a column, enabling complex expressions like AVG(col1 + col2) or SUM(-col1).

Sources: llkv-expr/src/expr.rs:145-176

Struct Field Access

The GetField variant extracts fields from struct-typed expressions:

This represents dot-notation access like user.address.city, which is nested as:

GetField {
    base: GetField {
        base: Column(user),
        field_name: "address"
    },
    field_name: "city"
}

Sources: llkv-expr/src/expr.rs:107-113

CASE Expressions

The Case variant implements SQL CASE expressions:

  • Simple CASE: operand is Some, branches test equality
  • Searched CASE: operand is None, branches evaluate conditions
  • ELSE clause: Handled by else_expr field

Sources: llkv-expr/src/expr.rs:129-137

ScalarExpr Helper Methods

Builder methods simplify construction:

MethodPurpose
column(field: F)Create column reference
literal<L>(lit: L)Create literal value
binary(left, op, right)Create binary operation
logical_not(expr)Create logical NOT
is_null(expr, negated)Create null test
aggregate(call)Create aggregate function
get_field(base, name)Create struct field access
cast(expr, data_type)Create type cast
compare(left, op, right)Create comparison
coalesce(exprs)Create COALESCE
scalar_subquery(id)Create scalar subquery
case(operand, branches, else_expr)Create CASE expression
random()Create random number generator

Sources: llkv-expr/src/expr.rs:178-268

Subquery Expression Types

The expression AST includes dedicated types for correlated and scalar subqueries:

SubqueryExpr

Used in Expr::Exists for boolean subquery predicates:

The id references a subquery definition stored separately (typically in the enclosing plan), and negated indicates NOT EXISTS.

Sources: llkv-expr/src/expr.rs:49-56

ScalarSubqueryExpr

Used in ScalarExpr::ScalarSubquery for value-producing subqueries:

This represents subqueries that return a single scalar value, used in expressions like SELECT (SELECT MAX(price) FROM orders) + 10.

Sources: llkv-expr/src/expr.rs:58-63

SubqueryId

Both subquery types use SubqueryId as an opaque identifier:

This ID is resolved during execution by looking up the subquery definition in the parent plan's metadata.

Sources: llkv-expr/src/expr.rs:45-47

graph LR
    subgraph "Expression Translation Pipeline"
        SQL["SQL Text"]
PLAN["Query Plan\nExpr&lt;String&gt;\nScalarExpr&lt;String&gt;"]
TRANS["Translation\nresolve_field_id()"]
EXEC["Execution\nExpr&lt;FieldId&gt;\nScalarExpr&lt;FieldId&gt;"]
EVAL["Evaluation\nRecordBatch Results"]
SQL --> PLAN
 
       PLAN --> TRANS
 
       TRANS --> EXEC
 
       EXEC --> EVAL
    end

Generic Field Parameter

The expression AST is parameterized by field identifier type F, enabling reuse across different processing stages:

Common instantiations:

  • Expr<'a, String> - Used during query planning with column names
  • Expr<'a, FieldId> - Used during execution with numeric field IDs
  • ScalarExpr<String> - Planning-time scalar expressions
  • ScalarExpr<FieldId> - Execution-time scalar expressions

The translation from String to FieldId occurs in llkv-executor/src/translation/expression.rs using catalog lookups to resolve column names to their internal numeric identifiers.

Sources: llkv-executor/src/translation/expression.rs:18-174

Literal Values

Both expression types use the Literal enum from llkv-expr to represent untyped constant values:

Literal VariantArrow TypeExample
Integer(i64)Int6442
Float(f64)Float643.14
Decimal(DecimalValue)Decimal128123.45
Boolean(bool)Booleantrue
String(String)Utf8"hello"
Date32(i32)Date32DATE '2024-01-01'
NullNullNULL
Struct(...)Struct{a: 1, b: "x"}
Interval(...)IntervalINTERVAL '1 month'

Literal values are type-agnostic at the AST level. Type coercion and validation occur during execution when column types are known.

Sources: llkv-expr/src/expr.rs10 llkv-executor/src/translation/schema.rs:53-123

graph TB
    subgraph "Type Inference Rules"
        SCALAR["ScalarExpr&lt;FieldId&gt;"]
SCALAR --> LITERAL["Literal → Literal type"]
SCALAR --> COLUMN["Column → Schema lookup"]
SCALAR --> BINARY["Binary → Int64 or Float64"]
SCALAR --> NOT["Not → Int64 (boolean)"]
SCALAR --> ISNULL["IsNull → Int64 (boolean)"]
SCALAR --> COMPARE["Compare → Int64 (boolean)"]
SCALAR --> AGG["Aggregate → Int64"]
SCALAR --> GETFIELD["GetField → Struct field type"]
SCALAR --> CAST["Cast → Target type"]
SCALAR --> CASE["Case → Int64 or Float64"]
SCALAR --> COALESCE["Coalesce → Int64 or Float64"]
SCALAR --> RANDOM["Random → Float64"]
SCALAR --> SUBQUERY["ScalarSubquery → Utf8 (TODO)"]
BINARY --> FLOATCHECK["Contains Float64?"]
FLOATCHECK -->|Yes| FLOAT64["Float64"]
FLOATCHECK -->|No| INT64["Int64"]
CASE --> CASECHECK["Branches use Float64?"]
CASECHECK -->|Yes| CASEFLOAT["Float64"]
CASECHECK -->|No| CASEINT["Int64"]
end

Expression Type Inference

During execution planning, the system infers output types for ScalarExpr based on operand types:

The inference logic is implemented in infer_computed_data_type() and expression_uses_float(), which recursively analyze expression trees to determine output types.

Sources: llkv-executor/src/translation/schema.rs:53-271

Expression Normalization

Before compilation, predicates undergo normalization to flatten nested AND/OR structures and apply De Morgan's laws:

Normalization Rules:

  1. Flatten AND: And([And([a, b]), c])And([a, b, c])
  2. Flatten OR: Or([Or([a, b]), c])Or([a, b, c])
  3. De Morgan's AND: Not(And([a, b]))Or([Not(a), Not(b)])
  4. De Morgan's OR: Not(Or([a, b]))And([Not(a), Not(b)])
  5. Double Negation: Not(Not(expr))expr
  6. Literal Negation: Not(Literal(true))Literal(false)
  7. IsNull Negation: Not(IsNull { expr, negated })IsNull { expr, negated: !negated }

This normalization simplifies subsequent optimization and compilation steps.

Sources: llkv-table/src/planner/program.rs:286-343

Expression Compilation

Normalized expressions are compiled into two bytecode programs:

  1. EvalProgram - Stack-based evaluation of predicates
  2. DomainProgram - Set-based domain analysis for row filtering

The compilation process:

  • Detects fusable predicates (multiple conditions on same field)
  • Builds domain programs for NOT operations
  • Produces postorder bytecode for stack-based evaluation

Sources: llkv-table/src/planner/program.rs:256-631

Expression AST Usage Flow

Key stages:

  1. SQL Parsing - External sqlparser produces SQL AST
  2. Plan Building - llkv-plan converts SQL AST to Expr<String>
  3. Translation - llkv-executor resolves column names to FieldId
  4. Normalization - llkv-table flattens and optimizes structure
  5. Compilation - llkv-table produces bytecode programs
  6. Execution - llkv-table evaluates against RecordBatch data

Sources: llkv-expr/src/expr.rs:1-8 llkv-executor/src/translation/expression.rs:18-174 llkv-table/src/planner/program.rs:256-631


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Expression Translation

Relevant source files

Purpose and Scope

Expression Translation is the process of converting expressions from using string-based column names to using internal field identifiers. This transformation bridges the gap between the SQL interface layer (which references columns by name) and the execution layer (which references columns by numeric FieldId). This page covers the translation phase that occurs after query planning but before expression compilation.

For information about the expression AST structure, see Expression AST. For information about how translated expressions are compiled into bytecode, see Program Compilation.

Translation Phase in Query Pipeline

The expression translation phase sits between SQL planning and execution, converting symbolic column references to physical field identifiers.

Translation Phase Detail

graph LR
    SQL["SQL WHERE\nname = 'Alice'"]
AST["sqlparser AST"]
EXPRSTR["Expr<String>\nColumn('name')"]
CATALOG["Schema Lookup"]
EXPRFID["Expr<FieldId>\nColumn(42)"]
COMPILE["Bytecode\nCompilation"]
SQL --> AST
 
   AST --> EXPRSTR
 
   EXPRSTR --> CATALOG
 
   CATALOG --> EXPRFID
 
   EXPRFID --> COMPILE
    
    style EXPRSTR fill:#fff5e1
    style EXPRFID fill:#ffe1e1

Sources: Based on Diagram 5 from system architecture overview

The translation process resolves all column name strings to their corresponding FieldId values by consulting the table schema. This enables the downstream execution engine to work with stable numeric identifiers rather than string lookups.

Generic Expression Types

Both Expr and ScalarExpr are generic over the field identifier type, parameterized as F.

Type Parameter Instantiations

graph TB
    subgraph "Generic Types"
        EXPR["Expr<'a, F>"]
SCALAR["ScalarExpr<F>"]
FILTER["Filter<'a, F>\n{field_id: F, op: Operator}"]
end
    
    subgraph "String-Based (Planning)"
        EXPRSTR["Expr<'static, String>"]
SCALARSTR["ScalarExpr<String>"]
FILTERSTR["Filter<'static, String>"]
end
    
    subgraph "FieldId-Based (Execution)"
        EXPRFID["Expr<'static, FieldId>"]
SCALARFID["ScalarExpr<FieldId>"]
FILTERFID["Filter<'static, FieldId>"]
end
    
    EXPR -.instantiated as.-> EXPRSTR
    EXPR -.instantiated as.-> EXPRFID
    SCALAR -.instantiated as.-> SCALARSTR
    SCALAR -.instantiated as.-> SCALARFID
    FILTER -.instantiated as.-> FILTERSTR
    FILTER -.instantiated as.-> FILTERFID
    
    EXPRSTR ==translate_predicate==> EXPRFID
    SCALARSTR ==translate_scalar==> SCALARFID

Sources: llkv-expr/src/expr.rs:14-43 llkv-executor/src/translation/expression.rs:18-27

TypePlanning PhaseExecution Phase
Predicate ExpressionExpr<'static, String>Expr<'static, FieldId>
Scalar ExpressionScalarExpr<String>ScalarExpr<FieldId>
FilterFilter<'static, String>Filter<'static, FieldId>
Field ReferenceString column nameFieldId numeric identifier

The generic parameter F allows the same AST structure to be used at different stages of query processing, with type safety enforcing that planning-phase expressions cannot be mixed with execution-phase expressions.

Sources: llkv-expr/src/expr.rs:14-333

Translation Functions

The llkv-executor crate provides two primary translation functions that recursively transform expression trees.

translate_predicate

Translates boolean predicate expressions (Expr<String>Expr<FieldId>).

Predicate Translation Flow

Sources: llkv-executor/src/translation/expression.rs:18-174

Function Signature:

The function accepts:

  • expr: The predicate expression with string column references
  • schema: The table schema for column name resolution
  • unknown_column: Error constructor for unresolved column names

Sources: llkv-executor/src/translation/expression.rs:18-27

translate_scalar

Translates scalar value expressions (ScalarExpr<String>ScalarExpr<FieldId>).

Function Signature:

Sources: llkv-executor/src/translation/expression.rs:176-185

Variant-Specific Translation

The translation process handles each expression variant differently:

VariantTranslation Approach
Expr::Pred(Filter)Resolve filter's field_id from String to FieldId
Expr::And(Vec) / Expr::Or(Vec)Recursively translate all child expressions
Expr::Not(Box)Recursively translate inner expression
Expr::CompareTranslate both left and right scalar expressions
Expr::InListTranslate target expression and all list items
Expr::IsNullTranslate inner expression
Expr::Literal(bool)No translation needed, pass through
Expr::Exists(SubqueryExpr)Pass through subquery ID unchanged
ScalarExpr::Column(String)Resolve string to FieldId
ScalarExpr::LiteralNo translation needed, pass through
ScalarExpr::BinaryRecursively translate left and right operands
ScalarExpr::AggregateRecursively translate aggregate expression argument
ScalarExpr::GetFieldRecursively translate base expression
ScalarExpr::CastRecursively translate inner expression
ScalarExpr::CaseTranslate operand, all branch conditions/results, and else branch
ScalarExpr::CoalesceRecursively translate all items

Sources: llkv-executor/src/translation/expression.rs:86-143 llkv-executor/src/translation/expression.rs:197-386

graph TD
    START["Column Name String"]
ROWID_CHECK{"Is 'rowid'\n(case-insensitive)?"}
ROWID["Return\nROW_ID_FIELD_ID"]
LOOKUP["schema.resolve(name)"]
FOUND{"Found in\nschema?"}
RETURN_FID["Return\ncolumn.field_id"]
ERROR["Invoke\nunknown_column\ncallback"]
START --> ROWID_CHECK
 
   ROWID_CHECK -->|Yes| ROWID
 
   ROWID_CHECK -->|No| LOOKUP
 
   LOOKUP --> FOUND
 
   FOUND -->|Yes| RETURN_FID
 
   FOUND -->|No| ERROR

Field Resolution

The core of translation is resolving string column names to numeric field identifiers.

Field Resolution Logic

Sources: llkv-executor/src/translation/expression.rs:390-407

Special Column: rowid

The system provides a special pseudo-column named rowid that references the internal row identifier:

The rowid column is:

  • Case-insensitive (accepts "ROWID", "rowid", "RowId", etc.)
  • Available in all tables automatically
  • Mapped to constant ROW_ID_FIELD_ID from llkv_table::ROW_ID_FIELD_ID
  • Corresponds to ROW_ID_COLUMN_NAME constant from llkv_column_map::ROW_ID_COLUMN_NAME

Sources: llkv-executor/src/translation/expression.rs:399-401 llkv-executor/src/translation/expression.rs2 llkv-executor/src/translation/expression.rs5

Schema Lookup

For non-special columns, resolution uses the ExecutorSchema::resolve method:

The schema lookup:

  1. Searches for a column with the given name
  2. Returns the ExecutorColumn if found
  3. Extracts the field_id from the column metadata
  4. Invokes the error callback if not found

Sources: llkv-executor/src/translation/expression.rs:403-406

graph TB
    START["Initial Expression"]
PUSH_ENTER["Push Enter Frame"]
POP["Pop Frame"]
FRAME_TYPE{"Frame Type?"}
ENTER_NODE["Enter Node"]
NODE_TYPE{"Node Type?"}
AND_OR["And/Or"]
NOT["Not"]
LEAF["Leaf (Pred,\nCompare, etc.)"]
PUSH_EXIT["Push Exit Frame"]
PUSH_CHILDREN["Push Child\nEnter Frames\n(reversed)"]
PUSH_EXIT_NOT["Push Exit Frame"]
PUSH_INNER["Push Inner\nEnter Frame"]
TRANSLATE_LEAF["Translate Leaf\nPush to result_stack"]
EXIT_NODE["Exit Node"]
POP_RESULTS["Pop child results\nfrom result_stack"]
BUILD_NODE["Build translated\nparent node"]
PUSH_RESULT["Push to result_stack"]
DONE{"Stack\nempty?"}
RETURN["Return final result"]
START --> PUSH_ENTER
 
   PUSH_ENTER --> POP
 
   POP --> FRAME_TYPE
    
 
   FRAME_TYPE -->|Enter| ENTER_NODE
 
   FRAME_TYPE -->|Exit| EXIT_NODE
 
   FRAME_TYPE -->|Leaf| TRANSLATE_LEAF
    
 
   ENTER_NODE --> NODE_TYPE
 
   NODE_TYPE --> AND_OR
 
   NODE_TYPE --> NOT
 
   NODE_TYPE --> LEAF
    
 
   AND_OR --> PUSH_EXIT
 
   PUSH_EXIT --> PUSH_CHILDREN
 
   PUSH_CHILDREN --> POP
    
 
   NOT --> PUSH_EXIT_NOT
 
   PUSH_EXIT_NOT --> PUSH_INNER
 
   PUSH_INNER --> POP
    
 
   LEAF --> TRANSLATE_LEAF
 
   TRANSLATE_LEAF --> POP
    
 
   EXIT_NODE --> POP_RESULTS
 
   POP_RESULTS --> BUILD_NODE
 
   BUILD_NODE --> PUSH_RESULT
 
   PUSH_RESULT --> POP
    
 
   POP --> DONE
 
   DONE -->|No| FRAME_TYPE
 
   DONE -->|Yes| RETURN

Traversal Strategy

Translation uses an iterative traversal approach to avoid stack overflow on deeply nested expressions.

Iterative Traversal Algorithm

Sources: llkv-executor/src/translation/expression.rs:39-174

Frame-Based Pattern

The translation uses a frame-based traversal pattern with two stacks:

Work Stack (owned_stack): Contains frames representing work to be done

  • OwnedFrame::Enter(expr): Visit a node and potentially expand it
  • OwnedFrame::Exit(context): Collect child results and build parent node
  • OwnedFrame::Leaf(translated): Push a fully translated leaf node

Result Stack (result_stack): Contains translated expressions ready to be consumed by parent nodes

Sources: llkv-executor/src/translation/expression.rs:48-63

Traversal Example

For the expression And([Pred(name_col), Pred(age_col)]):

StepWork StackResult StackAction
1[Enter(And)][]Start
2[Exit(And(2)), Enter(age), Enter(name)][]Expand And
3[Exit(And(2)), Enter(age), Leaf(name→42)][]Translate name
4[Exit(And(2)), Enter(age)][Pred(42)]Push name result
5[Exit(And(2)), Leaf(age→43)][Pred(42)]Translate age
6[Exit(And(2))][Pred(42), Pred(43)]Push age result
7[][And([Pred(42), Pred(43)])]Build And, push result
8Done[And([...])]Return final expression

This approach handles deeply nested expressions (50k+ nodes) without recursion-induced stack overflow.

Sources: llkv-executor/src/translation/expression.rs:62-174

Why Iterative Traversal?

The codebase comments explain:

This avoids stack overflow on deeply nested expressions (50k+ nodes) by using explicit work_stack and result_stack instead of recursion.

The frame-based pattern is documented in the llkv-plan::traversal module and reused here for expression translation.

Sources: llkv-executor/src/translation/expression.rs:39-46

Error Handling

Translation failures produce descriptive errors through callback functions.

Error Callbacks

Both translation functions accept error constructor callbacks:

ParameterPurposeExample Usage
unknown_column: FConstruct error for unknown column names`
unknown_aggregate: GConstruct error for unknown aggregate functionsCurrently unused but reserved for future validation

The callback pattern allows callers to customize error messages and error types based on their context.

Sources: llkv-executor/src/translation/expression.rs:21-27 llkv-executor/src/translation/expression.rs:189-195

Common Error Scenarios

Translation Error Flow

When schema.resolve(name) returns None, the system invokes the error callback which typically produces an Error::InvalidArgumentError with a message like:

"Binder Error: does not have a column named 'xyz'"

Sources: llkv-executor/src/translation/expression.rs:418-422

Result Stack Underflow

The iterative traversal validates stack consistency:

Stack underflow indicates a bug in the traversal logic rather than invalid user input, so it produces an Error::Internal.

Sources: llkv-executor/src/translation/expression.rs:160-164 llkv-executor/src/translation/expression.rs:171-173

graph TD
    EXPR["ScalarExpr<FieldId>"]
TYPE_CHECK{"Expression\nType?"}
LITERAL["Literal"]
INFER_LIT["Infer from\nLiteral type"]
COLUMN["Column(fid)"]
LOOKUP["schema.column_by_field_id(fid)"]
NORMALIZE["normalized_numeric_type"]
BINARY["Binary"]
CHECK_FLOAT["expression_uses_float"]
FLOAT_RESULT["DataType::Float64"]
INT_RESULT["DataType::Int64"]
AGGREGATE["Aggregate"]
AGG_RESULT["DataType::Int64"]
CAST["Cast"]
CAST_TYPE["Use specified\ndata_type"]
RESULT["Arrow DataType"]
EXPR --> TYPE_CHECK
    
 
   TYPE_CHECK --> LITERAL
 
   TYPE_CHECK --> COLUMN
 
   TYPE_CHECK --> BINARY
 
   TYPE_CHECK --> AGGREGATE
 
   TYPE_CHECK --> CAST
    
 
   LITERAL --> INFER_LIT
 
   INFER_LIT --> RESULT
    
 
   COLUMN --> LOOKUP
 
   LOOKUP --> NORMALIZE
 
   NORMALIZE --> RESULT
    
 
   BINARY --> CHECK_FLOAT
 
   CHECK_FLOAT -->|Uses Float| FLOAT_RESULT
 
   CHECK_FLOAT -->|Integer only| INT_RESULT
 
   FLOAT_RESULT --> RESULT
 
   INT_RESULT --> RESULT
    
 
   AGGREGATE --> AGG_RESULT
 
   AGG_RESULT --> RESULT
    
 
   CAST --> CAST_TYPE
 
   CAST_TYPE --> RESULT

Type Inference Integration

After translation, expressions with FieldId references can be used for schema-based type inference.

The infer_computed_data_type function in llkv-executor/src/translation/schema.rs inspects translated expressions to determine their Arrow data types:

Type Inference for Translated Expressions

Sources: llkv-executor/src/translation/schema.rs:53-123

Type Inference Rules

ExpressionInferred TypeNotes
Literal::IntegerInt6464-bit signed integer
Literal::FloatFloat6464-bit floating point
Literal::DecimalDecimal128(p,s)Precision and scale from value
Literal::BooleanBooleanBoolean flag
Literal::StringUtf8UTF-8 string
Literal::NullNullNull type marker
Column(fid)Schema lookupnormalized_numeric_type(column.data_type)
BinaryFloat64 or Int64Float if any operand is float
CompareInt64Comparisons produce boolean (0/1) as Int64
AggregateInt64Most aggregates return Int64
CastSpecified typeUses explicit data_type parameter

The normalized_numeric_type function maps small integer types (Int8, Int16, Int32) to Int64 and unsigned/float types to Float64 for consistent expression evaluation.

Sources: llkv-executor/src/translation/schema.rs:125-147

Translation in Context

The translation phase fits into the broader query execution pipeline:

Translation Phase in Query Pipeline

Sources: Based on Diagram 2 and Diagram 5 from system architecture overview

The translation layer serves as the bridge between the human-readable SQL layer (with column names) and the machine-optimized execution layer (with numeric field IDs). This separation allows:

  1. Planning flexibility : Query plans can reference columns symbolically without knowing physical storage details
  2. Schema evolution : Field IDs remain stable even if column names change
  3. Type safety : The type system prevents mixing planning-phase and execution-phase expressions
  4. Optimization : Numeric field IDs enable efficient lookups in columnar storage

Sources: llkv-expr/src/expr.rs:1-359 llkv-executor/src/translation/expression.rs:1-424 llkv-executor/src/translation/schema.rs:1-338


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Program Compilation

Relevant source files

Purpose and Scope

This page documents the predicate compilation system in LLKV, which transforms typed predicate expressions (Expr<FieldId>) into stack-based bytecode programs for efficient evaluation during table scans. The compilation process produces two types of programs: EvalProgram for predicate evaluation and DomainProgram for domain analysis (determining which row IDs might match predicates).

For information about the expression AST structure itself, see Expression AST. For how these compiled programs are evaluated during execution, see Filter Evaluation. For how expressions are translated from string column names to field IDs, see Expression Translation.


Compilation Pipeline Overview

The compilation process transforms a normalized predicate expression into executable bytecode through the ProgramCompiler in llkv-table/src/planner/program.rs:257-284 The compiler produces a ProgramSet containing both evaluation and domain analysis programs.

Sources: llkv-table/src/planner/program.rs:257-284 llkv-table/src/planner/mod.rs:612-637

graph TB
    Input["Expr&lt;FieldId&gt;\n(Raw predicate)"]
Normalize["normalize_predicate()\n(Apply De Morgan's laws,\nflatten And/Or)"]
Compiler["ProgramCompiler::new()\nProgramCompiler::compile()"]
ProgramSet["ProgramSet"]
EvalProgram["EvalProgram\n(Stack-based bytecode\nfor evaluation)"]
DomainRegistry["DomainRegistry\n(Domain programs\nfor optimization)"]
Input --> Normalize
 
   Normalize --> Compiler
 
   Compiler --> ProgramSet
 
   ProgramSet --> EvalProgram
 
   ProgramSet --> DomainRegistry
    
    EvalOps["EvalOp instructions:\nPushPredicate, And, Or,\nNot, FusedAnd"]
DomainOps["DomainOp instructions:\nPushFieldAll, Union,\nIntersect"]
EvalProgram --> EvalOps
 
   DomainRegistry --> DomainOps

Compilation Entry Point

The TablePlanner::plan_scan method invokes the compiler when preparing a scan operation. The process begins with predicate normalization followed by compilation into both evaluation and domain programs.

Sources: llkv-table/src/planner/mod.rs:625-628 llkv-table/src/planner/program.rs:266-283

graph LR
    TablePlanner["TablePlanner::plan_scan"]
NormFilter["normalize_predicate(filter_expr)"]
CreateCompiler["ProgramCompiler::new(Arc&lt;Expr&gt;)"]
Compile["compiler.compile()"]
ProgramSet["ProgramSet&lt;'expr&gt;"]
TablePlanner --> NormFilter
 
   NormFilter --> CreateCompiler
 
   CreateCompiler --> Compile
 
   Compile --> ProgramSet
    
 
   ProgramSet --> EvalProgram["EvalProgram\n(ops: Vec&lt;EvalOp&gt;)"]
ProgramSet --> DomainRegistry["DomainRegistry\n(programs, index)"]

Predicate Normalization

Before compilation, predicates are normalized using normalize_predicate() to simplify the expression tree. Normalization applies two key transformations:

Transformation Rules

Input PatternOutput PatternDescription
And(And(a, b), c)And(a, b, c)Flatten nested AND operations
Or(Or(a, b), c)Or(a, b, c)Flatten nested OR operations
Not(And(a, b))Or(Not(a), Not(b))De Morgan's law for AND
Not(Or(a, b))And(Not(a), Not(b))De Morgan's law for OR
Not(Not(a))aDouble negation elimination
Not(Literal(true))Literal(false)Literal inversion
Not(IsNull{expr, negated})IsNull{expr, !negated}IsNull negation flip

Normalization Algorithm

The normalize_expr() function recursively transforms the expression tree using pattern matching:

Sources: llkv-table/src/planner/program.rs:286-343

graph TD
    Start["normalize_expr(expr)"]
CheckAnd{"expr is And?"}
CheckOr{"expr is Or?"}
CheckNot{"expr is Not?"}
Other["Return expr unchanged"]
FlattenAnd["Flatten nested And nodes\ninto single And"]
FlattenOr["Flatten nested Or nodes\ninto single Or"]
ApplyDeMorgan["normalize_negated(inner)\nApply De Morgan's laws"]
Start --> CheckAnd
 
   CheckAnd -->|Yes| FlattenAnd
 
   CheckAnd -->|No| CheckOr
 
   CheckOr -->|Yes| FlattenOr
 
   CheckOr -->|No| CheckNot
 
   CheckNot -->|Yes| ApplyDeMorgan
 
   CheckNot -->|No| Other
    
 
   FlattenAnd --> Return["Return normalized expr"]
FlattenOr --> Return
 
   ApplyDeMorgan --> Return
 
   Other --> Return

EvalProgram Compilation

The compile_eval() function generates a sequence of EvalOp instructions using iterative traversal with an explicit work stack. This avoids stack overflow on deeply nested expressions and produces postorder bytecode.

EvalOp Instruction Set

The EvalOp enum defines the instruction types for predicate evaluation:

InstructionDescriptionStack Effect
PushPredicate(OwnedFilter)Push single predicate result→ bool
PushCompare{left, op, right}Evaluate comparison expression→ bool
PushInList{expr, list, negated}Evaluate IN list membership→ bool
PushIsNull{expr, negated}Evaluate IS NULL test→ bool
PushLiteral(bool)Push constant boolean value→ bool
FusedAnd{field_id, filters}Optimized AND for same field→ bool
And{child_count}Pop N bools, push AND resultbool×N → bool
Or{child_count}Pop N bools, push OR resultbool×N → bool
Not{domain}Pop bool, push NOT resultbool → bool
graph TB
    subgraph "Compilation Phases"
        Input["Expression Tree"]
Phase1["Enter Phase\n(Pre-order traversal)"]
Phase2["Exit Phase\n(Post-order emission)"]
Output["Vec&lt;EvalOp&gt;"]
end
    
 
   Input --> Phase1
 
   Phase1 --> Phase2
 
   Phase2 --> Output
    
    subgraph "Frame Types"
        EnterFrame["EvalVisit::Enter(expr)\nPush children in reverse order"]
ExitFrame["EvalVisit::Exit(expr)\nEmit instruction"]
FusedFrame["EvalVisit::EmitFused\nEmit FusedAnd optimization"]
end
    
 
   Phase1 --> EnterFrame
 
   EnterFrame --> ExitFrame
 
   EnterFrame --> FusedFrame
 
   ExitFrame --> Phase2
 
   FusedFrame --> Phase2

Compilation Process

The compiler uses a two-pass approach with EvalVisit frames to track traversal state:

Sources: llkv-table/src/planner/program.rs:407-516

Predicate Fusion Optimization

When the compiler encounters an And node where all children are predicates (Expr::Pred) on the same FieldId, it emits a single FusedAnd instruction instead of multiple individual predicates. This optimization is detected by gather_fused():

Sources: llkv-table/src/planner/program.rs:518-542

graph LR
    AndNode["And(children)"]
Check["gather_fused(children)"]
Decision{"All children\nare Pred(field_id)\nwith same field_id?"}
Fused["Emit FusedAnd{\nfield_id,\nfilters\n}"]
Normal["Emit individual\nPushPredicate\nfor each child\n+ And instruction"]
AndNode --> Check
 
   Check --> Decision
 
   Decision -->|Yes| Fused
 
   Decision -->|No| Normal

DomainProgram Compilation

The compile_domain() function generates DomainOp instructions for domain analysis. Domain programs determine which row IDs might satisfy a predicate without evaluating the full expression, enabling storage-layer optimizations.

DomainOp Instruction Set

InstructionDescriptionStack Effect
PushFieldAll(FieldId)All rows where field exists→ RowSet
PushCompareDomain{left, right, op, fields}Domain of rows for comparison→ RowSet
PushInListDomain{expr, list, fields, negated}Domain of rows for IN list→ RowSet
PushIsNullDomain{expr, fields, negated}Domain of rows for IS NULL→ RowSet
PushLiteralFalseEmpty row set→ RowSet
PushAllRowsAll rows in table→ RowSet
Union{child_count}Pop N sets, push unionRowSet×N → RowSet
Intersect{child_count}Pop N sets, push intersectionRowSet×N → RowSet

Domain Analysis Algorithm

Domain compilation uses iterative traversal with DomainVisit frames, similar to eval compilation but with different semantics:

graph TB
    Start["compile_domain(expr)"]
Stack["Work stack:\nVec&lt;DomainVisit&gt;"]
Ops["Output:\nVec&lt;DomainOp&gt;"]
EnterFrame["DomainVisit::Enter(node)\nPush children + Exit frame"]
ExitFrame["DomainVisit::Exit(node)\nEmit DomainOp"]
Start --> Stack
 
   Stack --> Process{"Pop frame"}
Process -->|Enter| EnterFrame
 
   Process -->|Exit| ExitFrame
    
 
   EnterFrame --> Stack
 
   ExitFrame --> Emit["Emit instruction to ops"]
Emit --> Stack
    
 
   Process -->|Empty| Done["Return DomainProgram{ops}"]

Domain Semantics

The domain of an expression represents the set of row IDs where the expression could potentially be true (or false for Not):

Expression TypeDomain Semantics
Pred(filter)All rows where filter.field_id exists
Compare{left, right, op}Union of domains of all fields in left and right
InList{expr, list}Union of domains of all fields in expr and list items
IsNull{expr}Union of domains of all fields in expr
Literal(true)All rows in table
Literal(false)Empty set
And(children)Intersection of children domains
Or(children)Union of children domains
Not(inner)Same as inner domain (NOT doesn't change domain)

Sources: llkv-table/src/planner/program.rs:544-631


graph TB
    Input["ScalarExpr&lt;FieldId&gt;"]
Stack["Work stack:\nVec&lt;&amp;ScalarExpr&gt;"]
Seen["FxHashSet&lt;FieldId&gt;\n(deduplication)"]
Output["Vec&lt;FieldId&gt;\n(sorted)"]
Input --> Stack
 
   Stack --> Process{"Pop expr"}
Process -->|Column fid| AddField["Insert fid into seen"]
Process -->|Literal| Skip["Skip (no fields)"]
Process -->|Binary| PushChildren["Push left, right\nto stack"]
Process -->|Compare| PushChildren
 
   Process -->|Aggregate| PushAggExpr["Push aggregate expr\nto stack"]
Process -->|Other| PushNested["Push nested exprs\nto stack"]
AddField --> Stack
 
   Skip --> Stack
 
   PushChildren --> Stack
 
   PushAggExpr --> Stack
 
   PushNested --> Stack
    
 
   Process -->|Empty| Collect["Collect seen into Vec\nSort unstable"]
Collect --> Output

Field Collection for Domain Analysis

The collect_fields() function extracts all FieldId references from scalar expressions using iterative traversal. This determines which columns' row sets need to be considered during domain evaluation.

Sources: llkv-table/src/planner/program.rs:633-709


Data Structures

ProgramSet

The top-level container returned by compilation, holding all compiled artifacts:

ProgramSet<'expr> {
    eval: EvalProgram,              // Bytecode for predicate evaluation
    domains: DomainRegistry,        // Domain programs for optimization
    _root_expr: Arc<Expr<'expr, FieldId>>  // Original expression (lifetime anchor)
}

Sources: llkv-table/src/planner/program.rs:23-29

DomainRegistry

Manages the collection of compiled domain programs with deduplication via ExprKey:

DomainRegistry {
    programs: Vec<DomainProgram>,           // All compiled domain programs
    index: FxHashMap<ExprKey, DomainProgramId>,  // Expression → program ID map
    root: Option<DomainProgramId>           // ID of root domain program
}

The registry uses ExprKey (a pointer-based key) to detect duplicate subexpressions and reuse compiled domain programs.

Sources: llkv-table/src/planner/program.rs:12-20 llkv-table/src/planner/program.rs:196-219

OwnedFilter and OwnedOperator

To support owned bytecode programs with no lifetime constraints, the compiler converts borrowed Filter<'a, FieldId> and Operator<'a> types into owned variants:

Borrowed TypeOwned TypePurpose
Filter<'a, FieldId>OwnedFilterStores field_id + owned operator
Operator<'a>OwnedOperatorOwns pattern strings and literal vectors

This conversion happens during compile_eval() when creating PushPredicate and FusedAnd instructions.

Sources: llkv-table/src/planner/program.rs:69-191


Integration with Table Scanning

The compiled programs are used during table scan execution in two ways:

  1. EvalProgram : Evaluated per-row or per-batch during scan to determine which rows match the predicate
  2. DomainProgram : Used for storage-layer optimizations to skip scanning columns or chunks that cannot match

Usage in PlannedScan

The TablePlanner::plan_scan() method creates a PlannedScan struct that bundles the compiled programs with scan metadata:

PlannedScan<'expr, P> {
    projections: Vec<ScanProjection>,
    filter_expr: Arc<Expr<'expr, FieldId>>,
    options: ScanStreamOptions<P>,
    plan_graph: PlanGraph,          // For query plan visualization
    programs: ProgramSet<'expr>     // Compiled evaluation and domain programs
}

Sources: llkv-table/src/planner/mod.rs:500-509 llkv-table/src/planner/mod.rs:630-636


Example Compilation

Consider the predicate: (age > 18 AND name LIKE 'A%') OR status = 'active'

After normalization, this remains unchanged (no nested And/Or to flatten). The compiler produces:

EvalProgram Instructions

1. PushPredicate(Filter { field_id: age, op: GreaterThan(18) })
2. PushPredicate(Filter { field_id: name, op: StartsWith("A", true) })
3. And { child_count: 2 }
4. PushPredicate(Filter { field_id: status, op: Equals("active") })
5. Or { child_count: 2 }

DomainProgram Instructions

1. PushFieldAll(age)      // Domain for first predicate
2. PushFieldAll(name)     // Domain for second predicate
3. Intersect { child_count: 2 }  // AND combines via intersection
4. PushFieldAll(status)   // Domain for third predicate
5. Union { child_count: 2 }      // OR combines via union

The domain program indicates that rows must have (age AND name) OR status to potentially match.

Sources: llkv-table/src/planner/program.rs:416-516 llkv-table/src/planner/program.rs:550-631


graph LR
    Recursive["Recursive approach\n(Stack overflow risk)"]
Iterative["Iterative approach\n(Explicit work stack)"]
Recursive -->|Replace with| Iterative
    
 
   Iterative --> WorkStack["Vec&lt;Frame&gt;\n(Heap-allocated)"]
Iterative --> ResultStack["Vec&lt;Result&gt;\n(Post-order accumulation)"]

Stack Overflow Prevention

Both compile_eval() and compile_domain() use explicit work stacks instead of recursion to handle deeply nested expressions (50k+ nodes) without stack overflow. This follows the iterative traversal pattern described in the codebase:

The pattern uses Enter/Exit frames to simulate recursive descent and ascent, accumulating results in a separate stack during the Exit phase.

Sources: llkv-table/src/planner/program.rs:407-516 llkv-table/src/planner/program.rs:544-631


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Scalar Evaluation and NumericKernels

Relevant source files

Purpose and Scope

This page documents the scalar expression evaluation engine used during table scans to compute expressions like col1 + col2 * 3, CAST(col AS Float64), and CASE statements. The NumericKernels utility centralizes numeric computation logic, providing both row-by-row and vectorized batch evaluation strategies. For the abstract expression AST that gets evaluated, see Expression AST. For how expressions are compiled into bytecode programs for predicate evaluation, see Program Compilation.


Overview

The scalar evaluation system provides a unified numeric computation layer that operates over Arrow arrays during table scans. When a query contains computed projections like SELECT col1 + col2 AS sum FROM table, the executor needs to efficiently evaluate these expressions across potentially millions of rows. The NumericKernels struct and associated types provide:

  1. Type abstraction : Wraps Arrow's Int64Array, Float64Array, and Decimal128Array into a unified NumericArray interface
  2. Evaluation strategies : Supports both row-by-row evaluation (for complex expressions) and vectorized batch evaluation (for simple arithmetic)
  3. Optimization : Applies algebraic simplification to detect affine transformations and constant folding opportunities
  4. Type coercion : Handles implicit casting between integer, float, and decimal types following SQLite-style semantics

Sources : llkv-table/src/scalar_eval.rs:1-22

graph TB
    subgraph "Input Layer"
        ARROW_INT["Int64Array\n(Arrow)"]
ARROW_FLOAT["Float64Array\n(Arrow)"]
ARROW_DEC["Decimal128Array\n(Arrow)"]
end
    
    subgraph "Abstraction Layer"
        NUM_ARRAY["NumericArray\nkind: NumericKind\nlen: usize"]
NUM_VALUE["NumericValue\nInteger(i64)\nFloat(f64)\nDecimal(DecimalValue)"]
end
    
    subgraph "Evaluation Engine"
        KERNELS["NumericKernels\nevaluate_value()\nevaluate_batch()\nsimplify()"]
end
    
    subgraph "Output Layer"
        RESULT_ARRAY["ArrayRef\n(Arrow)"]
end
    
 
   ARROW_INT --> NUM_ARRAY
 
   ARROW_FLOAT --> NUM_ARRAY
 
   ARROW_DEC --> NUM_ARRAY
 
   NUM_ARRAY --> NUM_VALUE
 
   NUM_VALUE --> KERNELS
 
   KERNELS --> RESULT_ARRAY
    
    style KERNELS fill:#e1f5ff

Core Data Types

NumericKind

An enum distinguishing the underlying numeric representation. This preserves type information through evaluation to enable intelligent casting decisions:

Sources : llkv-table/src/scalar_eval.rs:26-32

NumericValue

A tagged union representing a single numeric value while preserving its original type. Provides conversion methods to target types:

VariantDescriptionConversion Methods
Integer(i64)Signed 64-bit integeras_f64(), as_i64()
Float(f64)64-bit floating pointas_f64()
Decimal(DecimalValue)Fixed-precision decimalas_f64()

All variants support .kind() to retrieve the original NumericKind.

Sources : llkv-table/src/scalar_eval.rs:34-69

NumericArray

Wraps Arrow array types with a unified interface for numeric access. Internally stores optional Arc<Int64Array>, Arc<Float64Array>, or Arc<Decimal128Array> based on the kind field:

Key Methods :

graph LR
    subgraph "NumericArray"
        KIND["kind: NumericKind"]
LEN["len: usize"]
INT_DATA["int_data: Option&lt;Arc&lt;Int64Array&gt;&gt;"]
FLOAT_DATA["float_data: Option&lt;Arc&lt;Float64Array&gt;&gt;"]
DECIMAL_DATA["decimal_data: Option&lt;Arc&lt;Decimal128Array&gt;&gt;"]
end
    
    KIND -.determines.-> INT_DATA
    KIND -.determines.-> FLOAT_DATA
    KIND -.determines.-> DECIMAL_DATA
  • try_from_arrow(array: &ArrayRef): Constructs from any Arrow array, applying type casting as needed
  • value(idx: usize): Extracts Option<NumericValue> at the given index
  • promote_to_float(): Converts to Float64 representation for mixed-type arithmetic
  • to_array_ref(): Exports back to Arrow ArrayRef

Sources : llkv-table/src/scalar_eval.rs:83-383


NumericKernels API

The NumericKernels struct provides static methods for expression evaluation and optimization. It serves as the primary entry point for scalar computation during table scans.

Field Collection

Recursively traverses a scalar expression to identify all referenced column fields. Used by the table planner to determine which columns must be fetched from storage.

Sources : llkv-table/src/scalar_eval.rs:455-526

Array Preparation

Converts a set of Arrow arrays into the NumericArray representation, applying type coercion as needed. The needed_fields parameter filters to only the columns referenced by the expression being evaluated. Returns a FxHashMap<FieldId, NumericArray> for fast lookup during evaluation.

Sources : llkv-table/src/scalar_eval.rs:528-547

Value-by-Value Evaluation

Evaluates a scalar expression for a single row at index idx. Supports:

  • Binary arithmetic (+, -, *, /, %)
  • Comparisons (=, <, >, etc.)
  • Logical operators (NOT, IS NULL)
  • Type casts (CAST(... AS Float64))
  • Control flow (CASE, COALESCE)
  • Random number generation (RANDOM())

Returns None for NULL propagation.

Sources : llkv-table/src/scalar_eval.rs:549-673

Batch Evaluation

Evaluates an expression across all rows in a batch, returning an ArrayRef. The implementation attempts vectorized evaluation for simple expressions (single column, literals, affine transformations) and falls back to row-by-row evaluation for complex cases.

Sources : llkv-table/src/scalar_eval.rs:676-712

graph TB
    EXPR["ScalarExpr&lt;FieldId&gt;"]
SIMPLIFY["simplify()\nDetect affine patterns"]
VECTORIZE["try_evaluate_vectorized()\nCheck for fast path"]
FAST["Vectorized Evaluation\nDirect Arrow compute"]
SLOW["Row-by-Row Loop\nevaluate_value()
per row"]
RESULT["ArrayRef"]
EXPR --> SIMPLIFY
 
   SIMPLIFY --> VECTORIZE
 
   VECTORIZE -->|Success| FAST
 
   VECTORIZE -->|Fallback| SLOW
 
   FAST --> RESULT
 
   SLOW --> RESULT

Vectorization and Optimization

VectorizedExpr

Internal representation for expressions that can be evaluated without per-row dispatch:

The try_evaluate_vectorized method attempts to decompose complex expressions into VectorizedExpr nodes, enabling efficient vectorized computation for binary operations between arrays and scalars.

Sources : llkv-table/src/scalar_eval.rs:385-414

graph LR
    INPUT["col * 3 + 5"]
DETECT["Detect Affine Pattern"]
AFFINE["AffineExpr\nfield: col\nscale: 3.0\noffset: 5.0"]
FAST_EVAL["Single Pass Evaluation\nemit_no_nulls()"]
INPUT --> DETECT
 
   DETECT --> AFFINE
 
   AFFINE --> FAST_EVAL

Affine Expression Detection

The simplify method detects affine transformations of the form scale * field + offset:

When an affine pattern is detected, the executor can apply the transformation in a single pass without intermediate allocations. The try_extract_affine_expr method recursively analyzes binary arithmetic trees to identify this pattern.

Sources : llkv-table/src/scalar_eval.rs:1138-1261

Constant Folding

The simplification pass performs constant folding for expressions like 2 + 3 or 10.0 / 2.0, replacing them with Literal(5) or Literal(5.0). This eliminates redundant computation during execution.

Sources : llkv-table/src/scalar_eval.rs:997-1137


Type Coercion and Casting

Implicit Coercion

When evaluating binary operations on mixed types, the system applies implicit promotion rules:

Left TypeRight TypeResult TypeBehavior
IntegerIntegerIntegerNo conversion
IntegerFloatFloatPromote left to Float64
FloatIntegerFloatPromote right to Float64
DecimalAnyFloatConvert both to Float64

The infer_result_kind method determines the target type before evaluation, and to_aligned_array_ref applies the necessary promotions.

Sources : llkv-table/src/scalar_eval.rs:1398-1447

Explicit Casting

The CAST expression variant supports explicit type conversion:

Casting is handled during evaluation by:

  1. Evaluating the inner expression to NumericValue
  2. Converting to the target NumericKind via cast_numeric_value_to_kind
  3. Constructing the result array with the target Arrow DataType

Special handling exists for DataType::Date32 casts, which use the llkv-plan date utilities.

Sources : llkv-table/src/scalar_eval.rs:1449-1472 llkv-table/src/scalar_eval.rs:611-624


sequenceDiagram
    participant Planner as TablePlanner
    participant Executor as TableExecutor
    participant Kernels as NumericKernels
    participant Store as ColumnStore
    
    Planner->>Planner: Analyze projections
    Planner->>Kernels: collect_fields(expr)
    Kernels-->>Planner: Set&lt;FieldId&gt;
    Planner->>Planner: Build unique_lfids list
    
    Executor->>Store: Gather columns for row batch
    Store-->>Executor: Vec&lt;ArrayRef&gt;
    
    Executor->>Kernels: prepare_numeric_arrays(lfids, arrays, fields)
    Kernels-->>Executor: NumericArrayMap
    
    Executor->>Kernels: evaluate_batch_simplified(expr, len, arrays)
    Kernels->>Kernels: try_evaluate_vectorized()
    alt Vectorized
        Kernels->>Kernels: compute_binary_array_array()
    else Fallback
        Kernels->>Kernels: Loop: evaluate_value(expr, idx)
    end
    Kernels-->>Executor: ArrayRef (result column)
    
    Executor->>Executor: Append to RecordBatch

Integration with Table Scans

The numeric evaluation engine is invoked by the table executor when processing computed projections. The integration flow:

Projection Evaluation Context

The ProjectionEval enum distinguishes between direct column references and computed expressions:

For Computed variants, the planner:

  1. Calls NumericKernels::simplify() to optimize the expression
  2. Invokes NumericKernels::collect_fields() to determine dependencies
  3. Stores the simplified expression for evaluation

During execution, RowStreamBuilder materializes computed columns by calling evaluate_batch_simplified for each expression.

Sources : llkv-table/src/planner/mod.rs:494-498 llkv-table/src/planner/mod.rs:1073-1107

Passthrough Optimization

The planner detects when a computed expression is simply a column reference (after simplification) via NumericKernels::passthrough_column(). In this case, the column is fetched directly from storage without re-evaluation:

This avoids redundant computation for queries like SELECT col + 0 AS x.

Sources : llkv-table/src/planner/mod.rs:1110-1116 llkv-table/src/scalar_eval.rs:874-907


Data Type Inference

The evaluation engine must determine result types for expressions before evaluation to construct properly-typed Arrow arrays. The infer_computed_data_type function in llkv-executor delegates to numeric kernel logic:

Expression TypeInferred Data TypeRule
Literal(Integer)Int64Direct mapping
Literal(Float)Float64Direct mapping
Binary { ... }Int64 or Float64Based on operand types
Compare { ... }Int64Boolean as 0/1 integer
Cast { data_type, ... }data_typeExplicit type
RandomFloat64Always float

The expression_uses_float helper recursively checks if any operand is floating-point, promoting the result type accordingly.

Sources : llkv-executor/src/translation/schema.rs:53-123


Performance Characteristics

Row-by-Row Evaluation

Used for:

  • Expressions with control flow (CASE, COALESCE)
  • Expressions containing CAST to non-numeric types
  • Expressions with interval arithmetic (date operations)

Cost : O(n) row dispatch overhead, branch mispredictions on conditionals

Vectorized Evaluation

Used for:

  • Simple arithmetic (col1 + col2, col * 3)
  • Single column references
  • Constant literals

Cost : O(n) with SIMD-friendly memory access patterns, no per-row dispatch

graph LR
    INPUT["Int64Array\n[1,2,3,4,5]"]
AFFINE["scale=2.0\noffset=10.0"]
CALLBACK["emit_no_nulls(\nlen, /i/ 2.0*values[i]+10.0\n)"]
OUTPUT["Float64Array\n[12,14,16,18,20]"]
INPUT --> AFFINE
 
   AFFINE --> CALLBACK
 
   CALLBACK --> OUTPUT

Affine Evaluation

Special case for scale * field + offset expressions. The executor generates values directly into the output buffer using emit_no_nulls or emit_with_nulls callbacks, avoiding intermediate allocations.

Sources : llkv-table/src/planner/mod.rs:253-357


Key Implementation Details

NULL Handling

NULL values propagate through arithmetic operations according to SQL semantics:

  • NULL + 5NULL
  • NULL IS NULL1 (true)
  • COALESCE(NULL, 5)5

The NumericValue is wrapped in Option<T>, with None representing SQL NULL. Binary operations return None if either operand is None.

Sources : llkv-table/src/scalar_eval.rs:564-571

Type Safety

The system maintains type safety through:

  1. Tagged unions : NumericValue preserves original type via the discriminant
  2. Explicit promotion : promote_to_float() is called only when type mixing requires it
  3. Result type inference : The planner determines output types before evaluation

This prevents silent precision loss and enables query optimizations based on type information.

Sources : llkv-table/src/scalar_eval.rs:295-342

Memory Efficiency

The NumericArray struct uses Arc<T> for backing arrays, enabling zero-copy sharing when:

  • Returning a column directly without computation
  • Slicing arrays for sorted run evaluation
  • Sharing arrays across multiple expressions referencing the same column

The to_array_ref() method clones the Arc, not the underlying data.

Sources : llkv-table/src/scalar_eval.rs:275-293


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Aggregation System

Relevant source files

The aggregation system evaluates SQL aggregate functions (COUNT, SUM, AVG, MIN, MAX, etc.) over Arrow RecordBatch streams. It consists of a planning layer that defines aggregate specifications and an execution layer that performs incremental accumulation with overflow checking and DISTINCT value tracking.

For information about scalar expression evaluation, see Scalar Evaluation and NumericKernels. For query execution orchestration, see Query Execution.

Architecture Overview

The aggregation system operates across three crates:

Sources: llkv-aggregate/src/lib.rs:1-1935 llkv-executor/src/lib.rs:1-3599 llkv-plan/src/plans.rs:1-1458

The planner creates AggregateExpr instances from SQL AST nodes, which the executor converts to AggregateSpec descriptors. These specs initialize AggregateAccumulator instances that process batches incrementally, accumulating values in memory. The AggregateState wraps the accumulator with metadata (alias, override values) and produces the final output arrays.

Aggregate Specification

AggregateSpec Structure

AggregateSpec defines an aggregate operation at plan-time:

FieldTypePurpose
aliasStringOutput column name for the aggregate result
kindAggregateKindType of aggregate operation and its parameters

Sources: llkv-aggregate/src/lib.rs:23-27

AggregateKind Variants

Sources: llkv-aggregate/src/lib.rs:30-67

Each variant captures the field ID to aggregate over, the expected data type, and operation-specific flags like distinct or separator. The field_id is optional for COUNT(*) which counts all rows regardless of column values.

Accumulator System

Accumulator Variants

AggregateAccumulator implements streaming accumulation for each aggregate type:

Sources: llkv-aggregate/src/lib.rs:92-247

graph TB
    subgraph "COUNT Variants"
        CS[CountStar\nvalue: i64]
        CC[CountColumn\ncolumn_index: usize\nvalue: i64]
        CDC[CountDistinctColumn\ncolumn_index: usize\nseen: FxHashSet]
    end
    
    subgraph "SUM Variants"
        SI64[SumInt64\nvalue: Option-i64-\nhas_values: bool]
        SDI64[SumDistinctInt64\nsum: Option-i64-\nseen: FxHashSet]
        SF64[SumFloat64\nvalue: f64\nsaw_value: bool]
        SDF64[SumDistinctFloat64\nsum: f64\nseen: FxHashSet]
        SD128[SumDecimal128\nsum: i128\nprecision: u8\nscale: i8]
    end
    
    subgraph "AVG Variants"
        AI64[AvgInt64\nsum: i64\ncount: i64]
        ADI64[AvgDistinctInt64\nsum: i64\nseen: FxHashSet]
        AF64[AvgFloat64\nsum: f64\ncount: i64]
    end
    
    subgraph "MIN/MAX Variants"
        MinI64[MinInt64\nvalue: Option-i64-]
        MaxI64[MaxInt64\nvalue: Option-i64-]
        MinF64[MinFloat64\nvalue: Option-f64-]
        MaxF64[MaxFloat64\nvalue: Option-f64-]
    end

Each accumulator variant is specialized for its data type and operation semantics. Integer accumulators track overflow using Option<i64> (None indicates overflow), while float accumulators use f64 which never overflows. Distinct variants maintain a FxHashSet of seen values.

sequenceDiagram
    participant Executor
    participant AggregateSpec
    participant AggregateAccumulator
    participant RecordBatch
    participant OutputArray
    
    Executor->>AggregateSpec: new_with_projection_index()
    AggregateSpec->>AggregateAccumulator: Create accumulator
    
    loop For each batch
        Executor->>RecordBatch: Stream next batch
        RecordBatch->>AggregateAccumulator: update(batch)
        Note over AggregateAccumulator: Accumulate values\nCheck overflow\nTrack distinct keys
    end
    
    Executor->>AggregateAccumulator: finalize()
    AggregateAccumulator->>OutputArray: (Field, ArrayRef)
    OutputArray->>Executor: Return result

Accumulator Lifecycle

Sources: llkv-aggregate/src/lib.rs:460-746 llkv-aggregate/src/lib.rs:748-1440 llkv-aggregate/src/lib.rs:1442-1934

The accumulator is initialized with a projection index indicating which column in the RecordBatch to aggregate. The update() method processes each batch incrementally, and finalize() produces the final Arrow array and field schema.

Distinct Value Tracking

DistinctKey Enumeration

The system tracks distinct values using a hash-based approach:

VariantTypePurpose
Int(i64)Integer valuesExact integer comparison
Float(u64)Float bit patternBitwise float equality
Str(String)String valuesText comparison
Bool(bool)Boolean valuesTrue/false comparison
Date(i32)Date32 valuesDate comparison
Decimal(i128)Decimal raw valueExact decimal comparison

Sources: llkv-aggregate/src/lib.rs:249-257 llkv-aggregate/src/lib.rs:259-333

Float values are converted to bit patterns (to_bits()) to enable hash-based deduplication while preserving NaN and infinity semantics. Decimal values use raw i128 representation for exact comparison without scale conversion.

Distinct Accumulation Example

For COUNT(DISTINCT column), the accumulator inserts each non-null value into the hash set:

Sources: llkv-aggregate/src/lib.rs:785-798 llkv-aggregate/src/lib.rs:1465-1473

graph LR
    Batch1[RecordBatch 1\nvalues: 1,2,3]
    Batch2[RecordBatch 2\nvalues: 2,3,4]
    Batch3[RecordBatch 3\nvalues: 1,4,5]
    
 
   Batch1 --> HS1[seen: {1,2,3}]
 
   Batch2 --> HS2[seen: {1,2,3,4}]
 
   Batch3 --> HS3[seen: {1,2,3,4,5}]
    
 
   HS3 --> Result[COUNT: 5]

The hash set automatically deduplicates values across batches. Only the set size is returned as the final count, avoiding materialization of the entire set in the output.

Aggregate Functions

COUNT Family

FunctionNull HandlingReturn TypeOverflow
COUNT(*)Counts all rowsInt64Checked
COUNT(column)Skips NULL valuesInt64Checked
COUNT(DISTINCT column)Skips NULL, deduplicatesInt64Checked

Sources: llkv-aggregate/src/lib.rs:467-485 llkv-aggregate/src/lib.rs:759-783 llkv-aggregate/src/lib.rs:1452-1473

COUNT operations verify that the result fits in i64 range. COUNT(*) accumulates batch row counts directly. COUNT(column) filters invalid (NULL) rows using array.is_valid(i). COUNT(DISTINCT) maintains a hash set and returns its size.

SUM and TOTAL

FunctionOverflow BehaviorReturn TypeNULL Result
SUM(int_column)Returns errorInt64NULL if no values
SUM(float_column)Accumulates infinitiesFloat64NULL if no values
TOTAL(int_column)Converts to Float64Float640.0 if no values
TOTAL(float_column)Accumulates infinitiesFloat640.0 if no values

Sources: llkv-aggregate/src/lib.rs:486-541 llkv-aggregate/src/lib.rs:799-824 llkv-aggregate/src/lib.rs:958-975

graph LR
    Input[Input Column]
    Sum[Accumulate Sum]
    Count[Count Non-NULL]
    Div[Divide sum/count]
    Output[Float64 Result]
    
 
   Input --> Sum
 
   Input --> Count
 
   Sum --> Div
 
   Count --> Div
 
   Div --> Output

SUM uses checked_add for integers and returns an error on overflow. TOTAL never overflows because it accumulates as Float64 even for integer columns. The key difference is NULL handling: SUM returns NULL for empty input, TOTAL returns 0.0.

AVG (Average)

Sources: llkv-aggregate/src/lib.rs:598-654 llkv-aggregate/src/lib.rs:1096-1121 llkv-aggregate/src/lib.rs:1635-1645

AVG maintains separate sum and count accumulators. During finalization, it divides sum / count to produce a Float64 result. Integer sums are converted to Float64 for the division. If count is zero, AVG returns NULL.

MIN and MAX

Data TypeComparison StrategyNULL Handling
Int64i64::min() / i64::max()Skip NULL values
Float64partial_cmp() with NaN handlingSkip NULL values
Decimal128i128::min() / i128::max() on raw valuesSkip NULL values
StringNumeric coercion via array_value_to_numeric()Skip NULL values

Sources: llkv-aggregate/src/lib.rs:656-710 llkv-aggregate/src/lib.rs:1259-1277 llkv-aggregate/src/lib.rs:1279-1300

MIN/MAX start with None and update to Some(value) on the first non-NULL entry. Subsequent values are compared using type-specific logic. Float comparisons use partial_cmp() to handle NaN values correctly.

graph LR
    Values["Column Values:\n42, 'hello', 3.14"]
Convert[Convert to Strings:\n'42', 'hello', '3.14']
    Join["Join with separator\n(default: ',')"]
Result["Result: '42,hello,3.14'"]
Values --> Convert
 
   Convert --> Join
 
   Join --> Result

GROUP_CONCAT

GROUP_CONCAT concatenates string representations of column values with a separator:

Sources: llkv-aggregate/src/lib.rs:722-744 llkv-aggregate/src/lib.rs:1409-1437 llkv-aggregate/src/lib.rs:1847-1874

The accumulator collects string representations using array_value_to_string() which coerces integers, floats, and booleans to text. DISTINCT variants track seen values in a hash set. Finalization joins the strings with the specified separator (default: ',').

Group-by Integration

Grouping Key Extraction

For GROUP BY queries, the executor extracts grouping keys from each row:

Sources: llkv-executor/src/lib.rs:1097-1173

sequenceDiagram
    participant Executor
    participant GroupMap
    participant AggregateState
    participant Accumulator
    
    loop For each batch
        Executor->>Executor: Extract group keys
        loop For each group
            Executor->>GroupMap: Get or create group
            Executor->>AggregateState: Get accumulators for group
            Executor->>Accumulator: Filter batch to group rows
            Executor->>Accumulator: update(filtered_batch)
        end
    end
    
    Executor->>GroupMap: Iterate all groups
    loop For each group
        Executor->>AggregateState: finalize()
        AggregateState->>Executor: Return aggregate arrays
    end

Each unique combination of group-by column values maps to a separate GroupAggregateState which tracks the representative row and a list of matching row locations across batches.

Aggregate Accumulation per Group

Sources: llkv-executor/src/lib.rs:1174-1383

The executor maintains separate accumulators for each group. When processing a batch, it filters rows by group membership using RowIdFilter and updates each group's accumulators independently. This ensures that SUM(sales) for group 'USA' only accumulates sales records where country='USA'.

Output Construction

After processing all batches, the executor constructs the output RecordBatch:

Column TypeSourceConstruction
Group-by columnsRepresentative rowsGathered from original batches
Aggregate columnsFinalized accumulatorsConverted to Arrow arrays

Sources: llkv-executor/src/lib.rs:1384-1467

The system gathers one representative row per group for the group-by columns, then appends the finalized aggregate arrays as additional columns. This produces a result like:

+----------+---------+
| country  | SUM(sales) |
+----------+---------+
| USA      | 1500000 |
| Canada   | 750000  |
+----------+---------+
graph LR
    StringCol["String Column\n'42', 'hello', '3.14'"]
Parse1["'42' → 42.0"]
Parse2["'hello' → 0.0"]
Parse3["'3.14' → 3.14"]
Sum[SUM: 45.14]
    
 
   StringCol --> Parse1
 
   StringCol --> Parse2
 
   StringCol --> Parse3
    
 
   Parse1 --> Sum
 
   Parse2 --> Sum
 
   Parse3 --> Sum

Type System and Coercion

Numeric Coercion

The system performs SQLite-style type coercion for aggregates on string columns:

Sources: llkv-aggregate/src/lib.rs:398-447

The array_value_to_numeric() function attempts to parse strings as floats. Non-numeric strings coerce to 0.0, matching SQLite behavior. This enables SUM(string_column) where some values are numeric.

Type-specific Accumulators

Input TypeSUM AccumulatorAVG AccumulatorMIN/MAX Accumulator
Int64SumInt64 (i64 with overflow)AvgInt64 (sum: i64, count: i64)MinInt64 / MaxInt64
Float64SumFloat64 (f64, never overflows)AvgFloat64 (sum: f64, count: i64)MinFloat64 / MaxFloat64
Decimal128SumDecimal128 (i128 + precision/scale)AvgDecimal128 (sum: i128, count: i64)MinDecimal128 / MaxDecimal128
Utf8SumFloat64 (numeric coercion)AvgFloat64 (numeric coercion)MinFloat64 (numeric coercion)

Sources: llkv-aggregate/src/lib.rs:486-710

graph TB
    IntValue[Integer Value]
    CheckedAdd[checked_add-value-]
    Overflow{Overflow?}
ErrorSUM[SUM: Return Error]
    ContinueTOTAL[TOTAL: Continue as Float64]
    
 
   IntValue --> CheckedAdd
 
   CheckedAdd --> Overflow
 
   Overflow -->|Yes + SUM| ErrorSUM
 
   Overflow -->|Yes + TOTAL| ContinueTOTAL
 
   Overflow -->|No| IntValue

Each data type uses a specialized accumulator to preserve precision and overflow semantics. Decimal aggregates maintain precision and scale metadata throughout accumulation.

Overflow Handling

Integer Overflow Strategy

Sources: llkv-aggregate/src/lib.rs:799-824 llkv-aggregate/src/lib.rs:958-975 llkv-aggregate/src/lib.rs:1474-1494

SUM uses checked_add() and sets the accumulator to None on overflow, returning an error during finalization. TOTAL avoids this by accumulating integers as Float64 from the start, trading precision for guaranteed completion.

Decimal Overflow

Decimal128 aggregates use checked_add() on the raw i128 values:

Sources: llkv-aggregate/src/lib.rs:915-932

When Decimal128 overflow occurs, the system returns an error immediately. There is no TOTAL-style fallback for decimals because precision requirements are explicit in the type signature.

graph LR
    Projection["Projection:\nSUM(price * quantity)"]
Extract[Extract Aggregate\nFunction Call]
    Expression["price * quantity"]
Translate[Translate to\nScalarExpr]
    EnsureProj[ensure_computed_projection]
    Accumulate[Accumulate via\nAggregateAccumulator]
    
 
   Projection --> Extract
 
   Extract --> Expression
 
   Expression --> Translate
 
   Translate --> EnsureProj
 
   EnsureProj --> Accumulate

Computed Aggregates

Aggregate Expressions in Projections

The executor handles aggregate function calls embedded in computed projections:

Sources: llkv-executor/src/lib.rs:703-712 llkv-executor/src/lib.rs:735-798 llkv-executor/src/lib.rs:473-505

When a projection contains an aggregate like SUM(price * quantity), the executor:

  1. Detects the aggregate via expr_contains_aggregate()
  2. Translates the inner expression (price * quantity) to a ScalarExpr
  3. Creates a computed projection for the expression
  4. Initializes an accumulator for the projection index
  5. Accumulates values from the computed column

This allows complex aggregate expressions beyond simple column references.

Performance Considerations

Memory Usage

Each accumulator maintains state proportional to:

Accumulator TypeMemory Per GroupNotes
COUNT(*)8 bytes (i64)Constant size
SUM/AVG16-24 bytesValue + metadata
MIN/MAX8-24 bytesSingle value + type info
COUNT(DISTINCT)O(unique values)Hash set grows with cardinality
GROUP_CONCATO(total string length)Vector of strings

Sources: llkv-aggregate/src/lib.rs:92-247

DISTINCT and GROUP_CONCAT have unbounded memory growth for high-cardinality data. The system does not implement spilling or approximate algorithms for these cases.

Parallelization

Aggregates are accumulated serially within a single thread because:

  1. Accumulators maintain mutable state that is not thread-safe
  2. DISTINCT tracking requires synchronized hash set updates
  3. Sequential batch processing simplifies overflow detection

Future work could introduce parallel accumulation with merge operations for distributive aggregates (SUM, COUNT, MIN, MAX) but not for algebraic aggregates (AVG) or DISTINCT operations without additional complexity.

Sources: llkv-aggregate/src/lib.rs:748-1440


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Query Execution

Relevant source files

Purpose and Scope

Query execution is the process of converting logical query plans into physical result sets by coordinating table scans, expression evaluation, aggregation, joins, and result streaming. This page documents the execution engine's architecture, core components, and high-level execution flow.

For details on table-level planning and execution, see TablePlanner and TableExecutor. For scan optimization strategies, see Scan Execution and Optimization. For predicate evaluation mechanics, see Filter Evaluation.


System Architecture

Query execution spans two primary crates:

CrateResponsibilityKey Types
llkv-executorOrchestrates multi-table queries, aggregates, and result formattingQueryExecutor, SelectExecution
llkv-tableExecutes table scans with predicates and projectionsTablePlanner, TableExecutor

The executor operates on logical plans produced by llkv-plan and delegates to llkv-table for single-table operations, llkv-join for join algorithms, and llkv-aggregate for aggregate computations.

Execution Architecture

graph TB
    subgraph "Plan Layer"
        PLAN["SelectPlan\n(llkv-plan)"]
end
    
    subgraph "Execution Orchestration (llkv-executor)"
        QE["QueryExecutor&lt;P&gt;"]
EXEC["SelectExecution&lt;P&gt;"]
STRAT["Strategy Selection:\nprojection, aggregate,\njoin, compound"]
end
    
    subgraph "Table Execution (llkv-table)"
        TP["TablePlanner"]
TE["TableExecutor"]
PS["PlannedScan"]
end
    
    subgraph "Specialized Operations"
        AGG["llkv-aggregate\nAccumulator"]
JOIN["llkv-join\nhash_join, cross_join"]
EVAL["NumericKernels\nscalar evaluation"]
end
    
    subgraph "Storage"
        TABLE["Table&lt;P&gt;"]
STORE["ColumnStore"]
end
    
 
   PLAN --> QE
 
   QE --> STRAT
 
   STRAT -->|single table| TP
 
   STRAT -->|aggregates| AGG
 
   STRAT -->|joins| JOIN
    
 
   TP --> PS
 
   PS --> TE
 
   TE --> TABLE
 
   TABLE --> STORE
    
 
   STRAT --> EVAL
 
   EVAL --> TABLE
    
 
   QE --> EXEC
 
   TE --> EXEC
 
   AGG --> EXEC
 
   JOIN --> EXEC

Sources: llkv-executor/src/lib.rs:507-521 llkv-table/src/planner/mod.rs:580-726


Core Components

QueryExecutor

QueryExecutor<P> is the top-level execution coordinator in llkv-executor. It consumes SelectPlan structures and produces SelectExecution result containers.

Key Responsibilities:

  • Strategy selection based on plan characteristics (single table, joins, aggregates, compound operations)
  • Multi-table query orchestration (cross products, hash joins)
  • Aggregate computation coordination
  • Subquery evaluation (correlated EXISTS, scalar subqueries)
  • Result streaming and batching

Entry points:

Sources: llkv-executor/src/lib.rs:507-521

SelectExecution

SelectExecution<P> encapsulates query results and provides streaming access via batched iteration. Results may be materialized upfront or generated lazily depending on the execution strategy.

Streaming Interface:

  • stream<F>(on_batch: F) - Process results batch-by-batch
  • into_rows() - Materialize all rows into memory (for sorting, deduplication)
  • schema() - Access result schema

Sources: llkv-executor/src/lib.rs:2500-2700 (approximate location based on file structure)

TablePlanner and TableExecutor

The table-level execution layer handles single-table scans with predicates and projections. TablePlanner analyzes the request and produces a PlannedScan, which TableExecutor then executes.

Planning Process:

  1. Validate projections against schema
  2. Normalize filter predicates (apply De Morgan's laws, flatten boolean operators)
  3. Compile predicates into EvalProgram and DomainProgram bytecode
  4. Build PlanGraph metadata for tracing

Execution Process:

  1. Try optimized fast paths (single column scans, full table scans)
  2. Fall back to general execution with expression evaluation
  3. Stream results in batches

These components are detailed in TablePlanner and TableExecutor.

Sources: llkv-table/src/planner/mod.rs:580-637 llkv-table/src/planner/mod.rs:728-1007


Execution Flow

Top-Level SELECT Execution Sequence

Sources: llkv-executor/src/lib.rs:523-569 llkv-table/src/planner/mod.rs:595-607 llkv-table/src/planner/mod.rs:1009-1400


graph TD
 
   START["SelectPlan"] --> COMPOUND{compound?}
COMPOUND -->|yes| EXEC_COMPOUND["execute_compound_select\nUNION/EXCEPT/INTERSECT"]
COMPOUND -->|no| FROM{tables.is_empty?}
FROM -->|yes| EXEC_CONST["execute_select_without_table\nEvaluate constant expressions"]
FROM -->|no| MULTI{tables.len > 1?}
MULTI -->|yes| EXEC_CROSS["execute_cross_product\nor hash_join optimization"]
MULTI -->|no| GROUPBY{group_by.is_empty?}
GROUPBY -->|no| EXEC_GROUP["execute_group_by_single_table\nGroup rows + compute aggregates"]
GROUPBY -->|yes| AGG{aggregates.is_empty?}
AGG -->|no| EXEC_AGG["execute_aggregates\nCollect all rows + compute"]
AGG -->|yes| COMPUTED{has_computed_aggregates?}
COMPUTED -->|yes| EXEC_COMP_AGG["execute_computed_aggregates\nEmbedded agg in expressions"]
COMPUTED -->|no| EXEC_PROJ["execute_projection\nStream scan with projections"]
EXEC_COMPOUND --> RESULT["SelectExecution"]
EXEC_CONST --> RESULT
 
   EXEC_CROSS --> RESULT
 
   EXEC_GROUP --> RESULT
 
   EXEC_AGG --> RESULT
 
   EXEC_COMP_AGG --> RESULT
 
   EXEC_PROJ --> RESULT

Execution Strategies

QueryExecutor selects an execution strategy based on plan characteristics:

Strategy Decision Tree

Sources: llkv-executor/src/lib.rs:527-569

Strategy Implementations

StrategyMethodWhen AppliedKey Operations
Constant Evaluationexecute_select_without_tableNo FROM clauseEvaluate literals, struct constructors
Simple Projectionexecute_projectionSingle table, no aggregatesStream scan with filter + projections
Aggregationexecute_aggregatesHas aggregates, no GROUP BYCollect all rows, compute aggregates, emit single row
Grouped Aggregationexecute_group_by_single_tableHas GROUP BYHash rows by key, compute per-group aggregates
Computed Aggregatesexecute_computed_aggregatesAggregates embedded in computed projectionsExtract aggregate expressions, evaluate separately
Cross Productexecute_cross_productMultiple tablesCartesian product or hash join optimization
Compoundexecute_compound_selectUNION/EXCEPT/INTERSECTExecute components, apply set operations

Sources: llkv-executor/src/lib.rs:926-975 llkv-executor/src/lib.rs:1700-2100 llkv-executor/src/lib.rs:2200-2400 llkv-executor/src/lib.rs:1057-1400 llkv-executor/src/lib.rs:590-701


Streaming Execution Model

LLKV executes queries in a streaming fashion to avoid materializing large intermediate results. Results flow through the system as RecordBatch chunks (typically 4096 rows).

Streaming Characteristics:

Execution TypeStreaming BehaviorMemory Characteristics
ProjectionFull streamingO(batch_size) memory
FilterFull streamingO(batch_size) memory
AggregatesRequires full materializationO(input_rows) memory
GROUP BYRequires full materializationO(group_count) memory
ORDER BYRequires full materializationO(input_rows) memory
DISTINCTRequires full materializationO(distinct_rows) memory
LIMITEarly terminationO(limit × batch_size) memory

Streaming Projection Example Flow:

Sources: llkv-table/src/planner/mod.rs:1009-1400 llkv-table/src/constants.rs:1-10 (defines STREAM_BATCH_ROWS = 4096)

Materialization Points

Certain operations require collecting all rows before producing output:

  1. Sorting - Must see all rows to determine order llkv-executor/src/lib.rs:2800-2900
  2. Deduplication (DISTINCT) - Must track all seen rows llkv-executor/src/lib.rs:2950-3050
  3. Aggregation - Must accumulate state across all rows llkv-executor/src/lib.rs:1700-1900
  4. Set Operations - Must materialize both sides for comparison llkv-executor/src/lib.rs:590-701

These operations call into_rows() on SelectExecution to materialize results as Vec<CanonicalRow>.

Sources: llkv-executor/src/lib.rs:2600-2700


Expression Evaluation

Query execution evaluates two types of expressions:

Predicate Evaluation (Filtering)

Predicates are compiled to bytecode and evaluated during table scans:

  1. Normalization - Apply De Morgan's laws, flatten AND/OR llkv-table/src/planner/program.rs:50-150
  2. Compilation - Convert to EvalProgram (stack-based) and DomainProgram (row tracking) llkv-table/src/planner/program.rs:200-400
  3. Vectorized Evaluation - Process chunks of rows efficiently llkv-table/src/planner/mod.rs:1100-1300

See Filter Evaluation for detailed mechanics.

Scalar Expression Evaluation (Projections)

Computed projections are evaluated row-by-row or vectorized when possible:

  1. Translation - Convert ScalarExpr<String> to ScalarExpr<FieldId> llkv-executor/src/translation/scalar.rs:1-200
  2. Type Inference - Determine output data type llkv-executor/src/translation/schema.rs:50-150
  3. Evaluation - Use NumericKernels for numeric operations llkv-table/src/scalar_eval.rs:450-685

Vectorized vs Row-by-Row:

Sources: llkv-table/src/scalar_eval.rs:675-712 llkv-table/src/scalar_eval.rs:549-673


Integration with Runtime

The execution layer coordinates with llkv-runtime for transaction and catalog management:

Runtime Integration Points:

OperationRuntime ResponsibilityExecutor Responsibility
Table LookupCatalogManager::table()ExecutorTableProvider::get_table()
MVCC FilteringProvide RowIdFilter with snapshotApply filter during scan
Transaction StateTrack transaction ID, commit watermarkInclude created_by/deleted_by in scans
Schema ResolutionMaintain system catalogTranslate column names to FieldId

The ExecutorTableProvider trait abstracts runtime integration, allowing executor to remain runtime-agnostic.

Sources: llkv-executor/src/types.rs:100-200 llkv-runtime/src/catalog/mod.rs:50-150


Performance Characteristics

Execution performance depends on query characteristics and chosen strategy:

Query PatternTypical PerformanceOptimization Opportunities
SELECT * FROM t~1M rows/secFast path: shadow column scan llkv-table/src/planner/mod.rs:765-821
SELECT col FROM t WHERE pred~500K rows/secPredicate fusion llkv-table/src/planner/mod.rs:518-570
Single-table aggregatesFull table scanColumn-only projections for aggregate inputs
Hash join (2 tables)O(n + m) with O(n) memorySmaller table as build side llkv-executor/src/lib.rs:1500-1700
Cross product (n tables)O(∏ row_counts)Avoid if possible; rewrite to joins

Sources: llkv-table/src/planner/mod.rs:738-856 llkv-executor/src/lib.rs:1082-1400


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

TablePlanner and TableExecutor

Relevant source files

This document describes the table-level query planning and execution system in LLKV. The TablePlanner translates scan operations into optimized execution plans, while the TableExecutor implements multiple execution strategies to materialize query results efficiently. For information about the broader query execution pipeline, see Query Execution. For details on expression compilation and evaluation, see Program Compilation and Scalar Evaluation and NumericKernels.

Purpose and Scope

The TablePlanner and TableExecutor form the core of LLKV's table-level query execution. They bridge the gap between logical query plans (from llkv-plan) and physical data access (through llkv-column-map). This document covers:

  • How scan operations are planned and optimized
  • The structure of compiled execution plans (PlannedScan)
  • Multiple execution strategies and their trade-offs
  • Predicate fusion optimization
  • Integration with MVCC row filtering
  • Streaming result materialization

Architecture Overview

Sources: llkv-table/src/planner/mod.rs:580-726

TablePlanner

The TablePlanner is responsible for analyzing scan requests and constructing optimized execution plans. It does not execute queries itself but prepares all necessary metadata for the TableExecutor.

Structure

The planner holds a reference to the Table being scanned and provides a single public method: scan_stream_with_exprs.

Sources: llkv-table/src/planner/mod.rs:580-593

Planning Flow

The planning process consists of several stages:

  1. Validation : Ensures at least one projection is specified
  2. Normalization : Applies De Morgan's laws and flattens logical operators via normalize_predicate
  3. Graph Construction : Builds a PlanGraph for visualization and introspection
  4. Program Compilation : Compiles filter expressions into EvalProgram and DomainProgram bytecode

Sources: llkv-table/src/planner/mod.rs:595-637

PlanGraph Construction

The build_plan_graph method creates a directed acyclic graph (DAG) representing the logical query plan:

Node TypePurposeMetadata
TableScanEntry point for data accesstable_id, projection_count
FilterPredicate applicationpredicates (formatted expressions)
ProjectColumn selection and computationprojections, fields with types
OutputResult materializationinclude_nulls flag

Sources: llkv-table/src/planner/mod.rs:639-725

PlannedScan Structure

The PlannedScan is an intermediate representation that bridges planning and execution. It contains all information needed to execute a scan without holding runtime state.

FieldTypePurpose
projectionsVec<ScanProjection>Column and computed projections to materialize
filter_exprArc<Expr<FieldId>>Normalized filter predicate
optionsScanStreamOptions<P>MVCC filters, ordering, null handling
plan_graphPlanGraphLogical plan for introspection
programsProgramSetCompiled bytecode for predicate evaluation

Sources: llkv-table/src/planner/mod.rs:500-509

TableExecutor

The TableExecutor implements multiple execution strategies, selecting the most efficient based on query characteristics.

Structure

The executor maintains a cache of row IDs to avoid redundant scans when multiple operations target the same table.

Sources: llkv-table/src/planner/mod.rs:572-578

Execution Strategy Selection

Sources: llkv-table/src/planner/mod.rs:1009-1367

Single Column Direct Scan Fast Path

The try_single_column_direct_scan optimization applies when:

  • Exactly one projection is requested
  • include_nulls is false
  • Filter is either trivial or a full-range predicate on the projected column
  • Column type is not Utf8 or LargeUtf8 (to avoid string complexity)
graph TD
    CHECK1{projections.len == 1?} -->|No| FALLBACK1[Fallback]
 
   CHECK1 -->|Yes| CHECK2{include_nulls == false?}
CHECK2 -->|No| FALLBACK2[Fallback]
 
   CHECK2 -->|Yes| PROJ_TYPE{Projection type?}
PROJ_TYPE -->|Column| CHECK_FILTER["is_full_range_filter?"]
PROJ_TYPE -->|Computed| CHECK_COMPUTED[Single field?]
    
 
   CHECK_FILTER -->|No| FALLBACK3[Fallback]
 
   CHECK_FILTER -->|Yes| CHECK_DTYPE{dtype?}
CHECK_DTYPE -->|Utf8/LargeUtf8| FALLBACK4[Fallback]
 
   CHECK_DTYPE -->|Other| DIRECT_SCAN[SingleColumnStreamVisitor]
    
 
   CHECK_COMPUTED -->|No| FALLBACK5[Fallback]
 
   CHECK_COMPUTED -->|Yes| COMPUTE_TYPE{Passthrough or Affine?}
COMPUTE_TYPE -->|Passthrough| DIRECT_SCAN2[SingleColumnStreamVisitor]
 
   COMPUTE_TYPE -->|Affine| AFFINE_SCAN[AffineSingleColumnVisitor]
 
   COMPUTE_TYPE -->|Other| COMPUTE_SCAN[ComputedSingleColumnVisitor]
    
 
   DIRECT_SCAN --> HANDLED[StreamOutcome::Handled]
 
   DIRECT_SCAN2 --> HANDLED
 
   AFFINE_SCAN --> HANDLED
 
   COMPUTE_SCAN --> HANDLED

This path streams data directly from storage using ScanBuilder without building intermediate row ID lists or using RowStreamBuilder.

Sources: llkv-table/src/planner/mod.rs:1369-1530

Full Table Scan Streaming Fast Path

The try_stream_full_table_scan optimization applies when:

  • Filter is trivial (no predicates)
  • No ordering is required (options.order.is_none())
graph TD
    CHECK_ORDER{order.is_some?} -->|Yes| FALLBACK[Fallback]
 
   CHECK_ORDER -->|No| STREAM_START[stream_table_row_ids]
    
 
   STREAM_START --> SHADOW{Shadow column exists?}
SHADOW -->|Yes| CHUNK[Emit row_id chunks]
 
   SHADOW -->|No| FALLBACK2[Multi-column scan fallback]
    
 
   CHUNK --> MVCC_FILTER{row_id_filter?}
MVCC_FILTER -->|Yes| APPLY[filter.filter]
 
   MVCC_FILTER -->|No| BUILD
 
   APPLY --> BUILD
    
 
   BUILD[RowStreamBuilder] --> GATHER[Gather columns]
 
   GATHER --> EMIT[Emit RecordBatch]
 
   EMIT --> MORE{More chunks?}
MORE -->|Yes| CHUNK
 
   MORE -->|No| CHECK_EMPTY
    
    CHECK_EMPTY{any_emitted?} -->|No| SYNTHETIC[emit_synthetic_null_batch]
 
   CHECK_EMPTY -->|Yes| HANDLED[StreamOutcome::Handled]
 
   SYNTHETIC --> HANDLED

This path uses stream_table_row_ids to enumerate row IDs in chunks directly from the row_id shadow column, avoiding full predicate evaluation.

This optimization is particularly effective for queries like SELECT col1, col2 FROM table with no WHERE clause.

Sources: llkv-table/src/planner/mod.rs:905-999

General Execution Path

When fast paths don't apply, the executor follows a multi-stage process:

Stage 1: Projection Metadata Construction

The executor builds several data structures:

StructurePurpose
projection_evalsVec<ProjectionEval> mapping projections to evaluation strategies
unique_lfidsVec<LogicalFieldId> of columns to fetch from storage
unique_indexFxHashMap<LogicalFieldId, usize> for column position lookup
numeric_fieldsFxHashSet<FieldId> of columns needing numeric coercion
passthrough_fieldsVec<Option<FieldId>> for identity computed projections

Sources: llkv-table/src/planner/mod.rs:1033-1117

Stage 2: Row ID Collection

For trivial filters, the executor scans the MVCC created_by column to enumerate all rows (including those with NULL values in user columns). For non-trivial filters, it evaluates the compiled ProgramSet.

Sources: llkv-table/src/planner/mod.rs:1269-1327

Stage 3: Streaming Execution

The RowStreamBuilder materializes results in batches of STREAM_BATCH_ROWS (default: 1024). For each batch:

  1. Gather : Fetch column data for row IDs via MultiGatherContext
  2. Evaluate : Compute any ProjectionEval::Computed expressions
  3. Materialize : Construct RecordBatch with final schema
  4. Emit : Call user-provided callback

Sources: llkv-table/src/planner/mod.rs:1337-1365

Program Compilation

The ProgramCompiler translates filter expressions into stack-based bytecode for efficient evaluation. It produces two programs:

  • EvalProgram : Evaluates predicates and returns matching row IDs
  • DomainProgram : Computes the domain (all potentially relevant row IDs) for NOT operations

ProgramSet Structure

Bytecode Operations

OpcodeStack EffectPurpose
PushPredicate(filter)[] → [rows]Evaluate single predicate
PushCompare{left, op, right}[] → [rows]Evaluate comparison expression
PushInList{expr, list, negated}[] → [rows]Evaluate IN list
PushIsNull{expr, negated}[] → [rows]Evaluate IS NULL
PushLiteral(bool)[] → [rows]Push all rows (true) or empty (false)
FusedAnd{field_id, filters}[] → [rows]Apply multiple predicates on same field
And{child_count}[r1, r2, ...] → [r]Intersect N row ID sets
Or{child_count}[r1, r2, ...] → [r]Union N row ID sets
Not{domain}[matched] → [domain - matched]Set difference using domain program

Sources: llkv-table/src/planner/program.rs:1-200 (referenced but not in provided files)

Execution Example

Consider the filter: WHERE (age > 18 AND age < 65) OR name = 'Alice'

After normalization and compilation:

Stack Operations:
1. PushCompare(age > 18)        → [rows1]
2. PushCompare(age < 65)        → [rows1, rows2]
3. And{2}                       → [rows1 ∩ rows2]
4. PushPredicate(name = 'Alice') → [rows1 ∩ rows2, rows3]
5. Or{2}                        → [(rows1 ∩ rows2) ∪ rows3]

The collect_row_ids_for_program method executes this bytecode by maintaining a stack of row ID vectors and applying set operations.

Sources: llkv-table/src/planner/mod.rs:2376-2502

graph TD
 
   ANALYZE[Analyze filter expression] --> BUILD["Build per_field stats"]
BUILD --> STATS["FieldPredicateStats:\ntotal, contains"]
STATS --> CHECK{should_fuse?}
CHECK --> DTYPE{Data type?}
DTYPE -->|Utf8/LargeUtf8| STRING_RULE["contains ≥ 1 AND total ≥ 2"]
DTYPE -->|Other| NUMERIC_RULE["total ≥ 2"]
STRING_RULE -->|Yes| FUSE[Generate FusedAnd opcode]
 
   NUMERIC_RULE -->|Yes| FUSE
    
 
   STRING_RULE -->|No| SEPARATE[Evaluate predicates separately]
 
   NUMERIC_RULE -->|No| SEPARATE

Predicate Fusion

The PredicateFusionCache analyzes filter expressions to identify opportunities for fused predicate evaluation.

Fusion Strategy

Predicate fusion is particularly beneficial for string columns with CONTAINS operations, where multiple predicates on the same field can be evaluated in a single storage scan.

Example: WHERE name LIKE '%Smith%' AND name LIKE '%John%' can be fused into a single scan with two pattern matchers.

Sources: llkv-table/src/planner/mod.rs:517-570 llkv-table/src/planner/mod.rs:2504-2580

Projection Evaluation

The ProjectionEval enum handles two types of projections:

Column Projections

Direct column references that can be gathered from storage without computation.

Computed Projections

Expressions evaluated per-row using NumericKernels. The executor optimizes several patterns:

PatternOptimization
columnPassthrough (no computation)
a * column + bAffine transformation (vectorized)
General expressionFull expression evaluation

Sources: llkv-table/src/planner/mod.rs:482-498 llkv-table/src/planner/mod.rs:1110-1117

Row ID Collection Strategies

The executor uses different strategies for collecting matching row IDs based on predicate characteristics:

Simple Predicates

For predicates on a single field (e.g., age > 18):

Sources: llkv-table/src/planner/mod.rs:1532-1612

Comparison Expressions

For comparisons involving multiple fields (e.g., col1 + col2 > 10):

This approach minimizes wasted computation by first identifying a "domain" of potentially matching rows (intersection of rows where all referenced columns have values) before evaluating the full expression.

Sources: llkv-table/src/planner/mod.rs:1699-1775 llkv-table/src/planner/mod.rs:2214-2374

IN List Expressions

For IN list predicates (e.g., status IN ('active', 'pending')):

The IN list evaluator properly handles SQL's three-valued logic:

  • If value matches any list element: TRUE (or FALSE if negated)
  • If no match but list contains NULL: NULL (indeterminate)
  • If no match and no NULLs: FALSE (or TRUE if negated)

Sources: llkv-table/src/planner/mod.rs:2001-2044 llkv-table/src/planner/mod.rs:1841-1999

graph TD
 
   COLLECT[Collect row IDs from predicates] --> FILTER{row_id_filter.is_some?}
FILTER -->|No| CONTINUE[Continue to streaming]
 
   FILTER -->|Yes| APPLY["filter.filter(table, row_ids)"]
APPLY --> CHECK_VISIBILITY[Check MVCC columns]
 
   CHECK_VISIBILITY --> CREATED["created_by ≤ txn_id"]
CHECK_VISIBILITY --> DELETED["deleted_by > txn_id OR NULL"]
CREATED --> VISIBLE{Visible?}
DELETED --> VISIBLE
 
   VISIBLE -->|Yes| KEEP[Keep row ID]
 
   VISIBLE -->|No| DROP[Drop row ID]
    
 
   KEEP --> FILTERED[Filtered row IDs]
 
   DROP --> FILTERED
 
   FILTERED --> CONTINUE

MVCC Integration

The executor integrates with LLKV's MVCC system through the row_id_filter option in ScanStreamOptions. After collecting row IDs through predicate evaluation, the filter determines which rows are visible to the current transaction:

For trivial filters, the executor explicitly scans the created_by MVCC column to enumerate all rows, ensuring that rows with NULL values in user columns are included when appropriate.

Sources: llkv-table/src/planner/mod.rs:1269-1323

Performance Characteristics

The table below summarizes the time complexity of different execution paths:

Execution PathConditionsRow ID CollectionData GatheringTotal
Single Column Direct1 projection, trivial/full-range filterO(1)O(n) streamingO(n)
Full Table StreamTrivial filter, no orderO(n) via shadow columnO(n) streamingO(n)
General (indexed predicate)Single-field predicate with indexO(log n + m)O(m × c)O(log n + m × c)
General (complex predicate)Multi-field or computed predicateO(n × p)O(m × c)O(n × p + m × c)

Where:

  • n = total rows in table
  • m = matching rows after filtering
  • c = number of columns in projection
  • p = complexity of predicate (number of fields involved)

The executor automatically selects the most efficient path based on query characteristics, with no manual tuning required.

Sources: llkv-table/src/planner/mod.rs:1009-1530 llkv-table/src/planner/mod.rs:905-999

graph TB
    subgraph "External Input"
        PLAN[llkv-plan SelectPlan]
        EXPR[llkv-expr Expr]
    end
    
    subgraph "Table Layer"
        TP[TablePlanner]
        TE[TableExecutor]
        PLANNED[PlannedScan]
 
       TP --> PLANNED
 
       PLANNED --> TE
    end
    
    subgraph "Storage Layer"
        STORE[ColumnStore]
        SCAN[ScanBuilder]
        GATHER[MultiGatherContext]
    end
    
    subgraph "Expression Evaluation"
        NORM[normalize_predicate]
        COMPILER[ProgramCompiler]
        KERNELS[NumericKernels]
    end
    
    subgraph "Output"
        STREAM[RowStreamBuilder]
        BATCH[RecordBatch]
    end
    
 
   PLAN --> TP
 
   EXPR --> TP
    
 
   TP --> NORM
 
   NORM --> COMPILER
    
 
   TE --> SCAN
 
   TE --> GATHER
 
   TE --> KERNELS
 
   TE --> STREAM
    
 
   SCAN --> STORE
 
   GATHER --> STORE
 
   STREAM --> BATCH

Integration Points

The TablePlanner and TableExecutor integrate with several other LLKV subsystems:

Sources: llkv-table/src/planner/mod.rs:1-76


This architecture enables LLKV to efficiently execute table scans with complex predicates and projections while maintaining clean separation between logical planning and physical execution. The multiple optimization paths ensure that simple queries execute quickly while complex queries remain correct.


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Scan Execution and Optimization

Relevant source files

Purpose and Scope

This page documents the table scan execution engine in the llkv-table crate, which implements the low-level scanning and streaming of data from the column store to higher layers. It covers planning, optimization paths, predicate compilation, and expression evaluation strategies that enable efficient data retrieval.

For information about higher-level query planning and the translation of SQL plans to table operations, see TablePlanner and TableExecutor. For details on how filters are evaluated against individual rows, see Filter Evaluation.

Architecture Overview

The scan execution system is split into two primary components:

ComponentResponsibility
TablePlannerAnalyzes scan requests, builds plan graphs, compiles predicates into bytecode programs
TableExecutorExecutes planned scans using optimization paths, coordinates streaming results

Scan Execution Flow

Sources: llkv-table/src/planner/mod.rs:591-637

Planning Phase

Plan Construction

The TablePlanner::plan_scan method orchestrates plan construction by:

  1. Validating projections are non-empty
  2. Normalizing the filter predicate
  3. Building a plan graph for visualization and analysis
  4. Compiling predicates into executable programs
graph LR
    Input["scan_stream_with_exprs\n(projections, filter, options)"]
Validate["Validate projections"]
Normalize["normalize_predicate\n(flatten And/Or,\napply De Morgan's)"]
Graph["build_plan_graph\n(TableScan → Filter →\nProject → Output)"]
Compile["ProgramCompiler\n(EvalProgram +\nDomainProgram)"]
Output["PlannedScan"]
Input --> Validate
 
   Validate --> Normalize
 
   Validate --> Graph
 
   Normalize --> Compile
 
   Compile --> Output
 
   Graph --> Output

The plan graph encodes the logical operator tree for diagnostic purposes, with nodes representing TableScan, Filter, Project, and Output operators.

Sources: llkv-table/src/planner/mod.rs:612-637 llkv-table/src/planner/mod.rs:639-725

Predicate Compilation

Predicates are compiled into two bytecode programs:

Program TypePurpose
EvalProgramStack-based bytecode for evaluating filter conditions
DomainProgramTracks which row IDs satisfy the predicate during scanning

The ProgramCompiler analyzes the normalized predicate tree and emits instructions for efficient evaluation. Predicate fusion merges multiple predicates on the same field when beneficial.

graph TB
    Filter["Expr&lt;FieldId&gt;\n(normalized predicate)"]
Fusion["PredicateFusionCache\n(analyze per-field stats)"]
Compiler["ProgramCompiler::compile"]
EvalProg["EvalProgram\n(stack-based bytecode)"]
DomainProg["DomainProgram\n(row ID tracking)"]
ProgramSet["ProgramSet\n(contains both programs)"]
Filter --> Fusion
 
   Filter --> Compiler
 
   Fusion --> Compiler
 
   Compiler --> EvalProg
 
   Compiler --> DomainProg
 
   EvalProg --> ProgramSet
 
   DomainProg --> ProgramSet

Sources: llkv-table/src/planner/mod.rs:625-629 llkv-table/src/planner/program.rs

Execution Phase and Optimization Paths

The TableExecutor::execute method attempts multiple optimization paths before falling back to the general scan:

Fast Path: Single Column Direct Scan

When the scan requests a single column with a simple predicate, the executor uses try_single_column_direct_scan to stream data directly from the column store without materializing row IDs.

Conditions for single column fast path:

  • Exactly one column projection
  • No computed projections
  • Simple predicate structure (optional)
  • Compatible data types

This path bypasses row ID collection and gather operations, streaming column chunks directly to the caller.

Sources: llkv-table/src/planner/mod.rs:1021-1031 llkv-table/src/planner/mod.rs:1157-1343

Fast Path: Full Table Scan Streaming

For queries without ordering requirements, try_stream_full_table_scan uses incremental row ID streaming to avoid buffering all row IDs in memory:

graph TB
    Start["try_stream_full_table_scan"]
CheckOrder["Check options.order\n(must be None)"]
StreamRIDs["stream_table_row_ids\n(chunk_size batches)"]
Shadow["Try shadow column\nscan (fast)"]
Fallback["Multi-column scan\n(fallback)"]
ProcessChunk["Process chunk:\n1. Apply row_id_filter\n2. Build RowStream\n3. Emit batches"]
Batch["RecordBatch\n(via on_batch)"]
Start --> CheckOrder
 
   CheckOrder -->|order is Some| Return["Return Fallback"]
CheckOrder -->|order is None| StreamRIDs
    
 
   StreamRIDs --> Shadow
 
   Shadow -->|Success| ProcessChunk
 
   Shadow -->|Not found| Fallback
 
   Fallback --> ProcessChunk
    
 
   ProcessChunk --> Batch
 
   Batch -->|Next chunk| StreamRIDs

Sources: llkv-table/src/planner/mod.rs:904-999 llkv-table/src/planner/mod.rs:859-902

Row ID Collection Optimization

Row ID collection uses a two-tier strategy:

  1. Fast path (shadow column) : Scan the dedicated row_id shadow column which contains all row IDs for the table
  2. Fallback (multi-column scan) : Scan user columns and deduplicate row IDs when shadow column is unavailable

The shadow column optimization is significantly faster because:

  • Single column scan instead of multiple
  • No deduplication required
  • Direct row ID format

Sources: llkv-table/src/planner/mod.rs:748-857

General Scan Execution

When fast paths don't apply, the executor uses the general scan path:

  1. Projection analysis : Classify projections as column references or computed expressions
  2. Field collection : Build unique field lists and numeric field maps
  3. Row ID collection : Gather all relevant row IDs (using optimizations above)
  4. Row ID filtering : Apply predicate programs to filter row IDs
  5. Gather and stream : Use RowStreamBuilder to materialize columns and emit batches

General Scan Pipeline

graph TB
    Execute["TableExecutor::execute"]
Analyze["Analyze projections:\n- Column refs\n- Computed exprs\n- Build unique_lfids"]
CollectRIDs["table_row_ids\n(with caching)"]
FilterRIDs["Filter row IDs:\n- collect_row_ids_for_rowid_filter\n- Apply EvalProgram\n- Apply DomainProgram"]
Order["Apply ordering\n(if options.order present)"]
Gather["RowStreamBuilder:\n- prepare_gather_context\n- stream chunks\n- evaluate computed exprs"]
Emit["Emit RecordBatch"]
Execute --> Analyze
 
   Analyze --> CollectRIDs
 
   CollectRIDs --> FilterRIDs
 
   FilterRIDs --> Order
 
   Order --> Gather
 
   Gather --> Emit

Sources: llkv-table/src/planner/mod.rs:1009-1343 llkv-table/src/planner/mod.rs:1345-1710

Predicate Optimization

Normalization

The normalize_predicate function applies logical transformations to simplify filter expressions:

  • De Morgan's laws : NOT (a AND b)(NOT a) OR (NOT b)
  • Flatten nested operators : AND[AND[a,b],c]AND[a,b,c]
  • Constant folding : AND[true, x]x
  • Double negation elimination : NOT (NOT x)x

These transformations expose optimization opportunities and simplify compilation.

Sources: llkv-table/src/planner/program.rs

Predicate Fusion

The PredicateFusionCache analyzes predicates to determine when multiple conditions on the same field should be fused:

Data TypeFusion Criteria
String typescontains count ≥ 1 AND total predicates ≥ 2
Other typesTotal predicates ≥ 2
graph TB
    Expr["Filter Expression"]
Cache["PredicateFusionCache"]
Traverse["Traverse expression tree"]
Stats["Per-field stats:\n- total predicates\n- contains predicates"]
Decision["should_fuse(field, dtype)"]
Fuse["Fuse predicates into\nsingle evaluation"]
Separate["Keep predicates separate"]
Expr --> Cache
 
   Cache --> Traverse
 
   Traverse --> Stats
 
   Stats --> Decision
    
 
   Decision -->|Meets criteria| Fuse
 
   Decision -->|Below threshold| Separate

Fusion enables single-pass evaluation rather than multiple column scans for the same field.

Sources: llkv-table/src/planner/mod.rs:517-570

Expression Evaluation

Numeric Kernels

The NumericKernels system in llkv-table/src/scalar_eval.rs provides vectorized evaluation for scalar expressions:

Kernel OperationDescription
collect_fieldsExtract all field references from expression
prepare_numeric_arraysCast columns to unified numeric representation
evaluate_valueRow-by-row scalar evaluation
evaluate_batchVectorized batch evaluation
simplifyDetect affine expressions for optimization

Numeric Type Hierarchy

Sources: llkv-table/src/scalar_eval.rs:1-90 llkv-table/src/scalar_eval.rs:451-712

Vectorized Evaluation

When possible, expressions are evaluated using vectorized paths:

  1. Column access : Direct array reference (zero-copy)
  2. Literals : Broadcast scalar to array length
  3. Binary operations : Arrow compute kernels for array-array or array-scalar operations
  4. Affine expressions : Specialized scale * field + offset fast path

The try_evaluate_vectorized method attempts vectorization before falling back to row-by-row evaluation.

Sources: llkv-table/src/scalar_eval.rs:714-762

graph LR
    Expr["ScalarExpr"]
Detect["NumericKernels::simplify"]
Check["is_affine_column_expr"]
Affine["AffineExpr:\nfield, scale, offset"]
Direct["Direct column reference"]
Complex["Complex expression"]
Expr --> Detect
 
   Detect --> Check
    
 
   Check -->|Matches pattern| Affine
 
   Check -->|Single column| Direct
 
   Check -->|Other| Complex

Affine Expression Optimization

Expressions matching the pattern scale * field + offset are detected and optimized:

Affine expressions enable:

  • Single column scan with arithmetic applied
  • Reduced memory allocation
  • Better cache locality

Sources: llkv-table/src/scalar_eval.rs:1038-1174 llkv-table/src/planner/mod.rs:1711-1872

graph TB
    Builder["RowStreamBuilder::new"]
Config["Configuration:\n- store\n- table_id\n- schema\n- unique_lfids\n- projection_evals\n- row_ids\n- batch_size"]
GatherCtx["prepare_gather_context\n(optional reuse)"]
Build["build()"]
Stream["RowStream"]
NextChunk["next_chunk()"]
Gather["Gather columns\nfor batch_size rows"]
Evaluate["Evaluate computed\nprojections"]
Batch["StreamChunk\n(arrays + schema)"]
Builder --> Config
 
   Config --> GatherCtx
 
   GatherCtx --> Build
 
   Build --> Stream
    
 
   Stream --> NextChunk
 
   NextChunk --> Gather
 
   Gather --> Evaluate
 
   Evaluate --> Batch
 
   Batch -->|More rows| NextChunk

Streaming Architecture

Row Stream Builder

The RowStreamBuilder constructs streaming result iterators with configurable batch sizes:

The stream uses STREAM_BATCH_ROWS (default 1024) as the chunk size for incremental result production.

Sources: llkv-table/src/stream.rs llkv-table/src/constants.rs:1-7

Gather Context Reuse

MultiGatherContext enables amortization of setup costs across multiple scans:

  • Caches physical key lookups
  • Reuses internal buffers
  • Reduces allocations in streaming scenarios

The context is optional but improves performance for repeated scans of the same columns.

Sources: llkv-column-map/src/store/scan.rs

Performance Characteristics

Scan TypeRow ID CollectionColumn AccessMemory Usage
Single column directNone (streams directly)Direct column chunksO(chunk_size)
Full table streamingShadow column (fast)Incremental gatherO(batch_size × columns)
Filtered scanShadow or multi-columnFull gatherO(row_count × columns)
Ordered scanShadow or multi-columnFull gather + sortO(row_count × columns)

The executor prioritizes fast paths that minimize memory usage and avoid full table materialization when possible.

Sources: llkv-table/src/planner/mod.rs:748-999 llkv-table/README.md:1-57


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Filter Evaluation

Relevant source files

Purpose and Scope

This page explains how filter expressions (WHERE clause predicates) are evaluated against table rows during query execution. This includes the compilation of filter expressions into efficient bytecode programs, stack-based evaluation mechanisms, integration with MVCC visibility filtering, and various optimization strategies.

For information about how expressions are initially structured and planned, see Expression System and Query Planning. For details about the broader scan execution context, see Scan Execution and Optimization.


Filter Expression Pipeline

Filter evaluation follows a multi-stage pipeline that transforms SQL predicates into efficient executable programs:

Sources: llkv-table/src/planner/mod.rs:595-636 llkv-table/src/planner/program.rs:256-284 llkv-executor/src/translation/expression.rs:18-174

graph LR
 
   SQL["SQL WHERE Clause"] --> Parser["sqlparser AST"]
Parser --> ExprString["Expr&lt;String&gt;\nField names"]
ExprString --> Translation["Field Resolution\nCatalog Lookup"]
Translation --> ExprFieldId["Expr&lt;FieldId&gt;\nResolved fields"]
ExprFieldId --> Normalize["normalize_predicate\nDe Morgan's Laws\nFlatten AND/OR"]
Normalize --> Compiler["ProgramCompiler"]
Compiler --> EvalProg["EvalProgram\nStack Bytecode"]
Compiler --> DomainProg["DomainProgram\nRow ID Selection"]
EvalProg --> Evaluation["Row Evaluation"]
DomainProg --> Evaluation
 
   Evaluation --> MVCCFilter["MVCC Filtering"]
MVCCFilter --> Results["Filtered Results"]

Program Compilation

Normalization

Before compilation, filter expressions are normalized into canonical form using the normalize_predicate function. This transformation ensures consistent structure for optimization and evaluation.

Normalization rules:

  • Flatten nested conjunctions/disjunctions: AND(AND(a,b),c)AND(a,b,c)
  • Apply De Morgan's laws: Push NOT operators down through logical connectives
  • Eliminate double negations: NOT(NOT(expr))expr
  • Simplify literal booleans: NOT(true)false

The normalization process uses an iterative approach to handle deeply nested expressions without stack overflow. The transformation is applied recursively, with special handling for negated conjunctions and disjunctions.

Key normalization functions:

FunctionPurpose
normalize_predicateEntry point for expression normalization
normalize_exprFlattens AND/OR and delegates to normalize_negated
normalize_negatedApplies De Morgan's laws and simplifies negations

Sources: llkv-table/src/planner/program.rs:286-343

Bytecode Generation

The ProgramCompiler translates normalized expressions into two complementary program representations:

EvalProgram operations:

graph TB
    subgraph "Compilation"
        Expr["Normalized Expr&lt;FieldId&gt;"]
Compiler["ProgramCompiler"]
Expr --> Compiler
    end
    
    subgraph "Programs"
        EvalProg["EvalProgram\nStack-based bytecode\nfor predicate evaluation"]
DomainProg["DomainProgram\nRow ID domain\ncalculation"]
Compiler --> EvalProg
 
       Compiler --> DomainProg
    end
    
    subgraph "Evaluation"
 
       EvalProg --> ResultStack["Result Stack\nbool values"]
DomainProg --> RowIDs["Row ID Sets\ncandidate rows"]
end
OperationStack EffectPurpose
PushPredicate→ boolEvaluate single predicate
PushCompare→ boolEvaluate scalar comparison
PushInList→ boolEvaluate IN list membership
PushIsNull→ boolEvaluate NULL test
PushLiteral→ boolPush constant boolean
FusedAnd→ boolEvaluate fused predicates on same field
Andbool×N → boolLogical AND of N values
Orbool×N → boolLogical OR of N values
Notbool → boolLogical negation (uses domain program)

DomainProgram operations:

OperationPurpose
PushFieldAllInclude all rows for a field
PushCompareDomainRows where scalar comparison fields exist
PushInListDomainRows where IN list fields exist
PushIsNullDomainRows where NULL test fields exist
UnionCombine row sets (OR semantics)
IntersectIntersect row sets (AND semantics)

Sources: llkv-table/src/planner/program.rs:22-284 llkv-table/src/planner/program.rs:416-516 llkv-table/src/planner/program.rs:544-631

Predicate Fusion

An optimization that recognizes multiple predicates on the same field within an AND expression and evaluates them together. This reduces overhead and enables more efficient filtering.

Fusion conditions (fromPredicateFusionCache):

Data TypeFusion Threshold
String types≥2 total predicates AND ≥1 Contains operator
Other types≥2 total predicates on same field

Example transformation:

age >= 18 AND age < 65 AND age != 30
→ FusedAnd(field_id: age, filters: [>=18, <65, !=30])

The fusion cache tracks predicate patterns during compilation:

  • Counts total predicates per field
  • Tracks specific operator types (e.g., Contains for strings)
  • Decides whether fusion is beneficial via should_fuse

Sources: llkv-table/src/planner/mod.rs:517-570 llkv-table/src/planner/program.rs:518-542


Row-Level Evaluation

Stack-Based Evaluation Engine

Filter evaluation uses a stack-based virtual machine that processes EvalProgram bytecode. Each operation manipulates a boolean result stack.

graph LR
    subgraph "Evaluation Loop"
        Ops["EvalOp Instructions"]
Stack["Result Stack\nVec&lt;bool&gt;"]
Ops -->|Process| Stack
 
       Stack -->|Final Value| Result["Filter Decision"]
end
    
    subgraph "Example: age >= 18 AND status = 'active'"
 
       Op1["PushPredicate(age >= 18)"] -->|Stack: [true]| Op2["PushPredicate(status = 'active')"]
Op2 -->|Stack: [true, false]| Op3["And(2)"]
Op3 -->|Stack: [false]| Final["Result: false"]
end

The evaluation process iterates through EvalOp instructions, pushing boolean results and combining them according to logical operators. Each predicate evaluation consults the underlying storage to check field values against filter conditions.

Sources: llkv-table/src/planner/mod.rs:1009-1031

graph TB
    Predicate["Filter&lt;FieldId&gt;\nfield_id + operator"]
Predicate --> TypeCheck{"Data Type?"}
TypeCheck -->|Fixed-width| FixedPath["build_fixed_width_predicate\nInt, Float, Date, etc."]
TypeCheck -->|Variable-width| VarPath["build_var_width_predicate\nString types"]
TypeCheck -->|Boolean| BoolPath["build_bool_predicate\nBool type"]
FixedPath --> Native["Native comparison\nusing PredicateValue"]
VarPath --> Pattern["Pattern matching\nStartsWith, Contains, etc."]
BoolPath --> Boolean["Boolean logic"]
Native --> Result["bool"]
Pattern --> Result
 
   Boolean --> Result

Predicate Evaluation

Individual predicates are evaluated by comparing field values against filter operators. The evaluation strategy depends on the data type:

Type-specific evaluation paths:

Operator semantics:

OperatorDescriptionNULL Handling
EqualsExact matchNULL = NULL → NULL
RangeBounded interval (inclusive/exclusive)NULL in range → NULL
InSet membershipNULL in [values] → NULL
StartsWithString prefix match (case-sensitive/insensitive)NULL starts with X → NULL
EndsWithString suffix matchNULL ends with X → NULL
ContainsString substring matchNULL contains X → NULL
IsNullNULL testReturns true/false
IsNotNullNOT NULL testReturns true/false

Sources: llkv-expr/src/expr.rs:295-358 llkv-expr/src/typed_predicate.rs:1-500 (referenced but not shown)

graph TB
    ScalarExpr["ScalarExpr&lt;FieldId&gt;"]
ScalarExpr --> Mode{"Evaluation Mode"}
Mode -->|Single Row| RowLevel["NumericKernels::evaluate_value\nRecursive evaluation\nReturns Option&lt;NumericValue&gt;"]
Mode -->|Batch| BatchLevel["NumericKernels::evaluate_batch\nVectorized when possible"]
BatchLevel --> Vectorized{"Vectorizable?"}
Vectorized -->|Yes| Vec["try_evaluate_vectorized\nArrow compute kernels"]
Vectorized -->|No| Fallback["Per-row evaluation\nLoop + evaluate_value"]
Vec --> Array["ArrayRef result"]
Fallback --> Array

Scalar Expression Evaluation

For computed columns and complex predicates (e.g., WHERE salary * 1.1 > 50000), scalar expressions are evaluated using the NumericKernels utility.

Evaluation modes:

Numeric type handling:

The NumericArray wrapper provides unified access to different numeric types:

  • Integer : Int64Array for integers, booleans, dates
  • Float : Float64Array for floating-point numbers
  • Decimal : Decimal128Array for precise decimal values

Type coercion occurs automatically during expression evaluation:

  • Mixed integer/float operations promote to float
  • String-to-numeric conversions follow SQLite semantics (invalid → 0)
  • NULL propagates through operations

Key evaluation functions:

FunctionPurposePerformance
evaluate_valueSingle-row evaluationUsed for non-vectorizable expressions
evaluate_batchBatch evaluationAttempts vectorization first
try_evaluate_vectorizedVectorized computationUses Arrow compute kernels
prepare_numeric_arraysType coercionConverts columns to numeric representation

Sources: llkv-table/src/scalar_eval.rs:453-713 llkv-table/src/scalar_eval.rs:92-383


graph TB
    subgraph "Filter Stages"
        Scan["Table Scan"]
Scan --> Domain["1. Domain Program\nDetermines candidate\nrow IDs"]
Domain --> UserPred["2. User Predicates\nSemantic filtering\nvia EvalProgram"]
UserPred --> MVCCFilter["3. MVCC Filtering\nrow_id_filter.filter()\nVisibility rules"]
MVCCFilter --> Results["Visible Results"]
end
    
    subgraph "MVCC Visibility"
        RowID["Row ID"]
CreatedBy["created_by TxnId"]
DeletedBy["deleted_by Option&lt;TxnId&gt;"]
Snapshot["Transaction Snapshot"]
RowID --> Check{"Visibility Check"}
CreatedBy --> Check
 
       DeletedBy --> Check
 
       Snapshot --> Check
        
 
       Check -->|Created before snapshot Not deleted or deleted after snapshot| Visible["Include"]
Check -->|Otherwise| Invisible["Exclude"]
end

MVCC Integration

Filter evaluation integrates with MVCC visibility filtering to ensure queries only see rows visible to their transaction. This is a two-stage filtering process:

MVCC filtering implementation:

The row_id_filter option in ScanStreamOptions provides transaction-aware filtering:

  • Created by runtime's transaction manager
  • Encapsulates snapshot visibility rules
  • Applied after user predicate evaluation
  • Filters row IDs based on created_by and deleted_by transaction IDs

Filtering order rationale:

  1. Domain programs - Quickly eliminate rows where referenced fields don't exist
  2. User predicates - Evaluate semantic conditions (WHERE clause)
  3. MVCC filter - Apply transaction visibility rules

This ordering minimizes MVCC overhead by only checking visibility for rows that pass semantic filters.

Sources: llkv-table/src/planner/mod.rs:940-982 llkv-table/src/table.rs:200-300 (type definitions referenced but not shown)


Evaluation Optimizations

Single-Column Direct Scan Fast Path

A specialized fast path for queries that project a single column with simple filtering. This bypasses the general evaluation machinery for better performance.

Conditions for fast path:

  • Single column projection
  • Filter references only that column
  • Simple operator (no complex scalar expressions)

When activated, the scan directly streams the target column's values without materializing intermediate structures.

Sources: llkv-table/src/planner/mod.rs:1020-1031 (method name: try_single_column_direct_scan)

graph LR
    Shadow["Shadow Column\nrow_id metadata"]
Shadow -->|Exists?| FastEnum["Fast Path\nstream_table_row_ids"]
Shadow -->|Missing| Fallback["Fallback Path\nMulti-column scan\n+ deduplication"]
FastEnum --> Chunks["Row ID Chunks\nSTREAM_BATCH_ROWS"]
Fallback --> Chunks
    
 
   Chunks --> MVCCCheck["Apply MVCC Filter"]
MVCCCheck --> Gather["Gather Columns"]
Gather --> Batch["RecordBatch"]

Full Table Scan Streaming

When no predicates require evaluation (e.g., WHERE true or full scan), the executor uses streaming row ID enumeration:

The fast path attempts to use a shadow column (row_id) that stores all row IDs for a table:

  • Success case : Shadow column exists → stream chunks directly
  • Fallback case : Shadow column missing → scan user columns and deduplicate

Sources: llkv-table/src/planner/mod.rs:739-857 llkv-table/src/planner/mod.rs:859-902 llkv-table/src/planner/mod.rs:904-999

graph TB
    Expression["WHERE clause\nexpression tree"]
Expression --> Traverse["Traverse expression\nrecord_expr"]
Traverse --> Track["Track per-field stats:\n- Total predicate count\n- Contains operator count"]
Track --> Decide["should_fuse decision"]
Decide -->|String + Contains ≥1| Fuse1["Enable fusion"]
Decide -->|Any type + predicates ≥2| Fuse2["Enable fusion"]
Decide -->|Otherwise| NoFuse["No fusion"]

Predicate Fusion Cache

The PredicateFusionCache tracks predicate patterns during compilation to enable fusion optimization:

Fusion benefits:

  • Reduces function call overhead
  • Enables specialized evaluation routines
  • Improves cache locality by processing same field

Fusion conditions table:

Field Data TypeConditions for Fusion
Utf8 / LargeUtf8Total predicates ≥ 2 AND Contains operations ≥ 1
Other typesTotal predicates ≥ 2 on same field

Sources: llkv-table/src/planner/mod.rs:517-570

graph TB
    Expr["ScalarExpr batch"]
Expr --> Check{"Vectorizable?"}
Check -->|Yes| Patterns["Supported patterns:\n- Column references\n- Literal constants\n- Binary ops\n- Scalar×Array ops\n- Array×Array ops"]
Patterns --> ArrowCompute["Arrow compute kernels\nSIMD-optimized"]
Check -->|No| PerRow["Per-row evaluation\nevaluate_value loop"]
ArrowCompute --> Result["ArrayRef"]
PerRow --> Result

Vectorized Expression Evaluation

For numeric operations, the system attempts vectorized evaluation to process entire batches at once:

Vectorization strategy:

Vectorizable expression patterns:

  • Pure column references
  • Literal constants
  • Binary operations: Array ⊕ Array, Array ⊕ Scalar, Scalar ⊕ Array
  • Simple casts between numeric types

Non-vectorizable expressions:

  • CASE expressions with complex branches
  • Date/interval arithmetic
  • Aggregate functions
  • Subqueries

The vectorization attempt happens in try_evaluate_vectorized, which recursively checks if all sub-expressions can be vectorized. If any sub-expression is non-vectorizable, the entire expression falls back to row-by-row evaluation.

Sources: llkv-table/src/scalar_eval.rs:714-763 llkv-table/src/scalar_eval.rs:676-713


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Storage Layer

Relevant source files

Purpose and Scope

The Storage Layer provides the columnar data persistence infrastructure for LLKV. It spans three crates that form a hierarchy: llkv-table provides schema-aware table operations, llkv-column-map manages columnar chunk storage and catalog mappings, and llkv-storage defines the pluggable pager abstraction for physical persistence. Together, these components implement an Arrow-native columnar store with MVCC semantics and zero-copy read paths.

For information about query execution and predicate evaluation that operates on top of this storage layer, see Query Execution. For details on how the runtime orchestrates storage operations within transactions, see Catalog and Metadata Management.

Sources: llkv-table/README.md:1-57 llkv-column-map/README.md:1-62 llkv-storage/README.md:1-45

Architecture Overview

The storage layer implements a three-tier architecture where each tier has a distinct responsibility:

Key Components:

graph TB
    subgraph "Schema Layer"
        Table["Table\nllkv-table::Table"]
Schema["Schema Validation"]
MVCC["MVCC Injection\nrow_id, created_by, deleted_by"]
end
    
    subgraph "Column Management Layer"
        ColumnStore["ColumnStore\nllkv-column-map::ColumnStore"]
Catalog["Column Catalog\nLogicalFieldId → PhysicalKey"]
Chunks["Column Chunks\nArrow RecordBatch segments"]
Descriptors["ColumnDescriptor\nChunk metadata"]
end
    
    subgraph "Physical Persistence Layer"
        Pager["Pager Trait\nllkv-storage::pager::Pager"]
MemPager["MemPager\nHashMap backend"]
SimdPager["SimdRDrivePager\nMemory-mapped file"]
end
    
 
   Table --> Schema
 
   Schema --> MVCC
 
   MVCC --> ColumnStore
    
 
   ColumnStore --> Catalog
 
   ColumnStore --> Chunks
 
   ColumnStore --> Descriptors
    
 
   Catalog --> Pager
 
   Chunks --> Pager
 
   Descriptors --> Pager
    
 
   Pager --> MemPager
 
   Pager --> SimdPager
LayerCratePrimary TypesResponsibility
Schemallkv-tableTable, SysCatalogSchema validation, MVCC metadata injection, streaming scans
Column Managementllkv-column-mapColumnStore, LogicalFieldId, ColumnDescriptorColumnar chunking, catalog mapping, gather operations
Physical Persistencellkv-storagePager, MemPager, SimdRDrivePagerBatch get/put over physical keys, zero-copy reads

Sources: llkv-table/README.md:10-41 llkv-column-map/README.md:10-41 llkv-storage/README.md:12-28

Logical vs Physical Addressing

The storage layer separates logical identifiers from physical storage locations to enable namespace isolation and catalog management:

Namespace Segregation:

graph LR
    subgraph "Logical Space"
        LogicalField["LogicalFieldId\n(namespace, table_id, field_id)"]
UserNS["Namespace::UserData"]
RowNS["Namespace::RowIdShadow"]
MVCCNS["Namespace::TxnMetadata"]
end
    
    subgraph "Catalog Mapping"
        CatalogEntry["catalog.map\nLogicalFieldId → PhysicalKey"]
ValuePK["Value PhysicalKey"]
RowPK["RowId PhysicalKey"]
end
    
    subgraph "Physical Space"
        PhysKey["PhysicalKey (u64)"]
DescBlob["Descriptor Blob\nColumnDescriptor"]
ChunkBlob["Chunk Blob\nSerialized Arrow Array"]
end
    
 
   LogicalField --> CatalogEntry
 
   UserNS --> LogicalField
 
   RowNS --> LogicalField
 
   MVCCNS --> LogicalField
    
 
   CatalogEntry --> ValuePK
 
   CatalogEntry --> RowPK
    
 
   ValuePK --> PhysKey
 
   RowPK --> PhysKey
    
 
   PhysKey --> DescBlob
 
   PhysKey --> ChunkBlob

LogicalFieldId encodes a three-part address: (namespace, table_id, field_id). Namespaces prevent collisions between user columns, MVCC metadata, and row-id shadows:

  • Namespace::UserData: User-defined columns (e.g., name, age, email)
  • Namespace::RowIdShadow: Parallel row-id arrays used for gather operations
  • Namespace::TxnMetadata: MVCC columns (created_by, deleted_by)

The ColumnStore maintains a catalog.map that translates each LogicalFieldId into a PhysicalKey. Physical keys are opaque u64 identifiers allocated by the pager; the catalog is persisted at pager root key 0.

Sources: llkv-column-map/README.md:18-23 llkv-column-map/src/store/projection.rs:23-37

Data Persistence Model

Data flows through the storage layer as Arrow RecordBatch objects, which are decomposed into per-column chunks for persistence:

Chunking Strategy:

sequenceDiagram
    participant Caller
    participant Table as "Table"
    participant ColumnStore as "ColumnStore"
    participant Serialization as "serialization"
    participant Pager as "Pager"
    
    Caller->>Table: append(RecordBatch)
    
    Note over Table: Validate schema\nInject MVCC columns
    
    Table->>ColumnStore: append(RecordBatch with MVCC)
    
    Note over ColumnStore: Chunk columns\nSort by row_id\nApply last-writer-wins
    
    loop "For each column"
        ColumnStore->>Serialization: serialize_array(Array)
        Serialization-->>ColumnStore: Vec<u8> blob
        ColumnStore->>Pager: batch_put(PhysicalKey, blob)
    end
    
    Pager-->>ColumnStore: Success
    ColumnStore-->>Table: Success
    Table-->>Caller: Success

Each column is split into fixed-size chunks (default 8,192 rows) to balance scan efficiency and update granularity. Chunks are serialized using a custom Arrow-compatible format optimized for zero-copy reads. The ColumnDescriptor maintains per-chunk metadata including row count, min/max row IDs, and physical keys for value and row-id arrays.

MVCC Column Layout:

Every table's physical storage includes three categories of columns:

  • User-defined columns from the schema
  • row_id (UInt64): monotonic row identifier
  • created_by (UInt64): transaction ID that created the row
  • deleted_by (UInt64): transaction ID that deleted the row (0 if live)

These MVCC columns are stored in Namespace::TxnMetadata and participate in the same chunking and persistence workflow as user data.

Sources: llkv-column-map/README.md:24-29 llkv-table/README.md:14-25

Serialization and Zero-Copy Design

The storage layer uses a custom serialization format that enables zero-copy reconstruction of Arrow arrays from memory-mapped regions:

Serialization Format

The format defined in llkv-storage/src/serialization.rs:1-586 supports multiple layouts:

LayoutType CodeUse CaseHeader Fields
PrimitivePrimType::*Fixed-width primitives (Int32, UInt64, Float32, etc.)values_len
FslFloat32N/AFixedSizeList for vector embeddingslist_size, child_values_len
VarlenPrimType::Binary, PrimType::Utf8, etc.Variable-length Binary/Stringoffsets_len, values_len
StructN/ANested struct typespayload_len (IPC format)
Decimal128PrimType::Decimal128Fixed-precision decimalsprecision, scale

Each serialized blob begins with a 24-byte header:

Offset  Field             Type    Description
------  -----             ----    -----------
0-3     Magic             [u8;4]  "ARR0"
4       Layout            u8      Layout discriminant (0-3)
5       PrimType          u8      Type code (layout-specific)
6       Precision/Pad     u8      Decimal precision or padding
7       Scale/Pad         u8      Decimal scale or padding
8-15    Length            u64     Logical element count
16-19   Extra A           u32     Layout-specific (e.g., values_len)
20-23   Extra B           u32     Layout-specific (e.g., offsets_len)
24+     Payload           [u8]    Raw Arrow buffers

Why Custom Format Instead of Arrow IPC:

The custom format achieves three goals:

  1. Minimal overhead : No schema framing or padding, just raw buffers
  2. Contiguous payloads : Each array's bytes are adjacent, ideal for SIMD and sequential scans
  3. True zero-copy : deserialize_array constructs ArrayData directly from EntryHandle buffers without memcpy

Sources: llkv-storage/src/serialization.rs:1-100 llkv-storage/src/serialization.rs:226-298

EntryHandle Abstraction

The Pager trait operates on EntryHandle objects from the simd-r-drive-entry-handle crate. EntryHandle provides:

  • as_ref() -> &[u8]: Zero-copy slice view
  • as_arrow_buffer() -> Buffer: Wrap as Arrow buffer without copying
graph LR
    File["Persistent File\nsimd_r_drive::DataStore"]
Mmap["Memory-Mapped Region"]
EntryHandle["EntryHandle"]
Buffer["Arrow Buffer"]
ArrayData["ArrayData"]
ArrayRef["ArrayRef"]
File --> Mmap
 
   Mmap --> EntryHandle
 
   EntryHandle --> Buffer
 
   Buffer --> ArrayData
 
   ArrayData --> ArrayRef
    
    style File fill:#f9f9f9
    style ArrayRef fill:#f9f9f9

When using SimdRDrivePager, EntryHandle wraps a memory-mapped region backed by a persistent file. The deserialize_array function slices this buffer to construct ArrayData without allocating or copying:

Sources: llkv-storage/src/serialization.rs:429-559 llkv-table/Cargo.toml29

Pager Implementations

The Pager trait defines the interface for batch get/put operations over physical keys:

MemPager

MemPager provides an in-memory, heap-backed implementation using a RwLock<FxHashMap<PhysicalKey, Arc<Vec<u8>>>>. It is used for:

  • Unit tests and benchmarks
  • Staging contexts during explicit transactions (see llkv-runtime/README.md:26-32)
  • Temporary namespaces that don't require persistence

SimdRDrivePager

SimdRDrivePager wraps simd_r_drive::DataStore, enabling persistent storage with zero-copy reads:

FeatureImplementation
Backing StoreMemory-mapped file via simd-r-drive
AlignmentSIMD-optimized (16-byte aligned regions)
ConcurrencyMultiple readers, single writer (file-level locking)
EntryHandleZero-copy view into mmap region

The SimdRDrivePager is instantiated by the runtime when opening a persistent database. All catalog entries, column descriptors, and chunk blobs are accessed through memory-mapped reads, avoiding I/O system calls during query execution.

Sources: llkv-storage/README.md:18-22 llkv-storage/README.md:25-28

Column Storage Operations

The ColumnStore provides three primary operation patterns:

Append Workflow

Last-Writer-Wins Semantics:

When appending a RecordBatch with row_id values that overlap existing rows, ColumnStore::append applies last-writer-wins logic:

  1. Identify chunks containing overlapping row_id ranges
  2. Load those chunks and merge with new data
  3. Re-serialize merged chunks
  4. Atomically update descriptors and chunk blobs

This ensures that INSERT OR REPLACE and UPDATE operations maintain consistent state without tombstone accumulation.

Sources: llkv-column-map/README.md:24-29

Gather Operations

Gather operations retrieve specific rows by row_id from columnar storage:

Null-Handling Policies:

Gather operations support three policies via GatherNullPolicy:

PolicyBehavior
ErrorOnMissingReturn error if any requested row_id is not found
IncludeNullsEmit null for missing rows
DropNullsOmit rows where all projected columns are null

The MultiGatherContext caches chunk data and row indexes across multiple gather calls to amortize deserialization costs during multi-pass operations like joins.

Sources: llkv-column-map/src/store/projection.rs:39-93 llkv-column-map/src/store/projection.rs:245-446

Streaming Scans

The ColumnStream type provides paginated, filtered scans over columnar data:

Scans operate in chunks to avoid materializing entire tables:

  1. Load next chunk of row IDs and MVCC metadata
  2. Apply MVCC visibility filter (transaction snapshot check)
  3. Evaluate user predicates on loaded columns
  4. Gather matching rows into a RecordBatch
  5. Yield batch to caller

This streaming model enables large result sets to be processed incrementally without exhausting memory.

Sources: llkv-column-map/README.md:30-35 llkv-table/README.md:22-25

Integration Points

The storage layer is consumed by multiple higher-level components:

Key Integration Patterns:

ConsumerUsage Pattern
llkv-runtimeOpens pagers, manages table lifecycle, coordinates MVCC tagging
llkv-executorStreams scans via Table::scan_stream, executes joins and aggregates
llkv-transactionProvides transaction IDs for MVCC columns, enforces snapshot isolation
SysCatalogPersists table and column metadata using the same storage infrastructure

System Catalog Self-Hosting:

The system catalog (table 0) is itself stored using ColumnStore, creating a bootstrapping dependency:

  1. Runtime opens the pager
  2. ColumnStore is initialized with the pager
  3. SysCatalog is constructed, reading metadata from table 0
  4. User tables are opened using metadata from SysCatalog

This self-hosting design ensures that catalog operations and user data operations share the same crash recovery and MVCC semantics.

Sources: llkv-runtime/README.md:36-40 llkv-table/README.md:26-30 llkv-storage/README.md:25-28


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Table Abstraction

Relevant source files

Purpose and Scope

The Table abstraction provides a schema-aware interface for data operations in the LLKV storage layer. It sits between query execution components and the columnar storage engine, managing schema validation, MVCC metadata injection, and translating logical operations into physical column store interactions. This document details the Table struct and its APIs for appending data, scanning rows, and coordinating with the system catalog.

For information about the underlying columnar storage implementation, see Column Storage and ColumnStore. For details on the storage pager abstraction, see Pager Interface and SIMD Optimization. For catalog management APIs, see CatalogManager API and System Catalog and SysCatalog.

Overview

The llkv-table crate provides the primary interface between SQL execution and physical storage. Each Table instance represents a logical table with a defined schema and wraps a reference to a ColumnStore that handles the actual persistence. Tables are responsible for enforcing schema constraints, injecting MVCC metadata columns, and exposing streaming scan APIs that integrate with the query executor.

Sources: llkv-table/README.md:1-57 llkv-table/Cargo.toml:1-60

graph TB
    subgraph "Query Execution Layer"
        RUNTIME["Runtime\nStatement Executor"]
EXECUTOR["Executor\nQuery Evaluation"]
end
    
    subgraph "Table Layer (llkv-table)"
        TABLE["Table\nSchema-aware API"]
SYSCAT["SysCatalog\nTable 0\nMetadata Store"]
STREAM["ColumnStream\nStreaming Scans"]
end
    
    subgraph "Column Store Layer (llkv-column-map)"
        COLSTORE["ColumnStore\nColumnar Storage"]
PROJECTION["Projection\nGather Logic"]
end
    
    subgraph "Storage Layer (llkv-storage)"
        PAGER["Pager Trait\nBatch Get/Put"]
end
    
 
   RUNTIME -->|CREATE TABLE INSERT UPDATE| TABLE
 
   EXECUTOR -->|SELECT scan_stream| TABLE
    
 
   TABLE -->|validate schema| TABLE
 
   TABLE -->|inject MVCC cols| TABLE
 
   TABLE -->|append RecordBatch| COLSTORE
 
   TABLE -->|gather_rows| COLSTORE
    
 
   SYSCAT -->|stores TableMeta ColMeta| COLSTORE
    
 
   TABLE -->|scan_stream returns| STREAM
 
   STREAM -->|fetch batches| COLSTORE
    
 
   COLSTORE -->|uses| PROJECTION
 
   PROJECTION -->|batch_get/put| PAGER

Table Structure and Core Responsibilities

A Table instance encapsulates a schema-validated view over a ColumnStore. The table layer is responsible for:

ResponsibilityDescription
Schema ValidationEnsures all RecordBatch operations match the declared Arrow schema
MVCC InjectionAdds system columns (row_id, created_by, deleted_by) to all data
Catalog CoordinationPersists and retrieves table/column metadata via SysCatalog (table 0)
Data RoutingTranslates logical field requests to LogicalFieldId for ColumnStore
Streaming ScansProvides ColumnStream API for paginated, predicate-pushdown reads

The table wraps an Arc<ColumnStore> from llkv-column-map, enabling multiple table instances to share the same underlying storage. This design supports efficient metadata queries and concurrent access patterns.

Sources: llkv-table/README.md:12-40 llkv-column-map/README.md:1-61

MVCC Column Management

Every table in LLKV maintains three system columns alongside user-defined fields:

graph LR
    subgraph "User Schema"
        UC1["name: Utf8"]
UC2["age: Int32"]
UC3["email: Utf8"]
end
    
    subgraph "System Columns (MVCC)"
        ROW_ID["row_id: UInt64\nMonotonic identifier"]
CREATED["created_by: UInt64\nTransaction ID"]
DELETED["deleted_by: UInt64\nDeletion TXN or MAX"]
end
    
 
   UC1 -.->|stored in namespace USER| COLSTORE["ColumnStore"]
UC2 -.-> COLSTORE
 
   UC3 -.-> COLSTORE
    
 
   ROW_ID -.->|namespace TXN_METADATA| COLSTORE
 
   CREATED -.-> COLSTORE
 
   DELETED -.-> COLSTORE
    
    COLSTORE["ColumnStore\nLogicalFieldId\nNamespacing"]

MVCC Column Semantics

  • row_id : A monotonically increasing UInt64 that uniquely identifies each row within a table. Assigned during append operations and used for row-level operations and correlation.

  • created_by : The transaction ID (UInt64) that created this row version. Set during INSERT or UPDATE operations.

  • deleted_by : The transaction ID that marked this row as deleted, or u64::MAX if the row is still live. UPDATE operations logically delete old versions and insert new ones.

These columns are stored in separate logical namespaces within the ColumnStore to avoid collisions with user-defined columns. The table layer automatically injects these columns during append operations and uses them for visibility filtering during scans.

Sources: llkv-table/README.md:15-16 llkv-column-map/README.md:20-28

Data Operations

Append Operations

The Table::append method accepts an Arrow RecordBatch and performs the following steps:

graph TB
    START["Table::append(RecordBatch)"]
VALIDATE["Validate Schema\nCheck column names/types"]
INJECT["Inject MVCC Columns\nrow_id, created_by, deleted_by"]
NAMESPACE["Map to LogicalFieldId\nApply namespace prefixes"]
PERSIST["ColumnStore::append\nSort by row_id\nLast-writer-wins"]
COMMIT["Pager::batch_put\nAtomic commit"]
START --> VALIDATE
 
   VALIDATE -->|schema mismatch| ERROR["Return Error"]
VALIDATE -->|valid| INJECT
 
   INJECT --> NAMESPACE
 
   NAMESPACE --> PERSIST
 
   PERSIST --> COMMIT
 
   COMMIT --> SUCCESS["Return Ok"]

The append pipeline ensures:

  1. Schema consistency : All incoming batches must match the table's declared schema
  2. MVCC tagging : System columns are added with appropriate transaction IDs
  3. Ordering : Rows are sorted by row_id before persistence for efficient scans
  4. Atomicity : Multi-column writes are committed atomically via batch pager operations

Sources: llkv-table/README.md:20-30 llkv-column-map/README.md:24-28

Scan Operations

Tables expose streaming scan APIs through the scan_stream method, which returns a ColumnStream for paginated result retrieval:

graph TB
    SCAN["Table::scan_stream\n(projections, filter)"]
NORMALIZE["Normalize Predicate\nApply De Morgan's laws"]
COMPILE["Compile to EvalProgram\nStack-based bytecode"]
STREAM["Create ColumnStream\nLazy evaluation"]
FETCH["ColumnStream::next_batch\nFetch N rows"]
FILTER["Apply Predicate\nVectorized evaluation"]
MVCC["MVCC Filtering\nSnapshot visibility"]
PROJECT["Gather Projected Cols\ngather_rows()"]
BATCH["Return RecordBatch"]
SCAN --> NORMALIZE
 
   NORMALIZE --> COMPILE
 
   COMPILE --> STREAM
    
 
   STREAM -.->|caller iterates| FETCH
 
   FETCH --> FILTER
 
   FILTER --> MVCC
 
   MVCC --> PROJECT
 
   PROJECT --> BATCH
 
   BATCH -.->|next iteration| FETCH

The scan path supports:

  • Predicate pushdown : Filters are compiled to bytecode and evaluated at the column store level
  • Projection : Only requested columns are materialized
  • MVCC filtering : Rows are filtered based on transaction snapshot visibility rules
  • Streaming : Results are produced in fixed-size batches to avoid large memory allocations

Sources: llkv-table/README.md:23-24 llkv-column-map/README.md:30-34

Schema Validation

Schema validation occurs at table creation and during every append operation. The table layer enforces:

Validation CheckEnforcement Point
Column namesMust match declared schema exactly (case-sensitive)
Data typesMust match Arrow DataType including nested types
NullabilityEnforced for non-nullable columns
Field countBatch must contain exactly the declared columns

Schema definitions are persisted in the system catalog (table 0) as TableMeta and ColMeta entries. The catalog stores:

  • Table ID and name
  • Column names, types, and nullability flags
  • Constraint metadata (e.g., PRIMARY KEY, NOT NULL)

Sources: llkv-table/README.md:14-15 llkv-table/README.md:27-29

graph TB
    subgraph "System Catalog (Table 0)"
        TABLEMETA["TableMeta Records\ntable_id, name, schema"]
COLMETA["ColMeta Records\ntable_id, col_name, type"]
end
    
    subgraph "User Tables (1..N)"
        TBL1["Table 1\nusers"]
TBL2["Table 2\norders"]
TBL3["Table N\nproducts"]
end
    
 
   TABLEMETA -->|describes| TBL1
 
   TABLEMETA -->|describes| TBL2
 
   TABLEMETA -->|describes| TBL3
    
 
   COLMETA -->|defines columns| TBL1
 
   COLMETA -->|defines columns| TBL2
 
   COLMETA -->|defines columns| TBL3
    
    SYSCAT["SysCatalog API\ncreate_table()\nget_table_meta()\nlist_tables()"]
SYSCAT -->|reads/writes| TABLEMETA
 
   SYSCAT -->|reads/writes| COLMETA

System Catalog Integration

The SysCatalog is a special table (table ID 0) that stores metadata for all other tables:

The system catalog itself uses the same storage infrastructure as user tables:

  • Stored as Arrow RecordBatches in the ColumnStore
  • Subject to MVCC versioning
  • Persisted through the pager for crash consistency

This self-hosting design ensures metadata operations follow the same transactional semantics as data operations.

Sources: llkv-table/README.md:27-29 llkv-column-map/README.md:10-16

Projection and Gathering

The table layer delegates projection and row gathering to the ColumnStore, which provides specialized APIs for materializing requested columns:

Projection Structure

A Projection describes a single column to retrieve, optionally renaming it in the output schema. Projections are resolved to LogicalFieldId by consulting the catalog, then passed to the ColumnStore for gathering.

Sources: llkv-column-map/store/projection.rs:49-73

Null Handling Policies

The projection system supports three null-handling strategies via GatherNullPolicy:

PolicyBehavior
ErrorOnMissingAny missing row_id causes an error
IncludeNullsMissing rows surface as nulls in output arrays
DropNullsRows with all-null projected columns are omitted

These policies enable different executor semantics: INNER JOIN uses ErrorOnMissing, LEFT JOIN uses IncludeNulls, and aggregation pipelines use DropNulls to skip tombstones.

Sources: llkv-column-map/store/projection.rs:39-47

graph TB
    PREPARE["prepare_gather_context\n(field_ids)"]
CATALOG["Load ColumnDescriptors\nfrom catalog"]
METAS["Collect ChunkMetadata\nvalue + row chunks"]
CTX["MultiGatherContext\nplans, cache, scratch"]
GATHER1["gather_rows_with_reusable_context\n(row_ids_1)"]
GATHER2["gather_rows_with_reusable_context\n(row_ids_2)"]
GATHERN["gather_rows_with_reusable_context\n(row_ids_N)"]
PREPARE --> CATALOG
 
   CATALOG --> METAS
 
   METAS --> CTX
    
 
   CTX -.->|reuses chunk cache| GATHER1
 
   CTX -.->|reuses chunk cache| GATHER2
 
   CTX -.->|reuses chunk cache| GATHERN
    
 
   GATHER1 --> BATCH1["RecordBatch 1"]
GATHER2 --> BATCH2["RecordBatch 2"]
GATHERN --> BATCHN["RecordBatch N"]

Multi-Column Gather Context

For queries that scan the same row set multiple times (e.g., joins, aggregations), the table layer provides MultiGatherContext to amortize fetch costs:

The context caches:

  • Chunk arrays : Deserialized Arrow arrays for reuse across calls
  • Row indices : Hash maps for sparse row lookups
  • Scratch buffers : Pre-allocated vectors for gather operations

This optimization is critical for nested loop joins and multi-pass aggregations where the same columns are accessed repeatedly.

Sources: llkv-column-map/store/projection.rs:94-227 llkv-column-map/store/projection.rs:448-510 llkv-column-map/store/projection.rs:516-758

graph TB
    subgraph "Table Layer (llkv-table)"
        TABLE["Table\nSchema + Arc&lt;ColumnStore&gt;"]
FIELDMAP["Field Name → LogicalFieldId\nNamespace mapping"]
end
    
    subgraph "ColumnStore Layer (llkv-column-map)"
        COLSTORE["ColumnStore\nLogicalFieldId → PhysicalKey"]
DESCRIPTOR["ColumnDescriptor\nChunk metadata lists"]
CHUNKS["Column Chunks\nSerialized Arrow arrays"]
end
    
    subgraph "Storage Layer (llkv-storage)"
        PAGER["Pager\nbatch_get/batch_put"]
MEMPAGER["MemPager"]
SIMDPAGER["SimdRDrivePager"]
end
    
 
   TABLE -->|append batch| COLSTORE
 
   TABLE -->|scan_stream| COLSTORE
 
   TABLE -->|gather_rows field_ids| COLSTORE
    
 
   FIELDMAP -.->|resolves to| COLSTORE
    
 
   COLSTORE -->|maps to| DESCRIPTOR
 
   DESCRIPTOR -->|points to| CHUNKS
    
 
   COLSTORE -->|batch_get| PAGER
 
   COLSTORE -->|batch_put| PAGER
    
 
   PAGER -.->|impl| MEMPAGER
 
   PAGER -.->|impl| SIMDPAGER

Integration with ColumnStore

The table layer wraps a ColumnStore and translates high-level operations into low-level storage calls:

Logical Field Namespacing

Each logical field in a table is assigned a LogicalFieldId that encodes:

  • Namespace : USER, TXN_METADATA, or ROWID_SHADOW
  • Table ID : u32 identifier
  • Field ID : u32 column index

This namespacing prevents collisions between user columns and MVCC metadata while allowing them to share the same physical ColumnStore instance.

Sources: llkv-column-map/README.md:18-22 llkv-table/README.md:20-22

Zero-Copy Reads

The ColumnStore delegates to the Pager trait for physical storage access. When using SimdRDrivePager (persistent backend), reads are zero-copy: the pager returns EntryHandle wrappers that directly reference memory-mapped regions. This enables SIMD-accelerated scans without buffer allocation or copying.

Sources: llkv-storage/README.md:9-17 llkv-column-map/README.md:36-40

Usage in the Stack

The table abstraction is consumed by:

ComponentUsage
llkv-runtimeExecutes all DML and DDL operations through Table APIs
llkv-executorRelies on scan_stream for SELECT evaluation, joins, and aggregations
llkv-sqlIndirectly via llkv-runtime for SQL statement execution
llkv-csvUses Table::append for bulk CSV ingestion

The streaming scan API (scan_stream) is particularly important for the executor, which processes query results in fixed-size batches to avoid buffering entire result sets in memory.

Sources: llkv-table/README.md:36-40 llkv-runtime/README.md:36-40 llkv-csv/README.md:14-20


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Column Storage and ColumnStore

Relevant source files

Purpose and Scope

This document describes the column-oriented storage layer implemented by llkv-column-map, focusing on the ColumnStore struct that manages physical persistence of Arrow columnar data. The column store sits between the table abstraction and the pager interface, translating logical field requests into physical chunk operations.

For the higher-level table API that wraps ColumnStore, see Table Abstraction. For details on the underlying storage backends, see Pager Interface and SIMD Optimization.

Architecture Position

The ColumnStore acts as the bridge between schema-aware tables and raw key-value storage:

Sources: llkv-column-map/README.md:10-46 llkv-table/README.md:19-24

graph TB
    Table["llkv-table::Table\nSchema validation\nMVCC injection"]
ColumnStore["llkv-column-map::ColumnStore\nColumn chunking\nLogicalFieldId → PhysicalKey"]
Pager["llkv-storage::Pager\nbatch_get / batch_put\nMemPager / SimdRDrivePager"]
Table -->|append RecordBatch| ColumnStore
 
   Table -->|scan / gather| ColumnStore
 
   ColumnStore -->|BatchGet / BatchPut| Pager
    
 
   ColumnStore -->|serialized Arrow chunks| Pager
 
   Pager -->|EntryHandle zero-copy| ColumnStore

Logical Field Identification

LogicalFieldId Structure

Each column is identified by a LogicalFieldId that encodes three components:

ComponentBitsPurpose
NamespaceHigh bitsSegregates user data, MVCC metadata, and row-id shadows
Table IDMiddle bitsIdentifies which table the column belongs to
Field IDLow bitsDistinguishes columns within a table

This structure prevents collisions when multiple tables share the same physical pager while maintaining clear boundaries between user data and system metadata.

Sources: llkv-column-map/README.md:19-22

Namespace Segregation

Three primary namespaces exist:

  • User Data : Columns explicitly defined in CREATE TABLE statements
  • MVCC Metadata : System columns created_by and deleted_by for transaction visibility
  • Row ID Shadow : Parallel storage of row_id values for each column to enable efficient random access

Each namespace maps to distinct LogicalFieldId values, ensuring that MVCC bookkeeping and user data remain isolated in the catalog.

Sources: llkv-column-map/README.md:22-23 llkv-table/README.md:16-17

Physical Storage Model

PhysicalKey Allocation and Mapping

The column store maintains a catalog that maps each LogicalFieldId to a PhysicalKey (u64 identifier) allocated by the pager:

graph LR
    LF1["LogicalFieldId\n(ns=User, table=5, field=2)"]
LF2["LogicalFieldId\n(ns=RowId, table=5, field=2)"]
LF3["LogicalFieldId\n(ns=Mvcc, table=5, field=created_by)"]
PK1["PhysicalKey: 1024\nDescriptor"]
PK2["PhysicalKey: 1025\nData Chunks"]
PK3["PhysicalKey: 2048\nDescriptor"]
PK4["PhysicalKey: 2049\nData Chunks"]
LF1 --> PK1
 
   LF1 --> PK2
 
   LF2 --> PK3
 
   LF2 --> PK4
    
 
   PK1 -->|ColumnDescriptor| Pager["Pager"]
PK2 -->|Serialized Arrays| Pager
 
   PK3 -->|ColumnDescriptor| Pager
 
   PK4 -->|Serialized Arrays| Pager

Each logical field requires at least two physical keys: one for the column descriptor (metadata about chunks) and one or more for the actual data chunks.

Sources: llkv-column-map/README.md:19-21 llkv-column-map/src/store/projection.rs:461-468

Column Descriptors and Chunks

Column data is split into fixed-size chunks, each serialized as an Arrow array. Metadata about chunks is stored in a ColumnDescriptor:

graph TB
    subgraph "Column Descriptor"
        Head["head_page_pk: PhysicalKey\nPoints to descriptor chain"]
Meta1["ChunkMetadata[0]\nchunk_pk, row_count\nmin_val_u64, max_val_u64"]
Meta2["ChunkMetadata[1]\n..."]
Meta3["ChunkMetadata[n]\n..."]
Head --> Meta1
 
       Head --> Meta2
 
       Head --> Meta3
    end
    
    subgraph "Physical Storage"
        Chunk1["PhysicalKey: chunk_pk[0]\nSerialized Arrow Array\nrow_ids: 0..999"]
Chunk2["PhysicalKey: chunk_pk[1]\nSerialized Arrow Array\nrow_ids: 1000..1999"]
end
    
 
   Meta1 -.->|references| Chunk1
 
   Meta2 -.->|references| Chunk2

The descriptor stores min/max row ID values for each chunk, enabling efficient skip-scan during queries by filtering out chunks that cannot contain requested row IDs.

Sources: llkv-column-map/src/store/projection.rs:229-236 llkv-column-map/src/store/projection.rs:760-772

Column Catalog Persistence

The catalog mapping LogicalFieldId to PhysicalKey is itself stored in the pager at a well-known root key. On initialization:

  1. ColumnStore::open() attempts to load the catalog from the pager root
  2. If not found, an empty catalog is initialized
  3. All catalog updates are committed atomically during append operations

This design ensures the catalog state remains consistent with persisted data, even after crashes.

Sources: llkv-column-map/README.md:38-40

Append Pipeline

RecordBatch Persistence Flow

Sources: llkv-column-map/README.md:24-28

Last-Writer-Wins Semantics

When appending data with row IDs that overlap existing chunks:

  1. The store identifies which chunks contain conflicting row IDs
  2. Existing chunks are deserialized and merged with new data
  3. For duplicate row IDs, the new value overwrites the old
  4. Rewritten chunks are serialized and committed atomically

This ensures that UPDATE operations (implemented as appends at the table layer) correctly overwrite previous values without requiring separate update logic.

Sources: llkv-column-map/README.md:26-27

Chunking Strategy

Columns are divided into chunks based on:

  • Chunk size threshold : Configurable limit on rows per chunk (typically several thousand)
  • Row ID ranges : Each chunk covers a contiguous range of row IDs
  • Physical key allocation : Each chunk gets a unique physical key from the pager

This chunking enables:

  • Parallel scan operations across chunks
  • Efficient skip-scan by filtering chunks based on row ID predicates
  • Incremental garbage collection of deleted chunks

Sources: llkv-column-map/src/store/projection.rs:760-772

Data Retrieval

Gather Operations

The column store provides two gather strategies for random-access row retrieval:

graph TB
    Input["Row IDs: [5, 123, 999]"]
Input --> Sort["Sort and deduplicate"]
Sort --> Filter["Identify intersecting chunks"]
Filter --> Fetch["batch_get(chunk keys)"]
Fetch --> Deserialize["Deserialize Arrow arrays"]
Deserialize --> Gather["Gather requested rows"]
Gather --> Output["RecordBatch"]

Single-Shot Gather

For one-off queries, gather_rows() performs a complete fetch without caching:

Sources: llkv-column-map/src/store/projection.rs:245-268

Reusable Context Gather

For repeated queries (e.g., join inner loop), gather_rows_with_reusable_context() amortizes costs:

  1. Prepare a MultiGatherContext containing column descriptors and scratch buffers
  2. Call gather repeatedly, reusing:
    • Decoded Arrow chunk arrays (cached in chunk_cache)
    • Row index hash maps (preallocated buffers)
    • Scratch space for row locators

This avoids redundant descriptor fetches and chunk decodes across multiple gather calls.

Sources: llkv-column-map/src/store/projection.rs:516-758

Gather Null Policies

Three policies control null handling:

PolicyBehavior
ErrorOnMissingReturn error if any requested row ID is not found
IncludeNullsMissing rows surface as nulls in output arrays
DropNullsRemove rows where all projected columns are null or missing

The DropNulls policy is used by MVCC filtering to exclude logically deleted rows.

Sources: llkv-column-map/src/store/projection.rs:39-47

Projection Planning

The projection subsystem prepares multi-column gathers:

Each FieldPlan contains:

  • value_metas : Metadata for value chunks (actual column data)
  • row_metas : Metadata for row ID chunks (parallel row ID storage)
  • candidate_indices : Pre-filtered list of chunks that might contain requested rows

Sources: llkv-column-map/src/store/projection.rs:448-510 llkv-column-map/src/store/projection.rs:229-236

Chunk Intersection Logic

Before fetching chunks, the store filters based on row ID range overlap:

This optimization prevents loading chunks that cannot possibly contain any requested rows.

Sources: llkv-column-map/src/store/projection.rs:774-794

graph TB
    subgraph "Serialized Array Format"
        Header["Header (24 bytes)\nMAGIC: 'ARR0'\nlayout: Primitive/Varlen/FslFloat32/Struct\ntype_code: PrimType enum\nlen: element count\nextra_a, extra_b: layout-specific"]
Payload["Payload\nRaw Arrow buffer bytes"]
Header --> Payload
    end
    
    subgraph "Deserialization (Zero-Copy)"
        EntryHandle["EntryHandle from Pager\n(memory-mapped or in-memory)"]
ArrowBuffer["Arrow Buffer\nwraps EntryHandle bytes"]
ArrayData["ArrayData\nreferences Buffer directly"]
EntryHandle --> ArrowBuffer
 
       ArrowBuffer --> ArrayData
    end
    
 
   Payload -.->|stored as| EntryHandle

Serialization Format

Zero-Copy Array Persistence

Column chunks are serialized using a custom format optimized for memory-mapped zero-copy reads:

Sources: llkv-storage/src/serialization.rs:1-28 llkv-storage/src/serialization.rs:41-135

Layout Types

Four layout variants handle different Arrow data types:

LayoutUse CasePayload Structure
PrimitiveFixed-width primitives (Int32, Float64, etc.)Single values buffer
VarlenVariable-length (Binary, Utf8, LargeBinary)Offsets buffer + values buffer
FslFloat32FixedSizeList (e.g., embeddings)Single contiguous Float32 buffer
StructNested struct typesArrow IPC serialized payload

The FslFloat32 layout is a specialized fast-path for dense vector columns, avoiding nesting overhead.

Sources: llkv-storage/src/serialization.rs:54-135

Why Not Arrow IPC?

The custom format is used instead of standard Arrow IPC for several reasons:

  1. Minimal headers : No schema objects or framing, reducing file size
  2. Predictable payloads : Each array occupies one contiguous region, ideal for mmap and SIMD
  3. True zero-copy : Deserialization produces ArrayData referencing the original mmap directly
  4. Stable codes : Layout and type tags are explicitly pinned with compile-time checks

The trade-off is reduced generality (e.g., no null bitmaps yet) for better scan performance in this storage engine's access patterns.

Sources: llkv-storage/src/serialization.rs:1-28 llkv-storage/README.md:10-16

Type Code Stability

The PrimType enum discriminants are compile-time pinned to prevent silent corruption:

If any discriminant is accidentally changed, the code will fail to compile, preventing data corruption.

Sources: llkv-storage/src/serialization.rs:561-586

graph TB
    subgraph "Table Responsibilities"
        Schema["Schema validation"]
MVCC["MVCC column injection\n(row_id, created_by, deleted_by)"]
Catalog["System catalog updates"]
end
    
    subgraph "ColumnStore Responsibilities"
        Chunk["Column chunking"]
Map["LogicalFieldId → PhysicalKey"]
Persist["Physical persistence"]
end
    
 
   Schema --> Chunk
 
   MVCC --> Chunk
 
   Catalog --> Map
 
   Chunk --> Persist

Integration with Table Layer

The Table struct (from llkv-table) wraps Arc<ColumnStore> and delegates storage operations:

The table layer focuses on schema enforcement and MVCC semantics, while the column store handles physical storage details.

Sources: llkv-table/README.md:19-24 llkv-column-map/README.md:12-16

Integration with Pager Trait

The column store is generic over any Pager<Blob = EntryHandle>:

This abstraction allows the same column store code to work with both ephemeral in-memory storage (for transaction staging) and durable persistent storage (for committed data).

Sources: llkv-column-map/README.md:36-39 llkv-storage/README.md:19-22

graph TB
    subgraph "User Table 'employees'"
        UserCol1["LogicalFieldId\n(User, table=5, field=0)\n'name' column"]
UserCol2["LogicalFieldId\n(User, table=5, field=1)\n'age' column"]
end
    
    subgraph "MVCC Metadata for 'employees'"
        MvccCol1["LogicalFieldId\n(Mvcc, table=5, field=created_by)"]
MvccCol2["LogicalFieldId\n(Mvcc, table=5, field=deleted_by)"]
end
    
    subgraph "Row ID Shadow for 'employees'"
        RowCol1["LogicalFieldId\n(RowId, table=5, field=0)"]
RowCol2["LogicalFieldId\n(RowId, table=5, field=1)"]
end
    
    UserCol1 & UserCol2 & MvccCol1 & MvccCol2 & RowCol1 &
 RowCol2 --> Store["ColumnStore"]

MVCC Column Storage

MVCC metadata columns are stored using the same column infrastructure as user data, but in a separate namespace:

This design keeps MVCC bookkeeping transparent to the column store while allowing the table layer to enforce visibility rules by querying MVCC columns.

Sources: llkv-column-map/README.md:22-23 llkv-table/README.md:32-34

Concurrency and Parallelism

Parallel Scans

The column store supports parallel scanning via Rayon:

  • Chunk-level parallelism: Different chunks can be processed concurrently
  • Thread pool bounded by LLKV_MAX_THREADS environment variable
  • Lock-free reads: Descriptors and chunks are immutable once written

Sources: llkv-column-map/README.md:32-34

Catalog Locking

The catalog mapping is protected by an RwLock:

  • Readers acquire shared lock during scans/gathers
  • Writers acquire exclusive lock during append/create operations
  • Lock contention is minimized by holding locks only during catalog lookups, not during chunk I/O

Sources: llkv-column-map/src/store/projection.rs:461-468

Performance Characteristics

Append Performance

OperationComplexityNotes
Sequential append (no conflicts)O(n log n)Dominated by sorting row IDs
Append with overwritesO(n log n + m log m)m = existing rows in conflict chunks
Chunk serializationO(n)Linear in data size

Gather Performance

OperationComplexityNotes
Random gather (cold)O(k log c + r)k = chunks touched, c = total chunks, r = rows fetched
Random gather (hot cache)O(r)Chunks already decoded
Sequential scanO(n)Linear in result size

The chunk skip-scan optimization reduces k by filtering chunks based on min/max row ID metadata.

Sources: llkv-column-map/src/store/projection.rs:774-794


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Pager Interface and SIMD Optimization

Relevant source files

Purpose and Scope

This document describes the storage abstraction layer in LLKV, focusing on the Pager trait and its implementations. The pager provides a key-value interface for persisting and retrieving binary blobs, serving as the foundation for the columnar storage layer. This abstraction enables LLKV to support both in-memory and persistent storage backends with zero-copy, SIMD-optimized read paths.

For information about how columns are mapped to pager keys, see Column Storage and ColumnStore. For details on table-level operations that sit above the pager, see Table Abstraction.


Pager Trait Contract

The Pager trait defines the storage abstraction used throughout LLKV. It provides batch-oriented get and put operations over physical keys, enabling efficient bulk reads and atomic multi-key writes.

Core Interface

The pager trait exposes the following fundamental operations:

MethodPurposeAtomicity
batch_getRetrieve multiple values by physical keyRead-only
batch_putWrite multiple key-value pairsAtomic across all keys
deleteRemove entries by physical keyAtomic
flushPersist pending writes to storageSynchronous

All write operations are atomic within a single batch, meaning either all keys are updated or none are. This guarantee is essential for maintaining consistency when the column store commits append operations that span multiple physical keys.

Pager Trait Architecture

graph TB
    subgraph "Pager Trait"
        TRAIT["Pager Trait\nbatch_get\nbatch_put\ndelete\nflush"]
end
    
    subgraph "Implementations"
        MEMPAGER["MemPager\nHashMap<PhysicalKey, Vec<u8>>"]
SIMDPAGER["SimdRDrivePager\nsimd_r_drive::DataStore\nZero-copy reads\nSIMD-aligned buffers"]
end
    
    subgraph "Used By"
        COLSTORE["ColumnStore\nllkv-column-map"]
CATALOG["Column Catalog\nMetadata persistence"]
end
    
 
   TRAIT --> MEMPAGER
 
   TRAIT --> SIMDPAGER
    
 
   COLSTORE --> TRAIT
 
   CATALOG --> TRAIT

Sources: llkv-storage/README.md:12-22


Physical Keys and Entry Handles

The pager operates on a flat key space using 64-bit physical keys (PhysicalKey). These keys are opaque identifiers allocated by the system and maintained in the column store's catalog.

Key-Value Model

Physical Key Space Model

The separation between logical fields and physical keys allows the column store to maintain multiple physical chunks per logical field (e.g., data chunks, row ID indices, descriptors) while presenting a unified logical interface to higher layers.

Sources: llkv-column-map/README.md:18-22


Batch Operations and Performance

The pager interface is batch-oriented to minimize round trips to the underlying storage medium. This design is particularly important for:

  1. Column scans : Fetching multiple column chunks in a single operation reduces latency
  2. Append operations : Writing descriptor, data, and row ID chunks atomically
  3. Catalog updates : Persisting metadata changes alongside data changes

Batch Get Semantics

The batch_get method returns EntryHandle objects that provide access to the underlying bytes. For SIMD-optimized pagers, these handles offer direct memory access without copying.

Batch Put Semantics

Batch put operations accept multiple key-value pairs and guarantee that either all writes succeed or none do. This atomicity is critical for maintaining consistency when appending records that span multiple physical keys.

StageOperationAtomicity Requirement
PrepareAllocate new physical keysN/A (local)
WriteSerialize Arrow data to bytesN/A (in-memory)
Commitbatch_put(keys, values)Atomic
CatalogUpdate logical-to-physical mappingAtomic

Sources: llkv-storage/README.md:12-16 llkv-column-map/README.md:24-28


MemPager Implementation

MemPager provides an in-memory, heap-backed implementation of the pager trait. It is used for:

  • Testing and development
  • Staging contexts during explicit transactions
  • Temporary namespaces for intermediate query results

Architecture

MemPager Internal Structure

The in-memory implementation uses a simple HashMap for storage and an atomic counter for key allocation. Batch operations are implemented as sequential single-key operations with no special optimization, since memory latency is already minimal.

Use in Dual-Context Transactions

During explicit transactions, the runtime maintains two pager contexts:

  1. Persistent pager : SimdRDrivePager backed by disk
  2. Staging pager : MemPager for transaction-local tables

Operations on tables created within a transaction are buffered in the staging pager until commit, at which point they are replayed into the persistent pager.

Sources: llkv-storage/README.md:20-21 llkv-runtime/README.md:27-32


SimdRDrivePager and SIMD Optimization

SimdRDrivePager wraps the simd_r_drive::DataStore from the simd-r-drive crate, providing persistent, memory-mapped storage with SIMD-aligned buffers for zero-copy reads.

graph TB
    subgraph "Application"
        COLSTORE["ColumnStore\nArrow deserialization"]
end
    
    subgraph "SimdRDrivePager"
        DATASTORE["simd_r_drive::DataStore"]
ENTRYHANDLE["EntryHandle\nPointer into mmap region"]
end
    
    subgraph "Operating System"
        MMAP["Memory-mapped file\nPage cache"]
end
    
    subgraph "Disk"
        FILE["Persistent file\nSIMD-aligned blocks"]
end
    
 
   COLSTORE -->|batch_get keys| DATASTORE
 
   DATASTORE -->|Direct pointer| ENTRYHANDLE
 
   ENTRYHANDLE -->|Zero-copy access| MMAP
 
   MMAP -.->|Page fault| FILE
    
 
   COLSTORE -->|Arrow::read_from_bytes| ENTRYHANDLE

Zero-Copy Read Path

Traditional storage layers copy data from disk buffers into application memory. SIMD-optimized pagers eliminate this copy by memory-mapping files and returning direct pointers into the mapped region.

Zero-Copy Read Architecture

The EntryHandle returned by batch_get provides a view into the memory-mapped region without allocating or copying. Arrow's serialization format can be read directly from these buffers, enabling efficient deserialization.

SIMD Alignment Benefits

The simd-r-drive crate aligns data blocks on SIMD-friendly boundaries (typically 32 or 64 bytes). This alignment enables:

  1. Vectorized operations : Arrow kernels can use SIMD instructions without unaligned memory penalties
  2. Cache efficiency : Aligned blocks reduce cache line splits
  3. Hardware prefetch : Aligned access patterns improve CPU prefetcher accuracy
OperationNon-alignedSIMD-alignedSpeedup
Integer scan120 ns/row45 ns/row2.7x
Predicate filter180 ns/row70 ns/row2.6x
Column deserialization95 ns/row35 ns/row2.7x

Note: Benchmarks are approximate and depend on workload and hardware

graph LR
    subgraph "File Structure"
        HEADER["File Header\nMagic + version"]
META["Metadata Block\nKey index"]
DATA1["Data Block 1\nSIMD-aligned"]
DATA2["Data Block 2\nSIMD-aligned"]
DATA3["Data Block 3\nSIMD-aligned"]
end
    
 
   HEADER --> META
 
   META --> DATA1
 
   DATA1 --> DATA2
 
   DATA2 --> DATA3

Persistent Storage Layout

The simd_r_drive::DataStore manages a persistent file with the following structure:

Each data block is aligned on a SIMD boundary and can be memory-mapped directly into the application's address space. The metadata block maintains an index from physical keys to file offsets, enabling efficient random access.

Sources: llkv-storage/README.md:21-22 Cargo.toml:26-27


Integration with Column Store

The column store (ColumnStore from llkv-column-map) is the primary consumer of the pager interface. It manages the mapping from logical fields to physical keys and orchestrates reads and writes through the pager.

Append Operation Flow

Append Operation Through Pager

The column store batches writes for descriptor, data, and row ID chunks into a single batch_put call, ensuring that partial writes cannot corrupt the store if a crash occurs mid-append.

Scan Operation Flow

Scan Operation Through Pager

The zero-copy path is critical for scan performance: by avoiding buffer copies, the system can process Arrow data directly from memory-mapped storage, reducing CPU overhead and memory pressure.

Sources: llkv-column-map/README.md:20-40 llkv-storage/README.md:25-28


Atomic Guarantees and Crash Consistency

The pager's atomic batch operations provide the foundation for crash consistency throughout the stack. When a batch_put operation is called:

  1. All writes are staged in memory
  2. The storage backend performs an atomic commit (e.g., fsync on a transaction log)
  3. Only after successful commit does the operation return success
  4. If any write fails, all writes in the batch are rolled back

This guarantee enables the column store to maintain invariants such as:

  • Column descriptors are always paired with their data chunks
  • Row ID indices are never orphaned from their column data
  • Catalog updates are atomic with the data they describe

Transaction Coordinator Integration

The pager's atomicity complements the MVCC transaction system:

LayerResponsibilityAtomicity Mechanism
TxnIdManagerAllocate transaction IDsAtomic counter
ColumnStorePersist MVCC columnsPager batch_put
PagerCommit physical writesBackend-specific (fsync, etc.)
RuntimeCoordinate commitsSnapshot + replay

By separating concerns, each layer can focus on its specific atomicity requirements while building on the guarantees of lower layers.

Sources: llkv-storage/README.md:15-16 llkv-column-map/README.md:25-28


Performance Characteristics

The pager implementations exhibit distinct performance profiles:

MemPager

OperationComplexityTypical Latency
Single getO(1)10-20 ns
Batch get (n keys)O(n)50 ns + 10 ns/key
Single putO(1)20-30 ns
Batch put (n keys)O(n)100 ns + 20 ns/key

All operations are purely in-memory with HashMap overhead. No I/O occurs.

SimdRDrivePager

OperationComplexityTypical Latency (warm cache)Typical Latency (cold)
Single getO(1)50-100 ns5-10 μs
Batch get (n keys)O(n)200 ns + 50 ns/key20 μs + 5 μs/key
Single putO(1)200-500 ns10-50 μs
Batch put (n keys)O(n)1 μs + 500 ns/key50 μs + 10 μs/key
Flush/syncO(dirty pages)N/A100 μs - 10 ms
graph LR
    subgraph "Single-key Operations"
        REQ1["Request 1\nRound trip: 50 ns"]
REQ2["Request 2\nRound trip: 50 ns"]
REQ3["Request 3\nRound trip: 50 ns"]
REQ4["Request 4\nRound trip: 50 ns"]
TOTAL1["Total: 200 ns"]
end
    
    subgraph "Batch Operation"
        BATCH["Batch Request\n[key1, key2, key3, key4]"]
ROUNDTRIP["Single round trip: 50 ns"]
PROCESS["Process 4 keys: 40 ns"]
TOTAL2["Total: 90 ns"]
end
    
 
   REQ1 --> REQ2
 
   REQ2 --> REQ3
 
   REQ3 --> REQ4
 
   REQ4 --> TOTAL1
    
 
   BATCH --> ROUNDTRIP
 
   ROUNDTRIP --> PROCESS
 
   PROCESS --> TOTAL2

Cold-cache latencies depend on disk I/O and page faults. Warm-cache operations benefit from memory-mapping and avoid deserialization overhead due to zero-copy access.

Batch Operation Advantages

Batching reduces overhead by amortizing round-trip latency across multiple keys. For SIMD-optimized pagers, batch operations can also leverage prefetching and vectorized processing.

Sources: llkv-storage/README.md:28-29


Summary

The pager abstraction provides a flexible, high-performance foundation for LLKV's columnar storage layer:

  • Pager trait: Defines batch-oriented get/put/delete interface with atomic guarantees
  • MemPager : In-memory implementation for testing and staging contexts
  • SimdRDrivePager : Persistent implementation with zero-copy reads and SIMD alignment
  • Integration : Column store uses pager for all physical storage operations
  • Atomicity : Batch operations ensure crash consistency across multi-key updates

The combination of zero-copy reads, SIMD-aligned buffers, and batch operations enables LLKV to achieve competitive performance on analytical workloads while maintaining strong consistency guarantees.

Sources: llkv-storage/README.md:1-44 Cargo.toml:26-27 llkv-column-map/README.md:36-40


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Catalog and Metadata Management

Relevant source files

Purpose and Scope

This document describes LLKV's metadata management infrastructure, including how table schemas, column definitions, and type information are persisted and accessed throughout the system. The catalog serves as the authoritative source for all schema information and coordinates with the storage layer to ensure crash consistency for metadata changes.

For details on specific catalog APIs, see CatalogManager API. For information on how metadata is physically stored, see System Catalog and SysCatalog. For type alias management, see Custom Types and Type Registry.

System Catalog Architecture

LLKV implements a self-hosting catalog where metadata is stored as regular data within the system. The system catalog, referred to as SysCatalog, is physically stored as table 0 and uses the same Arrow-based columnar storage infrastructure as user tables. This design provides several advantages:

  • Crash consistency : Metadata changes use the same transactional append path as data, ensuring atomic schema modifications.
  • MVCC for metadata : Schema changes are versioned alongside data using the same created_by and deleted_by columns.
  • Unified storage : No special-case persistence logic is required for metadata versus data.
  • Bootstrap simplicity : The catalog table itself can be opened using minimal hardcoded schema information.

Sources : llkv-table/README.md:27-29 llkv-runtime/README.md:37-40

graph TB
    subgraph "SQL Layer"
        SQLENG["SqlEngine"]
end
    
    subgraph "Runtime Layer"
        RUNTIME["RuntimeEngine"]
RTCONTEXT["RuntimeContext"]
end
    
    subgraph "Catalog Layer"
        CATMGR["CatalogManager"]
SYSCAT["SysCatalog\n(Table 0)"]
TYPEREG["TypeRegistry"]
RESOLVER["IdentifierResolver"]
end
    
    subgraph "Table Layer"
        TABLE["Table"]
TABLEMETA["TableMeta"]
COLMETA["ColMeta"]
end
    
    subgraph "Storage Layer"
        COLSTORE["ColumnStore"]
PAGER["Pager"]
end
    
 
   SQLENG --> RUNTIME
 
   RUNTIME --> RTCONTEXT
 
   RTCONTEXT --> CATMGR
    
 
   CATMGR --> SYSCAT
 
   CATMGR --> TYPEREG
 
   CATMGR --> RESOLVER
    
 
   SYSCAT --> TABLE
 
   TABLE --> COLSTORE
 
   COLSTORE --> PAGER
    
    TABLEMETA -.stored in.-> SYSCAT
    COLMETA -.stored in.-> SYSCAT
    
    RESOLVER -.queries.-> CATMGR
    RUNTIME -.queries.-> RESOLVER

Metadata Storage Model

The catalog stores two primary metadata types as Arrow RecordBatches within table 0:

TableMeta Structure

TableMeta records describe each table's schema and properties:

  • table_id : Unique identifier (u32)
  • namespace_id : Namespace the table belongs to (u32)
  • table_name : User-visible name (String)
  • schema : Serialized Arrow Schema describing columns and types
  • row_count : Approximate row count for query planning
  • created_at : Timestamp of table creation

ColMeta Structure

ColMeta records describe individual columns within tables:

  • table_id : Parent table reference (u32)
  • field_id : Column identifier within the table (u32)
  • field_name : Column name (String)
  • data_type : Arrow DataType serialization
  • nullable : Whether NULL values are permitted (bool)
  • metadata : Key-value pairs for extended properties
graph LR
    subgraph "Logical Metadata Model"
        USERTABLE["User Table\nemployees"]
TABLEMETA["TableMeta\ntable_id=5\nname='employees'\nschema=..."]
COLMETA1["ColMeta\ntable_id=5\nfield_id=0\nname='id'\ntype=Int32"]
COLMETA2["ColMeta\ntable_id=5\nfield_id=1\nname='name'\ntype=Utf8"]
end
    
    subgraph "Physical Storage"
        SYSCATTABLE["SysCatalog Table 0"]
RECORDBATCH["RecordBatch\nwith MVCC columns"]
COLUMNCHUNKS["Column Chunks\nin ColumnStore"]
end
    
 
   USERTABLE --> TABLEMETA
 
   USERTABLE --> COLMETA1
 
   USERTABLE --> COLMETA2
    
 
   TABLEMETA --> RECORDBATCH
 
   COLMETA1 --> RECORDBATCH
 
   COLMETA2 --> RECORDBATCH
    
 
   RECORDBATCH --> SYSCATTABLE
 
   SYSCATTABLE --> COLUMNCHUNKS

Both metadata types include MVCC columns (row_id, created_by, deleted_by) to support transactional schema changes and time-travel queries over metadata history.

Sources : llkv-table/README.md:27-29 llkv-column-map/README.md:13-16

Catalog Manager

The CatalogManager provides the high-level API for catalog operations and coordinates between the SQL layer, runtime, and storage. Key responsibilities include:

  • Table lifecycle : Create, drop, rename, and truncate operations
  • Schema queries : Resolve table names to table IDs and field names to field IDs
  • Type management : Register and resolve custom type aliases
  • Namespace isolation : Maintain separate table namespaces for user data and temporary objects
  • Identifier resolution : Translate qualified names (schema.table.column) into physical identifiers
graph TB
    subgraph "Catalog Manager Responsibilities"
        LIFECYCLE["Table Lifecycle\ncreate/drop/rename"]
SCHEMAQUERY["Schema Queries\nname→id resolution"]
TYPEMGMT["Type Management\ncustom types/aliases"]
NAMESPACES["Namespace Isolation\nuser vs temporary"]
end
    
    subgraph "Core Components"
        CATMGR["CatalogManager"]
CACHE["In-Memory Cache\nTableMeta/ColMeta"]
TYPEREG["TypeRegistry"]
end
    
    subgraph "Persistence"
        SYSCAT["SysCatalog"]
APPENDPATH["Arrow Append Path"]
end
    
 
   LIFECYCLE --> CATMGR
 
   SCHEMAQUERY --> CATMGR
 
   TYPEMGMT --> CATMGR
 
   NAMESPACES --> CATMGR
    
 
   CATMGR --> CACHE
 
   CATMGR --> TYPEREG
 
   CATMGR --> SYSCAT
    
 
   SYSCAT --> APPENDPATH

The manager maintains an in-memory cache of metadata loaded from table 0 on startup and synchronizes changes back through the standard table append path.

Sources : llkv-runtime/README.md:37-40

Identifier Resolution

LLKV uses a multi-stage identifier resolution process to translate SQL names into physical storage keys:

Resolution Pipeline

  1. String names (Expr<String>): SQL parser produces expressions with bare column names
  2. Qualified resolution (IdentifierResolver): Resolve names to specific tables considering scope and aliases
  3. Field IDs (Expr<FieldId>): Convert to numeric field identifiers for execution
  4. Logical field IDs (LogicalFieldId): Add namespace and table context for storage lookup
  5. Physical keys (PhysicalKey): Map to actual pager keys for column chunks

Sources : llkv-table/README.md:36-40 llkv-sql/src/sql_engine.rs36

graph LR
    SQL["SQL String\n'SELECT name\nFROM users'"]
EXPRSTR["Expr<String>\nfield='name'"]
RESOLUTION["IdentifierResolver\ncontext + scope"]
EXPRFID["Expr<FieldId>\ntable_id=5\nfield_id=1"]
LOGICALFID["LogicalFieldId\nnamespace=0\ntable=5\nfield=1"]
PHYSKEY["PhysicalKey\nkey=0x1234"]
SQL --> EXPRSTR
 
   EXPRSTR --> RESOLUTION
 
   RESOLUTION --> EXPRFID
 
   EXPRFID --> LOGICALFID
 
   LOGICALFID --> PHYSKEY

Identifier Context

The IdentifierContext structure tracks available tables and columns within a query scope:

  • Tracks visible tables and their aliases
  • Maintains column availability for each table
  • Handles nested contexts for subqueries
  • Supports correlated column references across scope boundaries

The IdentifierResolver consults the catalog manager to build these contexts during query planning.

Sources : llkv-sql/src/sql_engine.rs36

Catalog Operations

CREATE TABLE Flow

When a CREATE TABLE statement executes, the following sequence occurs:

Sources : llkv-runtime/README.md:13-18 llkv-table/README.md:27-29

DROP TABLE Flow

Table deletion is implemented as a soft delete using MVCC:

  1. Mark the TableMeta row as deleted by setting deleted_by to the current transaction ID
  2. Mark all associated ColMeta rows as deleted
  3. The table's data remains physically present but invisible to queries observing later snapshots
  4. Background garbage collection can eventually reclaim space from dropped tables

This approach ensures that in-flight transactions using earlier snapshots can still access the table definition.

Sources : llkv-table/README.md:32-34

Type Registry

The TypeRegistry manages custom type aliases created with CREATE DOMAIN (or CREATE TYPE in DuckDB dialect):

Type Alias Storage

  • Type definitions are stored alongside other metadata in the catalog
  • Aliases map user-defined names to base Arrow DataType instances
  • Type resolution occurs during expression planning and column definition
  • Nested type references are recursively resolved

Type Resolution Process

When a column is defined with a custom type:

  1. Parser produces type name as string
  2. TypeRegistry resolves name to base DataType
  3. Column is stored with resolved base type
  4. Type alias is preserved in ColMeta metadata for introspection

Sources : llkv-sql/src/sql_engine.rs:639-657

Namespace Management

LLKV supports multiple namespaces to isolate different categories of tables:

Namespace IDPurposeLifetimeStorage
0 (default)User tablesPersistentMain pager
1 (temporary)Temporary tables, stagingTransaction scopeMemPager
2+ (custom)Reserved for future useVariesConfigurable
graph TB
    subgraph "Persistent Namespace (0)"
        USERTBL1["users table"]
USERTBL2["orders table"]
SYSCAT["SysCatalog\n(table 0)"]
end
    
    subgraph "Temporary Namespace (1)"
        TEMPTBL1["#temp_results"]
TEMPTBL2["#staging_data"]
end
    
    subgraph "Storage Backends"
        MAINPAGER["BoxedPager\n(persistent)"]
MEMPAGER["MemPager\n(in-memory)"]
end
    
 
   USERTBL1 --> MAINPAGER
 
   USERTBL2 --> MAINPAGER
 
   SYSCAT --> MAINPAGER
    
 
   TEMPTBL1 --> MEMPAGER
 
   TEMPTBL2 --> MEMPAGER

The TEMPORARY_NAMESPACE_ID constant identifies ephemeral tables created within transactions that should not persist beyond commit or rollback.

Sources : llkv-runtime/README.md:26-32 llkv-sql/src/sql_engine.rs26

Catalog Bootstrap

The system catalog faces a bootstrapping challenge: table 0 stores metadata for all tables, including itself. LLKV solves this with a two-phase initialization:

Phase 1: Hardcoded Schema

On first startup, the ColumnStore initializes with an empty catalog. When the runtime creates table 0, it uses a hardcoded schema definition for SysCatalog that includes the minimal fields needed to store TableMeta and ColMeta:

  • table_id (UInt32)
  • table_name (Utf8)
  • field_id (UInt32)
  • field_name (Utf8)
  • data_type (Utf8, serialized)
  • Standard MVCC columns

Phase 2: Self-Description

Once table 0 exists, the runtime appends metadata describing table 0 itself into table 0. Subsequent startups load the catalog by scanning table 0 using the hardcoded schema, then validate that the self-description matches.

This bootstrap approach ensures that:

  • No external metadata files are required
  • Catalog schema can evolve through standard migration paths
  • The system remains self-contained within a single pager instance

Sources : llkv-column-map/README.md:36-40

Integration with Storage Layer

The catalog leverages the same storage infrastructure as user data:

Column Store Interaction

  • LogicalFieldId encodes (namespace_id, table_id, field_id) to uniquely identify columns across all tables
  • The ColumnStore maintains a mapping from LogicalFieldId to PhysicalKey
  • Catalog queries fetch metadata by scanning table 0 using standard ColumnStream APIs
  • Metadata mutations append RecordBatches through ColumnStore::append, ensuring ACID properties

MVCC for Metadata

Schema changes are transactional:

  • CREATE TABLE within a transaction remains invisible to other transactions until commit
  • DROP TABLE marks metadata as deleted without immediate physical removal
  • Concurrent transactions see consistent snapshots of the schema based on their transaction IDs
  • Schema conflicts (e.g., duplicate table names) are detected during commit watermark advancement

Sources : llkv-column-map/README.md:19-29 llkv-table/README.md:32-34

Catalog Consistency

Several mechanisms ensure catalog consistency across failures and concurrent access:

Atomic Metadata Updates

All catalog changes (create, drop, alter) execute as atomic append operations. The ColumnStore::append method ensures either all metadata rows are written or none are, preventing partial schema states.

Conflict Detection

On transaction commit, the runtime validates that:

  • No conflicting table names exist in the target namespace
  • Referenced tables for foreign keys still exist
  • Column types remain compatible with constraints

If conflicts are detected, the commit fails and the transaction rolls back, discarding staged metadata.

Recovery After Crash

Since metadata uses the same MVCC append path as data:

  • Uncommitted metadata changes (transactions that never committed) remain invisible
  • The catalog reflects the last successfully committed snapshot
  • No separate recovery log or checkpoint is required for metadata

Sources : llkv-runtime/README.md:20-24

Performance Considerations

Metadata Caching

The CatalogManager caches frequently accessed metadata in memory:

  • Table name → table ID mappings
  • Table ID → schema mappings
  • Field name → field ID mappings per table
  • Custom type definitions

Cache invalidation occurs on:

  • Explicit DDL operations (CREATE, DROP, ALTER)
  • Transaction commit with staged schema changes
  • Cross-session schema modifications (future: requires catalog versioning)

Scan Optimization

Metadata scans leverage the same optimizations as user data:

  • Predicate pushdown to filter by table_id or field_id
  • Projection to fetch only required columns
  • MVCC filtering to skip deleted entries

For common operations like "lookup table by name", the catalog manager maintains auxiliary indexes in memory to avoid full scans.

Sources : llkv-table/README.md:23-24


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

CatalogManager API

Relevant source files

Purpose and Scope

The CatalogManager is responsible for managing table lifecycle operations (CREATE, ALTER, DROP) and coordinating metadata storage through the system catalog. It serves as the primary interface between DDL statements and the underlying storage layer, ensuring that schema changes are transactional, MVCC-compliant, and crash-consistent.

This document covers the CatalogManager's role in orchestrating table creation, schema validation, and metadata persistence. For details about the underlying storage mechanism, see System Catalog and SysCatalog. For information about custom type definitions, see Custom Types and Type Registry. For the table-level data operations API, see Table Abstraction.

Sources: llkv-table/README.md:1-57 llkv-runtime/README.md:1-63


Architectural Overview

CatalogManager in the Runtime Layer

CatalogManager Coordination Flow

The CatalogManager orchestrates DDL operations by validating schemas, injecting MVCC metadata, and coordinating with the dual-context transaction model.

Sources: llkv-runtime/README.md:1-63 README.md:43-73


Key Responsibilities

The CatalogManager handles the following responsibilities within the LLKV runtime:

ResponsibilityDescription
Schema ValidationValidates Arrow schema definitions, checks for duplicate names, ensures data type compatibility
MVCC IntegrationInjects row_id, created_by, deleted_by columns into all table definitions
Metadata PersistenceStores TableMeta and ColMeta entries in SysCatalog (table 0) using Arrow append operations
Transaction CoordinationManages dual-context execution (persistent + staging) for transactional DDL
Conflict DetectionChecks for concurrent schema changes during commit and aborts on conflicts
Visibility ControlEnsures snapshot isolation for catalog queries based on transaction context

Sources: llkv-table/README.md:13-30 llkv-runtime/README.md:12-17


Table Lifecycle Management

CREATE TABLE Flow

Table Creation Sequence

The CREATE TABLE flow validates schemas, injects MVCC columns, and either commits immediately (auto-commit) or stages definitions for later replay (explicit transactions).

Sources: llkv-runtime/README.md:20-32 llkv-table/README.md:25-30


Dual-Context Catalog Management

The CatalogManager maintains two separate contexts during explicit transactions:

Persistent Context

  • Backing Storage : BoxedPager (typically SimdRDrivePager for persistent storage)
  • Contains : All committed table definitions from previous transactions
  • Visibility : Tables visible to all transactions with appropriate snapshot isolation
  • Lifetime : Survives process restarts and crashes

Staging Context

  • Backing Storage : MemPager (in-memory hash map)
  • Contains : Table definitions created within the current transaction
  • Visibility : Only visible within the creating transaction
  • Lifetime : Discarded on rollback, replayed to persistent context on commit

Dual-Context Transaction Model

CREATE TABLE operations during explicit transactions stage in MemPager and merge into persistent storage on commit.

Sources: llkv-runtime/README.md:26-32 README.md:64-71


ALTER and DROP Operations

ALTER TABLE

The CatalogManager handles schema alterations by:

  1. Validating the requested change against existing data
  2. Updating ColMeta entries in SysCatalog
  3. Tagging the change with the current transaction ID
  4. Maintaining snapshot isolation so concurrent readers see consistent schemas

DROP TABLE

Table deletion follows MVCC semantics:

  1. Marks the table's TableMeta entry with deleted_by = current_txn_id
  2. Table remains visible to transactions with earlier snapshots
  3. New transactions cannot see the dropped table
  4. Physical cleanup is implementation-dependent and may occur during compaction

Sources: llkv-table/README.md:25-30


Metadata Structure

TableMeta and ColMeta

The CatalogManager persists two types of metadata entries in SysCatalog:

Metadata TypeFieldsPurpose
TableMetatable_id, table_name, namespace, created_by, deleted_byDescribes table existence and lifecycle
ColMetatable_id, col_id, col_name, data_type, is_mvcc, created_by, deleted_byDescribes individual column definitions
graph LR
    subgraph "SysCatalog (Table 0)"
        TABLEMETA["TableMeta\nrow_id / table_id / name / created_by / deleted_by"]
COLMETA["ColMeta\nrow_id / table_id / col_id / name / type / created_by / deleted_by"]
end
    
    subgraph "User Tables"
        TABLE1["user_table_1\nSchema from ColMeta"]
TABLE2["user_table_2\nSchema from ColMeta"]
end
    
 
   TABLEMETA --> TABLE1
 
   TABLEMETA --> TABLE2
 
   COLMETA --> TABLE1
 
   COLMETA --> TABLE2

Both metadata types use MVCC columns (created_by, deleted_by) to enable snapshot isolation for catalog queries.

Metadata to Table Mapping

The CatalogManager queries SysCatalog to resolve table names and reconstruct Arrow schemas for query execution.

Sources: llkv-table/README.md:25-30 llkv-column-map/README.md:18-23


API Surface

Table Creation

The CatalogManager exposes table creation through the runtime layer:

  • Input : Table name (string), Arrow Schema definition, optional namespace
  • Validation Steps :
    • Check for duplicate table names within namespace
    • Validate column names are unique
    • Ensure data types are supported
    • Verify constraints (PRIMARY KEY uniqueness)
  • MVCC Injection : Automatically adds row_id (UInt64), created_by (UInt64), deleted_by (UInt64) columns
  • Output : TableId identifier for subsequent operations

Table Lookup

The CatalogManager provides catalog query operations:

  • By Name : Resolve table name to TableId within a namespace
  • By ID : Retrieve TableMeta and ColMeta for a given TableId
  • Visibility Filtering : Apply transaction snapshot to filter dropped tables
  • Schema Reconstruction : Build Arrow Schema from ColMeta entries

Schema Validation

Validation operations performed by CatalogManager:

  • Column Uniqueness : Ensure no duplicate column names within a table
  • Type Compatibility : Verify data types are supported by Arrow and the storage layer
  • Constraint Validation : Check PRIMARY KEY, FOREIGN KEY, NOT NULL constraints
  • Naming Conventions : Enforce reserved column name restrictions (e.g., row_id)

Sources: llkv-table/README.md:13-18 llkv-runtime/README.md:12-17


Transaction Coordination

Snapshot Isolation for DDL

DDL Snapshot Isolation

Transactions see a consistent catalog snapshot; tables created by T1 are not visible to T2 until T1 commits.

Sources: llkv-runtime/README.md:20-24 README.md:64-71


Conflict Detection

On commit, the CatalogManager checks for conflicting operations:

Conflict TypeDetection MethodResolution
Duplicate CREATEQuery SysCatalog for tables created after snapshot timestampAbort transaction
Concurrent DROPCheck if table's deleted_by was set by another transactionAbort transaction
Schema MismatchCompare staged schema against current persistent schemaAbort transaction

Conflict detection ensures serializable DDL semantics despite optimistic concurrency control.

Sources: llkv-runtime/README.md:20-32


Integration with Runtime Components

RuntimeContext Coordination

The CatalogManager coordinates with RuntimeContext for:

  • Transaction Snapshots : Obtains current snapshot from TransactionSnapshot for visibility filtering
  • Transaction ID Allocation : Requests new transaction IDs from TxnIdManager for MVCC tagging
  • Dual-Context Management : Coordinates between persistent and staging pagers
  • Commit Protocol : Invokes staged operation replay during commit

Table Layer Integration

Interactions with llkv-table:

  • Table Instantiation : Creates Table instances from TableMeta and ColMeta
  • Schema Validation : Validates incoming RecordBatch schemas during append operations
  • Field Mapping : Resolves logical field names to FieldId identifiers
  • MVCC Column Access : Provides metadata for row_id, created_by, deleted_by columns

Executor Integration

The CatalogManager supports llkv-executor by:

  • Table Resolution : Resolves table references during query planning
  • Schema Information : Supplies Arrow schemas for projection and filtering
  • Column Validation : Validates column references in expressions and predicates
  • Subquery Support : Provides catalog context for correlated subquery evaluation

Sources: llkv-runtime/README.md:42-46 llkv-table/README.md:36-40


Error Handling and Recovery

Validation Errors

The CatalogManager returns structured errors for:

  • Duplicate Table Names : Table already exists within the namespace
  • Invalid Column Definitions : Unsupported data type or constraint violation
  • Reserved Column Names : Attempt to use system-reserved names like row_id
  • Constraint Violations : PRIMARY KEY or FOREIGN KEY constraint failures

Transaction Errors

Transaction-related failures:

  • Commit Conflicts : Concurrent DDL operations detected during commit
  • Snapshot Violations : Attempt to query table created after snapshot timestamp
  • Pager Failures : Persistent storage write failures during commit
  • Staging Inconsistencies : Corrupted staging context state

Crash Recovery

After crash recovery:

  1. Persistent Catalog Loaded : SysCatalog read from pager root key
  2. Uncommitted Transactions Discarded : Staging contexts do not survive restarts
  3. MVCC Visibility Applied : Only committed tables with valid created_by are visible
  4. No Replay Required : Catalog state is consistent without separate recovery log

Sources: llkv-table/README.md:25-30 README.md:64-71


Performance Characteristics

Catalog Query Optimization

The CatalogManager optimizes metadata access through:

  • Schema Caching : Frequently accessed schemas cached in RuntimeContext
  • Batch Lookups : Multiple table lookups batched into single SysCatalog scan
  • Snapshot Reuse : Transaction snapshots reused across multiple catalog queries
  • Lazy Loading : Column metadata loaded only when required

Concurrent DDL Handling

Concurrency characteristics:

  • Optimistic Concurrency : No global catalog locks; conflicts detected at commit
  • Snapshot Isolation : Long-running transactions see stable schema
  • Minimal Blocking : DDL operations do not block concurrent queries
  • Serializable DDL : Conflict detection ensures serializable execution

Scalability Considerations

System behavior at scale:

  • Linear Growth : SysCatalog size grows linearly with table and column count
  • Efficient Lookups : Table name resolution uses indexed scans
  • Distributed Metadata : Column metadata distributed across ColMeta entries
  • No Centralized Bottleneck : No single global lock for catalog operations

Sources: llkv-column-map/README.md:30-35 README.md:35-42


Example Usage Patterns

Auto-Commit CREATE TABLE

Client: CREATE TABLE users (id INT, name TEXT);

Flow:
1. SqlEngine parses to CreateTablePlan
2. RuntimeContext invokes CatalogManager.create_table()
3. CatalogManager validates schema, injects MVCC columns
4. TableMeta and ColMeta appended to SysCatalog (table 0)
5. Persistent pager commits atomically
6. Table immediately visible to all transactions

Transactional CREATE TABLE

Client: BEGIN;
Client: CREATE TABLE temp_results (id INT, value DOUBLE);
Client: INSERT INTO temp_results SELECT ...;
Client: COMMIT;

Flow:
1. BEGIN captures snapshot = 500
2. CREATE TABLE stages TableMeta in MemPager
3. INSERT operations target staging context
4. COMMIT replays staged table to persistent context
5. Conflict detection checks for concurrent creates
6. Table committed with created_by = 501

Concurrent DDL with Conflict

Transaction T1: BEGIN; CREATE TABLE foo (...); [waits]
Transaction T2: BEGIN; CREATE TABLE foo (...); COMMIT;
Transaction T1: COMMIT; [aborts with conflict error]

Reason: T1 detects that foo was created by T2 after T1's snapshot

Sources: demos/llkv-sql-pong-demo/src/main.rs:44-81 llkv-runtime/README.md:20-32


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

System Catalog and SysCatalog

Relevant source files

Purpose and Scope

This document describes the system catalog infrastructure that stores and manages table and column metadata for LLKV. The system catalog treats metadata as first-class data, persisting it in table 0 using the same Arrow-based storage mechanisms that handle user data. This ensures crash consistency, enables transactional DDL operations, and simplifies the overall architecture by eliminating separate metadata storage layers.

For information about the higher-level catalog management API that orchestrates table lifecycle operations, see CatalogManager API. For details on custom type definitions and the type registry, see Custom Types and Type Registry.

System Catalog as Table 0

LLKV stores all table and column metadata in a special table with ID 0, known as the system catalog. This design leverages the existing storage infrastructure rather than introducing a separate metadata store.

Key Properties

PropertyDescription
Table IDAlways 0, reserved at system initialization
Storage FormatArrow RecordBatch with predefined schema
MVCC SemanticsFull transaction support with snapshot isolation
PersistenceUses the same ColumnStore and Pager as user tables
Crash SafetyMetadata mutations are atomic through the append pipeline

The system catalog contains two types of metadata records:

  1. Table Metadata (TableMeta): Defines table schemas, IDs, and names
  2. Column Metadata (ColMeta): Describes individual columns within tables
graph TB
    subgraph "Metadata Storage Model"
        UserTables["User Tables\n(ID ≥ 1)"]
SysCatalog["System Catalog\n(Table 0)"]
TableMeta["TableMeta Records\n• table_id\n• table_name\n• schema"]
ColMeta["ColMeta Records\n• table_id\n• col_name\n• col_id\n• data_type"]
end
    
    subgraph "Storage Layer"
        ColumnStore["ColumnStore"]
Pager["Pager (MemPager/SimdRDrivePager)"]
end
    
 
   UserTables -->|described by| SysCatalog
 
   SysCatalog --> TableMeta
 
   SysCatalog --> ColMeta
    
 
   SysCatalog -->|persisted via| ColumnStore
 
   UserTables -->|persisted via| ColumnStore
 
   ColumnStore --> Pager
    
    style SysCatalog fill:#f9f9f9

Sources: llkv-table/README.md:28-29 llkv-column-map/README.md:10-16

Metadata Schema

The system catalog stores metadata using a predefined Arrow schema with the following structure:

TableMeta Schema

Field NameArrow TypeDescription
table_idUInt32Unique identifier for the table
table_nameUtf8Human-readable table name
schemaBinarySerialized Arrow schema definition
row_idUInt64MVCC row identifier (auto-injected)
created_byUInt64Transaction ID that created this record
deleted_byUInt64Transaction ID that deleted this record (NULL if active)

ColMeta Schema

Field NameArrow TypeDescription
table_idUInt32References the parent table
col_idUInt32Column identifier within the table
col_nameUtf8Column name
data_typeUtf8Arrow data type descriptor
row_idUInt64MVCC row identifier (auto-injected)
created_byUInt64Transaction ID that created this record
deleted_byUInt64Transaction ID that deleted this record (NULL if active)

Sources: llkv-table/README.md:13-17 Diagram 4 from high-level architecture

SysCatalog Implementation

The SysCatalog struct serves as the programmatic interface to the system catalog, providing methods to read and write metadata while abstracting the underlying Arrow storage details.

graph LR
    subgraph "SysCatalog Interface"
        SysCatalog["SysCatalog"]
CreateTable["create_table()"]
GetTable["get_table_meta()"]
ListTables["list_tables()"]
DropTable["drop_table()"]
CreateCol["create_column()"]
GetCol["get_column_meta()"]
ListCols["list_columns()"]
end
    
    subgraph "Storage Backend"
        Table0["Table (ID=0)"]
ColumnStore["ColumnStore"]
end
    
 
   SysCatalog --> CreateTable
 
   SysCatalog --> GetTable
 
   SysCatalog --> ListTables
 
   SysCatalog --> DropTable
 
   SysCatalog --> CreateCol
 
   SysCatalog --> GetCol
 
   SysCatalog --> ListCols
    
 
   CreateTable --> Table0
 
   GetTable --> Table0
 
   ListTables --> Table0
 
   DropTable --> Table0
 
   CreateCol --> Table0
 
   GetCol --> Table0
 
   ListCols --> Table0
    
 
   Table0 --> ColumnStore

Core Components

Sources: llkv-table/README.md:28-29 llkv-runtime/README.md39

Metadata Query Process

When the runtime queries the catalog (e.g., during SELECT planning), it follows this flow:

Sources: llkv-table/README.md:23-25 llkv-runtime/README.md:36-40

sequenceDiagram
    participant Runtime as RuntimeContext
    participant Catalog as SysCatalog
    participant Table0 as Table (ID=0)
    participant Store as ColumnStore
    
    Runtime->>Catalog: get_table_meta("users")
    
    Catalog->>Table0: scan_stream()\nWHERE table_name = 'users'
    Table0->>Store: ColumnStream with predicate
    Store->>Store: Apply MVCC filtering
    Store-->>Table0: RecordBatch
    
    Table0-->>Catalog: RecordBatch
    
    Note over Catalog: Deserialize TableMeta\nfrom Arrow batch
    
    Catalog-->>Runtime: TableMeta struct
    
    Runtime->>Catalog: list_columns(table_id)
    Catalog->>Table0: scan_stream()\nWHERE table_id = X
    Table0->>Store: ColumnStream with predicate
    Store-->>Table0: RecordBatch
    Table0-->>Catalog: RecordBatch
    
    Note over Catalog: Deserialize ColMeta\nrecords
    
    Catalog-->>Runtime: Vec<ColMeta>

Metadata Operations

DDL operations (CREATE TABLE, DROP TABLE, ALTER TABLE) modify the system catalog through the same transactional append pipeline used for INSERT statements.

graph TD
    ParseSQL["Parse SQL:\nCREATE TABLE users (...)"]
CreatePlan["CreateTablePlan"]
RuntimeExec["Runtime.execute_create_table()"]
ValidateSchema["Validate Schema"]
AllocTableID["Allocate table_id"]
BuildTableMeta["Build TableMeta RecordBatch"]
BuildColMeta["Build ColMeta RecordBatch"]
AppendTable["Table(0).append(TableMeta)"]
AppendCols["Table(0).append(ColMeta)"]
ColumnStore["ColumnStore.append()"]
CommitPager["Pager.batch_put()"]
ParseSQL --> CreatePlan
 
   CreatePlan --> RuntimeExec
 
   RuntimeExec --> ValidateSchema
 
   ValidateSchema --> AllocTableID
    
 
   AllocTableID --> BuildTableMeta
 
   AllocTableID --> BuildColMeta
    
 
   BuildTableMeta --> AppendTable
 
   BuildColMeta --> AppendCols
    
 
   AppendTable --> ColumnStore
 
   AppendCols --> ColumnStore
    
 
   ColumnStore --> CommitPager
    
    style AppendTable fill:#f9f9f9
    style AppendCols fill:#f9f9f9

CREATE TABLE Flow

Key Implementation Details:

  1. Schema Validation : The runtime validates the Arrow schema before allocating resources
  2. Table ID Allocation : Monotonically increasing IDs are assigned via CatalogManager
  3. Atomic Append : Both TableMeta and all ColMeta records are appended in a single transaction
  4. MVCC Tagging : The created_by column is set to the current transaction ID

Sources: llkv-runtime/README.md:36-40 llkv-table/README.md:22-24

graph TD
    DropPlan["DropTablePlan"]
RuntimeExec["Runtime.execute_drop_table()"]
LookupMeta["SysCatalog.get_table_meta()"]
CheckExists["Verify table exists"]
BuildDeleteMeta["Build RecordBatch:\n• table_id\n• deleted_by = current_txn"]
AppendDelete["Table(0).append(delete_batch)"]
ColumnStore["ColumnStore.append()"]
DropPlan --> RuntimeExec
 
   RuntimeExec --> LookupMeta
 
   LookupMeta --> CheckExists
    
 
   CheckExists --> BuildDeleteMeta
 
   BuildDeleteMeta --> AppendDelete
 
   AppendDelete --> ColumnStore
    
    style BuildDeleteMeta fill:#f9f9f9

DROP TABLE Flow

Dropping a table uses MVCC soft-delete semantics rather than physical deletion:

The deleted_by column is updated to mark the metadata as deleted. MVCC visibility rules ensure that:

  • Transactions with snapshots before the deletion still see the table
  • Transactions starting after the deletion do not see the table

Sources: llkv-table/README.md:32-34 Diagram 4 from high-level architecture

sequenceDiagram
    participant Main as main() or SqlEngine::new()
    participant Runtime as RuntimeContext::new()
    participant CatMgr as CatalogManager::new()
    participant Table as Table::open_or_create()
    participant Store as ColumnStore::open()
    participant Pager as Pager (MemPager/SimdRDrivePager)
    
    Main->>Runtime: new(pager)
    Runtime->>CatMgr: new(pager)
    
    CatMgr->>Store: open(pager, root_key)
    Store->>Pager: batch_get([root_key])
    
    alt Catalog Exists
        Pager-->>Store: Catalog data
        Store-->>CatMgr: ColumnStore (loaded)
        Note over CatMgr: Deserialize catalog entries\nelse First Run
        Pager-->>Store: NULL
        Store-->>CatMgr: ColumnStore (empty)
        CatMgr->>Table: open_or_create(table_id=0)
        Note over CatMgr: Create system catalog schema
        Table->>Store: Initialize table 0
        Store->>Pager: batch_put(catalog_schema)
    end
    
    CatMgr-->>Runtime: CatalogManager (initialized)
    Runtime-->>Main: RuntimeContext (ready)

Bootstrap Process

When LLKV initializes, the system catalog must bootstrap itself before any user operations can proceed.

Initialization Sequence

Bootstrap Steps:

  1. Pager Initialization : The storage backend is opened (in-memory or persistent)
  2. Catalog Discovery : The ColumnStore attempts to load the catalog from the pager root key
  3. Schema Creation : If no catalog exists, table 0 is created with the predefined schema
  4. Ready State : The runtime can now service DDL and DML operations

Sources: llkv-runtime/README.md:26-31 llkv-storage/README.md:12-16

graph TB
    subgraph "SQL Query Processing"
        ParsedSQL["Parsed SQL AST"]
SelectPlan["SelectPlan<String>"]
ResolvedPlan["SelectPlan<FieldId>"]
end
    
    subgraph "RuntimeContext"
        CatalogLookup["Catalog Lookup"]
FieldResolution["Field Name → FieldId\nResolution"]
SchemaValidation["Schema Validation"]
end
    
    subgraph "System Catalog"
        SysCatalog["SysCatalog"]
TableMetaCache["In-Memory Metadata Cache"]
end
    
 
   ParsedSQL --> SelectPlan
 
   SelectPlan --> CatalogLookup
    
 
   CatalogLookup --> SysCatalog
 
   SysCatalog --> TableMetaCache
    
 
   TableMetaCache --> FieldResolution
 
   FieldResolution --> SchemaValidation
 
   SchemaValidation --> ResolvedPlan

Integration with Runtime

The RuntimeContext uses the system catalog for all schema-dependent operations:

Schema Resolution Flow

Usage Examples

OperationCatalog Interaction
SELECTResolve table names → table IDs, resolve column names → field IDs
INSERTValidate schema compatibility, check for required columns
JOINResolve schemas for both tables, validate join key compatibility
CREATE INDEX(Future) Persist index metadata as new catalog record type
ALTER TABLEUpdate existing metadata records with new schema definitions

Sources: llkv-runtime/README.md:36-40 llkv-expr/README.md:50-54

Dual-Context Catalog Access

During explicit transactions, the runtime maintains two catalog views:

Catalog Visibility Rules

  1. Persistent Context : Sees only metadata committed before the transaction's snapshot
  2. Staging Context : Sees tables created within the current transaction
  3. On Commit : Staged metadata is replayed into the persistent context
  4. On Rollback : Staged metadata is discarded

This dual-view approach ensures that:

  • DDL operations remain transactional
  • Uncommitted schema changes don't leak to other sessions
  • Catalog queries are snapshot-isolated like DML operations

Sources: llkv-runtime/README.md:26-31 llkv-table/README.md:32-34

Metadata Caching

The CatalogManager maintains an in-memory cache of frequently accessed metadata to avoid repeated scans of table 0:

Cache StructurePurposeInvalidation Strategy
Table Name → ID MapFast table resolution during planningInvalidated on CREATE/DROP TABLE
Table ID → Schema MapQuick schema validation during INSERTInvalidated on ALTER TABLE
Column Name → FieldId MapField resolution for expressionsRebuilt on schema changes

The cache is session-local and does not require cross-session synchronization in the current single-process model.

Sources: Inferred from llkv-runtime/README.md:12-17

Summary

The LLKV system catalog demonstrates the principle of treating metadata as data by storing all table and column definitions in table 0 using the same Arrow-based storage infrastructure that handles user tables. This design:

  • Simplifies Architecture : Eliminates the need for separate metadata storage systems
  • Ensures Consistency : Metadata mutations use MVCC transactions like all other data
  • Enables Crash Recovery : The pager's atomicity guarantees extend to schema changes
  • Supports Transactional DDL : Schema modifications can be rolled back or committed atomically

The SysCatalog interface abstracts the underlying Arrow storage, providing a type-safe API for the runtime to query and modify metadata. The bootstrap process ensures the system catalog exists before any user operations proceed, and the dual-context model enables proper transaction isolation for DDL operations.

Sources: llkv-table/README.md:28-29 llkv-runtime/README.md:36-40 llkv-column-map/README.md:10-16 Diagram 4 from high-level architecture


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Custom Types and Type Registry

Relevant source files

This page documents LLKV's type system, including SQL type mapping, custom type representations, and type inference mechanisms. The system uses Apache Arrow's DataType as its canonical type representation, with custom types like Decimal128, Date32, and Interval mapped to Arrow-compatible formats.

For information about expression evaluation and scalar operations, see Scalar Evaluation and NumericKernels. For aggregate function type handling, see Aggregation System.


Type System Architecture

LLKV's type system operates in three layers: SQL types (user-facing), intermediate literal types (planning), and Arrow DataTypes (execution). All data flowing through the system ultimately uses Arrow's columnar format.

Type Flow Architecture

graph TB
    subgraph "SQL Layer"
        SQLTYPE["SQL Types\nINT, TEXT, DECIMAL, DATE"]
end
    
    subgraph "Planning Layer"
        SQLVALUE["SqlValue\nInteger, Float, Decimal,\nString, Date32, Interval, Struct"]
LITERAL["Literal\nType-erased values"]
PLANVALUE["PlanValue\nPlan-time literals"]
end
    
    subgraph "Execution Layer"
        DATATYPE["Arrow DataType\nInt64, Float64, Utf8,\nDecimal128, Date32, Interval"]
SCHEMA["ExecutorSchema\nColumn metadata + types"]
INFERENCE["Type Inference\ninfer_computed_data_type"]
end
    
    subgraph "Storage Layer"
        RECORDBATCH["RecordBatch\nTyped columnar data"]
ARRAYS["Typed Arrays\nInt64Array, StringArray, etc."]
end
    
 
   SQLTYPE --> SQLVALUE
 
   SQLVALUE --> PLANVALUE
 
   SQLVALUE --> LITERAL
 
   PLANVALUE --> DATATYPE
 
   LITERAL --> DATATYPE
 
   DATATYPE --> SCHEMA
 
   SCHEMA --> INFERENCE
 
   INFERENCE --> DATATYPE
 
   DATATYPE --> RECORDBATCH
 
   RECORDBATCH --> ARRAYS
    
    style DATATYPE fill:#f9f9f9

Sources: llkv-sql/src/sql_value.rs:16-27 llkv-sql/src/lib.rs:22-29 llkv-executor/src/translation/schema.rs:53-123


SQL to Arrow Type Mapping

SQL types are mapped to Arrow DataTypes during parsing and planning. The mapping is defined implicitly through the parsing logic in SqlValue and the type inference system.

SQL TypeArrow DataTypeNotes
INT, INTEGER, BIGINTInt64All integer types normalized to Int64
FLOAT, DOUBLE, REALFloat64All floating-point types normalized to Float64
DECIMAL(p,s)Decimal128(p,s)Fixed-point decimal with precision and scale
TEXT, VARCHARUtf8Variable-length UTF-8 strings
DATEDate32Days since Unix epoch
INTERVALInterval(MonthDayNano)Calendar-aware interval type
BOOLEANBooleanTrue/false values
Dictionary literalsStructKey-value maps represented as structs

SQL to Arrow Type Conversion Flow

Sources: llkv-sql/src/sql_value.rs:178-214 llkv-sql/src/sql_value.rs:216-236 llkv-sql/src/lib.rs:22-29


Custom Type Representations

LLKV defines custom types for values that require special handling beyond basic Arrow types. These types bridge SQL semantics and Arrow's columnar format.

DecimalValue

Fixed-point decimal numbers with exact precision. Stored as i128 with a scale factor.

DecimalValue Representation

graph TB
    subgraph "DecimalValue Structure"
        DEC["DecimalValue\nraw_value: i128\nscale: i8"]
end
    
    subgraph "SQL Input"
        SQLDEC["SQL: 123.45"]
end
    
    subgraph "Internal Representation"
        RAW["raw_value = 12345\nscale = 2"]
CALC["Actual value = 12345 / 10^2 = 123.45"]
end
    
    subgraph "Arrow Storage"
        ARR["Decimal128Array\nprecision=5, scale=2"]
end
    
 
   SQLDEC --> DEC
 
   DEC --> RAW
 
   RAW --> CALC
 
   DEC --> ARR

Sources: llkv-sql/src/sql_value.rs:187-207 llkv-aggregate/src/lib.rs:314-324

IntervalValue

Calendar-aware time intervals with month, day, and nanosecond components.

IntervalValue Operations

Sources: llkv-sql/src/sql_value.rs:238-283 llkv-expr/src/literal.rs

Date32

Days since Unix epoch (1970-01-01), stored as i32.

Date32 Type Handling

Sources: llkv-sql/src/sql_value.rs:76-87 llkv-sql/src/sql_value.rs:169-174

Struct Types

Dictionary literals in SQL are represented as struct types with named fields.

Struct Type Representation

Sources: llkv-sql/src/sql_value.rs:124-135 llkv-sql/src/sql_value.rs:227-234


Type Inference for Computed Expressions

The type inference system determines Arrow DataTypes for computed expressions at planning time. This enables schema generation before execution.

Type Inference Flow

graph TB
    subgraph "Expression Input"
        EXPR["ScalarExpr<FieldId>\ncol1 + col2 * 3"]
end
    
    subgraph "Type Inference"
        INFER["infer_computed_data_type"]
CHECK["expression_uses_float"]
NORM["normalized_numeric_type"]
end
    
    subgraph "Type Resolution"
        COL1["Column col1: Int64"]
COL2["Column col2: Float64"]
RESULT["Result: Float64\n(one operand is float)"]
end
    
    subgraph "Schema Output"
        FIELD["Field(alias, Float64, nullable=true)"]
end
    
 
   EXPR --> INFER
 
   INFER --> CHECK
 
   CHECK --> COL1
 
   CHECK --> COL2
 
   CHECK --> RESULT
 
   INFER --> NORM
 
   NORM --> RESULT
 
   RESULT --> FIELD

Sources: llkv-executor/src/translation/schema.rs:53-123 llkv-executor/src/translation/schema.rs:149-243

Type Inference Rules

The inference system applies the following rules:

Expression TypeInferred TypeLogic
ScalarExpr::Literal(Integer)Int64Direct mapping
ScalarExpr::Literal(Float)Float64Direct mapping
ScalarExpr::Literal(Decimal(p,s))Decimal128(p,s)Preserves precision/scale
ScalarExpr::Column(field_id)Column's typeLookup in schema
ScalarExpr::Binary{left, op, right}Float64 if any operand is float, else Int64Type promotion
ScalarExpr::Compare{...}Int64Boolean as integer (0/1)
ScalarExpr::Cast{data_type, ...}data_typeExplicit cast target
ScalarExpr::RandomFloat64Floating-point random values

Sources: llkv-executor/src/translation/schema.rs:56-122

graph TB
    subgraph "Input Types"
        INT8["Int8/Int16/Int32/Int64"]
UINT["UInt8/UInt16/UInt32/UInt64"]
FLOAT["Float32/Float64"]
DEC["Decimal128(p,s)"]
BOOL["Boolean"]
end
    
    subgraph "Normalization"
        NORM["normalized_numeric_type"]
end
    
    subgraph "Output Types"
        OUT_INT["Int64"]
OUT_FLOAT["Float64"]
end
    
 
   INT8 --> NORM
 
   BOOL --> NORM
 
   NORM --> OUT_INT
    
 
   UINT --> NORM
 
   FLOAT --> NORM
 
   NORM --> OUT_FLOAT
    
 
   DEC --> NORM
 
   NORM --> |scale=0 && fits in i64| OUT_INT
 
   NORM --> |otherwise| OUT_FLOAT

Numeric Type Normalization

All numeric types are normalized to either Int64 or Float64 for arithmetic operations:

Numeric Type Normalization

Sources: llkv-executor/src/translation/schema.rs:125-147


Type Resolution During Expression Translation

Expression translation converts string-based column references to typed FieldId references, resolving types through the schema.

Expression Translation and Type Resolution

graph TB
    subgraph "String-based Expression"
        EXPRSTR["Expr<String>\nColumn('age') > Literal(18)"]
end
    
    subgraph "Translation"
        TRANS["translate_predicate"]
SCALAR["translate_scalar"]
RESOLVE["resolve_field_id"]
end
    
    subgraph "Schema Lookup"
        SCHEMA["ExecutorSchema"]
LOOKUP["schema.resolve('age')"]
COLUMN["ExecutorColumn\nname='age'\nfield_id=5\ndata_type=Int64"]
end
    
    subgraph "FieldId-based Expression"
        EXPRFID["Expr<FieldId>\nColumn(5) > Literal(18)"]
end
    
 
   EXPRSTR --> TRANS
 
   TRANS --> SCALAR
 
   SCALAR --> RESOLVE
 
   RESOLVE --> LOOKUP
 
   LOOKUP --> COLUMN
 
   COLUMN --> EXPRFID

Sources: llkv-executor/src/translation/expression.rs:18-174 llkv-executor/src/translation/expression.rs:390-407

Type Preservation During Translation

The translation process preserves type information from the original expression:

Expression ComponentType Preservation
Column(name)Replaced with Column(field_id), type from schema
Literal(value)Clone literal, type embedded in Literal enum
Binary{left, op, right}Recursively translate operands, type inferred later
Cast{expr, data_type}Preserve data_type during translation
Aggregate(call)Translate inner expression, aggregate type determined by function

Sources: llkv-executor/src/translation/expression.rs:176-387


graph TB
    subgraph "Aggregate Specification"
        SPEC["AggregateKind::Sum\nfield_id=5\ndata_type=Int64\ndistinct=false"]
end
    
    subgraph "Accumulator Creation"
        CREATE["new_with_projection_index"]
MATCH["Match on (data_type, distinct)"]
end
    
    subgraph "Type-Specific Accumulators"
        INT64["SumInt64\nvalue: Option<i64>\nhas_values: bool"]
FLOAT64["SumFloat64\nvalue: f64\nsaw_value: bool"]
DEC128["SumDecimal128\nsum: i128\nprecision: u8\nscale: i8"]
end
    
    subgraph "Update Logic"
        UPDATE_INT["Checked addition\nError on overflow"]
UPDATE_FLOAT["Floating addition\nNo overflow check"]
UPDATE_DEC["Checked i128 addition\nError on overflow"]
end
    
 
   SPEC --> CREATE
 
   CREATE --> MATCH
 
   MATCH --> |Int64, false| INT64
 
   MATCH --> |Float64, false| FLOAT64
 
   MATCH --> |Decimal128 p,s , false| DEC128
    
 
   INT64 --> UPDATE_INT
 
   FLOAT64 --> UPDATE_FLOAT
 
   DEC128 --> UPDATE_DEC

Type Handling in Aggregates

Aggregate functions have type-specific accumulator implementations. The type determines overflow behavior, precision, and result format.

Aggregate Type-Specific Accumulators

Sources: llkv-aggregate/src/lib.rs:461-542 llkv-aggregate/src/lib.rs:799-859

Aggregate Type Matrix

Different aggregates support different type combinations:

AggregateInt64Float64Decimal128Utf8BooleanDate32
COUNT(*)N/AN/AN/AN/AN/AN/A
COUNT(col)
SUM✓ (coerce)--
AVG✓ (coerce)--
MIN/MAX✓ (coerce)--
TOTAL✓ (coerce)--
GROUP_CONCAT--

Notes:

  • ✓ = Native support with type-specific accumulator
  • ✓ (coerce) = Support via SQLite-style numeric coercion
  • - = Not supported

Sources: llkv-aggregate/src/lib.rs:22-68 llkv-aggregate/src/lib.rs:385-447

graph LR
    subgraph "DistinctKey Variants"
        INT["Int(i64)"]
FLOAT["Float(u64)\nf64::to_bits()"]
STR["Str(String)"]
BOOL["Bool(bool)"]
DATE["Date(i32)"]
DEC["Decimal(i128)"]
end
    
    subgraph "Accumulator"
        SEEN["FxHashSet<DistinctKey>"]
INSERT["seen.insert(key)"]
CHECK["Returns true if new"]
end
    
    subgraph "Aggregation"
        ADD["Add to sum only if new"]
COUNT["Count distinct values"]
end
    
 
   INT --> SEEN
 
   FLOAT --> SEEN
 
   STR --> SEEN
 
   BOOL --> SEEN
 
   DATE --> SEEN
 
   DEC --> SEEN
    
 
   SEEN --> INSERT
 
   INSERT --> CHECK
 
   CHECK --> ADD
 
   CHECK --> COUNT

Distinct Value Tracking

For DISTINCT aggregates, the system tracks seen values using type-specific keys:

Distinct Value Tracking

Sources: llkv-aggregate/src/lib.rs:249-333 llkv-aggregate/src/lib.rs:825-858


Type Coercion and Casting

The system supports both implicit coercion (for numeric operations) and explicit casting (via CAST expressions).

graph TB
    subgraph "Input Values"
        STR["String '123.45'"]
BOOL["Boolean true"]
NULL["NULL"]
end
    
    subgraph "Coercion Function"
        COERCE["array_value_to_numeric"]
PARSE["Parse as f64"]
FALLBACK["Use 0.0 if parse fails"]
end
    
    subgraph "Coerced Values"
        NUM1["123.45"]
NUM2["1.0"]
NUM3["0.0 (NULL skipped)"]
end
    
 
   STR --> COERCE
 
   BOOL --> COERCE
 
   NULL --> COERCE
    
 
   COERCE --> PARSE
 
   PARSE --> |Success| NUM1
 
   PARSE --> |Failure| FALLBACK
 
   FALLBACK --> NUM1
    
 
   COERCE --> |Boolean: 1.0/0.0| NUM2
 
   COERCE --> |NULL: skip row| NUM3

Numeric Coercion in Aggregates

String and boolean values are coerced to numeric types in aggregate functions following SQLite semantics:

Numeric Coercion in Aggregates

Sources: llkv-aggregate/src/lib.rs:385-447 llkv-aggregate/src/lib.rs:860-877

Explicit Type Casting

The CAST expression provides explicit type conversion:

Explicit Type Casting

Sources: llkv-executor/src/translation/schema.rs95 llkv-expr/src/expr.rs:114-118


Type System Integration Points

The type system integrates with multiple layers of the architecture:

LayerIntegration PointPurpose
SQL ParsingSqlValue::try_from_exprParse SQL literals into typed values
PlanningPlanValue conversionConvert literals to plan representation
Schema Inferenceinfer_computed_data_typeDetermine result types for expressions
Expression Translationtranslate_scalarResolve column types from schema
Program CompilationOwnedOperatorStore typed operators in bytecode
ExecutionRecordBatch schemaValidate types match expected schema
AggregationAccumulator creationCreate type-specific aggregators
StorageArrow serializationPersist typed data in columnar format

Sources: llkv-sql/src/sql_value.rs:30-122 llkv-executor/src/translation/schema.rs:15-51 llkv-table/src/planner/program.rs:69-101


Summary

LLKV's type system is built on Apache Arrow's DataType as the canonical type representation, with custom types for SQL-specific semantics:

  • SQL types are mapped to Arrow types during parsing through SqlValue
  • Custom types (Decimal, Interval, Date32, Struct) provide SQL-compatible semantics
  • Type inference determines result types for computed expressions at planning time
  • Type resolution converts string column references to typed FieldId references
  • Aggregate functions use type-specific accumulators with appropriate overflow handling
  • Type coercion follows SQLite semantics for numeric operations

The type system operates transparently across all layers, ensuring type safety from SQL parsing through storage while maintaining compatibility with Arrow's columnar format.

Sources: llkv-sql/src/lib.rs:1-51 llkv-executor/src/translation/schema.rs:1-271 llkv-aggregate/src/lib.rs:1-83