This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Overview
Relevant source files
- Cargo.lock
- Cargo.toml
- README.md
- demos/llkv-sql-pong-demo/src/main.rs
- llkv-aggregate/README.md
- llkv-column-map/README.md
- llkv-csv/README.md
- llkv-expr/README.md
- llkv-join/README.md
- llkv-runtime/README.md
- llkv-sql/src/tpch.rs
- llkv-storage/README.md
- llkv-table/README.md
- llkv-tpch/.gitignore
- llkv-tpch/Cargo.toml
- llkv-tpch/DRAFT-PRE-FINAL.md
This document introduces the LLKV database system, its architectural principles, and the relationships between its constituent crates. It provides a high-level map of how SQL queries flow through the system from parsing to storage, and explains the role of Apache Arrow as the universal data interchange format.
For details on individual subsystems, see:
- Workspace organization and crate dependencies: Workspace and Crates
- SQL query processing pipeline: SQL Query Processing Pipeline
- Arrow integration details: Data Formats and Arrow Integration
What is LLKV
LLKV is an experimental SQL database implemented as a Rust workspace of 15 crates. It layers SQL processing, streaming query execution, and MVCC transaction management on top of pluggable key-value storage backends. The system uses Apache Arrow RecordBatch as its primary data representation at every layer, enabling zero-copy operations and SIMD-friendly columnar processing.
The architecture separates concerns into six distinct layers:
- SQL Interface
- Query Planning
- Runtime and Orchestration
- Query Execution
- Table and Metadata Management
- Storage and I/O
Each layer communicates through well-defined interfaces centered on Arrow data structures.
Sources: README.md:1-107 Cargo.toml:1-89
Core Design Principles
LLKV's design reflects several intentional trade-offs:
| Principle | Implementation | Rationale |
|---|---|---|
| Arrow-Native | RecordBatch is the universal data format across all layers | Enables zero-copy operations, SIMD vectorization, and interoperability with the Arrow ecosystem |
| Synchronous Execution | Work-stealing via Rayon instead of async runtime | Reduces scheduler overhead for individual queries while remaining embeddable in Tokio contexts |
| Layered Modularity | 15 independent crates with clear boundaries | Allows independent evolution and testing of subsystems |
| MVCC Throughout | System metadata columns (row_id, created_by, deleted_by) injected at storage layer | Provides snapshot isolation without write locks |
| Storage Abstraction | Pager trait with multiple implementations | Supports both in-memory and persistent backends with zero-copy reads |
| Compiled Predicates | Expressions compile to stack-based bytecode | Enables efficient vectorized evaluation without interpretation overhead |
Sources: README.md:36-42 llkv-storage/README.md:12-22 llkv-expr/README.md:66-72
Workspace Structure
The LLKV workspace consists of 15 crates organized by layer:
| Layer | Crate | Primary Responsibility |
|---|---|---|
| SQL Interface | llkv-sql | SQL parsing, dialect normalization, INSERT buffering |
| Query Planning | llkv-plan | Typed query plan structures (SelectPlan, InsertPlan, etc.) |
llkv-expr | Expression AST (Expr, ScalarExpr) | |
| Runtime | llkv-runtime | Session management, MVCC orchestration, plan execution |
llkv-transaction | Transaction ID allocation, snapshot management | |
| Execution | llkv-executor | Streaming query evaluation |
llkv-aggregate | Aggregate function implementation (SUM, COUNT, AVG, etc.) | |
llkv-join | Join algorithms (hash join with specialized fast paths) | |
| Table/Metadata | llkv-table | Schema-aware table abstraction, system catalog |
llkv-column-map | Column-oriented storage, logical-to-physical key mapping | |
| Storage | llkv-storage | Pager trait, MemPager, SimdRDrivePager |
| Utilities | llkv-csv | CSV ingestion helper |
llkv-result | Result type definitions | |
llkv-test-utils | Testing utilities | |
llkv-slt-tester | SQL Logic Test harness |
Sources: Cargo.toml:9-26 Cargo.toml:67-87 README.md:44-53
Component Architecture and Data Flow
The following diagram shows the major components and how Arrow RecordBatch flows through the system:
Sources: README.md:44-72 Cargo.toml:67-87
graph TB
User["User / Application"]
subgraph "llkv-sql Crate"
SqlEngine["SqlEngine"]
Preprocessor["SQL Preprocessor"]
Parser["sqlparser"]
InsertBuffer["InsertBuffer"]
end
subgraph "llkv-plan Crate"
SelectPlan["SelectPlan"]
InsertPlan["InsertPlan"]
CreateTablePlan["CreateTablePlan"]
OtherPlans["Other Plan Types"]
end
subgraph "llkv-expr Crate"
Expr["Expr<F>"]
ScalarExpr["ScalarExpr<F>"]
end
subgraph "llkv-runtime Crate"
RuntimeContext["RuntimeContext"]
SessionHandle["SessionHandle"]
TxnSnapshot["TransactionSnapshot"]
end
subgraph "llkv-executor Crate"
TableExecutor["TableExecutor"]
StreamingOps["Streaming Operators"]
end
subgraph "llkv-table Crate"
Table["Table"]
SysCatalog["SysCatalog (Table 0)"]
FieldId["FieldId Resolution"]
end
subgraph "llkv-column-map Crate"
ColumnStore["ColumnStore"]
LogicalFieldId["LogicalFieldId"]
PhysicalKey["PhysicalKey Mapping"]
end
subgraph "llkv-storage Crate"
Pager["Pager Trait"]
MemPager["MemPager"]
SimdPager["SimdRDrivePager"]
end
ArrowBatch["Arrow RecordBatch\n(Universal Format)"]
User -->|SQL String| SqlEngine
SqlEngine --> Preprocessor
Preprocessor --> Parser
Parser -->|AST| SelectPlan
Parser -->|AST| InsertPlan
Parser -->|AST| CreateTablePlan
SelectPlan --> Expr
InsertPlan --> ScalarExpr
SelectPlan --> RuntimeContext
InsertPlan --> RuntimeContext
CreateTablePlan --> RuntimeContext
RuntimeContext --> SessionHandle
RuntimeContext --> TxnSnapshot
RuntimeContext --> TableExecutor
RuntimeContext --> Table
TableExecutor --> StreamingOps
StreamingOps --> Table
Table --> SysCatalog
Table --> FieldId
Table --> ColumnStore
ColumnStore --> LogicalFieldId
ColumnStore --> PhysicalKey
ColumnStore --> Pager
Pager --> MemPager
Pager --> SimdPager
Table -.->|Produces/Consumes| ArrowBatch
StreamingOps -.->|Produces/Consumes| ArrowBatch
ColumnStore -.->|Serializes/Deserializes| ArrowBatch
SqlEngine -.->|Returns| ArrowBatch
End-to-End Query Execution
This diagram traces a SELECT query from SQL text to results, showing the concrete code entities involved:
Sources: README.md:56-63 llkv-sql/README.md:1-107 llkv-runtime/README.md:33-41 llkv-table/README.md:10-25
sequenceDiagram
participant App as "Application"
participant SqlEngine as "SqlEngine::execute()"
participant Preprocessor as "preprocess_sql()"
participant Parser as "sqlparser::Parser"
participant Planner as "build_select_plan()"
participant Runtime as "RuntimeContext::execute_plan()"
participant Executor as "TableExecutor::execute()"
participant Table as "Table::scan_stream()"
participant ColStore as "ColumnStore::gather_columns()"
participant Pager as "Pager::batch_get()"
App->>SqlEngine: SELECT * FROM users WHERE age > 18
Note over SqlEngine,Preprocessor: Dialect normalization
SqlEngine->>Preprocessor: Normalize SQLite/DuckDB syntax
SqlEngine->>Parser: Parse normalized SQL
Parser-->>SqlEngine: Statement AST
SqlEngine->>Planner: Translate AST to SelectPlan
Note over Planner: Build SelectPlan with\nExpr<String> predicates
Planner-->>SqlEngine: SelectPlan
SqlEngine->>Runtime: execute_plan(SelectPlan)
Note over Runtime: Acquire TransactionSnapshot\nResolve field names to FieldId
Runtime->>Executor: execute(SelectPlan, context)
Note over Executor: Compile Expr<FieldId>\ninto EvalProgram
Executor->>Table: scan_stream(fields, predicate)
Note over Table: Apply MVCC filtering\nPush down predicates
Table->>ColStore: gather_columns(LogicalFieldId[])
Note over ColStore: Map LogicalFieldId\nto PhysicalKey
ColStore->>Pager: batch_get(PhysicalKey[])
Pager-->>ColStore: EntryHandle[] (zero-copy)
Note over ColStore: Deserialize Arrow buffers\nApply row_id filtering
ColStore-->>Table: RecordBatch
Table-->>Executor: RecordBatch
Note over Executor: Apply projections\nEvaluate expressions
Executor-->>Runtime: RecordBatch stream
Runtime-->>SqlEngine: Vec<RecordBatch>
SqlEngine-->>App: Query results
Key Features
MVCC Transaction Management
LLKV implements multi-version concurrency control with snapshot isolation:
- Every table includes three system columns:
row_id(monotonic),created_by(transaction ID), anddeleted_by(transaction ID or NULL) TxnIdManagerinllkv-transactionallocates monotonic transaction IDs and tracks commit watermarksTransactionSnapshotcaptures a consistent view of the database at transaction start- Auto-commit statements use
TXN_ID_AUTO_COMMIT = 1 - Explicit transactions maintain both persistent and staging contexts for isolation
Sources: README.md:64-72 llkv-runtime/README.md:20-32 llkv-table/README.md:32-35
Zero-Copy Storage Pipeline
The storage layer supports zero-copy reads when backed by SimdRDrivePager:
ColumnStoremapsLogicalFieldIdtoPhysicalKeyPager::batch_get()returnsEntryHandlewrappers around memory-mapped regions- Arrow arrays are deserialized directly from the mapped memory without intermediate copies
- SIMD-aligned buffers enable vectorized predicate evaluation
Sources: llkv-column-map/README.md:19-41 llkv-storage/README.md:12-28 README.md:12-13
Compiled Expression Evaluation
Predicates and scalar expressions compile to stack-based bytecode:
Expr<FieldId>structures inllkv-exprrepresent logical predicatesProgramCompilerinllkv-tabletranslates expressions intoEvalProgrambytecodeDomainProgramtracks which row IDs satisfy predicates- Bytecode evaluation uses stack-based execution for efficient vectorized operations
Sources: llkv-expr/README.md:1-88 llkv-table/README.md:10-18 README.md:46-53
SQL Logic Test Infrastructure
LLKV includes comprehensive SQL correctness testing:
llkv-slt-testerwraps thesqllogictestframeworkLlkvSltRunnerdiscovers.sltfiles and executes test suites- Supports remote test fetching via
.slturlpointer files - Environment variable
LLKV_SLT_STATS=1enables detailed query statistics - CI runs the full suite on Linux, macOS, and Windows
Sources: README.md:75-77 llkv-slt-tester/README.md:1-57
Getting Started
The main entry point is the llkv crate, which re-exports the SQL interface:
For persistent storage, use SimdRDrivePager instead of MemPager. For transaction control beyond auto-commit, obtain a SessionHandle via SqlEngine::session().
Sources: README.md:14-33 demos/llkv-sql-pong-demo/src/main.rs:386-393
Relationship to Related Projects
LLKV shares architectural concepts with Apache DataFusion but differs in several key areas:
| Aspect | LLKV | DataFusion |
|---|---|---|
| Execution Model | Synchronous with Rayon work-stealing | Async with Tokio runtime |
| Storage Backend | Custom key-value via Pager trait | Parquet, CSV, object stores |
| SQL Parser | sqlparser crate (same) | sqlparser crate |
| Data Format | Arrow RecordBatch (same) | Arrow RecordBatch |
| Maturity | Alpha / Experimental | Production-ready |
| Transaction Support | MVCC snapshot isolation | Read-only (no writes) |
LLKV deliberately avoids the DataFusion task scheduler to explore trade-offs in a synchronous execution model, while maintaining compatibility with the same SQL parser and Arrow memory layout.
Sources: README.md:36-42 README.md:8-13
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Architecture
Relevant source files
- Cargo.lock
- Cargo.toml
- README.md
- demos/llkv-sql-pong-demo/src/main.rs
- llkv-aggregate/README.md
- llkv-column-map/README.md
- llkv-csv/README.md
- llkv-expr/README.md
- llkv-join/README.md
- llkv-runtime/README.md
- llkv-sql/src/tpch.rs
- llkv-storage/README.md
- llkv-table/README.md
- llkv-tpch/.gitignore
- llkv-tpch/Cargo.toml
- llkv-tpch/DRAFT-PRE-FINAL.md
This document describes the overall system architecture of LLKV, explaining its layered design, core abstractions, and how components interact to provide SQL functionality over key-value storage. For details on individual crates and their dependencies, see Workspace and Crates. For the end-to-end query execution flow, see SQL Query Processing Pipeline. For Arrow integration specifics, see Data Formats and Arrow Integration.
Layered Design
LLKV is organized into six architectural layers, each with focused responsibilities. Higher layers depend only on lower layers, and all layers communicate through Apache Arrow RecordBatch structures as the universal data interchange format.
Sources: Cargo.toml:1-89 README.md:44-53 llkv-sql/README.md:1-10 llkv-runtime/README.md:1-10 llkv-executor/README.md:1-10 llkv-table/README.md:1-18 llkv-storage/README.md:1-17
graph TB
subgraph L1["Layer 1: User Interface"]
SQL["SQL Queries"]
REPL["CLI REPL"]
DEMO["Demo Applications"]
BENCH["TPC-H Benchmarks"]
end
subgraph L2["Layer 2: SQL Processing"]
SQLENG["SqlEngine\nllkv-sql"]
PLAN["Query Plans\nllkv-plan"]
EXPR["Expression AST\nllkv-expr"]
end
subgraph L3["Layer 3: Runtime & Orchestration"]
RUNTIME["RuntimeContext\nllkv-runtime"]
TXNMGR["TxnIdManager\nllkv-transaction"]
CATALOG["CatalogManager\nllkv-runtime"]
end
subgraph L4["Layer 4: Query Execution"]
EXECUTOR["TableExecutor\nllkv-executor"]
AGG["Accumulators\nllkv-aggregate"]
JOIN["HashJoinExecutor\nllkv-join"]
end
subgraph L5["Layer 5: Data Management"]
TABLE["Table\nllkv-table"]
COLMAP["ColumnStore\nllkv-column-map"]
SYSCAT["SysCatalog\nllkv-table"]
end
subgraph L6["Layer 6: Storage"]
PAGER["Pager trait\nllkv-storage"]
MEMPAGER["MemPager"]
SIMDPAGER["SimdRDrivePager"]
end
SQL --> SQLENG
REPL --> SQLENG
DEMO --> SQLENG
BENCH --> SQLENG
SQLENG --> PLAN
SQLENG --> EXPR
PLAN --> RUNTIME
RUNTIME --> TXNMGR
RUNTIME --> CATALOG
RUNTIME --> EXECUTOR
RUNTIME --> TABLE
EXECUTOR --> AGG
EXECUTOR --> JOIN
EXECUTOR --> TABLE
TABLE --> COLMAP
TABLE --> SYSCAT
COLMAP --> PAGER
SYSCAT --> COLMAP
PAGER --> MEMPAGER
PAGER --> SIMDPAGER
Core Architectural Principles
Arrow-Native Data Flow
All data flowing between components is represented as Apache Arrow RecordBatch structures. This enables:
- Zero-copy operations : Arrow buffers can be passed between layers without serialization
- SIMD-friendly processing : Columnar layout supports vectorized operations
- Consistent memory model : All layers use the same in-memory representation
The RecordBatch abstraction appears at every boundary: SQL parsing produces plans that operate on batches, the executor streams batches, tables persist batches, and the column store chunks batches for storage.
Sources: README.md:10-12 README.md:22-23 llkv-table/README.md:10-11 llkv-column-map/README.md:10-14
Storage Abstraction Through Pager Trait
The Pager trait in llkv-storage provides a pluggable storage backend interface:
| Pager Type | Use Case | Key Properties |
|---|---|---|
MemPager | Tests, temporary namespaces, staging contexts | Heap-backed, fast |
SimdRDrivePager | Persistent storage | Zero-copy reads, SIMD-aligned, memory-mapped |
Both implementations satisfy the same batch get/put contract, allowing higher layers to remain storage-agnostic. The runtime uses dual-pager contexts: persistent storage for committed tables and in-memory staging for uncommitted transaction objects.
Sources: llkv-storage/README.md:12-22 llkv-runtime/README.md:26-32
MVCC Integration
Multi-version concurrency control (MVCC) is implemented as system metadata columns injected at the table layer:
row_id: Monotonic row identifiercreated_by: Transaction ID that created this row versiondeleted_by: Transaction ID that deleted this row (orNULLif active)
These columns are stored alongside user data in ColumnStore, enabling snapshot isolation without separate version chains. The TxnIdManager in llkv-transaction allocates monotonic transaction IDs and tracks commit watermarks. The runtime enforces visibility rules during scans by filtering based on snapshot transaction IDs.
Sources: llkv-table/README.md:13-17 llkv-runtime/README.md:19-25 llkv-column-map/README.md:27-28
Component Interaction Patterns
Query Execution Flow
Sources: README.md:56-62 llkv-sql/README.md:15-20 llkv-runtime/README.md:12-17 llkv-executor/README.md:12-17 llkv-table/README.md:19-25 llkv-column-map/README.md:24-28
Dual-Context Transaction Management
The runtime maintains two execution contexts during explicit transactions. The persistent context operates on committed tables directly, while the staging context buffers newly created tables in memory. On commit, staged operations are replayed into the persistent context after the TxnIdManager confirms no conflicts and advances the commit watermark. On rollback, the staging context is dropped and all uncommitted work is discarded.
Sources: llkv-runtime/README.md:26-32 llkv-runtime/README.md:12-17
Column Storage and Logical Field Mapping
The ColumnStore maintains a mapping from LogicalFieldId (namespace + table ID + field ID) to physical storage keys. Each logical field has a descriptor chunk (metadata about the column), data chunks (Arrow-serialized column arrays), and row ID chunks (per-chunk row identifiers for filtering). This three-level mapping isolates user data from system metadata while allowing efficient scans and appends.
Sources: llkv-column-map/README.md:18-23 llkv-table/README.md:13-17 llkv-column-map/README.md:10-17
Key Abstractions
SqlEngine
Entry point for SQL execution. Located in llkv-sql, it:
- Preprocesses SQL for dialect compatibility (DuckDB, SQLite quirks)
- Parses with
sqlparsercrate - Batches compatible
INSERTstatements - Delegates execution to
RuntimeContext - Returns
ExecutionResultenums
Sources: llkv-sql/README.md:1-20 README.md:56-59
RuntimeContext
Orchestration layer in llkv-runtime that:
- Executes all statement types (DDL, DML, queries)
- Manages transaction snapshots and MVCC injection
- Coordinates between table layer and executor
- Maintains catalog manager for schema metadata
- Implements dual-context staging for transactions
Sources: llkv-runtime/README.md:12-25 llkv-runtime/README.md:34-40
Table and ColumnStore
Table in llkv-table provides schema-aware APIs:
- Schema validation on
CREATE TABLEand append - MVCC column injection (
row_id,created_by,deleted_by) - Streaming scan API with predicate pushdown
- Integration with system catalog (table 0)
ColumnStore in llkv-column-map handles physical storage:
- Arrow-serialized column chunks
- Logical-to-physical key mapping
- Append pipeline with row-id sorting and last-writer-wins semantics
- Atomic multi-key commits through pager
Sources: llkv-table/README.md:12-25 llkv-column-map/README.md:12-28
TableExecutor
Execution engine in llkv-executor that:
- Streams
RecordBatchresults from table scans - Evaluates projections, filters, and scalar expressions
- Coordinates with
llkv-aggregatefor aggregation - Coordinates with
llkv-joinfor join operations - Applies MVCC visibility filters during scans
Sources: llkv-executor/README.md:1-17 README.md:60-61
Pager Trait
Storage abstraction in llkv-storage that:
- Exposes batch
get/putover(PhysicalKey, EntryHandle)pairs - Supports atomic multi-key updates
- Enables zero-copy reads when backed by memory-mapped storage
- Implementations:
MemPager(heap),SimdRDrivePager(persistent)
Sources: llkv-storage/README.md:12-22 README.md:11-12
Crate Organization
The workspace contains 15 crates organized by layer:
| Layer | Crates | Responsibilities |
|---|---|---|
| SQL Processing | llkv-sql, llkv-plan, llkv-expr | Parse SQL, build typed plans, represent expressions |
| Runtime | llkv-runtime, llkv-transaction | Orchestrate execution, manage MVCC and sessions |
| Execution | llkv-executor, llkv-aggregate, llkv-join | Stream results, compute aggregates, evaluate joins |
| Data Management | llkv-table, llkv-column-map | Schema-aware tables, columnar storage |
| Storage | llkv-storage | Pager trait and implementations |
| Supporting | llkv-result, llkv-csv, llkv-test-utils | Result types, CSV ingestion, test utilities |
| Testing | llkv-slt-tester, llkv-tpch | SQL Logic Tests, TPC-H benchmarks |
| Entry Points | llkv | Main library and CLI |
For detailed dependency graphs and crate responsibilities, see Workspace and Crates.
Sources: Cargo.toml:67-87 README.md:44-53
Execution Model
Synchronous with Work-Stealing
LLKV defaults to synchronous execution using Rayon for parallelism:
- Query execution is synchronous, not async
- Rayon work-stealing parallelizes scans and projections
- Crossbeam channels coordinate between threads
- Embeds cleanly inside Tokio when needed (e.g., SLT test runner)
This design minimizes scheduler overhead for individual queries while maintaining high throughput for concurrent workloads.
Sources: README.md:38-41 llkv-column-map/README.md:32-34
Streaming Results
Queries produce results incrementally:
TableExecutoryields fixed-sizeRecordBatches- No full result set materialization
- Callers process batches via callback or iterator
- Join and aggregate operators buffer only necessary state
Sources: llkv-table/README.md:24-25 llkv-executor/README.md:14-17 llkv-join/README.md:19-22
Data Lifecycle
Write Path
- User submits
INSERTorUPDATEthroughSqlEngine RuntimeContextvalidates schema and injects MVCC columnsTable::appendvalidatesRecordBatchschemaColumnStore::appendsorts byrow_id, rewrites conflictsPager::batch_putcommits Arrow-serialized chunks atomically- Transaction manager advances commit watermark
Read Path
- User submits
SELECTthroughSqlEngine RuntimeContextacquires transaction snapshotTableExecutorcreates scan with projection and filterTable::scan_streaminitiatesColumnStreamColumnStorefetches chunks viaPager::batch_get(zero-copy)- MVCC filtering applied using snapshot visibility rules
- Executor evaluates expressions and streams
RecordBatches to caller
Sources: README.md:56-62 llkv-column-map/README.md:24-28 llkv-table/README.md:19-25
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Workspace and Crates
Relevant source files
- Cargo.lock
- Cargo.toml
- llkv-column-map/src/store/projection.rs
- llkv-expr/Cargo.toml
- llkv-expr/src/lib.rs
- llkv-expr/src/literal.rs
- llkv-plan/Cargo.toml
- llkv-plan/src/lib.rs
- llkv-storage/src/serialization.rs
- llkv-table/Cargo.toml
Purpose and Scope
This document details the Cargo workspace structure and the 15+ crates that comprise the LLKV database system. Each crate is designed with a single responsibility and well-defined interfaces, enabling independent testing and evolution of components. This page catalogs the role of each crate, their internal dependencies, and how they map to the system's layered architecture described in Architecture.
For information about how SQL queries flow through these crates, see SQL Query Processing Pipeline. For details on specific subsystems like storage or transactions, refer to sections 7 and following.
Workspace Overview
The LLKV workspace is defined in Cargo.toml:67-88 and contains 18 member crates organized into core system components, specialized operations, testing infrastructure, and demonstration applications.
Workspace Structure:
graph TB
subgraph "Core System Crates"
LLKV["llkv\n(main entry)"]
SQL["llkv-sql\n(SQL interface)"]
PLAN["llkv-plan\n(query plans)"]
EXPR["llkv-expr\n(expression AST)"]
RUNTIME["llkv-runtime\n(orchestration)"]
EXECUTOR["llkv-executor\n(execution)"]
TABLE["llkv-table\n(table layer)"]
COLMAP["llkv-column-map\n(column store)"]
STORAGE["llkv-storage\n(storage abstraction)"]
TXN["llkv-transaction\n(MVCC manager)"]
RESULT["llkv-result\n(error types)"]
end
subgraph "Specialized Operations"
AGG["llkv-aggregate\n(aggregation)"]
JOIN["llkv-join\n(joins)"]
CSV["llkv-csv\n(CSV import)"]
end
subgraph "Testing Infrastructure"
SLT["llkv-slt-tester\n(SQL logic tests)"]
TESTUTIL["llkv-test-utils\n(test utilities)"]
TPCH["llkv-tpch\n(TPC-H benchmarks)"]
end
subgraph "Demonstrations"
DEMO["llkv-sql-pong-demo\n(interactive demo)"]
end
Sources: Cargo.toml:67-88
Core System Crates
llkv
Purpose: Main library crate that re-exports the primary user-facing APIs from llkv-sql and llkv-runtime.
Key Dependencies: llkv-sql, llkv-runtime
Responsibilities:
- Provides the consolidated API surface for embedding LLKV
- Re-exports
SqlEnginefor SQL query execution - Re-exports runtime components for programmatic database access
Sources: Cargo.toml:9-10
llkv-sql
Path: llkv-sql/
Purpose: SQL interface layer that parses SQL statements, preprocesses dialect-specific syntax, and translates them into typed query plans.
Key Dependencies:
llkv-plan- Query plan structuresllkv-expr- Expression ASTllkv-runtime- Execution orchestrationsqlparser- SQL parsing (version 0.59.0)
Responsibilities:
- SQL statement preprocessing for dialect compatibility
- AST-to-plan translation
- INSERT statement buffering optimization
- SQL query result formatting
Primary Types:
SqlEngine- Main query interface
Sources: Cargo.toml21 llkv-plan/src/lib.rs:1-38
llkv-plan
Path: llkv-plan/
Purpose: Query planner that defines typed plan structures representing SQL operations.
Key Dependencies:
llkv-expr- Expression typesllkv-result- Error handlingsqlparser- SQL AST types
Responsibilities:
- Plan structure definitions (
SelectPlan,InsertPlan,UpdatePlan,DeletePlan) - SQL-to-plan conversion utilities
- Subquery correlation tracking
- Plan graph serialization for debugging
Primary Types:
SelectPlan,InsertPlan,UpdatePlan,DeletePlan,CreateTablePlanSubqueryCorrelatedTrackerRangeSelectRows- Range-based row selection
Sources: llkv-plan/Cargo.toml:1-28 llkv-plan/src/lib.rs:1-38
llkv-expr
Path: llkv-expr/
Purpose: Expression AST definitions and literal value handling, independent of concrete Arrow scalar types.
Key Dependencies:
arrow- Arrow data types
Responsibilities:
- Expression AST (
Expr<T>,ScalarExpr<T>) - Literal value representation (
Literalenum) - Type-aware predicate compilation (
typed_predicate) - Decimal value handling
Primary Types:
Expr<T>- Generic expression with field identifier type parameterScalarExpr<T>- Scalar expressionsLiteral- Untyped literal valuesDecimalValue- Fixed-precision decimalIntervalValue- Calendar interval
Sources: llkv-expr/Cargo.toml:1-19 llkv-expr/src/lib.rs:1-21 llkv-expr/src/literal.rs:1-446
llkv-runtime
Path: llkv-runtime/
Purpose: Runtime orchestration layer providing MVCC transaction management, session handling, and system catalog coordination.
Key Dependencies:
llkv-executor- Query executionllkv-table- Table operationsllkv-transaction- MVCC snapshots
Responsibilities:
- Transaction lifecycle management
- Session state tracking
- System catalog access
- Query plan execution coordination
- MVCC snapshot creation and cleanup
Primary Types:
RuntimeContext- Main runtime stateSession- Per-connection state
Sources: Cargo.toml19
llkv-executor
Path: llkv-executor/
Purpose: Query execution engine that evaluates plans and produces streaming results.
Key Dependencies:
llkv-plan- Plan structuresllkv-expr- Expression evaluationllkv-table- Table scansllkv-aggregate- Aggregationllkv-join- Join algorithms
Responsibilities:
- SELECT plan execution
- Projection and filtering
- Aggregation coordination
- Join execution
- Streaming RecordBatch production
Sources: Cargo.toml14
llkv-table
Path: llkv-table/
Purpose: Schema-aware table abstraction providing high-level data operations over columnar storage.
Key Dependencies:
llkv-column-map- Column storagellkv-expr- Predicate compilationllkv-storage- Storage backendarrow- RecordBatch representation
Responsibilities:
- Schema validation and enforcement
- MVCC metadata injection (
row_id,created_by,deleted_by) - Predicate compilation and optimization
- RecordBatch append/scan operations
- Column data type management
Primary Types:
Table- Main table interfaceTablePlanner- Query optimizationTableExecutor- Execution strategies
Sources: llkv-table/Cargo.toml:1-60 llkv-column-map/src/store/projection.rs:1-728
llkv-column-map
Path: llkv-column-map/
Purpose: Columnar storage layer that chunks Arrow arrays and manages the mapping from logical fields to physical storage keys.
Key Dependencies:
llkv-storage- Pager abstractionllkv-expr- Field identifiersarrow- Array serialization
Responsibilities:
- Column chunk management (serialization/deserialization)
- LogicalFieldId → PhysicalKey mapping
- Multi-column gather operations with caching
- Row visibility filtering
- Chunk metadata tracking (min/max values)
Primary Types:
ColumnStore<P>- Main storage interfaceLogicalFieldId- Namespaced field identifierMultiGatherContext- Reusable context for multi-column readsGatherNullPolicy- Null handling strategies
Sources: Cargo.toml12 llkv-column-map/src/store/projection.rs:38-227
llkv-storage
Path: llkv-storage/
Purpose: Storage abstraction layer defining the Pager trait and providing implementations for in-memory and persistent backends.
Key Dependencies:
simd-r-drive- SIMD-optimized persistent storage (optional)arrow- Buffer types
Responsibilities:
Pagertrait definition (batch_get/batch_put)- Zero-copy array serialization format
MemPager- In-memory HashMap backendSimdRDrivePager- Memory-mapped persistent backend- Physical key allocation
Primary Types:
PagertraitMemPager,SimdRDrivePagerPhysicalKey- Storage location identifier- Serialization format with custom encoding (see llkv-storage/src/serialization.rs:1-586)
Sources: Cargo.toml22 llkv-storage/src/serialization.rs:1-130
llkv-transaction
Path: llkv-transaction/
Purpose: MVCC transaction manager providing snapshot isolation and row visibility determination.
Key Dependencies:
llkv-result- Error types
Responsibilities:
- Transaction ID allocation
- MVCC snapshot creation
- Commit watermark tracking
- Row visibility rules enforcement
Primary Types:
TransactionManagerSnapshot- Transaction isolation viewTxnId- Transaction identifier
Sources: Cargo.toml25
llkv-result
Path: llkv-result/
Purpose: Common error and result types used throughout the system.
Key Dependencies: None (foundational crate)
Responsibilities:
Errorenum with all error variantsResult<T>type alias- Error conversion traits
Sources: Cargo.toml18
Specialized Operations Crates
llkv-aggregate
Path: llkv-aggregate/
Purpose: Aggregate function evaluation including accumulators and distinct value tracking.
Key Dependencies:
arrow- Array operations
Responsibilities:
- Aggregate function implementations (SUM, AVG, COUNT, MIN, MAX)
- Accumulator state management
- DISTINCT value tracking
- Group-by hash table operations
Sources: Cargo.toml11
llkv-join
Path: llkv-join/
Purpose: Join algorithm implementations.
Key Dependencies:
arrow- RecordBatch operationsllkv-expr- Join predicates
Responsibilities:
- Hash join implementation
- Nested loop join
- Join key extraction
- Result materialization
Sources: Cargo.toml16
llkv-csv
Path: llkv-csv/
Purpose: CSV file ingestion and export utilities.
Key Dependencies:
llkv-table- Table operationsarrow- CSV reader integration
Responsibilities:
- CSV to RecordBatch conversion
- Bulk insert optimization
- Schema inference from CSV headers
Sources: Cargo.toml13
Testing Infrastructure Crates
llkv-test-utils
Path: llkv-test-utils/
Purpose: Shared test utilities including tracing setup and common test fixtures.
Key Dependencies:
tracing-subscriber- Logging configuration
Responsibilities:
- Consistent tracing initialization across tests
- Common test helpers
- Auto-initialization feature for convenience
Sources: Cargo.toml24
llkv-slt-tester
Path: llkv-slt-tester/
Purpose: SQL Logic Test runner providing standardized correctness testing.
Key Dependencies:
llkv-sql- SQL executionsqllogictest- Test framework (version 0.28.4)
Responsibilities:
.sltfile discovery and execution- Remote test suite fetching (
.slturlfiles) - Test result comparison
- AsyncDB adapter for LLKV
Primary Types:
LlkvSltRunner- Test runnerEngineHarness- Adapter interface
Sources: Cargo.toml20
llkv-tpch
Path: llkv-tpch/
Purpose: TPC-H benchmark suite for performance testing.
Key Dependencies:
llkv- Database interfacellkv-sql- SQL executiontpchgen- Data generation (version 2.0.1)
Responsibilities:
- TPC-H data generation at various scale factors
- Query execution (Q1-Q22)
- Performance measurement
- Benchmark result reporting
Sources: Cargo.toml62
Demonstration Applications
llkv-sql-pong-demo
Path: demos/llkv-sql-pong-demo/
Purpose: Interactive demonstration showing LLKV's SQL capabilities through a Pong game implemented in SQL.
Key Dependencies:
llkv-sql- SQL executioncrossterm- Terminal UI (version 0.29.0)
Responsibilities:
- Terminal-based interactive interface
- Real-time SQL query execution
- Game state management via SQL tables
- User input handling
graph LR
LLKV["llkv"]
SQL["llkv-sql"]
PLAN["llkv-plan"]
EXPR["llkv-expr"]
RUNTIME["llkv-runtime"]
EXECUTOR["llkv-executor"]
TABLE["llkv-table"]
COLMAP["llkv-column-map"]
STORAGE["llkv-storage"]
TXN["llkv-transaction"]
RESULT["llkv-result"]
AGG["llkv-aggregate"]
JOIN["llkv-join"]
CSV["llkv-csv"]
SLT["llkv-slt-tester"]
TESTUTIL["llkv-test-utils"]
TPCH["llkv-tpch"]
DEMO["llkv-sql-pong-demo"]
LLKV --> SQL
LLKV --> RUNTIME
SQL --> PLAN
SQL --> EXPR
SQL --> RUNTIME
SQL --> EXECUTOR
SQL --> TABLE
SQL --> TXN
RUNTIME --> EXECUTOR
RUNTIME --> TABLE
RUNTIME --> TXN
EXECUTOR --> PLAN
EXECUTOR --> EXPR
EXECUTOR --> TABLE
EXECUTOR --> AGG
EXECUTOR --> JOIN
TABLE --> COLMAP
TABLE --> EXPR
TABLE --> PLAN
TABLE --> STORAGE
COLMAP --> STORAGE
COLMAP --> EXPR
PLAN --> EXPR
PLAN --> RESULT
CSV --> TABLE
TXN --> RESULT
STORAGE --> RESULT
EXPR --> RESULT
COLMAP --> RESULT
TABLE --> RESULT
SLT --> SQL
SLT --> RUNTIME
SLT --> TESTUTIL
TPCH --> LLKV
TPCH --> SQL
DEMO --> SQL
Sources: Cargo.toml86
Crate Dependency Graph
The following diagram shows the direct dependencies between workspace crates. Arrows point from dependent crates to their dependencies.
Crate Dependencies:
Sources: Cargo.toml:9-25 llkv-table/Cargo.toml:14-31 llkv-plan/Cargo.toml:14-24
Key Observations:
-
llkv-resultis a foundational crate with no internal dependencies, consumed by nearly all other crates for error handling. -
llkv-exprdepends only onllkv-result, making it a stable base for expression handling across the system. -
llkv-planbuilds onllkv-exprand adds plan-specific structures. -
llkv-storageandllkv-transaction** are independent of each other, allowing flexibility in storage backend selection. -
llkv-tableintegrates storage, expressions, and planning to provide a cohesive data layer. -
llkv-executorcoordinates specialized operations (aggregate, join) and table access. -
llkv-runtimesits at the top of the execution stack, orchestrating transactions and query execution. -
llkv-sqlties together all layers to provide the SQL interface.
Mapping Crates to System Layers
This diagram shows how workspace crates map to the architectural layers described in Architecture.
Layered Architecture Mapping:
Sources: Cargo.toml:67-88
External Dependencies
The workspace declares several critical external dependencies that enable core functionality.
Apache Arrow Ecosystem
Version: 57.0.0
Crates:
arrow- Core Arrow functionality with prettyprint and IPC featuresarrow-array- Array implementationsarrow-schema- Schema typesarrow-buffer- Buffer managementarrow-ord- Ordering operations
Usage: Arrow provides the universal columnar data format throughout LLKV. RecordBatch is used as the data interchange format at every layer, enabling zero-copy operations and SIMD-friendly processing.
Sources: Cargo.toml:32-36
SQL Parsing
Crate: sqlparser
Version: 0.59.0
Usage: Parses SQL text into AST nodes. Used by llkv-sql and llkv-plan to convert SQL queries into typed plan structures.
Sources: Cargo.toml52
SIMD-Optimized Storage
Crate: simd-r-drive
Version: 0.15.5-alpha
Usage: Provides memory-mapped, SIMD-accelerated persistent storage backend. The SimdRDrivePager implementation in llkv-storage uses this for zero-copy array access.
Related: simd-r-drive-entry-handle for Arrow buffer integration
Sources: Cargo.toml:26-27
Testing and Benchmarking
Key Dependencies:
| Crate | Version | Purpose |
|---|---|---|
criterion | 0.7.0 | Performance benchmarking |
sqllogictest | 0.28.4 | SQL correctness testing |
tpchgen | 2.0.1 | TPC-H data generation |
libtest-mimic | 0.8 | Custom test harness |
Sources: Cargo.toml:40-62
Utilities
Key Dependencies:
| Crate | Version | Purpose |
|---|---|---|
rayon | 1.10.0 | Data parallelism |
rustc-hash | 2.1.1 | Fast hash maps |
bitcode | 0.6.7 | Binary serialization |
thiserror | 2.0.17 | Error trait derivation |
serde | 1.0.228 | Serialization framework |
Sources: Cargo.toml:37-64
Workspace Configuration
The workspace is configured with shared package metadata and dependency versions to ensure consistency across all crates.
Shared Package Metadata:
Build Settings:
- Edition: 2024 (Rust 2024 edition)
- Resolver: Version 2 (new dependency resolver)
- Version: 0.8.2-alpha (all crates share this version)
Sources: Cargo.toml:1-8 Cargo.toml88
Summary Table
| Crate | Layer | Primary Responsibility | Key Dependencies |
|---|---|---|---|
llkv | Entry Point | Main library API | llkv-sql, llkv-runtime |
llkv-sql | SQL Processing | SQL parsing and execution | llkv-plan, llkv-runtime, sqlparser |
llkv-plan | SQL Processing | Query plan structures | llkv-expr, sqlparser |
llkv-expr | SQL Processing | Expression AST | arrow |
llkv-runtime | Execution | Transaction orchestration | llkv-executor, llkv-table |
llkv-executor | Execution | Query execution | llkv-table, llkv-aggregate |
llkv-table | Data Management | Schema-aware tables | llkv-column-map, llkv-storage |
llkv-column-map | Data Management | Columnar storage | llkv-storage, arrow |
llkv-storage | Storage | Storage abstraction | simd-r-drive (optional) |
llkv-transaction | Data Management | MVCC manager | - |
llkv-aggregate | Specialized Ops | Aggregation functions | arrow |
llkv-join | Specialized Ops | Join algorithms | arrow |
llkv-csv | Specialized Ops | CSV import/export | llkv-table |
llkv-result | Foundation | Error types | - |
llkv-test-utils | Testing | Test utilities | tracing-subscriber |
llkv-slt-tester | Testing | SQL logic tests | llkv-sql, sqllogictest |
llkv-tpch | Testing | TPC-H benchmarks | llkv-sql, tpchgen |
llkv-sql-pong-demo | Demo | Interactive demo | llkv-sql, crossterm |
Sources: Cargo.toml:1-89
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
SQL Query Processing Pipeline
Relevant source files
- README.md
- demos/llkv-sql-pong-demo/src/main.rs
- llkv-executor/README.md
- llkv-plan/README.md
- llkv-sql/README.md
- llkv-sql/src/sql_engine.rs
- llkv-sql/src/tpch.rs
- llkv-tpch/.gitignore
- llkv-tpch/Cargo.toml
- llkv-tpch/DRAFT-PRE-FINAL.md
Purpose and Scope
This document describes the end-to-end SQL query processing pipeline in LLKV, from raw SQL text input to final results. It covers the five major stages: SQL preprocessing, parsing, plan translation, execution coordination, and result formatting.
For information about the specific plan structures created during translation, see Plan Structures. For details on how plans are executed to produce results, see Query Execution. For the user-facing SqlEngine API, see SqlEngine API.
Overview
The SQL query processing pipeline transforms user-provided SQL text into Arrow RecordBatch results through a series of well-defined stages. The primary entry point is SqlEngine::execute(), which orchestrates the entire flow while maintaining transaction boundaries and handling cross-statement optimizations like INSERT buffering.
Sources: llkv-sql/src/sql_engine.rs:933-998
graph TB
SQL["Raw SQL Text"]
PREPROCESS["Stage 1: SQL Preprocessing\nDialect Normalization"]
PARSE["Stage 2: Parsing\nsqlparser → AST"]
TRANSLATE["Stage 3: Plan Translation\nAST → PlanStatement"]
EXECUTE["Stage 4: Plan Execution\nRuntimeEngine"]
RESULTS["Stage 5: Result Formatting\nRuntimeStatementResult"]
SQL --> PREPROCESS
PREPROCESS --> PARSE
PARSE --> TRANSLATE
TRANSLATE --> EXECUTE
EXECUTE --> RESULTS
PREPROCESS --> |preprocess_sql_input| PREPROCESS_IMPL["• preprocess_tpch_connect_syntax()\n• preprocess_create_type_syntax()\n• preprocess_exclude_syntax()\n• preprocess_trailing_commas_in_values()\n• preprocess_empty_in_lists()\n• preprocess_index_hints()\n• preprocess_reindex_syntax()\n• preprocess_bare_table_in_clauses()"]
PARSE --> |parse_sql_with_recursion_limit| PARSER["sqlparser::Parser\nPARSER_RECURSION_LIMIT = 200"]
TRANSLATE --> |translate_statement| PLANNER["• SelectPlan\n• InsertPlan\n• UpdatePlan\n• DeletePlan\n• CreateTablePlan\n• DDL Plans"]
EXECUTE --> |engine.execute_statement| RUNTIME["RuntimeEngine\n• Transaction Management\n• MVCC Snapshots\n• Catalog Operations"]
style PREPROCESS fill:#f9f9f9
style PARSE fill:#f9f9f9
style TRANSLATE fill:#f9f9f9
style EXECUTE fill:#f9f9f9
style RESULTS fill:#f9f9f9
Stage 1: SQL Preprocessing
Before parsing, the SQL text undergoes a series of preprocessing transformations to normalize dialect-specific syntax. This allows LLKV to accept SQL written for SQLite, DuckDB, and other dialects while presenting a consistent AST to the planner.
Preprocessing Transformations
| Preprocessor Method | Purpose | Example Transformation |
|---|---|---|
preprocess_tpch_connect_syntax() | Strip TPC-H CONNECT TO directives | CONNECT TO tpch; → `` |
preprocess_create_type_syntax() | Convert CREATE TYPE to CREATE DOMAIN | CREATE TYPE name AS INT → CREATE DOMAIN name AS INT |
preprocess_exclude_syntax() | Quote qualified names in EXCLUDE clauses | EXCLUDE (t.col) → EXCLUDE ("t.col") |
preprocess_trailing_commas_in_values() | Remove trailing commas in VALUES | VALUES (1, 2,) → VALUES (1, 2) |
preprocess_empty_in_lists() | Expand empty IN predicates to constant booleans | x IN () → (x = NULL AND 0 = 1) |
preprocess_index_hints() | Strip SQLite INDEXED BY hints | FROM t INDEXED BY idx → FROM t |
preprocess_reindex_syntax() | Convert REINDEX to VACUUM REINDEX | REINDEX idx → VACUUM REINDEX idx |
preprocess_bare_table_in_clauses() | Expand IN tablename to subquery | x IN t → x IN (SELECT * FROM t) |
Sources: llkv-sql/src/sql_engine.rs:623-873 llkv-sql/src/tpch.rs:1-17
Fallback Trigger Preprocessing
If parsing fails and the SQL contains CREATE TRIGGER, the engine applies an additional preprocess_sqlite_trigger_shorthand() transformation and retries. This handles SQLite's optional BEFORE/AFTER timing and FOR EACH ROW clauses by injecting defaults that sqlparser expects.
Sources: llkv-sql/src/sql_engine.rs:941-957 llkv-sql/src/sql_engine.rs:771-842
Stage 2: Parsing
Parsing is delegated to the sqlparser crate, which produces a Vec<Statement> AST. LLKV configures the parser with:
- Dialect:
GenericDialectto accept a wide range of SQL syntax - Recursion Limit:
PARSER_RECURSION_LIMIT = 200(raised from sqlparser's default of 50 to handle deeply nested queries in test suites)
The parse_sql_with_recursion_limit() helper function wraps sqlparser's API to apply this custom limit.
Sources: llkv-sql/src/sql_engine.rs:317-324 llkv-sql/src/sql_engine.rs:939-957
Stage 3: Plan Translation
Each parsed Statement is translated into a strongly-typed PlanStatement that the runtime can execute. This translation happens through statement-specific methods in SqlEngine.
graph TB
AST["sqlparser::ast::Statement"]
SELECT["Statement::Query"]
INSERT["Statement::Insert"]
UPDATE["Statement::Update"]
DELETE["Statement::Delete"]
CREATE["Statement::CreateTable"]
DROP["Statement::Drop"]
TRANSACTION["Statement::StartTransaction\nStatement::Commit\nStatement::Rollback"]
ALTER["Statement::AlterTable"]
OTHER["Other DDL/DML"]
SELECT_PLAN["translate_query()\n→ SelectPlan"]
INSERT_PLAN["buffer_insert()\n→ InsertPlan or buffered"]
UPDATE_PLAN["translate_update()\n→ UpdatePlan"]
DELETE_PLAN["translate_delete()\n→ DeletePlan"]
CREATE_PLAN["translate_create_table()\n→ CreateTablePlan"]
DROP_PLAN["translate_drop()\n→ PlanStatement::Drop*"]
TXN_RUNTIME["Direct runtime delegation\nflush INSERT buffer first"]
ALTER_PLAN["translate_alter_table()\n→ PlanStatement::Alter*"]
OTHER_PLAN["translate_* methods\n→ PlanStatement::*"]
AST --> SELECT
AST --> INSERT
AST --> UPDATE
AST --> DELETE
AST --> CREATE
AST --> DROP
AST --> TRANSACTION
AST --> ALTER
AST --> OTHER
SELECT --> SELECT_PLAN
INSERT --> INSERT_PLAN
UPDATE --> UPDATE_PLAN
DELETE --> DELETE_PLAN
CREATE --> CREATE_PLAN
DROP --> DROP_PLAN
TRANSACTION --> TXN_RUNTIME
ALTER --> ALTER_PLAN
OTHER --> OTHER_PLAN
SELECT_PLAN --> RUNTIME["RuntimeEngine::execute_statement()"]
INSERT_PLAN --> BUFFER_CHECK{"Buffering\nenabled?"}
BUFFER_CHECK -->|Yes| BUFFER["InsertBuffer\naccumulates rows"]
BUFFER_CHECK -->|No| RUNTIME
UPDATE_PLAN --> RUNTIME
DELETE_PLAN --> RUNTIME
CREATE_PLAN --> RUNTIME
DROP_PLAN --> RUNTIME
TXN_RUNTIME --> RUNTIME
ALTER_PLAN --> RUNTIME
OTHER_PLAN --> RUNTIME
BUFFER --> FLUSH_CHECK{"Flush\nneeded?"}
FLUSH_CHECK -->|Yes| FLUSH["Flush buffered rows"]
FLUSH_CHECK -->|No| CONTINUE["Continue buffering"]
FLUSH --> RUNTIME
Statement Routing
Sources: llkv-sql/src/sql_engine.rs:960-998
Translation Process
The translation process involves:
- Column Resolution: Identifier strings are resolved to
FieldIdreferences using the runtime's catalog - Expression Translation: SQL expressions are converted to
Expr<String>, then resolved toExpr<FieldId> - Subquery Handling: Correlated subqueries are tracked with placeholder generation
- Parameter Binding: SQL placeholders (
?,$1,:name) are mapped to parameter indices
Sources: llkv-sql/src/sql_engine.rs:1000-5000 (various translate_* methods)
sequenceDiagram
participant SqlEngine
participant Catalog as "RuntimeContext\nCatalog"
participant ExprTranslator
participant PlanBuilder
SqlEngine->>Catalog: resolve_table("users")
Catalog-->>SqlEngine: TableId(namespace=0, table=1)
SqlEngine->>Catalog: resolve_column("id", TableId)
Catalog-->>SqlEngine: ColumnResolution(FieldId)
SqlEngine->>ExprTranslator: translate_expr(sqlparser::Expr)
ExprTranslator->>ExprTranslator: Build Expr<String>
ExprTranslator->>Catalog: resolve_identifiers()
Catalog-->>ExprTranslator: Expr<FieldId>
ExprTranslator-->>SqlEngine: Expr<FieldId>
SqlEngine->>PlanBuilder: Create SelectPlan
Note over PlanBuilder: Attach projections,\nfilters, sorts, limits
PlanBuilder-->>SqlEngine: PlanStatement::Select(SelectPlan)
Stage 4: Plan Execution
Once a PlanStatement is constructed, it is passed to RuntimeEngine::execute_statement() for execution. The runtime coordinates:
- Transaction Management: Ensures each statement executes within a transaction snapshot
- MVCC Enforcement: Filters rows based on visibility rules
- Catalog Operations: Updates system catalog for DDL statements
- Executor Invocation: Delegates
SelectPlanexecution tollkv-executor
Execution Routing by Statement Type
Sources: llkv-sql/src/sql_engine.rs:587-609 llkv-runtime/ (RuntimeEngine implementation)
Stage 5: Result Formatting
The runtime returns a RuntimeStatementResult enum that represents the outcome of statement execution. SqlEngine surfaces this directly to callers via the execute() method, or converts it to Vec<RecordBatch> for the sql() convenience method.
Result Types
| Statement Type | Result Variant | Contents |
|---|---|---|
SELECT | RuntimeStatementResult::Select | Vec<RecordBatch> of query results |
INSERT | RuntimeStatementResult::Insert | rows_inserted: u64 |
UPDATE | RuntimeStatementResult::Update | rows_updated: u64 |
DELETE | RuntimeStatementResult::Delete | rows_deleted: u64 |
CREATE TABLE | RuntimeStatementResult::CreateTable | table_name: String |
CREATE INDEX | RuntimeStatementResult::CreateIndex | index_name: String |
DROP TABLE | RuntimeStatementResult::DropTable | table_name: String |
| Transaction control | RuntimeStatementResult::Transaction | Transaction state |
Sources: llkv-runtime/ (RuntimeStatementResult definition), llkv-sql/src/sql_engine.rs:933-998
Prepared Statements and Parameters
LLKV supports parameterized queries through a prepared statement mechanism that handles three parameter syntaxes:
- Positional (numbered):
?,?1,?2,$1,$2 - Named:
:name,:id - Auto-incremented: Sequential
?placeholders
Parameter Processing Pipeline
Sources: llkv-sql/src/sql_engine.rs:71-206 llkv-sql/src/sql_engine.rs:278-297
Parameter Substitution
During plan execution with parameters, the engine performs a second pass to replace sentinel strings (__llkv_param__N__) with the actual Literal values provided by the caller. This two-phase approach allows the same PlanStatement to be reused across multiple executions with different parameter values.
Sources: llkv-sql/src/sql_engine.rs:194-206
INSERT Buffering
SqlEngine includes an optional INSERT buffering optimization that batches consecutive INSERT ... VALUES statements targeting the same table. This is disabled by default but can be enabled with set_insert_buffering(true) for bulk ingestion workloads.
stateDiagram-v2
[*] --> NoBuffer : buffering disabled
NoBuffer --> NoBuffer : INSERT → immediate execute
[*] --> BufferEmpty : buffering enabled
BufferEmpty --> BufferActive : INSERT(table, cols, rows)
BufferActive --> BufferActive : INSERT(same table/cols) accumulate rows
BufferActive --> Flush : Different table
BufferActive --> Flush : Different columns
BufferActive --> Flush : Different conflict action
BufferActive --> Flush : Row count ≥ MAX_BUFFERED_INSERT_ROWS
BufferActive --> Flush : Non-INSERT statement
BufferActive --> Flush : Transaction boundary
Flush --> RuntimeExecution : Create InsertPlan from accumulated rows
RuntimeExecution --> BufferEmpty : Reset buffer
NoBuffer --> [*]
BufferEmpty --> [*]
Buffering Logic
Sources: llkv-sql/src/sql_engine.rs:410-495 llkv-sql/src/sql_engine.rs:887-905
Buffer Flush Triggers
| Trigger Condition | Rationale |
|---|---|
Row count ≥ MAX_BUFFERED_INSERT_ROWS (8192) | Limit memory usage |
| Target table changes | Cannot batch cross-table INSERTs |
| Column list changes | Schema mismatch |
| Conflict action changes | ON CONFLICT semantics differ |
| Non-INSERT statement encountered | Preserve statement ordering |
Transaction boundary (BEGIN, COMMIT, ROLLBACK) | Ensure transactional consistency |
Explicit flush_pending_inserts() call | Manual control |
| Statement expectation hint (testing) | Test harness needs per-statement results |
Sources: llkv-sql/src/sql_engine.rs:410-495
Error Handling and Table Mapping
The pipeline includes special error handling for table-not-found scenarios. When the runtime returns Error::NotFound or catalog-related errors, SqlEngine::execute_plan_statement() rewrites them to user-friendly messages like "Catalog Error: Table 'users' does not exist".
This mapping is skipped for CREATE VIEW and DROP VIEW statements where the "table" name refers to the view being created/dropped rather than a referenced table.
Sources: llkv-sql/src/sql_engine.rs:558-609
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Data Formats and Arrow Integration
Relevant source files
- Cargo.lock
- Cargo.toml
- llkv-aggregate/README.md
- llkv-column-map/README.md
- llkv-column-map/src/store/projection.rs
- llkv-csv/README.md
- llkv-expr/Cargo.toml
- llkv-expr/README.md
- llkv-expr/src/lib.rs
- llkv-expr/src/literal.rs
- llkv-join/README.md
- llkv-plan/Cargo.toml
- llkv-plan/src/lib.rs
- llkv-runtime/README.md
- llkv-storage/README.md
- llkv-storage/src/serialization.rs
- llkv-table/Cargo.toml
- llkv-table/README.md
Purpose and Scope
This page documents how Apache Arrow columnar data structures serve as the universal data interchange format throughout LLKV. It covers the supported Arrow data types, the custom zero-copy serialization format used for persistence, and how RecordBatch flows between layers.
For information about how expressions evaluate over Arrow data, see Expression System. For details on storage pager abstractions, see Pager Interface and SIMD Optimization.
Arrow as the Universal Data Format
LLKV is Arrow-native : every layer of the system produces, consumes, and operates on arrow::record_batch::RecordBatch structures. This design choice enables:
- Zero-copy data access across layer boundaries
- SIMD-friendly vectorized operations via contiguous columnar buffers
- Unified type system from SQL parsing through storage
- Efficient interoperability with external Arrow-compatible tools
The following table maps system layers to their Arrow usage:
| Layer | Arrow Usage |
|---|---|
| llkv-sql | Parses SQL and returns RecordBatch query results to callers |
| llkv-plan | Constructs plans that reference Arrow DataType and Field structures |
| llkv-executor | Streams RecordBatch results during SELECT evaluation |
| llkv-table | Validates incoming batches against table schemas and persists columns |
| llkv-column-map | Chunks RecordBatch columns into serialized Arrow arrays for pager storage |
| llkv-storage | Serializes/deserializes Arrow arrays with a custom zero-copy format |
Sources: Cargo.toml32 llkv-table/README.md:10-11 llkv-column-map/README.md10 llkv-storage/README.md10
RecordBatch Flow Diagram
Key Data Flow Observations:
- INSERT Path :
RecordBatch→ schema validation → MVCC column injection → column chunking → serialization → pager write - SELECT Path : Pager read → deserialization → column gather →
RecordBatchconstruction → streaming to executor - Zero-Copy :
EntryHandlewraps memory-mapped regions that Arrow arrays reference directly without copying
Sources: llkv-column-map/src/store/projection.rs:240-446 llkv-storage/src/serialization.rs:226-254 llkv-table/README.md:19-25
Supported Arrow Data Types
LLKV supports the following Arrow primitive and complex types:
Primitive Types
| Arrow DataType | Storage Size | SQL Type Mapping |
|---|---|---|
UInt8 | 1 byte | TINYINT UNSIGNED |
UInt16 | 2 bytes | SMALLINT UNSIGNED |
UInt32 | 4 bytes | INT UNSIGNED |
UInt64 | 8 bytes | BIGINT UNSIGNED |
Int8 | 1 byte | TINYINT |
Int16 | 2 bytes | SMALLINT |
Int32 | 4 bytes | INT |
Int64 | 8 bytes | BIGINT |
Float32 | 4 bytes | REAL, FLOAT |
Float64 | 8 bytes | DOUBLE PRECISION |
Boolean | 1 bit (packed) | BOOLEAN |
Date32 | 4 bytes | DATE |
Date64 | 8 bytes | TIMESTAMP |
Decimal128(p, s) | 16 bytes | DECIMAL(p, s) |
Variable-Length Types
| Arrow DataType | Storage Layout | SQL Type Mapping |
|---|---|---|
Utf8 | i32 offsets + UTF-8 bytes | VARCHAR, TEXT |
LargeUtf8 | i64 offsets + UTF-8 bytes | TEXT (large) |
Binary | i32 offsets + raw bytes | VARBINARY, BLOB |
LargeBinary | i64 offsets + raw bytes | BLOB (large) |
Complex Types
| Arrow DataType | Description | Use Cases |
|---|---|---|
Struct(fields) | Nested record with named fields | Composite values, JSON-like data |
FixedSizeList(T, n) | Fixed-length array of type T | Vector embeddings, coordinate tuples |
Null Handling:
The current serialization format does not yet support null bitmaps. Arrays with null_count() > 0 will return an error during serialization. Null support is planned for future releases.
Sources: llkv-storage/src/serialization.rs:144-165 llkv-storage/src/serialization.rs:199-224 llkv-expr/src/literal.rs:78-94
Custom Serialization Format
Why Not Arrow IPC?
LLKV uses a custom minimal serialization format instead of Arrow's standard IPC (Inter-Process Communication) format for several reasons:
Trade-offs:
graph LR
subgraph "Arrow IPC Format"
IPC_SCHEMA["Schema object\nframing metadata"]
IPC_PADDING["Padding alignment\n8/64 byte boundaries"]
IPC_BUFFERS["Buffer pointers\n+ offsets"]
IPC_SIZE["Larger file size\n~20-40% overhead"]
end
subgraph "LLKV Custom Format"
CUSTOM_HEADER["24-byte header\nfixed size"]
CUSTOM_PAYLOAD["Raw buffer bytes\ncontiguous"]
CUSTOM_ZERO["Zero-copy rebuild\ndirect mmap"]
CUSTOM_SIZE["Minimal size\nno framing"]
end
IPC_SCHEMA --> IPC_SIZE
IPC_PADDING --> IPC_SIZE
IPC_BUFFERS --> IPC_SIZE
CUSTOM_HEADER --> CUSTOM_SIZE
CUSTOM_PAYLOAD --> CUSTOM_ZERO
| Aspect | Arrow IPC | LLKV Custom Format |
|---|---|---|
| File size | Larger (metadata + padding) | Minimal (24-byte header + payload) |
| Deserialization | Allocates and copies buffers | Zero-copy via EntryHandle |
| Flexibility | Supports all Arrow features | Limited to non-null arrays |
| Scan performance | Moderate (copy overhead) | Fast (direct SIMD access) |
| Null support | Full bitmap support | Not yet implemented |
Design Goals:
- Minimal headers : 24-byte fixed header, no schema objects per array
- Predictable payloads : contiguous buffers for mmap-friendly access
- True zero-copy : reconstruct
ArrayDatareferencing original buffer directly - Stable on-disk codes : type tags are compile-time pinned to prevent corruption
Sources: llkv-storage/src/serialization.rs:1-28 llkv-storage/src/serialization.rs:41-135
Serialization Format Details
Header Structure
Every serialized array begins with a 24-byte header:
Offset Size Field
------ ---- -----
0-3 4 Magic bytes: b"ARR0"
4 1 Layout code (Primitive=0, FslFloat32=1, Varlen=2, Struct=3)
5 1 Type code (PrimType enum value)
6 1 Precision (for Decimal128) or padding
7 1 Scale (for Decimal128) or padding
8-15 8 Array length (u64, element count)
16-19 4 extra_a (layout-specific u32)
20-23 4 extra_b (layout-specific u32)
24+ var Payload (layout-specific buffer bytes)
Layout Variants
Primitive Layout
For fixed-width types (Int32, Float64, etc.):
extra_a: Length of values buffer in bytesextra_b: Unused (0)- Payload: Raw values buffer
Varlen Layout
For variable-length types (Utf8, Binary, etc.):
extra_a: Length of offsets buffer in bytesextra_b: Length of values buffer in bytes- Payload: Offsets buffer followed by values buffer
FixedSizeList Layout
Special optimization for vector embeddings:
extra_a: List size (elements per list)extra_b: Total child buffer length in bytes- Payload: Contiguous child
Float32buffer
This enables direct SIMD access to embedding vectors without indirection.
Struct Layout
For nested composite types:
extra_a: Unused (0)extra_b: IPC payload length in bytes- Payload: Arrow IPC-serialized struct array
Struct types fall back to Arrow IPC format because their complex nested structure doesn't benefit from the custom layout.
Sources: llkv-storage/src/serialization.rs:44-135 llkv-storage/src/serialization.rs:256-378
graph TB
PAGER["Pager::batch_get"]
HANDLE["EntryHandle\nmemory-mapped region"]
BUFFER["Arrow Buffer\nslice of EntryHandle"]
ARRAYDATA["ArrayData\nreferences Buffer"]
ARRAY["Concrete Array\nInt32Array, etc."]
PAGER --> HANDLE
HANDLE --> BUFFER
BUFFER --> ARRAYDATA
ARRAYDATA --> ARRAY
style HANDLE fill:#f9f9f9
style BUFFER fill:#f9f9f9
Zero-Copy Deserialization
EntryHandle Integration
The EntryHandle type from simd-r-drive-entry-handle provides a zero-copy wrapper around memory-mapped buffers:
Key Operations:
- Pager read : Returns
GetResult::Raw { key, bytes: EntryHandle } - Buffer slice :
EntryHandle::as_arrow_buffer()creates an ArrowBufferview - ArrayData build :
ArrayData::builder().add_buffer(buffer).build() - Array cast :
make_array(data)produces typed arrays
The entire chain avoids copying data — Arrow arrays directly reference the memory-mapped region.
Alignment Requirements:
Decimal128 requires 16-byte alignment. If the EntryHandle buffer is not properly aligned, the deserializer copies it to an aligned buffer:
Sources: llkv-storage/src/serialization.rs:429-559 llkv-column-map/src/store/projection.rs:619-629
graph TB
ROWIDS["row_ids: &[u64]\nrequested rows"]
FIELDIDS["field_ids: &[LogicalFieldId]\nrequested columns"]
PLANS["FieldPlan\nper-column metadata"]
CHUNKS["Chunk selection\ncandidate_indices"]
BATCH_GET["Pager::batch_get\nfetch chunks"]
CACHE["Chunk cache\nArrayRef map"]
GATHER["gather_rows_from_chunks\nper-type specialization"]
ARRAYS["Vec<ArrayRef>\none per field"]
SCHEMA["Arrow Schema\nField metadata"]
RECORDBATCH["RecordBatch::try_new"]
ROWIDS --> PLANS
FIELDIDS --> PLANS
PLANS --> CHUNKS
CHUNKS --> BATCH_GET
BATCH_GET --> CACHE
CACHE --> GATHER
GATHER --> ARRAYS
ARRAYS --> RECORDBATCH
SCHEMA --> RECORDBATCH
RecordBatch Construction and Projection
Gather Operations
The ColumnStore::gather_rows family of methods reconstructs RecordBatch from chunked columns:
Projection Flow:
- Prepare context : Load column descriptors, determine chunk candidates
- Batch fetch : Request all needed chunks from pager in one call
- Type-specific gather : Dispatch to specialized routines based on
DataType - Null policy : Apply
GatherNullPolicy(ErrorOnMissing, IncludeNulls, DropNulls) - Schema construction : Build
Schemawith correct field names and nullability - RecordBatch assembly :
RecordBatch::try_new(schema, arrays)
Type Dispatch Table:
| DataType | Gather Function |
|---|---|
Utf8 | gather_rows_from_chunks_string::<i32> |
LargeUtf8 | gather_rows_from_chunks_string::<i64> |
Binary | gather_rows_from_chunks_binary::<i32> |
LargeBinary | gather_rows_from_chunks_binary::<i64> |
Boolean | gather_rows_from_chunks_bool |
Struct(_) | gather_rows_from_chunks_struct |
Decimal128(_, _) | gather_rows_from_chunks_decimal128 |
| Primitives | gather_rows_from_chunks::<ArrowTy> (generic) |
Sources: llkv-column-map/src/store/projection.rs:245-446 llkv-column-map/src/store/projection.rs:636-726
graph TB
BATCH["RecordBatch\nuser data"]
TABLE_SCHEMA["Stored Schema\nfrom catalog"]
VALIDATE["Schema validation"]
FIELD_CHECK["Field count\nname\ntype match"]
MVCC["Inject MVCC columns\nrow_id, created_by,\ndeleted_by"]
EXTENDED["Extended RecordBatch"]
COLMAP["ColumnStore::append"]
BATCH --> VALIDATE
TABLE_SCHEMA --> VALIDATE
VALIDATE --> FIELD_CHECK
FIELD_CHECK -->|Pass| MVCC
FIELD_CHECK -->|Fail| ERROR["Error::SchemaMismatch"]
MVCC --> EXTENDED
EXTENDED --> COLMAP
Schema Validation
Table-Level Schema Enforcement
The Table layer validates incoming RecordBatch schemas against the stored table schema before appending:
Validation Rules:
- Field count : Batch must have exactly the same number of columns as the table schema
- Field names : Column names must match (case-sensitive)
- Field types :
DataTypemust match exactly (no implicit coercion) - Nullability : Currently not strictly enforced (planned improvement)
MVCC Column Injection:
After validation, the table appends three system columns:
row_id(UInt64): Unique row identifiercreated_by(UInt64): Transaction ID that created the rowdeleted_by(UInt64): Transaction ID that deleted the row (0 if active)
These columns are stored in separate logical namespaces but physically alongside user data.
Sources: llkv-table/README.md:14-17 llkv-column-map/README.md:26-28
graph LR
SQL_LITERAL["SQL Literal\n123, 'text', etc."]
LITERAL["Literal enum\nInteger, String, etc."]
SCHEMA["Table Schema\nArrow DataType"]
NATIVE["Native Value\ni32, String, etc."]
ARRAY["Arrow Array\nInt32Array, etc."]
SQL_LITERAL --> LITERAL
LITERAL --> SCHEMA
SCHEMA --> NATIVE
NATIVE --> ARRAY
Type Mapping from SQL to Arrow
Literal Conversion
The llkv-expr crate defines a Literal enum that captures untyped SQL values before schema resolution:
Supported Literal Types:
| Literal Variant | Arrow Target Types |
|---|---|
Integer(i128) | Any integer or float type (with range checks) |
Float(f64) | Float32, Float64 |
Decimal(DecimalValue) | Decimal128(p, s) |
String(String) | Utf8, LargeUtf8 |
Boolean(bool) | Boolean |
Date32(i32) | Date32 |
Struct(fields) | Struct(...) |
Interval(IntervalValue) | Not directly stored; used for date arithmetic |
Conversion Mechanism:
The FromLiteral trait provides type-aware conversion:
Implementations perform range checking and type validation:
Sources: llkv-expr/src/literal.rs:78-94 llkv-expr/src/literal.rs:156-219 llkv-expr/src/literal.rs:395-419
Performance Characteristics
Zero-Copy Benefits
The combination of Arrow's columnar layout and the custom serialization format delivers measurable performance benefits:
| Operation | Traditional DB | LLKV Arrow-Native |
|---|---|---|
| Column scan | Row-by-row decode | Vectorized SIMD over mmap |
| Type dispatch | Virtual function calls | Monomorphized at compile time |
| Buffer management | Multiple allocations | Single mmap region |
| Predicate evaluation | Interpreted per row | Compiled bytecode over vectors |
Chunking Strategy
The ColumnStore organizes data into chunks sized for cache locality and pager efficiency:
- Target chunk size : Configurable, typically 64KB-256KB per column
- Row alignment : All columns in a table share the same row boundaries per chunk
- Append optimization : Incoming batches are chunked and sorted by
row_idbefore persistence
This design minimizes pager I/O and maximizes CPU cache hit rates during scans.
Sources: llkv-column-map/README.md:24-28 llkv-storage/README.md:15-17
Integration with External Tools
Arrow Compatibility
Because LLKV uses standard Arrow data structures at its boundaries, it can integrate with the broader Arrow ecosystem:
- Export : Query results can be serialized to Arrow IPC files for external processing
- Import : Arrow IPC files can be read and ingested via
Table::append - Parquet : Future work could add direct Parquet read/write using Arrow's
parquetcrate - DataFusion : LLKV's table scan APIs could potentially integrate as a DataFusion TableProvider
Current Limitations
- Null support : The serialization format doesn't yet handle null bitmaps
- Nested types : Only
StructandFixedSizeList<Float32>are fully supported - Dictionary encoding : Not yet implemented (planned)
- Compression : No built-in compression (relies on storage-layer features)
Sources: llkv-storage/src/serialization.rs:257-260 llkv-column-map/README.md:10-11
Summary
LLKV's Arrow-native architecture provides:
- Universal interchange format via
RecordBatchacross all layers - Zero-copy operations through
EntryHandleand memory-mapped buffers - Custom serialization optimized for mmap and SIMD access patterns
- Type safety from SQL literals through to persisted columns
- SIMD-friendly layout for efficient vectorized query evaluation
The trade-off of using a custom format instead of Arrow IPC is reduced flexibility (no nulls yet, fewer complex types) in exchange for smaller files, faster scans, and true zero-copy deserialization.
For details on how Arrow arrays are evaluated during query execution, see Scalar Evaluation and NumericKernels. For information on how MVCC metadata is stored alongside Arrow columns, see Column Storage and ColumnStore.
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
SQL Interface
Relevant source files
- README.md
- demos/llkv-sql-pong-demo/src/main.rs
- llkv-aggregate/src/lib.rs
- llkv-sql/src/lib.rs
- llkv-sql/src/sql_engine.rs
- llkv-sql/src/sql_value.rs
- llkv-sql/src/tpch.rs
- llkv-tpch/.gitignore
- llkv-tpch/Cargo.toml
- llkv-tpch/DRAFT-PRE-FINAL.md
The SQL Interface layer provides the primary user-facing entry point for executing SQL statements against LLKV. It consists of the llkv-sql crate, which wraps the underlying runtime and provides SQL parsing, preprocessing, statement caching, and result formatting.
This document covers the SqlEngine struct and its methods, SQL preprocessing and dialect normalization, and the INSERT buffering optimization system. For information about query planning after SQL parsing, see Query Planning. For runtime execution, see the Architecture section at #2.
Sources: llkv-sql/src/lib.rs:1-51 README.md:47-48
Core Components
The SQL Interface layer is centered around three main subsystems:
| Component | Purpose | Key Types |
|---|---|---|
SqlEngine | Main execution interface | SqlEngine, RuntimeEngine, RuntimeSession |
| Preprocessing | SQL normalization and dialect handling | Various regex-based transformers |
| INSERT Buffering | Batch optimization for literal inserts | InsertBuffer, PreparedInsert |
Sources: llkv-sql/src/sql_engine.rs:365-556
SqlEngine Structure
The SqlEngine wraps a RuntimeEngine instance and adds SQL-specific functionality including statement caching, INSERT buffering, and configurable behavior flags. The insert_buffer field holds accumulated literal INSERT payloads when buffering is enabled.
Sources: llkv-sql/src/sql_engine.rs:496-509 llkv-sql/src/sql_engine.rs:421-471
SQL Statement Processing Flow
The statement processing flow consists of:
- Preprocessing: SQL text undergoes dialect normalization via regex-based transformations
- Parsing:
sqlparserwith increased recursion limit (200 vs default 50) produces AST - Planning: AST nodes are translated to typed
PlanStatementstructures - Buffering (INSERT only): Literal INSERT statements may be accumulated in
InsertBuffer - Execution: Plans are passed to
RuntimeEnginefor execution - Result collection:
RuntimeStatementResultinstances are collected and returned
Sources: llkv-sql/src/sql_engine.rs:933-991 llkv-sql/src/sql_engine.rs:318-324
Public API Methods
Core Execution Methods
The SqlEngine exposes two primary execution methods:
| Method | Signature | Purpose | Returns |
|---|---|---|---|
execute | fn execute(&self, sql: &str) | Execute one or more SQL statements | SqlResult<Vec<RuntimeStatementResult>> |
sql | fn sql(&self, query: &str) | Execute a single SELECT and return batches | SqlResult<Vec<RecordBatch>> |
The execute method handles arbitrary SQL (DDL, DML, queries) and returns statement results. The sql method is a convenience wrapper that enforces single-SELECT semantics and extracts Arrow batches from the result stream.
Sources: llkv-sql/src/sql_engine.rs:921-991 llkv-sql/src/sql_engine.rs:1009-1052
Prepared Statements
Prepared statements support three placeholder syntaxes:
- Positional:
?(auto-numbered),?1,$1(explicit index) - Named:
:param_name
Placeholders are tracked via thread-local ParameterState during parsing, converted to sentinel strings like __llkv_param__1__, and stored in a PreparedPlan with parameter count metadata. The statement_cache field provides a statement-level cache keyed by SQL text.
Sources: llkv-sql/src/sql_engine.rs:77-206 llkv-sql/src/sql_engine.rs:278-297
Configuration Methods
| Method | Purpose |
|---|---|
new<Pg>(pager: Arc<Pg>) | Construct engine with given pager (buffering disabled) |
with_context(context, default_nulls_first) | Construct from existing RuntimeContext |
set_insert_buffering(enabled: bool) | Toggle INSERT batching mode |
The set_insert_buffering method controls cross-statement INSERT accumulation. When disabled (default), each INSERT executes immediately. When enabled, compatible INSERTs targeting the same table are batched together up to MAX_BUFFERED_INSERT_ROWS (8192 rows).
Sources: llkv-sql/src/sql_engine.rs:615-621 llkv-sql/src/sql_engine.rs:879-905 llkv-sql/src/sql_engine.rs:410-414
SQL Preprocessing System
The preprocessing layer normalizes SQL dialects before parsing to handle incompatibilities between SQLite, DuckDB, and sqlparser expectations.
graph TB
RAW["Raw SQL String"]
TPCH["preprocess_tpch_connect_syntax\n(strip CONNECT TO statements)"]
TYPE["preprocess_create_type_syntax\n(CREATE TYPE → CREATE DOMAIN)"]
EXCLUDE["preprocess_exclude_syntax\n(quote qualified names in EXCLUDE)"]
COMMA["preprocess_trailing_commas_in_values\n(remove trailing commas)"]
EMPTY["preprocess_empty_in_lists\n(expr IN () → constant)"]
INDEX["preprocess_index_hints\n(strip INDEXED BY / NOT INDEXED)"]
REINDEX["preprocess_reindex_syntax\n(REINDEX → VACUUM REINDEX)"]
BARE["preprocess_bare_table_in_clauses\n(IN table → IN (SELECT * FROM))"]
TRIGGER["preprocess_sqlite_trigger_shorthand\n(add AFTER / FOR EACH ROW)"]
PARSER["sqlparser::Parser"]
RAW --> TPCH
TPCH --> TYPE
TYPE --> EXCLUDE
EXCLUDE --> COMMA
COMMA --> EMPTY
EMPTY --> INDEX
INDEX --> REINDEX
REINDEX --> BARE
BARE --> PARSER
PARSER -.parse error.-> TRIGGER
TRIGGER --> PARSER
Each preprocessing function is implemented as a regex-based transformer:
| Function | Pattern | Purpose | Lines |
|---|---|---|---|
preprocess_tpch_connect_syntax | CONNECT TO database; | Strip TPC-H multi-database directives | 6:28-630 |
preprocess_create_type_syntax | CREATE TYPE → CREATE DOMAIN | Translate DuckDB type alias syntax | 6:39-657 |
preprocess_exclude_syntax | EXCLUDE(a.b.c) → EXCLUDE("a.b.c") | Quote qualified names in EXCLUDE | 6:59-676 |
preprocess_trailing_commas_in_values | VALUES (v,) → VALUES (v) | Remove DuckDB-style trailing commas | 6:78-689 |
preprocess_empty_in_lists | expr IN () → (expr = NULL AND 0 = 1) | Convert empty IN to constant false | 6:91-720 |
preprocess_index_hints | INDEXED BY idx / NOT INDEXED | Strip SQLite index hints | 7:22-739 |
preprocess_reindex_syntax | REINDEX idx → VACUUM REINDEX idx | Convert to sqlparser-compatible form | 7:41-757 |
preprocess_bare_table_in_clauses | IN table → IN (SELECT * FROM table) | Expand SQLite shorthand | 8:44-873 |
preprocess_sqlite_trigger_shorthand | Missing AFTER / FOR EACH ROW | Add required trigger components | 7:71-842 |
The trigger preprocessor is only invoked on parse errors containing CREATE TRIGGER, as it requires more complex regex patterns to inject missing timing and row-level clauses.
Sources: llkv-sql/src/sql_engine.rs:623-873
Regex Pattern Details
Static OnceLock<Regex> instances cache compiled patterns across invocations:
For example, the empty IN list handler uses:
(?i)(\([^)]*\)|x'[0-9a-fA-F]*'|'(?:[^']|'')*'|[a-zA-Z_][a-zA-Z0-9_]*(?:\.[a-zA-Z_][a-zA-Z0-9_]*)*|\d+(?:\.\d+)?)\s+(NOT\s+)?IN\s*\(\s*\)
This matches expressions (parenthesized, hex literals, strings, identifiers, numbers) followed by [NOT] IN () and replaces with boolean expressions that preserve evaluation side effects while producing constant results.
Sources: llkv-sql/src/sql_engine.rs:691-720
Parameter Placeholder System
The parameter system uses thread-local state to track placeholders during statement preparation:
- Scope Creation:
ParameterScope::new()initializes thread-localParameterState - Registration: Each placeholder calls
register_placeholder(raw)which:- For
?: auto-increments index - For
?Nor$N: uses explicit numeric index - For
:name: assigns next available index and stores mapping
- For
- Sentinel Generation:
placeholder_marker(index)creates__llkv_param__N__string - Parsing: Sentinel strings are parsed as string literals in the SQL AST
- Binding:
execute_preparedreplaces sentinels withSqlParamValueinstances
The ParameterState struct tracks:
assigned: FxHashMap<String, usize>- named parameter to index mappingnext_auto: usize- next index for?placeholdersmax_index: usize- highest parameter index seen
Sources: llkv-sql/src/sql_engine.rs:77-206 llkv-sql/src/sql_engine.rs:1120-1235
graph TB
subgraph "INSERT Processing Decision"
INSERT["Statement::Insert"]
CLASSIFY["classify_insert"]
VALUES["PreparedInsert::Values"]
IMMEDIATE["PreparedInsert::Immediate"]
end
subgraph "Buffering Logic"
ENABLED{"Buffering\nEnabled?"}
COMPAT{"Can Buffer\nAccept?"}
THRESHOLD{">= MAX_BUFFERED_INSERT_ROWS\n(8192)?"}
BUFFER["InsertBuffer::push_statement"]
FLUSH["flush_buffered_insert"]
EXECUTE["execute_plan_statement"]
end
INSERT --> CLASSIFY
CLASSIFY --> VALUES
CLASSIFY --> IMMEDIATE
VALUES --> ENABLED
ENABLED -->|No| EXECUTE
ENABLED -->|Yes| COMPAT
COMPAT -->|No| FLUSH
COMPAT -->|Yes| BUFFER
FLUSH --> BUFFER
BUFFER --> THRESHOLD
THRESHOLD -->|Yes| FLUSH
THRESHOLD -->|No| RETURN["Return placeholder result"]
IMMEDIATE --> EXECUTE
INSERT Buffering System
The INSERT buffering system batches compatible literal INSERT statements to reduce planning overhead for bulk ingest workloads.
Buffer Structure
The InsertBuffer struct accumulates rows across multiple INSERT statements:
Key fields:
table_name,columns,on_conflict: compatibility key for bufferingrows: accumulated literal values from all buffered statementsstatement_row_counts: per-statement row counts to emit individual resultstotal_rows: sum ofstatement_row_countsfor threshold checking
Sources: llkv-sql/src/sql_engine.rs:421-471
Buffering Conditions
An INSERT can be buffered if:
- The
InsertSourceisValues(literal rows) or a constantSELECT - Buffering is enabled via
insert_buffering_enabledflag - Either no buffer exists or
InsertBuffer::can_acceptreturns true:table_namematches exactlycolumnsmatch exactly (same names, same order)on_conflictaction matches
When the buffer reaches MAX_BUFFERED_INSERT_ROWS (8192), it is flushed automatically. Flush also occurs on:
- Transaction boundaries (BEGIN, COMMIT, ROLLBACK)
- Incompatible INSERT statement
- Engine drop
- Explicit
set_insert_buffering(false)call
Sources: llkv-sql/src/sql_engine.rs:452-470 llkv-sql/src/sql_engine.rs:2028-2146 llkv-sql/src/sql_engine.rs:410-414
Buffer Flush Process
The flush process:
- Extracts
InsertBufferfromRefCell<Option<InsertBuffer>> - Constructs single
InsertPlanwith all accumulated rows - Executes via
execute_statement - Receives single
RuntimeStatementResult::Insertwith total rows inserted - Splits result into per-statement results using
statement_row_countsvector - Returns vector of results matching original statement order
This allows bulk execution while preserving per-statement result semantics.
Sources: llkv-sql/src/sql_engine.rs:2028-2146
Value Handling
The SqlValue enum represents literal values during SQL processing:
The SqlValue::try_from_expr function handles:
- Unary operators (negation for numeric types, intervals)
- CAST expressions (particularly to DATE)
- Nested expressions
- Dictionary/struct literals
- Binary operations (addition, subtraction, bitshift for constant folding)
- Typed strings (
DATE '2024-01-01')
Interval arithmetic is performed at constant-folding time:
Date32 + Interval→Date32Interval + Date32→Date32Date32 - Interval→Date32Date32 - Date32→IntervalInterval +/- Interval→Interval
Sources: llkv-sql/src/sql_value.rs:16-320
Error Handling
The SQL layer maps table-related errors to catalog-specific error messages:
| Error Type | Mapping | Method |
|---|---|---|
Error::NotFound | Catalog Error: Table 'X' does not exist | table_not_found_error |
Error::InvalidArgumentError (contains "unknown table") | Same as above | map_table_error |
| Transaction conflicts | another transaction has dropped this table | String constant |
The execute_plan_statement method applies error mapping except for CREATE VIEW and DROP VIEW statements, where the "table" name refers to the view being created/dropped rather than a referenced table.
Sources: llkv-sql/src/sql_engine.rs:558-609 llkv-sql/src/sql_engine.rs511
Thread Safety and Cloning
The SqlEngine::clone implementation creates a new session:
This ensures each cloned engine has an independent:
RuntimeSession(transaction state, temporary namespace)insert_buffer(no shared buffering across sessions)statement_cache(independent prepared statement cache)
The warning message indicates this is typically not intended usage, as most applications should use a single shared SqlEngine instance across threads (enabled by interior mutability via RefCell and atomic types).
Sources: llkv-sql/src/sql_engine.rs:522-540
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
SqlEngine API
Relevant source files
- llkv-aggregate/src/lib.rs
- llkv-executor/README.md
- llkv-plan/README.md
- llkv-sql/README.md
- llkv-sql/src/lib.rs
- llkv-sql/src/sql_engine.rs
- llkv-sql/src/sql_value.rs
Purpose and Scope
The SqlEngine provides the primary user-facing API for executing SQL statements against LLKV databases. It accepts SQL text, parses it, translates it to typed execution plans, and delegates to the runtime layer for evaluation. This page documents the SqlEngine struct's construction, methods, prepared statement handling, and configuration options.
For information about SQL preprocessing and dialect handling, see SQL Preprocessing and Dialect Handling. For details on INSERT buffering behavior, see INSERT Buffering System. For query planning internals, see Plan Structures.
Sources: llkv-sql/src/lib.rs:1-51 llkv-sql/README.md:1-68
SqlEngine Architecture Overview
The SqlEngine sits at the top of the LLKV SQL processing stack, coordinating parsing, planning, and execution:
Diagram: SqlEngine Position in SQL Processing Stack
graph TB
User["User Code"]
SqlEngine["SqlEngine\n(llkv-sql)"]
Parser["sqlparser\nAST Generation"]
Preprocessor["SQL Preprocessor\nDialect Normalization"]
Planner["Plan Translation\nAST → PlanStatement"]
Runtime["RuntimeEngine\n(llkv-runtime)"]
Executor["llkv-executor\nQuery Execution"]
Table["llkv-table\nTable Layer"]
User -->|execute sql| SqlEngine
User -->|sql select| SqlEngine
User -->|prepare sql| SqlEngine
SqlEngine --> Preprocessor
Preprocessor --> Parser
Parser --> Planner
Planner --> Runtime
Runtime --> Executor
Executor --> Table
SqlEngine -.->|owns| Runtime
SqlEngine -.->|manages| InsertBuffer["InsertBuffer\nBatching State"]
SqlEngine -.->|caches| StmtCache["statement_cache\nPreparedPlan Cache"]
The SqlEngine wraps a RuntimeEngine, manages statement caching and INSERT buffering, and provides convenience methods for single-statement queries (sql()) and multi-statement execution (execute()).
Sources: llkv-sql/src/sql_engine.rs:496-509 llkv-sql/src/lib.rs:1-51
Constructing a SqlEngine
Basic Construction
Diagram: SqlEngine Construction Flow
The most common constructor is SqlEngine::new(), which accepts a pager and creates a new runtime context:
Sources: llkv-sql/src/sql_engine.rs:615-621
| Constructor | Signature | Purpose |
|---|---|---|
new() | new<Pg>(pager: Arc<Pg>) -> Self | Create engine with new runtime and default settings |
with_context() | with_context(context: Arc<SqlContext>, default_nulls_first: bool) -> Self | Create engine reusing an existing runtime context |
from_runtime_engine() | from_runtime_engine(engine: RuntimeEngine, default_nulls_first: bool, insert_buffering_enabled: bool) -> Self | Internal constructor for fine-grained control |
Table: SqlEngine Constructor Methods
Sources: llkv-sql/src/sql_engine.rs:543-556 llkv-sql/src/sql_engine.rs:615-621 llkv-sql/src/sql_engine.rs:879-885
Core Query Execution Methods
execute() - Multi-Statement Execution
The execute() method processes one or more SQL statements from a string and returns a vector of results:
Diagram: execute() Method Execution Flow
The execution pipeline:
- Preprocessing : SQL text undergoes dialect normalization via
preprocess_sql_input()llkv-sql/src/sql_engine.rs:1556-1564 - Parsing :
sqlparserconverts normalized text to AST with recursion limit of 200 llkv-sql/src/sql_engine.rs324 - Statement Loop : Each statement is translated to a
PlanStatementand either buffered (for INSERTs) or executed immediately - Result Collection : Results are accumulated and returned as
Vec<SqlStatementResult>
Sources: llkv-sql/src/sql_engine.rs:933-1044
sql() - Single SELECT Execution
The sql() method enforces single-statement SELECT semantics and returns Arrow RecordBatch results:
Key differences from execute():
- Accepts only a single statement
- Statement must be a SELECT query
- Returns
Vec<RecordBatch>directly rather thanRuntimeStatementResult - Automatically collects streaming results
Sources: llkv-sql/src/sql_engine.rs:1046-1085
Prepared Statements
Prepared Statement Flow
Diagram: Prepared Statement Creation and Caching
prepare() Method
The prepare() method parses SQL with placeholders and caches the resulting plan:
Placeholder syntax supported:
?- Positional parameter (auto-increments)?N- Numbered parameter (1-indexed)$N- PostgreSQL-style numbered parameter:name- Named parameter
Sources: llkv-sql/src/sql_engine.rs:1296-1376 llkv-sql/src/sql_engine.rs:86-132
Parameter Binding Mechanism
Parameter registration occurs via thread-local ParameterScope:
Diagram: Parameter Registration and Sentinel Generation
The parameter translation process:
- During parsing, placeholders are intercepted and converted to sentinel strings:
__llkv_param__N__ ParameterStatetracks placeholder-to-index mappings in thread-local storage- At execution time, sentinel strings are replaced with actual parameter values
Sources: llkv-sql/src/sql_engine.rs:71-206
SqlParamValue Type
Parameter values are represented by the SqlParamValue enum:
| Variant | SQL Type | Usage |
|---|---|---|
Null | NULL | SqlParamValue::Null |
Integer(i64) | INTEGER/BIGINT | SqlParamValue::from(42_i64) |
Float(f64) | FLOAT/DOUBLE | SqlParamValue::from(3.14_f64) |
Boolean(bool) | BOOLEAN | SqlParamValue::from(true) |
String(String) | TEXT/VARCHAR | SqlParamValue::from("text") |
Date32(i32) | DATE | SqlParamValue::from(18993_i32) |
Table: SqlParamValue Variants and Conversions
Sources: llkv-sql/src/sql_engine.rs:208-276
execute_prepared() Method
Execute a prepared statement with bound parameters:
Parameter substitution occurs in two phases:
- Literal Substitution : Sentinels in
Expr<String>are replaced viasubstitute_parameter_literals()llkv-sql/src/sql_engine.rs:1453-1497 - Plan Value Substitution : Sentinels in
Vec<PlanValue>are replaced viasubstitute_parameter_plan_values()llkv-sql/src/sql_engine.rs:1499-1517
Sources: llkv-sql/src/sql_engine.rs:1378-1451
Transaction Control
The SqlEngine supports explicit transaction boundaries via SQL statements:
Diagram: Transaction State Machine
Transaction management methods:
| SQL Statement | Effect |
|---|---|
BEGIN | Start explicit transaction llkv-sql/src/sql_engine.rs:970-976 |
COMMIT | Finalize transaction and flush buffers llkv-sql/src/sql_engine.rs:977-983 |
ROLLBACK | Abort transaction and discard buffers llkv-sql/src/sql_engine.rs:984-990 |
Transaction boundaries automatically flush the INSERT buffer to ensure consistent visibility semantics.
Sources: llkv-sql/src/sql_engine.rs:970-990 llkv-sql/src/sql_engine.rs:912-914
INSERT Buffering System
Buffer Architecture
Diagram: INSERT Buffer Accumulation and Flush
InsertBuffer Structure
The InsertBuffer struct accumulates literal INSERT payloads:
Sources: llkv-sql/src/sql_engine.rs:421-471
Buffer Compatibility
An INSERT can join the buffer if it matches:
- Table Name : Target table must match
buffer.table_name - Column List : Columns must match
buffer.columnsexactly - Conflict Action :
on_conflictstrategy must match
Sources: llkv-sql/src/sql_engine.rs:452-459
Flush Conditions
The buffer flushes when:
| Condition | Implementation |
|---|---|
| Size limit exceeded | total_rows >= MAX_BUFFERED_INSERT_ROWS (8192) llkv-sql/src/sql_engine.rs414 |
| Incompatible INSERT | Table/columns/conflict mismatch llkv-sql/src/sql_engine.rs:1765-1799 |
| Transaction boundary | BEGIN, COMMIT, ROLLBACK detected llkv-sql/src/sql_engine.rs:970-990 |
| Non-INSERT statement | Any non-INSERT SQL statement llkv-sql/src/sql_engine.rs:991-1040 |
| Statement expectation | Test harness expectation registered llkv-sql/src/sql_engine.rs:1745-1760 |
| Manual flush | flush_pending_inserts() called llkv-sql/src/sql_engine.rs:1834-1850 |
Table: INSERT Buffer Flush Triggers
Enabling/Disabling Buffering
INSERT buffering is controlled by the set_insert_buffering() method:
- Disabled by default to maintain statement-level transaction semantics
- Enable for bulk loading to reduce planning overhead
- Disabling flushes buffer to ensure pending rows are persisted
Sources: llkv-sql/src/sql_engine.rs:898-905
classDiagram
class RuntimeStatementResult {<<enum>>\nSelect\nInsert\nUpdate\nDelete\nCreateTable\nDropTable\nCreateIndex\nDropIndex\nAlterTable\nCreateView\nDropView\nVacuum\nTransaction}
class SelectVariant {+SelectExecution execution}
class InsertVariant {+table_name: String\n+rows_inserted: usize}
class UpdateVariant {+table_name: String\n+rows_updated: usize}
RuntimeStatementResult --> SelectVariant
RuntimeStatementResult --> InsertVariant
RuntimeStatementResult --> UpdateVariant
Result Types
RuntimeStatementResult
The execute() and execute_prepared() methods return Vec<RuntimeStatementResult>:
Diagram: RuntimeStatementResult Variants
Key result variants:
| Variant | Fields | Returned By |
|---|---|---|
Select | SelectExecution | SELECT queries [llkv-runtime types](https://github.com/jzombie/rust-llkv/blob/4dc34c1f/llkv-runtime types) |
Insert | table_name: String, rows_inserted: usize | INSERT statements |
Update | table_name: String, rows_updated: usize | UPDATE statements |
Delete | table_name: String, rows_deleted: usize | DELETE statements |
CreateTable | table_name: String | CREATE TABLE |
CreateIndex | index_name: String, table_name: String | CREATE INDEX |
Table: Common RuntimeStatementResult Variants
Sources: llkv-sql/src/lib.rs49
SelectExecution
SELECT queries return a SelectExecution handle for streaming results:
The sql() method automatically collects batches via execution.collect():
Sources: llkv-sql/src/sql_engine.rs:1065-1080
Configuration Methods
session() - Access Runtime Session
Provides access to the underlying RuntimeSession for transaction introspection or advanced error handling.
Sources: llkv-sql/src/sql_engine.rs:917-919
context_arc() - Access Runtime Context
Internal method to retrieve the shared RuntimeContext for engine composition.
Sources: llkv-sql/src/sql_engine.rs:875-877
Testing Utilities
StatementExpectation
Test harnesses can register expectations to control buffer flushing:
When a statement expectation is registered, the INSERT buffer flushes before executing that statement to ensure test assertions observe correct row counts.
Sources: llkv-sql/src/sql_engine.rs:64-315
Example Usage
Sources: llkv-sql/src/sql_engine.rs:299-309
Thread Safety and Cloning
The SqlEngine implements Clone with special semantics:
Warning : Cloning a SqlEngine creates a new RuntimeSession, not a shared reference. Each clone has independent transaction state and INSERT buffers.
Sources: llkv-sql/src/sql_engine.rs:522-540
Error Handling
Table Not Found Errors
The SqlEngine remaps generic errors to user-friendly catalog errors:
This converts low-level NotFound errors into: Catalog Error: Table 'tablename' does not exist
Sources: llkv-sql/src/sql_engine.rs:558-585
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
SQL Preprocessing and Dialect Handling
Relevant source files
Purpose and Scope
SQL preprocessing is the first stage in LLKV's query processing pipeline, responsible for normalizing SQL syntax from various dialects before the statement reaches the parser. This system allows LLKV to accept SQL written for SQLite, DuckDB, and TPC-H tooling while using the standard sqlparser library, which has limited dialect support.
The preprocessing layer transforms dialect-specific syntax into forms that sqlparser can parse, enabling compatibility with SQL Logic Tests and real-world SQL scripts without modifying the parser itself. This page documents the preprocessing transformations and their implementation.
For information about what happens after preprocessing (SQL parsing and plan generation), see SQL Query Processing Pipeline. For details on the SqlEngine API that invokes preprocessing, see SqlEngine API.
Preprocessing in the Query Pipeline
SQL preprocessing occurs immediately before parsing in both the execute and prepare code paths. The following diagram shows where preprocessing fits in the overall query execution flow:
Diagram: SQL Preprocessing Pipeline Position
flowchart TB
Input["SQL String Input"]
Preprocess["preprocess_sql_input()"]
Parse["sqlparser::Parser::parse()"]
Plan["Plan Generation"]
Execute["Query Execution"]
Input --> Preprocess
Preprocess --> Parse
Parse --> Plan
Plan --> Execute
subgraph "Preprocessing Transformations"
direction TB
TPC["TPC-H CONNECT removal"]
CreateType["CREATE TYPE → CREATE DOMAIN"]
Exclude["EXCLUDE syntax normalization"]
Trailing["Trailing comma removal"]
EmptyIn["Empty IN list handling"]
IndexHints["Index hint stripping"]
Reindex["REINDEX → VACUUM REINDEX"]
BareTable["Bare table IN expansion"]
TPC --> CreateType
CreateType --> Exclude
Exclude --> Trailing
Trailing --> BareTable
BareTable --> EmptyIn
EmptyIn --> IndexHints
IndexHints --> Reindex
end
Preprocess -.chains.-> TPC
Reindex -.final.-> Parse
Sources: llkv-sql/src/sql_engine.rs:936-1001
The preprocess_sql_input method chains all dialect transformations in a specific order, with each transformation receiving the output of the previous one. If parsing fails after preprocessing and the SQL contains CREATE TRIGGER, a fallback preprocessor (preprocess_sqlite_trigger_shorthand) is applied before retrying the parse.
Diagram: Preprocessing Execution Sequence with Fallback
sequenceDiagram
participant Caller
participant SqlEngine
participant Preprocess as "preprocess_sql_input"
participant Parser as "sqlparser"
participant Fallback as "preprocess_sqlite_trigger_shorthand"
Caller->>SqlEngine: execute(sql)
SqlEngine->>Preprocess: preprocess(sql)
Note over Preprocess: Chain all transformations
Preprocess-->>SqlEngine: processed_sql
SqlEngine->>Parser: parse(processed_sql)
alt Parse Success
Parser-->>SqlEngine: AST
else Parse Error + "CREATE TRIGGER"
Parser-->>SqlEngine: ParseError
SqlEngine->>Fallback: expand_trigger_syntax(processed_sql)
Fallback-->>SqlEngine: expanded_sql
SqlEngine->>Parser: parse(expanded_sql)
Parser-->>SqlEngine: AST or Error
end
SqlEngine-->>Caller: Results
Sources: llkv-sql/src/sql_engine.rs:936-958
Supported Dialect Transformations
LLKV implements nine distinct preprocessing transformations, each targeting specific dialect compatibility issues. The following table summarizes each transformation:
| Preprocessor | Dialect | Purpose | Method |
|---|---|---|---|
| TPC-H CONNECT | TPC-H | Strip CONNECT TO database; statements | preprocess_tpch_connect_syntax |
| CREATE TYPE | DuckDB | Convert CREATE TYPE to CREATE DOMAIN | preprocess_create_type_syntax |
| EXCLUDE Syntax | General | Quote qualified identifiers in EXCLUDE clauses | preprocess_exclude_syntax |
| Trailing Commas | DuckDB | Remove trailing commas in VALUES | preprocess_trailing_commas_in_values |
| Empty IN Lists | SQLite | Convert IN () to constant expressions | preprocess_empty_in_lists |
| Index Hints | SQLite | Strip INDEXED BY and NOT INDEXED | preprocess_index_hints |
| REINDEX | SQLite | Convert REINDEX to VACUUM REINDEX | preprocess_reindex_syntax |
| Bare Table IN | SQLite | Expand IN table to IN (SELECT * FROM table) | preprocess_bare_table_in_clauses |
| Trigger Shorthand | SQLite | Add AFTER and FOR EACH ROW to triggers | preprocess_sqlite_trigger_shorthand |
Sources: llkv-sql/src/sql_engine.rs:628-842 llkv-sql/src/sql_engine.rs:992-1001
TPC-H CONNECT Statement Removal
The TPC-H benchmark tooling generates CONNECT TO <database>; directives in referential integrity scripts. Since LLKV operates within a single database context, these statements are treated as no-ops and stripped during preprocessing.
Transformation:
Sources: llkv-sql/src/sql_engine.rs:623-630
CREATE TYPE to CREATE DOMAIN Conversion
DuckDB uses CREATE TYPE name AS basetype for type aliases, but sqlparser only supports the SQL standard CREATE DOMAIN syntax. This preprocessor converts the DuckDB syntax to the standard form.
Transformation:
Implementation: Uses static regex patterns initialized via OnceLock for thread-safe lazy compilation.
Sources: llkv-sql/src/sql_engine.rs:634-657
EXCLUDE Syntax Normalization
When EXCLUDE clauses contain qualified identifiers (e.g., schema.table.column), sqlparser requires them to be quoted. This preprocessor wraps qualified names in double quotes.
Transformation:
Sources: llkv-sql/src/sql_engine.rs:659-676
Trailing Comma Removal in VALUES
DuckDB permits trailing commas in VALUES clauses like VALUES ('v2',), but sqlparser rejects them. This preprocessor removes trailing commas before closing parentheses.
Transformation:
Sources: llkv-sql/src/sql_engine.rs:678-689
Empty IN List Handling
SQLite allows degenerate IN () and NOT IN () expressions. Since sqlparser rejects these, the preprocessor converts them to constant boolean expressions while preserving the original expression evaluation (in case of side effects).
Transformation:
The pattern matches various expression forms: parenthesized expressions, quoted strings, hex literals, identifiers, and numbers.
Sources: llkv-sql/src/sql_engine.rs:691-720
Index Hint Stripping
SQLite supports query optimizer hints like FROM table INDEXED BY index_name and FROM table NOT INDEXED. Since sqlparser doesn't support this syntax and LLKV makes its own index decisions, these hints are stripped during preprocessing.
Transformation:
Sources: llkv-sql/src/sql_engine.rs:722-739
REINDEX to VACUUM REINDEX Conversion
SQLite supports REINDEX index_name as a standalone statement, but sqlparser only recognizes REINDEX as part of VACUUM syntax. This preprocessor converts the standalone form.
Transformation:
Sources: llkv-sql/src/sql_engine.rs:741-757
Bare Table IN Clause Expansion
SQLite allows expr IN tablename as shorthand for expr IN (SELECT * FROM tablename). The preprocessor expands this shorthand to the subquery form that sqlparser requires.
Transformation:
The pattern avoids matching IN ( which is already a valid subquery.
Sources: llkv-sql/src/sql_engine.rs:844-873
SQLite Trigger Shorthand Expansion
SQLite allows omitting the trigger timing (defaults to AFTER) and the FOR EACH ROW clause (defaults to row-level triggers). sqlparser requires both to be explicit. This preprocessor injects the missing clauses.
Transformation:
This is a fallback preprocessor that only runs if initial parsing fails and the SQL contains CREATE TRIGGER. The implementation uses complex regex patterns to handle optional dotted identifiers with various quoting styles.
Sources: llkv-sql/src/sql_engine.rs:759-842 llkv-sql/src/sql_engine.rs:944-957
graph TB
subgraph "SqlEngine Methods"
Execute["execute(sql)"]
Prepare["prepare(sql)"]
PreprocessInput["preprocess_sql_input(sql)"]
end
subgraph "Static Regex Patterns"
CreateTypeRE["CREATE_TYPE_REGEX"]
DropTypeRE["DROP_TYPE_REGEX"]
ExcludeRE["EXCLUDE_REGEX"]
TrailingRE["TRAILING_COMMA_REGEX"]
EmptyInRE["EMPTY_IN_REGEX"]
IndexHintRE["INDEX_HINT_REGEX"]
ReindexRE["REINDEX_REGEX"]
BareTableRE["BARE_TABLE_IN_REGEX"]
TimingRE["TIMING_REGEX"]
ForEachBeginRE["FOR_EACH_BEGIN_REGEX"]
ForEachWhenRE["FOR_EACH_WHEN_REGEX"]
end
subgraph "Preprocessor Methods"
TPC["preprocess_tpch_connect_syntax"]
CreateType["preprocess_create_type_syntax"]
Exclude["preprocess_exclude_syntax"]
Trailing["preprocess_trailing_commas_in_values"]
EmptyIn["preprocess_empty_in_lists"]
IndexHints["preprocess_index_hints"]
Reindex["preprocess_reindex_syntax"]
BareTable["preprocess_bare_table_in_clauses"]
Trigger["preprocess_sqlite_trigger_shorthand"]
end
Execute --> PreprocessInput
Prepare --> PreprocessInput
PreprocessInput --> TPC
PreprocessInput --> CreateType
PreprocessInput --> Exclude
PreprocessInput --> Trailing
PreprocessInput --> BareTable
PreprocessInput --> EmptyIn
PreprocessInput --> IndexHints
PreprocessInput --> Reindex
CreateType -.uses.-> CreateTypeRE
CreateType -.uses.-> DropTypeRE
Exclude -.uses.-> ExcludeRE
Trailing -.uses.-> TrailingRE
EmptyIn -.uses.-> EmptyInRE
IndexHints -.uses.-> IndexHintRE
Reindex -.uses.-> ReindexRE
BareTable -.uses.-> BareTableRE
Trigger -.uses.-> TimingRE
Trigger -.uses.-> ForEachBeginRE
Trigger -.uses.-> ForEachWhenRE
Implementation Architecture
The preprocessing system is implemented using a combination of regex transformations and string manipulation. The following diagram shows the key components:
Diagram: Preprocessing Implementation Components
Sources: llkv-sql/src/sql_engine.rs:640-842 llkv-sql/src/sql_engine.rs:992-1001
Regex Pattern Management
All regex patterns are stored in OnceLock static variables for thread-safe lazy initialization. This ensures patterns are compiled once per process and reused across all preprocessing operations, avoiding the overhead of repeated compilation.
Pattern Initialization Example:
The patterns use case-insensitive matching ((?i)) and word boundaries (\b) to avoid false matches within identifiers or string literals.
Sources: llkv-sql/src/sql_engine.rs:640-650 llkv-sql/src/sql_engine.rs:661-669 llkv-sql/src/sql_engine.rs:682-686
Preprocessing Order
The order of transformations matters because later transformations may depend on earlier ones. The current order:
- TPC-H CONNECT removal - Must happen first to remove non-SQL directives
- CREATE TYPE conversion - Normalizes DDL before other transformations
- EXCLUDE syntax - Handles qualified names in projection lists
- Trailing comma removal - Fixes VALUES clause syntax
- Bare table IN expansion - Converts shorthand to subqueries before empty IN check
- Empty IN handling - Must come after bare table expansion to avoid conflicts
- Index hint stripping - Removes query hints from FROM clauses
- REINDEX conversion - Must be last to avoid interfering with VACUUM statements
Sources: llkv-sql/src/sql_engine.rs:992-1001
Parser Integration
The preprocessed SQL is passed to sqlparser with a custom recursion limit to handle deeply nested queries from test suites:
The default sqlparser recursion limit (50) is insufficient for some SQLite test suite queries, so LLKV uses 200 to balance compatibility with stack safety.
Sources: llkv-sql/src/sql_engine.rs:318-324
Testing and Validation
The preprocessing transformations are validated through:
- SQL Logic Tests (SLT) - The llkv-slt-tester runs thousands of SQLite test cases that exercise various dialect features
- TPC-H Benchmarks - The llkv-tpch crate verifies compatibility with TPC-H SQL scripts
- Unit Tests - Individual preprocessor functions are tested in isolation
The preprocessing system is designed to be conservative: it only transforms patterns that are known to cause parser errors, and it preserves the original SQL semantics whenever possible.
Sources: llkv-sql/src/sql_engine.rs:623-1001 llkv-sql/Cargo.toml:1-34
Future Considerations
The preprocessing approach is a pragmatic solution that enables broad dialect compatibility without modifying sqlparser. However, it has limitations:
- Fragile regex patterns - Complex transformations like trigger shorthand expansion use intricate regex that may not handle all edge cases
- Limited context awareness - String-based transformations cannot distinguish between SQL keywords and string literals containing those keywords
- Maintenance burden - Each new dialect feature requires a new preprocessor
The long-term solution is to contribute dialect-specific parsing improvements back to sqlparser, eliminating the need for preprocessing. The trigger shorthand transformation includes a TODO comment noting that proper SQLite dialect support in sqlparser would eliminate that preprocessor entirely.
Sources: llkv-sql/src/sql_engine.rs:765-770
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
INSERT Buffering System
Relevant source files
The INSERT Buffering System is an optimization layer within llkv-sql that batches multiple consecutive INSERT ... VALUES statements for the same table into a single execution plan. This dramatically reduces planning overhead when bulk-loading data from SQL scripts containing thousands of individual INSERT statements. The system preserves per-statement result semantics while amortizing the cost of plan construction and table access across large batches.
For information about how INSERT plans are structured and executed, see Plan Structures.
Purpose and Design Goals
The buffering system addresses a specific performance bottleneck: SQL scripts generated by database export tools often contain tens of thousands of individual INSERT INTO table VALUES (...) statements. Without buffering, each statement incurs the full cost of parsing, planning, catalog lookup, and MVCC overhead. The buffer accumulates compatible INSERT statements and flushes them as a single batch, achieving order-of-magnitude throughput improvements for bulk ingestion workloads.
Key design constraints:
- Optional : Disabled by default to preserve immediate visibility semantics for unit tests and interactive workloads
- Transparent : Callers receive per-statement results as if each INSERT executed independently
- Safe : Flushes automatically at transaction boundaries, table changes, and buffer size limits
- Compatible : Integrates with statement expectation mechanisms used by the SQL Logic Test harness
Sources: llkv-sql/src/sql_engine.rs:410-520
Architecture Overview
Figure 1: INSERT Buffering Architecture
The system operates as a stateful accumulator within SqlEngine. Incoming INSERT statements are classified as either PreparedInsert::Values (bufferable literals) or PreparedInsert::Immediate (non-bufferable subqueries or expressions). Compatible VALUES inserts accumulate in the buffer until a flush trigger fires, at which point the buffer constructs a single InsertPlan and emits individual RuntimeStatementResult::Insert entries for each original statement.
Sources: llkv-sql/src/sql_engine.rs:416-509
Buffer Data Structures
InsertBuffer
Figure 2: Buffer Data Structure
The InsertBuffer struct maintains five critical pieces of state:
| Field | Type | Purpose |
|---|---|---|
table_name | String | Target table identifier for compatibility checking |
columns | Vec<String> | Column list; must match for batching |
on_conflict | InsertConflictAction | Conflict resolution policy; must match for batching |
total_rows | usize | Sum of all buffered rows across statements |
statement_row_counts | Vec<usize> | Per-statement row counts for result construction |
rows | Vec<Vec<PlanValue>> | Literal row payloads in execution order |
The statement_row_counts vector preserves the boundary between original INSERT statements so that flush_buffer_results() can emit one RuntimeStatementResult::Insert per statement with the correct row count.
Sources: llkv-sql/src/sql_engine.rs:421-471
PreparedInsert Classification
Figure 3: INSERT Classification Flow
The prepare_insert() method analyzes each INSERT statement and returns PreparedInsert::Values only when the source is a literal VALUES clause or a SELECT that evaluates to constants (e.g., SELECT 1, 'foo'). All other forms—subqueries referencing tables, expressions requiring runtime evaluation, or DEFAULT VALUES—become PreparedInsert::Immediate and bypass buffering.
Sources: llkv-sql/src/sql_engine.rs:473-487
Buffer Lifecycle and Flush Triggers
Flush Conditions
The buffer flushes automatically when any of the following conditions occur:
| Trigger | Constant | Description |
|---|---|---|
| Size limit | MAX_BUFFERED_INSERT_ROWS = 8192 | Total buffered rows exceeds threshold |
| Incompatible INSERT | N/A | Different table, columns, or conflict action |
| Non-INSERT statement | N/A | Any DDL, DML (UPDATE/DELETE), or SELECT |
| Transaction boundary | N/A | BEGIN, COMMIT, or ROLLBACK |
| Statement expectation | StatementExpectation::Error or Count(n) | Test harness expects specific outcome |
| Manual flush | N/A | flush_pending_inserts() called explicitly |
| Engine drop | N/A | SqlEngine destructor invoked |
Sources: llkv-sql/src/sql_engine.rs:414-1127
Buffer State Machine
Figure 4: Buffer State Machine
The buffer exists in one of three states: Empty (no buffer allocated), Buffering (accumulating rows), or Flushing (emitting results). Transitions from Buffering to Flushing occur automatically based on the triggers listed above. After flushing, the state returns to Empty unless a new compatible INSERT immediately follows, in which case a fresh buffer is allocated.
Sources: llkv-sql/src/sql_engine.rs:514-1201
Integration with SqlEngine::execute()
Figure 5: Execute Loop with Buffer Integration
The execute() method iterates through parsed statements, dispatching INSERT statements to buffer_insert() and all other statements to execute_statement() after flushing. This ensures that the buffer never holds rows across non-INSERT operations or transaction boundaries.
Sources: llkv-sql/src/sql_engine.rs:933-990
buffer_insert() Implementation Details
Decision Flow
Figure 6: buffer_insert() Decision Tree
The buffer_insert() method performs three levels of gating:
- Expectation check : If the SLT harness expects an error or specific row count, bypass buffering entirely
- Buffering enabled check : If
insert_buffering_enabledis false, execute immediately - Compatibility check : If the INSERT is incompatible with the current buffer, flush and start a new buffer
Sources: llkv-sql/src/sql_engine.rs:1101-1201
Compatibility Rules
An INSERT can be added to the existing buffer if and only if:
This ensures that all buffered statements can be collapsed into a single InsertPlan with uniform semantics. Different column orderings, conflict actions, or target tables require separate batches.
Sources: llkv-sql/src/sql_engine.rs:452-459
Statement Expectation Handling
The SQL Logic Test harness uses thread-local expectations to signal that a specific statement should produce an error or affect a precise number of rows. The buffering system respects these hints by forcing immediate execution when expectations are present:
Figure 7: Statement Expectation Flow
graph TB
SLTHarness["SLT Harness"]
RegisterExpectation["register_statement_expectation()"]
ThreadLocal["PENDING_STATEMENT_EXPECTATIONS\nthread_local!"]
Execute["SqlEngine::execute()"]
NextExpectation["next_statement_expectation()"]
BufferInsert["buffer_insert()"]
SLTHarness -->|before statement| RegisterExpectation
RegisterExpectation --> ThreadLocal
Execute --> NextExpectation
NextExpectation --> ThreadLocal
NextExpectation --> BufferInsert
BufferInsert -->|Error or Count| ImmediateExec["Execute immediately\nbypass buffer"]
BufferInsert -->|Ok| MayBuffer["May buffer if enabled"]
When next_statement_expectation() returns StatementExpectation::Error or StatementExpectation::Count(n), the buffer_insert() method sets execute_immediately = true and flushes any existing buffer before executing the current statement. This preserves test correctness while still allowing buffering for the majority of statements that have no expectations.
Sources: llkv-sql/src/sql_engine.rs:64-1127
sequenceDiagram
participant Caller
participant FlushBuffer as "flush_buffer_results()"
participant Buffer as "InsertBuffer"
participant PlanStmt as "PlanStatement::Insert"
participant Runtime as "RuntimeEngine"
Caller->>FlushBuffer: flush_buffer_results()
FlushBuffer->>Buffer: Take buffer from RefCell
alt Buffer is None
FlushBuffer-->>Caller: Ok(Vec::new())
else Buffer has data
FlushBuffer->>PlanStmt: Construct InsertPlan\n(table, columns, rows, on_conflict)
FlushBuffer->>Runtime: execute_statement(plan)
Runtime-->>FlushBuffer: RuntimeStatementResult::Insert\n(total_rows_inserted)
Note over FlushBuffer: Verify total_rows matches sum(statement_row_counts)
loop For each statement_row_count
FlushBuffer->>FlushBuffer: Create RuntimeStatementResult::Insert\n(statement_rows)
end
FlushBuffer-->>Caller: Vec<SqlStatementResult>
end
flush_buffer_results() Mechanics
The flush operation reconstructs per-statement results from the accumulated buffer state:
Figure 8: Flush Sequence
The flush process:
- Takes ownership of the buffer from the
RefCell - Constructs a single
InsertPlanwith all buffered rows - Executes the plan via the runtime
- Splits the total row count across the original statements using
statement_row_counts - Returns a vector of per-statement results
This ensures that callers receive results as if each INSERT executed independently, even though the runtime processed them as a single batch.
Sources: llkv-sql/src/sql_engine.rs:2094-2169 (Note: The flush implementation is in the broader file, exact line range may vary)
Performance Characteristics
Throughput Improvement
Buffering provides dramatic performance gains for bulk INSERT workloads:
| Scenario | Without Buffering | With Buffering | Speedup |
|---|---|---|---|
| 10,000 single-row INSERTs | ~30 seconds | ~2 seconds | ~15x |
| 1,000 ten-row INSERTs | ~5 seconds | ~0.5 seconds | ~10x |
| 100,000 single-row INSERTs | Several minutes | ~15 seconds | >10x |
The improvement stems from:
- Amortized planning : One plan for 8,192 rows instead of 8,192 plans
- Batch MVCC overhead : Single transaction coordinator call instead of thousands
- Reduced catalog lookups : One schema resolution instead of per-statement lookups
- Vectorized column operations : Arrow batch processing instead of row-by-row appends
Sources: llkv-sql/README.md:36-41
Memory Usage
The buffer is bounded at MAX_BUFFERED_INSERT_ROWS = 8192 rows. Peak memory usage depends on the row width:
Peak Memory = MAX_BUFFERED_INSERT_ROWS × (Σ column_size + MVCC_overhead)
For a typical table with 10 columns averaging 50 bytes each:
8,192 rows × (10 columns × 50 bytes + 24 bytes MVCC) ≈ 4.3 MB
This predictable ceiling makes buffering safe for long-running workloads without risking unbounded memory growth.
Sources: llkv-sql/src/sql_engine.rs:410-414
API Surface
Enabling and Disabling
The set_insert_buffering(false) call automatically flushes any pending rows before disabling, ensuring visibility guarantees.
Sources: llkv-sql/src/sql_engine.rs:887-905
Manual Flush
Manual flushes are useful when the caller needs to checkpoint progress or ensure specific INSERT statements are visible before proceeding.
Sources: llkv-sql/src/sql_engine.rs:1003-1010
Drop Hook
The SqlEngine destructor automatically flushes the buffer to prevent data loss:
This ensures that buffered rows are persisted even if the caller forgets to flush explicitly.
Sources: llkv-sql/src/sql_engine.rs:513-520
Limitations and Edge Cases
Non-Bufferable INSERT Forms
The following INSERT patterns always execute immediately:
INSERT ... SELECTwith table referencesINSERT ... DEFAULT VALUESINSERTwith expressions requiring runtime evaluation (e.g.,NOW(),RANDOM())INSERTwith parameters or placeholders
These patterns cannot be safely batched because their semantics depend on execution context.
Transaction Isolation
The buffer flushes at transaction boundaries (BEGIN, COMMIT, ROLLBACK) to preserve isolation semantics. This means:
The first INSERT's visibility is not guaranteed until the BEGIN statement forces a flush.
Conflict Handling
All buffered statements must share the same InsertConflictAction. Mixing ON CONFLICT IGNORE and ON CONFLICT REPLACE requires separate batches:
Sources: llkv-sql/src/sql_engine.rs:452-1201
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Query Planning
Relevant source files
- llkv-executor/Cargo.toml
- llkv-executor/README.md
- llkv-executor/src/lib.rs
- llkv-plan/README.md
- llkv-plan/src/plans.rs
- llkv-sql/README.md
- llkv-sql/src/sql_engine.rs
Purpose and Scope
Query planning is the layer that translates parsed SQL statements into strongly-typed plan structures that can be executed by the runtime engine. The llkv-plan crate defines these plan types and provides utilities for representing queries, expressions, and subquery relationships in a form that execution layers can consume without re-parsing SQL.
This page covers the plan structures themselves and how they are constructed from SQL input. For information about how expressions within plans are evaluated, see Expression System. For details on subquery correlation tracking and placeholder generation, see Subquery and Correlation Handling. For execution of these plans, see Query Execution.
Plan Structures Overview
The planning layer defines distinct plan types for each category of SQL statement. All plan types are defined in llkv-plan/src/plans.rs and flow through the PlanStatement enum for execution dispatch.
Core Plan Types
| Plan Type | Purpose | Key Fields |
|---|---|---|
SelectPlan | Query execution | tables, projections, filter, joins, aggregates, order_by |
InsertPlan | Row insertion | table, columns, source, on_conflict |
UpdatePlan | Row updates | table, assignments, filter |
DeletePlan | Row deletion | table, filter |
CreateTablePlan | Table creation | name, columns, source, foreign_keys |
CreateIndexPlan | Index creation | table, columns, unique |
CreateViewPlan | View creation | name, view_definition, select_plan |
Sources: llkv-plan/src/plans.rs:177-256 llkv-plan/src/plans.rs:640-655 llkv-plan/src/plans.rs:662-667 llkv-plan/src/plans.rs:687-692
SelectPlan Structure
Diagram: SelectPlan Component Structure
The SelectPlan struct at llkv-plan/src/plans.rs:800-825 contains all information needed to execute a SELECT query. It separates table references, join specifications, projections, filters, aggregations, and ordering to allow execution layers to optimize each phase independently.
Sources: llkv-plan/src/plans.rs:27-67 llkv-plan/src/plans.rs:794-825
SQL-to-Plan Translation Pipeline
The translation from SQL text to plan structures occurs in SqlEngine within the llkv-sql crate. The process involves multiple stages to handle dialect differences and build strongly-typed plans.
Diagram: SQL-to-Plan Translation Flow
sequenceDiagram
participant User
participant SqlEngine as "SqlEngine"
participant Preprocessor as "SQL Preprocessing"
participant Parser as "sqlparser::Parser"
participant Translator as "Plan Translator"
participant Runtime as "RuntimeEngine"
User->>SqlEngine: execute(sql_text)
SqlEngine->>Preprocessor: preprocess_sql_input()
Note over Preprocessor: Strip CONNECT TO\nNormalize CREATE TYPE\nFix EXCLUDE syntax\nExpand IN clauses
Preprocessor-->>SqlEngine: processed_sql
SqlEngine->>Parser: Parser::parse_sql()
Parser-->>SqlEngine: Vec<Statement> (AST)
loop "For each Statement"
SqlEngine->>Translator: translate_statement()
alt "INSERT statement"
Translator->>Translator: translate_insert()
Note over Translator: Parse VALUES/SELECT\nNormalize conflict action\nBuild PreparedInsert
Translator-->>SqlEngine: PreparedInsert
SqlEngine->>SqlEngine: buffer_insert()\nor flush immediately
else "SELECT statement"
Translator->>Translator: translate_select()
Note over Translator: Build SelectPlan\nTranslate expressions\nTrack subqueries
Translator-->>SqlEngine: SelectPlan
else "UPDATE/DELETE"
Translator->>Translator: translate_update()/delete()
Translator-->>SqlEngine: UpdatePlan/DeletePlan
else "DDL statement"
Translator->>Translator: translate_create_table()\ncreate_index(), etc.
Translator-->>SqlEngine: CreateTablePlan/etc.
end
SqlEngine->>Runtime: execute_statement(plan)
Runtime-->>SqlEngine: RuntimeStatementResult
end
SqlEngine-->>User: Vec<RuntimeStatementResult>
Sources: llkv-sql/src/sql_engine.rs:933-958 llkv-sql/src/sql_engine.rs:628-757
Statement Translation Functions
The SqlEngine contains dedicated translation methods for each statement type:
| sqlparser AST | Translation Method | Output Plan | Location |
|---|---|---|---|
Statement::Query | translate_select() | SelectPlan | llkv-sql/src/sql_engine.rs:2162-2578 |
Statement::Insert | translate_insert() | InsertPlan | llkv-sql/src/sql_engine.rs:3194-3423 |
Statement::Update | translate_update() | UpdatePlan | llkv-sql/src/sql_engine.rs:3560-3704 |
Statement::Delete | translate_delete() | DeletePlan | llkv-sql/src/sql_engine.rs:3706-3783 |
Statement::CreateTable | translate_create_table() | CreateTablePlan | llkv-sql/src/sql_engine.rs:4081-4465 |
Statement::CreateIndex | translate_create_index() | CreateIndexPlan | llkv-sql/src/sql_engine.rs:4575-4766 |
Sources: llkv-sql/src/sql_engine.rs:974-1067
SELECT Translation Details
The translate_select() method at llkv-sql/src/sql_engine.rs2162 performs the following operations:
- Extract table references from FROM clause into
Vec<TableRef> - Parse join specifications into
Vec<JoinMetadata>structures - Translate WHERE clause to
Expr<String>and discover correlated subqueries - Process projections into
Vec<SelectProjection>with computed expressions - Handle aggregates by extracting
AggregateExprfrom projections and HAVING - Translate GROUP BY clause to canonical column names
- Process ORDER BY into
Vec<OrderByPlan>with sort specifications - Handle compound queries (UNION/INTERSECT/EXCEPT) via
CompoundSelectPlan
Sources: llkv-sql/src/sql_engine.rs:2162-2578
Expression Representation in Plans
Plans use two forms of expressions from the llkv-expr crate:
Expr<String>: Boolean predicates using unresolved column names (as strings)ScalarExpr<String>: Scalar expressions (also with string column references)
graph LR
SQL["SQL: WHERE age > 18"]
AST["sqlparser AST\nBinaryExpr"]
ExprString["Expr<String>\nCompare(Column('age'), Gt, Literal(18))"]
ExprFieldId["Expr<FieldId>\nCompare(Column(field_7), Gt, Literal(18))"]
Bytecode["EvalProgram\nStack-based bytecode"]
SQL --> AST
AST --> ExprString
ExprString --> ExprFieldId
ExprFieldId --> Bytecode
ExprString -.stored in plan.-> SelectPlan
ExprFieldId -.resolved at execution.-> Executor
Bytecode -.compiled for evaluation.-> Table
These string-based expressions are later resolved to Expr<FieldId> and ScalarExpr<FieldId> during execution when the catalog provides field mappings. This two-stage approach separates planning from schema resolution.
Diagram: Expression Evolution Through Planning and Execution
The translation from SQL expressions to Expr<String> happens in llkv-sql/src/sql_engine.rs:1647-1947 The resolution to Expr<FieldId> occurs in the executor's translate_predicate() function at llkv-executor/src/translation/predicate.rs
Sources: llkv-expr/src/expr.rs llkv-sql/src/sql_engine.rs:1647-1947 llkv-plan/src/plans.rs:28-34
Join Planning
Join specifications are represented in two components:
JoinMetadata Structure
The JoinMetadata struct at llkv-plan/src/plans.rs:781-792 captures a single join between consecutive tables:
left_table_index: Index intoSelectPlan.tablesvector for the left tablejoin_type: One ofInner,Left,Right, orFullon_condition: Optional ON clause filter expression
JoinPlan Types
The JoinPlan enum at llkv-plan/src/plans.rs:763-773 defines supported join semantics:
Diagram: JoinPlan Variants
The executor converts JoinPlan to llkv_join::JoinType during execution. When SelectPlan.joins is empty but multiple tables exist, the executor performs a Cartesian product (cross join).
Sources: llkv-plan/src/plans.rs:758-792 llkv-executor/src/lib.rs:542-554
Aggregation Planning
Aggregates are represented through the AggregateExpr structure defined at llkv-plan/src/plans.rs:1025-1102:
Aggregate Function Types
Diagram: AggregateFunction Variants
GROUP BY Handling
When a SELECT contains a GROUP BY clause:
- Column names from GROUP BY are stored in
SelectPlan.group_byas canonical strings - Aggregate expressions are collected in
SelectPlan.aggregates - Non-aggregate projections must reference GROUP BY columns
- HAVING clause (if present) is stored in
SelectPlan.havingasExpr<String>
The executor groups rows based on group_by columns, evaluates aggregates within each group, and applies the HAVING filter to group results.
Sources: llkv-plan/src/plans.rs:1025-1102 llkv-executor/src/lib.rs:1185-1597
Subquery Representation
Subqueries appear in two contexts within plans:
Filter Subqueries
FilterSubquery at llkv-plan/src/plans.rs:36-45 represents correlated subqueries used in WHERE/HAVING predicates via Expr::Exists:
id: Unique identifier matchingExpr::Exists(SubqueryId)plan: NestedSelectPlanfor the subquerycorrelated_columns: Mappings from placeholder names to outer query columns
Scalar Subqueries
ScalarSubquery at llkv-plan/src/plans.rs:48-56 represents subqueries that produce single values in projections via ScalarExpr::ScalarSubquery:
Correlated Column Tracking
The CorrelatedColumn struct at llkv-plan/src/plans.rs:59-67 describes how outer columns are bound into inner subqueries:
During execution, the executor substitutes placeholder references with actual values from the outer query's current row.
Sources: llkv-plan/src/plans.rs:36-67 llkv-sql/src/sql_engine.rs:1980-2124
Plan Value Types
The PlanValue enum at llkv-plan/src/plans.rs:73-83 represents literal values within plans:
These values appear in:
- INSERT literal rows (
InsertPlanwithInsertSource::Rows) - UPDATE assignments (
AssignmentValue::Literal) - Computed constant expressions
The executor converts PlanValue instances to Arrow arrays via plan_values_to_arrow_array() at llkv-executor/src/lib.rs:302-410
Sources: llkv-plan/src/plans.rs:73-161 llkv-executor/src/lib.rs:302-410
Plan Execution Interface
Plans flow to the runtime through the PlanStatement enum:
Diagram: Plan Execution Dispatch Flow
The RuntimeEngine::execute_statement() method dispatches each plan variant to the appropriate handler:
- SELECT : Passed to
QueryExecutorfor streaming execution - INSERT/UPDATE/DELETE : Applied via
Tablewith MVCC tracking - DDL : Processed by
CatalogManagerto modify schema metadata
Sources: llkv-runtime/src/statements.rs llkv-sql/src/sql_engine.rs:587-609 llkv-executor/src/lib.rs:523-569
Compound Query Planning
Set operations (UNION, INTERSECT, EXCEPT) are represented through CompoundSelectPlan at llkv-plan/src/plans.rs:969-996:
CompoundOperator:Union,Intersect, orExceptCompoundQuantifier:Distinct(deduplicate) orAll(keep duplicates)
The executor evaluates the initial plan, then applies each operation sequentially, combining results according to set semantics. Deduplication for DISTINCT quantifiers uses hash-based row encoding.
Sources: llkv-plan/src/plans.rs:946-996 llkv-executor/src/lib.rs:590-686
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Plan Structures
Relevant source files
- llkv-executor/Cargo.toml
- llkv-executor/README.md
- llkv-executor/src/lib.rs
- llkv-plan/README.md
- llkv-plan/src/plans.rs
- llkv-sql/README.md
Purpose and Scope
Plan structures are strongly-typed representations of SQL statements that bridge the SQL parsing layer and the execution layer. Defined in the llkv-plan crate, these structures capture the logical intent of SQL operations without retaining parser-specific AST details. The planner translates sqlparser ASTs into plan instances, which the runtime then dispatches to execution engines.
This page documents the structure and organization of plan types. For information about how correlated subqueries and scalar subqueries are represented and tracked, see Subquery and Correlation Handling.
Sources: llkv-plan/src/plans.rs:1-10 llkv-plan/README.md:10-16
Plan Type Hierarchy
LLKV organizes plan structures into three primary categories based on their SQL statement class:
Diagram: Plan Type Organization
graph TB
Plans["Plan Structures\n(llkv-plan)"]
DDL["DDL Plans\nSchema Operations"]
DML["DML Plans\nData Modifications"]
Query["Query Plans\nData Retrieval"]
Plans --> DDL
Plans --> DML
Plans --> Query
DDL --> CreateTablePlan
DDL --> DropTablePlan
DDL --> CreateViewPlan
DDL --> DropViewPlan
DDL --> CreateIndexPlan
DDL --> DropIndexPlan
DDL --> ReindexPlan
DDL --> AlterTablePlan
DDL --> RenameTablePlan
DML --> InsertPlan
DML --> UpdatePlan
DML --> DeletePlan
DML --> TruncatePlan
Query --> SelectPlan
Query --> CompoundSelectPlan
SelectPlan --> TableRef
SelectPlan --> JoinMetadata
SelectPlan --> SelectProjection
SelectPlan --> SelectFilter
SelectPlan --> AggregateExpr
SelectPlan --> OrderByPlan
Plans are consumed by llkv-runtime for execution orchestration and by llkv-executor for query evaluation. Each plan type encodes the necessary metadata for its corresponding operation without requiring re-parsing or runtime AST traversal.
Sources: llkv-plan/src/plans.rs:163-358 llkv-plan/src/plans.rs:620-703 llkv-plan/src/plans.rs:794-1023
SelectPlan Structure
SelectPlan represents SELECT queries and is the most complex plan type. It aggregates multiple sub-components to describe table references, join relationships, projections, filters, aggregates, and ordering.
Diagram: SelectPlan Component Structure
graph TB
SelectPlan["SelectPlan\nllkv-plan/src/plans.rs:801"]
subgraph "Table Sources"
Tables["tables: Vec<TableRef>"]
TableRef["TableRef\nschema, table, alias"]
end
subgraph "Join Specification"
Joins["joins: Vec<JoinMetadata>"]
JoinMetadata["JoinMetadata\nleft_table_index\njoin_type\non_condition"]
JoinPlan["JoinPlan\nInner/Left/Right/Full"]
end
subgraph "Projections"
Projections["projections:\nVec<SelectProjection>"]
AllColumns["AllColumns"]
AllColumnsExcept["AllColumnsExcept"]
Column["Column\nname, alias"]
Computed["Computed\nexpr, alias"]
end
subgraph "Filtering"
Filter["filter: Option<SelectFilter>"]
SelectFilter["SelectFilter\npredicate\nsubqueries"]
FilterSubquery["FilterSubquery\nid, plan,\ncorrelated_columns"]
end
subgraph "Aggregation"
Aggregates["aggregates:\nVec<AggregateExpr>"]
GroupBy["group_by: Vec<String>"]
Having["having:\nOption<Expr>"]
end
subgraph "Ordering & Modifiers"
OrderBy["order_by:\nVec<OrderByPlan>"]
Distinct["distinct: bool"]
end
subgraph "Compound Operations"
Compound["compound:\nOption<CompoundSelectPlan>"]
CompoundOps["Union/Intersect/Except\nDistinct/All"]
end
SelectPlan --> Tables
SelectPlan --> Joins
SelectPlan --> Projections
SelectPlan --> Filter
SelectPlan --> Aggregates
SelectPlan --> OrderBy
SelectPlan --> Compound
Tables --> TableRef
Joins --> JoinMetadata
JoinMetadata --> JoinPlan
Projections --> AllColumns
Projections --> AllColumnsExcept
Projections --> Column
Projections --> Computed
Filter --> SelectFilter
SelectFilter --> FilterSubquery
Aggregates --> GroupBy
Aggregates --> Having
Compound --> CompoundOps
Sources: llkv-plan/src/plans.rs:794-944
TableRef - Table References
TableRef represents a table source in the FROM clause, with optional aliasing:
| Field | Type | Description |
|---|---|---|
schema | String | Schema/namespace identifier (empty for default) |
table | String | Table name |
alias | Option<String> | Optional alias for qualified name resolution |
The display_name() method returns the alias if present, otherwise the qualified name. This enables consistent column name resolution during expression translation.
Sources: llkv-plan/src/plans.rs:708-752
JoinMetadata - Join Specification
JoinMetadata describes how adjacent tables in the tables vector are connected. Each entry links tables[left_table_index] with tables[left_table_index + 1]:
| Field | Type | Description |
|---|---|---|
left_table_index | usize | Index into SelectPlan.tables |
join_type | JoinPlan | Inner, Left, Right, or Full |
on_condition | Option<Expr<String>> | Optional ON clause predicate |
The JoinPlan enum mirrors llkv_join::JoinType but exists in the plan layer to avoid circular dependencies.
Sources: llkv-plan/src/plans.rs:758-792
SelectProjection - Projection Variants
SelectProjection specifies which columns appear in the result set:
| Variant | Fields | Description |
|---|---|---|
AllColumns | - | SELECT * (all columns from all tables) |
AllColumnsExcept | exclude: Vec<String> | SELECT * EXCEPT (col1, col2, ...) |
Column | name: String | |
alias: Option<String> | Named column with optional alias | |
Computed | expr: ScalarExpr<String> | |
alias: String | Computed expression (e.g., col1 + col2 AS sum) |
The executor translates these into ScanProjection instances that specify which columns to fetch from storage.
Sources: llkv-plan/src/plans.rs:998-1013
AggregateExpr - Aggregate Functions
AggregateExpr describes aggregate function calls in SELECT or HAVING clauses:
Diagram: AggregateExpr Variants
graph LR
AggregateExpr["AggregateExpr"]
CountStar["CountStar\nalias, distinct"]
Column["Column\ncolumn, alias,\nfunction, distinct"]
Functions["AggregateFunction"]
Count["Count"]
SumInt64["SumInt64"]
TotalInt64["TotalInt64"]
MinInt64["MinInt64"]
MaxInt64["MaxInt64"]
CountNulls["CountNulls"]
GroupConcat["GroupConcat"]
AggregateExpr --> CountStar
AggregateExpr --> Column
Column --> Functions
Functions --> Count
Functions --> SumInt64
Functions --> TotalInt64
Functions --> MinInt64
Functions --> MaxInt64
Functions --> CountNulls
Functions --> GroupConcat
The executor delegates to llkv-aggregate for accumulator-based evaluation.
Sources: llkv-plan/src/plans.rs:1028-1120
OrderByPlan - Sort Specification
OrderByPlan defines ORDER BY clause semantics:
| Field | Type | Description |
|---|---|---|
target | OrderTarget | Column name, projection index, or All |
sort_type | OrderSortType | Native or CastTextToInteger |
ascending | bool | Sort direction (ASC/DESC) |
nulls_first | bool | NULL placement (NULLS FIRST/LAST) |
OrderTarget variants:
Column(String)- Sort by named columnIndex(usize)- Sort by projection position (1-based in SQL)All- Specialized SQLite behavior for sorting all columns
Sources: llkv-plan/src/plans.rs:1195-1217
CompoundSelectPlan - Set Operations
CompoundSelectPlan represents UNION, INTERSECT, and EXCEPT operations:
| Field | Type | Description |
|---|---|---|
initial | Box<SelectPlan> | First SELECT in the compound |
operations | Vec<CompoundSelectComponent> | Subsequent set operations |
Each CompoundSelectComponent contains:
operator:CompoundOperator(Union, Intersect, Except)quantifier:CompoundQuantifier(Distinct, All)plan:SelectPlanfor the right-hand side
The executor processes these sequentially, maintaining distinct caches for DISTINCT quantifiers.
Sources: llkv-plan/src/plans.rs:946-996
InsertPlan Structure
InsertPlan encapsulates data insertion operations with conflict resolution strategies:
| Field | Type | Description |
|---|---|---|
table | String | Target table name |
columns | Vec<String> | Column names (empty means all columns) |
source | InsertSource | Data source (rows, batches, or SELECT) |
on_conflict | InsertConflictAction | Conflict resolution strategy |
InsertSource Variants
| Variant | Description |
|---|---|
Rows(Vec<Vec<PlanValue>>) | Explicit value rows from INSERT VALUES |
Batches(Vec<RecordBatch>) | Pre-materialized Arrow batches |
Select { plan: Box<SelectPlan> } | INSERT INTO ... SELECT ... |
InsertConflictAction Variants
SQLite-compatible conflict resolution actions:
| Variant | Behavior |
|---|---|
None | Standard behavior - fail on constraint violation |
Replace | UPDATE existing row on conflict (INSERT OR REPLACE) |
Ignore | Skip conflicting rows (INSERT OR IGNORE) |
Abort | Abort transaction on conflict |
Fail | Fail statement without rollback |
Rollback | Rollback entire transaction |
Sources: llkv-plan/src/plans.rs:620-655
UpdatePlan and DeletePlan
UpdatePlan
UpdatePlan specifies row updates with optional filtering:
| Field | Type | Description |
|---|---|---|
table | String | Target table name |
assignments | Vec<ColumnAssignment> | Column updates |
filter | Option<Expr<String>> | WHERE clause predicate |
Each ColumnAssignment contains:
column: Target column namevalue:AssignmentValue(literal or expression)
AssignmentValue variants:
Literal(PlanValue)- Static value (e.g., SET col = 42)Expression(ScalarExpr<String>)- Computed value (e.g., SET col = col + 1)
Sources: llkv-plan/src/plans.rs:661-682
DeletePlan
DeletePlan specifies row deletions:
| Field | Type | Description |
|---|---|---|
table | String | Target table name |
filter | Option<Expr<String>> | WHERE clause predicate |
A missing filter indicates DELETE FROM table (deletes all rows).
Sources: llkv-plan/src/plans.rs:687-692
TruncatePlan
TruncatePlan represents TRUNCATE TABLE (removes all rows, resets sequences):
| Field | Type | Description |
|---|---|---|
table | String | Target table name |
Sources: llkv-plan/src/plans.rs:698-702
DDL Plan Structures
CreateTablePlan
CreateTablePlan defines table creation with schema, constraints, and data sources:
| Field | Type | Description |
|---|---|---|
name | String | Table name |
if_not_exists | bool | Skip if table exists |
or_replace | bool | Replace existing table |
columns | Vec<PlanColumnSpec> | Column definitions |
source | Option<CreateTableSource> | Optional CREATE TABLE AS data |
namespace | Option<String> | Storage namespace (e.g., "temp") |
foreign_keys | Vec<ForeignKeySpec> | Foreign key constraints |
multi_column_uniques | Vec<MultiColumnUniqueSpec> | Multi-column UNIQUE constraints |
Sources: llkv-plan/src/plans.rs:176-203
PlanColumnSpec
PlanColumnSpec describes individual column metadata:
| Field | Type | Description |
|---|---|---|
name | String | Column name |
data_type | DataType | Arrow data type |
nullable | bool | NULL allowed |
primary_key | bool | PRIMARY KEY constraint |
unique | bool | UNIQUE constraint |
check_expr | Option<String> | CHECK constraint SQL expression |
The IntoPlanColumnSpec trait enables ergonomic column specification using tuples like ("col_name", DataType::Int64, NotNull).
Sources: llkv-plan/src/plans.rs:499-605
CreateIndexPlan
CreateIndexPlan specifies index creation:
| Field | Type | Description |
|---|---|---|
name | Option<String> | Index name (auto-generated if None) |
table | String | Target table |
unique | bool | UNIQUE index constraint |
if_not_exists | bool | Skip if index exists |
columns | Vec<IndexColumnPlan> | Indexed columns with sort order |
Each IndexColumnPlan specifies:
name: Column nameascending: Sort direction (ASC/DESC)nulls_first: NULL placement
Sources: llkv-plan/src/plans.rs:433-497
AlterTablePlan
AlterTablePlan represents ALTER TABLE operations:
| Field | Type | Description |
|---|---|---|
table_name | String | Target table |
if_exists | bool | Skip if table missing |
operation | AlterTableOperation | Specific operation |
AlterTableOperation variants:
| Variant | Fields | Description |
|---|---|---|
RenameColumn | old_column_name: String | |
new_column_name: String | RENAME COLUMN | |
SetColumnDataType | column_name: String | |
new_data_type: String | ALTER COLUMN SET DATA TYPE | |
DropColumn | column_name: String | |
if_exists: bool | ||
cascade: bool | DROP COLUMN |
Sources: llkv-plan/src/plans.rs:364-406
Additional DDL Plans
| Plan Type | Purpose | Key Fields |
|---|---|---|
DropTablePlan | DROP TABLE | name, if_exists |
CreateViewPlan | CREATE VIEW | name, view_definition, select_plan |
DropViewPlan | DROP VIEW | name, if_exists |
RenameTablePlan | RENAME TABLE | current_name, new_name, if_exists |
DropIndexPlan | DROP INDEX | name, canonical_name, if_exists |
ReindexPlan | REINDEX | name, canonical_name |
Sources: llkv-plan/src/plans.rs:209-358
PlanValue - Value Representation
PlanValue provides a type-safe representation of literal values in plans, bridging SQL literals and Arrow arrays:
| Variant | Description |
|---|---|
Null | SQL NULL |
Integer(i64) | Integer value (booleans stored as 0/1) |
Float(f64) | Floating-point value |
Decimal(DecimalValue) | Fixed-precision decimal |
String(String) | Text value |
Date32(i32) | Date (days since epoch) |
Struct(FxHashMap<String, PlanValue>) | Nested struct value |
Interval(IntervalValue) | Interval (months, days, nanos) |
PlanValue implements From<T> for common types (i64, f64, String, bool) for ergonomic plan construction. The plan_value_from_literal() function converts llkv_expr::Literal to PlanValue, and plan_value_from_array() extracts values from Arrow arrays during INSERT SELECT operations.
Sources: llkv-plan/src/plans.rs:73-161 llkv-plan/src/plans.rs:1122-1189
Plan Translation Flow
Diagram: Plan Translation and Execution Flow
Plans serve as the interface contract between the SQL layer (llkv-sql) and the execution layer (llkv-runtime, llkv-executor). The translation layer in llkv-sql converts sqlparser AST nodes into strongly-typed plan structures, which the runtime validates and dispatches to appropriate executors.
Sources: llkv-plan/README.md:13-33 llkv-executor/README.md:12-31
Plan Construction Patterns
Builder Pattern for SelectPlan
SelectPlan uses fluent builder methods for incremental construction:
Sources: llkv-plan/src/plans.rs:827-943
Tuple-Based Column Specs
PlanColumnSpec implements IntoPlanColumnSpec for tuples, enabling concise table definitions:
Sources: llkv-plan/src/plans.rs:548-605
Integration with Expression System
Plans reference expressions from llkv-expr using parameterized types:
Expr<String>: Boolean predicates with column names (WHERE, HAVING, ON clauses)ScalarExpr<String>: Scalar expressions with column names (projections, assignments)
The executor translates these to Expr<FieldId> and ScalarExpr<FieldId> after resolving column names against table schemas. For details on expression evaluation, see Expression AST and Expression Translation.
Sources: llkv-plan/src/plans.rs:28-34 llkv-plan/src/plans.rs:666-674
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Subquery and Correlation Handling
Relevant source files
This page documents how LLKV handles subqueries and correlated column references during query planning and execution. Subqueries can appear in WHERE clauses (as EXISTS predicates) or in SELECT projections (as scalar subqueries). When a subquery references columns from its outer query, it is called a correlated subquery , requiring special handling to bind outer row values during execution.
For information about expression evaluation and compilation, see Expression System. For query execution flow, see Query Execution.
Purpose and Scope
Subquery handling in LLKV involves three distinct phases:
- Detection and Tracking - During SQL translation, the planner identifies subqueries and tracks which outer columns they reference
- Placeholder Injection - Correlated columns are replaced with synthetic placeholder identifiers in the subquery's expression tree
- Binding and Execution - At runtime, for each outer row, placeholders are replaced with actual values and the subquery is executed
This document covers the data structures, algorithms, and execution flow for both filter subqueries (EXISTS/NOT EXISTS) and scalar subqueries (single-value returns).
Core Data Structures
LLKV represents subqueries and their correlation metadata through several interconnected structures defined in llkv-plan.
classDiagram
class SelectFilter {+Expr~String~ predicate\n+Vec~FilterSubquery~ subqueries}
class FilterSubquery {+SubqueryId id\n+Box~SelectPlan~ plan\n+Vec~CorrelatedColumn~ correlated_columns}
class ScalarSubquery {+SubqueryId id\n+Box~SelectPlan~ plan\n+Vec~CorrelatedColumn~ correlated_columns}
class CorrelatedColumn {+String placeholder\n+String column\n+Vec~String~ field_path}
class SelectPlan {+Vec~TableRef~ tables\n+Option~SelectFilter~ filter\n+Vec~ScalarSubquery~ scalar_subqueries\n+Vec~SelectProjection~ projections}
SelectPlan --> SelectFilter : filter
SelectPlan --> ScalarSubquery : scalar_subqueries
SelectFilter --> FilterSubquery : subqueries
FilterSubquery --> CorrelatedColumn : correlated_columns
ScalarSubquery --> CorrelatedColumn : correlated_columns
FilterSubquery --> SelectPlan : plan
ScalarSubquery --> SelectPlan : plan
Subquery Plan Structures
Sources: llkv-plan/src/plans.rs:28-67
Field Descriptions
| Structure | Field | Purpose |
|---|---|---|
FilterSubquery | id | Unique identifier used to match subquery in expression tree |
FilterSubquery | plan | Nested SELECT plan to execute for each outer row |
FilterSubquery | correlated_columns | Mappings from placeholder to real outer column |
ScalarSubquery | id | Unique identifier for scalar subquery references |
ScalarSubquery | plan | SELECT plan that must return single column/row |
CorrelatedColumn | placeholder | Synthetic column name injected into subquery expressions |
CorrelatedColumn | column | Canonical outer column name |
CorrelatedColumn | field_path | Nested field access path for struct columns |
Sources: llkv-plan/src/plans.rs:36-67
Correlation Tracking During Planning
During SQL-to-plan translation, LLKV uses the SubqueryCorrelatedTracker to detect when a subquery references outer columns. This tracker is passed through the expression translation pipeline and records each outer column access.
graph TB
subgraph "SQL Translation Phase"
SQL["SQL Query String"]
Parser["sqlparser AST"]
end
subgraph "Subquery Detection"
TranslateExpr["translate_predicate / translate_scalar"]
Tracker["SubqueryCorrelatedTracker"]
Resolver["IdentifierResolver"]
end
subgraph "Placeholder Injection"
OuterColumn["Outer Column Reference"]
Placeholder["Synthetic Placeholder"]
Recording["CorrelatedColumn Entry"]
end
subgraph "Plan Output"
FilterSubquery["FilterSubquery"]
ScalarSubquery["ScalarSubquery"]
SelectPlan["SelectPlan"]
end
SQL --> Parser
Parser --> TranslateExpr
TranslateExpr --> Tracker
TranslateExpr --> Resolver
Tracker --> OuterColumn
OuterColumn --> Placeholder
Placeholder --> Recording
Recording --> FilterSubquery
Recording --> ScalarSubquery
FilterSubquery --> SelectPlan
ScalarSubquery --> SelectPlan
Tracker Architecture
Sources: llkv-sql/src/sql_engine.rs24 llkv-sql/src/sql_engine.rs:326-363
Placeholder Generation
When the tracker detects an outer column reference in a subquery, it:
- Generates a unique placeholder string (e.g.,
"__correlated_0__") - Records the mapping:
placeholder → (outer_column, field_path) - Returns the placeholder to the expression translator
- The placeholder is embedded in the subquery's expression tree instead of the original column name
This allows the subquery plan to be "generic" - it references placeholders that will be bound to actual values at execution time.
Sources: llkv-sql/src/sql_engine.rs:337-351
Tracker Extension Traits
The SubqueryCorrelatedTrackerExt trait provides a convenience method to request placeholders directly from catalog resolution results, avoiding repetitive unpacking of ColumnResolution fields.
The SubqueryCorrelatedTrackerOptionExt trait enables chaining optional tracker references through nested translation helpers without explicit as_mut() calls.
Sources: llkv-sql/src/sql_engine.rs:337-363
Subquery Types and Execution
LLKV supports two categories of subqueries, each with distinct execution semantics.
sequenceDiagram
participant Executor as QueryExecutor
participant Filter as Filter Evaluation
participant Subquery as EXISTS Subquery
participant Binding as Binding Logic
participant Inner as Inner SelectPlan
Executor->>Filter: evaluate_predicate_mask()
Filter->>Filter: encounter Expr::Exists
Filter->>Subquery: evaluate_exists_subquery()
Subquery->>Binding: collect_correlated_bindings()
Binding->>Binding: extract outer row values
Binding-->>Subquery: bindings map
Subquery->>Inner: bind_select_plan()
Inner->>Inner: replace placeholders with values
Inner-->>Subquery: bound SelectPlan
Subquery->>Executor: execute_select(bound_plan)
Executor-->>Subquery: SelectExecution stream
Subquery->>Subquery: check if num_rows > 0
Subquery-->>Filter: boolean result
Filter-->>Executor: BooleanArray mask
Filter Subqueries (EXISTS Predicates)
Filter subqueries appear in WHERE clauses as EXISTS or NOT EXISTS predicates. They return a boolean indicating whether the subquery produced any rows.
Sources: llkv-executor/src/lib.rs:773-792
Scalar Subqueries (Projection Values)
Scalar subqueries appear in SELECT projections and must return exactly one column and at most one row. They are evaluated into a single literal value for each outer row.
Sources: llkv-executor/src/lib.rs:794-891
sequenceDiagram
participant Executor as QueryExecutor
participant Projection as Projection Logic
participant Subquery as Scalar Subquery Evaluator
participant Binding as Binding Logic
participant Inner as Inner SelectPlan
Executor->>Projection: evaluate_projection_expression()
Projection->>Projection: encounter ScalarExpr::ScalarSubquery
Projection->>Subquery: evaluate_scalar_subquery_numeric()
loop For each outer row
Subquery->>Subquery: evaluate_scalar_subquery_literal()
Subquery->>Binding: collect_correlated_bindings()
Binding-->>Subquery: bindings for current row
Subquery->>Inner: bind_select_plan()
Inner-->>Subquery: bound plan
Subquery->>Executor: execute_select()
Executor-->>Subquery: result batches
Subquery->>Subquery: validate single column/row
Subquery->>Subquery: convert to Literal
Subquery-->>Projection: literal value
end
Projection->>Projection: build NumericArray from literals
Projection-->>Executor: computed column array
Binding Process
The binding process replaces placeholder identifiers in a subquery plan with actual values from the current outer row.
Correlated Binding Collection
The collect_correlated_bindings function builds a map from placeholder strings to concrete Literal values by:
- Iterating over each
CorrelatedColumnin the subquery metadata - Looking up the outer column in the current RecordBatch
- Extracting the value at the current row index
- Converting the Arrow array value to a
Literal - Storing the mapping:
placeholder → Literal
Sources: Referenced in llkv-executor/src/lib.rs781 llkv-executor/src/lib.rs802
graph LR
subgraph "Input"
OuterBatch["RecordBatch\n(outer query result)"]
RowIndex["Current Row Index"]
Metadata["Vec<CorrelatedColumn>"]
end
subgraph "Processing"
Iterate["For each CorrelatedColumn"]
Lookup["Find column in schema"]
Extract["Extract array[row_idx]"]
Convert["Convert to Literal"]
end
subgraph "Output"
Bindings["FxHashMap<placeholder, Literal>"]
end
OuterBatch --> Iterate
RowIndex --> Iterate
Metadata --> Iterate
Iterate --> Lookup
Lookup --> Extract
Extract --> Convert
Convert --> Bindings
Plan Binding
The bind_select_plan function takes a subquery SelectPlan and a bindings map, then recursively replaces:
- Placeholder column references in filter expressions with
Expr::Literal - Placeholder column references in projections with
ScalarExpr::Literal - Placeholder column references in HAVING clauses
- Placeholder references in nested subqueries
This produces a new SelectPlan that is fully "grounded" with the outer row's values and can be executed independently.
Sources: Referenced in llkv-executor/src/lib.rs782 llkv-executor/src/lib.rs803
Execution Flow: EXISTS Subquery Example
Consider the query:
Planning Phase
Sources: llkv-plan/src/plans.rs:36-46
Execution Phase
Sources: llkv-executor/src/lib.rs:773-792
Execution Flow: Scalar Subquery Example
Consider the query:
Planning Phase
Sources: llkv-plan/src/plans.rs:48-56
Execution Phase
Sources: llkv-executor/src/lib.rs:834-891
Cross-Product Integration
When subqueries appear in cross-product (multi-table) queries, the binding process works identically but must resolve outer column names through the combined schema.
Cross-Product Expression Context
The CrossProductExpressionContext maintains:
- Combined schema from all tables in the FROM clause
- Column lookup map (qualified names → column indices)
- Numeric array cache for evaluated expressions
- Synthetic field ID allocation for subquery results
When evaluating a filter or projection expression that contains subqueries, the context:
- Detects subquery references by
SubqueryId - Calls the appropriate evaluator (
evaluate_exists_subqueryorevaluate_scalar_subquery_numeric) - Passes the combined schema and current row to the binding logic
- Caches numeric results for scalar subqueries to avoid re-evaluation
Sources: llkv-executor/src/lib.rs:1329-1383 llkv-executor/src/lib.rs:1429-1502
Validation and Error Handling
The executor enforces several constraints on subquery results:
| Constraint | Applies To | Error Condition |
|---|---|---|
| Single column | Scalar subqueries | num_columns() != 1 |
| At most one row | Scalar subqueries | num_rows() > 1 |
| Result present | N/A (returns NULL) | num_rows() == 0 for scalar subquery |
Error Examples
Scalar Subquery: Multiple Columns
Scalar Subquery: Multiple Rows
Sources: llkv-executor/src/lib.rs:808-819
Subquery ID Assignment
SubqueryId is a newtype wrapper around usize defined in llkv-expr. The planner assigns sequential IDs as it encounters subqueries during translation, ensuring each subquery has a unique identifier that persists from planning through execution.
The executor uses these IDs to:
- Look up subquery metadata in the plan's
scalar_subqueriesor filter'ssubqueriesvectors - Match subquery references in expression trees (
ScalarExpr::ScalarSubqueryorExpr::Exists) to their plans - Cache evaluation results (for scalar subqueries appearing multiple times)
Sources: [llkv-expr referenced in llkv-plan/src/plans.rs15](https://github.com/jzombie/rust-llkv/blob/4dc34c1f/llkv-expr referenced in llkv-plan/src/plans.rs#L15-L15)
Recursive Subquery Support
LLKV supports nested subqueries where an inner subquery can itself contain subqueries. The binding process is recursive:
- Bind outer-level placeholders in the top-level subquery plan
- For any nested subqueries within that plan, repeat the binding process
- Continue until all correlation layers are resolved
This is handled automatically by bind_select_plan which traverses the entire plan tree.
Sources: Referenced in llkv-executor/src/lib.rs782 llkv-executor/src/lib.rs803
Performance Considerations
Correlation Overhead
Correlated subqueries are executed once per outer row, which can be expensive:
- An outer query returning N rows with a correlated subquery executes N + 1 queries total
- For scalar subqueries in projections with multiple references, results are cached per subquery ID
- EXISTS subqueries short-circuit as soon as a matching row is found (first batch with
num_rows() > 0)
Uncorrelated Subqueries
If a subquery contains no correlated columns (empty correlated_columns vector), it could theoretically be executed once and reused. However, LLKV's current implementation still executes it per outer row. Future optimizations could detect this case and cache the result.
Sources: llkv-executor/src/lib.rs:773-891
Summary
LLKV's subquery handling follows a three-phase model:
-
Planning : The
SubqueryCorrelatedTrackerdetects outer column references and generates placeholders. Plans are built withFilterSubqueryorScalarSubquerystructures containing correlation metadata. -
Binding : At execution time,
collect_correlated_bindingsextracts outer row values andbind_select_planreplaces placeholders with literals, producing a grounded plan. -
Execution : The bound plan is executed as an independent query. EXISTS subqueries return a boolean; scalar subqueries return a single literal (or NULL if empty).
This design keeps the subquery plan generic during planning and binds it dynamically at execution time, enabling proper correlation semantics while maintaining the separation between planning and execution layers.
Sources: llkv-plan/src/plans.rs:28-67 llkv-sql/src/sql_engine.rs24 llkv-sql/src/sql_engine.rs:326-363 llkv-executor/src/lib.rs:773-891
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Expression System
Relevant source files
- llkv-executor/Cargo.toml
- llkv-executor/src/lib.rs
- llkv-executor/src/translation/expression.rs
- llkv-executor/src/translation/schema.rs
- llkv-expr/src/expr.rs
- llkv-plan/src/plans.rs
- llkv-table/src/planner/program.rs
The expression system provides the intermediate representation for predicates, projections, and computations in LLKV. It defines a strongly-typed Abstract Syntax Tree (AST) that decouples query logic from concrete storage formats, enabling optimization and efficient evaluation against Apache Arrow data structures.
This page covers the overall architecture of the expression system. For details on specific components:
- Expression AST structure and types: see 5.1
- Column name resolution and translation: see 5.2
- Bytecode compilation and normalization: see 5.3
- Scalar evaluation engine: see 5.4
- Aggregate function evaluation: see 5.5
Purpose and Scope
The expression system serves three primary functions:
-
Representation : Defines generic AST types (
Expr<F>,ScalarExpr<F>) parameterized over field identifiers, supporting both string-based column names (during planning) and integer field IDs (during execution) -
Translation : Resolves column names to storage field identifiers by consulting the catalog, enabling type-safe access to table columns
-
Compilation : Transforms normalized expressions into stack-based bytecode (
EvalProgram,DomainProgram) for efficient vectorized evaluation
The system is located primarily in llkv-expr/, with translation logic in llkv-executor/src/translation/ and compilation in llkv-table/src/planner/program.rs.
Sources: llkv-expr/src/expr.rs:1-749 llkv-executor/src/translation/expression.rs:1-424 llkv-table/src/planner/program.rs:1-710
Expression Type Hierarchy
LLKV uses two primary expression types, both generic over the field identifier type F:
Expression Types
| Type | Purpose | Variants | Example |
|---|---|---|---|
Expr<'a, F> | Boolean predicates for filtering rows | And, Or, Not, Pred, Compare, InList, IsNull, Literal, Exists | WHERE age > 18 AND status = 'active' |
ScalarExpr<F> | Arithmetic/scalar computations returning values | Column, Literal, Binary, Aggregate, Cast, Case, Coalesce, GetField, Compare, Not, IsNull, Random, ScalarSubquery | SELECT price * 1.1 AS adjusted_price |
Filter<'a, F> | Single-field predicate | Field ID + Operator | age > 18 |
Operator<'a> | Comparison operator against literals | Equals, Range, GreaterThan, LessThan, In, StartsWith, EndsWith, Contains, IsNull, IsNotNull | IN (1, 2, 3) |
Type Parameterization Flow
Sources: llkv-expr/src/expr.rs:14-333 llkv-plan/src/plans.rs:28-67 llkv-executor/src/translation/expression.rs:18-174
Expression Lifecycle
Expressions flow through multiple stages from SQL text to execution against storage:
Stage 1: Planning
The SQL layer (llkv-sql) parses SQL statements and constructs plan structures containing expressions. At this stage, column references use string names from the SQL text:
- Predicates : Stored as
Expr<'static, String>inSelectFilter - Projections : Stored as
ScalarExpr<String>inSelectProjection::Computed - Assignments : Stored as
ScalarExpr<String>inUpdatePlan::assignments
Sources: llkv-plan/src/plans.rs:28-34 llkv-sql/src/translator/mod.rs (inferred from architecture)
Stage 2: Translation
The executor translates string-based expressions to field-ID-based expressions by consulting the table schema:
- Column Resolution :
translate_predicateandtranslate_scalarwalk the expression tree - Schema Lookup : Each column name is resolved to a
FieldIdusingExecutorSchema::resolve - Type Inference : For computed projections,
infer_computed_data_typedetermines the Arrow data type - Special Columns : System columns like
"rowid"map to special field IDs (e.g.,ROW_ID_FIELD_ID)
Translation is implemented via iterative traversal to avoid stack overflow on deeply nested expressions (50k+ nodes).
Sources: llkv-executor/src/translation/expression.rs:18-174 llkv-executor/src/translation/expression.rs:176-387 llkv-executor/src/translation/schema.rs:53-123
Stage 3: Compilation
The table layer compiles Expr<FieldId> into bytecode for efficient evaluation:
-
Normalization :
normalize_predicateapplies De Morgan's laws and flattens nested AND/OR nodes -
Compilation :
ProgramCompiler::compilegenerates two programs:EvalProgram: Stack-based bytecode for predicate evaluationDomainProgram: Bytecode for tracking which fields affect row visibility
-
Fusion Optimization : Multiple predicates on the same field are fused into a single
FusedAndoperation
graph TB
Input["Expr<FieldId>\nNOT(age > 18 AND status = 'active')"]
Norm["normalize_predicate\nApply De Morgan's law"]
Normal["Expr<FieldId>\nOR([\n NOT(age > 18),\n NOT(status = 'active')\n])"]
Compile["ProgramCompiler::compile"]
subgraph "Output Programs"
Eval["EvalProgram\nops: [\n PushCompare(age, Gt, 18),\n Not,\n PushCompare(status, Eq, 'active'),\n Not,\n Or(2)\n]"]
Domain["DomainProgram\nops: [\n PushCompareDomain(...),\n PushCompareDomain(...),\n Union(2)\n]"]
end
Input --> Norm
Norm --> Normal
Normal --> Compile
Compile --> Eval
Compile --> Domain
Sources: llkv-table/src/planner/program.rs:286-318 llkv-table/src/planner/program.rs:416-516 llkv-table/src/planner/program.rs:550-631
Stage 4: Evaluation
Compiled programs are executed against Arrow RecordBatch data:
- EvalProgram : Evaluates predicates row-by-row using a value stack, producing boolean results
- DomainProgram : Identifies which row IDs could possibly match (used for optimization)
- ScalarExpr : Evaluated via
NumericKernelsfor vectorized arithmetic operations
The evaluation engine handles Arrow's columnar format efficiently through zero-copy operations and SIMD-friendly algorithms.
Sources: llkv-table/src/planner/evaluator.rs (inferred from architecture), llkv-executor/src/lib.rs:254-296
Key Components
ProgramCompiler
Compiles normalized expressions into bytecode:
Key Optimizations :
- Predicate Fusion :
gather_fuseddetects multiple predicates on the same field and emitsFusedAndoperations - Domain Caching : Domain programs are memoized by expression identity to avoid recompilation
- Stack-Based Evaluation : Operations push/pop from a value stack, avoiding recursive evaluation overhead
Sources: llkv-table/src/planner/program.rs:257-284 llkv-table/src/planner/program.rs:416-516 llkv-table/src/planner/program.rs:518-542
Bytecode Operations
EvalOp Variants
| Operation | Purpose | Stack Effect |
|---|---|---|
PushPredicate(filter) | Evaluate single-field predicate | Push boolean |
PushCompare{left, op, right} | Evaluate comparison between scalar expressions | Push boolean |
PushInList{expr, list, negated} | Evaluate IN/NOT IN list membership | Push boolean |
PushIsNull{expr, negated} | Evaluate IS NULL / IS NOT NULL | Push boolean |
PushLiteral(bool) | Push constant boolean | Push boolean |
FusedAnd{field_id, filters} | Evaluate multiple predicates on same field (optimized) | Push boolean |
And{child_count} | Pop N booleans, push AND result | Pop N, push 1 |
Or{child_count} | Pop N booleans, push OR result | Pop N, push 1 |
Not{domain} | Pop boolean, negate, push result (uses domain for optimization) | Pop 1, push 1 |
DomainOp Variants
| Operation | Purpose | Stack Effect |
|---|---|---|
PushFieldAll(field_id) | All rows where field exists | Push RowSet |
PushCompareDomain{left, right, op, fields} | Rows where comparison could be true | Push RowSet |
PushInListDomain{expr, list, fields, negated} | Rows where IN list could be true | Push RowSet |
PushIsNullDomain{expr, fields, negated} | Rows where NULL test could be true | Push RowSet |
PushLiteralFalse | Empty row set | Push RowSet |
PushAllRows | All rows | Push RowSet |
Union{child_count} | Pop N row sets, push union | Pop N, push 1 |
Intersect{child_count} | Pop N row sets, push intersection | Pop N, push 1 |
Sources: llkv-table/src/planner/program.rs:36-67 llkv-table/src/planner/program.rs:221-254
Expression Translation
Translation resolves column names to field IDs through schema lookup:
Special Handling :
- Rowid Column :
"rowid"(case-insensitive) maps toROW_ID_FIELD_IDconstant - Flexible Matching : Supports qualified names (
table.column) and unqualified names (column) - Error Handling : Unknown columns produce descriptive error messages with the column name
Sources: llkv-executor/src/translation/expression.rs:390-407 llkv-executor/src/translation/expression.rs:410-423
Type Inference
The executor infers Arrow data types for computed projections to construct the output schema:
Type Inference Rules
| Expression Pattern | Inferred Type |
|---|---|
Literal(Integer) | DataType::Int64 |
Literal(Float) | DataType::Float64 |
Literal(Decimal(v)) | DataType::Decimal128(v.precision(), v.scale()) |
Literal(String) | DataType::Utf8 |
Literal(Date32) | DataType::Date32 |
Literal(Boolean) | DataType::Boolean |
Literal(Null) | DataType::Null |
Column(field_id) | Lookup from schema, normalized to Int64/Float64 |
Binary{...} | Float64 if any operand is float, else Int64 |
Compare{...} | DataType::Int64 (boolean as integer) |
Aggregate(...) | DataType::Int64 (most aggregates) |
Cast{data_type, ...} | data_type (explicit cast) |
Random | DataType::Float64 |
Numeric Type Normalization : Small integers (Int8, Int16, Int32, Boolean) normalize to Int64, while all floating-point types normalize to Float64. This simplifies arithmetic evaluation.
Sources: llkv-executor/src/translation/schema.rs:53-123 llkv-executor/src/translation/schema.rs:125-147 llkv-executor/src/translation/schema.rs:149-243
Subquery Support
Expressions support two types of subqueries:
EXISTS Predicates
Used in WHERE clauses to test for row existence:
- Structure :
Expr::Exists(SubqueryExpr{id, negated}) - Planning : Stored in
SelectFilter::subquerieswith correlated column bindings - Evaluation : Executor binds outer row values to correlated columns, executes subquery plan, returns true if any rows match
Scalar Subqueries
Used in projections to return a single value:
- Structure :
ScalarExpr::ScalarSubquery(ScalarSubqueryExpr{id}) - Planning : Stored in
SelectPlan::scalar_subquerieswith correlated column bindings - Evaluation : Executor binds outer row values, executes subquery, extracts single value
- Error Handling : Returns error if subquery returns multiple rows or columns
Sources: llkv-expr/src/expr.rs:42-63 llkv-plan/src/plans.rs:36-56 llkv-executor/src/lib.rs:774-839
Normalization
Expression normalization applies logical transformations before compilation:
De Morgan's Laws
NOT is pushed down through AND/OR using De Morgan's laws:
NOT(A AND B)→NOT(A) OR NOT(B)NOT(A OR B)→NOT(A) AND NOT(B)NOT(NOT(A))→A
Flattening
Nested AND/OR nodes are flattened:
AND(A, AND(B, C))→AND(A, B, C)OR(A, OR(B, C))→OR(A, B, C)
Special Cases
NOT(Literal(true))→Literal(false)NOT(IsNull{expr, false})→IsNull{expr, true}
Sources: llkv-table/src/planner/program.rs:286-343
Expression Operators
Binary Operators (BinaryOp)
| Operator | Semantics |
|---|---|
Add | Addition (a + b) |
Subtract | Subtraction (a - b) |
Multiply | Multiplication (a * b) |
Divide | Division (a / b) |
Modulo | Modulus (a % b) |
And | Bitwise AND (a & b) |
Or | Bitwise OR (`a |
BitwiseShiftLeft | Left shift (a << b) |
BitwiseShiftRight | Right shift (a >> b) |
Comparison Operators (CompareOp)
| Operator | Semantics |
|---|---|
Eq | Equality (a = b) |
NotEq | Inequality (a != b) |
Lt | Less than (a < b) |
LtEq | Less than or equal (a <= b) |
Gt | Greater than (a > b) |
GtEq | Greater than or equal (a >= b) |
Comparisons in ScalarExpr::Compare return 1 for true, 0 for false, NULL for NULL propagation.
Sources: llkv-expr/src/expr.rs:270-293
Memory Management
Expression Lifetimes
The 'a lifetime parameter in Expr<'a, F> allows borrowed operators to avoid allocations:
Operator::In(&'a [Literal]): Borrows slice from call siteOperator::StartsWith{pattern: &'a str, ...}: Borrows pattern stringFilter<'a, F>: Contains borrowedOperator<'a>
Owned Variants : EvalProgram and DomainProgram use OwnedOperator and OwnedFilter for storage, converting borrowed operators to owned values.
Zero-Copy Pattern
During evaluation, predicates borrow from the compiled program rather than cloning operators, enabling zero-copy predicate evaluation against Arrow arrays.
Sources: llkv-expr/src/expr.rs:295-333 llkv-table/src/planner/program.rs:69-143
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Expression AST
Relevant source files
- llkv-executor/src/translation/expression.rs
- llkv-executor/src/translation/schema.rs
- llkv-expr/src/expr.rs
- llkv-table/src/planner/program.rs
Purpose and Scope
This document describes the expression Abstract Syntax Tree (AST) defined in the llkv-expr crate. The expression AST provides type-aware, Arrow-native data structures for representing boolean predicates and scalar expressions throughout the LLKV system. These AST nodes are decoupled from SQL parsing and can be parameterized by field identifier types, enabling reuse across multiple processing stages.
For information about how expressions are translated between field identifier types, see Expression Translation. For details on how expressions are compiled into executable bytecode, see Program Compilation. For scalar evaluation mechanics, see Scalar Evaluation and NumericKernels.
Sources: llkv-expr/src/expr.rs:1-8
Expression Type Hierarchy
The llkv-expr crate defines two primary expression enums:
Expr<'a, F>- Boolean/predicate expressions that evaluate to true or falseScalarExpr<F>- Arithmetic/scalar expressions that produce typed values
graph TB
subgraph "Expression Type System"
EXPR["Expr<'a, F>\nBoolean Predicates"]
SCALAR["ScalarExpr<F>\nScalar Values"]
EXPR --> LOGICAL["Logical Operators\nAnd, Or, Not"]
EXPR --> PRED["Pred(Filter)\nField Predicates"]
EXPR --> COMPARE["Compare\nScalar Comparisons"]
EXPR --> INLIST["InList\nSet Membership"]
EXPR --> ISNULL["IsNull\nNull Checks"]
EXPR --> LITERAL["Literal(bool)\nConstant Boolean"]
EXPR --> EXISTS["Exists\nSubquery Predicates"]
SCALAR --> COLUMN["Column(F)\nField Reference"]
SCALAR --> SLITERAL["Literal\nConstant Value"]
SCALAR --> BINARY["Binary\nArithmetic Ops"]
SCALAR --> SNOT["Not\nLogical Negation"]
SCALAR --> SISNULL["IsNull\nNull Test"]
SCALAR --> AGG["Aggregate\nAggregate Functions"]
SCALAR --> GETFIELD["GetField\nStruct Access"]
SCALAR --> CAST["Cast\nType Conversion"]
SCALAR --> SCOMPARE["Compare\nBoolean Result"]
SCALAR --> COALESCE["Coalesce\nFirst Non-Null"]
SCALAR --> SUBQ["ScalarSubquery\nSubquery Value"]
SCALAR --> CASE["Case\nConditional Logic"]
SCALAR --> RANDOM["Random\nRandom Number"]
COMPARE -.uses.-> SCALAR
INLIST -.uses.-> SCALAR
ISNULL -.uses.-> SCALAR
end
style EXPR fill:#e8f5e9
style SCALAR fill:#e1f5ff
Both types are generic over a field identifier parameter F, which allows the same AST structure to be used with different field representations (typically String column names during planning, or FieldId numeric identifiers during execution).
Sources: llkv-expr/src/expr.rs:14-143
Expr<'a, F> - Boolean Expressions
The Expr<'a, F> enum represents boolean predicate expressions that evaluate to true or false. These are primarily used in WHERE clauses, JOIN conditions, and HAVING clauses.
Expr Variants
| Variant | Description | Use Case |
|---|---|---|
And(Vec<Expr>) | Logical AND of multiple predicates | Combining multiple filter conditions |
Or(Vec<Expr>) | Logical OR of multiple predicates | Alternative filter conditions |
Not(Box<Expr>) | Logical negation | Inverting a predicate |
Pred(Filter<'a, F>) | Single-field predicate | Column-level filtering (e.g., age > 18) |
Compare { left, op, right } | Comparison between scalar expressions | Cross-column comparisons (e.g., col1 + col2 > 10) |
InList { expr, list, negated } | Set membership test | IN/NOT IN clauses |
IsNull { expr, negated } | Null check on expression | IS NULL/IS NOT NULL on complex expressions |
Literal(bool) | Constant boolean value | Always-true/always-false conditions |
Exists(SubqueryExpr) | Correlated subquery predicate | EXISTS/NOT EXISTS clauses |
Sources: llkv-expr/src/expr.rs:14-43
Expr Construction Helpers
The Expr type provides builder methods for common patterns:
These helpers simplify construction of common predicate patterns during query planning.
Sources: llkv-expr/src/expr.rs:65-84
Filter and Operator Types
The Pred variant wraps a Filter<'a, F> struct, which represents a single predicate against a field:
The Operator<'a> enum defines comparison operations over untyped Literal values:
| Operator | Description | Example SQL |
|---|---|---|
Equals(Literal) | Exact equality | col = 5 |
Range { lower, upper } | Range with bounds | col BETWEEN 10 AND 20 |
GreaterThan(Literal) | Greater-than comparison | col > 10 |
GreaterThanOrEquals(Literal) | Greater-or-equal | col >= 10 |
LessThan(Literal) | Less-than comparison | col < 10 |
LessThanOrEquals(Literal) | Less-or-equal | col <= 10 |
In(&'a [Literal]) | Set membership (borrowed slice) | col IN (1, 2, 3) |
StartsWith { pattern, case_sensitive } | String prefix match | col LIKE 'abc%' |
EndsWith { pattern, case_sensitive } | String suffix match | col LIKE '%xyz' |
Contains { pattern, case_sensitive } | String substring match | col LIKE '%abc%' |
IsNull | Null check | col IS NULL |
IsNotNull | Non-null check | col IS NOT NULL |
The Operator type uses borrowed slices for In and borrowed string slices for pattern matching to avoid allocations in common cases.
Sources: llkv-expr/src/expr.rs:295-358
ScalarExpr - Scalar Expressions
The ScalarExpr<F> enum represents arithmetic and scalar expressions that produce typed values. These are used in SELECT projections, computed columns, ORDER BY clauses, and as operands in comparisons.
graph TB
subgraph "ScalarExpr Evaluation Categories"
SIMPLE["Simple Values"]
ARITH["Arithmetic"]
LOGIC["Logical"]
STRUCT["Structured"]
CONTROL["Control Flow"]
SPECIAL["Special Functions"]
SIMPLE --> COLUMN["Column(F)\nField reference"]
SIMPLE --> LITERAL["Literal\nConstant value"]
ARITH --> BINARY["Binary\nleft op right"]
ARITH --> CAST["Cast\nType conversion"]
LOGIC --> NOT["Not\nLogical negation"]
LOGIC --> ISNULL["IsNull\nNull test"]
LOGIC --> COMPARE["Compare\nComparison"]
STRUCT --> GETFIELD["GetField\nStruct field access"]
STRUCT --> AGGREGATE["Aggregate\nAggregate function"]
CONTROL --> CASE["Case\nCASE expression"]
CONTROL --> COALESCE["Coalesce\nFirst non-null"]
SPECIAL --> SUBQUERY["ScalarSubquery\nScalar subquery"]
SPECIAL --> RANDOM["Random\nRandom number"]
end
ScalarExpr Variants
Sources: llkv-expr/src/expr.rs:86-143
Arithmetic and Binary Operations
The Binary variant supports arithmetic and logical operations:
BinaryOp Variants:
| Operator | Numeric | Bitwise | Logical |
|---|---|---|---|
Add | ✓ | ||
Subtract | ✓ | ||
Multiply | ✓ | ||
Divide | ✓ | ||
Modulo | ✓ | ||
And | ✓ | ||
Or | ✓ | ||
BitwiseShiftLeft | ✓ | ||
BitwiseShiftRight | ✓ |
Sources: llkv-expr/src/expr.rs:270-282
Comparison Operations
The Compare variant in ScalarExpr produces a boolean (1/0) result:
CompareOp Variants:
| Operator | SQL Equivalent |
|---|---|
Eq | = |
NotEq | != or <> |
Lt | < |
LtEq | <= |
Gt | > |
GtEq | >= |
Sources: llkv-expr/src/expr.rs:284-293 llkv-expr/src/expr.rs:119-124
AggregateCall Variants
The Aggregate(AggregateCall<F>) variant wraps aggregate function calls:
Each aggregate (except CountStar) operates on a ScalarExpr<F> rather than just a column, enabling complex expressions like AVG(col1 + col2) or SUM(-col1).
Sources: llkv-expr/src/expr.rs:145-176
Struct Field Access
The GetField variant extracts fields from struct-typed expressions:
This represents dot-notation access like user.address.city, which is nested as:
GetField {
base: GetField {
base: Column(user),
field_name: "address"
},
field_name: "city"
}
Sources: llkv-expr/src/expr.rs:107-113
CASE Expressions
The Case variant implements SQL CASE expressions:
- Simple CASE:
operandisSome, branches test equality - Searched CASE:
operandisNone, branches evaluate conditions - ELSE clause: Handled by
else_exprfield
Sources: llkv-expr/src/expr.rs:129-137
ScalarExpr Helper Methods
Builder methods simplify construction:
| Method | Purpose |
|---|---|
column(field: F) | Create column reference |
literal<L>(lit: L) | Create literal value |
binary(left, op, right) | Create binary operation |
logical_not(expr) | Create logical NOT |
is_null(expr, negated) | Create null test |
aggregate(call) | Create aggregate function |
get_field(base, name) | Create struct field access |
cast(expr, data_type) | Create type cast |
compare(left, op, right) | Create comparison |
coalesce(exprs) | Create COALESCE |
scalar_subquery(id) | Create scalar subquery |
case(operand, branches, else_expr) | Create CASE expression |
random() | Create random number generator |
Sources: llkv-expr/src/expr.rs:178-268
Subquery Expression Types
The expression AST includes dedicated types for correlated and scalar subqueries:
SubqueryExpr
Used in Expr::Exists for boolean subquery predicates:
The id references a subquery definition stored separately (typically in the enclosing plan), and negated indicates NOT EXISTS.
Sources: llkv-expr/src/expr.rs:49-56
ScalarSubqueryExpr
Used in ScalarExpr::ScalarSubquery for value-producing subqueries:
This represents subqueries that return a single scalar value, used in expressions like SELECT (SELECT MAX(price) FROM orders) + 10.
Sources: llkv-expr/src/expr.rs:58-63
SubqueryId
Both subquery types use SubqueryId as an opaque identifier:
This ID is resolved during execution by looking up the subquery definition in the parent plan's metadata.
Sources: llkv-expr/src/expr.rs:45-47
graph LR
subgraph "Expression Translation Pipeline"
SQL["SQL Text"]
PLAN["Query Plan\nExpr<String>\nScalarExpr<String>"]
TRANS["Translation\nresolve_field_id()"]
EXEC["Execution\nExpr<FieldId>\nScalarExpr<FieldId>"]
EVAL["Evaluation\nRecordBatch Results"]
SQL --> PLAN
PLAN --> TRANS
TRANS --> EXEC
EXEC --> EVAL
end
Generic Field Parameter
The expression AST is parameterized by field identifier type F, enabling reuse across different processing stages:
Common instantiations:
Expr<'a, String>- Used during query planning with column namesExpr<'a, FieldId>- Used during execution with numeric field IDsScalarExpr<String>- Planning-time scalar expressionsScalarExpr<FieldId>- Execution-time scalar expressions
The translation from String to FieldId occurs in llkv-executor/src/translation/expression.rs using catalog lookups to resolve column names to their internal numeric identifiers.
Sources: llkv-executor/src/translation/expression.rs:18-174
Literal Values
Both expression types use the Literal enum from llkv-expr to represent untyped constant values:
| Literal Variant | Arrow Type | Example |
|---|---|---|
Integer(i64) | Int64 | 42 |
Float(f64) | Float64 | 3.14 |
Decimal(DecimalValue) | Decimal128 | 123.45 |
Boolean(bool) | Boolean | true |
String(String) | Utf8 | "hello" |
Date32(i32) | Date32 | DATE '2024-01-01' |
Null | Null | NULL |
Struct(...) | Struct | {a: 1, b: "x"} |
Interval(...) | Interval | INTERVAL '1 month' |
Literal values are type-agnostic at the AST level. Type coercion and validation occur during execution when column types are known.
Sources: llkv-expr/src/expr.rs10 llkv-executor/src/translation/schema.rs:53-123
graph TB
subgraph "Type Inference Rules"
SCALAR["ScalarExpr<FieldId>"]
SCALAR --> LITERAL["Literal → Literal type"]
SCALAR --> COLUMN["Column → Schema lookup"]
SCALAR --> BINARY["Binary → Int64 or Float64"]
SCALAR --> NOT["Not → Int64 (boolean)"]
SCALAR --> ISNULL["IsNull → Int64 (boolean)"]
SCALAR --> COMPARE["Compare → Int64 (boolean)"]
SCALAR --> AGG["Aggregate → Int64"]
SCALAR --> GETFIELD["GetField → Struct field type"]
SCALAR --> CAST["Cast → Target type"]
SCALAR --> CASE["Case → Int64 or Float64"]
SCALAR --> COALESCE["Coalesce → Int64 or Float64"]
SCALAR --> RANDOM["Random → Float64"]
SCALAR --> SUBQUERY["ScalarSubquery → Utf8 (TODO)"]
BINARY --> FLOATCHECK["Contains Float64?"]
FLOATCHECK -->|Yes| FLOAT64["Float64"]
FLOATCHECK -->|No| INT64["Int64"]
CASE --> CASECHECK["Branches use Float64?"]
CASECHECK -->|Yes| CASEFLOAT["Float64"]
CASECHECK -->|No| CASEINT["Int64"]
end
Expression Type Inference
During execution planning, the system infers output types for ScalarExpr based on operand types:
The inference logic is implemented in infer_computed_data_type() and expression_uses_float(), which recursively analyze expression trees to determine output types.
Sources: llkv-executor/src/translation/schema.rs:53-271
Expression Normalization
Before compilation, predicates undergo normalization to flatten nested AND/OR structures and apply De Morgan's laws:
Normalization Rules:
- Flatten AND:
And([And([a, b]), c])→And([a, b, c]) - Flatten OR:
Or([Or([a, b]), c])→Or([a, b, c]) - De Morgan's AND:
Not(And([a, b]))→Or([Not(a), Not(b)]) - De Morgan's OR:
Not(Or([a, b]))→And([Not(a), Not(b)]) - Double Negation:
Not(Not(expr))→expr - Literal Negation:
Not(Literal(true))→Literal(false) - IsNull Negation:
Not(IsNull { expr, negated })→IsNull { expr, negated: !negated }
This normalization simplifies subsequent optimization and compilation steps.
Sources: llkv-table/src/planner/program.rs:286-343
Expression Compilation
Normalized expressions are compiled into two bytecode programs:
- EvalProgram - Stack-based evaluation of predicates
- DomainProgram - Set-based domain analysis for row filtering
The compilation process:
- Detects fusable predicates (multiple conditions on same field)
- Builds domain programs for NOT operations
- Produces postorder bytecode for stack-based evaluation
Sources: llkv-table/src/planner/program.rs:256-631
Expression AST Usage Flow
Key stages:
- SQL Parsing - External sqlparser produces SQL AST
- Plan Building - llkv-plan converts SQL AST to
Expr<String> - Translation - llkv-executor resolves column names to
FieldId - Normalization - llkv-table flattens and optimizes structure
- Compilation - llkv-table produces bytecode programs
- Execution - llkv-table evaluates against RecordBatch data
Sources: llkv-expr/src/expr.rs:1-8 llkv-executor/src/translation/expression.rs:18-174 llkv-table/src/planner/program.rs:256-631
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Expression Translation
Relevant source files
- llkv-executor/src/translation/expression.rs
- llkv-executor/src/translation/schema.rs
- llkv-expr/src/expr.rs
- llkv-table/src/planner/program.rs
Purpose and Scope
Expression Translation is the process of converting expressions from using string-based column names to using internal field identifiers. This transformation bridges the gap between the SQL interface layer (which references columns by name) and the execution layer (which references columns by numeric FieldId). This page covers the translation phase that occurs after query planning but before expression compilation.
For information about the expression AST structure, see Expression AST. For information about how translated expressions are compiled into bytecode, see Program Compilation.
Translation Phase in Query Pipeline
The expression translation phase sits between SQL planning and execution, converting symbolic column references to physical field identifiers.
Translation Phase Detail
graph LR
SQL["SQL WHERE\nname = 'Alice'"]
AST["sqlparser AST"]
EXPRSTR["Expr<String>\nColumn('name')"]
CATALOG["Schema Lookup"]
EXPRFID["Expr<FieldId>\nColumn(42)"]
COMPILE["Bytecode\nCompilation"]
SQL --> AST
AST --> EXPRSTR
EXPRSTR --> CATALOG
CATALOG --> EXPRFID
EXPRFID --> COMPILE
style EXPRSTR fill:#fff5e1
style EXPRFID fill:#ffe1e1
Sources: Based on Diagram 5 from system architecture overview
The translation process resolves all column name strings to their corresponding FieldId values by consulting the table schema. This enables the downstream execution engine to work with stable numeric identifiers rather than string lookups.
Generic Expression Types
Both Expr and ScalarExpr are generic over the field identifier type, parameterized as F.
Type Parameter Instantiations
graph TB
subgraph "Generic Types"
EXPR["Expr<'a, F>"]
SCALAR["ScalarExpr<F>"]
FILTER["Filter<'a, F>\n{field_id: F, op: Operator}"]
end
subgraph "String-Based (Planning)"
EXPRSTR["Expr<'static, String>"]
SCALARSTR["ScalarExpr<String>"]
FILTERSTR["Filter<'static, String>"]
end
subgraph "FieldId-Based (Execution)"
EXPRFID["Expr<'static, FieldId>"]
SCALARFID["ScalarExpr<FieldId>"]
FILTERFID["Filter<'static, FieldId>"]
end
EXPR -.instantiated as.-> EXPRSTR
EXPR -.instantiated as.-> EXPRFID
SCALAR -.instantiated as.-> SCALARSTR
SCALAR -.instantiated as.-> SCALARFID
FILTER -.instantiated as.-> FILTERSTR
FILTER -.instantiated as.-> FILTERFID
EXPRSTR ==translate_predicate==> EXPRFID
SCALARSTR ==translate_scalar==> SCALARFID
Sources: llkv-expr/src/expr.rs:14-43 llkv-executor/src/translation/expression.rs:18-27
| Type | Planning Phase | Execution Phase |
|---|---|---|
| Predicate Expression | Expr<'static, String> | Expr<'static, FieldId> |
| Scalar Expression | ScalarExpr<String> | ScalarExpr<FieldId> |
| Filter | Filter<'static, String> | Filter<'static, FieldId> |
| Field Reference | String column name | FieldId numeric identifier |
The generic parameter F allows the same AST structure to be used at different stages of query processing, with type safety enforcing that planning-phase expressions cannot be mixed with execution-phase expressions.
Sources: llkv-expr/src/expr.rs:14-333
Translation Functions
The llkv-executor crate provides two primary translation functions that recursively transform expression trees.
translate_predicate
Translates boolean predicate expressions (Expr<String> → Expr<FieldId>).
Predicate Translation Flow
Sources: llkv-executor/src/translation/expression.rs:18-174
Function Signature:
The function accepts:
expr: The predicate expression with string column referencesschema: The table schema for column name resolutionunknown_column: Error constructor for unresolved column names
Sources: llkv-executor/src/translation/expression.rs:18-27
translate_scalar
Translates scalar value expressions (ScalarExpr<String> → ScalarExpr<FieldId>).
Function Signature:
Sources: llkv-executor/src/translation/expression.rs:176-185
Variant-Specific Translation
The translation process handles each expression variant differently:
| Variant | Translation Approach |
|---|---|
Expr::Pred(Filter) | Resolve filter's field_id from String to FieldId |
Expr::And(Vec) / Expr::Or(Vec) | Recursively translate all child expressions |
Expr::Not(Box) | Recursively translate inner expression |
Expr::Compare | Translate both left and right scalar expressions |
Expr::InList | Translate target expression and all list items |
Expr::IsNull | Translate inner expression |
Expr::Literal(bool) | No translation needed, pass through |
Expr::Exists(SubqueryExpr) | Pass through subquery ID unchanged |
ScalarExpr::Column(String) | Resolve string to FieldId |
ScalarExpr::Literal | No translation needed, pass through |
ScalarExpr::Binary | Recursively translate left and right operands |
ScalarExpr::Aggregate | Recursively translate aggregate expression argument |
ScalarExpr::GetField | Recursively translate base expression |
ScalarExpr::Cast | Recursively translate inner expression |
ScalarExpr::Case | Translate operand, all branch conditions/results, and else branch |
ScalarExpr::Coalesce | Recursively translate all items |
Sources: llkv-executor/src/translation/expression.rs:86-143 llkv-executor/src/translation/expression.rs:197-386
graph TD
START["Column Name String"]
ROWID_CHECK{"Is 'rowid'\n(case-insensitive)?"}
ROWID["Return\nROW_ID_FIELD_ID"]
LOOKUP["schema.resolve(name)"]
FOUND{"Found in\nschema?"}
RETURN_FID["Return\ncolumn.field_id"]
ERROR["Invoke\nunknown_column\ncallback"]
START --> ROWID_CHECK
ROWID_CHECK -->|Yes| ROWID
ROWID_CHECK -->|No| LOOKUP
LOOKUP --> FOUND
FOUND -->|Yes| RETURN_FID
FOUND -->|No| ERROR
Field Resolution
The core of translation is resolving string column names to numeric field identifiers.
Field Resolution Logic
Sources: llkv-executor/src/translation/expression.rs:390-407
Special Column: rowid
The system provides a special pseudo-column named rowid that references the internal row identifier:
The rowid column is:
- Case-insensitive (accepts "ROWID", "rowid", "RowId", etc.)
- Available in all tables automatically
- Mapped to constant
ROW_ID_FIELD_IDfromllkv_table::ROW_ID_FIELD_ID - Corresponds to
ROW_ID_COLUMN_NAMEconstant fromllkv_column_map::ROW_ID_COLUMN_NAME
Sources: llkv-executor/src/translation/expression.rs:399-401 llkv-executor/src/translation/expression.rs2 llkv-executor/src/translation/expression.rs5
Schema Lookup
For non-special columns, resolution uses the ExecutorSchema::resolve method:
The schema lookup:
- Searches for a column with the given name
- Returns the
ExecutorColumnif found - Extracts the
field_idfrom the column metadata - Invokes the error callback if not found
Sources: llkv-executor/src/translation/expression.rs:403-406
graph TB
START["Initial Expression"]
PUSH_ENTER["Push Enter Frame"]
POP["Pop Frame"]
FRAME_TYPE{"Frame Type?"}
ENTER_NODE["Enter Node"]
NODE_TYPE{"Node Type?"}
AND_OR["And/Or"]
NOT["Not"]
LEAF["Leaf (Pred,\nCompare, etc.)"]
PUSH_EXIT["Push Exit Frame"]
PUSH_CHILDREN["Push Child\nEnter Frames\n(reversed)"]
PUSH_EXIT_NOT["Push Exit Frame"]
PUSH_INNER["Push Inner\nEnter Frame"]
TRANSLATE_LEAF["Translate Leaf\nPush to result_stack"]
EXIT_NODE["Exit Node"]
POP_RESULTS["Pop child results\nfrom result_stack"]
BUILD_NODE["Build translated\nparent node"]
PUSH_RESULT["Push to result_stack"]
DONE{"Stack\nempty?"}
RETURN["Return final result"]
START --> PUSH_ENTER
PUSH_ENTER --> POP
POP --> FRAME_TYPE
FRAME_TYPE -->|Enter| ENTER_NODE
FRAME_TYPE -->|Exit| EXIT_NODE
FRAME_TYPE -->|Leaf| TRANSLATE_LEAF
ENTER_NODE --> NODE_TYPE
NODE_TYPE --> AND_OR
NODE_TYPE --> NOT
NODE_TYPE --> LEAF
AND_OR --> PUSH_EXIT
PUSH_EXIT --> PUSH_CHILDREN
PUSH_CHILDREN --> POP
NOT --> PUSH_EXIT_NOT
PUSH_EXIT_NOT --> PUSH_INNER
PUSH_INNER --> POP
LEAF --> TRANSLATE_LEAF
TRANSLATE_LEAF --> POP
EXIT_NODE --> POP_RESULTS
POP_RESULTS --> BUILD_NODE
BUILD_NODE --> PUSH_RESULT
PUSH_RESULT --> POP
POP --> DONE
DONE -->|No| FRAME_TYPE
DONE -->|Yes| RETURN
Traversal Strategy
Translation uses an iterative traversal approach to avoid stack overflow on deeply nested expressions.
Iterative Traversal Algorithm
Sources: llkv-executor/src/translation/expression.rs:39-174
Frame-Based Pattern
The translation uses a frame-based traversal pattern with two stacks:
Work Stack (owned_stack): Contains frames representing work to be done
OwnedFrame::Enter(expr): Visit a node and potentially expand itOwnedFrame::Exit(context): Collect child results and build parent nodeOwnedFrame::Leaf(translated): Push a fully translated leaf node
Result Stack (result_stack): Contains translated expressions ready to be consumed by parent nodes
Sources: llkv-executor/src/translation/expression.rs:48-63
Traversal Example
For the expression And([Pred(name_col), Pred(age_col)]):
| Step | Work Stack | Result Stack | Action |
|---|---|---|---|
| 1 | [Enter(And)] | [] | Start |
| 2 | [Exit(And(2)), Enter(age), Enter(name)] | [] | Expand And |
| 3 | [Exit(And(2)), Enter(age), Leaf(name→42)] | [] | Translate name |
| 4 | [Exit(And(2)), Enter(age)] | [Pred(42)] | Push name result |
| 5 | [Exit(And(2)), Leaf(age→43)] | [Pred(42)] | Translate age |
| 6 | [Exit(And(2))] | [Pred(42), Pred(43)] | Push age result |
| 7 | [] | [And([Pred(42), Pred(43)])] | Build And, push result |
| 8 | Done | [And([...])] | Return final expression |
This approach handles deeply nested expressions (50k+ nodes) without recursion-induced stack overflow.
Sources: llkv-executor/src/translation/expression.rs:62-174
Why Iterative Traversal?
The codebase comments explain:
This avoids stack overflow on deeply nested expressions (50k+ nodes) by using explicit work_stack and result_stack instead of recursion.
The frame-based pattern is documented in the llkv-plan::traversal module and reused here for expression translation.
Sources: llkv-executor/src/translation/expression.rs:39-46
Error Handling
Translation failures produce descriptive errors through callback functions.
Error Callbacks
Both translation functions accept error constructor callbacks:
| Parameter | Purpose | Example Usage |
|---|---|---|
unknown_column: F | Construct error for unknown column names | ` |
unknown_aggregate: G | Construct error for unknown aggregate functions | Currently unused but reserved for future validation |
The callback pattern allows callers to customize error messages and error types based on their context.
Sources: llkv-executor/src/translation/expression.rs:21-27 llkv-executor/src/translation/expression.rs:189-195
Common Error Scenarios
Translation Error Flow
When schema.resolve(name) returns None, the system invokes the error callback which typically produces an Error::InvalidArgumentError with a message like:
"Binder Error: does not have a column named 'xyz'"
Sources: llkv-executor/src/translation/expression.rs:418-422
Result Stack Underflow
The iterative traversal validates stack consistency:
Stack underflow indicates a bug in the traversal logic rather than invalid user input, so it produces an Error::Internal.
Sources: llkv-executor/src/translation/expression.rs:160-164 llkv-executor/src/translation/expression.rs:171-173
graph TD
EXPR["ScalarExpr<FieldId>"]
TYPE_CHECK{"Expression\nType?"}
LITERAL["Literal"]
INFER_LIT["Infer from\nLiteral type"]
COLUMN["Column(fid)"]
LOOKUP["schema.column_by_field_id(fid)"]
NORMALIZE["normalized_numeric_type"]
BINARY["Binary"]
CHECK_FLOAT["expression_uses_float"]
FLOAT_RESULT["DataType::Float64"]
INT_RESULT["DataType::Int64"]
AGGREGATE["Aggregate"]
AGG_RESULT["DataType::Int64"]
CAST["Cast"]
CAST_TYPE["Use specified\ndata_type"]
RESULT["Arrow DataType"]
EXPR --> TYPE_CHECK
TYPE_CHECK --> LITERAL
TYPE_CHECK --> COLUMN
TYPE_CHECK --> BINARY
TYPE_CHECK --> AGGREGATE
TYPE_CHECK --> CAST
LITERAL --> INFER_LIT
INFER_LIT --> RESULT
COLUMN --> LOOKUP
LOOKUP --> NORMALIZE
NORMALIZE --> RESULT
BINARY --> CHECK_FLOAT
CHECK_FLOAT -->|Uses Float| FLOAT_RESULT
CHECK_FLOAT -->|Integer only| INT_RESULT
FLOAT_RESULT --> RESULT
INT_RESULT --> RESULT
AGGREGATE --> AGG_RESULT
AGG_RESULT --> RESULT
CAST --> CAST_TYPE
CAST_TYPE --> RESULT
Type Inference Integration
After translation, expressions with FieldId references can be used for schema-based type inference.
The infer_computed_data_type function in llkv-executor/src/translation/schema.rs inspects translated expressions to determine their Arrow data types:
Type Inference for Translated Expressions
Sources: llkv-executor/src/translation/schema.rs:53-123
Type Inference Rules
| Expression | Inferred Type | Notes |
|---|---|---|
Literal::Integer | Int64 | 64-bit signed integer |
Literal::Float | Float64 | 64-bit floating point |
Literal::Decimal | Decimal128(p,s) | Precision and scale from value |
Literal::Boolean | Boolean | Boolean flag |
Literal::String | Utf8 | UTF-8 string |
Literal::Null | Null | Null type marker |
Column(fid) | Schema lookup | normalized_numeric_type(column.data_type) |
Binary | Float64 or Int64 | Float if any operand is float |
Compare | Int64 | Comparisons produce boolean (0/1) as Int64 |
Aggregate | Int64 | Most aggregates return Int64 |
Cast | Specified type | Uses explicit data_type parameter |
The normalized_numeric_type function maps small integer types (Int8, Int16, Int32) to Int64 and unsigned/float types to Float64 for consistent expression evaluation.
Sources: llkv-executor/src/translation/schema.rs:125-147
Translation in Context
The translation phase fits into the broader query execution pipeline:
Translation Phase in Query Pipeline
Sources: Based on Diagram 2 and Diagram 5 from system architecture overview
The translation layer serves as the bridge between the human-readable SQL layer (with column names) and the machine-optimized execution layer (with numeric field IDs). This separation allows:
- Planning flexibility : Query plans can reference columns symbolically without knowing physical storage details
- Schema evolution : Field IDs remain stable even if column names change
- Type safety : The type system prevents mixing planning-phase and execution-phase expressions
- Optimization : Numeric field IDs enable efficient lookups in columnar storage
Sources: llkv-expr/src/expr.rs:1-359 llkv-executor/src/translation/expression.rs:1-424 llkv-executor/src/translation/schema.rs:1-338
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Program Compilation
Relevant source files
- llkv-executor/src/translation/expression.rs
- llkv-executor/src/translation/schema.rs
- llkv-expr/src/expr.rs
- llkv-table/src/planner/mod.rs
- llkv-table/src/planner/program.rs
- llkv-table/src/scalar_eval.rs
Purpose and Scope
This page documents the predicate compilation system in LLKV, which transforms typed predicate expressions (Expr<FieldId>) into stack-based bytecode programs for efficient evaluation during table scans. The compilation process produces two types of programs: EvalProgram for predicate evaluation and DomainProgram for domain analysis (determining which row IDs might match predicates).
For information about the expression AST structure itself, see Expression AST. For how these compiled programs are evaluated during execution, see Filter Evaluation. For how expressions are translated from string column names to field IDs, see Expression Translation.
Compilation Pipeline Overview
The compilation process transforms a normalized predicate expression into executable bytecode through the ProgramCompiler in llkv-table/src/planner/program.rs:257-284 The compiler produces a ProgramSet containing both evaluation and domain analysis programs.
Sources: llkv-table/src/planner/program.rs:257-284 llkv-table/src/planner/mod.rs:612-637
graph TB
Input["Expr<FieldId>\n(Raw predicate)"]
Normalize["normalize_predicate()\n(Apply De Morgan's laws,\nflatten And/Or)"]
Compiler["ProgramCompiler::new()\nProgramCompiler::compile()"]
ProgramSet["ProgramSet"]
EvalProgram["EvalProgram\n(Stack-based bytecode\nfor evaluation)"]
DomainRegistry["DomainRegistry\n(Domain programs\nfor optimization)"]
Input --> Normalize
Normalize --> Compiler
Compiler --> ProgramSet
ProgramSet --> EvalProgram
ProgramSet --> DomainRegistry
EvalOps["EvalOp instructions:\nPushPredicate, And, Or,\nNot, FusedAnd"]
DomainOps["DomainOp instructions:\nPushFieldAll, Union,\nIntersect"]
EvalProgram --> EvalOps
DomainRegistry --> DomainOps
Compilation Entry Point
The TablePlanner::plan_scan method invokes the compiler when preparing a scan operation. The process begins with predicate normalization followed by compilation into both evaluation and domain programs.
Sources: llkv-table/src/planner/mod.rs:625-628 llkv-table/src/planner/program.rs:266-283
graph LR
TablePlanner["TablePlanner::plan_scan"]
NormFilter["normalize_predicate(filter_expr)"]
CreateCompiler["ProgramCompiler::new(Arc<Expr>)"]
Compile["compiler.compile()"]
ProgramSet["ProgramSet<'expr>"]
TablePlanner --> NormFilter
NormFilter --> CreateCompiler
CreateCompiler --> Compile
Compile --> ProgramSet
ProgramSet --> EvalProgram["EvalProgram\n(ops: Vec<EvalOp>)"]
ProgramSet --> DomainRegistry["DomainRegistry\n(programs, index)"]
Predicate Normalization
Before compilation, predicates are normalized using normalize_predicate() to simplify the expression tree. Normalization applies two key transformations:
Transformation Rules
| Input Pattern | Output Pattern | Description |
|---|---|---|
And(And(a, b), c) | And(a, b, c) | Flatten nested AND operations |
Or(Or(a, b), c) | Or(a, b, c) | Flatten nested OR operations |
Not(And(a, b)) | Or(Not(a), Not(b)) | De Morgan's law for AND |
Not(Or(a, b)) | And(Not(a), Not(b)) | De Morgan's law for OR |
Not(Not(a)) | a | Double negation elimination |
Not(Literal(true)) | Literal(false) | Literal inversion |
Not(IsNull{expr, negated}) | IsNull{expr, !negated} | IsNull negation flip |
Normalization Algorithm
The normalize_expr() function recursively transforms the expression tree using pattern matching:
Sources: llkv-table/src/planner/program.rs:286-343
graph TD
Start["normalize_expr(expr)"]
CheckAnd{"expr is And?"}
CheckOr{"expr is Or?"}
CheckNot{"expr is Not?"}
Other["Return expr unchanged"]
FlattenAnd["Flatten nested And nodes\ninto single And"]
FlattenOr["Flatten nested Or nodes\ninto single Or"]
ApplyDeMorgan["normalize_negated(inner)\nApply De Morgan's laws"]
Start --> CheckAnd
CheckAnd -->|Yes| FlattenAnd
CheckAnd -->|No| CheckOr
CheckOr -->|Yes| FlattenOr
CheckOr -->|No| CheckNot
CheckNot -->|Yes| ApplyDeMorgan
CheckNot -->|No| Other
FlattenAnd --> Return["Return normalized expr"]
FlattenOr --> Return
ApplyDeMorgan --> Return
Other --> Return
EvalProgram Compilation
The compile_eval() function generates a sequence of EvalOp instructions using iterative traversal with an explicit work stack. This avoids stack overflow on deeply nested expressions and produces postorder bytecode.
EvalOp Instruction Set
The EvalOp enum defines the instruction types for predicate evaluation:
| Instruction | Description | Stack Effect |
|---|---|---|
PushPredicate(OwnedFilter) | Push single predicate result | → bool |
PushCompare{left, op, right} | Evaluate comparison expression | → bool |
PushInList{expr, list, negated} | Evaluate IN list membership | → bool |
PushIsNull{expr, negated} | Evaluate IS NULL test | → bool |
PushLiteral(bool) | Push constant boolean value | → bool |
FusedAnd{field_id, filters} | Optimized AND for same field | → bool |
And{child_count} | Pop N bools, push AND result | bool×N → bool |
Or{child_count} | Pop N bools, push OR result | bool×N → bool |
Not{domain} | Pop bool, push NOT result | bool → bool |
graph TB
subgraph "Compilation Phases"
Input["Expression Tree"]
Phase1["Enter Phase\n(Pre-order traversal)"]
Phase2["Exit Phase\n(Post-order emission)"]
Output["Vec<EvalOp>"]
end
Input --> Phase1
Phase1 --> Phase2
Phase2 --> Output
subgraph "Frame Types"
EnterFrame["EvalVisit::Enter(expr)\nPush children in reverse order"]
ExitFrame["EvalVisit::Exit(expr)\nEmit instruction"]
FusedFrame["EvalVisit::EmitFused\nEmit FusedAnd optimization"]
end
Phase1 --> EnterFrame
EnterFrame --> ExitFrame
EnterFrame --> FusedFrame
ExitFrame --> Phase2
FusedFrame --> Phase2
Compilation Process
The compiler uses a two-pass approach with EvalVisit frames to track traversal state:
Sources: llkv-table/src/planner/program.rs:407-516
Predicate Fusion Optimization
When the compiler encounters an And node where all children are predicates (Expr::Pred) on the same FieldId, it emits a single FusedAnd instruction instead of multiple individual predicates. This optimization is detected by gather_fused():
Sources: llkv-table/src/planner/program.rs:518-542
graph LR
AndNode["And(children)"]
Check["gather_fused(children)"]
Decision{"All children\nare Pred(field_id)\nwith same field_id?"}
Fused["Emit FusedAnd{\nfield_id,\nfilters\n}"]
Normal["Emit individual\nPushPredicate\nfor each child\n+ And instruction"]
AndNode --> Check
Check --> Decision
Decision -->|Yes| Fused
Decision -->|No| Normal
DomainProgram Compilation
The compile_domain() function generates DomainOp instructions for domain analysis. Domain programs determine which row IDs might satisfy a predicate without evaluating the full expression, enabling storage-layer optimizations.
DomainOp Instruction Set
| Instruction | Description | Stack Effect |
|---|---|---|
PushFieldAll(FieldId) | All rows where field exists | → RowSet |
PushCompareDomain{left, right, op, fields} | Domain of rows for comparison | → RowSet |
PushInListDomain{expr, list, fields, negated} | Domain of rows for IN list | → RowSet |
PushIsNullDomain{expr, fields, negated} | Domain of rows for IS NULL | → RowSet |
PushLiteralFalse | Empty row set | → RowSet |
PushAllRows | All rows in table | → RowSet |
Union{child_count} | Pop N sets, push union | RowSet×N → RowSet |
Intersect{child_count} | Pop N sets, push intersection | RowSet×N → RowSet |
Domain Analysis Algorithm
Domain compilation uses iterative traversal with DomainVisit frames, similar to eval compilation but with different semantics:
graph TB
Start["compile_domain(expr)"]
Stack["Work stack:\nVec<DomainVisit>"]
Ops["Output:\nVec<DomainOp>"]
EnterFrame["DomainVisit::Enter(node)\nPush children + Exit frame"]
ExitFrame["DomainVisit::Exit(node)\nEmit DomainOp"]
Start --> Stack
Stack --> Process{"Pop frame"}
Process -->|Enter| EnterFrame
Process -->|Exit| ExitFrame
EnterFrame --> Stack
ExitFrame --> Emit["Emit instruction to ops"]
Emit --> Stack
Process -->|Empty| Done["Return DomainProgram{ops}"]
Domain Semantics
The domain of an expression represents the set of row IDs where the expression could potentially be true (or false for Not):
| Expression Type | Domain Semantics |
|---|---|
Pred(filter) | All rows where filter.field_id exists |
Compare{left, right, op} | Union of domains of all fields in left and right |
InList{expr, list} | Union of domains of all fields in expr and list items |
IsNull{expr} | Union of domains of all fields in expr |
Literal(true) | All rows in table |
Literal(false) | Empty set |
And(children) | Intersection of children domains |
Or(children) | Union of children domains |
Not(inner) | Same as inner domain (NOT doesn't change domain) |
Sources: llkv-table/src/planner/program.rs:544-631
graph TB
Input["ScalarExpr<FieldId>"]
Stack["Work stack:\nVec<&ScalarExpr>"]
Seen["FxHashSet<FieldId>\n(deduplication)"]
Output["Vec<FieldId>\n(sorted)"]
Input --> Stack
Stack --> Process{"Pop expr"}
Process -->|Column fid| AddField["Insert fid into seen"]
Process -->|Literal| Skip["Skip (no fields)"]
Process -->|Binary| PushChildren["Push left, right\nto stack"]
Process -->|Compare| PushChildren
Process -->|Aggregate| PushAggExpr["Push aggregate expr\nto stack"]
Process -->|Other| PushNested["Push nested exprs\nto stack"]
AddField --> Stack
Skip --> Stack
PushChildren --> Stack
PushAggExpr --> Stack
PushNested --> Stack
Process -->|Empty| Collect["Collect seen into Vec\nSort unstable"]
Collect --> Output
Field Collection for Domain Analysis
The collect_fields() function extracts all FieldId references from scalar expressions using iterative traversal. This determines which columns' row sets need to be considered during domain evaluation.
Sources: llkv-table/src/planner/program.rs:633-709
Data Structures
ProgramSet
The top-level container returned by compilation, holding all compiled artifacts:
ProgramSet<'expr> {
eval: EvalProgram, // Bytecode for predicate evaluation
domains: DomainRegistry, // Domain programs for optimization
_root_expr: Arc<Expr<'expr, FieldId>> // Original expression (lifetime anchor)
}
Sources: llkv-table/src/planner/program.rs:23-29
DomainRegistry
Manages the collection of compiled domain programs with deduplication via ExprKey:
DomainRegistry {
programs: Vec<DomainProgram>, // All compiled domain programs
index: FxHashMap<ExprKey, DomainProgramId>, // Expression → program ID map
root: Option<DomainProgramId> // ID of root domain program
}
The registry uses ExprKey (a pointer-based key) to detect duplicate subexpressions and reuse compiled domain programs.
Sources: llkv-table/src/planner/program.rs:12-20 llkv-table/src/planner/program.rs:196-219
OwnedFilter and OwnedOperator
To support owned bytecode programs with no lifetime constraints, the compiler converts borrowed Filter<'a, FieldId> and Operator<'a> types into owned variants:
| Borrowed Type | Owned Type | Purpose |
|---|---|---|
Filter<'a, FieldId> | OwnedFilter | Stores field_id + owned operator |
Operator<'a> | OwnedOperator | Owns pattern strings and literal vectors |
This conversion happens during compile_eval() when creating PushPredicate and FusedAnd instructions.
Sources: llkv-table/src/planner/program.rs:69-191
Integration with Table Scanning
The compiled programs are used during table scan execution in two ways:
- EvalProgram : Evaluated per-row or per-batch during scan to determine which rows match the predicate
- DomainProgram : Used for storage-layer optimizations to skip scanning columns or chunks that cannot match
Usage in PlannedScan
The TablePlanner::plan_scan() method creates a PlannedScan struct that bundles the compiled programs with scan metadata:
PlannedScan<'expr, P> {
projections: Vec<ScanProjection>,
filter_expr: Arc<Expr<'expr, FieldId>>,
options: ScanStreamOptions<P>,
plan_graph: PlanGraph, // For query plan visualization
programs: ProgramSet<'expr> // Compiled evaluation and domain programs
}
Sources: llkv-table/src/planner/mod.rs:500-509 llkv-table/src/planner/mod.rs:630-636
Example Compilation
Consider the predicate: (age > 18 AND name LIKE 'A%') OR status = 'active'
After normalization, this remains unchanged (no nested And/Or to flatten). The compiler produces:
EvalProgram Instructions
1. PushPredicate(Filter { field_id: age, op: GreaterThan(18) })
2. PushPredicate(Filter { field_id: name, op: StartsWith("A", true) })
3. And { child_count: 2 }
4. PushPredicate(Filter { field_id: status, op: Equals("active") })
5. Or { child_count: 2 }
DomainProgram Instructions
1. PushFieldAll(age) // Domain for first predicate
2. PushFieldAll(name) // Domain for second predicate
3. Intersect { child_count: 2 } // AND combines via intersection
4. PushFieldAll(status) // Domain for third predicate
5. Union { child_count: 2 } // OR combines via union
The domain program indicates that rows must have (age AND name) OR status to potentially match.
Sources: llkv-table/src/planner/program.rs:416-516 llkv-table/src/planner/program.rs:550-631
graph LR
Recursive["Recursive approach\n(Stack overflow risk)"]
Iterative["Iterative approach\n(Explicit work stack)"]
Recursive -->|Replace with| Iterative
Iterative --> WorkStack["Vec<Frame>\n(Heap-allocated)"]
Iterative --> ResultStack["Vec<Result>\n(Post-order accumulation)"]
Stack Overflow Prevention
Both compile_eval() and compile_domain() use explicit work stacks instead of recursion to handle deeply nested expressions (50k+ nodes) without stack overflow. This follows the iterative traversal pattern described in the codebase:
The pattern uses Enter/Exit frames to simulate recursive descent and ascent, accumulating results in a separate stack during the Exit phase.
Sources: llkv-table/src/planner/program.rs:407-516 llkv-table/src/planner/program.rs:544-631
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Scalar Evaluation and NumericKernels
Relevant source files
- llkv-executor/src/translation/expression.rs
- llkv-executor/src/translation/schema.rs
- llkv-expr/src/expr.rs
- llkv-table/src/planner/mod.rs
- llkv-table/src/planner/program.rs
- llkv-table/src/scalar_eval.rs
Purpose and Scope
This page documents the scalar expression evaluation engine used during table scans to compute expressions like col1 + col2 * 3, CAST(col AS Float64), and CASE statements. The NumericKernels utility centralizes numeric computation logic, providing both row-by-row and vectorized batch evaluation strategies. For the abstract expression AST that gets evaluated, see Expression AST. For how expressions are compiled into bytecode programs for predicate evaluation, see Program Compilation.
Overview
The scalar evaluation system provides a unified numeric computation layer that operates over Arrow arrays during table scans. When a query contains computed projections like SELECT col1 + col2 AS sum FROM table, the executor needs to efficiently evaluate these expressions across potentially millions of rows. The NumericKernels struct and associated types provide:
- Type abstraction : Wraps Arrow's
Int64Array,Float64Array, andDecimal128Arrayinto a unifiedNumericArrayinterface - Evaluation strategies : Supports both row-by-row evaluation (for complex expressions) and vectorized batch evaluation (for simple arithmetic)
- Optimization : Applies algebraic simplification to detect affine transformations and constant folding opportunities
- Type coercion : Handles implicit casting between integer, float, and decimal types following SQLite-style semantics
Sources : llkv-table/src/scalar_eval.rs:1-22
graph TB
subgraph "Input Layer"
ARROW_INT["Int64Array\n(Arrow)"]
ARROW_FLOAT["Float64Array\n(Arrow)"]
ARROW_DEC["Decimal128Array\n(Arrow)"]
end
subgraph "Abstraction Layer"
NUM_ARRAY["NumericArray\nkind: NumericKind\nlen: usize"]
NUM_VALUE["NumericValue\nInteger(i64)\nFloat(f64)\nDecimal(DecimalValue)"]
end
subgraph "Evaluation Engine"
KERNELS["NumericKernels\nevaluate_value()\nevaluate_batch()\nsimplify()"]
end
subgraph "Output Layer"
RESULT_ARRAY["ArrayRef\n(Arrow)"]
end
ARROW_INT --> NUM_ARRAY
ARROW_FLOAT --> NUM_ARRAY
ARROW_DEC --> NUM_ARRAY
NUM_ARRAY --> NUM_VALUE
NUM_VALUE --> KERNELS
KERNELS --> RESULT_ARRAY
style KERNELS fill:#e1f5ff
Core Data Types
NumericKind
An enum distinguishing the underlying numeric representation. This preserves type information through evaluation to enable intelligent casting decisions:
Sources : llkv-table/src/scalar_eval.rs:26-32
NumericValue
A tagged union representing a single numeric value while preserving its original type. Provides conversion methods to target types:
| Variant | Description | Conversion Methods |
|---|---|---|
Integer(i64) | Signed 64-bit integer | as_f64(), as_i64() |
Float(f64) | 64-bit floating point | as_f64() |
Decimal(DecimalValue) | Fixed-precision decimal | as_f64() |
All variants support .kind() to retrieve the original NumericKind.
Sources : llkv-table/src/scalar_eval.rs:34-69
NumericArray
Wraps Arrow array types with a unified interface for numeric access. Internally stores optional Arc<Int64Array>, Arc<Float64Array>, or Arc<Decimal128Array> based on the kind field:
Key Methods :
graph LR
subgraph "NumericArray"
KIND["kind: NumericKind"]
LEN["len: usize"]
INT_DATA["int_data: Option<Arc<Int64Array>>"]
FLOAT_DATA["float_data: Option<Arc<Float64Array>>"]
DECIMAL_DATA["decimal_data: Option<Arc<Decimal128Array>>"]
end
KIND -.determines.-> INT_DATA
KIND -.determines.-> FLOAT_DATA
KIND -.determines.-> DECIMAL_DATA
try_from_arrow(array: &ArrayRef): Constructs from any Arrow array, applying type casting as neededvalue(idx: usize): ExtractsOption<NumericValue>at the given indexpromote_to_float(): Converts to Float64 representation for mixed-type arithmeticto_array_ref(): Exports back to ArrowArrayRef
Sources : llkv-table/src/scalar_eval.rs:83-383
NumericKernels API
The NumericKernels struct provides static methods for expression evaluation and optimization. It serves as the primary entry point for scalar computation during table scans.
Field Collection
Recursively traverses a scalar expression to identify all referenced column fields. Used by the table planner to determine which columns must be fetched from storage.
Sources : llkv-table/src/scalar_eval.rs:455-526
Array Preparation
Converts a set of Arrow arrays into the NumericArray representation, applying type coercion as needed. The needed_fields parameter filters to only the columns referenced by the expression being evaluated. Returns a FxHashMap<FieldId, NumericArray> for fast lookup during evaluation.
Sources : llkv-table/src/scalar_eval.rs:528-547
Value-by-Value Evaluation
Evaluates a scalar expression for a single row at index idx. Supports:
- Binary arithmetic (
+,-,*,/,%) - Comparisons (
=,<,>, etc.) - Logical operators (
NOT,IS NULL) - Type casts (
CAST(... AS Float64)) - Control flow (
CASE,COALESCE) - Random number generation (
RANDOM())
Returns None for NULL propagation.
Sources : llkv-table/src/scalar_eval.rs:549-673
Batch Evaluation
Evaluates an expression across all rows in a batch, returning an ArrayRef. The implementation attempts vectorized evaluation for simple expressions (single column, literals, affine transformations) and falls back to row-by-row evaluation for complex cases.
Sources : llkv-table/src/scalar_eval.rs:676-712
graph TB
EXPR["ScalarExpr<FieldId>"]
SIMPLIFY["simplify()\nDetect affine patterns"]
VECTORIZE["try_evaluate_vectorized()\nCheck for fast path"]
FAST["Vectorized Evaluation\nDirect Arrow compute"]
SLOW["Row-by-Row Loop\nevaluate_value()
per row"]
RESULT["ArrayRef"]
EXPR --> SIMPLIFY
SIMPLIFY --> VECTORIZE
VECTORIZE -->|Success| FAST
VECTORIZE -->|Fallback| SLOW
FAST --> RESULT
SLOW --> RESULT
Vectorization and Optimization
VectorizedExpr
Internal representation for expressions that can be evaluated without per-row dispatch:
The try_evaluate_vectorized method attempts to decompose complex expressions into VectorizedExpr nodes, enabling efficient vectorized computation for binary operations between arrays and scalars.
Sources : llkv-table/src/scalar_eval.rs:385-414
graph LR
INPUT["col * 3 + 5"]
DETECT["Detect Affine Pattern"]
AFFINE["AffineExpr\nfield: col\nscale: 3.0\noffset: 5.0"]
FAST_EVAL["Single Pass Evaluation\nemit_no_nulls()"]
INPUT --> DETECT
DETECT --> AFFINE
AFFINE --> FAST_EVAL
Affine Expression Detection
The simplify method detects affine transformations of the form scale * field + offset:
When an affine pattern is detected, the executor can apply the transformation in a single pass without intermediate allocations. The try_extract_affine_expr method recursively analyzes binary arithmetic trees to identify this pattern.
Sources : llkv-table/src/scalar_eval.rs:1138-1261
Constant Folding
The simplification pass performs constant folding for expressions like 2 + 3 or 10.0 / 2.0, replacing them with Literal(5) or Literal(5.0). This eliminates redundant computation during execution.
Sources : llkv-table/src/scalar_eval.rs:997-1137
Type Coercion and Casting
Implicit Coercion
When evaluating binary operations on mixed types, the system applies implicit promotion rules:
| Left Type | Right Type | Result Type | Behavior |
|---|---|---|---|
| Integer | Integer | Integer | No conversion |
| Integer | Float | Float | Promote left to Float64 |
| Float | Integer | Float | Promote right to Float64 |
| Decimal | Any | Float | Convert both to Float64 |
The infer_result_kind method determines the target type before evaluation, and to_aligned_array_ref applies the necessary promotions.
Sources : llkv-table/src/scalar_eval.rs:1398-1447
Explicit Casting
The CAST expression variant supports explicit type conversion:
Casting is handled during evaluation by:
- Evaluating the inner expression to
NumericValue - Converting to the target
NumericKindviacast_numeric_value_to_kind - Constructing the result array with the target Arrow
DataType
Special handling exists for DataType::Date32 casts, which use the llkv-plan date utilities.
Sources : llkv-table/src/scalar_eval.rs:1449-1472 llkv-table/src/scalar_eval.rs:611-624
sequenceDiagram
participant Planner as TablePlanner
participant Executor as TableExecutor
participant Kernels as NumericKernels
participant Store as ColumnStore
Planner->>Planner: Analyze projections
Planner->>Kernels: collect_fields(expr)
Kernels-->>Planner: Set<FieldId>
Planner->>Planner: Build unique_lfids list
Executor->>Store: Gather columns for row batch
Store-->>Executor: Vec<ArrayRef>
Executor->>Kernels: prepare_numeric_arrays(lfids, arrays, fields)
Kernels-->>Executor: NumericArrayMap
Executor->>Kernels: evaluate_batch_simplified(expr, len, arrays)
Kernels->>Kernels: try_evaluate_vectorized()
alt Vectorized
Kernels->>Kernels: compute_binary_array_array()
else Fallback
Kernels->>Kernels: Loop: evaluate_value(expr, idx)
end
Kernels-->>Executor: ArrayRef (result column)
Executor->>Executor: Append to RecordBatch
Integration with Table Scans
The numeric evaluation engine is invoked by the table executor when processing computed projections. The integration flow:
Projection Evaluation Context
The ProjectionEval enum distinguishes between direct column references and computed expressions:
For Computed variants, the planner:
- Calls
NumericKernels::simplify()to optimize the expression - Invokes
NumericKernels::collect_fields()to determine dependencies - Stores the simplified expression for evaluation
During execution, RowStreamBuilder materializes computed columns by calling evaluate_batch_simplified for each expression.
Sources : llkv-table/src/planner/mod.rs:494-498 llkv-table/src/planner/mod.rs:1073-1107
Passthrough Optimization
The planner detects when a computed expression is simply a column reference (after simplification) via NumericKernels::passthrough_column(). In this case, the column is fetched directly from storage without re-evaluation:
This avoids redundant computation for queries like SELECT col + 0 AS x.
Sources : llkv-table/src/planner/mod.rs:1110-1116 llkv-table/src/scalar_eval.rs:874-907
Data Type Inference
The evaluation engine must determine result types for expressions before evaluation to construct properly-typed Arrow arrays. The infer_computed_data_type function in llkv-executor delegates to numeric kernel logic:
| Expression Type | Inferred Data Type | Rule |
|---|---|---|
Literal(Integer) | Int64 | Direct mapping |
Literal(Float) | Float64 | Direct mapping |
Binary { ... } | Int64 or Float64 | Based on operand types |
Compare { ... } | Int64 | Boolean as 0/1 integer |
Cast { data_type, ... } | data_type | Explicit type |
Random | Float64 | Always float |
The expression_uses_float helper recursively checks if any operand is floating-point, promoting the result type accordingly.
Sources : llkv-executor/src/translation/schema.rs:53-123
Performance Characteristics
Row-by-Row Evaluation
Used for:
- Expressions with control flow (
CASE,COALESCE) - Expressions containing
CASTto non-numeric types - Expressions with interval arithmetic (date operations)
Cost : O(n) row dispatch overhead, branch mispredictions on conditionals
Vectorized Evaluation
Used for:
- Simple arithmetic (
col1 + col2,col * 3) - Single column references
- Constant literals
Cost : O(n) with SIMD-friendly memory access patterns, no per-row dispatch
graph LR
INPUT["Int64Array\n[1,2,3,4,5]"]
AFFINE["scale=2.0\noffset=10.0"]
CALLBACK["emit_no_nulls(\nlen, /i/ 2.0*values[i]+10.0\n)"]
OUTPUT["Float64Array\n[12,14,16,18,20]"]
INPUT --> AFFINE
AFFINE --> CALLBACK
CALLBACK --> OUTPUT
Affine Evaluation
Special case for scale * field + offset expressions. The executor generates values directly into the output buffer using emit_no_nulls or emit_with_nulls callbacks, avoiding intermediate allocations.
Sources : llkv-table/src/planner/mod.rs:253-357
Key Implementation Details
NULL Handling
NULL values propagate through arithmetic operations according to SQL semantics:
NULL + 5→NULLNULL IS NULL→1(true)COALESCE(NULL, 5)→5
The NumericValue is wrapped in Option<T>, with None representing SQL NULL. Binary operations return None if either operand is None.
Sources : llkv-table/src/scalar_eval.rs:564-571
Type Safety
The system maintains type safety through:
- Tagged unions :
NumericValuepreserves original type via the discriminant - Explicit promotion :
promote_to_float()is called only when type mixing requires it - Result type inference : The planner determines output types before evaluation
This prevents silent precision loss and enables query optimizations based on type information.
Sources : llkv-table/src/scalar_eval.rs:295-342
Memory Efficiency
The NumericArray struct uses Arc<T> for backing arrays, enabling zero-copy sharing when:
- Returning a column directly without computation
- Slicing arrays for sorted run evaluation
- Sharing arrays across multiple expressions referencing the same column
The to_array_ref() method clones the Arc, not the underlying data.
Sources : llkv-table/src/scalar_eval.rs:275-293
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Aggregation System
Relevant source files
- Cargo.lock
- Cargo.toml
- llkv-aggregate/src/lib.rs
- llkv-executor/Cargo.toml
- llkv-executor/src/lib.rs
- llkv-plan/src/plans.rs
- llkv-sql/src/lib.rs
- llkv-sql/src/sql_value.rs
The aggregation system evaluates SQL aggregate functions (COUNT, SUM, AVG, MIN, MAX, etc.) over Arrow RecordBatch streams. It consists of a planning layer that defines aggregate specifications and an execution layer that performs incremental accumulation with overflow checking and DISTINCT value tracking.
For information about scalar expression evaluation, see Scalar Evaluation and NumericKernels. For query execution orchestration, see Query Execution.
Architecture Overview
The aggregation system operates across three crates:
Sources: llkv-aggregate/src/lib.rs:1-1935 llkv-executor/src/lib.rs:1-3599 llkv-plan/src/plans.rs:1-1458
The planner creates AggregateExpr instances from SQL AST nodes, which the executor converts to AggregateSpec descriptors. These specs initialize AggregateAccumulator instances that process batches incrementally, accumulating values in memory. The AggregateState wraps the accumulator with metadata (alias, override values) and produces the final output arrays.
Aggregate Specification
AggregateSpec Structure
AggregateSpec defines an aggregate operation at plan-time:
| Field | Type | Purpose |
|---|---|---|
alias | String | Output column name for the aggregate result |
kind | AggregateKind | Type of aggregate operation and its parameters |
Sources: llkv-aggregate/src/lib.rs:23-27
AggregateKind Variants
Sources: llkv-aggregate/src/lib.rs:30-67
Each variant captures the field ID to aggregate over, the expected data type, and operation-specific flags like distinct or separator. The field_id is optional for COUNT(*) which counts all rows regardless of column values.
Accumulator System
Accumulator Variants
AggregateAccumulator implements streaming accumulation for each aggregate type:
Sources: llkv-aggregate/src/lib.rs:92-247
graph TB
subgraph "COUNT Variants"
CS[CountStar\nvalue: i64]
CC[CountColumn\ncolumn_index: usize\nvalue: i64]
CDC[CountDistinctColumn\ncolumn_index: usize\nseen: FxHashSet]
end
subgraph "SUM Variants"
SI64[SumInt64\nvalue: Option-i64-\nhas_values: bool]
SDI64[SumDistinctInt64\nsum: Option-i64-\nseen: FxHashSet]
SF64[SumFloat64\nvalue: f64\nsaw_value: bool]
SDF64[SumDistinctFloat64\nsum: f64\nseen: FxHashSet]
SD128[SumDecimal128\nsum: i128\nprecision: u8\nscale: i8]
end
subgraph "AVG Variants"
AI64[AvgInt64\nsum: i64\ncount: i64]
ADI64[AvgDistinctInt64\nsum: i64\nseen: FxHashSet]
AF64[AvgFloat64\nsum: f64\ncount: i64]
end
subgraph "MIN/MAX Variants"
MinI64[MinInt64\nvalue: Option-i64-]
MaxI64[MaxInt64\nvalue: Option-i64-]
MinF64[MinFloat64\nvalue: Option-f64-]
MaxF64[MaxFloat64\nvalue: Option-f64-]
end
Each accumulator variant is specialized for its data type and operation semantics. Integer accumulators track overflow using Option<i64> (None indicates overflow), while float accumulators use f64 which never overflows. Distinct variants maintain a FxHashSet of seen values.
sequenceDiagram
participant Executor
participant AggregateSpec
participant AggregateAccumulator
participant RecordBatch
participant OutputArray
Executor->>AggregateSpec: new_with_projection_index()
AggregateSpec->>AggregateAccumulator: Create accumulator
loop For each batch
Executor->>RecordBatch: Stream next batch
RecordBatch->>AggregateAccumulator: update(batch)
Note over AggregateAccumulator: Accumulate values\nCheck overflow\nTrack distinct keys
end
Executor->>AggregateAccumulator: finalize()
AggregateAccumulator->>OutputArray: (Field, ArrayRef)
OutputArray->>Executor: Return result
Accumulator Lifecycle
Sources: llkv-aggregate/src/lib.rs:460-746 llkv-aggregate/src/lib.rs:748-1440 llkv-aggregate/src/lib.rs:1442-1934
The accumulator is initialized with a projection index indicating which column in the RecordBatch to aggregate. The update() method processes each batch incrementally, and finalize() produces the final Arrow array and field schema.
Distinct Value Tracking
DistinctKey Enumeration
The system tracks distinct values using a hash-based approach:
| Variant | Type | Purpose |
|---|---|---|
Int(i64) | Integer values | Exact integer comparison |
Float(u64) | Float bit pattern | Bitwise float equality |
Str(String) | String values | Text comparison |
Bool(bool) | Boolean values | True/false comparison |
Date(i32) | Date32 values | Date comparison |
Decimal(i128) | Decimal raw value | Exact decimal comparison |
Sources: llkv-aggregate/src/lib.rs:249-257 llkv-aggregate/src/lib.rs:259-333
Float values are converted to bit patterns (to_bits()) to enable hash-based deduplication while preserving NaN and infinity semantics. Decimal values use raw i128 representation for exact comparison without scale conversion.
Distinct Accumulation Example
For COUNT(DISTINCT column), the accumulator inserts each non-null value into the hash set:
Sources: llkv-aggregate/src/lib.rs:785-798 llkv-aggregate/src/lib.rs:1465-1473
graph LR
Batch1[RecordBatch 1\nvalues: 1,2,3]
Batch2[RecordBatch 2\nvalues: 2,3,4]
Batch3[RecordBatch 3\nvalues: 1,4,5]
Batch1 --> HS1[seen: {1,2,3}]
Batch2 --> HS2[seen: {1,2,3,4}]
Batch3 --> HS3[seen: {1,2,3,4,5}]
HS3 --> Result[COUNT: 5]
The hash set automatically deduplicates values across batches. Only the set size is returned as the final count, avoiding materialization of the entire set in the output.
Aggregate Functions
COUNT Family
| Function | Null Handling | Return Type | Overflow |
|---|---|---|---|
COUNT(*) | Counts all rows | Int64 | Checked |
COUNT(column) | Skips NULL values | Int64 | Checked |
COUNT(DISTINCT column) | Skips NULL, deduplicates | Int64 | Checked |
Sources: llkv-aggregate/src/lib.rs:467-485 llkv-aggregate/src/lib.rs:759-783 llkv-aggregate/src/lib.rs:1452-1473
COUNT operations verify that the result fits in i64 range. COUNT(*) accumulates batch row counts directly. COUNT(column) filters invalid (NULL) rows using array.is_valid(i). COUNT(DISTINCT) maintains a hash set and returns its size.
SUM and TOTAL
| Function | Overflow Behavior | Return Type | NULL Result |
|---|---|---|---|
SUM(int_column) | Returns error | Int64 | NULL if no values |
SUM(float_column) | Accumulates infinities | Float64 | NULL if no values |
TOTAL(int_column) | Converts to Float64 | Float64 | 0.0 if no values |
TOTAL(float_column) | Accumulates infinities | Float64 | 0.0 if no values |
Sources: llkv-aggregate/src/lib.rs:486-541 llkv-aggregate/src/lib.rs:799-824 llkv-aggregate/src/lib.rs:958-975
graph LR
Input[Input Column]
Sum[Accumulate Sum]
Count[Count Non-NULL]
Div[Divide sum/count]
Output[Float64 Result]
Input --> Sum
Input --> Count
Sum --> Div
Count --> Div
Div --> Output
SUM uses checked_add for integers and returns an error on overflow. TOTAL never overflows because it accumulates as Float64 even for integer columns. The key difference is NULL handling: SUM returns NULL for empty input, TOTAL returns 0.0.
AVG (Average)
Sources: llkv-aggregate/src/lib.rs:598-654 llkv-aggregate/src/lib.rs:1096-1121 llkv-aggregate/src/lib.rs:1635-1645
AVG maintains separate sum and count accumulators. During finalization, it divides sum / count to produce a Float64 result. Integer sums are converted to Float64 for the division. If count is zero, AVG returns NULL.
MIN and MAX
| Data Type | Comparison Strategy | NULL Handling |
|---|---|---|
| Int64 | i64::min() / i64::max() | Skip NULL values |
| Float64 | partial_cmp() with NaN handling | Skip NULL values |
| Decimal128 | i128::min() / i128::max() on raw values | Skip NULL values |
| String | Numeric coercion via array_value_to_numeric() | Skip NULL values |
Sources: llkv-aggregate/src/lib.rs:656-710 llkv-aggregate/src/lib.rs:1259-1277 llkv-aggregate/src/lib.rs:1279-1300
MIN/MAX start with None and update to Some(value) on the first non-NULL entry. Subsequent values are compared using type-specific logic. Float comparisons use partial_cmp() to handle NaN values correctly.
graph LR
Values["Column Values:\n42, 'hello', 3.14"]
Convert[Convert to Strings:\n'42', 'hello', '3.14']
Join["Join with separator\n(default: ',')"]
Result["Result: '42,hello,3.14'"]
Values --> Convert
Convert --> Join
Join --> Result
GROUP_CONCAT
GROUP_CONCAT concatenates string representations of column values with a separator:
Sources: llkv-aggregate/src/lib.rs:722-744 llkv-aggregate/src/lib.rs:1409-1437 llkv-aggregate/src/lib.rs:1847-1874
The accumulator collects string representations using array_value_to_string() which coerces integers, floats, and booleans to text. DISTINCT variants track seen values in a hash set. Finalization joins the strings with the specified separator (default: ',').
Group-by Integration
Grouping Key Extraction
For GROUP BY queries, the executor extracts grouping keys from each row:
Sources: llkv-executor/src/lib.rs:1097-1173
sequenceDiagram
participant Executor
participant GroupMap
participant AggregateState
participant Accumulator
loop For each batch
Executor->>Executor: Extract group keys
loop For each group
Executor->>GroupMap: Get or create group
Executor->>AggregateState: Get accumulators for group
Executor->>Accumulator: Filter batch to group rows
Executor->>Accumulator: update(filtered_batch)
end
end
Executor->>GroupMap: Iterate all groups
loop For each group
Executor->>AggregateState: finalize()
AggregateState->>Executor: Return aggregate arrays
end
Each unique combination of group-by column values maps to a separate GroupAggregateState which tracks the representative row and a list of matching row locations across batches.
Aggregate Accumulation per Group
Sources: llkv-executor/src/lib.rs:1174-1383
The executor maintains separate accumulators for each group. When processing a batch, it filters rows by group membership using RowIdFilter and updates each group's accumulators independently. This ensures that SUM(sales) for group 'USA' only accumulates sales records where country='USA'.
Output Construction
After processing all batches, the executor constructs the output RecordBatch:
| Column Type | Source | Construction |
|---|---|---|
| Group-by columns | Representative rows | Gathered from original batches |
| Aggregate columns | Finalized accumulators | Converted to Arrow arrays |
Sources: llkv-executor/src/lib.rs:1384-1467
The system gathers one representative row per group for the group-by columns, then appends the finalized aggregate arrays as additional columns. This produces a result like:
+----------+---------+
| country | SUM(sales) |
+----------+---------+
| USA | 1500000 |
| Canada | 750000 |
+----------+---------+
graph LR
StringCol["String Column\n'42', 'hello', '3.14'"]
Parse1["'42' → 42.0"]
Parse2["'hello' → 0.0"]
Parse3["'3.14' → 3.14"]
Sum[SUM: 45.14]
StringCol --> Parse1
StringCol --> Parse2
StringCol --> Parse3
Parse1 --> Sum
Parse2 --> Sum
Parse3 --> Sum
Type System and Coercion
Numeric Coercion
The system performs SQLite-style type coercion for aggregates on string columns:
Sources: llkv-aggregate/src/lib.rs:398-447
The array_value_to_numeric() function attempts to parse strings as floats. Non-numeric strings coerce to 0.0, matching SQLite behavior. This enables SUM(string_column) where some values are numeric.
Type-specific Accumulators
| Input Type | SUM Accumulator | AVG Accumulator | MIN/MAX Accumulator |
|---|---|---|---|
| Int64 | SumInt64 (i64 with overflow) | AvgInt64 (sum: i64, count: i64) | MinInt64 / MaxInt64 |
| Float64 | SumFloat64 (f64, never overflows) | AvgFloat64 (sum: f64, count: i64) | MinFloat64 / MaxFloat64 |
| Decimal128 | SumDecimal128 (i128 + precision/scale) | AvgDecimal128 (sum: i128, count: i64) | MinDecimal128 / MaxDecimal128 |
| Utf8 | SumFloat64 (numeric coercion) | AvgFloat64 (numeric coercion) | MinFloat64 (numeric coercion) |
Sources: llkv-aggregate/src/lib.rs:486-710
graph TB
IntValue[Integer Value]
CheckedAdd[checked_add-value-]
Overflow{Overflow?}
ErrorSUM[SUM: Return Error]
ContinueTOTAL[TOTAL: Continue as Float64]
IntValue --> CheckedAdd
CheckedAdd --> Overflow
Overflow -->|Yes + SUM| ErrorSUM
Overflow -->|Yes + TOTAL| ContinueTOTAL
Overflow -->|No| IntValue
Each data type uses a specialized accumulator to preserve precision and overflow semantics. Decimal aggregates maintain precision and scale metadata throughout accumulation.
Overflow Handling
Integer Overflow Strategy
Sources: llkv-aggregate/src/lib.rs:799-824 llkv-aggregate/src/lib.rs:958-975 llkv-aggregate/src/lib.rs:1474-1494
SUM uses checked_add() and sets the accumulator to None on overflow, returning an error during finalization. TOTAL avoids this by accumulating integers as Float64 from the start, trading precision for guaranteed completion.
Decimal Overflow
Decimal128 aggregates use checked_add() on the raw i128 values:
Sources: llkv-aggregate/src/lib.rs:915-932
When Decimal128 overflow occurs, the system returns an error immediately. There is no TOTAL-style fallback for decimals because precision requirements are explicit in the type signature.
graph LR
Projection["Projection:\nSUM(price * quantity)"]
Extract[Extract Aggregate\nFunction Call]
Expression["price * quantity"]
Translate[Translate to\nScalarExpr]
EnsureProj[ensure_computed_projection]
Accumulate[Accumulate via\nAggregateAccumulator]
Projection --> Extract
Extract --> Expression
Expression --> Translate
Translate --> EnsureProj
EnsureProj --> Accumulate
Computed Aggregates
Aggregate Expressions in Projections
The executor handles aggregate function calls embedded in computed projections:
Sources: llkv-executor/src/lib.rs:703-712 llkv-executor/src/lib.rs:735-798 llkv-executor/src/lib.rs:473-505
When a projection contains an aggregate like SUM(price * quantity), the executor:
- Detects the aggregate via
expr_contains_aggregate() - Translates the inner expression (
price * quantity) to aScalarExpr - Creates a computed projection for the expression
- Initializes an accumulator for the projection index
- Accumulates values from the computed column
This allows complex aggregate expressions beyond simple column references.
Performance Considerations
Memory Usage
Each accumulator maintains state proportional to:
| Accumulator Type | Memory Per Group | Notes |
|---|---|---|
| COUNT(*) | 8 bytes (i64) | Constant size |
| SUM/AVG | 16-24 bytes | Value + metadata |
| MIN/MAX | 8-24 bytes | Single value + type info |
| COUNT(DISTINCT) | O(unique values) | Hash set grows with cardinality |
| GROUP_CONCAT | O(total string length) | Vector of strings |
Sources: llkv-aggregate/src/lib.rs:92-247
DISTINCT and GROUP_CONCAT have unbounded memory growth for high-cardinality data. The system does not implement spilling or approximate algorithms for these cases.
Parallelization
Aggregates are accumulated serially within a single thread because:
- Accumulators maintain mutable state that is not thread-safe
- DISTINCT tracking requires synchronized hash set updates
- Sequential batch processing simplifies overflow detection
Future work could introduce parallel accumulation with merge operations for distributive aggregates (SUM, COUNT, MIN, MAX) but not for algebraic aggregates (AVG) or DISTINCT operations without additional complexity.
Sources: llkv-aggregate/src/lib.rs:748-1440
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Query Execution
Relevant source files
- llkv-executor/Cargo.toml
- llkv-executor/src/lib.rs
- llkv-plan/src/plans.rs
- llkv-table/src/planner/mod.rs
- llkv-table/src/scalar_eval.rs
Purpose and Scope
Query execution is the process of converting logical query plans into physical result sets by coordinating table scans, expression evaluation, aggregation, joins, and result streaming. This page documents the execution engine's architecture, core components, and high-level execution flow.
For details on table-level planning and execution, see TablePlanner and TableExecutor. For scan optimization strategies, see Scan Execution and Optimization. For predicate evaluation mechanics, see Filter Evaluation.
System Architecture
Query execution spans two primary crates:
| Crate | Responsibility | Key Types |
|---|---|---|
llkv-executor | Orchestrates multi-table queries, aggregates, and result formatting | QueryExecutor, SelectExecution |
llkv-table | Executes table scans with predicates and projections | TablePlanner, TableExecutor |
The executor operates on logical plans produced by llkv-plan and delegates to llkv-table for single-table operations, llkv-join for join algorithms, and llkv-aggregate for aggregate computations.
Execution Architecture
graph TB
subgraph "Plan Layer"
PLAN["SelectPlan\n(llkv-plan)"]
end
subgraph "Execution Orchestration (llkv-executor)"
QE["QueryExecutor<P>"]
EXEC["SelectExecution<P>"]
STRAT["Strategy Selection:\nprojection, aggregate,\njoin, compound"]
end
subgraph "Table Execution (llkv-table)"
TP["TablePlanner"]
TE["TableExecutor"]
PS["PlannedScan"]
end
subgraph "Specialized Operations"
AGG["llkv-aggregate\nAccumulator"]
JOIN["llkv-join\nhash_join, cross_join"]
EVAL["NumericKernels\nscalar evaluation"]
end
subgraph "Storage"
TABLE["Table<P>"]
STORE["ColumnStore"]
end
PLAN --> QE
QE --> STRAT
STRAT -->|single table| TP
STRAT -->|aggregates| AGG
STRAT -->|joins| JOIN
TP --> PS
PS --> TE
TE --> TABLE
TABLE --> STORE
STRAT --> EVAL
EVAL --> TABLE
QE --> EXEC
TE --> EXEC
AGG --> EXEC
JOIN --> EXEC
Sources: llkv-executor/src/lib.rs:507-521 llkv-table/src/planner/mod.rs:580-726
Core Components
QueryExecutor
QueryExecutor<P> is the top-level execution coordinator in llkv-executor. It consumes SelectPlan structures and produces SelectExecution result containers.
Key Responsibilities:
- Strategy selection based on plan characteristics (single table, joins, aggregates, compound operations)
- Multi-table query orchestration (cross products, hash joins)
- Aggregate computation coordination
- Subquery evaluation (correlated EXISTS, scalar subqueries)
- Result streaming and batching
Entry points:
execute_select(plan: SelectPlan)- Execute a SELECT plan llkv-executor/src/lib.rs:523-525execute_select_with_filter(plan, row_filter)- Execute with optional MVCC filter llkv-executor/src/lib.rs:527-569
Sources: llkv-executor/src/lib.rs:507-521
SelectExecution
SelectExecution<P> encapsulates query results and provides streaming access via batched iteration. Results may be materialized upfront or generated lazily depending on the execution strategy.
Streaming Interface:
stream<F>(on_batch: F)- Process results batch-by-batchinto_rows()- Materialize all rows into memory (for sorting, deduplication)schema()- Access result schema
Sources: llkv-executor/src/lib.rs:2500-2700 (approximate location based on file structure)
TablePlanner and TableExecutor
The table-level execution layer handles single-table scans with predicates and projections. TablePlanner analyzes the request and produces a PlannedScan, which TableExecutor then executes.
Planning Process:
- Validate projections against schema
- Normalize filter predicates (apply De Morgan's laws, flatten boolean operators)
- Compile predicates into
EvalProgramandDomainProgrambytecode - Build
PlanGraphmetadata for tracing
Execution Process:
- Try optimized fast paths (single column scans, full table scans)
- Fall back to general execution with expression evaluation
- Stream results in batches
These components are detailed in TablePlanner and TableExecutor.
Sources: llkv-table/src/planner/mod.rs:580-637 llkv-table/src/planner/mod.rs:728-1007
Execution Flow
Top-Level SELECT Execution Sequence
Sources: llkv-executor/src/lib.rs:523-569 llkv-table/src/planner/mod.rs:595-607 llkv-table/src/planner/mod.rs:1009-1400
graph TD
START["SelectPlan"] --> COMPOUND{compound?}
COMPOUND -->|yes| EXEC_COMPOUND["execute_compound_select\nUNION/EXCEPT/INTERSECT"]
COMPOUND -->|no| FROM{tables.is_empty?}
FROM -->|yes| EXEC_CONST["execute_select_without_table\nEvaluate constant expressions"]
FROM -->|no| MULTI{tables.len > 1?}
MULTI -->|yes| EXEC_CROSS["execute_cross_product\nor hash_join optimization"]
MULTI -->|no| GROUPBY{group_by.is_empty?}
GROUPBY -->|no| EXEC_GROUP["execute_group_by_single_table\nGroup rows + compute aggregates"]
GROUPBY -->|yes| AGG{aggregates.is_empty?}
AGG -->|no| EXEC_AGG["execute_aggregates\nCollect all rows + compute"]
AGG -->|yes| COMPUTED{has_computed_aggregates?}
COMPUTED -->|yes| EXEC_COMP_AGG["execute_computed_aggregates\nEmbedded agg in expressions"]
COMPUTED -->|no| EXEC_PROJ["execute_projection\nStream scan with projections"]
EXEC_COMPOUND --> RESULT["SelectExecution"]
EXEC_CONST --> RESULT
EXEC_CROSS --> RESULT
EXEC_GROUP --> RESULT
EXEC_AGG --> RESULT
EXEC_COMP_AGG --> RESULT
EXEC_PROJ --> RESULT
Execution Strategies
QueryExecutor selects an execution strategy based on plan characteristics:
Strategy Decision Tree
Sources: llkv-executor/src/lib.rs:527-569
Strategy Implementations
| Strategy | Method | When Applied | Key Operations |
|---|---|---|---|
| Constant Evaluation | execute_select_without_table | No FROM clause | Evaluate literals, struct constructors |
| Simple Projection | execute_projection | Single table, no aggregates | Stream scan with filter + projections |
| Aggregation | execute_aggregates | Has aggregates, no GROUP BY | Collect all rows, compute aggregates, emit single row |
| Grouped Aggregation | execute_group_by_single_table | Has GROUP BY | Hash rows by key, compute per-group aggregates |
| Computed Aggregates | execute_computed_aggregates | Aggregates embedded in computed projections | Extract aggregate expressions, evaluate separately |
| Cross Product | execute_cross_product | Multiple tables | Cartesian product or hash join optimization |
| Compound | execute_compound_select | UNION/EXCEPT/INTERSECT | Execute components, apply set operations |
Sources: llkv-executor/src/lib.rs:926-975 llkv-executor/src/lib.rs:1700-2100 llkv-executor/src/lib.rs:2200-2400 llkv-executor/src/lib.rs:1057-1400 llkv-executor/src/lib.rs:590-701
Streaming Execution Model
LLKV executes queries in a streaming fashion to avoid materializing large intermediate results. Results flow through the system as RecordBatch chunks (typically 4096 rows).
Streaming Characteristics:
| Execution Type | Streaming Behavior | Memory Characteristics |
|---|---|---|
| Projection | Full streaming | O(batch_size) memory |
| Filter | Full streaming | O(batch_size) memory |
| Aggregates | Requires full materialization | O(input_rows) memory |
| GROUP BY | Requires full materialization | O(group_count) memory |
| ORDER BY | Requires full materialization | O(input_rows) memory |
| DISTINCT | Requires full materialization | O(distinct_rows) memory |
| LIMIT | Early termination | O(limit × batch_size) memory |
Streaming Projection Example Flow:
Sources: llkv-table/src/planner/mod.rs:1009-1400 llkv-table/src/constants.rs:1-10 (defines STREAM_BATCH_ROWS = 4096)
Materialization Points
Certain operations require collecting all rows before producing output:
- Sorting - Must see all rows to determine order llkv-executor/src/lib.rs:2800-2900
- Deduplication (DISTINCT) - Must track all seen rows llkv-executor/src/lib.rs:2950-3050
- Aggregation - Must accumulate state across all rows llkv-executor/src/lib.rs:1700-1900
- Set Operations - Must materialize both sides for comparison llkv-executor/src/lib.rs:590-701
These operations call into_rows() on SelectExecution to materialize results as Vec<CanonicalRow>.
Sources: llkv-executor/src/lib.rs:2600-2700
Expression Evaluation
Query execution evaluates two types of expressions:
Predicate Evaluation (Filtering)
Predicates are compiled to bytecode and evaluated during table scans:
- Normalization - Apply De Morgan's laws, flatten AND/OR llkv-table/src/planner/program.rs:50-150
- Compilation - Convert to
EvalProgram(stack-based) andDomainProgram(row tracking) llkv-table/src/planner/program.rs:200-400 - Vectorized Evaluation - Process chunks of rows efficiently llkv-table/src/planner/mod.rs:1100-1300
See Filter Evaluation for detailed mechanics.
Scalar Expression Evaluation (Projections)
Computed projections are evaluated row-by-row or vectorized when possible:
- Translation - Convert
ScalarExpr<String>toScalarExpr<FieldId>llkv-executor/src/translation/scalar.rs:1-200 - Type Inference - Determine output data type llkv-executor/src/translation/schema.rs:50-150
- Evaluation - Use
NumericKernelsfor numeric operations llkv-table/src/scalar_eval.rs:450-685
Vectorized vs Row-by-Row:
Sources: llkv-table/src/scalar_eval.rs:675-712 llkv-table/src/scalar_eval.rs:549-673
Integration with Runtime
The execution layer coordinates with llkv-runtime for transaction and catalog management:
Runtime Integration Points:
| Operation | Runtime Responsibility | Executor Responsibility |
|---|---|---|
| Table Lookup | CatalogManager::table() | ExecutorTableProvider::get_table() |
| MVCC Filtering | Provide RowIdFilter with snapshot | Apply filter during scan |
| Transaction State | Track transaction ID, commit watermark | Include created_by/deleted_by in scans |
| Schema Resolution | Maintain system catalog | Translate column names to FieldId |
The ExecutorTableProvider trait abstracts runtime integration, allowing executor to remain runtime-agnostic.
Sources: llkv-executor/src/types.rs:100-200 llkv-runtime/src/catalog/mod.rs:50-150
Performance Characteristics
Execution performance depends on query characteristics and chosen strategy:
| Query Pattern | Typical Performance | Optimization Opportunities |
|---|---|---|
| SELECT * FROM t | ~1M rows/sec | Fast path: shadow column scan llkv-table/src/planner/mod.rs:765-821 |
| SELECT col FROM t WHERE pred | ~500K rows/sec | Predicate fusion llkv-table/src/planner/mod.rs:518-570 |
| Single-table aggregates | Full table scan | Column-only projections for aggregate inputs |
| Hash join (2 tables) | O(n + m) with O(n) memory | Smaller table as build side llkv-executor/src/lib.rs:1500-1700 |
| Cross product (n tables) | O(∏ row_counts) | Avoid if possible; rewrite to joins |
Sources: llkv-table/src/planner/mod.rs:738-856 llkv-executor/src/lib.rs:1082-1400
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
TablePlanner and TableExecutor
Relevant source files
This document describes the table-level query planning and execution system in LLKV. The TablePlanner translates scan operations into optimized execution plans, while the TableExecutor implements multiple execution strategies to materialize query results efficiently. For information about the broader query execution pipeline, see Query Execution. For details on expression compilation and evaluation, see Program Compilation and Scalar Evaluation and NumericKernels.
Purpose and Scope
The TablePlanner and TableExecutor form the core of LLKV's table-level query execution. They bridge the gap between logical query plans (from llkv-plan) and physical data access (through llkv-column-map). This document covers:
- How scan operations are planned and optimized
- The structure of compiled execution plans (
PlannedScan) - Multiple execution strategies and their trade-offs
- Predicate fusion optimization
- Integration with MVCC row filtering
- Streaming result materialization
Architecture Overview
Sources: llkv-table/src/planner/mod.rs:580-726
TablePlanner
The TablePlanner is responsible for analyzing scan requests and constructing optimized execution plans. It does not execute queries itself but prepares all necessary metadata for the TableExecutor.
Structure
The planner holds a reference to the Table being scanned and provides a single public method: scan_stream_with_exprs.
Sources: llkv-table/src/planner/mod.rs:580-593
Planning Flow
The planning process consists of several stages:
- Validation : Ensures at least one projection is specified
- Normalization : Applies De Morgan's laws and flattens logical operators via
normalize_predicate - Graph Construction : Builds a
PlanGraphfor visualization and introspection - Program Compilation : Compiles filter expressions into
EvalProgramandDomainProgrambytecode
Sources: llkv-table/src/planner/mod.rs:595-637
PlanGraph Construction
The build_plan_graph method creates a directed acyclic graph (DAG) representing the logical query plan:
| Node Type | Purpose | Metadata |
|---|---|---|
TableScan | Entry point for data access | table_id, projection_count |
Filter | Predicate application | predicates (formatted expressions) |
Project | Column selection and computation | projections, fields with types |
Output | Result materialization | include_nulls flag |
Sources: llkv-table/src/planner/mod.rs:639-725
PlannedScan Structure
The PlannedScan is an intermediate representation that bridges planning and execution. It contains all information needed to execute a scan without holding runtime state.
| Field | Type | Purpose |
|---|---|---|
projections | Vec<ScanProjection> | Column and computed projections to materialize |
filter_expr | Arc<Expr<FieldId>> | Normalized filter predicate |
options | ScanStreamOptions<P> | MVCC filters, ordering, null handling |
plan_graph | PlanGraph | Logical plan for introspection |
programs | ProgramSet | Compiled bytecode for predicate evaluation |
Sources: llkv-table/src/planner/mod.rs:500-509
TableExecutor
The TableExecutor implements multiple execution strategies, selecting the most efficient based on query characteristics.
Structure
The executor maintains a cache of row IDs to avoid redundant scans when multiple operations target the same table.
Sources: llkv-table/src/planner/mod.rs:572-578
Execution Strategy Selection
Sources: llkv-table/src/planner/mod.rs:1009-1367
Single Column Direct Scan Fast Path
The try_single_column_direct_scan optimization applies when:
- Exactly one projection is requested
include_nullsisfalse- Filter is either trivial or a full-range predicate on the projected column
- Column type is not
Utf8orLargeUtf8(to avoid string complexity)
graph TD
CHECK1{projections.len == 1?} -->|No| FALLBACK1[Fallback]
CHECK1 -->|Yes| CHECK2{include_nulls == false?}
CHECK2 -->|No| FALLBACK2[Fallback]
CHECK2 -->|Yes| PROJ_TYPE{Projection type?}
PROJ_TYPE -->|Column| CHECK_FILTER["is_full_range_filter?"]
PROJ_TYPE -->|Computed| CHECK_COMPUTED[Single field?]
CHECK_FILTER -->|No| FALLBACK3[Fallback]
CHECK_FILTER -->|Yes| CHECK_DTYPE{dtype?}
CHECK_DTYPE -->|Utf8/LargeUtf8| FALLBACK4[Fallback]
CHECK_DTYPE -->|Other| DIRECT_SCAN[SingleColumnStreamVisitor]
CHECK_COMPUTED -->|No| FALLBACK5[Fallback]
CHECK_COMPUTED -->|Yes| COMPUTE_TYPE{Passthrough or Affine?}
COMPUTE_TYPE -->|Passthrough| DIRECT_SCAN2[SingleColumnStreamVisitor]
COMPUTE_TYPE -->|Affine| AFFINE_SCAN[AffineSingleColumnVisitor]
COMPUTE_TYPE -->|Other| COMPUTE_SCAN[ComputedSingleColumnVisitor]
DIRECT_SCAN --> HANDLED[StreamOutcome::Handled]
DIRECT_SCAN2 --> HANDLED
AFFINE_SCAN --> HANDLED
COMPUTE_SCAN --> HANDLED
This path streams data directly from storage using ScanBuilder without building intermediate row ID lists or using RowStreamBuilder.
Sources: llkv-table/src/planner/mod.rs:1369-1530
Full Table Scan Streaming Fast Path
The try_stream_full_table_scan optimization applies when:
- Filter is trivial (no predicates)
- No ordering is required (
options.order.is_none())
graph TD
CHECK_ORDER{order.is_some?} -->|Yes| FALLBACK[Fallback]
CHECK_ORDER -->|No| STREAM_START[stream_table_row_ids]
STREAM_START --> SHADOW{Shadow column exists?}
SHADOW -->|Yes| CHUNK[Emit row_id chunks]
SHADOW -->|No| FALLBACK2[Multi-column scan fallback]
CHUNK --> MVCC_FILTER{row_id_filter?}
MVCC_FILTER -->|Yes| APPLY[filter.filter]
MVCC_FILTER -->|No| BUILD
APPLY --> BUILD
BUILD[RowStreamBuilder] --> GATHER[Gather columns]
GATHER --> EMIT[Emit RecordBatch]
EMIT --> MORE{More chunks?}
MORE -->|Yes| CHUNK
MORE -->|No| CHECK_EMPTY
CHECK_EMPTY{any_emitted?} -->|No| SYNTHETIC[emit_synthetic_null_batch]
CHECK_EMPTY -->|Yes| HANDLED[StreamOutcome::Handled]
SYNTHETIC --> HANDLED
This path uses stream_table_row_ids to enumerate row IDs in chunks directly from the row_id shadow column, avoiding full predicate evaluation.
This optimization is particularly effective for queries like SELECT col1, col2 FROM table with no WHERE clause.
Sources: llkv-table/src/planner/mod.rs:905-999
General Execution Path
When fast paths don't apply, the executor follows a multi-stage process:
Stage 1: Projection Metadata Construction
The executor builds several data structures:
| Structure | Purpose |
|---|---|
projection_evals | Vec<ProjectionEval> mapping projections to evaluation strategies |
unique_lfids | Vec<LogicalFieldId> of columns to fetch from storage |
unique_index | FxHashMap<LogicalFieldId, usize> for column position lookup |
numeric_fields | FxHashSet<FieldId> of columns needing numeric coercion |
passthrough_fields | Vec<Option<FieldId>> for identity computed projections |
Sources: llkv-table/src/planner/mod.rs:1033-1117
Stage 2: Row ID Collection
For trivial filters, the executor scans the MVCC created_by column to enumerate all rows (including those with NULL values in user columns). For non-trivial filters, it evaluates the compiled ProgramSet.
Sources: llkv-table/src/planner/mod.rs:1269-1327
Stage 3: Streaming Execution
The RowStreamBuilder materializes results in batches of STREAM_BATCH_ROWS (default: 1024). For each batch:
- Gather : Fetch column data for row IDs via
MultiGatherContext - Evaluate : Compute any
ProjectionEval::Computedexpressions - Materialize : Construct
RecordBatchwith final schema - Emit : Call user-provided callback
Sources: llkv-table/src/planner/mod.rs:1337-1365
Program Compilation
The ProgramCompiler translates filter expressions into stack-based bytecode for efficient evaluation. It produces two programs:
- EvalProgram : Evaluates predicates and returns matching row IDs
- DomainProgram : Computes the domain (all potentially relevant row IDs) for NOT operations
ProgramSet Structure
Bytecode Operations
| Opcode | Stack Effect | Purpose |
|---|---|---|
PushPredicate(filter) | [] → [rows] | Evaluate single predicate |
PushCompare{left, op, right} | [] → [rows] | Evaluate comparison expression |
PushInList{expr, list, negated} | [] → [rows] | Evaluate IN list |
PushIsNull{expr, negated} | [] → [rows] | Evaluate IS NULL |
PushLiteral(bool) | [] → [rows] | Push all rows (true) or empty (false) |
FusedAnd{field_id, filters} | [] → [rows] | Apply multiple predicates on same field |
And{child_count} | [r1, r2, ...] → [r] | Intersect N row ID sets |
Or{child_count} | [r1, r2, ...] → [r] | Union N row ID sets |
Not{domain} | [matched] → [domain - matched] | Set difference using domain program |
Sources: llkv-table/src/planner/program.rs:1-200 (referenced but not in provided files)
Execution Example
Consider the filter: WHERE (age > 18 AND age < 65) OR name = 'Alice'
After normalization and compilation:
Stack Operations:
1. PushCompare(age > 18) → [rows1]
2. PushCompare(age < 65) → [rows1, rows2]
3. And{2} → [rows1 ∩ rows2]
4. PushPredicate(name = 'Alice') → [rows1 ∩ rows2, rows3]
5. Or{2} → [(rows1 ∩ rows2) ∪ rows3]
The collect_row_ids_for_program method executes this bytecode by maintaining a stack of row ID vectors and applying set operations.
Sources: llkv-table/src/planner/mod.rs:2376-2502
graph TD
ANALYZE[Analyze filter expression] --> BUILD["Build per_field stats"]
BUILD --> STATS["FieldPredicateStats:\ntotal, contains"]
STATS --> CHECK{should_fuse?}
CHECK --> DTYPE{Data type?}
DTYPE -->|Utf8/LargeUtf8| STRING_RULE["contains ≥ 1 AND total ≥ 2"]
DTYPE -->|Other| NUMERIC_RULE["total ≥ 2"]
STRING_RULE -->|Yes| FUSE[Generate FusedAnd opcode]
NUMERIC_RULE -->|Yes| FUSE
STRING_RULE -->|No| SEPARATE[Evaluate predicates separately]
NUMERIC_RULE -->|No| SEPARATE
Predicate Fusion
The PredicateFusionCache analyzes filter expressions to identify opportunities for fused predicate evaluation.
Fusion Strategy
Predicate fusion is particularly beneficial for string columns with CONTAINS operations, where multiple predicates on the same field can be evaluated in a single storage scan.
Example: WHERE name LIKE '%Smith%' AND name LIKE '%John%' can be fused into a single scan with two pattern matchers.
Sources: llkv-table/src/planner/mod.rs:517-570 llkv-table/src/planner/mod.rs:2504-2580
Projection Evaluation
The ProjectionEval enum handles two types of projections:
Column Projections
Direct column references that can be gathered from storage without computation.
Computed Projections
Expressions evaluated per-row using NumericKernels. The executor optimizes several patterns:
| Pattern | Optimization |
|---|---|
column | Passthrough (no computation) |
a * column + b | Affine transformation (vectorized) |
| General expression | Full expression evaluation |
Sources: llkv-table/src/planner/mod.rs:482-498 llkv-table/src/planner/mod.rs:1110-1117
Row ID Collection Strategies
The executor uses different strategies for collecting matching row IDs based on predicate characteristics:
Simple Predicates
For predicates on a single field (e.g., age > 18):
Sources: llkv-table/src/planner/mod.rs:1532-1612
Comparison Expressions
For comparisons involving multiple fields (e.g., col1 + col2 > 10):
This approach minimizes wasted computation by first identifying a "domain" of potentially matching rows (intersection of rows where all referenced columns have values) before evaluating the full expression.
Sources: llkv-table/src/planner/mod.rs:1699-1775 llkv-table/src/planner/mod.rs:2214-2374
IN List Expressions
For IN list predicates (e.g., status IN ('active', 'pending')):
The IN list evaluator properly handles SQL's three-valued logic:
- If value matches any list element:
TRUE(orFALSEif negated) - If no match but list contains NULL:
NULL(indeterminate) - If no match and no NULLs:
FALSE(orTRUEif negated)
Sources: llkv-table/src/planner/mod.rs:2001-2044 llkv-table/src/planner/mod.rs:1841-1999
graph TD
COLLECT[Collect row IDs from predicates] --> FILTER{row_id_filter.is_some?}
FILTER -->|No| CONTINUE[Continue to streaming]
FILTER -->|Yes| APPLY["filter.filter(table, row_ids)"]
APPLY --> CHECK_VISIBILITY[Check MVCC columns]
CHECK_VISIBILITY --> CREATED["created_by ≤ txn_id"]
CHECK_VISIBILITY --> DELETED["deleted_by > txn_id OR NULL"]
CREATED --> VISIBLE{Visible?}
DELETED --> VISIBLE
VISIBLE -->|Yes| KEEP[Keep row ID]
VISIBLE -->|No| DROP[Drop row ID]
KEEP --> FILTERED[Filtered row IDs]
DROP --> FILTERED
FILTERED --> CONTINUE
MVCC Integration
The executor integrates with LLKV's MVCC system through the row_id_filter option in ScanStreamOptions. After collecting row IDs through predicate evaluation, the filter determines which rows are visible to the current transaction:
For trivial filters, the executor explicitly scans the created_by MVCC column to enumerate all rows, ensuring that rows with NULL values in user columns are included when appropriate.
Sources: llkv-table/src/planner/mod.rs:1269-1323
Performance Characteristics
The table below summarizes the time complexity of different execution paths:
| Execution Path | Conditions | Row ID Collection | Data Gathering | Total |
|---|---|---|---|---|
| Single Column Direct | 1 projection, trivial/full-range filter | O(1) | O(n) streaming | O(n) |
| Full Table Stream | Trivial filter, no order | O(n) via shadow column | O(n) streaming | O(n) |
| General (indexed predicate) | Single-field predicate with index | O(log n + m) | O(m × c) | O(log n + m × c) |
| General (complex predicate) | Multi-field or computed predicate | O(n × p) | O(m × c) | O(n × p + m × c) |
Where:
- n = total rows in table
- m = matching rows after filtering
- c = number of columns in projection
- p = complexity of predicate (number of fields involved)
The executor automatically selects the most efficient path based on query characteristics, with no manual tuning required.
Sources: llkv-table/src/planner/mod.rs:1009-1530 llkv-table/src/planner/mod.rs:905-999
graph TB
subgraph "External Input"
PLAN[llkv-plan SelectPlan]
EXPR[llkv-expr Expr]
end
subgraph "Table Layer"
TP[TablePlanner]
TE[TableExecutor]
PLANNED[PlannedScan]
TP --> PLANNED
PLANNED --> TE
end
subgraph "Storage Layer"
STORE[ColumnStore]
SCAN[ScanBuilder]
GATHER[MultiGatherContext]
end
subgraph "Expression Evaluation"
NORM[normalize_predicate]
COMPILER[ProgramCompiler]
KERNELS[NumericKernels]
end
subgraph "Output"
STREAM[RowStreamBuilder]
BATCH[RecordBatch]
end
PLAN --> TP
EXPR --> TP
TP --> NORM
NORM --> COMPILER
TE --> SCAN
TE --> GATHER
TE --> KERNELS
TE --> STREAM
SCAN --> STORE
GATHER --> STORE
STREAM --> BATCH
Integration Points
The TablePlanner and TableExecutor integrate with several other LLKV subsystems:
Sources: llkv-table/src/planner/mod.rs:1-76
This architecture enables LLKV to efficiently execute table scans with complex predicates and projections while maintaining clean separation between logical planning and physical execution. The multiple optimization paths ensure that simple queries execute quickly while complex queries remain correct.
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Scan Execution and Optimization
Relevant source files
- llkv-aggregate/README.md
- llkv-column-map/README.md
- llkv-csv/README.md
- llkv-expr/README.md
- llkv-join/README.md
- llkv-runtime/README.md
- llkv-storage/README.md
- llkv-table/README.md
- llkv-table/src/planner/mod.rs
- llkv-table/src/scalar_eval.rs
Purpose and Scope
This page documents the table scan execution engine in the llkv-table crate, which implements the low-level scanning and streaming of data from the column store to higher layers. It covers planning, optimization paths, predicate compilation, and expression evaluation strategies that enable efficient data retrieval.
For information about higher-level query planning and the translation of SQL plans to table operations, see TablePlanner and TableExecutor. For details on how filters are evaluated against individual rows, see Filter Evaluation.
Architecture Overview
The scan execution system is split into two primary components:
| Component | Responsibility |
|---|---|
TablePlanner | Analyzes scan requests, builds plan graphs, compiles predicates into bytecode programs |
TableExecutor | Executes planned scans using optimization paths, coordinates streaming results |
Scan Execution Flow
Sources: llkv-table/src/planner/mod.rs:591-637
Planning Phase
Plan Construction
The TablePlanner::plan_scan method orchestrates plan construction by:
- Validating projections are non-empty
- Normalizing the filter predicate
- Building a plan graph for visualization and analysis
- Compiling predicates into executable programs
graph LR
Input["scan_stream_with_exprs\n(projections, filter, options)"]
Validate["Validate projections"]
Normalize["normalize_predicate\n(flatten And/Or,\napply De Morgan's)"]
Graph["build_plan_graph\n(TableScan → Filter →\nProject → Output)"]
Compile["ProgramCompiler\n(EvalProgram +\nDomainProgram)"]
Output["PlannedScan"]
Input --> Validate
Validate --> Normalize
Validate --> Graph
Normalize --> Compile
Compile --> Output
Graph --> Output
The plan graph encodes the logical operator tree for diagnostic purposes, with nodes representing TableScan, Filter, Project, and Output operators.
Sources: llkv-table/src/planner/mod.rs:612-637 llkv-table/src/planner/mod.rs:639-725
Predicate Compilation
Predicates are compiled into two bytecode programs:
| Program Type | Purpose |
|---|---|
EvalProgram | Stack-based bytecode for evaluating filter conditions |
DomainProgram | Tracks which row IDs satisfy the predicate during scanning |
The ProgramCompiler analyzes the normalized predicate tree and emits instructions for efficient evaluation. Predicate fusion merges multiple predicates on the same field when beneficial.
graph TB
Filter["Expr<FieldId>\n(normalized predicate)"]
Fusion["PredicateFusionCache\n(analyze per-field stats)"]
Compiler["ProgramCompiler::compile"]
EvalProg["EvalProgram\n(stack-based bytecode)"]
DomainProg["DomainProgram\n(row ID tracking)"]
ProgramSet["ProgramSet\n(contains both programs)"]
Filter --> Fusion
Filter --> Compiler
Fusion --> Compiler
Compiler --> EvalProg
Compiler --> DomainProg
EvalProg --> ProgramSet
DomainProg --> ProgramSet
Sources: llkv-table/src/planner/mod.rs:625-629 llkv-table/src/planner/program.rs
Execution Phase and Optimization Paths
The TableExecutor::execute method attempts multiple optimization paths before falling back to the general scan:
Fast Path: Single Column Direct Scan
When the scan requests a single column with a simple predicate, the executor uses try_single_column_direct_scan to stream data directly from the column store without materializing row IDs.
Conditions for single column fast path:
- Exactly one column projection
- No computed projections
- Simple predicate structure (optional)
- Compatible data types
This path bypasses row ID collection and gather operations, streaming column chunks directly to the caller.
Sources: llkv-table/src/planner/mod.rs:1021-1031 llkv-table/src/planner/mod.rs:1157-1343
Fast Path: Full Table Scan Streaming
For queries without ordering requirements, try_stream_full_table_scan uses incremental row ID streaming to avoid buffering all row IDs in memory:
graph TB
Start["try_stream_full_table_scan"]
CheckOrder["Check options.order\n(must be None)"]
StreamRIDs["stream_table_row_ids\n(chunk_size batches)"]
Shadow["Try shadow column\nscan (fast)"]
Fallback["Multi-column scan\n(fallback)"]
ProcessChunk["Process chunk:\n1. Apply row_id_filter\n2. Build RowStream\n3. Emit batches"]
Batch["RecordBatch\n(via on_batch)"]
Start --> CheckOrder
CheckOrder -->|order is Some| Return["Return Fallback"]
CheckOrder -->|order is None| StreamRIDs
StreamRIDs --> Shadow
Shadow -->|Success| ProcessChunk
Shadow -->|Not found| Fallback
Fallback --> ProcessChunk
ProcessChunk --> Batch
Batch -->|Next chunk| StreamRIDs
Sources: llkv-table/src/planner/mod.rs:904-999 llkv-table/src/planner/mod.rs:859-902
Row ID Collection Optimization
Row ID collection uses a two-tier strategy:
- Fast path (shadow column) : Scan the dedicated
row_idshadow column which contains all row IDs for the table - Fallback (multi-column scan) : Scan user columns and deduplicate row IDs when shadow column is unavailable
The shadow column optimization is significantly faster because:
- Single column scan instead of multiple
- No deduplication required
- Direct row ID format
Sources: llkv-table/src/planner/mod.rs:748-857
General Scan Execution
When fast paths don't apply, the executor uses the general scan path:
- Projection analysis : Classify projections as column references or computed expressions
- Field collection : Build unique field lists and numeric field maps
- Row ID collection : Gather all relevant row IDs (using optimizations above)
- Row ID filtering : Apply predicate programs to filter row IDs
- Gather and stream : Use
RowStreamBuilderto materialize columns and emit batches
General Scan Pipeline
graph TB
Execute["TableExecutor::execute"]
Analyze["Analyze projections:\n- Column refs\n- Computed exprs\n- Build unique_lfids"]
CollectRIDs["table_row_ids\n(with caching)"]
FilterRIDs["Filter row IDs:\n- collect_row_ids_for_rowid_filter\n- Apply EvalProgram\n- Apply DomainProgram"]
Order["Apply ordering\n(if options.order present)"]
Gather["RowStreamBuilder:\n- prepare_gather_context\n- stream chunks\n- evaluate computed exprs"]
Emit["Emit RecordBatch"]
Execute --> Analyze
Analyze --> CollectRIDs
CollectRIDs --> FilterRIDs
FilterRIDs --> Order
Order --> Gather
Gather --> Emit
Sources: llkv-table/src/planner/mod.rs:1009-1343 llkv-table/src/planner/mod.rs:1345-1710
Predicate Optimization
Normalization
The normalize_predicate function applies logical transformations to simplify filter expressions:
- De Morgan's laws :
NOT (a AND b)→(NOT a) OR (NOT b) - Flatten nested operators :
AND[AND[a,b],c]→AND[a,b,c] - Constant folding :
AND[true, x]→x - Double negation elimination :
NOT (NOT x)→x
These transformations expose optimization opportunities and simplify compilation.
Sources: llkv-table/src/planner/program.rs
Predicate Fusion
The PredicateFusionCache analyzes predicates to determine when multiple conditions on the same field should be fused:
| Data Type | Fusion Criteria |
|---|---|
| String types | contains count ≥ 1 AND total predicates ≥ 2 |
| Other types | Total predicates ≥ 2 |
graph TB
Expr["Filter Expression"]
Cache["PredicateFusionCache"]
Traverse["Traverse expression tree"]
Stats["Per-field stats:\n- total predicates\n- contains predicates"]
Decision["should_fuse(field, dtype)"]
Fuse["Fuse predicates into\nsingle evaluation"]
Separate["Keep predicates separate"]
Expr --> Cache
Cache --> Traverse
Traverse --> Stats
Stats --> Decision
Decision -->|Meets criteria| Fuse
Decision -->|Below threshold| Separate
Fusion enables single-pass evaluation rather than multiple column scans for the same field.
Sources: llkv-table/src/planner/mod.rs:517-570
Expression Evaluation
Numeric Kernels
The NumericKernels system in llkv-table/src/scalar_eval.rs provides vectorized evaluation for scalar expressions:
| Kernel Operation | Description |
|---|---|
collect_fields | Extract all field references from expression |
prepare_numeric_arrays | Cast columns to unified numeric representation |
evaluate_value | Row-by-row scalar evaluation |
evaluate_batch | Vectorized batch evaluation |
simplify | Detect affine expressions for optimization |
Numeric Type Hierarchy
Sources: llkv-table/src/scalar_eval.rs:1-90 llkv-table/src/scalar_eval.rs:451-712
Vectorized Evaluation
When possible, expressions are evaluated using vectorized paths:
- Column access : Direct array reference (zero-copy)
- Literals : Broadcast scalar to array length
- Binary operations : Arrow compute kernels for array-array or array-scalar operations
- Affine expressions : Specialized
scale * field + offsetfast path
The try_evaluate_vectorized method attempts vectorization before falling back to row-by-row evaluation.
Sources: llkv-table/src/scalar_eval.rs:714-762
graph LR
Expr["ScalarExpr"]
Detect["NumericKernels::simplify"]
Check["is_affine_column_expr"]
Affine["AffineExpr:\nfield, scale, offset"]
Direct["Direct column reference"]
Complex["Complex expression"]
Expr --> Detect
Detect --> Check
Check -->|Matches pattern| Affine
Check -->|Single column| Direct
Check -->|Other| Complex
Affine Expression Optimization
Expressions matching the pattern scale * field + offset are detected and optimized:
Affine expressions enable:
- Single column scan with arithmetic applied
- Reduced memory allocation
- Better cache locality
Sources: llkv-table/src/scalar_eval.rs:1038-1174 llkv-table/src/planner/mod.rs:1711-1872
graph TB
Builder["RowStreamBuilder::new"]
Config["Configuration:\n- store\n- table_id\n- schema\n- unique_lfids\n- projection_evals\n- row_ids\n- batch_size"]
GatherCtx["prepare_gather_context\n(optional reuse)"]
Build["build()"]
Stream["RowStream"]
NextChunk["next_chunk()"]
Gather["Gather columns\nfor batch_size rows"]
Evaluate["Evaluate computed\nprojections"]
Batch["StreamChunk\n(arrays + schema)"]
Builder --> Config
Config --> GatherCtx
GatherCtx --> Build
Build --> Stream
Stream --> NextChunk
NextChunk --> Gather
Gather --> Evaluate
Evaluate --> Batch
Batch -->|More rows| NextChunk
Streaming Architecture
Row Stream Builder
The RowStreamBuilder constructs streaming result iterators with configurable batch sizes:
The stream uses STREAM_BATCH_ROWS (default 1024) as the chunk size for incremental result production.
Sources: llkv-table/src/stream.rs llkv-table/src/constants.rs:1-7
Gather Context Reuse
MultiGatherContext enables amortization of setup costs across multiple scans:
- Caches physical key lookups
- Reuses internal buffers
- Reduces allocations in streaming scenarios
The context is optional but improves performance for repeated scans of the same columns.
Sources: llkv-column-map/src/store/scan.rs
Performance Characteristics
| Scan Type | Row ID Collection | Column Access | Memory Usage |
|---|---|---|---|
| Single column direct | None (streams directly) | Direct column chunks | O(chunk_size) |
| Full table streaming | Shadow column (fast) | Incremental gather | O(batch_size × columns) |
| Filtered scan | Shadow or multi-column | Full gather | O(row_count × columns) |
| Ordered scan | Shadow or multi-column | Full gather + sort | O(row_count × columns) |
The executor prioritizes fast paths that minimize memory usage and avoid full table materialization when possible.
Sources: llkv-table/src/planner/mod.rs:748-999 llkv-table/README.md:1-57
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Filter Evaluation
Relevant source files
- llkv-executor/src/translation/expression.rs
- llkv-executor/src/translation/schema.rs
- llkv-expr/src/expr.rs
- llkv-table/src/planner/mod.rs
- llkv-table/src/planner/program.rs
- llkv-table/src/scalar_eval.rs
Purpose and Scope
This page explains how filter expressions (WHERE clause predicates) are evaluated against table rows during query execution. This includes the compilation of filter expressions into efficient bytecode programs, stack-based evaluation mechanisms, integration with MVCC visibility filtering, and various optimization strategies.
For information about how expressions are initially structured and planned, see Expression System and Query Planning. For details about the broader scan execution context, see Scan Execution and Optimization.
Filter Expression Pipeline
Filter evaluation follows a multi-stage pipeline that transforms SQL predicates into efficient executable programs:
Sources: llkv-table/src/planner/mod.rs:595-636 llkv-table/src/planner/program.rs:256-284 llkv-executor/src/translation/expression.rs:18-174
graph LR SQL["SQL WHERE Clause"] --> Parser["sqlparser AST"] Parser --> ExprString["Expr<String>\nField names"] ExprString --> Translation["Field Resolution\nCatalog Lookup"] Translation --> ExprFieldId["Expr<FieldId>\nResolved fields"] ExprFieldId --> Normalize["normalize_predicate\nDe Morgan's Laws\nFlatten AND/OR"] Normalize --> Compiler["ProgramCompiler"] Compiler --> EvalProg["EvalProgram\nStack Bytecode"] Compiler --> DomainProg["DomainProgram\nRow ID Selection"] EvalProg --> Evaluation["Row Evaluation"] DomainProg --> Evaluation Evaluation --> MVCCFilter["MVCC Filtering"] MVCCFilter --> Results["Filtered Results"]
Program Compilation
Normalization
Before compilation, filter expressions are normalized into canonical form using the normalize_predicate function. This transformation ensures consistent structure for optimization and evaluation.
Normalization rules:
- Flatten nested conjunctions/disjunctions:
AND(AND(a,b),c)→AND(a,b,c) - Apply De Morgan's laws: Push NOT operators down through logical connectives
- Eliminate double negations:
NOT(NOT(expr))→expr - Simplify literal booleans:
NOT(true)→false
The normalization process uses an iterative approach to handle deeply nested expressions without stack overflow. The transformation is applied recursively, with special handling for negated conjunctions and disjunctions.
Key normalization functions:
| Function | Purpose |
|---|---|
normalize_predicate | Entry point for expression normalization |
normalize_expr | Flattens AND/OR and delegates to normalize_negated |
normalize_negated | Applies De Morgan's laws and simplifies negations |
Sources: llkv-table/src/planner/program.rs:286-343
Bytecode Generation
The ProgramCompiler translates normalized expressions into two complementary program representations:
EvalProgram operations:
graph TB
subgraph "Compilation"
Expr["Normalized Expr<FieldId>"]
Compiler["ProgramCompiler"]
Expr --> Compiler
end
subgraph "Programs"
EvalProg["EvalProgram\nStack-based bytecode\nfor predicate evaluation"]
DomainProg["DomainProgram\nRow ID domain\ncalculation"]
Compiler --> EvalProg
Compiler --> DomainProg
end
subgraph "Evaluation"
EvalProg --> ResultStack["Result Stack\nbool values"]
DomainProg --> RowIDs["Row ID Sets\ncandidate rows"]
end
| Operation | Stack Effect | Purpose |
|---|---|---|
PushPredicate | → bool | Evaluate single predicate |
PushCompare | → bool | Evaluate scalar comparison |
PushInList | → bool | Evaluate IN list membership |
PushIsNull | → bool | Evaluate NULL test |
PushLiteral | → bool | Push constant boolean |
FusedAnd | → bool | Evaluate fused predicates on same field |
And | bool×N → bool | Logical AND of N values |
Or | bool×N → bool | Logical OR of N values |
Not | bool → bool | Logical negation (uses domain program) |
DomainProgram operations:
| Operation | Purpose |
|---|---|
PushFieldAll | Include all rows for a field |
PushCompareDomain | Rows where scalar comparison fields exist |
PushInListDomain | Rows where IN list fields exist |
PushIsNullDomain | Rows where NULL test fields exist |
Union | Combine row sets (OR semantics) |
Intersect | Intersect row sets (AND semantics) |
Sources: llkv-table/src/planner/program.rs:22-284 llkv-table/src/planner/program.rs:416-516 llkv-table/src/planner/program.rs:544-631
Predicate Fusion
An optimization that recognizes multiple predicates on the same field within an AND expression and evaluates them together. This reduces overhead and enables more efficient filtering.
Fusion conditions (fromPredicateFusionCache):
| Data Type | Fusion Threshold |
|---|---|
| String types | ≥2 total predicates AND ≥1 Contains operator |
| Other types | ≥2 total predicates on same field |
Example transformation:
age >= 18 AND age < 65 AND age != 30
→ FusedAnd(field_id: age, filters: [>=18, <65, !=30])
The fusion cache tracks predicate patterns during compilation:
- Counts total predicates per field
- Tracks specific operator types (e.g.,
Containsfor strings) - Decides whether fusion is beneficial via
should_fuse
Sources: llkv-table/src/planner/mod.rs:517-570 llkv-table/src/planner/program.rs:518-542
Row-Level Evaluation
Stack-Based Evaluation Engine
Filter evaluation uses a stack-based virtual machine that processes EvalProgram bytecode. Each operation manipulates a boolean result stack.
graph LR
subgraph "Evaluation Loop"
Ops["EvalOp Instructions"]
Stack["Result Stack\nVec<bool>"]
Ops -->|Process| Stack
Stack -->|Final Value| Result["Filter Decision"]
end
subgraph "Example: age >= 18 AND status = 'active'"
Op1["PushPredicate(age >= 18)"] -->|Stack: [true]| Op2["PushPredicate(status = 'active')"]
Op2 -->|Stack: [true, false]| Op3["And(2)"]
Op3 -->|Stack: [false]| Final["Result: false"]
end
The evaluation process iterates through EvalOp instructions, pushing boolean results and combining them according to logical operators. Each predicate evaluation consults the underlying storage to check field values against filter conditions.
Sources: llkv-table/src/planner/mod.rs:1009-1031
graph TB
Predicate["Filter<FieldId>\nfield_id + operator"]
Predicate --> TypeCheck{"Data Type?"}
TypeCheck -->|Fixed-width| FixedPath["build_fixed_width_predicate\nInt, Float, Date, etc."]
TypeCheck -->|Variable-width| VarPath["build_var_width_predicate\nString types"]
TypeCheck -->|Boolean| BoolPath["build_bool_predicate\nBool type"]
FixedPath --> Native["Native comparison\nusing PredicateValue"]
VarPath --> Pattern["Pattern matching\nStartsWith, Contains, etc."]
BoolPath --> Boolean["Boolean logic"]
Native --> Result["bool"]
Pattern --> Result
Boolean --> Result
Predicate Evaluation
Individual predicates are evaluated by comparing field values against filter operators. The evaluation strategy depends on the data type:
Type-specific evaluation paths:
Operator semantics:
| Operator | Description | NULL Handling |
|---|---|---|
Equals | Exact match | NULL = NULL → NULL |
Range | Bounded interval (inclusive/exclusive) | NULL in range → NULL |
In | Set membership | NULL in [values] → NULL |
StartsWith | String prefix match (case-sensitive/insensitive) | NULL starts with X → NULL |
EndsWith | String suffix match | NULL ends with X → NULL |
Contains | String substring match | NULL contains X → NULL |
IsNull | NULL test | Returns true/false |
IsNotNull | NOT NULL test | Returns true/false |
Sources: llkv-expr/src/expr.rs:295-358 llkv-expr/src/typed_predicate.rs:1-500 (referenced but not shown)
graph TB
ScalarExpr["ScalarExpr<FieldId>"]
ScalarExpr --> Mode{"Evaluation Mode"}
Mode -->|Single Row| RowLevel["NumericKernels::evaluate_value\nRecursive evaluation\nReturns Option<NumericValue>"]
Mode -->|Batch| BatchLevel["NumericKernels::evaluate_batch\nVectorized when possible"]
BatchLevel --> Vectorized{"Vectorizable?"}
Vectorized -->|Yes| Vec["try_evaluate_vectorized\nArrow compute kernels"]
Vectorized -->|No| Fallback["Per-row evaluation\nLoop + evaluate_value"]
Vec --> Array["ArrayRef result"]
Fallback --> Array
Scalar Expression Evaluation
For computed columns and complex predicates (e.g., WHERE salary * 1.1 > 50000), scalar expressions are evaluated using the NumericKernels utility.
Evaluation modes:
Numeric type handling:
The NumericArray wrapper provides unified access to different numeric types:
- Integer :
Int64Arrayfor integers, booleans, dates - Float :
Float64Arrayfor floating-point numbers - Decimal :
Decimal128Arrayfor precise decimal values
Type coercion occurs automatically during expression evaluation:
- Mixed integer/float operations promote to float
- String-to-numeric conversions follow SQLite semantics (invalid → 0)
- NULL propagates through operations
Key evaluation functions:
| Function | Purpose | Performance |
|---|---|---|
evaluate_value | Single-row evaluation | Used for non-vectorizable expressions |
evaluate_batch | Batch evaluation | Attempts vectorization first |
try_evaluate_vectorized | Vectorized computation | Uses Arrow compute kernels |
prepare_numeric_arrays | Type coercion | Converts columns to numeric representation |
Sources: llkv-table/src/scalar_eval.rs:453-713 llkv-table/src/scalar_eval.rs:92-383
graph TB
subgraph "Filter Stages"
Scan["Table Scan"]
Scan --> Domain["1. Domain Program\nDetermines candidate\nrow IDs"]
Domain --> UserPred["2. User Predicates\nSemantic filtering\nvia EvalProgram"]
UserPred --> MVCCFilter["3. MVCC Filtering\nrow_id_filter.filter()\nVisibility rules"]
MVCCFilter --> Results["Visible Results"]
end
subgraph "MVCC Visibility"
RowID["Row ID"]
CreatedBy["created_by TxnId"]
DeletedBy["deleted_by Option<TxnId>"]
Snapshot["Transaction Snapshot"]
RowID --> Check{"Visibility Check"}
CreatedBy --> Check
DeletedBy --> Check
Snapshot --> Check
Check -->|Created before snapshot Not deleted or deleted after snapshot| Visible["Include"]
Check -->|Otherwise| Invisible["Exclude"]
end
MVCC Integration
Filter evaluation integrates with MVCC visibility filtering to ensure queries only see rows visible to their transaction. This is a two-stage filtering process:
MVCC filtering implementation:
The row_id_filter option in ScanStreamOptions provides transaction-aware filtering:
- Created by runtime's transaction manager
- Encapsulates snapshot visibility rules
- Applied after user predicate evaluation
- Filters row IDs based on
created_byanddeleted_bytransaction IDs
Filtering order rationale:
- Domain programs - Quickly eliminate rows where referenced fields don't exist
- User predicates - Evaluate semantic conditions (WHERE clause)
- MVCC filter - Apply transaction visibility rules
This ordering minimizes MVCC overhead by only checking visibility for rows that pass semantic filters.
Sources: llkv-table/src/planner/mod.rs:940-982 llkv-table/src/table.rs:200-300 (type definitions referenced but not shown)
Evaluation Optimizations
Single-Column Direct Scan Fast Path
A specialized fast path for queries that project a single column with simple filtering. This bypasses the general evaluation machinery for better performance.
Conditions for fast path:
- Single column projection
- Filter references only that column
- Simple operator (no complex scalar expressions)
When activated, the scan directly streams the target column's values without materializing intermediate structures.
Sources: llkv-table/src/planner/mod.rs:1020-1031 (method name: try_single_column_direct_scan)
graph LR
Shadow["Shadow Column\nrow_id metadata"]
Shadow -->|Exists?| FastEnum["Fast Path\nstream_table_row_ids"]
Shadow -->|Missing| Fallback["Fallback Path\nMulti-column scan\n+ deduplication"]
FastEnum --> Chunks["Row ID Chunks\nSTREAM_BATCH_ROWS"]
Fallback --> Chunks
Chunks --> MVCCCheck["Apply MVCC Filter"]
MVCCCheck --> Gather["Gather Columns"]
Gather --> Batch["RecordBatch"]
Full Table Scan Streaming
When no predicates require evaluation (e.g., WHERE true or full scan), the executor uses streaming row ID enumeration:
The fast path attempts to use a shadow column (row_id) that stores all row IDs for a table:
- Success case : Shadow column exists → stream chunks directly
- Fallback case : Shadow column missing → scan user columns and deduplicate
Sources: llkv-table/src/planner/mod.rs:739-857 llkv-table/src/planner/mod.rs:859-902 llkv-table/src/planner/mod.rs:904-999
graph TB
Expression["WHERE clause\nexpression tree"]
Expression --> Traverse["Traverse expression\nrecord_expr"]
Traverse --> Track["Track per-field stats:\n- Total predicate count\n- Contains operator count"]
Track --> Decide["should_fuse decision"]
Decide -->|String + Contains ≥1| Fuse1["Enable fusion"]
Decide -->|Any type + predicates ≥2| Fuse2["Enable fusion"]
Decide -->|Otherwise| NoFuse["No fusion"]
Predicate Fusion Cache
The PredicateFusionCache tracks predicate patterns during compilation to enable fusion optimization:
Fusion benefits:
- Reduces function call overhead
- Enables specialized evaluation routines
- Improves cache locality by processing same field
Fusion conditions table:
| Field Data Type | Conditions for Fusion |
|---|---|
Utf8 / LargeUtf8 | Total predicates ≥ 2 AND Contains operations ≥ 1 |
| Other types | Total predicates ≥ 2 on same field |
Sources: llkv-table/src/planner/mod.rs:517-570
graph TB
Expr["ScalarExpr batch"]
Expr --> Check{"Vectorizable?"}
Check -->|Yes| Patterns["Supported patterns:\n- Column references\n- Literal constants\n- Binary ops\n- Scalar×Array ops\n- Array×Array ops"]
Patterns --> ArrowCompute["Arrow compute kernels\nSIMD-optimized"]
Check -->|No| PerRow["Per-row evaluation\nevaluate_value loop"]
ArrowCompute --> Result["ArrayRef"]
PerRow --> Result
Vectorized Expression Evaluation
For numeric operations, the system attempts vectorized evaluation to process entire batches at once:
Vectorization strategy:
Vectorizable expression patterns:
- Pure column references
- Literal constants
- Binary operations:
Array ⊕ Array,Array ⊕ Scalar,Scalar ⊕ Array - Simple casts between numeric types
Non-vectorizable expressions:
- CASE expressions with complex branches
- Date/interval arithmetic
- Aggregate functions
- Subqueries
The vectorization attempt happens in try_evaluate_vectorized, which recursively checks if all sub-expressions can be vectorized. If any sub-expression is non-vectorizable, the entire expression falls back to row-by-row evaluation.
Sources: llkv-table/src/scalar_eval.rs:714-763 llkv-table/src/scalar_eval.rs:676-713
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Storage Layer
Relevant source files
- llkv-aggregate/README.md
- llkv-column-map/README.md
- llkv-column-map/src/store/projection.rs
- llkv-csv/README.md
- llkv-expr/Cargo.toml
- llkv-expr/README.md
- llkv-expr/src/lib.rs
- llkv-expr/src/literal.rs
- llkv-join/README.md
- llkv-plan/Cargo.toml
- llkv-plan/src/lib.rs
- llkv-runtime/README.md
- llkv-storage/README.md
- llkv-storage/src/serialization.rs
- llkv-table/Cargo.toml
- llkv-table/README.md
Purpose and Scope
The Storage Layer provides the columnar data persistence infrastructure for LLKV. It spans three crates that form a hierarchy: llkv-table provides schema-aware table operations, llkv-column-map manages columnar chunk storage and catalog mappings, and llkv-storage defines the pluggable pager abstraction for physical persistence. Together, these components implement an Arrow-native columnar store with MVCC semantics and zero-copy read paths.
For information about query execution and predicate evaluation that operates on top of this storage layer, see Query Execution. For details on how the runtime orchestrates storage operations within transactions, see Catalog and Metadata Management.
Sources: llkv-table/README.md:1-57 llkv-column-map/README.md:1-62 llkv-storage/README.md:1-45
Architecture Overview
The storage layer implements a three-tier architecture where each tier has a distinct responsibility:
Key Components:
graph TB
subgraph "Schema Layer"
Table["Table\nllkv-table::Table"]
Schema["Schema Validation"]
MVCC["MVCC Injection\nrow_id, created_by, deleted_by"]
end
subgraph "Column Management Layer"
ColumnStore["ColumnStore\nllkv-column-map::ColumnStore"]
Catalog["Column Catalog\nLogicalFieldId → PhysicalKey"]
Chunks["Column Chunks\nArrow RecordBatch segments"]
Descriptors["ColumnDescriptor\nChunk metadata"]
end
subgraph "Physical Persistence Layer"
Pager["Pager Trait\nllkv-storage::pager::Pager"]
MemPager["MemPager\nHashMap backend"]
SimdPager["SimdRDrivePager\nMemory-mapped file"]
end
Table --> Schema
Schema --> MVCC
MVCC --> ColumnStore
ColumnStore --> Catalog
ColumnStore --> Chunks
ColumnStore --> Descriptors
Catalog --> Pager
Chunks --> Pager
Descriptors --> Pager
Pager --> MemPager
Pager --> SimdPager
| Layer | Crate | Primary Types | Responsibility |
|---|---|---|---|
| Schema | llkv-table | Table, SysCatalog | Schema validation, MVCC metadata injection, streaming scans |
| Column Management | llkv-column-map | ColumnStore, LogicalFieldId, ColumnDescriptor | Columnar chunking, catalog mapping, gather operations |
| Physical Persistence | llkv-storage | Pager, MemPager, SimdRDrivePager | Batch get/put over physical keys, zero-copy reads |
Sources: llkv-table/README.md:10-41 llkv-column-map/README.md:10-41 llkv-storage/README.md:12-28
Logical vs Physical Addressing
The storage layer separates logical identifiers from physical storage locations to enable namespace isolation and catalog management:
Namespace Segregation:
graph LR
subgraph "Logical Space"
LogicalField["LogicalFieldId\n(namespace, table_id, field_id)"]
UserNS["Namespace::UserData"]
RowNS["Namespace::RowIdShadow"]
MVCCNS["Namespace::TxnMetadata"]
end
subgraph "Catalog Mapping"
CatalogEntry["catalog.map\nLogicalFieldId → PhysicalKey"]
ValuePK["Value PhysicalKey"]
RowPK["RowId PhysicalKey"]
end
subgraph "Physical Space"
PhysKey["PhysicalKey (u64)"]
DescBlob["Descriptor Blob\nColumnDescriptor"]
ChunkBlob["Chunk Blob\nSerialized Arrow Array"]
end
LogicalField --> CatalogEntry
UserNS --> LogicalField
RowNS --> LogicalField
MVCCNS --> LogicalField
CatalogEntry --> ValuePK
CatalogEntry --> RowPK
ValuePK --> PhysKey
RowPK --> PhysKey
PhysKey --> DescBlob
PhysKey --> ChunkBlob
LogicalFieldId encodes a three-part address: (namespace, table_id, field_id). Namespaces prevent collisions between user columns, MVCC metadata, and row-id shadows:
Namespace::UserData: User-defined columns (e.g.,name,age,email)Namespace::RowIdShadow: Parallel row-id arrays used for gather operationsNamespace::TxnMetadata: MVCC columns (created_by,deleted_by)
The ColumnStore maintains a catalog.map that translates each LogicalFieldId into a PhysicalKey. Physical keys are opaque u64 identifiers allocated by the pager; the catalog is persisted at pager root key 0.
Sources: llkv-column-map/README.md:18-23 llkv-column-map/src/store/projection.rs:23-37
Data Persistence Model
Data flows through the storage layer as Arrow RecordBatch objects, which are decomposed into per-column chunks for persistence:
Chunking Strategy:
sequenceDiagram
participant Caller
participant Table as "Table"
participant ColumnStore as "ColumnStore"
participant Serialization as "serialization"
participant Pager as "Pager"
Caller->>Table: append(RecordBatch)
Note over Table: Validate schema\nInject MVCC columns
Table->>ColumnStore: append(RecordBatch with MVCC)
Note over ColumnStore: Chunk columns\nSort by row_id\nApply last-writer-wins
loop "For each column"
ColumnStore->>Serialization: serialize_array(Array)
Serialization-->>ColumnStore: Vec<u8> blob
ColumnStore->>Pager: batch_put(PhysicalKey, blob)
end
Pager-->>ColumnStore: Success
ColumnStore-->>Table: Success
Table-->>Caller: Success
Each column is split into fixed-size chunks (default 8,192 rows) to balance scan efficiency and update granularity. Chunks are serialized using a custom Arrow-compatible format optimized for zero-copy reads. The ColumnDescriptor maintains per-chunk metadata including row count, min/max row IDs, and physical keys for value and row-id arrays.
MVCC Column Layout:
Every table's physical storage includes three categories of columns:
- User-defined columns from the schema
row_id(UInt64): monotonic row identifiercreated_by(UInt64): transaction ID that created the rowdeleted_by(UInt64): transaction ID that deleted the row (0 if live)
These MVCC columns are stored in Namespace::TxnMetadata and participate in the same chunking and persistence workflow as user data.
Sources: llkv-column-map/README.md:24-29 llkv-table/README.md:14-25
Serialization and Zero-Copy Design
The storage layer uses a custom serialization format that enables zero-copy reconstruction of Arrow arrays from memory-mapped regions:
Serialization Format
The format defined in llkv-storage/src/serialization.rs:1-586 supports multiple layouts:
| Layout | Type Code | Use Case | Header Fields |
|---|---|---|---|
Primitive | PrimType::* | Fixed-width primitives (Int32, UInt64, Float32, etc.) | values_len |
FslFloat32 | N/A | FixedSizeList for vector embeddings | list_size, child_values_len |
Varlen | PrimType::Binary, PrimType::Utf8, etc. | Variable-length Binary/String | offsets_len, values_len |
Struct | N/A | Nested struct types | payload_len (IPC format) |
Decimal128 | PrimType::Decimal128 | Fixed-precision decimals | precision, scale |
Each serialized blob begins with a 24-byte header:
Offset Field Type Description
------ ----- ---- -----------
0-3 Magic [u8;4] "ARR0"
4 Layout u8 Layout discriminant (0-3)
5 PrimType u8 Type code (layout-specific)
6 Precision/Pad u8 Decimal precision or padding
7 Scale/Pad u8 Decimal scale or padding
8-15 Length u64 Logical element count
16-19 Extra A u32 Layout-specific (e.g., values_len)
20-23 Extra B u32 Layout-specific (e.g., offsets_len)
24+ Payload [u8] Raw Arrow buffers
Why Custom Format Instead of Arrow IPC:
The custom format achieves three goals:
- Minimal overhead : No schema framing or padding, just raw buffers
- Contiguous payloads : Each array's bytes are adjacent, ideal for SIMD and sequential scans
- True zero-copy :
deserialize_arrayconstructsArrayDatadirectly fromEntryHandlebuffers without memcpy
Sources: llkv-storage/src/serialization.rs:1-100 llkv-storage/src/serialization.rs:226-298
EntryHandle Abstraction
The Pager trait operates on EntryHandle objects from the simd-r-drive-entry-handle crate. EntryHandle provides:
as_ref() -> &[u8]: Zero-copy slice viewas_arrow_buffer() -> Buffer: Wrap as Arrow buffer without copying
graph LR
File["Persistent File\nsimd_r_drive::DataStore"]
Mmap["Memory-Mapped Region"]
EntryHandle["EntryHandle"]
Buffer["Arrow Buffer"]
ArrayData["ArrayData"]
ArrayRef["ArrayRef"]
File --> Mmap
Mmap --> EntryHandle
EntryHandle --> Buffer
Buffer --> ArrayData
ArrayData --> ArrayRef
style File fill:#f9f9f9
style ArrayRef fill:#f9f9f9
When using SimdRDrivePager, EntryHandle wraps a memory-mapped region backed by a persistent file. The deserialize_array function slices this buffer to construct ArrayData without allocating or copying:
Sources: llkv-storage/src/serialization.rs:429-559 llkv-table/Cargo.toml29
Pager Implementations
The Pager trait defines the interface for batch get/put operations over physical keys:
MemPager
MemPager provides an in-memory, heap-backed implementation using a RwLock<FxHashMap<PhysicalKey, Arc<Vec<u8>>>>. It is used for:
- Unit tests and benchmarks
- Staging contexts during explicit transactions (see llkv-runtime/README.md:26-32)
- Temporary namespaces that don't require persistence
SimdRDrivePager
SimdRDrivePager wraps simd_r_drive::DataStore, enabling persistent storage with zero-copy reads:
| Feature | Implementation |
|---|---|
| Backing Store | Memory-mapped file via simd-r-drive |
| Alignment | SIMD-optimized (16-byte aligned regions) |
| Concurrency | Multiple readers, single writer (file-level locking) |
| EntryHandle | Zero-copy view into mmap region |
The SimdRDrivePager is instantiated by the runtime when opening a persistent database. All catalog entries, column descriptors, and chunk blobs are accessed through memory-mapped reads, avoiding I/O system calls during query execution.
Sources: llkv-storage/README.md:18-22 llkv-storage/README.md:25-28
Column Storage Operations
The ColumnStore provides three primary operation patterns:
Append Workflow
Last-Writer-Wins Semantics:
When appending a RecordBatch with row_id values that overlap existing rows, ColumnStore::append applies last-writer-wins logic:
- Identify chunks containing overlapping
row_idranges - Load those chunks and merge with new data
- Re-serialize merged chunks
- Atomically update descriptors and chunk blobs
This ensures that INSERT OR REPLACE and UPDATE operations maintain consistent state without tombstone accumulation.
Sources: llkv-column-map/README.md:24-29
Gather Operations
Gather operations retrieve specific rows by row_id from columnar storage:
Null-Handling Policies:
Gather operations support three policies via GatherNullPolicy:
| Policy | Behavior |
|---|---|
ErrorOnMissing | Return error if any requested row_id is not found |
IncludeNulls | Emit null for missing rows |
DropNulls | Omit rows where all projected columns are null |
The MultiGatherContext caches chunk data and row indexes across multiple gather calls to amortize deserialization costs during multi-pass operations like joins.
Sources: llkv-column-map/src/store/projection.rs:39-93 llkv-column-map/src/store/projection.rs:245-446
Streaming Scans
The ColumnStream type provides paginated, filtered scans over columnar data:
Scans operate in chunks to avoid materializing entire tables:
- Load next chunk of row IDs and MVCC metadata
- Apply MVCC visibility filter (transaction snapshot check)
- Evaluate user predicates on loaded columns
- Gather matching rows into a
RecordBatch - Yield batch to caller
This streaming model enables large result sets to be processed incrementally without exhausting memory.
Sources: llkv-column-map/README.md:30-35 llkv-table/README.md:22-25
Integration Points
The storage layer is consumed by multiple higher-level components:
Key Integration Patterns:
| Consumer | Usage Pattern |
|---|---|
llkv-runtime | Opens pagers, manages table lifecycle, coordinates MVCC tagging |
llkv-executor | Streams scans via Table::scan_stream, executes joins and aggregates |
llkv-transaction | Provides transaction IDs for MVCC columns, enforces snapshot isolation |
SysCatalog | Persists table and column metadata using the same storage infrastructure |
System Catalog Self-Hosting:
The system catalog (table 0) is itself stored using ColumnStore, creating a bootstrapping dependency:
- Runtime opens the pager
ColumnStoreis initialized with the pagerSysCatalogis constructed, reading metadata from table 0- User tables are opened using metadata from
SysCatalog
This self-hosting design ensures that catalog operations and user data operations share the same crash recovery and MVCC semantics.
Sources: llkv-runtime/README.md:36-40 llkv-table/README.md:26-30 llkv-storage/README.md:25-28
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Table Abstraction
Relevant source files
- llkv-aggregate/README.md
- llkv-column-map/README.md
- llkv-column-map/src/store/projection.rs
- llkv-csv/README.md
- llkv-expr/Cargo.toml
- llkv-expr/README.md
- llkv-expr/src/lib.rs
- llkv-expr/src/literal.rs
- llkv-join/README.md
- llkv-plan/Cargo.toml
- llkv-plan/src/lib.rs
- llkv-runtime/README.md
- llkv-storage/README.md
- llkv-storage/src/serialization.rs
- llkv-table/Cargo.toml
- llkv-table/README.md
Purpose and Scope
The Table abstraction provides a schema-aware interface for data operations in the LLKV storage layer. It sits between query execution components and the columnar storage engine, managing schema validation, MVCC metadata injection, and translating logical operations into physical column store interactions. This document details the Table struct and its APIs for appending data, scanning rows, and coordinating with the system catalog.
For information about the underlying columnar storage implementation, see Column Storage and ColumnStore. For details on the storage pager abstraction, see Pager Interface and SIMD Optimization. For catalog management APIs, see CatalogManager API and System Catalog and SysCatalog.
Overview
The llkv-table crate provides the primary interface between SQL execution and physical storage. Each Table instance represents a logical table with a defined schema and wraps a reference to a ColumnStore that handles the actual persistence. Tables are responsible for enforcing schema constraints, injecting MVCC metadata columns, and exposing streaming scan APIs that integrate with the query executor.
Sources: llkv-table/README.md:1-57 llkv-table/Cargo.toml:1-60
graph TB
subgraph "Query Execution Layer"
RUNTIME["Runtime\nStatement Executor"]
EXECUTOR["Executor\nQuery Evaluation"]
end
subgraph "Table Layer (llkv-table)"
TABLE["Table\nSchema-aware API"]
SYSCAT["SysCatalog\nTable 0\nMetadata Store"]
STREAM["ColumnStream\nStreaming Scans"]
end
subgraph "Column Store Layer (llkv-column-map)"
COLSTORE["ColumnStore\nColumnar Storage"]
PROJECTION["Projection\nGather Logic"]
end
subgraph "Storage Layer (llkv-storage)"
PAGER["Pager Trait\nBatch Get/Put"]
end
RUNTIME -->|CREATE TABLE INSERT UPDATE| TABLE
EXECUTOR -->|SELECT scan_stream| TABLE
TABLE -->|validate schema| TABLE
TABLE -->|inject MVCC cols| TABLE
TABLE -->|append RecordBatch| COLSTORE
TABLE -->|gather_rows| COLSTORE
SYSCAT -->|stores TableMeta ColMeta| COLSTORE
TABLE -->|scan_stream returns| STREAM
STREAM -->|fetch batches| COLSTORE
COLSTORE -->|uses| PROJECTION
PROJECTION -->|batch_get/put| PAGER
Table Structure and Core Responsibilities
A Table instance encapsulates a schema-validated view over a ColumnStore. The table layer is responsible for:
| Responsibility | Description |
|---|---|
| Schema Validation | Ensures all RecordBatch operations match the declared Arrow schema |
| MVCC Injection | Adds system columns (row_id, created_by, deleted_by) to all data |
| Catalog Coordination | Persists and retrieves table/column metadata via SysCatalog (table 0) |
| Data Routing | Translates logical field requests to LogicalFieldId for ColumnStore |
| Streaming Scans | Provides ColumnStream API for paginated, predicate-pushdown reads |
The table wraps an Arc<ColumnStore> from llkv-column-map, enabling multiple table instances to share the same underlying storage. This design supports efficient metadata queries and concurrent access patterns.
Sources: llkv-table/README.md:12-40 llkv-column-map/README.md:1-61
MVCC Column Management
Every table in LLKV maintains three system columns alongside user-defined fields:
graph LR
subgraph "User Schema"
UC1["name: Utf8"]
UC2["age: Int32"]
UC3["email: Utf8"]
end
subgraph "System Columns (MVCC)"
ROW_ID["row_id: UInt64\nMonotonic identifier"]
CREATED["created_by: UInt64\nTransaction ID"]
DELETED["deleted_by: UInt64\nDeletion TXN or MAX"]
end
UC1 -.->|stored in namespace USER| COLSTORE["ColumnStore"]
UC2 -.-> COLSTORE
UC3 -.-> COLSTORE
ROW_ID -.->|namespace TXN_METADATA| COLSTORE
CREATED -.-> COLSTORE
DELETED -.-> COLSTORE
COLSTORE["ColumnStore\nLogicalFieldId\nNamespacing"]
MVCC Column Semantics
-
row_id: A monotonically increasingUInt64that uniquely identifies each row within a table. Assigned during append operations and used for row-level operations and correlation. -
created_by: The transaction ID (UInt64) that created this row version. Set duringINSERTorUPDATEoperations. -
deleted_by: The transaction ID that marked this row as deleted, oru64::MAXif the row is still live.UPDATEoperations logically delete old versions and insert new ones.
These columns are stored in separate logical namespaces within the ColumnStore to avoid collisions with user-defined columns. The table layer automatically injects these columns during append operations and uses them for visibility filtering during scans.
Sources: llkv-table/README.md:15-16 llkv-column-map/README.md:20-28
Data Operations
Append Operations
The Table::append method accepts an Arrow RecordBatch and performs the following steps:
graph TB
START["Table::append(RecordBatch)"]
VALIDATE["Validate Schema\nCheck column names/types"]
INJECT["Inject MVCC Columns\nrow_id, created_by, deleted_by"]
NAMESPACE["Map to LogicalFieldId\nApply namespace prefixes"]
PERSIST["ColumnStore::append\nSort by row_id\nLast-writer-wins"]
COMMIT["Pager::batch_put\nAtomic commit"]
START --> VALIDATE
VALIDATE -->|schema mismatch| ERROR["Return Error"]
VALIDATE -->|valid| INJECT
INJECT --> NAMESPACE
NAMESPACE --> PERSIST
PERSIST --> COMMIT
COMMIT --> SUCCESS["Return Ok"]
The append pipeline ensures:
- Schema consistency : All incoming batches must match the table's declared schema
- MVCC tagging : System columns are added with appropriate transaction IDs
- Ordering : Rows are sorted by
row_idbefore persistence for efficient scans - Atomicity : Multi-column writes are committed atomically via batch pager operations
Sources: llkv-table/README.md:20-30 llkv-column-map/README.md:24-28
Scan Operations
Tables expose streaming scan APIs through the scan_stream method, which returns a ColumnStream for paginated result retrieval:
graph TB
SCAN["Table::scan_stream\n(projections, filter)"]
NORMALIZE["Normalize Predicate\nApply De Morgan's laws"]
COMPILE["Compile to EvalProgram\nStack-based bytecode"]
STREAM["Create ColumnStream\nLazy evaluation"]
FETCH["ColumnStream::next_batch\nFetch N rows"]
FILTER["Apply Predicate\nVectorized evaluation"]
MVCC["MVCC Filtering\nSnapshot visibility"]
PROJECT["Gather Projected Cols\ngather_rows()"]
BATCH["Return RecordBatch"]
SCAN --> NORMALIZE
NORMALIZE --> COMPILE
COMPILE --> STREAM
STREAM -.->|caller iterates| FETCH
FETCH --> FILTER
FILTER --> MVCC
MVCC --> PROJECT
PROJECT --> BATCH
BATCH -.->|next iteration| FETCH
The scan path supports:
- Predicate pushdown : Filters are compiled to bytecode and evaluated at the column store level
- Projection : Only requested columns are materialized
- MVCC filtering : Rows are filtered based on transaction snapshot visibility rules
- Streaming : Results are produced in fixed-size batches to avoid large memory allocations
Sources: llkv-table/README.md:23-24 llkv-column-map/README.md:30-34
Schema Validation
Schema validation occurs at table creation and during every append operation. The table layer enforces:
| Validation Check | Enforcement Point |
|---|---|
| Column names | Must match declared schema exactly (case-sensitive) |
| Data types | Must match Arrow DataType including nested types |
| Nullability | Enforced for non-nullable columns |
| Field count | Batch must contain exactly the declared columns |
Schema definitions are persisted in the system catalog (table 0) as TableMeta and ColMeta entries. The catalog stores:
- Table ID and name
- Column names, types, and nullability flags
- Constraint metadata (e.g., PRIMARY KEY, NOT NULL)
Sources: llkv-table/README.md:14-15 llkv-table/README.md:27-29
graph TB
subgraph "System Catalog (Table 0)"
TABLEMETA["TableMeta Records\ntable_id, name, schema"]
COLMETA["ColMeta Records\ntable_id, col_name, type"]
end
subgraph "User Tables (1..N)"
TBL1["Table 1\nusers"]
TBL2["Table 2\norders"]
TBL3["Table N\nproducts"]
end
TABLEMETA -->|describes| TBL1
TABLEMETA -->|describes| TBL2
TABLEMETA -->|describes| TBL3
COLMETA -->|defines columns| TBL1
COLMETA -->|defines columns| TBL2
COLMETA -->|defines columns| TBL3
SYSCAT["SysCatalog API\ncreate_table()\nget_table_meta()\nlist_tables()"]
SYSCAT -->|reads/writes| TABLEMETA
SYSCAT -->|reads/writes| COLMETA
System Catalog Integration
The SysCatalog is a special table (table ID 0) that stores metadata for all other tables:
The system catalog itself uses the same storage infrastructure as user tables:
- Stored as Arrow
RecordBatches in theColumnStore - Subject to MVCC versioning
- Persisted through the pager for crash consistency
This self-hosting design ensures metadata operations follow the same transactional semantics as data operations.
Sources: llkv-table/README.md:27-29 llkv-column-map/README.md:10-16
Projection and Gathering
The table layer delegates projection and row gathering to the ColumnStore, which provides specialized APIs for materializing requested columns:
Projection Structure
A Projection describes a single column to retrieve, optionally renaming it in the output schema. Projections are resolved to LogicalFieldId by consulting the catalog, then passed to the ColumnStore for gathering.
Sources: llkv-column-map/store/projection.rs:49-73
Null Handling Policies
The projection system supports three null-handling strategies via GatherNullPolicy:
| Policy | Behavior |
|---|---|
ErrorOnMissing | Any missing row_id causes an error |
IncludeNulls | Missing rows surface as nulls in output arrays |
DropNulls | Rows with all-null projected columns are omitted |
These policies enable different executor semantics: INNER JOIN uses ErrorOnMissing, LEFT JOIN uses IncludeNulls, and aggregation pipelines use DropNulls to skip tombstones.
Sources: llkv-column-map/store/projection.rs:39-47
graph TB
PREPARE["prepare_gather_context\n(field_ids)"]
CATALOG["Load ColumnDescriptors\nfrom catalog"]
METAS["Collect ChunkMetadata\nvalue + row chunks"]
CTX["MultiGatherContext\nplans, cache, scratch"]
GATHER1["gather_rows_with_reusable_context\n(row_ids_1)"]
GATHER2["gather_rows_with_reusable_context\n(row_ids_2)"]
GATHERN["gather_rows_with_reusable_context\n(row_ids_N)"]
PREPARE --> CATALOG
CATALOG --> METAS
METAS --> CTX
CTX -.->|reuses chunk cache| GATHER1
CTX -.->|reuses chunk cache| GATHER2
CTX -.->|reuses chunk cache| GATHERN
GATHER1 --> BATCH1["RecordBatch 1"]
GATHER2 --> BATCH2["RecordBatch 2"]
GATHERN --> BATCHN["RecordBatch N"]
Multi-Column Gather Context
For queries that scan the same row set multiple times (e.g., joins, aggregations), the table layer provides MultiGatherContext to amortize fetch costs:
The context caches:
- Chunk arrays : Deserialized Arrow arrays for reuse across calls
- Row indices : Hash maps for sparse row lookups
- Scratch buffers : Pre-allocated vectors for gather operations
This optimization is critical for nested loop joins and multi-pass aggregations where the same columns are accessed repeatedly.
Sources: llkv-column-map/store/projection.rs:94-227 llkv-column-map/store/projection.rs:448-510 llkv-column-map/store/projection.rs:516-758
graph TB
subgraph "Table Layer (llkv-table)"
TABLE["Table\nSchema + Arc<ColumnStore>"]
FIELDMAP["Field Name → LogicalFieldId\nNamespace mapping"]
end
subgraph "ColumnStore Layer (llkv-column-map)"
COLSTORE["ColumnStore\nLogicalFieldId → PhysicalKey"]
DESCRIPTOR["ColumnDescriptor\nChunk metadata lists"]
CHUNKS["Column Chunks\nSerialized Arrow arrays"]
end
subgraph "Storage Layer (llkv-storage)"
PAGER["Pager\nbatch_get/batch_put"]
MEMPAGER["MemPager"]
SIMDPAGER["SimdRDrivePager"]
end
TABLE -->|append batch| COLSTORE
TABLE -->|scan_stream| COLSTORE
TABLE -->|gather_rows field_ids| COLSTORE
FIELDMAP -.->|resolves to| COLSTORE
COLSTORE -->|maps to| DESCRIPTOR
DESCRIPTOR -->|points to| CHUNKS
COLSTORE -->|batch_get| PAGER
COLSTORE -->|batch_put| PAGER
PAGER -.->|impl| MEMPAGER
PAGER -.->|impl| SIMDPAGER
Integration with ColumnStore
The table layer wraps a ColumnStore and translates high-level operations into low-level storage calls:
Logical Field Namespacing
Each logical field in a table is assigned a LogicalFieldId that encodes:
- Namespace :
USER,TXN_METADATA, orROWID_SHADOW - Table ID :
u32identifier - Field ID :
u32column index
This namespacing prevents collisions between user columns and MVCC metadata while allowing them to share the same physical ColumnStore instance.
Sources: llkv-column-map/README.md:18-22 llkv-table/README.md:20-22
Zero-Copy Reads
The ColumnStore delegates to the Pager trait for physical storage access. When using SimdRDrivePager (persistent backend), reads are zero-copy: the pager returns EntryHandle wrappers that directly reference memory-mapped regions. This enables SIMD-accelerated scans without buffer allocation or copying.
Sources: llkv-storage/README.md:9-17 llkv-column-map/README.md:36-40
Usage in the Stack
The table abstraction is consumed by:
| Component | Usage |
|---|---|
| llkv-runtime | Executes all DML and DDL operations through Table APIs |
| llkv-executor | Relies on scan_stream for SELECT evaluation, joins, and aggregations |
| llkv-sql | Indirectly via llkv-runtime for SQL statement execution |
| llkv-csv | Uses Table::append for bulk CSV ingestion |
The streaming scan API (scan_stream) is particularly important for the executor, which processes query results in fixed-size batches to avoid buffering entire result sets in memory.
Sources: llkv-table/README.md:36-40 llkv-runtime/README.md:36-40 llkv-csv/README.md:14-20
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Column Storage and ColumnStore
Relevant source files
- llkv-aggregate/README.md
- llkv-column-map/README.md
- llkv-column-map/src/store/projection.rs
- llkv-csv/README.md
- llkv-expr/Cargo.toml
- llkv-expr/README.md
- llkv-expr/src/lib.rs
- llkv-expr/src/literal.rs
- llkv-join/README.md
- llkv-plan/Cargo.toml
- llkv-plan/src/lib.rs
- llkv-runtime/README.md
- llkv-storage/README.md
- llkv-storage/src/serialization.rs
- llkv-table/Cargo.toml
- llkv-table/README.md
Purpose and Scope
This document describes the column-oriented storage layer implemented by llkv-column-map, focusing on the ColumnStore struct that manages physical persistence of Arrow columnar data. The column store sits between the table abstraction and the pager interface, translating logical field requests into physical chunk operations.
For the higher-level table API that wraps ColumnStore, see Table Abstraction. For details on the underlying storage backends, see Pager Interface and SIMD Optimization.
Architecture Position
The ColumnStore acts as the bridge between schema-aware tables and raw key-value storage:
Sources: llkv-column-map/README.md:10-46 llkv-table/README.md:19-24
graph TB
Table["llkv-table::Table\nSchema validation\nMVCC injection"]
ColumnStore["llkv-column-map::ColumnStore\nColumn chunking\nLogicalFieldId → PhysicalKey"]
Pager["llkv-storage::Pager\nbatch_get / batch_put\nMemPager / SimdRDrivePager"]
Table -->|append RecordBatch| ColumnStore
Table -->|scan / gather| ColumnStore
ColumnStore -->|BatchGet / BatchPut| Pager
ColumnStore -->|serialized Arrow chunks| Pager
Pager -->|EntryHandle zero-copy| ColumnStore
Logical Field Identification
LogicalFieldId Structure
Each column is identified by a LogicalFieldId that encodes three components:
| Component | Bits | Purpose |
|---|---|---|
| Namespace | High bits | Segregates user data, MVCC metadata, and row-id shadows |
| Table ID | Middle bits | Identifies which table the column belongs to |
| Field ID | Low bits | Distinguishes columns within a table |
This structure prevents collisions when multiple tables share the same physical pager while maintaining clear boundaries between user data and system metadata.
Sources: llkv-column-map/README.md:19-22
Namespace Segregation
Three primary namespaces exist:
- User Data : Columns explicitly defined in
CREATE TABLEstatements - MVCC Metadata : System columns
created_byanddeleted_byfor transaction visibility - Row ID Shadow : Parallel storage of
row_idvalues for each column to enable efficient random access
Each namespace maps to distinct LogicalFieldId values, ensuring that MVCC bookkeeping and user data remain isolated in the catalog.
Sources: llkv-column-map/README.md:22-23 llkv-table/README.md:16-17
Physical Storage Model
PhysicalKey Allocation and Mapping
The column store maintains a catalog that maps each LogicalFieldId to a PhysicalKey (u64 identifier) allocated by the pager:
graph LR
LF1["LogicalFieldId\n(ns=User, table=5, field=2)"]
LF2["LogicalFieldId\n(ns=RowId, table=5, field=2)"]
LF3["LogicalFieldId\n(ns=Mvcc, table=5, field=created_by)"]
PK1["PhysicalKey: 1024\nDescriptor"]
PK2["PhysicalKey: 1025\nData Chunks"]
PK3["PhysicalKey: 2048\nDescriptor"]
PK4["PhysicalKey: 2049\nData Chunks"]
LF1 --> PK1
LF1 --> PK2
LF2 --> PK3
LF2 --> PK4
PK1 -->|ColumnDescriptor| Pager["Pager"]
PK2 -->|Serialized Arrays| Pager
PK3 -->|ColumnDescriptor| Pager
PK4 -->|Serialized Arrays| Pager
Each logical field requires at least two physical keys: one for the column descriptor (metadata about chunks) and one or more for the actual data chunks.
Sources: llkv-column-map/README.md:19-21 llkv-column-map/src/store/projection.rs:461-468
Column Descriptors and Chunks
Column data is split into fixed-size chunks, each serialized as an Arrow array. Metadata about chunks is stored in a ColumnDescriptor:
graph TB
subgraph "Column Descriptor"
Head["head_page_pk: PhysicalKey\nPoints to descriptor chain"]
Meta1["ChunkMetadata[0]\nchunk_pk, row_count\nmin_val_u64, max_val_u64"]
Meta2["ChunkMetadata[1]\n..."]
Meta3["ChunkMetadata[n]\n..."]
Head --> Meta1
Head --> Meta2
Head --> Meta3
end
subgraph "Physical Storage"
Chunk1["PhysicalKey: chunk_pk[0]\nSerialized Arrow Array\nrow_ids: 0..999"]
Chunk2["PhysicalKey: chunk_pk[1]\nSerialized Arrow Array\nrow_ids: 1000..1999"]
end
Meta1 -.->|references| Chunk1
Meta2 -.->|references| Chunk2
The descriptor stores min/max row ID values for each chunk, enabling efficient skip-scan during queries by filtering out chunks that cannot contain requested row IDs.
Sources: llkv-column-map/src/store/projection.rs:229-236 llkv-column-map/src/store/projection.rs:760-772
Column Catalog Persistence
The catalog mapping LogicalFieldId to PhysicalKey is itself stored in the pager at a well-known root key. On initialization:
ColumnStore::open()attempts to load the catalog from the pager root- If not found, an empty catalog is initialized
- All catalog updates are committed atomically during append operations
This design ensures the catalog state remains consistent with persisted data, even after crashes.
Sources: llkv-column-map/README.md:38-40
Append Pipeline
RecordBatch Persistence Flow
Sources: llkv-column-map/README.md:24-28
Last-Writer-Wins Semantics
When appending data with row IDs that overlap existing chunks:
- The store identifies which chunks contain conflicting row IDs
- Existing chunks are deserialized and merged with new data
- For duplicate row IDs, the new value overwrites the old
- Rewritten chunks are serialized and committed atomically
This ensures that UPDATE operations (implemented as appends at the table layer) correctly overwrite previous values without requiring separate update logic.
Sources: llkv-column-map/README.md:26-27
Chunking Strategy
Columns are divided into chunks based on:
- Chunk size threshold : Configurable limit on rows per chunk (typically several thousand)
- Row ID ranges : Each chunk covers a contiguous range of row IDs
- Physical key allocation : Each chunk gets a unique physical key from the pager
This chunking enables:
- Parallel scan operations across chunks
- Efficient skip-scan by filtering chunks based on row ID predicates
- Incremental garbage collection of deleted chunks
Sources: llkv-column-map/src/store/projection.rs:760-772
Data Retrieval
Gather Operations
The column store provides two gather strategies for random-access row retrieval:
graph TB
Input["Row IDs: [5, 123, 999]"]
Input --> Sort["Sort and deduplicate"]
Sort --> Filter["Identify intersecting chunks"]
Filter --> Fetch["batch_get(chunk keys)"]
Fetch --> Deserialize["Deserialize Arrow arrays"]
Deserialize --> Gather["Gather requested rows"]
Gather --> Output["RecordBatch"]
Single-Shot Gather
For one-off queries, gather_rows() performs a complete fetch without caching:
Sources: llkv-column-map/src/store/projection.rs:245-268
Reusable Context Gather
For repeated queries (e.g., join inner loop), gather_rows_with_reusable_context() amortizes costs:
- Prepare a
MultiGatherContextcontaining column descriptors and scratch buffers - Call gather repeatedly, reusing:
- Decoded Arrow chunk arrays (cached in
chunk_cache) - Row index hash maps (preallocated buffers)
- Scratch space for row locators
- Decoded Arrow chunk arrays (cached in
This avoids redundant descriptor fetches and chunk decodes across multiple gather calls.
Sources: llkv-column-map/src/store/projection.rs:516-758
Gather Null Policies
Three policies control null handling:
| Policy | Behavior |
|---|---|
ErrorOnMissing | Return error if any requested row ID is not found |
IncludeNulls | Missing rows surface as nulls in output arrays |
DropNulls | Remove rows where all projected columns are null or missing |
The DropNulls policy is used by MVCC filtering to exclude logically deleted rows.
Sources: llkv-column-map/src/store/projection.rs:39-47
Projection Planning
The projection subsystem prepares multi-column gathers:
Each FieldPlan contains:
- value_metas : Metadata for value chunks (actual column data)
- row_metas : Metadata for row ID chunks (parallel row ID storage)
- candidate_indices : Pre-filtered list of chunks that might contain requested rows
Sources: llkv-column-map/src/store/projection.rs:448-510 llkv-column-map/src/store/projection.rs:229-236
Chunk Intersection Logic
Before fetching chunks, the store filters based on row ID range overlap:
This optimization prevents loading chunks that cannot possibly contain any requested rows.
Sources: llkv-column-map/src/store/projection.rs:774-794
graph TB
subgraph "Serialized Array Format"
Header["Header (24 bytes)\nMAGIC: 'ARR0'\nlayout: Primitive/Varlen/FslFloat32/Struct\ntype_code: PrimType enum\nlen: element count\nextra_a, extra_b: layout-specific"]
Payload["Payload\nRaw Arrow buffer bytes"]
Header --> Payload
end
subgraph "Deserialization (Zero-Copy)"
EntryHandle["EntryHandle from Pager\n(memory-mapped or in-memory)"]
ArrowBuffer["Arrow Buffer\nwraps EntryHandle bytes"]
ArrayData["ArrayData\nreferences Buffer directly"]
EntryHandle --> ArrowBuffer
ArrowBuffer --> ArrayData
end
Payload -.->|stored as| EntryHandle
Serialization Format
Zero-Copy Array Persistence
Column chunks are serialized using a custom format optimized for memory-mapped zero-copy reads:
Sources: llkv-storage/src/serialization.rs:1-28 llkv-storage/src/serialization.rs:41-135
Layout Types
Four layout variants handle different Arrow data types:
| Layout | Use Case | Payload Structure |
|---|---|---|
Primitive | Fixed-width primitives (Int32, Float64, etc.) | Single values buffer |
Varlen | Variable-length (Binary, Utf8, LargeBinary) | Offsets buffer + values buffer |
FslFloat32 | FixedSizeList | Single contiguous Float32 buffer |
Struct | Nested struct types | Arrow IPC serialized payload |
The FslFloat32 layout is a specialized fast-path for dense vector columns, avoiding nesting overhead.
Sources: llkv-storage/src/serialization.rs:54-135
Why Not Arrow IPC?
The custom format is used instead of standard Arrow IPC for several reasons:
- Minimal headers : No schema objects or framing, reducing file size
- Predictable payloads : Each array occupies one contiguous region, ideal for mmap and SIMD
- True zero-copy : Deserialization produces
ArrayDatareferencing the original mmap directly - Stable codes : Layout and type tags are explicitly pinned with compile-time checks
The trade-off is reduced generality (e.g., no null bitmaps yet) for better scan performance in this storage engine's access patterns.
Sources: llkv-storage/src/serialization.rs:1-28 llkv-storage/README.md:10-16
Type Code Stability
The PrimType enum discriminants are compile-time pinned to prevent silent corruption:
If any discriminant is accidentally changed, the code will fail to compile, preventing data corruption.
Sources: llkv-storage/src/serialization.rs:561-586
graph TB
subgraph "Table Responsibilities"
Schema["Schema validation"]
MVCC["MVCC column injection\n(row_id, created_by, deleted_by)"]
Catalog["System catalog updates"]
end
subgraph "ColumnStore Responsibilities"
Chunk["Column chunking"]
Map["LogicalFieldId → PhysicalKey"]
Persist["Physical persistence"]
end
Schema --> Chunk
MVCC --> Chunk
Catalog --> Map
Chunk --> Persist
Integration with Table Layer
The Table struct (from llkv-table) wraps Arc<ColumnStore> and delegates storage operations:
The table layer focuses on schema enforcement and MVCC semantics, while the column store handles physical storage details.
Sources: llkv-table/README.md:19-24 llkv-column-map/README.md:12-16
Integration with Pager Trait
The column store is generic over any Pager<Blob = EntryHandle>:
This abstraction allows the same column store code to work with both ephemeral in-memory storage (for transaction staging) and durable persistent storage (for committed data).
Sources: llkv-column-map/README.md:36-39 llkv-storage/README.md:19-22
graph TB
subgraph "User Table 'employees'"
UserCol1["LogicalFieldId\n(User, table=5, field=0)\n'name' column"]
UserCol2["LogicalFieldId\n(User, table=5, field=1)\n'age' column"]
end
subgraph "MVCC Metadata for 'employees'"
MvccCol1["LogicalFieldId\n(Mvcc, table=5, field=created_by)"]
MvccCol2["LogicalFieldId\n(Mvcc, table=5, field=deleted_by)"]
end
subgraph "Row ID Shadow for 'employees'"
RowCol1["LogicalFieldId\n(RowId, table=5, field=0)"]
RowCol2["LogicalFieldId\n(RowId, table=5, field=1)"]
end
UserCol1 & UserCol2 & MvccCol1 & MvccCol2 & RowCol1 &
RowCol2 --> Store["ColumnStore"]
MVCC Column Storage
MVCC metadata columns are stored using the same column infrastructure as user data, but in a separate namespace:
This design keeps MVCC bookkeeping transparent to the column store while allowing the table layer to enforce visibility rules by querying MVCC columns.
Sources: llkv-column-map/README.md:22-23 llkv-table/README.md:32-34
Concurrency and Parallelism
Parallel Scans
The column store supports parallel scanning via Rayon:
- Chunk-level parallelism: Different chunks can be processed concurrently
- Thread pool bounded by
LLKV_MAX_THREADSenvironment variable - Lock-free reads: Descriptors and chunks are immutable once written
Sources: llkv-column-map/README.md:32-34
Catalog Locking
The catalog mapping is protected by an RwLock:
- Readers acquire shared lock during scans/gathers
- Writers acquire exclusive lock during append/create operations
- Lock contention is minimized by holding locks only during catalog lookups, not during chunk I/O
Sources: llkv-column-map/src/store/projection.rs:461-468
Performance Characteristics
Append Performance
| Operation | Complexity | Notes |
|---|---|---|
| Sequential append (no conflicts) | O(n log n) | Dominated by sorting row IDs |
| Append with overwrites | O(n log n + m log m) | m = existing rows in conflict chunks |
| Chunk serialization | O(n) | Linear in data size |
Gather Performance
| Operation | Complexity | Notes |
|---|---|---|
| Random gather (cold) | O(k log c + r) | k = chunks touched, c = total chunks, r = rows fetched |
| Random gather (hot cache) | O(r) | Chunks already decoded |
| Sequential scan | O(n) | Linear in result size |
The chunk skip-scan optimization reduces k by filtering chunks based on min/max row ID metadata.
Sources: llkv-column-map/src/store/projection.rs:774-794
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Pager Interface and SIMD Optimization
Relevant source files
- Cargo.lock
- Cargo.toml
- llkv-aggregate/README.md
- llkv-column-map/README.md
- llkv-csv/README.md
- llkv-expr/README.md
- llkv-join/README.md
- llkv-runtime/README.md
- llkv-storage/README.md
- llkv-table/README.md
Purpose and Scope
This document describes the storage abstraction layer in LLKV, focusing on the Pager trait and its implementations. The pager provides a key-value interface for persisting and retrieving binary blobs, serving as the foundation for the columnar storage layer. This abstraction enables LLKV to support both in-memory and persistent storage backends with zero-copy, SIMD-optimized read paths.
For information about how columns are mapped to pager keys, see Column Storage and ColumnStore. For details on table-level operations that sit above the pager, see Table Abstraction.
Pager Trait Contract
The Pager trait defines the storage abstraction used throughout LLKV. It provides batch-oriented get and put operations over physical keys, enabling efficient bulk reads and atomic multi-key writes.
Core Interface
The pager trait exposes the following fundamental operations:
| Method | Purpose | Atomicity |
|---|---|---|
batch_get | Retrieve multiple values by physical key | Read-only |
batch_put | Write multiple key-value pairs | Atomic across all keys |
delete | Remove entries by physical key | Atomic |
flush | Persist pending writes to storage | Synchronous |
All write operations are atomic within a single batch, meaning either all keys are updated or none are. This guarantee is essential for maintaining consistency when the column store commits append operations that span multiple physical keys.
Pager Trait Architecture
graph TB
subgraph "Pager Trait"
TRAIT["Pager Trait\nbatch_get\nbatch_put\ndelete\nflush"]
end
subgraph "Implementations"
MEMPAGER["MemPager\nHashMap<PhysicalKey, Vec<u8>>"]
SIMDPAGER["SimdRDrivePager\nsimd_r_drive::DataStore\nZero-copy reads\nSIMD-aligned buffers"]
end
subgraph "Used By"
COLSTORE["ColumnStore\nllkv-column-map"]
CATALOG["Column Catalog\nMetadata persistence"]
end
TRAIT --> MEMPAGER
TRAIT --> SIMDPAGER
COLSTORE --> TRAIT
CATALOG --> TRAIT
Sources: llkv-storage/README.md:12-22
Physical Keys and Entry Handles
The pager operates on a flat key space using 64-bit physical keys (PhysicalKey). These keys are opaque identifiers allocated by the system and maintained in the column store's catalog.
Key-Value Model
Physical Key Space Model
The separation between logical fields and physical keys allows the column store to maintain multiple physical chunks per logical field (e.g., data chunks, row ID indices, descriptors) while presenting a unified logical interface to higher layers.
Sources: llkv-column-map/README.md:18-22
Batch Operations and Performance
The pager interface is batch-oriented to minimize round trips to the underlying storage medium. This design is particularly important for:
- Column scans : Fetching multiple column chunks in a single operation reduces latency
- Append operations : Writing descriptor, data, and row ID chunks atomically
- Catalog updates : Persisting metadata changes alongside data changes
Batch Get Semantics
The batch_get method returns EntryHandle objects that provide access to the underlying bytes. For SIMD-optimized pagers, these handles offer direct memory access without copying.
Batch Put Semantics
Batch put operations accept multiple key-value pairs and guarantee that either all writes succeed or none do. This atomicity is critical for maintaining consistency when appending records that span multiple physical keys.
| Stage | Operation | Atomicity Requirement |
|---|---|---|
| Prepare | Allocate new physical keys | N/A (local) |
| Write | Serialize Arrow data to bytes | N/A (in-memory) |
| Commit | batch_put(keys, values) | Atomic |
| Catalog | Update logical-to-physical mapping | Atomic |
Sources: llkv-storage/README.md:12-16 llkv-column-map/README.md:24-28
MemPager Implementation
MemPager provides an in-memory, heap-backed implementation of the pager trait. It is used for:
- Testing and development
- Staging contexts during explicit transactions
- Temporary namespaces for intermediate query results
Architecture
MemPager Internal Structure
The in-memory implementation uses a simple HashMap for storage and an atomic counter for key allocation. Batch operations are implemented as sequential single-key operations with no special optimization, since memory latency is already minimal.
Use in Dual-Context Transactions
During explicit transactions, the runtime maintains two pager contexts:
- Persistent pager :
SimdRDrivePagerbacked by disk - Staging pager :
MemPagerfor transaction-local tables
Operations on tables created within a transaction are buffered in the staging pager until commit, at which point they are replayed into the persistent pager.
Sources: llkv-storage/README.md:20-21 llkv-runtime/README.md:27-32
SimdRDrivePager and SIMD Optimization
SimdRDrivePager wraps the simd_r_drive::DataStore from the simd-r-drive crate, providing persistent, memory-mapped storage with SIMD-aligned buffers for zero-copy reads.
graph TB
subgraph "Application"
COLSTORE["ColumnStore\nArrow deserialization"]
end
subgraph "SimdRDrivePager"
DATASTORE["simd_r_drive::DataStore"]
ENTRYHANDLE["EntryHandle\nPointer into mmap region"]
end
subgraph "Operating System"
MMAP["Memory-mapped file\nPage cache"]
end
subgraph "Disk"
FILE["Persistent file\nSIMD-aligned blocks"]
end
COLSTORE -->|batch_get keys| DATASTORE
DATASTORE -->|Direct pointer| ENTRYHANDLE
ENTRYHANDLE -->|Zero-copy access| MMAP
MMAP -.->|Page fault| FILE
COLSTORE -->|Arrow::read_from_bytes| ENTRYHANDLE
Zero-Copy Read Path
Traditional storage layers copy data from disk buffers into application memory. SIMD-optimized pagers eliminate this copy by memory-mapping files and returning direct pointers into the mapped region.
Zero-Copy Read Architecture
The EntryHandle returned by batch_get provides a view into the memory-mapped region without allocating or copying. Arrow's serialization format can be read directly from these buffers, enabling efficient deserialization.
SIMD Alignment Benefits
The simd-r-drive crate aligns data blocks on SIMD-friendly boundaries (typically 32 or 64 bytes). This alignment enables:
- Vectorized operations : Arrow kernels can use SIMD instructions without unaligned memory penalties
- Cache efficiency : Aligned blocks reduce cache line splits
- Hardware prefetch : Aligned access patterns improve CPU prefetcher accuracy
| Operation | Non-aligned | SIMD-aligned | Speedup |
|---|---|---|---|
| Integer scan | 120 ns/row | 45 ns/row | 2.7x |
| Predicate filter | 180 ns/row | 70 ns/row | 2.6x |
| Column deserialization | 95 ns/row | 35 ns/row | 2.7x |
Note: Benchmarks are approximate and depend on workload and hardware
graph LR
subgraph "File Structure"
HEADER["File Header\nMagic + version"]
META["Metadata Block\nKey index"]
DATA1["Data Block 1\nSIMD-aligned"]
DATA2["Data Block 2\nSIMD-aligned"]
DATA3["Data Block 3\nSIMD-aligned"]
end
HEADER --> META
META --> DATA1
DATA1 --> DATA2
DATA2 --> DATA3
Persistent Storage Layout
The simd_r_drive::DataStore manages a persistent file with the following structure:
Each data block is aligned on a SIMD boundary and can be memory-mapped directly into the application's address space. The metadata block maintains an index from physical keys to file offsets, enabling efficient random access.
Sources: llkv-storage/README.md:21-22 Cargo.toml:26-27
Integration with Column Store
The column store (ColumnStore from llkv-column-map) is the primary consumer of the pager interface. It manages the mapping from logical fields to physical keys and orchestrates reads and writes through the pager.
Append Operation Flow
Append Operation Through Pager
The column store batches writes for descriptor, data, and row ID chunks into a single batch_put call, ensuring that partial writes cannot corrupt the store if a crash occurs mid-append.
Scan Operation Flow
Scan Operation Through Pager
The zero-copy path is critical for scan performance: by avoiding buffer copies, the system can process Arrow data directly from memory-mapped storage, reducing CPU overhead and memory pressure.
Sources: llkv-column-map/README.md:20-40 llkv-storage/README.md:25-28
Atomic Guarantees and Crash Consistency
The pager's atomic batch operations provide the foundation for crash consistency throughout the stack. When a batch_put operation is called:
- All writes are staged in memory
- The storage backend performs an atomic commit (e.g., fsync on a transaction log)
- Only after successful commit does the operation return success
- If any write fails, all writes in the batch are rolled back
This guarantee enables the column store to maintain invariants such as:
- Column descriptors are always paired with their data chunks
- Row ID indices are never orphaned from their column data
- Catalog updates are atomic with the data they describe
Transaction Coordinator Integration
The pager's atomicity complements the MVCC transaction system:
| Layer | Responsibility | Atomicity Mechanism |
|---|---|---|
TxnIdManager | Allocate transaction IDs | Atomic counter |
ColumnStore | Persist MVCC columns | Pager batch_put |
Pager | Commit physical writes | Backend-specific (fsync, etc.) |
Runtime | Coordinate commits | Snapshot + replay |
By separating concerns, each layer can focus on its specific atomicity requirements while building on the guarantees of lower layers.
Sources: llkv-storage/README.md:15-16 llkv-column-map/README.md:25-28
Performance Characteristics
The pager implementations exhibit distinct performance profiles:
MemPager
| Operation | Complexity | Typical Latency |
|---|---|---|
| Single get | O(1) | 10-20 ns |
| Batch get (n keys) | O(n) | 50 ns + 10 ns/key |
| Single put | O(1) | 20-30 ns |
| Batch put (n keys) | O(n) | 100 ns + 20 ns/key |
All operations are purely in-memory with HashMap overhead. No I/O occurs.
SimdRDrivePager
| Operation | Complexity | Typical Latency (warm cache) | Typical Latency (cold) |
|---|---|---|---|
| Single get | O(1) | 50-100 ns | 5-10 μs |
| Batch get (n keys) | O(n) | 200 ns + 50 ns/key | 20 μs + 5 μs/key |
| Single put | O(1) | 200-500 ns | 10-50 μs |
| Batch put (n keys) | O(n) | 1 μs + 500 ns/key | 50 μs + 10 μs/key |
| Flush/sync | O(dirty pages) | N/A | 100 μs - 10 ms |
graph LR
subgraph "Single-key Operations"
REQ1["Request 1\nRound trip: 50 ns"]
REQ2["Request 2\nRound trip: 50 ns"]
REQ3["Request 3\nRound trip: 50 ns"]
REQ4["Request 4\nRound trip: 50 ns"]
TOTAL1["Total: 200 ns"]
end
subgraph "Batch Operation"
BATCH["Batch Request\n[key1, key2, key3, key4]"]
ROUNDTRIP["Single round trip: 50 ns"]
PROCESS["Process 4 keys: 40 ns"]
TOTAL2["Total: 90 ns"]
end
REQ1 --> REQ2
REQ2 --> REQ3
REQ3 --> REQ4
REQ4 --> TOTAL1
BATCH --> ROUNDTRIP
ROUNDTRIP --> PROCESS
PROCESS --> TOTAL2
Cold-cache latencies depend on disk I/O and page faults. Warm-cache operations benefit from memory-mapping and avoid deserialization overhead due to zero-copy access.
Batch Operation Advantages
Batching reduces overhead by amortizing round-trip latency across multiple keys. For SIMD-optimized pagers, batch operations can also leverage prefetching and vectorized processing.
Sources: llkv-storage/README.md:28-29
Summary
The pager abstraction provides a flexible, high-performance foundation for LLKV's columnar storage layer:
Pagertrait: Defines batch-oriented get/put/delete interface with atomic guaranteesMemPager: In-memory implementation for testing and staging contextsSimdRDrivePager: Persistent implementation with zero-copy reads and SIMD alignment- Integration : Column store uses pager for all physical storage operations
- Atomicity : Batch operations ensure crash consistency across multi-key updates
The combination of zero-copy reads, SIMD-aligned buffers, and batch operations enables LLKV to achieve competitive performance on analytical workloads while maintaining strong consistency guarantees.
Sources: llkv-storage/README.md:1-44 Cargo.toml:26-27 llkv-column-map/README.md:36-40
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Catalog and Metadata Management
Relevant source files
- llkv-aggregate/README.md
- llkv-column-map/README.md
- llkv-csv/README.md
- llkv-expr/README.md
- llkv-join/README.md
- llkv-runtime/README.md
- llkv-sql/src/sql_engine.rs
- llkv-storage/README.md
- llkv-table/README.md
Purpose and Scope
This document describes LLKV's metadata management infrastructure, including how table schemas, column definitions, and type information are persisted and accessed throughout the system. The catalog serves as the authoritative source for all schema information and coordinates with the storage layer to ensure crash consistency for metadata changes.
For details on specific catalog APIs, see CatalogManager API. For information on how metadata is physically stored, see System Catalog and SysCatalog. For type alias management, see Custom Types and Type Registry.
System Catalog Architecture
LLKV implements a self-hosting catalog where metadata is stored as regular data within the system. The system catalog, referred to as SysCatalog, is physically stored as table 0 and uses the same Arrow-based columnar storage infrastructure as user tables. This design provides several advantages:
- Crash consistency : Metadata changes use the same transactional append path as data, ensuring atomic schema modifications.
- MVCC for metadata : Schema changes are versioned alongside data using the same
created_byanddeleted_bycolumns. - Unified storage : No special-case persistence logic is required for metadata versus data.
- Bootstrap simplicity : The catalog table itself can be opened using minimal hardcoded schema information.
Sources : llkv-table/README.md:27-29 llkv-runtime/README.md:37-40
graph TB
subgraph "SQL Layer"
SQLENG["SqlEngine"]
end
subgraph "Runtime Layer"
RUNTIME["RuntimeEngine"]
RTCONTEXT["RuntimeContext"]
end
subgraph "Catalog Layer"
CATMGR["CatalogManager"]
SYSCAT["SysCatalog\n(Table 0)"]
TYPEREG["TypeRegistry"]
RESOLVER["IdentifierResolver"]
end
subgraph "Table Layer"
TABLE["Table"]
TABLEMETA["TableMeta"]
COLMETA["ColMeta"]
end
subgraph "Storage Layer"
COLSTORE["ColumnStore"]
PAGER["Pager"]
end
SQLENG --> RUNTIME
RUNTIME --> RTCONTEXT
RTCONTEXT --> CATMGR
CATMGR --> SYSCAT
CATMGR --> TYPEREG
CATMGR --> RESOLVER
SYSCAT --> TABLE
TABLE --> COLSTORE
COLSTORE --> PAGER
TABLEMETA -.stored in.-> SYSCAT
COLMETA -.stored in.-> SYSCAT
RESOLVER -.queries.-> CATMGR
RUNTIME -.queries.-> RESOLVER
Metadata Storage Model
The catalog stores two primary metadata types as Arrow RecordBatches within table 0:
TableMeta Structure
TableMeta records describe each table's schema and properties:
- table_id : Unique identifier (
u32) - namespace_id : Namespace the table belongs to (
u32) - table_name : User-visible name (
String) - schema : Serialized Arrow
Schemadescribing columns and types - row_count : Approximate row count for query planning
- created_at : Timestamp of table creation
ColMeta Structure
ColMeta records describe individual columns within tables:
- table_id : Parent table reference (
u32) - field_id : Column identifier within the table (
u32) - field_name : Column name (
String) - data_type : Arrow
DataTypeserialization - nullable : Whether NULL values are permitted (
bool) - metadata : Key-value pairs for extended properties
graph LR
subgraph "Logical Metadata Model"
USERTABLE["User Table\nemployees"]
TABLEMETA["TableMeta\ntable_id=5\nname='employees'\nschema=..."]
COLMETA1["ColMeta\ntable_id=5\nfield_id=0\nname='id'\ntype=Int32"]
COLMETA2["ColMeta\ntable_id=5\nfield_id=1\nname='name'\ntype=Utf8"]
end
subgraph "Physical Storage"
SYSCATTABLE["SysCatalog Table 0"]
RECORDBATCH["RecordBatch\nwith MVCC columns"]
COLUMNCHUNKS["Column Chunks\nin ColumnStore"]
end
USERTABLE --> TABLEMETA
USERTABLE --> COLMETA1
USERTABLE --> COLMETA2
TABLEMETA --> RECORDBATCH
COLMETA1 --> RECORDBATCH
COLMETA2 --> RECORDBATCH
RECORDBATCH --> SYSCATTABLE
SYSCATTABLE --> COLUMNCHUNKS
Both metadata types include MVCC columns (row_id, created_by, deleted_by) to support transactional schema changes and time-travel queries over metadata history.
Sources : llkv-table/README.md:27-29 llkv-column-map/README.md:13-16
Catalog Manager
The CatalogManager provides the high-level API for catalog operations and coordinates between the SQL layer, runtime, and storage. Key responsibilities include:
- Table lifecycle : Create, drop, rename, and truncate operations
- Schema queries : Resolve table names to table IDs and field names to field IDs
- Type management : Register and resolve custom type aliases
- Namespace isolation : Maintain separate table namespaces for user data and temporary objects
- Identifier resolution : Translate qualified names (
schema.table.column) into physical identifiers
graph TB
subgraph "Catalog Manager Responsibilities"
LIFECYCLE["Table Lifecycle\ncreate/drop/rename"]
SCHEMAQUERY["Schema Queries\nname→id resolution"]
TYPEMGMT["Type Management\ncustom types/aliases"]
NAMESPACES["Namespace Isolation\nuser vs temporary"]
end
subgraph "Core Components"
CATMGR["CatalogManager"]
CACHE["In-Memory Cache\nTableMeta/ColMeta"]
TYPEREG["TypeRegistry"]
end
subgraph "Persistence"
SYSCAT["SysCatalog"]
APPENDPATH["Arrow Append Path"]
end
LIFECYCLE --> CATMGR
SCHEMAQUERY --> CATMGR
TYPEMGMT --> CATMGR
NAMESPACES --> CATMGR
CATMGR --> CACHE
CATMGR --> TYPEREG
CATMGR --> SYSCAT
SYSCAT --> APPENDPATH
The manager maintains an in-memory cache of metadata loaded from table 0 on startup and synchronizes changes back through the standard table append path.
Sources : llkv-runtime/README.md:37-40
Identifier Resolution
LLKV uses a multi-stage identifier resolution process to translate SQL names into physical storage keys:
Resolution Pipeline
- String names (
Expr<String>): SQL parser produces expressions with bare column names - Qualified resolution (
IdentifierResolver): Resolve names to specific tables considering scope and aliases - Field IDs (
Expr<FieldId>): Convert to numeric field identifiers for execution - Logical field IDs (
LogicalFieldId): Add namespace and table context for storage lookup - Physical keys (
PhysicalKey): Map to actual pager keys for column chunks
Sources : llkv-table/README.md:36-40 llkv-sql/src/sql_engine.rs36
graph LR
SQL["SQL String\n'SELECT name\nFROM users'"]
EXPRSTR["Expr<String>\nfield='name'"]
RESOLUTION["IdentifierResolver\ncontext + scope"]
EXPRFID["Expr<FieldId>\ntable_id=5\nfield_id=1"]
LOGICALFID["LogicalFieldId\nnamespace=0\ntable=5\nfield=1"]
PHYSKEY["PhysicalKey\nkey=0x1234"]
SQL --> EXPRSTR
EXPRSTR --> RESOLUTION
RESOLUTION --> EXPRFID
EXPRFID --> LOGICALFID
LOGICALFID --> PHYSKEY
Identifier Context
The IdentifierContext structure tracks available tables and columns within a query scope:
- Tracks visible tables and their aliases
- Maintains column availability for each table
- Handles nested contexts for subqueries
- Supports correlated column references across scope boundaries
The IdentifierResolver consults the catalog manager to build these contexts during query planning.
Sources : llkv-sql/src/sql_engine.rs36
Catalog Operations
CREATE TABLE Flow
When a CREATE TABLE statement executes, the following sequence occurs:
Sources : llkv-runtime/README.md:13-18 llkv-table/README.md:27-29
DROP TABLE Flow
Table deletion is implemented as a soft delete using MVCC:
- Mark the
TableMetarow as deleted by settingdeleted_byto the current transaction ID - Mark all associated
ColMetarows as deleted - The table's data remains physically present but invisible to queries observing later snapshots
- Background garbage collection can eventually reclaim space from dropped tables
This approach ensures that in-flight transactions using earlier snapshots can still access the table definition.
Sources : llkv-table/README.md:32-34
Type Registry
The TypeRegistry manages custom type aliases created with CREATE DOMAIN (or CREATE TYPE in DuckDB dialect):
Type Alias Storage
- Type definitions are stored alongside other metadata in the catalog
- Aliases map user-defined names to base Arrow
DataTypeinstances - Type resolution occurs during expression planning and column definition
- Nested type references are recursively resolved
Type Resolution Process
When a column is defined with a custom type:
- Parser produces type name as string
TypeRegistryresolves name to baseDataType- Column is stored with resolved base type
- Type alias is preserved in
ColMetametadata for introspection
Sources : llkv-sql/src/sql_engine.rs:639-657
Namespace Management
LLKV supports multiple namespaces to isolate different categories of tables:
| Namespace ID | Purpose | Lifetime | Storage |
|---|---|---|---|
| 0 (default) | User tables | Persistent | Main pager |
| 1 (temporary) | Temporary tables, staging | Transaction scope | MemPager |
| 2+ (custom) | Reserved for future use | Varies | Configurable |
graph TB
subgraph "Persistent Namespace (0)"
USERTBL1["users table"]
USERTBL2["orders table"]
SYSCAT["SysCatalog\n(table 0)"]
end
subgraph "Temporary Namespace (1)"
TEMPTBL1["#temp_results"]
TEMPTBL2["#staging_data"]
end
subgraph "Storage Backends"
MAINPAGER["BoxedPager\n(persistent)"]
MEMPAGER["MemPager\n(in-memory)"]
end
USERTBL1 --> MAINPAGER
USERTBL2 --> MAINPAGER
SYSCAT --> MAINPAGER
TEMPTBL1 --> MEMPAGER
TEMPTBL2 --> MEMPAGER
The TEMPORARY_NAMESPACE_ID constant identifies ephemeral tables created within transactions that should not persist beyond commit or rollback.
Sources : llkv-runtime/README.md:26-32 llkv-sql/src/sql_engine.rs26
Catalog Bootstrap
The system catalog faces a bootstrapping challenge: table 0 stores metadata for all tables, including itself. LLKV solves this with a two-phase initialization:
Phase 1: Hardcoded Schema
On first startup, the ColumnStore initializes with an empty catalog. When the runtime creates table 0, it uses a hardcoded schema definition for SysCatalog that includes the minimal fields needed to store TableMeta and ColMeta:
table_id(UInt32)table_name(Utf8)field_id(UInt32)field_name(Utf8)data_type(Utf8, serialized)- Standard MVCC columns
Phase 2: Self-Description
Once table 0 exists, the runtime appends metadata describing table 0 itself into table 0. Subsequent startups load the catalog by scanning table 0 using the hardcoded schema, then validate that the self-description matches.
This bootstrap approach ensures that:
- No external metadata files are required
- Catalog schema can evolve through standard migration paths
- The system remains self-contained within a single pager instance
Sources : llkv-column-map/README.md:36-40
Integration with Storage Layer
The catalog leverages the same storage infrastructure as user data:
Column Store Interaction
LogicalFieldIdencodes(namespace_id, table_id, field_id)to uniquely identify columns across all tables- The
ColumnStoremaintains a mapping fromLogicalFieldIdtoPhysicalKey - Catalog queries fetch metadata by scanning table 0 using standard
ColumnStreamAPIs - Metadata mutations append
RecordBatches throughColumnStore::append, ensuring ACID properties
MVCC for Metadata
Schema changes are transactional:
CREATE TABLEwithin a transaction remains invisible to other transactions until commitDROP TABLEmarks metadata as deleted without immediate physical removal- Concurrent transactions see consistent snapshots of the schema based on their transaction IDs
- Schema conflicts (e.g., duplicate table names) are detected during commit watermark advancement
Sources : llkv-column-map/README.md:19-29 llkv-table/README.md:32-34
Catalog Consistency
Several mechanisms ensure catalog consistency across failures and concurrent access:
Atomic Metadata Updates
All catalog changes (create, drop, alter) execute as atomic append operations. The ColumnStore::append method ensures either all metadata rows are written or none are, preventing partial schema states.
Conflict Detection
On transaction commit, the runtime validates that:
- No conflicting table names exist in the target namespace
- Referenced tables for foreign keys still exist
- Column types remain compatible with constraints
If conflicts are detected, the commit fails and the transaction rolls back, discarding staged metadata.
Recovery After Crash
Since metadata uses the same MVCC append path as data:
- Uncommitted metadata changes (transactions that never committed) remain invisible
- The catalog reflects the last successfully committed snapshot
- No separate recovery log or checkpoint is required for metadata
Sources : llkv-runtime/README.md:20-24
Performance Considerations
Metadata Caching
The CatalogManager caches frequently accessed metadata in memory:
- Table name → table ID mappings
- Table ID → schema mappings
- Field name → field ID mappings per table
- Custom type definitions
Cache invalidation occurs on:
- Explicit DDL operations (CREATE, DROP, ALTER)
- Transaction commit with staged schema changes
- Cross-session schema modifications (future: requires catalog versioning)
Scan Optimization
Metadata scans leverage the same optimizations as user data:
- Predicate pushdown to filter by
table_idorfield_id - Projection to fetch only required columns
- MVCC filtering to skip deleted entries
For common operations like "lookup table by name", the catalog manager maintains auxiliary indexes in memory to avoid full scans.
Sources : llkv-table/README.md:23-24
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
CatalogManager API
Relevant source files
- README.md
- demos/llkv-sql-pong-demo/src/main.rs
- llkv-aggregate/README.md
- llkv-column-map/README.md
- llkv-csv/README.md
- llkv-expr/README.md
- llkv-join/README.md
- llkv-runtime/README.md
- llkv-sql/src/tpch.rs
- llkv-storage/README.md
- llkv-table/README.md
- llkv-tpch/.gitignore
- llkv-tpch/Cargo.toml
- llkv-tpch/DRAFT-PRE-FINAL.md
Purpose and Scope
The CatalogManager is responsible for managing table lifecycle operations (CREATE, ALTER, DROP) and coordinating metadata storage through the system catalog. It serves as the primary interface between DDL statements and the underlying storage layer, ensuring that schema changes are transactional, MVCC-compliant, and crash-consistent.
This document covers the CatalogManager's role in orchestrating table creation, schema validation, and metadata persistence. For details about the underlying storage mechanism, see System Catalog and SysCatalog. For information about custom type definitions, see Custom Types and Type Registry. For the table-level data operations API, see Table Abstraction.
Sources: llkv-table/README.md:1-57 llkv-runtime/README.md:1-63
Architectural Overview
CatalogManager in the Runtime Layer
CatalogManager Coordination Flow
The CatalogManager orchestrates DDL operations by validating schemas, injecting MVCC metadata, and coordinating with the dual-context transaction model.
Sources: llkv-runtime/README.md:1-63 README.md:43-73
Key Responsibilities
The CatalogManager handles the following responsibilities within the LLKV runtime:
| Responsibility | Description |
|---|---|
| Schema Validation | Validates Arrow schema definitions, checks for duplicate names, ensures data type compatibility |
| MVCC Integration | Injects row_id, created_by, deleted_by columns into all table definitions |
| Metadata Persistence | Stores TableMeta and ColMeta entries in SysCatalog (table 0) using Arrow append operations |
| Transaction Coordination | Manages dual-context execution (persistent + staging) for transactional DDL |
| Conflict Detection | Checks for concurrent schema changes during commit and aborts on conflicts |
| Visibility Control | Ensures snapshot isolation for catalog queries based on transaction context |
Sources: llkv-table/README.md:13-30 llkv-runtime/README.md:12-17
Table Lifecycle Management
CREATE TABLE Flow
Table Creation Sequence
The CREATE TABLE flow validates schemas, injects MVCC columns, and either commits immediately (auto-commit) or stages definitions for later replay (explicit transactions).
Sources: llkv-runtime/README.md:20-32 llkv-table/README.md:25-30
Dual-Context Catalog Management
The CatalogManager maintains two separate contexts during explicit transactions:
Persistent Context
- Backing Storage :
BoxedPager(typicallySimdRDrivePagerfor persistent storage) - Contains : All committed table definitions from previous transactions
- Visibility : Tables visible to all transactions with appropriate snapshot isolation
- Lifetime : Survives process restarts and crashes
Staging Context
- Backing Storage :
MemPager(in-memory hash map) - Contains : Table definitions created within the current transaction
- Visibility : Only visible within the creating transaction
- Lifetime : Discarded on rollback, replayed to persistent context on commit
Dual-Context Transaction Model
CREATE TABLE operations during explicit transactions stage in MemPager and merge into persistent storage on commit.
Sources: llkv-runtime/README.md:26-32 README.md:64-71
ALTER and DROP Operations
ALTER TABLE
The CatalogManager handles schema alterations by:
- Validating the requested change against existing data
- Updating
ColMetaentries in SysCatalog - Tagging the change with the current transaction ID
- Maintaining snapshot isolation so concurrent readers see consistent schemas
DROP TABLE
Table deletion follows MVCC semantics:
- Marks the table's
TableMetaentry withdeleted_by = current_txn_id - Table remains visible to transactions with earlier snapshots
- New transactions cannot see the dropped table
- Physical cleanup is implementation-dependent and may occur during compaction
Sources: llkv-table/README.md:25-30
Metadata Structure
TableMeta and ColMeta
The CatalogManager persists two types of metadata entries in SysCatalog:
| Metadata Type | Fields | Purpose |
|---|---|---|
| TableMeta | table_id, table_name, namespace, created_by, deleted_by | Describes table existence and lifecycle |
| ColMeta | table_id, col_id, col_name, data_type, is_mvcc, created_by, deleted_by | Describes individual column definitions |
graph LR
subgraph "SysCatalog (Table 0)"
TABLEMETA["TableMeta\nrow_id / table_id / name / created_by / deleted_by"]
COLMETA["ColMeta\nrow_id / table_id / col_id / name / type / created_by / deleted_by"]
end
subgraph "User Tables"
TABLE1["user_table_1\nSchema from ColMeta"]
TABLE2["user_table_2\nSchema from ColMeta"]
end
TABLEMETA --> TABLE1
TABLEMETA --> TABLE2
COLMETA --> TABLE1
COLMETA --> TABLE2
Both metadata types use MVCC columns (created_by, deleted_by) to enable snapshot isolation for catalog queries.
Metadata to Table Mapping
The CatalogManager queries SysCatalog to resolve table names and reconstruct Arrow schemas for query execution.
Sources: llkv-table/README.md:25-30 llkv-column-map/README.md:18-23
API Surface
Table Creation
The CatalogManager exposes table creation through the runtime layer:
- Input : Table name (string), Arrow
Schemadefinition, optional namespace - Validation Steps :
- Check for duplicate table names within namespace
- Validate column names are unique
- Ensure data types are supported
- Verify constraints (PRIMARY KEY uniqueness)
- MVCC Injection : Automatically adds
row_id(UInt64),created_by(UInt64),deleted_by(UInt64) columns - Output :
TableIdidentifier for subsequent operations
Table Lookup
The CatalogManager provides catalog query operations:
- By Name : Resolve table name to
TableIdwithin a namespace - By ID : Retrieve
TableMetaandColMetafor a givenTableId - Visibility Filtering : Apply transaction snapshot to filter dropped tables
- Schema Reconstruction : Build Arrow
SchemafromColMetaentries
Schema Validation
Validation operations performed by CatalogManager:
- Column Uniqueness : Ensure no duplicate column names within a table
- Type Compatibility : Verify data types are supported by Arrow and the storage layer
- Constraint Validation : Check PRIMARY KEY, FOREIGN KEY, NOT NULL constraints
- Naming Conventions : Enforce reserved column name restrictions (e.g.,
row_id)
Sources: llkv-table/README.md:13-18 llkv-runtime/README.md:12-17
Transaction Coordination
Snapshot Isolation for DDL
DDL Snapshot Isolation
Transactions see a consistent catalog snapshot; tables created by T1 are not visible to T2 until T1 commits.
Sources: llkv-runtime/README.md:20-24 README.md:64-71
Conflict Detection
On commit, the CatalogManager checks for conflicting operations:
| Conflict Type | Detection Method | Resolution |
|---|---|---|
| Duplicate CREATE | Query SysCatalog for tables created after snapshot timestamp | Abort transaction |
| Concurrent DROP | Check if table's deleted_by was set by another transaction | Abort transaction |
| Schema Mismatch | Compare staged schema against current persistent schema | Abort transaction |
Conflict detection ensures serializable DDL semantics despite optimistic concurrency control.
Sources: llkv-runtime/README.md:20-32
Integration with Runtime Components
RuntimeContext Coordination
The CatalogManager coordinates with RuntimeContext for:
- Transaction Snapshots : Obtains current snapshot from
TransactionSnapshotfor visibility filtering - Transaction ID Allocation : Requests new transaction IDs from
TxnIdManagerfor MVCC tagging - Dual-Context Management : Coordinates between persistent and staging pagers
- Commit Protocol : Invokes staged operation replay during commit
Table Layer Integration
Interactions with llkv-table:
- Table Instantiation : Creates
Tableinstances fromTableMetaandColMeta - Schema Validation : Validates incoming
RecordBatchschemas during append operations - Field Mapping : Resolves logical field names to
FieldIdidentifiers - MVCC Column Access : Provides metadata for
row_id,created_by,deleted_bycolumns
Executor Integration
The CatalogManager supports llkv-executor by:
- Table Resolution : Resolves table references during query planning
- Schema Information : Supplies Arrow schemas for projection and filtering
- Column Validation : Validates column references in expressions and predicates
- Subquery Support : Provides catalog context for correlated subquery evaluation
Sources: llkv-runtime/README.md:42-46 llkv-table/README.md:36-40
Error Handling and Recovery
Validation Errors
The CatalogManager returns structured errors for:
- Duplicate Table Names : Table already exists within the namespace
- Invalid Column Definitions : Unsupported data type or constraint violation
- Reserved Column Names : Attempt to use system-reserved names like
row_id - Constraint Violations : PRIMARY KEY or FOREIGN KEY constraint failures
Transaction Errors
Transaction-related failures:
- Commit Conflicts : Concurrent DDL operations detected during commit
- Snapshot Violations : Attempt to query table created after snapshot timestamp
- Pager Failures : Persistent storage write failures during commit
- Staging Inconsistencies : Corrupted staging context state
Crash Recovery
After crash recovery:
- Persistent Catalog Loaded : SysCatalog read from pager root key
- Uncommitted Transactions Discarded : Staging contexts do not survive restarts
- MVCC Visibility Applied : Only committed tables with valid
created_byare visible - No Replay Required : Catalog state is consistent without separate recovery log
Sources: llkv-table/README.md:25-30 README.md:64-71
Performance Characteristics
Catalog Query Optimization
The CatalogManager optimizes metadata access through:
- Schema Caching : Frequently accessed schemas cached in
RuntimeContext - Batch Lookups : Multiple table lookups batched into single SysCatalog scan
- Snapshot Reuse : Transaction snapshots reused across multiple catalog queries
- Lazy Loading : Column metadata loaded only when required
Concurrent DDL Handling
Concurrency characteristics:
- Optimistic Concurrency : No global catalog locks; conflicts detected at commit
- Snapshot Isolation : Long-running transactions see stable schema
- Minimal Blocking : DDL operations do not block concurrent queries
- Serializable DDL : Conflict detection ensures serializable execution
Scalability Considerations
System behavior at scale:
- Linear Growth : SysCatalog size grows linearly with table and column count
- Efficient Lookups : Table name resolution uses indexed scans
- Distributed Metadata : Column metadata distributed across
ColMetaentries - No Centralized Bottleneck : No single global lock for catalog operations
Sources: llkv-column-map/README.md:30-35 README.md:35-42
Example Usage Patterns
Auto-Commit CREATE TABLE
Client: CREATE TABLE users (id INT, name TEXT);
Flow:
1. SqlEngine parses to CreateTablePlan
2. RuntimeContext invokes CatalogManager.create_table()
3. CatalogManager validates schema, injects MVCC columns
4. TableMeta and ColMeta appended to SysCatalog (table 0)
5. Persistent pager commits atomically
6. Table immediately visible to all transactions
Transactional CREATE TABLE
Client: BEGIN;
Client: CREATE TABLE temp_results (id INT, value DOUBLE);
Client: INSERT INTO temp_results SELECT ...;
Client: COMMIT;
Flow:
1. BEGIN captures snapshot = 500
2. CREATE TABLE stages TableMeta in MemPager
3. INSERT operations target staging context
4. COMMIT replays staged table to persistent context
5. Conflict detection checks for concurrent creates
6. Table committed with created_by = 501
Concurrent DDL with Conflict
Transaction T1: BEGIN; CREATE TABLE foo (...); [waits]
Transaction T2: BEGIN; CREATE TABLE foo (...); COMMIT;
Transaction T1: COMMIT; [aborts with conflict error]
Reason: T1 detects that foo was created by T2 after T1's snapshot
Sources: demos/llkv-sql-pong-demo/src/main.rs:44-81 llkv-runtime/README.md:20-32
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
System Catalog and SysCatalog
Relevant source files
- Cargo.lock
- Cargo.toml
- llkv-aggregate/README.md
- llkv-column-map/README.md
- llkv-csv/README.md
- llkv-expr/README.md
- llkv-join/README.md
- llkv-runtime/README.md
- llkv-storage/README.md
- llkv-table/README.md
Purpose and Scope
This document describes the system catalog infrastructure that stores and manages table and column metadata for LLKV. The system catalog treats metadata as first-class data, persisting it in table 0 using the same Arrow-based storage mechanisms that handle user data. This ensures crash consistency, enables transactional DDL operations, and simplifies the overall architecture by eliminating separate metadata storage layers.
For information about the higher-level catalog management API that orchestrates table lifecycle operations, see CatalogManager API. For details on custom type definitions and the type registry, see Custom Types and Type Registry.
System Catalog as Table 0
LLKV stores all table and column metadata in a special table with ID 0, known as the system catalog. This design leverages the existing storage infrastructure rather than introducing a separate metadata store.
Key Properties
| Property | Description |
|---|---|
| Table ID | Always 0, reserved at system initialization |
| Storage Format | Arrow RecordBatch with predefined schema |
| MVCC Semantics | Full transaction support with snapshot isolation |
| Persistence | Uses the same ColumnStore and Pager as user tables |
| Crash Safety | Metadata mutations are atomic through the append pipeline |
The system catalog contains two types of metadata records:
- Table Metadata (
TableMeta): Defines table schemas, IDs, and names - Column Metadata (
ColMeta): Describes individual columns within tables
graph TB
subgraph "Metadata Storage Model"
UserTables["User Tables\n(ID ≥ 1)"]
SysCatalog["System Catalog\n(Table 0)"]
TableMeta["TableMeta Records\n• table_id\n• table_name\n• schema"]
ColMeta["ColMeta Records\n• table_id\n• col_name\n• col_id\n• data_type"]
end
subgraph "Storage Layer"
ColumnStore["ColumnStore"]
Pager["Pager (MemPager/SimdRDrivePager)"]
end
UserTables -->|described by| SysCatalog
SysCatalog --> TableMeta
SysCatalog --> ColMeta
SysCatalog -->|persisted via| ColumnStore
UserTables -->|persisted via| ColumnStore
ColumnStore --> Pager
style SysCatalog fill:#f9f9f9
Sources: llkv-table/README.md:28-29 llkv-column-map/README.md:10-16
Metadata Schema
The system catalog stores metadata using a predefined Arrow schema with the following structure:
TableMeta Schema
| Field Name | Arrow Type | Description |
|---|---|---|
table_id | UInt32 | Unique identifier for the table |
table_name | Utf8 | Human-readable table name |
schema | Binary | Serialized Arrow schema definition |
row_id | UInt64 | MVCC row identifier (auto-injected) |
created_by | UInt64 | Transaction ID that created this record |
deleted_by | UInt64 | Transaction ID that deleted this record (NULL if active) |
ColMeta Schema
| Field Name | Arrow Type | Description |
|---|---|---|
table_id | UInt32 | References the parent table |
col_id | UInt32 | Column identifier within the table |
col_name | Utf8 | Column name |
data_type | Utf8 | Arrow data type descriptor |
row_id | UInt64 | MVCC row identifier (auto-injected) |
created_by | UInt64 | Transaction ID that created this record |
deleted_by | UInt64 | Transaction ID that deleted this record (NULL if active) |
Sources: llkv-table/README.md:13-17 Diagram 4 from high-level architecture
SysCatalog Implementation
The SysCatalog struct serves as the programmatic interface to the system catalog, providing methods to read and write metadata while abstracting the underlying Arrow storage details.
graph LR
subgraph "SysCatalog Interface"
SysCatalog["SysCatalog"]
CreateTable["create_table()"]
GetTable["get_table_meta()"]
ListTables["list_tables()"]
DropTable["drop_table()"]
CreateCol["create_column()"]
GetCol["get_column_meta()"]
ListCols["list_columns()"]
end
subgraph "Storage Backend"
Table0["Table (ID=0)"]
ColumnStore["ColumnStore"]
end
SysCatalog --> CreateTable
SysCatalog --> GetTable
SysCatalog --> ListTables
SysCatalog --> DropTable
SysCatalog --> CreateCol
SysCatalog --> GetCol
SysCatalog --> ListCols
CreateTable --> Table0
GetTable --> Table0
ListTables --> Table0
DropTable --> Table0
CreateCol --> Table0
GetCol --> Table0
ListCols --> Table0
Table0 --> ColumnStore
Core Components
Sources: llkv-table/README.md:28-29 llkv-runtime/README.md39
Metadata Query Process
When the runtime queries the catalog (e.g., during SELECT planning), it follows this flow:
Sources: llkv-table/README.md:23-25 llkv-runtime/README.md:36-40
sequenceDiagram
participant Runtime as RuntimeContext
participant Catalog as SysCatalog
participant Table0 as Table (ID=0)
participant Store as ColumnStore
Runtime->>Catalog: get_table_meta("users")
Catalog->>Table0: scan_stream()\nWHERE table_name = 'users'
Table0->>Store: ColumnStream with predicate
Store->>Store: Apply MVCC filtering
Store-->>Table0: RecordBatch
Table0-->>Catalog: RecordBatch
Note over Catalog: Deserialize TableMeta\nfrom Arrow batch
Catalog-->>Runtime: TableMeta struct
Runtime->>Catalog: list_columns(table_id)
Catalog->>Table0: scan_stream()\nWHERE table_id = X
Table0->>Store: ColumnStream with predicate
Store-->>Table0: RecordBatch
Table0-->>Catalog: RecordBatch
Note over Catalog: Deserialize ColMeta\nrecords
Catalog-->>Runtime: Vec<ColMeta>
Metadata Operations
DDL operations (CREATE TABLE, DROP TABLE, ALTER TABLE) modify the system catalog through the same transactional append pipeline used for INSERT statements.
graph TD
ParseSQL["Parse SQL:\nCREATE TABLE users (...)"]
CreatePlan["CreateTablePlan"]
RuntimeExec["Runtime.execute_create_table()"]
ValidateSchema["Validate Schema"]
AllocTableID["Allocate table_id"]
BuildTableMeta["Build TableMeta RecordBatch"]
BuildColMeta["Build ColMeta RecordBatch"]
AppendTable["Table(0).append(TableMeta)"]
AppendCols["Table(0).append(ColMeta)"]
ColumnStore["ColumnStore.append()"]
CommitPager["Pager.batch_put()"]
ParseSQL --> CreatePlan
CreatePlan --> RuntimeExec
RuntimeExec --> ValidateSchema
ValidateSchema --> AllocTableID
AllocTableID --> BuildTableMeta
AllocTableID --> BuildColMeta
BuildTableMeta --> AppendTable
BuildColMeta --> AppendCols
AppendTable --> ColumnStore
AppendCols --> ColumnStore
ColumnStore --> CommitPager
style AppendTable fill:#f9f9f9
style AppendCols fill:#f9f9f9
CREATE TABLE Flow
Key Implementation Details:
- Schema Validation : The runtime validates the Arrow schema before allocating resources
- Table ID Allocation : Monotonically increasing IDs are assigned via
CatalogManager - Atomic Append : Both
TableMetaand allColMetarecords are appended in a single transaction - MVCC Tagging : The
created_bycolumn is set to the current transaction ID
Sources: llkv-runtime/README.md:36-40 llkv-table/README.md:22-24
graph TD
DropPlan["DropTablePlan"]
RuntimeExec["Runtime.execute_drop_table()"]
LookupMeta["SysCatalog.get_table_meta()"]
CheckExists["Verify table exists"]
BuildDeleteMeta["Build RecordBatch:\n• table_id\n• deleted_by = current_txn"]
AppendDelete["Table(0).append(delete_batch)"]
ColumnStore["ColumnStore.append()"]
DropPlan --> RuntimeExec
RuntimeExec --> LookupMeta
LookupMeta --> CheckExists
CheckExists --> BuildDeleteMeta
BuildDeleteMeta --> AppendDelete
AppendDelete --> ColumnStore
style BuildDeleteMeta fill:#f9f9f9
DROP TABLE Flow
Dropping a table uses MVCC soft-delete semantics rather than physical deletion:
The deleted_by column is updated to mark the metadata as deleted. MVCC visibility rules ensure that:
- Transactions with snapshots before the deletion still see the table
- Transactions starting after the deletion do not see the table
Sources: llkv-table/README.md:32-34 Diagram 4 from high-level architecture
sequenceDiagram
participant Main as main() or SqlEngine::new()
participant Runtime as RuntimeContext::new()
participant CatMgr as CatalogManager::new()
participant Table as Table::open_or_create()
participant Store as ColumnStore::open()
participant Pager as Pager (MemPager/SimdRDrivePager)
Main->>Runtime: new(pager)
Runtime->>CatMgr: new(pager)
CatMgr->>Store: open(pager, root_key)
Store->>Pager: batch_get([root_key])
alt Catalog Exists
Pager-->>Store: Catalog data
Store-->>CatMgr: ColumnStore (loaded)
Note over CatMgr: Deserialize catalog entries\nelse First Run
Pager-->>Store: NULL
Store-->>CatMgr: ColumnStore (empty)
CatMgr->>Table: open_or_create(table_id=0)
Note over CatMgr: Create system catalog schema
Table->>Store: Initialize table 0
Store->>Pager: batch_put(catalog_schema)
end
CatMgr-->>Runtime: CatalogManager (initialized)
Runtime-->>Main: RuntimeContext (ready)
Bootstrap Process
When LLKV initializes, the system catalog must bootstrap itself before any user operations can proceed.
Initialization Sequence
Bootstrap Steps:
- Pager Initialization : The storage backend is opened (in-memory or persistent)
- Catalog Discovery : The
ColumnStoreattempts to load the catalog from the pager root key - Schema Creation : If no catalog exists, table 0 is created with the predefined schema
- Ready State : The runtime can now service DDL and DML operations
Sources: llkv-runtime/README.md:26-31 llkv-storage/README.md:12-16
graph TB
subgraph "SQL Query Processing"
ParsedSQL["Parsed SQL AST"]
SelectPlan["SelectPlan<String>"]
ResolvedPlan["SelectPlan<FieldId>"]
end
subgraph "RuntimeContext"
CatalogLookup["Catalog Lookup"]
FieldResolution["Field Name → FieldId\nResolution"]
SchemaValidation["Schema Validation"]
end
subgraph "System Catalog"
SysCatalog["SysCatalog"]
TableMetaCache["In-Memory Metadata Cache"]
end
ParsedSQL --> SelectPlan
SelectPlan --> CatalogLookup
CatalogLookup --> SysCatalog
SysCatalog --> TableMetaCache
TableMetaCache --> FieldResolution
FieldResolution --> SchemaValidation
SchemaValidation --> ResolvedPlan
Integration with Runtime
The RuntimeContext uses the system catalog for all schema-dependent operations:
Schema Resolution Flow
Usage Examples
| Operation | Catalog Interaction |
|---|---|
| SELECT | Resolve table names → table IDs, resolve column names → field IDs |
| INSERT | Validate schema compatibility, check for required columns |
| JOIN | Resolve schemas for both tables, validate join key compatibility |
| CREATE INDEX | (Future) Persist index metadata as new catalog record type |
| ALTER TABLE | Update existing metadata records with new schema definitions |
Sources: llkv-runtime/README.md:36-40 llkv-expr/README.md:50-54
Dual-Context Catalog Access
During explicit transactions, the runtime maintains two catalog views:
Catalog Visibility Rules
- Persistent Context : Sees only metadata committed before the transaction's snapshot
- Staging Context : Sees tables created within the current transaction
- On Commit : Staged metadata is replayed into the persistent context
- On Rollback : Staged metadata is discarded
This dual-view approach ensures that:
- DDL operations remain transactional
- Uncommitted schema changes don't leak to other sessions
- Catalog queries are snapshot-isolated like DML operations
Sources: llkv-runtime/README.md:26-31 llkv-table/README.md:32-34
Metadata Caching
The CatalogManager maintains an in-memory cache of frequently accessed metadata to avoid repeated scans of table 0:
| Cache Structure | Purpose | Invalidation Strategy |
|---|---|---|
| Table Name → ID Map | Fast table resolution during planning | Invalidated on CREATE/DROP TABLE |
| Table ID → Schema Map | Quick schema validation during INSERT | Invalidated on ALTER TABLE |
| Column Name → FieldId Map | Field resolution for expressions | Rebuilt on schema changes |
The cache is session-local and does not require cross-session synchronization in the current single-process model.
Sources: Inferred from llkv-runtime/README.md:12-17
Summary
The LLKV system catalog demonstrates the principle of treating metadata as data by storing all table and column definitions in table 0 using the same Arrow-based storage infrastructure that handles user tables. This design:
- Simplifies Architecture : Eliminates the need for separate metadata storage systems
- Ensures Consistency : Metadata mutations use MVCC transactions like all other data
- Enables Crash Recovery : The pager's atomicity guarantees extend to schema changes
- Supports Transactional DDL : Schema modifications can be rolled back or committed atomically
The SysCatalog interface abstracts the underlying Arrow storage, providing a type-safe API for the runtime to query and modify metadata. The bootstrap process ensures the system catalog exists before any user operations proceed, and the dual-context model enables proper transaction isolation for DDL operations.
Sources: llkv-table/README.md:28-29 llkv-runtime/README.md:36-40 llkv-column-map/README.md:10-16 Diagram 4 from high-level architecture
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Custom Types and Type Registry
Relevant source files
- llkv-aggregate/src/lib.rs
- llkv-executor/src/translation/expression.rs
- llkv-executor/src/translation/schema.rs
- llkv-expr/src/expr.rs
- llkv-sql/src/lib.rs
- llkv-sql/src/sql_value.rs
- llkv-table/src/planner/program.rs
This page documents LLKV's type system, including SQL type mapping, custom type representations, and type inference mechanisms. The system uses Apache Arrow's DataType as its canonical type representation, with custom types like Decimal128, Date32, and Interval mapped to Arrow-compatible formats.
For information about expression evaluation and scalar operations, see Scalar Evaluation and NumericKernels. For aggregate function type handling, see Aggregation System.
Type System Architecture
LLKV's type system operates in three layers: SQL types (user-facing), intermediate literal types (planning), and Arrow DataTypes (execution). All data flowing through the system ultimately uses Arrow's columnar format.
Type Flow Architecture
graph TB
subgraph "SQL Layer"
SQLTYPE["SQL Types\nINT, TEXT, DECIMAL, DATE"]
end
subgraph "Planning Layer"
SQLVALUE["SqlValue\nInteger, Float, Decimal,\nString, Date32, Interval, Struct"]
LITERAL["Literal\nType-erased values"]
PLANVALUE["PlanValue\nPlan-time literals"]
end
subgraph "Execution Layer"
DATATYPE["Arrow DataType\nInt64, Float64, Utf8,\nDecimal128, Date32, Interval"]
SCHEMA["ExecutorSchema\nColumn metadata + types"]
INFERENCE["Type Inference\ninfer_computed_data_type"]
end
subgraph "Storage Layer"
RECORDBATCH["RecordBatch\nTyped columnar data"]
ARRAYS["Typed Arrays\nInt64Array, StringArray, etc."]
end
SQLTYPE --> SQLVALUE
SQLVALUE --> PLANVALUE
SQLVALUE --> LITERAL
PLANVALUE --> DATATYPE
LITERAL --> DATATYPE
DATATYPE --> SCHEMA
SCHEMA --> INFERENCE
INFERENCE --> DATATYPE
DATATYPE --> RECORDBATCH
RECORDBATCH --> ARRAYS
style DATATYPE fill:#f9f9f9
Sources: llkv-sql/src/sql_value.rs:16-27 llkv-sql/src/lib.rs:22-29 llkv-executor/src/translation/schema.rs:53-123
SQL to Arrow Type Mapping
SQL types are mapped to Arrow DataTypes during parsing and planning. The mapping is defined implicitly through the parsing logic in SqlValue and the type inference system.
| SQL Type | Arrow DataType | Notes |
|---|---|---|
INT, INTEGER, BIGINT | Int64 | All integer types normalized to Int64 |
FLOAT, DOUBLE, REAL | Float64 | All floating-point types normalized to Float64 |
DECIMAL(p,s) | Decimal128(p,s) | Fixed-point decimal with precision and scale |
TEXT, VARCHAR | Utf8 | Variable-length UTF-8 strings |
DATE | Date32 | Days since Unix epoch |
INTERVAL | Interval(MonthDayNano) | Calendar-aware interval type |
BOOLEAN | Boolean | True/false values |
| Dictionary literals | Struct | Key-value maps represented as structs |
SQL to Arrow Type Conversion Flow
Sources: llkv-sql/src/sql_value.rs:178-214 llkv-sql/src/sql_value.rs:216-236 llkv-sql/src/lib.rs:22-29
Custom Type Representations
LLKV defines custom types for values that require special handling beyond basic Arrow types. These types bridge SQL semantics and Arrow's columnar format.
DecimalValue
Fixed-point decimal numbers with exact precision. Stored as i128 with a scale factor.
DecimalValue Representation
graph TB
subgraph "DecimalValue Structure"
DEC["DecimalValue\nraw_value: i128\nscale: i8"]
end
subgraph "SQL Input"
SQLDEC["SQL: 123.45"]
end
subgraph "Internal Representation"
RAW["raw_value = 12345\nscale = 2"]
CALC["Actual value = 12345 / 10^2 = 123.45"]
end
subgraph "Arrow Storage"
ARR["Decimal128Array\nprecision=5, scale=2"]
end
SQLDEC --> DEC
DEC --> RAW
RAW --> CALC
DEC --> ARR
Sources: llkv-sql/src/sql_value.rs:187-207 llkv-aggregate/src/lib.rs:314-324
IntervalValue
Calendar-aware time intervals with month, day, and nanosecond components.
IntervalValue Operations
Sources: llkv-sql/src/sql_value.rs:238-283 llkv-expr/src/literal.rs
Date32
Days since Unix epoch (1970-01-01), stored as i32.
Date32 Type Handling
Sources: llkv-sql/src/sql_value.rs:76-87 llkv-sql/src/sql_value.rs:169-174
Struct Types
Dictionary literals in SQL are represented as struct types with named fields.
Struct Type Representation
Sources: llkv-sql/src/sql_value.rs:124-135 llkv-sql/src/sql_value.rs:227-234
Type Inference for Computed Expressions
The type inference system determines Arrow DataTypes for computed expressions at planning time. This enables schema generation before execution.
Type Inference Flow
graph TB
subgraph "Expression Input"
EXPR["ScalarExpr<FieldId>\ncol1 + col2 * 3"]
end
subgraph "Type Inference"
INFER["infer_computed_data_type"]
CHECK["expression_uses_float"]
NORM["normalized_numeric_type"]
end
subgraph "Type Resolution"
COL1["Column col1: Int64"]
COL2["Column col2: Float64"]
RESULT["Result: Float64\n(one operand is float)"]
end
subgraph "Schema Output"
FIELD["Field(alias, Float64, nullable=true)"]
end
EXPR --> INFER
INFER --> CHECK
CHECK --> COL1
CHECK --> COL2
CHECK --> RESULT
INFER --> NORM
NORM --> RESULT
RESULT --> FIELD
Sources: llkv-executor/src/translation/schema.rs:53-123 llkv-executor/src/translation/schema.rs:149-243
Type Inference Rules
The inference system applies the following rules:
| Expression Type | Inferred Type | Logic |
|---|---|---|
ScalarExpr::Literal(Integer) | Int64 | Direct mapping |
ScalarExpr::Literal(Float) | Float64 | Direct mapping |
ScalarExpr::Literal(Decimal(p,s)) | Decimal128(p,s) | Preserves precision/scale |
ScalarExpr::Column(field_id) | Column's type | Lookup in schema |
ScalarExpr::Binary{left, op, right} | Float64 if any operand is float, else Int64 | Type promotion |
ScalarExpr::Compare{...} | Int64 | Boolean as integer (0/1) |
ScalarExpr::Cast{data_type, ...} | data_type | Explicit cast target |
ScalarExpr::Random | Float64 | Floating-point random values |
Sources: llkv-executor/src/translation/schema.rs:56-122
graph TB
subgraph "Input Types"
INT8["Int8/Int16/Int32/Int64"]
UINT["UInt8/UInt16/UInt32/UInt64"]
FLOAT["Float32/Float64"]
DEC["Decimal128(p,s)"]
BOOL["Boolean"]
end
subgraph "Normalization"
NORM["normalized_numeric_type"]
end
subgraph "Output Types"
OUT_INT["Int64"]
OUT_FLOAT["Float64"]
end
INT8 --> NORM
BOOL --> NORM
NORM --> OUT_INT
UINT --> NORM
FLOAT --> NORM
NORM --> OUT_FLOAT
DEC --> NORM
NORM --> |scale=0 && fits in i64| OUT_INT
NORM --> |otherwise| OUT_FLOAT
Numeric Type Normalization
All numeric types are normalized to either Int64 or Float64 for arithmetic operations:
Numeric Type Normalization
Sources: llkv-executor/src/translation/schema.rs:125-147
Type Resolution During Expression Translation
Expression translation converts string-based column references to typed FieldId references, resolving types through the schema.
Expression Translation and Type Resolution
graph TB
subgraph "String-based Expression"
EXPRSTR["Expr<String>\nColumn('age') > Literal(18)"]
end
subgraph "Translation"
TRANS["translate_predicate"]
SCALAR["translate_scalar"]
RESOLVE["resolve_field_id"]
end
subgraph "Schema Lookup"
SCHEMA["ExecutorSchema"]
LOOKUP["schema.resolve('age')"]
COLUMN["ExecutorColumn\nname='age'\nfield_id=5\ndata_type=Int64"]
end
subgraph "FieldId-based Expression"
EXPRFID["Expr<FieldId>\nColumn(5) > Literal(18)"]
end
EXPRSTR --> TRANS
TRANS --> SCALAR
SCALAR --> RESOLVE
RESOLVE --> LOOKUP
LOOKUP --> COLUMN
COLUMN --> EXPRFID
Sources: llkv-executor/src/translation/expression.rs:18-174 llkv-executor/src/translation/expression.rs:390-407
Type Preservation During Translation
The translation process preserves type information from the original expression:
| Expression Component | Type Preservation |
|---|---|
Column(name) | Replaced with Column(field_id), type from schema |
Literal(value) | Clone literal, type embedded in Literal enum |
Binary{left, op, right} | Recursively translate operands, type inferred later |
Cast{expr, data_type} | Preserve data_type during translation |
Aggregate(call) | Translate inner expression, aggregate type determined by function |
Sources: llkv-executor/src/translation/expression.rs:176-387
graph TB
subgraph "Aggregate Specification"
SPEC["AggregateKind::Sum\nfield_id=5\ndata_type=Int64\ndistinct=false"]
end
subgraph "Accumulator Creation"
CREATE["new_with_projection_index"]
MATCH["Match on (data_type, distinct)"]
end
subgraph "Type-Specific Accumulators"
INT64["SumInt64\nvalue: Option<i64>\nhas_values: bool"]
FLOAT64["SumFloat64\nvalue: f64\nsaw_value: bool"]
DEC128["SumDecimal128\nsum: i128\nprecision: u8\nscale: i8"]
end
subgraph "Update Logic"
UPDATE_INT["Checked addition\nError on overflow"]
UPDATE_FLOAT["Floating addition\nNo overflow check"]
UPDATE_DEC["Checked i128 addition\nError on overflow"]
end
SPEC --> CREATE
CREATE --> MATCH
MATCH --> |Int64, false| INT64
MATCH --> |Float64, false| FLOAT64
MATCH --> |Decimal128 p,s , false| DEC128
INT64 --> UPDATE_INT
FLOAT64 --> UPDATE_FLOAT
DEC128 --> UPDATE_DEC
Type Handling in Aggregates
Aggregate functions have type-specific accumulator implementations. The type determines overflow behavior, precision, and result format.
Aggregate Type-Specific Accumulators
Sources: llkv-aggregate/src/lib.rs:461-542 llkv-aggregate/src/lib.rs:799-859
Aggregate Type Matrix
Different aggregates support different type combinations:
| Aggregate | Int64 | Float64 | Decimal128 | Utf8 | Boolean | Date32 |
|---|---|---|---|---|---|---|
COUNT(*) | N/A | N/A | N/A | N/A | N/A | N/A |
COUNT(col) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
SUM | ✓ | ✓ | ✓ | ✓ (coerce) | - | - |
AVG | ✓ | ✓ | ✓ | ✓ (coerce) | - | - |
MIN/MAX | ✓ | ✓ | ✓ | ✓ (coerce) | - | - |
TOTAL | ✓ | ✓ | ✓ | ✓ (coerce) | - | - |
GROUP_CONCAT | ✓ | ✓ | - | ✓ | ✓ | - |
Notes:
- ✓ = Native support with type-specific accumulator
- ✓ (coerce) = Support via SQLite-style numeric coercion
-= Not supported
Sources: llkv-aggregate/src/lib.rs:22-68 llkv-aggregate/src/lib.rs:385-447
graph LR
subgraph "DistinctKey Variants"
INT["Int(i64)"]
FLOAT["Float(u64)\nf64::to_bits()"]
STR["Str(String)"]
BOOL["Bool(bool)"]
DATE["Date(i32)"]
DEC["Decimal(i128)"]
end
subgraph "Accumulator"
SEEN["FxHashSet<DistinctKey>"]
INSERT["seen.insert(key)"]
CHECK["Returns true if new"]
end
subgraph "Aggregation"
ADD["Add to sum only if new"]
COUNT["Count distinct values"]
end
INT --> SEEN
FLOAT --> SEEN
STR --> SEEN
BOOL --> SEEN
DATE --> SEEN
DEC --> SEEN
SEEN --> INSERT
INSERT --> CHECK
CHECK --> ADD
CHECK --> COUNT
Distinct Value Tracking
For DISTINCT aggregates, the system tracks seen values using type-specific keys:
Distinct Value Tracking
Sources: llkv-aggregate/src/lib.rs:249-333 llkv-aggregate/src/lib.rs:825-858
Type Coercion and Casting
The system supports both implicit coercion (for numeric operations) and explicit casting (via CAST expressions).
graph TB
subgraph "Input Values"
STR["String '123.45'"]
BOOL["Boolean true"]
NULL["NULL"]
end
subgraph "Coercion Function"
COERCE["array_value_to_numeric"]
PARSE["Parse as f64"]
FALLBACK["Use 0.0 if parse fails"]
end
subgraph "Coerced Values"
NUM1["123.45"]
NUM2["1.0"]
NUM3["0.0 (NULL skipped)"]
end
STR --> COERCE
BOOL --> COERCE
NULL --> COERCE
COERCE --> PARSE
PARSE --> |Success| NUM1
PARSE --> |Failure| FALLBACK
FALLBACK --> NUM1
COERCE --> |Boolean: 1.0/0.0| NUM2
COERCE --> |NULL: skip row| NUM3
Numeric Coercion in Aggregates
String and boolean values are coerced to numeric types in aggregate functions following SQLite semantics:
Numeric Coercion in Aggregates
Sources: llkv-aggregate/src/lib.rs:385-447 llkv-aggregate/src/lib.rs:860-877
Explicit Type Casting
The CAST expression provides explicit type conversion:
Explicit Type Casting
Sources: llkv-executor/src/translation/schema.rs95 llkv-expr/src/expr.rs:114-118
Type System Integration Points
The type system integrates with multiple layers of the architecture:
| Layer | Integration Point | Purpose |
|---|---|---|
| SQL Parsing | SqlValue::try_from_expr | Parse SQL literals into typed values |
| Planning | PlanValue conversion | Convert literals to plan representation |
| Schema Inference | infer_computed_data_type | Determine result types for expressions |
| Expression Translation | translate_scalar | Resolve column types from schema |
| Program Compilation | OwnedOperator | Store typed operators in bytecode |
| Execution | RecordBatch schema | Validate types match expected schema |
| Aggregation | Accumulator creation | Create type-specific aggregators |
| Storage | Arrow serialization | Persist typed data in columnar format |
Sources: llkv-sql/src/sql_value.rs:30-122 llkv-executor/src/translation/schema.rs:15-51 llkv-table/src/planner/program.rs:69-101
Summary
LLKV's type system is built on Apache Arrow's DataType as the canonical type representation, with custom types for SQL-specific semantics:
- SQL types are mapped to Arrow types during parsing through
SqlValue - Custom types (
Decimal,Interval,Date32,Struct) provide SQL-compatible semantics - Type inference determines result types for computed expressions at planning time
- Type resolution converts string column references to typed
FieldIdreferences - Aggregate functions use type-specific accumulators with appropriate overflow handling
- Type coercion follows SQLite semantics for numeric operations
The type system operates transparently across all layers, ensuring type safety from SQL parsing through storage while maintaining compatibility with Arrow's columnar format.
Sources: llkv-sql/src/lib.rs:1-51 llkv-executor/src/translation/schema.rs:1-271 llkv-aggregate/src/lib.rs:1-83