This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Overview
Loading…
Overview
Relevant source files
Purpose and Scope
LLKV is an embedded SQL database system implemented in Rust that combines Apache Arrow’s columnar memory format with a key-value storage backend. This document provides a high-level introduction to the system architecture, component organization, and core data flows.
For detailed information about specific subsystems, see:
- Architecture and component organization : Architecture
- SQL query processing : SQL Interface and Query Planning
- Storage implementation : Storage Layer
- Metadata management : Catalog and Metadata Management
System Architecture
LLKV is organized as a Rust workspace containing 15 specialized crates that form a layered architecture. The system processes SQL queries through multiple stages—parsing, planning, execution—before ultimately persisting data in a memory-mapped key-value store.
graph TB
subgraph "User Interface"
SqlEngine["SqlEngine\n(llkv-sql)"]
PreparedStatement["PreparedStatement"]
end
subgraph "Query Processing"
Parser["SQL Parser\n(sqlparser-rs)"]
Planner["Query Planner\n(llkv-plan)"]
ExprSystem["Expression System\n(llkv-expr)"]
end
subgraph "Execution"
QueryExecutor["QueryExecutor\n(llkv-executor)"]
RuntimeEngine["RuntimeEngine\n(llkv-runtime)"]
Aggregate["AggregateAccumulator\n(llkv-aggregate)"]
Join["Hash Join\n(llkv-join)"]
end
subgraph "Data Management"
Table["Table\n(llkv-table)"]
SysCatalog["SysCatalog\nTable ID = 0"]
Scanner["Scanner\n(llkv-scan)"]
TxManager["TransactionManager\n(llkv-transaction)"]
end
subgraph "Storage"
ColumnStore["ColumnStore\n(llkv-column-map)"]
ArrowBatches["RecordBatch\nArrow Arrays"]
Pager["Pager Trait\n(llkv-storage)"]
end
subgraph "Persistence"
SimdRDrive["simd-r-drive\nMemory-Mapped K-V"]
EntryHandle["EntryHandle\nByte Blobs"]
end
SqlEngine --> Parser
Parser --> Planner
Planner --> ExprSystem
Planner --> QueryExecutor
QueryExecutor --> RuntimeEngine
QueryExecutor --> Aggregate
QueryExecutor --> Join
RuntimeEngine --> Table
RuntimeEngine --> TxManager
Table --> SysCatalog
Table --> Scanner
Scanner --> ColumnStore
ColumnStore --> ArrowBatches
ColumnStore --> Pager
Pager --> SimdRDrive
SimdRDrive --> EntryHandle
style SqlEngine fill:#e8e8e8
style ColumnStore fill:#e8e8e8
style SimdRDrive fill:#e8e8e8
Layered Architecture
Diagram: End-to-End System Layering
The system flows from SQL text at the top through progressive layers of abstraction down to persistent storage. Each layer is implemented as one or more dedicated crates with well-defined responsibilities.
Sources: Cargo.toml:1-109 llkv-sql/src/sql_engine.rs:1-100 llkv-executor/src/lib.rs:1-100
Workspace Structure
The LLKV workspace is divided into 15 crates, each handling a specific concern. The following diagram maps crate names to their primary responsibilities:
Diagram: Crate Dependency Structure
graph LR
subgraph "Foundation"
types["llkv-types\nShared types\nLogicalFieldId"]
result["llkv-result\nError handling\nError enum"]
storage["llkv-storage\nPager trait\nMemPager"]
end
subgraph "Expression & Planning"
expr["llkv-expr\nExpression AST\nScalarExpr, Expr"]
plan["llkv-plan\nQuery plans\nSelectPlan, InsertPlan"]
end
subgraph "Execution"
executor["llkv-executor\nQuery execution\nQueryExecutor"]
compute["llkv-compute\nCompute kernels\nNumericKernels"]
aggregate["llkv-aggregate\nAggregation\nAggregateAccumulator"]
join["llkv-join\nJoin ops\nhash_join"]
scan["llkv-scan\nTable scans\nScanner"]
end
subgraph "Data Management"
table["llkv-table\nTable abstraction\nTable, SysCatalog"]
colmap["llkv-column-map\nColumn store\nColumnStore"]
transaction["llkv-transaction\nMVCC\nTransactionManager"]
end
subgraph "User Interface"
sql["llkv-sql\nSQL engine\nSqlEngine"]
runtime["llkv-runtime\nRuntime engine\nRuntimeEngine"]
end
subgraph "Utilities"
csv["llkv-csv\nCSV import/export"]
threading["llkv-threading\nThread pool"]
testutils["llkv-test-utils\nTest helpers"]
slttester["llkv-slt-tester\nSQLite test harness"]
end
types -.-> expr
types -.-> storage
types -.-> table
result -.-> storage
result -.-> table
storage -.-> colmap
expr -.-> plan
expr -.-> compute
plan -.-> executor
colmap -.-> table
table -.-> executor
executor -.-> runtime
runtime -.-> sql
This diagram shows the primary dependency flow between crates. Foundation crates (llkv-types, llkv-result, llkv-storage) provide shared infrastructure. Middle layers handle query planning and execution. Top layers expose the SQL interface.
Sources: Cargo.toml:2-26 Cargo.toml:37-96
Key Components
SQL Interface Layer
The SqlEngine struct in llkv-sql is the primary entry point for executing SQL statements. It handles statement parsing, preprocessing, and orchestrates execution through the runtime layer.
Diagram: SQL Interface Entry Points
graph TD
User["Application Code"]
SqlEngine["SqlEngine::new(pager)"]
Execute["SqlEngine::execute(sql)"]
Sql["SqlEngine::sql(sql)"]
Prepare["SqlEngine::prepare(sql)"]
User --> SqlEngine
SqlEngine --> Execute
SqlEngine --> Sql
SqlEngine --> Prepare
Execute --> Parse["parse_sql_with_recursion_limit"]
Parse --> Preprocess["preprocess_sql_input"]
Preprocess --> BuildPlan["build_*_plan methods"]
BuildPlan --> RuntimeExec["RuntimeEngine::execute_statement"]
Sql --> Execute
Prepare --> PreparedStatement["PreparedStatement"]
The SqlEngine provides three primary methods: execute() for mixed statements, sql() for SELECT queries returning RecordBatch results, and prepare() for parameterized statements.
Sources: llkv-sql/src/sql_engine.rs:440-486 llkv-sql/src/sql_engine.rs:1045-1134 llkv-sql/src/sql_engine.rs:1560-1612
Query Planning
The llkv-plan crate transforms parsed SQL AST into executable plans. Key plan types include:
| Plan Type | Purpose | Key Fields |
|---|---|---|
SelectPlan | Query execution | projections, tables, filter, group_by, order_by |
InsertPlan | Data insertion | table, columns, source, on_conflict |
UpdatePlan | Row updates | table, assignments, filter |
DeletePlan | Row deletion | table, filter |
CreateTablePlan | Schema definition | table, columns, constraints |
Sources: [llkv-plan crate referenced in llkv-executor/src/lib.rs31-35](https://github.com/jzombie/rust-llkv/blob/89777726/llkv-plan crate referenced in llkv-executor/src/lib.rs#L31-L35)
Expression System
The llkv-expr crate defines two core expression types:
Expr<F>: Boolean predicate expressions for filtering (used in WHERE clauses)ScalarExpr<F>: Scalar value expressions for projections and computations
Both are generic over field identifier type F, allowing translation from string column names to numeric FieldId identifiers.
Sources: llkv-executor/src/lib.rs:23-26
graph TB
subgraph "Logical Layer"
RecordBatch["RecordBatch\nArrow columnar format"]
Schema["Schema\nColumn definitions"]
end
subgraph "Column Store Layer"
ColumnStore["ColumnStore"]
ColumnDescriptor["ColumnDescriptor\nLinked list of chunks"]
ChunkMetadata["ChunkMetadata\nmin, max, size, nulls"]
end
subgraph "Physical Layer"
Pager["Pager trait\nbatch_get, batch_put"]
MemPager["MemPager"]
SimdRDrive["simd-r-drive\nMemory-mapped storage"]
end
subgraph "Persistence"
EntryHandle["EntryHandle\nByte blob references"]
PhysicalKeys["Physical keys (u64)"]
end
RecordBatch --> ColumnStore
Schema --> ColumnStore
ColumnStore --> ColumnDescriptor
ColumnDescriptor --> ChunkMetadata
ColumnDescriptor --> Pager
Pager --> MemPager
Pager --> SimdRDrive
MemPager --> EntryHandle
SimdRDrive --> EntryHandle
EntryHandle --> PhysicalKeys
Storage Architecture
LLKV stores data in a columnar format using Apache Arrow, persisted through a key-value storage backend:
Diagram: Storage Architecture Layers
Arrow RecordBatches are decomposed into individual column chunks, each serialized and stored via the Pager trait. The simd-r-drive backend provides memory-mapped, SIMD-optimized key-value operations.
Sources: Cargo.lock:126-143 Cargo.lock:671-687 [llkv-column-map references in llkv-executor/src/lib.rs20](https://github.com/jzombie/rust-llkv/blob/89777726/llkv-column-map references in llkv-executor/src/lib.rs#L20-L20)
Query Execution Flow
The following diagram traces a SELECT query through the execution pipeline:
Diagram: SELECT Query Execution Sequence
Query execution proceeds in two phases: (1) filter evaluation to collect matching row IDs, and (2) column gathering to assemble the final RecordBatch. Metadata-based chunk pruning optimizes filter evaluation by skipping chunks that cannot contain matching rows.
Sources: llkv-sql/src/sql_engine.rs:1596-1612 llkv-executor/src/lib.rs:519-563
Data Model
Tables and Schemas
Every table in LLKV has:
- A unique numeric
table_id - A
Schemadefining column names, types, and nullability - Optional constraints (primary key, foreign keys, unique, check)
- Optional indexes (single-column and multi-column)
graph TB
SysCatalog["SysCatalog (Table 0)"]
TableMeta["TableMeta records"]
ColMeta["ColMeta records"]
IndexMeta["Index metadata"]
ConstraintMeta["Constraint records"]
TriggerMeta["Trigger definitions"]
SysCatalog --> TableMeta
SysCatalog --> ColMeta
SysCatalog --> IndexMeta
SysCatalog --> ConstraintMeta
SysCatalog --> TriggerMeta
UserTable1["User Table 1"]
UserTable2["User Table 2"]
TableMeta -.describes.-> UserTable1
TableMeta -.describes.-> UserTable2
ColMeta -.describes.-> UserTable1
ColMeta -.describes.-> UserTable2
System Catalog
Table ID 0 is reserved for the SysCatalog, a special table that stores metadata about all other tables, columns, indexes, triggers, and constraints. The catalog is self-describing—it uses the same columnar storage as user tables.
Diagram: System Catalog Structure
All DDL operations (CREATE TABLE, ALTER TABLE, etc.) modify the system catalog. At startup, the catalog is read to reconstruct the complete database schema.
Sources: [llkv-sql/src/sql_engine.rs references to SysCatalog](https://github.com/jzombie/rust-llkv/blob/89777726/llkv-sql/src/sql_engine.rs references to SysCatalog)
Columnar Storage
Data is stored column-wise using Apache Arrow’s in-memory format, with each column divided into chunks. Each chunk contains:
- Serialized Arrow array data
- Row ID bitmap (which rows are present)
- Metadata (min/max values, null count, size)
This organization enables:
- Efficient predicate pushdown (skip chunks via min/max)
- Vectorized operations on decompressed data
- Compaction (merging small chunks)
Sources: [llkv-executor/src/lib.rs references to ColumnStore and RecordBatch](https://github.com/jzombie/rust-llkv/blob/89777726/llkv-executor/src/lib.rs references to ColumnStore and RecordBatch)
Transaction Support
LLKV implements Multi-Version Concurrency Control (MVCC) using hidden columns:
| Column | Type | Purpose |
|---|---|---|
__created_by | u64 | Transaction ID that created this row version |
__deleted_by | u64 | Transaction ID that deleted this row version (or u64::MAX if active) |
The TransactionManager in llkv-transaction coordinates transaction boundaries and assigns transaction IDs. Queries automatically filter rows based on the current transaction’s visibility rules.
Sources: [llkv-transaction crate in Cargo.toml24](https://github.com/jzombie/rust-llkv/blob/89777726/llkv-transaction crate in Cargo.toml#L24-L24)
External Dependencies
LLKV relies on several external crates for core functionality:
| Dependency | Version | Purpose |
|---|---|---|
arrow | 57.1.0 | Columnar data format, compute kernels |
sqlparser | 0.59.0 | SQL parsing (supports multiple dialects) |
simd-r-drive | 0.15.5-alpha | Memory-mapped key-value storage with SIMD optimization |
rayon | 1.10.0 | Parallel processing (used in joins, aggregations) |
croaring | 2.5.1 | Bitmap indexes for row ID sets |
Sources: Cargo.toml:40-49 Cargo.lock:126-143 Cargo.lock:671-687
Usage Example
Sources: llkv-sql/src/sql_engine.rs:443-485
Summary
LLKV is a layered SQL database system that marries Apache Arrow’s columnar format with key-value storage. The architecture separates concerns across 15 crates, enabling modular development and testing. Queries flow from SQL text through parsing, planning, and execution stages before accessing columnar data persisted in a memory-mapped store. The system supports transactions, indexes, constraints, and SQL features including joins, aggregations, and subqueries.
For deeper exploration of specific subsystems, consult the following sections:
- Architecture - Detailed crate organization and dependencies
- SQL Interface - SQL preprocessing and dialect handling
- Query Execution - Execution strategies and optimizations
- Storage Layer - Column store implementation details
- Catalog and Metadata Management - Schema management and type system
Sources: Cargo.toml:1-109 llkv-sql/src/sql_engine.rs:1-100 llkv-executor/src/lib.rs:1-100
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Architecture
Loading…
Architecture
Relevant source files
- .github/workflows/build.docs.yml
- Cargo.lock
- Cargo.toml
- demos/llkv-sql-pong-demo/assets/llkv-sql-pong-screenshot.png
- dev-docs/doc-preview.md
- llkv-executor/Cargo.toml
- llkv-join/Cargo.toml
- llkv-plan/Cargo.toml
- llkv-sql/Cargo.toml
- llkv-table/Cargo.toml
- llkv-test-utils/Cargo.toml
This page describes the overall architectural design of LLKV, including the layered system structure, key design decisions, and how major components interact. For detailed information about individual crates and their responsibilities, see Workspace and Crates. For the end-to-end query execution flow, see SQL Query Processing Pipeline. For details on Arrow integration and data representation, see Data Formats and Arrow Integration.
Architectural Overview
LLKV is a columnar SQL database that stores Apache Arrow data structures directly in a key-value persistence layer. The architecture consists of six distinct layers, each implemented as one or more Rust crates. The system translates SQL statements into query plans, executes those plans against columnar table storage, and persists data using memory-mapped key-value stores.
The core architectural innovation is the llkv-column-map layer, which bridges Apache Arrow’s in-memory columnar format with the simd-r-drive key-value storage backend. This design enables zero-copy operations on columnar data while maintaining ACID properties through the underlying storage engine.
Sources: Cargo.toml:1-109 high-level overview diagrams
System Layers
The following diagram shows the six architectural layers with their implementing crates and key data structures:
graph TB
subgraph "Layer 1: User Interface"
SQL["llkv-sql\nSqlEngine"]
DEMO["llkv-sql-pong-demo"]
TPCH["llkv-tpch"]
CSV["llkv-csv"]
end
subgraph "Layer 2: Query Processing"
PARSER["sqlparser-rs\nParser, Statement"]
PLANNER["llkv-plan\nSelectPlan, InsertPlan"]
EXPR["llkv-expr\nExpr, ScalarExpr"]
end
subgraph "Layer 3: Execution"
EXECUTOR["llkv-executor\nTableExecutor"]
RUNTIME["llkv-runtime\nDatabaseRuntime"]
AGGREGATE["llkv-aggregate\nAccumulator"]
JOIN["llkv-join\nhash_join"]
COMPUTE["llkv-compute\nNumericKernels"]
end
subgraph "Layer 4: Data Management"
TABLE["llkv-table\nTable, SysCatalog"]
TRANSACTION["llkv-transaction\nTransaction"]
SCAN["llkv-scan\nScanOp"]
end
subgraph "Layer 5: Storage - Arrow Native"
COLMAP["llkv-column-map\nColumnStore, ColumnDescriptor"]
STORAGE["llkv-storage\nPager trait"]
ARROW["arrow-array\nRecordBatch, ArrayRef"]
end
subgraph "Layer 6: Persistence - Key-Value"
PAGER["Pager implementations\nbatch_get, batch_put"]
SIMD["simd-r-drive\nRDrive, EntryHandle"]
end
SQL --> PARSER
PARSER --> PLANNER
PLANNER --> EXPR
EXPR --> EXECUTOR
EXECUTOR --> AGGREGATE
EXECUTOR --> JOIN
EXECUTOR --> COMPUTE
EXECUTOR --> RUNTIME
RUNTIME --> TABLE
RUNTIME --> TRANSACTION
TABLE --> SCAN
SCAN --> COLMAP
TABLE --> COLMAP
COLMAP --> ARROW
COLMAP --> STORAGE
STORAGE --> PAGER
PAGER --> SIMD
Each layer has well-defined responsibilities. Layer 1 provides user-facing interfaces. Layer 2 translates SQL into executable plans. Layer 3 executes those plans using specialized operators. Layer 4 manages logical tables and transactions. Layer 5 implements columnar storage using Arrow data structures. Layer 6 provides persistent key-value storage with memory-mapping.
Sources: Cargo.toml:2-26 llkv-sql/Cargo.toml:1-45 llkv-executor/Cargo.toml:1-48 llkv-table/Cargo.toml:1-72
Key Architectural Decisions
Arrow-Native Columnar Storage
The system uses Apache Arrow as its native in-memory data format. All data is represented as RecordBatch instances containing ArrayRef columns. This design decision enables:
- Zero-copy interoperability with Arrow-based analytics tools
- Vectorized computation using Arrow kernels
- Efficient memory layouts for SIMD operations
- Type-safe column operations through Arrow’s schema system
The arrow dependency (version 57.1.0) provides the foundation for all data operations.
Key-Value Persistence Backend
Rather than implementing a custom storage engine, LLKV persists data through the Pager trait abstraction defined in llkv-storage. The primary implementation uses simd-r-drive (version 0.15.5-alpha), a memory-mapped key-value store with SIMD-optimized operations.
This design separates logical data management from physical storage concerns. The Pager trait defines operations:
alloc()- allocate new storage keysbatch_get()- retrieve multiple valuesbatch_put()- atomically write multiple key-value pairsfree()- release storage keys
Modular Crate Organization
The workspace contains 15+ specialized crates, each focused on a specific concern. This modularity enables:
- Independent testing and benchmarking per crate
- Clear dependency boundaries
- Parallel development across subsystems
- Selective feature compilation
Core crates follow a naming convention: llkv-{subsystem}. Foundational crates like llkv-types and llkv-result provide shared types and error handling used throughout the system.
MVCC Transaction Isolation
The system implements Multi-Version Concurrency Control through llkv-transaction. Each table row includes system columns created_by and deleted_by that track transaction visibility. The Transaction struct manages transaction state and visibility rules.
Sources: Cargo.toml:37-96 llkv-storage/Cargo.toml:1-48 llkv-table/Cargo.toml:20-41
Component Organization
The following diagram maps the workspace crates to their architectural roles:
Dependencies flow upward through the layers. Lower-level crates like llkv-types and llkv-storage have no dependencies on higher layers. The llkv-sql crate sits at the top, orchestrating all subsystems.
Sources: Cargo.toml:2-26 Cargo.toml:55-74
Data Flow Architecture
Data flows through the system in two primary patterns: write operations (INSERT, UPDATE, DELETE) and read operations (SELECT).
flowchart LR SQL["SQL Statement"] --> PARSE["sqlparser\nparse()"] PARSE --> PLAN["llkv-plan\nInsertPlan"] PLAN --> RUNTIME["DatabaseRuntime\nexecute_insert_plan()"] RUNTIME --> TABLE["Table\nappend()"] TABLE --> COLMAP["ColumnStore\nappend()"] COLMAP --> CHUNK["Chunking\nLWW deduplication"] CHUNK --> SERIAL["Serialize\nArrow arrays"] SERIAL --> PAGER["Pager\nbatch_put()"] PAGER --> KV["simd-r-drive\nEntryHandle"]
Write Path
Write operations follow a path from SQL parsing through plan creation, runtime execution, table operations, column store append, chunking/deduplication, serialization, and finally persistence via the Pager trait.
flowchart LR SQL["SQL Statement"] --> PARSE["sqlparser\nparse()"] PARSE --> PLAN["llkv-plan\nSelectPlan"] PLAN --> EXECUTOR["TableExecutor\nexecute_select()"] EXECUTOR --> FILTER["Phase 1:\nfilter_row_ids()"] FILTER --> GATHER["Phase 2:\ngather_rows()"] GATHER --> COLMAP["ColumnStore\ngather()"] COLMAP --> PAGER["Pager\nbatch_get()"] PAGER --> DESER["Deserialize\nArrow arrays"] DESER --> BATCH["RecordBatch\nassembly"] BATCH --> RESULT["Query Results"]
The ColumnStore::append() method implements Last-Write-Wins semantics for upserts by detecting duplicate row IDs and replacing older values.
Read Path
Read operations use a two-phase approach: first collecting matching row IDs via predicate evaluation, then gathering column data for those rows. This minimizes data movement by filtering before gathering.
Sources: llkv-sql/Cargo.toml:20-38 llkv-executor/Cargo.toml:20-42 llkv-table/Cargo.toml:20-41
Storage Architecture: Arrow to Key-Value Bridge
The llkv-column-map crate implements the critical bridge between Arrow’s columnar format and key-value storage:
graph TB
subgraph "Logical Layer"
FIELD["LogicalFieldId\n(table_id, field_name)"]
ROWID["RowId\n(u64)"]
end
subgraph "Column Organization"
CATALOG["ColumnCatalog\nfield_id → ColumnDescriptor"]
DESC["ColumnDescriptor\nlinked list of chunks"]
META["ChunkMetadata\n(min, max, size, null_count)"]
end
subgraph "Physical Storage"
CHUNK["Data Chunks\nserialized Arrow arrays"]
RIDARRAY["RowId Arrays\nsorted u64 arrays"]
PKEY["Physical Keys\n(chunk_pk, rid_pk)"]
end
subgraph "Key-Value Layer"
ENTRY["EntryHandle\nbyte blobs"]
MMAP["Memory-Mapped Files"]
end
FIELD --> CATALOG
CATALOG --> DESC
DESC --> META
META --> CHUNK
CHUNK --> PKEY
DESC --> RIDARRAY
RIDARRAY --> PKEY
PKEY --> ENTRY
ENTRY --> MMAP
ROWID -.used for.-> RIDARRAY
The ColumnCatalog maps logical field identifiers to physical storage. Each column is represented by a ColumnDescriptor that maintains a linked list of data chunks. Each chunk contains:
- A serialized Arrow array (the actual column data)
- A corresponding sorted array of row IDs
- Metadata including min/max values for predicate pushdown
- Physical keys (
chunk_pk,rid_pk) pointing to storage
The ChunkMetadata enables chunk pruning during scans: chunks whose min/max ranges don’t overlap with query predicates can be skipped entirely.
Data chunks are stored as serialized byte blobs accessed through EntryHandle instances from simd-r-drive. The storage layer uses memory-mapped files for efficient I/O.
Sources: llkv-column-map/Cargo.toml:1-65 llkv-storage/Cargo.toml:1-48 Cargo.toml:85-86
System Catalog
The system catalog (table ID 0) stores metadata about all tables, columns, indexes, and constraints. It is itself stored in the same ColumnStore as user data, creating a self-describing bootstrapped system.
The SysCatalog struct provides typed access to catalog tables. The CatalogManager coordinates table lifecycle operations (CREATE, ALTER, DROP) by manipulating catalog entries.
Table ID ranges partition the namespace:
- ID 0: System catalog
- IDs 1-999: User tables
- IDs 1000+: Information schema views
- IDs 10000+: Temporary tables
This design allows the catalog to leverage the same storage, transaction, and query infrastructure as user data.
Sources: llkv-table/Cargo.toml:20-41
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Workspace and Crates
Loading…
Workspace and Crates
Relevant source files
- .github/workflows/build.docs.yml
- Cargo.lock
- Cargo.toml
- demos/llkv-sql-pong-demo/assets/llkv-sql-pong-screenshot.png
- dev-docs/doc-preview.md
- llkv-executor/Cargo.toml
- llkv-join/Cargo.toml
- llkv-plan/Cargo.toml
- llkv-sql/Cargo.toml
- llkv-table/Cargo.toml
- llkv-test-utils/Cargo.toml
Purpose and Scope
This document describes the modular workspace structure of the LLKV codebase, detailing the 22 crates that comprise the system, their responsibilities, and interdependencies. Each crate serves a specific architectural layer, enabling separation of concerns and independent development. For information about the overall system architecture and how these crates fit into the layered design, see Architecture. For details about specific subsystems like SQL processing or storage, refer to the respective sections in the table of contents.
Sources: Cargo.toml:1-109
Workspace Configuration
LLKV uses a Cargo workspace to manage multiple interdependent crates under a unified build system. The workspace is configured with resolver = "2", enabling the newer dependency resolver that provides more predictable builds across different platforms and feature combinations.
All workspace members share common metadata including version 0.8.5-alpha, Apache-2.0 license, and Rust edition 2024. Workspace-level lints enforce quality standards, prohibiting print statements to stdout/stderr and disallowing specific methods like Box::leak.
Sources: Cargo.toml:2-28 Cargo.toml:29-35 Cargo.toml:98-104
Crate Dependency Architecture
Sources: Cargo.toml:37-96 llkv-table/Cargo.toml:20-40 llkv-executor/Cargo.toml:20-41 llkv-sql/Cargo.toml:20-37
Foundation Crates
llkv-types
The llkv-types crate provides the shared type system used throughout LLKV. It defines fundamental data types, field identifiers, and type conversions that enable type-safe operations across all layers. This crate has no internal LLKV dependencies, making it the true foundation upon which other crates build.
Purpose: Shared type definitions and type system infrastructure
Key Exports: Type enums, field identifiers, type conversion utilities
Internal Dependencies: None
External Dependencies: Minimal (bitcode, serde)
Sources: Cargo.toml74 llkv-table/Cargo.toml33 llkv-executor/Cargo.toml35
llkv-result
The llkv-result crate defines error types and result handling patterns used throughout the system. It provides a unified error type hierarchy that enables precise error reporting and recovery across all subsystem boundaries.
Purpose: Centralized error handling and result types
Key Exports: LlkvResult type alias, error enums
Internal Dependencies: None
External Dependencies: thiserror (for error derivation)
Sources: Cargo.toml64 llkv-table/Cargo.toml30 llkv-executor/Cargo.toml30
llkv-storage
The llkv-storage crate abstracts the underlying key-value storage layer through the Pager trait. It provides batch operations (batch_get, batch_put, alloc, free) that enable efficient bulk access to the persistent storage backend. When the simd-r-drive-support feature is enabled, it provides an implementation using the SIMD-optimized memory-mapped key-value store.
Purpose: Storage abstraction layer and pager interface
Key Exports: Pager trait, storage implementations
Internal Dependencies: llkv-types, llkv-result
External Dependencies: simd-r-drive (conditional)
Sources: Cargo.toml69 llkv-table/Cargo.toml32 llkv-sql/Cargo.toml29
Data Management Crates
llkv-column-map
The llkv-column-map crate implements the Arrow-native columnar storage engine. It manages data organized in chunks with metadata (min/max values, null counts) and provides operations like append (with Last-Write-Wins semantics), gather (row ID to RecordBatch assembly), filtering, and compaction. The ColumnStore and ColumnCatalog types bridge Apache Arrow’s in-memory format with the persistent key-value backend.
Purpose: Columnar data storage with Arrow integration
Key Exports: ColumnStore, ColumnCatalog, ColumnDescriptor
Internal Dependencies: llkv-types, llkv-storage
External Dependencies: arrow, croaring (for bitmap indexes)
Sources: Cargo.toml57 llkv-table/Cargo.toml26 llkv-executor/Cargo.toml25 llkv-sql/Cargo.toml22
llkv-table
The llkv-table crate provides the Table abstraction that represents a logical table with a schema. It integrates the columnar storage with catalog management, MVCC column injection, and scan operations. The SysCatalog within this crate stores metadata about all tables (Table 0 stores information about Tables 1+). The CatalogManager handles table lifecycle operations including create, alter, and drop.
Purpose: Table abstraction, catalog management, and metadata storage
Key Exports: Table, SysCatalog, CatalogManager, TableMeta, ColMeta
Internal Dependencies: llkv-types, llkv-result, llkv-storage, llkv-column-map, llkv-expr, llkv-scan, llkv-compute, llkv-plan
External Dependencies: arrow, arrow-array, arrow-schema, bitcode
Sources: Cargo.toml70 llkv-table/Cargo.toml:1-72 llkv-executor/Cargo.toml33 llkv-sql/Cargo.toml30
llkv-scan
The llkv-scan crate implements table scanning operations with predicate evaluation. It provides efficient row filtering using vectorized operations and bitmap indexes, enabling predicate pushdown to minimize data movement during query execution.
Purpose: Table scan operations and predicate evaluation
Key Exports: Scan iterators, filter evaluation functions
Internal Dependencies: llkv-types, llkv-column-map, llkv-expr
External Dependencies: arrow, croaring
Sources: Cargo.toml66 llkv-table/Cargo.toml31 llkv-executor/Cargo.toml31 llkv-plan/Cargo.toml26
Expression and Computation Crates
llkv-expr
The llkv-expr crate defines the expression Abstract Syntax Tree (AST) used throughout query processing. It includes Expr for general expressions and ScalarExpr for scalar computations. The crate provides expression translation (string column names to FieldId identifiers), compilation into bytecode (EvalProgram, DomainProgram), and optimization passes.
Purpose: Expression AST, translation, and compilation
Key Exports: Expr, ScalarExpr, EvalProgram, expression translators
Internal Dependencies: llkv-types, llkv-result
External Dependencies: arrow, time (for date/time handling)
Sources: Cargo.toml61 llkv-table/Cargo.toml28 llkv-executor/Cargo.toml27 llkv-sql/Cargo.toml25 llkv-plan/Cargo.toml24
llkv-compute
The llkv-compute crate implements vectorized compute kernels for scalar expression evaluation. It provides the NumericKernels system for optimized arithmetic, comparison, and logical operations on Arrow arrays. These kernels leverage SIMD instructions where possible for high-performance data processing.
Purpose: Vectorized compute kernels for expression evaluation
Key Exports: NumericKernels, scalar evaluation functions
Internal Dependencies: llkv-types, llkv-expr
External Dependencies: arrow
Sources: Cargo.toml58 llkv-table/Cargo.toml27 llkv-executor/Cargo.toml26 llkv-sql/Cargo.toml23 llkv-plan/Cargo.toml23
Query Processing Crates
llkv-plan
The llkv-plan crate converts SQL Abstract Syntax Trees (from sqlparser-rs) into executable query plans. It defines plan structures including SelectPlan, InsertPlan, UpdatePlan, DeletePlan, and DDL plans. The planner handles subquery correlation tracking, expression binding, and plan optimization.
Purpose: SQL-to-plan conversion and query planning
Key Exports: SelectPlan, InsertPlan, plan builder types
Internal Dependencies: llkv-types, llkv-result, llkv-expr, llkv-compute, llkv-scan, llkv-storage, llkv-column-map
External Dependencies: arrow, sqlparser, regex, time
Sources: Cargo.toml63 llkv-plan/Cargo.toml:1-42 llkv-executor/Cargo.toml29 llkv-sql/Cargo.toml26
llkv-executor
The llkv-executor crate executes query plans to produce results. It implements the TablePlanner and TableExecutor that optimize and execute table-level operations, including full table scans, filtered scans, aggregations, joins, and sorting. The executor uses parallel processing (via rayon) where beneficial and provides streaming result iteration.
Purpose: Query plan execution engine
Key Exports: TableExecutor, TablePlanner, execution functions
Internal Dependencies: llkv-types, llkv-result, llkv-expr, llkv-compute, llkv-plan, llkv-scan, llkv-table, llkv-storage, llkv-column-map, llkv-aggregate, llkv-join, llkv-threading
External Dependencies: arrow, rayon, croaring
Sources: Cargo.toml60 llkv-executor/Cargo.toml:1-47 llkv-sql/Cargo.toml24
llkv-aggregate
The llkv-aggregate crate implements aggregate function evaluation for GROUP BY queries. It provides accumulators for functions like SUM, AVG, COUNT, MIN, MAX, and handles DISTINCT aggregation using bitmap sets. The aggregation engine supports both hash-based and sort-based strategies.
Purpose: Aggregate function evaluation and accumulation
Key Exports: Accumulator types, aggregation operators
Internal Dependencies: llkv-types, llkv-expr, llkv-compute
External Dependencies: arrow, croaring
Sources: Cargo.toml56 llkv-executor/Cargo.toml24
llkv-join
The llkv-join crate implements table join operations using hash join algorithms. It supports INNER, LEFT, RIGHT, and FULL OUTER joins with optimizations for build/probe side selection and parallel hash table construction. The join implementation integrates with the table scan and filter systems for efficient multi-table queries.
Purpose: Table join operations and algorithms
Key Exports: Join operators, hash join implementation
Internal Dependencies: llkv-types, llkv-result, llkv-expr, llkv-table, llkv-storage, llkv-column-map, llkv-threading
External Dependencies: arrow, rayon, rustc-hash
Sources: Cargo.toml62 llkv-join/Cargo.toml:1-44 llkv-executor/Cargo.toml28
Runtime and Transaction Crates
llkv-runtime
The llkv-runtime crate provides the runtime engine that orchestrates query execution with transaction management. It integrates the executor, table management, and transaction systems, providing the high-level API for executing SQL statements within transactional contexts. The runtime manages catalog operations, schema validation, and result formatting.
Purpose: Runtime orchestration and transaction coordination
Key Exports: Runtime engine, execution context
Internal Dependencies: llkv-types, llkv-result, llkv-table, llkv-executor, llkv-transaction, llkv-storage
External Dependencies: arrow
Sources: Cargo.toml65 llkv-sql/Cargo.toml28
llkv-transaction
The llkv-transaction crate implements Multi-Version Concurrency Control (MVCC) for transactional semantics. It manages transaction identifiers, version visibility, and isolation. The system injects created_by and deleted_by columns into tables to track row versions, enabling snapshot isolation for concurrent queries.
Purpose: MVCC transaction management
Key Exports: Transaction manager, transaction ID generation
Internal Dependencies: llkv-types
External Dependencies: None (minimal)
Sources: Cargo.toml73 llkv-sql/Cargo.toml31
SQL Interface Crates
llkv-sql
The llkv-sql crate provides the primary SQL interface through the SqlEngine type. It handles SQL parsing (using sqlparser-rs), preprocessing (dialect normalization), plan building, and execution. The engine includes an INSERT buffering system that batches multiple INSERT statements for improved bulk insert performance. This is the main entry point for executing SQL statements against LLKV.
Purpose: SQL parsing, preprocessing, and execution interface
Key Exports: SqlEngine, SQL execution methods
Internal Dependencies: llkv-types, llkv-result, llkv-expr, llkv-plan, llkv-executor, llkv-runtime, llkv-table, llkv-storage, llkv-column-map, llkv-compute, llkv-transaction
External Dependencies: arrow, sqlparser, regex
Sources: Cargo.toml68 llkv-sql/Cargo.toml:1-44
Utility and Support Crates
llkv-csv
The llkv-csv crate provides utilities for importing and exporting CSV data to/from LLKV tables. It handles schema inference, data type conversion, and batch processing for efficient bulk data operations.
Purpose: CSV import/export functionality
Key Exports: CSV reader/writer utilities
Internal Dependencies: llkv-table, llkv-types
External Dependencies: arrow, csv
Sources: Cargo.toml59
llkv-test-utils
The llkv-test-utils crate provides testing utilities used across the workspace. When the auto-init feature is enabled, it automatically initializes tracing at test binary startup via the ctor crate, simplifying test debugging. This crate is marked as a development dependency in most other crates.
Purpose: Shared test utilities and test infrastructure
Key Exports: Test helpers, tracing initialization
Internal Dependencies: None
External Dependencies: tracing, tracing-subscriber, ctor (optional)
Sources: Cargo.toml71 llkv-test-utils/Cargo.toml:1-34 llkv-table/Cargo.toml45
llkv-threading
The llkv-threading crate provides threading utilities and abstractions used by parallelized operations in the executor and join crates. It wraps rayon patterns and provides consistent threading primitives across the codebase.
Purpose: Threading utilities and parallel processing abstractions
Key Exports: Thread pool management, parallel iterators
Internal Dependencies: llkv-types
External Dependencies: rayon, crossbeam-channel
Sources: Cargo.toml72 llkv-executor/Cargo.toml34 llkv-join/Cargo.toml27
Application and Demo Crates
llkv (Main Library)
The root llkv crate serves as the main library crate, aggregating and re-exporting key types and functions from the specialized crates. It provides a unified API surface for external consumers of the LLKV database system.
Purpose: Main library aggregation and public API
Key Exports: Re-exports from llkv-sql and other core crates
Internal Dependencies: llkv-sql and other workspace crates
External Dependencies: Inherited from dependencies
Sources: Cargo.toml5 Cargo.toml55
llkv-sql-pong-demo
An interactive terminal-based demonstration application that showcases LLKV’s SQL capabilities through a ping-pong game scenario. The demo creates tables, inserts data, and executes queries to demonstrate SQL functionality in an engaging way.
Purpose: Interactive SQL demonstration
Key Exports: Demo application binary
Internal Dependencies: llkv-sql
External Dependencies: crossterm (for terminal UI)
Sources: Cargo.toml4
llkv-slt-tester
The llkv-slt-tester crate implements a SQLLogicTest runner for LLKV. It executes standardized SQL test suites to verify correctness against established database behavior expectations, enabling regression testing and compatibility validation.
Purpose: SQLLogicTest execution and validation
Key Exports: Test runner binary
Internal Dependencies: llkv-sql
External Dependencies: sqllogictest, libtest-mimic
Sources: Cargo.toml67
llkv-tpch
The llkv-tpch crate implements the TPC-H benchmark suite for LLKV. It generates TPC-H schema and data, executes the 22 standard TPC-H queries, and measures performance metrics. This crate is used for performance evaluation and regression testing.
Purpose: TPC-H benchmark execution
Key Exports: Benchmark runner binary
Internal Dependencies: llkv-sql, llkv-csv
External Dependencies: tpchgen
Sources: Cargo.toml23
Workspace Dependencies Overview
The following table summarizes key external dependencies used across the workspace:
| Dependency | Version | Purpose |
|---|---|---|
| arrow | 57.1.0 | Columnar data format and operations |
| sqlparser | 0.59.0 | SQL parsing |
| simd-r-drive | 0.15.5-alpha | Key-value storage backend |
| rayon | 1.10.0 | Parallel processing |
| croaring | 2.5.1 | Bitmap indexes |
| bitcode | 0.6.7 | Fast binary serialization |
| time | 0.3.44 | Date/time handling |
| regex | 1.12.2 | Pattern matching |
| rustc-hash | 2.1.1 | Fast hash functions |
| thiserror | 2.0.17 | Error derivation |
Sources: Cargo.toml:37-96
Crate Interdependency Matrix
Sources: Cargo.toml:37-96 llkv-table/Cargo.toml:20-40 llkv-executor/Cargo.toml:20-41 llkv-sql/Cargo.toml:20-37 llkv-join/Cargo.toml:20-32 llkv-plan/Cargo.toml:20-36
Build Configuration and Features
The workspace defines several build profiles and configuration options:
- resolver = “2” : Uses Cargo’s newer dependency resolver for more consistent builds
- edition = “2024” : Uses the latest Rust edition (2024)
- samply profile : Inherits from release with debug symbols enabled for profiling
Workspace lints enforce code quality by denying print statements and specific unsafe operations. Individual crates may enable conditional features like simd-r-drive-support for the storage backend.
Sources: Cargo.toml27 Cargo.toml32 Cargo.toml:98-109 llkv-sql/Cargo.toml40
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
SQL Query Processing Pipeline
Loading…
SQL Query Processing Pipeline
Relevant source files
- .github/workflows/build.docs.yml
- demos/llkv-sql-pong-demo/assets/llkv-sql-pong-screenshot.png
- dev-docs/doc-preview.md
- llkv-executor/src/lib.rs
- llkv-expr/src/expr.rs
- llkv-plan/Cargo.toml
- llkv-plan/src/plans.rs
- llkv-sql/Cargo.toml
- llkv-sql/src/sql_engine.rs
- llkv-test-utils/Cargo.toml
Purpose and Scope
This page describes the end-to-end flow of SQL query processing in LLKV, from raw SQL text input to Arrow RecordBatch results. It covers the major pipeline stages: parsing, preprocessing, planning, execution, and result formatting. For detailed information about specific components, see:
- SQL interface and the
SqlEngineAPI: SQL Interface - Plan structure details: Query Planning
- Expression translation and compilation: Expression System
- Execution implementation details: Query Execution
Pipeline Overview
The SQL query processing pipeline transforms user SQL statements through several stages before producing results:
Sources: llkv-sql/src/sql_engine.rs:1-100 [Diagram 2 from overview](https://github.com/jzombie/rust-llkv/blob/89777726/Diagram 2 from overview)
flowchart TB
Input["SQL Text\n(User Input)"]
Preprocess["SQL Preprocessing\npreprocess_* functions"]
Parse["SQL Parsing\nsqlparser::Parser"]
AST["sqlparser::ast::Statement"]
Plan["Query Planning\nbuild_*_plan functions"]
PlanStmt["PlanStatement"]
Execute["Execution\nRuntimeEngine::execute_statement"]
Result["RuntimeStatementResult"]
Batches["Vec<RecordBatch>"]
Input --> Preprocess
Preprocess --> Parse
Parse --> AST
AST --> Plan
Plan --> PlanStmt
PlanStmt --> Execute
Execute --> Result
Result --> Batches
style Input fill:#f9f9f9
style Batches fill:#f9f9f9
SQL Preprocessing Stage
Before parsing, LLKV applies several preprocessing transformations to normalize SQL syntax across different dialects (SQLite, DuckDB, PostgreSQL). This stage rewrites incompatible syntax into forms that sqlparser can handle.
Preprocessing Functions
| Function | Purpose | Example Transformation |
|---|---|---|
preprocess_tpch_connect_syntax | Strip TPC-H CONNECT TO statements | CONNECT TO tpch; → (removed) |
preprocess_create_type_syntax | Convert CREATE TYPE to CREATE DOMAIN | CREATE TYPE int8 AS bigint → CREATE DOMAIN int8 AS bigint |
preprocess_exclude_syntax | Quote qualified names in EXCLUDE clauses | EXCLUDE (t.col) → EXCLUDE ("t.col") |
preprocess_trailing_commas_in_values | Remove trailing commas in VALUES | VALUES (1,) → VALUES (1) |
preprocess_empty_in_lists | Expand empty IN lists to boolean expressions | x IN () → (x = NULL AND 0 = 1) |
preprocess_index_hints | Strip SQLite index hints | FROM t INDEXED BY idx → FROM t |
preprocess_reindex_syntax | Convert standalone REINDEX to VACUUM REINDEX | REINDEX idx → VACUUM REINDEX idx |
preprocess_sqlite_trigger_shorthand | Add explicit timing and FOR EACH ROW | CREATE TRIGGER ... BEGIN → CREATE TRIGGER ... AFTER ... FOR EACH ROW BEGIN |
preprocess_bare_table_in_clauses | Convert bare table names in IN to subqueries | x IN table → x IN (SELECT * FROM table) |
The preprocessing pipeline is applied sequentially in SqlEngine::execute:
Sources: llkv-sql/src/sql_engine.rs:759-1006 llkv-sql/src/sql_engine.rs:1395-1460
Parsing Stage
After preprocessing, the SQL text is parsed using the sqlparser crate with a configurable recursion limit to handle deeply nested queries.
Parser Configuration
The parser is configured with:
- Dialect :
GenericDialect(supports multiple SQL dialects) - Recursion limit :
200(set viaPARSER_RECURSION_LIMITconstant) - Parameter tracking : Thread-local
ParameterScopefor prepared statement placeholders
The parser produces a vector of sqlparser::ast::Statement objects. Each statement is then converted to a PlanStatement for execution.
Sources: llkv-sql/src/sql_engine.rs:395-400 llkv-sql/src/sql_engine.rs:1464-1481 llkv-sql/src/sql_engine.rs:223-256
Statement Type Dispatch
Once parsed, statements are dispatched to specialized planning functions based on their type. The SqlEngine maintains an optional InsertBuffer to batch consecutive literal INSERT statements for improved throughput.
Sources: llkv-sql/src/sql_engine.rs:1482-2800 llkv-sql/src/sql_engine.rs:486-547
flowchart TB
AST["sqlparser::ast::Statement"]
Dispatch{"Statement Type"}
CreateTable["build_create_table_plan"]
DropTable["build_drop_table_plan"]
AlterTable["build_alter_table_plan"]
CreateView["build_create_view_plan"]
DropView["build_drop_view_plan"]
CreateIndex["build_create_index_plan"]
DropIndex["build_drop_index_plan"]
Reindex["build_reindex_plan"]
RenameTable["build_rename_table_plan"]
Insert["build_insert_plan"]
InsertBuffer["InsertBuffer::push_statement\n(batching optimization)"]
Update["build_update_plan"]
Delete["build_delete_plan"]
Truncate["build_truncate_plan"]
Select["build_select_plan"]
Explain["wrap in Explain plan"]
Set["handle SET statement"]
Show["handle SHOW statement"]
Begin["handle BEGIN TRANSACTION"]
Commit["handle COMMIT"]
Rollback["handle ROLLBACK"]
Savepoint["handle SAVEPOINT"]
Release["handle RELEASE"]
Vacuum["handle VACUUM"]
Pragma["handle PRAGMA"]
PlanStmt["PlanStatement"]
AST --> Dispatch
Dispatch -->|CREATE TABLE| CreateTable
Dispatch -->|DROP TABLE| DropTable
Dispatch -->|ALTER TABLE| AlterTable
Dispatch -->|CREATE VIEW| CreateView
Dispatch -->|DROP VIEW| DropView
Dispatch -->|CREATE INDEX| CreateIndex
Dispatch -->|DROP INDEX| DropIndex
Dispatch -->|VACUUM REINDEX| Reindex
Dispatch -->|ALTER TABLE RENAME| RenameTable
Dispatch -->|INSERT| Insert
Insert -.may buffer.-> InsertBuffer
Dispatch -->|UPDATE| Update
Dispatch -->|DELETE| Delete
Dispatch -->|TRUNCATE| Truncate
Dispatch -->|SELECT| Select
Dispatch -->|EXPLAIN| Explain
Dispatch -->|SET| Set
Dispatch -->|SHOW| Show
Dispatch -->|BEGIN| Begin
Dispatch -->|COMMIT| Commit
Dispatch -->|ROLLBACK| Rollback
Dispatch -->|SAVEPOINT| Savepoint
Dispatch -->|RELEASE| Release
Dispatch -->|VACUUM| Vacuum
Dispatch -->|PRAGMA| Pragma
CreateTable --> PlanStmt
DropTable --> PlanStmt
AlterTable --> PlanStmt
CreateView --> PlanStmt
DropView --> PlanStmt
CreateIndex --> PlanStmt
DropIndex --> PlanStmt
Reindex --> PlanStmt
RenameTable --> PlanStmt
Insert --> PlanStmt
InsertBuffer --> PlanStmt
Update --> PlanStmt
Delete --> PlanStmt
Truncate --> PlanStmt
Select --> PlanStmt
Explain --> PlanStmt
Planning Stage
The planning stage converts sqlparser::ast::Statement nodes into typed PlanStatement objects. This involves:
- Column resolution : Mapping string column names to
FieldIdidentifiers via the system catalog - Expression translation : Converting
sqlparser::ast::Exprtollkv_expr::expr::ExprandScalarExpr - Subquery tracking : Recording correlated subqueries with placeholder bindings
- Validation : Ensuring referenced tables and columns exist
SELECT Plan Construction
For SELECT statements, the planner builds a SelectPlan containing:
Sources: llkv-sql/src/sql_engine.rs:2801-3500 llkv-plan/src/plans.rs:705-850
Expression Translation
Expression translation converts SQL expressions into the llkv_expr AST, which uses FieldId instead of string column names. This enables efficient field access during execution.
Sources: llkv-plan/src/translation/expression.rs:1-500 llkv-sql/src/sql_engine.rs:4000-4500
flowchart LR
SQLExpr["sqlparser::ast::Expr\n(column names as strings)"]
Resolver["IdentifierResolver\n(catalog lookups)"]
Translation["translate_scalar / translate_predicate"]
LLKVExpr["llkv_expr::expr::ScalarExpr<FieldId>\nllkv_expr::expr::Expr<FieldId>"]
SQLExpr --> Translation
Resolver -.provides.-> Translation
Translation --> LLKVExpr
flowchart TB
PlanStmt["PlanStatement"]
Execute["RuntimeEngine::execute_statement"]
DDL{"Statement Type"}
CreateTableExec["execute_create_table\n(allocate table_id, register schema)"]
DropTableExec["execute_drop_table\n(remove from catalog)"]
AlterTableExec["execute_alter_table\n(modify schema)"]
CreateViewExec["execute_create_view\n(store view definition)"]
CreateIndexExec["execute_create_index\n(build index structures)"]
InsertExec["execute_insert\n(convert to RecordBatch, append)"]
UpdateExec["execute_update\n(filter + rewrite rows)"]
DeleteExec["execute_delete\n(mark rows deleted via MVCC)"]
TruncateExec["execute_truncate\n(clear all rows)"]
SelectExec["QueryExecutor::execute_select\n(scan, filter, project, aggregate)"]
Result["RuntimeStatementResult"]
PlanStmt --> Execute
Execute --> DDL
DDL -->|CreateTable| CreateTableExec
DDL -->|DropTable| DropTableExec
DDL -->|AlterTable| AlterTableExec
DDL -->|CreateView| CreateViewExec
DDL -->|CreateIndex| CreateIndexExec
DDL -->|Insert| InsertExec
DDL -->|Update| UpdateExec
DDL -->|Delete| DeleteExec
DDL -->|Truncate| TruncateExec
DDL -->|Select| SelectExec
CreateTableExec --> Result
DropTableExec --> Result
AlterTableExec --> Result
CreateViewExec --> Result
CreateIndexExec --> Result
InsertExec --> Result
UpdateExec --> Result
DeleteExec --> Result
TruncateExec --> Result
SelectExec --> Result
Execution Stage
The RuntimeEngine receives a PlanStatement and dispatches it to the appropriate execution handler:
Sources: llkv-runtime/src/lib.rs:1-500 llkv-sql/src/sql_engine.rs:706-745
flowchart TB
SelectPlan["SelectPlan"]
Executor["QueryExecutor::execute_select"]
Strategy{"Execution Strategy"}
NoTable["execute_select_without_table\n(constant projection)"]
Compound["execute_compound_select\n(UNION/EXCEPT/INTERSECT)"]
GroupBy["execute_group_by_single_table\n(hash aggregation)"]
CrossProduct["execute_cross_product\n(nested loop join)"]
Aggregates["execute_aggregates\n(full-table aggregation)"]
Projection["execute_projection\n(scan + filter + project)"]
ScanStream["Table::scan_stream\n(streaming batches)"]
FilterEval["filter_row_ids\n(predicate evaluation)"]
Gather["gather_rows\n(assemble RecordBatch)"]
Sort["lexsort_to_indices\n(ORDER BY)"]
LimitOffset["take + offset\n(LIMIT/OFFSET)"]
SelectExecution["SelectExecution\n(result wrapper)"]
SelectPlan --> Executor
Executor --> Strategy
Strategy -->|tables.is_empty| NoTable
Strategy -->|compound.is_some| Compound
Strategy -->|!group_by.is_empty| GroupBy
Strategy -->|tables.len > 1| CrossProduct
Strategy -->|!aggregates.is_empty| Aggregates
Strategy -->|single table| Projection
Projection --> ScanStream
ScanStream --> FilterEval
FilterEval --> Gather
Gather --> Sort
Sort --> LimitOffset
NoTable --> SelectExecution
Compound --> SelectExecution
GroupBy --> SelectExecution
CrossProduct --> SelectExecution
Aggregates --> SelectExecution
LimitOffset --> SelectExecution
SELECT Execution Flow
SELECT statement execution is the most complex path, involving multiple optimization strategies:
Sources: llkv-executor/src/lib.rs:519-563 llkv-executor/src/scan.rs:1-500
Two-Phase Execution for Filtering
When a WHERE clause is present, execution follows a two-phase approach:
-
Phase 1: Row ID Collection
- Evaluate predicates against stored data
- Use chunk metadata for pruning (min/max values)
- Build bitmap of matching row IDs
-
Phase 2: Data Gathering
- Fetch only projected columns for matching rows
- Assemble Arrow
RecordBatchfrom gathered data
Sources: llkv-executor/src/scan.rs:200-400 [Diagram 6 from overview](https://github.com/jzombie/rust-llkv/blob/89777726/Diagram 6 from overview)
flowchart LR
RuntimeResult["RuntimeStatementResult"]
Branch{"Result Type"}
DDLResult["CreateTable/DropTable/etc\n(metadata only)"]
DMLResult["Insert/Update/Delete\n(row count)"]
SelectResult["Select\n(SelectExecution)"]
SelectExec["SelectExecution"]
Stream["stream(callback)\n(iterate batches)"]
IntoBatches["into_batches()\n(collect all)"]
IntoRows["into_rows()\n(materialize)"]
Batches["Vec<RecordBatch>"]
RuntimeResult --> Branch
Branch -->|DDL| DDLResult
Branch -->|DML| DMLResult
Branch -->|SELECT| SelectResult
SelectResult --> SelectExec
SelectExec --> Stream
SelectExec --> IntoBatches
SelectExec --> IntoRows
Stream --> Batches
IntoBatches --> Batches
Result Formatting
All execution paths ultimately produce a RuntimeStatementResult, which is converted to Arrow RecordBatch objects for SELECT queries:
The SelectExecution type provides three consumption modes:
| Method | Behavior | Use Case |
|---|---|---|
stream(callback) | Iterate batches without allocating | Large result sets, streaming output |
into_batches() | Collect all batches into Vec | Moderate result sets, need random access |
into_rows() | Materialize as Vec<CanonicalRow> | Small result sets, need row-level operations |
Sources: llkv-executor/src/types/execution.rs:1-300 llkv-sql/src/sql_engine.rs:1100-1150
Optimization Points in the Pipeline
Several optimization strategies are applied throughout the pipeline:
1. SQL Preprocessing Optimizations
- Empty IN list rewriting :
x IN ()becomes constant false, enabling early elimination - Index hint removal : Allows the planner to make optimal index choices without user hints
Sources: llkv-sql/src/sql_engine.rs:835-856 llkv-sql/src/sql_engine.rs:858-875
2. INSERT Buffering
- Batching : Consecutive literal
INSERTstatements are buffered until reachingMAX_BUFFERED_INSERT_ROWS(8192) - Throughput : Dramatically reduces planning overhead for bulk ingest workloads
- Semantics : Disabled by default to preserve transactional visibility
Sources: llkv-sql/src/sql_engine.rs:486-547 llkv-sql/src/sql_engine.rs:1200-1300
3. Predicate Pushdown
- Chunk pruning : Min/max metadata on column chunks enables skipping irrelevant data
- Vectorized evaluation : SIMD instructions accelerate predicate evaluation within chunks
Sources: llkv-executor/src/scan.rs:300-400
4. Projection Pushdown
- Lazy column loading : Only requested columns are fetched from storage
- Computed projection caching : Identical expressions are evaluated once and reused
Sources: llkv-executor/src/lib.rs:469-501
5. Fast Paths
- Constant SELECT : Queries without tables avoid storage access entirely
- Full table scan : Queries without predicates stream directly without bitmap construction
Sources: llkv-executor/src/lib.rs:533-534
Parameter Binding for Prepared Statements
The pipeline supports parameterized queries via PreparedStatement:
Parameter placeholders are normalized during preprocessing:
?,?N,$N: Positional parameters (1-indexed):name: Named parameters (converted to indices)
Sources: llkv-sql/src/sql_engine.rs:78-283 llkv-sql/src/sql_engine.rs:354-373
Summary
The SQL query processing pipeline in LLKV follows a clear multi-stage architecture:
- Preprocessing : Normalize SQL dialect differences
- Parsing : Convert text to AST (
sqlparser) - Planning : Build typed
PlanStatementwith column resolution - Execution : Execute via
RuntimeEngineandQueryExecutor - Result Formatting : Return Arrow
RecordBatchobjects
Key design principles:
- Dialect flexibility : Preprocessing enables SQLite, DuckDB, and PostgreSQL syntax
- Type safety : Early resolution of column names to
FieldIdprevents runtime errors - Streaming execution : Large result sets never require full materialization
- Optimization opportunities : Metadata pruning, SIMD evaluation, and projection pushdown reduce data movement
For implementation details of specific stages, refer to the linked subsystem pages.
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Data Formats and Arrow Integration
Loading…
Data Formats and Arrow Integration
Relevant source files
- Cargo.lock
- Cargo.toml
- llkv-column-map/Cargo.toml
- llkv-column-map/src/gather.rs
- llkv-column-map/src/store/projection.rs
- llkv-csv/src/writer.rs
- llkv-sql/tests/pager_io_tests.rs
- llkv-table/examples/direct_comparison.rs
- llkv-table/examples/performance_benchmark.rs
- llkv-table/examples/test_streaming.rs
- llkv-table/src/table.rs
This page explains how Apache Arrow serves as the foundational columnar data format throughout LLKV, covering the types supported, serialization strategies, and integration points across the system. Arrow provides the in-memory representation for all data operations, from SQL query results to storage layer batches.
For information about how Arrow data flows through query execution, see Query Execution. For details on how the storage layer persists Arrow arrays, see Column Storage and ColumnStore.
RecordBatch as the Universal Data Container
The arrow::record_batch::RecordBatch type is the primary data exchange unit across all LLKV layers. A RecordBatch represents a columnar slice of data with a fixed schema, where each column is an ArrayRef (type-erased Arrow array).
Sources: llkv-table/src/table.rs:6-9 llkv-table/examples/test_streaming.rs:1-9
graph TB
RecordBatch["RecordBatch"]
Schema["Schema\n(arrow::datatypes::Schema)"]
Columns["Vec<ArrayRef>\n(arrow::array::ArrayRef)"]
RecordBatch --> Schema
RecordBatch --> Columns
Schema --> Fields["Vec<Field>\nColumn Names + Types"]
Columns --> Array1["Column 0: ArrayRef"]
Columns --> Array2["Column 1: ArrayRef"]
Columns --> Array3["Column N: ArrayRef"]
Array1 --> UInt64Array["UInt64Array\n(row_id column)"]
Array2 --> Int64Array["Int64Array\n(user data)"]
Array3 --> StringArray["StringArray\n(text data)"]
RecordBatch Construction Pattern
All data inserted into tables must arrive as a RecordBatch with specific metadata:
| Required Element | Description | Metadata Key |
|---|---|---|
row_id column | UInt64Array with unique row identifiers | None (system column) |
| Field ID metadata | Maps each user column to a FieldId | "field_id" |
| Arrow data type | Native Arrow type (Int64, Utf8, etc.) | N/A |
| Column name | Human-readable identifier | Field name |
Sources: llkv-table/src/table.rs:231-438 llkv-table/examples/test_streaming.rs:18-58
Schema and Field Metadata
The arrow::datatypes::Schema describes the structure of a RecordBatch. LLKV extends Arrow’s schema system with custom metadata to track column identity and table ownership.
Field Metadata Convention
Each user column’s Field carries a field_id metadata entry that encodes either:
- User columns: A raw
FieldId(0-2047) that gets namespaced to aLogicalFieldId - MVCC columns: A pre-computed
LogicalFieldIdforcreated_by/deleted_by
Sources: llkv-table/src/table.rs:243-320 llkv-table/constants.rs (FIELD_ID_META_KEY)
graph TB
Field["arrow::datatypes::Field"]
FieldName["name: String"]
DataType["data_type: DataType"]
Nullable["nullable: bool"]
Metadata["metadata: HashMap<String, String>"]
Field --> FieldName
Field --> DataType
Field --> Nullable
Field --> Metadata
Metadata --> FieldIdKey["'field_id' → '42'\n(user column)"]
Metadata --> LFidKey["'field_id' → '8589934634'\n(MVCC column)"]
FieldIdKey -.converted to.-> UserLFid["LogicalFieldId::for_user(table_id, 42)"]
LFidKey -.direct use.-> SystemLFid["LogicalFieldId::for_mvcc_created_by(table_id)"]
Schema Reconstruction from Catalog
The Table::schema() method reconstructs an Arrow Schema by querying the system catalog for column names and combining them with type information from the column store:
Sources: llkv-table/src/table.rs:519-549
sequenceDiagram
participant Caller
participant Table
participant ColumnStore
participant SysCatalog
Caller->>Table: schema()
Table->>ColumnStore: user_field_ids_for_table(table_id)
ColumnStore-->>Table: Vec<LogicalFieldId>
Table->>SysCatalog: get_cols_meta(field_ids)
SysCatalog-->>Table: Vec<ColMeta>
Table->>ColumnStore: data_type(lfid) for each
ColumnStore-->>Table: DataType
Table->>Table: Build Field with metadata
Table-->>Caller: Arc<Schema>
Supported Arrow Data Types
LLKV’s column store supports the following Arrow data types for user columns:
| Arrow Type | Rust Type Mapping | Storage Encoding | Notes |
|---|---|---|---|
UInt64 | u64 | Native | Row IDs, MVCC transaction IDs |
UInt32 | u32 | Native | - |
UInt16 | u16 | Native | - |
UInt8 | u8 | Native | - |
Int64 | i64 | Native | - |
Int32 | i32 | Native | - |
Int16 | i16 | Native | - |
Int8 | i8 | Native | - |
Float64 | f64 | IEEE 754 | - |
Float32 | f32 | IEEE 754 | - |
Date32 | i32 (days since epoch) | Native | - |
Date64 | i64 (ms since epoch) | Native | - |
Decimal128 | i128 | Native + precision/scale metadata | Fixed-point decimal |
Utf8 | String (i32 offsets) | Length-prefixed UTF-8 | Variable-length strings |
LargeUtf8 | String (i64 offsets) | Length-prefixed UTF-8 | Large strings |
Binary | Vec<u8> (i32 offsets) | Length-prefixed bytes | Variable-length binary |
LargeBinary | Vec<u8> (i64 offsets) | Length-prefixed bytes | Large binary |
Boolean | bool | Bit-packed | - |
Struct | Nested fields | Passthrough (no specialized gather) | Limited support |
Sources: llkv-column-map/src/store/projection.rs:135-185 llkv-column-map/src/gather.rs:10-18
Type Dispatch Strategy
The column store uses Rust enums to dispatch operations per data type, ensuring type-safe handling without runtime reflection:
Sources: llkv-column-map/src/store/projection.rs:99-236 llkv-column-map/src/store/projection.rs:237-395
graph TB
ColumnOutputBuilder["ColumnOutputBuilder\n(enum per type)"]
ColumnOutputBuilder --> Utf8["Utf8 { builder, len_capacity, value_capacity }"]
ColumnOutputBuilder --> LargeUtf8["LargeUtf8 { builder, len_capacity, value_capacity }"]
ColumnOutputBuilder --> Binary["Binary { builder, len_capacity, value_capacity }"]
ColumnOutputBuilder --> LargeBinary["LargeBinary { builder, len_capacity, value_capacity }"]
ColumnOutputBuilder --> Boolean["Boolean { builder, len_capacity }"]
ColumnOutputBuilder --> Decimal128["Decimal128 { builder, precision, scale }"]
ColumnOutputBuilder --> Primitive["Primitive(PrimitiveBuilderKind)"]
ColumnOutputBuilder --> Passthrough["Passthrough (Struct arrays)"]
Primitive --> UInt64Builder["UInt64 { builder, len_capacity }"]
Primitive --> Int64Builder["Int64 { builder, len_capacity }"]
Primitive --> Float64Builder["Float64 { builder, len_capacity }"]
Primitive --> OtherPrimitives["... (10 more primitive types)"]
Array Builders and Construction
Arrow provides builder types (arrow::array::*Builder) for incrementally constructing arrays. LLKV’s projection layer wraps these builders with capacity management to minimize allocations during multi-batch operations.
Capacity Pre-Allocation Strategy
The ColumnOutputBuilder tracks allocated capacity separately from Arrow’s builder to avoid repeated re-allocations:
Sources: llkv-column-map/src/store/projection.rs:189-234 llkv-column-map/src/store/projection.rs:398-457
graph LR
RequestedLen["Requested Len"]
ValueBytesHint["Value Bytes Hint"]
RequestedLen --> Check["len_capacity < len?"]
ValueBytesHint --> Check2["value_capacity < bytes?"]
Check -->|Yes| Reallocate["Recreate Builder\nwith_capacity(len, bytes)"]
Check2 -->|Yes| Reallocate
Check -->|No| Reuse["Reuse Existing Builder"]
Check2 -->|No| Reuse
Reallocate --> UpdateCapacity["Update cached capacities"]
UpdateCapacity --> Append["append_value()
or append_null()"]
Reuse --> Append
String and Binary Builder Pattern
Variable-length types require two capacity dimensions:
- Length capacity: Number of values
- Value capacity: Total bytes across all values
Sources: llkv-column-map/src/store/projection.rs:398-411 llkv-column-map/src/store/projection.rs:413-426
graph TB
subgraph "In-Memory Layer"
RecordBatch1["RecordBatch\n(user input)"]
ArrayRef1["ArrayRef\n(Int64Array, StringArray, etc.)"]
end
subgraph "Serialization Layer"
Serialize["serialize_array()\n(Arrow IPC encoding)"]
Deserialize["deserialize_array()\n(Arrow IPC decoding)"]
end
subgraph "Storage Layer"
ChunkBlob["EntryHandle\n(byte blob)"]
Pager["Pager::batch_put()"]
PhysicalKey["PhysicalKey\n(u64 identifier)"]
end
RecordBatch1 --> ArrayRef1
ArrayRef1 --> Serialize
Serialize --> ChunkBlob
ChunkBlob --> Pager
Pager --> PhysicalKey
PhysicalKey -.read.-> Pager
Pager -.read.-> ChunkBlob
ChunkBlob --> Deserialize
Deserialize --> ArrayRef1
Serialization and Storage Integration
Arrow arrays are serialized to byte blobs using Arrow’s IPC format before persistence in the key-value store. The column store maintains a separation between logical Arrow representation and physical storage:
Sources: llkv-column-map/src/serialization.rs (deserialize_array function), llkv-column-map/src/store/projection.rs17
Chunk Organization
Large columns are split into multiple chunks (default 65,536 rows per chunk). Each chunk stores:
- Value array: Serialized Arrow array with actual data
- Row ID array: Separate
UInt64Arraymapping row IDs to array indices
Sources: llkv-column-map/src/store/descriptor.rs (ChunkMetadata, ColumnDescriptor), llkv-column-map/src/store/mod.rs (DEFAULT_CHUNK_TARGET_ROWS)
graph TB
GatherNullPolicy["GatherNullPolicy"]
GatherNullPolicy --> ErrorOnMissing["ErrorOnMissing\nMissing row → Error"]
GatherNullPolicy --> IncludeNulls["IncludeNulls\nMissing row → NULL in output"]
GatherNullPolicy --> DropNulls["DropNulls\nMissing row → omit from output"]
ErrorOnMissing -.used by.-> InnerJoin["Inner Joins\nStrict row matches"]
IncludeNulls -.used by.-> OuterJoin["Outer Joins\nNULL padding"]
DropNulls -.used by.-> Projection["Filtered Projections\nRemove incomplete rows"]
Null Handling
Arrow’s nullable columns use a separate validity bitmap. LLKV’s GatherNullPolicy enum controls how nulls propagate during row gathering:
Sources: llkv-column-map/src/store/projection.rs:40-48 llkv-table/src/table.rs:589-645
Null Bitmap Semantics
Arrow arrays maintain a separate NullBuffer (bit-packed). When gathering rows, the system must:
- Check if the source array position is valid (
array.is_valid(idx)) - Append
nullto the builder if invalid, or the value if valid
Sources: llkv-column-map/src/gather.rs:369-402
graph LR
subgraph "CSV Export"
ScanStream["Table::scan_stream()"]
RecordBatch2["RecordBatch stream"]
ArrowCSV["arrow::csv::Writer"]
CSVFile["CSV File"]
ScanStream --> RecordBatch2
RecordBatch2 --> ArrowCSV
ArrowCSV --> CSVFile
end
subgraph "CSV Import"
CSVInput["CSV File"]
Parser["csv-core parser"]
Builders["Array Builders"]
RecordBatch3["RecordBatch"]
TableAppend["Table::append()"]
CSVInput --> Parser
Parser --> Builders
Builders --> RecordBatch3
RecordBatch3 --> TableAppend
end
Arrow Integration Points
CSV Import and Export
The llkv-csv crate uses Arrow’s native CSV writer (arrow::csv::WriterBuilder) for exports and custom parsing for imports:
Sources: llkv-csv/src/writer.rs:196-267 llkv-csv/src/reader.rs
SQL Query Results
All SQL query results are returned as RecordBatch instances, enabling zero-copy integration with Arrow-based tools:
| Statement Type | Output Format | Arrow Schema Source |
|---|---|---|
SELECT | Stream of RecordBatch | Query projection list |
INSERT | RecordBatch (row count) | Fixed schema: { affected_rows: UInt64 } |
UPDATE | RecordBatch (row count) | Fixed schema: { affected_rows: UInt64 } |
DELETE | RecordBatch (row count) | Fixed schema: { affected_rows: UInt64 } |
CREATE TABLE | RecordBatch (success) | Empty schema or metadata |
Sources: llkv-runtime/src/lib.rs (RuntimeStatementResult enum), llkv-sql/src/lib.rs (SqlEngine::execute)
graph TB
AggregateExpr["Aggregate Expression\n(SUM, AVG, MIN, MAX)"]
AggregateExpr --> ArrowKernel["arrow::compute kernels\n(sum, min, max)"]
AggregateExpr --> CustomAccumulator["Custom Accumulator\n(COUNT DISTINCT, etc.)"]
ArrowKernel --> PrimitiveArray["PrimitiveArray<T>"]
CustomAccumulator --> AnyArray["ArrayRef (any type)"]
PrimitiveArray -.vectorized ops.-> SIMDPath["SIMD-accelerated compute"]
AnyArray -.type-erased dispatch.-> SlowPath["Generic accumulation"]
Arrow-Native Aggregations
Aggregate functions operate directly on Arrow arrays using arrow::compute kernels where possible:
Sources: llkv-aggregate/src/lib.rs (aggregate accumulators), llkv-compute/src/kernels.rs (NumericKernels)
Data Type Conversion Table
The following table maps SQL types to Arrow types as materialized in the system:
| SQL Type | Arrow Type | Notes |
|---|---|---|
INTEGER, INT | Int64 | 64-bit signed |
BIGINT | Int64 | Same as INTEGER |
SMALLINT | Int32 | Promoted to Int32 internally |
TINYINT | Int32 | Promoted to Int32 internally |
FLOAT, REAL | Float32 | Single precision |
DOUBLE | Float64 | Double precision |
TEXT, VARCHAR | Utf8 | Variable-length, i32 offsets |
BLOB | Binary | Variable-length bytes |
BOOLEAN, BOOL | Boolean | Bit-packed |
DATE | Date32 | Days since Unix epoch |
TIMESTAMP | Date64 | Milliseconds since Unix epoch |
DECIMAL(p, s) | Decimal128(p, s) | Fixed-point, stores precision/scale |
Sources: llkv-plan/src/types.rs (SQL type mapping), llkv-types/src/arrow.rs (DataType conversion)
RecordBatch Flow Summary
Sources: llkv-table/src/table.rs:231-438 llkv-column-map/src/store/mod.rs (append logic)
Performance Considerations
Vectorization Benefits
Arrow’s columnar layout enables SIMD operations in compute kernels:
- Primitive aggregations:
arrow::compute::sum,min,maxuse SIMD - Filtering: Bit-packed boolean arrays enable fast predicate evaluation
- Null bitmap checks: Batch validity checks avoid per-row branches
Memory Efficiency
Columnar storage provides superior compression and cache locality:
- Compression ratio: Homogeneous columns compress better than row-based formats
- Cache efficiency: Scanning a single column loads contiguous memory
- Null optimization: Sparse nulls use minimal space (1 bit per value)
graph TB
GatherContextPool["GatherContextPool"]
GatherContextPool --> AcquireContext["acquire(field_ids)"]
AcquireContext --> CheckCache["Check HashMap\nfor cached context"]
CheckCache -->|Hit| ReuseContext["Reuse MultiGatherContext"]
CheckCache -->|Miss| CreateNew["Create new context"]
ReuseContext --> Execute["Execute gather operation"]
CreateNew --> Execute
Execute --> ReleaseContext["Release context back to pool"]
ReleaseContext --> StoreInPool["Store in HashMap\n(max 4 per field set)"]
Builder Reuse Strategy
The MultiGatherContext pools ColumnOutputBuilder instances to avoid repeated allocations during streaming scans:
Sources: llkv-column-map/src/store/projection.rs:651-691
Key Takeaways:
- RecordBatch is the universal data container for all operations
- Field metadata tracks column identity via
field_idkeys - Type dispatch uses Rust enums to handle 18+ Arrow data types efficiently
- Serialization uses Arrow IPC format for persistence
- Null handling respects Arrow’s validity bitmaps with configurable policies
- Integration is native: CSV, SQL results, and storage all speak Arrow
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
SQL Interface
Loading…
SQL Interface
Relevant source files
- llkv-executor/src/lib.rs
- llkv-sql/src/lib.rs
- llkv-sql/src/sql_engine.rs
- llkv-tpch/Cargo.toml
- llkv-tpch/DRAFT.md
- llkv-tpch/src/lib.rs
- llkv-tpch/src/main.rs
- llkv-tpch/src/queries.rs
The SQL Interface provides the primary user-facing API for interacting with LLKV databases through SQL statements. This layer is responsible for parsing SQL text, preprocessing dialect-specific syntax, translating Abstract Syntax Trees (AST) into execution plans, and delegating to the runtime engine for actual execution.
For information about query planning and execution, see Query Planning and Query Execution. For details on the underlying runtime that executes plans, see Architecture.
Purpose and Scope
The SQL Interface layer (llkv-sql crate) serves as the entry point for SQL-based database operations. It bridges the gap between SQL text written by users and the Arrow-native columnar storage engine, handling:
- SQL Parsing : Converting SQL strings into Abstract Syntax Trees using
sqlparser-rs - Dialect Normalization : Preprocessing SQL to handle syntax variations from SQLite, DuckDB, and PostgreSQL
- Statement Translation : Converting ASTs into typed execution plans (
PlanStatementstructures) - Execution Coordination : Delegating plans to the
RuntimeEngineand formatting results - Transaction Management : Coordinating multi-statement transactions with MVCC support
- Performance Optimization : Batching INSERT statements to reduce planning overhead
The SQL Interface does not execute queries directly—it delegates all data operations to the runtime and executor layers. Its primary responsibility is accurate SQL-to-plan translation while preserving user intent across different SQL dialects.
Sources : llkv-sql/src/lib.rs:1-52 llkv-sql/src/sql_engine.rs:1-60
Architecture Overview
The SQL Interface operates as a stateful wrapper around the RuntimeEngine, maintaining session state, prepared statement caches, and insert buffering state. The core workflow proceeds through several stages:
SQL Interface Processing Pipeline
graph TB
User["User Application"]
SqlEngine["SqlEngine"]
Preprocess["SQL Preprocessing"]
Parser["sqlparser-rs\nGenericDialect"]
Translator["Statement Translation"]
Runtime["RuntimeEngine"]
Results["RecordBatch / RowCount"]
User -->|execute sql| SqlEngine
SqlEngine -->|1. Preprocess| Preprocess
Preprocess -->|Normalized SQL| Parser
Parser -->|AST Statement| Translator
Translator -->|PlanStatement| Runtime
Runtime -->|RuntimeStatementResult| SqlEngine
SqlEngine -->|Vec<RecordBatch>| Results
Results -->|Query Results| User
subgraph "llkv-sql Crate"
SqlEngine
Preprocess
Translator
end
subgraph "External Dependencies"
Parser
end
subgraph "llkv-runtime Crate"
Runtime
end
The SqlEngine maintains several internal subsystems:
- Statement Cache : Thread-safe prepared statement storage (
RwLock<FxHashMap<String, Arc<PreparedPlan>>>) - Insert Buffer : Cross-statement INSERT batching for bulk ingest (
RefCell<Option<InsertBuffer>>) - Session State : Transaction context and configuration flags
- Information Schema Cache : Lazy metadata refresh tracking
Sources : llkv-sql/src/sql_engine.rs:572-621 [Diagram 1 from overview](https://github.com/jzombie/rust-llkv/blob/89777726/Diagram 1 from overview)
SqlEngine Structure
The SqlEngine struct encapsulates all SQL processing state and exposes methods for executing SQL statements:
SqlEngine Class Structure
classDiagram
class SqlEngine {-engine: RuntimeEngine\n-default_nulls_first: AtomicBool\n-insert_buffer: RefCell~Option~InsertBuffer~~\n-insert_buffering_enabled: AtomicBool\n-information_schema_ready: AtomicBool\n-statement_cache: RwLock~FxHashMap~String PreparedPlan~~\n+new(pager) SqlEngine\n+with_context(context, nulls_first) SqlEngine\n+execute(sql) Result~Vec~RuntimeStatementResult~~\n+sql(sql) Result~Vec~RecordBatch~~\n+prepare(sql) Result~PreparedStatement~\n+execute_prepared(stmt, params) Result~Vec~RuntimeStatementResult~~\n+session() RuntimeSession\n+runtime_context() Arc~RuntimeContext~}
class RuntimeEngine {+execute_statement(plan) Result~RuntimeStatementResult~\n+context() Arc~RuntimeContext~}
class PreparedStatement {-inner: Arc~PreparedPlan~\n+parameter_count() usize}
class InsertBuffer {-table_name: String\n-columns: Vec~String~\n-rows: Vec~Vec~PlanValue~~\n-on_conflict: InsertConflictAction\n+can_accept(table, cols, conflict) bool\n+should_flush() bool}
SqlEngine --> RuntimeEngine : delegates to
SqlEngine --> PreparedStatement : creates
SqlEngine --> InsertBuffer : maintains
Sources : llkv-sql/src/sql_engine.rs:572-621 llkv-sql/src/sql_engine.rs:354-373 llkv-sql/src/sql_engine.rs:487-547
Statement Execution Flow
A typical SQL execution proceeds through the following phases:
Statement Execution Sequence
Sources : llkv-sql/src/sql_engine.rs:1533-1633 llkv-sql/src/sql_engine.rs:2134-2285
SQL Preprocessing System
Before parsing, the SQL Interface applies multiple preprocessing passes to normalize syntax variations across different SQL dialects:
| Preprocessor | Purpose | Example Transformation |
|---|---|---|
| TPCH Connect Syntax | Strip TPC-H CONNECT TO directives | CONNECT TO db; → removed |
| CREATE TYPE Syntax | Convert DuckDB type aliases | CREATE TYPE t AS INT → CREATE DOMAIN t AS INT |
| EXCLUDE Syntax | Quote qualified names in EXCLUDE | EXCLUDE (schema.col) → EXCLUDE ("schema.col") |
| Trailing Commas | Remove DuckDB-style trailing commas | VALUES (1,) → VALUES (1) |
| Empty IN Lists | Convert degenerate IN expressions | col IN () → (col = NULL AND 0 = 1) |
| Index Hints | Strip SQLite index hints | FROM t INDEXED BY idx → FROM t |
| REINDEX Syntax | Convert SQLite REINDEX | REINDEX idx → VACUUM REINDEX idx |
| Trigger Shorthand | Expand SQLite trigger defaults | CREATE TRIGGER tr ... → adds AFTER, FOR EACH ROW |
| Bare Table IN | Expand SQLite table subqueries | col IN tablename → col IN (SELECT * FROM tablename) |
Each preprocessor runs as a regex-based rewrite pass before the SQL text reaches sqlparser-rs. This allows LLKV to accept SQL from multiple dialects while using a single parser.
Sources : llkv-sql/src/sql_engine.rs:759-1006 llkv-sql/src/sql_engine.rs:771-793
Prepared Statements and Parameters
The SQL Interface supports prepared statements with parameterized queries. Parameters use placeholder syntax compatible with multiple dialects:
Parameter Processing Flow
graph LR
subgraph "Parameter Placeholder Formats"
Q1["? \n(anonymous)"]
Q2["?N \n(numbered)"]
Dollar["$N \n(PostgreSQL)"]
Named[":name \n(named)"]
end
subgraph "Parameter Processing"
Register["register_placeholder(raw)"]
State["ParameterState"]
Index["assigned: HashMap<String, usize>"]
end
subgraph "Execution"
Prepare["SqlEngine::prepare(sql)"]
PrepStmt["PreparedStatement"]
Execute["execute_prepared(stmt, params)"]
Bind["Bind params to plan"]
end
Q1 --> Register
Q2 --> Register
Dollar --> Register
Named --> Register
Register --> State
State --> Index
Index --> Prepare
Prepare --> PrepStmt
PrepStmt --> Execute
Execute --> Bind
Parameter placeholders are registered during parsing and replaced with internal sentinel values (__llkv_param__N__). When executing a prepared statement, the sentinel values are substituted with actual parameter values before plan execution.
Sources : llkv-sql/src/sql_engine.rs:78-282 llkv-sql/src/sql_engine.rs:354-373 llkv-sql/src/sql_engine.rs:1707-1773
INSERT Buffering Optimization
To improve bulk insert performance, the SQL Interface can buffer multiple consecutive INSERT ... VALUES statements and execute them as a single batched operation. This dramatically reduces planning overhead for large data loads:
INSERT Buffer State Machine
stateDiagram-v2
[*] --> Empty : engine created
Empty --> Buffering : INSERT with VALUES
Buffering --> Buffering : compatible INSERT
Buffering --> Flushing : incompatible statement
Buffering --> Flushing : buffer threshold reached
Flushing --> Empty : execute batched insert
Flushing --> Buffering : new INSERT after flush
Empty --> [*]
note right of Buffering
Accumulate rows while:
- Same table
- Same columns
- Same conflict action
- Below MAX_BUFFERED_INSERT_ROWS
end note
note right of Flushing
Execute single INSERT with
all accumulated rows
end note
Buffering is disabled by default to preserve per-statement semantics for unit tests. Long-running workloads can enable buffering via set_insert_buffering(true) to achieve significant performance gains on bulk ingests.
Sources : llkv-sql/src/sql_engine.rs:487-547 llkv-sql/src/sql_engine.rs:2134-2285
graph TD
AST["sqlparser::ast::Statement"]
subgraph "DDL Statements"
CreateTable["CreateTable"]
AlterTable["AlterTable"]
DropTable["DropTable"]
CreateView["CreateView"]
CreateIndex["CreateIndex"]
end
subgraph "DML Statements"
Query["Query (SELECT)"]
Insert["Insert"]
Update["Update"]
Delete["Delete"]
end
subgraph "Transaction Control"
Begin["BEGIN"]
Commit["COMMIT"]
Rollback["ROLLBACK"]
end
subgraph "PlanStatement Variants"
SelectPlan["SelectPlan"]
InsertPlan["InsertPlan"]
UpdatePlan["UpdatePlan"]
DeletePlan["DeletePlan"]
CreateTablePlan["CreateTablePlan"]
AlterTablePlan["AlterTablePlan"]
BeginTxn["BeginTransaction"]
CommitTxn["CommitTransaction"]
end
AST --> CreateTable
AST --> AlterTable
AST --> DropTable
AST --> Query
AST --> Insert
AST --> Update
AST --> Delete
AST --> Begin
AST --> Commit
CreateTable --> CreateTablePlan
AlterTable --> AlterTablePlan
Query --> SelectPlan
Insert --> InsertPlan
Update --> UpdatePlan
Delete --> DeletePlan
Begin --> BeginTxn
Commit --> CommitTxn
Statement Translation Process
Once SQL is parsed into an AST, the SQL Interface translates each Statement variant into a corresponding PlanStatement:
AST to Plan Translation
Translation involves:
- Identifier Resolution : Converting string column names to
FieldIdvalues viaIdentifierResolver - Expression Translation : Building typed expression trees from AST nodes
- Type Inference : Determining result types for computed columns
- Validation : Checking for unknown tables, duplicate columns, type mismatches
Sources : llkv-sql/src/sql_engine.rs:2287-3500 llkv-executor/src/lib.rs:89-93
Integration with Runtime and Executor
The SQL Interface delegates all actual execution to lower layers:
Layer Integration Diagram
The SQL Interface owns the RuntimeEngine instance and delegates PlanStatement execution to it. The runtime coordinates with the executor for query execution and the table layer for DDL operations.
Sources : llkv-sql/src/sql_engine.rs:706-745 [llkv-runtime/src/lib.rs (implied)](https://github.com/jzombie/rust-llkv/blob/89777726/llkv-runtime/src/lib.rs (implied)) [Diagram 1 from overview](https://github.com/jzombie/rust-llkv/blob/89777726/Diagram 1 from overview)
Example Usage
Basic usage of the SqlEngine:
Sources : llkv-sql/src/sql_engine.rs:443-485 llkv-sql/src/lib.rs:1-52
Key Design Principles
The SQL Interface embodies several architectural decisions:
- Dialect Tolerance : Accept SQL from multiple databases through preprocessing rather than forking the parser
- Lazy Validation : Validate table existence and column types during plan translation, not parsing
- Stateful Optimization : Maintain cross-statement state (insert buffer, prepared statement cache) for performance
- Delegate Execution : Never directly manipulate storage; always route through runtime abstractions
- Arrow Native : Return query results as Arrow
RecordBatchstructures for zero-copy integration - Session Isolation : Each
SqlEngineinstance owns its transaction state and constraint enforcement mode
These principles allow the SQL Interface to serve as a stable, predictable API surface while the underlying execution engine evolves independently.
Sources : llkv-sql/src/sql_engine.rs:1-60 llkv-sql/src/lib.rs:1-52 [Diagram 2 from overview](https://github.com/jzombie/rust-llkv/blob/89777726/Diagram 2 from overview)
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
SqlEngine API
Loading…
SqlEngine API
Relevant source files
- llkv-executor/src/lib.rs
- llkv-sql/src/lib.rs
- llkv-sql/src/sql_engine.rs
- llkv-tpch/Cargo.toml
- llkv-tpch/DRAFT.md
- llkv-tpch/src/lib.rs
- llkv-tpch/src/main.rs
- llkv-tpch/src/queries.rs
This document describes the SqlEngine struct, which is the primary user-facing interface for executing SQL statements in LLKV. The SqlEngine parses SQL text using sqlparser-rs, converts it into execution plans, and delegates to the RuntimeEngine for execution. For details on how SQL syntax is normalized before parsing, see SQL Preprocessing and Dialect Handling. For information about the INSERT batching optimization, see INSERT Buffering System.
Sources: llkv-sql/src/sql_engine.rs:441-486 llkv-sql/src/lib.rs:1-52
SqlEngine Struct and Core Components
The SqlEngine struct wraps a RuntimeEngine and adds SQL parsing, statement caching, INSERT buffering, and dialect preprocessing capabilities.
Diagram: SqlEngine component structure and user interaction
graph TB
SqlEngine["SqlEngine"]
RuntimeEngine["RuntimeEngine"]
RuntimeContext["RuntimeContext<Pager>"]
InsertBuffer["InsertBuffer"]
StatementCache["Statement Cache"]
InfoSchema["Information Schema"]
SqlEngine -->|contains| RuntimeEngine
SqlEngine -->|maintains| InsertBuffer
SqlEngine -->|maintains| StatementCache
SqlEngine -->|tracks| InfoSchema
RuntimeEngine -->|contains| RuntimeContext
RuntimeContext -->|manages| Tables["Tables"]
RuntimeContext -->|manages| Catalog["System Catalog"]
User["User Code"] -->|execute sql| SqlEngine
User -->|prepare sql| SqlEngine
User -->|sql query| SqlEngine
SqlEngine -->|execute_statement| RuntimeEngine
RuntimeEngine -->|Result| SqlEngine
SqlEngine -->|RecordBatch[]| User
Sources: llkv-sql/src/sql_engine.rs:572-620
SqlEngine Fields
| Field | Type | Purpose |
|---|---|---|
engine | RuntimeEngine | Core execution engine that manages tables, catalog, and transactions |
default_nulls_first | AtomicBool | Controls default sort order for NULL values in ORDER BY clauses |
insert_buffer | RefCell<Option<InsertBuffer>> | Accumulates literal INSERT statements for batch execution |
insert_buffering_enabled | AtomicBool | Feature flag controlling cross-statement INSERT batching |
information_schema_ready | AtomicBool | Tracks whether the information_schema has been initialized |
statement_cache | RwLock<FxHashMap<String, Arc<PreparedPlan>>> | Cache for prepared statement plans |
Sources: llkv-sql/src/sql_engine.rs:572-586
Construction and Configuration
Creating a SqlEngine
Diagram: SqlEngine construction paths
Sources: llkv-sql/src/sql_engine.rs:751-757 llkv-sql/src/sql_engine.rs:627-633
Constructor Methods
| Method | Signature | Description |
|---|---|---|
new<Pg>(pager: Arc<Pg>) | Creates engine with new RuntimeContext | Standard constructor for fresh database instances |
with_context(context, default_nulls_first) | Creates engine from existing context | Used when sharing storage across engine instances |
Configuration Options:
default_nulls_first: Controls whetherNULLvalues sort first (true) or last (false) inORDER BYclauses whenNULLS FIRST/LASTis not explicitly specified- INSERT buffering : Disabled by default; enable via
set_insert_buffering(true)for bulk loading workloads
Sources: llkv-sql/src/sql_engine.rs:751-757 llkv-sql/src/sql_engine.rs:627-633 llkv-sql/src/sql_engine.rs:1431-1449
Statement Execution
Primary Execution Methods
Diagram: Statement execution flow through SqlEngine
Sources: llkv-sql/src/sql_engine.rs:1109-1194 llkv-sql/src/sql_engine.rs:1196-1236 llkv-sql/src/sql_engine.rs:1715-1765
Execution Method Details
| Method | Return Type | Use Case |
|---|---|---|
execute(&self, sql: &str) | Vec<RuntimeStatementResult> | Execute one or more SQL statements (DDL, DML, SELECT) |
sql(&self, query: &str) | Vec<RecordBatch> | Execute a single SELECT query and return Arrow results |
execute_single(&self, sql: &str) | RuntimeStatementResult | Execute exactly one statement, error if multiple |
prepare(&self, sql: &str) | PreparedStatement | Parse and cache a parameterized statement |
execute_prepared(&self, stmt, params) | Vec<RuntimeStatementResult> | Execute a prepared statement with bound parameters |
Sources: llkv-sql/src/sql_engine.rs:1109-1194 llkv-sql/src/sql_engine.rs:1196-1236 llkv-sql/src/sql_engine.rs:1238-1272 llkv-sql/src/sql_engine.rs:1715-1765 llkv-sql/src/sql_engine.rs:1767-1816
RuntimeStatementResult Variants
The RuntimeStatementResult enum represents the outcome of executing different statement types:
| Variant | Fields | Produced By |
|---|---|---|
CreateTable | table_name: String | CREATE TABLE |
CreateView | view_name: String | CREATE VIEW |
CreateIndex | index_name: String, table_name: String | CREATE INDEX |
DropTable | table_name: String | DROP TABLE |
DropView | view_name: String | DROP VIEW |
DropIndex | index_name: String | DROP INDEX |
Insert | table_name: String, rows_inserted: usize | INSERT |
Update | table_name: String, rows_updated: usize | UPDATE |
Delete | table_name: String, rows_deleted: usize | DELETE |
Select | execution: SelectExecution<P> | SELECT (when using execute()) |
BeginTransaction | kind: TransactionKind | BEGIN TRANSACTION |
CommitTransaction | COMMIT | |
RollbackTransaction | ROLLBACK | |
AlterTable | table_name: String | ALTER TABLE |
Truncate | table_name: String | TRUNCATE |
Reindex | index_name: Option<String> | REINDEX |
Sources: llkv-runtime/src/lib.rs (inferred from context), llkv-sql/src/lib.rs50
Prepared Statements and Parameter Binding
Parameter Placeholder Syntax
LLKV supports multiple parameter placeholder styles for prepared statements:
| Style | Example | Description |
|---|---|---|
| Positional (?) | SELECT * FROM t WHERE id = ? | Auto-numbered starting from 1 |
| Numbered (?) | SELECT * FROM t WHERE id = ?2 | Explicit position |
| Numbered ($) | SELECT * FROM t WHERE id = $1 | PostgreSQL style |
| Named (:) | SELECT * FROM t WHERE name = :username | Named parameters |
Sources: llkv-sql/src/sql_engine.rs:94-133 llkv-sql/src/sql_engine.rs:258-268
Parameter Binding Flow
Diagram: Prepared statement lifecycle with parameter binding
Sources: llkv-sql/src/sql_engine.rs:1715-1765 llkv-sql/src/sql_engine.rs:1767-1816 llkv-sql/src/sql_engine.rs:223-256
SqlParamValue Types
The SqlParamValue enum represents typed parameter values:
Conversion methods:
as_literal(&self) -> Literal- Converts to internalLiteraltype for expression evaluationas_plan_value(&self) -> PlanValue- Converts toPlanValuefor INSERT operations
Sources: llkv-sql/src/sql_engine.rs:285-316
graph TB
subgraph "Statement Processing"
Parse["parse_statement()"]
Canonical["canonicalize_insert()"]
PreparedInsert["PreparedInsert"]
end
subgraph "Buffer Management"
BufferDecision{"Can Buffer?"}
BufferAdd["buffer.push_statement()"]
BufferFlush["flush_buffer()"]
BufferCheck{"buffer.should_flush()?"}
end
subgraph "Execution"
BuildPlan["InsertPlan::from_buffer"]
Execute["execute_insert_plan()"]
end
Parse -->|INSERT statement| Canonical
Canonical -->|VALUES only| PreparedInsert
PreparedInsert --> BufferDecision
BufferDecision -->|Yes: same table, same columns, same on_conflict| BufferAdd
BufferDecision -->|No: different params| BufferFlush
BufferAdd --> BufferCheck
BufferCheck -->|Yes: >= 8192 rows| BufferFlush
BufferCheck -->|No: continue| Continue["Continue processing"]
BufferFlush --> BuildPlan
BuildPlan --> Execute
Execute -->|Multiple Insert results| Results["Vec<RuntimeStatementResult>"]
INSERT Buffering System
The INSERT buffering system batches multiple literal INSERT ... VALUES statements together to reduce planning overhead during bulk loading operations.
Buffering Architecture
Diagram: INSERT buffering decision tree and execution flow
Sources: llkv-sql/src/sql_engine.rs:486-546 llkv-sql/src/sql_engine.rs:1869-2008
InsertBuffer Configuration
| Constant/Field | Value/Type | Purpose |
|---|---|---|
MAX_BUFFERED_INSERT_ROWS | 8192 | Maximum rows to accumulate before forcing a flush |
insert_buffering_enabled | AtomicBool | Global enable/disable flag (default: false) |
table_name | String | Target table for buffered inserts |
columns | Vec<String> | Column list (must match across statements) |
on_conflict | InsertConflictAction | Conflict resolution strategy (must match) |
statement_row_counts | Vec<usize> | Tracks row counts per original statement |
rows | Vec<Vec<PlanValue>> | Accumulated literal rows |
Buffer acceptance rules:
- Table name must match exactly
- Column list must match exactly
on_conflictaction must match- Only literal
VALUESare buffered (subqueries flush immediately)
Flush triggers:
- Row count reaches
MAX_BUFFERED_INSERT_ROWS - Next INSERT targets different table/columns/conflict action
- Non-INSERT statement encountered
execute()completes (flush remaining)SqlEngineis dropped
Sources: llkv-sql/src/sql_engine.rs:486-546 llkv-sql/src/sql_engine.rs:1431-1449
graph LR
Input["Raw SQL String"]
TPCH["preprocess_tpch_connect_syntax"]
CreateType["preprocess_create_type_syntax"]
Exclude["preprocess_exclude_syntax"]
TrailingComma["preprocess_trailing_commas_in_values"]
EmptyIn["preprocess_empty_in_lists"]
IndexHints["preprocess_index_hints"]
Reindex["preprocess_reindex_syntax"]
TriggerShorthand["preprocess_sqlite_trigger_shorthand"]
BareTable["preprocess_bare_table_in_clauses"]
Parser["sqlparser::Parser"]
AST["Vec<Statement>"]
Input --> TPCH
TPCH --> CreateType
CreateType --> Exclude
Exclude --> TrailingComma
TrailingComma --> EmptyIn
EmptyIn --> IndexHints
IndexHints --> Reindex
Reindex --> TriggerShorthand
TriggerShorthand --> BareTable
BareTable --> Parser
Parser --> AST
SQL Preprocessing
Before parsing with sqlparser, the SQL text undergoes multiple normalization passes to handle dialect-specific syntax variations. This allows LLKV to accept SQL from SQLite, DuckDB, PostgreSQL, and other sources.
Preprocessing Pipeline
Diagram: SQL preprocessing transformation pipeline
Sources: llkv-sql/src/sql_engine.rs:1024-1103
Preprocessing Transformations
| Method | Transformation | Example |
|---|---|---|
preprocess_tpch_connect_syntax | Strip CONNECT TO <db>; | CONNECT TO tpch; → (removed) |
preprocess_create_type_syntax | CREATE TYPE → CREATE DOMAIN | CREATE TYPE myint AS INTEGER → CREATE DOMAIN myint AS INTEGER |
preprocess_exclude_syntax | Quote qualified EXCLUDE identifiers | EXCLUDE (schema.table.col) → EXCLUDE ("schema.table.col") |
preprocess_trailing_commas_in_values | Remove trailing commas | VALUES (1, 2,) → VALUES (1, 2) |
preprocess_empty_in_lists | Convert empty IN to boolean expr | col IN () → (col = NULL AND 0 = 1) |
preprocess_index_hints | Strip SQLite index hints | FROM t INDEXED BY idx → FROM t |
preprocess_reindex_syntax | Convert to VACUUM REINDEX | REINDEX idx → VACUUM REINDEX idx |
preprocess_sqlite_trigger_shorthand | Add explicit timing/FOR EACH ROW | CREATE TRIGGER t INSERT ON t → CREATE TRIGGER t AFTER INSERT ON t FOR EACH ROW |
preprocess_bare_table_in_clauses | Wrap table in subquery | col IN tablename → col IN (SELECT * FROM tablename) |
Sources: llkv-sql/src/sql_engine.rs:759-1006
Session and Transaction Management
Accessing the RuntimeSession
The SqlEngine::session() method provides access to the underlying RuntimeSession, which manages transaction state and constraint enforcement:
Key session methods:
begin_transaction(kind)- Start a new transactioncommit_transaction()- Commit the current transactionrollback_transaction()- Abort the current transactionset_constraint_enforcement_mode(mode)- Control when constraints are checkedexecute_insert_plan(plan)- Direct plan execution (bypasses SQL parsing)execute_select_plan(plan)- Direct SELECT execution
Sources: llkv-sql/src/sql_engine.rs:1412-1424
Constraint Enforcement Modes
The session supports two constraint enforcement modes:
| Mode | Behavior |
|---|---|
ConstraintEnforcementMode::Immediate | Constraints checked after each statement |
ConstraintEnforcementMode::Deferred | Constraints checked only at transaction commit |
Deferred mode is useful for bulk loading when referential integrity violations may occur temporarily during the load process.
Sources: llkv-table/src/catalog/manager.rs (inferred from TpchToolkit usage), llkv-tpch/src/lib.rs:274-278
stateDiagram-v2
[*] --> Uninitialized: SqlEngine::new()
Uninitialized --> Checking : Query references information_schema
Checking --> Initializing : ensure_information_schema_ready()
Initializing --> Ready : refresh_information_schema()
Ready --> Ready : Subsequent queries
Ready --> Invalidated : DDL statement (CREATE/DROP TABLE, etc.)
Invalidated --> Checking : Next information_schema query
Checking --> Ready : Already initialized
Information Schema
The SqlEngine lazily initializes the information_schema tables on first access to avoid startup overhead. Querying information_schema.tables, information_schema.columns, or other information schema views triggers initialization.
Information Schema Lifecycle
Diagram: Information schema initialization and invalidation lifecycle
Key methods:
ensure_information_schema_ready()- Initialize if not already readyinvalidate_information_schema()- Mark as stale after schema changesrefresh_information_schema()- Delegate toRuntimeEnginefor rebuild
Invalidation triggers:
CREATE TABLE,DROP TABLEALTER TABLECREATE VIEW,DROP VIEW
Sources: llkv-sql/src/sql_engine.rs:648-660 llkv-sql/src/sql_engine.rs:706-745
graph LR
Execute["execute_plan_statement"]
RuntimeEngine["RuntimeEngine::execute_statement"]
Error["Error::NotFound or\nInvalidArgumentError"]
MapError["map_table_error"]
UserError["Error::CatalogError:\n'Table does not exist'"]
Execute -->|delegates| RuntimeEngine
RuntimeEngine -->|returns| Error
Execute -->|table name available| MapError
MapError -->|transforms| UserError
UserError -->|propagates| User["User Code"]
Error Handling and Diagnostics
Error Mapping
The SqlEngine transforms generic storage errors into user-friendly SQL errors, particularly for table lookup failures:
Diagram: Error mapping for table lookup failures
Error transformation rules:
Error::NotFound→Error::CatalogError("Table '...' does not exist")- Generic “unknown table” messages → Catalog error with table name
- View-related operations (CREATE VIEW, DROP VIEW) skip mapping
Sources: llkv-sql/src/sql_engine.rs:677-745
Statement Expectations (Test Harness)
For test infrastructure, the SqlEngine supports registering expected outcomes for statements:
Expectation types:
StatementExpectation::Ok- Statement should succeedStatementExpectation::Error- Statement should failStatementExpectation::Count(n)- Statement should affectnrows
Sources: llkv-sql/src/sql_engine.rs:65-391
Advanced Usage Examples
Bulk Loading with INSERT Buffering
Sources: llkv-sql/src/sql_engine.rs:1431-1449 llkv-sql/src/sql_engine.rs:2010-2027
Prepared Statements with Named Parameters
Sources: llkv-sql/src/sql_engine.rs:1715-1816
Direct Plan Execution (Advanced)
For performance-critical code paths, bypass SQL parsing by executing plans directly:
Sources: llkv-sql/src/sql_engine.rs:1412-1424 llkv-runtime/src/lib.rs (InsertPlan structure inferred)
Key Constants
| Constant | Value | Purpose |
|---|---|---|
PARSER_RECURSION_LIMIT | 200 | Maximum recursion depth for sqlparser (prevents stack overflow) |
MAX_BUFFERED_INSERT_ROWS | 8192 | Flush threshold for INSERT buffering |
PARAM_SENTINEL_PREFIX | "__llkv_param__" | Internal marker for parameter placeholders |
PARAM_SENTINEL_SUFFIX | "__" | Internal marker suffix |
DROPPED_TABLE_TRANSACTION_ERR | "another transaction has dropped this table" | Error message for concurrent table drops |
Sources: llkv-sql/src/sql_engine.rs:78-588
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
SQL Preprocessing and Dialect Handling
Loading…
SQL Preprocessing and Dialect Handling
Relevant source files
This document describes the SQL preprocessing layer that normalizes SQL syntax from multiple dialects (SQLite, DuckDB, PostgreSQL) before parsing. The preprocessor transforms dialect-specific syntax into forms that the sqlparser-rs library can parse, enabling broad compatibility across different SQL variants.
For information about the overall SQL execution flow, see SqlEngine API. For details on how SQL parameters are bound to prepared statements, see Plan Structures.
Purpose and Architecture
The SQL preprocessing system transforms SQL text through a series of regex-based rewriting rules before passing it to sqlparser-rs. This architecture allows LLKV to accept SQL from various dialects without forking the parser library or implementing a custom parser.
Key components :
SqlEngine::preprocess_sql_input()- main preprocessing orchestrator- Individual preprocessing functions for each dialect feature
- Thread-local parameter state tracking
- Regex-based pattern matching and replacement
flowchart LR
RawSQL["Raw SQL Text\n(SQLite/DuckDB/PostgreSQL)"]
Preprocess["preprocess_sql_input()"]
subgraph "Preprocessing Steps"
TPCH["TPC-H CONNECT removal"]
CreateType["CREATE TYPE → DOMAIN"]
Exclude["EXCLUDE qualifier handling"]
Trailing["Trailing comma removal"]
BareTable["Bare table IN conversion"]
EmptyIn["Empty IN list handling"]
IndexHints["Index hint removal"]
Reindex["REINDEX normalization"]
Trigger["Trigger shorthand expansion"]
end
Parser["sqlparser::Parser"]
AST["Statement AST"]
RawSQL --> Preprocess
Preprocess --> TPCH
TPCH --> CreateType
CreateType --> Exclude
Exclude --> Trailing
Trailing --> BareTable
BareTable --> EmptyIn
EmptyIn --> IndexHints
IndexHints --> Reindex
Reindex --> Trigger
Trigger --> Parser
Parser --> AST
The preprocessing layer sits between raw SQL input and the generic SQL parser:
Sources : llkv-sql/src/sql_engine.rs:1116-1125
Preprocessing Pipeline
The preprocess_sql_input() function applies transformations in a specific order to handle interdependencies between rules:
Sources : llkv-sql/src/sql_engine.rs:1116-1125
flowchart TD
Input["SQL Input String"]
Step1["preprocess_tpch_connect_syntax()"]
Step2["preprocess_create_type_syntax()"]
Step3["preprocess_exclude_syntax()"]
Step4["preprocess_trailing_commas_in_values()"]
Step5["preprocess_bare_table_in_clauses()"]
Step6["preprocess_empty_in_lists()"]
Step7["preprocess_index_hints()"]
Step8["preprocess_reindex_syntax()"]
Output["Normalized SQL"]
Input --> Step1
Step1 --> Step2
Step2 --> Step3
Step3 --> Step4
Step4 --> Step5
Step5 --> Step6
Step6 --> Step7
Step7 --> Step8
Step8 --> Output
Note1["Removes TPC-H CONNECT statements"]
Note2["Converts DuckDB type aliases"]
Note3["Quotes qualified identifiers"]
Note4["Removes trailing commas in VALUES"]
Note5["Wraps bare tables in subqueries"]
Note6["Converts empty IN to constant predicates"]
Note7["Strips SQLite index hints"]
Note8["Converts standalone REINDEX"]
Step1 -.-> Note1
Step2 -.-> Note2
Step3 -.-> Note3
Step4 -.-> Note4
Step5 -.-> Note5
Step6 -.-> Note6
Step7 -.-> Note7
Step8 -.-> Note8
Dialect-Specific Transformations
TPC-H CONNECT Statement Removal
TPC-H benchmark scripts include CONNECT TO database; directives that are no-ops in LLKV (single database system). The preprocessor strips these statements entirely.
Implementation : llkv-sql/src/sql_engine.rs:759-766
Pattern : CONNECT TO <identifier>;
Action : Remove statement
CREATE TYPE → CREATE DOMAIN Conversion
DuckDB uses CREATE TYPE name AS basetype for type aliases, but sqlparser-rs only supports the SQL standard CREATE DOMAIN syntax. The preprocessor performs bidirectional conversion:
CREATE TYPE myint AS INTEGER → CREATE DOMAIN myint AS INTEGER
DROP TYPE myint → DROP DOMAIN myint
Regex patterns :
CREATE TYPE→CREATE DOMAINDROP TYPE→DROP DOMAIN
Implementation : llkv-sql/src/sql_engine.rs:775-793
Sources : llkv-sql/src/sql_engine.rs:775-793
EXCLUDE Clause Qualified Name Handling
DuckDB allows qualified identifiers in EXCLUDE clauses: SELECT * EXCLUDE (schema.table.col). The preprocessor wraps these in double quotes for parser compatibility:
EXCLUDE (schema.table.col) → EXCLUDE ("schema.table.col")
Pattern : EXCLUDE\s*\(\s*([a-zA-Z_][a-zA-Z0-9_]*(?:\.[a-zA-Z_][a-zA-Z0-9_]*)+)\s*\)
Implementation : llkv-sql/src/sql_engine.rs:795-812
Sources : llkv-sql/src/sql_engine.rs:795-812
Trailing Comma Removal in VALUES Clauses
DuckDB permits trailing commas in tuple literals: VALUES ('v2',). The preprocessor removes these for parser compatibility:
VALUES (1, 2,) → VALUES (1, 2)
Pattern : ,(\s*)\)
Replacement : $1)
Implementation : llkv-sql/src/sql_engine.rs:814-825
Sources : llkv-sql/src/sql_engine.rs:814-825
Empty IN List Handling
SQLite allows degenerate forms expr IN () and expr NOT IN () which sqlparser-rs rejects. The preprocessor converts these to constant boolean expressions while preserving expression evaluation (for potential side effects):
expr IN () → (expr = NULL AND 0 = 1) -- always false
expr NOT IN () → (expr = NULL OR 1 = 1) -- always true
Regex : Matches parenthesized expressions, quoted strings, hex literals, identifiers, or numbers followed by [NOT] IN ()
Implementation : llkv-sql/src/sql_engine.rs:827-856
Sources : llkv-sql/src/sql_engine.rs:827-856
SQLite Index Hint Removal
SQLite query optimizer hints (INDEXED BY index_name, NOT INDEXED) are stripped since LLKV makes its own index selection decisions:
FROM table INDEXED BY idx_name → FROM table
FROM table NOT INDEXED → FROM table
Pattern : \s+(INDEXED\s+BY\s+[a-zA-Z_][a-zA-Z0-9_]*|NOT\s+INDEXED)\b
Implementation : llkv-sql/src/sql_engine.rs:858-875
Sources : llkv-sql/src/sql_engine.rs:858-875
REINDEX Syntax Normalization
SQLite supports standalone REINDEX index_name statements, but sqlparser-rs only recognizes REINDEX within VACUUM statements. The preprocessor converts:
REINDEX my_index → VACUUM REINDEX my_index
Pattern : \bREINDEX\s+([a-zA-Z_][a-zA-Z0-9_]*(?:\.[a-zA-Z_][a-zA-Z0-9_]*)*)\b
Implementation : llkv-sql/src/sql_engine.rs:877-893
Sources : llkv-sql/src/sql_engine.rs:877-893
SQLite Trigger Shorthand Expansion
SQLite allows omitting trigger timing (BEFORE/AFTER, defaults to AFTER) and the FOR EACH ROW clause (defaults to row-level triggers). The sqlparser-rs library requires these to be explicit, so the preprocessor injects them when missing.
Transformation example :
Regex approach :
- Detect missing timing keyword and inject
AFTER - Detect missing
FOR EACH ROWbeforeBEGINorWHENand inject it
Implementation : llkv-sql/src/sql_engine.rs:895-978
Note : This is marked as a temporary workaround. The proper fix would be extending sqlparser-rs’s SQLiteDialect::parse_statement to handle these optional clauses.
Sources : llkv-sql/src/sql_engine.rs:895-978
Bare Table Names in IN Clauses
SQLite allows expr IN tablename as shorthand for expr IN (SELECT * FROM tablename). The preprocessor converts this to the explicit subquery form:
col IN users → col IN (SELECT * FROM users)
Pattern : \b(NOT\s+)?IN\s+([a-zA-Z_][a-zA-Z0-9_]*(?:\.[a-zA-Z_][a-zA-Z0-9_]*)*)(\s|$|;|,|\))
Avoids : Already-parenthesized expressions (IN (...))
Implementation : llkv-sql/src/sql_engine.rs:980-1009
Sources : llkv-sql/src/sql_engine.rs:980-1009
SQL Parameter System
The parameter system enables prepared statements with placeholder binding. Parameters can use various syntaxes across different SQL dialects:
| Syntax | Example | Description |
|---|---|---|
? | WHERE id = ? | Positional, auto-incremented |
?N | WHERE id = ?1 | Positional, explicit index |
$N | WHERE id = $1 | PostgreSQL-style positional |
:name | WHERE id = :user_id | Named parameter |
Parameter State Management
Thread-local state tracks parameter registration during statement preparation:
Key functions :
ParameterScope::new()- initializes thread-local state llkv-sql/src/sql_engine.rs:228-237register_placeholder(raw: &str)- assigns indices to parameters llkv-sql/src/sql_engine.rs:258-268placeholder_marker(index: usize)- generates internal sentinel llkv-sql/src/sql_engine.rs:270-272
Sources : llkv-sql/src/sql_engine.rs:78-282
flowchart TD
Input["Raw Parameter String"]
CheckAuto{"Is '?' ?"}
IncrementAuto["Increment next_auto\nReturn new index"]
CheckCache{"Already registered?"}
ReturnCached["Return cached index"]
ParseType{"Parameter type"}
ParseNumeric["Parse numeric index\n(?N or $N)"]
AssignNamed["Assign max_index + 1\n(:name)"]
UpdateCache["Store in assigned map\nUpdate max_index"]
Return["Return index"]
Input --> CheckAuto
CheckAuto --
Yes --> IncrementAuto
CheckAuto --
No --> CheckCache
CheckCache --
Yes --> ReturnCached
CheckCache --
No --> ParseType
ParseType -- "?N or $N" --> ParseNumeric
ParseType -- ":name" --> AssignNamed
ParseNumeric --> UpdateCache
AssignNamed --> UpdateCache
UpdateCache --> Return
IncrementAuto --> Return
Parameter Index Assignment
The ParameterState::register() method normalizes different parameter syntaxes to unified indices:
Sources : llkv-sql/src/sql_engine.rs:94-133
Parameter Sentinel System
During plan building, parameters are represented as sentinel strings that can be recognized and replaced during execution:
| Sentinel Format | Example | Usage |
|---|---|---|
__llkv_param__N__ | __llkv_param__1__ | Internal representation in plan |
Functions :
placeholder_marker(index)- generates sentinel llkv-sql/src/sql_engine.rs:270-272literal_placeholder(index)- wraps inLiteral::Stringllkv-sql/src/sql_engine.rs:274-276parse_placeholder_marker(text)- extracts index from sentinel llkv-sql/src/sql_engine.rs:278-282
Sources : llkv-sql/src/sql_engine.rs:78-282
Integration with SqlEngine Execution
The preprocessing pipeline integrates with the main SQL execution flow:
Trigger retry logic : If initial parsing fails and the SQL contains CREATE TRIGGER, the engine applies the trigger shorthand preprocessing and retries. This is a fallback for cases where the initial preprocessing pipeline doesn’t catch all trigger variations.
Sources : llkv-sql/src/sql_engine.rs:1057-1083
Recursion Limit Configuration
The parser recursion limit is set higher than sqlparser-rs’s default to accommodate deeply nested SQL expressions common in test suites:
This prevents stack overflows while still protecting against pathological inputs.
Sources : llkv-sql/src/sql_engine.rs:393-400
Design Rationale
Why Regex-Based Preprocessing?
The preprocessing approach uses regex pattern matching rather than extending the parser for several reasons:
- Non-invasive : Does not require forking or patching
sqlparser-rs - Composable : Multiple transformations can be chained independently
- Maintainable : Each dialect feature is isolated in its own function
- Sufficient : Handles syntactic transformations without semantic analysis
Trade-offs
| Approach | Advantages | Disadvantages |
|---|---|---|
| Regex preprocessing | Simple, composable, no parser changes | Fragile to edge cases, string-based |
| Parser extension | Robust, type-safe | Requires maintaining parser fork |
| Custom parser | Full control | High maintenance burden |
The current regex-based approach is pragmatic for the current set of dialect differences. If the number of transformations grows significantly, a proper parser extension may become necessary.
Sources : llkv-sql/src/sql_engine.rs:759-1009
Summary
The SQL preprocessing layer provides broad dialect compatibility through a pipeline of targeted string transformations:
- TPC-H compatibility : Strips
CONNECTstatements - DuckDB compatibility : Type aliases, trailing commas, qualified EXCLUDE
- SQLite compatibility : Trigger shorthand, empty IN lists, index hints, bare table IN, REINDEX
- PostgreSQL compatibility :
$Nparameter syntax
The system is extensible—new dialect features can be added as additional preprocessing functions without disrupting existing transformations.
Sources : llkv-sql/src/sql_engine.rs:759-1125
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
INSERT Buffering System
Loading…
INSERT Buffering System
Relevant source files
The INSERT Buffering System is a performance optimization that batches multiple consecutive INSERT ... VALUES statements into a single execution operation. This system reduces planning and execution overhead by accumulating literal row data across statement boundaries and flushing them together, achieving significant throughput improvements for bulk ingestion workloads while preserving per-statement result semantics.
For information about SQL statement execution in general, see SQL Interface. For details on how INSERT statements are planned and translated, see Query Planning.
Purpose and Scope
The buffering system operates at the SQL engine layer, intercepting INSERT statements before they reach the runtime executor. It analyzes each INSERT to determine if it contains literal values that can be safely accumulated, or if it requires immediate execution due to complex source expressions (e.g., INSERT ... SELECT). The system maintains buffering state across multiple execute() calls, making it suitable for workloads that stream SQL scripts containing thousands of INSERT statements.
In scope:
- Buffering of
INSERT ... VALUESstatements with literal row data - Automatic flushing based on configurable thresholds and statement boundaries
- Per-statement result tracking for compatibility with test harnesses
- Conflict resolution action compatibility (REPLACE, IGNORE, etc.)
Out of scope:
- Buffering of INSERT statements with subqueries or SELECT sources
- Cross-table buffering (only same-table INSERTs are batched)
- Transaction-aware buffering strategies
System Architecture
Sources: llkv-sql/src/sql_engine.rs:1057-1114 llkv-sql/src/sql_engine.rs:1231-1338 llkv-sql/src/sql_engine.rs:1423-1557
Core Data Structures
classDiagram
class InsertBuffer {-String table_name\n-Vec~String~ columns\n-InsertConflictAction on_conflict\n-usize total_rows\n-Vec~usize~ statement_row_counts\n-Vec~Vec~PlanValue~~ rows\n+new() InsertBuffer\n+can_accept() bool\n+push_statement()\n+should_flush() bool}
class PreparedInsert {<<enumeration>>\nValues\nImmediate}
class SqlEngine {-RefCell~Option~InsertBuffer~~ insert_buffer\n-AtomicBool insert_buffering_enabled\n+buffer_insert() BufferedInsertResult\n+flush_buffer_results() Vec~SqlStatementResult~\n+set_insert_buffering()}
SqlEngine --> InsertBuffer : contains
SqlEngine --> PreparedInsert : produces
InsertBuffer
The InsertBuffer struct accumulates literal row data across multiple INSERT statements while tracking per-statement row counts for result reporting.
Field descriptions:
| Field | Type | Purpose |
|---|---|---|
table_name | String | Target table identifier for compatibility checking |
columns | Vec<String> | Column list (must match across buffered statements) |
on_conflict | InsertConflictAction | Conflict resolution strategy (REPLACE, IGNORE, etc.) |
total_rows | usize | Sum of all buffered rows across statements |
statement_row_counts | Vec<usize> | Per-statement row counts for result splitting |
rows | Vec<Vec<PlanValue>> | Flattened literal row data |
Sources: llkv-sql/src/sql_engine.rs:497-547
Configuration and Thresholds
MAX_BUFFERED_INSERT_ROWS
The buffering system enforces a maximum row threshold to prevent unbounded memory growth during bulk ingestion:
When total_rows reaches this threshold, the buffer automatically flushes before accepting additional statements. This value balances memory usage against amortized planning overhead.
Sources: llkv-sql/src/sql_engine.rs490
Enabling and Disabling Buffering
Buffering is disabled by default to preserve immediate execution semantics for unit tests and applications that depend on synchronous error reporting. Long-running workloads can opt in via the set_insert_buffering method:
Sources: llkv-sql/src/sql_engine.rs:1022-1029 llkv-sql/src/sql_engine.rs583
sequenceDiagram
participant App
participant SqlEngine
participant InsertBuffer
App->>SqlEngine: set_insert_buffering(true)
SqlEngine->>SqlEngine: Store enabled flag
Note over App,SqlEngine: Buffering now active
App->>SqlEngine: execute("INSERT INTO t VALUES (1)")
SqlEngine->>InsertBuffer: Create or append
App->>SqlEngine: execute("INSERT INTO t VALUES (2)")
SqlEngine->>InsertBuffer: Append to existing
App->>SqlEngine: set_insert_buffering(false)
SqlEngine->>InsertBuffer: flush_buffer_results()
InsertBuffer->>SqlEngine: Return per-statement results
Buffering Decision Flow
The buffer_insert method implements the core buffering logic, determining whether an INSERT should be accumulated or executed immediately:
Immediate Execution Conditions
Buffering is bypassed under the following conditions:
- Statement Expectation is Error or Count: Test harnesses use expectations to validate synchronous error reporting and exact row counts
- Buffering Disabled: The
insert_buffering_enabledflag is false (default state) - Non-literal Source: The INSERT uses a SELECT subquery or other dynamic source
- Buffer Incompatibility: The INSERT targets a different table, column list, or conflict action
Sources: llkv-sql/src/sql_engine.rs:1231-1338
flowchart TD
START["buffer_insert(insert, expectation)"]
CHECK_EXPECT{"expectation ==\nError |Count?"}
CHECK_ENABLED{"buffering_enabled?"}
PREPARE["prepare_insert insert"]
CHECK_TYPE{"PreparedInsert type?"}
CHECK_COMPAT{"buffer.can_accept ?"}
CHECK_FULL{"buffer.should_flush ?"}
EXEC_IMM["Execute immediately Return flushed + current"]
CREATE_BUF["Create new InsertBuffer"]
APPEND_BUF["buffer.push_statement"]
FLUSH_BUF["flush_buffer_results"]
RETURN_PLACE["Return placeholder result"]
START --> CHECK_EXPECT
CHECK_EXPECT -->|Yes|EXEC_IMM
CHECK_EXPECT -->|No|CHECK_ENABLED
CHECK_ENABLED -->|No|EXEC_IMM
CHECK_ENABLED -->|Yes|PREPARE
PREPARE --> CHECK_TYPE
CHECK_TYPE -->|Immediate|EXEC_IMM
CHECK_TYPE -->|Values|CHECK_COMPAT
CHECK_COMPAT -->|No|FLUSH_BUF
CHECK_COMPAT -->|Yes|APPEND_BUF
FLUSH_BUF --> CREATE_BUF
CREATE_BUF --> RETURN_PLACE
APPEND_BUF --> CHECK_FULL
CHECK_FULL -->|Yes|FLUSH_BUF
CHECK_FULL -->|No| RETURN_PLACE
Insert Canonicalization
The prepare_insert method analyzes the INSERT statement and converts it into one of two canonical forms:
PreparedInsert::Values
Literal VALUES clauses (including constant SELECT forms like SELECT 1, 2) are rewritten into Vec<Vec<PlanValue>> rows for buffering:
Conversion logic:
flowchart LR
SQL["INSERT INTO users (id, name)\nVALUES (1, 'Alice')"]
AST["sqlparser::ast::Insert"]
CHECK{"Source type?"}
VALUES["SetExpr::Values"]
SELECT["SetExpr::Select"]
CONST{"Constant SELECT?"}
CONVERT["Convert to Vec~Vec~PlanValue~~"]
PREP_VAL["PreparedInsert::Values"]
PREP_IMM["PreparedInsert::Immediate"]
SQL --> AST
AST --> CHECK
CHECK -->|VALUES| VALUES
CHECK -->|SELECT| SELECT
VALUES --> CONVERT
SELECT --> CONST
CONST -->|Yes| CONVERT
CONST -->|No| PREP_IMM
CONVERT --> PREP_VAL
The method iterates over each row in the VALUES clause, converting SQL expressions to PlanValue variants:
| SQL Type | PlanValue Variant |
|---|---|
| Integer literals | PlanValue::Integer(i64) |
| Float literals | PlanValue::Float(f64) |
| String literals | PlanValue::String(String) |
| NULL | PlanValue::Null |
| Date literals | PlanValue::Date32(i32) |
| DECIMAL literals | PlanValue::Decimal(DecimalValue) |
Sources: llkv-sql/src/sql_engine.rs:1504-1524
flowchart LR
COMPLEX["INSERT INTO t\nSELECT * FROM source"]
PLAN["build_select_plan()"]
WRAP["InsertPlan { source: InsertSource::Select }"]
PREP["PreparedInsert::Immediate"]
COMPLEX --> PLAN
PLAN --> WRAP
WRAP --> PREP
PreparedInsert::Immediate
Non-literal sources (subqueries, complex expressions) are wrapped in a fully-planned InsertPlan for immediate execution:
Sources: llkv-sql/src/sql_engine.rs:1543-1552
flowchart TD
START["can_accept(table, columns, on_conflict)"]
CHECK_TABLE{"table_name == table?"}
CHECK_COLS{"columns == columns?"}
CHECK_CONFLICT{"on_conflict == on_conflict?"}
ACCEPT["Return true"]
REJECT["Return false"]
START --> CHECK_TABLE
CHECK_TABLE -->|No| REJECT
CHECK_TABLE -->|Yes| CHECK_COLS
CHECK_COLS -->|No| REJECT
CHECK_COLS -->|Yes| CHECK_CONFLICT
CHECK_CONFLICT -->|No| REJECT
CHECK_CONFLICT -->|Yes| ACCEPT
Buffer Compatibility Checking
The can_accept method ensures that only compatible INSERTs are batched together:
Compatibility requirements:
| Field | Requirement |
|---|---|
table_name | Exact string match (case-sensitive) |
columns | Exact column list match (order matters) |
on_conflict | Same conflict resolution strategy |
Sources: llkv-sql/src/sql_engine.rs:528-535
flowchart TD
subgraph "During execute()"
STMT{"Statement Type"}
INSERT["INSERT"]
TRANS["Transaction Boundary\n(BEGIN/COMMIT/ROLLBACK)"]
OTHER["Other Statement\n(SELECT, UPDATE, etc.)"]
end
subgraph "Buffer State"
SIZE{"total_rows >=\nMAX_BUFFERED_INSERT_ROWS?"}
COMPAT{"can_accept()?"}
end
subgraph "Explicit Triggers"
DROP["SqlEngine::drop()"]
DISABLE["set_insert_buffering(false)"]
MANUAL["flush_pending_inserts()"]
end
FLUSH["flush_buffer_results()"]
STMT --> INSERT
STMT --> TRANS
STMT --> OTHER
INSERT --> SIZE
INSERT --> COMPAT
SIZE -->|Yes| FLUSH
COMPAT -->|No| FLUSH
TRANS --> FLUSH
OTHER --> FLUSH
DROP --> FLUSH
DISABLE --> FLUSH
MANUAL --> FLUSH
Flush Triggers
The buffer flushes under multiple conditions to maintain correctness and predictable memory usage:
Automatic Flush Conditions
Sources: llkv-sql/src/sql_engine.rs:544-546 llkv-sql/src/sql_engine.rs:1098-1111 llkv-sql/src/sql_engine.rs:592-596
Manual Flush API
Applications can force a flush via the flush_pending_inserts public method:
This is useful when the application needs to ensure all buffered data is persisted before performing a read operation or checkpoint.
Sources: llkv-sql/src/sql_engine.rs:1132-1134
sequenceDiagram
participant Buffer as InsertBuffer
participant Engine as SqlEngine
participant Runtime as RuntimeEngine
Note over Buffer: Buffer contains 3 statements:\n10, 20, 15 rows
Engine->>Buffer: Take buffer (45 total rows)
Buffer-->>Engine: InsertBuffer instance
Engine->>Engine: Build InsertPlan with all 45 rows
Engine->>Runtime: execute_statement(InsertPlan)
Runtime->>Runtime: Persist all rows atomically
Runtime-->>Engine: RuntimeStatementResult::Insert(45)
Engine->>Engine: Split into per-statement results
Note over Engine: [10 rows, 20 rows, 15 rows]
Engine-->>Engine: Return Vec~SqlStatementResult~
Flush Execution and Result Splitting
When the buffer flushes, all accumulated rows are executed as a single InsertPlan, then the result is split back into per-statement results:
Implementation details:
- The buffer is moved out of the
RefCellto prevent reentrancy issues - A single
InsertPlanis constructed withInsertSource::Rows(all_rows) - The plan executes via
RuntimeEngine::execute_statement - The returned row count is validated against
total_rows - Per-statement results are reconstructed using
statement_row_counts
Sources: llkv-sql/src/sql_engine.rs:1342-1407
Result Reconstruction
The flush method maintains the illusion of per-statement execution by reconstructing individual RuntimeStatementResult::Insert instances:
This ensures compatibility with test harnesses and applications that inspect per-statement results.
Sources: llkv-sql/src/sql_engine.rs:1389-1406
flowchart TD
EXPECT["register_statement_expectation(expectation)"]
THREAD_LOCAL["PENDING_STATEMENT_EXPECTATIONS\n(thread-local queue)"]
EXEC["execute(sql)"]
NEXT["next_statement_expectation()"]
CHECK{"Expectation Type"}
OK["StatementExpectation::Ok\n(buffer eligible)"]
ERROR["StatementExpectation::Error\n(execute immediately)"]
COUNT["StatementExpectation::Count\n(execute immediately)"]
EXPECT --> THREAD_LOCAL
EXEC --> NEXT
NEXT --> THREAD_LOCAL
THREAD_LOCAL --> CHECK
CHECK --> OK
CHECK --> ERROR
CHECK --> COUNT
Statement Expectations Integration
The SQL Logic Test (SLT) harness uses statement expectations to validate error handling and row counts. The buffering system respects these expectations by bypassing the buffer when necessary:
Expectation types:
| Expectation | Behavior | Reason |
|---|---|---|
Ok | Allow buffering | No specific validation needed |
Error | Force immediate execution | Must observe synchronous error |
Count(n) | Force immediate execution | Must validate exact row count |
Sources: llkv-sql/src/sql_engine.rs:375-391 llkv-sql/src/sql_engine.rs:1240-1250
Performance Characteristics
The buffering system provides substantial performance improvements for bulk ingestion workloads:
Planning Overhead Reduction
Without buffering, each INSERT statement incurs:
- SQL parsing (via
sqlparser) - Plan construction (
InsertPlanallocation) - Schema validation (column name resolution)
- Runtime plan dispatch
With buffering, these costs are amortized across all buffered statements, executing once per flush rather than once per statement.
Batch Size Impact
The default threshold of 8192 rows balances several factors:
| Factor | Impact of Larger Batches | Impact of Smaller Batches |
|---|---|---|
| Memory usage | Increases linearly | Reduces proportionally |
| Planning amortization | Better (fewer flushes) | Worse (more flushes) |
| Latency to visibility | Higher (longer buffering) | Lower (frequent flushes) |
| Write throughput | Generally higher | Generally lower |
Sources: llkv-sql/src/sql_engine.rs490
sequenceDiagram
participant App
participant SqlEngine
participant Buffer as InsertBuffer
participant Runtime
App->>SqlEngine: execute("BEGIN TRANSACTION")
SqlEngine->>Buffer: flush_buffer_results()
Buffer-->>SqlEngine: (empty, no pending data)
SqlEngine->>Runtime: begin_transaction()
App->>SqlEngine: execute("INSERT INTO t VALUES (1)")
SqlEngine->>Buffer: Buffer statement
App->>SqlEngine: execute("COMMIT")
SqlEngine->>Buffer: flush_buffer_results()
Buffer->>Runtime: execute_statement(InsertPlan)
Runtime-->>Buffer: Success
SqlEngine->>Runtime: commit_transaction()
Integration with Transaction Boundaries
The buffering system automatically flushes at transaction boundaries to maintain ACID semantics:
This ensures that all buffered writes are persisted before the transaction commits, preventing data loss or inconsistency.
Sources: llkv-sql/src/sql_engine.rs:1094-1102
Conflict Resolution Compatibility
The buffer preserves conflict resolution semantics by tracking the on_conflict action:
| Action | SQLite Syntax | Buffering Behavior |
|---|---|---|
None | (default) | Buffers with same action |
Replace | REPLACE INTO / INSERT OR REPLACE | Buffers separately from other actions |
Ignore | INSERT OR IGNORE | Buffers separately from other actions |
Abort | INSERT OR ABORT | Buffers separately from other actions |
Fail | INSERT OR FAIL | Buffers separately from other actions |
Rollback | INSERT OR ROLLBACK | Buffers separately from other actions |
The buffer’s can_accept method ensures that only statements with identical conflict actions are batched together, preserving the semantic behavior of each INSERT.
Sources: llkv-sql/src/sql_engine.rs:1441-1455 llkv-sql/src/sql_engine.rs:528-535
Drop Safety
The SqlEngine implements a Drop handler to ensure that buffered data is persisted when the engine goes out of scope:
This prevents data loss when applications terminate without explicitly flushing or disabling buffering.
Sources: llkv-sql/src/sql_engine.rs:590-597
Key Code Entities
| Entity | Location | Role |
|---|---|---|
InsertBuffer | llkv-sql/src/sql_engine.rs:497-547 | Accumulates row data and tracks per-statement counts |
SqlEngine::insert_buffer | llkv-sql/src/sql_engine.rs576 | Holds current buffer instance |
SqlEngine::insert_buffering_enabled | llkv-sql/src/sql_engine.rs583 | Global enable/disable flag |
MAX_BUFFERED_INSERT_ROWS | llkv-sql/src/sql_engine.rs490 | Flush threshold constant |
PreparedInsert | llkv-sql/src/sql_engine.rs:555-563 | Canonical INSERT representation |
buffer_insert | llkv-sql/src/sql_engine.rs:1231-1338 | Main buffering decision logic |
prepare_insert | llkv-sql/src/sql_engine.rs:1423-1557 | INSERT canonicalization |
flush_buffer_results | llkv-sql/src/sql_engine.rs:1342-1407 | Execution and result splitting |
BufferedInsertResult | llkv-sql/src/sql_engine.rs:567-570 | Return type for buffering operations |
Sources: llkv-sql/src/sql_engine.rs:490-1557
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Query Planning
Loading…
Query Planning
Relevant source files
- .github/workflows/build.docs.yml
- demos/llkv-sql-pong-demo/assets/llkv-sql-pong-screenshot.png
- dev-docs/doc-preview.md
- llkv-expr/src/expr.rs
- llkv-plan/Cargo.toml
- llkv-plan/src/lib.rs
- llkv-plan/src/plans.rs
- llkv-sql/Cargo.toml
- llkv-table/src/resolvers/identifier.rs
- llkv-test-utils/Cargo.toml
Purpose and Scope
The query planning layer converts parsed SQL abstract syntax trees (AST) into structured logical plans that execution engines can process. This document covers the planning architecture, plan types, translation process, and validation steps. For information about the SQL parsing that precedes planning, see SQL Interface. For details about the plan structures themselves, see Plan Structures. For subquery correlation handling, see Subquery and Correlation Handling. For expression compilation and evaluation, see Expression System. For plan execution, see Query Execution.
Sources: llkv-plan/src/lib.rs:1-49 llkv-plan/src/plans.rs:1-6
Architecture Overview
The llkv-plan crate sits between the SQL parsing layer and the execution layer. It receives SQL ASTs from sqlparser-rs and produces logical plan structures consumed by llkv-executor and llkv-runtime. The crate is organized into several focused modules:
| Module | Purpose |
|---|---|
plans | Core plan structures (SelectPlan, InsertPlan, etc.) |
planner | Plan building logic |
translation | SQL AST to plan conversion |
validation | Schema and naming constraint enforcement |
canonical | Canonical value conversion |
conversion | Helper conversions for plan types |
physical | Physical execution plan structures |
table_scan | Table scan plan optimization |
subquery_correlation | Correlated subquery tracking |
traversal | Generic AST traversal utilities |
Sources: llkv-plan/src/lib.rs:1-49 llkv-plan/Cargo.toml:1-43
Planning Process Flow
The planning process begins when the SQL layer invokes translation functions that walk the parsed AST and construct typed plan structures. The process involves multiple phases:
Identifier Resolution : During translation, column references like "users.name" or "u.address.city" must be resolved to canonical column names. The IdentifierResolver in llkv-table handles this by consulting the TableCatalog and applying alias rules.
sequenceDiagram
participant Parser as sqlparser
participant Translator as translation module
participant Validator as validation module
participant Builder as Plan builders
participant Catalog as TableCatalog
participant Output as Logical Plans
Parser->>Translator: Statement AST
Note over Translator: Identify statement type\n(SELECT, INSERT, DDL, etc.)
Translator->>Validator: Check schema references
Validator->>Catalog: Resolve table names
Catalog-->>Validator: TableMetadataView
Validator-->>Translator: Validated identifiers
Translator->>Builder: Construct plan with validated data
alt SELECT statement
Builder->>Builder: Build SelectPlan\n- Parse projections\n- Translate WHERE\n- Extract aggregates\n- Build join metadata
else INSERT statement
Builder->>Builder: Build InsertPlan\n- Parse column list\n- Convert values\n- Set conflict action
else DDL statement
Builder->>Builder: Build DDL Plan\n- Extract schema info\n- Validate constraints
end
Builder->>Validator: Final validation pass
Validator-->>Builder: Approved
Builder->>Output: Typed logical plan
Expression Translation : Predicate and scalar expressions are translated from SQL AST nodes into the llkv-expr expression types (Expr, ScalarExpr). This translation involves converting SQL operators to expression operators, resolving function calls, and tracking subquery placeholders.
Subquery Discovery : The translation layer identifies correlated subqueries in WHERE and SELECT clauses, assigns them unique SubqueryId identifiers, and builds FilterSubquery or ScalarSubquery metadata structures that track which outer columns are referenced.
Sources: llkv-plan/src/plans.rs:1-100 llkv-table/src/resolvers/identifier.rs:1-260
Plan Type Categories
Plans are organized by SQL statement category. Each category has distinct plan structures tailored to its execution requirements:
DDL Plans
Data Definition Language plans modify schema and structure:
DML Plans
Data Manipulation Language plans modify table contents:
Query Plans
SELECT statements produce the most complex plans with nested substructures:
Sources: llkv-plan/src/plans.rs:164-1620
Plan Construction Patterns
Plan structures expose builder-style APIs for incremental construction:
Builder Methods
Most plan types provide fluent builder methods:
CreateTablePlan::new("users")
.with_if_not_exists(true)
.with_columns(vec![
PlanColumnSpec::new("id", DataType::Int64, false)
.with_primary_key(true),
PlanColumnSpec::new("name", DataType::Utf8, true),
])
.with_namespace(Some("main".to_string()))
SelectPlan Construction
SelectPlan is built incrementally as the translator walks the SQL AST:
SelectPlan::new("users")
.with_projections(projections)
.with_filter(Some(SelectFilter {
predicate: expr,
subqueries: vec![],
}))
.with_aggregates(aggregates)
.with_order_by(order_by)
.with_group_by(group_by_columns)
The with_tables constructor supports multi-table queries:
SelectPlan::with_tables(vec![
TableRef::new("main", "orders"),
TableRef::new("main", "customers"),
])
.with_joins(vec![
JoinMetadata {
left_table_index: 0,
join_type: JoinPlan::Inner,
on_condition: Some(join_expr),
},
])
Join Metadata
JoinMetadata replaces the older parallel join_types and join_filters vectors by bundling join information into a single structure per join operation. Each entry describes how tables[i] connects to tables[i+1]:
| Field | Type | Purpose |
|---|---|---|
left_table_index | usize | Index into SelectPlan.tables |
join_type | JoinPlan | INNER, LEFT, RIGHT, or FULL |
on_condition | Option<Expr> | ON clause predicate |
Sources: llkv-plan/src/plans.rs:176-952 llkv-plan/src/plans.rs:756-792
Subquery Plan Structures
Subqueries appear in two contexts and use distinct plan structures:
Filter Subqueries
Used in WHERE clauses with EXISTS predicates:
The SubqueryId links the predicate’s Expr::Exists node to the corresponding FilterSubquery in the subqueries vector.
Scalar Subqueries
Used in SELECT projections to return a single value:
Correlated Column Tracking
When a subquery references outer query columns, the planner injects placeholder column names and tracks the mapping:
CorrelatedColumn {
placeholder: "__subquery_corr_0_users_id", // Injected into subquery
column: "users.id", // Actual outer column
field_path: vec![], // Empty for simple columns
}
For struct field references like users.address.city:
CorrelatedColumn {
placeholder: "__subquery_corr_0_users_address",
column: "users.address",
field_path: vec!["city".to_string()],
}
Sources: llkv-plan/src/plans.rs:23-68 llkv-expr/src/expr.rs:42-65
graph TD
LOGICAL["Logical Plans\n(SelectPlan, InsertPlan, etc.)"]
LOGICAL --> PHYS_BUILD["Physical plan builder\nllkv-plan::physical"]
PHYS_BUILD --> PHYS["PhysicalPlan"]
PHYS --> SCAN["TableScanPlan\nAccess path selection"]
PHYS --> JOIN_EXEC["Join algorithm choice\n(hash join, nested loop)"]
PHYS --> AGG_EXEC["Aggregation strategy"]
PHYS --> SORT_EXEC["Sort algorithm"]
SCAN --> FULL["Full table scan"]
SCAN --> INDEX["Index scan"]
SCAN --> FILTER_SCAN["Filtered scan with pushdown"]
PHYS --> EXEC["llkv-executor\nQuery execution"]
Physical Planning
While logical plans represent what operations to perform, physical plans specify how to execute them with specific algorithms and data access patterns:
The table_scan module provides build_table_scan_plan which analyzes predicates to determine the optimal scan strategy:
TableScanProjectionSpec {
columns: Vec<String>, // Columns to retrieve
filter: Option<Expr>, // Pushed-down filter
order_by: Vec<OrderByPlan>, // Sort requirements
limit: Option<usize>, // Row limit
}
This physical plan metadata guides the executor in choosing between:
- Full scans : No filter, read all rows
- Filter scans : Predicate evaluation during scan
- Index scans : Use indexes when available and beneficial
Sources: llkv-plan/src/physical.rs:1-50 (inferred), llkv-plan/src/table_scan.rs:1-50 (inferred)
Translation and Validation
Translation Process
The translation modules convert SQL AST nodes to plan structures while preserving semantic meaning:
Projection Translation :
SELECT *→SelectProjection::AllColumnsSELECT * EXCEPT (x)→SelectProjection::AllColumnsExceptSELECT col→SelectProjection::Column { name, alias }SELECT expr AS name→SelectProjection::Computed { expr, alias }
Predicate Translation : SQL predicates are converted to Expr trees:
WHERE a = 1 AND b < 2→Expr::And([Filter(a=1), Filter(b<2)])WHERE x IN (1,2,3)→Expr::Pred(Filter { op: Operator::In(...) })WHERE EXISTS (SELECT ...)→Expr::Exists(SubqueryExpr { id, negated })
Aggregate Translation : Aggregate functions map to AggregateExpr variants:
COUNT(*)→AggregateExpr::CountStarSUM(col)→AggregateExpr::Column { function: SumInt64, ... }AVG(DISTINCT col)→AggregateExpr::Column { distinct: true, ... }
Validation Helpers
The validation module enforces constraints during plan construction:
| Validation Type | Purpose |
|---|---|
| Schema validation | Verify table and column existence |
| Type compatibility | Check data type conversions |
| Constraint validation | Enforce PRIMARY KEY, UNIQUE, CHECK, FK rules |
| Identifier resolution | Resolve ambiguous column references |
| Aggregate context | Ensure aggregates only in valid contexts |
The validation module provides helper functions that translators invoke at critical points:
- Before plan construction : Validate referenced tables exist
- During expression building : Resolve column identifiers
- After plan assembly : Final consistency checks
Canonical Value Conversion
The canonical module converts PlanValue instances to CanonicalScalar for internal processing:
PlanValue::Integer(42) → CanonicalScalar::Int64(42)
PlanValue::String("abc") → CanonicalScalar::String("abc")
PlanValue::Decimal(dec) → CanonicalScalar::Decimal(dec)
This normalization ensures consistent value representation across plan construction and execution.
Sources: llkv-plan/src/translation.rs:1-50 (inferred), llkv-plan/src/validation.rs:1-50 (inferred), llkv-plan/src/canonical.rs:1-50 (inferred), llkv-plan/src/plans.rs:126-161
Plan Metadata and Auxiliary Types
Plans contain rich metadata to guide execution:
PlanValue
Represents literal values in plans:
ColumnAssignment
UPDATE plans use ColumnAssignment to specify modifications:
ColumnAssignment {
column: "age",
value: AssignmentValue::Literal(PlanValue::Integer(25)),
}
ColumnAssignment {
column: "updated_at",
value: AssignmentValue::Expression(
ScalarExpr::Column("current_timestamp")
),
}
Constraint Specifications
CREATE TABLE plans include constraint metadata:
PlanColumnSpec {
name: "id",
data_type: DataType::Int64,
nullable: false,
primary_key: true,
unique: true,
check_expr: None,
}
ForeignKeySpec {
name: Some("fk_orders_customer"),
columns: vec!["customer_id"],
referenced_table: "customers",
referenced_columns: vec!["id"],
on_delete: ForeignKeyAction::Restrict,
on_update: ForeignKeyAction::NoAction,
}
MultiColumnUniqueSpec {
name: Some("unique_email_username"),
columns: vec!["email", "username"],
}
Sources: llkv-plan/src/plans.rs:73-161 llkv-plan/src/plans.rs:167-427 llkv-plan/src/plans.rs:660-682 llkv-plan/src/plans.rs:499-546
Traversal and Program Compilation
AST Traversal
The traversal module provides generic postorder traversal for deeply nested ASTs:
traverse_postorder(
root_node,
|node| { /* pre-visit logic */ },
|node| { /* post-visit logic */ }
)
This pattern supports:
- Subquery discovery during SELECT translation
- Expression normalization
- Dead code elimination in predicates
Program Compilation
Plans containing expressions are compiled into evaluation programs for efficient execution. The ProgramCompiler converts Expr trees into bytecode-like instruction sequences:
Expr → EvalProgram → Execution
The compiled programs optimize:
- Predicate evaluation order
- Short-circuit boolean logic
- Literal folding
- Domain program generation for chunk pruning
Sources: llkv-plan/src/traversal.rs:1-50 (inferred), llkv-plan/src/lib.rs:38-42
sequenceDiagram
participant Planner as llkv-plan
participant Runtime as llkv-runtime
participant Executor as llkv-executor
participant Table as llkv-table
Planner->>Runtime: Logical Plan
Runtime->>Runtime: Acquire transaction context
alt SELECT Plan
Runtime->>Executor: execute_select(plan, txn)
Executor->>Executor: Build execution pipeline
Executor->>Table: scan with filter
Table-->>Executor: RecordBatch stream
Executor->>Executor: Apply aggregations, joins, sorts
Executor-->>Runtime: Final RecordBatch
else INSERT Plan
Runtime->>Runtime: Convert InsertSource to batches
Runtime->>Table: append(batches)
Table-->>Runtime: Row count
else UPDATE/DELETE Plan
Runtime->>Executor: execute_update/delete(plan, txn)
Executor->>Table: filter + modify
Table-->>Runtime: Row count
else DDL Plan
Runtime->>Table: Modify catalog/schema
Table-->>Runtime: Success
end
Runtime->>Planner: Execution result
Integration with Execution
Plans flow from the planning layer to execution through well-defined interfaces:
The execution layer consumes plan metadata to:
- Determine table access order
- Select join algorithms
- Push down predicates to storage
- Apply aggregations and sorts
- Stream results incrementally
Sources: llkv-plan/src/plans.rs:1-1620 llkv-executor/src/lib.rs:1-50 (inferred), llkv-runtime/src/lib.rs:1-50 (inferred)
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Plan Structures
Loading…
Plan Structures
Relevant source files
- llkv-expr/src/expr.rs
- llkv-plan/src/lib.rs
- llkv-plan/src/plans.rs
- llkv-table/src/resolvers/identifier.rs
This document describes the logical plan data structures defined in the llkv-plan crate. These structures represent parsed and validated SQL operations before execution. For information about how plans are created from SQL AST, see Subquery and Correlation Handling. For details on expression compilation and evaluation, see Expression System.
Purpose and Scope
Plan structures are the intermediate representation (IR) between SQL parsing and query execution. The SQL parser produces AST nodes from sqlparser-rs, which the planner translates into strongly-typed plan structures. The executor consumes these plans to produce results. Plans carry all semantic information needed for execution: table references, column specifications, predicates, projections, aggregations, and ordering constraints.
All plan types are defined in llkv-plan/src/plans.rs
Sources: llkv-plan/src/plans.rs:1-1263
Plan Statement Hierarchy
All executable statements are represented by the PlanStatement enum, which serves as the top-level discriminated union of all plan types.
Sources: llkv-plan/src/plans.rs:1244-1263
graph TB
PlanStatement["PlanStatement\n(Top-level enum)"]
subgraph "Transaction Control"
BeginTransaction["BeginTransaction"]
CommitTransaction["CommitTransaction"]
RollbackTransaction["RollbackTransaction"]
end
subgraph "DDL Operations"
CreateTable["CreateTablePlan"]
DropTable["DropTablePlan"]
CreateView["CreateViewPlan"]
DropView["DropViewPlan"]
CreateIndex["CreateIndexPlan"]
DropIndex["DropIndexPlan"]
Reindex["ReindexPlan"]
AlterTable["AlterTablePlan"]
end
subgraph "DML Operations"
Insert["InsertPlan"]
Update["UpdatePlan"]
Delete["DeletePlan"]
Truncate["TruncatePlan"]
end
subgraph "Query Operations"
Select["SelectPlan"]
end
PlanStatement --> BeginTransaction
PlanStatement --> CommitTransaction
PlanStatement --> RollbackTransaction
PlanStatement --> CreateTable
PlanStatement --> DropTable
PlanStatement --> CreateView
PlanStatement --> DropView
PlanStatement --> CreateIndex
PlanStatement --> DropIndex
PlanStatement --> Reindex
PlanStatement --> AlterTable
PlanStatement --> Insert
PlanStatement --> Update
PlanStatement --> Delete
PlanStatement --> Truncate
PlanStatement --> Select
DDL Plan Structures
DDL (Data Definition Language) plans modify database schema: creating, altering, and dropping tables, views, and indexes.
CreateTablePlan
The CreateTablePlan structure defines table creation operations, including optional data sources for CREATE TABLE AS SELECT.
| Field | Type | Description |
|---|---|---|
name | String | Table name |
if_not_exists | bool | Skip creation if table exists |
or_replace | bool | Replace existing table |
columns | Vec<PlanColumnSpec> | Column definitions |
source | Option<CreateTableSource> | Optional data source (batches or SELECT) |
namespace | Option<String> | Optional storage namespace |
foreign_keys | Vec<ForeignKeySpec> | Foreign key constraints |
multi_column_uniques | Vec<MultiColumnUniqueSpec> | Multi-column UNIQUE constraints |
Sources: llkv-plan/src/plans.rs:176-203
PlanColumnSpec
Column specifications carry metadata from the planner to the executor, including type, nullability, and constraints.
| Field | Type | Description |
|---|---|---|
name | String | Column name |
data_type | DataType | Arrow data type |
nullable | bool | Whether NULL values are permitted |
primary_key | bool | Whether this is the primary key |
unique | bool | Whether values must be unique |
check_expr | Option<String> | Optional CHECK constraint SQL |
Sources: llkv-plan/src/plans.rs:503-546
CreateTableSource
The CreateTableSource enum specifies data sources for CREATE TABLE AS operations:
Sources: llkv-plan/src/plans.rs:607-617
AlterTablePlan
The AlterTablePlan structure defines table modification operations:
Sources: llkv-plan/src/plans.rs:364-406
Index Plans
Index management is handled by three plan types:
| Plan Type | Purpose | Key Fields |
|---|---|---|
CreateIndexPlan | Create new index | name, table, unique, columns: Vec<IndexColumnPlan> |
DropIndexPlan | Remove index | name, canonical_name, if_exists |
ReindexPlan | Rebuild index | name, canonical_name |
Sources: llkv-plan/src/plans.rs:433-358
DML Plan Structures
DML (Data Manipulation Language) plans modify table data: inserting, updating, and deleting rows.
graph TB
InsertPlan["InsertPlan"]
InsertSource["InsertSource\n(enum)"]
Rows["Rows\nVec<Vec<PlanValue>>"]
Batches["Batches\nVec<RecordBatch>"]
SelectSource["Select\nBox<SelectPlan>"]
ConflictAction["InsertConflictAction"]
None["None"]
Replace["Replace"]
Ignore["Ignore"]
Abort["Abort"]
Fail["Fail"]
Rollback["Rollback"]
InsertPlan -->|source| InsertSource
InsertPlan -->|on_conflict| ConflictAction
InsertSource --> Rows
InsertSource --> Batches
InsertSource --> SelectSource
ConflictAction --> None
ConflictAction --> Replace
ConflictAction --> Ignore
ConflictAction --> Abort
ConflictAction --> Fail
ConflictAction --> Rollback
InsertPlan
The InsertPlan includes SQLite-style conflict resolution actions (INSERT OR REPLACE, INSERT OR IGNORE, etc.) to handle constraint violations.
Sources: llkv-plan/src/plans.rs:623-656
UpdatePlan
The UpdatePlan structure defines row update operations:
| Field | Type | Description |
|---|---|---|
table | String | Target table name |
assignments | Vec<ColumnAssignment> | Column updates |
filter | Option<Expr<'static, String>> | Optional WHERE predicate |
Each ColumnAssignment maps a column to an AssignmentValue:
Sources: llkv-plan/src/plans.rs:661-681
DeletePlan and TruncatePlan
Both plans remove rows from tables:
DeletePlan: Removes rows matching an optional filter predicateTruncatePlan: Removes all rows (no filter)
Sources: llkv-plan/src/plans.rs:687-702
graph TB
SelectPlan["SelectPlan"]
Tables["tables: Vec<TableRef>\nFROM clause tables"]
Joins["joins: Vec<JoinMetadata>\nJoin specifications"]
Projections["projections: Vec<SelectProjection>\nSELECT list"]
Filter["filter: Option<SelectFilter>\nWHERE clause"]
Having["having: Option<Expr>\nHAVING clause"]
GroupBy["group_by: Vec<String>\nGROUP BY columns"]
OrderBy["order_by: Vec<OrderByPlan>\nORDER BY specs"]
Aggregates["aggregates: Vec<AggregateExpr>\nAggregate functions"]
ScalarSubqueries["scalar_subqueries: Vec<ScalarSubquery>\nScalar subquery plans"]
Compound["compound: Option<CompoundSelectPlan>\nUNION/INTERSECT/EXCEPT"]
SelectPlan --> Tables
SelectPlan --> Joins
SelectPlan --> Projections
SelectPlan --> Filter
SelectPlan --> Having
SelectPlan --> GroupBy
SelectPlan --> OrderBy
SelectPlan --> Aggregates
SelectPlan --> ScalarSubqueries
SelectPlan --> Compound
SelectPlan Structure
The SelectPlan is the most complex plan type, representing SELECT queries with projections, filters, joins, aggregates, ordering, and compound operations.
Core SelectPlan Fields
Sources: llkv-plan/src/plans.rs:800-864
TableRef and JoinMetadata
Table references include optional aliases and schema qualifications:
Join metadata connects consecutive tables in the tables vector:
Sources: llkv-plan/src/plans.rs:708-792
SelectProjection
The SelectProjection enum defines what columns appear in results:
Sources: llkv-plan/src/plans.rs:1007-1021
SelectFilter and Subqueries
The SelectFilter structure bundles predicates with correlated subqueries:
Scalar subqueries (used in projections) follow a similar structure:
Sources: llkv-plan/src/plans.rs:28-67
AggregateExpr
Aggregate expressions define computations over grouped data:
Sources: llkv-plan/src/plans.rs:1036-1128
OrderByPlan
Ordering specifications include target, direction, and null handling:
Sources: llkv-plan/src/plans.rs:1203-1225
graph TB
ExprTypes["Expression Types"]
Expr["Expr<F>\nBoolean expressions"]
ScalarExpr["ScalarExpr<F>\nScalar computations"]
Filter["Filter<F>\nSingle-field predicates"]
Operator["Operator\nComparison operators"]
ExprVariants["Expr variants"]
And["And(Vec<Expr>)"]
Or["Or(Vec<Expr>)"]
Not["Not(Box<Expr>)"]
Pred["Pred(Filter)"]
Compare["Compare{left, op, right}"]
InList["InList{expr, list, negated}"]
IsNull["IsNull{expr, negated}"]
Literal["Literal(bool)"]
Exists["Exists(SubqueryExpr)"]
ScalarVariants["ScalarExpr variants"]
Column["Column(F)"]
Lit["Literal(Literal)"]
Binary["Binary{left, op, right}"]
Aggregate["Aggregate(AggregateCall)"]
Cast["Cast{expr, data_type}"]
Case["Case{operand, branches, else_expr}"]
Coalesce["Coalesce(Vec)"]
ScalarSubquery["ScalarSubquery(ScalarSubqueryExpr)"]
ExprTypes --> Expr
ExprTypes --> ScalarExpr
Expr --> Filter
Filter --> Operator
Expr --> ExprVariants
ExprVariants --> And
ExprVariants --> Or
ExprVariants --> Not
ExprVariants --> Pred
ExprVariants --> Compare
ExprVariants --> InList
ExprVariants --> IsNull
ExprVariants --> Literal
ExprVariants --> Exists
ScalarExpr --> ScalarVariants
ScalarVariants --> Column
ScalarVariants --> Lit
ScalarVariants --> Binary
ScalarVariants --> Aggregate
ScalarVariants --> Cast
ScalarVariants --> Case
ScalarVariants --> Coalesce
ScalarVariants --> ScalarSubquery
Expression Integration
Plans embed expression trees from llkv-expr for predicates, projections, and computed values.
Expression Type Hierarchy
Sources: llkv-expr/src/expr.rs:14-183
Expr - Boolean Predicates
The Expr type represents boolean-valued expressions used in WHERE, HAVING, and ON clauses. The generic parameter F allows different field identifier types (commonly String for column names):
| Variant | Description |
|---|---|
And(Vec<Expr<F>>) | Logical conjunction |
Or(Vec<Expr<F>>) | Logical disjunction |
Not(Box<Expr<F>>) | Logical negation |
Pred(Filter<F>) | Single-field predicate |
Compare{left, op, right} | Scalar comparison (col1 < col2) |
InList{expr, list, negated} | Membership test |
IsNull{expr, negated} | NULL test for complex expressions |
Literal(bool) | Constant true/false |
Exists(SubqueryExpr) | Correlated subquery existence check |
Sources: llkv-expr/src/expr.rs:14-123
ScalarExpr - Scalar Computations
The ScalarExpr type represents scalar-valued expressions used in projections and assignments:
| Variant | Description |
|---|---|
Column(F) | Column reference |
Literal(Literal) | Constant value |
Binary{left, op, right} | Arithmetic/logical operation |
Not(Box<ScalarExpr<F>>) | Logical NOT |
IsNull{expr, negated} | NULL test returning 1/0 |
Aggregate(AggregateCall<F>) | Aggregate function call |
GetField{base, field_name} | Struct field extraction |
Cast{expr, data_type} | Type conversion |
Compare{left, op, right} | Comparison returning 1/0 |
Coalesce(Vec<ScalarExpr<F>>) | First non-NULL value |
ScalarSubquery(ScalarSubqueryExpr) | Scalar subquery result |
Case{operand, branches, else_expr} | CASE expression |
Random | Random number generator |
Sources: llkv-expr/src/expr.rs:125-307
Operator Types
The Operator enum defines comparison and pattern matching operations applied to single fields:
Sources: llkv-expr/src/expr.rs:372-428
graph LR
PlanValue["PlanValue\n(enum)"]
Null["Null"]
Integer["Integer(i64)"]
Float["Float(f64)"]
Decimal["Decimal(DecimalValue)"]
String["String(String)"]
Date32["Date32(i32)"]
Struct["Struct(FxHashMap)"]
Interval["Interval(IntervalValue)"]
PlanValue --> Null
PlanValue --> Integer
PlanValue --> Float
PlanValue --> Decimal
PlanValue --> String
PlanValue --> Date32
PlanValue --> Struct
PlanValue --> Interval
Supporting Types
PlanValue
The PlanValue enum represents runtime values in plans, providing a type-safe wrapper for literals and computed values:
PlanValue implements From<T> for common types, enabling ergonomic construction:
Sources: llkv-plan/src/plans.rs:73-161
Foreign Key Specifications
Foreign key constraints are defined by ForeignKeySpec:
Sources: llkv-plan/src/plans.rs:412-427
graph TB
CompoundSelectPlan["CompoundSelectPlan"]
Initial["initial: Box<SelectPlan>\nFirst query"]
Operations["operations: Vec<CompoundSelectComponent>\nSubsequent operations"]
Component["CompoundSelectComponent"]
Operator["operator:\nUnion / Intersect / Except"]
Quantifier["quantifier:\nDistinct / All"]
Plan["plan: SelectPlan\nRight-hand query"]
CompoundSelectPlan --> Initial
CompoundSelectPlan --> Operations
Operations --> Component
Component --> Operator
Component --> Quantifier
Component --> Plan
Compound Queries
Compound queries (UNION, INTERSECT, EXCEPT) are represented by CompoundSelectPlan:
This structure allows arbitrary chains of set operations:
SELECT ... UNION ALL SELECT ... INTERSECT SELECT ...
Sources: llkv-plan/src/plans.rs:954-1004
Plan Construction Example
A typical SELECT plan construction follows this pattern:
This produces a plan for: SELECT id, age + 1 AS next_age FROM users WHERE age >= 18 ORDER BY id
Sources: llkv-plan/src/plans.rs:831-952
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Subquery and Correlation Handling
Loading…
Subquery and Correlation Handling
Relevant source files
- llkv-expr/src/expr.rs
- llkv-plan/src/lib.rs
- llkv-plan/src/plans.rs
- llkv-table/src/resolvers/identifier.rs
Purpose and Scope
This document explains how LLKV handles correlated and uncorrelated subqueries in SQL queries. It covers the expression AST structures for representing subqueries, the plan-level metadata for tracking correlation relationships, and the placeholder system used during query planning.
For information about query plan structures more broadly, see Plan Structures. For expression evaluation and compilation, see Expression System.
Overview
LLKV supports two categories of subqueries:
- Predicate Subqueries - Used in WHERE clauses with
EXISTSorNOT EXISTS - Scalar Subqueries - Used in SELECT projections or expressions, returning a single value
Both types can be either correlated (referencing columns from outer queries) or uncorrelated (self-contained). Correlated subqueries require special handling to capture outer column references and inject them as parameters during evaluation.
The system uses a multi-stage approach:
- Planning Phase : Assign unique
SubqueryIdvalues and build metadata - Translation Phase : Replace correlated column references with placeholders
- Execution Phase : Evaluate subqueries per outer row, binding placeholder values
Sources: llkv-expr/src/expr.rs:45-65 llkv-plan/src/plans.rs:27-67
Subquery Identification System
SubqueryId
Each subquery within a query plan receives a unique SubqueryId identifier. This allows the executor to distinguish multiple subqueries and manage their evaluation contexts separately.
SubqueryId Assignment Flow
graph TB
Query["Main Query\n(SELECT Plan)"]
Filter["WHERE Clause\n(SelectFilter)"]
Proj["SELECT List\n(Projections)"]
Sub1["EXISTS Subquery\nSubqueryId(0)"]
Sub2["Scalar Subquery\nSubqueryId(1)"]
Sub3["EXISTS Subquery\nSubqueryId(2)"]
Query --> Filter
Query --> Proj
Filter --> Sub1
Filter --> Sub3
Proj --> Sub2
Sub1 -.references.-> Meta1["FilterSubquery\nid=0\nplan + correlations"]
Sub2 -.references.-> Meta2["ScalarSubquery\nid=1\nplan + correlations"]
Sub3 -.references.-> Meta3["FilterSubquery\nid=2\nplan + correlations"]
The planner assigns sequential IDs during query translation. Each subquery expression (Expr::Exists or ScalarExpr::ScalarSubquery) holds its assigned ID, while the full subquery plan and correlation metadata are stored separately in the parent SelectPlan.
| Structure | Location | Purpose |
|---|---|---|
SubqueryId | Expression AST | References a subquery definition |
FilterSubquery | SelectPlan.filter.subqueries | Metadata for EXISTS/NOT EXISTS |
ScalarSubquery | SelectPlan.scalar_subqueries | Metadata for scalar subqueries |
Sources: llkv-expr/src/expr.rs:46-65 llkv-plan/src/plans.rs:36-56
Expression AST Structures
SubqueryExpr for Predicates
The SubqueryExpr structure appears in boolean expressions as Expr::Exists variants. It supports negation for NOT EXISTS semantics.
Example WHERE clause : WHERE EXISTS (SELECT 1 FROM orders WHERE orders.customer_id = customers.id)
The predicate tree contains Expr::Exists(SubqueryExpr { id: SubqueryId(0), negated: false }), while SelectPlan.filter.subqueries[0] holds the actual subquery plan and correlation data.
Sources: llkv-expr/src/expr.rs:49-56
ScalarSubqueryExpr for Projections
Scalar subqueries return a single value per outer row and appear in ScalarExpr trees. They carry the expected data type for validation.
Example projection : SELECT name, (SELECT COUNT(*) FROM orders WHERE orders.customer_id = customers.id) AS order_count
The expression is ScalarExpr::ScalarSubquery(ScalarSubqueryExpr { id: SubqueryId(0), data_type: Int64 }), with full plan details in SelectPlan.scalar_subqueries[0].
Sources: llkv-expr/src/expr.rs:58-65 llkv-plan/src/plans.rs:47-56
Plan-Level Metadata Structures
classDiagram
class FilterSubquery {+SubqueryId id\n+Box~SelectPlan~ plan\n+Vec~CorrelatedColumn~ correlated_columns}
class CorrelatedColumn {+String placeholder\n+String column\n+Vec~String~ field_path}
class SelectFilter {+Expr predicate\n+Vec~FilterSubquery~ subqueries}
SelectFilter --> FilterSubquery : contains
FilterSubquery --> CorrelatedColumn : tracks correlations
FilterSubquery --> SelectPlan : nested plan
FilterSubquery
This structure captures all metadata needed to evaluate an EXISTS predicate during query execution.
Field Descriptions :
| Field | Type | Purpose |
|---|---|---|
id | SubqueryId | Unique identifier referenced in predicate AST |
plan | Box<SelectPlan> | Complete logical plan for subquery execution |
correlated_columns | Vec<CorrelatedColumn> | Outer columns referenced inside subquery |
Sources: llkv-plan/src/plans.rs:36-45
ScalarSubquery
Parallel structure for scalar subqueries used in projections:
The plan must produce exactly one row with one column. If the subquery returns zero rows, the executor produces NULL. Multiple rows trigger a runtime error.
Sources: llkv-plan/src/plans.rs:47-56
CorrelatedColumn
Describes a single correlated column reference captured from the outer query:
Example :
- Outer Query :
SELECT * FROM customers WHERE ... - Subquery :
SELECT 1 FROM orders WHERE orders.customer_id = customers.id - CorrelatedColumn :
{ placeholder: "$corr_0", column: "id", field_path: [] }
During subquery evaluation, the executor binds $corr_0 to the current outer row’s id value.
Sources: llkv-plan/src/plans.rs:58-67
Correlation Tracking System
sequenceDiagram
participant Planner
participant Tracker as SubqueryCorrelatedTracker
participant SubqueryPlan
Planner->>Tracker: new()
Planner->>Tracker: track_column("customers.id")
Tracker-->>Planner: placeholder = "$corr_0"
Planner->>SubqueryPlan: Replace "customers.id" with "$corr_0"
Planner->>Tracker: finalize()
Tracker-->>Planner: Vec<CorrelatedColumn>
Placeholder Naming Convention
The planner module exports SUBQUERY_CORRELATED_PLACEHOLDER_PREFIX which defines the prefix for generated placeholder names. The helper function subquery_correlated_placeholder(index) produces names like $corr_0, $corr_1, etc.
Tracking Workflow :
- Create
SubqueryCorrelatedTrackerwhen entering subquery translation - For each outer column reference, call
track_column(canonical_name)→ returns placeholder - Replace column reference in subquery AST with placeholder
- Call
finalize()to extractVec<CorrelatedColumn>for plan metadata
Sources: llkv-plan/src/lib.rs:43-46
SubqueryCorrelatedColumnTracker
This type (referenced in exports) manages the mapping between outer columns and generated placeholders. It ensures each unique outer column gets one placeholder regardless of how many times it’s referenced in the subquery.
Deduplication Example :
The tracker creates only one CorrelatedColumn entry with a single placeholder $corr_0 that both references use.
Sources: llkv-plan/src/lib.rs:44-45
Integration with Plan Structures
SelectPlan Storage
The SelectPlan structure holds subquery metadata in dedicated fields:
Storage Pattern :
| Subquery Type | Storage Location | ID Reference Location |
|---|---|---|
| EXISTS in WHERE | filter.subqueries | filter.predicate → Expr::Exists |
| Scalar in SELECT | scalar_subqueries | projections → ScalarExpr::ScalarSubquery |
| Scalar in WHERE | scalar_subqueries | filter.predicate → comparison expressions |
Sources: llkv-plan/src/plans.rs:800-829
Builder Methods
The SelectPlan provides fluent builder methods for attaching subquery metadata:
The planner typically calls these after translating the main query and all nested subqueries.
Sources: llkv-plan/src/plans.rs:895-909
Identifier Resolution During Correlation
graph TB
OuterQuery["Outer Query\ntable=customers\nalias=c"]
SubqueryScope["Subquery Scope\ntable=orders\nalias=o"]
Identifier["Column Reference:\n'c.id'"]
Resolver["IdentifierResolver"]
OuterQuery --> Context1["IdentifierContext\ntable_id=1\nalias=Some('c')"]
SubqueryScope --> Context2["IdentifierContext\ntable_id=2\nalias=Some('o')"]
Identifier --> Resolver
Resolver --> Decision{"Which scope?"}
Decision -->|Outer| Correlated["Mark as correlated\nGenerate placeholder"]
Decision -->|Inner| Local["Resolve locally"]
IdentifierContext for Outer Scopes
When translating a correlated subquery, the planner maintains an IdentifierContext that tracks which columns belong to outer query scopes vs. the subquery’s own tables.
Resolution Process :
- Check if identifier starts with outer table alias → mark correlated
- Check if identifier matches outer table columns (when no alias) → mark correlated
- Otherwise, resolve within subquery’s own table scope
Sources: llkv-table/src/resolvers/identifier.rs:8-66
ColumnResolution Structure
The resolver produces ColumnResolution objects that distinguish simple column references from nested field access:
For correlated columns, this resolution data flows into CorrelatedColumn.field_path, enabling struct-typed correlation like outer_table.struct_column.nested_field.
Sources: llkv-table/src/resolvers/identifier.rs:68-100
Execution Flow
Evaluation Steps :
- Outer Row Context : For each row from outer query, create evaluation context
- Placeholder Binding : Extract values for correlated columns from outer row
- Subquery Execution : Run inner query with bound placeholder values
- Result Integration : EXISTS → boolean, Scalar → value, integrate into outer query
The executor must evaluate correlated subqueries once per outer row, making them potentially expensive. Uncorrelated subqueries can be evaluated once and cached.
Sources: llkv-plan/src/plans.rs:27-67
Example: Correlated EXISTS Subquery
Consider this SQL query:
Planning Phase Output
SelectPlan Structure :
SelectPlan {
tables: [TableRef { table: "customers", alias: Some("c") }],
projections: [
Column { name: "name", alias: None },
Column { name: "id", alias: None }
],
filter: Some(SelectFilter {
predicate: Expr::Exists(SubqueryExpr {
id: SubqueryId(0),
negated: false
}),
subqueries: [
FilterSubquery {
id: SubqueryId(0),
plan: Box::new(SelectPlan {
tables: [TableRef { table: "orders", alias: Some("o") }],
filter: Some(SelectFilter {
predicate: Expr::And([
Compare {
left: Column("customer_id"),
op: Eq,
right: Column("$corr_0")
},
Compare {
left: Column("status"),
op: Eq,
right: Literal("pending")
}
]),
subqueries: []
}),
projections: [Computed { expr: Literal(1), alias: "1" }]
}),
correlated_columns: [
CorrelatedColumn {
placeholder: "$corr_0",
column: "id",
field_path: []
}
]
}
]
}),
// ... other fields ...
}
Translation Details
| Original Reference | After Translation | Correlation Entry |
|---|---|---|
c.id in subquery WHERE | $corr_0 | { placeholder: "$corr_0", column: "id", field_path: [] } |
The subquery’s filter now compares o.customer_id against the placeholder $corr_0 instead of directly referencing the outer column.
Sources: llkv-plan/src/plans.rs:27-67 llkv-expr/src/expr.rs:49-56
Key Design Decisions
Why Separate FilterSubquery and ScalarSubquery?
Different evaluation semantics:
- EXISTS : Returns boolean, short-circuits on first match
- Scalar : Must verify exactly one row returned, extracts value
Having distinct types allows the executor to apply appropriate validation and optimization strategies.
Why Store SubqueryId in Expression AST?
Decouples expression evaluation from subquery execution context. The expression tree remains lightweight and can be cloned/transformed without carrying full subquery plans. The executor looks up metadata by ID when needed.
Why Use String Placeholders?
String placeholders like $corr_0 integrate naturally with the existing column resolution system. The executor can treat them as special “virtual columns” that get their values from outer row context rather than table scans.
Sources: llkv-expr/src/expr.rs:45-65 llkv-plan/src/plans.rs:36-67
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Expression System
Loading…
Expression System
Relevant source files
- .github/workflows/build.docs.yml
- demos/llkv-sql-pong-demo/assets/llkv-sql-pong-screenshot.png
- dev-docs/doc-preview.md
- llkv-expr/src/expr.rs
- llkv-plan/Cargo.toml
- llkv-plan/src/plans.rs
- llkv-sql/Cargo.toml
- llkv-test-utils/Cargo.toml
The Expression System provides the foundational Abstract Syntax Tree (AST) and type system for representing predicates, scalar computations, and aggregate functions throughout LLKV’s query processing pipeline. This system decouples expression semantics from concrete Arrow data types through a generic design that allows expressions to be built, translated, optimized, and evaluated independently.
For information about query planning structures that contain these expressions, see Query Planning. For details on how expressions are executed and evaluated, see Query Execution.
Purpose and Scope
This page documents the expression representation layer defined primarily in llkv-expr, covering:
- Boolean predicate expressions (
Expr) for WHERE and HAVING clauses - Scalar arithmetic expressions (
ScalarExpr) for projections and computed columns - Operator types and literal value representations
- Subquery integration points
- Generic field identifier patterns that enable expression translation from column names to internal field IDs
graph TB
subgraph "Boolean Expression Layer"
Expr["Expr<'a, F>\nBoolean Predicates"]
ExprAnd["And(Vec<Expr>)"]
ExprOr["Or(Vec<Expr>)"]
ExprNot["Not(Box<Expr>)"]
ExprPred["Pred(Filter)"]
ExprCompare["Compare{left, op, right}"]
ExprInList["InList{expr, list, negated}"]
ExprIsNull["IsNull{expr, negated}"]
ExprLiteral["Literal(bool)"]
ExprExists["Exists(SubqueryExpr)"]
end
subgraph "Scalar Expression Layer"
ScalarExpr["ScalarExpr<F>\nArithmetic/Values"]
SEColumn["Column(F)"]
SELiteral["Literal(Literal)"]
SEBinary["Binary{left, op, right}"]
SENot["Not(Box<ScalarExpr>)"]
SEIsNull["IsNull{expr, negated}"]
SEAggregate["Aggregate(AggregateCall)"]
SEGetField["GetField{base, field_name}"]
SECast["Cast{expr, data_type}"]
SECompare["Compare{left, op, right}"]
SECoalesce["Coalesce(Vec<ScalarExpr>)"]
SEScalarSubquery["ScalarSubquery(ScalarSubqueryExpr)"]
SECase["Case{operand, branches, else_expr}"]
SERandom["Random"]
end
subgraph "Filter Layer"
Filter["Filter<'a, F>"]
FilterField["field_id: F"]
FilterOp["op: Operator<'a>"]
end
subgraph "Operator Types"
Operator["Operator<'a>"]
OpEquals["Equals(Literal)"]
OpRange["Range{lower, upper}"]
OpGT["GreaterThan(Literal)"]
OpLT["LessThan(Literal)"]
OpIn["In(&'a [Literal])"]
OpStartsWith["StartsWith{pattern, case_sensitive}"]
OpEndsWith["EndsWith{pattern, case_sensitive}"]
OpContains["Contains{pattern, case_sensitive}"]
OpIsNull["IsNull"]
OpIsNotNull["IsNotNull"]
end
Expr --> ExprAnd
Expr --> ExprOr
Expr --> ExprNot
Expr --> ExprPred
Expr --> ExprCompare
Expr --> ExprInList
Expr --> ExprIsNull
Expr --> ExprLiteral
Expr --> ExprExists
ExprPred --> Filter
ExprCompare --> ScalarExpr
ExprInList --> ScalarExpr
ExprIsNull --> ScalarExpr
Filter --> FilterField
Filter --> FilterOp
FilterOp --> Operator
Operator --> OpEquals
Operator --> OpRange
Operator --> OpGT
Operator --> OpLT
Operator --> OpIn
Operator --> OpStartsWith
Operator --> OpEndsWith
Operator --> OpContains
Operator --> OpIsNull
Operator --> OpIsNotNull
ScalarExpr --> SEColumn
ScalarExpr --> SELiteral
ScalarExpr --> SEBinary
ScalarExpr --> SENot
ScalarExpr --> SEIsNull
ScalarExpr --> SEAggregate
ScalarExpr --> SEGetField
ScalarExpr --> SECast
ScalarExpr --> SECompare
ScalarExpr --> SECoalesce
ScalarExpr --> SEScalarSubquery
ScalarExpr --> SECase
ScalarExpr --> SERandom
The Expression System serves as the intermediate representation between SQL parsing and physical execution, allowing optimizations and transformations to occur independently of storage details.
Expression Type Hierarchy
Sources: llkv-expr/src/expr.rs:14-182
Boolean Expressions: Expr<’a, F>
The Expr<'a, F> enum represents logical boolean expressions used in WHERE clauses, HAVING clauses, and JOIN conditions. The generic type parameter F represents field identifiers, allowing the same expression structure to work with string column names during planning and numeric field IDs during execution.
Core Expression Variants
| Variant | Purpose | Example SQL |
|---|---|---|
And(Vec<Expr>) | Logical conjunction | WHERE a > 5 AND b < 10 |
Or(Vec<Expr>) | Logical disjunction | WHERE status = 'active' OR status = 'pending' |
Not(Box<Expr>) | Logical negation | WHERE NOT (price > 100) |
Pred(Filter) | Single field predicate | WHERE age >= 18 |
Compare{left, op, right} | Scalar comparison | WHERE col1 + col2 > 100 |
InList{expr, list, negated} | Set membership | WHERE status IN ('active', 'pending') |
IsNull{expr, negated} | Null testing | WHERE (col1 + col2) IS NULL |
Literal(bool) | Constant boolean | Used for empty IN lists or optimizations |
Exists(SubqueryExpr) | Correlated subquery | WHERE EXISTS (SELECT ...) |
The Expr type provides helper methods for constructing common patterns:
Sources: llkv-expr/src/expr.rs:14-123
Scalar Expressions: ScalarExpr
The ScalarExpr<F> enum represents expressions that compute scalar values, used in SELECT projections, computed columns, and as operands in boolean expressions. These expressions can reference columns, perform arithmetic, call functions, and handle complex nested computations.
Scalar Expression Variants
| Variant | Purpose | Example SQL |
|---|---|---|
Column(F) | Column reference | SELECT price FROM products |
Literal(Literal) | Constant value | SELECT 42 |
Binary{left, op, right} | Arithmetic operation | SELECT price * quantity |
Not(Box<ScalarExpr>) | Logical NOT | SELECT NOT active |
IsNull{expr, negated} | NULL test | SELECT col IS NULL |
Aggregate(AggregateCall) | Aggregate function | SELECT COUNT(*) + 1 |
GetField{base, field_name} | Struct field access | SELECT user.address.city |
Cast{expr, data_type} | Type conversion | SELECT CAST(price AS INTEGER) |
Compare{left, op, right} | Comparison returning 0/1 | SELECT (price > 100) |
Coalesce(Vec<ScalarExpr>) | First non-NULL | SELECT COALESCE(price, 0) |
ScalarSubquery(ScalarSubqueryExpr) | Scalar subquery | SELECT (SELECT MAX(price) FROM items) |
Case{operand, branches, else_expr} | Conditional | SELECT CASE WHEN ... THEN ... END |
Random | Random number <FileRef file-url=“https://github.com/jzombie/rust-llkv/blob/89777726/0.0, 1.0) | SELECT RANDOM() |
Operators and Filters
Binary Arithmetic Operators
The BinaryOp enum defines arithmetic and logical operations:
| Operator | SQL Symbol | Description |
|---|---|---|
Add | + | Addition |
Subtract | - | Subtraction |
Multiply | * | Multiplication |
Divide | / | Division |
Modulo | % | Modulo |
And | AND | Logical AND |
Or | OR | Logical OR |
BitwiseShiftLeft | << | Left shift |
BitwiseShiftRight | >> | Right shift |
Sources: llkv-expr/src/expr.rs:309-338
Comparison Operators
The CompareOp enum defines relational comparisons:
| Operator | SQL Symbol | Description |
|---|---|---|
Eq | = | Equality |
NotEq | != | Inequality |
Lt | < | Less than |
LtEq | <= | Less than or equal |
Gt | > | Greater than |
GtEq | >= | Greater than or equal |
Sources: llkv-expr/src/expr.rs:340-363
Filter Operators
The Filter<'a, F> struct combines a field identifier with an Operator<'a> to represent single-field predicates. The Operator enum supports specialized operations optimized for columnar storage:
| Operator | Purpose | Optimization |
|---|---|---|
Equals(Literal) | Exact match | Hash-based lookup |
Range{lower, upper} | Range query | Min/max chunk pruning |
GreaterThan(Literal) | > comparison | Min/max chunk pruning |
GreaterThanOrEquals(Literal) | >= comparison | Min/max chunk pruning |
LessThan(Literal) | < comparison | Min/max chunk pruning |
LessThanOrEquals(Literal) | <= comparison | Min/max chunk pruning |
In(&'a [Literal]) | Set membership | Borrowed slice for efficiency |
StartsWith{pattern, case_sensitive} | Prefix match | String-optimized |
EndsWith{pattern, case_sensitive} | Suffix match | String-optimized |
Contains{pattern, case_sensitive} | Substring match | String-optimized |
IsNull | NULL test | Null bitmap scan |
IsNotNull | NOT NULL test | Null bitmap scan |
The Operator::Range variant uses Rust’s std::ops::Bound enum to represent inclusive/exclusive bounds, enabling efficient representation of expressions like WHERE age BETWEEN 18 AND 65 or WHERE timestamp >= '2024-01-01'.
Sources: llkv-expr/src/expr.rs:365-428
Literal Value System
The Expression System uses Literal (defined in llkv-expr) to represent untyped constant values in expressions. These are separate from Arrow’s scalar types and provide a lightweight representation that can be coerced to concrete types during execution based on column schemas.
The PlanValue enum (defined in llkv-plan) serves a similar purpose at the plan level, with the conversion function plan_value_from_literal() bridging between the two representations.
Sources: llkv-expr/src/expr.rs:1-10 llkv-plan/src/plans.rs:73-161
graph TB
subgraph "AggregateCall Variants"
AggregateCall["AggregateCall<F>"]
CountStar["CountStar"]
Count["Count{expr, distinct}"]
Sum["Sum{expr, distinct}"]
Total["Total{expr, distinct}"]
Avg["Avg{expr, distinct}"]
Min["Min(Box<ScalarExpr>)"]
Max["Max(Box<ScalarExpr>)"]
CountNulls["CountNulls(Box<ScalarExpr>)"]
GroupConcat["GroupConcat{expr, distinct, separator}"]
end
AggregateCall --> CountStar
AggregateCall --> Count
AggregateCall --> Sum
AggregateCall --> Total
AggregateCall --> Avg
AggregateCall --> Min
AggregateCall --> Max
AggregateCall --> CountNulls
AggregateCall --> GroupConcat
Count --> ScalarExprArg["Box<ScalarExpr<F>>"]
Sum --> ScalarExprArg
Total --> ScalarExprArg
Avg --> ScalarExprArg
GroupConcat --> ScalarExprArg
Aggregate Functions
The AggregateCall<F> enum represents aggregate function calls within scalar expressions. Unlike simple column aggregates, these operate on arbitrary expressions:
| Aggregate | Purpose | Example SQL |
|---|---|---|
CountStar | Count all rows | SELECT COUNT(*) |
Count{expr, distinct} | Count non-null expression values | SELECT COUNT(DISTINCT user_id) |
Sum{expr, distinct} | Sum of expression values | SELECT SUM(price * quantity) |
Total{expr, distinct} | Sum treating NULL as 0 | SELECT TOTAL(amount) |
Avg{expr, distinct} | Average of expression values | SELECT AVG(col1 + col2) |
Min(expr) | Minimum value | SELECT MIN(-price) |
Max(expr) | Maximum value | SELECT MAX(col1 * col2) |
CountNulls(expr) | Count NULL values | SELECT COUNT_NULLS(optional_field) |
GroupConcat{expr, distinct, separator} | Concatenate strings | SELECT GROUP_CONCAT(name, ', ') |
The key distinction is that each aggregate (except CountStar) accepts a Box<ScalarExpr<F>> rather than just a column name, enabling complex expressions like AVG(col1 + col2) or SUM(-price).
Sources: llkv-expr/src/expr.rs:184-215
Subquery Integration
The Expression System provides integration points for correlated and scalar subqueries through specialized types:
SubqueryExpr
Used in boolean Expr::Exists predicates to represent EXISTS or NOT EXISTS subqueries:
Sources: llkv-expr/src/expr.rs:45-56
ScalarSubqueryExpr
Used in ScalarExpr::ScalarSubquery to represent scalar subqueries that return a single value:
Sources: llkv-expr/src/expr.rs:58-65
SubqueryId
A lightweight wrapper around u32 that uniquely identifies a subquery within a query plan:
The actual subquery definitions are stored in the parent plan structure (e.g., SelectPlan::scalar_subqueries or SelectFilter::subqueries), with expressions referencing them by ID. This design allows the same subquery to be referenced multiple times without duplication.
Sources: llkv-expr/src/expr.rs:45-65 llkv-plan/src/plans.rs:27-67
Generic Field Identifier Pattern
The generic type parameter F in Expr<'a, F>, ScalarExpr<F>, and Filter<'a, F> enables the same expression structure to work at different stages of query processing:
- Planning Stage :
F = String- Expressions use human-readable column names from SQL - Translation Stage :
F = FieldId- Expressions use table-local field identifiers - Execution Stage :
F = LogicalFieldId- Expressions use namespace-qualified field identifiers for MVCC columns
This design allows expression trees to be built during SQL parsing, translated to internal identifiers during planning, and evaluated efficiently during execution without changing the fundamental expression structure. The translation process is documented in section Expression Translation.
Sources: llkv-expr/src/expr.rs:14-370
Expression Construction Helpers
Both Expr and ScalarExpr provide builder-style helper methods to simplify expression construction in code:
Boolean Expression Helpers
Sources: llkv-expr/src/expr.rs:67-123
Scalar Expression Helpers
Sources: llkv-expr/src/expr.rs:217-307
Integration with Query Plans
The Expression System integrates with query plans through several key structures defined in llkv-plan:
SelectFilter
Wraps a boolean expression with associated correlated subqueries:
FilterSubquery
Describes a correlated subquery referenced by an EXISTS predicate:
ScalarSubquery
Describes a scalar subquery referenced in a projection expression:
CorrelatedColumn
Maps placeholder column names used in subquery expressions to actual outer query columns:
Plans attach these structures to represent the full expression context, allowing executors to evaluate subqueries with proper outer row bindings.
Sources: llkv-plan/src/plans.rs:27-67
Expression System Data Flow
Sources: llkv-expr/src/expr.rs llkv-plan/src/plans.rs:27-1032
Key Design Principles
Type-Safe Generic Design
The generic field identifier pattern (F in Expr<'a, F>) provides compile-time type safety while allowing the same expression structure to work at different pipeline stages. This eliminates the need for multiple parallel expression type hierarchies.
Deferred Type Resolution
Literal values remain untyped until evaluation time, when they are coerced based on the target column’s Arrow DataType. This allows expressions like WHERE age > 18 to work correctly whether age is Int8, Int32, or Int64.
Zero-Copy Operator Patterns
The Operator::In(&'a [Literal]) variant borrows a slice of literals rather than owning a Vec, enabling stack-allocated IN lists without heap allocation for common small cases.
Expression Rewriting Support
The is_trivially_true() and is_full_range_for() helper methods enable query optimizers to identify and eliminate redundant predicates without deep tree traversals.
Subquery Indirection
Subqueries are represented by SubqueryId references rather than inline nested plans, allowing the same subquery to be referenced multiple times without duplication and enabling separate optimization of inner and outer queries.
Sources: llkv-expr/src/expr.rs:1-819
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Expression AST
Loading…
Expression AST
Relevant source files
Purpose and Scope
The Expression AST defines the abstract syntax tree for all logical and arithmetic expressions in LLKV. This module provides the core data structures used to represent predicates, scalar computations, aggregations, and subqueries throughout the query processing pipeline. The AST is type-agnostic and decoupled from Arrow’s concrete scalar types, allowing expressions to be constructed, transformed, and optimized before being bound to specific table schemas.
For information about translating expressions from string column names to field identifiers, see Expression Translation. For information about compiling expressions into executable bytecode, see Program Compilation. For information about evaluating expressions against data, see Scalar Evaluation and NumericKernels.
Sources: llkv-expr/src/expr.rs:1-14
Expression System Architecture
The expression system consists of two primary AST types that serve distinct purposes in query processing:
Expr<'a, F> represents boolean-valued expressions used in WHERE clauses, HAVING clauses, and join conditions. These expressions combine logical operators (AND, OR, NOT) with predicates that test column values against literals or ranges.
graph TB
subgraph "Boolean Predicates"
Expr["Expr<'a, F>\nBoolean logic"]
Filter["Filter<'a, F>\nField predicate"]
Operator["Operator<'a>\nComparison operators"]
end
subgraph "Value Expressions"
ScalarExpr["ScalarExpr<F>\nArithmetic expressions"]
BinaryOp["BinaryOp\n+, -, *, /, %"]
CompareOp["CompareOp\n=, !=, <, >"]
AggregateCall["AggregateCall<F>\nCOUNT, SUM, AVG"]
end
subgraph "Supporting Types"
Literal["Literal\nType-agnostic values"]
SubqueryExpr["SubqueryExpr\nEXISTS predicate"]
ScalarSubqueryExpr["ScalarSubqueryExpr\nScalar subquery"]
end
Expr --> Filter
Filter --> Operator
Expr --> ScalarExpr
ScalarExpr --> BinaryOp
ScalarExpr --> CompareOp
ScalarExpr --> AggregateCall
ScalarExpr --> Literal
Expr --> SubqueryExpr
ScalarExpr --> ScalarSubqueryExpr
Operator --> Literal
ScalarExpr<F> represents value-producing expressions used in SELECT projections, computed columns, and UPDATE assignments. These expressions support arithmetic operations, function calls, type casts, and complex nested computations.
Both types are parameterized by a generic field identifier type F, enabling the same AST structures to be used with different column reference schemes (e.g., String names during planning, FieldId integers during execution).
Sources: llkv-expr/src/expr.rs:14-43 llkv-expr/src/expr.rs:125-182
Boolean Predicate Expressions
The Expr<'a, F> enum represents logical expressions that evaluate to boolean values. It forms the foundation for filtering operations throughout the query engine.
graph TD
Expr["Expr<'a, F>"]
And["And(Vec<Expr>)\nLogical conjunction"]
Or["Or(Vec<Expr>)\nLogical disjunction"]
Not["Not(Box<Expr>)\nLogical negation"]
Pred["Pred(Filter)\nField predicate"]
Compare["Compare\nScalar comparison"]
InList["InList\nSet membership"]
IsNull["IsNull\nNULL test"]
Literal["Literal(bool)\nConstant boolean"]
Exists["Exists(SubqueryExpr)\nCorrelated EXISTS"]
Expr --> And
Expr --> Or
Expr --> Not
Expr --> Pred
Expr --> Compare
Expr --> InList
Expr --> IsNull
Expr --> Literal
Expr --> Exists
Expr Variants
| Variant | Purpose | Example SQL |
|---|---|---|
And(Vec<Expr>) | Logical conjunction of sub-expressions | col1 = 5 AND col2 > 10 |
Or(Vec<Expr>) | Logical disjunction of sub-expressions | status = 'active' OR status = 'pending' |
Not(Box<Expr>) | Logical negation | NOT (age < 18) |
Pred(Filter) | Single-field predicate with operator | price < 100 |
Compare | Comparison between two scalar expressions | col1 + col2 > col3 * 2 |
InList | Set membership test | status IN ('active', 'pending') |
IsNull | NULL test for complex expressions | (col1 + col2) IS NULL |
Literal(bool) | Constant true/false value | WHERE true |
Exists(SubqueryExpr) | Correlated subquery existence test | EXISTS (SELECT 1 FROM t2 WHERE ...) |
The And and Or variants accept vectors of expressions, allowing efficient representation of multi-way logical operations without deep nesting. The Pred variant wraps a Filter<'a, F> structure for simple single-column predicates, which can be efficiently pushed down to storage layer scanning operations.
Sources: llkv-expr/src/expr.rs:14-43 llkv-expr/src/expr.rs:67-123
Filter and Operator Types
The Filter<'a, F> structure encapsulates a single predicate against a field, combining a field identifier with an operator:
The Operator<'a> enum defines comparison and pattern-matching operations over untyped Literal values:
| Operator Variant | Description | Example |
|---|---|---|
Equals(Literal) | Exact equality test | status = 'active' |
Range { lower, upper } | Bounded range test | age BETWEEN 18 AND 65 |
GreaterThan(Literal) | Greater than comparison | price > 100.0 |
GreaterThanOrEquals(Literal) | Greater than or equal | quantity >= 10 |
LessThan(Literal) | Less than comparison | age < 18 |
LessThanOrEquals(Literal) | Less than or equal | score <= 100 |
In(&'a [Literal]) | Set membership | status IN ('active', 'pending') |
StartsWith { pattern, case_sensitive } | Prefix match | name LIKE 'John%' |
EndsWith { pattern, case_sensitive } | Suffix match | email LIKE '%@example.com' |
Contains { pattern, case_sensitive } | Substring match | description LIKE '%keyword%' |
IsNull | NULL test | email IS NULL |
IsNotNull | Non-NULL test | email IS NOT NULL |
The In operator accepts a borrowed slice of literals to avoid allocations for small, static IN lists. The pattern-matching operators (StartsWith, EndsWith, Contains) support both case-sensitive and case-insensitive matching.
Sources: llkv-expr/src/expr.rs:365-428
graph TD
ScalarExpr["ScalarExpr<F>"]
Column["Column(F)\nField reference"]
Literal["Literal\nConstant value"]
Binary["Binary\nArithmetic operation"]
Not["Not\nLogical negation"]
IsNull["IsNull\nNULL test"]
Aggregate["Aggregate\nAggregate function"]
GetField["GetField\nStruct field access"]
Cast["Cast\nType conversion"]
Compare["Compare\nBoolean comparison"]
Coalesce["Coalesce\nFirst non-NULL"]
ScalarSubquery["ScalarSubquery\nSubquery result"]
Case["Case\nConditional expression"]
Random["Random\nRandom number"]
ScalarExpr --> Column
ScalarExpr --> Literal
ScalarExpr --> Binary
ScalarExpr --> Not
ScalarExpr --> IsNull
ScalarExpr --> Aggregate
ScalarExpr --> GetField
ScalarExpr --> Cast
ScalarExpr --> Compare
ScalarExpr --> Coalesce
ScalarExpr --> ScalarSubquery
ScalarExpr --> Case
ScalarExpr --> Random
Scalar Value Expressions
The ScalarExpr<F> enum represents expressions that produce scalar values. These are used in SELECT projections, computed columns, ORDER BY clauses, and anywhere a value (rather than a boolean) is needed.
ScalarExpr Variants
| Variant | Purpose | Example SQL |
|---|---|---|
Column(F) | Reference to a table column | price |
Literal(Literal) | Constant value | 42, 'hello', 3.14 |
Binary { left, op, right } | Arithmetic or logical binary operation | price * quantity |
Not(Box<ScalarExpr>) | Logical negation returning 1/0 | NOT active |
IsNull { expr, negated } | NULL test returning 1/0 | col1 IS NULL |
Aggregate(AggregateCall) | Aggregate function call | COUNT(*) + 1 |
GetField { base, field_name } | Struct field extraction | user.address.city |
Cast { expr, data_type } | Explicit type cast | CAST(price AS INTEGER) |
Compare { left, op, right } | Comparison producing boolean result | price > 100 |
Coalesce(Vec<ScalarExpr>) | First non-NULL expression | COALESCE(nickname, username) |
ScalarSubquery(ScalarSubqueryExpr) | Scalar subquery result | (SELECT MAX(price) FROM ...) |
Case { operand, branches, else_expr } | Conditional expression | CASE WHEN x > 0 THEN 1 ELSE -1 END |
Random | Random number generator | RANDOM() |
The Binary variant supports arithmetic operators (+, -, *, /, %) as well as logical operators (AND, OR) and bitwise shift operators (<<, >>). The Compare variant produces boolean results (represented as 1/0 integers) from comparisons like col1 > col2.
Sources: llkv-expr/src/expr.rs:125-307
Binary and Comparison Operators
The BinaryOp enum defines arithmetic and logical binary operators:
The CompareOp enum defines comparison operators for scalar expressions:
These operators enable complex arithmetic expressions like (price * quantity * (1 - discount)) > 1000 and nested logical operations like (col1 + col2) > (col3 * 2) AND status = 'active'.
Sources: llkv-expr/src/expr.rs:309-363
Aggregate Function Calls
The AggregateCall<F> enum represents aggregate function invocations within scalar expressions. Unlike simple column aggregates, these variants operate on full expressions, enabling aggregations like AVG(col1 + col2) or SUM(-price).
graph LR
AggregateCall["AggregateCall<F>"]
CountStar["CountStar\nCOUNT(*)"]
Count["Count { expr, distinct }\nCOUNT(expr)"]
Sum["Sum { expr, distinct }\nSUM(expr)"]
Total["Total { expr, distinct }\nTOTAL(expr)"]
Avg["Avg { expr, distinct }\nAVG(expr)"]
Min["Min(expr)\nMIN(expr)"]
Max["Max(expr)\nMAX(expr)"]
CountNulls["CountNulls(expr)\nCOUNT_NULLS(expr)"]
GroupConcat["GroupConcat { expr, distinct, separator }\nGROUP_CONCAT(expr)"]
AggregateCall --> CountStar
AggregateCall --> Count
AggregateCall --> Sum
AggregateCall --> Total
AggregateCall --> Avg
AggregateCall --> Min
AggregateCall --> Max
AggregateCall --> CountNulls
AggregateCall --> GroupConcat
AggregateCall Variants
| Variant | SQL Equivalent | Distinct Support | Description |
|---|---|---|---|
CountStar | COUNT(*) | No | Count all rows including NULLs |
Count { expr, distinct } | COUNT(expr) | Yes | Count non-NULL expression values |
Sum { expr, distinct } | SUM(expr) | Yes | Sum of expression values |
Total { expr, distinct } | TOTAL(expr) | Yes | Sum returning 0 for empty set (SQLite semantics) |
Avg { expr, distinct } | AVG(expr) | Yes | Average of expression values |
Min(expr) | MIN(expr) | No | Minimum expression value |
Max(expr) | MAX(expr) | No | Maximum expression value |
CountNulls(expr) | N/A | No | Count NULL values in expression |
GroupConcat { expr, distinct, separator } | GROUP_CONCAT(expr) | Yes | Concatenate values with separator |
All variants except CountStar accept a Box<ScalarExpr<F>>, allowing aggregates to operate on computed expressions. For example, SUM(price * quantity) is represented as:
AggregateCall::Sum {
expr: Box::new(ScalarExpr::Binary {
left: Box::new(ScalarExpr::Column("price")),
op: BinaryOp::Multiply,
right: Box::new(ScalarExpr::Column("quantity")),
}),
distinct: false,
}
Sources: llkv-expr/src/expr.rs:184-215 llkv-expr/src/expr.rs:217-307
Subquery Integration
The expression AST supports two forms of subquery integration: boolean EXISTS predicates and scalar subqueries that produce single values.
Boolean EXISTS Subqueries
The SubqueryExpr structure represents a correlated EXISTS predicate within a boolean expression:
The id field references a subquery definition stored separately in the query plan (see FilterSubquery in llkv-plan/src/plans.rs:36-45), while negated indicates whether the SQL used NOT EXISTS. The separation of subquery metadata from the expression tree allows the same subquery to be referenced multiple times without duplication.
Scalar Subqueries
The ScalarSubqueryExpr structure represents a subquery that returns a single scalar value:
graph TB
SelectPlan["SelectPlan"]
ExprFilter["filter: Option<SelectFilter>"]
SelectFilter["SelectFilter"]
Predicate["predicate: Expr<'static, String>"]
FilterSubqueries["subqueries: Vec<FilterSubquery>"]
ScalarSubqueries["scalar_subqueries: Vec<ScalarSubquery>"]
Projections["projections: Vec<SelectProjection>"]
ComputedProj["Computed { expr, alias }"]
ScalarExpr["expr: ScalarExpr<String>"]
SelectPlan --> ExprFilter
ExprFilter --> SelectFilter
SelectFilter --> Predicate
SelectFilter --> FilterSubqueries
SelectPlan --> ScalarSubqueries
SelectPlan --> Projections
Projections --> ComputedProj
ComputedProj --> ScalarExpr
Predicate -.references.-> FilterSubqueries
ScalarExpr -.references.-> ScalarSubqueries
Scalar subqueries appear in value-producing contexts like SELECT (SELECT MAX(price) FROM items) AS max_price or WHERE quantity > (SELECT AVG(quantity) FROM inventory). The data_type field captures the Arrow data type of the subquery’s output column, enabling type checking during expression compilation.
The query planner populates the subqueries field in SelectFilter and the scalar_subqueries field in SelectPlan with complete subquery definitions, while expressions reference them by SubqueryId. This architecture enables efficient subquery execution and correlation tracking during query evaluation.
Sources: llkv-expr/src/expr.rs:45-66 llkv-plan/src/plans.rs:27-67
Literal Values
Expressions reference the Literal type from the llkv-expr crate’s literal module to represent constant values. The literal system is type-agnostic, deferring concrete type resolution until expressions are evaluated against actual table schemas.
The Literal enum (defined in llkv-expr/src/literal.rs) supports:
- Null : SQL NULL value
- Int128(i128) : Integer literals (wide representation to handle all integer sizes)
- Float64(f64) : Floating-point literals
- Decimal128(DecimalValue) : Fixed-precision decimal literals
- String(String) : Text literals
- Boolean(bool) : True/false literals
- Date32(i32) : Date literals (days since Unix epoch)
- Struct(Vec <(String, Literal)>): Structured data literals
- Interval(IntervalValue) : Time interval literals
The use of Int128 for integer literals allows the expression system to represent values that may exceed i64 range during intermediate computations, with overflow checks deferred to evaluation time. The Decimal128 variant uses the DecimalValue type from llkv-types to maintain precision and scale metadata.
Literals can be constructed through From trait implementations, enabling ergonomic expression building:
Sources: llkv-expr/src/expr.rs10
Type Parameterization and Field References
Both Expr<'a, F> and ScalarExpr<F> are parameterized by a generic type F that represents field identifiers. This design enables the same AST structures to be used at different stages of query processing with different column reference schemes:
graph LR
subgraph "Planning Stage"
ExprString["Expr<'static, String>\nColumn names"]
ScalarString["ScalarExpr<String>\nColumn names"]
end
subgraph "Translation Stage"
Translator["Expression Translator"]
end
subgraph "Execution Stage"
ExprFieldId["Expr<'static, FieldId>\nNumeric IDs"]
ScalarFieldId["ScalarExpr<FieldId>\nNumeric IDs"]
end
ExprString --> Translator
ScalarString --> Translator
Translator --> ExprFieldId
Translator --> ScalarFieldId
Common Type Instantiations
| Stage | Field Type | Use Case | Example Types |
|---|---|---|---|
| Planning | String | SQL parsing and plan construction | Expr<'static, String> |
| Execution | FieldId | Table scanning and evaluation | Expr<'static, FieldId> |
| Testing | u32 or &str | Lightweight tests without full schema | Expr<'static, u32> |
The lifetime parameter 'a in Expr<'a, F> represents the lifetime of borrowed data in Operator::In(&'a [Literal]), allowing filter expressions to reference static IN lists without heap allocation. Most expressions use 'static lifetime, indicating no borrowed data.
Field Reference Translation
The expression translation system (see Expression Translation) converts expressions from string-based column names to numeric FieldId identifiers by walking the AST and resolving names against table schemas. This transformation is represented by the generic parameter substitution:
Expr<'static, String> → Expr<'static, FieldId>
ScalarExpr<String> → ScalarExpr<FieldId>
The generic design allows the same expression evaluation logic to work with both naming schemes without code duplication, while maintaining type safety at compile time.
Sources: llkv-expr/src/expr.rs:14-21 llkv-expr/src/expr.rs:125-134 llkv-expr/src/expr.rs:365-370
Expression Construction Helpers
Both Expr and ScalarExpr provide builder methods for ergonomic AST construction:
Expr Helpers
ScalarExpr Helpers
These helpers simplify expression construction in query planners and translators by providing clearer semantics than direct enum variant construction.
Sources: llkv-expr/src/expr.rs:67-86 llkv-expr/src/expr.rs:217-307
Expression Analysis Methods
The Expr type provides utility methods for analyzing expression structure and optimizing query execution:
The is_trivially_true() method identifies expressions that cannot filter out any rows, such as:
- Unbounded range filters:
Operator::Range { lower: Unbounded, upper: Unbounded } - Literal true values:
Expr::Literal(true)
Scan executors use this method to skip predicate evaluation entirely when the filter is guaranteed to match all rows, avoiding unnecessary computation.
Sources: llkv-expr/src/expr.rs:87-123
graph TB
subgraph "Query Plans"
SelectPlan["SelectPlan"]
InsertPlan["InsertPlan"]
UpdatePlan["UpdatePlan"]
DeletePlan["DeletePlan"]
end
subgraph "Expression Usage"
FilterUsage["filter: Option<SelectFilter>\nWHERE clauses"]
HavingUsage["having: Option<Expr>\nHAVING clauses"]
ProjUsage["projections with ScalarExpr\nSELECT list"]
UpdateExpr["assignments with ScalarExpr\nSET clauses"]
JoinOn["on_condition in JoinMetadata\nJOIN ON clauses"]
end
SelectPlan --> FilterUsage
SelectPlan --> HavingUsage
SelectPlan --> ProjUsage
SelectPlan --> JoinOn
UpdatePlan --> UpdateExpr
UpdatePlan --> FilterUsage
DeletePlan --> FilterUsage
Integration with Query Plans
Expressions integrate deeply with the query planning system defined in llkv-plan:
Expression Usage in Plans
| Plan Type | Expression Field | Expression Type | Purpose |
|---|---|---|---|
SelectPlan | filter | SelectFilter with Expr<'static, String> | WHERE clause filtering |
SelectPlan | having | Expr<'static, String> | HAVING clause filtering |
SelectPlan | projections | SelectProjection::Computed with ScalarExpr<String> | Computed columns |
SelectPlan | joins[].on_condition | Expr<'static, String> | JOIN ON conditions |
UpdatePlan | filter | Expr<'static, String> | WHERE clause for updates |
UpdatePlan | assignments[].value | AssignmentValue::Expression with ScalarExpr<String> | SET clause expressions |
DeletePlan | filter | Expr<'static, String> | WHERE clause for deletes |
All plan-level expressions use String as the field identifier type since plans are constructed during SQL parsing before schema resolution. The query executor translates these to FieldId-based expressions before evaluation.
Sources: llkv-plan/src/plans.rs:27-34 llkv-plan/src/plans.rs:660-692 llkv-plan/src/plans.rs:794-828
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Expression Translation
Loading…
Expression Translation
Relevant source files
Purpose and Scope
Expression Translation is the process of converting expressions that reference columns by name (as strings) into expressions that reference columns by numeric field identifiers (FieldId). This translation bridges the gap between the SQL parsing/planning layer—which operates on human-readable column names—and the execution layer, which requires efficient numeric field identifiers for accessing columnar storage.
This page documents the translation mechanisms, key functions, and integration points. For information about the expression AST types themselves, see Expression AST. For details on how translated expressions are compiled into executable programs, see Program Compilation.
Sources: llkv-expr/src/expr.rs:1-819 llkv-executor/src/lib.rs:87-97
The Parameterized Expression Type System
The LLKV expression system uses generic type parameters to support multiple identifier types throughout the query processing pipeline. All expression types are parameterized over a field identifier type F:
| Expression Type | Description | Parameter |
|---|---|---|
Expr<'a, F> | Boolean predicate expression | Field identifier type F |
ScalarExpr<F> | Arithmetic/scalar expression | Field identifier type F |
Filter<'a, F> | Single-field predicate | Field identifier type F |
The parameterization allows the same expression structures to be used with different identifier representations:
- During Planning :
Expr<'static, String>andScalarExpr<String>use column names as parsed from SQL - During Execution :
Expr<'static, FieldId>andScalarExpr<FieldId>use numeric field identifiers for efficient storage access
graph TD
subgraph "SQL Parsing Layer"
SQL["SQL Query Text"]
PARSER["sqlparser"]
AST["SQL AST"]
end
subgraph "Planning Layer"
PLANNER["Query Planner"]
EXPR_STRING["Expr<String>\nScalarExpr<String>"]
PLAN["SelectPlan"]
end
subgraph "Translation Layer"
TRANSLATOR["translate_scalar\ntranslate_predicate"]
SCHEMA["Schema / Catalog"]
RESOLVER["IdentifierResolver"]
end
subgraph "Execution Layer"
EXPR_FIELDID["Expr<FieldId>\nScalarExpr<FieldId>"]
EVALUATOR["Expression Evaluator"]
STORAGE["Column Store"]
end
SQL --> PARSER
PARSER --> AST
AST --> PLANNER
PLANNER --> EXPR_STRING
EXPR_STRING --> PLAN
PLAN --> TRANSLATOR
SCHEMA --> TRANSLATOR
RESOLVER --> TRANSLATOR
TRANSLATOR --> EXPR_FIELDID
EXPR_FIELDID --> EVALUATOR
EVALUATOR --> STORAGE
This design separates concerns: the planner manipulates human-readable names without needing catalog knowledge, while the executor works with resolved numeric identifiers that map directly to physical storage locations.
Diagram: Expression Translation Flow from SQL to Execution
Sources: llkv-expr/src/expr.rs:14-182 llkv-executor/src/lib.rs:87-97
Core Translation Functions
The translation layer exposes a set of functions for converting string-based expressions to field ID-based expressions. These functions are defined in the llkv-plan crate’s translation module and re-exported by llkv-executor for convenience.
Primary Translation Functions
| Function | Purpose | Signature Pattern |
|---|---|---|
translate_scalar | Translate scalar expressions | (expr: &ScalarExpr<String>, schema, error_fn) -> Result<ScalarExpr<FieldId>> |
translate_scalar_with | Translate with custom resolver | (expr: &ScalarExpr<String>, resolver, error_fn) -> Result<ScalarExpr<FieldId>> |
translate_predicate | Translate filter predicates | (expr: &Expr<String>, schema, error_fn) -> Result<Expr<FieldId>> |
translate_predicate_with | Translate predicate with resolver | (expr: &Expr<String>, resolver, error_fn) -> Result<Expr<FieldId>> |
resolve_field_id_from_schema | Resolve single column name | (name: &str, schema) -> Result<FieldId> |
The _with variants accept an IdentifierResolver reference for more complex scenarios (multi-table queries, subqueries, etc.), while the simpler variants accept a schema directly and construct a resolver internally.
Usage Pattern
Translation functions follow a consistent pattern: they take a string-based expression, schema/resolver information, and an error handler closure. The error handler is invoked when a column name cannot be resolved, allowing callers to customize error messages:
The error closure receives the unresolved column name and returns an appropriate error type. This pattern appears throughout the executor when translating expressions from plans:
Sources: llkv-executor/src/lib.rs:87-97 llkv-executor/src/lib.rs:485-489 llkv-executor/src/lib.rs:1054-1059
Schema-Based Resolution
Column name resolution relies on Arrow schema information to map string identifiers to numeric field IDs. The resolution process handles case-insensitive matching and validates that referenced columns actually exist in the schema.
graph LR
subgraph "Input"
EXPR_STR["ScalarExpr<String>"]
COLUMN_NAME["Column Name: 'user_id'"]
end
subgraph "Resolution Context"
SCHEMA["Arrow Schema"]
FIELDS["Field Definitions"]
METADATA["Field Metadata"]
end
subgraph "Resolution Process"
NORMALIZE["Normalize Name\n(case-insensitive)"]
LOOKUP["Lookup in Schema"]
EXTRACT_ID["Extract FieldId"]
end
subgraph "Output"
EXPR_FIELD["ScalarExpr<FieldId>"]
FIELD_ID["FieldId: 42"]
end
COLUMN_NAME --> NORMALIZE
SCHEMA --> LOOKUP
NORMALIZE --> LOOKUP
LOOKUP --> EXTRACT_ID
EXTRACT_ID --> FIELD_ID
EXPR_STR --> NORMALIZE
EXTRACT_ID --> EXPR_FIELD
Resolution Workflow
Diagram: Column Name to FieldId Resolution
The resolve_field_id_from_schema function performs the core resolution logic. It searches the schema’s field definitions for a matching column name and extracts the associated field ID from the field’s metadata.
Schema Structure
Arrow schemas used during translation contain:
- Field Definitions : Name, data type, nullability
- Field Metadata : Key-value pairs including the numeric field ID
- Nested Field Support : For struct types, schemas may contain nested field hierarchies
The translation process must handle qualified names (e.g., table.column), nested field access (e.g., user.address.city), and alias resolution when applicable.
Sources: llkv-executor/src/lib.rs:87-97
Field Path Resolution for Nested Fields
When expressions reference nested fields within struct types, the translation process must resolve not just the top-level column but the entire field path. This is handled through the IdentifierResolver and ColumnResolution types provided by llkv-table/catalog.
graph TD
subgraph "Input Expression"
NESTED["GetField Expression"]
BASE["base: user"]
FIELD["field_name: 'address'"]
SUBFIELD["field_name: 'city'"]
end
subgraph "Resolver"
RESOLVER["IdentifierResolver"]
CONTEXT["IdentifierContext"]
end
subgraph "Resolution Result"
COL_RES["ColumnResolution"]
COL_NAME["column(): 'user'"]
FIELD_PATH["field_path(): ['address', 'city']"]
FIELD_ID["Resolved FieldId"]
end
NESTED --> RESOLVER
CONTEXT --> RESOLVER
RESOLVER --> COL_RES
COL_RES --> COL_NAME
COL_RES --> FIELD_PATH
COL_RES --> FIELD_ID
ColumnResolution Structure
Diagram: Nested Field Resolution
The ColumnResolution type encapsulates the resolution result, providing:
- The base column name
- The field path for nested access (empty for top-level columns)
- The resolved field ID for storage access
This information is used during correlated subquery tracking and when translating GetField expressions in the scalar expression tree.
Sources: llkv-sql/src/sql_engine.rs:37-38 llkv-sql/src/sql_engine.rs:420-427
Translation in Multi-Table Contexts
When translating expressions for queries involving multiple tables (joins, cross products, subqueries), the translation process must disambiguate column references that may appear in multiple tables. This is handled by the IdentifierResolver which maintains context about available tables and their schemas.
IdentifierContext and Resolution
The IdentifierContext type (from llkv-table/catalog) represents the set of tables and columns available in a given scope. During translation:
- Outer Scope Tracking : For subqueries, outer table contexts are tracked separately
- Column Disambiguation : Qualified names (e.g.,
table.column) are resolved against the appropriate table - Ambiguity Detection : Unqualified references to columns that exist in multiple tables produce errors
The translate_predicate_with and translate_scalar_with functions accept an IdentifierResolver reference that encapsulates this context:
Sources: llkv-sql/src/sql_engine.rs:37-38
Error Handling and Diagnostics
Translation failures occur when column names cannot be resolved. The error handling strategy uses caller-provided closures to generate context-specific error messages.
Error Patterns
| Scenario | Error Message Pattern |
|---|---|
| Unknown column in aggregate | "unknown column '{name}' in aggregate expression" |
| Unknown column in WHERE clause | "unknown column '{name}' in filter" |
| Unknown column in cross product | "column '{name}' not found in cross product result" |
| Ambiguous column reference | "column '{name}' is ambiguous" |
The error closure pattern allows the caller to include query-specific context in error messages. This is particularly important for debugging complex queries where the same expression type might be used in multiple contexts.
Resolution Failure Example
When translate_scalar encounters a ScalarExpr::Column(name) variant and the name cannot be found in the schema, it invokes the error closure:
Sources: llkv-executor/src/lib.rs:485-489 llkv-executor/src/lib.rs:1054-1059
graph TB
subgraph "Planning Phase"
SQL["SQL Statement"]
PARSE["Parse & Build Plan"]
PLAN["SelectPlan\nUpdatePlan\netc."]
EXPR_STR["Expressions with\nString identifiers"]
end
subgraph "Execution Preparation"
GET_TABLE["Get Table Handle"]
SCHEMA_FETCH["Fetch Schema"]
TRANSLATE["Translation Functions"]
EXPR_FIELD["Expressions with\nFieldId identifiers"]
end
subgraph "Execution Phase"
BUILD_SCAN["Build ScanProjection"]
COMPILE["Compile to EvalProgram"]
EVALUATE["Evaluate Against Batches"]
RESULTS["RecordBatch Results"]
end
SQL --> PARSE
PARSE --> PLAN
PLAN --> EXPR_STR
PLAN --> GET_TABLE
GET_TABLE --> SCHEMA_FETCH
SCHEMA_FETCH --> TRANSLATE
EXPR_STR --> TRANSLATE
TRANSLATE --> EXPR_FIELD
EXPR_FIELD --> BUILD_SCAN
BUILD_SCAN --> COMPILE
COMPILE --> EVALUATE
EVALUATE --> RESULTS
Integration with Query Execution Pipeline
Expression translation occurs at the boundary between planning and execution. Plans produced by the SQL layer contain string-based expressions, which are translated as execution structures are built.
Translation Points in Execution
Diagram: Translation in the Execution Pipeline
Key Translation Points
- Filter Translation : When building scan plans,
WHEREclause expressions are translated before being passed to the scan optimizer - Projection Translation : Computed columns in
SELECTprojections are translated before evaluation - Aggregate Translation : Aggregate function arguments are translated to resolve column references
- Join Condition Translation :
ONclause expressions for joins are translated in the context of both joined tables
The executor’s ensure_computed_projection function demonstrates this integration. It translates a string-based expression, infers its result data type, and registers it as a computed projection for the scan:
This function encapsulates the full translation workflow: resolve column names, infer types, and prepare the translated expression for execution.
Sources: llkv-executor/src/lib.rs:470-501 llkv-executor/src/lib.rs:87-97
Translation of Complex Expression Types
The translation process must handle all variants of the expression AST, recursively translating nested expressions while preserving structure and semantics.
Recursive Translation Table
| Expression Variant | Translation Strategy |
|---|---|
ScalarExpr::Column | Resolve string to FieldId via schema |
ScalarExpr::Literal | No translation needed (no field references) |
ScalarExpr::Binary | Recursively translate left and right operands |
ScalarExpr::Aggregate | Translate the aggregate’s argument expression |
ScalarExpr::GetField | Translate base expression, preserve field name |
ScalarExpr::Cast | Translate inner expression, preserve target type |
ScalarExpr::Compare | Recursively translate both comparison operands |
ScalarExpr::Coalesce | Translate each expression in the list |
ScalarExpr::Case | Translate operand and all WHEN/THEN/ELSE branches |
ScalarExpr::ScalarSubquery | No translation (contains SubqueryId reference) |
ScalarExpr::Random | No translation (no field references) |
For predicate expressions (Expr<F>):
| Predicate Variant | Translation Strategy |
|---|---|
Expr::And / Expr::Or | Recursively translate all sub-expressions |
Expr::Not | Recursively translate inner expression |
Expr::Pred(Filter) | Translate filter’s field ID, preserve operator |
Expr::Compare | Translate left and right scalar expressions |
Expr::InList | Translate target expression and list elements |
Expr::IsNull | Translate the operand expression |
Expr::Literal | No translation (constant boolean value) |
Expr::Exists | No translation (contains SubqueryId reference) |
The translation process maintains the expression tree structure while substituting field identifiers, ensuring that evaluation semantics remain unchanged.
Sources: llkv-expr/src/expr.rs:125-182 llkv-expr/src/expr.rs:14-66
Performance Considerations
Expression translation is performed once during query execution setup, not per-row or per-batch. The translated expressions are then compiled into evaluation programs (see Program Compilation) which are reused across all batches in the query result.
Translation Caching
The executor maintains caches to avoid redundant translation work:
- Computed Projection Cache : Stores translated expressions keyed by their string representation to avoid re-translating identical expressions in the same query
- Column Projection Cache : Maps field IDs to projection indices to reuse existing projections when multiple expressions reference the same column
This caching strategy is evident in functions like ensure_computed_projection, which checks the cache before performing translation:
Sources: llkv-executor/src/lib.rs:470-501
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Program Compilation
Loading…
Program Compilation
Relevant source files
- .github/workflows/build.docs.yml
- demos/llkv-sql-pong-demo/assets/llkv-sql-pong-screenshot.png
- dev-docs/doc-preview.md
- llkv-expr/src/expr.rs
- llkv-plan/Cargo.toml
- llkv-plan/src/plans.rs
- llkv-sql/Cargo.toml
- llkv-test-utils/Cargo.toml
Purpose and Scope
Program compilation transforms expression ASTs into executable bytecode optimized for evaluation. This intermediate representation enables efficient vectorized operations and simplifies the runtime evaluation engine. The compilation phase bridges the gap between high-level expression trees and low-level execution kernels.
This page covers the compilation of ScalarExpr and filter Expr trees into EvalProgram and DomainProgram bytecode respectively. For information about the expression AST structure, see Expression AST. For how compiled programs are evaluated, see Scalar Evaluation and NumericKernels.
Compilation Pipeline Overview
Sources: llkv-expr/src/expr.rs:1-819
Compilation Targets
The compilation system produces two distinct bytecode formats depending on the expression context:
| Program Type | Input AST | Purpose | Output |
|---|---|---|---|
EvalProgram | ScalarExpr<FieldId> | Compute scalar values per row | Arrow array of computed values |
DomainProgram | Expr<FieldId> | Evaluate boolean predicates | Bitmap of matching row IDs |
EvalProgram Structure
EvalProgram compiles scalar expressions into a stack-based bytecode suitable for vectorized evaluation. Each instruction operates on Arrow arrays, producing intermediate or final results:
Sources: llkv-expr/src/expr.rs:125-307
flowchart TB
subgraph ScalarInput["Scalar Expression Tree"]
ROOT["Binary: Add"]
LEFT["Binary: Multiply"]
RIGHT["Cast: Float64"]
COL1["Column: field_id=1"]
LIT1["Literal: 10"]
COL2["Column: field_id=2"]
end
ROOT --> LEFT
ROOT --> RIGHT
LEFT --> COL1
LEFT --> LIT1
RIGHT --> COL2
subgraph Bytecode["EvalProgram Bytecode"]
I1["LOAD_COLUMN field_id=1"]
I2["LOAD_LITERAL 10"]
I3["MULTIPLY"]
I4["LOAD_COLUMN field_id=2"]
I5["CAST Float64"]
I6["ADD"]
end
I1 --> I2
I2 --> I3
I3 --> I4
I4 --> I5
I5 --> I6
subgraph Execution["Execution"]
STACK["Evaluation Stack\nArray-based operations"]
RESULT["Result Array"]
end
I6 --> STACK
STACK --> RESULT
DomainProgram Structure
DomainProgram compiles filter predicates into bytecode optimized for boolean evaluation and row filtering. The program operates on column metadata and data chunks to identify matching rows:
Sources: llkv-expr/src/expr.rs:14-123 llkv-expr/src/expr.rs:365-428
flowchart TB
subgraph FilterInput["Filter Expression Tree"]
AND["And"]
PRED1["Pred: field_id=1 > 100"]
PRED2["Pred: field_id=2 IN values"]
NOT["Not"]
PRED3["Pred: field_id=3 LIKE pattern"]
end
AND --> PRED1
AND --> NOT
NOT --> PRED2
AND --> PRED3
subgraph DomainBytecode["DomainProgram Bytecode"]
D1["EVAL_PRED field_id=1\nGreaterThan 100"]
D2["EVAL_PRED field_id=2\nIn values"]
D3["NEGATE"]
D4["EVAL_PRED field_id=3\nContains pattern"]
D5["AND_ALL"]
end
D1 --> D2
D2 --> D3
D3 --> D4
D4 --> D5
subgraph BitmapOps["Bitmap Operations"]
B1["Bitmap: field_id=1 matches"]
B2["Bitmap: field_id=2 matches"]
B3["Bitmap: NOT operation"]
B4["Bitmap: field_id=3 matches"]
B5["Bitmap: AND operation"]
FINAL["Final: matching row IDs"]
end
D5 --> B1
B1 --> B2
B2 --> B3
B3 --> B4
B4 --> B5
B5 --> FINAL
Expression Analysis and Type Inference
Before bytecode generation, the compiler analyzes the expression tree to:
- Resolve data types - Each expression node’s output type is inferred from its inputs and operation
- Validate operations - Ensure type compatibility for binary operations and function calls
- Track dependencies - Identify which columns and subqueries the expression requires
- Detect constant subexpressions - Find opportunities for constant folding
Sources: llkv-expr/src/expr.rs:125-182 llkv-expr/src/expr.rs:309-363
Compilation Optimizations
The compilation phase applies several optimization passes to improve evaluation performance:
Constant Folding
Expressions involving only literal values are evaluated at compile time:
| Original Expression | Optimized Form |
|---|---|
Literal(2) + Literal(3) | Literal(5) |
Literal(10) * Literal(0) | Literal(0) |
Cast(Literal("123"), Int64) | Literal(123) |
Expression Simplification
Algebraic identities and boolean logic simplifications reduce instruction count:
| Pattern | Simplified To |
|---|---|
x + 0 | x |
x * 1 | x |
x * 0 | 0 |
And([true, expr]) | expr |
Or([false, expr]) | expr |
Not(Not(expr)) | expr |
Dead Code Elimination
Unreachable code paths in Case expressions are removed:
Sources: llkv-expr/src/expr.rs:169-176
Bytecode Generation
After optimization, the compiler generates bytecode instructions by walking the expression tree in post-order (depth-first). Each expression variant maps to one or more bytecode instructions:
Scalar Expression Instruction Mapping
| Expression Variant | Generated Instructions |
|---|---|
Column(field_id) | LOAD_COLUMN field_id |
Literal(value) | LOAD_LITERAL value |
Binary{left, op, right} | Compile left → Compile right → BINARY_OP op |
Cast{expr, data_type} | Compile expr → CAST data_type |
Compare{left, op, right} | Compile left → Compile right → COMPARE op |
Coalesce(exprs) | Compile all exprs → COALESCE count |
Case{operand, branches, else} | Complex multi-instruction sequence with jump tables |
Random | RANDOM_F64 |
Sources: llkv-expr/src/expr.rs:125-182
Filter Expression Instruction Mapping
| Expression Variant | Generated Instructions |
|---|---|
Pred(Filter{field_id, op}) | EVAL_PREDICATE field_id op |
And(exprs) | Compile all exprs → AND_ALL count |
Or(exprs) | Compile all exprs → OR_ALL count |
Not(expr) | Compile expr → NEGATE |
Compare{left, right, op} | Compile as scalar → TO_BOOLEAN |
InList{expr, list, negated} | Compile expr → Build lookup table → IN_SET negated |
IsNull{expr, negated} | Compile expr → IS_NULL negated |
Literal(bool) | LOAD_BOOL value |
Exists(subquery) | EVAL_SUBQUERY subquery_id |
Sources: llkv-expr/src/expr.rs:14-56
Subquery Handling
Correlated subqueries require special compilation handling. The compiler generates placeholder references that are resolved during execution:
Sources: llkv-expr/src/expr.rs:45-66
flowchart TB
subgraph Outer["Outer Query Expression"]
FILTER["Filter with EXISTS"]
SUBQUERY["Subquery: SubqueryId=1\nCorrelated columns"]
end
subgraph Compilation["Compilation Strategy"]
PLACEHOLDER["Generate EVAL_SUBQUERY\ninstruction with ID"]
CORRELATION["Track correlated\ncolumn mappings"]
DEFER["Defer actual subquery\ncompilation to executor"]
end
subgraph Execution["Runtime Resolution"]
BIND["Bind outer row values\nto correlated placeholders"]
EVAL_SUB["Execute subquery plan\nwith bound values"]
COLLECT["Collect subquery\nresult set"]
BOOLEAN["Convert to boolean\nfor EXISTS/IN"]
end
FILTER --> SUBQUERY
SUBQUERY --> PLACEHOLDER
PLACEHOLDER --> CORRELATION
CORRELATION --> DEFER
DEFER --> BIND
BIND --> EVAL_SUB
EVAL_SUB --> COLLECT
COLLECT --> BOOLEAN
Aggregate Function Compilation
Aggregate expressions within scalar contexts (e.g., COUNT(*) + 1) compile to instructions that reference pre-computed aggregate results:
Sources: llkv-expr/src/expr.rs:184-215
flowchart LR
subgraph AggExpr["Aggregate Expression"]
AGG_CALL["AggregateCall\nCountStar / Sum / Avg"]
end
subgraph Compilation["Compilation"]
REF["Generate AGG_REFERENCE\ninstruction"]
METADATA["Store aggregate metadata\nfunction type, distinct flag"]
end
subgraph PreExecution["Pre-Execution Phase"]
COMPUTE["Executor computes\naggregate values"]
STORE["Store in aggregate\nresult table"]
end
subgraph Evaluation["Expression Evaluation"]
LOOKUP["AGG_REFERENCE instruction\nlooks up pre-computed value"]
BROADCAST["Broadcast scalar result\nto array length"]
CONTINUE["Continue with remaining\nexpression operations"]
end
AGG_CALL --> REF
REF --> METADATA
METADATA --> COMPUTE
COMPUTE --> STORE
STORE --> LOOKUP
LOOKUP --> BROADCAST
BROADCAST --> CONTINUE
Integration with Execution Layer
Compiled programs are executed by the compute kernels layer, which provides vectorized implementations of each bytecode instruction:
Sources: llkv-expr/src/expr.rs:1-819
flowchart TB
subgraph Programs["Compiled Programs"]
EP["EvalProgram"]
DP["DomainProgram"]
end
subgraph ComputeLayer["llkv-compute Kernels"]
ARITHMETIC["Arithmetic Kernels\nadd_arrays, multiply_arrays"]
COMPARISON["Comparison Kernels\ngt_arrays, eq_arrays"]
CAST_K["Cast Kernels\ncast_array"]
LOGICAL["Logical Kernels\nand_bitmaps, or_bitmaps"]
STRING["String Kernels\ncontains, starts_with"]
end
subgraph Execution["Execution Context"]
BATCH["Input RecordBatch\nColumn arrays"]
STACK["Evaluation Stack"]
BITMAP["Bitmap Accumulator"]
end
subgraph Results["Evaluation Results"]
SCALAR_OUT["Array of computed\nscalar values"]
FILTER_OUT["Bitmap of matching\nrow IDs"]
end
EP --> ARITHMETIC
EP --> COMPARISON
EP --> CAST_K
DP --> COMPARISON
DP --> LOGICAL
DP --> STRING
ARITHMETIC --> STACK
COMPARISON --> STACK
CAST_K --> STACK
LOGICAL --> BITMAP
STRING --> BITMAP
BATCH --> STACK
BATCH --> BITMAP
STACK --> SCALAR_OUT
BITMAP --> FILTER_OUT
Compilation Performance Considerations
The compilation phase is designed to amortize its cost over repeated evaluations:
| Scenario | Compilation Strategy |
|---|---|
| One-time query | Compile on demand, minimal optimization |
| Repeated query | Compile once, cache bytecode, reuse across invocations |
| Prepared statement | Pre-compile at preparation time, execute many times |
| Table scan filter | Compile predicate once, apply to all batches |
| Aggregation | Compile aggregate expressions, evaluate per group |
Compilation Cache Strategy
Sources: llkv-expr/src/expr.rs:1-819
Summary
The compilation phase transforms high-level expression ASTs into efficient bytecode programs optimized for vectorized execution. By separating compilation from evaluation, the system achieves:
- Performance : Bytecode enables efficient stack-based evaluation with Arrow kernels
- Reusability : Compiled programs can be cached and reused across query invocations
- Optimization : Multiple optimization passes improve runtime performance
- Type Safety : Type inference and validation occur during compilation, not evaluation
- Maintainability : Clear separation between compilation and execution concerns
The compiled EvalProgram and DomainProgram bytecode formats serve as the bridge between query planning and execution, enabling the query engine to efficiently evaluate complex scalar computations and filter predicates over columnar data.
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Scalar Evaluation and NumericKernels
Loading…
Scalar Evaluation and NumericKernels
Relevant source files
- llkv-csv/src/writer.rs
- llkv-expr/src/expr.rs
- llkv-plan/src/plans.rs
- llkv-table/examples/direct_comparison.rs
- llkv-table/examples/performance_benchmark.rs
- llkv-table/examples/test_streaming.rs
- llkv-table/src/table.rs
Purpose and Scope
This page documents the scalar expression evaluation engine in LLKV, covering how ScalarExpr instances are computed against columnar data to produce results for projections, filters, and computed columns. The evaluation system operates on Apache Arrow arrays and leverages vectorization for performance.
For information about the expression AST structure and variants, see Expression AST. For how expressions are translated from SQL column names to field identifiers, see Expression Translation. For the bytecode compilation pipeline, see Program Compilation. For aggregate function evaluation, see Aggregation System.
ScalarExpr Variants
The ScalarExpr<F> enum represents all scalar computations that can be performed in LLKV. Each variant corresponds to a specific type of operation that produces a single value per row.
Sources: llkv-expr/src/expr.rs:126-182
graph TB
ScalarExpr["ScalarExpr<F>"]
ScalarExpr --> Column["Column(F)\nDirect column reference"]
ScalarExpr --> Literal["Literal(Literal)\nConstant value"]
ScalarExpr --> Binary["Binary\nArithmetic operations"]
ScalarExpr --> Not["Not(Box<ScalarExpr>)\nLogical negation"]
ScalarExpr --> IsNull["IsNull\nNULL test"]
ScalarExpr --> Aggregate["Aggregate(AggregateCall)\nAggregate functions"]
ScalarExpr --> GetField["GetField\nStruct field access"]
ScalarExpr --> Cast["Cast\nType conversion"]
ScalarExpr --> Compare["Compare\nComparison ops"]
ScalarExpr --> Coalesce["Coalesce\nFirst non-null"]
ScalarExpr --> ScalarSubquery["ScalarSubquery\nSubquery result"]
ScalarExpr --> Case["Case\nCASE expression"]
ScalarExpr --> Random["Random\nRandom number"]
Binary --> BinaryOp["BinaryOp:\nAdd, Subtract, Multiply,\nDivide, Modulo,\nAnd, Or,\nBitwiseShiftLeft,\nBitwiseShiftRight"]
Compare --> CompareOp["CompareOp:\nEq, NotEq,\nLt, LtEq,\nGt, GtEq"]
The generic type parameter F represents the field identifier type, which is typically String (for SQL column names) or FieldId (for translated physical column identifiers).
Binary Operations
Binary operations perform arithmetic or logical computations between two scalar expressions. The BinaryOp enum defines the supported operators:
| Operator | Symbol | Description | Example |
|---|---|---|---|
Add | + | Addition | col1 + col2 |
Subtract | - | Subtraction | col1 - 100 |
Multiply | * | Multiplication | price * quantity |
Divide | / | Division | total / count |
Modulo | % | Remainder | id % 10 |
And | AND | Logical AND | flag1 AND flag2 |
Or | OR | Logical OR | status1 OR status2 |
BitwiseShiftLeft | << | Left bit shift | mask << 2 |
BitwiseShiftRight | >> | Right bit shift | flags >> 4 |
Sources: llkv-expr/src/expr.rs:310-338
Binary expressions are constructed recursively, allowing complex nested computations:
Comparison Operations
Comparison operations produce boolean results (represented as 1/0 integers) by comparing two scalar expressions. The CompareOp enum defines six comparison operators:
| Operator | Symbol | Description |
|---|---|---|
Eq | = | Equal to |
NotEq | != | Not equal to |
Lt | < | Less than |
LtEq | <= | Less than or equal |
Gt | > | Greater than |
GtEq | >= | Greater than or equal |
Sources: llkv-expr/src/expr.rs:341-363
Comparisons are represented as a ScalarExpr::Compare variant that contains the left operand, operator, and right operand. When evaluated, they produce a boolean value that follows SQL three-valued logic (true, false, NULL).
Evaluation Pipeline
The evaluation of scalar expressions follows a multi-stage pipeline from SQL text to computed Arrow arrays:
Sources: llkv-expr/src/expr.rs:1-819 llkv-table/src/table.rs:29-30
graph LR
SQL["SQL Expression\n'col1 + col2 * 10'"]
Parse["SQL Parser\nsqlparser-rs"]
Translate["Expression Translation\nString → FieldId"]
Compile["Program Compilation\nScalarExpr → EvalProgram"]
Execute["Evaluation Engine\nVectorized Execution"]
Result["Arrow Array\nComputed Results"]
SQL --> Parse
Parse --> Translate
Translate --> Compile
Compile --> Execute
Execute --> Result
Compile -.uses.-> ProgramCompiler["ProgramCompiler\nllkv-compute"]
Execute -.operates on.-> ArrowArrays["Arrow Arrays\nColumnar Data"]
The key stages are:
- Parsing : SQL expressions are parsed into AST nodes by
sqlparser-rs - Translation : Column names (strings) are resolved to
FieldIdidentifiers - Compilation :
ScalarExpris compiled into bytecode byProgramCompiler - Execution : The bytecode is evaluated against Arrow columnar data
Type System and Casting
LLKV uses Arrow’s type system for scalar evaluation. The Cast variant of ScalarExpr performs explicit type conversions:
Sources: llkv-expr/src/expr.rs:154-157
The data_type field is an Arrow DataType that specifies the target type for the conversion. Type casting follows Arrow’s casting rules and handles conversions between:
- Numeric types (integers, floats, decimals)
- String types (UTF-8)
- Temporal types (Date32, timestamps)
- Boolean types
- Struct types
Invalid casts produce NULL values following SQL semantics.
Null Handling and Propagation
Scalar expressions implement SQL three-valued logic where NULL values propagate through most operations. The IsNull variant provides explicit NULL testing:
Sources: llkv-expr/src/expr.rs:139-142
When negated is false, this evaluates to 1 (true) if the expression is NULL, 0 (false) otherwise. When negated is true, it performs the inverse test (IS NOT NULL).
The Coalesce variant provides NULL-coalescing behavior, returning the first non-NULL value from a list of expressions:
Sources: llkv-expr/src/expr.rs165
This is used to implement SQL’s COALESCE(expr1, expr2, ...) function.
CASE Expressions
The Case variant implements SQL CASE expressions with optional operand and ELSE branches:
Sources: llkv-expr/src/expr.rs:169-176
This represents both simple and searched CASE expressions:
- Simple CASE : When
operandisSome, each branch’s WHEN expression is compared to the operand - Searched CASE : When
operandisNone, each branch’s WHEN expression is evaluated as a boolean
The branches vector contains (WHEN, THEN) pairs evaluated in order. If no branch matches, else_expr is returned, or NULL if else_expr is None.
graph TB
subgraph "Row-by-Row Evaluation (Avoided)"
Row1["Row 1:\ncol1=10, col2=5\n→ 10 + 5 = 15"]
Row2["Row 2:\ncol1=20, col2=3\n→ 20 + 3 = 23"]
Row3["Row 3:\ncol1=15, col2=7\n→ 15 + 7 = 22"]
end
subgraph "Vectorized Evaluation (Used)"
Col1["Int64Array\n[10, 20, 15]"]
Col2["Int64Array\n[5, 3, 7]"]
Result["Int64Array\n[15, 23, 22]"]
Col1 --> VectorAdd["Vectorized Add\nSIMD Operations"]
Col2 --> VectorAdd
VectorAdd --> Result
end
style Row1 fill:#f9f9f9
style Row2 fill:#f9f9f9
style Row3 fill:#f9f9f9
Vectorization Strategy
Expression evaluation in LLKV leverages Apache Arrow’s columnar format for vectorized execution. Rather than evaluating expressions row-by-row, operations process entire arrays at once.
Sources: llkv-table/src/table.rs:1-681
Key benefits of vectorization:
- SIMD Instructions : Modern CPUs can process multiple values simultaneously
- Reduced Overhead : Eliminates per-row interpretation overhead
- Cache Efficiency : Columnar layout improves CPU cache utilization
- Arrow Compute Kernels : Leverages highly optimized Arrow implementations
graph TB
Expr["ScalarExpr::Binary\nop=Add"]
Dispatch["Type Dispatch"]
Expr --> Dispatch
Dispatch --> Int64["Int64Array\nAdd kernel"]
Dispatch --> Int32["Int32Array\nAdd kernel"]
Dispatch --> Float64["Float64Array\nAdd kernel"]
Dispatch --> Decimal128["Decimal128Array\nAdd kernel"]
Int64 --> Result1["Int64Array result"]
Int32 --> Result2["Int32Array result"]
Float64 --> Result3["Float64Array result"]
Decimal128 --> Result4["Decimal128Array result"]
Numeric Type Dispatch
LLKV handles multiple numeric types through Arrow’s type system. The evaluation engine uses Arrow’s primitive type traits to dispatch operations:
Sources: llkv-table/src/table.rs:17-20
The dispatch mechanism uses Arrow’s type system to select the appropriate kernel at evaluation time. The macros llkv_for_each_arrow_numeric, llkv_for_each_arrow_boolean, and llkv_for_each_arrow_string provide type-safe iteration over all supported Arrow types.
Aggregate Functions in Scalar Context
The Aggregate variant of ScalarExpr allows aggregate functions to appear in scalar contexts, such as COUNT(*) + 1:
Sources: llkv-expr/src/expr.rs145
The AggregateCall enum includes:
CountStar: Count all rowsCount { expr, distinct }: Count non-NULL valuesSum { expr, distinct }: Sum of valuesTotal { expr, distinct }: Sum with NULL-safe 0 defaultAvg { expr, distinct }: Average of valuesMin(expr): Minimum valueMax(expr): Maximum valueCountNulls(expr): Count NULL valuesGroupConcat { expr, distinct, separator }: String concatenation
Sources: llkv-expr/src/expr.rs:184-215
Each aggregate operates on a ScalarExpr, allowing nested computations like SUM(col1 + col2) or AVG(-price).
Random Number Generation
The Random variant produces pseudo-random floating-point values:
Sources: llkv-expr/src/expr.rs181
Following PostgreSQL and DuckDB semantics, each evaluation produces a new random value in the range [0.0, 1.0). The generator is seeded automatically and does not expose seed control at the SQL level.\n\n## Struct Field Access\n\nThe GetField variant extracts fields from struct expressions](https://github.com/jzombie/rust-llkv/blob/89777726/0.0, 1.0). The generator is seeded automatically and does not expose seed control at the SQL level.\n\n## Struct Field Access\n\nThe GetField variant extracts fields from struct expressions#LNaN-LNaN)
This enables navigation of nested data structures. For example, accessing user.address.city would be represented as:
The evaluation engine resolves field names against Arrow struct schemas at runtime.
Performance Considerations
The scalar evaluation engine includes several optimizations:
Expression Constant Folding
Constant subexpressions are evaluated once during compilation rather than per-row. For example, col1 + (10 * 20) is simplified to col1 + 200 before evaluation.
Predicate Pushdown
When scalar expressions appear in WHERE clauses, they may be pushed down to the storage layer for early filtering. The PredicateFusionCache in llkv-compute caches compiled predicates to avoid recompilation.
Sources: llkv-table/src/table.rs29
Type Specialization
Arrow kernels are specialized for each numeric type, avoiding generic dispatch overhead in tight loops. This ensures that Int64 + Int64 uses dedicated integer addition instructions rather than polymorphic dispatch.
SIMD Acceleration
The underlying storage layer (simd-r-drive) provides SIMD-optimized operations for bulk data movement and filtering, which complements the vectorized evaluation strategy.
Sources: llkv-storage/pager/MemPager llkv-table/src/table.rs21
sequenceDiagram
participant SQL as SQL Engine
participant Planner as TablePlanner
participant Scanner as ScanRowStream
participant Compute as Compute Layer
participant Store as ColumnStore
SQL->>Planner: Execute SELECT with expressions
Planner->>Planner: Compile ScalarExpr to bytecode
Planner->>Scanner: Create scan with projections
loop For each batch
Scanner->>Store: Gather column arrays
Store-->>Scanner: Arrow arrays
Scanner->>Compute: Evaluate expressions
Compute->>Compute: Vectorized operations
Compute-->>Scanner: Computed arrays
Scanner-->>SQL: RecordBatch
end
Integration with Scan Pipeline
Scalar expressions are evaluated during table scans through the ScanProjection system:
Sources: llkv-table/src/table.rs:447-488 llkv-scan/execute/execute_scan
The scan pipeline:
- Gathers base column arrays from the
ColumnStore - Passes arrays to the compute layer for expression evaluation
- Assembles computed results into
RecordBatchinstances - Streams batches to the caller
This design minimizes memory allocation by processing data in fixed-size batches (typically 1024 rows) rather than materializing entire result sets.
Expression Compilation Flow
The complete compilation flow from SQL to executed results:
Sources: llkv-expr/src/expr.rs:1-819 llkv-plan/src/plans.rs:1-1500 llkv-table/src/table.rs:1-681
This pipeline ensures that expressions are validated, optimized, and compiled before execution begins, minimizing runtime overhead.
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Aggregation System
Loading…
Aggregation System
Relevant source files
The Aggregation System implements SQL aggregate functions (COUNT, SUM, AVG, MIN, MAX, etc.) across the LLKV query pipeline. It handles both simple aggregates (SELECT COUNT(*) FROM table) and grouped aggregations (SELECT col, SUM(amount) FROM table GROUP BY col), with support for the DISTINCT modifier and expression-based aggregates like SUM(col1 + col2).
For information about scalar expression evaluation (non-aggregate), see Scalar Evaluation and NumericKernels. For query planning that produces aggregate plans, see Plan Structures.
System Architecture
The aggregation system spans four layers in the codebase, each with distinct responsibilities:
Diagram: Aggregation System Layering
graph TB
subgraph "Expression Layer (llkv-expr)"
AGG_CALL["AggregateCall<F>\nCountStar, Count, Sum, Avg, Min, Max"]
SCALAR_EXPR["ScalarExpr<F>\nAggregate(AggregateCall)"]
end
subgraph "Plan Layer (llkv-plan)"
AGG_EXPR["AggregateExpr\nCountStar, Column"]
AGG_FUNC["AggregateFunction\nCount, SumInt64, MinInt64, etc."]
SELECT_PLAN["SelectPlan\naggregates: Vec<AggregateExpr>\ngroup_by: Vec<String>"]
end
subgraph "Execution Layer (llkv-executor)"
EXECUTOR["QueryExecutor::execute_aggregates\nQueryExecutor::execute_group_by_single_table"]
AGG_VALUE["AggregateValue\nNull, Int64, Float64, Decimal128, String"]
GROUP_STATE["GroupAggregateState\nrepresentative_batch_idx\nrepresentative_row\nrow_locations"]
end
subgraph "Accumulator Layer (llkv-aggregate)"
AGG_ACCUMULATOR["AggregateAccumulator\nInterface for aggregate computation"]
AGG_KIND["AggregateKind\nType classification"]
AGG_SPEC["AggregateSpec\nConfiguration"]
AGG_STATE["AggregateState\nRuntime state"]
end
SCALAR_EXPR --> AGG_CALL
SELECT_PLAN --> AGG_EXPR
AGG_EXPR --> AGG_FUNC
EXECUTOR --> AGG_VALUE
EXECUTOR --> GROUP_STATE
EXECUTOR --> AGG_ACCUMULATOR
AGG_ACCUMULATOR --> AGG_KIND
AGG_ACCUMULATOR --> AGG_SPEC
AGG_ACCUMULATOR --> AGG_STATE
AGG_CALL -.translates to.-> AGG_EXPR
SELECT_PLAN -.executes via.-> EXECUTOR
Sources:
- llkv-expr/src/expr.rs:184-215
- llkv-plan/src/plans.rs:1036-1061
- llkv-executor/src/lib.rs:109-151
- llkv-executor/src/lib.rs19
Expression-Level Aggregates
Aggregate functions are represented in the expression AST via the AggregateCall<F> enum, which enables aggregates to appear within computed projections (e.g., COUNT(*) + 1 or SUM(col1) / AVG(col2)). Each variant captures the specific aggregate semantics:
| Variant | Description | Example SQL |
|---|---|---|
CountStar | Count all rows (including NULLs) | COUNT(*) |
Count { expr, distinct } | Count non-NULL values of expression | COUNT(col), COUNT(DISTINCT col) |
Sum { expr, distinct } | Sum numeric expression values | SUM(amount), SUM(DISTINCT col) |
Total { expr, distinct } | Sum with NULL-to-zero coercion | TOTAL(amount) |
Avg { expr, distinct } | Arithmetic mean of expression | AVG(price) |
Min(expr) | Minimum value | MIN(created_at) |
Max(expr) | Maximum value | MAX(score) |
CountNulls(expr) | Count NULL occurrences | COUNT_NULLS(optional_field) |
GroupConcat { expr, distinct, separator } | Concatenate strings | GROUP_CONCAT(name, ',') |
Each aggregate operates on a ScalarExpr<F>, not just a column name, which allows complex expressions like SUM(price * quantity) or AVG(col1 + col2).
Sources:
Plan-Level Representation
The query planner converts SQL aggregate syntax into AggregateExpr instances stored in SelectPlan::aggregates. The plan layer uses a simplified representation compared to the expression layer:
Diagram: Plan-Level Aggregate Structure
graph LR
SELECT_PLAN["SelectPlan"]
AGG_LIST["aggregates: Vec<AggregateExpr>"]
GROUP_BY["group_by: Vec<String>"]
SELECT_PLAN --> AGG_LIST
SELECT_PLAN --> GROUP_BY
AGG_LIST --> COUNT_STAR["AggregateExpr::CountStar\nalias: String\ndistinct: bool"]
AGG_LIST --> COLUMN["AggregateExpr::Column\ncolumn: String\nalias: String\nfunction: AggregateFunction\ndistinct: bool"]
COLUMN --> FUNC["AggregateFunction::\nCount, SumInt64, TotalInt64,\nMinInt64, MaxInt64,\nCountNulls, GroupConcat"]
Sources:
The planner distinguishes between:
- Non-grouped aggregates : Empty
group_byvector, producing a single result row - Grouped aggregates : Populated
group_byvector, producing one row per distinct group
Execution Strategy Selection
The executor chooses different code paths based on query structure, optimizing for common patterns:
Diagram: Aggregate Execution Decision Tree
graph TD
START["QueryExecutor::execute_select"]
CHECK_COMPOUND{"plan.compound.is_some()"}
CHECK_EMPTY_TABLES{"plan.tables.is_empty()"}
CHECK_GROUP_BY{"!plan.group_by.is_empty()"}
CHECK_MULTI_TABLE{"plan.tables.len() > 1"}
CHECK_AGGREGATES{"!plan.aggregates.is_empty()"}
CHECK_COMPUTED{"has_computed_aggregates(&plan)"}
START --> CHECK_COMPOUND
CHECK_COMPOUND -->|Yes| COMPOUND["execute_compound_select"]
CHECK_COMPOUND -->|No| CHECK_EMPTY_TABLES
CHECK_EMPTY_TABLES -->|Yes| NO_TABLE["execute_select_without_table"]
CHECK_EMPTY_TABLES -->|No| CHECK_GROUP_BY
CHECK_GROUP_BY -->|Yes| CHECK_MULTI_TABLE
CHECK_MULTI_TABLE -->|Multi| CROSS_PROD["execute_cross_product"]
CHECK_MULTI_TABLE -->|Single| GROUP_BY_SINGLE["execute_group_by_single_table"]
CHECK_GROUP_BY -->|No| CHECK_MULTI_TABLE_2{"plan.tables.len() > 1"}
CHECK_MULTI_TABLE_2 -->|Yes| CROSS_PROD
CHECK_MULTI_TABLE_2 -->|No| CHECK_AGGREGATES
CHECK_AGGREGATES -->|Yes| EXEC_AGG["execute_aggregates"]
CHECK_AGGREGATES -->|No| CHECK_COMPUTED
CHECK_COMPUTED -->|Yes| COMPUTED_AGG["execute_computed_aggregates"]
CHECK_COMPUTED -->|No| PROJECTION["execute_projection"]
Sources:
Non-Grouped Aggregate Execution
execute_aggregates processes queries without GROUP BY clauses. All rows are treated as a single group:
- Projection Planning : Build
ScanProjectionlist for columns needed by aggregate expressions - Expression Translation : Convert
ScalarExpr<String>toScalarExpr<FieldId>using table schema - Data Streaming : Scan table and accumulate values via
AggregateAccumulator - Result Assembly : Finalize accumulators and construct single-row
RecordBatch
Sources:
Grouped Aggregate Execution
execute_group_by_single_table handles queries with GROUP BY clauses:
- Full Scan : Load all table rows into memory (required for grouping)
- Group Key Extraction : Evaluate
GROUP BYexpressions for each row, producingGroupKeyValueinstances - Group State Tracking : Build
FxHashMap<Vec<GroupKeyValue>, GroupAggregateState>mapping group keys to row locations - Per-Group Accumulation : For each group, process its rows through aggregate accumulators
- HAVING Filter : Apply post-aggregation filter if present
- Result Construction : Build output
RecordBatchwith one row per group
Sources:
Accumulator Interface
The llkv-aggregate crate (imported at llkv-executor/src/lib.rs19) provides the AggregateAccumulator trait, which abstracts the computation logic for individual aggregate functions. Each accumulator maintains incremental state as it processes rows:
Diagram: Accumulator Lifecycle
sequenceDiagram
participant Executor
participant Accumulator as AggregateAccumulator
participant State as AggregateState
Executor->>Accumulator: new(AggregateSpec)
Accumulator->>State: initialize()
loop For each batch
Executor->>Accumulator: update(batch, row_indices)
Accumulator->>State: accumulate values
end
Executor->>Accumulator: finalize()
Accumulator->>State: compute final value
Accumulator-->>Executor: AggregateValue
Sources:
The executor wraps accumulator results in AggregateValue, which handles type conversions between the accumulator’s output type and the plan’s expected type:
AggregateValue Variant | Usage |
|---|---|
Null | No rows matched, or all values were NULL |
Int64(i64) | Integer aggregates (COUNT, SUM for integers) |
Float64(f64) | Floating-point aggregates (AVG, SUM for floats) |
Decimal128 { value: i128, scale: i8 } | Precise decimal aggregates |
String(String) | String aggregates (GROUP_CONCAT) |
Sources:
graph TD
START["Aggregate with distinct=true"]
INIT["Initialize FxHashSet<Vec<u8>>\nfor distinct tracking"]
LOOP_START["For each input row"]
EXTRACT["Extract aggregate expression value"]
ENCODE["Encode value as byte vector\nusing encode_row()"]
CHECK_SEEN{"Value already\nin set?"}
SKIP["Skip row\n(duplicate)"]
INSERT["Insert into set"]
ACCUMULATE["Pass to accumulator"]
LOOP_END["Next row"]
FINALIZE["Finalize accumulator"]
START --> INIT
INIT --> LOOP_START
LOOP_START --> EXTRACT
EXTRACT --> ENCODE
ENCODE --> CHECK_SEEN
CHECK_SEEN -->|Yes| SKIP
CHECK_SEEN -->|No| INSERT
INSERT --> ACCUMULATE
SKIP --> LOOP_END
ACCUMULATE --> LOOP_END
LOOP_END --> LOOP_START
LOOP_END -.all rows.-> FINALIZE
Distinct Value Tracking
When an aggregate includes the DISTINCT modifier (e.g., COUNT(DISTINCT col)), the executor must deduplicate values before accumulation. This is handled via hash-based tracking:
Diagram: DISTINCT Aggregate Processing
The encode_row function (referenced throughout llkv-executor/src/lib.rs) converts values to a canonical byte representation suitable for hash-based deduplication.
Sources:
graph LR
INPUT["Input Batch"]
subgraph "Expression Evaluation"
TRANSLATE["translate_scalar\n(String → FieldId)"]
EVAL_NUMERIC["NumericKernels::evaluate_numeric"]
RESULT_ARRAY["Computed ArrayRef"]
end
subgraph "Accumulation"
EXTRACT["Extract values from array"]
ACCUMULATE["AggregateAccumulator::update"]
end
INPUT --> TRANSLATE
TRANSLATE --> EVAL_NUMERIC
EVAL_NUMERIC --> RESULT_ARRAY
RESULT_ARRAY --> EXTRACT
EXTRACT --> ACCUMULATE
Expression-Based Aggregates
Unlike simple column aggregates, expression-based aggregates (e.g., SUM(col1 * col2) or AVG(CASE WHEN x > 0 THEN x ELSE 0 END)) require evaluating the expression for each row before accumulating:
Diagram: Expression Aggregate Evaluation
The executor uses ensure_computed_projection to translate expression trees and infer result data types:
Sources:
This helper ensures the expression is added to the scan projection list only once (via caching), avoiding redundant computation when multiple aggregates reference the same expression.
Simple vs Complex Column Extraction
The function try_extract_simple_column optimizes aggregate evaluation by detecting when an aggregate expression is equivalent to a direct column reference:
This optimization allows the executor to skip expression evaluation machinery for common cases, reading column data directly from the column store.
Sources:
graph TD
AGG_VALUE["AggregateValue"]
AS_I64["as_i64() → Option<i64>"]
AS_F64["as_f64() → Option<f64>"]
AGG_VALUE --> AS_I64
AGG_VALUE --> AS_F64
AS_I64 --> NULL_CHECK1{"Null?"}
NULL_CHECK1 -->|Yes| NONE1["None"]
NULL_CHECK1 -->|No| TYPE_CHECK1{"Type?"}
TYPE_CHECK1 -->|Int64| DIRECT_I64["Some(value)"]
TYPE_CHECK1 -->|Float64| TRUNC["Some(value as i64)"]
TYPE_CHECK1 -->|Decimal128| SCALE_DOWN["Some(value / 10^scale)"]
TYPE_CHECK1 -->|String| PARSE_I64["s.parse::<i64>().ok()"]
AS_F64 --> NULL_CHECK2{"Null?"}
NULL_CHECK2 -->|Yes| NONE2["None"]
NULL_CHECK2 -->|No| TYPE_CHECK2{"Type?"}
TYPE_CHECK2 -->|Int64| PROMOTE["Some(value as f64)"]
TYPE_CHECK2 -->|Float64| DIRECT_F64["Some(value)"]
TYPE_CHECK2 -->|Decimal128| DIVIDE["Some(value / 10.0^scale)"]
TYPE_CHECK2 -->|String| PARSE_F64["s.parse::<f64>().ok()"]
Aggregate Result Types and Conversions
AggregateValue provides conversion methods to satisfy different consumer requirements:
Diagram: AggregateValue Type Conversions
Sources:
These conversions enable:
- Order By : Converting aggregate results to sortable numeric types
- HAVING Filters : Evaluating post-aggregate predicates that compare aggregate values
- Nested Aggregates : Using one aggregate’s result in another’s computation (rare, but supported in computed projections)
graph LR
EXPR["GROUP BY expression"]
EVAL["Evaluate for each row"]
subgraph "GroupKeyValue Variants"
NULL_VAL["Null"]
INT_VAL["Int(i64)"]
BOOL_VAL["Bool(bool)"]
STRING_VAL["String(String)"]
end
ENCODE["encode_row()\nVec<GroupKeyValue> → Vec<u8>"]
MAP["FxHashMap<Vec<u8>, GroupAggregateState>"]
EXPR --> EVAL
EVAL --> NULL_VAL
EVAL --> INT_VAL
EVAL --> BOOL_VAL
EVAL --> STRING_VAL
NULL_VAL --> ENCODE
INT_VAL --> ENCODE
BOOL_VAL --> ENCODE
STRING_VAL --> ENCODE
ENCODE --> MAP
Group Key Representation
For grouped aggregations, the executor encodes group-by expressions into GroupKeyValue instances, which form composite keys in the group state map:
Diagram: Group Key Encoding
Sources:
The GroupAggregateState struct tracks which rows belong to each group:
This representation enables efficient accumulation: for each group, the executor iterates row_locations and passes those rows to the aggregate accumulators.
Sources:
sequenceDiagram
participant Exec as QueryExecutor
participant Scan as Table Scan
participant Accum as Accumulators
participant Eval as NumericKernels
Exec->>Scan: Scan all rows
Scan-->>Exec: RecordBatch
Exec->>Exec: Identify aggregate calls\nin projections
loop For each aggregate
Exec->>Accum: create accumulator
Exec->>Accum: update(batch)
Accum-->>Exec: finalized value
end
Exec->>Exec: Inject aggregate values\nas synthetic columns
loop For each projection
Exec->>Eval: evaluate_numeric()\nwith synthetic columns
Eval-->>Exec: computed ArrayRef
end
Exec->>Exec: Construct final RecordBatch
Computed Aggregates in Projections
When a SELECT list includes computed expressions containing aggregate functions (e.g., SELECT COUNT(*) * 2, SUM(x) + AVG(y)), the executor uses execute_computed_aggregates:
Diagram: Computed Aggregate Flow
Sources:
This execution path:
- Scans the table once to collect all rows
- Evaluates aggregate functions to produce scalar values
- Injects those scalars into a temporary evaluation context as synthetic columns
- Evaluates the projection expressions referencing those synthetic columns
- Assembles the final result batch
This approach allows arbitrary nesting of aggregates within expressions while maintaining correctness.
Performance Considerations
The aggregation system makes several trade-offs:
| Strategy | Benefit | Cost |
|---|---|---|
AggregateAccumulator trait abstraction | Pluggable aggregate implementations | Indirect call overhead |
| Full batch materialization for GROUP BY | Simple implementation, works for any key type | High memory usage for large result sets |
| Hash-based DISTINCT tracking | Correct deduplication | Memory proportional to cardinality |
| Expression evaluation per row | Supports complex aggregates | Cannot leverage predicate pushdown |
FxHashMap for grouping | Fast hashing for typical keys | Collision risk with adversarial inputs |
For aggregates over large datasets, consider:
- Predicate pushdown : Filter rows before aggregation
- Projection pruning : Only scan columns needed by aggregate expressions
- Index-assisted aggregation : Use indexes for MIN/MAX when possible (not currently implemented)
Sources:
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Query Execution
Loading…
Query Execution
Relevant source files
- llkv-executor/Cargo.toml
- llkv-executor/src/lib.rs
- llkv-join/Cargo.toml
- llkv-sql/src/sql_engine.rs
- llkv-table/Cargo.toml
Purpose and Scope
This document describes the query execution layer that transforms query plans into result data. The executor sits between the query planner (Query Planning) and the storage layer (Storage Layer), dispatching work to specialized components based on plan characteristics. This page provides a high-level overview of execution architecture and strategy selection. For detailed information about specific execution modes, see TablePlanner and TableExecutor, Scan Execution and Optimization, and Filter Evaluation.
Architecture Overview
The query execution layer is implemented primarily in the llkv-executor crate, with the QueryExecutor struct serving as the main orchestrator. The executor receives SelectPlan structures from the planner and produces SelectExecution results containing Arrow RecordBatch data.
Execution Strategy Dispatch Flow
graph TB
subgraph "Planning Layer"
PLAN["SelectPlan\n(from llkv-plan)"]
end
subgraph "Execution Layer - llkv-executor"
EXECUTOR["QueryExecutor<P>"]
DISPATCH{Execution\nStrategy\nDispatch}
COMPOUND["Compound SELECT\nUNION/EXCEPT/INTERSECT"]
NOTABLE["No-Table Execution\nSELECT constant"]
GROUPBY["Group By Execution\nAggregation + Grouping"]
CROSS["Cross Product\nMultiple Tables"]
AGGREGATE["Aggregate Execution\nSUM/AVG/COUNT"]
PROJECTION["Projection Execution\nColumn Selection"]
end
subgraph "Storage Layer"
PROVIDER["ExecutorTableProvider<P>"]
TABLE["ExecutorTable<P>"]
SCAN["Scan Operations\n(llkv-scan)"]
end
subgraph "Result"
EXECUTION["SelectExecution<P>"]
BATCH["RecordBatch[]\nArrow Data"]
end
PLAN --> EXECUTOR
EXECUTOR --> DISPATCH
DISPATCH -->|compound query| COMPOUND
DISPATCH -->|no FROM clause| NOTABLE
DISPATCH -->|GROUP BY present| GROUPBY
DISPATCH -->|multiple tables| CROSS
DISPATCH -->|aggregates only| AGGREGATE
DISPATCH -->|default path| PROJECTION
EXECUTOR --> PROVIDER
PROVIDER --> TABLE
TABLE --> SCAN
COMPOUND --> EXECUTION
NOTABLE --> EXECUTION
GROUPBY --> EXECUTION
CROSS --> EXECUTION
AGGREGATE --> EXECUTION
PROJECTION --> EXECUTION
EXECUTION --> BATCH
The executor examines plan characteristics to select an appropriate execution strategy. Each strategy is optimized for specific query patterns.
Sources: llkv-executor/src/lib.rs:504-563 llkv-executor/src/lib.rs:584-695
QueryExecutor Structure
The QueryExecutor<P> struct is the primary entry point for executing SELECT queries. It is generic over a Pager type P to support different storage backends.
| Component | Type | Purpose |
|---|---|---|
provider | Arc<dyn ExecutorTableProvider<P>> | Provides access to tables and their metadata |
The provider abstraction allows the executor to remain decoupled from specific table implementations, enabling testing with mock providers and supporting different storage configurations.
QueryExecutor and Provider Relationship
graph LR
subgraph "Executor Core"
QE["QueryExecutor<P>"]
end
subgraph "Provider Interface"
PROVIDER["ExecutorTableProvider<P>"]
GET_TABLE["get_table(name)\n→ ExecutorTable<P>"]
end
subgraph "Table Interface"
ETABLE["ExecutorTable<P>"]
SCHEMA["schema: ExecutorSchema"]
STORAGE["storage: TableStorageAdapter<P>"]
end
QE --> PROVIDER
PROVIDER --> GET_TABLE
GET_TABLE --> ETABLE
ETABLE --> SCHEMA
ETABLE --> STORAGE
The provider pattern enables dependency injection, allowing the executor to work with different table sources without tight coupling to storage implementations.
Sources: llkv-executor/src/lib.rs:504-517 llkv-executor/src/types.rs:1-100
Execution Entry Points
The executor provides two primary entry points for executing SELECT plans:
execute_select
Executes a SELECT plan without additional filtering constraints. This is the standard path used when no external row filtering is required.
execute_select_with_filter
Executes a SELECT plan with an optional row ID filter. The RowIdFilter trait allows callers to specify a predicate that determines which row IDs should be considered during execution. This is used for implementing transaction isolation (MVCC filtering) and other row-level visibility constraints.
Both methods extract limit and offset from the plan and apply them to the final SelectExecution result, ensuring consistent pagination behavior across all execution strategies.
Sources: llkv-executor/src/lib.rs:519-563
Execution Strategy Dispatch
The executor examines the characteristics of a SelectPlan to determine the most efficient execution strategy. The dispatch logic follows a priority hierarchy:
Execution Strategy Decision Tree
graph TD
START["execute_select_with_filter(plan)"]
CHECK_COMPOUND{plan.compound\nis Some?}
CHECK_TABLES{plan.tables\nis empty?}
CHECK_GROUPBY{plan.group_by\nnot empty?}
CHECK_MULTI{plan.tables.len()\n> 1?}
CHECK_AGG{plan.aggregates\nnot empty?}
CHECK_COMPUTED{has computed\naggregates?}
COMPOUND["execute_compound_select\nUNION/EXCEPT/INTERSECT"]
NOTABLE["execute_select_without_table\nConstant evaluation"]
GROUPBY_SINGLE["execute_group_by_single_table\nGROUP BY aggregation"]
GROUPBY_CROSS["execute_cross_product\nMulti-table GROUP BY"]
CROSS["execute_cross_product\nCartesian product"]
AGGREGATE["execute_aggregates\nSingle-table aggregates"]
COMPUTED_AGG["execute_computed_aggregates\nAggregates in expressions"]
PROJECTION["execute_projection\nSimple column selection"]
RESULT["SelectExecution<P>\nwith limit/offset"]
START --> CHECK_COMPOUND
CHECK_COMPOUND -->|Yes| COMPOUND
CHECK_COMPOUND -->|No| CHECK_TABLES
CHECK_TABLES -->|Yes| NOTABLE
CHECK_TABLES -->|No| CHECK_GROUPBY
CHECK_GROUPBY -->|Yes| CHECK_MULTI
CHECK_MULTI -->|Yes| GROUPBY_CROSS
CHECK_MULTI -->|No| GROUPBY_SINGLE
CHECK_GROUPBY -->|No| CHECK_MULTI
CHECK_MULTI -->|Yes| CROSS
CHECK_MULTI -->|No| CHECK_AGG
CHECK_AGG -->|Yes| AGGREGATE
CHECK_AGG -->|No| CHECK_COMPUTED
CHECK_COMPUTED -->|Yes| COMPUTED_AGG
CHECK_COMPUTED -->|No| PROJECTION
COMPOUND --> RESULT
NOTABLE --> RESULT
GROUPBY_SINGLE --> RESULT
GROUPBY_CROSS --> RESULT
CROSS --> RESULT
AGGREGATE --> RESULT
COMPUTED_AGG --> RESULT
PROJECTION --> RESULT
The executor prioritizes specialized execution paths over generic ones, enabling optimizations tailored to specific query patterns.
Strategy Descriptions
| Strategy | Plan Characteristics | Implementation |
|---|---|---|
| Compound SELECT | plan.compound.is_some() | Executes UNION, EXCEPT, or INTERSECT operations by evaluating component queries and combining results with deduplication for DISTINCT quantifiers |
| No-Table Execution | plan.tables.is_empty() | Evaluates constant expressions like SELECT 1, 2, 3 without accessing storage |
| Group By (Single Table) | !plan.group_by.is_empty() && plan.tables.len() == 1 | Performs grouped aggregation on a single table with efficient column scanning |
| Group By (Cross Product) | !plan.group_by.is_empty() && plan.tables.len() > 1 | Computes Cartesian product before grouping |
| Cross Product | plan.tables.len() > 1 | Joins multiple tables using nested loop or hash join |
| Aggregate Execution | !plan.aggregates.is_empty() | Computes aggregates (COUNT, SUM, AVG, etc.) over a single table |
| Computed Aggregates | Aggregates within computed expressions | Evaluates expressions containing aggregate functions |
| Projection Execution | Default path | Performs column selection with optional filtering |
Sources: llkv-executor/src/lib.rs:523-563
Result Representation
The executor returns results as SelectExecution<P> instances, which encapsulate one or more Arrow RecordBatch objects along with metadata.
SelectExecution Result Types
graph TB
subgraph "SelectExecution<P>"
EXEC["SelectExecution"]
SCHEMA["schema: Arc<Schema>"]
DISPLAY["display_name: String"]
MODE{Execution\nMode}
end
subgraph "Single Batch Mode"
SINGLE["Single RecordBatch"]
BATCH1["RecordBatch\nMaterialized data"]
end
subgraph "Multi Batch Mode"
MULTI["Vec<RecordBatch>"]
BATCH2["RecordBatch[]\nMultiple batches"]
end
subgraph "Streaming Mode"
STREAM["Scan Stream"]
LAZY["Lazy evaluation"]
ITER["Iterator-based"]
end
subgraph "Post-Processing"
LIMIT["limit: Option<usize>"]
OFFSET["offset: Option<usize>"]
APPLY["Applied during\nmaterialization"]
end
EXEC --> SCHEMA
EXEC --> DISPLAY
EXEC --> MODE
MODE -->|Materialized| SINGLE
MODE -->|Compound/Sorted| MULTI
MODE -->|Large tables| STREAM
SINGLE --> BATCH1
MULTI --> BATCH2
STREAM --> LAZY
STREAM --> ITER
EXEC --> LIMIT
EXEC --> OFFSET
LIMIT --> APPLY
OFFSET --> APPLY
The SelectExecution type supports multiple result modes optimized for different query patterns. The with_limit and with_offset methods attach pagination parameters that are applied when materializing results.
Result Materialization
Callers can materialize results in several ways:
- into_rows() : Converts all batches into a
Vec<CanonicalRow>representation, applying limit and offset - stream(callback) : Invokes a callback for each batch, enabling memory-efficient processing of large result sets
- into_record_batch() : Consolidates results into a single
RecordBatch, useful for small result sets - into_batches() : Returns all batches as a vector
The streaming API is particularly important for large queries where materializing all results at once would exceed memory limits.
Sources: llkv-executor/src/lib.rs:519-563 llkv-executor/src/scan.rs:1-100
Execution Phases
Most execution strategies follow a two-phase pattern optimized for columnar storage:
Phase 1: Row ID Collection
The executor first identifies which rows satisfy the query’s filter predicates without fetching the full column data. This phase produces a bitmap or set of row IDs that match the criteria.
Row ID Collection Phase
sequenceDiagram
participant EX as QueryExecutor
participant TBL as ExecutorTable
participant SCAN as Scan Operations
participant STORE as Column Store
EX->>TBL: filter_row_ids(predicate)
TBL->>SCAN: evaluate_filter
SCAN->>STORE: Load chunk metadata
Note over SCAN,STORE: Chunk pruning using\nmin/max values
SCAN->>STORE: Load matching chunks
SCAN->>SCAN: Vectorized predicate\nevaluation (SIMD)
SCAN-->>TBL: Bitmap of matching row IDs
TBL-->>EX: Row ID set
Predicate evaluation uses chunk metadata to skip irrelevant data (Scan Execution and Optimization) and vectorized kernels for efficient matching (Filter Evaluation).
sequenceDiagram
participant EX as QueryExecutor
participant TBL as ExecutorTable
participant STORE as Column Store
participant PAGER as Storage Pager
EX->>TBL: scan_stream(projections, row_ids)
loop For each projection
TBL->>STORE: gather_rows(field_id, row_ids)
STORE->>STORE: Identify chunks containing\nrequested row IDs
STORE->>PAGER: batch_get(chunk_keys)
PAGER-->>STORE: Chunk data
STORE->>STORE: Construct Arrow array
STORE-->>TBL: ArrayRef
end
TBL->>TBL: Construct RecordBatch\nfrom arrays
TBL-->>EX: RecordBatch
Phase 2: Data Gathering
Once the matching row IDs are known, the executor fetches only the required columns for those specific rows. This minimizes I/O by avoiding unnecessary column reads.
Data Gathering Phase
The gather operation reconstructs Arrow arrays from chunked columnar storage, fetching only the columns referenced in the query’s projections.
Phase 3: Post-Processing
After data gathering, the executor applies sorting, aggregation, or other transformations as required by the plan:
| Operation | When Applied | Implementation |
|---|---|---|
| Sorting | ORDER BY clause present | Uses Arrow’s lexsort_to_indices with custom NULLS FIRST/LAST handling |
| Limiting | LIMIT clause present | Truncates result set to specified row count |
| Offset | OFFSET clause present | Skips specified number of rows before returning results |
| Aggregation | GROUP BY or aggregate functions | Materializes groups and computes aggregate values |
| Distinct | SELECT DISTINCT | Hash-based deduplication using row encoding |
Sources: llkv-executor/src/lib.rs:584-1000 llkv-executor/src/scan.rs:1-500
Subquery Execution
The executor handles subqueries through recursive evaluation, supporting both scalar subqueries and EXISTS predicates.
graph TD
EXPR["Evaluate Expression\ncontaining subquery"]
COLLECT["Collect correlated\ncolumn bindings"]
ENCODE["Encode bindings\nas cache key"]
CHECK{Cache\nhit?}
CACHED["Return cached\nLiteral"]
EXECUTE["Execute subquery\nwith bindings"]
VALIDATE["Validate result:\n1 column, ≤1 row"]
STORE["Store in cache"]
RETURN["Return Literal"]
EXPR --> COLLECT
COLLECT --> ENCODE
ENCODE --> CHECK
CHECK -->|Yes| CACHED
CHECK -->|No| EXECUTE
EXECUTE --> VALIDATE
VALIDATE --> STORE
STORE --> RETURN
CACHED --> RETURN
Scalar Subquery Evaluation
Scalar subqueries are evaluated lazily during expression computation. The executor maintains a cache (scalar_subquery_cache) to avoid re-executing identical subqueries with the same correlated bindings:
Scalar Subquery Evaluation with Caching
The caching mechanism is essential for performance when a subquery is evaluated multiple times in a cross product or aggregate context.
Parallel Subquery Execution
For queries that require evaluating the same correlated subquery across many rows, the executor batches the work and executes it in parallel using Rayon:
let job_results: Vec<ExecutorResult<Literal>> = with_thread_pool(|| {
pending_bindings
.par_iter()
.map(|bindings| self.evaluate_scalar_subquery_with_bindings(subquery, bindings))
.collect()
});
This parallelization significantly reduces execution time for subquery-heavy queries.
Sources: llkv-executor/src/lib.rs:787-961
graph LR
subgraph "SqlEngine"
PARSE["Parse SQL\n(sqlparser)"]
PLAN["Build Plan\n(llkv-plan)"]
EXEC["Execute Plan"]
end
subgraph "RuntimeEngine"
CONTEXT["RuntimeContext"]
SESSION["RuntimeSession"]
CATALOG["CatalogManager"]
end
subgraph "Executor Layer"
QEXEC["QueryExecutor"]
PROVIDER["TableProvider"]
end
PARSE --> PLAN
PLAN --> EXEC
EXEC --> CONTEXT
CONTEXT --> SESSION
CONTEXT --> CATALOG
EXEC --> QEXEC
QEXEC --> PROVIDER
PROVIDER --> CATALOG
Integration with Runtime
The SqlEngine in llkv-sql orchestrates the entire execution pipeline, bridging SQL text to query results:
SqlEngine and Executor Integration
The RuntimeEngine provides the execution context, including transaction state, catalog access, and session configuration, while the QueryExecutor focuses solely on transforming plans into results.
Sources: llkv-sql/src/sql_engine.rs:572-745 llkv-executor/src/lib.rs:504-563
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
TablePlanner and TableExecutor
Loading…
TablePlanner and TableExecutor
Relevant source files
- llkv-csv/src/writer.rs
- llkv-table/examples/direct_comparison.rs
- llkv-table/examples/performance_benchmark.rs
- llkv-table/examples/test_streaming.rs
- llkv-table/src/lib.rs
- llkv-table/src/table.rs
Purpose and Scope
This page documents the table-level query execution interface provided by the Table struct and its optimization strategies. The table layer acts as a bridge between high-level query plans (covered in Query Planning) and low-level columnar storage operations (covered in Column Storage and ColumnStore). It orchestrates scan execution, manages field ID namespacing, handles MVCC visibility, and selects execution strategies based on query characteristics.
For details about the lower-level scan execution mechanics and streaming strategies, see Scan Execution and Optimization. For filter expression evaluation, see Filter Evaluation.
Architecture Overview
The Table struct serves as the primary interface for executing queries against table data. Rather than implementing query execution logic directly, it orchestrates lower-level components and applies table-specific concerns like schema management, MVCC column injection, and field ID translation.
graph TB
subgraph "Table Layer"
API["Table API\nscan_stream, filter_row_ids"]
SCHEMA["Schema Management\nField ID → LogicalFieldId"]
MVCC["MVCC Integration\ncreated_by, deleted_by"]
OPTIONS["Execution Options\nScanStreamOptions"]
end
subgraph "Scan Execution Layer"
EXECUTE["execute_scan()\nllkv-scan"]
STREAM["RowStreamBuilder\nBatch streaming"]
FILTER["Filter Evaluation\nPredicate matching"]
end
subgraph "Storage Layer"
STORE["ColumnStore\nPhysical storage"]
GATHER["Column Gathering\nRecordBatch assembly"]
end
API --> SCHEMA
API --> MVCC
API --> OPTIONS
SCHEMA --> EXECUTE
MVCC --> EXECUTE
OPTIONS --> EXECUTE
EXECUTE --> STREAM
STREAM --> FILTER
FILTER --> GATHER
GATHER --> STORE
STORE -.results.-> GATHER
GATHER -.batches.-> STREAM
STREAM -.batches.-> API
Table Layer Responsibilities
Sources: llkv-table/src/table.rs:60-69 llkv-table/src/table.rs:447-488
Table Struct and Core Methods
The Table<P> struct wraps a ColumnStore with table-specific metadata and caching:
| Component | Type | Purpose |
|---|---|---|
store | Arc<ColumnStore<P>> | Physical columnar storage |
table_id | TableId | Unique table identifier |
mvcc_cache | RwLock<Option<MvccColumnCache>> | Cached MVCC column presence |
Primary execution methods:
scan_stream: Main scan interface accepting flexible projection typesscan_stream_with_exprs: Direct execution with resolvedScanProjectionexpressionsfilter_row_ids: Collect row IDs matching a filter predicatestream_columns: Stream raw column data without expression evaluation
Sources: llkv-table/src/table.rs:60-69 llkv-table/src/table.rs:447-462 llkv-table/src/table.rs:469-488 llkv-table/src/table.rs:490-496 llkv-table/src/table.rs:589-645
Scan Execution Flow
High-Level Execution Pipeline
Sources: llkv-table/src/table.rs:447-488 llkv-table/examples/performance_benchmark.rs:1-211
Scan Configuration: ScanStreamOptions
ScanStreamOptions controls execution behavior:
| Field | Type | Purpose |
|---|---|---|
include_nulls | bool | Include rows where all projected columns are null |
order | Option<ScanOrderSpec> | Ordering specification (ASC/DESC) |
row_id_filter | Option<Arc<dyn RowIdFilter>> | Pre-filtered row ID set (e.g., MVCC visibility) |
include_row_ids | bool | Include row_id column in output |
ranges | Option<Vec<(Bound<u64>, Bound<u64>)>> | Row ID range restrictions |
driving_column | Option<LogicalFieldId> | Column to drive scan ordering |
Sources: llkv-table/src/table.rs:43-46
Projection Types: ScanProjection
Projections specify what data to retrieve:
Column(ColumnProjectionInfo): Direct column access with optional aliaslogical_field_id: Column to readdata_type: Expected Arrow data typeoutput_name: Column name in result
Expression(ProjectionEval): Computed expressions over columns- Arithmetic operations, functions, literals
- Evaluated during result assembly
Conversion: Projection (from llkv-column-map) can be converted to ScanProjection via Into trait.
Sources: llkv-table/src/table.rs:40-46
Optimization Strategies
Strategy Selection Logic
Sources: llkv-table/examples/performance_benchmark.rs:26-79 llkv-table/examples/test_streaming.rs:26-174
Fast Path: Direct Streaming
When conditions are met for the fast path, scan execution bypasses row ID collection and materializes chunks directly from ScanBuilder:
Eligibility criteria:
- Exactly one column projection
- Unbounded range filter (
Bound::Unboundedon both ends) - No null inclusion (
include_nulls = false) - No ordering requirements (
order = None) - No row ID filter (
row_id_filter = None)
Performance characteristics:
- 10-100x faster than standard path for large scans
- Zero-copy chunk streaming when possible
- Minimal memory overhead (streaming chunks, not full materialization)
Example usage:
Sources: llkv-table/examples/test_streaming.rs:65-110 llkv-table/examples/performance_benchmark.rs:145-151
Standard Path: Row ID Collection and Gathering
When fast path conditions are not met, execution uses a two-phase approach:
Phase 1: Row ID Collection
- Evaluate filter expression against the table
- Build a bitmap (
Treemap) of matching row IDs - Filter by MVCC visibility if
row_id_filteris provided - Apply range restrictions if
rangesis specified
Phase 2: Column Gathering
- Split row IDs into fixed-size windows
- For each window:
- Gather all projected columns
- Evaluate computed expressions
- Assemble into
RecordBatch - Stream batch to callback
Batch size: Controlled by STREAM_BATCH_ROWS constant (default varies by workload).
Sources: llkv-table/src/table.rs:490-496 llkv-table/src/table.rs:589-645
Performance Comparison: Layer Overhead
ColumnStore Direct Access vs Table Layer
Measured overhead (1M row single-column scan):
- Direct ColumnStore : ~5-10ms baseline
- Table Layer (fast path) : ~10-20ms (1.5-2x overhead)
- Table Layer (standard path) : ~50-100ms (5-10x overhead)
The fast path optimization significantly reduces the gap by bypassing row ID collection and expression evaluation when not needed.
Sources: llkv-table/examples/direct_comparison.rs:276-399
Row ID Filtering and Collection
RowIdScanCollector
Internal visitor for collecting row IDs that match filter predicates:
Implementation strategy:
- Implements
PrimitiveWithRowIdsVisitorandPrimitiveSortedWithRowIdsVisitor - Accumulates row IDs into a
Treemap(compressed bitmap) - Ignores actual column values (focus on row IDs only)
- Handles both chunked and sorted run formats
Key methods:
extend_from_array: Add row IDs from a chunkextend_from_slice: Add a slice of row IDsinto_inner: Extract the final bitmap
Sources: llkv-table/src/table.rs:805-858
RowIdChunkEmitter
Streaming alternative that emits row IDs in fixed-size chunks without materializing the full set:
Use case: Memory-efficient processing when full row ID set is not needed.
Features:
- Configurable chunk size
- Optional reverse ordering for sorted runs
- Error propagation via
errorfield - Zero allocation when chunks align with storage chunks
Methods:
extend_from_array: Process a chunk of row IDsextend_sorted_run: Handle sorted run with optional reversalflush: Emit current bufferfinish: Final flush and error check
Sources: llkv-table/src/table.rs:860-1017
Integration with MVCC and Transactions
The table layer handles MVCC visibility transparently:
MVCC Column Injection
When appending data, the Table::append method automatically injects MVCC columns if not present:
Columns added:
created_by(UInt64): Transaction ID that created the row (defaults to 1 for auto-commit)deleted_by(UInt64): Transaction ID that deleted the row (defaults to 0 for not deleted)
Field ID assignment:
- MVCC columns get reserved logical field IDs via
LogicalFieldId::for_mvcc_created_byandLogicalFieldId::for_mvcc_deleted_by - Separate from user field IDs to avoid collisions
Caching: The MvccColumnCache stores whether MVCC columns exist to avoid repeated schema inspections.
Sources: llkv-table/src/table.rs:50-56 llkv-table/src/table.rs:231-438
Transaction Visibility Filtering
Query execution can filter by transaction visibility using the row_id_filter option:
The filter is applied during row ID collection phase, ensuring only visible rows are included in results.
Sources: llkv-table/src/table.rs:43-46
graph LR
USER["User Field ID\n(FieldId: u32)"]
META["Field Metadata\n'field_id' key"]
TABLE["Table ID\n(TableId: u32)"]
LOGICAL["LogicalFieldId\n(u64)"]
USER --> COMPOSE
TABLE --> COMPOSE
COMPOSE["LogicalFieldId::for_user()"] --> LOGICAL
META -.annotates.-> USER
Field ID Translation and Namespacing
The table layer translates user-visible field IDs to logical field IDs that include table ID:
Translation Process
Why namespacing?
- Multiple tables can have the same user field IDs
ColumnStoreoperates onLogicalFieldIdto avoid collisions- Table layer encapsulates this translation
Key functions:
LogicalFieldId::for_user(table_id, field_id): Create user data fieldLogicalFieldId::for_mvcc_created_by(table_id): Create MVCC created_by fieldLogicalFieldId::for_mvcc_deleted_by(table_id): Create MVCC deleted_by field
Sources: llkv-table/src/table.rs:231-438 llkv-table/src/table.rs:589-645
Schema Management
Schema Construction
The Table::schema() method builds an Arrow schema from catalog metadata:
Process:
- Query
ColumnStorefor all logical fields belonging to this table - Sort fields by field ID for consistent ordering
- Retrieve column names from
SysCatalog - Build Arrow
Fieldwith data type and metadata - Add
row_idfield first, then user fields
Schema metadata: Each field includes field_id metadata for round-trip compatibility.
Sources: llkv-table/src/table.rs:519-549
Schema as RecordBatch
For display purposes, schema_recordbatch() formats the schema as a table:
| Column | Type | Contents |
|---|---|---|
name | Utf8 | Column name |
field_id | UInt32 | User field ID |
data_type | Utf8 | Arrow data type string |
Sources: llkv-table/src/table.rs:554-586
Example: CSV Export Integration
The CSV export writer demonstrates table-level query execution in practice:
Export Pipeline
Key steps:
- Convert
CsvExportColumntoScanProjectionwith field ID translation - Resolve aliases from catalog or use defaults
- Create unbounded filter for full table scan
- Stream batches directly to Arrow CSV writer
- No intermediate materialization
Sources: llkv-csv/src/writer.rs:139-268
Summary: Optimization Decision Matrix
| Query Characteristic | Fast Path | Standard Path |
|---|---|---|
| Single column projection | ✓ | ✓ |
| Multiple columns | ✗ | ✓ |
| Unbounded filter | ✓ | ✓ |
| Bounded/complex filter | ✗ | ✓ |
| Include nulls | ✗ | ✓ |
| Exclude nulls | ✓ | ✓ |
| No ordering | ✓ | ✓ |
| Ordered results | ✗ | ✓ |
| No MVCC filter | ✓ | ✓ |
| MVCC filtering | ✗ | ✓ |
| No range restrictions | ✓ | ✓ |
| Range restrictions | ✗ | ✓ |
Performance impact:
- Fast path: 1.5-2x slower than direct
ColumnStoreaccess - Standard path: 5-10x slower than direct
ColumnStoreaccess - Still significantly faster than alternatives due to columnar storage and zero-copy optimizations
Sources: llkv-table/examples/performance_benchmark.rs:1-211 llkv-table/examples/direct_comparison.rs:276-399
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Scan Execution and Optimization
Loading…
Scan Execution and Optimization
Relevant source files
- llkv-column-map/Cargo.toml
- llkv-column-map/src/gather.rs
- llkv-column-map/src/store/projection.rs
- llkv-csv/src/writer.rs
- llkv-sql/tests/pager_io_tests.rs
- llkv-table/examples/direct_comparison.rs
- llkv-table/examples/performance_benchmark.rs
- llkv-table/examples/test_streaming.rs
- llkv-table/src/table.rs
Purpose and Scope
This document details the table scan execution flow in LLKV, covering how queries retrieve data from columnar storage through optimized fast paths and streaming strategies. The scan subsystem bridges query planning and physical storage, coordinating row ID collection, data gathering, and result materialization while minimizing memory allocations.
For query planning that produces scan operations, see TablePlanner and TableExecutor. For predicate evaluation during filtering, see Filter Evaluation. For the underlying column storage mechanisms, see Column Storage and ColumnStore.
Scan Entry Points
The Table struct provides two primary scan entry points that execute projections with filtering:
| Method | Signature | Purpose |
|---|---|---|
scan_stream | &self, projections: I, filter_expr: &Expr, options: ScanStreamOptions, on_batch: F | Accepts flexible projection inputs, converts to ScanProjection |
scan_stream_with_exprs | &self, projections: &[ScanProjection], filter_expr: &Expr, options: ScanStreamOptions, on_batch: F | Direct scan with pre-built ScanProjection values |
Both methods stream results as Arrow RecordBatches to a callback function, avoiding full materialization in memory. The ScanStreamOptions struct controls scan behavior:
The on_batch callback receives each batch as it is produced, allowing streaming processing without memory accumulation.
Sources: llkv-table/src/table.rs:447-488 llkv-scan/src/lib.rs
Two-Phase Scan Execution Model
Two-Phase Architecture
LLKV scans execute in two distinct phases to minimize data movement:
Phase 1: Row ID Collection - Evaluates filter predicates against stored data to produce a set of matching row IDs. This phase:
- Visits column chunks using the visitor pattern
- Applies predicates via metadata-based pruning (min/max values)
- Collects matching row IDs into a
Treemap(bitmap) orVec<RowId> - Avoids loading actual column values during filtering
Phase 2: Data Gathering - Retrieves column values for the matching row IDs identified in Phase 1. This phase:
- Loads only chunks containing matched rows
- Assembles Arrow arrays via type-specific gathering functions
- Streams results in fixed-size batches (default 4096 rows)
- Supports two execution paths: fast streaming or full materialization
This separation ensures predicates are evaluated without loading unnecessary column data, significantly reducing I/O and memory when filters are selective.
Sources: llkv-scan/src/execute.rs llkv-table/src/table.rs:479-488 llkv-scan/src/row_stream.rs
Fast Path Optimizations
The scan executor detects specific patterns that enable zero-copy streaming optimizations:
Fast Path Criteria
graph TD
ScanRequest["Scan Request"]
CheckCols{"Single\nProjection?"}
CheckFilter{"Unbounded\nFilter?"}
CheckNulls{"include_nulls\n= false?"}
CheckOrder{"No ORDER BY?"}
CheckRowFilter{"No row_id_filter?"}
FastPath["✓ Fast Path\nDirect ScanBuilder streaming"]
SlowPath["✗ Materialization Path\nMultiGatherContext gathering"]
ScanRequest --> CheckCols
CheckCols -->|Yes| CheckFilter
CheckCols -->|No| SlowPath
CheckFilter -->|Yes| CheckNulls
CheckFilter -->|No| SlowPath
CheckNulls -->|Yes| CheckOrder
CheckNulls -->|No| SlowPath
CheckOrder -->|Yes| CheckRowFilter
CheckOrder -->|No| SlowPath
CheckRowFilter -->|Yes| FastPath
CheckRowFilter -->|No| SlowPath
The fast path activates when all conditions are met:
| Condition | Why Required |
|---|---|
| Single projection | Multi-column requires coordination across field plans |
| Unbounded filter | Complex filters require row-by-row evaluation |
include_nulls = false | Null handling requires inspection of each row |
| No ORDER BY | Sorting requires full materialization |
| No row_id_filter | Custom filters require additional filtering logic |
Performance Characteristics
When the fast path activates, scans execute directly via ScanBuilder against the ColumnStore, achieving throughputs exceeding 100M rows/second for simple numeric columns. The materialization path adds 2-5x overhead due to:
- Context preparation and cache management
- Chunk coordinate computation across multiple fields
- Arrow array assembly and concatenation
Test examples demonstrate the performance difference:
- Single-column unbounded scan: ~1-2ms for 1M rows
- Multi-column scan: ~5-10ms for 1M rows
- Complex filters: ~10-20ms for 1M rows
Sources: llkv-scan/src/execute.rs llkv-table/examples/test_streaming.rs:83-110 llkv-table/examples/performance_benchmark.rs:126-151
Row ID Collection Infrastructure
RowIdScanCollector
The RowIdScanCollector accumulates matching row IDs into a Treemap bitmap during Phase 1 filtering:
It implements multiple visitor traits to handle different scan modes:
PrimitiveVisitor- Ignores value chunks (row IDs not yet available)PrimitiveSortedVisitor- Ignores sorted runsPrimitiveWithRowIdsVisitor- Collects row IDs from chunks with row ID arraysPrimitiveSortedWithRowIdsVisitor- Collects row IDs from sorted runs
The collector extracts row IDs from UInt64Array instances passed alongside value arrays, adding them to the bitmap for efficient deduplication and range queries.
Sources: llkv-table/src/table.rs:805-858
RowIdChunkEmitter
For streaming scenarios, RowIdChunkEmitter emits row IDs in fixed-size chunks to a callback rather than accumulating them:
This approach:
- Minimizes memory usage by processing row IDs in batches
- Enables early termination on errors
- Supports reverse iteration for descending scans
- Optimizes for cases where row IDs arrive in contiguous chunks
The emitter implements the same visitor traits as the collector but invokes the callback when the buffer reaches chunk_size, then clears the buffer for the next batch.
Sources: llkv-table/src/table.rs:860-1017
Visitor Pattern Integration
Both collectors integrate with the column-map scan infrastructure via visitor traits. During a scan:
ScanBuilderiterates over column chunks- For each chunk, it detects whether row IDs are available
- It invokes the appropriate visitor method based on chunk characteristics:
_chunk_with_ridsfor unsorted chunks with row IDs_run_with_ridsfor sorted runs with row IDs
- The collector/emitter extracts row IDs and accumulates/emits them
This pattern decouples row ID collection from the storage format, allowing specialized handling for sorted vs. unsorted data.
Sources: llkv-column-map/src/scan/mod.rs llkv-table/src/table.rs:832-858
classDiagram
class MultiGatherContext {+field_infos: FieldInfos\n+plans: FieldPlans\n+chunk_cache: FxHashMap~PhysicalKey, ArrayRef~\n+row_index: FxHashMap~u64, usize~\n+row_scratch: Vec~Option~\n+builders: Vec~ColumnOutputBuilder~\n+epoch: u64\n+new() MultiGatherContext\n+update_field_infos_and_plans()\n+matches_field_ids() bool\n+schema_for_nullability() Schema\n+chunk_span_for_row() Option}
class FieldPlan {+dtype: DataType\n+value_metas: Vec~ChunkMetadata~\n+row_metas: Vec~ChunkMetadata~\n+candidate_indices: Vec~usize~\n+avg_value_bytes_per_row: f64}
class ColumnOutputBuilder {<<enum>>\nUtf8\nBinary\nBoolean\nDecimal128\nPrimitive\nPassthrough}
class GatherContextPool {+inner: Mutex~FxHashMap~\n+max_per_key: usize\n+acquire() GatherContextGuard}
MultiGatherContext --> FieldPlan: contains
MultiGatherContext --> ColumnOutputBuilder: contains
GatherContextPool --> MultiGatherContext: pools
Data Gathering Infrastructure
MultiGatherContext
The MultiGatherContext coordinates multi-column data gathering during Phase 2:
Context Fields
| Field | Purpose |
|---|---|
field_infos | Maps LogicalFieldId to DataType for each projected column |
plans | Contains FieldPlan with chunk metadata for each column |
chunk_cache | Caches decoded Arrow arrays to avoid redundant deserialization |
row_index | Maps row IDs to output indices during gathering |
row_scratch | Scratch buffer mapping output positions to chunk coordinates |
builders | Type-specific Arrow array builders for each output column |
epoch | Tracks storage version for cache invalidation |
Gathering Flow
- Context preparation: Load column descriptors and chunk metadata for all fields
- Candidate selection: Identify chunks containing requested row IDs based on min/max values
- Chunk loading: Batch-fetch identified chunks from storage
- Row mapping: Build index from row ID to output position
- Value extraction: For each output position, locate value in cached chunk and append to builder
- Array finalization: Convert builders to Arrow arrays and assemble
RecordBatch
Sources: llkv-column-map/src/store/projection.rs:460-649
GatherContextPool
The GatherContextPool reuses MultiGatherContext instances across scans to amortize setup costs:
Pooling Strategy
- Contexts are keyed by the list of projected
LogicalFieldIdvalues - Up to
max_per_keycontexts are retained per projection set (default: 4) - Contexts are reset (chunk cache cleared) when returned to the pool
- Epoch tracking invalidates cached chunk arrays when storage changes
When a scan requests a context:
- Pool checks for matching contexts by field ID list
- If found, pops a context from the pool
- If not found or pool empty, creates a new context
- Context is wrapped in
GatherContextGuardfor automatic return to pool
This pattern is critical for high-frequency scans where context setup (descriptor loading, metadata parsing) dominates execution time. Pooling reduces per-scan overhead from ~10ms to <1ms for repeated projections.
Sources: llkv-column-map/src/store/projection.rs:651-720
Type-Specific Gathering Functions
LLKV provides specialized gathering functions for each Arrow type to optimize performance:
| Function | Arrow Type | Optimization |
|---|---|---|
gather_rows_from_chunks_string | GenericStringArray<O> | Direct slice reuse for contiguous sorted rows |
gather_rows_from_chunks_binary | GenericBinaryArray<O> | Same as string with binary validation |
gather_rows_from_chunks_bool | BooleanArray | Bitmap-based gathering |
gather_rows_from_chunks_decimal128 | Decimal128Array | Precision-preserving assembly |
gather_rows_from_chunks_struct | StructArray | Recursive gathering per child field |
gather_rows_from_chunks | PrimitiveArray<T> | Fast path for single-chunk contiguous access |
Each function implements a common pattern:
Fast Path Detection
All gathering functions check for a fast path where:
- Only one chunk contains the requested row IDs
- Row IDs are sorted ascending
- Row IDs form a contiguous range in the chunk
When detected, the function returns a slice of the cached chunk array via Arrow’s zero-copy slice() method, avoiding builder allocation and value copying entirely.
Sources: llkv-column-map/src/gather.rs:283-403 llkv-column-map/src/gather.rs:405-526 llkv-column-map/src/gather.rs:529-638
flowchart LR
RowIds["Row ID Source\nBitmap or Vec"]
Chunk1["Batch 1\n0..4096"]
Chunk2["Batch 2\n4096..8192"]
ChunkN["Batch N\n(N-1)*4096..M"]
Gather1["gather_rows()"]
Gather2["gather_rows()"]
GatherN["gather_rows()"]
Batch1["RecordBatch 1"]
Batch2["RecordBatch 2"]
BatchN["RecordBatch N"]
Callback["on_batch\ncallback"]
RowIds -->|split into chunks| Chunk1
RowIds --> Chunk2
RowIds --> ChunkN
Chunk1 --> Gather1
Chunk2 --> Gather2
ChunkN --> GatherN
Gather1 --> Batch1
Gather2 --> Batch2
GatherN --> BatchN
Batch1 --> Callback
Batch2 --> Callback
BatchN --> Callback
Streaming Strategies
Batch-Based Streaming
LLKV streams scan results in fixed-size batches to balance memory usage and callback overhead:
Batch Size Configuration
The default batch size is 4096 rows, defined in llkv-table/src/constants.rs:
This value balances:
- Memory footprint : Smaller batches reduce peak memory but increase callback overhead
- Allocation efficiency : Builders pre-allocate for batch size, minimizing reallocation
- Cache locality : Larger batches improve CPU cache utilization during gathering
Streaming Execution
The RowStreamBuilder manages batch-based streaming:
- Accept a
RowIdSource(bitmap or vector) - Split row IDs into windows of
STREAM_BATCH_ROWS - For each window:
- Call
gather_rows_with_reusable_context()with the window’s row IDs - Invoke the
on_batchcallback with the resultingRecordBatch - Reuse the
MultiGatherContextfor the next window
- Call
This approach ensures constant memory usage regardless of result set size, as only one batch exists in memory at a time.
Sources: llkv-table/src/constants.rs llkv-scan/src/row_stream.rs llkv-table/src/table.rs:588-645
Memory Efficiency
The streaming architecture minimizes memory allocations through several mechanisms:
Builder Reuse - Arrow array builders are pre-allocated to the batch size and reused across batches. Builders track their capacity and avoid reallocation when the next batch has the same or smaller size.
Context Pooling - The GatherContextPool retains prepared contexts, eliminating the need to reload descriptors and rebuild field plans for repeated projections.
Chunk Caching - Within a context, decoded Arrow arrays are cached by physical key. Chunks that span multiple batches are deserialized once and reused, reducing I/O and deserialization overhead.
Scratch Buffer Reuse - The row_scratch buffer that maps output indices to chunk coordinates is allocated once per context and cleared (not deallocated) between batches.
Visitor State Management - Row ID collectors and emitters maintain minimal state, typically just a bitmap or small buffer, avoiding large intermediate structures.
These optimizations enable LLKV to sustain multi-million row/second scan rates with sub-100MB memory footprints even for billion-row tables.
Sources: llkv-column-map/src/store/projection.rs:562-568 llkv-column-map/src/store/projection.rs:709-720
graph TB
subgraph "llkv-table layer"
ScanStream["Table::scan_stream()"]
end
subgraph "llkv-scan layer"
ExecuteScan["execute_scan()"]
CheckFast{"Can Use\nFast Path?"}
FastLogic["build_fast_path_stream()"]
SlowLogic["collect_row_ids_for_table()\nthen build_row_stream()"]
FastStream["Direct ScanBuilder\nstreaming"]
SlowStream["RowStreamBuilder\nwith MultiGatherContext"]
end
subgraph "llkv-column-map layer"
ScanBuilder["ScanBuilder::run()"]
ColumnStore["ColumnStore::gather_rows()"]
end
Callback["User-provided\non_batch callback"]
ScanStream --> ExecuteScan
ExecuteScan --> CheckFast
CheckFast -->|Yes| FastLogic
CheckFast -->|No| SlowLogic
FastLogic --> FastStream
SlowLogic --> SlowStream
FastStream --> ScanBuilder
SlowStream --> ColumnStore
ScanBuilder --> Callback
ColumnStore --> Callback
Integration with llkv-scan
The llkv-scan crate provides the execute_scan function that orchestrates the full scan execution:
execute_scan Function Signature
ScanStorage Trait
The ScanStorage trait abstracts over storage implementations, allowing execute_scan to work with both Table and direct ColumnStore access:
This abstraction enables unit testing of scan logic without requiring full table setup and allows specialized storage implementations to optimize row ID collection.
Sources: llkv-scan/src/execute.rs llkv-scan/src/storage.rs llkv-table/src/table.rs:479-488
Performance Monitoring
Benchmarking Scan Performance
The repository includes examples that measure scan performance across different scenarios:
| Example | Purpose | Key Measurement |
|---|---|---|
test_streaming.rs | Verify streaming optimization activates | Throughput (rows/sec) |
direct_comparison.rs | Compare ColumnStore vs Table layer overhead | Execution time ratio |
performance_benchmark.rs | Profile different scan configurations | Latency distribution |
Representative Results (1M row table, Int64 column):
| Scenario | Throughput | Notes |
|---|---|---|
| Single column, unbounded | 100-200M rows/sec | Fast path active |
| Multi-column | 20-50M rows/sec | Materialization path |
| Bounded filter | 10-30M rows/sec | Predicate evaluation overhead |
| With nulls | 15-40M rows/sec | Null inspection required |
These benchmarks guide optimization efforts and validate that changes maintain performance characteristics.
Sources: llkv-table/examples/test_streaming.rs llkv-table/examples/direct_comparison.rs:276-399 llkv-table/examples/performance_benchmark.rs:82-211
Pager I/O Instrumentation
The InstrumentedPager wrapper tracks storage operations during scans:
Tracked Metrics
physical_gets: Total number of key-value readsget_batches: Number of batch read operationsphysical_puts: Total number of key-value writesput_batches: Number of batch write operations
For a SELECT COUNT(*) on a 3-row table, typical I/O patterns show:
- ~36 physical gets for descriptor and chunk loads
- ~23 batch operations due to pipelined chunk fetching
This instrumentation identifies I/O hotspots and validates that scan optimizations reduce storage access.
Sources: llkv-sql/tests/pager_io_tests.rs:17-73 llkv-storage/src/pager/instrumented.rs
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Filter Evaluation
Loading…
Filter Evaluation
Relevant source files
- llkv-column-map/src/store/core.rs
- llkv-column-map/src/store/scan/filter.rs
- llkv-csv/src/writer.rs
- llkv-table/examples/direct_comparison.rs
- llkv-table/examples/performance_benchmark.rs
- llkv-table/examples/test_streaming.rs
- llkv-table/src/table.rs
Purpose and Scope
This page documents how filter expressions are evaluated against table rows to produce sets of matching row IDs. Filter evaluation is a critical component of query execution that bridges the query planning layer (see Query Planning) and the physical data retrieval operations (see Scan Execution and Optimization).
The system evaluates filters in two primary contexts:
- Row ID collection : Applying predicates to columns to determine which rows satisfy conditions
- MVCC filtering : Applying transaction visibility rules to determine which row versions are visible
For information about the expression AST and predicate structures used during filtering, see Expression System.
Filter Evaluation Pipeline
Filter evaluation flows from the table abstraction down through the column store to type-specific visitor implementations:
Sources: llkv-table/src/table.rs:490-496 llkv-column-map/src/store/core.rs:356-372 llkv-column-map/src/store/scan/filter.rs:209-298
flowchart TB
TableFilter["Table::filter_row_ids()"]
CollectRowIds["collect_row_ids_for_table()"]
CompileExpr["Expression Compilation"]
EvalPred["Predicate Evaluation"]
StoreFilter["ColumnStore::filter_row_ids()"]
Dispatch["FilterDispatch::run_filter()"]
PrimitiveFilter["FilterPrimitive::run_filter()"]
Visitor["FilterVisitor<T, F>"]
DescLoad["Load ColumnDescriptor"]
ChunkLoop["For each ChunkMetadata"]
MetaPrune["Check min/max pruning"]
LoadChunk["Load chunk arrays"]
EvalChunk["Evaluate predicate"]
CollectMatches["Collect matching row IDs"]
ReturnBitmap["Return Treemap bitmap"]
TableFilter --> CollectRowIds
CollectRowIds --> CompileExpr
CompileExpr --> EvalPred
EvalPred --> StoreFilter
StoreFilter --> Dispatch
Dispatch --> PrimitiveFilter
PrimitiveFilter --> Visitor
Visitor --> DescLoad
DescLoad --> ChunkLoop
ChunkLoop --> MetaPrune
MetaPrune -->|Skip chunk| ChunkLoop
MetaPrune -->|May match| LoadChunk
LoadChunk --> EvalChunk
EvalChunk --> CollectMatches
CollectMatches --> ChunkLoop
ChunkLoop -->|Done| ReturnBitmap
Table-Level Filter Entry Points
The Table struct provides the primary interface for filter evaluation at the table abstraction layer:
graph TB
subgraph "Table Layer"
FilterRowIds["filter_row_ids(&Expr<FieldId>)"]
CollectRowIds["collect_row_ids_for_table()"]
CompileFilter["FilterCompiler::compile()"]
end
subgraph "Compilation"
TranslateExpr["ExprTranslator::translate()"]
BuildPredicate["build_*_predicate()"]
Predicate["Predicate<T::Value>"]
end
subgraph "ColumnStore Layer"
StoreFilterRowIds["filter_row_ids<T>()"]
FilterMatchesResult["filter_matches<T, F>()"]
end
FilterRowIds --> CollectRowIds
CollectRowIds --> CompileFilter
CompileFilter --> TranslateExpr
TranslateExpr --> BuildPredicate
BuildPredicate --> Predicate
Predicate --> StoreFilterRowIds
Predicate --> FilterMatchesResult
StoreFilterRowIds --> |Vec<u64>| FilterRowIds
FilterMatchesResult --> |FilterResult| CollectRowIds
The filter_row_ids() method at llkv-table/src/table.rs:490-496 converts an expression tree into a bitmap of matching row IDs. It delegates to collect_row_ids_for_table() which compiles the expression and evaluates predicates against the column store.
Key responsibilities:
- Expression translation : Convert
Expr<FieldId>toExpr<LogicalFieldId> - Predicate compilation : Transform operators into typed
Predicatestructures - MVCC integration : Filter results by transaction visibility
- Bitmap aggregation : Combine multiple predicate results using set operations
Sources: llkv-table/src/table.rs:490-496 llkv-table/src/table.rs:851-1100
Filter Dispatch System
The filter dispatch system provides type-specific filter implementations through the FilterDispatch trait hierarchy:
Implementation Strategy:
classDiagram
class FilterDispatch {<<trait>>\n+type Value\n+run_filter() Vec~u64~\n+run_fused() Vec~u64~}
class FilterPrimitive {<<trait>>\n+type Native\n+run_filter() Vec~u64~\n+run_all() Vec~u64~\n+run_filter_with_result() FilterResult}
class Utf8Filter~O~ {+run_filter() Vec~u64~\n+run_fused() Vec~u64~}
class UInt64Type
class Int64Type
class Float64Type
class Date32Type
class StringTypes
FilterDispatch <|-- FilterPrimitive : implements
FilterDispatch <|-- Utf8Filter : implements
FilterPrimitive <|-- UInt64Type : specializes
FilterPrimitive <|-- Int64Type : specializes
FilterPrimitive <|-- Float64Type : specializes
FilterPrimitive <|-- Date32Type : specializes
Utf8Filter --> StringTypes : handles
The FilterDispatch trait at llkv-column-map/src/store/scan/filter.rs:209-273 defines the interface for type-specific filtering:
Primitive types implement FilterPrimitive which provides the default FilterDispatch implementation at llkv-column-map/src/store/scan/filter.rs:275-298 This handles numeric types, booleans, and dates using the visitor pattern.
String types use the specialized Utf8Filter implementation at llkv-column-map/src/store/scan/filter.rs:307-504 which supports vectorized operations like contains and fused multi-predicate evaluation.
Sources: llkv-column-map/src/store/scan/filter.rs:209-298 llkv-column-map/src/store/scan/filter.rs:307-504
Visitor Pattern for Chunk Traversal
Filter evaluation uses the visitor pattern to traverse chunks efficiently:
The FilterVisitor<T, F> struct at llkv-column-map/src/store/scan/filter.rs:506-591 implements all visitor traits to handle different chunk formats:
- Unsorted chunks : Processes each value individually
- Sorted chunks : Can exploit ordering for early termination
- With row IDs : Matches values to explicit row identifiers
- Sorted with row IDs : Combines both optimizations
The visitor maintains internal state to build a FilterResult:
| Field | Type | Purpose |
|---|---|---|
predicate | F: FnMut(T::Native) -> bool | Predicate closure to evaluate |
runs | Vec<FilterRun> | Run-length encoded matches |
fallback_row_ids | Option<Vec<u64>> | Sparse row ID list |
prev_row_id | Option<u64> | Last seen row ID for run detection |
total_matches | usize | Count of matching rows |
Sources: llkv-column-map/src/store/scan/filter.rs:506-648 llkv-column-map/src/store/scan/filter.rs:692-771
flowchart LR
LoadDesc["Load ColumnDescriptor"]
IterChunks["Iterate ChunkMetadata"]
CheckMin["Check min_val_u64"]
CheckMax["Check max_val_u64"]
PruneDecision{{"Can prune?"}}
SkipChunk["Skip chunk"]
LoadAndEval["Load & evaluate chunk"]
LoadDesc --> IterChunks
IterChunks --> CheckMin
CheckMin --> CheckMax
CheckMax --> PruneDecision
PruneDecision -->|Yes: predicate range doesn't overlap| SkipChunk
PruneDecision -->|No: may contain matching values| LoadAndEval
SkipChunk --> IterChunks
LoadAndEval --> IterChunks
Chunk Metadata Pruning
The filter evaluation pipeline exploits chunk metadata to skip irrelevant data:
The ChunkMetadata structure stores summary statistics for each chunk:
| Field | Type | Purpose |
|---|---|---|
min_val_u64 | u64 | Minimum value in chunk (for numerics) |
max_val_u64 | u64 | Maximum value in chunk (for numerics) |
row_count | u32 | Number of rows in chunk |
chunk_pk | PhysicalKey | Key for chunk data |
value_order_perm_pk | PhysicalKey | Key for sort permutation |
Pruning logic at llkv-column-map/src/store/core.rs:679-690:
This optimization is particularly effective for:
- Range queries :
WHERE col BETWEEN x AND y - Equality predicates :
WHERE col = value - Sorted data : Natural clustering improves pruning
Sources: llkv-column-map/src/store/core.rs:679-690 llkv-column-map/src/store/descriptor.rs:40-70
String Filtering with Predicate Fusion
String filtering receives special optimization through the Utf8Filter implementation which supports fused multi-predicate evaluation:
Key optimization techniques:
flowchart TB
RunFused["Utf8Filter::run_fused()"]
SeparatePreds["Separate predicates"]
ContainsPreds["Contains predicates"]
OtherPreds["Other predicates"]
LoadChunks["Load value & row_id chunks"]
InitBitmask["Initialize BitMask\n(all bits = 1)"]
FilterNulls["AND with null bitmask"]
LoopContains["For each contains pattern"]
VectorizedContains["Arrow::contains_utf8_scalar()"]
AndMask["AND result into bitmask"]
LoopOther["For each remaining bit"]
EvalOther["Evaluate other predicates"]
CollectRowIds["Collect matching row IDs"]
RunFused --> SeparatePreds
SeparatePreds --> ContainsPreds
SeparatePreds --> OtherPreds
SeparatePreds --> LoadChunks
LoadChunks --> InitBitmask
InitBitmask --> FilterNulls
FilterNulls --> LoopContains
LoopContains --> VectorizedContains
VectorizedContains --> AndMask
AndMask --> LoopContains
LoopContains -->|Done| LoopOther
LoopOther --> EvalOther
EvalOther --> CollectRowIds
-
Bitwise filtering using
BitMaskat llkv-column-map/src/store/scan/filter.rs:32-110:- Stores candidate rows as packed u64 words
- Supports efficient AND operations
- Tracks candidate count to short-circuit early
-
Vectorized contains at llkv-column-map/src/store/scan/filter.rs:441-465:
- Uses Arrow’s SIMD
contains_utf8_scalar()kernel - Processes entire chunks without row-by-row iteration
- Returns boolean arrays that AND into the bitmask
- Uses Arrow’s SIMD
-
Progressive filtering at llkv-column-map/src/store/scan/filter.rs:421-469:
- Applies vectorized predicates first to eliminate most rows
- Only evaluates slower per-row predicates on remaining candidates
- Short-circuits when candidate count reaches zero
Example scenario:
The fused evaluation:
- Vectorizes both
LIKEpatterns usingcontains - ANDs the results to get a sparse candidate set
- Only evaluates
LENGTH()on remaining rows
Sources: llkv-column-map/src/store/scan/filter.rs:336-504 llkv-column-map/src/store/scan/filter.rs:32-110
Filter Result Encoding
Filter results use run-length encoding to efficiently represent dense matches:
The FilterResult structure at llkv-column-map/src/store/scan/filter.rs:136-183 provides two representations:
Run-length encoding (dense matches):
- Used when matching rows are mostly sequential
- Each
FilterRunstoresstart_row_idandlen - Extremely compact for range queries or sorted scans
- Example: rows [100, 101, 102, 103, 104] →
FilterRun { start: 100, len: 5 }
Sparse representation (fallback):
- Used when matches are scattered
- Stores explicit
Vec<u64>of row IDs - Automatically degrades when out-of-order matches detected
- Example: rows [100, 200, 150, 300] →
[100, 200, 150, 300]
Adaptive strategy at llkv-column-map/src/store/scan/filter.rs:543-590:
The FilterVisitor::record_match() method dynamically chooses encoding:
If match follows previous (row_id == prev + 1):
Extend current run
Else if match is out of order (row_id < prev):
Convert to sparse representation
Else:
Start new run
This ensures optimal encoding regardless of data distribution.
Sources: llkv-column-map/src/store/scan/filter.rs:115-183 llkv-column-map/src/store/scan/filter.rs:543-590
flowchart TB
FilterRowIds["Table::filter_row_ids()"]
CollectUserData["Collect matching user data rows"]
MVCCCheck{{"MVCC enabled?"}}
LoadCreatedBy["Load created_by column"]
LoadDeletedBy["Load deleted_by column"]
FilterCreated["Filter: created_by <= txn_id"]
FilterDeleted["Filter: deleted_by = 0 OR\ndeleted_by > txn_id"]
IntersectSets["Intersect bitmaps"]
ReturnVisible["Return visible rows"]
FilterRowIds --> CollectUserData
CollectUserData --> MVCCCheck
MVCCCheck -->|No| ReturnVisible
MVCCCheck -->|Yes| LoadCreatedBy
LoadCreatedBy --> LoadDeletedBy
LoadDeletedBy --> FilterCreated
FilterCreated --> FilterDeleted
FilterDeleted --> IntersectSets
IntersectSets --> ReturnVisible
MVCC Filtering Integration
Filter evaluation integrates with Multi-Version Concurrency Control (MVCC) to enforce transaction visibility:
MVCC columns are stored in separate logical namespaces:
| Namespace | Column | Purpose |
|---|---|---|
TxnCreatedBy | created_by | Transaction ID that created this row version |
TxnDeletedBy | deleted_by | Transaction ID that deleted this row (0 if active) |
Visibility rules at llkv-table/src/table.rs:1047-1095:
A row version is visible to transaction T if:
created_by <= T.id(row was created before or by this transaction)deleted_by = 0 OR deleted_by > T.id(row not deleted, or deleted after this transaction)
Implementation:
The collect_row_ids_for_table() method applies MVCC filtering after user predicate evaluation:
- Evaluate user predicates on user-data columns → bitmap A
- Evaluate
created_by <= txn_idon MVCC column → bitmap B - Evaluate
deleted_by = 0 OR deleted_by > txn_idon MVCC column → bitmap C - Return intersection: A ∩ B ∩ C
This ensures only transaction-visible row versions are returned to the query executor.
Sources: llkv-table/src/table.rs:851-1100 llkv-table/src/table.rs:231-437
Performance Characteristics
Filter evaluation performance depends on several factors:
| Scenario | Performance | Explanation |
|---|---|---|
| Equality on indexed column | O(log N) | Uses binary search in sorted chunks |
| Range query on sorted data | O(chunks) | Metadata pruning skips most chunks |
| String contains (single) | O(N) | Vectorized SIMD comparison |
| String contains (multiple) | ~O(N) | Fused evaluation with progressive filtering |
| Complex predicates | O(N × P) | Per-row evaluation of P predicates |
| Dense matches | High efficiency | Run-length encoding reduces memory |
| Sparse matches | Moderate overhead | Explicit row ID lists |
Optimization opportunities:
- Chunk-level parallelism : Filter evaluation at llkv-column-map/src/store/scan/filter.rs:381-494 uses Rayon for parallel chunk processing
- Early termination : Metadata pruning can skip 90%+ of chunks for range queries
- Cache efficiency : Sequential chunk traversal has good spatial locality
- SIMD operations : String operations use Arrow’s vectorized kernels
Sources: llkv-column-map/src/store/scan/filter.rs:381-494 llkv-column-map/src/store/core.rs:679-690
Integration with Scan Execution
Filter evaluation feeds into the scan execution pipeline (see Scan Execution and Optimization):
The two-phase execution model:
Phase 1: Filter evaluation (this page)
- Evaluate predicates against columns
- Produce bitmap of matching row IDs
- Minimal data movement (only row ID metadata)
Phase 2: Data gathering (see Scan Execution and Optimization)
- Use row ID bitmap to fetch actual column values
- Assemble into Arrow
RecordBatch - Apply projections and transformations
This separation enables:
- Predicate pushdown : Filter before gathering data
- Projection pruning : Only fetch required columns
- Parallel execution : Filter and gather can overlap
- Memory efficiency : Small bitmaps instead of full data
Sources: llkv-table/src/table.rs:490-496 llkv-scan/src/execute.rs:1-300
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Storage Layer
Loading…
Storage Layer
Relevant source files
- llkv-column-map/Cargo.toml
- llkv-column-map/src/gather.rs
- llkv-column-map/src/store/core.rs
- llkv-column-map/src/store/projection.rs
- llkv-column-map/src/store/scan/filter.rs
- llkv-sql/tests/pager_io_tests.rs
The Storage Layer manages the persistence and retrieval of columnar data in LLKV. It bridges Apache Arrow’s in-memory columnar format with a key-value storage backend, providing efficient operations for appends, updates, deletes, and scans while maintaining data integrity through MVCC semantics.
This page provides an architectural overview of the storage subsystem. For details on specific components, see:
- Table Abstraction - High-level table operations and API
- Column Storage and ColumnStore - Physical column storage implementation
- Pager Interface and SIMD Optimization - Key-value persistence layer
Sources: llkv-column-map/src/store/core.rs:1-68 llkv-column-map/Cargo.toml:1-35
Architecture Overview
The storage layer consists of three primary components arranged in a hierarchy from logical to physical representation:
Sources: llkv-column-map/src/store/core.rs:60-68 llkv-column-map/Cargo.toml:21-33
graph TB
subgraph "Arrow Layer"
RB["RecordBatch\nColumnar Arrays"]
SCHEMA["Schema\nField Metadata"]
ARRAYS["Arrow Arrays\n(Int64, Utf8, etc)"]
end
subgraph "Column Store Layer"
CS["ColumnStore\nllkv-column-map"]
CATALOG["ColumnCatalog\nLogicalFieldId → PhysicalKey"]
DESC["ColumnDescriptor\nLinked List of Chunks"]
META["ChunkMetadata\nmin/max, row_count"]
CHUNKS["Serialized Chunks\nArrow IPC Format"]
end
subgraph "Pager Layer"
PAGER["Pager Trait\nbatch_get/batch_put"]
BATCH_OPS["BatchGet/BatchPut\nOperations"]
end
subgraph "Physical Storage"
SIMD["simd-r-drive\nMemory-Mapped KV Store"]
HANDLES["EntryHandle\nByte Blobs"]
end
RB --> CS
SCHEMA --> CS
ARRAYS --> CHUNKS
CS --> CATALOG
CS --> DESC
DESC --> META
META --> CHUNKS
CS --> PAGER
PAGER --> BATCH_OPS
BATCH_OPS --> SIMD
SIMD --> HANDLES
CHUNKS -.serialized as.-> HANDLES
CATALOG -.persisted via.-> PAGER
DESC -.persisted via.-> PAGER
Columnar Storage Model
Logical Field Identification
Columns are identified by LogicalFieldId, a composite key containing:
| Component | Type | Purpose |
|---|---|---|
| Namespace | LogicalStorageNamespace | Separates user data from internal columns (row IDs, MVCC metadata) |
| Table ID | TableId | Identifies the owning table |
| Field ID | FieldId | Column position within the table schema |
Namespace Values:
UserData- Regular table columnsRowIdShadow- Internal row ID trackingTxnCreatedBy- MVCC transaction creation timestampsTxnDeletedBy- MVCC transaction deletion timestamps
This namespacing prevents collisions between user-visible columns and internal bookkeeping while enabling efficient multi-column operations within a table.
Sources: llkv-column-map/src/store/core.rs:36-46
Chunk Organization
Data is stored in chunks , which are serialized Arrow arrays. Each chunk contains:
ChunkMetadata Fields:
| Field | Type | Purpose |
|---|---|---|
chunk_pk | PhysicalKey | Storage location of value array |
min_val_u64 / max_val_u64 | u64 | Value range for predicate pruning |
row_count | u64 | Number of rows in chunk |
null_count | u64 | Number of null values |
serialized_bytes | u64 | Size of serialized array |
value_order_perm_pk | PhysicalKey | Optional sort index for fast lookups |
The metadata enables chunk pruning : evaluating predicates against min/max values to skip entire chunks without loading data.
Sources: llkv-column-map/src/store/core.rs:1-6 llkv-column-map/src/store/descriptor.rs
Descriptor Pages
A ColumnDescriptor organizes chunks into a linked list of metadata pages:
Each descriptor page contains:
- Header (
DescriptorPageHeader):entry_count,next_page_pk - Entries : Array of
ChunkMetadatastructures
Appends extend the tail page; when full, a new page is allocated and linked.
Sources: llkv-column-map/src/store/descriptor.rs
Data Flow: Append Operation
The append path implements Last-Write-Wins semantics for row ID conflicts:
Key Steps:
- Pre-sort by row ID llkv-column-map/src/store/core.rs:798-847 - Ensures efficient metadata updates and naturally sorted shadow columns
- LWW Rewrite llkv-column-map/src/store/core.rs:893-901 - Updates existing row IDs in-place
- Filter llkv-column-map/src/store/core.rs:918-942 - Removes rewritten rows and nulls
- Chunk & Serialize llkv-column-map/src/store/core.rs:1032-1114 - Splits into target-sized chunks, serializes to Arrow IPC
- Atomic Batch Put llkv-column-map/src/store/core.rs:1116-1132 - Commits all writes atomically
- Epoch Increment llkv-column-map/src/store/core.rs:1133-1134 - Invalidates gather context caches
Sources: llkv-column-map/src/store/core.rs:758-1137
sequenceDiagram
participant User
participant CS as ColumnStore
participant CTX as MultiGatherContext
participant Cache as Chunk Cache
participant Pager
User->>CS: gather_rows(field_ids, row_ids)
CS->>CTX: Acquire context from pool
CS->>CS: Build FieldPlans\n(value_metas, row_metas)
Note over CS,CTX: Phase 1: Identify Candidate Chunks
loop For each field
CS->>CTX: Filter chunks by row_id range
CTX->>CTX: candidate_indices.push(idx)
end
Note over CS,Cache: Phase 2: Fetch Chunks
CS->>Cache: Check cache for chunk_pks
CS->>Pager: batch_get(missing_chunks)
Pager-->>CS: EntryHandle[]
CS->>CS: deserialize_array(bytes)
CS->>Cache: Insert arrays
Note over CS,CTX: Phase 3: Build Row Index
CS->>CTX: row_index[row_id] = output_idx
Note over CS,CTX: Phase 4: Gather Values
loop For each column
loop For each row_id
CTX->>Cache: Lookup value array
CTX->>CTX: Find row in chunk\nvia binary search
CTX->>CTX: builder.append_value()
end
CTX->>CTX: Finish builder → ArrayRef
end
CS->>CS: RecordBatch::try_new(schema, arrays)
CS-->>User: RecordBatch
CTX->>CS: Return context to pool
Data Flow: Gather Operation
Gathering assembles a RecordBatch from a list of row IDs by fetching values from chunks:
Optimizations:
| Optimization | Location | Description |
|---|---|---|
| Context Pooling | llkv-column-map/src/store/projection.rs:651-721 | Reuses MultiGatherContext across calls to amortize allocation costs |
| Chunk Caching | llkv-column-map/src/store/projection.rs:1032-1053 | Caches deserialized Arrow arrays by PhysicalKey |
| Candidate Pruning | llkv-column-map/src/store/projection.rs:1015-1027 | Only loads chunks overlapping the requested row ID range |
| Dense Fast Path | llkv-column-map/src/store/projection.rs:984-1011 | For contiguous row IDs, uses offset arithmetic instead of hash lookup |
| Single Chunk Slicing | llkv-column-map/src/store/projection.rs:303-329 | If all rows in one sorted chunk, returns array.slice() |
Sources: llkv-column-map/src/store/projection.rs:772-960 llkv-column-map/src/store/core.rs:758-785
Data Flow: Filter Operation
Filtering evaluates predicates and returns matching row IDs:
Filter Execution Path:
- Descriptor Load llkv-column-map/src/store/scan/filter.rs:1301-1333 - Fetch column descriptor and iterate metadata
- Metadata Pruning llkv-column-map/src/store/scan/filter.rs:1336-1353 - Skip chunks where
min_val > predicate_maxormax_val < predicate_min - Chunk Fetch llkv-column-map/src/store/scan/filter.rs:1355-1367 - Batch-get overlapping chunk arrays
- Visitor Evaluation llkv-column-map/src/store/scan/filter.rs:1369-1394 - Type-specialized visitor applies predicate
- Result Encoding llkv-column-map/src/store/scan/filter.rs:506-591 - Build
FilterResultwith run-length encoding or sparse list
Visitor Pattern Example (FilterVisitor<T, F>):
The visitor pattern enables type-specialized hot loops while maintaining a generic filter interface.
Sources: llkv-column-map/src/store/scan/filter.rs:209-298 llkv-column-map/src/store/scan/filter.rs:506-591 llkv-column-map/src/store/scan/filter.rs:605-649
Index Management
The storage layer supports two index types for accelerated lookups:
Index Operations:
| Method | Purpose | Location |
|---|---|---|
register_index | Creates and persists an index for a column | llkv-column-map/src/store/core.rs:145-147 |
unregister_index | Removes an index and frees storage | llkv-column-map/src/store/core.rs:163-165 |
list_persisted_indexes | Queries existing indexes for a column | llkv-column-map/src/store/core.rs:408-425 |
Presence Index Usage llkv-column-map/src/store/core.rs:663-754:
- Stored in
ChunkMetadata.value_order_perm_pk - Binary search over permuted view of row IDs
- Accelerates
has_row_idchecks
Sources: llkv-column-map/src/store/core.rs:136-165 llkv-column-map/src/store/core.rs:663-754
Storage Optimizations
Chunk Size Tuning
The ColumnStoreConfig provides tuning parameters:
| Parameter | Default | Purpose |
|---|---|---|
target_chunk_rows | 8192 | Target rows per chunk for new appends |
min_chunk_rows | 2048 | Minimum before considering compaction |
max_chunk_rows | 32768 | Maximum before forcing a split |
Smaller chunks enable finer-grained pruning but increase metadata overhead. The defaults balance scan performance with storage efficiency.
Sources: llkv-column-map/src/store/config.rs
Last-Write-Wins Semantics
When appending rows with existing row IDs:
- Scan existing chunks llkv-column-map/src/store/core.rs:893-901 to identify conflicting row IDs
- Overwrite in-place llkv-column-map/src/store/lww.rs within the chunk containing the old value
- Append new rows llkv-column-map/src/store/core.rs:918-942 that don’t conflict
This avoids duplicates and provides update semantics without explicit UPDATE statements.
Sources: llkv-column-map/src/store/core.rs:893-942 llkv-column-map/src/store/lww.rs
Null Handling
Null values are not stored in value chunks llkv-column-map/src/store/core.rs:920-931:
- Filtered out during append
- Represented by absence in the shadow row ID column
- Surfaced as nulls during gather if row ID exists but value is missing
This reduces storage footprint for sparse columns.
Sources: llkv-column-map/src/store/core.rs:920-931
Compaction (Future)
The current implementation supports incremental writes but defers compaction:
- Small chunks accumulate over time
- Metadata overhead increases
- Scan performance degrades for highly-updated columns
Future work will implement background compaction to merge small chunks and rebuild optimal chunk sizes.
Sources: llkv-column-map/src/store/compact.rs
Component Relationships
The storage layer integrates with upstream and downstream components:
Key Interfaces:
- Table → ColumnStore : High-level append/gather operations llkv-table/src/table.rs
- ScanBuilder → ColumnStore : Predicate evaluation and filtering llkv-scan/src/scan_builder.rs
- ColumnStore → Pager : Batched key-value operations llkv-storage/src/pager.rs
Sources: llkv-column-map/src/store/core.rs:60-68 llkv-storage/src/pager.rs
Thread Safety
ColumnStore is Send + Sync and designed for concurrent access:
- Catalog :
Arc<RwLock<ColumnCatalog>>- Read-heavy workload, allows concurrent reads - Caches :
RwLockfor data type and index metadata - Append Epoch :
AtomicU64for cache invalidation signaling - Context Pool :
Mutex<FxHashMap<...>>for gather context reuse
Concurrent appends to different tables are lock-free at the ColumnStore level. Appends to the same column serialize on the catalog write lock.
Sources: llkv-column-map/src/store/core.rs:47-68
Performance Characteristics
| Operation | Complexity | Notes |
|---|---|---|
| Append (sorted) | O(n) | n = rows; includes serialization and pager writes |
| Append (unsorted) | O(n log n) | Requires lexicographic sort |
| Gather (random) | O(m · k) | m = row count, k = avg chunk scan per row |
| Gather (dense) | O(m) | Contiguous row IDs enable offset arithmetic |
| Filter (full scan) | O(c) | c = total chunks; metadata pruning reduces effective c |
| Filter (indexed) | O(log c) | With presence/value indexes |
Batching Benefits llkv-sql/tests/pager_io_tests.rs:18-120:
- INSERT: ~8 allocations, 24 puts for 3 rows (2 columns + row ID + MVCC)
- SELECT: ~36 gets in 23 batches (descriptors + chunks + metadata)
- DELETE: ~6 puts, 46 gets (read for tombstones, write MVCC columns)
Sources: llkv-column-map/src/store/core.rs llkv-sql/tests/pager_io_tests.rs:18-120
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Table Abstraction
Loading…
Table Abstraction
Relevant source files
- llkv-csv/src/writer.rs
- llkv-table/examples/direct_comparison.rs
- llkv-table/examples/performance_benchmark.rs
- llkv-table/examples/test_streaming.rs
- llkv-table/src/lib.rs
- llkv-table/src/table.rs
The Table struct provides a schema-aware, transactional interface over the columnar storage layer. It sits between the SQL execution layer and the physical storage, managing field ID namespacing, MVCC column injection, schema validation, and catalog integration. The Table abstraction transforms low-level columnar operations into relational table semantics.
For information about the underlying columnar storage, see Column Storage and ColumnStore. For query execution that uses tables, see Query Execution. For metadata management, see System Catalog and SysCatalog.
Purpose and Responsibilities
The Table abstraction serves as the primary interface for relational data operations in LLKV. Its key responsibilities include:
| Responsibility | Description |
|---|---|
| Field ID Namespacing | Maps user FieldId values to table-scoped LogicalFieldId identifiers |
| MVCC Integration | Automatically injects and manages created_by and deleted_by columns |
| Schema Management | Provides Arrow schema generation and validation |
| Catalog Integration | Coordinates with SysCatalog for metadata persistence |
| Scan Operations | Implements streaming scans with projection, filtering, and ordering |
| Index Management | Registers and manages persisted sort indexes |
Sources: llkv-table/src/lib.rs:1-27 llkv-table/src/table.rs:58-69
Table Structure and Creation
Table Struct
The Table struct is a lightweight wrapper that combines a ColumnStore reference with a TableId. The MVCC cache is an optimization to avoid repeated schema introspection during appends.
Sources: llkv-table/src/table.rs:58-69 llkv-table/src/table.rs:50-56
Table Creation
Tables are created through the CatalogManager which coordinates metadata persistence and storage initialization:
sequenceDiagram
participant User
participant CatalogManager
participant SysCatalog
participant ColumnStore
participant Pager
User->>CatalogManager: create_table_from_columns(name, columns)
CatalogManager->>CatalogManager: Allocate TableId
CatalogManager->>SysCatalog: Register TableMeta
CatalogManager->>SysCatalog: Register ColMeta for each column
CatalogManager->>ColumnStore: Ensure physical storage
ColumnStore->>Pager: Initialize descriptors if needed
CatalogManager->>User: Return Table~P~
The creation process ensures metadata consistency before returning a usable Table handle.
Sources: llkv-table/src/table.rs:77-103
Field ID Namespace Mapping
Logical vs User Field IDs
LLKV uses a two-tier field ID system to prevent collisions between tables:
The mapping occurs during append operations where each field’s metadata is rewritten:
| Component | Type | Example |
|---|---|---|
| User Field ID | FieldId (u32) | 10 |
| Table ID | TableId (u32) | 1 |
| Logical Field ID | LogicalFieldId (u64) | 0x0000_0001_0000_000A |
Sources: llkv-table/src/table.rs:304-315
Data Operations
Append Operation
The append method is the primary write operation, handling field ID conversion, MVCC column injection, and catalog updates:
flowchart TD
START["append(batch)"]
CACHE["Get/Initialize\nMvccColumnCache"]
PROCESS["Process each field"]
subgraph "Per Field Processing"
SYSCHECK{"System column?\n(row_id, MVCC)"}
MVCC_ASSIGN["Assign MVCC\nLogicalFieldId"]
USER_CONVERT["Convert user FieldId\nto LogicalFieldId"]
META_UPDATE["Update catalog\nif needed"]
end
INJECT_CREATED{"created_by\npresent?"}
INJECT_DELETED{"deleted_by\npresent?"}
BUILD_CREATED["Build created_by column\nwith TXN_ID_AUTO_COMMIT"]
BUILD_DELETED["Build deleted_by column\nwith TXN_ID_NONE"]
RECONSTRUCT["Reconstruct RecordBatch\nwith LogicalFieldIds"]
STORE["store.append(namespaced_batch)"]
END["Return Result"]
START --> CACHE
CACHE --> PROCESS
PROCESS --> SYSCHECK
SYSCHECK -->|Yes| MVCC_ASSIGN
SYSCHECK -->|No| USER_CONVERT
USER_CONVERT --> META_UPDATE
MVCC_ASSIGN --> INJECT_CREATED
META_UPDATE --> INJECT_CREATED
INJECT_CREATED -->|No| BUILD_CREATED
INJECT_CREATED -->|Yes| INJECT_DELETED
BUILD_CREATED --> INJECT_DELETED
INJECT_DELETED -->|No| BUILD_DELETED
INJECT_DELETED -->|Yes| RECONSTRUCT
BUILD_DELETED --> RECONSTRUCT
RECONSTRUCT --> STORE
STORE --> END
MVCC Column Injection
For non-transactional appends (e.g., CSV import), the system automatically injects MVCC columns:
graph LR
subgraph "Input Batch"
IR["row_id: UInt64"]
IC1["col_a: Int64\nfield_id=10"]
IC2["col_b: Utf8\nfield_id=11"]
end
subgraph "Output Batch"
OR["row_id: UInt64"]
OC1["col_a: Int64\nlfid=0x0000_0001_0000_000A"]
OC2["col_b: Utf8\nlfid=0x0000_0001_0000_000B"]
OCB["created_by: UInt64\nlfid=MVCC_CREATED\nvalue=1"]
ODB["deleted_by: UInt64\nlfid=MVCC_DELETED\nvalue=0"]
end
IR --> OR
IC1 --> OC1
IC2 --> OC2
OC2 --> OCB
OCB --> ODB
The constant values used are:
TXN_ID_AUTO_COMMIT = 1forcreated_by(indicates system-committed, always visible)TXN_ID_NONE = 0fordeleted_by(indicates not deleted)
Sources: llkv-table/src/table.rs:231-438 llkv-table/src/table.rs:347-399
graph TB
subgraph "Public API"
SS["scan_stream\n(projections, filter, options, callback)"]
SSE["scan_stream_with_exprs\n(projections, filter, options, callback)"]
FR["filter_row_ids\n(filter_expr)"]
SC["stream_columns\n(logical_fields, row_ids, policy)"]
end
subgraph "Internal Execution"
ES["llkv_scan::execute_scan"]
RSB["RowStreamBuilder"]
CRS["collect_row_ids_for_table"]
end
subgraph "Storage Layer"
CS["ColumnStore::ScanBuilder"]
GR["gather_rows"]
FM["filter_matches"]
end
SS --> SSE
SSE --> ES
FR --> CRS
SC --> RSB
ES --> FM
ES --> GR
RSB --> GR
CRS --> FM
FM --> CS
GR --> CS
Scan Operations
The Table provides multiple scan methods that build on the lower-level ColumnStore scan infrastructure:
Scan Flow with Projections
Sources: llkv-table/src/table.rs:447-488 llkv-table/src/table.rs:490-496
Schema Management
Schema Generation
The schema() method constructs an Arrow Schema from catalog metadata:
Each field in the generated schema includes field_id metadata:
| Schema Component | Source | Example |
|---|---|---|
| Field name | ColMeta.name or generated | "customer_id" or "col_10" |
| Data type | ColumnStore.data_type(lfid) | DataType::Int64 |
| Field ID metadata | field_id key in metadata map | "10" |
| Nullability | Always true for user columns | true |
Sources: llkv-table/src/table.rs:519-549
graph TD
SCHEMA["schema()"]
subgraph "Extract Components"
NAMES["Collect field names"]
FIDS["Extract field_id metadata"]
TYPES["Format data types"]
end
subgraph "Build RecordBatch"
NAME_ARR["StringArray(names)"]
FID_ARR["UInt32Array(field_ids)"]
TYPE_ARR["StringArray(data_types)"]
RB_SCHEMA["Schema(name, field_id, data_type)"]
BATCH["RecordBatch::try_new"]
end
SCHEMA --> NAMES
SCHEMA --> FIDS
SCHEMA --> TYPES
NAMES --> NAME_ARR
FIDS --> FID_ARR
TYPES --> TYPE_ARR
NAME_ARR --> BATCH
FID_ARR --> BATCH
TYPE_ARR --> BATCH
RB_SCHEMA --> BATCH
Schema RecordBatch
For interactive display, the schema_recordbatch() method returns a tabular representation:
Sources: llkv-table/src/table.rs:554-586
stateDiagram-v2
[*] --> Uninitialized : Table created
state Uninitialized {
[*] --> CheckCache : append() called
CheckCache --> Initialize : Cache is None
Initialize --> Store : Scan schema for created_by, deleted_by
Store --> [*] : Cache populated
}
Uninitialized --> Cached : First append
state Cached {
[*] --> ReadCache : Subsequent appends
ReadCache --> UseCached : Return cached values
UseCached --> [*]
}
MVCC Column Management
Column Cache Optimization
The MvccColumnCache avoids repeated string comparisons during append operations:
The cache stores two boolean flags:
| Field | Type | Meaning |
|---|---|---|
has_created_by | bool | Schema includes created_by column |
has_deleted_by | bool | Schema includes deleted_by column |
This optimization is critical for bulk insert performance as it eliminates O(columns) string comparisons per batch.
Sources: llkv-table/src/table.rs:50-56 llkv-table/src/table.rs:175-205
MVCC Column LogicalFieldIds
MVCC columns use specially reserved LogicalFieldId values:
These reserved patterns ensure MVCC columns are stored separately from user data in the ColumnStore.
Sources: llkv-table/src/table.rs:255-259 llkv-table/src/table.rs:360-366 llkv-table/src/table.rs:383-389
Index Management
The Table provides operations to register and manage persisted sort indexes:
Registered indexes are maintained by the ColumnStore and used to optimize range scans and ordered queries.
Sources: llkv-table/src/table.rs:145-173
classDiagram
class RowIdScanCollector {
-Treemap row_ids
+extend_from_array(row_ids)
+extend_from_slice(row_ids, start, len)
+into_inner() Treemap
}
class PrimitiveWithRowIdsVisitor {<<trait>>\n+i64_chunk_with_rids(values, row_ids)\n+utf8_chunk_with_rids(values, row_ids)\n...}
class PrimitiveSortedWithRowIdsVisitor {<<trait>>\n+i64_run_with_rids(values, row_ids, start, len)\n+null_run(row_ids, start, len)\n...}
RowIdScanCollector ..|> PrimitiveWithRowIdsVisitor
RowIdScanCollector ..|> PrimitiveSortedWithRowIdsVisitor
Row ID Collection and Filtering
RowIdScanCollector
The RowIdScanCollector implements visitor traits to collect row IDs during filtered scans:
The collector ignores actual data values and only accumulates row IDs into a Treemap (roaring bitmap).
Sources: llkv-table/src/table.rs:805-858
sequenceDiagram
participant Scan
participant Emitter
participant Callback
participant Buffer
loop For each chunk from ColumnStore
Scan->>Emitter: chunk_with_rids(values, row_ids)
Emitter->>Buffer: Accumulate row_ids
alt Buffer reaches chunk_size
Emitter->>Callback: on_chunk(&row_ids[..chunk_size])
Callback-->>Emitter: Result
Emitter->>Buffer: Clear buffer
end
end
Note over Emitter: Scan complete
Emitter->>Callback: on_chunk(&remaining_row_ids)
Callback-->>Emitter: Result
RowIdChunkEmitter
For streaming scenarios, RowIdChunkEmitter buffers and emits row IDs in fixed-size chunks:
The emitter includes an optimization: when the buffer is empty, it passes slices directly without copying.
Sources: llkv-table/src/table.rs:860-993
classDiagram
class ScanStorage~P~ {<<trait>>\n+table_id() TableId\n+field_data_type(lfid) Result~DataType~\n+total_rows() Result~u64~\n+gather_column(lfid, row_ids, policy) Result~ArrayRef~\n+filter_matches(lfid, predicate) Result~Treemap~\n...}
class Table~P~ {-store: Arc~ColumnStore~P~~\n-table_id: TableId\n+catalog() SysCatalog}
Table~P~ ..|> ScanStorage~P~ : implements
ScanStorage Trait Implementation
The Table implements the ScanStorage trait to integrate with the llkv-scan execution layer:
Key delegations:
| ScanStorage Method | Table Implementation | Delegation Target |
|---|---|---|
table_id() | Direct field access | self.table_id |
field_data_type(lfid) | Forward to store | self.store.data_type(lfid) |
gather_column(...) | Forward to store | self.store.gather_rows(...) |
filter_matches(...) | Build predicate and scan | ColumnStore::ScanBuilder |
stream_row_ids_in_chunks(...) | Use RowIdChunkEmitter | Custom visitor |
Sources: llkv-table/src/table.rs:1019-1232
Integration with Lower Layers
Relationship to ColumnStore
The Table never directly interacts with the Pager; all storage operations go through the ColumnStore.
Sources: llkv-table/src/table.rs:64-69 llkv-table/src/table.rs:647-649
Usage Examples
Creating and Populating a Table
Sources: llkv-table/examples/test_streaming.rs:35-60 llkv-csv/src/writer.rs:104-137
Scanning with Projections and Filters
Sources: llkv-table/examples/test_streaming.rs:64-110 llkv-table/examples/performance_benchmark.rs:128-151
Performance Characteristics
The Table layer adds minimal overhead to ColumnStore operations:
| Operation | Overhead | Source |
|---|---|---|
| Append (cached MVCC) | Field ID conversion + metadata map construction | ~1-2% per field |
| Append (uncached MVCC) | Additional schema scan | ~5% first append only |
| Single-column scan | Filter compilation + streaming setup | ~1.1x vs direct ColumnStore |
| Multi-column scan | RecordBatch construction per batch | ~1.5-2x vs direct ColumnStore |
| Schema introspection | Catalog lookups + field sorting | O(fields) |
The design prioritizes zero-copy operations and streaming to minimize memory overhead.
Sources: llkv-table/examples/direct_comparison.rs:277-399 llkv-table/examples/performance_benchmark.rs:82-211
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Column Storage and ColumnStore
Loading…
Column Storage and ColumnStore
Relevant source files
- llkv-column-map/Cargo.toml
- llkv-column-map/src/gather.rs
- llkv-column-map/src/store/core.rs
- llkv-column-map/src/store/projection.rs
- llkv-column-map/src/store/scan/filter.rs
- llkv-sql/tests/pager_io_tests.rs
This page documents the columnar storage layer implemented in the llkv-column-map crate. The ColumnStore provides Arrow-native persistence of columnar data over a key-value pager, enabling efficient predicate evaluation, multi-column projections, and Last-Write-Wins (LWW) semantics for updates.
For information about the table-level abstraction built on top of ColumnStore, see Table Abstraction. For details about the underlying pager interface, see Pager Interface and SIMD Optimization.
ColumnStore Architecture
The ColumnStore<P> struct (llkv-column-map/src/store/core.rs:60-68) serves as the primary interface for columnar data operations. It is parameterized by a Pager implementation and manages:
- Column catalog : Maps
LogicalFieldIdto physical storage keys - Descriptors : Linked lists of chunk metadata for each column
- Data type cache : Avoids repeated descriptor loads for schema queries
- Index manager : Maintains presence and value indexes
- Context pool : Reusable gather contexts to amortize chunk fetch costs
Purpose : This diagram shows the internal components of ColumnStore and their relationships. The catalog is the single source of truth for column-to-physical-key mappings, while auxiliary structures provide caching and optimization.
Sources : llkv-column-map/src/store/core.rs:60-68 llkv-column-map/src/store/core.rs:100-124
Storage Organization
Namespaces
Columns are identified by LogicalFieldId, which combines a namespace, table ID, and field ID to prevent collisions between different data categories (llkv-column-map/src/store/core.rs:38-46):
| Namespace | Purpose | Example Usage |
|---|---|---|
UserData | Regular table columns | User-visible columns from CREATE TABLE |
RowIdShadow | Internal row ID tracking | Shadow column storing RowId for each value |
TxnCreatedBy | MVCC creation timestamps | Transaction ID that created each row |
TxnDeletedBy | MVCC deletion timestamps | Transaction ID that deleted each row |
Each column in the UserData namespace has a corresponding RowIdShadow column that stores the row IDs for non-null values. This pairing enables efficient row-based gathering and filtering.
Sources : llkv-column-map/src/store/core.rs:38-46
Physical Storage Structure
Data is organized as a hierarchy: ColumnCatalog → ColumnDescriptor → DescriptorPages → ChunkMetadata → Data Chunks.
Purpose : This diagram shows how data is physically organized in storage. The catalog is the entry point, descriptors form a linked list of metadata pages, and each metadata entry points to a serialized Arrow array.
graph TB
CATALOG["ColumnCatalog\nPhysicalKey: CATALOG_ROOT_PKEY"]
subgraph "Per Column"
DESC["ColumnDescriptor\n(head_page_pk, tail_page_pk,\ntotal_row_count)"]
PAGE1["DescriptorPage\n(next_page_pk, entry_count)"]
PAGE2["DescriptorPage\n(next_page_pk, entry_count)"]
end
subgraph "Chunk Metadata"
META1["ChunkMetadata\n(chunk_pk, row_count,\nmin_val, max_val,\nserialized_bytes)"]
META2["ChunkMetadata\n(chunk_pk, row_count,\nmin_val, max_val,\nserialized_bytes)"]
META3["ChunkMetadata"]
end
subgraph "Arrow Data"
CHUNK1["Serialized Arrow Array\n(values)"]
CHUNK2["Serialized Arrow Array\n(values)"]
CHUNK3["Serialized Arrow Array\n(row IDs)"]
end
CATALOG -->|maps LogicalFieldId| DESC
DESC -->|head_page_pk| PAGE1
PAGE1 -->|next_page_pk| PAGE2
PAGE1 --> META1
PAGE1 --> META2
PAGE2 --> META3
META1 -->|chunk_pk| CHUNK1
META2 -->|chunk_pk| CHUNK2
META3 -->|chunk_pk| CHUNK3
Key structures :
- ColumnCatalog (llkv-column-map/src/store/catalog.rs): Root mapping stored at
CATALOG_ROOT_PKEY - ColumnDescriptor : Per-column header with total row count and pointers to metadata pages
- DescriptorPage : Linked list node containing multiple
ChunkMetadataentries - ChunkMetadata (llkv-column-map/src/store/descriptor.rs): Min/max values, row count, serialized size, chunk physical key
The min/max values in ChunkMetadata enable chunk pruning : predicates can skip entire chunks without loading data if the min/max range doesn’t overlap the query predicate.
Sources : llkv-column-map/src/store/core.rs:100-124 llkv-column-map/src/store/descriptor.rs
Core Operations
Append with Last-Write-Wins
The append method (llkv-column-map/src/store/core.rs:787-1633) ingests a RecordBatch with LWW semantics:
- Preprocessing : Sort by
rowidif not already sorted - LWW Rewrite : For each column, identify existing row IDs and overwrite them in-place
- Filter for Append : Remove rewritten rows and nulls from the batch
- Chunk and Persist : Split data into target-sized chunks, serialize as Arrow arrays, persist with metadata
Purpose : This diagram shows the append flow with LWW semantics. Each column is processed independently, rewrites happen first, then new data is chunked and persisted atomically.
sequenceDiagram
participant Caller
participant ColumnStore
participant LWW as "LWW Rewrite"
participant Chunker
participant Pager
Caller->>ColumnStore: append(batch)
ColumnStore->>ColumnStore: Sort by rowid if needed
loop For each column
ColumnStore->>LWW: lww_rewrite_for_field(field_id, row_ids, values)
LWW->>Pager: Load existing chunks
LWW->>LWW: Identify overlapping row IDs
LWW->>Pager: Overwrite matching rows
LWW-->>ColumnStore: rewritten_ids
ColumnStore->>ColumnStore: Filter out rewritten_ids and nulls
ColumnStore->>Chunker: Split to target chunk sizes
loop For each chunk
Chunker->>Pager: alloc() for chunk_pk and rid_pk
Chunker->>Chunker: Serialize Arrow array
Chunker->>Chunker: Compute min/max/size metadata
end
end
ColumnStore->>Pager: batch_put(all_puts)
ColumnStore->>ColumnStore: Increment append_epoch
ColumnStore-->>Caller: Success
The append_epoch counter (llkv-column-map/src/store/core.rs:132-134) is incremented after every append, providing a cache invalidation signal for gather contexts.
Sources : llkv-column-map/src/store/core.rs:787-1633 llkv-column-map/src/store/core.rs:893-942
Multi-Column Gathering
The gather_rows method (llkv-column-map/src/store/projection.rs:758-785) assembles a RecordBatch from multiple columns given a list of row IDs. It uses a two-phase strategy :
Phase 1: Planning
- Load
ColumnDescriptorfor each field - Collect
ChunkMetadatafor value and row ID chunks - Build
FieldPlanwith candidate chunk indices
Phase 2: Execution
- Identify candidate chunks that intersect requested row IDs (chunk pruning via min/max)
- Batch-fetch missing chunks from pager
- For each column, build Arrow array by gathering values from chunks
- Assemble
RecordBatchfrom gathered columns
Purpose : This diagram shows the gather pipeline. Planning loads metadata, execution prunes chunks and assembles the result. The chunk cache avoids redundant pager reads across multiple gather calls.
Context Pooling : The GatherContextPool (llkv-column-map/src/store/projection.rs:651-693) maintains a pool of MultiGatherContext objects keyed by field IDs. Each context caches:
- Chunk arrays by physical key
- Row ID index for sparse lookups
- Scratch buffers for row location tracking
- Arrow builders for each column type
By reusing contexts, repeated gather calls avoid allocating temporary structures and can benefit from cached chunks.
Sources : llkv-column-map/src/store/projection.rs:758-927 llkv-column-map/src/store/projection.rs:651-720
Filter Operations
The filter_row_ids method (llkv-column-map/src/store/core.rs:356-372) evaluates a predicate against a column and returns matching row IDs. It uses:
- Type-specific dispatch :
FilterDispatchtrait routes to specialized implementations (primitive types, strings, booleans) - Chunk pruning : Skip chunks where
min_valandmax_valdon’t overlap the predicate - Vectorized evaluation : For string predicates like
CONTAINS, use Arrow’s vectorized kernels (llkv-column-map/src/store/scan/filter.rs:337-503) - Dense encoding : Return results as
FilterResultwith run-length encoding when possible
For example, the fused string filter (llkv-column-map/src/store/scan/filter.rs:337-503) evaluates multiple CONTAINS predicates in a single pass:
- Load value and row ID chunks in parallel (via Rayon)
- Use Arrow’s
contains_utf8_scalarkernel for each pattern - AND results using a bitmask to avoid per-row branching
- Extract matching row IDs from the final bitmask
Sources : llkv-column-map/src/store/core.rs:356-372 llkv-column-map/src/store/scan/filter.rs:209-503
Optimization Techniques
Chunk Pruning via Min/Max Metadata
Each ChunkMetadata stores min_val_u64 and max_val_u64 (llkv-column-map/src/store/descriptor.rs). During filtering or gathering, chunks are skipped if:
- For predicates: The predicate range doesn’t overlap
[min_val, max_val] - For point lookups: The requested row ID is outside
[min_val, max_val]
This optimization is particularly effective for:
- Range predicates on sorted or clustered columns
- Temporal queries on timestamp columns (common in OLAP)
- Primary key lookups
Sources : llkv-column-map/src/store/core.rs:679-690 llkv-column-map/src/store/projection.rs:1019-1026
Context Pooling
The GatherContextPool (llkv-column-map/src/store/projection.rs:651-693) maintains up to 4 contexts per unique field ID set. This amortizes:
- Chunk cache allocations (reuse
FxHashMap<PhysicalKey, ArrayRef>) - Arrow builder allocations (reuse builders with reserved capacity)
- Scratch buffer allocations (reuse
Vecfor row indices)
Contexts track an epoch (llkv-column-map/src/store/projection.rs:520-526) that is compared against ColumnStore.append_epoch. If epochs mismatch, the context rebuilds its metadata from the latest descriptors.
Sources : llkv-column-map/src/store/projection.rs:651-720 llkv-column-map/src/store/projection.rs:460-526
Data Type Caching
The DTypeCache (llkv-column-map/src/store/core.rs64) stores Arrow DataType for each LogicalFieldId to avoid repeated descriptor loads during schema queries. It uses a fingerprint stored in the descriptor to detect type changes (e.g., from ALTER TABLE operations).
Sources : llkv-column-map/src/store/core.rs:175-180 llkv-column-map/src/store/core.rs:190-227
Indexes
The IndexManager (llkv-column-map/src/store/core.rs65) supports two index types:
| Index Type | Purpose | Storage |
|---|---|---|
Presence | Tracks which row IDs exist | Bitmap or sorted permutation array |
Value | Enables sorted iteration by value | Permutation array mapping value order to physical order |
Indexes are created via register_index (llkv-column-map/src/store/core.rs:145-147) and maintained automatically during append operations. The has_row_id method (llkv-column-map/src/store/core.rs:663-754) uses presence indexes for fast lookups via binary search.
Sources : llkv-column-map/src/store/core.rs:145-165 llkv-column-map/src/store/core.rs:663-754
Data Flow: Append Operation
Purpose : This flowchart shows the detailed append logic. Each column is processed independently (LWW rewrite, filter, chunk), then all writes are flushed atomically.
Sources : llkv-column-map/src/store/core.rs:787-1633
Data Flow: Gather Operation
Purpose : This flowchart shows the gather operation with context reuse. The epoch check ensures cache validity, and the dense pattern optimization avoids hash lookups for contiguous row IDs.
Sources : llkv-column-map/src/store/projection.rs:929-1195
Configuration and Tuning
The ColumnStoreConfig (llkv-column-map/src/store/core.rs63) provides tuning parameters:
- Target chunk size : Defaults guide chunk splitting during append
- Write hints : Returned via
write_hints()(llkv-column-map/src/store/core.rs:126-129) to inform upstream batch sizing
The GatherContextPool (llkv-column-map/src/store/projection.rs:658-663) maintains up to max_per_key contexts per field set (currently 4). Increasing this value trades memory for reduced context allocation frequency in highly concurrent workloads.
Sources : llkv-column-map/src/store/core.rs:126-129 llkv-column-map/src/store/projection.rs:658-663
Thread Safety
ColumnStore is Send + Sync (llkv-column-map/src/store/core.rs49). Internal state uses:
Arc<RwLock<ColumnCatalog>>: Catalog allows concurrent reads, exclusive writesArc<Pager>: Pager implementations must beSend + SyncArc<AtomicU64>: Epoch counter supports lock-free reads
This enables concurrent query execution across multiple threads while serializing metadata updates (e.g., append, remove_column).
Sources : llkv-column-map/src/store/core.rs:49-85
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Pager Interface and SIMD Optimization
Loading…
Pager Interface and SIMD Optimization
Relevant source files
Purpose and Scope
This document describes the storage abstraction layer that provides persistent key-value storage for the columnar data model. The Pager trait defines a generic interface for batch-oriented storage operations, while the SIMD R-Drive library provides a memory-mapped, SIMD-optimized concrete implementation.
For column-level storage operations built on top of the pager, see Column Storage and ColumnStore. For table-level abstractions, see Table Abstraction.
System Architecture
The storage system follows a layered design where the Pager provides a simple key-value abstraction that the ColumnStore builds upon. All persistence operations flow through batch APIs to minimize I/O overhead and enable atomic commits.
graph TB
subgraph "Column Storage Layer"
ColumnStore["ColumnStore<P: Pager>"]
ColumnCatalog["ColumnCatalog"]
ColumnDescriptor["ColumnDescriptor"]
ChunkMetadata["ChunkMetadata"]
end
subgraph "Pager Abstraction Layer"
PagerTrait["Pager Trait"]
BatchGet["BatchGet"]
BatchPut["BatchPut"]
GetResult["GetResult"]
PhysicalKey["PhysicalKey (u64)"]
end
subgraph "SIMD R-Drive Implementation"
SimdRDrive["simd-r-drive 0.15.5"]
EntryHandle["EntryHandle"]
MemoryMap["Memory-Mapped Files"]
SimdOps["SIMD Operations"]
end
ColumnStore --> PagerTrait
ColumnStore --> BatchGet
ColumnStore --> BatchPut
ColumnCatalog --> PagerTrait
ColumnDescriptor --> PagerTrait
ChunkMetadata --> PhysicalKey
PagerTrait --> SimdRDrive
BatchGet --> PhysicalKey
BatchPut --> PhysicalKey
GetResult --> EntryHandle
SimdRDrive --> EntryHandle
SimdRDrive --> MemoryMap
SimdRDrive --> SimdOps
EntryHandle --> MemoryMap
Storage Layering
Sources: llkv-column-map/src/store/core.rs:1-89 Cargo.toml:85-86
The Pager Trait
The Pager trait defines the storage interface used throughout the system. It abstracts over key-value stores with batch-oriented operations and explicit memory management.
Core Operations
| Operation | Purpose | Batch Size |
|---|---|---|
batch_get(&[BatchGet]) | Retrieve multiple keys atomically | 1-N keys |
batch_put(&[BatchPut]) | Write multiple keys atomically | 1-N keys |
alloc() | Allocate a single physical key | 1 key |
alloc_many(&[usize]) | Allocate multiple physical keys | N keys |
free(PhysicalKey) | Deallocate a single key | 1 key |
free_many(&[PhysicalKey]) | Deallocate multiple keys | N keys |
Type Parameters
The Pager trait is generic over the blob type returned from reads:
The SIMD R-Drive implementation uses EntryHandle as the blob type, which provides zero-copy access to memory-mapped regions without allocating new buffers.
Sources: llkv-column-map/src/store/core.rs:60-68 llkv-column-map/src/store/core.rs:89-124
Physical Keys
Physical keys are 64-bit unsigned integers (u64) that uniquely identify storage locations. The system partitions the key space for different purposes:
- Catalog Root : Key
0stores theColumnCatalog - Column Descriptors : User-allocated keys for column metadata
- Data Chunks : User-allocated keys for serialized Arrow arrays
- Index Structures : User-allocated keys for permutations and bitmaps
Sources: llkv-storage/src/constants.rs (implied), llkv-column-map/src/store/core.rs:100-124
sequenceDiagram
participant CS as ColumnStore
participant Pager as Pager
participant SIMD as SIMD R-Drive
participant MM as Memory Map
CS->>Pager: batch_get([BatchGet::Raw{key: 42}, ...])
Pager->>SIMD: batch read request
SIMD->>MM: locate pages for keys
MM->>SIMD: zero-copy EntryHandles
SIMD->>Pager: Vec<GetResult>
Pager->>CS: Results with EntryHandles
CS->>CS: deserialize Arrow arrays
Batch Operations
All storage I/O uses batch operations to minimize system call overhead and enable atomic multi-key transactions.
BatchGet Flow
Sources: llkv-column-map/src/store/core.rs:100-110 llkv-column-map/src/store/core.rs:414-425
The BatchGet enum specifies requests:
| Variant | Purpose |
|---|---|
Raw { key: PhysicalKey } | Retrieve raw bytes for a key |
The GetResult enum returns outcomes:
| Variant | Meaning |
|---|---|
Raw { key: PhysicalKey, bytes: Blob } | Successfully retrieved bytes |
| (others implied by usage) | Key not found, or other statuses |
BatchPut Flow
Sources: llkv-column-map/src/store/core.rs:217-222 llkv-column-map/src/store/core.rs:1117-1125
The BatchPut enum specifies writes:
| Variant | Purpose |
|---|---|
Raw { key: PhysicalKey, bytes: Vec<u8> } | Write raw bytes to a key |
Why Batching Matters
Batching provides three critical benefits:
- Atomicity : All keys in a batch are committed together or not at all
- Performance : Amortizes system call overhead across many keys
- Prefetching : Enables SIMD R-Drive to optimize I/O patterns
The ColumnStore::append method demonstrates this pattern—it stages hundreds of puts (data chunks, descriptors, catalog updates) and commits them atomically at the end:
Sources: llkv-column-map/src/store/core.rs:1117-1125
Physical Key Allocation
The pager manages a free-list of physical keys. Allocations are persistent—once allocated, a key remains valid until explicitly freed.
graph TD
subgraph "Allocation Operations"
A1["alloc() → PhysicalKey"]
A2["alloc_many(n) → Vec<PhysicalKey>"]
end
subgraph "Deallocation Operations"
F1["free(key)"]
F2["free_many(&[key])"]
end
subgraph "Usage Examples"
Desc["Column Descriptor\n1 key per column"]
Chunks["Data Chunks\nN keys per append"]
Indexes["Indexes\n1 key per index"]
end
A1 --> Desc
A2 --> Chunks
A1 --> Indexes
Desc -.-> F1
Chunks -.-> F2
Allocation Patterns
Sources: llkv-column-map/src/store/core.rs:250-263 llkv-column-map/src/store/core.rs:989-1006
Allocation Examples
The ColumnStore::append method allocates keys in bulk:
The ColumnStore::remove_column method frees all keys associated with a column:
Sources: llkv-column-map/src/store/core.rs:990-1005 llkv-column-map/src/store/core.rs:563-587
graph TB
subgraph "User Space"
Pager["Pager API"]
EntryHandle["EntryHandle\n(zero-copy view)"]
end
subgraph "SIMD R-Drive"
PageTable["Page Table\n(key → page mapping)"]
FreeList["Free List\n(available keys)"]
Allocator["Allocator\n(SIMD-optimized)"]
end
subgraph "Operating System"
MMap["mmap()
system call"]
FileBackend["Backing File"]
end
Pager --> EntryHandle
Pager --> PageTable
Pager --> FreeList
PageTable --> Allocator
FreeList --> Allocator
Allocator --> MMap
MMap --> FileBackend
EntryHandle -.references.-> MMap
SIMD R-Drive Implementation
The simd-r-drive crate (version 0.15.5) provides a memory-mapped key-value store optimized with SIMD instructions.
Memory-Mapped Architecture
Sources: Cargo.toml:85-86 llkv-column-map/src/store/core.rs:22-25
EntryHandle: Zero-Copy Blobs
The EntryHandle type implements AsRef<[u8]> and provides direct access to memory-mapped regions without copying data:
When deserializing Arrow arrays, the system reads directly from the mapped memory:
Sources: llkv-column-map/src/store/core.rs:60-89 llkv-column-map/src/store/core.rs:196-210
SIMD Optimization Benefits
The SIMD R-Drive uses vectorized instructions for:
- Page Scanning : Parallel search for keys in page tables
- Free List Management : Vectorized bitmap operations for allocation
- Data Movement : SIMD memcpy for large blobs
While the Rust code in this repository doesn’t directly show SIMD instructions, the external simd-r-drive dependency provides these optimizations transparently through the Pager trait.
Sources: Cargo.toml85 Cargo.lock:671-678
Integration with ColumnStore
The ColumnStore uses the pager for all persistent state, following consistent patterns across operations.
graph LR
Catalog["ColumnCatalog\n(at key 0)"]
Catalog -->|LogicalFieldId → pk_data| DataDesc["Data Descriptor\n(at pk_data)"]
Catalog -->|LogicalFieldId → pk_rid| RidDesc["RowID Descriptor\n(at pk_rid)"]
DataDesc -->|linked list| Page1["Descriptor Page\nChunkMetadata[]"]
Page1 -->|next_page_pk| Page2["Descriptor Page\nChunkMetadata[]"]
RidDesc -->|linked list| Page3["Descriptor Page\nChunkMetadata[]"]
Storage Pattern: Column Descriptors
Each column has two descriptors (one for data, one for row IDs) stored at fixed physical keys tracked in the catalog:
Sources: llkv-column-map/src/store/core.rs:100-124 llkv-column-map/src/store/core.rs:234-344
graph TB
ChunkMeta["ChunkMetadata"]
ChunkMeta -->|chunk_pk| DataBlob["Serialized Arrow Array\n(at chunk_pk)"]
ChunkMeta -->|value_order_perm_pk| PermBlob["Sort Permutation\n(at perm_pk)"]
DataBlob -.->|deserialized to| ArrayRef["ArrayRef\n(in-memory)"]
Storage Pattern: Data Chunks
Each chunk of Arrow array data is stored at a physical key referenced by ChunkMetadata:
Sources: llkv-column-map/src/store/core.rs:988-1006 llkv-column-map/src/store/descriptor.rs (implied)
sequenceDiagram
participant App as Application
participant CS as ColumnStore
participant Pager as Pager
App->>CS: append(RecordBatch)
Note over CS: Stage Phase
CS->>CS: serialize data chunks
CS->>CS: serialize row-id chunks
CS->>CS: update descriptors
CS->>CS: update catalog
CS->>CS: accumulate BatchPut[]
Note over CS: Commit Phase
CS->>Pager: batch_put(all_puts)
Pager->>Pager: atomic commit
alt Success
Pager-->>CS: Ok(())
CS->>CS: increment append_epoch
CS-->>App: Ok(())
else Failure
Pager-->>CS: Err(...)
CS-->>App: Err(...)
Note over CS: No changes visible
end
Transaction Boundary Example
The append operation demonstrates how the pager enables atomic multi-key transactions:
Sources: llkv-column-map/src/store/core.rs:787-1126
The key insight: all puts are staged in memory (Vec<BatchPut>) until a single atomic commit at line 1121. This ensures that partial failures leave the store in a consistent state.
Performance Characteristics
Batch Size Recommendations
The ColumnStore follows these heuristics:
| Operation | Typical Batch Size | Justification |
|---|---|---|
| Catalog Load | 1 key | Single root key at startup |
| Descriptor Load | 2 keys | Data + RowID descriptors |
| Chunk Append | 10-100 keys | Multiple columns × chunks per column |
| Chunk Rewrite | 1-10 keys | Targeted LWW updates |
| Full Scan | 100+ keys | Prefetch all chunks for a column |
Sources: llkv-column-map/src/store/core.rs:100-110 llkv-column-map/src/store/core.rs:1117-1125
Memory Mapping Benefits
Memory-mapped storage provides:
- Zero-Copy Deserialization : Arrow arrays reference mapped memory directly
- OS-Level Caching : The kernel manages page cache automatically
- Lazy Loading : Pages are only faulted in when accessed
The EntryHandle type enables these patterns by avoiding buffer copies during reads.
Sources: llkv-column-map/src/store/core.rs:60-85 llkv-column-map/src/store/scan/filter.rs18
Allocation Strategies
The system uses bulk allocation to reduce pager round-trips:
Sources: llkv-column-map/src/store/core.rs:989-1006
Error Handling and Recovery
Allocation Failures
When allocation fails mid-operation, the system attempts to roll back:
Sources: llkv-column-map/src/store/core.rs:563-587
Batch Operation Atomicity
The pager guarantees that either all keys in a batch_put are written or none are. This prevents partial writes from corrupting the storage state. The ColumnStore relies on this guarantee to implement transactional append and update operations.
Sources: llkv-column-map/src/store/core.rs:1117-1125
Summary
The Pager interface provides a minimal, batch-oriented abstraction over persistent key-value storage. The SIMD R-Drive implementation uses memory-mapped files and vectorized operations to optimize common patterns like bulk reads and zero-copy deserialization. By staging all writes and committing atomically, the ColumnStore builds transactional semantics on top of the simple key-value model.
Key Takeaways:
- All I/O flows through batch operations (
batch_get,batch_put) - Physical keys (
u64) uniquely identify all persistent objects EntryHandleenables zero-copy access to memory-mapped data- Bulk allocation (
alloc_many) minimizes pager round-trips - Atomic batch commits enable transactional storage operations
Sources: llkv-column-map/src/store/core.rs:60-1126 Cargo.toml:85-86 llkv-storage (trait definitions implied by usage)
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Catalog and Metadata Management
Loading…
Catalog and Metadata Management
Relevant source files
- llkv-csv/src/writer.rs
- llkv-table/examples/direct_comparison.rs
- llkv-table/examples/performance_benchmark.rs
- llkv-table/examples/test_streaming.rs
- llkv-table/src/lib.rs
- llkv-table/src/table.rs
Purpose and Scope
This document explains the catalog and metadata infrastructure that tracks all tables, columns, indexes, constraints, and custom types in the LLKV system. The catalog provides schema information and manages the lifecycle of database objects.
For details on the table creation and management API, see CatalogManager API. For implementation details of how metadata is stored, see System Catalog and SysCatalog. For custom type definitions and aliases, see Custom Types and Type Registry.
Overview
The LLKV catalog is a self-describing system : all metadata about tables, columns, indexes, and constraints is stored as structured records in Table 0 , which is itself a table managed by the same ColumnStore that manages user data. This bootstrapped design means the catalog uses the same storage primitives as regular tables.
The catalog provides:
- Schema tracking : Table and column metadata with names and types
- Index registry : Persisted sort indexes and multi-column indexes
- Constraint metadata : Primary keys, foreign keys, unique constraints, and check constraints
- Trigger definitions : Event triggers with timing and execution metadata
- Custom type registry : User-defined type aliases
Self-Describing Architecture
Sources: llkv-table/src/lib.rs:1-98 llkv-table/src/table.rs:499-511
Metadata Types
The catalog stores several types of metadata records, each representing a different aspect of database schema and configuration.
Core Metadata Records
| Metadata Type | Description | Key Fields |
|---|---|---|
TableMeta | Table definitions | table_id, name, schema |
ColMeta | Column definitions | col_id, name, flags, default |
SingleColumnIndexEntryMeta | Single-column index registry | field_id, index_kind |
MultiColumnIndexEntryMeta | Multi-column index registry | field_ids, index_kind |
TriggerEntryMeta | Trigger definitions | trigger_id, timing, event |
CustomTypeMeta | User-defined types | type_name, base_type |
Constraint Metadata
Constraints are stored as specialized metadata records that enforce referential integrity and data validation:
| Constraint Type | Metadata Structure |
|---|---|
| Primary Key | PrimaryKeyConstraint - columns forming the primary key |
| Foreign Key | ForeignKeyConstraint - parent/child table references and actions |
| Unique | UniqueConstraint - columns with unique constraint |
| Check | CheckConstraint - validation expression |
Sources: llkv-table/src/lib.rs:81-86 llkv-table/src/lib.rs:56-67
Table ID Ranges
Table IDs are partitioned into reserved ranges to distinguish system tables from user tables and temporary objects.
graph LR
subgraph "Table ID Space"
CATALOG["0\nSystem Catalog"]
USER["1-999\nUser Tables"]
INFOSCHEMA["1000+\nInformation Schema"]
TEMP["10000+\nTemporary Tables"]
end
CATALOG -.special case.-> CATALOG
USER -.normal tables.-> USER
INFOSCHEMA -.system views.-> INFOSCHEMA
TEMP -.session-local.-> TEMP
Reserved Table ID Constants
The is_reserved_table_id() function checks whether a table ID is in the reserved range (Table 0). User code cannot directly instantiate Table objects for reserved IDs.
Sources: llkv-table/src/lib.rs:75-78 llkv-table/src/table.rs:110-126
Storage Architecture
Catalog as a Table
The system catalog is physically stored as Table 0 in the ColumnStore. Each metadata type (table definitions, column definitions, indexes, etc.) is stored as a column in this special table, with each row representing one metadata record.
graph TB
subgraph "Logical View"
CATALOG["System Catalog API\n(SysCatalog)"]
end
subgraph "Physical Storage"
TABLE0["Table 0"]
COLS["Columns:\n- table_meta\n- col_meta\n- index_meta\n- trigger_meta\n- constraint_meta"]
end
subgraph "ColumnStore Layer"
DESCRIPTORS["Column Descriptors"]
CHUNKS["Data Chunks\n(Serialized Arrow)"]
end
subgraph "Pager Layer"
KVSTORE["Key-Value Store"]
end
CATALOG --> TABLE0
TABLE0 --> COLS
COLS --> DESCRIPTORS
DESCRIPTORS --> CHUNKS
CHUNKS --> KVSTORE
Accessing the Catalog
The Table::catalog() method provides access to the system catalog without exposing the underlying table structure. The SysCatalog type wraps the ColumnStore and provides typed methods for reading and writing metadata.
The get_table_meta() and get_cols_meta() convenience methods delegate to the catalog:
Sources: llkv-table/src/table.rs:499-511
Metadata Operations
Table Creation
Table creation is coordinated by the CatalogManager, which handles metadata persistence, catalog registration, and storage initialization. The Table type provides factory methods that delegate to the catalog manager:
This factory pattern ensures that table creation is properly coordinated across three layers:
- MetadataManager : Assigns table IDs and tracks metadata
- TableCatalog : Maintains name-to-ID mappings
- ColumnStore : Initializes physical storage
Sources: llkv-table/src/table.rs:80-103
Defensive Metadata Persistence
When appending data, the Table::append() method defensively persists column names to the catalog if they’re missing. This ensures metadata consistency even when batches arrive with only field_id metadata and no column names.
This defensive approach handles cases like CSV import where column names are known but may not have been explicitly registered in the catalog.
Sources: llkv-table/src/table.rs:327-344
sequenceDiagram
participant Client
participant Table
participant ColumnStore
participant Catalog
Client->>Table: schema()
Table->>ColumnStore: user_field_ids_for_table(table_id)
ColumnStore-->>Table: logical_fields: [LogicalFieldId]
Table->>Catalog: get_cols_meta(field_ids)
Catalog-->>Table: metas: [ColMeta]
loop "For each field"
Table->>ColumnStore: data_type(lfid)
ColumnStore-->>Table: DataType
Table->>Table: Build Field with metadata
end
Table-->>Client: Arc<Schema>
Schema Resolution
The Table::schema() method constructs an Arrow Schema by querying the catalog for column metadata and combining it with physical data type information from the ColumnStore.
The resulting schema includes:
row_idfield (always first)- User-defined columns with names from the catalog
field_idstored in field metadata for each column
Sources: llkv-table/src/table.rs:519-549
Index Registration
The catalog tracks persisted sort indexes for columns, allowing efficient range scans and ordered reads.
Registering Indexes
Listing Indexes
Index metadata is stored in the catalog and used by the query planner to optimize scan operations.
Sources: llkv-table/src/table.rs:145-173
Integration with CSV Import/Export
CSV import and export operations rely on the catalog to resolve column names and field IDs. The CsvWriter queries the catalog when building projections to ensure that columns are properly aliased.
This integration ensures that exported CSV files have human-readable column headers even when the underlying storage uses numeric field IDs.
Sources: llkv-csv/src/writer.rs:320-368
graph TB
subgraph "SQL Layer"
SQLENGINE["SqlEngine"]
PLANNER["Query Planner"]
end
subgraph "Catalog Layer"
CATALOG["CatalogManager"]
METADATA["MetadataManager"]
SYSCAT["SysCatalog"]
end
subgraph "Table Layer"
TABLE["Table"]
CONSTRAINTS["ConstraintService"]
end
subgraph "Storage Layer"
COLSTORE["ColumnStore"]
PAGER["Pager"]
end
SQLENGINE -->|CREATE TABLE| CATALOG
SQLENGINE -->|ALTER TABLE| CATALOG
SQLENGINE -->|DROP TABLE| CATALOG
CATALOG --> METADATA
CATALOG --> SYSCAT
CATALOG --> TABLE
TABLE --> SYSCAT
TABLE --> CONSTRAINTS
TABLE --> COLSTORE
SYSCAT --> COLSTORE
COLSTORE --> PAGER
PLANNER -.resolves schema.-> CATALOG
CONSTRAINTS -.validates.-> SYSCAT
Relationship to Other Systems
The catalog sits at the center of the LLKV architecture, connecting several subsystems:
- SQL Layer : Issues DDL commands that modify the catalog
- Query Planner : Resolves table and column names via the catalog
- Table Layer : Queries metadata during data operations
- Constraint Layer : Uses catalog to track and enforce constraints
- Storage Layer : Physically persists catalog records as Table 0
Sources: llkv-table/src/lib.rs:34-98
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
CatalogManager API
Loading…
CatalogManager API
Relevant source files
Purpose and Scope
This page documents the CatalogManager API, which provides the primary interface for table lifecycle management in LLKV. The CatalogManager coordinates table creation, modification, deletion, and schema management operations. It serves as the bridge between high-level DDL operations and the low-level storage of metadata in the system catalog.
For details on how metadata is physically stored, see System Catalog and SysCatalog. For information about custom type definitions, see Custom Types and Type Registry.
Overview
The CatalogManager is the central coordinator for all catalog operations in LLKV. It manages:
- Table ID allocation : Assigns unique identifiers to new tables
- Schema registration : Validates and stores table schemas with Arrow integration
- Field ID mapping : Assigns logical field IDs to columns and maintains resolution
- Index registration : Tracks single-column and multi-column index metadata
- Metadata snapshots : Provides consistent views of catalog state
- DDL coordination : Orchestrates CREATE/ALTER/DROP operations
graph TB
subgraph "High-Level Operations"
DDL["DDL Statements\n(CREATE/ALTER/DROP)"]
QUERY["Query Planning\n(Name Resolution)"]
EXEC["Query Execution\n(Schema Access)"]
end
subgraph "CatalogManager Layer"
CM["CatalogManager"]
SNAPSHOT["TableCatalogSnapshot"]
RESOLVER["FieldResolver"]
RESULT["CreateTableResult"]
end
subgraph "Metadata Structures"
TABLEMETA["TableMeta"]
COLMETA["ColMeta"]
INDEXMETA["Index Descriptors"]
SCHEMA["Arrow Schema"]
end
subgraph "Persistence Layer"
SYSCAT["SysCatalog\n(Table 0)"]
STORE["ColumnStore"]
end
DDL --> CM
QUERY --> CM
EXEC --> CM
CM --> SNAPSHOT
CM --> RESOLVER
CM --> RESULT
CM --> TABLEMETA
CM --> COLMETA
CM --> INDEXMETA
CM --> SCHEMA
TABLEMETA --> SYSCAT
COLMETA --> SYSCAT
INDEXMETA --> SYSCAT
SYSCAT --> STORE
The CatalogManager maintains an in-memory cache of catalog metadata for performance while delegating persistence to SysCatalog (table 0).
Sources: llkv-table/src/lib.rs:1-98
Core Types
CatalogManager
The CatalogManager struct is the primary API for catalog operations. While the exact implementation is in the catalog module, it is exported from the main crate interface.
Key Responsibilities:
- Maintains in-memory catalog cache
- Allocates table and field IDs
- Validates schema changes
- Coordinates with
SysCatalogfor persistence - Provides snapshot isolation for metadata reads
CreateTableResult
Returned by table creation operations, this structure contains:
- The newly assigned
TableId - The created
Tableinstance - Initial field ID mappings
- Registration confirmation
TableCatalogSnapshot
A consistent, immutable view of catalog metadata at a specific point in time. Used to ensure that query planning and execution see a stable view of table schemas even as concurrent DDL operations occur.
Properties:
- Immutable after creation
- Contains table and column metadata
- Includes field ID mappings
- May include index registrations
FieldResolver
Provides mapping from string column names to FieldId identifiers. This is critical for translating SQL column references into the internal field ID system used by the storage layer.
Functionality:
- Resolves qualified names (e.g.,
table.column) - Handles field aliases
- Supports case-insensitive lookups (depending on configuration)
- Validates field existence
Sources: llkv-table/src/lib.rs:54-55 llkv-table/src/lib.rs:79-80
Table Lifecycle Operations
CREATE TABLE
The CatalogManager handles table creation through a multi-step process:
- Validation : Checks table name uniqueness and schema validity
- ID Allocation : Assigns a new
TableIdfrom available range - Field ID Assignment : Maps each column to a unique
FieldId - Schema Storage : Registers Arrow schema with type information
- Metadata Persistence : Writes
TableMetaandColMetato system catalog - Table Instantiation : Creates
Tableinstance backed byColumnStore
Table ID Ranges:
| Range | Purpose | Constant |
|---|---|---|
| 0 | System Catalog | CATALOG_TABLE_ID |
| 1-999 | User Tables | - |
| 1000-9999 | Information Schema | INFORMATION_SCHEMA_TABLE_ID_START |
| 10000+ | Temporary Tables | TEMPORARY_TABLE_ID_START |
The CatalogManager ensures IDs are allocated from the appropriate range based on table type.
DROP TABLE
Table deletion involves:
- Dependency Checking : Validates no foreign keys reference the table
- Metadata Removal : Deletes entries from system catalog
- Storage Cleanup : Marks column store data for cleanup (may be deferred)
- Cache Invalidation : Removes table from in-memory cache
The operation is typically transactional - either all steps succeed or the table remains.
ALTER TABLE Operations
The CatalogManager coordinates schema modifications, though validation is delegated to specialized functions. Operations include:
- ADD COLUMN : Assigns new
FieldId, updates schema - DROP COLUMN : Validates no dependencies, marks column deleted
- ALTER COLUMN TYPE : Validates type compatibility, updates metadata
- RENAME COLUMN : Updates name mappings
sequenceDiagram
participant Client
participant CM as CatalogManager
participant Validator
participant SysCat as SysCatalog
participant Store as ColumnStore
Client->>CM: create_table(name, schema)
CM->>CM: Allocate TableId
CM->>CM: Assign FieldIds
CM->>Validator: Validate schema
Validator-->>CM: OK
CM->>SysCat: Write TableMeta
CM->>SysCat: Write ColMeta records
SysCat-->>CM: Persisted
CM->>Store: Initialize ColumnStore
Store-->>CM: Table handle
CM->>CM: Update cache
CM-->>Client: CreateTableResult
The validate_alter_table_operation function (referenced in exports) performs constraint checks before modifications are committed.
Sources: llkv-table/src/lib.rs:22-27 llkv-table/src/lib.rs:76-78
Schema and Field Management
Field ID Assignment
Every column in LLKV is assigned a unique FieldId at creation time. This numeric identifier:
- Persists across schema changes : Remains stable even if column is renamed
- Enables versioning : Different table versions can reference same field ID
- Optimizes storage : Physical storage keys use field IDs, not string names
- Supports MVCC : System columns like
created_byhave reserved field IDs
The CatalogManager maintains a monotonic counter per table to allocate field IDs sequentially.
Arrow Schema Integration
The CatalogManager integrates tightly with Apache Arrow schemas:
- Validates Arrow
DataTypecompatibility - Maps Arrow fields to
FieldIdassignments - Stores schema metadata in serialized form
- Reconstructs Arrow
Schemafrom stored metadata
This allows LLKV to leverage Arrow’s type system while maintaining its own field ID system for storage efficiency.
FieldResolver API
The FieldResolver is obtained from a TableCatalogSnapshot and provides:
resolve(column_name: &str) -> Result<FieldId>
resolve_qualified(table_name: &str, column_name: &str) -> Result<FieldId>
get_field_name(field_id: FieldId) -> Option<&str>
get_field_type(field_id: FieldId) -> Option<&DataType>
This bidirectional mapping supports both query translation (name → ID) and result formatting (ID → name).
Sources: llkv-table/src/lib.rs:3-21 llkv-table/src/lib.rs54
Index Registration
Single-Column Indexes
The SingleColumnIndexDescriptor and SingleColumnIndexRegistration types manage metadata for indexes on individual columns:
SingleColumnIndexDescriptor:
- Field ID being indexed
- Index type (e.g., BTree, Hash)
- Index-specific parameters
- Creation timestamp
SingleColumnIndexRegistration:
- Links table to index descriptor
- Tracks index state (building, ready, failed)
- Stores index metadata in system catalog
The CatalogManager maintains a registry of active indexes and coordinates their creation and maintenance.
Multi-Column Indexes
For composite indexes spanning multiple columns, the MultiColumnUniqueRegistration type (referenced in exports) provides similar functionality with support for:
- Multiple field IDs in index key
- Column ordering
- Uniqueness constraints
- Compound key generation
Sources: llkv-table/src/lib.rs55 llkv-table/src/lib.rs73
Metadata Snapshots
Snapshot Creation
A TableCatalogSnapshot provides a consistent view of catalog state. Snapshots are created:
- On Demand : When planning a query
- Periodically : For long-running operations
- At Transaction Start : For transaction isolation
The snapshot is immutable and won’t reflect concurrent DDL changes, ensuring query planning sees a stable schema.
Snapshot Contents
A snapshot typically includes:
- All
TableMetarecords (table definitions) - All
ColMetarecords (column definitions) - Field ID mappings for all tables
- Index registrations (optional)
- Custom type definitions (optional)
- Constraint metadata (optional)
Cache Invalidation
When the CatalogManager modifies metadata:
- Updates system catalog (table 0)
- Increments epoch/version counter
- Invalidates stale snapshots
- Updates in-memory cache
Existing snapshots remain valid but represent a previous version. New snapshots will reflect the changes.
Sources: llkv-table/src/lib.rs54
Integration with SysCatalog
The CatalogManager uses SysCatalog (documented in System Catalog and SysCatalog) as its persistence layer:
Write Operations
- CREATE TABLE : Writes
TableMetaandColMetarecords - ALTER TABLE : Updates existing metadata records
- DROP TABLE : Marks metadata as deleted
- Index Registration : Writes index descriptor records
Read Operations
- Snapshot Creation : Reads all metadata records
- Table Lookup : Queries
TableMetaby name or ID - Field Resolution : Retrieves
ColMetafor a table - Index Discovery : Loads index descriptors
sequenceDiagram
participant Runtime
participant CM as CatalogManager
participant Cache as "In-Memory Cache"
participant SC as SysCatalog
participant Store as ColumnStore
Runtime->>CM: Initialize
CM->>SC: Bootstrap table 0
SC->>Store: Initialize ColumnStore(0)
Store-->>SC: Ready
SC-->>CM: SysCatalog ready
CM->>SC: Read all TableMeta
SC->>Store: scan(TableMeta)
Store-->>SC: RecordBatch
SC-->>CM: Vec<TableMeta>
CM->>SC: Read all ColMeta
SC->>Store: scan(ColMeta)
Store-->>SC: RecordBatch
SC-->>CM: Vec<ColMeta>
CM->>Cache: Populate
Cache-->>CM: Loaded
CM-->>Runtime: CatalogManager ready
Bootstrapping
On system startup:
SysCataloginitializes (creates table 0 if needed)CatalogManagerreads all metadata from table 0- In-memory cache is populated
- System is ready for operations
Sources: llkv-table/src/lib.rs:30-31 llkv-table/src/lib.rs:81-85
Usage Patterns
Creating a Table
// Typical usage pattern (conceptual)
let result = catalog_manager.create_table(
table_name,
schema, // Arrow Schema
table_id_hint // Optional TableId preference
)?;
let table: Table = result.table;
let table_id: TableId = result.table_id;
let field_ids: HashMap<String, FieldId> = result.field_mappings;
Resolving Column Names
// Get a snapshot for consistent reads
let snapshot = catalog_manager.snapshot();
// Resolve column references
let resolver = snapshot.field_resolver(table_id)?;
let field_id = resolver.resolve("column_name")?;
// Use field_id in storage operations
Registering an Index
// Register a single-column index
let descriptor = SingleColumnIndexDescriptor::new(
field_id,
IndexType::BTree,
options
);
catalog_manager.register_index(
table_id,
descriptor
)?;
Checking Metadata Changes
// Capture snapshot version
let snapshot_v1 = catalog_manager.snapshot();
let version_1 = snapshot_v1.version();
// ... DDL operations occur ...
// Create new snapshot and check for changes
let snapshot_v2 = catalog_manager.snapshot();
let version_2 = snapshot_v2.version();
if version_1 != version_2 {
// Metadata has changed, invalidate plans
}
Sources: llkv-table/src/lib.rs:54-89
Thread Safety and Concurrency
The CatalogManager typically uses interior mutability (e.g., RwLock or Mutex) to allow:
- Concurrent Reads : Multiple threads can read snapshots simultaneously
- Exclusive Writes : DDL operations acquire exclusive locks
- Snapshot Isolation : Snapshots remain valid even during concurrent DDL
This design allows high read concurrency while ensuring DDL operations are serialized and atomic.
Related Modules
The CatalogManager coordinates with several related modules:
catalogmodule: Contains the implementation (not shown in provided files)sys_catalogmodule: Persistence layer for metadata [llkv-table/src/sys_catalog.rs]metadatamodule: Extended metadata management [llkv-table/src/metadata.rs]ddlmodule: DDL-specific helpers [llkv-table/src/ddl.rs]resolversmodule: Name resolution utilities [llkv-table/src/resolvers.rs]constraintsmodule: Constraint validation [llkv-table/src/constraints.rs]
Sources: llkv-table/src/lib.rs:34-46 llkv-table/src/lib.rs:68-69 llkv-table/src/lib.rs74 llkv-table/src/lib.rs79
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
System Catalog and SysCatalog
Loading…
System Catalog and SysCatalog
Relevant source files
Purpose and Scope
The System Catalog (SysCatalog) is LLKV’s metadata repository that stores all table definitions, column schemas, indexes, constraints, triggers, and custom types. It provides a self-describing, bootstrapped metadata system where all metadata is stored as structured records in Table 0 using the same columnar storage infrastructure as user data.
This page covers the system catalog’s architecture, metadata structures, and how it integrates with DDL operations. For the higher-level catalog management API, see CatalogManager API. For custom type definitions and aliases, see Custom Types and Type Registry.
Sources: llkv-table/src/lib.rs:1-32
Self-Describing Architecture
The system catalog is implemented as Table 0, a special reserved table that stores metadata about all other tables (including itself). This creates a bootstrapped, self-referential system where the catalog uses the same storage mechanisms it documents.
Key architectural characteristics:
graph TB
subgraph "Table 0 - System Catalog"
SysCatalog["SysCatalog struct\n(sys_catalog.rs)"]
Table0["Underlying Table\ntable_id = 0"]
ColumnStore0["ColumnStore\nPhysical Storage"]
end
subgraph "Metadata Record Types"
TableMeta["TableMeta Records\ntable_id, name, schema"]
ColMeta["ColMeta Records\nfield_id, table_id, name, type"]
CustomTypeMeta["CustomTypeMeta Records\nCustom type definitions"]
IndexMeta["Index Metadata\nSingle & Multi-Column"]
TriggerMeta["TriggerMeta Records\nBEFORE/AFTER triggers"]
end
subgraph "User Tables"
Table1["User Table 1\ntable_id = 1"]
Table2["User Table 2\ntable_id = 2"]
TableN["User Table N\ntable_id = N"]
end
SysCatalog --> Table0
Table0 --> ColumnStore0
TableMeta --> Table0
ColMeta --> Table0
CustomTypeMeta --> Table0
IndexMeta --> Table0
TriggerMeta --> Table0
TableMeta -.describes.-> Table1
TableMeta -.describes.-> Table2
TableMeta -.describes.-> TableN
ColMeta -.describes columns.-> Table1
ColMeta -.describes columns.-> Table2
| Property | Description |
|---|---|
| Table ID | Always CATALOG_TABLE_ID (0) |
| Self-describing | Metadata about Table 0 is stored in Table 0 itself |
| Schema consistency | Uses the same Table and ColumnStore abstractions as user tables |
| Transactional | All metadata changes are transactional via MVCC |
| Queryable | Can be queried like any other table (with appropriate permissions) |
Sources: llkv-table/src/lib.rs:7-31 llkv-table/src/sys_catalog.rs
Table ID Ranges and Reservations
LLKV partitions the table ID space into reserved ranges for different purposes:
Constants and predicates:
graph LR
subgraph "Table ID Ranges"
Range0["ID 0\nCATALOG_TABLE_ID\n(System Catalog)"]
Range1["IDs 1-999\n(User Tables)"]
Range2["IDs 1000+\nINFORMATION_SCHEMA_TABLE_ID_START\n(Information Schema)"]
Range3["IDs 10000+\nTEMPORARY_TABLE_ID_START\n(Temporary Tables)"]
end
Range0 --> SysCheck{is_reserved_table_id}
Range1 --> UserCheck{User table}
Range2 --> InfoCheck{is_information_schema_table}
Range3 --> TempCheck{Temporary table}
| Constant/Function | Value/Purpose | Location |
|---|---|---|
CATALOG_TABLE_ID | 0 - System catalog table | llkv-table/src/reserved.rs |
INFORMATION_SCHEMA_TABLE_ID_START | 1000 - Information schema views start | llkv-table/src/reserved.rs |
TEMPORARY_TABLE_ID_START | 10000 - Temporary tables start | llkv-table/src/reserved.rs |
is_reserved_table_id(id) | Returns true if ID is reserved | llkv-table/src/reserved.rs |
is_information_schema_table(id) | Returns true if ID is info schema | llkv-table/src/reserved.rs |
This partitioning allows the system to quickly identify table categories and apply appropriate handling (e.g., temporary tables are not persisted across sessions).
Sources: llkv-table/src/lib.rs:23-27 llkv-table/src/lib.rs:75-78 llkv-table/src/reserved.rs
Core Metadata Structures
The system catalog stores multiple types of metadata records, each represented by a specific struct that defines a schema for that metadata type.
TableMeta Structure
TableMeta records describe table-level metadata:
Key operations:
| Method | Purpose | Returns |
|---|---|---|
schema() | Get Arrow schema for table | Schema |
field_ids() | Get ordered list of field IDs | Vec<FieldId> |
table_id | Unique table identifier | TableId |
table_name | Table name (unqualified) | String |
schema_name | Optional schema/namespace | Option<String> |
Sources: llkv-table/src/sys_catalog.rs llkv-table/src/metadata.rs
ColMeta Structure
ColMeta records describe individual column metadata:
Each ColMeta record associates a FieldId (the internal column identifier) with a TableId and provides the column’s name, type, nullability, and position within the table schema.
Sources: llkv-table/src/sys_catalog.rs llkv-table/src/metadata.rs
Index Metadata Structures
Index metadata is stored in two forms:
Index metadata types:
| Structure | Purpose | Key Fields |
|---|---|---|
SingleColumnIndexEntryMeta | Describes a single-column index | table_id, field_id, index_name |
TableSingleColumnIndexMeta | Table’s single-column index map | HashMap<FieldId, IndexDescriptor> |
MultiColumnIndexEntryMeta | Describes a multi-column index | table_id, field_ids[], index_name |
TableMultiColumnIndexMeta | Table’s multi-column index map | Vec<MultiColumnIndexDescriptor> |
Sources: llkv-table/src/sys_catalog.rs llkv-table/src/metadata.rs llkv-table/src/catalog.rs
Trigger Metadata Structures
Triggers are stored with timing and event specifications:
Each trigger is identified by name and associated with a specific table, timing (BEFORE/AFTER), and event (INSERT/UPDATE/DELETE).
Sources: llkv-table/src/sys_catalog.rs
Custom Type Metadata
Custom types and type aliases are stored as CustomTypeMeta records:
This enables user-defined type aliases (e.g., CREATE TYPE email AS VARCHAR(255)) to be persisted and resolved during schema parsing.
Sources: llkv-table/src/sys_catalog.rs llkv-table/src/metadata.rs
graph TB
subgraph "SysCatalog API"
Constructor["new(table: Table)\nWraps Table 0"]
subgraph "Table Metadata"
InsertTable["insert_table_meta(meta)\nAdd new table"]
GetTable["get_table_meta(id)\nRetrieve table info"]
ListTables["list_tables()\nGet all tables"]
UpdateTable["update_table_meta(meta)\nModify table"]
DeleteTable["delete_table_meta(id)\nRemove table"]
end
subgraph "Column Metadata"
InsertCol["insert_col_meta(meta)\nAdd column"]
GetCols["get_col_metas(table_id)\nGet table columns"]
end
subgraph "Index Metadata"
InsertIdx["insert_index_meta(meta)\nRegister index"]
GetIdxs["get_indexes(table_id)\nGet table indexes"]
end
subgraph "Trigger Metadata"
InsertTrigger["insert_trigger_meta(meta)\nCreate trigger"]
GetTriggers["get_triggers(table_id)\nList triggers"]
end
end
Constructor --> InsertTable
Constructor --> GetTable
Constructor --> InsertCol
Constructor --> InsertIdx
Constructor --> InsertTrigger
SysCatalog API and Operations
The SysCatalog struct provides methods for reading and writing metadata records:
Core methods:
| Method Category | Operations | Purpose |
|---|---|---|
| Table operations | insert_table_meta(), get_table_meta(), update_table_meta(), delete_table_meta() | Manage table-level metadata |
| Column operations | insert_col_meta(), get_col_metas(), update_col_meta() | Manage column definitions |
| Index operations | insert_index_meta(), get_indexes(), delete_index() | Register and query indexes |
| Trigger operations | insert_trigger_meta(), get_triggers(), delete_trigger() | Manage trigger definitions |
| Type operations | insert_custom_type(), get_custom_type(), list_custom_types() | Manage custom type definitions |
| Query operations | list_tables(), table_exists(), resolve_field_id() | Query and resolve metadata |
All operations are implemented as append operations on the underlying Table struct, leveraging the same MVCC and transactional semantics as user data.
Sources: llkv-table/src/sys_catalog.rs
Bootstrap Process
The system catalog must be initialized before any other tables can be created. The bootstrap process creates Table 0 with a predefined schema:
Bootstrap steps:
sequenceDiagram
participant Init as "Initialization"
participant CM as "CatalogManager"
participant SysCat as "SysCatalog::new()"
participant Table0 as "Table ID 0"
participant CS as "ColumnStore"
Init->>CM: Create database
CM->>SysCat: Initialize system catalog
Note over SysCat: Define catalog schema\n(TableMeta, ColMeta, etc.)
SysCat->>Table0: Create Table 0 with catalog schema
Table0->>CS: Initialize column storage for catalog
SysCat->>Table0: Insert TableMeta for Table 0\n(self-reference)
Note over SysCat: Catalog is now operational
SysCat-->>CM: Return initialized catalog
CM-->>Init: Ready for user tables
- Schema definition : The catalog schema is hardcoded in
SysCatalog, defining the fields forTableMeta,ColMeta, and other metadata types - Table 0 creation : A
Tablestruct is created withtable_id = 0and the catalog schema - Self-registration : The first metadata record inserted is
TableMetafor Table 0 itself, creating the self-referential loop - Ready state : Once initialized, the catalog can accept metadata for user tables
Sources: llkv-table/src/sys_catalog.rs llkv-table/src/catalog.rs
Metadata Queries and Resolution
The system catalog supports various query patterns for metadata resolution:
Resolution functions:
| Function | Input | Output | Location |
|---|---|---|---|
resolve_table_name() | Schema name, table name | Option<TableId> | llkv-table/src/resolvers.rs |
canonical_table_name() | Schema name, table name | Canonical string | llkv-table/src/resolvers.rs |
FieldResolver::resolve() | Column name, context | FieldId | llkv-table/src/catalog.rs |
get_table_meta() | TableId | Option<TableMeta> | llkv-table/src/sys_catalog.rs |
get_col_metas() | TableId | Vec<ColMeta> | llkv-table/src/sys_catalog.rs |
These resolution operations enable the query planner to translate SQL identifiers (table names, column names) into internal identifiers (TableId, FieldId) used throughout the execution pipeline.
Sources: llkv-table/src/resolvers.rs llkv-table/src/catalog.rs llkv-table/src/sys_catalog.rs
sequenceDiagram
participant SQL as "SQL Engine"
participant DDL as "DDL Handler\n(CatalogManager)"
participant SysCat as "SysCatalog"
participant Table0 as "Table 0"
SQL->>DDL: CREATE TABLE users (...)
DDL->>DDL: Parse schema\nAllocate TableId
DDL->>DDL: Assign FieldIds to columns
DDL->>SysCat: insert_table_meta(TableMeta)
SysCat->>Table0: Append TableMeta record
loop For each column
DDL->>SysCat: insert_col_meta(ColMeta)
SysCat->>Table0: Append ColMeta record
end
DDL->>DDL: Create user Table with allocated ID
DDL-->>SQL: CreateTableResult
Note over SQL,Table0: Table now visible to queries
Integration with DDL Operations
All DDL operations (CREATE TABLE, ALTER TABLE, DROP TABLE) modify the system catalog:
DDL operation flow:
-
CREATE TABLE :
- Parse schema and allocate
TableId - Assign
FieldIdto each column - Insert
TableMetaandColMetarecords into catalog - Create the physical
Tablewith the assigned ID
- Parse schema and allocate
-
ALTER TABLE :
- Retrieve current
TableMetaandColMetarecords - Validate operation (see
validate_alter_table_operation()) - Update metadata records (append new versions due to MVCC)
- Modify the physical table schema
- Retrieve current
-
DROP TABLE :
- Mark
TableMetaas deleted (MVCC soft delete) - Mark associated
ColMetarecords as deleted - Remove indexes and constraints
- Free physical storage pages
- Mark
Sources: llkv-table/src/ddl.rs llkv-table/src/catalog.rs llkv-table/src/constraints.rs
graph TB
subgraph "Constraint Types"
PK["PrimaryKeyConstraint\nUnique, Not Null"]
Unique["UniqueConstraint\nSingle or Multi-Column"]
FK["ForeignKeyConstraint\nReferential Integrity"]
Check["CheckConstraint\nExpression Validation"]
end
subgraph "Constraint Metadata"
ConstraintRecord["ConstraintRecord\nid, table_id, kind"]
ConstraintService["ConstraintService\nEnforcement Logic"]
end
subgraph "Storage"
MetaManager["MetadataManager"]
SysCatalog["SysCatalog Table 0"]
end
PK --> ConstraintRecord
Unique --> ConstraintRecord
FK --> ConstraintRecord
Check --> ConstraintRecord
ConstraintRecord --> MetaManager
MetaManager --> SysCatalog
ConstraintService --> ConstraintRecord
Constraints and Validation Metadata
The system catalog also stores constraint definitions that enforce data integrity:
Constraint structures:
| Constraint Type | Structure | Key Information |
|---|---|---|
| Primary Key | PrimaryKeyConstraint | Column(s), constraint name |
| Unique | UniqueConstraint | Column(s), constraint name, partial index |
| Foreign Key | ForeignKeyConstraint | Child columns, parent table/columns, ON DELETE/UPDATE actions |
| Check | CheckConstraint | Boolean expression, constraint name |
The ConstraintService validates constraint satisfaction during INSERT, UPDATE, and DELETE operations, querying the catalog for relevant constraint definitions.
Sources: llkv-table/src/constraints.rs llkv-table/src/metadata.rs
graph TB
subgraph "MetadataManager API"
MM["MetadataManager\n(metadata.rs)"]
IndexReg["register_single_column_index()\nregister_multi_column_index()"]
IndexGet["get_single_column_indexes()\nget_multi_column_indexes()"]
FKReg["register_foreign_key()"]
FKGet["get_foreign_keys()"]
FKValidate["validate_foreign_key_rows()"]
ConstraintReg["register_constraint()"]
ConstraintGet["get_constraints()"]
end
subgraph "SysCatalog Operations"
SysCat["SysCatalog"]
TableMetaOps["Table/Column Metadata"]
IndexMetaOps["Index Metadata"]
ConstraintMetaOps["Constraint Metadata"]
end
MM --> SysCat
IndexReg --> IndexMetaOps
IndexGet --> IndexMetaOps
FKReg --> ConstraintMetaOps
FKGet --> ConstraintMetaOps
ConstraintReg --> ConstraintMetaOps
ConstraintGet --> ConstraintMetaOps
IndexMetaOps --> SysCat
ConstraintMetaOps --> SysCat
Metadata Manager Integration
The MetadataManager provides a higher-level interface over SysCatalog for managing complex metadata like indexes and constraints:
The MetadataManager coordinates between the catalog and the runtime enforcement logic, ensuring that metadata changes are properly persisted and that constraints are consistently enforced.
Sources: llkv-table/src/metadata.rs llkv-table/src/catalog.rs
graph LR
subgraph "Transaction T1"
T1Create["CREATE TABLE users"]
T1Meta["Insert TableMeta\ncreated_by = T1"]
end
subgraph "Transaction T2 (concurrent)"
T2Query["SELECT FROM users"]
T2Scan["Scan catalog\nFilter: created_by <= T2\ndeleted_by > T2 OR NULL"]
end
subgraph "Table 0 Storage"
Metadata["TableMeta Records\nwith MVCC columns"]
end
T1Meta --> Metadata
T2Scan --> Metadata
T1Create -.commits.-> T1Meta
T2Query -.reads snapshot.-> T2Scan
MVCC and Transactional Metadata
Because the system catalog is implemented as Table 0 using the same storage layer as user tables, all metadata operations are automatically transactional with MVCC semantics:
MVCC characteristics:
| Property | Behavior |
|---|---|
| Isolation | Each transaction sees a consistent snapshot of metadata |
| Atomicity | DDL operations are atomic (all metadata changes or none) |
| Versioning | Multiple versions of metadata coexist until compaction |
| Soft deletes | Dropped tables remain visible to old transactions |
| Time travel | Historical metadata can be queried (if supported) |
This ensures that concurrent DDL and DML operations do not interfere with each other, and that queries always see a consistent view of the schema.
Sources: llkv-table/src/table.rs llkv-table/src/sys_catalog.rs
Schema Evolution and Compatibility
The system catalog schema itself is versioned and can evolve over time:
Schema migrations add new columns to the catalog tables (e.g., adding constraint metadata) while maintaining backward compatibility with existing metadata records.
Sources: llkv-table/src/sys_catalog.rs llkv-table/src/metadata.rs
Summary
The System Catalog (SysCatalog) provides:
- Self-describing metadata : All metadata stored in Table 0 using the same columnar storage as user data
- Comprehensive tracking : Tables, columns, indexes, triggers, constraints, and custom types
- Transactional semantics : MVCC ensures consistent metadata reads and atomic DDL operations
- Efficient resolution : Fast lookup of table names to IDs, field names to IDs
- Extensibility : New metadata types can be added by extending the catalog schema
The catalog bridges the gap between SQL identifiers and internal identifiers, enabling the query processor to operate on TableId and FieldId rather than string names throughout the execution pipeline.
Sources: llkv-table/src/lib.rs llkv-table/src/sys_catalog.rs llkv-table/src/catalog.rs llkv-table/src/metadata.rs
Dismiss
Refresh this wiki
Enter email to refresh
This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Custom Types and Type Registry
Loading…
Custom Types and Type Registry
Relevant source files
Purpose and Scope
This document describes LLKV’s custom type system, which enables users to define and manage type aliases that extend Apache Arrow’s native type system. Custom types are persisted in the system catalog and provide a mechanism for creating domain-specific type names that map to underlying Arrow DataType definitions.
For information about the system catalog infrastructure that stores custom type metadata, see System Catalog and SysCatalog. For details on how tables use these types in their schemas, see Table Abstraction.
Type System Architecture
LLKV’s type system is built on Apache Arrow’s columnar type system but adds a layer of indirection through custom type definitions. This allows users to create semantic type names (e.g., email_address, currency_amount) that map to specific Arrow types with additional constraints or metadata.
Type Resolution Flow
graph TB
subgraph "SQL Layer"
DDL["CREATE TYPE Statement"]
COLDEF["Column Definition\nwith Custom Type"]
end
subgraph "Type Registry"
SYSCAT["SysCatalog\nTable 0"]
TYPEMETA["CustomTypeMeta\nRecords"]
RESOLVER["Type Resolver"]
end
subgraph "Arrow Type System"
ARROW["Arrow DataType"]
SCHEMA["Arrow Schema"]
end
subgraph "Column Storage"
COLSTORE["ColumnStore"]
DESCRIPTOR["ColumnDescriptor"]
end
DDL --> TYPEMETA
TYPEMETA --> SYSCAT
COLDEF --> RESOLVER
RESOLVER --> TYPEMETA
RESOLVER --> ARROW
ARROW --> SCHEMA
SCHEMA --> COLSTORE
COLSTORE --> DESCRIPTOR
style TYPEMETA fill:#f9f9f9
style SYSCAT fill:#f9f9f9
style RESOLVER fill:#f9f9f9
- User defines custom types via SQL DDL
- Type metadata is stored in the system catalog
- Column definitions reference custom types by name
- Type resolver translates names to Arrow DataTypes
- Physical storage uses Arrow’s native columnar format
Sources: llkv-table/src/lib.rs82 llkv-table/src/lib.rs:81-85
CustomTypeMeta Structure
CustomTypeMeta is the fundamental metadata structure that describes a custom type definition. It is stored as a record in the system catalog (Table 0) alongside other metadata like TableMeta and ColMeta.
CustomTypeMeta Fields
classDiagram
class CustomTypeMeta {+type_id: TypeId\n+type_name: String\n+base_type: ArrowDataType\n+nullable: bool\n+metadata: HashMap~String,String~\n+created_at: Timestamp\n+created_by: TransactionId\n+deleted_by: Option~TransactionId~}
class SysCatalog {+register_custom_type()\n+get_custom_type()\n+list_custom_types()\n+drop_custom_type()}
class ArrowDataType {<<enumeration>>\nInt64\nUtf8\nDecimal128\nDate32\nTimestamp\nStruct\nList}
class ColumnDescriptor {+field_id: FieldId\n+data_type: DataType}
CustomTypeMeta --> ArrowDataType : maps_to
SysCatalog --> CustomTypeMeta : stores
ColumnDescriptor --> ArrowDataType : uses
| Field | Type | Description |
|---|---|---|
type_id | TypeId | Unique identifier for the custom type |
type_name | String | User-defined name (e.g., “email_address”) |
base_type | ArrowDataType | Underlying Arrow type definition |
nullable | bool | Whether NULL values are permitted |
metadata | HashMap<String, String> | Additional type-specific metadata |
created_at | Timestamp | Type creation timestamp |
created_by | TransactionId | Transaction that created the type |
deleted_by | Option<TransactionId> | MVCC deletion marker |
Sources: llkv-table/src/lib.rs82
Type Registration and Lifecycle
Custom types are managed through the SysCatalog interface, which provides operations for the complete type lifecycle: registration, retrieval, modification, and deletion.
sequenceDiagram
participant User
participant SqlEngine
participant CatalogManager
participant SysCatalog
participant Table0 as Table 0
User->>SqlEngine: CREATE TYPE email_address AS VARCHAR(255)
SqlEngine->>SqlEngine: Parse DDL statement
SqlEngine->>CatalogManager: register_custom_type(name, base_type)
CatalogManager->>CatalogManager: Validate type name uniqueness
CatalogManager->>CatalogManager: Assign new TypeId
CatalogManager->>SysCatalog: Insert CustomTypeMeta record
SysCatalog->>SysCatalog: Build RecordBatch with metadata
SysCatalog->>SysCatalog: Add MVCC columns (created_by)
SysCatalog->>Table0: append(batch)
Table0->>Table0: Write to ColumnStore
Table0-->>SysCatalog: Success
SysCatalog-->>CatalogManager: TypeId
CatalogManager-->>SqlEngine: Result
SqlEngine-->>User: Type created successfully
Type Registration Flow
Registration Steps
- DDL statement parsed by SQL layer
CatalogManagervalidates type name uniqueness- New
TypeIdallocated CustomTypeMetarecord constructed- Metadata written to system catalog (Table 0)
- MVCC columns (
created_by,deleted_by) added automatically - Type becomes available for schema definitions
Sources: llkv-table/src/lib.rs54 llkv-table/src/lib.rs:81-85
Type Lifecycle Operations
Type States
- Registered : Type defined but not yet used in any table schemas
- InUse : One or more columns reference this type
- Modified : Type definition updated (if ALTER TYPE is supported)
- Deprecated : Type soft-deleted via MVCC (deleted_by set)
- Deleted : Type permanently removed from catalog
Sources: llkv-table/src/lib.rs:81-85
Type Resolution and Schema Integration
When creating tables or altering schemas, the type resolver translates custom type names to Arrow DataType instances. This resolution happens during DDL execution and schema validation.
Resolution Process
graph LR
subgraph "Table Definition"
COL1["Column: 'email'\nType: 'email_address'"]
COL2["Column: 'age'\nType: 'INT'"]
end
subgraph "Type Resolution"
RESOLVER["Type Resolver"]
CACHE["Type Cache"]
end
subgraph "System Catalog"
CUSTOM["CustomTypeMeta\nemail_address → Utf8(255)"]
BUILTIN["Built-in Types\nINT → Int32"]
end
subgraph "Arrow Schema"
FIELD1["Field: 'email'\nDataType: Utf8"]
FIELD2["Field: 'age'\nDataType: Int32"]
end
COL1 --> RESOLVER
COL2 --> RESOLVER
RESOLVER --> CACHE
CACHE --> CUSTOM
RESOLVER --> BUILTIN
CUSTOM --> FIELD1
BUILTIN --> FIELD2
FIELD1 --> SCHEMA["Arrow Schema"]
FIELD2 --> SCHEMA
- Column definition specifies type by name
- Type resolver checks cache for previous resolution
- Cache miss triggers lookup in
SysCatalog - Custom type metadata retrieved from Table 0
- Base Arrow
DataTypeextracted - Type constraints/metadata applied
- Resolved type cached for subsequent use
- Arrow
Fieldconstructed with final type
Sources: llkv-table/src/lib.rs43 llkv-table/src/lib.rs54
DDL Operations for Custom Types
CREATE TYPE Statement
Processing Steps
| Step | Action | Component |
|---|---|---|
| 1 | Parse SQL | SqlEngine → sqlparser |
| 2 | Extract type definition | SQL preprocessing layer |
| 3 | Validate base type | CatalogManager |
| 4 | Check name uniqueness | SysCatalog query |
| 5 | Allocate TypeId | CatalogManager |
| 6 | Construct metadata | CustomTypeMeta builder |
| 7 | Write to catalog | SysCatalog::append() |
| 8 | Update cache | Type resolver cache |
Sources: llkv-table/src/lib.rs54 llkv-table/src/lib.rs68
DROP TYPE Statement
Deletion Process
MVCC Soft Delete
- Type records are not physically removed
deleted_byfield set to transaction ID- Historical queries can still see old type definitions
- New schemas cannot reference deleted types
- Cache invalidation ensures immediate visibility
Sources: llkv-table/src/lib.rs:81-85
Storage in System Catalog
Custom type metadata is stored in the system catalog (Table 0) alongside other metadata types like TableMeta, ColMeta, and constraint information.
System Catalog Schema for CustomTypeMeta
| Column Name | Type | Description |
|---|---|---|
row_id | RowId | Unique row identifier |
metadata_type | Utf8 | Discriminator: “CustomType” |
type_id | UInt64 | Custom type identifier |
type_name | Utf8 | User-defined type name |
base_type_json | Utf8 | Serialized Arrow DataType |
nullable | Boolean | Nullability flag |
metadata_json | Utf8 | Additional metadata as JSON |
created_at | Timestamp | Creation timestamp |
created_by | UInt64 | Creating transaction ID |
deleted_by | UInt64 (nullable) | Deleting transaction ID (MVCC) |
Storage Characteristics
- Custom types stored in same table as other metadata
metadata_typecolumn distinguishes record types- Arrow JSON serialization for base type persistence
- Metadata JSON for extensible properties
- MVCC columns enable temporal queries
- Indexed by
type_namefor fast lookup
Sources: llkv-table/src/lib.rs:81-85
Type Catalog Query Examples
Retrieving Custom Type Definition
Query Optimization
- Type name indexed for fast lookups
- MVCC predicate filters deleted types
- Result caching minimizes catalog queries
- Batch operations for multiple type resolutions
Sources: llkv-table/src/lib.rs54 llkv-table/src/lib.rs:81-85
graph TB
subgraph "User Operations"
CREATE["CREATE TYPE"]
DROP["DROP TYPE"]
ALTER["ALTER TYPE"]
QUERY["Type Resolution"]
end
subgraph "CatalogManager"
API["CatalogManager API"]
VALIDATION["Type Validation"]
DEPENDENCY["Dependency Tracking"]
CACHE_MGR["Cache Manager"]
end
subgraph "SysCatalog"
SYSCAT["SysCatalog"]
TABLE0["Table 0"]
end
subgraph "Type System"
RESOLVER["Type Resolver"]
ARROW["Arrow DataType"]
end
CREATE --> API
DROP --> API
ALTER --> API
QUERY --> API
API --> VALIDATION
API --> DEPENDENCY
API --> CACHE_MGR
VALIDATION --> SYSCAT
DEPENDENCY --> SYSCAT
CACHE_MGR --> RESOLVER
SYSCAT --> TABLE0
RESOLVER --> ARROW
style API fill:#f9f9f9
style SYSCAT fill:#f9f9f9
Integration with CatalogManager
The CatalogManager provides high-level operations for custom type management, coordinating between the SQL layer, type resolver, and system catalog.
CatalogManager Responsibilities
| Function | Description |
|---|---|
register_custom_type() | Create new type definition |
get_custom_type() | Retrieve type by name or ID |
list_custom_types() | Query all non-deleted types |
drop_custom_type() | Mark type as deleted (MVCC) |
resolve_type_name() | Translate name to Arrow type |
check_type_dependencies() | Find columns using type |
invalidate_type_cache() | Clear cached type definitions |
Sources: llkv-table/src/lib.rs54 llkv-table/src/lib.rs68
Type System Best Practices
Naming Conventions
Recommended Type Names
- Use descriptive, domain-specific names:
customer_id,email_address,currency_amount - Avoid generic names that conflict with SQL keywords:
text,number,date - Use snake_case for consistency with column names
- Include units or constraints in name:
price_usd,duration_seconds
Type Reusability
When to Create Custom Types
- Domain concepts appearing in multiple tables
- Types with specific constraints (precision, length)
- Semantic meaning beyond base type
- Types requiring validation or transformation logic
When to Use Base Types
- One-off column definitions
- Standard SQL types without constraints
- Internal implementation columns
Performance Considerations
Cache Behavior
- Type resolution results are cached per session
- First resolution incurs catalog lookup cost
- Subsequent resolutions served from memory
- Cache invalidation on type modifications
Query Impact
- Custom types add one indirection layer
- Physical storage uses base Arrow types
- No runtime performance penalty
- Query plans operate on resolved types
Sources: llkv-table/src/lib.rs54 llkv-table/src/lib.rs:81-85
Dismiss
Refresh this wiki
Enter email to refresh