Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Overview

Loading…

Overview

Relevant source files

Purpose and Scope

LLKV is an embedded SQL database system implemented in Rust that combines Apache Arrow’s columnar memory format with a key-value storage backend. This document provides a high-level introduction to the system architecture, component organization, and core data flows.

For detailed information about specific subsystems, see:


System Architecture

LLKV is organized as a Rust workspace containing 15 specialized crates that form a layered architecture. The system processes SQL queries through multiple stages—parsing, planning, execution—before ultimately persisting data in a memory-mapped key-value store.

graph TB
    subgraph "User Interface"
        SqlEngine["SqlEngine\n(llkv-sql)"]
PreparedStatement["PreparedStatement"]
end
    
    subgraph "Query Processing"
        Parser["SQL Parser\n(sqlparser-rs)"]
Planner["Query Planner\n(llkv-plan)"]
ExprSystem["Expression System\n(llkv-expr)"]
end
    
    subgraph "Execution"
        QueryExecutor["QueryExecutor\n(llkv-executor)"]
RuntimeEngine["RuntimeEngine\n(llkv-runtime)"]
Aggregate["AggregateAccumulator\n(llkv-aggregate)"]
Join["Hash Join\n(llkv-join)"]
end
    
    subgraph "Data Management"
        Table["Table\n(llkv-table)"]
SysCatalog["SysCatalog\nTable ID = 0"]
Scanner["Scanner\n(llkv-scan)"]
TxManager["TransactionManager\n(llkv-transaction)"]
end
    
    subgraph "Storage"
        ColumnStore["ColumnStore\n(llkv-column-map)"]
ArrowBatches["RecordBatch\nArrow Arrays"]
Pager["Pager Trait\n(llkv-storage)"]
end
    
    subgraph "Persistence"
        SimdRDrive["simd-r-drive\nMemory-Mapped K-V"]
EntryHandle["EntryHandle\nByte Blobs"]
end
    
 
   SqlEngine --> Parser
 
   Parser --> Planner
 
   Planner --> ExprSystem
 
   Planner --> QueryExecutor
 
   QueryExecutor --> RuntimeEngine
 
   QueryExecutor --> Aggregate
 
   QueryExecutor --> Join
 
   RuntimeEngine --> Table
 
   RuntimeEngine --> TxManager
 
   Table --> SysCatalog
 
   Table --> Scanner
 
   Scanner --> ColumnStore
 
   ColumnStore --> ArrowBatches
 
   ColumnStore --> Pager
 
   Pager --> SimdRDrive
 
   SimdRDrive --> EntryHandle
    
    style SqlEngine fill:#e8e8e8
    style ColumnStore fill:#e8e8e8
    style SimdRDrive fill:#e8e8e8

Layered Architecture

Diagram: End-to-End System Layering

The system flows from SQL text at the top through progressive layers of abstraction down to persistent storage. Each layer is implemented as one or more dedicated crates with well-defined responsibilities.

Sources: Cargo.toml:1-109 llkv-sql/src/sql_engine.rs:1-100 llkv-executor/src/lib.rs:1-100


Workspace Structure

The LLKV workspace is divided into 15 crates, each handling a specific concern. The following diagram maps crate names to their primary responsibilities:

Diagram: Crate Dependency Structure

graph LR
    subgraph "Foundation"
        types["llkv-types\nShared types\nLogicalFieldId"]
result["llkv-result\nError handling\nError enum"]
storage["llkv-storage\nPager trait\nMemPager"]
end
    
    subgraph "Expression & Planning"
        expr["llkv-expr\nExpression AST\nScalarExpr, Expr"]
plan["llkv-plan\nQuery plans\nSelectPlan, InsertPlan"]
end
    
    subgraph "Execution"
        executor["llkv-executor\nQuery execution\nQueryExecutor"]
compute["llkv-compute\nCompute kernels\nNumericKernels"]
aggregate["llkv-aggregate\nAggregation\nAggregateAccumulator"]
join["llkv-join\nJoin ops\nhash_join"]
scan["llkv-scan\nTable scans\nScanner"]
end
    
    subgraph "Data Management"
        table["llkv-table\nTable abstraction\nTable, SysCatalog"]
colmap["llkv-column-map\nColumn store\nColumnStore"]
transaction["llkv-transaction\nMVCC\nTransactionManager"]
end
    
    subgraph "User Interface"
        sql["llkv-sql\nSQL engine\nSqlEngine"]
runtime["llkv-runtime\nRuntime engine\nRuntimeEngine"]
end
    
    subgraph "Utilities"
        csv["llkv-csv\nCSV import/export"]
threading["llkv-threading\nThread pool"]
testutils["llkv-test-utils\nTest helpers"]
slttester["llkv-slt-tester\nSQLite test harness"]
end
    
 
   types -.-> expr
 
   types -.-> storage
 
   types -.-> table
 
   result -.-> storage
 
   result -.-> table
 
   storage -.-> colmap
 
   expr -.-> plan
 
   expr -.-> compute
 
   plan -.-> executor
 
   colmap -.-> table
 
   table -.-> executor
 
   executor -.-> runtime
 
   runtime -.-> sql

This diagram shows the primary dependency flow between crates. Foundation crates (llkv-types, llkv-result, llkv-storage) provide shared infrastructure. Middle layers handle query planning and execution. Top layers expose the SQL interface.

Sources: Cargo.toml:2-26 Cargo.toml:37-96


Key Components

SQL Interface Layer

The SqlEngine struct in llkv-sql is the primary entry point for executing SQL statements. It handles statement parsing, preprocessing, and orchestrates execution through the runtime layer.

Diagram: SQL Interface Entry Points

graph TD
    User["Application Code"]
SqlEngine["SqlEngine::new(pager)"]
Execute["SqlEngine::execute(sql)"]
Sql["SqlEngine::sql(sql)"]
Prepare["SqlEngine::prepare(sql)"]
User --> SqlEngine
 
   SqlEngine --> Execute
 
   SqlEngine --> Sql
 
   SqlEngine --> Prepare
    
 
   Execute --> Parse["parse_sql_with_recursion_limit"]
Parse --> Preprocess["preprocess_sql_input"]
Preprocess --> BuildPlan["build_*_plan methods"]
BuildPlan --> RuntimeExec["RuntimeEngine::execute_statement"]
Sql --> Execute
 
   Prepare --> PreparedStatement["PreparedStatement"]

The SqlEngine provides three primary methods: execute() for mixed statements, sql() for SELECT queries returning RecordBatch results, and prepare() for parameterized statements.

Sources: llkv-sql/src/sql_engine.rs:440-486 llkv-sql/src/sql_engine.rs:1045-1134 llkv-sql/src/sql_engine.rs:1560-1612

Query Planning

The llkv-plan crate transforms parsed SQL AST into executable plans. Key plan types include:

Plan TypePurposeKey Fields
SelectPlanQuery executionprojections, tables, filter, group_by, order_by
InsertPlanData insertiontable, columns, source, on_conflict
UpdatePlanRow updatestable, assignments, filter
DeletePlanRow deletiontable, filter
CreateTablePlanSchema definitiontable, columns, constraints

Sources: [llkv-plan crate referenced in llkv-executor/src/lib.rs31-35](https://github.com/jzombie/rust-llkv/blob/89777726/llkv-plan crate referenced in llkv-executor/src/lib.rs#L31-L35)

Expression System

The llkv-expr crate defines two core expression types:

  • Expr<F>: Boolean predicate expressions for filtering (used in WHERE clauses)
  • ScalarExpr<F>: Scalar value expressions for projections and computations

Both are generic over field identifier type F, allowing translation from string column names to numeric FieldId identifiers.

Sources: llkv-executor/src/lib.rs:23-26

graph TB
    subgraph "Logical Layer"
        RecordBatch["RecordBatch\nArrow columnar format"]
Schema["Schema\nColumn definitions"]
end
    
    subgraph "Column Store Layer"
        ColumnStore["ColumnStore"]
ColumnDescriptor["ColumnDescriptor\nLinked list of chunks"]
ChunkMetadata["ChunkMetadata\nmin, max, size, nulls"]
end
    
    subgraph "Physical Layer"
        Pager["Pager trait\nbatch_get, batch_put"]
MemPager["MemPager"]
SimdRDrive["simd-r-drive\nMemory-mapped storage"]
end
    
    subgraph "Persistence"
        EntryHandle["EntryHandle\nByte blob references"]
PhysicalKeys["Physical keys (u64)"]
end
    
 
   RecordBatch --> ColumnStore
 
   Schema --> ColumnStore
 
   ColumnStore --> ColumnDescriptor
 
   ColumnDescriptor --> ChunkMetadata
 
   ColumnDescriptor --> Pager
 
   Pager --> MemPager
 
   Pager --> SimdRDrive
 
   MemPager --> EntryHandle
 
   SimdRDrive --> EntryHandle
 
   EntryHandle --> PhysicalKeys

Storage Architecture

LLKV stores data in a columnar format using Apache Arrow, persisted through a key-value storage backend:

Diagram: Storage Architecture Layers

Arrow RecordBatches are decomposed into individual column chunks, each serialized and stored via the Pager trait. The simd-r-drive backend provides memory-mapped, SIMD-optimized key-value operations.

Sources: Cargo.lock:126-143 Cargo.lock:671-687 [llkv-column-map references in llkv-executor/src/lib.rs20](https://github.com/jzombie/rust-llkv/blob/89777726/llkv-column-map references in llkv-executor/src/lib.rs#L20-L20)


Query Execution Flow

The following diagram traces a SELECT query through the execution pipeline:

Diagram: SELECT Query Execution Sequence

Query execution proceeds in two phases: (1) filter evaluation to collect matching row IDs, and (2) column gathering to assemble the final RecordBatch. Metadata-based chunk pruning optimizes filter evaluation by skipping chunks that cannot contain matching rows.

Sources: llkv-sql/src/sql_engine.rs:1596-1612 llkv-executor/src/lib.rs:519-563


Data Model

Tables and Schemas

Every table in LLKV has:

  • A unique numeric table_id
  • A Schema defining column names, types, and nullability
  • Optional constraints (primary key, foreign keys, unique, check)
  • Optional indexes (single-column and multi-column)
graph TB
    SysCatalog["SysCatalog (Table 0)"]
TableMeta["TableMeta records"]
ColMeta["ColMeta records"]
IndexMeta["Index metadata"]
ConstraintMeta["Constraint records"]
TriggerMeta["Trigger definitions"]
SysCatalog --> TableMeta
 
   SysCatalog --> ColMeta
 
   SysCatalog --> IndexMeta
 
   SysCatalog --> ConstraintMeta
 
   SysCatalog --> TriggerMeta
    
    UserTable1["User Table 1"]
UserTable2["User Table 2"]
TableMeta -.describes.-> UserTable1
    TableMeta -.describes.-> UserTable2
    ColMeta -.describes.-> UserTable1
    ColMeta -.describes.-> UserTable2

System Catalog

Table ID 0 is reserved for the SysCatalog, a special table that stores metadata about all other tables, columns, indexes, triggers, and constraints. The catalog is self-describing—it uses the same columnar storage as user tables.

Diagram: System Catalog Structure

All DDL operations (CREATE TABLE, ALTER TABLE, etc.) modify the system catalog. At startup, the catalog is read to reconstruct the complete database schema.

Sources: [llkv-sql/src/sql_engine.rs references to SysCatalog](https://github.com/jzombie/rust-llkv/blob/89777726/llkv-sql/src/sql_engine.rs references to SysCatalog)

Columnar Storage

Data is stored column-wise using Apache Arrow’s in-memory format, with each column divided into chunks. Each chunk contains:

  • Serialized Arrow array data
  • Row ID bitmap (which rows are present)
  • Metadata (min/max values, null count, size)

This organization enables:

  • Efficient predicate pushdown (skip chunks via min/max)
  • Vectorized operations on decompressed data
  • Compaction (merging small chunks)

Sources: [llkv-executor/src/lib.rs references to ColumnStore and RecordBatch](https://github.com/jzombie/rust-llkv/blob/89777726/llkv-executor/src/lib.rs references to ColumnStore and RecordBatch)


Transaction Support

LLKV implements Multi-Version Concurrency Control (MVCC) using hidden columns:

ColumnTypePurpose
__created_byu64Transaction ID that created this row version
__deleted_byu64Transaction ID that deleted this row version (or u64::MAX if active)

The TransactionManager in llkv-transaction coordinates transaction boundaries and assigns transaction IDs. Queries automatically filter rows based on the current transaction’s visibility rules.

Sources: [llkv-transaction crate in Cargo.toml24](https://github.com/jzombie/rust-llkv/blob/89777726/llkv-transaction crate in Cargo.toml#L24-L24)


External Dependencies

LLKV relies on several external crates for core functionality:

DependencyVersionPurpose
arrow57.1.0Columnar data format, compute kernels
sqlparser0.59.0SQL parsing (supports multiple dialects)
simd-r-drive0.15.5-alphaMemory-mapped key-value storage with SIMD optimization
rayon1.10.0Parallel processing (used in joins, aggregations)
croaring2.5.1Bitmap indexes for row ID sets

Sources: Cargo.toml:40-49 Cargo.lock:126-143 Cargo.lock:671-687


Usage Example

Sources: llkv-sql/src/sql_engine.rs:443-485


Summary

LLKV is a layered SQL database system that marries Apache Arrow’s columnar format with key-value storage. The architecture separates concerns across 15 crates, enabling modular development and testing. Queries flow from SQL text through parsing, planning, and execution stages before accessing columnar data persisted in a memory-mapped store. The system supports transactions, indexes, constraints, and SQL features including joins, aggregations, and subqueries.

For deeper exploration of specific subsystems, consult the following sections:

Sources: Cargo.toml:1-109 llkv-sql/src/sql_engine.rs:1-100 llkv-executor/src/lib.rs:1-100

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Architecture

Loading…

Architecture

Relevant source files

This page describes the overall architectural design of LLKV, including the layered system structure, key design decisions, and how major components interact. For detailed information about individual crates and their responsibilities, see Workspace and Crates. For the end-to-end query execution flow, see SQL Query Processing Pipeline. For details on Arrow integration and data representation, see Data Formats and Arrow Integration.

Architectural Overview

LLKV is a columnar SQL database that stores Apache Arrow data structures directly in a key-value persistence layer. The architecture consists of six distinct layers, each implemented as one or more Rust crates. The system translates SQL statements into query plans, executes those plans against columnar table storage, and persists data using memory-mapped key-value stores.

The core architectural innovation is the llkv-column-map layer, which bridges Apache Arrow’s in-memory columnar format with the simd-r-drive key-value storage backend. This design enables zero-copy operations on columnar data while maintaining ACID properties through the underlying storage engine.

Sources: Cargo.toml:1-109 high-level overview diagrams

System Layers

The following diagram shows the six architectural layers with their implementing crates and key data structures:

graph TB
    subgraph "Layer 1: User Interface"
        SQL["llkv-sql\nSqlEngine"]
DEMO["llkv-sql-pong-demo"]
TPCH["llkv-tpch"]
CSV["llkv-csv"]
end
    
    subgraph "Layer 2: Query Processing"
        PARSER["sqlparser-rs\nParser, Statement"]
PLANNER["llkv-plan\nSelectPlan, InsertPlan"]
EXPR["llkv-expr\nExpr, ScalarExpr"]
end
    
    subgraph "Layer 3: Execution"
        EXECUTOR["llkv-executor\nTableExecutor"]
RUNTIME["llkv-runtime\nDatabaseRuntime"]
AGGREGATE["llkv-aggregate\nAccumulator"]
JOIN["llkv-join\nhash_join"]
COMPUTE["llkv-compute\nNumericKernels"]
end
    
    subgraph "Layer 4: Data Management"
        TABLE["llkv-table\nTable, SysCatalog"]
TRANSACTION["llkv-transaction\nTransaction"]
SCAN["llkv-scan\nScanOp"]
end
    
    subgraph "Layer 5: Storage - Arrow Native"
        COLMAP["llkv-column-map\nColumnStore, ColumnDescriptor"]
STORAGE["llkv-storage\nPager trait"]
ARROW["arrow-array\nRecordBatch, ArrayRef"]
end
    
    subgraph "Layer 6: Persistence - Key-Value"
        PAGER["Pager implementations\nbatch_get, batch_put"]
SIMD["simd-r-drive\nRDrive, EntryHandle"]
end
    
 
   SQL --> PARSER
 
   PARSER --> PLANNER
 
   PLANNER --> EXPR
 
   EXPR --> EXECUTOR
    
 
   EXECUTOR --> AGGREGATE
 
   EXECUTOR --> JOIN
 
   EXECUTOR --> COMPUTE
 
   EXECUTOR --> RUNTIME
    
 
   RUNTIME --> TABLE
 
   RUNTIME --> TRANSACTION
 
   TABLE --> SCAN
    
 
   SCAN --> COLMAP
 
   TABLE --> COLMAP
 
   COLMAP --> ARROW
 
   COLMAP --> STORAGE
    
 
   STORAGE --> PAGER
 
   PAGER --> SIMD

Each layer has well-defined responsibilities. Layer 1 provides user-facing interfaces. Layer 2 translates SQL into executable plans. Layer 3 executes those plans using specialized operators. Layer 4 manages logical tables and transactions. Layer 5 implements columnar storage using Arrow data structures. Layer 6 provides persistent key-value storage with memory-mapping.

Sources: Cargo.toml:2-26 llkv-sql/Cargo.toml:1-45 llkv-executor/Cargo.toml:1-48 llkv-table/Cargo.toml:1-72

Key Architectural Decisions

Arrow-Native Columnar Storage

The system uses Apache Arrow as its native in-memory data format. All data is represented as RecordBatch instances containing ArrayRef columns. This design decision enables:

  • Zero-copy interoperability with Arrow-based analytics tools
  • Vectorized computation using Arrow kernels
  • Efficient memory layouts for SIMD operations
  • Type-safe column operations through Arrow’s schema system

The arrow dependency (version 57.1.0) provides the foundation for all data operations.

Key-Value Persistence Backend

Rather than implementing a custom storage engine, LLKV persists data through the Pager trait abstraction defined in llkv-storage. The primary implementation uses simd-r-drive (version 0.15.5-alpha), a memory-mapped key-value store with SIMD-optimized operations.

This design separates logical data management from physical storage concerns. The Pager trait defines operations:

  • alloc() - allocate new storage keys
  • batch_get() - retrieve multiple values
  • batch_put() - atomically write multiple key-value pairs
  • free() - release storage keys

Modular Crate Organization

The workspace contains 15+ specialized crates, each focused on a specific concern. This modularity enables:

  • Independent testing and benchmarking per crate
  • Clear dependency boundaries
  • Parallel development across subsystems
  • Selective feature compilation

Core crates follow a naming convention: llkv-{subsystem}. Foundational crates like llkv-types and llkv-result provide shared types and error handling used throughout the system.

MVCC Transaction Isolation

The system implements Multi-Version Concurrency Control through llkv-transaction. Each table row includes system columns created_by and deleted_by that track transaction visibility. The Transaction struct manages transaction state and visibility rules.

Sources: Cargo.toml:37-96 llkv-storage/Cargo.toml:1-48 llkv-table/Cargo.toml:20-41

Component Organization

The following diagram maps the workspace crates to their architectural roles:

Dependencies flow upward through the layers. Lower-level crates like llkv-types and llkv-storage have no dependencies on higher layers. The llkv-sql crate sits at the top, orchestrating all subsystems.

Sources: Cargo.toml:2-26 Cargo.toml:55-74

Data Flow Architecture

Data flows through the system in two primary patterns: write operations (INSERT, UPDATE, DELETE) and read operations (SELECT).

flowchart LR
 
   SQL["SQL Statement"] --> PARSE["sqlparser\nparse()"]
PARSE --> PLAN["llkv-plan\nInsertPlan"]
PLAN --> RUNTIME["DatabaseRuntime\nexecute_insert_plan()"]
RUNTIME --> TABLE["Table\nappend()"]
TABLE --> COLMAP["ColumnStore\nappend()"]
COLMAP --> CHUNK["Chunking\nLWW deduplication"]
CHUNK --> SERIAL["Serialize\nArrow arrays"]
SERIAL --> PAGER["Pager\nbatch_put()"]
PAGER --> KV["simd-r-drive\nEntryHandle"]

Write Path

Write operations follow a path from SQL parsing through plan creation, runtime execution, table operations, column store append, chunking/deduplication, serialization, and finally persistence via the Pager trait.

flowchart LR
 
   SQL["SQL Statement"] --> PARSE["sqlparser\nparse()"]
PARSE --> PLAN["llkv-plan\nSelectPlan"]
PLAN --> EXECUTOR["TableExecutor\nexecute_select()"]
EXECUTOR --> FILTER["Phase 1:\nfilter_row_ids()"]
FILTER --> GATHER["Phase 2:\ngather_rows()"]
GATHER --> COLMAP["ColumnStore\ngather()"]
COLMAP --> PAGER["Pager\nbatch_get()"]
PAGER --> DESER["Deserialize\nArrow arrays"]
DESER --> BATCH["RecordBatch\nassembly"]
BATCH --> RESULT["Query Results"]

The ColumnStore::append() method implements Last-Write-Wins semantics for upserts by detecting duplicate row IDs and replacing older values.

Read Path

Read operations use a two-phase approach: first collecting matching row IDs via predicate evaluation, then gathering column data for those rows. This minimizes data movement by filtering before gathering.

Sources: llkv-sql/Cargo.toml:20-38 llkv-executor/Cargo.toml:20-42 llkv-table/Cargo.toml:20-41

Storage Architecture: Arrow to Key-Value Bridge

The llkv-column-map crate implements the critical bridge between Arrow’s columnar format and key-value storage:

graph TB
    subgraph "Logical Layer"
        FIELD["LogicalFieldId\n(table_id, field_name)"]
ROWID["RowId\n(u64)"]
end
    
    subgraph "Column Organization"
        CATALOG["ColumnCatalog\nfield_id → ColumnDescriptor"]
DESC["ColumnDescriptor\nlinked list of chunks"]
META["ChunkMetadata\n(min, max, size, null_count)"]
end
    
    subgraph "Physical Storage"
        CHUNK["Data Chunks\nserialized Arrow arrays"]
RIDARRAY["RowId Arrays\nsorted u64 arrays"]
PKEY["Physical Keys\n(chunk_pk, rid_pk)"]
end
    
    subgraph "Key-Value Layer"
        ENTRY["EntryHandle\nbyte blobs"]
MMAP["Memory-Mapped Files"]
end
    
 
   FIELD --> CATALOG
 
   CATALOG --> DESC
 
   DESC --> META
 
   META --> CHUNK
 
   CHUNK --> PKEY
 
   DESC --> RIDARRAY
 
   RIDARRAY --> PKEY
 
   PKEY --> ENTRY
 
   ENTRY --> MMAP
    
    ROWID -.used for.-> RIDARRAY

The ColumnCatalog maps logical field identifiers to physical storage. Each column is represented by a ColumnDescriptor that maintains a linked list of data chunks. Each chunk contains:

  • A serialized Arrow array (the actual column data)
  • A corresponding sorted array of row IDs
  • Metadata including min/max values for predicate pushdown
  • Physical keys (chunk_pk, rid_pk) pointing to storage

The ChunkMetadata enables chunk pruning during scans: chunks whose min/max ranges don’t overlap with query predicates can be skipped entirely.

Data chunks are stored as serialized byte blobs accessed through EntryHandle instances from simd-r-drive. The storage layer uses memory-mapped files for efficient I/O.

Sources: llkv-column-map/Cargo.toml:1-65 llkv-storage/Cargo.toml:1-48 Cargo.toml:85-86

System Catalog

The system catalog (table ID 0) stores metadata about all tables, columns, indexes, and constraints. It is itself stored in the same ColumnStore as user data, creating a self-describing bootstrapped system.

The SysCatalog struct provides typed access to catalog tables. The CatalogManager coordinates table lifecycle operations (CREATE, ALTER, DROP) by manipulating catalog entries.

Table ID ranges partition the namespace:

  • ID 0: System catalog
  • IDs 1-999: User tables
  • IDs 1000+: Information schema views
  • IDs 10000+: Temporary tables

This design allows the catalog to leverage the same storage, transaction, and query infrastructure as user data.

Sources: llkv-table/Cargo.toml:20-41

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Workspace and Crates

Loading…

Workspace and Crates

Relevant source files

Purpose and Scope

This document describes the modular workspace structure of the LLKV codebase, detailing the 22 crates that comprise the system, their responsibilities, and interdependencies. Each crate serves a specific architectural layer, enabling separation of concerns and independent development. For information about the overall system architecture and how these crates fit into the layered design, see Architecture. For details about specific subsystems like SQL processing or storage, refer to the respective sections in the table of contents.

Sources: Cargo.toml:1-109

Workspace Configuration

LLKV uses a Cargo workspace to manage multiple interdependent crates under a unified build system. The workspace is configured with resolver = "2", enabling the newer dependency resolver that provides more predictable builds across different platforms and feature combinations.

All workspace members share common metadata including version 0.8.5-alpha, Apache-2.0 license, and Rust edition 2024. Workspace-level lints enforce quality standards, prohibiting print statements to stdout/stderr and disallowing specific methods like Box::leak.

Sources: Cargo.toml:2-28 Cargo.toml:29-35 Cargo.toml:98-104

Crate Dependency Architecture

Sources: Cargo.toml:37-96 llkv-table/Cargo.toml:20-40 llkv-executor/Cargo.toml:20-41 llkv-sql/Cargo.toml:20-37

Foundation Crates

llkv-types

The llkv-types crate provides the shared type system used throughout LLKV. It defines fundamental data types, field identifiers, and type conversions that enable type-safe operations across all layers. This crate has no internal LLKV dependencies, making it the true foundation upon which other crates build.

Purpose: Shared type definitions and type system infrastructure
Key Exports: Type enums, field identifiers, type conversion utilities
Internal Dependencies: None
External Dependencies: Minimal (bitcode, serde)

Sources: Cargo.toml74 llkv-table/Cargo.toml33 llkv-executor/Cargo.toml35

llkv-result

The llkv-result crate defines error types and result handling patterns used throughout the system. It provides a unified error type hierarchy that enables precise error reporting and recovery across all subsystem boundaries.

Purpose: Centralized error handling and result types
Key Exports: LlkvResult type alias, error enums
Internal Dependencies: None
External Dependencies: thiserror (for error derivation)

Sources: Cargo.toml64 llkv-table/Cargo.toml30 llkv-executor/Cargo.toml30

llkv-storage

The llkv-storage crate abstracts the underlying key-value storage layer through the Pager trait. It provides batch operations (batch_get, batch_put, alloc, free) that enable efficient bulk access to the persistent storage backend. When the simd-r-drive-support feature is enabled, it provides an implementation using the SIMD-optimized memory-mapped key-value store.

Purpose: Storage abstraction layer and pager interface
Key Exports: Pager trait, storage implementations
Internal Dependencies: llkv-types, llkv-result
External Dependencies: simd-r-drive (conditional)

Sources: Cargo.toml69 llkv-table/Cargo.toml32 llkv-sql/Cargo.toml29

Data Management Crates

llkv-column-map

The llkv-column-map crate implements the Arrow-native columnar storage engine. It manages data organized in chunks with metadata (min/max values, null counts) and provides operations like append (with Last-Write-Wins semantics), gather (row ID to RecordBatch assembly), filtering, and compaction. The ColumnStore and ColumnCatalog types bridge Apache Arrow’s in-memory format with the persistent key-value backend.

Purpose: Columnar data storage with Arrow integration
Key Exports: ColumnStore, ColumnCatalog, ColumnDescriptor
Internal Dependencies: llkv-types, llkv-storage
External Dependencies: arrow, croaring (for bitmap indexes)

Sources: Cargo.toml57 llkv-table/Cargo.toml26 llkv-executor/Cargo.toml25 llkv-sql/Cargo.toml22

llkv-table

The llkv-table crate provides the Table abstraction that represents a logical table with a schema. It integrates the columnar storage with catalog management, MVCC column injection, and scan operations. The SysCatalog within this crate stores metadata about all tables (Table 0 stores information about Tables 1+). The CatalogManager handles table lifecycle operations including create, alter, and drop.

Purpose: Table abstraction, catalog management, and metadata storage
Key Exports: Table, SysCatalog, CatalogManager, TableMeta, ColMeta
Internal Dependencies: llkv-types, llkv-result, llkv-storage, llkv-column-map, llkv-expr, llkv-scan, llkv-compute, llkv-plan
External Dependencies: arrow, arrow-array, arrow-schema, bitcode

Sources: Cargo.toml70 llkv-table/Cargo.toml:1-72 llkv-executor/Cargo.toml33 llkv-sql/Cargo.toml30

llkv-scan

The llkv-scan crate implements table scanning operations with predicate evaluation. It provides efficient row filtering using vectorized operations and bitmap indexes, enabling predicate pushdown to minimize data movement during query execution.

Purpose: Table scan operations and predicate evaluation
Key Exports: Scan iterators, filter evaluation functions
Internal Dependencies: llkv-types, llkv-column-map, llkv-expr
External Dependencies: arrow, croaring

Sources: Cargo.toml66 llkv-table/Cargo.toml31 llkv-executor/Cargo.toml31 llkv-plan/Cargo.toml26

Expression and Computation Crates

llkv-expr

The llkv-expr crate defines the expression Abstract Syntax Tree (AST) used throughout query processing. It includes Expr for general expressions and ScalarExpr for scalar computations. The crate provides expression translation (string column names to FieldId identifiers), compilation into bytecode (EvalProgram, DomainProgram), and optimization passes.

Purpose: Expression AST, translation, and compilation
Key Exports: Expr, ScalarExpr, EvalProgram, expression translators
Internal Dependencies: llkv-types, llkv-result
External Dependencies: arrow, time (for date/time handling)

Sources: Cargo.toml61 llkv-table/Cargo.toml28 llkv-executor/Cargo.toml27 llkv-sql/Cargo.toml25 llkv-plan/Cargo.toml24

llkv-compute

The llkv-compute crate implements vectorized compute kernels for scalar expression evaluation. It provides the NumericKernels system for optimized arithmetic, comparison, and logical operations on Arrow arrays. These kernels leverage SIMD instructions where possible for high-performance data processing.

Purpose: Vectorized compute kernels for expression evaluation
Key Exports: NumericKernels, scalar evaluation functions
Internal Dependencies: llkv-types, llkv-expr
External Dependencies: arrow

Sources: Cargo.toml58 llkv-table/Cargo.toml27 llkv-executor/Cargo.toml26 llkv-sql/Cargo.toml23 llkv-plan/Cargo.toml23

Query Processing Crates

llkv-plan

The llkv-plan crate converts SQL Abstract Syntax Trees (from sqlparser-rs) into executable query plans. It defines plan structures including SelectPlan, InsertPlan, UpdatePlan, DeletePlan, and DDL plans. The planner handles subquery correlation tracking, expression binding, and plan optimization.

Purpose: SQL-to-plan conversion and query planning
Key Exports: SelectPlan, InsertPlan, plan builder types
Internal Dependencies: llkv-types, llkv-result, llkv-expr, llkv-compute, llkv-scan, llkv-storage, llkv-column-map
External Dependencies: arrow, sqlparser, regex, time

Sources: Cargo.toml63 llkv-plan/Cargo.toml:1-42 llkv-executor/Cargo.toml29 llkv-sql/Cargo.toml26

llkv-executor

The llkv-executor crate executes query plans to produce results. It implements the TablePlanner and TableExecutor that optimize and execute table-level operations, including full table scans, filtered scans, aggregations, joins, and sorting. The executor uses parallel processing (via rayon) where beneficial and provides streaming result iteration.

Purpose: Query plan execution engine
Key Exports: TableExecutor, TablePlanner, execution functions
Internal Dependencies: llkv-types, llkv-result, llkv-expr, llkv-compute, llkv-plan, llkv-scan, llkv-table, llkv-storage, llkv-column-map, llkv-aggregate, llkv-join, llkv-threading
External Dependencies: arrow, rayon, croaring

Sources: Cargo.toml60 llkv-executor/Cargo.toml:1-47 llkv-sql/Cargo.toml24

llkv-aggregate

The llkv-aggregate crate implements aggregate function evaluation for GROUP BY queries. It provides accumulators for functions like SUM, AVG, COUNT, MIN, MAX, and handles DISTINCT aggregation using bitmap sets. The aggregation engine supports both hash-based and sort-based strategies.

Purpose: Aggregate function evaluation and accumulation
Key Exports: Accumulator types, aggregation operators
Internal Dependencies: llkv-types, llkv-expr, llkv-compute
External Dependencies: arrow, croaring

Sources: Cargo.toml56 llkv-executor/Cargo.toml24

llkv-join

The llkv-join crate implements table join operations using hash join algorithms. It supports INNER, LEFT, RIGHT, and FULL OUTER joins with optimizations for build/probe side selection and parallel hash table construction. The join implementation integrates with the table scan and filter systems for efficient multi-table queries.

Purpose: Table join operations and algorithms
Key Exports: Join operators, hash join implementation
Internal Dependencies: llkv-types, llkv-result, llkv-expr, llkv-table, llkv-storage, llkv-column-map, llkv-threading
External Dependencies: arrow, rayon, rustc-hash

Sources: Cargo.toml62 llkv-join/Cargo.toml:1-44 llkv-executor/Cargo.toml28

Runtime and Transaction Crates

llkv-runtime

The llkv-runtime crate provides the runtime engine that orchestrates query execution with transaction management. It integrates the executor, table management, and transaction systems, providing the high-level API for executing SQL statements within transactional contexts. The runtime manages catalog operations, schema validation, and result formatting.

Purpose: Runtime orchestration and transaction coordination
Key Exports: Runtime engine, execution context
Internal Dependencies: llkv-types, llkv-result, llkv-table, llkv-executor, llkv-transaction, llkv-storage
External Dependencies: arrow

Sources: Cargo.toml65 llkv-sql/Cargo.toml28

llkv-transaction

The llkv-transaction crate implements Multi-Version Concurrency Control (MVCC) for transactional semantics. It manages transaction identifiers, version visibility, and isolation. The system injects created_by and deleted_by columns into tables to track row versions, enabling snapshot isolation for concurrent queries.

Purpose: MVCC transaction management
Key Exports: Transaction manager, transaction ID generation
Internal Dependencies: llkv-types
External Dependencies: None (minimal)

Sources: Cargo.toml73 llkv-sql/Cargo.toml31

SQL Interface Crates

llkv-sql

The llkv-sql crate provides the primary SQL interface through the SqlEngine type. It handles SQL parsing (using sqlparser-rs), preprocessing (dialect normalization), plan building, and execution. The engine includes an INSERT buffering system that batches multiple INSERT statements for improved bulk insert performance. This is the main entry point for executing SQL statements against LLKV.

Purpose: SQL parsing, preprocessing, and execution interface
Key Exports: SqlEngine, SQL execution methods
Internal Dependencies: llkv-types, llkv-result, llkv-expr, llkv-plan, llkv-executor, llkv-runtime, llkv-table, llkv-storage, llkv-column-map, llkv-compute, llkv-transaction
External Dependencies: arrow, sqlparser, regex

Sources: Cargo.toml68 llkv-sql/Cargo.toml:1-44

Utility and Support Crates

llkv-csv

The llkv-csv crate provides utilities for importing and exporting CSV data to/from LLKV tables. It handles schema inference, data type conversion, and batch processing for efficient bulk data operations.

Purpose: CSV import/export functionality
Key Exports: CSV reader/writer utilities
Internal Dependencies: llkv-table, llkv-types
External Dependencies: arrow, csv

Sources: Cargo.toml59

llkv-test-utils

The llkv-test-utils crate provides testing utilities used across the workspace. When the auto-init feature is enabled, it automatically initializes tracing at test binary startup via the ctor crate, simplifying test debugging. This crate is marked as a development dependency in most other crates.

Purpose: Shared test utilities and test infrastructure
Key Exports: Test helpers, tracing initialization
Internal Dependencies: None
External Dependencies: tracing, tracing-subscriber, ctor (optional)

Sources: Cargo.toml71 llkv-test-utils/Cargo.toml:1-34 llkv-table/Cargo.toml45

llkv-threading

The llkv-threading crate provides threading utilities and abstractions used by parallelized operations in the executor and join crates. It wraps rayon patterns and provides consistent threading primitives across the codebase.

Purpose: Threading utilities and parallel processing abstractions
Key Exports: Thread pool management, parallel iterators
Internal Dependencies: llkv-types
External Dependencies: rayon, crossbeam-channel

Sources: Cargo.toml72 llkv-executor/Cargo.toml34 llkv-join/Cargo.toml27

Application and Demo Crates

llkv (Main Library)

The root llkv crate serves as the main library crate, aggregating and re-exporting key types and functions from the specialized crates. It provides a unified API surface for external consumers of the LLKV database system.

Purpose: Main library aggregation and public API
Key Exports: Re-exports from llkv-sql and other core crates
Internal Dependencies: llkv-sql and other workspace crates
External Dependencies: Inherited from dependencies

Sources: Cargo.toml5 Cargo.toml55

llkv-sql-pong-demo

An interactive terminal-based demonstration application that showcases LLKV’s SQL capabilities through a ping-pong game scenario. The demo creates tables, inserts data, and executes queries to demonstrate SQL functionality in an engaging way.

Purpose: Interactive SQL demonstration
Key Exports: Demo application binary
Internal Dependencies: llkv-sql
External Dependencies: crossterm (for terminal UI)

Sources: Cargo.toml4

llkv-slt-tester

The llkv-slt-tester crate implements a SQLLogicTest runner for LLKV. It executes standardized SQL test suites to verify correctness against established database behavior expectations, enabling regression testing and compatibility validation.

Purpose: SQLLogicTest execution and validation
Key Exports: Test runner binary
Internal Dependencies: llkv-sql
External Dependencies: sqllogictest, libtest-mimic

Sources: Cargo.toml67

llkv-tpch

The llkv-tpch crate implements the TPC-H benchmark suite for LLKV. It generates TPC-H schema and data, executes the 22 standard TPC-H queries, and measures performance metrics. This crate is used for performance evaluation and regression testing.

Purpose: TPC-H benchmark execution
Key Exports: Benchmark runner binary
Internal Dependencies: llkv-sql, llkv-csv
External Dependencies: tpchgen

Sources: Cargo.toml23

Workspace Dependencies Overview

The following table summarizes key external dependencies used across the workspace:

DependencyVersionPurpose
arrow57.1.0Columnar data format and operations
sqlparser0.59.0SQL parsing
simd-r-drive0.15.5-alphaKey-value storage backend
rayon1.10.0Parallel processing
croaring2.5.1Bitmap indexes
bitcode0.6.7Fast binary serialization
time0.3.44Date/time handling
regex1.12.2Pattern matching
rustc-hash2.1.1Fast hash functions
thiserror2.0.17Error derivation

Sources: Cargo.toml:37-96

Crate Interdependency Matrix

Sources: Cargo.toml:37-96 llkv-table/Cargo.toml:20-40 llkv-executor/Cargo.toml:20-41 llkv-sql/Cargo.toml:20-37 llkv-join/Cargo.toml:20-32 llkv-plan/Cargo.toml:20-36

Build Configuration and Features

The workspace defines several build profiles and configuration options:

  • resolver = “2” : Uses Cargo’s newer dependency resolver for more consistent builds
  • edition = “2024” : Uses the latest Rust edition (2024)
  • samply profile : Inherits from release with debug symbols enabled for profiling

Workspace lints enforce code quality by denying print statements and specific unsafe operations. Individual crates may enable conditional features like simd-r-drive-support for the storage backend.

Sources: Cargo.toml27 Cargo.toml32 Cargo.toml:98-109 llkv-sql/Cargo.toml40

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

SQL Query Processing Pipeline

Loading…

SQL Query Processing Pipeline

Relevant source files

Purpose and Scope

This page describes the end-to-end flow of SQL query processing in LLKV, from raw SQL text input to Arrow RecordBatch results. It covers the major pipeline stages: parsing, preprocessing, planning, execution, and result formatting. For detailed information about specific components, see:

Pipeline Overview

The SQL query processing pipeline transforms user SQL statements through several stages before producing results:

Sources: llkv-sql/src/sql_engine.rs:1-100 [Diagram 2 from overview](https://github.com/jzombie/rust-llkv/blob/89777726/Diagram 2 from overview)

flowchart TB
    Input["SQL Text\n(User Input)"]
Preprocess["SQL Preprocessing\npreprocess_* functions"]
Parse["SQL Parsing\nsqlparser::Parser"]
AST["sqlparser::ast::Statement"]
Plan["Query Planning\nbuild_*_plan functions"]
PlanStmt["PlanStatement"]
Execute["Execution\nRuntimeEngine::execute_statement"]
Result["RuntimeStatementResult"]
Batches["Vec&lt;RecordBatch&gt;"]
Input --> Preprocess
 
   Preprocess --> Parse
 
   Parse --> AST
 
   AST --> Plan
 
   Plan --> PlanStmt
 
   PlanStmt --> Execute
 
   Execute --> Result
 
   Result --> Batches
    
    style Input fill:#f9f9f9
    style Batches fill:#f9f9f9

SQL Preprocessing Stage

Before parsing, LLKV applies several preprocessing transformations to normalize SQL syntax across different dialects (SQLite, DuckDB, PostgreSQL). This stage rewrites incompatible syntax into forms that sqlparser can handle.

Preprocessing Functions

FunctionPurposeExample Transformation
preprocess_tpch_connect_syntaxStrip TPC-H CONNECT TO statementsCONNECT TO tpch; → (removed)
preprocess_create_type_syntaxConvert CREATE TYPE to CREATE DOMAINCREATE TYPE int8 AS bigintCREATE DOMAIN int8 AS bigint
preprocess_exclude_syntaxQuote qualified names in EXCLUDE clausesEXCLUDE (t.col)EXCLUDE ("t.col")
preprocess_trailing_commas_in_valuesRemove trailing commas in VALUESVALUES (1,)VALUES (1)
preprocess_empty_in_listsExpand empty IN lists to boolean expressionsx IN ()(x = NULL AND 0 = 1)
preprocess_index_hintsStrip SQLite index hintsFROM t INDEXED BY idxFROM t
preprocess_reindex_syntaxConvert standalone REINDEX to VACUUM REINDEXREINDEX idxVACUUM REINDEX idx
preprocess_sqlite_trigger_shorthandAdd explicit timing and FOR EACH ROWCREATE TRIGGER ... BEGINCREATE TRIGGER ... AFTER ... FOR EACH ROW BEGIN
preprocess_bare_table_in_clausesConvert bare table names in IN to subqueriesx IN tablex IN (SELECT * FROM table)

The preprocessing pipeline is applied sequentially in SqlEngine::execute:

Sources: llkv-sql/src/sql_engine.rs:759-1006 llkv-sql/src/sql_engine.rs:1395-1460

Parsing Stage

After preprocessing, the SQL text is parsed using the sqlparser crate with a configurable recursion limit to handle deeply nested queries.

Parser Configuration

The parser is configured with:

  • Dialect : GenericDialect (supports multiple SQL dialects)
  • Recursion limit : 200 (set via PARSER_RECURSION_LIMIT constant)
  • Parameter tracking : Thread-local ParameterScope for prepared statement placeholders

The parser produces a vector of sqlparser::ast::Statement objects. Each statement is then converted to a PlanStatement for execution.

Sources: llkv-sql/src/sql_engine.rs:395-400 llkv-sql/src/sql_engine.rs:1464-1481 llkv-sql/src/sql_engine.rs:223-256

Statement Type Dispatch

Once parsed, statements are dispatched to specialized planning functions based on their type. The SqlEngine maintains an optional InsertBuffer to batch consecutive literal INSERT statements for improved throughput.

Sources: llkv-sql/src/sql_engine.rs:1482-2800 llkv-sql/src/sql_engine.rs:486-547

flowchart TB
    AST["sqlparser::ast::Statement"]
Dispatch{"Statement Type"}
CreateTable["build_create_table_plan"]
DropTable["build_drop_table_plan"]
AlterTable["build_alter_table_plan"]
CreateView["build_create_view_plan"]
DropView["build_drop_view_plan"]
CreateIndex["build_create_index_plan"]
DropIndex["build_drop_index_plan"]
Reindex["build_reindex_plan"]
RenameTable["build_rename_table_plan"]
Insert["build_insert_plan"]
InsertBuffer["InsertBuffer::push_statement\n(batching optimization)"]
Update["build_update_plan"]
Delete["build_delete_plan"]
Truncate["build_truncate_plan"]
Select["build_select_plan"]
Explain["wrap in Explain plan"]
Set["handle SET statement"]
Show["handle SHOW statement"]
Begin["handle BEGIN TRANSACTION"]
Commit["handle COMMIT"]
Rollback["handle ROLLBACK"]
Savepoint["handle SAVEPOINT"]
Release["handle RELEASE"]
Vacuum["handle VACUUM"]
Pragma["handle PRAGMA"]
PlanStmt["PlanStatement"]
AST --> Dispatch
    
 
   Dispatch -->|CREATE TABLE| CreateTable
 
   Dispatch -->|DROP TABLE| DropTable
 
   Dispatch -->|ALTER TABLE| AlterTable
 
   Dispatch -->|CREATE VIEW| CreateView
 
   Dispatch -->|DROP VIEW| DropView
 
   Dispatch -->|CREATE INDEX| CreateIndex
 
   Dispatch -->|DROP INDEX| DropIndex
 
   Dispatch -->|VACUUM REINDEX| Reindex
 
   Dispatch -->|ALTER TABLE RENAME| RenameTable
    
 
   Dispatch -->|INSERT| Insert
    Insert -.may buffer.-> InsertBuffer
 
   Dispatch -->|UPDATE| Update
 
   Dispatch -->|DELETE| Delete
 
   Dispatch -->|TRUNCATE| Truncate
    
 
   Dispatch -->|SELECT| Select
 
   Dispatch -->|EXPLAIN| Explain
    
 
   Dispatch -->|SET| Set
 
   Dispatch -->|SHOW| Show
 
   Dispatch -->|BEGIN| Begin
 
   Dispatch -->|COMMIT| Commit
 
   Dispatch -->|ROLLBACK| Rollback
 
   Dispatch -->|SAVEPOINT| Savepoint
 
   Dispatch -->|RELEASE| Release
 
   Dispatch -->|VACUUM| Vacuum
 
   Dispatch -->|PRAGMA| Pragma
    
 
   CreateTable --> PlanStmt
 
   DropTable --> PlanStmt
 
   AlterTable --> PlanStmt
 
   CreateView --> PlanStmt
 
   DropView --> PlanStmt
 
   CreateIndex --> PlanStmt
 
   DropIndex --> PlanStmt
 
   Reindex --> PlanStmt
 
   RenameTable --> PlanStmt
 
   Insert --> PlanStmt
 
   InsertBuffer --> PlanStmt
 
   Update --> PlanStmt
 
   Delete --> PlanStmt
 
   Truncate --> PlanStmt
 
   Select --> PlanStmt
 
   Explain --> PlanStmt

Planning Stage

The planning stage converts sqlparser::ast::Statement nodes into typed PlanStatement objects. This involves:

  1. Column resolution : Mapping string column names to FieldId identifiers via the system catalog
  2. Expression translation : Converting sqlparser::ast::Expr to llkv_expr::expr::Expr and ScalarExpr
  3. Subquery tracking : Recording correlated subqueries with placeholder bindings
  4. Validation : Ensuring referenced tables and columns exist

SELECT Plan Construction

For SELECT statements, the planner builds a SelectPlan containing:

Sources: llkv-sql/src/sql_engine.rs:2801-3500 llkv-plan/src/plans.rs:705-850

Expression Translation

Expression translation converts SQL expressions into the llkv_expr AST, which uses FieldId instead of string column names. This enables efficient field access during execution.

Sources: llkv-plan/src/translation/expression.rs:1-500 llkv-sql/src/sql_engine.rs:4000-4500

flowchart LR
    SQLExpr["sqlparser::ast::Expr\n(column names as strings)"]
Resolver["IdentifierResolver\n(catalog lookups)"]
Translation["translate_scalar / translate_predicate"]
LLKVExpr["llkv_expr::expr::ScalarExpr&lt;FieldId&gt;\nllkv_expr::expr::Expr&lt;FieldId&gt;"]
SQLExpr --> Translation
    Resolver -.provides.-> Translation
 
   Translation --> LLKVExpr
flowchart TB
    PlanStmt["PlanStatement"]
Execute["RuntimeEngine::execute_statement"]
DDL{"Statement Type"}
CreateTableExec["execute_create_table\n(allocate table_id, register schema)"]
DropTableExec["execute_drop_table\n(remove from catalog)"]
AlterTableExec["execute_alter_table\n(modify schema)"]
CreateViewExec["execute_create_view\n(store view definition)"]
CreateIndexExec["execute_create_index\n(build index structures)"]
InsertExec["execute_insert\n(convert to RecordBatch, append)"]
UpdateExec["execute_update\n(filter + rewrite rows)"]
DeleteExec["execute_delete\n(mark rows deleted via MVCC)"]
TruncateExec["execute_truncate\n(clear all rows)"]
SelectExec["QueryExecutor::execute_select\n(scan, filter, project, aggregate)"]
Result["RuntimeStatementResult"]
PlanStmt --> Execute
 
   Execute --> DDL
    
 
   DDL -->|CreateTable| CreateTableExec
 
   DDL -->|DropTable| DropTableExec
 
   DDL -->|AlterTable| AlterTableExec
 
   DDL -->|CreateView| CreateViewExec
 
   DDL -->|CreateIndex| CreateIndexExec
    
 
   DDL -->|Insert| InsertExec
 
   DDL -->|Update| UpdateExec
 
   DDL -->|Delete| DeleteExec
 
   DDL -->|Truncate| TruncateExec
    
 
   DDL -->|Select| SelectExec
    
 
   CreateTableExec --> Result
 
   DropTableExec --> Result
 
   AlterTableExec --> Result
 
   CreateViewExec --> Result
 
   CreateIndexExec --> Result
 
   InsertExec --> Result
 
   UpdateExec --> Result
 
   DeleteExec --> Result
 
   TruncateExec --> Result
 
   SelectExec --> Result

Execution Stage

The RuntimeEngine receives a PlanStatement and dispatches it to the appropriate execution handler:

Sources: llkv-runtime/src/lib.rs:1-500 llkv-sql/src/sql_engine.rs:706-745

flowchart TB
    SelectPlan["SelectPlan"]
Executor["QueryExecutor::execute_select"]
Strategy{"Execution Strategy"}
NoTable["execute_select_without_table\n(constant projection)"]
Compound["execute_compound_select\n(UNION/EXCEPT/INTERSECT)"]
GroupBy["execute_group_by_single_table\n(hash aggregation)"]
CrossProduct["execute_cross_product\n(nested loop join)"]
Aggregates["execute_aggregates\n(full-table aggregation)"]
Projection["execute_projection\n(scan + filter + project)"]
ScanStream["Table::scan_stream\n(streaming batches)"]
FilterEval["filter_row_ids\n(predicate evaluation)"]
Gather["gather_rows\n(assemble RecordBatch)"]
Sort["lexsort_to_indices\n(ORDER BY)"]
LimitOffset["take + offset\n(LIMIT/OFFSET)"]
SelectExecution["SelectExecution\n(result wrapper)"]
SelectPlan --> Executor
 
   Executor --> Strategy
    
 
   Strategy -->|tables.is_empty| NoTable
 
   Strategy -->|compound.is_some| Compound
 
   Strategy -->|!group_by.is_empty| GroupBy
 
   Strategy -->|tables.len > 1| CrossProduct
 
   Strategy -->|!aggregates.is_empty| Aggregates
 
   Strategy -->|single table| Projection
    
 
   Projection --> ScanStream
 
   ScanStream --> FilterEval
 
   FilterEval --> Gather
 
   Gather --> Sort
 
   Sort --> LimitOffset
    
 
   NoTable --> SelectExecution
 
   Compound --> SelectExecution
 
   GroupBy --> SelectExecution
 
   CrossProduct --> SelectExecution
 
   Aggregates --> SelectExecution
 
   LimitOffset --> SelectExecution

SELECT Execution Flow

SELECT statement execution is the most complex path, involving multiple optimization strategies:

Sources: llkv-executor/src/lib.rs:519-563 llkv-executor/src/scan.rs:1-500

Two-Phase Execution for Filtering

When a WHERE clause is present, execution follows a two-phase approach:

  1. Phase 1: Row ID Collection

    • Evaluate predicates against stored data
    • Use chunk metadata for pruning (min/max values)
    • Build bitmap of matching row IDs
  2. Phase 2: Data Gathering

    • Fetch only projected columns for matching rows
    • Assemble Arrow RecordBatch from gathered data

Sources: llkv-executor/src/scan.rs:200-400 [Diagram 6 from overview](https://github.com/jzombie/rust-llkv/blob/89777726/Diagram 6 from overview)

flowchart LR
    RuntimeResult["RuntimeStatementResult"]
Branch{"Result Type"}
DDLResult["CreateTable/DropTable/etc\n(metadata only)"]
DMLResult["Insert/Update/Delete\n(row count)"]
SelectResult["Select\n(SelectExecution)"]
SelectExec["SelectExecution"]
Stream["stream(callback)\n(iterate batches)"]
IntoBatches["into_batches()\n(collect all)"]
IntoRows["into_rows()\n(materialize)"]
Batches["Vec&lt;RecordBatch&gt;"]
RuntimeResult --> Branch
 
   Branch -->|DDL| DDLResult
 
   Branch -->|DML| DMLResult
 
   Branch -->|SELECT| SelectResult
    
 
   SelectResult --> SelectExec
 
   SelectExec --> Stream
 
   SelectExec --> IntoBatches
 
   SelectExec --> IntoRows
    
 
   Stream --> Batches
 
   IntoBatches --> Batches

Result Formatting

All execution paths ultimately produce a RuntimeStatementResult, which is converted to Arrow RecordBatch objects for SELECT queries:

The SelectExecution type provides three consumption modes:

MethodBehaviorUse Case
stream(callback)Iterate batches without allocatingLarge result sets, streaming output
into_batches()Collect all batches into VecModerate result sets, need random access
into_rows()Materialize as Vec<CanonicalRow>Small result sets, need row-level operations

Sources: llkv-executor/src/types/execution.rs:1-300 llkv-sql/src/sql_engine.rs:1100-1150

Optimization Points in the Pipeline

Several optimization strategies are applied throughout the pipeline:

1. SQL Preprocessing Optimizations

  • Empty IN list rewriting : x IN () becomes constant false, enabling early elimination
  • Index hint removal : Allows the planner to make optimal index choices without user hints

Sources: llkv-sql/src/sql_engine.rs:835-856 llkv-sql/src/sql_engine.rs:858-875

2. INSERT Buffering

  • Batching : Consecutive literal INSERT statements are buffered until reaching MAX_BUFFERED_INSERT_ROWS (8192)
  • Throughput : Dramatically reduces planning overhead for bulk ingest workloads
  • Semantics : Disabled by default to preserve transactional visibility

Sources: llkv-sql/src/sql_engine.rs:486-547 llkv-sql/src/sql_engine.rs:1200-1300

3. Predicate Pushdown

  • Chunk pruning : Min/max metadata on column chunks enables skipping irrelevant data
  • Vectorized evaluation : SIMD instructions accelerate predicate evaluation within chunks

Sources: llkv-executor/src/scan.rs:300-400

4. Projection Pushdown

  • Lazy column loading : Only requested columns are fetched from storage
  • Computed projection caching : Identical expressions are evaluated once and reused

Sources: llkv-executor/src/lib.rs:469-501

5. Fast Paths

  • Constant SELECT : Queries without tables avoid storage access entirely
  • Full table scan : Queries without predicates stream directly without bitmap construction

Sources: llkv-executor/src/lib.rs:533-534

Parameter Binding for Prepared Statements

The pipeline supports parameterized queries via PreparedStatement:

Parameter placeholders are normalized during preprocessing:

  • ?, ?N, $N: Positional parameters (1-indexed)
  • :name: Named parameters (converted to indices)

Sources: llkv-sql/src/sql_engine.rs:78-283 llkv-sql/src/sql_engine.rs:354-373

Summary

The SQL query processing pipeline in LLKV follows a clear multi-stage architecture:

  1. Preprocessing : Normalize SQL dialect differences
  2. Parsing : Convert text to AST (sqlparser)
  3. Planning : Build typed PlanStatement with column resolution
  4. Execution : Execute via RuntimeEngine and QueryExecutor
  5. Result Formatting : Return Arrow RecordBatch objects

Key design principles:

  • Dialect flexibility : Preprocessing enables SQLite, DuckDB, and PostgreSQL syntax
  • Type safety : Early resolution of column names to FieldId prevents runtime errors
  • Streaming execution : Large result sets never require full materialization
  • Optimization opportunities : Metadata pruning, SIMD evaluation, and projection pushdown reduce data movement

For implementation details of specific stages, refer to the linked subsystem pages.

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Data Formats and Arrow Integration

Loading…

Data Formats and Arrow Integration

Relevant source files

This page explains how Apache Arrow serves as the foundational columnar data format throughout LLKV, covering the types supported, serialization strategies, and integration points across the system. Arrow provides the in-memory representation for all data operations, from SQL query results to storage layer batches.

For information about how Arrow data flows through query execution, see Query Execution. For details on how the storage layer persists Arrow arrays, see Column Storage and ColumnStore.


RecordBatch as the Universal Data Container

The arrow::record_batch::RecordBatch type is the primary data exchange unit across all LLKV layers. A RecordBatch represents a columnar slice of data with a fixed schema, where each column is an ArrayRef (type-erased Arrow array).

Sources: llkv-table/src/table.rs:6-9 llkv-table/examples/test_streaming.rs:1-9

graph TB
    RecordBatch["RecordBatch"]
Schema["Schema\n(arrow::datatypes::Schema)"]
Columns["Vec&lt;ArrayRef&gt;\n(arrow::array::ArrayRef)"]
RecordBatch --> Schema
 
   RecordBatch --> Columns
    
 
   Schema --> Fields["Vec&lt;Field&gt;\nColumn Names + Types"]
Columns --> Array1["Column 0: ArrayRef"]
Columns --> Array2["Column 1: ArrayRef"]
Columns --> Array3["Column N: ArrayRef"]
Array1 --> UInt64Array["UInt64Array\n(row_id column)"]
Array2 --> Int64Array["Int64Array\n(user data)"]
Array3 --> StringArray["StringArray\n(text data)"]

RecordBatch Construction Pattern

All data inserted into tables must arrive as a RecordBatch with specific metadata:

Required ElementDescriptionMetadata Key
row_id columnUInt64Array with unique row identifiersNone (system column)
Field ID metadataMaps each user column to a FieldId"field_id"
Arrow data typeNative Arrow type (Int64, Utf8, etc.)N/A
Column nameHuman-readable identifierField name

Sources: llkv-table/src/table.rs:231-438 llkv-table/examples/test_streaming.rs:18-58


Schema and Field Metadata

The arrow::datatypes::Schema describes the structure of a RecordBatch. LLKV extends Arrow’s schema system with custom metadata to track column identity and table ownership.

Field Metadata Convention

Each user column’s Field carries a field_id metadata entry that encodes either:

  • User columns: A raw FieldId (0-2047) that gets namespaced to a LogicalFieldId
  • MVCC columns: A pre-computed LogicalFieldId for created_by / deleted_by

Sources: llkv-table/src/table.rs:243-320 llkv-table/constants.rs (FIELD_ID_META_KEY)

graph TB
    Field["arrow::datatypes::Field"]
FieldName["name: String"]
DataType["data_type: DataType"]
Nullable["nullable: bool"]
Metadata["metadata: HashMap&lt;String, String&gt;"]
Field --> FieldName
 
   Field --> DataType
 
   Field --> Nullable
 
   Field --> Metadata
    
 
   Metadata --> FieldIdKey["'field_id' → '42'\n(user column)"]
Metadata --> LFidKey["'field_id' → '8589934634'\n(MVCC column)"]
FieldIdKey -.converted to.-> UserLFid["LogicalFieldId::for_user(table_id, 42)"]
LFidKey -.direct use.-> SystemLFid["LogicalFieldId::for_mvcc_created_by(table_id)"]

Schema Reconstruction from Catalog

The Table::schema() method reconstructs an Arrow Schema by querying the system catalog for column names and combining them with type information from the column store:

Sources: llkv-table/src/table.rs:519-549

sequenceDiagram
    participant Caller
    participant Table
    participant ColumnStore
    participant SysCatalog
    
    Caller->>Table: schema()
    Table->>ColumnStore: user_field_ids_for_table(table_id)
    ColumnStore-->>Table: Vec&lt;LogicalFieldId&gt;
    Table->>SysCatalog: get_cols_meta(field_ids)
    SysCatalog-->>Table: Vec&lt;ColMeta&gt;
    Table->>ColumnStore: data_type(lfid) for each
    ColumnStore-->>Table: DataType
    Table->>Table: Build Field with metadata
    Table-->>Caller: Arc&lt;Schema&gt;

Supported Arrow Data Types

LLKV’s column store supports the following Arrow data types for user columns:

Arrow TypeRust Type MappingStorage EncodingNotes
UInt64u64NativeRow IDs, MVCC transaction IDs
UInt32u32Native-
UInt16u16Native-
UInt8u8Native-
Int64i64Native-
Int32i32Native-
Int16i16Native-
Int8i8Native-
Float64f64IEEE 754-
Float32f32IEEE 754-
Date32i32 (days since epoch)Native-
Date64i64 (ms since epoch)Native-
Decimal128i128Native + precision/scale metadataFixed-point decimal
Utf8String (i32 offsets)Length-prefixed UTF-8Variable-length strings
LargeUtf8String (i64 offsets)Length-prefixed UTF-8Large strings
BinaryVec<u8> (i32 offsets)Length-prefixed bytesVariable-length binary
LargeBinaryVec<u8> (i64 offsets)Length-prefixed bytesLarge binary
BooleanboolBit-packed-
StructNested fieldsPassthrough (no specialized gather)Limited support

Sources: llkv-column-map/src/store/projection.rs:135-185 llkv-column-map/src/gather.rs:10-18

Type Dispatch Strategy

The column store uses Rust enums to dispatch operations per data type, ensuring type-safe handling without runtime reflection:

Sources: llkv-column-map/src/store/projection.rs:99-236 llkv-column-map/src/store/projection.rs:237-395

graph TB
    ColumnOutputBuilder["ColumnOutputBuilder\n(enum per type)"]
ColumnOutputBuilder --> Utf8["Utf8 { builder, len_capacity, value_capacity }"]
ColumnOutputBuilder --> LargeUtf8["LargeUtf8 { builder, len_capacity, value_capacity }"]
ColumnOutputBuilder --> Binary["Binary { builder, len_capacity, value_capacity }"]
ColumnOutputBuilder --> LargeBinary["LargeBinary { builder, len_capacity, value_capacity }"]
ColumnOutputBuilder --> Boolean["Boolean { builder, len_capacity }"]
ColumnOutputBuilder --> Decimal128["Decimal128 { builder, precision, scale }"]
ColumnOutputBuilder --> Primitive["Primitive(PrimitiveBuilderKind)"]
ColumnOutputBuilder --> Passthrough["Passthrough (Struct arrays)"]
Primitive --> UInt64Builder["UInt64 { builder, len_capacity }"]
Primitive --> Int64Builder["Int64 { builder, len_capacity }"]
Primitive --> Float64Builder["Float64 { builder, len_capacity }"]
Primitive --> OtherPrimitives["... (10 more primitive types)"]

Array Builders and Construction

Arrow provides builder types (arrow::array::*Builder) for incrementally constructing arrays. LLKV’s projection layer wraps these builders with capacity management to minimize allocations during multi-batch operations.

Capacity Pre-Allocation Strategy

The ColumnOutputBuilder tracks allocated capacity separately from Arrow’s builder to avoid repeated re-allocations:

Sources: llkv-column-map/src/store/projection.rs:189-234 llkv-column-map/src/store/projection.rs:398-457

graph LR
    RequestedLen["Requested Len"]
ValueBytesHint["Value Bytes Hint"]
RequestedLen --> Check["len_capacity < len?"]
ValueBytesHint --> Check2["value_capacity < bytes?"]
Check -->|Yes| Reallocate["Recreate Builder\nwith_capacity(len, bytes)"]
Check2 -->|Yes| Reallocate
    
 
   Check -->|No| Reuse["Reuse Existing Builder"]
Check2 -->|No| Reuse
    
 
   Reallocate --> UpdateCapacity["Update cached capacities"]
UpdateCapacity --> Append["append_value()
or append_null()"]
Reuse --> Append

String and Binary Builder Pattern

Variable-length types require two capacity dimensions:

  • Length capacity: Number of values
  • Value capacity: Total bytes across all values

Sources: llkv-column-map/src/store/projection.rs:398-411 llkv-column-map/src/store/projection.rs:413-426


graph TB
    subgraph "In-Memory Layer"
        RecordBatch1["RecordBatch\n(user input)"]
ArrayRef1["ArrayRef\n(Int64Array, StringArray, etc.)"]
end
    
    subgraph "Serialization Layer"
        Serialize["serialize_array()\n(Arrow IPC encoding)"]
Deserialize["deserialize_array()\n(Arrow IPC decoding)"]
end
    
    subgraph "Storage Layer"
        ChunkBlob["EntryHandle\n(byte blob)"]
Pager["Pager::batch_put()"]
PhysicalKey["PhysicalKey\n(u64 identifier)"]
end
    
 
   RecordBatch1 --> ArrayRef1
 
   ArrayRef1 --> Serialize
 
   Serialize --> ChunkBlob
 
   ChunkBlob --> Pager
 
   Pager --> PhysicalKey
    
    PhysicalKey -.read.-> Pager
    Pager -.read.-> ChunkBlob
 
   ChunkBlob --> Deserialize
 
   Deserialize --> ArrayRef1

Serialization and Storage Integration

Arrow arrays are serialized to byte blobs using Arrow’s IPC format before persistence in the key-value store. The column store maintains a separation between logical Arrow representation and physical storage:

Sources: llkv-column-map/src/serialization.rs (deserialize_array function), llkv-column-map/src/store/projection.rs17

Chunk Organization

Large columns are split into multiple chunks (default 65,536 rows per chunk). Each chunk stores:

  • Value array: Serialized Arrow array with actual data
  • Row ID array: Separate UInt64Array mapping row IDs to array indices

Sources: llkv-column-map/src/store/descriptor.rs (ChunkMetadata, ColumnDescriptor), llkv-column-map/src/store/mod.rs (DEFAULT_CHUNK_TARGET_ROWS)


graph TB
    GatherNullPolicy["GatherNullPolicy"]
GatherNullPolicy --> ErrorOnMissing["ErrorOnMissing\nMissing row → Error"]
GatherNullPolicy --> IncludeNulls["IncludeNulls\nMissing row → NULL in output"]
GatherNullPolicy --> DropNulls["DropNulls\nMissing row → omit from output"]
ErrorOnMissing -.used by.-> InnerJoin["Inner Joins\nStrict row matches"]
IncludeNulls -.used by.-> OuterJoin["Outer Joins\nNULL padding"]
DropNulls -.used by.-> Projection["Filtered Projections\nRemove incomplete rows"]

Null Handling

Arrow’s nullable columns use a separate validity bitmap. LLKV’s GatherNullPolicy enum controls how nulls propagate during row gathering:

Sources: llkv-column-map/src/store/projection.rs:40-48 llkv-table/src/table.rs:589-645

Null Bitmap Semantics

Arrow arrays maintain a separate NullBuffer (bit-packed). When gathering rows, the system must:

  1. Check if the source array position is valid (array.is_valid(idx))
  2. Append null to the builder if invalid, or the value if valid

Sources: llkv-column-map/src/gather.rs:369-402


graph LR
    subgraph "CSV Export"
        ScanStream["Table::scan_stream()"]
RecordBatch2["RecordBatch stream"]
ArrowCSV["arrow::csv::Writer"]
CSVFile["CSV File"]
ScanStream --> RecordBatch2
 
       RecordBatch2 --> ArrowCSV
 
       ArrowCSV --> CSVFile
    end
    
    subgraph "CSV Import"
        CSVInput["CSV File"]
Parser["csv-core parser"]
Builders["Array Builders"]
RecordBatch3["RecordBatch"]
TableAppend["Table::append()"]
CSVInput --> Parser
 
       Parser --> Builders
 
       Builders --> RecordBatch3
 
       RecordBatch3 --> TableAppend
    end

Arrow Integration Points

CSV Import and Export

The llkv-csv crate uses Arrow’s native CSV writer (arrow::csv::WriterBuilder) for exports and custom parsing for imports:

Sources: llkv-csv/src/writer.rs:196-267 llkv-csv/src/reader.rs

SQL Query Results

All SQL query results are returned as RecordBatch instances, enabling zero-copy integration with Arrow-based tools:

Statement TypeOutput FormatArrow Schema Source
SELECTStream of RecordBatchQuery projection list
INSERTRecordBatch (row count)Fixed schema: { affected_rows: UInt64 }
UPDATERecordBatch (row count)Fixed schema: { affected_rows: UInt64 }
DELETERecordBatch (row count)Fixed schema: { affected_rows: UInt64 }
CREATE TABLERecordBatch (success)Empty schema or metadata

Sources: llkv-runtime/src/lib.rs (RuntimeStatementResult enum), llkv-sql/src/lib.rs (SqlEngine::execute)

graph TB
    AggregateExpr["Aggregate Expression\n(SUM, AVG, MIN, MAX)"]
AggregateExpr --> ArrowKernel["arrow::compute kernels\n(sum, min, max)"]
AggregateExpr --> CustomAccumulator["Custom Accumulator\n(COUNT DISTINCT, etc.)"]
ArrowKernel --> PrimitiveArray["PrimitiveArray&lt;T&gt;"]
CustomAccumulator --> AnyArray["ArrayRef (any type)"]
PrimitiveArray -.vectorized ops.-> SIMDPath["SIMD-accelerated compute"]
AnyArray -.type-erased dispatch.-> SlowPath["Generic accumulation"]

Arrow-Native Aggregations

Aggregate functions operate directly on Arrow arrays using arrow::compute kernels where possible:

Sources: llkv-aggregate/src/lib.rs (aggregate accumulators), llkv-compute/src/kernels.rs (NumericKernels)


Data Type Conversion Table

The following table maps SQL types to Arrow types as materialized in the system:

SQL TypeArrow TypeNotes
INTEGER, INTInt6464-bit signed
BIGINTInt64Same as INTEGER
SMALLINTInt32Promoted to Int32 internally
TINYINTInt32Promoted to Int32 internally
FLOAT, REALFloat32Single precision
DOUBLEFloat64Double precision
TEXT, VARCHARUtf8Variable-length, i32 offsets
BLOBBinaryVariable-length bytes
BOOLEAN, BOOLBooleanBit-packed
DATEDate32Days since Unix epoch
TIMESTAMPDate64Milliseconds since Unix epoch
DECIMAL(p, s)Decimal128(p, s)Fixed-point, stores precision/scale

Sources: llkv-plan/src/types.rs (SQL type mapping), llkv-types/src/arrow.rs (DataType conversion)


RecordBatch Flow Summary

Sources: llkv-table/src/table.rs:231-438 llkv-column-map/src/store/mod.rs (append logic)


Performance Considerations

Vectorization Benefits

Arrow’s columnar layout enables SIMD operations in compute kernels:

  • Primitive aggregations: arrow::compute::sum, min, max use SIMD
  • Filtering: Bit-packed boolean arrays enable fast predicate evaluation
  • Null bitmap checks: Batch validity checks avoid per-row branches

Memory Efficiency

Columnar storage provides superior compression and cache locality:

  • Compression ratio: Homogeneous columns compress better than row-based formats
  • Cache efficiency: Scanning a single column loads contiguous memory
  • Null optimization: Sparse nulls use minimal space (1 bit per value)
graph TB
    GatherContextPool["GatherContextPool"]
GatherContextPool --> AcquireContext["acquire(field_ids)"]
AcquireContext --> CheckCache["Check HashMap\nfor cached context"]
CheckCache -->|Hit| ReuseContext["Reuse MultiGatherContext"]
CheckCache -->|Miss| CreateNew["Create new context"]
ReuseContext --> Execute["Execute gather operation"]
CreateNew --> Execute
    
 
   Execute --> ReleaseContext["Release context back to pool"]
ReleaseContext --> StoreInPool["Store in HashMap\n(max 4 per field set)"]

Builder Reuse Strategy

The MultiGatherContext pools ColumnOutputBuilder instances to avoid repeated allocations during streaming scans:

Sources: llkv-column-map/src/store/projection.rs:651-691


Key Takeaways:

  • RecordBatch is the universal data container for all operations
  • Field metadata tracks column identity via field_id keys
  • Type dispatch uses Rust enums to handle 18+ Arrow data types efficiently
  • Serialization uses Arrow IPC format for persistence
  • Null handling respects Arrow’s validity bitmaps with configurable policies
  • Integration is native: CSV, SQL results, and storage all speak Arrow

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

SQL Interface

Loading…

SQL Interface

Relevant source files

The SQL Interface provides the primary user-facing API for interacting with LLKV databases through SQL statements. This layer is responsible for parsing SQL text, preprocessing dialect-specific syntax, translating Abstract Syntax Trees (AST) into execution plans, and delegating to the runtime engine for actual execution.

For information about query planning and execution, see Query Planning and Query Execution. For details on the underlying runtime that executes plans, see Architecture.


Purpose and Scope

The SQL Interface layer (llkv-sql crate) serves as the entry point for SQL-based database operations. It bridges the gap between SQL text written by users and the Arrow-native columnar storage engine, handling:

  • SQL Parsing : Converting SQL strings into Abstract Syntax Trees using sqlparser-rs
  • Dialect Normalization : Preprocessing SQL to handle syntax variations from SQLite, DuckDB, and PostgreSQL
  • Statement Translation : Converting ASTs into typed execution plans (PlanStatement structures)
  • Execution Coordination : Delegating plans to the RuntimeEngine and formatting results
  • Transaction Management : Coordinating multi-statement transactions with MVCC support
  • Performance Optimization : Batching INSERT statements to reduce planning overhead

The SQL Interface does not execute queries directly—it delegates all data operations to the runtime and executor layers. Its primary responsibility is accurate SQL-to-plan translation while preserving user intent across different SQL dialects.

Sources : llkv-sql/src/lib.rs:1-52 llkv-sql/src/sql_engine.rs:1-60


Architecture Overview

The SQL Interface operates as a stateful wrapper around the RuntimeEngine, maintaining session state, prepared statement caches, and insert buffering state. The core workflow proceeds through several stages:

SQL Interface Processing Pipeline

graph TB
    User["User Application"]
SqlEngine["SqlEngine"]
Preprocess["SQL Preprocessing"]
Parser["sqlparser-rs\nGenericDialect"]
Translator["Statement Translation"]
Runtime["RuntimeEngine"]
Results["RecordBatch / RowCount"]
User -->|execute sql| SqlEngine
 
   SqlEngine -->|1. Preprocess| Preprocess
 
   Preprocess -->|Normalized SQL| Parser
 
   Parser -->|AST Statement| Translator
 
   Translator -->|PlanStatement| Runtime
 
   Runtime -->|RuntimeStatementResult| SqlEngine
 
   SqlEngine -->|Vec<RecordBatch>| Results
 
   Results -->|Query Results| User
    
    subgraph "llkv-sql Crate"
        SqlEngine
        Preprocess
        Translator
    end
    
    subgraph "External Dependencies"
        Parser
    end
    
    subgraph "llkv-runtime Crate"
        Runtime
    end

The SqlEngine maintains several internal subsystems:

  1. Statement Cache : Thread-safe prepared statement storage (RwLock<FxHashMap<String, Arc<PreparedPlan>>>)
  2. Insert Buffer : Cross-statement INSERT batching for bulk ingest (RefCell<Option<InsertBuffer>>)
  3. Session State : Transaction context and configuration flags
  4. Information Schema Cache : Lazy metadata refresh tracking

Sources : llkv-sql/src/sql_engine.rs:572-621 [Diagram 1 from overview](https://github.com/jzombie/rust-llkv/blob/89777726/Diagram 1 from overview)


SqlEngine Structure

The SqlEngine struct encapsulates all SQL processing state and exposes methods for executing SQL statements:

SqlEngine Class Structure

classDiagram
    class SqlEngine {-engine: RuntimeEngine\n-default_nulls_first: AtomicBool\n-insert_buffer: RefCell~Option~InsertBuffer~~\n-insert_buffering_enabled: AtomicBool\n-information_schema_ready: AtomicBool\n-statement_cache: RwLock~FxHashMap~String PreparedPlan~~\n+new(pager) SqlEngine\n+with_context(context, nulls_first) SqlEngine\n+execute(sql) Result~Vec~RuntimeStatementResult~~\n+sql(sql) Result~Vec~RecordBatch~~\n+prepare(sql) Result~PreparedStatement~\n+execute_prepared(stmt, params) Result~Vec~RuntimeStatementResult~~\n+session() RuntimeSession\n+runtime_context() Arc~RuntimeContext~}
    
    class RuntimeEngine {+execute_statement(plan) Result~RuntimeStatementResult~\n+context() Arc~RuntimeContext~}
    
    class PreparedStatement {-inner: Arc~PreparedPlan~\n+parameter_count() usize}
    
    class InsertBuffer {-table_name: String\n-columns: Vec~String~\n-rows: Vec~Vec~PlanValue~~\n-on_conflict: InsertConflictAction\n+can_accept(table, cols, conflict) bool\n+should_flush() bool}
    
    SqlEngine --> RuntimeEngine : delegates to
    SqlEngine --> PreparedStatement : creates
    SqlEngine --> InsertBuffer : maintains

Sources : llkv-sql/src/sql_engine.rs:572-621 llkv-sql/src/sql_engine.rs:354-373 llkv-sql/src/sql_engine.rs:487-547


Statement Execution Flow

A typical SQL execution proceeds through the following phases:

Statement Execution Sequence

Sources : llkv-sql/src/sql_engine.rs:1533-1633 llkv-sql/src/sql_engine.rs:2134-2285


SQL Preprocessing System

Before parsing, the SQL Interface applies multiple preprocessing passes to normalize syntax variations across different SQL dialects:

PreprocessorPurposeExample Transformation
TPCH Connect SyntaxStrip TPC-H CONNECT TO directivesCONNECT TO db; → removed
CREATE TYPE SyntaxConvert DuckDB type aliasesCREATE TYPE t AS INTCREATE DOMAIN t AS INT
EXCLUDE SyntaxQuote qualified names in EXCLUDEEXCLUDE (schema.col)EXCLUDE ("schema.col")
Trailing CommasRemove DuckDB-style trailing commasVALUES (1,)VALUES (1)
Empty IN ListsConvert degenerate IN expressionscol IN ()(col = NULL AND 0 = 1)
Index HintsStrip SQLite index hintsFROM t INDEXED BY idxFROM t
REINDEX SyntaxConvert SQLite REINDEXREINDEX idxVACUUM REINDEX idx
Trigger ShorthandExpand SQLite trigger defaultsCREATE TRIGGER tr ... → adds AFTER, FOR EACH ROW
Bare Table INExpand SQLite table subqueriescol IN tablenamecol IN (SELECT * FROM tablename)

Each preprocessor runs as a regex-based rewrite pass before the SQL text reaches sqlparser-rs. This allows LLKV to accept SQL from multiple dialects while using a single parser.

Sources : llkv-sql/src/sql_engine.rs:759-1006 llkv-sql/src/sql_engine.rs:771-793


Prepared Statements and Parameters

The SQL Interface supports prepared statements with parameterized queries. Parameters use placeholder syntax compatible with multiple dialects:

Parameter Processing Flow

graph LR
    subgraph "Parameter Placeholder Formats"
        Q1["? \n(anonymous)"]
Q2["?N \n(numbered)"]
Dollar["$N \n(PostgreSQL)"]
Named[":name \n(named)"]
end
    
    subgraph "Parameter Processing"
        Register["register_placeholder(raw)"]
State["ParameterState"]
Index["assigned: HashMap<String, usize>"]
end
    
    subgraph "Execution"
        Prepare["SqlEngine::prepare(sql)"]
PrepStmt["PreparedStatement"]
Execute["execute_prepared(stmt, params)"]
Bind["Bind params to plan"]
end
    
 
   Q1 --> Register
 
   Q2 --> Register
 
   Dollar --> Register
 
   Named --> Register
 
   Register --> State
 
   State --> Index
 
   Index --> Prepare
 
   Prepare --> PrepStmt
 
   PrepStmt --> Execute
 
   Execute --> Bind

Parameter placeholders are registered during parsing and replaced with internal sentinel values (__llkv_param__N__). When executing a prepared statement, the sentinel values are substituted with actual parameter values before plan execution.

Sources : llkv-sql/src/sql_engine.rs:78-282 llkv-sql/src/sql_engine.rs:354-373 llkv-sql/src/sql_engine.rs:1707-1773


INSERT Buffering Optimization

To improve bulk insert performance, the SQL Interface can buffer multiple consecutive INSERT ... VALUES statements and execute them as a single batched operation. This dramatically reduces planning overhead for large data loads:

INSERT Buffer State Machine

stateDiagram-v2
    [*] --> Empty : engine created
    Empty --> Buffering : INSERT with VALUES
    Buffering --> Buffering : compatible INSERT
    Buffering --> Flushing : incompatible statement
    Buffering --> Flushing : buffer threshold reached
    Flushing --> Empty : execute batched insert
    Flushing --> Buffering : new INSERT after flush
    Empty --> [*]
    
    note right of Buffering
        Accumulate rows while:
        - Same table
        - Same columns
        - Same conflict action
        - Below MAX_BUFFERED_INSERT_ROWS
    end note
    
    note right of Flushing
        Execute single INSERT with
        all accumulated rows
    end note

Buffering is disabled by default to preserve per-statement semantics for unit tests. Long-running workloads can enable buffering via set_insert_buffering(true) to achieve significant performance gains on bulk ingests.

Sources : llkv-sql/src/sql_engine.rs:487-547 llkv-sql/src/sql_engine.rs:2134-2285


graph TD
    AST["sqlparser::ast::Statement"]
subgraph "DDL Statements"
        CreateTable["CreateTable"]
AlterTable["AlterTable"]
DropTable["DropTable"]
CreateView["CreateView"]
CreateIndex["CreateIndex"]
end
    
    subgraph "DML Statements"
        Query["Query (SELECT)"]
Insert["Insert"]
Update["Update"]
Delete["Delete"]
end
    
    subgraph "Transaction Control"
        Begin["BEGIN"]
Commit["COMMIT"]
Rollback["ROLLBACK"]
end
    
    subgraph "PlanStatement Variants"
        SelectPlan["SelectPlan"]
InsertPlan["InsertPlan"]
UpdatePlan["UpdatePlan"]
DeletePlan["DeletePlan"]
CreateTablePlan["CreateTablePlan"]
AlterTablePlan["AlterTablePlan"]
BeginTxn["BeginTransaction"]
CommitTxn["CommitTransaction"]
end
    
 
   AST --> CreateTable
 
   AST --> AlterTable
 
   AST --> DropTable
 
   AST --> Query
 
   AST --> Insert
 
   AST --> Update
 
   AST --> Delete
 
   AST --> Begin
 
   AST --> Commit
    
 
   CreateTable --> CreateTablePlan
 
   AlterTable --> AlterTablePlan
 
   Query --> SelectPlan
 
   Insert --> InsertPlan
 
   Update --> UpdatePlan
 
   Delete --> DeletePlan
 
   Begin --> BeginTxn
 
   Commit --> CommitTxn

Statement Translation Process

Once SQL is parsed into an AST, the SQL Interface translates each Statement variant into a corresponding PlanStatement:

AST to Plan Translation

Translation involves:

  1. Identifier Resolution : Converting string column names to FieldId values via IdentifierResolver
  2. Expression Translation : Building typed expression trees from AST nodes
  3. Type Inference : Determining result types for computed columns
  4. Validation : Checking for unknown tables, duplicate columns, type mismatches

Sources : llkv-sql/src/sql_engine.rs:2287-3500 llkv-executor/src/lib.rs:89-93


Integration with Runtime and Executor

The SQL Interface delegates all actual execution to lower layers:

Layer Integration Diagram

The SQL Interface owns the RuntimeEngine instance and delegates PlanStatement execution to it. The runtime coordinates with the executor for query execution and the table layer for DDL operations.

Sources : llkv-sql/src/sql_engine.rs:706-745 [llkv-runtime/src/lib.rs (implied)](https://github.com/jzombie/rust-llkv/blob/89777726/llkv-runtime/src/lib.rs (implied)) [Diagram 1 from overview](https://github.com/jzombie/rust-llkv/blob/89777726/Diagram 1 from overview)


Example Usage

Basic usage of the SqlEngine:

Sources : llkv-sql/src/sql_engine.rs:443-485 llkv-sql/src/lib.rs:1-52


Key Design Principles

The SQL Interface embodies several architectural decisions:

  1. Dialect Tolerance : Accept SQL from multiple databases through preprocessing rather than forking the parser
  2. Lazy Validation : Validate table existence and column types during plan translation, not parsing
  3. Stateful Optimization : Maintain cross-statement state (insert buffer, prepared statement cache) for performance
  4. Delegate Execution : Never directly manipulate storage; always route through runtime abstractions
  5. Arrow Native : Return query results as Arrow RecordBatch structures for zero-copy integration
  6. Session Isolation : Each SqlEngine instance owns its transaction state and constraint enforcement mode

These principles allow the SQL Interface to serve as a stable, predictable API surface while the underlying execution engine evolves independently.

Sources : llkv-sql/src/sql_engine.rs:1-60 llkv-sql/src/lib.rs:1-52 [Diagram 2 from overview](https://github.com/jzombie/rust-llkv/blob/89777726/Diagram 2 from overview)

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

SqlEngine API

Loading…

SqlEngine API

Relevant source files

This document describes the SqlEngine struct, which is the primary user-facing interface for executing SQL statements in LLKV. The SqlEngine parses SQL text using sqlparser-rs, converts it into execution plans, and delegates to the RuntimeEngine for execution. For details on how SQL syntax is normalized before parsing, see SQL Preprocessing and Dialect Handling. For information about the INSERT batching optimization, see INSERT Buffering System.

Sources: llkv-sql/src/sql_engine.rs:441-486 llkv-sql/src/lib.rs:1-52


SqlEngine Struct and Core Components

The SqlEngine struct wraps a RuntimeEngine and adds SQL parsing, statement caching, INSERT buffering, and dialect preprocessing capabilities.

Diagram: SqlEngine component structure and user interaction

graph TB
    SqlEngine["SqlEngine"]
RuntimeEngine["RuntimeEngine"]
RuntimeContext["RuntimeContext&lt;Pager&gt;"]
InsertBuffer["InsertBuffer"]
StatementCache["Statement Cache"]
InfoSchema["Information Schema"]
SqlEngine -->|contains| RuntimeEngine
 
   SqlEngine -->|maintains| InsertBuffer
 
   SqlEngine -->|maintains| StatementCache
 
   SqlEngine -->|tracks| InfoSchema
 
   RuntimeEngine -->|contains| RuntimeContext
 
   RuntimeContext -->|manages| Tables["Tables"]
RuntimeContext -->|manages| Catalog["System Catalog"]
User["User Code"] -->|execute sql| SqlEngine
 
   User -->|prepare sql| SqlEngine
 
   User -->|sql query| SqlEngine
 
   SqlEngine -->|execute_statement| RuntimeEngine
 
   RuntimeEngine -->|Result| SqlEngine
 
   SqlEngine -->|RecordBatch[]| User

Sources: llkv-sql/src/sql_engine.rs:572-620

SqlEngine Fields

FieldTypePurpose
engineRuntimeEngineCore execution engine that manages tables, catalog, and transactions
default_nulls_firstAtomicBoolControls default sort order for NULL values in ORDER BY clauses
insert_bufferRefCell<Option<InsertBuffer>>Accumulates literal INSERT statements for batch execution
insert_buffering_enabledAtomicBoolFeature flag controlling cross-statement INSERT batching
information_schema_readyAtomicBoolTracks whether the information_schema has been initialized
statement_cacheRwLock<FxHashMap<String, Arc<PreparedPlan>>>Cache for prepared statement plans

Sources: llkv-sql/src/sql_engine.rs:572-586


Construction and Configuration

Creating a SqlEngine

Diagram: SqlEngine construction paths

Sources: llkv-sql/src/sql_engine.rs:751-757 llkv-sql/src/sql_engine.rs:627-633

Constructor Methods

MethodSignatureDescription
new<Pg>(pager: Arc<Pg>)Creates engine with new RuntimeContextStandard constructor for fresh database instances
with_context(context, default_nulls_first)Creates engine from existing contextUsed when sharing storage across engine instances

Configuration Options:

  • default_nulls_first : Controls whether NULL values sort first (true) or last (false) in ORDER BY clauses when NULLS FIRST/LAST is not explicitly specified
  • INSERT buffering : Disabled by default; enable via set_insert_buffering(true) for bulk loading workloads

Sources: llkv-sql/src/sql_engine.rs:751-757 llkv-sql/src/sql_engine.rs:627-633 llkv-sql/src/sql_engine.rs:1431-1449


Statement Execution

Primary Execution Methods

Diagram: Statement execution flow through SqlEngine

Sources: llkv-sql/src/sql_engine.rs:1109-1194 llkv-sql/src/sql_engine.rs:1196-1236 llkv-sql/src/sql_engine.rs:1715-1765

Execution Method Details

MethodReturn TypeUse Case
execute(&self, sql: &str)Vec<RuntimeStatementResult>Execute one or more SQL statements (DDL, DML, SELECT)
sql(&self, query: &str)Vec<RecordBatch>Execute a single SELECT query and return Arrow results
execute_single(&self, sql: &str)RuntimeStatementResultExecute exactly one statement, error if multiple
prepare(&self, sql: &str)PreparedStatementParse and cache a parameterized statement
execute_prepared(&self, stmt, params)Vec<RuntimeStatementResult>Execute a prepared statement with bound parameters

Sources: llkv-sql/src/sql_engine.rs:1109-1194 llkv-sql/src/sql_engine.rs:1196-1236 llkv-sql/src/sql_engine.rs:1238-1272 llkv-sql/src/sql_engine.rs:1715-1765 llkv-sql/src/sql_engine.rs:1767-1816

RuntimeStatementResult Variants

The RuntimeStatementResult enum represents the outcome of executing different statement types:

VariantFieldsProduced By
CreateTabletable_name: StringCREATE TABLE
CreateViewview_name: StringCREATE VIEW
CreateIndexindex_name: String, table_name: StringCREATE INDEX
DropTabletable_name: StringDROP TABLE
DropViewview_name: StringDROP VIEW
DropIndexindex_name: StringDROP INDEX
Inserttable_name: String, rows_inserted: usizeINSERT
Updatetable_name: String, rows_updated: usizeUPDATE
Deletetable_name: String, rows_deleted: usizeDELETE
Selectexecution: SelectExecution<P>SELECT (when using execute())
BeginTransactionkind: TransactionKindBEGIN TRANSACTION
CommitTransactionCOMMIT
RollbackTransactionROLLBACK
AlterTabletable_name: StringALTER TABLE
Truncatetable_name: StringTRUNCATE
Reindexindex_name: Option<String>REINDEX

Sources: llkv-runtime/src/lib.rs (inferred from context), llkv-sql/src/lib.rs50


Prepared Statements and Parameter Binding

Parameter Placeholder Syntax

LLKV supports multiple parameter placeholder styles for prepared statements:

StyleExampleDescription
Positional (?)SELECT * FROM t WHERE id = ?Auto-numbered starting from 1
Numbered (?)SELECT * FROM t WHERE id = ?2Explicit position
Numbered ($)SELECT * FROM t WHERE id = $1PostgreSQL style
Named (:)SELECT * FROM t WHERE name = :usernameNamed parameters

Sources: llkv-sql/src/sql_engine.rs:94-133 llkv-sql/src/sql_engine.rs:258-268

Parameter Binding Flow

Diagram: Prepared statement lifecycle with parameter binding

Sources: llkv-sql/src/sql_engine.rs:1715-1765 llkv-sql/src/sql_engine.rs:1767-1816 llkv-sql/src/sql_engine.rs:223-256

SqlParamValue Types

The SqlParamValue enum represents typed parameter values:

Conversion methods:

  • as_literal(&self) -> Literal - Converts to internal Literal type for expression evaluation
  • as_plan_value(&self) -> PlanValue - Converts to PlanValue for INSERT operations

Sources: llkv-sql/src/sql_engine.rs:285-316


graph TB
    subgraph "Statement Processing"
        Parse["parse_statement()"]
Canonical["canonicalize_insert()"]
PreparedInsert["PreparedInsert"]
end
    
    subgraph "Buffer Management"
        BufferDecision{"Can Buffer?"}
BufferAdd["buffer.push_statement()"]
BufferFlush["flush_buffer()"]
BufferCheck{"buffer.should_flush()?"}
end
    
    subgraph "Execution"
        BuildPlan["InsertPlan::from_buffer"]
Execute["execute_insert_plan()"]
end
    
 
   Parse -->|INSERT statement| Canonical
 
   Canonical -->|VALUES only| PreparedInsert
 
   PreparedInsert --> BufferDecision
    
 
   BufferDecision -->|Yes: same table, same columns, same on_conflict| BufferAdd
 
   BufferDecision -->|No: different params| BufferFlush
    
 
   BufferAdd --> BufferCheck
 
   BufferCheck -->|Yes: >= 8192 rows| BufferFlush
 
   BufferCheck -->|No: continue| Continue["Continue processing"]
BufferFlush --> BuildPlan
 
   BuildPlan --> Execute
 
   Execute -->|Multiple Insert results| Results["Vec<RuntimeStatementResult>"]

INSERT Buffering System

The INSERT buffering system batches multiple literal INSERT ... VALUES statements together to reduce planning overhead during bulk loading operations.

Buffering Architecture

Diagram: INSERT buffering decision tree and execution flow

Sources: llkv-sql/src/sql_engine.rs:486-546 llkv-sql/src/sql_engine.rs:1869-2008

InsertBuffer Configuration

Constant/FieldValue/TypePurpose
MAX_BUFFERED_INSERT_ROWS8192Maximum rows to accumulate before forcing a flush
insert_buffering_enabledAtomicBoolGlobal enable/disable flag (default: false)
table_nameStringTarget table for buffered inserts
columnsVec<String>Column list (must match across statements)
on_conflictInsertConflictActionConflict resolution strategy (must match)
statement_row_countsVec<usize>Tracks row counts per original statement
rowsVec<Vec<PlanValue>>Accumulated literal rows

Buffer acceptance rules:

  1. Table name must match exactly
  2. Column list must match exactly
  3. on_conflict action must match
  4. Only literal VALUES are buffered (subqueries flush immediately)

Flush triggers:

  1. Row count reaches MAX_BUFFERED_INSERT_ROWS
  2. Next INSERT targets different table/columns/conflict action
  3. Non-INSERT statement encountered
  4. execute() completes (flush remaining)
  5. SqlEngine is dropped

Sources: llkv-sql/src/sql_engine.rs:486-546 llkv-sql/src/sql_engine.rs:1431-1449


graph LR
    Input["Raw SQL String"]
TPCH["preprocess_tpch_connect_syntax"]
CreateType["preprocess_create_type_syntax"]
Exclude["preprocess_exclude_syntax"]
TrailingComma["preprocess_trailing_commas_in_values"]
EmptyIn["preprocess_empty_in_lists"]
IndexHints["preprocess_index_hints"]
Reindex["preprocess_reindex_syntax"]
TriggerShorthand["preprocess_sqlite_trigger_shorthand"]
BareTable["preprocess_bare_table_in_clauses"]
Parser["sqlparser::Parser"]
AST["Vec<Statement>"]
Input --> TPCH
 
   TPCH --> CreateType
 
   CreateType --> Exclude
 
   Exclude --> TrailingComma
 
   TrailingComma --> EmptyIn
 
   EmptyIn --> IndexHints
 
   IndexHints --> Reindex
 
   Reindex --> TriggerShorthand
 
   TriggerShorthand --> BareTable
 
   BareTable --> Parser
 
   Parser --> AST

SQL Preprocessing

Before parsing with sqlparser, the SQL text undergoes multiple normalization passes to handle dialect-specific syntax variations. This allows LLKV to accept SQL from SQLite, DuckDB, PostgreSQL, and other sources.

Preprocessing Pipeline

Diagram: SQL preprocessing transformation pipeline

Sources: llkv-sql/src/sql_engine.rs:1024-1103

Preprocessing Transformations

MethodTransformationExample
preprocess_tpch_connect_syntaxStrip CONNECT TO <db>;CONNECT TO tpch;(removed)
preprocess_create_type_syntaxCREATE TYPECREATE DOMAINCREATE TYPE myint AS INTEGERCREATE DOMAIN myint AS INTEGER
preprocess_exclude_syntaxQuote qualified EXCLUDE identifiersEXCLUDE (schema.table.col)EXCLUDE ("schema.table.col")
preprocess_trailing_commas_in_valuesRemove trailing commasVALUES (1, 2,)VALUES (1, 2)
preprocess_empty_in_listsConvert empty IN to boolean exprcol IN ()(col = NULL AND 0 = 1)
preprocess_index_hintsStrip SQLite index hintsFROM t INDEXED BY idxFROM t
preprocess_reindex_syntaxConvert to VACUUM REINDEXREINDEX idxVACUUM REINDEX idx
preprocess_sqlite_trigger_shorthandAdd explicit timing/FOR EACH ROWCREATE TRIGGER t INSERT ON tCREATE TRIGGER t AFTER INSERT ON t FOR EACH ROW
preprocess_bare_table_in_clausesWrap table in subquerycol IN tablenamecol IN (SELECT * FROM tablename)

Sources: llkv-sql/src/sql_engine.rs:759-1006


Session and Transaction Management

Accessing the RuntimeSession

The SqlEngine::session() method provides access to the underlying RuntimeSession, which manages transaction state and constraint enforcement:

Key session methods:

  • begin_transaction(kind) - Start a new transaction
  • commit_transaction() - Commit the current transaction
  • rollback_transaction() - Abort the current transaction
  • set_constraint_enforcement_mode(mode) - Control when constraints are checked
  • execute_insert_plan(plan) - Direct plan execution (bypasses SQL parsing)
  • execute_select_plan(plan) - Direct SELECT execution

Sources: llkv-sql/src/sql_engine.rs:1412-1424

Constraint Enforcement Modes

The session supports two constraint enforcement modes:

ModeBehavior
ConstraintEnforcementMode::ImmediateConstraints checked after each statement
ConstraintEnforcementMode::DeferredConstraints checked only at transaction commit

Deferred mode is useful for bulk loading when referential integrity violations may occur temporarily during the load process.

Sources: llkv-table/src/catalog/manager.rs (inferred from TpchToolkit usage), llkv-tpch/src/lib.rs:274-278


stateDiagram-v2
    [*] --> Uninitialized: SqlEngine::new()
    Uninitialized --> Checking : Query references information_schema
    Checking --> Initializing : ensure_information_schema_ready()
    Initializing --> Ready : refresh_information_schema()
    Ready --> Ready : Subsequent queries
    Ready --> Invalidated : DDL statement (CREATE/DROP TABLE, etc.)
    Invalidated --> Checking : Next information_schema query
    Checking --> Ready : Already initialized

Information Schema

The SqlEngine lazily initializes the information_schema tables on first access to avoid startup overhead. Querying information_schema.tables, information_schema.columns, or other information schema views triggers initialization.

Information Schema Lifecycle

Diagram: Information schema initialization and invalidation lifecycle

Key methods:

  • ensure_information_schema_ready() - Initialize if not already ready
  • invalidate_information_schema() - Mark as stale after schema changes
  • refresh_information_schema() - Delegate to RuntimeEngine for rebuild

Invalidation triggers:

  • CREATE TABLE, DROP TABLE
  • ALTER TABLE
  • CREATE VIEW, DROP VIEW

Sources: llkv-sql/src/sql_engine.rs:648-660 llkv-sql/src/sql_engine.rs:706-745


graph LR
    Execute["execute_plan_statement"]
RuntimeEngine["RuntimeEngine::execute_statement"]
Error["Error::NotFound or\nInvalidArgumentError"]
MapError["map_table_error"]
UserError["Error::CatalogError:\n'Table does not exist'"]
Execute -->|delegates| RuntimeEngine
 
   RuntimeEngine -->|returns| Error
 
   Execute -->|table name available| MapError
 
   MapError -->|transforms| UserError
 
   UserError -->|propagates| User["User Code"]

Error Handling and Diagnostics

Error Mapping

The SqlEngine transforms generic storage errors into user-friendly SQL errors, particularly for table lookup failures:

Diagram: Error mapping for table lookup failures

Error transformation rules:

  1. Error::NotFoundError::CatalogError("Table '...' does not exist")
  2. Generic “unknown table” messages → Catalog error with table name
  3. View-related operations (CREATE VIEW, DROP VIEW) skip mapping

Sources: llkv-sql/src/sql_engine.rs:677-745

Statement Expectations (Test Harness)

For test infrastructure, the SqlEngine supports registering expected outcomes for statements:

Expectation types:

  • StatementExpectation::Ok - Statement should succeed
  • StatementExpectation::Error - Statement should fail
  • StatementExpectation::Count(n) - Statement should affect n rows

Sources: llkv-sql/src/sql_engine.rs:65-391


Advanced Usage Examples

Bulk Loading with INSERT Buffering

Sources: llkv-sql/src/sql_engine.rs:1431-1449 llkv-sql/src/sql_engine.rs:2010-2027

Prepared Statements with Named Parameters

Sources: llkv-sql/src/sql_engine.rs:1715-1816

Direct Plan Execution (Advanced)

For performance-critical code paths, bypass SQL parsing by executing plans directly:

Sources: llkv-sql/src/sql_engine.rs:1412-1424 llkv-runtime/src/lib.rs (InsertPlan structure inferred)


Key Constants

ConstantValuePurpose
PARSER_RECURSION_LIMIT200Maximum recursion depth for sqlparser (prevents stack overflow)
MAX_BUFFERED_INSERT_ROWS8192Flush threshold for INSERT buffering
PARAM_SENTINEL_PREFIX"__llkv_param__"Internal marker for parameter placeholders
PARAM_SENTINEL_SUFFIX"__"Internal marker suffix
DROPPED_TABLE_TRANSACTION_ERR"another transaction has dropped this table"Error message for concurrent table drops

Sources: llkv-sql/src/sql_engine.rs:78-588

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

SQL Preprocessing and Dialect Handling

Loading…

SQL Preprocessing and Dialect Handling

Relevant source files

This document describes the SQL preprocessing layer that normalizes SQL syntax from multiple dialects (SQLite, DuckDB, PostgreSQL) before parsing. The preprocessor transforms dialect-specific syntax into forms that the sqlparser-rs library can parse, enabling broad compatibility across different SQL variants.

For information about the overall SQL execution flow, see SqlEngine API. For details on how SQL parameters are bound to prepared statements, see Plan Structures.


Purpose and Architecture

The SQL preprocessing system transforms SQL text through a series of regex-based rewriting rules before passing it to sqlparser-rs. This architecture allows LLKV to accept SQL from various dialects without forking the parser library or implementing a custom parser.

Key components :

  • SqlEngine::preprocess_sql_input() - main preprocessing orchestrator
  • Individual preprocessing functions for each dialect feature
  • Thread-local parameter state tracking
  • Regex-based pattern matching and replacement
flowchart LR
    RawSQL["Raw SQL Text\n(SQLite/DuckDB/PostgreSQL)"]
Preprocess["preprocess_sql_input()"]
subgraph "Preprocessing Steps"
        TPCH["TPC-H CONNECT removal"]
CreateType["CREATE TYPE → DOMAIN"]
Exclude["EXCLUDE qualifier handling"]
Trailing["Trailing comma removal"]
BareTable["Bare table IN conversion"]
EmptyIn["Empty IN list handling"]
IndexHints["Index hint removal"]
Reindex["REINDEX normalization"]
Trigger["Trigger shorthand expansion"]
end
    
    Parser["sqlparser::Parser"]
AST["Statement AST"]
RawSQL --> Preprocess
 
   Preprocess --> TPCH
 
   TPCH --> CreateType
 
   CreateType --> Exclude
 
   Exclude --> Trailing
 
   Trailing --> BareTable
 
   BareTable --> EmptyIn
 
   EmptyIn --> IndexHints
 
   IndexHints --> Reindex
 
   Reindex --> Trigger
 
   Trigger --> Parser
 
   Parser --> AST

The preprocessing layer sits between raw SQL input and the generic SQL parser:

Sources : llkv-sql/src/sql_engine.rs:1116-1125


Preprocessing Pipeline

The preprocess_sql_input() function applies transformations in a specific order to handle interdependencies between rules:

Sources : llkv-sql/src/sql_engine.rs:1116-1125

flowchart TD
    Input["SQL Input String"]
Step1["preprocess_tpch_connect_syntax()"]
Step2["preprocess_create_type_syntax()"]
Step3["preprocess_exclude_syntax()"]
Step4["preprocess_trailing_commas_in_values()"]
Step5["preprocess_bare_table_in_clauses()"]
Step6["preprocess_empty_in_lists()"]
Step7["preprocess_index_hints()"]
Step8["preprocess_reindex_syntax()"]
Output["Normalized SQL"]
Input --> Step1
 
   Step1 --> Step2
 
   Step2 --> Step3
 
   Step3 --> Step4
 
   Step4 --> Step5
 
   Step5 --> Step6
 
   Step6 --> Step7
 
   Step7 --> Step8
 
   Step8 --> Output
    
    Note1["Removes TPC-H CONNECT statements"]
Note2["Converts DuckDB type aliases"]
Note3["Quotes qualified identifiers"]
Note4["Removes trailing commas in VALUES"]
Note5["Wraps bare tables in subqueries"]
Note6["Converts empty IN to constant predicates"]
Note7["Strips SQLite index hints"]
Note8["Converts standalone REINDEX"]
Step1 -.-> Note1
 
   Step2 -.-> Note2
 
   Step3 -.-> Note3
 
   Step4 -.-> Note4
 
   Step5 -.-> Note5
 
   Step6 -.-> Note6
 
   Step7 -.-> Note7
 
   Step8 -.-> Note8

Dialect-Specific Transformations

TPC-H CONNECT Statement Removal

TPC-H benchmark scripts include CONNECT TO database; directives that are no-ops in LLKV (single database system). The preprocessor strips these statements entirely.

Implementation : llkv-sql/src/sql_engine.rs:759-766

Pattern : CONNECT TO <identifier>;

Action : Remove statement


CREATE TYPE → CREATE DOMAIN Conversion

DuckDB uses CREATE TYPE name AS basetype for type aliases, but sqlparser-rs only supports the SQL standard CREATE DOMAIN syntax. The preprocessor performs bidirectional conversion:

CREATE TYPE myint AS INTEGER  →  CREATE DOMAIN myint AS INTEGER
DROP TYPE myint               →  DROP DOMAIN myint

Regex patterns :

  • CREATE TYPECREATE DOMAIN
  • DROP TYPEDROP DOMAIN

Implementation : llkv-sql/src/sql_engine.rs:775-793

Sources : llkv-sql/src/sql_engine.rs:775-793


EXCLUDE Clause Qualified Name Handling

DuckDB allows qualified identifiers in EXCLUDE clauses: SELECT * EXCLUDE (schema.table.col). The preprocessor wraps these in double quotes for parser compatibility:

EXCLUDE (schema.table.col)  →  EXCLUDE ("schema.table.col")

Pattern : EXCLUDE\s*\(\s*([a-zA-Z_][a-zA-Z0-9_]*(?:\.[a-zA-Z_][a-zA-Z0-9_]*)+)\s*\)

Implementation : llkv-sql/src/sql_engine.rs:795-812

Sources : llkv-sql/src/sql_engine.rs:795-812


Trailing Comma Removal in VALUES Clauses

DuckDB permits trailing commas in tuple literals: VALUES ('v2',). The preprocessor removes these for parser compatibility:

VALUES (1, 2,)  →  VALUES (1, 2)

Pattern : ,(\s*)\)

Replacement : $1)

Implementation : llkv-sql/src/sql_engine.rs:814-825

Sources : llkv-sql/src/sql_engine.rs:814-825


Empty IN List Handling

SQLite allows degenerate forms expr IN () and expr NOT IN () which sqlparser-rs rejects. The preprocessor converts these to constant boolean expressions while preserving expression evaluation (for potential side effects):

expr IN ()      →  (expr = NULL AND 0 = 1)  -- always false
expr NOT IN ()  →  (expr = NULL OR 1 = 1)   -- always true

Regex : Matches parenthesized expressions, quoted strings, hex literals, identifiers, or numbers followed by [NOT] IN ()

Implementation : llkv-sql/src/sql_engine.rs:827-856

Sources : llkv-sql/src/sql_engine.rs:827-856


SQLite Index Hint Removal

SQLite query optimizer hints (INDEXED BY index_name, NOT INDEXED) are stripped since LLKV makes its own index selection decisions:

FROM table INDEXED BY idx_name  →  FROM table
FROM table NOT INDEXED           →  FROM table

Pattern : \s+(INDEXED\s+BY\s+[a-zA-Z_][a-zA-Z0-9_]*|NOT\s+INDEXED)\b

Implementation : llkv-sql/src/sql_engine.rs:858-875

Sources : llkv-sql/src/sql_engine.rs:858-875


REINDEX Syntax Normalization

SQLite supports standalone REINDEX index_name statements, but sqlparser-rs only recognizes REINDEX within VACUUM statements. The preprocessor converts:

REINDEX my_index  →  VACUUM REINDEX my_index

Pattern : \bREINDEX\s+([a-zA-Z_][a-zA-Z0-9_]*(?:\.[a-zA-Z_][a-zA-Z0-9_]*)*)\b

Implementation : llkv-sql/src/sql_engine.rs:877-893

Sources : llkv-sql/src/sql_engine.rs:877-893


SQLite Trigger Shorthand Expansion

SQLite allows omitting trigger timing (BEFORE/AFTER, defaults to AFTER) and the FOR EACH ROW clause (defaults to row-level triggers). The sqlparser-rs library requires these to be explicit, so the preprocessor injects them when missing.

Transformation example :

Regex approach :

  1. Detect missing timing keyword and inject AFTER
  2. Detect missing FOR EACH ROW before BEGIN or WHEN and inject it

Implementation : llkv-sql/src/sql_engine.rs:895-978

Note : This is marked as a temporary workaround. The proper fix would be extending sqlparser-rs’s SQLiteDialect::parse_statement to handle these optional clauses.

Sources : llkv-sql/src/sql_engine.rs:895-978


Bare Table Names in IN Clauses

SQLite allows expr IN tablename as shorthand for expr IN (SELECT * FROM tablename). The preprocessor converts this to the explicit subquery form:

col IN users  →  col IN (SELECT * FROM users)

Pattern : \b(NOT\s+)?IN\s+([a-zA-Z_][a-zA-Z0-9_]*(?:\.[a-zA-Z_][a-zA-Z0-9_]*)*)(\s|$|;|,|\))

Avoids : Already-parenthesized expressions (IN (...))

Implementation : llkv-sql/src/sql_engine.rs:980-1009

Sources : llkv-sql/src/sql_engine.rs:980-1009


SQL Parameter System

The parameter system enables prepared statements with placeholder binding. Parameters can use various syntaxes across different SQL dialects:

SyntaxExampleDescription
?WHERE id = ?Positional, auto-incremented
?NWHERE id = ?1Positional, explicit index
$NWHERE id = $1PostgreSQL-style positional
:nameWHERE id = :user_idNamed parameter

Parameter State Management

Thread-local state tracks parameter registration during statement preparation:

Key functions :

Sources : llkv-sql/src/sql_engine.rs:78-282

flowchart TD
    Input["Raw Parameter String"]
CheckAuto{"Is '?' ?"}
IncrementAuto["Increment next_auto\nReturn new index"]
CheckCache{"Already registered?"}
ReturnCached["Return cached index"]
ParseType{"Parameter type"}
ParseNumeric["Parse numeric index\n(?N or $N)"]
AssignNamed["Assign max_index + 1\n(:name)"]
UpdateCache["Store in assigned map\nUpdate max_index"]
Return["Return index"]
Input --> CheckAuto
    CheckAuto --
 Yes --> IncrementAuto
    CheckAuto --
 No --> CheckCache
    CheckCache --
 Yes --> ReturnCached
    CheckCache --
 No --> ParseType
    
    ParseType -- "?N or $N" --> ParseNumeric
    ParseType -- ":name" --> AssignNamed
    
 
   ParseNumeric --> UpdateCache
 
   AssignNamed --> UpdateCache
 
   UpdateCache --> Return
 
   IncrementAuto --> Return

Parameter Index Assignment

The ParameterState::register() method normalizes different parameter syntaxes to unified indices:

Sources : llkv-sql/src/sql_engine.rs:94-133

Parameter Sentinel System

During plan building, parameters are represented as sentinel strings that can be recognized and replaced during execution:

Sentinel FormatExampleUsage
__llkv_param__N____llkv_param__1__Internal representation in plan

Functions :

Sources : llkv-sql/src/sql_engine.rs:78-282


Integration with SqlEngine Execution

The preprocessing pipeline integrates with the main SQL execution flow:

Trigger retry logic : If initial parsing fails and the SQL contains CREATE TRIGGER, the engine applies the trigger shorthand preprocessing and retries. This is a fallback for cases where the initial preprocessing pipeline doesn’t catch all trigger variations.

Sources : llkv-sql/src/sql_engine.rs:1057-1083


Recursion Limit Configuration

The parser recursion limit is set higher than sqlparser-rs’s default to accommodate deeply nested SQL expressions common in test suites:

This prevents stack overflows while still protecting against pathological inputs.

Sources : llkv-sql/src/sql_engine.rs:393-400


Design Rationale

Why Regex-Based Preprocessing?

The preprocessing approach uses regex pattern matching rather than extending the parser for several reasons:

  1. Non-invasive : Does not require forking or patching sqlparser-rs
  2. Composable : Multiple transformations can be chained independently
  3. Maintainable : Each dialect feature is isolated in its own function
  4. Sufficient : Handles syntactic transformations without semantic analysis

Trade-offs

ApproachAdvantagesDisadvantages
Regex preprocessingSimple, composable, no parser changesFragile to edge cases, string-based
Parser extensionRobust, type-safeRequires maintaining parser fork
Custom parserFull controlHigh maintenance burden

The current regex-based approach is pragmatic for the current set of dialect differences. If the number of transformations grows significantly, a proper parser extension may become necessary.

Sources : llkv-sql/src/sql_engine.rs:759-1009


Summary

The SQL preprocessing layer provides broad dialect compatibility through a pipeline of targeted string transformations:

  • TPC-H compatibility : Strips CONNECT statements
  • DuckDB compatibility : Type aliases, trailing commas, qualified EXCLUDE
  • SQLite compatibility : Trigger shorthand, empty IN lists, index hints, bare table IN, REINDEX
  • PostgreSQL compatibility : $N parameter syntax

The system is extensible—new dialect features can be added as additional preprocessing functions without disrupting existing transformations.

Sources : llkv-sql/src/sql_engine.rs:759-1125

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

INSERT Buffering System

Loading…

INSERT Buffering System

Relevant source files

The INSERT Buffering System is a performance optimization that batches multiple consecutive INSERT ... VALUES statements into a single execution operation. This system reduces planning and execution overhead by accumulating literal row data across statement boundaries and flushing them together, achieving significant throughput improvements for bulk ingestion workloads while preserving per-statement result semantics.

For information about SQL statement execution in general, see SQL Interface. For details on how INSERT statements are planned and translated, see Query Planning.


Purpose and Scope

The buffering system operates at the SQL engine layer, intercepting INSERT statements before they reach the runtime executor. It analyzes each INSERT to determine if it contains literal values that can be safely accumulated, or if it requires immediate execution due to complex source expressions (e.g., INSERT ... SELECT). The system maintains buffering state across multiple execute() calls, making it suitable for workloads that stream SQL scripts containing thousands of INSERT statements.

In scope:

  • Buffering of INSERT ... VALUES statements with literal row data
  • Automatic flushing based on configurable thresholds and statement boundaries
  • Per-statement result tracking for compatibility with test harnesses
  • Conflict resolution action compatibility (REPLACE, IGNORE, etc.)

Out of scope:

  • Buffering of INSERT statements with subqueries or SELECT sources
  • Cross-table buffering (only same-table INSERTs are batched)
  • Transaction-aware buffering strategies

System Architecture

Sources: llkv-sql/src/sql_engine.rs:1057-1114 llkv-sql/src/sql_engine.rs:1231-1338 llkv-sql/src/sql_engine.rs:1423-1557


Core Data Structures

classDiagram
    class InsertBuffer {-String table_name\n-Vec~String~ columns\n-InsertConflictAction on_conflict\n-usize total_rows\n-Vec~usize~ statement_row_counts\n-Vec~Vec~PlanValue~~ rows\n+new() InsertBuffer\n+can_accept() bool\n+push_statement()\n+should_flush() bool}
    
    class PreparedInsert {<<enumeration>>\nValues\nImmediate}
    
    class SqlEngine {-RefCell~Option~InsertBuffer~~ insert_buffer\n-AtomicBool insert_buffering_enabled\n+buffer_insert() BufferedInsertResult\n+flush_buffer_results() Vec~SqlStatementResult~\n+set_insert_buffering()}
    
    SqlEngine --> InsertBuffer : contains
    SqlEngine --> PreparedInsert : produces

InsertBuffer

The InsertBuffer struct accumulates literal row data across multiple INSERT statements while tracking per-statement row counts for result reporting.

Field descriptions:

FieldTypePurpose
table_nameStringTarget table identifier for compatibility checking
columnsVec<String>Column list (must match across buffered statements)
on_conflictInsertConflictActionConflict resolution strategy (REPLACE, IGNORE, etc.)
total_rowsusizeSum of all buffered rows across statements
statement_row_countsVec<usize>Per-statement row counts for result splitting
rowsVec<Vec<PlanValue>>Flattened literal row data

Sources: llkv-sql/src/sql_engine.rs:497-547


Configuration and Thresholds

MAX_BUFFERED_INSERT_ROWS

The buffering system enforces a maximum row threshold to prevent unbounded memory growth during bulk ingestion:

When total_rows reaches this threshold, the buffer automatically flushes before accepting additional statements. This value balances memory usage against amortized planning overhead.

Sources: llkv-sql/src/sql_engine.rs490

Enabling and Disabling Buffering

Buffering is disabled by default to preserve immediate execution semantics for unit tests and applications that depend on synchronous error reporting. Long-running workloads can opt in via the set_insert_buffering method:

Sources: llkv-sql/src/sql_engine.rs:1022-1029 llkv-sql/src/sql_engine.rs583

sequenceDiagram
    participant App
    participant SqlEngine
    participant InsertBuffer
    
    App->>SqlEngine: set_insert_buffering(true)
    SqlEngine->>SqlEngine: Store enabled flag
    
    Note over App,SqlEngine: Buffering now active
    
    App->>SqlEngine: execute("INSERT INTO t VALUES (1)")
    SqlEngine->>InsertBuffer: Create or append
    App->>SqlEngine: execute("INSERT INTO t VALUES (2)")
    SqlEngine->>InsertBuffer: Append to existing
    
    App->>SqlEngine: set_insert_buffering(false)
    SqlEngine->>InsertBuffer: flush_buffer_results()
    InsertBuffer->>SqlEngine: Return per-statement results

Buffering Decision Flow

The buffer_insert method implements the core buffering logic, determining whether an INSERT should be accumulated or executed immediately:

Immediate Execution Conditions

Buffering is bypassed under the following conditions:

  1. Statement Expectation is Error or Count: Test harnesses use expectations to validate synchronous error reporting and exact row counts
  2. Buffering Disabled: The insert_buffering_enabled flag is false (default state)
  3. Non-literal Source: The INSERT uses a SELECT subquery or other dynamic source
  4. Buffer Incompatibility: The INSERT targets a different table, column list, or conflict action

Sources: llkv-sql/src/sql_engine.rs:1231-1338

flowchart TD
    START["buffer_insert(insert, expectation)"]
CHECK_EXPECT{"expectation ==\nError |Count?"}
CHECK_ENABLED{"buffering_enabled?"}
PREPARE["prepare_insert insert"]
CHECK_TYPE{"PreparedInsert type?"}
CHECK_COMPAT{"buffer.can_accept ?"}
CHECK_FULL{"buffer.should_flush ?"}
EXEC_IMM["Execute immediately Return flushed + current"]
CREATE_BUF["Create new InsertBuffer"]
APPEND_BUF["buffer.push_statement"]
FLUSH_BUF["flush_buffer_results"]
RETURN_PLACE["Return placeholder result"]
START --> CHECK_EXPECT
 CHECK_EXPECT -->|Yes|EXEC_IMM
 CHECK_EXPECT -->|No|CHECK_ENABLED
 CHECK_ENABLED -->|No|EXEC_IMM
 CHECK_ENABLED -->|Yes|PREPARE
 PREPARE --> CHECK_TYPE
 CHECK_TYPE -->|Immediate|EXEC_IMM
 CHECK_TYPE -->|Values|CHECK_COMPAT
 CHECK_COMPAT -->|No|FLUSH_BUF
 CHECK_COMPAT -->|Yes|APPEND_BUF
 FLUSH_BUF --> CREATE_BUF
 CREATE_BUF --> RETURN_PLACE
 APPEND_BUF --> CHECK_FULL
 CHECK_FULL -->|Yes|FLUSH_BUF
 CHECK_FULL -->|No| RETURN_PLACE

Insert Canonicalization

The prepare_insert method analyzes the INSERT statement and converts it into one of two canonical forms:

PreparedInsert::Values

Literal VALUES clauses (including constant SELECT forms like SELECT 1, 2) are rewritten into Vec<Vec<PlanValue>> rows for buffering:

Conversion logic:

flowchart LR
    SQL["INSERT INTO users (id, name)\nVALUES (1, 'Alice')"]
AST["sqlparser::ast::Insert"]
CHECK{"Source type?"}
VALUES["SetExpr::Values"]
SELECT["SetExpr::Select"]
CONST{"Constant SELECT?"}
CONVERT["Convert to Vec~Vec~PlanValue~~"]
PREP_VAL["PreparedInsert::Values"]
PREP_IMM["PreparedInsert::Immediate"]
SQL --> AST
 
   AST --> CHECK
 
   CHECK -->|VALUES| VALUES
 
   CHECK -->|SELECT| SELECT
 
   VALUES --> CONVERT
 
   SELECT --> CONST
 
   CONST -->|Yes| CONVERT
 
   CONST -->|No| PREP_IMM
 
   CONVERT --> PREP_VAL

The method iterates over each row in the VALUES clause, converting SQL expressions to PlanValue variants:

SQL TypePlanValue Variant
Integer literalsPlanValue::Integer(i64)
Float literalsPlanValue::Float(f64)
String literalsPlanValue::String(String)
NULLPlanValue::Null
Date literalsPlanValue::Date32(i32)
DECIMAL literalsPlanValue::Decimal(DecimalValue)

Sources: llkv-sql/src/sql_engine.rs:1504-1524

flowchart LR
    COMPLEX["INSERT INTO t\nSELECT * FROM source"]
PLAN["build_select_plan()"]
WRAP["InsertPlan { source: InsertSource::Select }"]
PREP["PreparedInsert::Immediate"]
COMPLEX --> PLAN
 
   PLAN --> WRAP
 
   WRAP --> PREP

PreparedInsert::Immediate

Non-literal sources (subqueries, complex expressions) are wrapped in a fully-planned InsertPlan for immediate execution:

Sources: llkv-sql/src/sql_engine.rs:1543-1552


flowchart TD
    START["can_accept(table, columns, on_conflict)"]
CHECK_TABLE{"table_name == table?"}
CHECK_COLS{"columns == columns?"}
CHECK_CONFLICT{"on_conflict == on_conflict?"}
ACCEPT["Return true"]
REJECT["Return false"]
START --> CHECK_TABLE
 
   CHECK_TABLE -->|No| REJECT
 
   CHECK_TABLE -->|Yes| CHECK_COLS
 
   CHECK_COLS -->|No| REJECT
 
   CHECK_COLS -->|Yes| CHECK_CONFLICT
 
   CHECK_CONFLICT -->|No| REJECT
 
   CHECK_CONFLICT -->|Yes| ACCEPT

Buffer Compatibility Checking

The can_accept method ensures that only compatible INSERTs are batched together:

Compatibility requirements:

FieldRequirement
table_nameExact string match (case-sensitive)
columnsExact column list match (order matters)
on_conflictSame conflict resolution strategy

Sources: llkv-sql/src/sql_engine.rs:528-535


flowchart TD
    subgraph "During execute()"
        STMT{"Statement Type"}
INSERT["INSERT"]
TRANS["Transaction Boundary\n(BEGIN/COMMIT/ROLLBACK)"]
OTHER["Other Statement\n(SELECT, UPDATE, etc.)"]
end
    
    subgraph "Buffer State"
        SIZE{"total_rows >=\nMAX_BUFFERED_INSERT_ROWS?"}
COMPAT{"can_accept()?"}
end
    
    subgraph "Explicit Triggers"
        DROP["SqlEngine::drop()"]
DISABLE["set_insert_buffering(false)"]
MANUAL["flush_pending_inserts()"]
end
    
    FLUSH["flush_buffer_results()"]
STMT --> INSERT
 
   STMT --> TRANS
 
   STMT --> OTHER
 
   INSERT --> SIZE
 
   INSERT --> COMPAT
 
   SIZE -->|Yes| FLUSH
 
   COMPAT -->|No| FLUSH
 
   TRANS --> FLUSH
 
   OTHER --> FLUSH
 
   DROP --> FLUSH
 
   DISABLE --> FLUSH
 
   MANUAL --> FLUSH

Flush Triggers

The buffer flushes under multiple conditions to maintain correctness and predictable memory usage:

Automatic Flush Conditions

Sources: llkv-sql/src/sql_engine.rs:544-546 llkv-sql/src/sql_engine.rs:1098-1111 llkv-sql/src/sql_engine.rs:592-596

Manual Flush API

Applications can force a flush via the flush_pending_inserts public method:

This is useful when the application needs to ensure all buffered data is persisted before performing a read operation or checkpoint.

Sources: llkv-sql/src/sql_engine.rs:1132-1134


sequenceDiagram
    participant Buffer as InsertBuffer
    participant Engine as SqlEngine
    participant Runtime as RuntimeEngine
    
    Note over Buffer: Buffer contains 3 statements:\n10, 20, 15 rows
    
    Engine->>Buffer: Take buffer (45 total rows)
    Buffer-->>Engine: InsertBuffer instance
    
    Engine->>Engine: Build InsertPlan with all 45 rows
    Engine->>Runtime: execute_statement(InsertPlan)
    Runtime->>Runtime: Persist all rows atomically
    Runtime-->>Engine: RuntimeStatementResult::Insert(45)
    
    Engine->>Engine: Split into per-statement results
    Note over Engine: [10 rows, 20 rows, 15 rows]
    
    Engine-->>Engine: Return Vec~SqlStatementResult~

Flush Execution and Result Splitting

When the buffer flushes, all accumulated rows are executed as a single InsertPlan, then the result is split back into per-statement results:

Implementation details:

  1. The buffer is moved out of the RefCell to prevent reentrancy issues
  2. A single InsertPlan is constructed with InsertSource::Rows(all_rows)
  3. The plan executes via RuntimeEngine::execute_statement
  4. The returned row count is validated against total_rows
  5. Per-statement results are reconstructed using statement_row_counts

Sources: llkv-sql/src/sql_engine.rs:1342-1407

Result Reconstruction

The flush method maintains the illusion of per-statement execution by reconstructing individual RuntimeStatementResult::Insert instances:

This ensures compatibility with test harnesses and applications that inspect per-statement results.

Sources: llkv-sql/src/sql_engine.rs:1389-1406


flowchart TD
    EXPECT["register_statement_expectation(expectation)"]
THREAD_LOCAL["PENDING_STATEMENT_EXPECTATIONS\n(thread-local queue)"]
EXEC["execute(sql)"]
NEXT["next_statement_expectation()"]
CHECK{"Expectation Type"}
OK["StatementExpectation::Ok\n(buffer eligible)"]
ERROR["StatementExpectation::Error\n(execute immediately)"]
COUNT["StatementExpectation::Count\n(execute immediately)"]
EXPECT --> THREAD_LOCAL
 
   EXEC --> NEXT
 
   NEXT --> THREAD_LOCAL
 
   THREAD_LOCAL --> CHECK
 
   CHECK --> OK
 
   CHECK --> ERROR
 
   CHECK --> COUNT

Statement Expectations Integration

The SQL Logic Test (SLT) harness uses statement expectations to validate error handling and row counts. The buffering system respects these expectations by bypassing the buffer when necessary:

Expectation types:

ExpectationBehaviorReason
OkAllow bufferingNo specific validation needed
ErrorForce immediate executionMust observe synchronous error
Count(n)Force immediate executionMust validate exact row count

Sources: llkv-sql/src/sql_engine.rs:375-391 llkv-sql/src/sql_engine.rs:1240-1250


Performance Characteristics

The buffering system provides substantial performance improvements for bulk ingestion workloads:

Planning Overhead Reduction

Without buffering, each INSERT statement incurs:

  1. SQL parsing (via sqlparser)
  2. Plan construction (InsertPlan allocation)
  3. Schema validation (column name resolution)
  4. Runtime plan dispatch

With buffering, these costs are amortized across all buffered statements, executing once per flush rather than once per statement.

Batch Size Impact

The default threshold of 8192 rows balances several factors:

FactorImpact of Larger BatchesImpact of Smaller Batches
Memory usageIncreases linearlyReduces proportionally
Planning amortizationBetter (fewer flushes)Worse (more flushes)
Latency to visibilityHigher (longer buffering)Lower (frequent flushes)
Write throughputGenerally higherGenerally lower

Sources: llkv-sql/src/sql_engine.rs490


sequenceDiagram
    participant App
    participant SqlEngine
    participant Buffer as InsertBuffer
    participant Runtime
    
    App->>SqlEngine: execute("BEGIN TRANSACTION")
    SqlEngine->>Buffer: flush_buffer_results()
    Buffer-->>SqlEngine: (empty, no pending data)
    SqlEngine->>Runtime: begin_transaction()
    
    App->>SqlEngine: execute("INSERT INTO t VALUES (1)")
    SqlEngine->>Buffer: Buffer statement
    
    App->>SqlEngine: execute("COMMIT")
    SqlEngine->>Buffer: flush_buffer_results()
    Buffer->>Runtime: execute_statement(InsertPlan)
    Runtime-->>Buffer: Success
    SqlEngine->>Runtime: commit_transaction()

Integration with Transaction Boundaries

The buffering system automatically flushes at transaction boundaries to maintain ACID semantics:

This ensures that all buffered writes are persisted before the transaction commits, preventing data loss or inconsistency.

Sources: llkv-sql/src/sql_engine.rs:1094-1102


Conflict Resolution Compatibility

The buffer preserves conflict resolution semantics by tracking the on_conflict action:

ActionSQLite SyntaxBuffering Behavior
None(default)Buffers with same action
ReplaceREPLACE INTO / INSERT OR REPLACEBuffers separately from other actions
IgnoreINSERT OR IGNOREBuffers separately from other actions
AbortINSERT OR ABORTBuffers separately from other actions
FailINSERT OR FAILBuffers separately from other actions
RollbackINSERT OR ROLLBACKBuffers separately from other actions

The buffer’s can_accept method ensures that only statements with identical conflict actions are batched together, preserving the semantic behavior of each INSERT.

Sources: llkv-sql/src/sql_engine.rs:1441-1455 llkv-sql/src/sql_engine.rs:528-535


Drop Safety

The SqlEngine implements a Drop handler to ensure that buffered data is persisted when the engine goes out of scope:

This prevents data loss when applications terminate without explicitly flushing or disabling buffering.

Sources: llkv-sql/src/sql_engine.rs:590-597


Key Code Entities

EntityLocationRole
InsertBufferllkv-sql/src/sql_engine.rs:497-547Accumulates row data and tracks per-statement counts
SqlEngine::insert_bufferllkv-sql/src/sql_engine.rs576Holds current buffer instance
SqlEngine::insert_buffering_enabledllkv-sql/src/sql_engine.rs583Global enable/disable flag
MAX_BUFFERED_INSERT_ROWSllkv-sql/src/sql_engine.rs490Flush threshold constant
PreparedInsertllkv-sql/src/sql_engine.rs:555-563Canonical INSERT representation
buffer_insertllkv-sql/src/sql_engine.rs:1231-1338Main buffering decision logic
prepare_insertllkv-sql/src/sql_engine.rs:1423-1557INSERT canonicalization
flush_buffer_resultsllkv-sql/src/sql_engine.rs:1342-1407Execution and result splitting
BufferedInsertResultllkv-sql/src/sql_engine.rs:567-570Return type for buffering operations

Sources: llkv-sql/src/sql_engine.rs:490-1557

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Query Planning

Loading…

Query Planning

Relevant source files

Purpose and Scope

The query planning layer converts parsed SQL abstract syntax trees (AST) into structured logical plans that execution engines can process. This document covers the planning architecture, plan types, translation process, and validation steps. For information about the SQL parsing that precedes planning, see SQL Interface. For details about the plan structures themselves, see Plan Structures. For subquery correlation handling, see Subquery and Correlation Handling. For expression compilation and evaluation, see Expression System. For plan execution, see Query Execution.

Sources: llkv-plan/src/lib.rs:1-49 llkv-plan/src/plans.rs:1-6

Architecture Overview

The llkv-plan crate sits between the SQL parsing layer and the execution layer. It receives SQL ASTs from sqlparser-rs and produces logical plan structures consumed by llkv-executor and llkv-runtime. The crate is organized into several focused modules:

ModulePurpose
plansCore plan structures (SelectPlan, InsertPlan, etc.)
plannerPlan building logic
translationSQL AST to plan conversion
validationSchema and naming constraint enforcement
canonicalCanonical value conversion
conversionHelper conversions for plan types
physicalPhysical execution plan structures
table_scanTable scan plan optimization
subquery_correlationCorrelated subquery tracking
traversalGeneric AST traversal utilities

Sources: llkv-plan/src/lib.rs:1-49 llkv-plan/Cargo.toml:1-43

Planning Process Flow

The planning process begins when the SQL layer invokes translation functions that walk the parsed AST and construct typed plan structures. The process involves multiple phases:

Identifier Resolution : During translation, column references like "users.name" or "u.address.city" must be resolved to canonical column names. The IdentifierResolver in llkv-table handles this by consulting the TableCatalog and applying alias rules.

sequenceDiagram
    participant Parser as sqlparser
    participant Translator as translation module
    participant Validator as validation module
    participant Builder as Plan builders
    participant Catalog as TableCatalog
    participant Output as Logical Plans
    
    Parser->>Translator: Statement AST
    
    Note over Translator: Identify statement type\n(SELECT, INSERT, DDL, etc.)
    
    Translator->>Validator: Check schema references
    Validator->>Catalog: Resolve table names
    Catalog-->>Validator: TableMetadataView
    Validator-->>Translator: Validated identifiers
    
    Translator->>Builder: Construct plan with validated data
    
    alt SELECT statement
        Builder->>Builder: Build SelectPlan\n- Parse projections\n- Translate WHERE\n- Extract aggregates\n- Build join metadata
    else INSERT statement
        Builder->>Builder: Build InsertPlan\n- Parse column list\n- Convert values\n- Set conflict action
    else DDL statement
        Builder->>Builder: Build DDL Plan\n- Extract schema info\n- Validate constraints
    end
    
    Builder->>Validator: Final validation pass
    Validator-->>Builder: Approved
    
    Builder->>Output: Typed logical plan

Expression Translation : Predicate and scalar expressions are translated from SQL AST nodes into the llkv-expr expression types (Expr, ScalarExpr). This translation involves converting SQL operators to expression operators, resolving function calls, and tracking subquery placeholders.

Subquery Discovery : The translation layer identifies correlated subqueries in WHERE and SELECT clauses, assigns them unique SubqueryId identifiers, and builds FilterSubquery or ScalarSubquery metadata structures that track which outer columns are referenced.

Sources: llkv-plan/src/plans.rs:1-100 llkv-table/src/resolvers/identifier.rs:1-260

Plan Type Categories

Plans are organized by SQL statement category. Each category has distinct plan structures tailored to its execution requirements:

DDL Plans

Data Definition Language plans modify schema and structure:

DML Plans

Data Manipulation Language plans modify table contents:

Query Plans

SELECT statements produce the most complex plans with nested substructures:

Sources: llkv-plan/src/plans.rs:164-1620

Plan Construction Patterns

Plan structures expose builder-style APIs for incremental construction:

Builder Methods

Most plan types provide fluent builder methods:

CreateTablePlan::new("users")
    .with_if_not_exists(true)
    .with_columns(vec![
        PlanColumnSpec::new("id", DataType::Int64, false)
            .with_primary_key(true),
        PlanColumnSpec::new("name", DataType::Utf8, true),
    ])
    .with_namespace(Some("main".to_string()))

SelectPlan Construction

SelectPlan is built incrementally as the translator walks the SQL AST:

SelectPlan::new("users")
    .with_projections(projections)
    .with_filter(Some(SelectFilter {
        predicate: expr,
        subqueries: vec![],
    }))
    .with_aggregates(aggregates)
    .with_order_by(order_by)
    .with_group_by(group_by_columns)

The with_tables constructor supports multi-table queries:

SelectPlan::with_tables(vec![
    TableRef::new("main", "orders"),
    TableRef::new("main", "customers"),
])
.with_joins(vec![
    JoinMetadata {
        left_table_index: 0,
        join_type: JoinPlan::Inner,
        on_condition: Some(join_expr),
    },
])

Join Metadata

JoinMetadata replaces the older parallel join_types and join_filters vectors by bundling join information into a single structure per join operation. Each entry describes how tables[i] connects to tables[i+1]:

FieldTypePurpose
left_table_indexusizeIndex into SelectPlan.tables
join_typeJoinPlanINNER, LEFT, RIGHT, or FULL
on_conditionOption<Expr>ON clause predicate

Sources: llkv-plan/src/plans.rs:176-952 llkv-plan/src/plans.rs:756-792

Subquery Plan Structures

Subqueries appear in two contexts and use distinct plan structures:

Filter Subqueries

Used in WHERE clauses with EXISTS predicates:

The SubqueryId links the predicate’s Expr::Exists node to the corresponding FilterSubquery in the subqueries vector.

Scalar Subqueries

Used in SELECT projections to return a single value:

Correlated Column Tracking

When a subquery references outer query columns, the planner injects placeholder column names and tracks the mapping:

CorrelatedColumn {
    placeholder: "__subquery_corr_0_users_id",  // Injected into subquery
    column: "users.id",                          // Actual outer column
    field_path: vec![],                          // Empty for simple columns
}

For struct field references like users.address.city:

CorrelatedColumn {
    placeholder: "__subquery_corr_0_users_address",
    column: "users.address",
    field_path: vec!["city".to_string()],
}

Sources: llkv-plan/src/plans.rs:23-68 llkv-expr/src/expr.rs:42-65

graph TD
    LOGICAL["Logical Plans\n(SelectPlan, InsertPlan, etc.)"]
LOGICAL --> PHYS_BUILD["Physical plan builder\nllkv-plan::physical"]
PHYS_BUILD --> PHYS["PhysicalPlan"]
PHYS --> SCAN["TableScanPlan\nAccess path selection"]
PHYS --> JOIN_EXEC["Join algorithm choice\n(hash join, nested loop)"]
PHYS --> AGG_EXEC["Aggregation strategy"]
PHYS --> SORT_EXEC["Sort algorithm"]
SCAN --> FULL["Full table scan"]
SCAN --> INDEX["Index scan"]
SCAN --> FILTER_SCAN["Filtered scan with pushdown"]
PHYS --> EXEC["llkv-executor\nQuery execution"]

Physical Planning

While logical plans represent what operations to perform, physical plans specify how to execute them with specific algorithms and data access patterns:

The table_scan module provides build_table_scan_plan which analyzes predicates to determine the optimal scan strategy:

TableScanProjectionSpec {
    columns: Vec<String>,           // Columns to retrieve
    filter: Option<Expr>,           // Pushed-down filter
    order_by: Vec<OrderByPlan>,     // Sort requirements
    limit: Option<usize>,           // Row limit
}

This physical plan metadata guides the executor in choosing between:

  • Full scans : No filter, read all rows
  • Filter scans : Predicate evaluation during scan
  • Index scans : Use indexes when available and beneficial

Sources: llkv-plan/src/physical.rs:1-50 (inferred), llkv-plan/src/table_scan.rs:1-50 (inferred)

Translation and Validation

Translation Process

The translation modules convert SQL AST nodes to plan structures while preserving semantic meaning:

Projection Translation :

  • SELECT *SelectProjection::AllColumns
  • SELECT * EXCEPT (x)SelectProjection::AllColumnsExcept
  • SELECT colSelectProjection::Column { name, alias }
  • SELECT expr AS nameSelectProjection::Computed { expr, alias }

Predicate Translation : SQL predicates are converted to Expr trees:

  • WHERE a = 1 AND b < 2Expr::And([Filter(a=1), Filter(b<2)])
  • WHERE x IN (1,2,3)Expr::Pred(Filter { op: Operator::In(...) })
  • WHERE EXISTS (SELECT ...)Expr::Exists(SubqueryExpr { id, negated })

Aggregate Translation : Aggregate functions map to AggregateExpr variants:

  • COUNT(*)AggregateExpr::CountStar
  • SUM(col)AggregateExpr::Column { function: SumInt64, ... }
  • AVG(DISTINCT col)AggregateExpr::Column { distinct: true, ... }

Validation Helpers

The validation module enforces constraints during plan construction:

Validation TypePurpose
Schema validationVerify table and column existence
Type compatibilityCheck data type conversions
Constraint validationEnforce PRIMARY KEY, UNIQUE, CHECK, FK rules
Identifier resolutionResolve ambiguous column references
Aggregate contextEnsure aggregates only in valid contexts

The validation module provides helper functions that translators invoke at critical points:

  • Before plan construction : Validate referenced tables exist
  • During expression building : Resolve column identifiers
  • After plan assembly : Final consistency checks

Canonical Value Conversion

The canonical module converts PlanValue instances to CanonicalScalar for internal processing:

PlanValue::Integer(42) → CanonicalScalar::Int64(42)
PlanValue::String("abc") → CanonicalScalar::String("abc")
PlanValue::Decimal(dec) → CanonicalScalar::Decimal(dec)

This normalization ensures consistent value representation across plan construction and execution.

Sources: llkv-plan/src/translation.rs:1-50 (inferred), llkv-plan/src/validation.rs:1-50 (inferred), llkv-plan/src/canonical.rs:1-50 (inferred), llkv-plan/src/plans.rs:126-161

Plan Metadata and Auxiliary Types

Plans contain rich metadata to guide execution:

PlanValue

Represents literal values in plans:

ColumnAssignment

UPDATE plans use ColumnAssignment to specify modifications:

ColumnAssignment {
    column: "age",
    value: AssignmentValue::Literal(PlanValue::Integer(25)),
}

ColumnAssignment {
    column: "updated_at",
    value: AssignmentValue::Expression(
        ScalarExpr::Column("current_timestamp")
    ),
}

Constraint Specifications

CREATE TABLE plans include constraint metadata:

PlanColumnSpec {
    name: "id",
    data_type: DataType::Int64,
    nullable: false,
    primary_key: true,
    unique: true,
    check_expr: None,
}

ForeignKeySpec {
    name: Some("fk_orders_customer"),
    columns: vec!["customer_id"],
    referenced_table: "customers",
    referenced_columns: vec!["id"],
    on_delete: ForeignKeyAction::Restrict,
    on_update: ForeignKeyAction::NoAction,
}

MultiColumnUniqueSpec {
    name: Some("unique_email_username"),
    columns: vec!["email", "username"],
}

Sources: llkv-plan/src/plans.rs:73-161 llkv-plan/src/plans.rs:167-427 llkv-plan/src/plans.rs:660-682 llkv-plan/src/plans.rs:499-546

Traversal and Program Compilation

AST Traversal

The traversal module provides generic postorder traversal for deeply nested ASTs:

traverse_postorder(
    root_node,
    |node| { /* pre-visit logic */ },
    |node| { /* post-visit logic */ }
)

This pattern supports:

  • Subquery discovery during SELECT translation
  • Expression normalization
  • Dead code elimination in predicates

Program Compilation

Plans containing expressions are compiled into evaluation programs for efficient execution. The ProgramCompiler converts Expr trees into bytecode-like instruction sequences:

Expr → EvalProgram → Execution

The compiled programs optimize:

  • Predicate evaluation order
  • Short-circuit boolean logic
  • Literal folding
  • Domain program generation for chunk pruning

Sources: llkv-plan/src/traversal.rs:1-50 (inferred), llkv-plan/src/lib.rs:38-42

sequenceDiagram
    participant Planner as llkv-plan
    participant Runtime as llkv-runtime
    participant Executor as llkv-executor
    participant Table as llkv-table
    
    Planner->>Runtime: Logical Plan
    Runtime->>Runtime: Acquire transaction context
    
    alt SELECT Plan
        Runtime->>Executor: execute_select(plan, txn)
        Executor->>Executor: Build execution pipeline
        Executor->>Table: scan with filter
        Table-->>Executor: RecordBatch stream
        Executor->>Executor: Apply aggregations, joins, sorts
        Executor-->>Runtime: Final RecordBatch
    else INSERT Plan
        Runtime->>Runtime: Convert InsertSource to batches
        Runtime->>Table: append(batches)
        Table-->>Runtime: Row count
    else UPDATE/DELETE Plan
        Runtime->>Executor: execute_update/delete(plan, txn)
        Executor->>Table: filter + modify
        Table-->>Runtime: Row count
    else DDL Plan
        Runtime->>Table: Modify catalog/schema
        Table-->>Runtime: Success
    end
    
    Runtime->>Planner: Execution result

Integration with Execution

Plans flow from the planning layer to execution through well-defined interfaces:

The execution layer consumes plan metadata to:

  1. Determine table access order
  2. Select join algorithms
  3. Push down predicates to storage
  4. Apply aggregations and sorts
  5. Stream results incrementally

Sources: llkv-plan/src/plans.rs:1-1620 llkv-executor/src/lib.rs:1-50 (inferred), llkv-runtime/src/lib.rs:1-50 (inferred)

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Plan Structures

Loading…

Plan Structures

Relevant source files

This document describes the logical plan data structures defined in the llkv-plan crate. These structures represent parsed and validated SQL operations before execution. For information about how plans are created from SQL AST, see Subquery and Correlation Handling. For details on expression compilation and evaluation, see Expression System.

Purpose and Scope

Plan structures are the intermediate representation (IR) between SQL parsing and query execution. The SQL parser produces AST nodes from sqlparser-rs, which the planner translates into strongly-typed plan structures. The executor consumes these plans to produce results. Plans carry all semantic information needed for execution: table references, column specifications, predicates, projections, aggregations, and ordering constraints.

All plan types are defined in llkv-plan/src/plans.rs

Sources: llkv-plan/src/plans.rs:1-1263

Plan Statement Hierarchy

All executable statements are represented by the PlanStatement enum, which serves as the top-level discriminated union of all plan types.

Sources: llkv-plan/src/plans.rs:1244-1263

graph TB
    PlanStatement["PlanStatement\n(Top-level enum)"]
subgraph "Transaction Control"
        BeginTransaction["BeginTransaction"]
CommitTransaction["CommitTransaction"]
RollbackTransaction["RollbackTransaction"]
end
    
    subgraph "DDL Operations"
        CreateTable["CreateTablePlan"]
DropTable["DropTablePlan"]
CreateView["CreateViewPlan"]
DropView["DropViewPlan"]
CreateIndex["CreateIndexPlan"]
DropIndex["DropIndexPlan"]
Reindex["ReindexPlan"]
AlterTable["AlterTablePlan"]
end
    
    subgraph "DML Operations"
        Insert["InsertPlan"]
Update["UpdatePlan"]
Delete["DeletePlan"]
Truncate["TruncatePlan"]
end
    
    subgraph "Query Operations"
        Select["SelectPlan"]
end
    
 
   PlanStatement --> BeginTransaction
 
   PlanStatement --> CommitTransaction
 
   PlanStatement --> RollbackTransaction
    
 
   PlanStatement --> CreateTable
 
   PlanStatement --> DropTable
 
   PlanStatement --> CreateView
 
   PlanStatement --> DropView
 
   PlanStatement --> CreateIndex
 
   PlanStatement --> DropIndex
 
   PlanStatement --> Reindex
 
   PlanStatement --> AlterTable
    
 
   PlanStatement --> Insert
 
   PlanStatement --> Update
 
   PlanStatement --> Delete
 
   PlanStatement --> Truncate
    
 
   PlanStatement --> Select

DDL Plan Structures

DDL (Data Definition Language) plans modify database schema: creating, altering, and dropping tables, views, and indexes.

CreateTablePlan

The CreateTablePlan structure defines table creation operations, including optional data sources for CREATE TABLE AS SELECT.

FieldTypeDescription
nameStringTable name
if_not_existsboolSkip creation if table exists
or_replaceboolReplace existing table
columnsVec<PlanColumnSpec>Column definitions
sourceOption<CreateTableSource>Optional data source (batches or SELECT)
namespaceOption<String>Optional storage namespace
foreign_keysVec<ForeignKeySpec>Foreign key constraints
multi_column_uniquesVec<MultiColumnUniqueSpec>Multi-column UNIQUE constraints

Sources: llkv-plan/src/plans.rs:176-203

PlanColumnSpec

Column specifications carry metadata from the planner to the executor, including type, nullability, and constraints.

FieldTypeDescription
nameStringColumn name
data_typeDataTypeArrow data type
nullableboolWhether NULL values are permitted
primary_keyboolWhether this is the primary key
uniqueboolWhether values must be unique
check_exprOption<String>Optional CHECK constraint SQL

Sources: llkv-plan/src/plans.rs:503-546

CreateTableSource

The CreateTableSource enum specifies data sources for CREATE TABLE AS operations:

Sources: llkv-plan/src/plans.rs:607-617

AlterTablePlan

The AlterTablePlan structure defines table modification operations:

Sources: llkv-plan/src/plans.rs:364-406

Index Plans

Index management is handled by three plan types:

Plan TypePurposeKey Fields
CreateIndexPlanCreate new indexname, table, unique, columns: Vec<IndexColumnPlan>
DropIndexPlanRemove indexname, canonical_name, if_exists
ReindexPlanRebuild indexname, canonical_name

Sources: llkv-plan/src/plans.rs:433-358

DML Plan Structures

DML (Data Manipulation Language) plans modify table data: inserting, updating, and deleting rows.

graph TB
    InsertPlan["InsertPlan"]
InsertSource["InsertSource\n(enum)"]
Rows["Rows\nVec&lt;Vec&lt;PlanValue&gt;&gt;"]
Batches["Batches\nVec&lt;RecordBatch&gt;"]
SelectSource["Select\nBox&lt;SelectPlan&gt;"]
ConflictAction["InsertConflictAction"]
None["None"]
Replace["Replace"]
Ignore["Ignore"]
Abort["Abort"]
Fail["Fail"]
Rollback["Rollback"]
InsertPlan -->|source| InsertSource
 
   InsertPlan -->|on_conflict| ConflictAction
    
 
   InsertSource --> Rows
 
   InsertSource --> Batches
 
   InsertSource --> SelectSource
    
 
   ConflictAction --> None
 
   ConflictAction --> Replace
 
   ConflictAction --> Ignore
 
   ConflictAction --> Abort
 
   ConflictAction --> Fail
 
   ConflictAction --> Rollback

InsertPlan

The InsertPlan includes SQLite-style conflict resolution actions (INSERT OR REPLACE, INSERT OR IGNORE, etc.) to handle constraint violations.

Sources: llkv-plan/src/plans.rs:623-656

UpdatePlan

The UpdatePlan structure defines row update operations:

FieldTypeDescription
tableStringTarget table name
assignmentsVec<ColumnAssignment>Column updates
filterOption<Expr<'static, String>>Optional WHERE predicate

Each ColumnAssignment maps a column to an AssignmentValue:

Sources: llkv-plan/src/plans.rs:661-681

DeletePlan and TruncatePlan

Both plans remove rows from tables:

  • DeletePlan: Removes rows matching an optional filter predicate
  • TruncatePlan: Removes all rows (no filter)

Sources: llkv-plan/src/plans.rs:687-702

graph TB
    SelectPlan["SelectPlan"]
Tables["tables: Vec&lt;TableRef&gt;\nFROM clause tables"]
Joins["joins: Vec&lt;JoinMetadata&gt;\nJoin specifications"]
Projections["projections: Vec&lt;SelectProjection&gt;\nSELECT list"]
Filter["filter: Option&lt;SelectFilter&gt;\nWHERE clause"]
Having["having: Option&lt;Expr&gt;\nHAVING clause"]
GroupBy["group_by: Vec&lt;String&gt;\nGROUP BY columns"]
OrderBy["order_by: Vec&lt;OrderByPlan&gt;\nORDER BY specs"]
Aggregates["aggregates: Vec&lt;AggregateExpr&gt;\nAggregate functions"]
ScalarSubqueries["scalar_subqueries: Vec&lt;ScalarSubquery&gt;\nScalar subquery plans"]
Compound["compound: Option&lt;CompoundSelectPlan&gt;\nUNION/INTERSECT/EXCEPT"]
SelectPlan --> Tables
 
   SelectPlan --> Joins
 
   SelectPlan --> Projections
 
   SelectPlan --> Filter
 
   SelectPlan --> Having
 
   SelectPlan --> GroupBy
 
   SelectPlan --> OrderBy
 
   SelectPlan --> Aggregates
 
   SelectPlan --> ScalarSubqueries
 
   SelectPlan --> Compound

SelectPlan Structure

The SelectPlan is the most complex plan type, representing SELECT queries with projections, filters, joins, aggregates, ordering, and compound operations.

Core SelectPlan Fields

Sources: llkv-plan/src/plans.rs:800-864

TableRef and JoinMetadata

Table references include optional aliases and schema qualifications:

Join metadata connects consecutive tables in the tables vector:

Sources: llkv-plan/src/plans.rs:708-792

SelectProjection

The SelectProjection enum defines what columns appear in results:

Sources: llkv-plan/src/plans.rs:1007-1021

SelectFilter and Subqueries

The SelectFilter structure bundles predicates with correlated subqueries:

Scalar subqueries (used in projections) follow a similar structure:

Sources: llkv-plan/src/plans.rs:28-67

AggregateExpr

Aggregate expressions define computations over grouped data:

Sources: llkv-plan/src/plans.rs:1036-1128

OrderByPlan

Ordering specifications include target, direction, and null handling:

Sources: llkv-plan/src/plans.rs:1203-1225

graph TB
    ExprTypes["Expression Types"]
Expr["Expr&lt;F&gt;\nBoolean expressions"]
ScalarExpr["ScalarExpr&lt;F&gt;\nScalar computations"]
Filter["Filter&lt;F&gt;\nSingle-field predicates"]
Operator["Operator\nComparison operators"]
ExprVariants["Expr variants"]
And["And(Vec&lt;Expr&gt;)"]
Or["Or(Vec&lt;Expr&gt;)"]
Not["Not(Box&lt;Expr&gt;)"]
Pred["Pred(Filter)"]
Compare["Compare{left, op, right}"]
InList["InList{expr, list, negated}"]
IsNull["IsNull{expr, negated}"]
Literal["Literal(bool)"]
Exists["Exists(SubqueryExpr)"]
ScalarVariants["ScalarExpr variants"]
Column["Column(F)"]
Lit["Literal(Literal)"]
Binary["Binary{left, op, right}"]
Aggregate["Aggregate(AggregateCall)"]
Cast["Cast{expr, data_type}"]
Case["Case{operand, branches, else_expr}"]
Coalesce["Coalesce(Vec)"]
ScalarSubquery["ScalarSubquery(ScalarSubqueryExpr)"]
ExprTypes --> Expr
 
   ExprTypes --> ScalarExpr
 
   Expr --> Filter
 
   Filter --> Operator
    
 
   Expr --> ExprVariants
 
   ExprVariants --> And
 
   ExprVariants --> Or
 
   ExprVariants --> Not
 
   ExprVariants --> Pred
 
   ExprVariants --> Compare
 
   ExprVariants --> InList
 
   ExprVariants --> IsNull
 
   ExprVariants --> Literal
 
   ExprVariants --> Exists
    
 
   ScalarExpr --> ScalarVariants
 
   ScalarVariants --> Column
 
   ScalarVariants --> Lit
 
   ScalarVariants --> Binary
 
   ScalarVariants --> Aggregate
 
   ScalarVariants --> Cast
 
   ScalarVariants --> Case
 
   ScalarVariants --> Coalesce
 
   ScalarVariants --> ScalarSubquery

Expression Integration

Plans embed expression trees from llkv-expr for predicates, projections, and computed values.

Expression Type Hierarchy

Sources: llkv-expr/src/expr.rs:14-183

Expr - Boolean Predicates

The Expr type represents boolean-valued expressions used in WHERE, HAVING, and ON clauses. The generic parameter F allows different field identifier types (commonly String for column names):

VariantDescription
And(Vec<Expr<F>>)Logical conjunction
Or(Vec<Expr<F>>)Logical disjunction
Not(Box<Expr<F>>)Logical negation
Pred(Filter<F>)Single-field predicate
Compare{left, op, right}Scalar comparison (col1 < col2)
InList{expr, list, negated}Membership test
IsNull{expr, negated}NULL test for complex expressions
Literal(bool)Constant true/false
Exists(SubqueryExpr)Correlated subquery existence check

Sources: llkv-expr/src/expr.rs:14-123

ScalarExpr - Scalar Computations

The ScalarExpr type represents scalar-valued expressions used in projections and assignments:

VariantDescription
Column(F)Column reference
Literal(Literal)Constant value
Binary{left, op, right}Arithmetic/logical operation
Not(Box<ScalarExpr<F>>)Logical NOT
IsNull{expr, negated}NULL test returning 1/0
Aggregate(AggregateCall<F>)Aggregate function call
GetField{base, field_name}Struct field extraction
Cast{expr, data_type}Type conversion
Compare{left, op, right}Comparison returning 1/0
Coalesce(Vec<ScalarExpr<F>>)First non-NULL value
ScalarSubquery(ScalarSubqueryExpr)Scalar subquery result
Case{operand, branches, else_expr}CASE expression
RandomRandom number generator

Sources: llkv-expr/src/expr.rs:125-307

Operator Types

The Operator enum defines comparison and pattern matching operations applied to single fields:

Sources: llkv-expr/src/expr.rs:372-428

graph LR
    PlanValue["PlanValue\n(enum)"]
Null["Null"]
Integer["Integer(i64)"]
Float["Float(f64)"]
Decimal["Decimal(DecimalValue)"]
String["String(String)"]
Date32["Date32(i32)"]
Struct["Struct(FxHashMap)"]
Interval["Interval(IntervalValue)"]
PlanValue --> Null
 
   PlanValue --> Integer
 
   PlanValue --> Float
 
   PlanValue --> Decimal
 
   PlanValue --> String
 
   PlanValue --> Date32
 
   PlanValue --> Struct
 
   PlanValue --> Interval

Supporting Types

PlanValue

The PlanValue enum represents runtime values in plans, providing a type-safe wrapper for literals and computed values:

PlanValue implements From<T> for common types, enabling ergonomic construction:

Sources: llkv-plan/src/plans.rs:73-161

Foreign Key Specifications

Foreign key constraints are defined by ForeignKeySpec:

Sources: llkv-plan/src/plans.rs:412-427

graph TB
    CompoundSelectPlan["CompoundSelectPlan"]
Initial["initial: Box&lt;SelectPlan&gt;\nFirst query"]
Operations["operations: Vec&lt;CompoundSelectComponent&gt;\nSubsequent operations"]
Component["CompoundSelectComponent"]
Operator["operator:\nUnion / Intersect / Except"]
Quantifier["quantifier:\nDistinct / All"]
Plan["plan: SelectPlan\nRight-hand query"]
CompoundSelectPlan --> Initial
 
   CompoundSelectPlan --> Operations
    
 
   Operations --> Component
 
   Component --> Operator
 
   Component --> Quantifier
 
   Component --> Plan

Compound Queries

Compound queries (UNION, INTERSECT, EXCEPT) are represented by CompoundSelectPlan:

This structure allows arbitrary chains of set operations:

  • SELECT ... UNION ALL SELECT ... INTERSECT SELECT ...

Sources: llkv-plan/src/plans.rs:954-1004

Plan Construction Example

A typical SELECT plan construction follows this pattern:

This produces a plan for: SELECT id, age + 1 AS next_age FROM users WHERE age >= 18 ORDER BY id

Sources: llkv-plan/src/plans.rs:831-952

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Subquery and Correlation Handling

Loading…

Subquery and Correlation Handling

Relevant source files

Purpose and Scope

This document explains how LLKV handles correlated and uncorrelated subqueries in SQL queries. It covers the expression AST structures for representing subqueries, the plan-level metadata for tracking correlation relationships, and the placeholder system used during query planning.

For information about query plan structures more broadly, see Plan Structures. For expression evaluation and compilation, see Expression System.


Overview

LLKV supports two categories of subqueries:

  1. Predicate Subqueries - Used in WHERE clauses with EXISTS or NOT EXISTS
  2. Scalar Subqueries - Used in SELECT projections or expressions, returning a single value

Both types can be either correlated (referencing columns from outer queries) or uncorrelated (self-contained). Correlated subqueries require special handling to capture outer column references and inject them as parameters during evaluation.

The system uses a multi-stage approach:

  • Planning Phase : Assign unique SubqueryId values and build metadata
  • Translation Phase : Replace correlated column references with placeholders
  • Execution Phase : Evaluate subqueries per outer row, binding placeholder values

Sources: llkv-expr/src/expr.rs:45-65 llkv-plan/src/plans.rs:27-67


Subquery Identification System

SubqueryId

Each subquery within a query plan receives a unique SubqueryId identifier. This allows the executor to distinguish multiple subqueries and manage their evaluation contexts separately.

SubqueryId Assignment Flow

graph TB
    Query["Main Query\n(SELECT Plan)"]
Filter["WHERE Clause\n(SelectFilter)"]
Proj["SELECT List\n(Projections)"]
Sub1["EXISTS Subquery\nSubqueryId(0)"]
Sub2["Scalar Subquery\nSubqueryId(1)"]
Sub3["EXISTS Subquery\nSubqueryId(2)"]
Query --> Filter
 
   Query --> Proj
 
   Filter --> Sub1
 
   Filter --> Sub3
 
   Proj --> Sub2
    
    Sub1 -.references.-> Meta1["FilterSubquery\nid=0\nplan + correlations"]
Sub2 -.references.-> Meta2["ScalarSubquery\nid=1\nplan + correlations"]
Sub3 -.references.-> Meta3["FilterSubquery\nid=2\nplan + correlations"]

The planner assigns sequential IDs during query translation. Each subquery expression (Expr::Exists or ScalarExpr::ScalarSubquery) holds its assigned ID, while the full subquery plan and correlation metadata are stored separately in the parent SelectPlan.

StructureLocationPurpose
SubqueryIdExpression ASTReferences a subquery definition
FilterSubquerySelectPlan.filter.subqueriesMetadata for EXISTS/NOT EXISTS
ScalarSubquerySelectPlan.scalar_subqueriesMetadata for scalar subqueries

Sources: llkv-expr/src/expr.rs:46-65 llkv-plan/src/plans.rs:36-56


Expression AST Structures

SubqueryExpr for Predicates

The SubqueryExpr structure appears in boolean expressions as Expr::Exists variants. It supports negation for NOT EXISTS semantics.

Example WHERE clause : WHERE EXISTS (SELECT 1 FROM orders WHERE orders.customer_id = customers.id)

The predicate tree contains Expr::Exists(SubqueryExpr { id: SubqueryId(0), negated: false }), while SelectPlan.filter.subqueries[0] holds the actual subquery plan and correlation data.

Sources: llkv-expr/src/expr.rs:49-56

ScalarSubqueryExpr for Projections

Scalar subqueries return a single value per outer row and appear in ScalarExpr trees. They carry the expected data type for validation.

Example projection : SELECT name, (SELECT COUNT(*) FROM orders WHERE orders.customer_id = customers.id) AS order_count

The expression is ScalarExpr::ScalarSubquery(ScalarSubqueryExpr { id: SubqueryId(0), data_type: Int64 }), with full plan details in SelectPlan.scalar_subqueries[0].

Sources: llkv-expr/src/expr.rs:58-65 llkv-plan/src/plans.rs:47-56


Plan-Level Metadata Structures

classDiagram
    class FilterSubquery {+SubqueryId id\n+Box~SelectPlan~ plan\n+Vec~CorrelatedColumn~ correlated_columns}
    
    class CorrelatedColumn {+String placeholder\n+String column\n+Vec~String~ field_path}
    
    class SelectFilter {+Expr predicate\n+Vec~FilterSubquery~ subqueries}
    
    SelectFilter --> FilterSubquery : contains
    FilterSubquery --> CorrelatedColumn : tracks correlations
    FilterSubquery --> SelectPlan : nested plan

FilterSubquery

This structure captures all metadata needed to evaluate an EXISTS predicate during query execution.

Field Descriptions :

FieldTypePurpose
idSubqueryIdUnique identifier referenced in predicate AST
planBox<SelectPlan>Complete logical plan for subquery execution
correlated_columnsVec<CorrelatedColumn>Outer columns referenced inside subquery

Sources: llkv-plan/src/plans.rs:36-45

ScalarSubquery

Parallel structure for scalar subqueries used in projections:

The plan must produce exactly one row with one column. If the subquery returns zero rows, the executor produces NULL. Multiple rows trigger a runtime error.

Sources: llkv-plan/src/plans.rs:47-56

CorrelatedColumn

Describes a single correlated column reference captured from the outer query:

Example :

  • Outer Query : SELECT * FROM customers WHERE ...
  • Subquery : SELECT 1 FROM orders WHERE orders.customer_id = customers.id
  • CorrelatedColumn : { placeholder: "$corr_0", column: "id", field_path: [] }

During subquery evaluation, the executor binds $corr_0 to the current outer row’s id value.

Sources: llkv-plan/src/plans.rs:58-67


Correlation Tracking System

sequenceDiagram
    participant Planner
    participant Tracker as SubqueryCorrelatedTracker
    participant SubqueryPlan
    
    Planner->>Tracker: new()
    Planner->>Tracker: track_column("customers.id")
    Tracker-->>Planner: placeholder = "$corr_0"
    
    Planner->>SubqueryPlan: Replace "customers.id" with "$corr_0"
    
    Planner->>Tracker: finalize()
    Tracker-->>Planner: Vec<CorrelatedColumn>

Placeholder Naming Convention

The planner module exports SUBQUERY_CORRELATED_PLACEHOLDER_PREFIX which defines the prefix for generated placeholder names. The helper function subquery_correlated_placeholder(index) produces names like $corr_0, $corr_1, etc.

Tracking Workflow :

  1. Create SubqueryCorrelatedTracker when entering subquery translation
  2. For each outer column reference, call track_column(canonical_name) → returns placeholder
  3. Replace column reference in subquery AST with placeholder
  4. Call finalize() to extract Vec<CorrelatedColumn> for plan metadata

Sources: llkv-plan/src/lib.rs:43-46

SubqueryCorrelatedColumnTracker

This type (referenced in exports) manages the mapping between outer columns and generated placeholders. It ensures each unique outer column gets one placeholder regardless of how many times it’s referenced in the subquery.

Deduplication Example :

The tracker creates only one CorrelatedColumn entry with a single placeholder $corr_0 that both references use.

Sources: llkv-plan/src/lib.rs:44-45


Integration with Plan Structures

SelectPlan Storage

The SelectPlan structure holds subquery metadata in dedicated fields:

Storage Pattern :

Subquery TypeStorage LocationID Reference Location
EXISTS in WHEREfilter.subqueriesfilter.predicateExpr::Exists
Scalar in SELECTscalar_subqueriesprojectionsScalarExpr::ScalarSubquery
Scalar in WHEREscalar_subqueriesfilter.predicate → comparison expressions

Sources: llkv-plan/src/plans.rs:800-829

Builder Methods

The SelectPlan provides fluent builder methods for attaching subquery metadata:

The planner typically calls these after translating the main query and all nested subqueries.

Sources: llkv-plan/src/plans.rs:895-909


Identifier Resolution During Correlation

graph TB
    OuterQuery["Outer Query\ntable=customers\nalias=c"]
SubqueryScope["Subquery Scope\ntable=orders\nalias=o"]
Identifier["Column Reference:\n'c.id'"]
Resolver["IdentifierResolver"]
OuterQuery --> Context1["IdentifierContext\ntable_id=1\nalias=Some('c')"]
SubqueryScope --> Context2["IdentifierContext\ntable_id=2\nalias=Some('o')"]
Identifier --> Resolver
 
   Resolver --> Decision{"Which scope?"}
Decision -->|Outer| Correlated["Mark as correlated\nGenerate placeholder"]
Decision -->|Inner| Local["Resolve locally"]

IdentifierContext for Outer Scopes

When translating a correlated subquery, the planner maintains an IdentifierContext that tracks which columns belong to outer query scopes vs. the subquery’s own tables.

Resolution Process :

  1. Check if identifier starts with outer table alias → mark correlated
  2. Check if identifier matches outer table columns (when no alias) → mark correlated
  3. Otherwise, resolve within subquery’s own table scope

Sources: llkv-table/src/resolvers/identifier.rs:8-66

ColumnResolution Structure

The resolver produces ColumnResolution objects that distinguish simple column references from nested field access:

For correlated columns, this resolution data flows into CorrelatedColumn.field_path, enabling struct-typed correlation like outer_table.struct_column.nested_field.

Sources: llkv-table/src/resolvers/identifier.rs:68-100


Execution Flow

Evaluation Steps :

  1. Outer Row Context : For each row from outer query, create evaluation context
  2. Placeholder Binding : Extract values for correlated columns from outer row
  3. Subquery Execution : Run inner query with bound placeholder values
  4. Result Integration : EXISTS → boolean, Scalar → value, integrate into outer query

The executor must evaluate correlated subqueries once per outer row, making them potentially expensive. Uncorrelated subqueries can be evaluated once and cached.

Sources: llkv-plan/src/plans.rs:27-67


Example: Correlated EXISTS Subquery

Consider this SQL query:

Planning Phase Output

SelectPlan Structure :

SelectPlan {
    tables: [TableRef { table: "customers", alias: Some("c") }],
    projections: [
        Column { name: "name", alias: None },
        Column { name: "id", alias: None }
    ],
    filter: Some(SelectFilter {
        predicate: Expr::Exists(SubqueryExpr {
            id: SubqueryId(0),
            negated: false
        }),
        subqueries: [
            FilterSubquery {
                id: SubqueryId(0),
                plan: Box::new(SelectPlan {
                    tables: [TableRef { table: "orders", alias: Some("o") }],
                    filter: Some(SelectFilter {
                        predicate: Expr::And([
                            Compare { 
                                left: Column("customer_id"), 
                                op: Eq, 
                                right: Column("$corr_0") 
                            },
                            Compare { 
                                left: Column("status"), 
                                op: Eq, 
                                right: Literal("pending") 
                            }
                        ]),
                        subqueries: []
                    }),
                    projections: [Computed { expr: Literal(1), alias: "1" }]
                }),
                correlated_columns: [
                    CorrelatedColumn {
                        placeholder: "$corr_0",
                        column: "id",
                        field_path: []
                    }
                ]
            }
        ]
    }),
    // ... other fields ...
}

Translation Details

Original ReferenceAfter TranslationCorrelation Entry
c.id in subquery WHERE$corr_0{ placeholder: "$corr_0", column: "id", field_path: [] }

The subquery’s filter now compares o.customer_id against the placeholder $corr_0 instead of directly referencing the outer column.

Sources: llkv-plan/src/plans.rs:27-67 llkv-expr/src/expr.rs:49-56


Key Design Decisions

Why Separate FilterSubquery and ScalarSubquery?

Different evaluation semantics:

  • EXISTS : Returns boolean, short-circuits on first match
  • Scalar : Must verify exactly one row returned, extracts value

Having distinct types allows the executor to apply appropriate validation and optimization strategies.

Why Store SubqueryId in Expression AST?

Decouples expression evaluation from subquery execution context. The expression tree remains lightweight and can be cloned/transformed without carrying full subquery plans. The executor looks up metadata by ID when needed.

Why Use String Placeholders?

String placeholders like $corr_0 integrate naturally with the existing column resolution system. The executor can treat them as special “virtual columns” that get their values from outer row context rather than table scans.

Sources: llkv-expr/src/expr.rs:45-65 llkv-plan/src/plans.rs:36-67

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Expression System

Loading…

Expression System

Relevant source files

The Expression System provides the foundational Abstract Syntax Tree (AST) and type system for representing predicates, scalar computations, and aggregate functions throughout LLKV’s query processing pipeline. This system decouples expression semantics from concrete Arrow data types through a generic design that allows expressions to be built, translated, optimized, and evaluated independently.

For information about query planning structures that contain these expressions, see Query Planning. For details on how expressions are executed and evaluated, see Query Execution.

Purpose and Scope

This page documents the expression representation layer defined primarily in llkv-expr, covering:

  • Boolean predicate expressions (Expr) for WHERE and HAVING clauses
  • Scalar arithmetic expressions (ScalarExpr) for projections and computed columns
  • Operator types and literal value representations
  • Subquery integration points
  • Generic field identifier patterns that enable expression translation from column names to internal field IDs
graph TB
    subgraph "Boolean Expression Layer"
        Expr["Expr&lt;'a, F&gt;\nBoolean Predicates"]
ExprAnd["And(Vec&lt;Expr&gt;)"]
ExprOr["Or(Vec&lt;Expr&gt;)"]
ExprNot["Not(Box&lt;Expr&gt;)"]
ExprPred["Pred(Filter)"]
ExprCompare["Compare{left, op, right}"]
ExprInList["InList{expr, list, negated}"]
ExprIsNull["IsNull{expr, negated}"]
ExprLiteral["Literal(bool)"]
ExprExists["Exists(SubqueryExpr)"]
end
    
    subgraph "Scalar Expression Layer"
        ScalarExpr["ScalarExpr&lt;F&gt;\nArithmetic/Values"]
SEColumn["Column(F)"]
SELiteral["Literal(Literal)"]
SEBinary["Binary{left, op, right}"]
SENot["Not(Box&lt;ScalarExpr&gt;)"]
SEIsNull["IsNull{expr, negated}"]
SEAggregate["Aggregate(AggregateCall)"]
SEGetField["GetField{base, field_name}"]
SECast["Cast{expr, data_type}"]
SECompare["Compare{left, op, right}"]
SECoalesce["Coalesce(Vec&lt;ScalarExpr&gt;)"]
SEScalarSubquery["ScalarSubquery(ScalarSubqueryExpr)"]
SECase["Case{operand, branches, else_expr}"]
SERandom["Random"]
end
    
    subgraph "Filter Layer"
        Filter["Filter&lt;'a, F&gt;"]
FilterField["field_id: F"]
FilterOp["op: Operator&lt;'a&gt;"]
end
    
    subgraph "Operator Types"
        Operator["Operator&lt;'a&gt;"]
OpEquals["Equals(Literal)"]
OpRange["Range{lower, upper}"]
OpGT["GreaterThan(Literal)"]
OpLT["LessThan(Literal)"]
OpIn["In(&amp;'a [Literal])"]
OpStartsWith["StartsWith{pattern, case_sensitive}"]
OpEndsWith["EndsWith{pattern, case_sensitive}"]
OpContains["Contains{pattern, case_sensitive}"]
OpIsNull["IsNull"]
OpIsNotNull["IsNotNull"]
end
    
 
   Expr --> ExprAnd
 
   Expr --> ExprOr
 
   Expr --> ExprNot
 
   Expr --> ExprPred
 
   Expr --> ExprCompare
 
   Expr --> ExprInList
 
   Expr --> ExprIsNull
 
   Expr --> ExprLiteral
 
   Expr --> ExprExists
    
 
   ExprPred --> Filter
 
   ExprCompare --> ScalarExpr
 
   ExprInList --> ScalarExpr
 
   ExprIsNull --> ScalarExpr
    
 
   Filter --> FilterField
 
   Filter --> FilterOp
 
   FilterOp --> Operator
    
 
   Operator --> OpEquals
 
   Operator --> OpRange
 
   Operator --> OpGT
 
   Operator --> OpLT
 
   Operator --> OpIn
 
   Operator --> OpStartsWith
 
   Operator --> OpEndsWith
 
   Operator --> OpContains
 
   Operator --> OpIsNull
 
   Operator --> OpIsNotNull
    
 
   ScalarExpr --> SEColumn
 
   ScalarExpr --> SELiteral
 
   ScalarExpr --> SEBinary
 
   ScalarExpr --> SENot
 
   ScalarExpr --> SEIsNull
 
   ScalarExpr --> SEAggregate
 
   ScalarExpr --> SEGetField
 
   ScalarExpr --> SECast
 
   ScalarExpr --> SECompare
 
   ScalarExpr --> SECoalesce
 
   ScalarExpr --> SEScalarSubquery
 
   ScalarExpr --> SECase
 
   ScalarExpr --> SERandom

The Expression System serves as the intermediate representation between SQL parsing and physical execution, allowing optimizations and transformations to occur independently of storage details.

Expression Type Hierarchy

Sources: llkv-expr/src/expr.rs:14-182

Boolean Expressions: Expr<’a, F>

The Expr<'a, F> enum represents logical boolean expressions used in WHERE clauses, HAVING clauses, and JOIN conditions. The generic type parameter F represents field identifiers, allowing the same expression structure to work with string column names during planning and numeric field IDs during execution.

Core Expression Variants

VariantPurposeExample SQL
And(Vec<Expr>)Logical conjunctionWHERE a > 5 AND b < 10
Or(Vec<Expr>)Logical disjunctionWHERE status = 'active' OR status = 'pending'
Not(Box<Expr>)Logical negationWHERE NOT (price > 100)
Pred(Filter)Single field predicateWHERE age >= 18
Compare{left, op, right}Scalar comparisonWHERE col1 + col2 > 100
InList{expr, list, negated}Set membershipWHERE status IN ('active', 'pending')
IsNull{expr, negated}Null testingWHERE (col1 + col2) IS NULL
Literal(bool)Constant booleanUsed for empty IN lists or optimizations
Exists(SubqueryExpr)Correlated subqueryWHERE EXISTS (SELECT ...)

The Expr type provides helper methods for constructing common patterns:

Sources: llkv-expr/src/expr.rs:14-123

Scalar Expressions: ScalarExpr

The ScalarExpr<F> enum represents expressions that compute scalar values, used in SELECT projections, computed columns, and as operands in boolean expressions. These expressions can reference columns, perform arithmetic, call functions, and handle complex nested computations.

Scalar Expression Variants

VariantPurposeExample SQL
Column(F)Column referenceSELECT price FROM products
Literal(Literal)Constant valueSELECT 42
Binary{left, op, right}Arithmetic operationSELECT price * quantity
Not(Box<ScalarExpr>)Logical NOTSELECT NOT active
IsNull{expr, negated}NULL testSELECT col IS NULL
Aggregate(AggregateCall)Aggregate functionSELECT COUNT(*) + 1
GetField{base, field_name}Struct field accessSELECT user.address.city
Cast{expr, data_type}Type conversionSELECT CAST(price AS INTEGER)
Compare{left, op, right}Comparison returning 0/1SELECT (price > 100)
Coalesce(Vec<ScalarExpr>)First non-NULLSELECT COALESCE(price, 0)
ScalarSubquery(ScalarSubqueryExpr)Scalar subquerySELECT (SELECT MAX(price) FROM items)
Case{operand, branches, else_expr}ConditionalSELECT CASE WHEN ... THEN ... END
RandomRandom number <FileRef file-url=“https://github.com/jzombie/rust-llkv/blob/89777726/0.0, 1.0)SELECT RANDOM()

Operators and Filters

Binary Arithmetic Operators

The BinaryOp enum defines arithmetic and logical operations:

OperatorSQL SymbolDescription
Add+Addition
Subtract-Subtraction
Multiply*Multiplication
Divide/Division
Modulo%Modulo
AndANDLogical AND
OrORLogical OR
BitwiseShiftLeft<<Left shift
BitwiseShiftRight>>Right shift

Sources: llkv-expr/src/expr.rs:309-338

Comparison Operators

The CompareOp enum defines relational comparisons:

OperatorSQL SymbolDescription
Eq=Equality
NotEq!=Inequality
Lt<Less than
LtEq<=Less than or equal
Gt>Greater than
GtEq>=Greater than or equal

Sources: llkv-expr/src/expr.rs:340-363

Filter Operators

The Filter<'a, F> struct combines a field identifier with an Operator<'a> to represent single-field predicates. The Operator enum supports specialized operations optimized for columnar storage:

OperatorPurposeOptimization
Equals(Literal)Exact matchHash-based lookup
Range{lower, upper}Range queryMin/max chunk pruning
GreaterThan(Literal)> comparisonMin/max chunk pruning
GreaterThanOrEquals(Literal)>= comparisonMin/max chunk pruning
LessThan(Literal)< comparisonMin/max chunk pruning
LessThanOrEquals(Literal)<= comparisonMin/max chunk pruning
In(&'a [Literal])Set membershipBorrowed slice for efficiency
StartsWith{pattern, case_sensitive}Prefix matchString-optimized
EndsWith{pattern, case_sensitive}Suffix matchString-optimized
Contains{pattern, case_sensitive}Substring matchString-optimized
IsNullNULL testNull bitmap scan
IsNotNullNOT NULL testNull bitmap scan

The Operator::Range variant uses Rust’s std::ops::Bound enum to represent inclusive/exclusive bounds, enabling efficient representation of expressions like WHERE age BETWEEN 18 AND 65 or WHERE timestamp >= '2024-01-01'.

Sources: llkv-expr/src/expr.rs:365-428

Literal Value System

The Expression System uses Literal (defined in llkv-expr) to represent untyped constant values in expressions. These are separate from Arrow’s scalar types and provide a lightweight representation that can be coerced to concrete types during execution based on column schemas.

The PlanValue enum (defined in llkv-plan) serves a similar purpose at the plan level, with the conversion function plan_value_from_literal() bridging between the two representations.

Sources: llkv-expr/src/expr.rs:1-10 llkv-plan/src/plans.rs:73-161

graph TB
    subgraph "AggregateCall Variants"
        AggregateCall["AggregateCall&lt;F&gt;"]
CountStar["CountStar"]
Count["Count{expr, distinct}"]
Sum["Sum{expr, distinct}"]
Total["Total{expr, distinct}"]
Avg["Avg{expr, distinct}"]
Min["Min(Box&lt;ScalarExpr&gt;)"]
Max["Max(Box&lt;ScalarExpr&gt;)"]
CountNulls["CountNulls(Box&lt;ScalarExpr&gt;)"]
GroupConcat["GroupConcat{expr, distinct, separator}"]
end
    
 
   AggregateCall --> CountStar
 
   AggregateCall --> Count
 
   AggregateCall --> Sum
 
   AggregateCall --> Total
 
   AggregateCall --> Avg
 
   AggregateCall --> Min
 
   AggregateCall --> Max
 
   AggregateCall --> CountNulls
 
   AggregateCall --> GroupConcat
    
 
   Count --> ScalarExprArg["Box&lt;ScalarExpr&lt;F&gt;&gt;"]
Sum --> ScalarExprArg
 
   Total --> ScalarExprArg
 
   Avg --> ScalarExprArg
 
   GroupConcat --> ScalarExprArg

Aggregate Functions

The AggregateCall<F> enum represents aggregate function calls within scalar expressions. Unlike simple column aggregates, these operate on arbitrary expressions:

AggregatePurposeExample SQL
CountStarCount all rowsSELECT COUNT(*)
Count{expr, distinct}Count non-null expression valuesSELECT COUNT(DISTINCT user_id)
Sum{expr, distinct}Sum of expression valuesSELECT SUM(price * quantity)
Total{expr, distinct}Sum treating NULL as 0SELECT TOTAL(amount)
Avg{expr, distinct}Average of expression valuesSELECT AVG(col1 + col2)
Min(expr)Minimum valueSELECT MIN(-price)
Max(expr)Maximum valueSELECT MAX(col1 * col2)
CountNulls(expr)Count NULL valuesSELECT COUNT_NULLS(optional_field)
GroupConcat{expr, distinct, separator}Concatenate stringsSELECT GROUP_CONCAT(name, ', ')

The key distinction is that each aggregate (except CountStar) accepts a Box<ScalarExpr<F>> rather than just a column name, enabling complex expressions like AVG(col1 + col2) or SUM(-price).

Sources: llkv-expr/src/expr.rs:184-215

Subquery Integration

The Expression System provides integration points for correlated and scalar subqueries through specialized types:

SubqueryExpr

Used in boolean Expr::Exists predicates to represent EXISTS or NOT EXISTS subqueries:

Sources: llkv-expr/src/expr.rs:45-56

ScalarSubqueryExpr

Used in ScalarExpr::ScalarSubquery to represent scalar subqueries that return a single value:

Sources: llkv-expr/src/expr.rs:58-65

SubqueryId

A lightweight wrapper around u32 that uniquely identifies a subquery within a query plan:

The actual subquery definitions are stored in the parent plan structure (e.g., SelectPlan::scalar_subqueries or SelectFilter::subqueries), with expressions referencing them by ID. This design allows the same subquery to be referenced multiple times without duplication.

Sources: llkv-expr/src/expr.rs:45-65 llkv-plan/src/plans.rs:27-67

Generic Field Identifier Pattern

The generic type parameter F in Expr<'a, F>, ScalarExpr<F>, and Filter<'a, F> enables the same expression structure to work at different stages of query processing:

  1. Planning Stage : F = String - Expressions use human-readable column names from SQL
  2. Translation Stage : F = FieldId - Expressions use table-local field identifiers
  3. Execution Stage : F = LogicalFieldId - Expressions use namespace-qualified field identifiers for MVCC columns

This design allows expression trees to be built during SQL parsing, translated to internal identifiers during planning, and evaluated efficiently during execution without changing the fundamental expression structure. The translation process is documented in section Expression Translation.

Sources: llkv-expr/src/expr.rs:14-370

Expression Construction Helpers

Both Expr and ScalarExpr provide builder-style helper methods to simplify expression construction in code:

Boolean Expression Helpers

Sources: llkv-expr/src/expr.rs:67-123

Scalar Expression Helpers

Sources: llkv-expr/src/expr.rs:217-307

Integration with Query Plans

The Expression System integrates with query plans through several key structures defined in llkv-plan:

SelectFilter

Wraps a boolean expression with associated correlated subqueries:

FilterSubquery

Describes a correlated subquery referenced by an EXISTS predicate:

ScalarSubquery

Describes a scalar subquery referenced in a projection expression:

CorrelatedColumn

Maps placeholder column names used in subquery expressions to actual outer query columns:

Plans attach these structures to represent the full expression context, allowing executors to evaluate subqueries with proper outer row bindings.

Sources: llkv-plan/src/plans.rs:27-67

Expression System Data Flow

Sources: llkv-expr/src/expr.rs llkv-plan/src/plans.rs:27-1032

Key Design Principles

Type-Safe Generic Design

The generic field identifier pattern (F in Expr<'a, F>) provides compile-time type safety while allowing the same expression structure to work at different pipeline stages. This eliminates the need for multiple parallel expression type hierarchies.

Deferred Type Resolution

Literal values remain untyped until evaluation time, when they are coerced based on the target column’s Arrow DataType. This allows expressions like WHERE age > 18 to work correctly whether age is Int8, Int32, or Int64.

Zero-Copy Operator Patterns

The Operator::In(&'a [Literal]) variant borrows a slice of literals rather than owning a Vec, enabling stack-allocated IN lists without heap allocation for common small cases.

Expression Rewriting Support

The is_trivially_true() and is_full_range_for() helper methods enable query optimizers to identify and eliminate redundant predicates without deep tree traversals.

Subquery Indirection

Subqueries are represented by SubqueryId references rather than inline nested plans, allowing the same subquery to be referenced multiple times without duplication and enabling separate optimization of inner and outer queries.

Sources: llkv-expr/src/expr.rs:1-819

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Expression AST

Loading…

Expression AST

Relevant source files

Purpose and Scope

The Expression AST defines the abstract syntax tree for all logical and arithmetic expressions in LLKV. This module provides the core data structures used to represent predicates, scalar computations, aggregations, and subqueries throughout the query processing pipeline. The AST is type-agnostic and decoupled from Arrow’s concrete scalar types, allowing expressions to be constructed, transformed, and optimized before being bound to specific table schemas.

For information about translating expressions from string column names to field identifiers, see Expression Translation. For information about compiling expressions into executable bytecode, see Program Compilation. For information about evaluating expressions against data, see Scalar Evaluation and NumericKernels.

Sources: llkv-expr/src/expr.rs:1-14

Expression System Architecture

The expression system consists of two primary AST types that serve distinct purposes in query processing:

Expr<'a, F> represents boolean-valued expressions used in WHERE clauses, HAVING clauses, and join conditions. These expressions combine logical operators (AND, OR, NOT) with predicates that test column values against literals or ranges.

graph TB
    subgraph "Boolean Predicates"
        Expr["Expr&lt;'a, F&gt;\nBoolean logic"]
Filter["Filter&lt;'a, F&gt;\nField predicate"]
Operator["Operator&lt;'a&gt;\nComparison operators"]
end
    
    subgraph "Value Expressions"
        ScalarExpr["ScalarExpr&lt;F&gt;\nArithmetic expressions"]
BinaryOp["BinaryOp\n+, -, *, /, %"]
CompareOp["CompareOp\n=, !=, &lt;, &gt;"]
AggregateCall["AggregateCall&lt;F&gt;\nCOUNT, SUM, AVG"]
end
    
    subgraph "Supporting Types"
        Literal["Literal\nType-agnostic values"]
SubqueryExpr["SubqueryExpr\nEXISTS predicate"]
ScalarSubqueryExpr["ScalarSubqueryExpr\nScalar subquery"]
end
    
 
   Expr --> Filter
 
   Filter --> Operator
 
   Expr --> ScalarExpr
 
   ScalarExpr --> BinaryOp
 
   ScalarExpr --> CompareOp
 
   ScalarExpr --> AggregateCall
 
   ScalarExpr --> Literal
 
   Expr --> SubqueryExpr
 
   ScalarExpr --> ScalarSubqueryExpr
 
   Operator --> Literal

ScalarExpr<F> represents value-producing expressions used in SELECT projections, computed columns, and UPDATE assignments. These expressions support arithmetic operations, function calls, type casts, and complex nested computations.

Both types are parameterized by a generic field identifier type F, enabling the same AST structures to be used with different column reference schemes (e.g., String names during planning, FieldId integers during execution).

Sources: llkv-expr/src/expr.rs:14-43 llkv-expr/src/expr.rs:125-182

Boolean Predicate Expressions

The Expr<'a, F> enum represents logical expressions that evaluate to boolean values. It forms the foundation for filtering operations throughout the query engine.

graph TD
    Expr["Expr&lt;'a, F&gt;"]
And["And(Vec&lt;Expr&gt;)\nLogical conjunction"]
Or["Or(Vec&lt;Expr&gt;)\nLogical disjunction"]
Not["Not(Box&lt;Expr&gt;)\nLogical negation"]
Pred["Pred(Filter)\nField predicate"]
Compare["Compare\nScalar comparison"]
InList["InList\nSet membership"]
IsNull["IsNull\nNULL test"]
Literal["Literal(bool)\nConstant boolean"]
Exists["Exists(SubqueryExpr)\nCorrelated EXISTS"]
Expr --> And
 
   Expr --> Or
 
   Expr --> Not
 
   Expr --> Pred
 
   Expr --> Compare
 
   Expr --> InList
 
   Expr --> IsNull
 
   Expr --> Literal
 
   Expr --> Exists

Expr Variants

VariantPurposeExample SQL
And(Vec<Expr>)Logical conjunction of sub-expressionscol1 = 5 AND col2 > 10
Or(Vec<Expr>)Logical disjunction of sub-expressionsstatus = 'active' OR status = 'pending'
Not(Box<Expr>)Logical negationNOT (age < 18)
Pred(Filter)Single-field predicate with operatorprice < 100
CompareComparison between two scalar expressionscol1 + col2 > col3 * 2
InListSet membership teststatus IN ('active', 'pending')
IsNullNULL test for complex expressions(col1 + col2) IS NULL
Literal(bool)Constant true/false valueWHERE true
Exists(SubqueryExpr)Correlated subquery existence testEXISTS (SELECT 1 FROM t2 WHERE ...)

The And and Or variants accept vectors of expressions, allowing efficient representation of multi-way logical operations without deep nesting. The Pred variant wraps a Filter<'a, F> structure for simple single-column predicates, which can be efficiently pushed down to storage layer scanning operations.

Sources: llkv-expr/src/expr.rs:14-43 llkv-expr/src/expr.rs:67-123

Filter and Operator Types

The Filter<'a, F> structure encapsulates a single predicate against a field, combining a field identifier with an operator:

The Operator<'a> enum defines comparison and pattern-matching operations over untyped Literal values:

Operator VariantDescriptionExample
Equals(Literal)Exact equality teststatus = 'active'
Range { lower, upper }Bounded range testage BETWEEN 18 AND 65
GreaterThan(Literal)Greater than comparisonprice > 100.0
GreaterThanOrEquals(Literal)Greater than or equalquantity >= 10
LessThan(Literal)Less than comparisonage < 18
LessThanOrEquals(Literal)Less than or equalscore <= 100
In(&'a [Literal])Set membershipstatus IN ('active', 'pending')
StartsWith { pattern, case_sensitive }Prefix matchname LIKE 'John%'
EndsWith { pattern, case_sensitive }Suffix matchemail LIKE '%@example.com'
Contains { pattern, case_sensitive }Substring matchdescription LIKE '%keyword%'
IsNullNULL testemail IS NULL
IsNotNullNon-NULL testemail IS NOT NULL

The In operator accepts a borrowed slice of literals to avoid allocations for small, static IN lists. The pattern-matching operators (StartsWith, EndsWith, Contains) support both case-sensitive and case-insensitive matching.

Sources: llkv-expr/src/expr.rs:365-428

graph TD
    ScalarExpr["ScalarExpr&lt;F&gt;"]
Column["Column(F)\nField reference"]
Literal["Literal\nConstant value"]
Binary["Binary\nArithmetic operation"]
Not["Not\nLogical negation"]
IsNull["IsNull\nNULL test"]
Aggregate["Aggregate\nAggregate function"]
GetField["GetField\nStruct field access"]
Cast["Cast\nType conversion"]
Compare["Compare\nBoolean comparison"]
Coalesce["Coalesce\nFirst non-NULL"]
ScalarSubquery["ScalarSubquery\nSubquery result"]
Case["Case\nConditional expression"]
Random["Random\nRandom number"]
ScalarExpr --> Column
 
   ScalarExpr --> Literal
 
   ScalarExpr --> Binary
 
   ScalarExpr --> Not
 
   ScalarExpr --> IsNull
 
   ScalarExpr --> Aggregate
 
   ScalarExpr --> GetField
 
   ScalarExpr --> Cast
 
   ScalarExpr --> Compare
 
   ScalarExpr --> Coalesce
 
   ScalarExpr --> ScalarSubquery
 
   ScalarExpr --> Case
 
   ScalarExpr --> Random

Scalar Value Expressions

The ScalarExpr<F> enum represents expressions that produce scalar values. These are used in SELECT projections, computed columns, ORDER BY clauses, and anywhere a value (rather than a boolean) is needed.

ScalarExpr Variants

VariantPurposeExample SQL
Column(F)Reference to a table columnprice
Literal(Literal)Constant value42, 'hello', 3.14
Binary { left, op, right }Arithmetic or logical binary operationprice * quantity
Not(Box<ScalarExpr>)Logical negation returning 1/0NOT active
IsNull { expr, negated }NULL test returning 1/0col1 IS NULL
Aggregate(AggregateCall)Aggregate function callCOUNT(*) + 1
GetField { base, field_name }Struct field extractionuser.address.city
Cast { expr, data_type }Explicit type castCAST(price AS INTEGER)
Compare { left, op, right }Comparison producing boolean resultprice > 100
Coalesce(Vec<ScalarExpr>)First non-NULL expressionCOALESCE(nickname, username)
ScalarSubquery(ScalarSubqueryExpr)Scalar subquery result(SELECT MAX(price) FROM ...)
Case { operand, branches, else_expr }Conditional expressionCASE WHEN x > 0 THEN 1 ELSE -1 END
RandomRandom number generatorRANDOM()

The Binary variant supports arithmetic operators (+, -, *, /, %) as well as logical operators (AND, OR) and bitwise shift operators (<<, >>). The Compare variant produces boolean results (represented as 1/0 integers) from comparisons like col1 > col2.

Sources: llkv-expr/src/expr.rs:125-307

Binary and Comparison Operators

The BinaryOp enum defines arithmetic and logical binary operators:

The CompareOp enum defines comparison operators for scalar expressions:

These operators enable complex arithmetic expressions like (price * quantity * (1 - discount)) > 1000 and nested logical operations like (col1 + col2) > (col3 * 2) AND status = 'active'.

Sources: llkv-expr/src/expr.rs:309-363

Aggregate Function Calls

The AggregateCall<F> enum represents aggregate function invocations within scalar expressions. Unlike simple column aggregates, these variants operate on full expressions, enabling aggregations like AVG(col1 + col2) or SUM(-price).

graph LR
    AggregateCall["AggregateCall&lt;F&gt;"]
CountStar["CountStar\nCOUNT(*)"]
Count["Count { expr, distinct }\nCOUNT(expr)"]
Sum["Sum { expr, distinct }\nSUM(expr)"]
Total["Total { expr, distinct }\nTOTAL(expr)"]
Avg["Avg { expr, distinct }\nAVG(expr)"]
Min["Min(expr)\nMIN(expr)"]
Max["Max(expr)\nMAX(expr)"]
CountNulls["CountNulls(expr)\nCOUNT_NULLS(expr)"]
GroupConcat["GroupConcat { expr, distinct, separator }\nGROUP_CONCAT(expr)"]
AggregateCall --> CountStar
 
   AggregateCall --> Count
 
   AggregateCall --> Sum
 
   AggregateCall --> Total
 
   AggregateCall --> Avg
 
   AggregateCall --> Min
 
   AggregateCall --> Max
 
   AggregateCall --> CountNulls
 
   AggregateCall --> GroupConcat

AggregateCall Variants

VariantSQL EquivalentDistinct SupportDescription
CountStarCOUNT(*)NoCount all rows including NULLs
Count { expr, distinct }COUNT(expr)YesCount non-NULL expression values
Sum { expr, distinct }SUM(expr)YesSum of expression values
Total { expr, distinct }TOTAL(expr)YesSum returning 0 for empty set (SQLite semantics)
Avg { expr, distinct }AVG(expr)YesAverage of expression values
Min(expr)MIN(expr)NoMinimum expression value
Max(expr)MAX(expr)NoMaximum expression value
CountNulls(expr)N/ANoCount NULL values in expression
GroupConcat { expr, distinct, separator }GROUP_CONCAT(expr)YesConcatenate values with separator

All variants except CountStar accept a Box<ScalarExpr<F>>, allowing aggregates to operate on computed expressions. For example, SUM(price * quantity) is represented as:

AggregateCall::Sum {
    expr: Box::new(ScalarExpr::Binary {
        left: Box::new(ScalarExpr::Column("price")),
        op: BinaryOp::Multiply,
        right: Box::new(ScalarExpr::Column("quantity")),
    }),
    distinct: false,
}

Sources: llkv-expr/src/expr.rs:184-215 llkv-expr/src/expr.rs:217-307

Subquery Integration

The expression AST supports two forms of subquery integration: boolean EXISTS predicates and scalar subqueries that produce single values.

Boolean EXISTS Subqueries

The SubqueryExpr structure represents a correlated EXISTS predicate within a boolean expression:

The id field references a subquery definition stored separately in the query plan (see FilterSubquery in llkv-plan/src/plans.rs:36-45), while negated indicates whether the SQL used NOT EXISTS. The separation of subquery metadata from the expression tree allows the same subquery to be referenced multiple times without duplication.

Scalar Subqueries

The ScalarSubqueryExpr structure represents a subquery that returns a single scalar value:

graph TB
    SelectPlan["SelectPlan"]
ExprFilter["filter: Option&lt;SelectFilter&gt;"]
SelectFilter["SelectFilter"]
Predicate["predicate: Expr&lt;'static, String&gt;"]
FilterSubqueries["subqueries: Vec&lt;FilterSubquery&gt;"]
ScalarSubqueries["scalar_subqueries: Vec&lt;ScalarSubquery&gt;"]
Projections["projections: Vec&lt;SelectProjection&gt;"]
ComputedProj["Computed { expr, alias }"]
ScalarExpr["expr: ScalarExpr&lt;String&gt;"]
SelectPlan --> ExprFilter
 
   ExprFilter --> SelectFilter
 
   SelectFilter --> Predicate
 
   SelectFilter --> FilterSubqueries
    
 
   SelectPlan --> ScalarSubqueries
 
   SelectPlan --> Projections
 
   Projections --> ComputedProj
 
   ComputedProj --> ScalarExpr
    
    Predicate -.references.-> FilterSubqueries
    ScalarExpr -.references.-> ScalarSubqueries

Scalar subqueries appear in value-producing contexts like SELECT (SELECT MAX(price) FROM items) AS max_price or WHERE quantity > (SELECT AVG(quantity) FROM inventory). The data_type field captures the Arrow data type of the subquery’s output column, enabling type checking during expression compilation.

The query planner populates the subqueries field in SelectFilter and the scalar_subqueries field in SelectPlan with complete subquery definitions, while expressions reference them by SubqueryId. This architecture enables efficient subquery execution and correlation tracking during query evaluation.

Sources: llkv-expr/src/expr.rs:45-66 llkv-plan/src/plans.rs:27-67

Literal Values

Expressions reference the Literal type from the llkv-expr crate’s literal module to represent constant values. The literal system is type-agnostic, deferring concrete type resolution until expressions are evaluated against actual table schemas.

The Literal enum (defined in llkv-expr/src/literal.rs) supports:

  • Null : SQL NULL value
  • Int128(i128) : Integer literals (wide representation to handle all integer sizes)
  • Float64(f64) : Floating-point literals
  • Decimal128(DecimalValue) : Fixed-precision decimal literals
  • String(String) : Text literals
  • Boolean(bool) : True/false literals
  • Date32(i32) : Date literals (days since Unix epoch)
  • Struct(Vec <(String, Literal)>): Structured data literals
  • Interval(IntervalValue) : Time interval literals

The use of Int128 for integer literals allows the expression system to represent values that may exceed i64 range during intermediate computations, with overflow checks deferred to evaluation time. The Decimal128 variant uses the DecimalValue type from llkv-types to maintain precision and scale metadata.

Literals can be constructed through From trait implementations, enabling ergonomic expression building:

Sources: llkv-expr/src/expr.rs10

Type Parameterization and Field References

Both Expr<'a, F> and ScalarExpr<F> are parameterized by a generic type F that represents field identifiers. This design enables the same AST structures to be used at different stages of query processing with different column reference schemes:

graph LR
    subgraph "Planning Stage"
        ExprString["Expr&lt;'static, String&gt;\nColumn names"]
ScalarString["ScalarExpr&lt;String&gt;\nColumn names"]
end
    
    subgraph "Translation Stage"
        Translator["Expression Translator"]
end
    
    subgraph "Execution Stage"
        ExprFieldId["Expr&lt;'static, FieldId&gt;\nNumeric IDs"]
ScalarFieldId["ScalarExpr&lt;FieldId&gt;\nNumeric IDs"]
end
    
 
   ExprString --> Translator
 
   ScalarString --> Translator
 
   Translator --> ExprFieldId
 
   Translator --> ScalarFieldId

Common Type Instantiations

StageField TypeUse CaseExample Types
PlanningStringSQL parsing and plan constructionExpr<'static, String>
ExecutionFieldIdTable scanning and evaluationExpr<'static, FieldId>
Testingu32 or &strLightweight tests without full schemaExpr<'static, u32>

The lifetime parameter 'a in Expr<'a, F> represents the lifetime of borrowed data in Operator::In(&'a [Literal]), allowing filter expressions to reference static IN lists without heap allocation. Most expressions use 'static lifetime, indicating no borrowed data.

Field Reference Translation

The expression translation system (see Expression Translation) converts expressions from string-based column names to numeric FieldId identifiers by walking the AST and resolving names against table schemas. This transformation is represented by the generic parameter substitution:

Expr<'static, String>  →  Expr<'static, FieldId>
ScalarExpr<String>     →  ScalarExpr<FieldId>

The generic design allows the same expression evaluation logic to work with both naming schemes without code duplication, while maintaining type safety at compile time.

Sources: llkv-expr/src/expr.rs:14-21 llkv-expr/src/expr.rs:125-134 llkv-expr/src/expr.rs:365-370

Expression Construction Helpers

Both Expr and ScalarExpr provide builder methods for ergonomic AST construction:

Expr Helpers

ScalarExpr Helpers

These helpers simplify expression construction in query planners and translators by providing clearer semantics than direct enum variant construction.

Sources: llkv-expr/src/expr.rs:67-86 llkv-expr/src/expr.rs:217-307

Expression Analysis Methods

The Expr type provides utility methods for analyzing expression structure and optimizing query execution:

The is_trivially_true() method identifies expressions that cannot filter out any rows, such as:

  • Unbounded range filters: Operator::Range { lower: Unbounded, upper: Unbounded }
  • Literal true values: Expr::Literal(true)

Scan executors use this method to skip predicate evaluation entirely when the filter is guaranteed to match all rows, avoiding unnecessary computation.

Sources: llkv-expr/src/expr.rs:87-123

graph TB
    subgraph "Query Plans"
        SelectPlan["SelectPlan"]
InsertPlan["InsertPlan"]
UpdatePlan["UpdatePlan"]
DeletePlan["DeletePlan"]
end
    
    subgraph "Expression Usage"
        FilterUsage["filter: Option&lt;SelectFilter&gt;\nWHERE clauses"]
HavingUsage["having: Option&lt;Expr&gt;\nHAVING clauses"]
ProjUsage["projections with ScalarExpr\nSELECT list"]
UpdateExpr["assignments with ScalarExpr\nSET clauses"]
JoinOn["on_condition in JoinMetadata\nJOIN ON clauses"]
end
    
 
   SelectPlan --> FilterUsage
 
   SelectPlan --> HavingUsage
 
   SelectPlan --> ProjUsage
 
   SelectPlan --> JoinOn
    
 
   UpdatePlan --> UpdateExpr
 
   UpdatePlan --> FilterUsage
    
 
   DeletePlan --> FilterUsage

Integration with Query Plans

Expressions integrate deeply with the query planning system defined in llkv-plan:

Expression Usage in Plans

Plan TypeExpression FieldExpression TypePurpose
SelectPlanfilterSelectFilter with Expr<'static, String>WHERE clause filtering
SelectPlanhavingExpr<'static, String>HAVING clause filtering
SelectPlanprojectionsSelectProjection::Computed with ScalarExpr<String>Computed columns
SelectPlanjoins[].on_conditionExpr<'static, String>JOIN ON conditions
UpdatePlanfilterExpr<'static, String>WHERE clause for updates
UpdatePlanassignments[].valueAssignmentValue::Expression with ScalarExpr<String>SET clause expressions
DeletePlanfilterExpr<'static, String>WHERE clause for deletes

All plan-level expressions use String as the field identifier type since plans are constructed during SQL parsing before schema resolution. The query executor translates these to FieldId-based expressions before evaluation.

Sources: llkv-plan/src/plans.rs:27-34 llkv-plan/src/plans.rs:660-692 llkv-plan/src/plans.rs:794-828

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Expression Translation

Loading…

Expression Translation

Relevant source files

Purpose and Scope

Expression Translation is the process of converting expressions that reference columns by name (as strings) into expressions that reference columns by numeric field identifiers (FieldId). This translation bridges the gap between the SQL parsing/planning layer—which operates on human-readable column names—and the execution layer, which requires efficient numeric field identifiers for accessing columnar storage.

This page documents the translation mechanisms, key functions, and integration points. For information about the expression AST types themselves, see Expression AST. For details on how translated expressions are compiled into executable programs, see Program Compilation.

Sources: llkv-expr/src/expr.rs:1-819 llkv-executor/src/lib.rs:87-97


The Parameterized Expression Type System

The LLKV expression system uses generic type parameters to support multiple identifier types throughout the query processing pipeline. All expression types are parameterized over a field identifier type F:

Expression TypeDescriptionParameter
Expr<'a, F>Boolean predicate expressionField identifier type F
ScalarExpr<F>Arithmetic/scalar expressionField identifier type F
Filter<'a, F>Single-field predicateField identifier type F

The parameterization allows the same expression structures to be used with different identifier representations:

  • During Planning : Expr<'static, String> and ScalarExpr<String> use column names as parsed from SQL
  • During Execution : Expr<'static, FieldId> and ScalarExpr<FieldId> use numeric field identifiers for efficient storage access
graph TD
    subgraph "SQL Parsing Layer"
        SQL["SQL Query Text"]
PARSER["sqlparser"]
AST["SQL AST"]
end
    
    subgraph "Planning Layer"
        PLANNER["Query Planner"]
EXPR_STRING["Expr&lt;String&gt;\nScalarExpr&lt;String&gt;"]
PLAN["SelectPlan"]
end
    
    subgraph "Translation Layer"
        TRANSLATOR["translate_scalar\ntranslate_predicate"]
SCHEMA["Schema / Catalog"]
RESOLVER["IdentifierResolver"]
end
    
    subgraph "Execution Layer"
        EXPR_FIELDID["Expr&lt;FieldId&gt;\nScalarExpr&lt;FieldId&gt;"]
EVALUATOR["Expression Evaluator"]
STORAGE["Column Store"]
end
    
 
   SQL --> PARSER
 
   PARSER --> AST
 
   AST --> PLANNER
 
   PLANNER --> EXPR_STRING
 
   EXPR_STRING --> PLAN
    
 
   PLAN --> TRANSLATOR
 
   SCHEMA --> TRANSLATOR
 
   RESOLVER --> TRANSLATOR
 
   TRANSLATOR --> EXPR_FIELDID
    
 
   EXPR_FIELDID --> EVALUATOR
 
   EVALUATOR --> STORAGE

This design separates concerns: the planner manipulates human-readable names without needing catalog knowledge, while the executor works with resolved numeric identifiers that map directly to physical storage locations.

Diagram: Expression Translation Flow from SQL to Execution

Sources: llkv-expr/src/expr.rs:14-182 llkv-executor/src/lib.rs:87-97


Core Translation Functions

The translation layer exposes a set of functions for converting string-based expressions to field ID-based expressions. These functions are defined in the llkv-plan crate’s translation module and re-exported by llkv-executor for convenience.

Primary Translation Functions

FunctionPurposeSignature Pattern
translate_scalarTranslate scalar expressions(expr: &ScalarExpr<String>, schema, error_fn) -> Result<ScalarExpr<FieldId>>
translate_scalar_withTranslate with custom resolver(expr: &ScalarExpr<String>, resolver, error_fn) -> Result<ScalarExpr<FieldId>>
translate_predicateTranslate filter predicates(expr: &Expr<String>, schema, error_fn) -> Result<Expr<FieldId>>
translate_predicate_withTranslate predicate with resolver(expr: &Expr<String>, resolver, error_fn) -> Result<Expr<FieldId>>
resolve_field_id_from_schemaResolve single column name(name: &str, schema) -> Result<FieldId>

The _with variants accept an IdentifierResolver reference for more complex scenarios (multi-table queries, subqueries, etc.), while the simpler variants accept a schema directly and construct a resolver internally.

Usage Pattern

Translation functions follow a consistent pattern: they take a string-based expression, schema/resolver information, and an error handler closure. The error handler is invoked when a column name cannot be resolved, allowing callers to customize error messages:

The error closure receives the unresolved column name and returns an appropriate error type. This pattern appears throughout the executor when translating expressions from plans:

Sources: llkv-executor/src/lib.rs:87-97 llkv-executor/src/lib.rs:485-489 llkv-executor/src/lib.rs:1054-1059


Schema-Based Resolution

Column name resolution relies on Arrow schema information to map string identifiers to numeric field IDs. The resolution process handles case-insensitive matching and validates that referenced columns actually exist in the schema.

graph LR
    subgraph "Input"
        EXPR_STR["ScalarExpr&lt;String&gt;"]
COLUMN_NAME["Column Name: 'user_id'"]
end
    
    subgraph "Resolution Context"
        SCHEMA["Arrow Schema"]
FIELDS["Field Definitions"]
METADATA["Field Metadata"]
end
    
    subgraph "Resolution Process"
        NORMALIZE["Normalize Name\n(case-insensitive)"]
LOOKUP["Lookup in Schema"]
EXTRACT_ID["Extract FieldId"]
end
    
    subgraph "Output"
        EXPR_FIELD["ScalarExpr&lt;FieldId&gt;"]
FIELD_ID["FieldId: 42"]
end
    
 
   COLUMN_NAME --> NORMALIZE
 
   SCHEMA --> LOOKUP
 
   NORMALIZE --> LOOKUP
 
   LOOKUP --> EXTRACT_ID
 
   EXTRACT_ID --> FIELD_ID
    
 
   EXPR_STR --> NORMALIZE
 
   EXTRACT_ID --> EXPR_FIELD

Resolution Workflow

Diagram: Column Name to FieldId Resolution

The resolve_field_id_from_schema function performs the core resolution logic. It searches the schema’s field definitions for a matching column name and extracts the associated field ID from the field’s metadata.

Schema Structure

Arrow schemas used during translation contain:

  • Field Definitions : Name, data type, nullability
  • Field Metadata : Key-value pairs including the numeric field ID
  • Nested Field Support : For struct types, schemas may contain nested field hierarchies

The translation process must handle qualified names (e.g., table.column), nested field access (e.g., user.address.city), and alias resolution when applicable.

Sources: llkv-executor/src/lib.rs:87-97


Field Path Resolution for Nested Fields

When expressions reference nested fields within struct types, the translation process must resolve not just the top-level column but the entire field path. This is handled through the IdentifierResolver and ColumnResolution types provided by llkv-table/catalog.

graph TD
    subgraph "Input Expression"
        NESTED["GetField Expression"]
BASE["base: user"]
FIELD["field_name: 'address'"]
SUBFIELD["field_name: 'city'"]
end
    
    subgraph "Resolver"
        RESOLVER["IdentifierResolver"]
CONTEXT["IdentifierContext"]
end
    
    subgraph "Resolution Result"
        COL_RES["ColumnResolution"]
COL_NAME["column(): 'user'"]
FIELD_PATH["field_path(): ['address', 'city']"]
FIELD_ID["Resolved FieldId"]
end
    
 
   NESTED --> RESOLVER
 
   CONTEXT --> RESOLVER
 
   RESOLVER --> COL_RES
 
   COL_RES --> COL_NAME
 
   COL_RES --> FIELD_PATH
 
   COL_RES --> FIELD_ID

ColumnResolution Structure

Diagram: Nested Field Resolution

The ColumnResolution type encapsulates the resolution result, providing:

  • The base column name
  • The field path for nested access (empty for top-level columns)
  • The resolved field ID for storage access

This information is used during correlated subquery tracking and when translating GetField expressions in the scalar expression tree.

Sources: llkv-sql/src/sql_engine.rs:37-38 llkv-sql/src/sql_engine.rs:420-427


Translation in Multi-Table Contexts

When translating expressions for queries involving multiple tables (joins, cross products, subqueries), the translation process must disambiguate column references that may appear in multiple tables. This is handled by the IdentifierResolver which maintains context about available tables and their schemas.

IdentifierContext and Resolution

The IdentifierContext type (from llkv-table/catalog) represents the set of tables and columns available in a given scope. During translation:

  1. Outer Scope Tracking : For subqueries, outer table contexts are tracked separately
  2. Column Disambiguation : Qualified names (e.g., table.column) are resolved against the appropriate table
  3. Ambiguity Detection : Unqualified references to columns that exist in multiple tables produce errors

The translate_predicate_with and translate_scalar_with functions accept an IdentifierResolver reference that encapsulates this context:

Sources: llkv-sql/src/sql_engine.rs:37-38


Error Handling and Diagnostics

Translation failures occur when column names cannot be resolved. The error handling strategy uses caller-provided closures to generate context-specific error messages.

Error Patterns

ScenarioError Message Pattern
Unknown column in aggregate"unknown column '{name}' in aggregate expression"
Unknown column in WHERE clause"unknown column '{name}' in filter"
Unknown column in cross product"column '{name}' not found in cross product result"
Ambiguous column reference"column '{name}' is ambiguous"

The error closure pattern allows the caller to include query-specific context in error messages. This is particularly important for debugging complex queries where the same expression type might be used in multiple contexts.

Resolution Failure Example

When translate_scalar encounters a ScalarExpr::Column(name) variant and the name cannot be found in the schema, it invokes the error closure:

Sources: llkv-executor/src/lib.rs:485-489 llkv-executor/src/lib.rs:1054-1059


graph TB
    subgraph "Planning Phase"
        SQL["SQL Statement"]
PARSE["Parse & Build Plan"]
PLAN["SelectPlan\nUpdatePlan\netc."]
EXPR_STR["Expressions with\nString identifiers"]
end
    
    subgraph "Execution Preparation"
        GET_TABLE["Get Table Handle"]
SCHEMA_FETCH["Fetch Schema"]
TRANSLATE["Translation Functions"]
EXPR_FIELD["Expressions with\nFieldId identifiers"]
end
    
    subgraph "Execution Phase"
        BUILD_SCAN["Build ScanProjection"]
COMPILE["Compile to EvalProgram"]
EVALUATE["Evaluate Against Batches"]
RESULTS["RecordBatch Results"]
end
    
 
   SQL --> PARSE
 
   PARSE --> PLAN
 
   PLAN --> EXPR_STR
    
 
   PLAN --> GET_TABLE
 
   GET_TABLE --> SCHEMA_FETCH
 
   SCHEMA_FETCH --> TRANSLATE
 
   EXPR_STR --> TRANSLATE
 
   TRANSLATE --> EXPR_FIELD
    
 
   EXPR_FIELD --> BUILD_SCAN
 
   BUILD_SCAN --> COMPILE
 
   COMPILE --> EVALUATE
 
   EVALUATE --> RESULTS

Integration with Query Execution Pipeline

Expression translation occurs at the boundary between planning and execution. Plans produced by the SQL layer contain string-based expressions, which are translated as execution structures are built.

Translation Points in Execution

Diagram: Translation in the Execution Pipeline

Key Translation Points

  1. Filter Translation : When building scan plans, WHERE clause expressions are translated before being passed to the scan optimizer
  2. Projection Translation : Computed columns in SELECT projections are translated before evaluation
  3. Aggregate Translation : Aggregate function arguments are translated to resolve column references
  4. Join Condition Translation : ON clause expressions for joins are translated in the context of both joined tables

The executor’s ensure_computed_projection function demonstrates this integration. It translates a string-based expression, infers its result data type, and registers it as a computed projection for the scan:

This function encapsulates the full translation workflow: resolve column names, infer types, and prepare the translated expression for execution.

Sources: llkv-executor/src/lib.rs:470-501 llkv-executor/src/lib.rs:87-97


Translation of Complex Expression Types

The translation process must handle all variants of the expression AST, recursively translating nested expressions while preserving structure and semantics.

Recursive Translation Table

Expression VariantTranslation Strategy
ScalarExpr::ColumnResolve string to FieldId via schema
ScalarExpr::LiteralNo translation needed (no field references)
ScalarExpr::BinaryRecursively translate left and right operands
ScalarExpr::AggregateTranslate the aggregate’s argument expression
ScalarExpr::GetFieldTranslate base expression, preserve field name
ScalarExpr::CastTranslate inner expression, preserve target type
ScalarExpr::CompareRecursively translate both comparison operands
ScalarExpr::CoalesceTranslate each expression in the list
ScalarExpr::CaseTranslate operand and all WHEN/THEN/ELSE branches
ScalarExpr::ScalarSubqueryNo translation (contains SubqueryId reference)
ScalarExpr::RandomNo translation (no field references)

For predicate expressions (Expr<F>):

Predicate VariantTranslation Strategy
Expr::And / Expr::OrRecursively translate all sub-expressions
Expr::NotRecursively translate inner expression
Expr::Pred(Filter)Translate filter’s field ID, preserve operator
Expr::CompareTranslate left and right scalar expressions
Expr::InListTranslate target expression and list elements
Expr::IsNullTranslate the operand expression
Expr::LiteralNo translation (constant boolean value)
Expr::ExistsNo translation (contains SubqueryId reference)

The translation process maintains the expression tree structure while substituting field identifiers, ensuring that evaluation semantics remain unchanged.

Sources: llkv-expr/src/expr.rs:125-182 llkv-expr/src/expr.rs:14-66


Performance Considerations

Expression translation is performed once during query execution setup, not per-row or per-batch. The translated expressions are then compiled into evaluation programs (see Program Compilation) which are reused across all batches in the query result.

Translation Caching

The executor maintains caches to avoid redundant translation work:

  • Computed Projection Cache : Stores translated expressions keyed by their string representation to avoid re-translating identical expressions in the same query
  • Column Projection Cache : Maps field IDs to projection indices to reuse existing projections when multiple expressions reference the same column

This caching strategy is evident in functions like ensure_computed_projection, which checks the cache before performing translation:

Sources: llkv-executor/src/lib.rs:470-501

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Program Compilation

Loading…

Program Compilation

Relevant source files

Purpose and Scope

Program compilation transforms expression ASTs into executable bytecode optimized for evaluation. This intermediate representation enables efficient vectorized operations and simplifies the runtime evaluation engine. The compilation phase bridges the gap between high-level expression trees and low-level execution kernels.

This page covers the compilation of ScalarExpr and filter Expr trees into EvalProgram and DomainProgram bytecode respectively. For information about the expression AST structure, see Expression AST. For how compiled programs are evaluated, see Scalar Evaluation and NumericKernels.

Compilation Pipeline Overview

Sources: llkv-expr/src/expr.rs:1-819

Compilation Targets

The compilation system produces two distinct bytecode formats depending on the expression context:

Program TypeInput ASTPurposeOutput
EvalProgramScalarExpr<FieldId>Compute scalar values per rowArrow array of computed values
DomainProgramExpr<FieldId>Evaluate boolean predicatesBitmap of matching row IDs

EvalProgram Structure

EvalProgram compiles scalar expressions into a stack-based bytecode suitable for vectorized evaluation. Each instruction operates on Arrow arrays, producing intermediate or final results:

Sources: llkv-expr/src/expr.rs:125-307

flowchart TB
    subgraph ScalarInput["Scalar Expression Tree"]
ROOT["Binary: Add"]
LEFT["Binary: Multiply"]
RIGHT["Cast: Float64"]
COL1["Column: field_id=1"]
LIT1["Literal: 10"]
COL2["Column: field_id=2"]
end
    
 
   ROOT --> LEFT
 
   ROOT --> RIGHT
 
   LEFT --> COL1
 
   LEFT --> LIT1
 
   RIGHT --> COL2
    
    subgraph Bytecode["EvalProgram Bytecode"]
I1["LOAD_COLUMN field_id=1"]
I2["LOAD_LITERAL 10"]
I3["MULTIPLY"]
I4["LOAD_COLUMN field_id=2"]
I5["CAST Float64"]
I6["ADD"]
end
    
 
   I1 --> I2
 
   I2 --> I3
 
   I3 --> I4
 
   I4 --> I5
 
   I5 --> I6
    
    subgraph Execution["Execution"]
STACK["Evaluation Stack\nArray-based operations"]
RESULT["Result Array"]
end
    
 
   I6 --> STACK
 
   STACK --> RESULT

DomainProgram Structure

DomainProgram compiles filter predicates into bytecode optimized for boolean evaluation and row filtering. The program operates on column metadata and data chunks to identify matching rows:

Sources: llkv-expr/src/expr.rs:14-123 llkv-expr/src/expr.rs:365-428

flowchart TB
    subgraph FilterInput["Filter Expression Tree"]
AND["And"]
PRED1["Pred: field_id=1 > 100"]
PRED2["Pred: field_id=2 IN values"]
NOT["Not"]
PRED3["Pred: field_id=3 LIKE pattern"]
end
    
 
   AND --> PRED1
 
   AND --> NOT
 
   NOT --> PRED2
 
   AND --> PRED3
    
    subgraph DomainBytecode["DomainProgram Bytecode"]
D1["EVAL_PRED field_id=1\nGreaterThan 100"]
D2["EVAL_PRED field_id=2\nIn values"]
D3["NEGATE"]
D4["EVAL_PRED field_id=3\nContains pattern"]
D5["AND_ALL"]
end
    
 
   D1 --> D2
 
   D2 --> D3
 
   D3 --> D4
 
   D4 --> D5
    
    subgraph BitmapOps["Bitmap Operations"]
B1["Bitmap: field_id=1 matches"]
B2["Bitmap: field_id=2 matches"]
B3["Bitmap: NOT operation"]
B4["Bitmap: field_id=3 matches"]
B5["Bitmap: AND operation"]
FINAL["Final: matching row IDs"]
end
    
 
   D5 --> B1
 
   B1 --> B2
 
   B2 --> B3
 
   B3 --> B4
 
   B4 --> B5
 
   B5 --> FINAL

Expression Analysis and Type Inference

Before bytecode generation, the compiler analyzes the expression tree to:

  1. Resolve data types - Each expression node’s output type is inferred from its inputs and operation
  2. Validate operations - Ensure type compatibility for binary operations and function calls
  3. Track dependencies - Identify which columns and subqueries the expression requires
  4. Detect constant subexpressions - Find opportunities for constant folding

Sources: llkv-expr/src/expr.rs:125-182 llkv-expr/src/expr.rs:309-363

Compilation Optimizations

The compilation phase applies several optimization passes to improve evaluation performance:

Constant Folding

Expressions involving only literal values are evaluated at compile time:

Original ExpressionOptimized Form
Literal(2) + Literal(3)Literal(5)
Literal(10) * Literal(0)Literal(0)
Cast(Literal("123"), Int64)Literal(123)

Expression Simplification

Algebraic identities and boolean logic simplifications reduce instruction count:

PatternSimplified To
x + 0x
x * 1x
x * 00
And([true, expr])expr
Or([false, expr])expr
Not(Not(expr))expr

Dead Code Elimination

Unreachable code paths in Case expressions are removed:

Sources: llkv-expr/src/expr.rs:169-176

Bytecode Generation

After optimization, the compiler generates bytecode instructions by walking the expression tree in post-order (depth-first). Each expression variant maps to one or more bytecode instructions:

Scalar Expression Instruction Mapping

Expression VariantGenerated Instructions
Column(field_id)LOAD_COLUMN field_id
Literal(value)LOAD_LITERAL value
Binary{left, op, right}Compile left → Compile right → BINARY_OP op
Cast{expr, data_type}Compile expr → CAST data_type
Compare{left, op, right}Compile left → Compile right → COMPARE op
Coalesce(exprs)Compile all exprs → COALESCE count
Case{operand, branches, else}Complex multi-instruction sequence with jump tables
RandomRANDOM_F64

Sources: llkv-expr/src/expr.rs:125-182

Filter Expression Instruction Mapping

Expression VariantGenerated Instructions
Pred(Filter{field_id, op})EVAL_PREDICATE field_id op
And(exprs)Compile all exprs → AND_ALL count
Or(exprs)Compile all exprs → OR_ALL count
Not(expr)Compile expr → NEGATE
Compare{left, right, op}Compile as scalar → TO_BOOLEAN
InList{expr, list, negated}Compile expr → Build lookup table → IN_SET negated
IsNull{expr, negated}Compile expr → IS_NULL negated
Literal(bool)LOAD_BOOL value
Exists(subquery)EVAL_SUBQUERY subquery_id

Sources: llkv-expr/src/expr.rs:14-56

Subquery Handling

Correlated subqueries require special compilation handling. The compiler generates placeholder references that are resolved during execution:

Sources: llkv-expr/src/expr.rs:45-66

flowchart TB
    subgraph Outer["Outer Query Expression"]
FILTER["Filter with EXISTS"]
SUBQUERY["Subquery: SubqueryId=1\nCorrelated columns"]
end
    
    subgraph Compilation["Compilation Strategy"]
PLACEHOLDER["Generate EVAL_SUBQUERY\ninstruction with ID"]
CORRELATION["Track correlated\ncolumn mappings"]
DEFER["Defer actual subquery\ncompilation to executor"]
end
    
    subgraph Execution["Runtime Resolution"]
BIND["Bind outer row values\nto correlated placeholders"]
EVAL_SUB["Execute subquery plan\nwith bound values"]
COLLECT["Collect subquery\nresult set"]
BOOLEAN["Convert to boolean\nfor EXISTS/IN"]
end
    
 
   FILTER --> SUBQUERY
 
   SUBQUERY --> PLACEHOLDER
 
   PLACEHOLDER --> CORRELATION
 
   CORRELATION --> DEFER
    
 
   DEFER --> BIND
 
   BIND --> EVAL_SUB
 
   EVAL_SUB --> COLLECT
 
   COLLECT --> BOOLEAN

Aggregate Function Compilation

Aggregate expressions within scalar contexts (e.g., COUNT(*) + 1) compile to instructions that reference pre-computed aggregate results:

Sources: llkv-expr/src/expr.rs:184-215

flowchart LR
    subgraph AggExpr["Aggregate Expression"]
AGG_CALL["AggregateCall\nCountStar / Sum / Avg"]
end
    
    subgraph Compilation["Compilation"]
REF["Generate AGG_REFERENCE\ninstruction"]
METADATA["Store aggregate metadata\nfunction type, distinct flag"]
end
    
    subgraph PreExecution["Pre-Execution Phase"]
COMPUTE["Executor computes\naggregate values"]
STORE["Store in aggregate\nresult table"]
end
    
    subgraph Evaluation["Expression Evaluation"]
LOOKUP["AGG_REFERENCE instruction\nlooks up pre-computed value"]
BROADCAST["Broadcast scalar result\nto array length"]
CONTINUE["Continue with remaining\nexpression operations"]
end
    
 
   AGG_CALL --> REF
 
   REF --> METADATA
 
   METADATA --> COMPUTE
 
   COMPUTE --> STORE
 
   STORE --> LOOKUP
 
   LOOKUP --> BROADCAST
 
   BROADCAST --> CONTINUE

Integration with Execution Layer

Compiled programs are executed by the compute kernels layer, which provides vectorized implementations of each bytecode instruction:

Sources: llkv-expr/src/expr.rs:1-819

flowchart TB
    subgraph Programs["Compiled Programs"]
EP["EvalProgram"]
DP["DomainProgram"]
end
    
    subgraph ComputeLayer["llkv-compute Kernels"]
ARITHMETIC["Arithmetic Kernels\nadd_arrays, multiply_arrays"]
COMPARISON["Comparison Kernels\ngt_arrays, eq_arrays"]
CAST_K["Cast Kernels\ncast_array"]
LOGICAL["Logical Kernels\nand_bitmaps, or_bitmaps"]
STRING["String Kernels\ncontains, starts_with"]
end
    
    subgraph Execution["Execution Context"]
BATCH["Input RecordBatch\nColumn arrays"]
STACK["Evaluation Stack"]
BITMAP["Bitmap Accumulator"]
end
    
    subgraph Results["Evaluation Results"]
SCALAR_OUT["Array of computed\nscalar values"]
FILTER_OUT["Bitmap of matching\nrow IDs"]
end
    
 
   EP --> ARITHMETIC
 
   EP --> COMPARISON
 
   EP --> CAST_K
    
 
   DP --> COMPARISON
 
   DP --> LOGICAL
 
   DP --> STRING
    
 
   ARITHMETIC --> STACK
 
   COMPARISON --> STACK
 
   CAST_K --> STACK
    
 
   LOGICAL --> BITMAP
 
   STRING --> BITMAP
    
 
   BATCH --> STACK
 
   BATCH --> BITMAP
    
 
   STACK --> SCALAR_OUT
 
   BITMAP --> FILTER_OUT

Compilation Performance Considerations

The compilation phase is designed to amortize its cost over repeated evaluations:

ScenarioCompilation Strategy
One-time queryCompile on demand, minimal optimization
Repeated queryCompile once, cache bytecode, reuse across invocations
Prepared statementPre-compile at preparation time, execute many times
Table scan filterCompile predicate once, apply to all batches
AggregationCompile aggregate expressions, evaluate per group

Compilation Cache Strategy

Sources: llkv-expr/src/expr.rs:1-819

Summary

The compilation phase transforms high-level expression ASTs into efficient bytecode programs optimized for vectorized execution. By separating compilation from evaluation, the system achieves:

  • Performance : Bytecode enables efficient stack-based evaluation with Arrow kernels
  • Reusability : Compiled programs can be cached and reused across query invocations
  • Optimization : Multiple optimization passes improve runtime performance
  • Type Safety : Type inference and validation occur during compilation, not evaluation
  • Maintainability : Clear separation between compilation and execution concerns

The compiled EvalProgram and DomainProgram bytecode formats serve as the bridge between query planning and execution, enabling the query engine to efficiently evaluate complex scalar computations and filter predicates over columnar data.

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Scalar Evaluation and NumericKernels

Loading…

Scalar Evaluation and NumericKernels

Relevant source files

Purpose and Scope

This page documents the scalar expression evaluation engine in LLKV, covering how ScalarExpr instances are computed against columnar data to produce results for projections, filters, and computed columns. The evaluation system operates on Apache Arrow arrays and leverages vectorization for performance.

For information about the expression AST structure and variants, see Expression AST. For how expressions are translated from SQL column names to field identifiers, see Expression Translation. For the bytecode compilation pipeline, see Program Compilation. For aggregate function evaluation, see Aggregation System.

ScalarExpr Variants

The ScalarExpr<F> enum represents all scalar computations that can be performed in LLKV. Each variant corresponds to a specific type of operation that produces a single value per row.

Sources: llkv-expr/src/expr.rs:126-182

graph TB
    ScalarExpr["ScalarExpr&lt;F&gt;"]
ScalarExpr --> Column["Column(F)\nDirect column reference"]
ScalarExpr --> Literal["Literal(Literal)\nConstant value"]
ScalarExpr --> Binary["Binary\nArithmetic operations"]
ScalarExpr --> Not["Not(Box&lt;ScalarExpr&gt;)\nLogical negation"]
ScalarExpr --> IsNull["IsNull\nNULL test"]
ScalarExpr --> Aggregate["Aggregate(AggregateCall)\nAggregate functions"]
ScalarExpr --> GetField["GetField\nStruct field access"]
ScalarExpr --> Cast["Cast\nType conversion"]
ScalarExpr --> Compare["Compare\nComparison ops"]
ScalarExpr --> Coalesce["Coalesce\nFirst non-null"]
ScalarExpr --> ScalarSubquery["ScalarSubquery\nSubquery result"]
ScalarExpr --> Case["Case\nCASE expression"]
ScalarExpr --> Random["Random\nRandom number"]
Binary --> BinaryOp["BinaryOp:\nAdd, Subtract, Multiply,\nDivide, Modulo,\nAnd, Or,\nBitwiseShiftLeft,\nBitwiseShiftRight"]
Compare --> CompareOp["CompareOp:\nEq, NotEq,\nLt, LtEq,\nGt, GtEq"]

The generic type parameter F represents the field identifier type, which is typically String (for SQL column names) or FieldId (for translated physical column identifiers).

Binary Operations

Binary operations perform arithmetic or logical computations between two scalar expressions. The BinaryOp enum defines the supported operators:

OperatorSymbolDescriptionExample
Add+Additioncol1 + col2
Subtract-Subtractioncol1 - 100
Multiply*Multiplicationprice * quantity
Divide/Divisiontotal / count
Modulo%Remainderid % 10
AndANDLogical ANDflag1 AND flag2
OrORLogical ORstatus1 OR status2
BitwiseShiftLeft<<Left bit shiftmask << 2
BitwiseShiftRight>>Right bit shiftflags >> 4

Sources: llkv-expr/src/expr.rs:310-338

Binary expressions are constructed recursively, allowing complex nested computations:

Comparison Operations

Comparison operations produce boolean results (represented as 1/0 integers) by comparing two scalar expressions. The CompareOp enum defines six comparison operators:

OperatorSymbolDescription
Eq=Equal to
NotEq!=Not equal to
Lt<Less than
LtEq<=Less than or equal
Gt>Greater than
GtEq>=Greater than or equal

Sources: llkv-expr/src/expr.rs:341-363

Comparisons are represented as a ScalarExpr::Compare variant that contains the left operand, operator, and right operand. When evaluated, they produce a boolean value that follows SQL three-valued logic (true, false, NULL).

Evaluation Pipeline

The evaluation of scalar expressions follows a multi-stage pipeline from SQL text to computed Arrow arrays:

Sources: llkv-expr/src/expr.rs:1-819 llkv-table/src/table.rs:29-30

graph LR
    SQL["SQL Expression\n'col1 + col2 * 10'"]
Parse["SQL Parser\nsqlparser-rs"]
Translate["Expression Translation\nString → FieldId"]
Compile["Program Compilation\nScalarExpr → EvalProgram"]
Execute["Evaluation Engine\nVectorized Execution"]
Result["Arrow Array\nComputed Results"]
SQL --> Parse
 
   Parse --> Translate
 
   Translate --> Compile
 
   Compile --> Execute
 
   Execute --> Result
    
    Compile -.uses.-> ProgramCompiler["ProgramCompiler\nllkv-compute"]
Execute -.operates on.-> ArrowArrays["Arrow Arrays\nColumnar Data"]

The key stages are:

  1. Parsing : SQL expressions are parsed into AST nodes by sqlparser-rs
  2. Translation : Column names (strings) are resolved to FieldId identifiers
  3. Compilation : ScalarExpr is compiled into bytecode by ProgramCompiler
  4. Execution : The bytecode is evaluated against Arrow columnar data

Type System and Casting

LLKV uses Arrow’s type system for scalar evaluation. The Cast variant of ScalarExpr performs explicit type conversions:

Sources: llkv-expr/src/expr.rs:154-157

The data_type field is an Arrow DataType that specifies the target type for the conversion. Type casting follows Arrow’s casting rules and handles conversions between:

  • Numeric types (integers, floats, decimals)
  • String types (UTF-8)
  • Temporal types (Date32, timestamps)
  • Boolean types
  • Struct types

Invalid casts produce NULL values following SQL semantics.

Null Handling and Propagation

Scalar expressions implement SQL three-valued logic where NULL values propagate through most operations. The IsNull variant provides explicit NULL testing:

Sources: llkv-expr/src/expr.rs:139-142

When negated is false, this evaluates to 1 (true) if the expression is NULL, 0 (false) otherwise. When negated is true, it performs the inverse test (IS NOT NULL).

The Coalesce variant provides NULL-coalescing behavior, returning the first non-NULL value from a list of expressions:

Sources: llkv-expr/src/expr.rs165

This is used to implement SQL’s COALESCE(expr1, expr2, ...) function.

CASE Expressions

The Case variant implements SQL CASE expressions with optional operand and ELSE branches:

Sources: llkv-expr/src/expr.rs:169-176

This represents both simple and searched CASE expressions:

  • Simple CASE : When operand is Some, each branch’s WHEN expression is compared to the operand
  • Searched CASE : When operand is None, each branch’s WHEN expression is evaluated as a boolean

The branches vector contains (WHEN, THEN) pairs evaluated in order. If no branch matches, else_expr is returned, or NULL if else_expr is None.

graph TB
    subgraph "Row-by-Row Evaluation (Avoided)"
        Row1["Row 1:\ncol1=10, col2=5\n→ 10 + 5 = 15"]
Row2["Row 2:\ncol1=20, col2=3\n→ 20 + 3 = 23"]
Row3["Row 3:\ncol1=15, col2=7\n→ 15 + 7 = 22"]
end
    
    subgraph "Vectorized Evaluation (Used)"
        Col1["Int64Array\n[10, 20, 15]"]
Col2["Int64Array\n[5, 3, 7]"]
Result["Int64Array\n[15, 23, 22]"]
Col1 --> VectorAdd["Vectorized Add\nSIMD Operations"]
Col2 --> VectorAdd
 
       VectorAdd --> Result
    end
    
    style Row1 fill:#f9f9f9
    style Row2 fill:#f9f9f9
    style Row3 fill:#f9f9f9

Vectorization Strategy

Expression evaluation in LLKV leverages Apache Arrow’s columnar format for vectorized execution. Rather than evaluating expressions row-by-row, operations process entire arrays at once.

Sources: llkv-table/src/table.rs:1-681

Key benefits of vectorization:

  1. SIMD Instructions : Modern CPUs can process multiple values simultaneously
  2. Reduced Overhead : Eliminates per-row interpretation overhead
  3. Cache Efficiency : Columnar layout improves CPU cache utilization
  4. Arrow Compute Kernels : Leverages highly optimized Arrow implementations
graph TB
    Expr["ScalarExpr::Binary\nop=Add"]
Dispatch["Type Dispatch"]
Expr --> Dispatch
    
 
   Dispatch --> Int64["Int64Array\nAdd kernel"]
Dispatch --> Int32["Int32Array\nAdd kernel"]
Dispatch --> Float64["Float64Array\nAdd kernel"]
Dispatch --> Decimal128["Decimal128Array\nAdd kernel"]
Int64 --> Result1["Int64Array result"]
Int32 --> Result2["Int32Array result"]
Float64 --> Result3["Float64Array result"]
Decimal128 --> Result4["Decimal128Array result"]

Numeric Type Dispatch

LLKV handles multiple numeric types through Arrow’s type system. The evaluation engine uses Arrow’s primitive type traits to dispatch operations:

Sources: llkv-table/src/table.rs:17-20

The dispatch mechanism uses Arrow’s type system to select the appropriate kernel at evaluation time. The macros llkv_for_each_arrow_numeric, llkv_for_each_arrow_boolean, and llkv_for_each_arrow_string provide type-safe iteration over all supported Arrow types.

Aggregate Functions in Scalar Context

The Aggregate variant of ScalarExpr allows aggregate functions to appear in scalar contexts, such as COUNT(*) + 1:

Sources: llkv-expr/src/expr.rs145

The AggregateCall enum includes:

  • CountStar: Count all rows
  • Count { expr, distinct }: Count non-NULL values
  • Sum { expr, distinct }: Sum of values
  • Total { expr, distinct }: Sum with NULL-safe 0 default
  • Avg { expr, distinct }: Average of values
  • Min(expr): Minimum value
  • Max(expr): Maximum value
  • CountNulls(expr): Count NULL values
  • GroupConcat { expr, distinct, separator }: String concatenation

Sources: llkv-expr/src/expr.rs:184-215

Each aggregate operates on a ScalarExpr, allowing nested computations like SUM(col1 + col2) or AVG(-price).

Random Number Generation

The Random variant produces pseudo-random floating-point values:

Sources: llkv-expr/src/expr.rs181

Following PostgreSQL and DuckDB semantics, each evaluation produces a new random value in the range [0.0, 1.0). The generator is seeded automatically and does not expose seed control at the SQL level.\n\n## Struct Field Access\n\nThe GetField variant extracts fields from struct expressions](https://github.com/jzombie/rust-llkv/blob/89777726/0.0, 1.0). The generator is seeded automatically and does not expose seed control at the SQL level.\n\n## Struct Field Access\n\nThe GetField variant extracts fields from struct expressions#LNaN-LNaN)

This enables navigation of nested data structures. For example, accessing user.address.city would be represented as:

The evaluation engine resolves field names against Arrow struct schemas at runtime.

Performance Considerations

The scalar evaluation engine includes several optimizations:

Expression Constant Folding

Constant subexpressions are evaluated once during compilation rather than per-row. For example, col1 + (10 * 20) is simplified to col1 + 200 before evaluation.

Predicate Pushdown

When scalar expressions appear in WHERE clauses, they may be pushed down to the storage layer for early filtering. The PredicateFusionCache in llkv-compute caches compiled predicates to avoid recompilation.

Sources: llkv-table/src/table.rs29

Type Specialization

Arrow kernels are specialized for each numeric type, avoiding generic dispatch overhead in tight loops. This ensures that Int64 + Int64 uses dedicated integer addition instructions rather than polymorphic dispatch.

SIMD Acceleration

The underlying storage layer (simd-r-drive) provides SIMD-optimized operations for bulk data movement and filtering, which complements the vectorized evaluation strategy.

Sources: llkv-storage/pager/MemPager llkv-table/src/table.rs21

sequenceDiagram
    participant SQL as SQL Engine
    participant Planner as TablePlanner
    participant Scanner as ScanRowStream
    participant Compute as Compute Layer
    participant Store as ColumnStore
    
    SQL->>Planner: Execute SELECT with expressions
    Planner->>Planner: Compile ScalarExpr to bytecode
    Planner->>Scanner: Create scan with projections
    
    loop For each batch
        Scanner->>Store: Gather column arrays
        Store-->>Scanner: Arrow arrays
        Scanner->>Compute: Evaluate expressions
        Compute->>Compute: Vectorized operations
        Compute-->>Scanner: Computed arrays
        Scanner-->>SQL: RecordBatch
    end

Integration with Scan Pipeline

Scalar expressions are evaluated during table scans through the ScanProjection system:

Sources: llkv-table/src/table.rs:447-488 llkv-scan/execute/execute_scan

The scan pipeline:

  1. Gathers base column arrays from the ColumnStore
  2. Passes arrays to the compute layer for expression evaluation
  3. Assembles computed results into RecordBatch instances
  4. Streams batches to the caller

This design minimizes memory allocation by processing data in fixed-size batches (typically 1024 rows) rather than materializing entire result sets.

Expression Compilation Flow

The complete compilation flow from SQL to executed results:

Sources: llkv-expr/src/expr.rs:1-819 llkv-plan/src/plans.rs:1-1500 llkv-table/src/table.rs:1-681

This pipeline ensures that expressions are validated, optimized, and compiled before execution begins, minimizing runtime overhead.

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Aggregation System

Loading…

Aggregation System

Relevant source files

The Aggregation System implements SQL aggregate functions (COUNT, SUM, AVG, MIN, MAX, etc.) across the LLKV query pipeline. It handles both simple aggregates (SELECT COUNT(*) FROM table) and grouped aggregations (SELECT col, SUM(amount) FROM table GROUP BY col), with support for the DISTINCT modifier and expression-based aggregates like SUM(col1 + col2).

For information about scalar expression evaluation (non-aggregate), see Scalar Evaluation and NumericKernels. For query planning that produces aggregate plans, see Plan Structures.


System Architecture

The aggregation system spans four layers in the codebase, each with distinct responsibilities:

Diagram: Aggregation System Layering

graph TB
    subgraph "Expression Layer (llkv-expr)"
        AGG_CALL["AggregateCall&lt;F&gt;\nCountStar, Count, Sum, Avg, Min, Max"]
SCALAR_EXPR["ScalarExpr&lt;F&gt;\nAggregate(AggregateCall)"]
end
    
    subgraph "Plan Layer (llkv-plan)"
        AGG_EXPR["AggregateExpr\nCountStar, Column"]
AGG_FUNC["AggregateFunction\nCount, SumInt64, MinInt64, etc."]
SELECT_PLAN["SelectPlan\naggregates: Vec&lt;AggregateExpr&gt;\ngroup_by: Vec&lt;String&gt;"]
end
    
    subgraph "Execution Layer (llkv-executor)"
        EXECUTOR["QueryExecutor::execute_aggregates\nQueryExecutor::execute_group_by_single_table"]
AGG_VALUE["AggregateValue\nNull, Int64, Float64, Decimal128, String"]
GROUP_STATE["GroupAggregateState\nrepresentative_batch_idx\nrepresentative_row\nrow_locations"]
end
    
    subgraph "Accumulator Layer (llkv-aggregate)"
        AGG_ACCUMULATOR["AggregateAccumulator\nInterface for aggregate computation"]
AGG_KIND["AggregateKind\nType classification"]
AGG_SPEC["AggregateSpec\nConfiguration"]
AGG_STATE["AggregateState\nRuntime state"]
end
    
 
   SCALAR_EXPR --> AGG_CALL
 
   SELECT_PLAN --> AGG_EXPR
 
   AGG_EXPR --> AGG_FUNC
 
   EXECUTOR --> AGG_VALUE
 
   EXECUTOR --> GROUP_STATE
 
   EXECUTOR --> AGG_ACCUMULATOR
 
   AGG_ACCUMULATOR --> AGG_KIND
 
   AGG_ACCUMULATOR --> AGG_SPEC
 
   AGG_ACCUMULATOR --> AGG_STATE
    
    AGG_CALL -.translates to.-> AGG_EXPR
    SELECT_PLAN -.executes via.-> EXECUTOR

Sources:


Expression-Level Aggregates

Aggregate functions are represented in the expression AST via the AggregateCall<F> enum, which enables aggregates to appear within computed projections (e.g., COUNT(*) + 1 or SUM(col1) / AVG(col2)). Each variant captures the specific aggregate semantics:

VariantDescriptionExample SQL
CountStarCount all rows (including NULLs)COUNT(*)
Count { expr, distinct }Count non-NULL values of expressionCOUNT(col), COUNT(DISTINCT col)
Sum { expr, distinct }Sum numeric expression valuesSUM(amount), SUM(DISTINCT col)
Total { expr, distinct }Sum with NULL-to-zero coercionTOTAL(amount)
Avg { expr, distinct }Arithmetic mean of expressionAVG(price)
Min(expr)Minimum valueMIN(created_at)
Max(expr)Maximum valueMAX(score)
CountNulls(expr)Count NULL occurrencesCOUNT_NULLS(optional_field)
GroupConcat { expr, distinct, separator }Concatenate stringsGROUP_CONCAT(name, ',')

Each aggregate operates on a ScalarExpr<F>, not just a column name, which allows complex expressions like SUM(price * quantity) or AVG(col1 + col2).

Sources:


Plan-Level Representation

The query planner converts SQL aggregate syntax into AggregateExpr instances stored in SelectPlan::aggregates. The plan layer uses a simplified representation compared to the expression layer:

Diagram: Plan-Level Aggregate Structure

graph LR
    SELECT_PLAN["SelectPlan"]
AGG_LIST["aggregates: Vec&lt;AggregateExpr&gt;"]
GROUP_BY["group_by: Vec&lt;String&gt;"]
SELECT_PLAN --> AGG_LIST
 
   SELECT_PLAN --> GROUP_BY
    
 
   AGG_LIST --> COUNT_STAR["AggregateExpr::CountStar\nalias: String\ndistinct: bool"]
AGG_LIST --> COLUMN["AggregateExpr::Column\ncolumn: String\nalias: String\nfunction: AggregateFunction\ndistinct: bool"]
COLUMN --> FUNC["AggregateFunction::\nCount, SumInt64, TotalInt64,\nMinInt64, MaxInt64,\nCountNulls, GroupConcat"]

Sources:

The planner distinguishes between:

  • Non-grouped aggregates : Empty group_by vector, producing a single result row
  • Grouped aggregates : Populated group_by vector, producing one row per distinct group

Execution Strategy Selection

The executor chooses different code paths based on query structure, optimizing for common patterns:

Diagram: Aggregate Execution Decision Tree

graph TD
    START["QueryExecutor::execute_select"]
CHECK_COMPOUND{"plan.compound.is_some()"}
CHECK_EMPTY_TABLES{"plan.tables.is_empty()"}
CHECK_GROUP_BY{"!plan.group_by.is_empty()"}
CHECK_MULTI_TABLE{"plan.tables.len() > 1"}
CHECK_AGGREGATES{"!plan.aggregates.is_empty()"}
CHECK_COMPUTED{"has_computed_aggregates(&plan)"}
START --> CHECK_COMPOUND
 
   CHECK_COMPOUND -->|Yes| COMPOUND["execute_compound_select"]
CHECK_COMPOUND -->|No| CHECK_EMPTY_TABLES
    
 
   CHECK_EMPTY_TABLES -->|Yes| NO_TABLE["execute_select_without_table"]
CHECK_EMPTY_TABLES -->|No| CHECK_GROUP_BY
    
 
   CHECK_GROUP_BY -->|Yes| CHECK_MULTI_TABLE
 
   CHECK_MULTI_TABLE -->|Multi| CROSS_PROD["execute_cross_product"]
CHECK_MULTI_TABLE -->|Single| GROUP_BY_SINGLE["execute_group_by_single_table"]
CHECK_GROUP_BY -->|No| CHECK_MULTI_TABLE_2{"plan.tables.len() > 1"}
CHECK_MULTI_TABLE_2 -->|Yes| CROSS_PROD
 
   CHECK_MULTI_TABLE_2 -->|No| CHECK_AGGREGATES
    
 
   CHECK_AGGREGATES -->|Yes| EXEC_AGG["execute_aggregates"]
CHECK_AGGREGATES -->|No| CHECK_COMPUTED
    
 
   CHECK_COMPUTED -->|Yes| COMPUTED_AGG["execute_computed_aggregates"]
CHECK_COMPUTED -->|No| PROJECTION["execute_projection"]

Sources:

Non-Grouped Aggregate Execution

execute_aggregates processes queries without GROUP BY clauses. All rows are treated as a single group:

  1. Projection Planning : Build ScanProjection list for columns needed by aggregate expressions
  2. Expression Translation : Convert ScalarExpr<String> to ScalarExpr<FieldId> using table schema
  3. Data Streaming : Scan table and accumulate values via AggregateAccumulator
  4. Result Assembly : Finalize accumulators and construct single-row RecordBatch

Sources:

Grouped Aggregate Execution

execute_group_by_single_table handles queries with GROUP BY clauses:

  1. Full Scan : Load all table rows into memory (required for grouping)
  2. Group Key Extraction : Evaluate GROUP BY expressions for each row, producing GroupKeyValue instances
  3. Group State Tracking : Build FxHashMap<Vec<GroupKeyValue>, GroupAggregateState> mapping group keys to row locations
  4. Per-Group Accumulation : For each group, process its rows through aggregate accumulators
  5. HAVING Filter : Apply post-aggregation filter if present
  6. Result Construction : Build output RecordBatch with one row per group

Sources:


Accumulator Interface

The llkv-aggregate crate (imported at llkv-executor/src/lib.rs19) provides the AggregateAccumulator trait, which abstracts the computation logic for individual aggregate functions. Each accumulator maintains incremental state as it processes rows:

Diagram: Accumulator Lifecycle

sequenceDiagram
    participant Executor
    participant Accumulator as AggregateAccumulator
    participant State as AggregateState
    
    Executor->>Accumulator: new(AggregateSpec)
    Accumulator->>State: initialize()
    
    loop For each batch
        Executor->>Accumulator: update(batch, row_indices)
        Accumulator->>State: accumulate values
    end
    
    Executor->>Accumulator: finalize()
    Accumulator->>State: compute final value
    Accumulator-->>Executor: AggregateValue

Sources:

The executor wraps accumulator results in AggregateValue, which handles type conversions between the accumulator’s output type and the plan’s expected type:

AggregateValue VariantUsage
NullNo rows matched, or all values were NULL
Int64(i64)Integer aggregates (COUNT, SUM for integers)
Float64(f64)Floating-point aggregates (AVG, SUM for floats)
Decimal128 { value: i128, scale: i8 }Precise decimal aggregates
String(String)String aggregates (GROUP_CONCAT)

Sources:


graph TD
    START["Aggregate with distinct=true"]
INIT["Initialize FxHashSet&lt;Vec&lt;u8&gt;&gt;\nfor distinct tracking"]
LOOP_START["For each input row"]
EXTRACT["Extract aggregate expression value"]
ENCODE["Encode value as byte vector\nusing encode_row()"]
CHECK_SEEN{"Value already\nin set?"}
SKIP["Skip row\n(duplicate)"]
INSERT["Insert into set"]
ACCUMULATE["Pass to accumulator"]
LOOP_END["Next row"]
FINALIZE["Finalize accumulator"]
START --> INIT
 
   INIT --> LOOP_START
 
   LOOP_START --> EXTRACT
 
   EXTRACT --> ENCODE
 
   ENCODE --> CHECK_SEEN
 
   CHECK_SEEN -->|Yes| SKIP
 
   CHECK_SEEN -->|No| INSERT
 
   INSERT --> ACCUMULATE
 
   SKIP --> LOOP_END
 
   ACCUMULATE --> LOOP_END
 
   LOOP_END --> LOOP_START
    LOOP_END -.all rows.-> FINALIZE

Distinct Value Tracking

When an aggregate includes the DISTINCT modifier (e.g., COUNT(DISTINCT col)), the executor must deduplicate values before accumulation. This is handled via hash-based tracking:

Diagram: DISTINCT Aggregate Processing

The encode_row function (referenced throughout llkv-executor/src/lib.rs) converts values to a canonical byte representation suitable for hash-based deduplication.

Sources:


graph LR
    INPUT["Input Batch"]
subgraph "Expression Evaluation"
        TRANSLATE["translate_scalar\n(String → FieldId)"]
EVAL_NUMERIC["NumericKernels::evaluate_numeric"]
RESULT_ARRAY["Computed ArrayRef"]
end
    
    subgraph "Accumulation"
        EXTRACT["Extract values from array"]
ACCUMULATE["AggregateAccumulator::update"]
end
    
 
   INPUT --> TRANSLATE
 
   TRANSLATE --> EVAL_NUMERIC
 
   EVAL_NUMERIC --> RESULT_ARRAY
 
   RESULT_ARRAY --> EXTRACT
 
   EXTRACT --> ACCUMULATE

Expression-Based Aggregates

Unlike simple column aggregates, expression-based aggregates (e.g., SUM(col1 * col2) or AVG(CASE WHEN x > 0 THEN x ELSE 0 END)) require evaluating the expression for each row before accumulating:

Diagram: Expression Aggregate Evaluation

The executor uses ensure_computed_projection to translate expression trees and infer result data types:

Sources:

This helper ensures the expression is added to the scan projection list only once (via caching), avoiding redundant computation when multiple aggregates reference the same expression.


Simple vs Complex Column Extraction

The function try_extract_simple_column optimizes aggregate evaluation by detecting when an aggregate expression is equivalent to a direct column reference:

This optimization allows the executor to skip expression evaluation machinery for common cases, reading column data directly from the column store.

Sources:


graph TD
    AGG_VALUE["AggregateValue"]
AS_I64["as_i64() → Option&lt;i64&gt;"]
AS_F64["as_f64() → Option&lt;f64&gt;"]
AGG_VALUE --> AS_I64
 
   AGG_VALUE --> AS_F64
    
 
   AS_I64 --> NULL_CHECK1{"Null?"}
NULL_CHECK1 -->|Yes| NONE1["None"]
NULL_CHECK1 -->|No| TYPE_CHECK1{"Type?"}
TYPE_CHECK1 -->|Int64| DIRECT_I64["Some(value)"]
TYPE_CHECK1 -->|Float64| TRUNC["Some(value as i64)"]
TYPE_CHECK1 -->|Decimal128| SCALE_DOWN["Some(value / 10^scale)"]
TYPE_CHECK1 -->|String| PARSE_I64["s.parse::&lt;i64&gt;().ok()"]
AS_F64 --> NULL_CHECK2{"Null?"}
NULL_CHECK2 -->|Yes| NONE2["None"]
NULL_CHECK2 -->|No| TYPE_CHECK2{"Type?"}
TYPE_CHECK2 -->|Int64| PROMOTE["Some(value as f64)"]
TYPE_CHECK2 -->|Float64| DIRECT_F64["Some(value)"]
TYPE_CHECK2 -->|Decimal128| DIVIDE["Some(value / 10.0^scale)"]
TYPE_CHECK2 -->|String| PARSE_F64["s.parse::&lt;f64&gt;().ok()"]

Aggregate Result Types and Conversions

AggregateValue provides conversion methods to satisfy different consumer requirements:

Diagram: AggregateValue Type Conversions

Sources:

These conversions enable:

  • Order By : Converting aggregate results to sortable numeric types
  • HAVING Filters : Evaluating post-aggregate predicates that compare aggregate values
  • Nested Aggregates : Using one aggregate’s result in another’s computation (rare, but supported in computed projections)

graph LR
    EXPR["GROUP BY expression"]
EVAL["Evaluate for each row"]
subgraph "GroupKeyValue Variants"
        NULL_VAL["Null"]
INT_VAL["Int(i64)"]
BOOL_VAL["Bool(bool)"]
STRING_VAL["String(String)"]
end
    
    ENCODE["encode_row()\nVec&lt;GroupKeyValue&gt; → Vec&lt;u8&gt;"]
MAP["FxHashMap&lt;Vec&lt;u8&gt;, GroupAggregateState&gt;"]
EXPR --> EVAL
 
   EVAL --> NULL_VAL
 
   EVAL --> INT_VAL
 
   EVAL --> BOOL_VAL
 
   EVAL --> STRING_VAL
    
 
   NULL_VAL --> ENCODE
 
   INT_VAL --> ENCODE
 
   BOOL_VAL --> ENCODE
 
   STRING_VAL --> ENCODE
    
 
   ENCODE --> MAP

Group Key Representation

For grouped aggregations, the executor encodes group-by expressions into GroupKeyValue instances, which form composite keys in the group state map:

Diagram: Group Key Encoding

Sources:

The GroupAggregateState struct tracks which rows belong to each group:

This representation enables efficient accumulation: for each group, the executor iterates row_locations and passes those rows to the aggregate accumulators.

Sources:


sequenceDiagram
    participant Exec as QueryExecutor
    participant Scan as Table Scan
    participant Accum as Accumulators
    participant Eval as NumericKernels
    
    Exec->>Scan: Scan all rows
    Scan-->>Exec: RecordBatch
    
    Exec->>Exec: Identify aggregate calls\nin projections
    
    loop For each aggregate
        Exec->>Accum: create accumulator
        Exec->>Accum: update(batch)
        Accum-->>Exec: finalized value
    end
    
    Exec->>Exec: Inject aggregate values\nas synthetic columns
    
    loop For each projection
        Exec->>Eval: evaluate_numeric()\nwith synthetic columns
        Eval-->>Exec: computed ArrayRef
    end
    
    Exec->>Exec: Construct final RecordBatch

Computed Aggregates in Projections

When a SELECT list includes computed expressions containing aggregate functions (e.g., SELECT COUNT(*) * 2, SUM(x) + AVG(y)), the executor uses execute_computed_aggregates:

Diagram: Computed Aggregate Flow

Sources:

This execution path:

  1. Scans the table once to collect all rows
  2. Evaluates aggregate functions to produce scalar values
  3. Injects those scalars into a temporary evaluation context as synthetic columns
  4. Evaluates the projection expressions referencing those synthetic columns
  5. Assembles the final result batch

This approach allows arbitrary nesting of aggregates within expressions while maintaining correctness.


Performance Considerations

The aggregation system makes several trade-offs:

StrategyBenefitCost
AggregateAccumulator trait abstractionPluggable aggregate implementationsIndirect call overhead
Full batch materialization for GROUP BYSimple implementation, works for any key typeHigh memory usage for large result sets
Hash-based DISTINCT trackingCorrect deduplicationMemory proportional to cardinality
Expression evaluation per rowSupports complex aggregatesCannot leverage predicate pushdown
FxHashMap for groupingFast hashing for typical keysCollision risk with adversarial inputs

For aggregates over large datasets, consider:

  • Predicate pushdown : Filter rows before aggregation
  • Projection pruning : Only scan columns needed by aggregate expressions
  • Index-assisted aggregation : Use indexes for MIN/MAX when possible (not currently implemented)

Sources:

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Query Execution

Loading…

Query Execution

Relevant source files

Purpose and Scope

This document describes the query execution layer that transforms query plans into result data. The executor sits between the query planner (Query Planning) and the storage layer (Storage Layer), dispatching work to specialized components based on plan characteristics. This page provides a high-level overview of execution architecture and strategy selection. For detailed information about specific execution modes, see TablePlanner and TableExecutor, Scan Execution and Optimization, and Filter Evaluation.

Architecture Overview

The query execution layer is implemented primarily in the llkv-executor crate, with the QueryExecutor struct serving as the main orchestrator. The executor receives SelectPlan structures from the planner and produces SelectExecution results containing Arrow RecordBatch data.

Execution Strategy Dispatch Flow

graph TB
    subgraph "Planning Layer"
        PLAN["SelectPlan\n(from llkv-plan)"]
end
    
    subgraph "Execution Layer - llkv-executor"
        EXECUTOR["QueryExecutor&lt;P&gt;"]
DISPATCH{Execution\nStrategy\nDispatch}
COMPOUND["Compound SELECT\nUNION/EXCEPT/INTERSECT"]
NOTABLE["No-Table Execution\nSELECT constant"]
GROUPBY["Group By Execution\nAggregation + Grouping"]
CROSS["Cross Product\nMultiple Tables"]
AGGREGATE["Aggregate Execution\nSUM/AVG/COUNT"]
PROJECTION["Projection Execution\nColumn Selection"]
end
    
    subgraph "Storage Layer"
        PROVIDER["ExecutorTableProvider&lt;P&gt;"]
TABLE["ExecutorTable&lt;P&gt;"]
SCAN["Scan Operations\n(llkv-scan)"]
end
    
    subgraph "Result"
        EXECUTION["SelectExecution&lt;P&gt;"]
BATCH["RecordBatch[]\nArrow Data"]
end
    
 
   PLAN --> EXECUTOR
 
   EXECUTOR --> DISPATCH
    
 
   DISPATCH -->|compound query| COMPOUND
 
   DISPATCH -->|no FROM clause| NOTABLE
 
   DISPATCH -->|GROUP BY present| GROUPBY
 
   DISPATCH -->|multiple tables| CROSS
 
   DISPATCH -->|aggregates only| AGGREGATE
 
   DISPATCH -->|default path| PROJECTION
    
 
   EXECUTOR --> PROVIDER
 
   PROVIDER --> TABLE
 
   TABLE --> SCAN
    
 
   COMPOUND --> EXECUTION
 
   NOTABLE --> EXECUTION
 
   GROUPBY --> EXECUTION
 
   CROSS --> EXECUTION
 
   AGGREGATE --> EXECUTION
 
   PROJECTION --> EXECUTION
    
 
   EXECUTION --> BATCH

The executor examines plan characteristics to select an appropriate execution strategy. Each strategy is optimized for specific query patterns.

Sources: llkv-executor/src/lib.rs:504-563 llkv-executor/src/lib.rs:584-695

QueryExecutor Structure

The QueryExecutor<P> struct is the primary entry point for executing SELECT queries. It is generic over a Pager type P to support different storage backends.

ComponentTypePurpose
providerArc<dyn ExecutorTableProvider<P>>Provides access to tables and their metadata

The provider abstraction allows the executor to remain decoupled from specific table implementations, enabling testing with mock providers and supporting different storage configurations.

QueryExecutor and Provider Relationship

graph LR
    subgraph "Executor Core"
        QE["QueryExecutor&lt;P&gt;"]
end
    
    subgraph "Provider Interface"
        PROVIDER["ExecutorTableProvider&lt;P&gt;"]
GET_TABLE["get_table(name)\n→ ExecutorTable&lt;P&gt;"]
end
    
    subgraph "Table Interface"
        ETABLE["ExecutorTable&lt;P&gt;"]
SCHEMA["schema: ExecutorSchema"]
STORAGE["storage: TableStorageAdapter&lt;P&gt;"]
end
    
 
   QE --> PROVIDER
 
   PROVIDER --> GET_TABLE
 
   GET_TABLE --> ETABLE
 
   ETABLE --> SCHEMA
 
   ETABLE --> STORAGE

The provider pattern enables dependency injection, allowing the executor to work with different table sources without tight coupling to storage implementations.

Sources: llkv-executor/src/lib.rs:504-517 llkv-executor/src/types.rs:1-100

Execution Entry Points

The executor provides two primary entry points for executing SELECT plans:

execute_select

Executes a SELECT plan without additional filtering constraints. This is the standard path used when no external row filtering is required.

execute_select_with_filter

Executes a SELECT plan with an optional row ID filter. The RowIdFilter trait allows callers to specify a predicate that determines which row IDs should be considered during execution. This is used for implementing transaction isolation (MVCC filtering) and other row-level visibility constraints.

Both methods extract limit and offset from the plan and apply them to the final SelectExecution result, ensuring consistent pagination behavior across all execution strategies.

Sources: llkv-executor/src/lib.rs:519-563

Execution Strategy Dispatch

The executor examines the characteristics of a SelectPlan to determine the most efficient execution strategy. The dispatch logic follows a priority hierarchy:

Execution Strategy Decision Tree

graph TD
    START["execute_select_with_filter(plan)"]
CHECK_COMPOUND{plan.compound\nis Some?}
CHECK_TABLES{plan.tables\nis empty?}
CHECK_GROUPBY{plan.group_by\nnot empty?}
CHECK_MULTI{plan.tables.len()\n&gt; 1?}
CHECK_AGG{plan.aggregates\nnot empty?}
CHECK_COMPUTED{has computed\naggregates?}
COMPOUND["execute_compound_select\nUNION/EXCEPT/INTERSECT"]
NOTABLE["execute_select_without_table\nConstant evaluation"]
GROUPBY_SINGLE["execute_group_by_single_table\nGROUP BY aggregation"]
GROUPBY_CROSS["execute_cross_product\nMulti-table GROUP BY"]
CROSS["execute_cross_product\nCartesian product"]
AGGREGATE["execute_aggregates\nSingle-table aggregates"]
COMPUTED_AGG["execute_computed_aggregates\nAggregates in expressions"]
PROJECTION["execute_projection\nSimple column selection"]
RESULT["SelectExecution&lt;P&gt;\nwith limit/offset"]
START --> CHECK_COMPOUND
 
   CHECK_COMPOUND -->|Yes| COMPOUND
 
   CHECK_COMPOUND -->|No| CHECK_TABLES
    
 
   CHECK_TABLES -->|Yes| NOTABLE
 
   CHECK_TABLES -->|No| CHECK_GROUPBY
    
 
   CHECK_GROUPBY -->|Yes| CHECK_MULTI
 
   CHECK_MULTI -->|Yes| GROUPBY_CROSS
 
   CHECK_MULTI -->|No| GROUPBY_SINGLE
    
 
   CHECK_GROUPBY -->|No| CHECK_MULTI
 
   CHECK_MULTI -->|Yes| CROSS
 
   CHECK_MULTI -->|No| CHECK_AGG
    
 
   CHECK_AGG -->|Yes| AGGREGATE
 
   CHECK_AGG -->|No| CHECK_COMPUTED
    
 
   CHECK_COMPUTED -->|Yes| COMPUTED_AGG
 
   CHECK_COMPUTED -->|No| PROJECTION
    
 
   COMPOUND --> RESULT
 
   NOTABLE --> RESULT
 
   GROUPBY_SINGLE --> RESULT
 
   GROUPBY_CROSS --> RESULT
 
   CROSS --> RESULT
 
   AGGREGATE --> RESULT
 
   COMPUTED_AGG --> RESULT
 
   PROJECTION --> RESULT

The executor prioritizes specialized execution paths over generic ones, enabling optimizations tailored to specific query patterns.

Strategy Descriptions

StrategyPlan CharacteristicsImplementation
Compound SELECTplan.compound.is_some()Executes UNION, EXCEPT, or INTERSECT operations by evaluating component queries and combining results with deduplication for DISTINCT quantifiers
No-Table Executionplan.tables.is_empty()Evaluates constant expressions like SELECT 1, 2, 3 without accessing storage
Group By (Single Table)!plan.group_by.is_empty() && plan.tables.len() == 1Performs grouped aggregation on a single table with efficient column scanning
Group By (Cross Product)!plan.group_by.is_empty() && plan.tables.len() > 1Computes Cartesian product before grouping
Cross Productplan.tables.len() > 1Joins multiple tables using nested loop or hash join
Aggregate Execution!plan.aggregates.is_empty()Computes aggregates (COUNT, SUM, AVG, etc.) over a single table
Computed AggregatesAggregates within computed expressionsEvaluates expressions containing aggregate functions
Projection ExecutionDefault pathPerforms column selection with optional filtering

Sources: llkv-executor/src/lib.rs:523-563

Result Representation

The executor returns results as SelectExecution<P> instances, which encapsulate one or more Arrow RecordBatch objects along with metadata.

SelectExecution Result Types

graph TB
    subgraph "SelectExecution&lt;P&gt;"
        EXEC["SelectExecution"]
SCHEMA["schema: Arc&lt;Schema&gt;"]
DISPLAY["display_name: String"]
MODE{Execution\nMode}
end
    
    subgraph "Single Batch Mode"
        SINGLE["Single RecordBatch"]
BATCH1["RecordBatch\nMaterialized data"]
end
    
    subgraph "Multi Batch Mode"
        MULTI["Vec&lt;RecordBatch&gt;"]
BATCH2["RecordBatch[]\nMultiple batches"]
end
    
    subgraph "Streaming Mode"
        STREAM["Scan Stream"]
LAZY["Lazy evaluation"]
ITER["Iterator-based"]
end
    
    subgraph "Post-Processing"
        LIMIT["limit: Option&lt;usize&gt;"]
OFFSET["offset: Option&lt;usize&gt;"]
APPLY["Applied during\nmaterialization"]
end
    
 
   EXEC --> SCHEMA
 
   EXEC --> DISPLAY
 
   EXEC --> MODE
    
 
   MODE -->|Materialized| SINGLE
 
   MODE -->|Compound/Sorted| MULTI
 
   MODE -->|Large tables| STREAM
    
 
   SINGLE --> BATCH1
 
   MULTI --> BATCH2
 
   STREAM --> LAZY
 
   STREAM --> ITER
    
 
   EXEC --> LIMIT
 
   EXEC --> OFFSET
 
   LIMIT --> APPLY
 
   OFFSET --> APPLY

The SelectExecution type supports multiple result modes optimized for different query patterns. The with_limit and with_offset methods attach pagination parameters that are applied when materializing results.

Result Materialization

Callers can materialize results in several ways:

  • into_rows() : Converts all batches into a Vec<CanonicalRow> representation, applying limit and offset
  • stream(callback) : Invokes a callback for each batch, enabling memory-efficient processing of large result sets
  • into_record_batch() : Consolidates results into a single RecordBatch, useful for small result sets
  • into_batches() : Returns all batches as a vector

The streaming API is particularly important for large queries where materializing all results at once would exceed memory limits.

Sources: llkv-executor/src/lib.rs:519-563 llkv-executor/src/scan.rs:1-100

Execution Phases

Most execution strategies follow a two-phase pattern optimized for columnar storage:

Phase 1: Row ID Collection

The executor first identifies which rows satisfy the query’s filter predicates without fetching the full column data. This phase produces a bitmap or set of row IDs that match the criteria.

Row ID Collection Phase

sequenceDiagram
    participant EX as QueryExecutor
    participant TBL as ExecutorTable
    participant SCAN as Scan Operations
    participant STORE as Column Store
    
    EX->>TBL: filter_row_ids(predicate)
    TBL->>SCAN: evaluate_filter
    SCAN->>STORE: Load chunk metadata
    
    Note over SCAN,STORE: Chunk pruning using\nmin/max values
    
    SCAN->>STORE: Load matching chunks
    SCAN->>SCAN: Vectorized predicate\nevaluation (SIMD)
    SCAN-->>TBL: Bitmap of matching row IDs
    TBL-->>EX: Row ID set

Predicate evaluation uses chunk metadata to skip irrelevant data (Scan Execution and Optimization) and vectorized kernels for efficient matching (Filter Evaluation).

sequenceDiagram
    participant EX as QueryExecutor
    participant TBL as ExecutorTable
    participant STORE as Column Store
    participant PAGER as Storage Pager
    
    EX->>TBL: scan_stream(projections, row_ids)
    
    loop For each projection
        TBL->>STORE: gather_rows(field_id, row_ids)
        STORE->>STORE: Identify chunks containing\nrequested row IDs
        STORE->>PAGER: batch_get(chunk_keys)
        PAGER-->>STORE: Chunk data
        STORE->>STORE: Construct Arrow array
        STORE-->>TBL: ArrayRef
    end
    
    TBL->>TBL: Construct RecordBatch\nfrom arrays
    TBL-->>EX: RecordBatch

Phase 2: Data Gathering

Once the matching row IDs are known, the executor fetches only the required columns for those specific rows. This minimizes I/O by avoiding unnecessary column reads.

Data Gathering Phase

The gather operation reconstructs Arrow arrays from chunked columnar storage, fetching only the columns referenced in the query’s projections.

Phase 3: Post-Processing

After data gathering, the executor applies sorting, aggregation, or other transformations as required by the plan:

OperationWhen AppliedImplementation
SortingORDER BY clause presentUses Arrow’s lexsort_to_indices with custom NULLS FIRST/LAST handling
LimitingLIMIT clause presentTruncates result set to specified row count
OffsetOFFSET clause presentSkips specified number of rows before returning results
AggregationGROUP BY or aggregate functionsMaterializes groups and computes aggregate values
DistinctSELECT DISTINCTHash-based deduplication using row encoding

Sources: llkv-executor/src/lib.rs:584-1000 llkv-executor/src/scan.rs:1-500

Subquery Execution

The executor handles subqueries through recursive evaluation, supporting both scalar subqueries and EXISTS predicates.

graph TD
    EXPR["Evaluate Expression\ncontaining subquery"]
COLLECT["Collect correlated\ncolumn bindings"]
ENCODE["Encode bindings\nas cache key"]
CHECK{Cache\nhit?}
CACHED["Return cached\nLiteral"]
EXECUTE["Execute subquery\nwith bindings"]
VALIDATE["Validate result:\n1 column, ≤1 row"]
STORE["Store in cache"]
RETURN["Return Literal"]
EXPR --> COLLECT
 
   COLLECT --> ENCODE
 
   ENCODE --> CHECK
 
   CHECK -->|Yes| CACHED
 
   CHECK -->|No| EXECUTE
 
   EXECUTE --> VALIDATE
 
   VALIDATE --> STORE
 
   STORE --> RETURN
 
   CACHED --> RETURN

Scalar Subquery Evaluation

Scalar subqueries are evaluated lazily during expression computation. The executor maintains a cache (scalar_subquery_cache) to avoid re-executing identical subqueries with the same correlated bindings:

Scalar Subquery Evaluation with Caching

The caching mechanism is essential for performance when a subquery is evaluated multiple times in a cross product or aggregate context.

Parallel Subquery Execution

For queries that require evaluating the same correlated subquery across many rows, the executor batches the work and executes it in parallel using Rayon:

let job_results: Vec<ExecutorResult<Literal>> = with_thread_pool(|| {
    pending_bindings
        .par_iter()
        .map(|bindings| self.evaluate_scalar_subquery_with_bindings(subquery, bindings))
        .collect()
});

This parallelization significantly reduces execution time for subquery-heavy queries.

Sources: llkv-executor/src/lib.rs:787-961

graph LR
    subgraph "SqlEngine"
        PARSE["Parse SQL\n(sqlparser)"]
PLAN["Build Plan\n(llkv-plan)"]
EXEC["Execute Plan"]
end
    
    subgraph "RuntimeEngine"
        CONTEXT["RuntimeContext"]
SESSION["RuntimeSession"]
CATALOG["CatalogManager"]
end
    
    subgraph "Executor Layer"
        QEXEC["QueryExecutor"]
PROVIDER["TableProvider"]
end
    
 
   PARSE --> PLAN
 
   PLAN --> EXEC
 
   EXEC --> CONTEXT
 
   CONTEXT --> SESSION
 
   CONTEXT --> CATALOG
    
 
   EXEC --> QEXEC
 
   QEXEC --> PROVIDER
 
   PROVIDER --> CATALOG

Integration with Runtime

The SqlEngine in llkv-sql orchestrates the entire execution pipeline, bridging SQL text to query results:

SqlEngine and Executor Integration

The RuntimeEngine provides the execution context, including transaction state, catalog access, and session configuration, while the QueryExecutor focuses solely on transforming plans into results.

Sources: llkv-sql/src/sql_engine.rs:572-745 llkv-executor/src/lib.rs:504-563

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

TablePlanner and TableExecutor

Loading…

TablePlanner and TableExecutor

Relevant source files

Purpose and Scope

This page documents the table-level query execution interface provided by the Table struct and its optimization strategies. The table layer acts as a bridge between high-level query plans (covered in Query Planning) and low-level columnar storage operations (covered in Column Storage and ColumnStore). It orchestrates scan execution, manages field ID namespacing, handles MVCC visibility, and selects execution strategies based on query characteristics.

For details about the lower-level scan execution mechanics and streaming strategies, see Scan Execution and Optimization. For filter expression evaluation, see Filter Evaluation.

Architecture Overview

The Table struct serves as the primary interface for executing queries against table data. Rather than implementing query execution logic directly, it orchestrates lower-level components and applies table-specific concerns like schema management, MVCC column injection, and field ID translation.

graph TB
    subgraph "Table Layer"
        API["Table API\nscan_stream, filter_row_ids"]
SCHEMA["Schema Management\nField ID → LogicalFieldId"]
MVCC["MVCC Integration\ncreated_by, deleted_by"]
OPTIONS["Execution Options\nScanStreamOptions"]
end
    
    subgraph "Scan Execution Layer"
        EXECUTE["execute_scan()\nllkv-scan"]
STREAM["RowStreamBuilder\nBatch streaming"]
FILTER["Filter Evaluation\nPredicate matching"]
end
    
    subgraph "Storage Layer"
        STORE["ColumnStore\nPhysical storage"]
GATHER["Column Gathering\nRecordBatch assembly"]
end
    
 
   API --> SCHEMA
 
   API --> MVCC
 
   API --> OPTIONS
    
 
   SCHEMA --> EXECUTE
 
   MVCC --> EXECUTE
 
   OPTIONS --> EXECUTE
    
 
   EXECUTE --> STREAM
 
   STREAM --> FILTER
 
   FILTER --> GATHER
 
   GATHER --> STORE
    
    STORE -.results.-> GATHER
    GATHER -.batches.-> STREAM
    STREAM -.batches.-> API

Table Layer Responsibilities

Sources: llkv-table/src/table.rs:60-69 llkv-table/src/table.rs:447-488

Table Struct and Core Methods

The Table<P> struct wraps a ColumnStore with table-specific metadata and caching:

ComponentTypePurpose
storeArc<ColumnStore<P>>Physical columnar storage
table_idTableIdUnique table identifier
mvcc_cacheRwLock<Option<MvccColumnCache>>Cached MVCC column presence

Primary execution methods:

  • scan_stream : Main scan interface accepting flexible projection types
  • scan_stream_with_exprs : Direct execution with resolved ScanProjection expressions
  • filter_row_ids : Collect row IDs matching a filter predicate
  • stream_columns : Stream raw column data without expression evaluation

Sources: llkv-table/src/table.rs:60-69 llkv-table/src/table.rs:447-462 llkv-table/src/table.rs:469-488 llkv-table/src/table.rs:490-496 llkv-table/src/table.rs:589-645

Scan Execution Flow

High-Level Execution Pipeline

Sources: llkv-table/src/table.rs:447-488 llkv-table/examples/performance_benchmark.rs:1-211

Scan Configuration: ScanStreamOptions

ScanStreamOptions controls execution behavior:

FieldTypePurpose
include_nullsboolInclude rows where all projected columns are null
orderOption<ScanOrderSpec>Ordering specification (ASC/DESC)
row_id_filterOption<Arc<dyn RowIdFilter>>Pre-filtered row ID set (e.g., MVCC visibility)
include_row_idsboolInclude row_id column in output
rangesOption<Vec<(Bound<u64>, Bound<u64>)>>Row ID range restrictions
driving_columnOption<LogicalFieldId>Column to drive scan ordering

Sources: llkv-table/src/table.rs:43-46

Projection Types: ScanProjection

Projections specify what data to retrieve:

  • Column(ColumnProjectionInfo) : Direct column access with optional alias
    • logical_field_id: Column to read
    • data_type: Expected Arrow data type
    • output_name: Column name in result
  • Expression(ProjectionEval) : Computed expressions over columns
    • Arithmetic operations, functions, literals
    • Evaluated during result assembly

Conversion: Projection (from llkv-column-map) can be converted to ScanProjection via Into trait.

Sources: llkv-table/src/table.rs:40-46

Optimization Strategies

Strategy Selection Logic

Sources: llkv-table/examples/performance_benchmark.rs:26-79 llkv-table/examples/test_streaming.rs:26-174

Fast Path: Direct Streaming

When conditions are met for the fast path, scan execution bypasses row ID collection and materializes chunks directly from ScanBuilder:

Eligibility criteria:

  1. Exactly one column projection
  2. Unbounded range filter (Bound::Unbounded on both ends)
  3. No null inclusion (include_nulls = false)
  4. No ordering requirements (order = None)
  5. No row ID filter (row_id_filter = None)

Performance characteristics:

  • 10-100x faster than standard path for large scans
  • Zero-copy chunk streaming when possible
  • Minimal memory overhead (streaming chunks, not full materialization)

Example usage:

Sources: llkv-table/examples/test_streaming.rs:65-110 llkv-table/examples/performance_benchmark.rs:145-151

Standard Path: Row ID Collection and Gathering

When fast path conditions are not met, execution uses a two-phase approach:

Phase 1: Row ID Collection

  • Evaluate filter expression against the table
  • Build a bitmap (Treemap) of matching row IDs
  • Filter by MVCC visibility if row_id_filter is provided
  • Apply range restrictions if ranges is specified

Phase 2: Column Gathering

  • Split row IDs into fixed-size windows
  • For each window:
    • Gather all projected columns
    • Evaluate computed expressions
    • Assemble into RecordBatch
    • Stream batch to callback

Batch size: Controlled by STREAM_BATCH_ROWS constant (default varies by workload).

Sources: llkv-table/src/table.rs:490-496 llkv-table/src/table.rs:589-645

Performance Comparison: Layer Overhead

ColumnStore Direct Access vs Table Layer

Measured overhead (1M row single-column scan):

  • Direct ColumnStore : ~5-10ms baseline
  • Table Layer (fast path) : ~10-20ms (1.5-2x overhead)
  • Table Layer (standard path) : ~50-100ms (5-10x overhead)

The fast path optimization significantly reduces the gap by bypassing row ID collection and expression evaluation when not needed.

Sources: llkv-table/examples/direct_comparison.rs:276-399

Row ID Filtering and Collection

RowIdScanCollector

Internal visitor for collecting row IDs that match filter predicates:

Implementation strategy:

  • Implements PrimitiveWithRowIdsVisitor and PrimitiveSortedWithRowIdsVisitor
  • Accumulates row IDs into a Treemap (compressed bitmap)
  • Ignores actual column values (focus on row IDs only)
  • Handles both chunked and sorted run formats

Key methods:

  • extend_from_array: Add row IDs from a chunk
  • extend_from_slice: Add a slice of row IDs
  • into_inner: Extract the final bitmap

Sources: llkv-table/src/table.rs:805-858

RowIdChunkEmitter

Streaming alternative that emits row IDs in fixed-size chunks without materializing the full set:

Use case: Memory-efficient processing when full row ID set is not needed.

Features:

  • Configurable chunk size
  • Optional reverse ordering for sorted runs
  • Error propagation via error field
  • Zero allocation when chunks align with storage chunks

Methods:

  • extend_from_array: Process a chunk of row IDs
  • extend_sorted_run: Handle sorted run with optional reversal
  • flush: Emit current buffer
  • finish: Final flush and error check

Sources: llkv-table/src/table.rs:860-1017

Integration with MVCC and Transactions

The table layer handles MVCC visibility transparently:

MVCC Column Injection

When appending data, the Table::append method automatically injects MVCC columns if not present:

Columns added:

  • created_by (UInt64): Transaction ID that created the row (defaults to 1 for auto-commit)
  • deleted_by (UInt64): Transaction ID that deleted the row (defaults to 0 for not deleted)

Field ID assignment:

  • MVCC columns get reserved logical field IDs via LogicalFieldId::for_mvcc_created_by and LogicalFieldId::for_mvcc_deleted_by
  • Separate from user field IDs to avoid collisions

Caching: The MvccColumnCache stores whether MVCC columns exist to avoid repeated schema inspections.

Sources: llkv-table/src/table.rs:50-56 llkv-table/src/table.rs:231-438

Transaction Visibility Filtering

Query execution can filter by transaction visibility using the row_id_filter option:

The filter is applied during row ID collection phase, ensuring only visible rows are included in results.

Sources: llkv-table/src/table.rs:43-46

graph LR
    USER["User Field ID\n(FieldId: u32)"]
META["Field Metadata\n'field_id' key"]
TABLE["Table ID\n(TableId: u32)"]
LOGICAL["LogicalFieldId\n(u64)"]
USER --> COMPOSE
 
   TABLE --> COMPOSE
 
   COMPOSE["LogicalFieldId::for_user()"] --> LOGICAL
    
    META -.annotates.-> USER

Field ID Translation and Namespacing

The table layer translates user-visible field IDs to logical field IDs that include table ID:

Translation Process

Why namespacing?

  • Multiple tables can have the same user field IDs
  • ColumnStore operates on LogicalFieldId to avoid collisions
  • Table layer encapsulates this translation

Key functions:

  • LogicalFieldId::for_user(table_id, field_id): Create user data field
  • LogicalFieldId::for_mvcc_created_by(table_id): Create MVCC created_by field
  • LogicalFieldId::for_mvcc_deleted_by(table_id): Create MVCC deleted_by field

Sources: llkv-table/src/table.rs:231-438 llkv-table/src/table.rs:589-645

Schema Management

Schema Construction

The Table::schema() method builds an Arrow schema from catalog metadata:

Process:

  1. Query ColumnStore for all logical fields belonging to this table
  2. Sort fields by field ID for consistent ordering
  3. Retrieve column names from SysCatalog
  4. Build Arrow Field with data type and metadata
  5. Add row_id field first, then user fields

Schema metadata: Each field includes field_id metadata for round-trip compatibility.

Sources: llkv-table/src/table.rs:519-549

Schema as RecordBatch

For display purposes, schema_recordbatch() formats the schema as a table:

ColumnTypeContents
nameUtf8Column name
field_idUInt32User field ID
data_typeUtf8Arrow data type string

Sources: llkv-table/src/table.rs:554-586

Example: CSV Export Integration

The CSV export writer demonstrates table-level query execution in practice:

Export Pipeline

Key steps:

  1. Convert CsvExportColumn to ScanProjection with field ID translation
  2. Resolve aliases from catalog or use defaults
  3. Create unbounded filter for full table scan
  4. Stream batches directly to Arrow CSV writer
  5. No intermediate materialization

Sources: llkv-csv/src/writer.rs:139-268

Summary: Optimization Decision Matrix

Query CharacteristicFast PathStandard Path
Single column projection
Multiple columns
Unbounded filter
Bounded/complex filter
Include nulls
Exclude nulls
No ordering
Ordered results
No MVCC filter
MVCC filtering
No range restrictions
Range restrictions

Performance impact:

  • Fast path: 1.5-2x slower than direct ColumnStore access
  • Standard path: 5-10x slower than direct ColumnStore access
  • Still significantly faster than alternatives due to columnar storage and zero-copy optimizations

Sources: llkv-table/examples/performance_benchmark.rs:1-211 llkv-table/examples/direct_comparison.rs:276-399

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Scan Execution and Optimization

Loading…

Scan Execution and Optimization

Relevant source files

Purpose and Scope

This document details the table scan execution flow in LLKV, covering how queries retrieve data from columnar storage through optimized fast paths and streaming strategies. The scan subsystem bridges query planning and physical storage, coordinating row ID collection, data gathering, and result materialization while minimizing memory allocations.

For query planning that produces scan operations, see TablePlanner and TableExecutor. For predicate evaluation during filtering, see Filter Evaluation. For the underlying column storage mechanisms, see Column Storage and ColumnStore.

Scan Entry Points

The Table struct provides two primary scan entry points that execute projections with filtering:

MethodSignaturePurpose
scan_stream&self, projections: I, filter_expr: &Expr, options: ScanStreamOptions, on_batch: FAccepts flexible projection inputs, converts to ScanProjection
scan_stream_with_exprs&self, projections: &[ScanProjection], filter_expr: &Expr, options: ScanStreamOptions, on_batch: FDirect scan with pre-built ScanProjection values

Both methods stream results as Arrow RecordBatches to a callback function, avoiding full materialization in memory. The ScanStreamOptions struct controls scan behavior:

The on_batch callback receives each batch as it is produced, allowing streaming processing without memory accumulation.

Sources: llkv-table/src/table.rs:447-488 llkv-scan/src/lib.rs

Two-Phase Scan Execution Model

Two-Phase Architecture

LLKV scans execute in two distinct phases to minimize data movement:

Phase 1: Row ID Collection - Evaluates filter predicates against stored data to produce a set of matching row IDs. This phase:

  • Visits column chunks using the visitor pattern
  • Applies predicates via metadata-based pruning (min/max values)
  • Collects matching row IDs into a Treemap (bitmap) or Vec<RowId>
  • Avoids loading actual column values during filtering

Phase 2: Data Gathering - Retrieves column values for the matching row IDs identified in Phase 1. This phase:

  • Loads only chunks containing matched rows
  • Assembles Arrow arrays via type-specific gathering functions
  • Streams results in fixed-size batches (default 4096 rows)
  • Supports two execution paths: fast streaming or full materialization

This separation ensures predicates are evaluated without loading unnecessary column data, significantly reducing I/O and memory when filters are selective.

Sources: llkv-scan/src/execute.rs llkv-table/src/table.rs:479-488 llkv-scan/src/row_stream.rs

Fast Path Optimizations

The scan executor detects specific patterns that enable zero-copy streaming optimizations:

Fast Path Criteria

graph TD
    ScanRequest["Scan Request"]
CheckCols{"Single\nProjection?"}
CheckFilter{"Unbounded\nFilter?"}
CheckNulls{"include_nulls\n= false?"}
CheckOrder{"No ORDER BY?"}
CheckRowFilter{"No row_id_filter?"}
FastPath["✓ Fast Path\nDirect ScanBuilder streaming"]
SlowPath["✗ Materialization Path\nMultiGatherContext gathering"]
ScanRequest --> CheckCols
 
   CheckCols -->|Yes| CheckFilter
 
   CheckCols -->|No| SlowPath
    
 
   CheckFilter -->|Yes| CheckNulls
 
   CheckFilter -->|No| SlowPath
    
 
   CheckNulls -->|Yes| CheckOrder
 
   CheckNulls -->|No| SlowPath
    
 
   CheckOrder -->|Yes| CheckRowFilter
 
   CheckOrder -->|No| SlowPath
    
 
   CheckRowFilter -->|Yes| FastPath
 
   CheckRowFilter -->|No| SlowPath

The fast path activates when all conditions are met:

ConditionWhy Required
Single projectionMulti-column requires coordination across field plans
Unbounded filterComplex filters require row-by-row evaluation
include_nulls = falseNull handling requires inspection of each row
No ORDER BYSorting requires full materialization
No row_id_filterCustom filters require additional filtering logic

Performance Characteristics

When the fast path activates, scans execute directly via ScanBuilder against the ColumnStore, achieving throughputs exceeding 100M rows/second for simple numeric columns. The materialization path adds 2-5x overhead due to:

  • Context preparation and cache management
  • Chunk coordinate computation across multiple fields
  • Arrow array assembly and concatenation

Test examples demonstrate the performance difference:

  • Single-column unbounded scan: ~1-2ms for 1M rows
  • Multi-column scan: ~5-10ms for 1M rows
  • Complex filters: ~10-20ms for 1M rows

Sources: llkv-scan/src/execute.rs llkv-table/examples/test_streaming.rs:83-110 llkv-table/examples/performance_benchmark.rs:126-151

Row ID Collection Infrastructure

RowIdScanCollector

The RowIdScanCollector accumulates matching row IDs into a Treemap bitmap during Phase 1 filtering:

It implements multiple visitor traits to handle different scan modes:

  • PrimitiveVisitor - Ignores value chunks (row IDs not yet available)
  • PrimitiveSortedVisitor - Ignores sorted runs
  • PrimitiveWithRowIdsVisitor - Collects row IDs from chunks with row ID arrays
  • PrimitiveSortedWithRowIdsVisitor - Collects row IDs from sorted runs

The collector extracts row IDs from UInt64Array instances passed alongside value arrays, adding them to the bitmap for efficient deduplication and range queries.

Sources: llkv-table/src/table.rs:805-858

RowIdChunkEmitter

For streaming scenarios, RowIdChunkEmitter emits row IDs in fixed-size chunks to a callback rather than accumulating them:

This approach:

  • Minimizes memory usage by processing row IDs in batches
  • Enables early termination on errors
  • Supports reverse iteration for descending scans
  • Optimizes for cases where row IDs arrive in contiguous chunks

The emitter implements the same visitor traits as the collector but invokes the callback when the buffer reaches chunk_size, then clears the buffer for the next batch.

Sources: llkv-table/src/table.rs:860-1017

Visitor Pattern Integration

Both collectors integrate with the column-map scan infrastructure via visitor traits. During a scan:

  1. ScanBuilder iterates over column chunks
  2. For each chunk, it detects whether row IDs are available
  3. It invokes the appropriate visitor method based on chunk characteristics:
    • _chunk_with_rids for unsorted chunks with row IDs
    • _run_with_rids for sorted runs with row IDs
  4. The collector/emitter extracts row IDs and accumulates/emits them

This pattern decouples row ID collection from the storage format, allowing specialized handling for sorted vs. unsorted data.

Sources: llkv-column-map/src/scan/mod.rs llkv-table/src/table.rs:832-858

classDiagram
    class MultiGatherContext {+field_infos: FieldInfos\n+plans: FieldPlans\n+chunk_cache: FxHashMap~PhysicalKey, ArrayRef~\n+row_index: FxHashMap~u64, usize~\n+row_scratch: Vec~Option~\n+builders: Vec~ColumnOutputBuilder~\n+epoch: u64\n+new() MultiGatherContext\n+update_field_infos_and_plans()\n+matches_field_ids() bool\n+schema_for_nullability() Schema\n+chunk_span_for_row() Option}
    
    class FieldPlan {+dtype: DataType\n+value_metas: Vec~ChunkMetadata~\n+row_metas: Vec~ChunkMetadata~\n+candidate_indices: Vec~usize~\n+avg_value_bytes_per_row: f64}
    
    class ColumnOutputBuilder {<<enum>>\nUtf8\nBinary\nBoolean\nDecimal128\nPrimitive\nPassthrough}
    
    class GatherContextPool {+inner: Mutex~FxHashMap~\n+max_per_key: usize\n+acquire() GatherContextGuard}
    
    MultiGatherContext --> FieldPlan: contains
    MultiGatherContext --> ColumnOutputBuilder: contains
    GatherContextPool --> MultiGatherContext: pools

Data Gathering Infrastructure

MultiGatherContext

The MultiGatherContext coordinates multi-column data gathering during Phase 2:

Context Fields

FieldPurpose
field_infosMaps LogicalFieldId to DataType for each projected column
plansContains FieldPlan with chunk metadata for each column
chunk_cacheCaches decoded Arrow arrays to avoid redundant deserialization
row_indexMaps row IDs to output indices during gathering
row_scratchScratch buffer mapping output positions to chunk coordinates
buildersType-specific Arrow array builders for each output column
epochTracks storage version for cache invalidation

Gathering Flow

  1. Context preparation: Load column descriptors and chunk metadata for all fields
  2. Candidate selection: Identify chunks containing requested row IDs based on min/max values
  3. Chunk loading: Batch-fetch identified chunks from storage
  4. Row mapping: Build index from row ID to output position
  5. Value extraction: For each output position, locate value in cached chunk and append to builder
  6. Array finalization: Convert builders to Arrow arrays and assemble RecordBatch

Sources: llkv-column-map/src/store/projection.rs:460-649

GatherContextPool

The GatherContextPool reuses MultiGatherContext instances across scans to amortize setup costs:

Pooling Strategy

  • Contexts are keyed by the list of projected LogicalFieldId values
  • Up to max_per_key contexts are retained per projection set (default: 4)
  • Contexts are reset (chunk cache cleared) when returned to the pool
  • Epoch tracking invalidates cached chunk arrays when storage changes

When a scan requests a context:

  1. Pool checks for matching contexts by field ID list
  2. If found, pops a context from the pool
  3. If not found or pool empty, creates a new context
  4. Context is wrapped in GatherContextGuard for automatic return to pool

This pattern is critical for high-frequency scans where context setup (descriptor loading, metadata parsing) dominates execution time. Pooling reduces per-scan overhead from ~10ms to <1ms for repeated projections.

Sources: llkv-column-map/src/store/projection.rs:651-720

Type-Specific Gathering Functions

LLKV provides specialized gathering functions for each Arrow type to optimize performance:

FunctionArrow TypeOptimization
gather_rows_from_chunks_stringGenericStringArray<O>Direct slice reuse for contiguous sorted rows
gather_rows_from_chunks_binaryGenericBinaryArray<O>Same as string with binary validation
gather_rows_from_chunks_boolBooleanArrayBitmap-based gathering
gather_rows_from_chunks_decimal128Decimal128ArrayPrecision-preserving assembly
gather_rows_from_chunks_structStructArrayRecursive gathering per child field
gather_rows_from_chunksPrimitiveArray<T>Fast path for single-chunk contiguous access

Each function implements a common pattern:

Fast Path Detection

All gathering functions check for a fast path where:

  • Only one chunk contains the requested row IDs
  • Row IDs are sorted ascending
  • Row IDs form a contiguous range in the chunk

When detected, the function returns a slice of the cached chunk array via Arrow’s zero-copy slice() method, avoiding builder allocation and value copying entirely.

Sources: llkv-column-map/src/gather.rs:283-403 llkv-column-map/src/gather.rs:405-526 llkv-column-map/src/gather.rs:529-638

flowchart LR
    RowIds["Row ID Source\nBitmap or Vec"]
Chunk1["Batch 1\n0..4096"]
Chunk2["Batch 2\n4096..8192"]
ChunkN["Batch N\n(N-1)*4096..M"]
Gather1["gather_rows()"]
Gather2["gather_rows()"]
GatherN["gather_rows()"]
Batch1["RecordBatch 1"]
Batch2["RecordBatch 2"]
BatchN["RecordBatch N"]
Callback["on_batch\ncallback"]
RowIds -->|split into chunks| Chunk1
 
   RowIds --> Chunk2
 
   RowIds --> ChunkN
    
 
   Chunk1 --> Gather1
 
   Chunk2 --> Gather2
 
   ChunkN --> GatherN
    
 
   Gather1 --> Batch1
 
   Gather2 --> Batch2
 
   GatherN --> BatchN
    
 
   Batch1 --> Callback
 
   Batch2 --> Callback
 
   BatchN --> Callback

Streaming Strategies

Batch-Based Streaming

LLKV streams scan results in fixed-size batches to balance memory usage and callback overhead:

Batch Size Configuration

The default batch size is 4096 rows, defined in llkv-table/src/constants.rs:

This value balances:

  • Memory footprint : Smaller batches reduce peak memory but increase callback overhead
  • Allocation efficiency : Builders pre-allocate for batch size, minimizing reallocation
  • Cache locality : Larger batches improve CPU cache utilization during gathering

Streaming Execution

The RowStreamBuilder manages batch-based streaming:

  1. Accept a RowIdSource (bitmap or vector)
  2. Split row IDs into windows of STREAM_BATCH_ROWS
  3. For each window:
    • Call gather_rows_with_reusable_context() with the window’s row IDs
    • Invoke the on_batch callback with the resulting RecordBatch
    • Reuse the MultiGatherContext for the next window

This approach ensures constant memory usage regardless of result set size, as only one batch exists in memory at a time.

Sources: llkv-table/src/constants.rs llkv-scan/src/row_stream.rs llkv-table/src/table.rs:588-645

Memory Efficiency

The streaming architecture minimizes memory allocations through several mechanisms:

Builder Reuse - Arrow array builders are pre-allocated to the batch size and reused across batches. Builders track their capacity and avoid reallocation when the next batch has the same or smaller size.

Context Pooling - The GatherContextPool retains prepared contexts, eliminating the need to reload descriptors and rebuild field plans for repeated projections.

Chunk Caching - Within a context, decoded Arrow arrays are cached by physical key. Chunks that span multiple batches are deserialized once and reused, reducing I/O and deserialization overhead.

Scratch Buffer Reuse - The row_scratch buffer that maps output indices to chunk coordinates is allocated once per context and cleared (not deallocated) between batches.

Visitor State Management - Row ID collectors and emitters maintain minimal state, typically just a bitmap or small buffer, avoiding large intermediate structures.

These optimizations enable LLKV to sustain multi-million row/second scan rates with sub-100MB memory footprints even for billion-row tables.

Sources: llkv-column-map/src/store/projection.rs:562-568 llkv-column-map/src/store/projection.rs:709-720

graph TB
    subgraph "llkv-table layer"
        ScanStream["Table::scan_stream()"]
end
    
    subgraph "llkv-scan layer"
        ExecuteScan["execute_scan()"]
CheckFast{"Can Use\nFast Path?"}
FastLogic["build_fast_path_stream()"]
SlowLogic["collect_row_ids_for_table()\nthen build_row_stream()"]
FastStream["Direct ScanBuilder\nstreaming"]
SlowStream["RowStreamBuilder\nwith MultiGatherContext"]
end
    
    subgraph "llkv-column-map layer"
        ScanBuilder["ScanBuilder::run()"]
ColumnStore["ColumnStore::gather_rows()"]
end
    
    Callback["User-provided\non_batch callback"]
ScanStream --> ExecuteScan
    
 
   ExecuteScan --> CheckFast
    
 
   CheckFast -->|Yes| FastLogic
 
   CheckFast -->|No| SlowLogic
    
 
   FastLogic --> FastStream
 
   SlowLogic --> SlowStream
    
 
   FastStream --> ScanBuilder
 
   SlowStream --> ColumnStore
    
 
   ScanBuilder --> Callback
 
   ColumnStore --> Callback

Integration with llkv-scan

The llkv-scan crate provides the execute_scan function that orchestrates the full scan execution:

execute_scan Function Signature

ScanStorage Trait

The ScanStorage trait abstracts over storage implementations, allowing execute_scan to work with both Table and direct ColumnStore access:

This abstraction enables unit testing of scan logic without requiring full table setup and allows specialized storage implementations to optimize row ID collection.

Sources: llkv-scan/src/execute.rs llkv-scan/src/storage.rs llkv-table/src/table.rs:479-488

Performance Monitoring

Benchmarking Scan Performance

The repository includes examples that measure scan performance across different scenarios:

ExamplePurposeKey Measurement
test_streaming.rsVerify streaming optimization activatesThroughput (rows/sec)
direct_comparison.rsCompare ColumnStore vs Table layer overheadExecution time ratio
performance_benchmark.rsProfile different scan configurationsLatency distribution

Representative Results (1M row table, Int64 column):

ScenarioThroughputNotes
Single column, unbounded100-200M rows/secFast path active
Multi-column20-50M rows/secMaterialization path
Bounded filter10-30M rows/secPredicate evaluation overhead
With nulls15-40M rows/secNull inspection required

These benchmarks guide optimization efforts and validate that changes maintain performance characteristics.

Sources: llkv-table/examples/test_streaming.rs llkv-table/examples/direct_comparison.rs:276-399 llkv-table/examples/performance_benchmark.rs:82-211

Pager I/O Instrumentation

The InstrumentedPager wrapper tracks storage operations during scans:

Tracked Metrics

  • physical_gets: Total number of key-value reads
  • get_batches: Number of batch read operations
  • physical_puts: Total number of key-value writes
  • put_batches: Number of batch write operations

For a SELECT COUNT(*) on a 3-row table, typical I/O patterns show:

  • ~36 physical gets for descriptor and chunk loads
  • ~23 batch operations due to pipelined chunk fetching

This instrumentation identifies I/O hotspots and validates that scan optimizations reduce storage access.

Sources: llkv-sql/tests/pager_io_tests.rs:17-73 llkv-storage/src/pager/instrumented.rs

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Filter Evaluation

Loading…

Filter Evaluation

Relevant source files

Purpose and Scope

This page documents how filter expressions are evaluated against table rows to produce sets of matching row IDs. Filter evaluation is a critical component of query execution that bridges the query planning layer (see Query Planning) and the physical data retrieval operations (see Scan Execution and Optimization).

The system evaluates filters in two primary contexts:

  • Row ID collection : Applying predicates to columns to determine which rows satisfy conditions
  • MVCC filtering : Applying transaction visibility rules to determine which row versions are visible

For information about the expression AST and predicate structures used during filtering, see Expression System.


Filter Evaluation Pipeline

Filter evaluation flows from the table abstraction down through the column store to type-specific visitor implementations:

Sources: llkv-table/src/table.rs:490-496 llkv-column-map/src/store/core.rs:356-372 llkv-column-map/src/store/scan/filter.rs:209-298

flowchart TB
    TableFilter["Table::filter_row_ids()"]
CollectRowIds["collect_row_ids_for_table()"]
CompileExpr["Expression Compilation"]
EvalPred["Predicate Evaluation"]
StoreFilter["ColumnStore::filter_row_ids()"]
Dispatch["FilterDispatch::run_filter()"]
PrimitiveFilter["FilterPrimitive::run_filter()"]
Visitor["FilterVisitor<T, F>"]
DescLoad["Load ColumnDescriptor"]
ChunkLoop["For each ChunkMetadata"]
MetaPrune["Check min/max pruning"]
LoadChunk["Load chunk arrays"]
EvalChunk["Evaluate predicate"]
CollectMatches["Collect matching row IDs"]
ReturnBitmap["Return Treemap bitmap"]
TableFilter --> CollectRowIds
 
   CollectRowIds --> CompileExpr
 
   CompileExpr --> EvalPred
 
   EvalPred --> StoreFilter
    
 
   StoreFilter --> Dispatch
 
   Dispatch --> PrimitiveFilter
 
   PrimitiveFilter --> Visitor
    
 
   Visitor --> DescLoad
 
   DescLoad --> ChunkLoop
 
   ChunkLoop --> MetaPrune
 
   MetaPrune -->|Skip chunk| ChunkLoop
 
   MetaPrune -->|May match| LoadChunk
 
   LoadChunk --> EvalChunk
 
   EvalChunk --> CollectMatches
 
   CollectMatches --> ChunkLoop
    
 
   ChunkLoop -->|Done| ReturnBitmap

Table-Level Filter Entry Points

The Table struct provides the primary interface for filter evaluation at the table abstraction layer:

graph TB
    subgraph "Table Layer"
        FilterRowIds["filter_row_ids(&Expr<FieldId>)"]
CollectRowIds["collect_row_ids_for_table()"]
CompileFilter["FilterCompiler::compile()"]
end
    
    subgraph "Compilation"
        TranslateExpr["ExprTranslator::translate()"]
BuildPredicate["build_*_predicate()"]
Predicate["Predicate<T::Value>"]
end
    
    subgraph "ColumnStore Layer"
        StoreFilterRowIds["filter_row_ids<T>()"]
FilterMatchesResult["filter_matches<T, F>()"]
end
    
 
   FilterRowIds --> CollectRowIds
 
   CollectRowIds --> CompileFilter
 
   CompileFilter --> TranslateExpr
 
   TranslateExpr --> BuildPredicate
 
   BuildPredicate --> Predicate
    
 
   Predicate --> StoreFilterRowIds
 
   Predicate --> FilterMatchesResult
    
 
   StoreFilterRowIds --> |Vec<u64>| FilterRowIds
 
   FilterMatchesResult --> |FilterResult| CollectRowIds

The filter_row_ids() method at llkv-table/src/table.rs:490-496 converts an expression tree into a bitmap of matching row IDs. It delegates to collect_row_ids_for_table() which compiles the expression and evaluates predicates against the column store.

Key responsibilities:

  • Expression translation : Convert Expr<FieldId> to Expr<LogicalFieldId>
  • Predicate compilation : Transform operators into typed Predicate structures
  • MVCC integration : Filter results by transaction visibility
  • Bitmap aggregation : Combine multiple predicate results using set operations

Sources: llkv-table/src/table.rs:490-496 llkv-table/src/table.rs:851-1100


Filter Dispatch System

The filter dispatch system provides type-specific filter implementations through the FilterDispatch trait hierarchy:

Implementation Strategy:

classDiagram
    class FilterDispatch {<<trait>>\n+type Value\n+run_filter() Vec~u64~\n+run_fused() Vec~u64~}
    
    class FilterPrimitive {<<trait>>\n+type Native\n+run_filter() Vec~u64~\n+run_all() Vec~u64~\n+run_filter_with_result() FilterResult}
    
    class Utf8Filter~O~ {+run_filter() Vec~u64~\n+run_fused() Vec~u64~}
    
    class UInt64Type
    class Int64Type
    class Float64Type
    class Date32Type
    class StringTypes
    
    FilterDispatch <|-- FilterPrimitive : implements
    FilterDispatch <|-- Utf8Filter : implements
    
    FilterPrimitive <|-- UInt64Type : specializes
    FilterPrimitive <|-- Int64Type : specializes
    FilterPrimitive <|-- Float64Type : specializes
    FilterPrimitive <|-- Date32Type : specializes
    
    Utf8Filter --> StringTypes : handles

The FilterDispatch trait at llkv-column-map/src/store/scan/filter.rs:209-273 defines the interface for type-specific filtering:

Primitive types implement FilterPrimitive which provides the default FilterDispatch implementation at llkv-column-map/src/store/scan/filter.rs:275-298 This handles numeric types, booleans, and dates using the visitor pattern.

String types use the specialized Utf8Filter implementation at llkv-column-map/src/store/scan/filter.rs:307-504 which supports vectorized operations like contains and fused multi-predicate evaluation.

Sources: llkv-column-map/src/store/scan/filter.rs:209-298 llkv-column-map/src/store/scan/filter.rs:307-504


Visitor Pattern for Chunk Traversal

Filter evaluation uses the visitor pattern to traverse chunks efficiently:

The FilterVisitor<T, F> struct at llkv-column-map/src/store/scan/filter.rs:506-591 implements all visitor traits to handle different chunk formats:

  • Unsorted chunks : Processes each value individually
  • Sorted chunks : Can exploit ordering for early termination
  • With row IDs : Matches values to explicit row identifiers
  • Sorted with row IDs : Combines both optimizations

The visitor maintains internal state to build a FilterResult:

FieldTypePurpose
predicateF: FnMut(T::Native) -> boolPredicate closure to evaluate
runsVec<FilterRun>Run-length encoded matches
fallback_row_idsOption<Vec<u64>>Sparse row ID list
prev_row_idOption<u64>Last seen row ID for run detection
total_matchesusizeCount of matching rows

Sources: llkv-column-map/src/store/scan/filter.rs:506-648 llkv-column-map/src/store/scan/filter.rs:692-771


flowchart LR
    LoadDesc["Load ColumnDescriptor"]
IterChunks["Iterate ChunkMetadata"]
CheckMin["Check min_val_u64"]
CheckMax["Check max_val_u64"]
PruneDecision{{"Can prune?"}}
SkipChunk["Skip chunk"]
LoadAndEval["Load & evaluate chunk"]
LoadDesc --> IterChunks
 
   IterChunks --> CheckMin
 
   CheckMin --> CheckMax
 
   CheckMax --> PruneDecision
    
 
   PruneDecision -->|Yes: predicate range doesn't overlap| SkipChunk
 
   PruneDecision -->|No: may contain matching values| LoadAndEval
    
 
   SkipChunk --> IterChunks
 
   LoadAndEval --> IterChunks

Chunk Metadata Pruning

The filter evaluation pipeline exploits chunk metadata to skip irrelevant data:

The ChunkMetadata structure stores summary statistics for each chunk:

FieldTypePurpose
min_val_u64u64Minimum value in chunk (for numerics)
max_val_u64u64Maximum value in chunk (for numerics)
row_countu32Number of rows in chunk
chunk_pkPhysicalKeyKey for chunk data
value_order_perm_pkPhysicalKeyKey for sort permutation

Pruning logic at llkv-column-map/src/store/core.rs:679-690:

This optimization is particularly effective for:

  • Range queries : WHERE col BETWEEN x AND y
  • Equality predicates : WHERE col = value
  • Sorted data : Natural clustering improves pruning

Sources: llkv-column-map/src/store/core.rs:679-690 llkv-column-map/src/store/descriptor.rs:40-70


String Filtering with Predicate Fusion

String filtering receives special optimization through the Utf8Filter implementation which supports fused multi-predicate evaluation:

Key optimization techniques:

flowchart TB
    RunFused["Utf8Filter::run_fused()"]
SeparatePreds["Separate predicates"]
ContainsPreds["Contains predicates"]
OtherPreds["Other predicates"]
LoadChunks["Load value & row_id chunks"]
InitBitmask["Initialize BitMask\n(all bits = 1)"]
FilterNulls["AND with null bitmask"]
LoopContains["For each contains pattern"]
VectorizedContains["Arrow::contains_utf8_scalar()"]
AndMask["AND result into bitmask"]
LoopOther["For each remaining bit"]
EvalOther["Evaluate other predicates"]
CollectRowIds["Collect matching row IDs"]
RunFused --> SeparatePreds
 
   SeparatePreds --> ContainsPreds
 
   SeparatePreds --> OtherPreds
    
 
   SeparatePreds --> LoadChunks
 
   LoadChunks --> InitBitmask
 
   InitBitmask --> FilterNulls
    
 
   FilterNulls --> LoopContains
 
   LoopContains --> VectorizedContains
 
   VectorizedContains --> AndMask
 
   AndMask --> LoopContains
    
 
   LoopContains -->|Done| LoopOther
 
   LoopOther --> EvalOther
 
   EvalOther --> CollectRowIds
  1. Bitwise filtering using BitMask at llkv-column-map/src/store/scan/filter.rs:32-110:

    • Stores candidate rows as packed u64 words
    • Supports efficient AND operations
    • Tracks candidate count to short-circuit early
  2. Vectorized contains at llkv-column-map/src/store/scan/filter.rs:441-465:

    • Uses Arrow’s SIMD contains_utf8_scalar() kernel
    • Processes entire chunks without row-by-row iteration
    • Returns boolean arrays that AND into the bitmask
  3. Progressive filtering at llkv-column-map/src/store/scan/filter.rs:421-469:

    • Applies vectorized predicates first to eliminate most rows
    • Only evaluates slower per-row predicates on remaining candidates
    • Short-circuits when candidate count reaches zero

Example scenario:

The fused evaluation:

  1. Vectorizes both LIKE patterns using contains
  2. ANDs the results to get a sparse candidate set
  3. Only evaluates LENGTH() on remaining rows

Sources: llkv-column-map/src/store/scan/filter.rs:336-504 llkv-column-map/src/store/scan/filter.rs:32-110


Filter Result Encoding

Filter results use run-length encoding to efficiently represent dense matches:

The FilterResult structure at llkv-column-map/src/store/scan/filter.rs:136-183 provides two representations:

Run-length encoding (dense matches):

  • Used when matching rows are mostly sequential
  • Each FilterRun stores start_row_id and len
  • Extremely compact for range queries or sorted scans
  • Example: rows [100, 101, 102, 103, 104] → FilterRun { start: 100, len: 5 }

Sparse representation (fallback):

  • Used when matches are scattered
  • Stores explicit Vec<u64> of row IDs
  • Automatically degrades when out-of-order matches detected
  • Example: rows [100, 200, 150, 300] → [100, 200, 150, 300]

Adaptive strategy at llkv-column-map/src/store/scan/filter.rs:543-590:

The FilterVisitor::record_match() method dynamically chooses encoding:

If match follows previous (row_id == prev + 1):
    Extend current run
Else if match is out of order (row_id < prev):
    Convert to sparse representation
Else:
    Start new run

This ensures optimal encoding regardless of data distribution.

Sources: llkv-column-map/src/store/scan/filter.rs:115-183 llkv-column-map/src/store/scan/filter.rs:543-590


flowchart TB
    FilterRowIds["Table::filter_row_ids()"]
CollectUserData["Collect matching user data rows"]
MVCCCheck{{"MVCC enabled?"}}
LoadCreatedBy["Load created_by column"]
LoadDeletedBy["Load deleted_by column"]
FilterCreated["Filter: created_by <= txn_id"]
FilterDeleted["Filter: deleted_by = 0 OR\ndeleted_by > txn_id"]
IntersectSets["Intersect bitmaps"]
ReturnVisible["Return visible rows"]
FilterRowIds --> CollectUserData
 
   CollectUserData --> MVCCCheck
    
 
   MVCCCheck -->|No| ReturnVisible
 
   MVCCCheck -->|Yes| LoadCreatedBy
    
 
   LoadCreatedBy --> LoadDeletedBy
 
   LoadDeletedBy --> FilterCreated
 
   FilterCreated --> FilterDeleted
 
   FilterDeleted --> IntersectSets
 
   IntersectSets --> ReturnVisible

MVCC Filtering Integration

Filter evaluation integrates with Multi-Version Concurrency Control (MVCC) to enforce transaction visibility:

MVCC columns are stored in separate logical namespaces:

NamespaceColumnPurpose
TxnCreatedBycreated_byTransaction ID that created this row version
TxnDeletedBydeleted_byTransaction ID that deleted this row (0 if active)

Visibility rules at llkv-table/src/table.rs:1047-1095:

A row version is visible to transaction T if:

  1. created_by <= T.id (row was created before or by this transaction)
  2. deleted_by = 0 OR deleted_by > T.id (row not deleted, or deleted after this transaction)

Implementation:

The collect_row_ids_for_table() method applies MVCC filtering after user predicate evaluation:

  1. Evaluate user predicates on user-data columns → bitmap A
  2. Evaluate created_by <= txn_id on MVCC column → bitmap B
  3. Evaluate deleted_by = 0 OR deleted_by > txn_id on MVCC column → bitmap C
  4. Return intersection: A ∩ B ∩ C

This ensures only transaction-visible row versions are returned to the query executor.

Sources: llkv-table/src/table.rs:851-1100 llkv-table/src/table.rs:231-437


Performance Characteristics

Filter evaluation performance depends on several factors:

ScenarioPerformanceExplanation
Equality on indexed columnO(log N)Uses binary search in sorted chunks
Range query on sorted dataO(chunks)Metadata pruning skips most chunks
String contains (single)O(N)Vectorized SIMD comparison
String contains (multiple)~O(N)Fused evaluation with progressive filtering
Complex predicatesO(N × P)Per-row evaluation of P predicates
Dense matchesHigh efficiencyRun-length encoding reduces memory
Sparse matchesModerate overheadExplicit row ID lists

Optimization opportunities:

  1. Chunk-level parallelism : Filter evaluation at llkv-column-map/src/store/scan/filter.rs:381-494 uses Rayon for parallel chunk processing
  2. Early termination : Metadata pruning can skip 90%+ of chunks for range queries
  3. Cache efficiency : Sequential chunk traversal has good spatial locality
  4. SIMD operations : String operations use Arrow’s vectorized kernels

Sources: llkv-column-map/src/store/scan/filter.rs:381-494 llkv-column-map/src/store/core.rs:679-690


Integration with Scan Execution

Filter evaluation feeds into the scan execution pipeline (see Scan Execution and Optimization):

The two-phase execution model:

Phase 1: Filter evaluation (this page)

  • Evaluate predicates against columns
  • Produce bitmap of matching row IDs
  • Minimal data movement (only row ID metadata)

Phase 2: Data gathering (see Scan Execution and Optimization)

  • Use row ID bitmap to fetch actual column values
  • Assemble into Arrow RecordBatch
  • Apply projections and transformations

This separation enables:

  • Predicate pushdown : Filter before gathering data
  • Projection pruning : Only fetch required columns
  • Parallel execution : Filter and gather can overlap
  • Memory efficiency : Small bitmaps instead of full data

Sources: llkv-table/src/table.rs:490-496 llkv-scan/src/execute.rs:1-300

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Storage Layer

Loading…

Storage Layer

Relevant source files

The Storage Layer manages the persistence and retrieval of columnar data in LLKV. It bridges Apache Arrow’s in-memory columnar format with a key-value storage backend, providing efficient operations for appends, updates, deletes, and scans while maintaining data integrity through MVCC semantics.

This page provides an architectural overview of the storage subsystem. For details on specific components, see:

Sources: llkv-column-map/src/store/core.rs:1-68 llkv-column-map/Cargo.toml:1-35

Architecture Overview

The storage layer consists of three primary components arranged in a hierarchy from logical to physical representation:

Sources: llkv-column-map/src/store/core.rs:60-68 llkv-column-map/Cargo.toml:21-33

graph TB
    subgraph "Arrow Layer"
        RB["RecordBatch\nColumnar Arrays"]
SCHEMA["Schema\nField Metadata"]
ARRAYS["Arrow Arrays\n(Int64, Utf8, etc)"]
end
    
    subgraph "Column Store Layer"
        CS["ColumnStore\nllkv-column-map"]
CATALOG["ColumnCatalog\nLogicalFieldId → PhysicalKey"]
DESC["ColumnDescriptor\nLinked List of Chunks"]
META["ChunkMetadata\nmin/max, row_count"]
CHUNKS["Serialized Chunks\nArrow IPC Format"]
end
    
    subgraph "Pager Layer"
        PAGER["Pager Trait\nbatch_get/batch_put"]
BATCH_OPS["BatchGet/BatchPut\nOperations"]
end
    
    subgraph "Physical Storage"
        SIMD["simd-r-drive\nMemory-Mapped KV Store"]
HANDLES["EntryHandle\nByte Blobs"]
end
    
 
   RB --> CS
 
   SCHEMA --> CS
 
   ARRAYS --> CHUNKS
    
 
   CS --> CATALOG
 
   CS --> DESC
 
   DESC --> META
 
   META --> CHUNKS
    
 
   CS --> PAGER
 
   PAGER --> BATCH_OPS
 
   BATCH_OPS --> SIMD
 
   SIMD --> HANDLES
    CHUNKS -.serialized as.-> HANDLES
    
    CATALOG -.persisted via.-> PAGER
    DESC -.persisted via.-> PAGER

Columnar Storage Model

Logical Field Identification

Columns are identified by LogicalFieldId, a composite key containing:

ComponentTypePurpose
NamespaceLogicalStorageNamespaceSeparates user data from internal columns (row IDs, MVCC metadata)
Table IDTableIdIdentifies the owning table
Field IDFieldIdColumn position within the table schema

Namespace Values:

  • UserData - Regular table columns
  • RowIdShadow - Internal row ID tracking
  • TxnCreatedBy - MVCC transaction creation timestamps
  • TxnDeletedBy - MVCC transaction deletion timestamps

This namespacing prevents collisions between user-visible columns and internal bookkeeping while enabling efficient multi-column operations within a table.

Sources: llkv-column-map/src/store/core.rs:36-46

Chunk Organization

Data is stored in chunks , which are serialized Arrow arrays. Each chunk contains:

ChunkMetadata Fields:

FieldTypePurpose
chunk_pkPhysicalKeyStorage location of value array
min_val_u64 / max_val_u64u64Value range for predicate pruning
row_countu64Number of rows in chunk
null_countu64Number of null values
serialized_bytesu64Size of serialized array
value_order_perm_pkPhysicalKeyOptional sort index for fast lookups

The metadata enables chunk pruning : evaluating predicates against min/max values to skip entire chunks without loading data.

Sources: llkv-column-map/src/store/core.rs:1-6 llkv-column-map/src/store/descriptor.rs

Descriptor Pages

A ColumnDescriptor organizes chunks into a linked list of metadata pages:

Each descriptor page contains:

  • Header (DescriptorPageHeader): entry_count, next_page_pk
  • Entries : Array of ChunkMetadata structures

Appends extend the tail page; when full, a new page is allocated and linked.

Sources: llkv-column-map/src/store/descriptor.rs

Data Flow: Append Operation

The append path implements Last-Write-Wins semantics for row ID conflicts:

Key Steps:

  1. Pre-sort by row ID llkv-column-map/src/store/core.rs:798-847 - Ensures efficient metadata updates and naturally sorted shadow columns
  2. LWW Rewrite llkv-column-map/src/store/core.rs:893-901 - Updates existing row IDs in-place
  3. Filter llkv-column-map/src/store/core.rs:918-942 - Removes rewritten rows and nulls
  4. Chunk & Serialize llkv-column-map/src/store/core.rs:1032-1114 - Splits into target-sized chunks, serializes to Arrow IPC
  5. Atomic Batch Put llkv-column-map/src/store/core.rs:1116-1132 - Commits all writes atomically
  6. Epoch Increment llkv-column-map/src/store/core.rs:1133-1134 - Invalidates gather context caches

Sources: llkv-column-map/src/store/core.rs:758-1137

sequenceDiagram
    participant User
    participant CS as ColumnStore
    participant CTX as MultiGatherContext
    participant Cache as Chunk Cache
    participant Pager
    
    User->>CS: gather_rows(field_ids, row_ids)
    CS->>CTX: Acquire context from pool
    CS->>CS: Build FieldPlans\n(value_metas, row_metas)
    
    Note over CS,CTX: Phase 1: Identify Candidate Chunks
    loop For each field
        CS->>CTX: Filter chunks by row_id range
        CTX->>CTX: candidate_indices.push(idx)
    end
    
    Note over CS,Cache: Phase 2: Fetch Chunks
    CS->>Cache: Check cache for chunk_pks
    CS->>Pager: batch_get(missing_chunks)
    Pager-->>CS: EntryHandle[]
    CS->>CS: deserialize_array(bytes)
    CS->>Cache: Insert arrays
    
    Note over CS,CTX: Phase 3: Build Row Index
    CS->>CTX: row_index[row_id] = output_idx
    
    Note over CS,CTX: Phase 4: Gather Values
    loop For each column
        loop For each row_id
            CTX->>Cache: Lookup value array
            CTX->>CTX: Find row in chunk\nvia binary search
            CTX->>CTX: builder.append_value()
        end
        CTX->>CTX: Finish builder → ArrayRef
    end
    
    CS->>CS: RecordBatch::try_new(schema, arrays)
    CS-->>User: RecordBatch
    CTX->>CS: Return context to pool

Data Flow: Gather Operation

Gathering assembles a RecordBatch from a list of row IDs by fetching values from chunks:

Optimizations:

OptimizationLocationDescription
Context Poolingllkv-column-map/src/store/projection.rs:651-721Reuses MultiGatherContext across calls to amortize allocation costs
Chunk Cachingllkv-column-map/src/store/projection.rs:1032-1053Caches deserialized Arrow arrays by PhysicalKey
Candidate Pruningllkv-column-map/src/store/projection.rs:1015-1027Only loads chunks overlapping the requested row ID range
Dense Fast Pathllkv-column-map/src/store/projection.rs:984-1011For contiguous row IDs, uses offset arithmetic instead of hash lookup
Single Chunk Slicingllkv-column-map/src/store/projection.rs:303-329If all rows in one sorted chunk, returns array.slice()

Sources: llkv-column-map/src/store/projection.rs:772-960 llkv-column-map/src/store/core.rs:758-785

Data Flow: Filter Operation

Filtering evaluates predicates and returns matching row IDs:

Filter Execution Path:

  1. Descriptor Load llkv-column-map/src/store/scan/filter.rs:1301-1333 - Fetch column descriptor and iterate metadata
  2. Metadata Pruning llkv-column-map/src/store/scan/filter.rs:1336-1353 - Skip chunks where min_val > predicate_max or max_val < predicate_min
  3. Chunk Fetch llkv-column-map/src/store/scan/filter.rs:1355-1367 - Batch-get overlapping chunk arrays
  4. Visitor Evaluation llkv-column-map/src/store/scan/filter.rs:1369-1394 - Type-specialized visitor applies predicate
  5. Result Encoding llkv-column-map/src/store/scan/filter.rs:506-591 - Build FilterResult with run-length encoding or sparse list

Visitor Pattern Example (FilterVisitor<T, F>):

The visitor pattern enables type-specialized hot loops while maintaining a generic filter interface.

Sources: llkv-column-map/src/store/scan/filter.rs:209-298 llkv-column-map/src/store/scan/filter.rs:506-591 llkv-column-map/src/store/scan/filter.rs:605-649

Index Management

The storage layer supports two index types for accelerated lookups:

Index Operations:

MethodPurposeLocation
register_indexCreates and persists an index for a columnllkv-column-map/src/store/core.rs:145-147
unregister_indexRemoves an index and frees storagellkv-column-map/src/store/core.rs:163-165
list_persisted_indexesQueries existing indexes for a columnllkv-column-map/src/store/core.rs:408-425

Presence Index Usage llkv-column-map/src/store/core.rs:663-754:

  • Stored in ChunkMetadata.value_order_perm_pk
  • Binary search over permuted view of row IDs
  • Accelerates has_row_id checks

Sources: llkv-column-map/src/store/core.rs:136-165 llkv-column-map/src/store/core.rs:663-754

Storage Optimizations

Chunk Size Tuning

The ColumnStoreConfig provides tuning parameters:

ParameterDefaultPurpose
target_chunk_rows8192Target rows per chunk for new appends
min_chunk_rows2048Minimum before considering compaction
max_chunk_rows32768Maximum before forcing a split

Smaller chunks enable finer-grained pruning but increase metadata overhead. The defaults balance scan performance with storage efficiency.

Sources: llkv-column-map/src/store/config.rs

Last-Write-Wins Semantics

When appending rows with existing row IDs:

  1. Scan existing chunks llkv-column-map/src/store/core.rs:893-901 to identify conflicting row IDs
  2. Overwrite in-place llkv-column-map/src/store/lww.rs within the chunk containing the old value
  3. Append new rows llkv-column-map/src/store/core.rs:918-942 that don’t conflict

This avoids duplicates and provides update semantics without explicit UPDATE statements.

Sources: llkv-column-map/src/store/core.rs:893-942 llkv-column-map/src/store/lww.rs

Null Handling

Null values are not stored in value chunks llkv-column-map/src/store/core.rs:920-931:

  • Filtered out during append
  • Represented by absence in the shadow row ID column
  • Surfaced as nulls during gather if row ID exists but value is missing

This reduces storage footprint for sparse columns.

Sources: llkv-column-map/src/store/core.rs:920-931

Compaction (Future)

The current implementation supports incremental writes but defers compaction:

  • Small chunks accumulate over time
  • Metadata overhead increases
  • Scan performance degrades for highly-updated columns

Future work will implement background compaction to merge small chunks and rebuild optimal chunk sizes.

Sources: llkv-column-map/src/store/compact.rs

Component Relationships

The storage layer integrates with upstream and downstream components:

Key Interfaces:

Sources: llkv-column-map/src/store/core.rs:60-68 llkv-storage/src/pager.rs

Thread Safety

ColumnStore is Send + Sync and designed for concurrent access:

  • Catalog : Arc<RwLock<ColumnCatalog>> - Read-heavy workload, allows concurrent reads
  • Caches : RwLock for data type and index metadata
  • Append Epoch : AtomicU64 for cache invalidation signaling
  • Context Pool : Mutex<FxHashMap<...>> for gather context reuse

Concurrent appends to different tables are lock-free at the ColumnStore level. Appends to the same column serialize on the catalog write lock.

Sources: llkv-column-map/src/store/core.rs:47-68

Performance Characteristics

OperationComplexityNotes
Append (sorted)O(n)n = rows; includes serialization and pager writes
Append (unsorted)O(n log n)Requires lexicographic sort
Gather (random)O(m · k)m = row count, k = avg chunk scan per row
Gather (dense)O(m)Contiguous row IDs enable offset arithmetic
Filter (full scan)O(c)c = total chunks; metadata pruning reduces effective c
Filter (indexed)O(log c)With presence/value indexes

Batching Benefits llkv-sql/tests/pager_io_tests.rs:18-120:

  • INSERT: ~8 allocations, 24 puts for 3 rows (2 columns + row ID + MVCC)
  • SELECT: ~36 gets in 23 batches (descriptors + chunks + metadata)
  • DELETE: ~6 puts, 46 gets (read for tombstones, write MVCC columns)

Sources: llkv-column-map/src/store/core.rs llkv-sql/tests/pager_io_tests.rs:18-120

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Table Abstraction

Loading…

Table Abstraction

Relevant source files

The Table struct provides a schema-aware, transactional interface over the columnar storage layer. It sits between the SQL execution layer and the physical storage, managing field ID namespacing, MVCC column injection, schema validation, and catalog integration. The Table abstraction transforms low-level columnar operations into relational table semantics.

For information about the underlying columnar storage, see Column Storage and ColumnStore. For query execution that uses tables, see Query Execution. For metadata management, see System Catalog and SysCatalog.


Purpose and Responsibilities

The Table abstraction serves as the primary interface for relational data operations in LLKV. Its key responsibilities include:

ResponsibilityDescription
Field ID NamespacingMaps user FieldId values to table-scoped LogicalFieldId identifiers
MVCC IntegrationAutomatically injects and manages created_by and deleted_by columns
Schema ManagementProvides Arrow schema generation and validation
Catalog IntegrationCoordinates with SysCatalog for metadata persistence
Scan OperationsImplements streaming scans with projection, filtering, and ordering
Index ManagementRegisters and manages persisted sort indexes

Sources: llkv-table/src/lib.rs:1-27 llkv-table/src/table.rs:58-69


Table Structure and Creation

Table Struct

The Table struct is a lightweight wrapper that combines a ColumnStore reference with a TableId. The MVCC cache is an optimization to avoid repeated schema introspection during appends.

Sources: llkv-table/src/table.rs:58-69 llkv-table/src/table.rs:50-56

Table Creation

Tables are created through the CatalogManager which coordinates metadata persistence and storage initialization:

sequenceDiagram
    participant User
    participant CatalogManager
    participant SysCatalog
    participant ColumnStore
    participant Pager
    
    User->>CatalogManager: create_table_from_columns(name, columns)
    CatalogManager->>CatalogManager: Allocate TableId
    CatalogManager->>SysCatalog: Register TableMeta
    CatalogManager->>SysCatalog: Register ColMeta for each column
    CatalogManager->>ColumnStore: Ensure physical storage
    ColumnStore->>Pager: Initialize descriptors if needed
    CatalogManager->>User: Return Table~P~

The creation process ensures metadata consistency before returning a usable Table handle.

Sources: llkv-table/src/table.rs:77-103


Field ID Namespace Mapping

Logical vs User Field IDs

LLKV uses a two-tier field ID system to prevent collisions between tables:

The mapping occurs during append operations where each field’s metadata is rewritten:

ComponentTypeExample
User Field IDFieldId (u32)10
Table IDTableId (u32)1
Logical Field IDLogicalFieldId (u64)0x0000_0001_0000_000A

Sources: llkv-table/src/table.rs:304-315


Data Operations

Append Operation

The append method is the primary write operation, handling field ID conversion, MVCC column injection, and catalog updates:

flowchart TD
    START["append(batch)"]
CACHE["Get/Initialize\nMvccColumnCache"]
PROCESS["Process each field"]
subgraph "Per Field Processing"
        SYSCHECK{"System column?\n(row_id, MVCC)"}
MVCC_ASSIGN["Assign MVCC\nLogicalFieldId"]
USER_CONVERT["Convert user FieldId\nto LogicalFieldId"]
META_UPDATE["Update catalog\nif needed"]
end
    
    INJECT_CREATED{"created_by\npresent?"}
INJECT_DELETED{"deleted_by\npresent?"}
BUILD_CREATED["Build created_by column\nwith TXN_ID_AUTO_COMMIT"]
BUILD_DELETED["Build deleted_by column\nwith TXN_ID_NONE"]
RECONSTRUCT["Reconstruct RecordBatch\nwith LogicalFieldIds"]
STORE["store.append(namespaced_batch)"]
END["Return Result"]
START --> CACHE
 
   CACHE --> PROCESS
 
   PROCESS --> SYSCHECK
 
   SYSCHECK -->|Yes| MVCC_ASSIGN
 
   SYSCHECK -->|No| USER_CONVERT
 
   USER_CONVERT --> META_UPDATE
 
   MVCC_ASSIGN --> INJECT_CREATED
 
   META_UPDATE --> INJECT_CREATED
    
 
   INJECT_CREATED -->|No| BUILD_CREATED
 
   INJECT_CREATED -->|Yes| INJECT_DELETED
 
   BUILD_CREATED --> INJECT_DELETED
    
 
   INJECT_DELETED -->|No| BUILD_DELETED
 
   INJECT_DELETED -->|Yes| RECONSTRUCT
 
   BUILD_DELETED --> RECONSTRUCT
    
 
   RECONSTRUCT --> STORE
 
   STORE --> END

MVCC Column Injection

For non-transactional appends (e.g., CSV import), the system automatically injects MVCC columns:

graph LR
    subgraph "Input Batch"
        IR["row_id: UInt64"]
IC1["col_a: Int64\nfield_id=10"]
IC2["col_b: Utf8\nfield_id=11"]
end
    
    subgraph "Output Batch"
        OR["row_id: UInt64"]
OC1["col_a: Int64\nlfid=0x0000_0001_0000_000A"]
OC2["col_b: Utf8\nlfid=0x0000_0001_0000_000B"]
OCB["created_by: UInt64\nlfid=MVCC_CREATED\nvalue=1"]
ODB["deleted_by: UInt64\nlfid=MVCC_DELETED\nvalue=0"]
end
    
 
   IR --> OR
 
   IC1 --> OC1
 
   IC2 --> OC2
 
   OC2 --> OCB
 
   OCB --> ODB

The constant values used are:

  • TXN_ID_AUTO_COMMIT = 1 for created_by (indicates system-committed, always visible)
  • TXN_ID_NONE = 0 for deleted_by (indicates not deleted)

Sources: llkv-table/src/table.rs:231-438 llkv-table/src/table.rs:347-399

graph TB
    subgraph "Public API"
        SS["scan_stream\n(projections, filter, options, callback)"]
SSE["scan_stream_with_exprs\n(projections, filter, options, callback)"]
FR["filter_row_ids\n(filter_expr)"]
SC["stream_columns\n(logical_fields, row_ids, policy)"]
end
    
    subgraph "Internal Execution"
        ES["llkv_scan::execute_scan"]
RSB["RowStreamBuilder"]
CRS["collect_row_ids_for_table"]
end
    
    subgraph "Storage Layer"
        CS["ColumnStore::ScanBuilder"]
GR["gather_rows"]
FM["filter_matches"]
end
    
 
   SS --> SSE
 
   SSE --> ES
 
   FR --> CRS
 
   SC --> RSB
    
 
   ES --> FM
 
   ES --> GR
 
   RSB --> GR
 
   CRS --> FM
    
 
   FM --> CS
 
   GR --> CS

Scan Operations

The Table provides multiple scan methods that build on the lower-level ColumnStore scan infrastructure:

Scan Flow with Projections

Sources: llkv-table/src/table.rs:447-488 llkv-table/src/table.rs:490-496


Schema Management

Schema Generation

The schema() method constructs an Arrow Schema from catalog metadata:

Each field in the generated schema includes field_id metadata:

Schema ComponentSourceExample
Field nameColMeta.name or generated"customer_id" or "col_10"
Data typeColumnStore.data_type(lfid)DataType::Int64
Field ID metadatafield_id key in metadata map"10"
NullabilityAlways true for user columnstrue

Sources: llkv-table/src/table.rs:519-549

graph TD
    SCHEMA["schema()"]
subgraph "Extract Components"
        NAMES["Collect field names"]
FIDS["Extract field_id metadata"]
TYPES["Format data types"]
end
    
    subgraph "Build RecordBatch"
        NAME_ARR["StringArray(names)"]
FID_ARR["UInt32Array(field_ids)"]
TYPE_ARR["StringArray(data_types)"]
RB_SCHEMA["Schema(name, field_id, data_type)"]
BATCH["RecordBatch::try_new"]
end
    
 
   SCHEMA --> NAMES
 
   SCHEMA --> FIDS
 
   SCHEMA --> TYPES
    
 
   NAMES --> NAME_ARR
 
   FIDS --> FID_ARR
 
   TYPES --> TYPE_ARR
    
 
   NAME_ARR --> BATCH
 
   FID_ARR --> BATCH
 
   TYPE_ARR --> BATCH
 
   RB_SCHEMA --> BATCH

Schema RecordBatch

For interactive display, the schema_recordbatch() method returns a tabular representation:

Sources: llkv-table/src/table.rs:554-586


stateDiagram-v2
    [*] --> Uninitialized : Table created
    
    state Uninitialized {
        [*] --> CheckCache : append() called
        CheckCache --> Initialize : Cache is None
        Initialize --> Store : Scan schema for created_by, deleted_by
        Store --> [*] : Cache populated
    }
    
    Uninitialized --> Cached : First append
    
    state Cached {
        [*] --> ReadCache : Subsequent appends
        ReadCache --> UseCached : Return cached values
        UseCached --> [*]
    }

MVCC Column Management

Column Cache Optimization

The MvccColumnCache avoids repeated string comparisons during append operations:

The cache stores two boolean flags:

FieldTypeMeaning
has_created_byboolSchema includes created_by column
has_deleted_byboolSchema includes deleted_by column

This optimization is critical for bulk insert performance as it eliminates O(columns) string comparisons per batch.

Sources: llkv-table/src/table.rs:50-56 llkv-table/src/table.rs:175-205

MVCC Column LogicalFieldIds

MVCC columns use specially reserved LogicalFieldId values:

These reserved patterns ensure MVCC columns are stored separately from user data in the ColumnStore.

Sources: llkv-table/src/table.rs:255-259 llkv-table/src/table.rs:360-366 llkv-table/src/table.rs:383-389


Index Management

The Table provides operations to register and manage persisted sort indexes:

Registered indexes are maintained by the ColumnStore and used to optimize range scans and ordered queries.

Sources: llkv-table/src/table.rs:145-173


classDiagram
    class RowIdScanCollector {
        -Treemap row_ids
        +extend_from_array(row_ids)
        +extend_from_slice(row_ids, start, len)
        +into_inner() Treemap
    }
    
    class PrimitiveWithRowIdsVisitor {<<trait>>\n+i64_chunk_with_rids(values, row_ids)\n+utf8_chunk_with_rids(values, row_ids)\n...}
    
    class PrimitiveSortedWithRowIdsVisitor {<<trait>>\n+i64_run_with_rids(values, row_ids, start, len)\n+null_run(row_ids, start, len)\n...}
    
    RowIdScanCollector ..|> PrimitiveWithRowIdsVisitor
    RowIdScanCollector ..|> PrimitiveSortedWithRowIdsVisitor

Row ID Collection and Filtering

RowIdScanCollector

The RowIdScanCollector implements visitor traits to collect row IDs during filtered scans:

The collector ignores actual data values and only accumulates row IDs into a Treemap (roaring bitmap).

Sources: llkv-table/src/table.rs:805-858

sequenceDiagram
    participant Scan
    participant Emitter
    participant Callback
    participant Buffer
    
    loop For each chunk from ColumnStore
        Scan->>Emitter: chunk_with_rids(values, row_ids)
        Emitter->>Buffer: Accumulate row_ids
        
        alt Buffer reaches chunk_size
            Emitter->>Callback: on_chunk(&row_ids[..chunk_size])
            Callback-->>Emitter: Result
            Emitter->>Buffer: Clear buffer
        end
    end
    
    Note over Emitter: Scan complete
    Emitter->>Callback: on_chunk(&remaining_row_ids)
    Callback-->>Emitter: Result

RowIdChunkEmitter

For streaming scenarios, RowIdChunkEmitter buffers and emits row IDs in fixed-size chunks:

The emitter includes an optimization: when the buffer is empty, it passes slices directly without copying.

Sources: llkv-table/src/table.rs:860-993


classDiagram
    class ScanStorage~P~ {<<trait>>\n+table_id() TableId\n+field_data_type(lfid) Result~DataType~\n+total_rows() Result~u64~\n+gather_column(lfid, row_ids, policy) Result~ArrayRef~\n+filter_matches(lfid, predicate) Result~Treemap~\n...}
    
    class Table~P~ {-store: Arc~ColumnStore~P~~\n-table_id: TableId\n+catalog() SysCatalog}
    
    Table~P~ ..|> ScanStorage~P~ : implements

ScanStorage Trait Implementation

The Table implements the ScanStorage trait to integrate with the llkv-scan execution layer:

Key delegations:

ScanStorage MethodTable ImplementationDelegation Target
table_id()Direct field accessself.table_id
field_data_type(lfid)Forward to storeself.store.data_type(lfid)
gather_column(...)Forward to storeself.store.gather_rows(...)
filter_matches(...)Build predicate and scanColumnStore::ScanBuilder
stream_row_ids_in_chunks(...)Use RowIdChunkEmitterCustom visitor

Sources: llkv-table/src/table.rs:1019-1232


Integration with Lower Layers

Relationship to ColumnStore

The Table never directly interacts with the Pager; all storage operations go through the ColumnStore.

Sources: llkv-table/src/table.rs:64-69 llkv-table/src/table.rs:647-649


Usage Examples

Creating and Populating a Table

Sources: llkv-table/examples/test_streaming.rs:35-60 llkv-csv/src/writer.rs:104-137

Scanning with Projections and Filters

Sources: llkv-table/examples/test_streaming.rs:64-110 llkv-table/examples/performance_benchmark.rs:128-151


Performance Characteristics

The Table layer adds minimal overhead to ColumnStore operations:

OperationOverheadSource
Append (cached MVCC)Field ID conversion + metadata map construction~1-2% per field
Append (uncached MVCC)Additional schema scan~5% first append only
Single-column scanFilter compilation + streaming setup~1.1x vs direct ColumnStore
Multi-column scanRecordBatch construction per batch~1.5-2x vs direct ColumnStore
Schema introspectionCatalog lookups + field sortingO(fields)

The design prioritizes zero-copy operations and streaming to minimize memory overhead.

Sources: llkv-table/examples/direct_comparison.rs:277-399 llkv-table/examples/performance_benchmark.rs:82-211

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Column Storage and ColumnStore

Loading…

Column Storage and ColumnStore

Relevant source files

This page documents the columnar storage layer implemented in the llkv-column-map crate. The ColumnStore provides Arrow-native persistence of columnar data over a key-value pager, enabling efficient predicate evaluation, multi-column projections, and Last-Write-Wins (LWW) semantics for updates.

For information about the table-level abstraction built on top of ColumnStore, see Table Abstraction. For details about the underlying pager interface, see Pager Interface and SIMD Optimization.

ColumnStore Architecture

The ColumnStore<P> struct (llkv-column-map/src/store/core.rs:60-68) serves as the primary interface for columnar data operations. It is parameterized by a Pager implementation and manages:

  • Column catalog : Maps LogicalFieldId to physical storage keys
  • Descriptors : Linked lists of chunk metadata for each column
  • Data type cache : Avoids repeated descriptor loads for schema queries
  • Index manager : Maintains presence and value indexes
  • Context pool : Reusable gather contexts to amortize chunk fetch costs

Purpose : This diagram shows the internal components of ColumnStore and their relationships. The catalog is the single source of truth for column-to-physical-key mappings, while auxiliary structures provide caching and optimization.

Sources : llkv-column-map/src/store/core.rs:60-68 llkv-column-map/src/store/core.rs:100-124

Storage Organization

Namespaces

Columns are identified by LogicalFieldId, which combines a namespace, table ID, and field ID to prevent collisions between different data categories (llkv-column-map/src/store/core.rs:38-46):

NamespacePurposeExample Usage
UserDataRegular table columnsUser-visible columns from CREATE TABLE
RowIdShadowInternal row ID trackingShadow column storing RowId for each value
TxnCreatedByMVCC creation timestampsTransaction ID that created each row
TxnDeletedByMVCC deletion timestampsTransaction ID that deleted each row

Each column in the UserData namespace has a corresponding RowIdShadow column that stores the row IDs for non-null values. This pairing enables efficient row-based gathering and filtering.

Sources : llkv-column-map/src/store/core.rs:38-46

Physical Storage Structure

Data is organized as a hierarchy: ColumnCatalog → ColumnDescriptor → DescriptorPages → ChunkMetadata → Data Chunks.

Purpose : This diagram shows how data is physically organized in storage. The catalog is the entry point, descriptors form a linked list of metadata pages, and each metadata entry points to a serialized Arrow array.

graph TB
    CATALOG["ColumnCatalog\nPhysicalKey: CATALOG_ROOT_PKEY"]
subgraph "Per Column"
        DESC["ColumnDescriptor\n(head_page_pk, tail_page_pk,\ntotal_row_count)"]
PAGE1["DescriptorPage\n(next_page_pk, entry_count)"]
PAGE2["DescriptorPage\n(next_page_pk, entry_count)"]
end
    
    subgraph "Chunk Metadata"
        META1["ChunkMetadata\n(chunk_pk, row_count,\nmin_val, max_val,\nserialized_bytes)"]
META2["ChunkMetadata\n(chunk_pk, row_count,\nmin_val, max_val,\nserialized_bytes)"]
META3["ChunkMetadata"]
end
    
    subgraph "Arrow Data"
        CHUNK1["Serialized Arrow Array\n(values)"]
CHUNK2["Serialized Arrow Array\n(values)"]
CHUNK3["Serialized Arrow Array\n(row IDs)"]
end
    
 
   CATALOG -->|maps LogicalFieldId| DESC
 
   DESC -->|head_page_pk| PAGE1
 
   PAGE1 -->|next_page_pk| PAGE2
 
   PAGE1 --> META1
 
   PAGE1 --> META2
 
   PAGE2 --> META3
    
 
   META1 -->|chunk_pk| CHUNK1
 
   META2 -->|chunk_pk| CHUNK2
 
   META3 -->|chunk_pk| CHUNK3

Key structures :

  • ColumnCatalog (llkv-column-map/src/store/catalog.rs): Root mapping stored at CATALOG_ROOT_PKEY
  • ColumnDescriptor : Per-column header with total row count and pointers to metadata pages
  • DescriptorPage : Linked list node containing multiple ChunkMetadata entries
  • ChunkMetadata (llkv-column-map/src/store/descriptor.rs): Min/max values, row count, serialized size, chunk physical key

The min/max values in ChunkMetadata enable chunk pruning : predicates can skip entire chunks without loading data if the min/max range doesn’t overlap the query predicate.

Sources : llkv-column-map/src/store/core.rs:100-124 llkv-column-map/src/store/descriptor.rs

Core Operations

Append with Last-Write-Wins

The append method (llkv-column-map/src/store/core.rs:787-1633) ingests a RecordBatch with LWW semantics:

  1. Preprocessing : Sort by rowid if not already sorted
  2. LWW Rewrite : For each column, identify existing row IDs and overwrite them in-place
  3. Filter for Append : Remove rewritten rows and nulls from the batch
  4. Chunk and Persist : Split data into target-sized chunks, serialize as Arrow arrays, persist with metadata

Purpose : This diagram shows the append flow with LWW semantics. Each column is processed independently, rewrites happen first, then new data is chunked and persisted atomically.

sequenceDiagram
    participant Caller
    participant ColumnStore
    participant LWW as "LWW Rewrite"
    participant Chunker
    participant Pager
    
    Caller->>ColumnStore: append(batch)
    ColumnStore->>ColumnStore: Sort by rowid if needed
    
    loop For each column
        ColumnStore->>LWW: lww_rewrite_for_field(field_id, row_ids, values)
        LWW->>Pager: Load existing chunks
        LWW->>LWW: Identify overlapping row IDs
        LWW->>Pager: Overwrite matching rows
        LWW-->>ColumnStore: rewritten_ids
        
        ColumnStore->>ColumnStore: Filter out rewritten_ids and nulls
        ColumnStore->>Chunker: Split to target chunk sizes
        
        loop For each chunk
            Chunker->>Pager: alloc() for chunk_pk and rid_pk
            Chunker->>Chunker: Serialize Arrow array
            Chunker->>Chunker: Compute min/max/size metadata
        end
    end
    
    ColumnStore->>Pager: batch_put(all_puts)
    ColumnStore->>ColumnStore: Increment append_epoch
    ColumnStore-->>Caller: Success

The append_epoch counter (llkv-column-map/src/store/core.rs:132-134) is incremented after every append, providing a cache invalidation signal for gather contexts.

Sources : llkv-column-map/src/store/core.rs:787-1633 llkv-column-map/src/store/core.rs:893-942

Multi-Column Gathering

The gather_rows method (llkv-column-map/src/store/projection.rs:758-785) assembles a RecordBatch from multiple columns given a list of row IDs. It uses a two-phase strategy :

Phase 1: Planning

  • Load ColumnDescriptor for each field
  • Collect ChunkMetadata for value and row ID chunks
  • Build FieldPlan with candidate chunk indices

Phase 2: Execution

  • Identify candidate chunks that intersect requested row IDs (chunk pruning via min/max)
  • Batch-fetch missing chunks from pager
  • For each column, build Arrow array by gathering values from chunks
  • Assemble RecordBatch from gathered columns

Purpose : This diagram shows the gather pipeline. Planning loads metadata, execution prunes chunks and assembles the result. The chunk cache avoids redundant pager reads across multiple gather calls.

Context Pooling : The GatherContextPool (llkv-column-map/src/store/projection.rs:651-693) maintains a pool of MultiGatherContext objects keyed by field IDs. Each context caches:

  • Chunk arrays by physical key
  • Row ID index for sparse lookups
  • Scratch buffers for row location tracking
  • Arrow builders for each column type

By reusing contexts, repeated gather calls avoid allocating temporary structures and can benefit from cached chunks.

Sources : llkv-column-map/src/store/projection.rs:758-927 llkv-column-map/src/store/projection.rs:651-720

Filter Operations

The filter_row_ids method (llkv-column-map/src/store/core.rs:356-372) evaluates a predicate against a column and returns matching row IDs. It uses:

  1. Type-specific dispatch : FilterDispatch trait routes to specialized implementations (primitive types, strings, booleans)
  2. Chunk pruning : Skip chunks where min_val and max_val don’t overlap the predicate
  3. Vectorized evaluation : For string predicates like CONTAINS, use Arrow’s vectorized kernels (llkv-column-map/src/store/scan/filter.rs:337-503)
  4. Dense encoding : Return results as FilterResult with run-length encoding when possible

For example, the fused string filter (llkv-column-map/src/store/scan/filter.rs:337-503) evaluates multiple CONTAINS predicates in a single pass:

  • Load value and row ID chunks in parallel (via Rayon)
  • Use Arrow’s contains_utf8_scalar kernel for each pattern
  • AND results using a bitmask to avoid per-row branching
  • Extract matching row IDs from the final bitmask

Sources : llkv-column-map/src/store/core.rs:356-372 llkv-column-map/src/store/scan/filter.rs:209-503

Optimization Techniques

Chunk Pruning via Min/Max Metadata

Each ChunkMetadata stores min_val_u64 and max_val_u64 (llkv-column-map/src/store/descriptor.rs). During filtering or gathering, chunks are skipped if:

  • For predicates: The predicate range doesn’t overlap [min_val, max_val]
  • For point lookups: The requested row ID is outside [min_val, max_val]

This optimization is particularly effective for:

  • Range predicates on sorted or clustered columns
  • Temporal queries on timestamp columns (common in OLAP)
  • Primary key lookups

Sources : llkv-column-map/src/store/core.rs:679-690 llkv-column-map/src/store/projection.rs:1019-1026

Context Pooling

The GatherContextPool (llkv-column-map/src/store/projection.rs:651-693) maintains up to 4 contexts per unique field ID set. This amortizes:

  • Chunk cache allocations (reuse FxHashMap<PhysicalKey, ArrayRef>)
  • Arrow builder allocations (reuse builders with reserved capacity)
  • Scratch buffer allocations (reuse Vec for row indices)

Contexts track an epoch (llkv-column-map/src/store/projection.rs:520-526) that is compared against ColumnStore.append_epoch. If epochs mismatch, the context rebuilds its metadata from the latest descriptors.

Sources : llkv-column-map/src/store/projection.rs:651-720 llkv-column-map/src/store/projection.rs:460-526

Data Type Caching

The DTypeCache (llkv-column-map/src/store/core.rs64) stores Arrow DataType for each LogicalFieldId to avoid repeated descriptor loads during schema queries. It uses a fingerprint stored in the descriptor to detect type changes (e.g., from ALTER TABLE operations).

Sources : llkv-column-map/src/store/core.rs:175-180 llkv-column-map/src/store/core.rs:190-227

Indexes

The IndexManager (llkv-column-map/src/store/core.rs65) supports two index types:

Index TypePurposeStorage
PresenceTracks which row IDs existBitmap or sorted permutation array
ValueEnables sorted iteration by valuePermutation array mapping value order to physical order

Indexes are created via register_index (llkv-column-map/src/store/core.rs:145-147) and maintained automatically during append operations. The has_row_id method (llkv-column-map/src/store/core.rs:663-754) uses presence indexes for fast lookups via binary search.

Sources : llkv-column-map/src/store/core.rs:145-165 llkv-column-map/src/store/core.rs:663-754

Data Flow: Append Operation

Purpose : This flowchart shows the detailed append logic. Each column is processed independently (LWW rewrite, filter, chunk), then all writes are flushed atomically.

Sources : llkv-column-map/src/store/core.rs:787-1633

Data Flow: Gather Operation

Purpose : This flowchart shows the gather operation with context reuse. The epoch check ensures cache validity, and the dense pattern optimization avoids hash lookups for contiguous row IDs.

Sources : llkv-column-map/src/store/projection.rs:929-1195

Configuration and Tuning

The ColumnStoreConfig (llkv-column-map/src/store/core.rs63) provides tuning parameters:

The GatherContextPool (llkv-column-map/src/store/projection.rs:658-663) maintains up to max_per_key contexts per field set (currently 4). Increasing this value trades memory for reduced context allocation frequency in highly concurrent workloads.

Sources : llkv-column-map/src/store/core.rs:126-129 llkv-column-map/src/store/projection.rs:658-663

Thread Safety

ColumnStore is Send + Sync (llkv-column-map/src/store/core.rs49). Internal state uses:

  • Arc<RwLock<ColumnCatalog>>: Catalog allows concurrent reads, exclusive writes
  • Arc<Pager>: Pager implementations must be Send + Sync
  • Arc<AtomicU64>: Epoch counter supports lock-free reads

This enables concurrent query execution across multiple threads while serializing metadata updates (e.g., append, remove_column).

Sources : llkv-column-map/src/store/core.rs:49-85

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Pager Interface and SIMD Optimization

Loading…

Pager Interface and SIMD Optimization

Relevant source files

Purpose and Scope

This document describes the storage abstraction layer that provides persistent key-value storage for the columnar data model. The Pager trait defines a generic interface for batch-oriented storage operations, while the SIMD R-Drive library provides a memory-mapped, SIMD-optimized concrete implementation.

For column-level storage operations built on top of the pager, see Column Storage and ColumnStore. For table-level abstractions, see Table Abstraction.


System Architecture

The storage system follows a layered design where the Pager provides a simple key-value abstraction that the ColumnStore builds upon. All persistence operations flow through batch APIs to minimize I/O overhead and enable atomic commits.

graph TB
    subgraph "Column Storage Layer"
        ColumnStore["ColumnStore&lt;P: Pager&gt;"]
ColumnCatalog["ColumnCatalog"]
ColumnDescriptor["ColumnDescriptor"]
ChunkMetadata["ChunkMetadata"]
end
    
    subgraph "Pager Abstraction Layer"
        PagerTrait["Pager Trait"]
BatchGet["BatchGet"]
BatchPut["BatchPut"]
GetResult["GetResult"]
PhysicalKey["PhysicalKey (u64)"]
end
    
    subgraph "SIMD R-Drive Implementation"
        SimdRDrive["simd-r-drive 0.15.5"]
EntryHandle["EntryHandle"]
MemoryMap["Memory-Mapped Files"]
SimdOps["SIMD Operations"]
end
    
 
   ColumnStore --> PagerTrait
 
   ColumnStore --> BatchGet
 
   ColumnStore --> BatchPut
 
   ColumnCatalog --> PagerTrait
 
   ColumnDescriptor --> PagerTrait
 
   ChunkMetadata --> PhysicalKey
    
 
   PagerTrait --> SimdRDrive
 
   BatchGet --> PhysicalKey
 
   BatchPut --> PhysicalKey
 
   GetResult --> EntryHandle
    
 
   SimdRDrive --> EntryHandle
 
   SimdRDrive --> MemoryMap
 
   SimdRDrive --> SimdOps
 
   EntryHandle --> MemoryMap

Storage Layering

Sources: llkv-column-map/src/store/core.rs:1-89 Cargo.toml:85-86


The Pager Trait

The Pager trait defines the storage interface used throughout the system. It abstracts over key-value stores with batch-oriented operations and explicit memory management.

Core Operations

OperationPurposeBatch Size
batch_get(&[BatchGet])Retrieve multiple keys atomically1-N keys
batch_put(&[BatchPut])Write multiple keys atomically1-N keys
alloc()Allocate a single physical key1 key
alloc_many(&[usize])Allocate multiple physical keysN keys
free(PhysicalKey)Deallocate a single key1 key
free_many(&[PhysicalKey])Deallocate multiple keysN keys

Type Parameters

The Pager trait is generic over the blob type returned from reads:

The SIMD R-Drive implementation uses EntryHandle as the blob type, which provides zero-copy access to memory-mapped regions without allocating new buffers.

Sources: llkv-column-map/src/store/core.rs:60-68 llkv-column-map/src/store/core.rs:89-124

Physical Keys

Physical keys are 64-bit unsigned integers (u64) that uniquely identify storage locations. The system partitions the key space for different purposes:

  • Catalog Root : Key 0 stores the ColumnCatalog
  • Column Descriptors : User-allocated keys for column metadata
  • Data Chunks : User-allocated keys for serialized Arrow arrays
  • Index Structures : User-allocated keys for permutations and bitmaps

Sources: llkv-storage/src/constants.rs (implied), llkv-column-map/src/store/core.rs:100-124


sequenceDiagram
    participant CS as ColumnStore
    participant Pager as Pager
    participant SIMD as SIMD R-Drive
    participant MM as Memory Map
    
    CS->>Pager: batch_get([BatchGet::Raw{key: 42}, ...])
    Pager->>SIMD: batch read request
    SIMD->>MM: locate pages for keys
    MM->>SIMD: zero-copy EntryHandles
    SIMD->>Pager: Vec<GetResult>
    Pager->>CS: Results with EntryHandles
    CS->>CS: deserialize Arrow arrays

Batch Operations

All storage I/O uses batch operations to minimize system call overhead and enable atomic multi-key transactions.

BatchGet Flow

Sources: llkv-column-map/src/store/core.rs:100-110 llkv-column-map/src/store/core.rs:414-425

The BatchGet enum specifies requests:

VariantPurpose
Raw { key: PhysicalKey }Retrieve raw bytes for a key

The GetResult enum returns outcomes:

VariantMeaning
Raw { key: PhysicalKey, bytes: Blob }Successfully retrieved bytes
(others implied by usage)Key not found, or other statuses

BatchPut Flow

Sources: llkv-column-map/src/store/core.rs:217-222 llkv-column-map/src/store/core.rs:1117-1125

The BatchPut enum specifies writes:

VariantPurpose
Raw { key: PhysicalKey, bytes: Vec<u8> }Write raw bytes to a key

Why Batching Matters

Batching provides three critical benefits:

  1. Atomicity : All keys in a batch are committed together or not at all
  2. Performance : Amortizes system call overhead across many keys
  3. Prefetching : Enables SIMD R-Drive to optimize I/O patterns

The ColumnStore::append method demonstrates this pattern—it stages hundreds of puts (data chunks, descriptors, catalog updates) and commits them atomically at the end:

Sources: llkv-column-map/src/store/core.rs:1117-1125


Physical Key Allocation

The pager manages a free-list of physical keys. Allocations are persistent—once allocated, a key remains valid until explicitly freed.

graph TD
    subgraph "Allocation Operations"
        A1["alloc() → PhysicalKey"]
A2["alloc_many(n) → Vec&lt;PhysicalKey&gt;"]
end
    
    subgraph "Deallocation Operations"
        F1["free(key)"]
F2["free_many(&[key])"]
end
    
    subgraph "Usage Examples"
        Desc["Column Descriptor\n1 key per column"]
Chunks["Data Chunks\nN keys per append"]
Indexes["Indexes\n1 key per index"]
end
    
 
   A1 --> Desc
 
   A2 --> Chunks
 
   A1 --> Indexes
    
 
   Desc -.-> F1
 
   Chunks -.-> F2

Allocation Patterns

Sources: llkv-column-map/src/store/core.rs:250-263 llkv-column-map/src/store/core.rs:989-1006

Allocation Examples

The ColumnStore::append method allocates keys in bulk:

The ColumnStore::remove_column method frees all keys associated with a column:

Sources: llkv-column-map/src/store/core.rs:990-1005 llkv-column-map/src/store/core.rs:563-587


graph TB
    subgraph "User Space"
        Pager["Pager API"]
EntryHandle["EntryHandle\n(zero-copy view)"]
end
    
    subgraph "SIMD R-Drive"
        PageTable["Page Table\n(key → page mapping)"]
FreeList["Free List\n(available keys)"]
Allocator["Allocator\n(SIMD-optimized)"]
end
    
    subgraph "Operating System"
        MMap["mmap()
system call"]
FileBackend["Backing File"]
end
    
 
   Pager --> EntryHandle
 
   Pager --> PageTable
 
   Pager --> FreeList
 
   PageTable --> Allocator
 
   FreeList --> Allocator
    
 
   Allocator --> MMap
 
   MMap --> FileBackend
    EntryHandle -.references.-> MMap

SIMD R-Drive Implementation

The simd-r-drive crate (version 0.15.5) provides a memory-mapped key-value store optimized with SIMD instructions.

Memory-Mapped Architecture

Sources: Cargo.toml:85-86 llkv-column-map/src/store/core.rs:22-25

EntryHandle: Zero-Copy Blobs

The EntryHandle type implements AsRef<[u8]> and provides direct access to memory-mapped regions without copying data:

When deserializing Arrow arrays, the system reads directly from the mapped memory:

Sources: llkv-column-map/src/store/core.rs:60-89 llkv-column-map/src/store/core.rs:196-210

SIMD Optimization Benefits

The SIMD R-Drive uses vectorized instructions for:

  1. Page Scanning : Parallel search for keys in page tables
  2. Free List Management : Vectorized bitmap operations for allocation
  3. Data Movement : SIMD memcpy for large blobs

While the Rust code in this repository doesn’t directly show SIMD instructions, the external simd-r-drive dependency provides these optimizations transparently through the Pager trait.

Sources: Cargo.toml85 Cargo.lock:671-678


Integration with ColumnStore

The ColumnStore uses the pager for all persistent state, following consistent patterns across operations.

graph LR
    Catalog["ColumnCatalog\n(at key 0)"]
Catalog -->|LogicalFieldId → pk_data| DataDesc["Data Descriptor\n(at pk_data)"]
Catalog -->|LogicalFieldId → pk_rid| RidDesc["RowID Descriptor\n(at pk_rid)"]
DataDesc -->|linked list| Page1["Descriptor Page\nChunkMetadata[]"]
Page1 -->|next_page_pk| Page2["Descriptor Page\nChunkMetadata[]"]
RidDesc -->|linked list| Page3["Descriptor Page\nChunkMetadata[]"]

Storage Pattern: Column Descriptors

Each column has two descriptors (one for data, one for row IDs) stored at fixed physical keys tracked in the catalog:

Sources: llkv-column-map/src/store/core.rs:100-124 llkv-column-map/src/store/core.rs:234-344

graph TB
    ChunkMeta["ChunkMetadata"]
ChunkMeta -->|chunk_pk| DataBlob["Serialized Arrow Array\n(at chunk_pk)"]
ChunkMeta -->|value_order_perm_pk| PermBlob["Sort Permutation\n(at perm_pk)"]
DataBlob -.->|deserialized to| ArrayRef["ArrayRef\n(in-memory)"]

Storage Pattern: Data Chunks

Each chunk of Arrow array data is stored at a physical key referenced by ChunkMetadata:

Sources: llkv-column-map/src/store/core.rs:988-1006 llkv-column-map/src/store/descriptor.rs (implied)

sequenceDiagram
    participant App as Application
    participant CS as ColumnStore
    participant Pager as Pager
    
    App->>CS: append(RecordBatch)
    
    Note over CS: Stage Phase
    CS->>CS: serialize data chunks
    CS->>CS: serialize row-id chunks
    CS->>CS: update descriptors
    CS->>CS: update catalog
    CS->>CS: accumulate BatchPut[]
    
    Note over CS: Commit Phase
    CS->>Pager: batch_put(all_puts)
    Pager->>Pager: atomic commit
    
    alt Success
        Pager-->>CS: Ok(())
        CS->>CS: increment append_epoch
        CS-->>App: Ok(())
    else Failure
        Pager-->>CS: Err(...)
        CS-->>App: Err(...)
        Note over CS: No changes visible
    end

Transaction Boundary Example

The append operation demonstrates how the pager enables atomic multi-key transactions:

Sources: llkv-column-map/src/store/core.rs:787-1126

The key insight: all puts are staged in memory (Vec<BatchPut>) until a single atomic commit at line 1121. This ensures that partial failures leave the store in a consistent state.


Performance Characteristics

Batch Size Recommendations

The ColumnStore follows these heuristics:

OperationTypical Batch SizeJustification
Catalog Load1 keySingle root key at startup
Descriptor Load2 keysData + RowID descriptors
Chunk Append10-100 keysMultiple columns × chunks per column
Chunk Rewrite1-10 keysTargeted LWW updates
Full Scan100+ keysPrefetch all chunks for a column

Sources: llkv-column-map/src/store/core.rs:100-110 llkv-column-map/src/store/core.rs:1117-1125

Memory Mapping Benefits

Memory-mapped storage provides:

  1. Zero-Copy Deserialization : Arrow arrays reference mapped memory directly
  2. OS-Level Caching : The kernel manages page cache automatically
  3. Lazy Loading : Pages are only faulted in when accessed

The EntryHandle type enables these patterns by avoiding buffer copies during reads.

Sources: llkv-column-map/src/store/core.rs:60-85 llkv-column-map/src/store/scan/filter.rs18

Allocation Strategies

The system uses bulk allocation to reduce pager round-trips:

Sources: llkv-column-map/src/store/core.rs:989-1006


Error Handling and Recovery

Allocation Failures

When allocation fails mid-operation, the system attempts to roll back:

Sources: llkv-column-map/src/store/core.rs:563-587

Batch Operation Atomicity

The pager guarantees that either all keys in a batch_put are written or none are. This prevents partial writes from corrupting the storage state. The ColumnStore relies on this guarantee to implement transactional append and update operations.

Sources: llkv-column-map/src/store/core.rs:1117-1125


Summary

The Pager interface provides a minimal, batch-oriented abstraction over persistent key-value storage. The SIMD R-Drive implementation uses memory-mapped files and vectorized operations to optimize common patterns like bulk reads and zero-copy deserialization. By staging all writes and committing atomically, the ColumnStore builds transactional semantics on top of the simple key-value model.

Key Takeaways:

  • All I/O flows through batch operations (batch_get, batch_put)
  • Physical keys (u64) uniquely identify all persistent objects
  • EntryHandle enables zero-copy access to memory-mapped data
  • Bulk allocation (alloc_many) minimizes pager round-trips
  • Atomic batch commits enable transactional storage operations

Sources: llkv-column-map/src/store/core.rs:60-1126 Cargo.toml:85-86 llkv-storage (trait definitions implied by usage)

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Catalog and Metadata Management

Loading…

Catalog and Metadata Management

Relevant source files

Purpose and Scope

This document explains the catalog and metadata infrastructure that tracks all tables, columns, indexes, constraints, and custom types in the LLKV system. The catalog provides schema information and manages the lifecycle of database objects.

For details on the table creation and management API, see CatalogManager API. For implementation details of how metadata is stored, see System Catalog and SysCatalog. For custom type definitions and aliases, see Custom Types and Type Registry.

Overview

The LLKV catalog is a self-describing system : all metadata about tables, columns, indexes, and constraints is stored as structured records in Table 0 , which is itself a table managed by the same ColumnStore that manages user data. This bootstrapped design means the catalog uses the same storage primitives as regular tables.

The catalog provides:

  • Schema tracking : Table and column metadata with names and types
  • Index registry : Persisted sort indexes and multi-column indexes
  • Constraint metadata : Primary keys, foreign keys, unique constraints, and check constraints
  • Trigger definitions : Event triggers with timing and execution metadata
  • Custom type registry : User-defined type aliases

Self-Describing Architecture

Sources: llkv-table/src/lib.rs:1-98 llkv-table/src/table.rs:499-511

Metadata Types

The catalog stores several types of metadata records, each representing a different aspect of database schema and configuration.

Core Metadata Records

Metadata TypeDescriptionKey Fields
TableMetaTable definitionstable_id, name, schema
ColMetaColumn definitionscol_id, name, flags, default
SingleColumnIndexEntryMetaSingle-column index registryfield_id, index_kind
MultiColumnIndexEntryMetaMulti-column index registryfield_ids, index_kind
TriggerEntryMetaTrigger definitionstrigger_id, timing, event
CustomTypeMetaUser-defined typestype_name, base_type

Constraint Metadata

Constraints are stored as specialized metadata records that enforce referential integrity and data validation:

Constraint TypeMetadata Structure
Primary KeyPrimaryKeyConstraint - columns forming the primary key
Foreign KeyForeignKeyConstraint - parent/child table references and actions
UniqueUniqueConstraint - columns with unique constraint
CheckCheckConstraint - validation expression

Sources: llkv-table/src/lib.rs:81-86 llkv-table/src/lib.rs:56-67

Table ID Ranges

Table IDs are partitioned into reserved ranges to distinguish system tables from user tables and temporary objects.

graph LR
    subgraph "Table ID Space"
        CATALOG["0\nSystem Catalog"]
USER["1-999\nUser Tables"]
INFOSCHEMA["1000+\nInformation Schema"]
TEMP["10000+\nTemporary Tables"]
end
    
    CATALOG -.special case.-> CATALOG
    USER -.normal tables.-> USER
    INFOSCHEMA -.system views.-> INFOSCHEMA
    TEMP -.session-local.-> TEMP

Reserved Table ID Constants

The is_reserved_table_id() function checks whether a table ID is in the reserved range (Table 0). User code cannot directly instantiate Table objects for reserved IDs.

Sources: llkv-table/src/lib.rs:75-78 llkv-table/src/table.rs:110-126

Storage Architecture

Catalog as a Table

The system catalog is physically stored as Table 0 in the ColumnStore. Each metadata type (table definitions, column definitions, indexes, etc.) is stored as a column in this special table, with each row representing one metadata record.

graph TB
    subgraph "Logical View"
        CATALOG["System Catalog API\n(SysCatalog)"]
end
    
    subgraph "Physical Storage"
        TABLE0["Table 0"]
COLS["Columns:\n- table_meta\n- col_meta\n- index_meta\n- trigger_meta\n- constraint_meta"]
end
    
    subgraph "ColumnStore Layer"
        DESCRIPTORS["Column Descriptors"]
CHUNKS["Data Chunks\n(Serialized Arrow)"]
end
    
    subgraph "Pager Layer"
        KVSTORE["Key-Value Store"]
end
    
 
   CATALOG --> TABLE0
 
   TABLE0 --> COLS
 
   COLS --> DESCRIPTORS
 
   DESCRIPTORS --> CHUNKS
 
   CHUNKS --> KVSTORE

Accessing the Catalog

The Table::catalog() method provides access to the system catalog without exposing the underlying table structure. The SysCatalog type wraps the ColumnStore and provides typed methods for reading and writing metadata.

The get_table_meta() and get_cols_meta() convenience methods delegate to the catalog:

Sources: llkv-table/src/table.rs:499-511

Metadata Operations

Table Creation

Table creation is coordinated by the CatalogManager, which handles metadata persistence, catalog registration, and storage initialization. The Table type provides factory methods that delegate to the catalog manager:

This factory pattern ensures that table creation is properly coordinated across three layers:

  1. MetadataManager : Assigns table IDs and tracks metadata
  2. TableCatalog : Maintains name-to-ID mappings
  3. ColumnStore : Initializes physical storage

Sources: llkv-table/src/table.rs:80-103

Defensive Metadata Persistence

When appending data, the Table::append() method defensively persists column names to the catalog if they’re missing. This ensures metadata consistency even when batches arrive with only field_id metadata and no column names.

This defensive approach handles cases like CSV import where column names are known but may not have been explicitly registered in the catalog.

Sources: llkv-table/src/table.rs:327-344

sequenceDiagram
    participant Client
    participant Table
    participant ColumnStore
    participant Catalog
    
    Client->>Table: schema()
    Table->>ColumnStore: user_field_ids_for_table(table_id)
    ColumnStore-->>Table: logical_fields: [LogicalFieldId]
    
    Table->>Catalog: get_cols_meta(field_ids)
    Catalog-->>Table: metas: [ColMeta]
    
    loop "For each field"
        Table->>ColumnStore: data_type(lfid)
        ColumnStore-->>Table: DataType
        Table->>Table: Build Field with metadata
    end
    
    Table-->>Client: Arc<Schema>

Schema Resolution

The Table::schema() method constructs an Arrow Schema by querying the catalog for column metadata and combining it with physical data type information from the ColumnStore.

The resulting schema includes:

  • row_id field (always first)
  • User-defined columns with names from the catalog
  • field_id stored in field metadata for each column

Sources: llkv-table/src/table.rs:519-549

Index Registration

The catalog tracks persisted sort indexes for columns, allowing efficient range scans and ordered reads.

Registering Indexes

Listing Indexes

Index metadata is stored in the catalog and used by the query planner to optimize scan operations.

Sources: llkv-table/src/table.rs:145-173

Integration with CSV Import/Export

CSV import and export operations rely on the catalog to resolve column names and field IDs. The CsvWriter queries the catalog when building projections to ensure that columns are properly aliased.

This integration ensures that exported CSV files have human-readable column headers even when the underlying storage uses numeric field IDs.

Sources: llkv-csv/src/writer.rs:320-368

graph TB
    subgraph "SQL Layer"
        SQLENGINE["SqlEngine"]
PLANNER["Query Planner"]
end
    
    subgraph "Catalog Layer"
        CATALOG["CatalogManager"]
METADATA["MetadataManager"]
SYSCAT["SysCatalog"]
end
    
    subgraph "Table Layer"
        TABLE["Table"]
CONSTRAINTS["ConstraintService"]
end
    
    subgraph "Storage Layer"
        COLSTORE["ColumnStore"]
PAGER["Pager"]
end
    
 
   SQLENGINE -->|CREATE TABLE| CATALOG
 
   SQLENGINE -->|ALTER TABLE| CATALOG
 
   SQLENGINE -->|DROP TABLE| CATALOG
    
 
   CATALOG --> METADATA
 
   CATALOG --> SYSCAT
 
   CATALOG --> TABLE
    
 
   TABLE --> SYSCAT
 
   TABLE --> CONSTRAINTS
 
   TABLE --> COLSTORE
    
 
   SYSCAT --> COLSTORE
 
   COLSTORE --> PAGER
    
    PLANNER -.resolves schema.-> CATALOG
    CONSTRAINTS -.validates.-> SYSCAT

Relationship to Other Systems

The catalog sits at the center of the LLKV architecture, connecting several subsystems:

  • SQL Layer : Issues DDL commands that modify the catalog
  • Query Planner : Resolves table and column names via the catalog
  • Table Layer : Queries metadata during data operations
  • Constraint Layer : Uses catalog to track and enforce constraints
  • Storage Layer : Physically persists catalog records as Table 0

Sources: llkv-table/src/lib.rs:34-98

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

CatalogManager API

Loading…

CatalogManager API

Relevant source files

Purpose and Scope

This page documents the CatalogManager API, which provides the primary interface for table lifecycle management in LLKV. The CatalogManager coordinates table creation, modification, deletion, and schema management operations. It serves as the bridge between high-level DDL operations and the low-level storage of metadata in the system catalog.

For details on how metadata is physically stored, see System Catalog and SysCatalog. For information about custom type definitions, see Custom Types and Type Registry.

Overview

The CatalogManager is the central coordinator for all catalog operations in LLKV. It manages:

  • Table ID allocation : Assigns unique identifiers to new tables
  • Schema registration : Validates and stores table schemas with Arrow integration
  • Field ID mapping : Assigns logical field IDs to columns and maintains resolution
  • Index registration : Tracks single-column and multi-column index metadata
  • Metadata snapshots : Provides consistent views of catalog state
  • DDL coordination : Orchestrates CREATE/ALTER/DROP operations
graph TB
    subgraph "High-Level Operations"
        DDL["DDL Statements\n(CREATE/ALTER/DROP)"]
QUERY["Query Planning\n(Name Resolution)"]
EXEC["Query Execution\n(Schema Access)"]
end
    
    subgraph "CatalogManager Layer"
        CM["CatalogManager"]
SNAPSHOT["TableCatalogSnapshot"]
RESOLVER["FieldResolver"]
RESULT["CreateTableResult"]
end
    
    subgraph "Metadata Structures"
        TABLEMETA["TableMeta"]
COLMETA["ColMeta"]
INDEXMETA["Index Descriptors"]
SCHEMA["Arrow Schema"]
end
    
    subgraph "Persistence Layer"
        SYSCAT["SysCatalog\n(Table 0)"]
STORE["ColumnStore"]
end
    
 
   DDL --> CM
 
   QUERY --> CM
 
   EXEC --> CM
    
 
   CM --> SNAPSHOT
 
   CM --> RESOLVER
 
   CM --> RESULT
    
 
   CM --> TABLEMETA
 
   CM --> COLMETA
 
   CM --> INDEXMETA
 
   CM --> SCHEMA
    
 
   TABLEMETA --> SYSCAT
 
   COLMETA --> SYSCAT
 
   INDEXMETA --> SYSCAT
    
 
   SYSCAT --> STORE

The CatalogManager maintains an in-memory cache of catalog metadata for performance while delegating persistence to SysCatalog (table 0).

Sources: llkv-table/src/lib.rs:1-98

Core Types

CatalogManager

The CatalogManager struct is the primary API for catalog operations. While the exact implementation is in the catalog module, it is exported from the main crate interface.

Key Responsibilities:

  • Maintains in-memory catalog cache
  • Allocates table and field IDs
  • Validates schema changes
  • Coordinates with SysCatalog for persistence
  • Provides snapshot isolation for metadata reads

CreateTableResult

Returned by table creation operations, this structure contains:

  • The newly assigned TableId
  • The created Table instance
  • Initial field ID mappings
  • Registration confirmation

TableCatalogSnapshot

A consistent, immutable view of catalog metadata at a specific point in time. Used to ensure that query planning and execution see a stable view of table schemas even as concurrent DDL operations occur.

Properties:

  • Immutable after creation
  • Contains table and column metadata
  • Includes field ID mappings
  • May include index registrations

FieldResolver

Provides mapping from string column names to FieldId identifiers. This is critical for translating SQL column references into the internal field ID system used by the storage layer.

Functionality:

  • Resolves qualified names (e.g., table.column)
  • Handles field aliases
  • Supports case-insensitive lookups (depending on configuration)
  • Validates field existence

Sources: llkv-table/src/lib.rs:54-55 llkv-table/src/lib.rs:79-80

Table Lifecycle Operations

CREATE TABLE

The CatalogManager handles table creation through a multi-step process:

  1. Validation : Checks table name uniqueness and schema validity
  2. ID Allocation : Assigns a new TableId from available range
  3. Field ID Assignment : Maps each column to a unique FieldId
  4. Schema Storage : Registers Arrow schema with type information
  5. Metadata Persistence : Writes TableMeta and ColMeta to system catalog
  6. Table Instantiation : Creates Table instance backed by ColumnStore

Table ID Ranges:

RangePurposeConstant
0System CatalogCATALOG_TABLE_ID
1-999User Tables-
1000-9999Information SchemaINFORMATION_SCHEMA_TABLE_ID_START
10000+Temporary TablesTEMPORARY_TABLE_ID_START

The CatalogManager ensures IDs are allocated from the appropriate range based on table type.

DROP TABLE

Table deletion involves:

  1. Dependency Checking : Validates no foreign keys reference the table
  2. Metadata Removal : Deletes entries from system catalog
  3. Storage Cleanup : Marks column store data for cleanup (may be deferred)
  4. Cache Invalidation : Removes table from in-memory cache

The operation is typically transactional - either all steps succeed or the table remains.

ALTER TABLE Operations

The CatalogManager coordinates schema modifications, though validation is delegated to specialized functions. Operations include:

  • ADD COLUMN : Assigns new FieldId, updates schema
  • DROP COLUMN : Validates no dependencies, marks column deleted
  • ALTER COLUMN TYPE : Validates type compatibility, updates metadata
  • RENAME COLUMN : Updates name mappings
sequenceDiagram
    participant Client
    participant CM as CatalogManager
    participant Validator
    participant SysCat as SysCatalog
    participant Store as ColumnStore
    
    Client->>CM: create_table(name, schema)
    CM->>CM: Allocate TableId
    CM->>CM: Assign FieldIds
    CM->>Validator: Validate schema
    Validator-->>CM: OK
    CM->>SysCat: Write TableMeta
    CM->>SysCat: Write ColMeta records
    SysCat-->>CM: Persisted
    CM->>Store: Initialize ColumnStore
    Store-->>CM: Table handle
    CM->>CM: Update cache
    CM-->>Client: CreateTableResult

The validate_alter_table_operation function (referenced in exports) performs constraint checks before modifications are committed.

Sources: llkv-table/src/lib.rs:22-27 llkv-table/src/lib.rs:76-78

Schema and Field Management

Field ID Assignment

Every column in LLKV is assigned a unique FieldId at creation time. This numeric identifier:

  • Persists across schema changes : Remains stable even if column is renamed
  • Enables versioning : Different table versions can reference same field ID
  • Optimizes storage : Physical storage keys use field IDs, not string names
  • Supports MVCC : System columns like created_by have reserved field IDs

The CatalogManager maintains a monotonic counter per table to allocate field IDs sequentially.

Arrow Schema Integration

The CatalogManager integrates tightly with Apache Arrow schemas:

  • Validates Arrow DataType compatibility
  • Maps Arrow fields to FieldId assignments
  • Stores schema metadata in serialized form
  • Reconstructs Arrow Schema from stored metadata

This allows LLKV to leverage Arrow’s type system while maintaining its own field ID system for storage efficiency.

FieldResolver API

The FieldResolver is obtained from a TableCatalogSnapshot and provides:

resolve(column_name: &str) -> Result<FieldId>
resolve_qualified(table_name: &str, column_name: &str) -> Result<FieldId>
get_field_name(field_id: FieldId) -> Option<&str>
get_field_type(field_id: FieldId) -> Option<&DataType>

This bidirectional mapping supports both query translation (name → ID) and result formatting (ID → name).

Sources: llkv-table/src/lib.rs:3-21 llkv-table/src/lib.rs54

Index Registration

Single-Column Indexes

The SingleColumnIndexDescriptor and SingleColumnIndexRegistration types manage metadata for indexes on individual columns:

SingleColumnIndexDescriptor:

  • Field ID being indexed
  • Index type (e.g., BTree, Hash)
  • Index-specific parameters
  • Creation timestamp

SingleColumnIndexRegistration:

  • Links table to index descriptor
  • Tracks index state (building, ready, failed)
  • Stores index metadata in system catalog

The CatalogManager maintains a registry of active indexes and coordinates their creation and maintenance.

Multi-Column Indexes

For composite indexes spanning multiple columns, the MultiColumnUniqueRegistration type (referenced in exports) provides similar functionality with support for:

  • Multiple field IDs in index key
  • Column ordering
  • Uniqueness constraints
  • Compound key generation

Sources: llkv-table/src/lib.rs55 llkv-table/src/lib.rs73

Metadata Snapshots

Snapshot Creation

A TableCatalogSnapshot provides a consistent view of catalog state. Snapshots are created:

  • On Demand : When planning a query
  • Periodically : For long-running operations
  • At Transaction Start : For transaction isolation

The snapshot is immutable and won’t reflect concurrent DDL changes, ensuring query planning sees a stable schema.

Snapshot Contents

A snapshot typically includes:

  • All TableMeta records (table definitions)
  • All ColMeta records (column definitions)
  • Field ID mappings for all tables
  • Index registrations (optional)
  • Custom type definitions (optional)
  • Constraint metadata (optional)

Cache Invalidation

When the CatalogManager modifies metadata:

  1. Updates system catalog (table 0)
  2. Increments epoch/version counter
  3. Invalidates stale snapshots
  4. Updates in-memory cache

Existing snapshots remain valid but represent a previous version. New snapshots will reflect the changes.

Sources: llkv-table/src/lib.rs54

Integration with SysCatalog

The CatalogManager uses SysCatalog (documented in System Catalog and SysCatalog) as its persistence layer:

Write Operations

  • CREATE TABLE : Writes TableMeta and ColMeta records
  • ALTER TABLE : Updates existing metadata records
  • DROP TABLE : Marks metadata as deleted
  • Index Registration : Writes index descriptor records

Read Operations

  • Snapshot Creation : Reads all metadata records
  • Table Lookup : Queries TableMeta by name or ID
  • Field Resolution : Retrieves ColMeta for a table
  • Index Discovery : Loads index descriptors
sequenceDiagram
    participant Runtime
    participant CM as CatalogManager
    participant Cache as "In-Memory Cache"
    participant SC as SysCatalog
    participant Store as ColumnStore
    
    Runtime->>CM: Initialize
    CM->>SC: Bootstrap table 0
    SC->>Store: Initialize ColumnStore(0)
    Store-->>SC: Ready
    SC-->>CM: SysCatalog ready
    
    CM->>SC: Read all TableMeta
    SC->>Store: scan(TableMeta)
    Store-->>SC: RecordBatch
    SC-->>CM: Vec<TableMeta>
    
    CM->>SC: Read all ColMeta
    SC->>Store: scan(ColMeta)
    Store-->>SC: RecordBatch
    SC-->>CM: Vec<ColMeta>
    
    CM->>Cache: Populate
    Cache-->>CM: Loaded
    
    CM-->>Runtime: CatalogManager ready

Bootstrapping

On system startup:

  1. SysCatalog initializes (creates table 0 if needed)
  2. CatalogManager reads all metadata from table 0
  3. In-memory cache is populated
  4. System is ready for operations

Sources: llkv-table/src/lib.rs:30-31 llkv-table/src/lib.rs:81-85

Usage Patterns

Creating a Table

// Typical usage pattern (conceptual)
let result = catalog_manager.create_table(
    table_name,
    schema,         // Arrow Schema
    table_id_hint   // Optional TableId preference
)?;

let table: Table = result.table;
let table_id: TableId = result.table_id;
let field_ids: HashMap<String, FieldId> = result.field_mappings;

Resolving Column Names

// Get a snapshot for consistent reads
let snapshot = catalog_manager.snapshot();

// Resolve column references
let resolver = snapshot.field_resolver(table_id)?;
let field_id = resolver.resolve("column_name")?;

// Use field_id in storage operations

Registering an Index

// Register a single-column index
let descriptor = SingleColumnIndexDescriptor::new(
    field_id,
    IndexType::BTree,
    options
);

catalog_manager.register_index(
    table_id,
    descriptor
)?;

Checking Metadata Changes

// Capture snapshot version
let snapshot_v1 = catalog_manager.snapshot();
let version_1 = snapshot_v1.version();

// ... DDL operations occur ...

// Create new snapshot and check for changes
let snapshot_v2 = catalog_manager.snapshot();
let version_2 = snapshot_v2.version();

if version_1 != version_2 {
    // Metadata has changed, invalidate plans
}

Sources: llkv-table/src/lib.rs:54-89

Thread Safety and Concurrency

The CatalogManager typically uses interior mutability (e.g., RwLock or Mutex) to allow:

  • Concurrent Reads : Multiple threads can read snapshots simultaneously
  • Exclusive Writes : DDL operations acquire exclusive locks
  • Snapshot Isolation : Snapshots remain valid even during concurrent DDL

This design allows high read concurrency while ensuring DDL operations are serialized and atomic.

The CatalogManager coordinates with several related modules:

  • catalog module: Contains the implementation (not shown in provided files)
  • sys_catalog module: Persistence layer for metadata [llkv-table/src/sys_catalog.rs]
  • metadata module: Extended metadata management [llkv-table/src/metadata.rs]
  • ddl module: DDL-specific helpers [llkv-table/src/ddl.rs]
  • resolvers module: Name resolution utilities [llkv-table/src/resolvers.rs]
  • constraints module: Constraint validation [llkv-table/src/constraints.rs]

Sources: llkv-table/src/lib.rs:34-46 llkv-table/src/lib.rs:68-69 llkv-table/src/lib.rs74 llkv-table/src/lib.rs79

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

System Catalog and SysCatalog

Loading…

System Catalog and SysCatalog

Relevant source files

Purpose and Scope

The System Catalog (SysCatalog) is LLKV’s metadata repository that stores all table definitions, column schemas, indexes, constraints, triggers, and custom types. It provides a self-describing, bootstrapped metadata system where all metadata is stored as structured records in Table 0 using the same columnar storage infrastructure as user data.

This page covers the system catalog’s architecture, metadata structures, and how it integrates with DDL operations. For the higher-level catalog management API, see CatalogManager API. For custom type definitions and aliases, see Custom Types and Type Registry.

Sources: llkv-table/src/lib.rs:1-32

Self-Describing Architecture

The system catalog is implemented as Table 0, a special reserved table that stores metadata about all other tables (including itself). This creates a bootstrapped, self-referential system where the catalog uses the same storage mechanisms it documents.

Key architectural characteristics:

graph TB
    subgraph "Table 0 - System Catalog"
        SysCatalog["SysCatalog struct\n(sys_catalog.rs)"]
Table0["Underlying Table\ntable_id = 0"]
ColumnStore0["ColumnStore\nPhysical Storage"]
end
    
    subgraph "Metadata Record Types"
        TableMeta["TableMeta Records\ntable_id, name, schema"]
ColMeta["ColMeta Records\nfield_id, table_id, name, type"]
CustomTypeMeta["CustomTypeMeta Records\nCustom type definitions"]
IndexMeta["Index Metadata\nSingle & Multi-Column"]
TriggerMeta["TriggerMeta Records\nBEFORE/AFTER triggers"]
end
    
    subgraph "User Tables"
        Table1["User Table 1\ntable_id = 1"]
Table2["User Table 2\ntable_id = 2"]
TableN["User Table N\ntable_id = N"]
end
    
 
   SysCatalog --> Table0
 
   Table0 --> ColumnStore0
    
 
   TableMeta --> Table0
 
   ColMeta --> Table0
 
   CustomTypeMeta --> Table0
 
   IndexMeta --> Table0
 
   TriggerMeta --> Table0
    
    TableMeta -.describes.-> Table1
    TableMeta -.describes.-> Table2
    TableMeta -.describes.-> TableN
    
    ColMeta -.describes columns.-> Table1
    ColMeta -.describes columns.-> Table2
PropertyDescription
Table IDAlways CATALOG_TABLE_ID (0)
Self-describingMetadata about Table 0 is stored in Table 0 itself
Schema consistencyUses the same Table and ColumnStore abstractions as user tables
TransactionalAll metadata changes are transactional via MVCC
QueryableCan be queried like any other table (with appropriate permissions)

Sources: llkv-table/src/lib.rs:7-31 llkv-table/src/sys_catalog.rs

Table ID Ranges and Reservations

LLKV partitions the table ID space into reserved ranges for different purposes:

Constants and predicates:

graph LR
    subgraph "Table ID Ranges"
        Range0["ID 0\nCATALOG_TABLE_ID\n(System Catalog)"]
Range1["IDs 1-999\n(User Tables)"]
Range2["IDs 1000+\nINFORMATION_SCHEMA_TABLE_ID_START\n(Information Schema)"]
Range3["IDs 10000+\nTEMPORARY_TABLE_ID_START\n(Temporary Tables)"]
end
    
 
   Range0 --> SysCheck{is_reserved_table_id}
Range1 --> UserCheck{User table}
Range2 --> InfoCheck{is_information_schema_table}
Range3 --> TempCheck{Temporary table}
Constant/FunctionValue/PurposeLocation
CATALOG_TABLE_ID0 - System catalog tablellkv-table/src/reserved.rs
INFORMATION_SCHEMA_TABLE_ID_START1000 - Information schema views startllkv-table/src/reserved.rs
TEMPORARY_TABLE_ID_START10000 - Temporary tables startllkv-table/src/reserved.rs
is_reserved_table_id(id)Returns true if ID is reservedllkv-table/src/reserved.rs
is_information_schema_table(id)Returns true if ID is info schemallkv-table/src/reserved.rs

This partitioning allows the system to quickly identify table categories and apply appropriate handling (e.g., temporary tables are not persisted across sessions).

Sources: llkv-table/src/lib.rs:23-27 llkv-table/src/lib.rs:75-78 llkv-table/src/reserved.rs

Core Metadata Structures

The system catalog stores multiple types of metadata records, each represented by a specific struct that defines a schema for that metadata type.

TableMeta Structure

TableMeta records describe table-level metadata:

Key operations:

MethodPurposeReturns
schema()Get Arrow schema for tableSchema
field_ids()Get ordered list of field IDsVec<FieldId>
table_idUnique table identifierTableId
table_nameTable name (unqualified)String
schema_nameOptional schema/namespaceOption<String>

Sources: llkv-table/src/sys_catalog.rs llkv-table/src/metadata.rs

ColMeta Structure

ColMeta records describe individual column metadata:

Each ColMeta record associates a FieldId (the internal column identifier) with a TableId and provides the column’s name, type, nullability, and position within the table schema.

Sources: llkv-table/src/sys_catalog.rs llkv-table/src/metadata.rs

Index Metadata Structures

Index metadata is stored in two forms:

Index metadata types:

StructurePurposeKey Fields
SingleColumnIndexEntryMetaDescribes a single-column indextable_id, field_id, index_name
TableSingleColumnIndexMetaTable’s single-column index mapHashMap<FieldId, IndexDescriptor>
MultiColumnIndexEntryMetaDescribes a multi-column indextable_id, field_ids[], index_name
TableMultiColumnIndexMetaTable’s multi-column index mapVec<MultiColumnIndexDescriptor>

Sources: llkv-table/src/sys_catalog.rs llkv-table/src/metadata.rs llkv-table/src/catalog.rs

Trigger Metadata Structures

Triggers are stored with timing and event specifications:

Each trigger is identified by name and associated with a specific table, timing (BEFORE/AFTER), and event (INSERT/UPDATE/DELETE).

Sources: llkv-table/src/sys_catalog.rs

Custom Type Metadata

Custom types and type aliases are stored as CustomTypeMeta records:

This enables user-defined type aliases (e.g., CREATE TYPE email AS VARCHAR(255)) to be persisted and resolved during schema parsing.

Sources: llkv-table/src/sys_catalog.rs llkv-table/src/metadata.rs

graph TB
    subgraph "SysCatalog API"
        Constructor["new(table: Table)\nWraps Table 0"]
subgraph "Table Metadata"
            InsertTable["insert_table_meta(meta)\nAdd new table"]
GetTable["get_table_meta(id)\nRetrieve table info"]
ListTables["list_tables()\nGet all tables"]
UpdateTable["update_table_meta(meta)\nModify table"]
DeleteTable["delete_table_meta(id)\nRemove table"]
end
        
        subgraph "Column Metadata"
            InsertCol["insert_col_meta(meta)\nAdd column"]
GetCols["get_col_metas(table_id)\nGet table columns"]
end
        
        subgraph "Index Metadata"
            InsertIdx["insert_index_meta(meta)\nRegister index"]
GetIdxs["get_indexes(table_id)\nGet table indexes"]
end
        
        subgraph "Trigger Metadata"
            InsertTrigger["insert_trigger_meta(meta)\nCreate trigger"]
GetTriggers["get_triggers(table_id)\nList triggers"]
end
    end
    
 
   Constructor --> InsertTable
 
   Constructor --> GetTable
 
   Constructor --> InsertCol
 
   Constructor --> InsertIdx
 
   Constructor --> InsertTrigger

SysCatalog API and Operations

The SysCatalog struct provides methods for reading and writing metadata records:

Core methods:

Method CategoryOperationsPurpose
Table operationsinsert_table_meta(), get_table_meta(), update_table_meta(), delete_table_meta()Manage table-level metadata
Column operationsinsert_col_meta(), get_col_metas(), update_col_meta()Manage column definitions
Index operationsinsert_index_meta(), get_indexes(), delete_index()Register and query indexes
Trigger operationsinsert_trigger_meta(), get_triggers(), delete_trigger()Manage trigger definitions
Type operationsinsert_custom_type(), get_custom_type(), list_custom_types()Manage custom type definitions
Query operationslist_tables(), table_exists(), resolve_field_id()Query and resolve metadata

All operations are implemented as append operations on the underlying Table struct, leveraging the same MVCC and transactional semantics as user data.

Sources: llkv-table/src/sys_catalog.rs

Bootstrap Process

The system catalog must be initialized before any other tables can be created. The bootstrap process creates Table 0 with a predefined schema:

Bootstrap steps:

sequenceDiagram
    participant Init as "Initialization"
    participant CM as "CatalogManager"
    participant SysCat as "SysCatalog::new()"
    participant Table0 as "Table ID 0"
    participant CS as "ColumnStore"
    
    Init->>CM: Create database
    CM->>SysCat: Initialize system catalog
    
    Note over SysCat: Define catalog schema\n(TableMeta, ColMeta, etc.)
    
    SysCat->>Table0: Create Table 0 with catalog schema
    Table0->>CS: Initialize column storage for catalog
    
    SysCat->>Table0: Insert TableMeta for Table 0\n(self-reference)
    
    Note over SysCat: Catalog is now operational
    
    SysCat-->>CM: Return initialized catalog
    CM-->>Init: Ready for user tables
  1. Schema definition : The catalog schema is hardcoded in SysCatalog, defining the fields for TableMeta, ColMeta, and other metadata types
  2. Table 0 creation : A Table struct is created with table_id = 0 and the catalog schema
  3. Self-registration : The first metadata record inserted is TableMeta for Table 0 itself, creating the self-referential loop
  4. Ready state : Once initialized, the catalog can accept metadata for user tables

Sources: llkv-table/src/sys_catalog.rs llkv-table/src/catalog.rs

Metadata Queries and Resolution

The system catalog supports various query patterns for metadata resolution:

Resolution functions:

FunctionInputOutputLocation
resolve_table_name()Schema name, table nameOption<TableId>llkv-table/src/resolvers.rs
canonical_table_name()Schema name, table nameCanonical stringllkv-table/src/resolvers.rs
FieldResolver::resolve()Column name, contextFieldIdllkv-table/src/catalog.rs
get_table_meta()TableIdOption<TableMeta>llkv-table/src/sys_catalog.rs
get_col_metas()TableIdVec<ColMeta>llkv-table/src/sys_catalog.rs

These resolution operations enable the query planner to translate SQL identifiers (table names, column names) into internal identifiers (TableId, FieldId) used throughout the execution pipeline.

Sources: llkv-table/src/resolvers.rs llkv-table/src/catalog.rs llkv-table/src/sys_catalog.rs

sequenceDiagram
    participant SQL as "SQL Engine"
    participant DDL as "DDL Handler\n(CatalogManager)"
    participant SysCat as "SysCatalog"
    participant Table0 as "Table 0"
    
    SQL->>DDL: CREATE TABLE users (...)
    
    DDL->>DDL: Parse schema\nAllocate TableId
    DDL->>DDL: Assign FieldIds to columns
    
    DDL->>SysCat: insert_table_meta(TableMeta)
    SysCat->>Table0: Append TableMeta record
    
    loop For each column
        DDL->>SysCat: insert_col_meta(ColMeta)
        SysCat->>Table0: Append ColMeta record
    end
    
    DDL->>DDL: Create user Table with allocated ID
    
    DDL-->>SQL: CreateTableResult
    
    Note over SQL,Table0: Table now visible to queries

Integration with DDL Operations

All DDL operations (CREATE TABLE, ALTER TABLE, DROP TABLE) modify the system catalog:

DDL operation flow:

  1. CREATE TABLE :

    • Parse schema and allocate TableId
    • Assign FieldId to each column
    • Insert TableMeta and ColMeta records into catalog
    • Create the physical Table with the assigned ID
  2. ALTER TABLE :

    • Retrieve current TableMeta and ColMeta records
    • Validate operation (see validate_alter_table_operation())
    • Update metadata records (append new versions due to MVCC)
    • Modify the physical table schema
  3. DROP TABLE :

    • Mark TableMeta as deleted (MVCC soft delete)
    • Mark associated ColMeta records as deleted
    • Remove indexes and constraints
    • Free physical storage pages

Sources: llkv-table/src/ddl.rs llkv-table/src/catalog.rs llkv-table/src/constraints.rs

graph TB
    subgraph "Constraint Types"
        PK["PrimaryKeyConstraint\nUnique, Not Null"]
Unique["UniqueConstraint\nSingle or Multi-Column"]
FK["ForeignKeyConstraint\nReferential Integrity"]
Check["CheckConstraint\nExpression Validation"]
end
    
    subgraph "Constraint Metadata"
        ConstraintRecord["ConstraintRecord\nid, table_id, kind"]
ConstraintService["ConstraintService\nEnforcement Logic"]
end
    
    subgraph "Storage"
        MetaManager["MetadataManager"]
SysCatalog["SysCatalog Table 0"]
end
    
 
   PK --> ConstraintRecord
 
   Unique --> ConstraintRecord
 
   FK --> ConstraintRecord
 
   Check --> ConstraintRecord
    
 
   ConstraintRecord --> MetaManager
 
   MetaManager --> SysCatalog
    
 
   ConstraintService --> ConstraintRecord

Constraints and Validation Metadata

The system catalog also stores constraint definitions that enforce data integrity:

Constraint structures:

Constraint TypeStructureKey Information
Primary KeyPrimaryKeyConstraintColumn(s), constraint name
UniqueUniqueConstraintColumn(s), constraint name, partial index
Foreign KeyForeignKeyConstraintChild columns, parent table/columns, ON DELETE/UPDATE actions
CheckCheckConstraintBoolean expression, constraint name

The ConstraintService validates constraint satisfaction during INSERT, UPDATE, and DELETE operations, querying the catalog for relevant constraint definitions.

Sources: llkv-table/src/constraints.rs llkv-table/src/metadata.rs

graph TB
    subgraph "MetadataManager API"
        MM["MetadataManager\n(metadata.rs)"]
IndexReg["register_single_column_index()\nregister_multi_column_index()"]
IndexGet["get_single_column_indexes()\nget_multi_column_indexes()"]
FKReg["register_foreign_key()"]
FKGet["get_foreign_keys()"]
FKValidate["validate_foreign_key_rows()"]
ConstraintReg["register_constraint()"]
ConstraintGet["get_constraints()"]
end
    
    subgraph "SysCatalog Operations"
        SysCat["SysCatalog"]
TableMetaOps["Table/Column Metadata"]
IndexMetaOps["Index Metadata"]
ConstraintMetaOps["Constraint Metadata"]
end
    
 
   MM --> SysCat
    
 
   IndexReg --> IndexMetaOps
 
   IndexGet --> IndexMetaOps
    
 
   FKReg --> ConstraintMetaOps
 
   FKGet --> ConstraintMetaOps
    
 
   ConstraintReg --> ConstraintMetaOps
 
   ConstraintGet --> ConstraintMetaOps
    
 
   IndexMetaOps --> SysCat
 
   ConstraintMetaOps --> SysCat

Metadata Manager Integration

The MetadataManager provides a higher-level interface over SysCatalog for managing complex metadata like indexes and constraints:

The MetadataManager coordinates between the catalog and the runtime enforcement logic, ensuring that metadata changes are properly persisted and that constraints are consistently enforced.

Sources: llkv-table/src/metadata.rs llkv-table/src/catalog.rs

graph LR
    subgraph "Transaction T1"
        T1Create["CREATE TABLE users"]
T1Meta["Insert TableMeta\ncreated_by = T1"]
end
    
    subgraph "Transaction T2 (concurrent)"
        T2Query["SELECT FROM users"]
T2Scan["Scan catalog\nFilter: created_by <= T2\ndeleted_by > T2 OR NULL"]
end
    
    subgraph "Table 0 Storage"
        Metadata["TableMeta Records\nwith MVCC columns"]
end
    
 
   T1Meta --> Metadata
 
   T2Scan --> Metadata
    
    T1Create -.commits.-> T1Meta
    T2Query -.reads snapshot.-> T2Scan

MVCC and Transactional Metadata

Because the system catalog is implemented as Table 0 using the same storage layer as user tables, all metadata operations are automatically transactional with MVCC semantics:

MVCC characteristics:

PropertyBehavior
IsolationEach transaction sees a consistent snapshot of metadata
AtomicityDDL operations are atomic (all metadata changes or none)
VersioningMultiple versions of metadata coexist until compaction
Soft deletesDropped tables remain visible to old transactions
Time travelHistorical metadata can be queried (if supported)

This ensures that concurrent DDL and DML operations do not interfere with each other, and that queries always see a consistent view of the schema.

Sources: llkv-table/src/table.rs llkv-table/src/sys_catalog.rs

Schema Evolution and Compatibility

The system catalog schema itself is versioned and can evolve over time:

Schema migrations add new columns to the catalog tables (e.g., adding constraint metadata) while maintaining backward compatibility with existing metadata records.

Sources: llkv-table/src/sys_catalog.rs llkv-table/src/metadata.rs

Summary

The System Catalog (SysCatalog) provides:

  • Self-describing metadata : All metadata stored in Table 0 using the same columnar storage as user data
  • Comprehensive tracking : Tables, columns, indexes, triggers, constraints, and custom types
  • Transactional semantics : MVCC ensures consistent metadata reads and atomic DDL operations
  • Efficient resolution : Fast lookup of table names to IDs, field names to IDs
  • Extensibility : New metadata types can be added by extending the catalog schema

The catalog bridges the gap between SQL identifiers and internal identifiers, enabling the query processor to operate on TableId and FieldId rather than string names throughout the execution pipeline.

Sources: llkv-table/src/lib.rs llkv-table/src/sys_catalog.rs llkv-table/src/catalog.rs llkv-table/src/metadata.rs

Dismiss

Refresh this wiki

Enter email to refresh


GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Custom Types and Type Registry

Loading…

Custom Types and Type Registry

Relevant source files

Purpose and Scope

This document describes LLKV’s custom type system, which enables users to define and manage type aliases that extend Apache Arrow’s native type system. Custom types are persisted in the system catalog and provide a mechanism for creating domain-specific type names that map to underlying Arrow DataType definitions.

For information about the system catalog infrastructure that stores custom type metadata, see System Catalog and SysCatalog. For details on how tables use these types in their schemas, see Table Abstraction.

Type System Architecture

LLKV’s type system is built on Apache Arrow’s columnar type system but adds a layer of indirection through custom type definitions. This allows users to create semantic type names (e.g., email_address, currency_amount) that map to specific Arrow types with additional constraints or metadata.

Type Resolution Flow

graph TB
    subgraph "SQL Layer"
        DDL["CREATE TYPE Statement"]
COLDEF["Column Definition\nwith Custom Type"]
end
    
    subgraph "Type Registry"
        SYSCAT["SysCatalog\nTable 0"]
TYPEMETA["CustomTypeMeta\nRecords"]
RESOLVER["Type Resolver"]
end
    
    subgraph "Arrow Type System"
        ARROW["Arrow DataType"]
SCHEMA["Arrow Schema"]
end
    
    subgraph "Column Storage"
        COLSTORE["ColumnStore"]
DESCRIPTOR["ColumnDescriptor"]
end
    
 
   DDL --> TYPEMETA
 
   TYPEMETA --> SYSCAT
    
 
   COLDEF --> RESOLVER
 
   RESOLVER --> TYPEMETA
 
   RESOLVER --> ARROW
    
 
   ARROW --> SCHEMA
 
   SCHEMA --> COLSTORE
 
   COLSTORE --> DESCRIPTOR
    
    style TYPEMETA fill:#f9f9f9
    style SYSCAT fill:#f9f9f9
    style RESOLVER fill:#f9f9f9
  • User defines custom types via SQL DDL
  • Type metadata is stored in the system catalog
  • Column definitions reference custom types by name
  • Type resolver translates names to Arrow DataTypes
  • Physical storage uses Arrow’s native columnar format

Sources: llkv-table/src/lib.rs82 llkv-table/src/lib.rs:81-85

CustomTypeMeta Structure

CustomTypeMeta is the fundamental metadata structure that describes a custom type definition. It is stored as a record in the system catalog (Table 0) alongside other metadata like TableMeta and ColMeta.

CustomTypeMeta Fields

classDiagram
    class CustomTypeMeta {+type_id: TypeId\n+type_name: String\n+base_type: ArrowDataType\n+nullable: bool\n+metadata: HashMap~String,String~\n+created_at: Timestamp\n+created_by: TransactionId\n+deleted_by: Option~TransactionId~}
    
    class SysCatalog {+register_custom_type()\n+get_custom_type()\n+list_custom_types()\n+drop_custom_type()}
    
    class ArrowDataType {<<enumeration>>\nInt64\nUtf8\nDecimal128\nDate32\nTimestamp\nStruct\nList}
    
    class ColumnDescriptor {+field_id: FieldId\n+data_type: DataType}
    
    CustomTypeMeta --> ArrowDataType : maps_to
    SysCatalog --> CustomTypeMeta : stores
    ColumnDescriptor --> ArrowDataType : uses
FieldTypeDescription
type_idTypeIdUnique identifier for the custom type
type_nameStringUser-defined name (e.g., “email_address”)
base_typeArrowDataTypeUnderlying Arrow type definition
nullableboolWhether NULL values are permitted
metadataHashMap<String, String>Additional type-specific metadata
created_atTimestampType creation timestamp
created_byTransactionIdTransaction that created the type
deleted_byOption<TransactionId>MVCC deletion marker

Sources: llkv-table/src/lib.rs82

Type Registration and Lifecycle

Custom types are managed through the SysCatalog interface, which provides operations for the complete type lifecycle: registration, retrieval, modification, and deletion.

sequenceDiagram
    participant User
    participant SqlEngine
    participant CatalogManager
    participant SysCatalog
    participant Table0 as Table 0
    
    User->>SqlEngine: CREATE TYPE email_address AS VARCHAR(255)
    SqlEngine->>SqlEngine: Parse DDL statement
    SqlEngine->>CatalogManager: register_custom_type(name, base_type)
    
    CatalogManager->>CatalogManager: Validate type name uniqueness
    CatalogManager->>CatalogManager: Assign new TypeId
    
    CatalogManager->>SysCatalog: Insert CustomTypeMeta record
    SysCatalog->>SysCatalog: Build RecordBatch with metadata
    SysCatalog->>SysCatalog: Add MVCC columns (created_by)
    
    SysCatalog->>Table0: append(batch)
    Table0->>Table0: Write to ColumnStore
    Table0-->>SysCatalog: Success
    
    SysCatalog-->>CatalogManager: TypeId
    CatalogManager-->>SqlEngine: Result
    SqlEngine-->>User: Type created successfully

Type Registration Flow

Registration Steps

  1. DDL statement parsed by SQL layer
  2. CatalogManager validates type name uniqueness
  3. New TypeId allocated
  4. CustomTypeMeta record constructed
  5. Metadata written to system catalog (Table 0)
  6. MVCC columns (created_by, deleted_by) added automatically
  7. Type becomes available for schema definitions

Sources: llkv-table/src/lib.rs54 llkv-table/src/lib.rs:81-85

Type Lifecycle Operations

Type States

  • Registered : Type defined but not yet used in any table schemas
  • InUse : One or more columns reference this type
  • Modified : Type definition updated (if ALTER TYPE is supported)
  • Deprecated : Type soft-deleted via MVCC (deleted_by set)
  • Deleted : Type permanently removed from catalog

Sources: llkv-table/src/lib.rs:81-85

Type Resolution and Schema Integration

When creating tables or altering schemas, the type resolver translates custom type names to Arrow DataType instances. This resolution happens during DDL execution and schema validation.

Resolution Process

graph LR
    subgraph "Table Definition"
        COL1["Column: 'email'\nType: 'email_address'"]
COL2["Column: 'age'\nType: 'INT'"]
end
    
    subgraph "Type Resolution"
        RESOLVER["Type Resolver"]
CACHE["Type Cache"]
end
    
    subgraph "System Catalog"
        CUSTOM["CustomTypeMeta\nemail_address → Utf8(255)"]
BUILTIN["Built-in Types\nINT → Int32"]
end
    
    subgraph "Arrow Schema"
        FIELD1["Field: 'email'\nDataType: Utf8"]
FIELD2["Field: 'age'\nDataType: Int32"]
end
    
 
   COL1 --> RESOLVER
 
   COL2 --> RESOLVER
    
 
   RESOLVER --> CACHE
 
   CACHE --> CUSTOM
 
   RESOLVER --> BUILTIN
    
 
   CUSTOM --> FIELD1
 
   BUILTIN --> FIELD2
    
 
   FIELD1 --> SCHEMA["Arrow Schema"]
FIELD2 --> SCHEMA
  1. Column definition specifies type by name
  2. Type resolver checks cache for previous resolution
  3. Cache miss triggers lookup in SysCatalog
  4. Custom type metadata retrieved from Table 0
  5. Base Arrow DataType extracted
  6. Type constraints/metadata applied
  7. Resolved type cached for subsequent use
  8. Arrow Field constructed with final type

Sources: llkv-table/src/lib.rs43 llkv-table/src/lib.rs54

DDL Operations for Custom Types

CREATE TYPE Statement

Processing Steps

StepActionComponent
1Parse SQLSqlEnginesqlparser
2Extract type definitionSQL preprocessing layer
3Validate base typeCatalogManager
4Check name uniquenessSysCatalog query
5Allocate TypeIdCatalogManager
6Construct metadataCustomTypeMeta builder
7Write to catalogSysCatalog::append()
8Update cacheType resolver cache

Sources: llkv-table/src/lib.rs54 llkv-table/src/lib.rs68

DROP TYPE Statement

Deletion Process

MVCC Soft Delete

  • Type records are not physically removed
  • deleted_by field set to transaction ID
  • Historical queries can still see old type definitions
  • New schemas cannot reference deleted types
  • Cache invalidation ensures immediate visibility

Sources: llkv-table/src/lib.rs:81-85

Storage in System Catalog

Custom type metadata is stored in the system catalog (Table 0) alongside other metadata types like TableMeta, ColMeta, and constraint information.

System Catalog Schema for CustomTypeMeta

Column NameTypeDescription
row_idRowIdUnique row identifier
metadata_typeUtf8Discriminator: “CustomType”
type_idUInt64Custom type identifier
type_nameUtf8User-defined type name
base_type_jsonUtf8Serialized Arrow DataType
nullableBooleanNullability flag
metadata_jsonUtf8Additional metadata as JSON
created_atTimestampCreation timestamp
created_byUInt64Creating transaction ID
deleted_byUInt64 (nullable)Deleting transaction ID (MVCC)

Storage Characteristics

  • Custom types stored in same table as other metadata
  • metadata_type column distinguishes record types
  • Arrow JSON serialization for base type persistence
  • Metadata JSON for extensible properties
  • MVCC columns enable temporal queries
  • Indexed by type_name for fast lookup

Sources: llkv-table/src/lib.rs:81-85

Type Catalog Query Examples

Retrieving Custom Type Definition

Query Optimization

  • Type name indexed for fast lookups
  • MVCC predicate filters deleted types
  • Result caching minimizes catalog queries
  • Batch operations for multiple type resolutions

Sources: llkv-table/src/lib.rs54 llkv-table/src/lib.rs:81-85

graph TB
    subgraph "User Operations"
        CREATE["CREATE TYPE"]
DROP["DROP TYPE"]
ALTER["ALTER TYPE"]
QUERY["Type Resolution"]
end
    
    subgraph "CatalogManager"
        API["CatalogManager API"]
VALIDATION["Type Validation"]
DEPENDENCY["Dependency Tracking"]
CACHE_MGR["Cache Manager"]
end
    
    subgraph "SysCatalog"
        SYSCAT["SysCatalog"]
TABLE0["Table 0"]
end
    
    subgraph "Type System"
        RESOLVER["Type Resolver"]
ARROW["Arrow DataType"]
end
    
 
   CREATE --> API
 
   DROP --> API
 
   ALTER --> API
 
   QUERY --> API
    
 
   API --> VALIDATION
 
   API --> DEPENDENCY
 
   API --> CACHE_MGR
    
 
   VALIDATION --> SYSCAT
 
   DEPENDENCY --> SYSCAT
 
   CACHE_MGR --> RESOLVER
    
 
   SYSCAT --> TABLE0
 
   RESOLVER --> ARROW
    
    style API fill:#f9f9f9
    style SYSCAT fill:#f9f9f9

Integration with CatalogManager

The CatalogManager provides high-level operations for custom type management, coordinating between the SQL layer, type resolver, and system catalog.

CatalogManager Responsibilities

FunctionDescription
register_custom_type()Create new type definition
get_custom_type()Retrieve type by name or ID
list_custom_types()Query all non-deleted types
drop_custom_type()Mark type as deleted (MVCC)
resolve_type_name()Translate name to Arrow type
check_type_dependencies()Find columns using type
invalidate_type_cache()Clear cached type definitions

Sources: llkv-table/src/lib.rs54 llkv-table/src/lib.rs68

Type System Best Practices

Naming Conventions

Recommended Type Names

  • Use descriptive, domain-specific names: customer_id, email_address, currency_amount
  • Avoid generic names that conflict with SQL keywords: text, number, date
  • Use snake_case for consistency with column names
  • Include units or constraints in name: price_usd, duration_seconds

Type Reusability

When to Create Custom Types

  • Domain concepts appearing in multiple tables
  • Types with specific constraints (precision, length)
  • Semantic meaning beyond base type
  • Types requiring validation or transformation logic

When to Use Base Types

  • One-off column definitions
  • Standard SQL types without constraints
  • Internal implementation columns

Performance Considerations

Cache Behavior

  • Type resolution results are cached per session
  • First resolution incurs catalog lookup cost
  • Subsequent resolutions served from memory
  • Cache invalidation on type modifications

Query Impact

  • Custom types add one indirection layer
  • Physical storage uses base Arrow types
  • No runtime performance penalty
  • Query plans operate on resolved types

Sources: llkv-table/src/lib.rs54 llkv-table/src/lib.rs:81-85

Dismiss

Refresh this wiki

Enter email to refresh