This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
SQL Query Processing Pipeline
Loading…
SQL Query Processing Pipeline
Relevant source files
- .github/workflows/build.docs.yml
- demos/llkv-sql-pong-demo/assets/llkv-sql-pong-screenshot.png
- dev-docs/doc-preview.md
- llkv-executor/src/lib.rs
- llkv-expr/src/expr.rs
- llkv-plan/Cargo.toml
- llkv-plan/src/plans.rs
- llkv-sql/Cargo.toml
- llkv-sql/src/sql_engine.rs
- llkv-test-utils/Cargo.toml
Purpose and Scope
This page describes the end-to-end flow of SQL query processing in LLKV, from raw SQL text input to Arrow RecordBatch results. It covers the major pipeline stages: parsing, preprocessing, planning, execution, and result formatting. For detailed information about specific components, see:
- SQL interface and the
SqlEngineAPI: SQL Interface - Plan structure details: Query Planning
- Expression translation and compilation: Expression System
- Execution implementation details: Query Execution
Pipeline Overview
The SQL query processing pipeline transforms user SQL statements through several stages before producing results:
Sources: llkv-sql/src/sql_engine.rs:1-100 [Diagram 2 from overview](https://github.com/jzombie/rust-llkv/blob/89777726/Diagram 2 from overview)
flowchart TB
Input["SQL Text\n(User Input)"]
Preprocess["SQL Preprocessing\npreprocess_* functions"]
Parse["SQL Parsing\nsqlparser::Parser"]
AST["sqlparser::ast::Statement"]
Plan["Query Planning\nbuild_*_plan functions"]
PlanStmt["PlanStatement"]
Execute["Execution\nRuntimeEngine::execute_statement"]
Result["RuntimeStatementResult"]
Batches["Vec<RecordBatch>"]
Input --> Preprocess
Preprocess --> Parse
Parse --> AST
AST --> Plan
Plan --> PlanStmt
PlanStmt --> Execute
Execute --> Result
Result --> Batches
style Input fill:#f9f9f9
style Batches fill:#f9f9f9
SQL Preprocessing Stage
Before parsing, LLKV applies several preprocessing transformations to normalize SQL syntax across different dialects (SQLite, DuckDB, PostgreSQL). This stage rewrites incompatible syntax into forms that sqlparser can handle.
Preprocessing Functions
| Function | Purpose | Example Transformation |
|---|---|---|
preprocess_tpch_connect_syntax | Strip TPC-H CONNECT TO statements | CONNECT TO tpch; → (removed) |
preprocess_create_type_syntax | Convert CREATE TYPE to CREATE DOMAIN | CREATE TYPE int8 AS bigint → CREATE DOMAIN int8 AS bigint |
preprocess_exclude_syntax | Quote qualified names in EXCLUDE clauses | EXCLUDE (t.col) → EXCLUDE ("t.col") |
preprocess_trailing_commas_in_values | Remove trailing commas in VALUES | VALUES (1,) → VALUES (1) |
preprocess_empty_in_lists | Expand empty IN lists to boolean expressions | x IN () → (x = NULL AND 0 = 1) |
preprocess_index_hints | Strip SQLite index hints | FROM t INDEXED BY idx → FROM t |
preprocess_reindex_syntax | Convert standalone REINDEX to VACUUM REINDEX | REINDEX idx → VACUUM REINDEX idx |
preprocess_sqlite_trigger_shorthand | Add explicit timing and FOR EACH ROW | CREATE TRIGGER ... BEGIN → CREATE TRIGGER ... AFTER ... FOR EACH ROW BEGIN |
preprocess_bare_table_in_clauses | Convert bare table names in IN to subqueries | x IN table → x IN (SELECT * FROM table) |
The preprocessing pipeline is applied sequentially in SqlEngine::execute:
Sources: llkv-sql/src/sql_engine.rs:759-1006 llkv-sql/src/sql_engine.rs:1395-1460
Parsing Stage
After preprocessing, the SQL text is parsed using the sqlparser crate with a configurable recursion limit to handle deeply nested queries.
Parser Configuration
The parser is configured with:
- Dialect :
GenericDialect(supports multiple SQL dialects) - Recursion limit :
200(set viaPARSER_RECURSION_LIMITconstant) - Parameter tracking : Thread-local
ParameterScopefor prepared statement placeholders
The parser produces a vector of sqlparser::ast::Statement objects. Each statement is then converted to a PlanStatement for execution.
Sources: llkv-sql/src/sql_engine.rs:395-400 llkv-sql/src/sql_engine.rs:1464-1481 llkv-sql/src/sql_engine.rs:223-256
Statement Type Dispatch
Once parsed, statements are dispatched to specialized planning functions based on their type. The SqlEngine maintains an optional InsertBuffer to batch consecutive literal INSERT statements for improved throughput.
Sources: llkv-sql/src/sql_engine.rs:1482-2800 llkv-sql/src/sql_engine.rs:486-547
flowchart TB
AST["sqlparser::ast::Statement"]
Dispatch{"Statement Type"}
CreateTable["build_create_table_plan"]
DropTable["build_drop_table_plan"]
AlterTable["build_alter_table_plan"]
CreateView["build_create_view_plan"]
DropView["build_drop_view_plan"]
CreateIndex["build_create_index_plan"]
DropIndex["build_drop_index_plan"]
Reindex["build_reindex_plan"]
RenameTable["build_rename_table_plan"]
Insert["build_insert_plan"]
InsertBuffer["InsertBuffer::push_statement\n(batching optimization)"]
Update["build_update_plan"]
Delete["build_delete_plan"]
Truncate["build_truncate_plan"]
Select["build_select_plan"]
Explain["wrap in Explain plan"]
Set["handle SET statement"]
Show["handle SHOW statement"]
Begin["handle BEGIN TRANSACTION"]
Commit["handle COMMIT"]
Rollback["handle ROLLBACK"]
Savepoint["handle SAVEPOINT"]
Release["handle RELEASE"]
Vacuum["handle VACUUM"]
Pragma["handle PRAGMA"]
PlanStmt["PlanStatement"]
AST --> Dispatch
Dispatch -->|CREATE TABLE| CreateTable
Dispatch -->|DROP TABLE| DropTable
Dispatch -->|ALTER TABLE| AlterTable
Dispatch -->|CREATE VIEW| CreateView
Dispatch -->|DROP VIEW| DropView
Dispatch -->|CREATE INDEX| CreateIndex
Dispatch -->|DROP INDEX| DropIndex
Dispatch -->|VACUUM REINDEX| Reindex
Dispatch -->|ALTER TABLE RENAME| RenameTable
Dispatch -->|INSERT| Insert
Insert -.may buffer.-> InsertBuffer
Dispatch -->|UPDATE| Update
Dispatch -->|DELETE| Delete
Dispatch -->|TRUNCATE| Truncate
Dispatch -->|SELECT| Select
Dispatch -->|EXPLAIN| Explain
Dispatch -->|SET| Set
Dispatch -->|SHOW| Show
Dispatch -->|BEGIN| Begin
Dispatch -->|COMMIT| Commit
Dispatch -->|ROLLBACK| Rollback
Dispatch -->|SAVEPOINT| Savepoint
Dispatch -->|RELEASE| Release
Dispatch -->|VACUUM| Vacuum
Dispatch -->|PRAGMA| Pragma
CreateTable --> PlanStmt
DropTable --> PlanStmt
AlterTable --> PlanStmt
CreateView --> PlanStmt
DropView --> PlanStmt
CreateIndex --> PlanStmt
DropIndex --> PlanStmt
Reindex --> PlanStmt
RenameTable --> PlanStmt
Insert --> PlanStmt
InsertBuffer --> PlanStmt
Update --> PlanStmt
Delete --> PlanStmt
Truncate --> PlanStmt
Select --> PlanStmt
Explain --> PlanStmt
Planning Stage
The planning stage converts sqlparser::ast::Statement nodes into typed PlanStatement objects. This involves:
- Column resolution : Mapping string column names to
FieldIdidentifiers via the system catalog - Expression translation : Converting
sqlparser::ast::Exprtollkv_expr::expr::ExprandScalarExpr - Subquery tracking : Recording correlated subqueries with placeholder bindings
- Validation : Ensuring referenced tables and columns exist
SELECT Plan Construction
For SELECT statements, the planner builds a SelectPlan containing:
Sources: llkv-sql/src/sql_engine.rs:2801-3500 llkv-plan/src/plans.rs:705-850
Expression Translation
Expression translation converts SQL expressions into the llkv_expr AST, which uses FieldId instead of string column names. This enables efficient field access during execution.
Sources: llkv-plan/src/translation/expression.rs:1-500 llkv-sql/src/sql_engine.rs:4000-4500
flowchart LR
SQLExpr["sqlparser::ast::Expr\n(column names as strings)"]
Resolver["IdentifierResolver\n(catalog lookups)"]
Translation["translate_scalar / translate_predicate"]
LLKVExpr["llkv_expr::expr::ScalarExpr<FieldId>\nllkv_expr::expr::Expr<FieldId>"]
SQLExpr --> Translation
Resolver -.provides.-> Translation
Translation --> LLKVExpr
flowchart TB
PlanStmt["PlanStatement"]
Execute["RuntimeEngine::execute_statement"]
DDL{"Statement Type"}
CreateTableExec["execute_create_table\n(allocate table_id, register schema)"]
DropTableExec["execute_drop_table\n(remove from catalog)"]
AlterTableExec["execute_alter_table\n(modify schema)"]
CreateViewExec["execute_create_view\n(store view definition)"]
CreateIndexExec["execute_create_index\n(build index structures)"]
InsertExec["execute_insert\n(convert to RecordBatch, append)"]
UpdateExec["execute_update\n(filter + rewrite rows)"]
DeleteExec["execute_delete\n(mark rows deleted via MVCC)"]
TruncateExec["execute_truncate\n(clear all rows)"]
SelectExec["QueryExecutor::execute_select\n(scan, filter, project, aggregate)"]
Result["RuntimeStatementResult"]
PlanStmt --> Execute
Execute --> DDL
DDL -->|CreateTable| CreateTableExec
DDL -->|DropTable| DropTableExec
DDL -->|AlterTable| AlterTableExec
DDL -->|CreateView| CreateViewExec
DDL -->|CreateIndex| CreateIndexExec
DDL -->|Insert| InsertExec
DDL -->|Update| UpdateExec
DDL -->|Delete| DeleteExec
DDL -->|Truncate| TruncateExec
DDL -->|Select| SelectExec
CreateTableExec --> Result
DropTableExec --> Result
AlterTableExec --> Result
CreateViewExec --> Result
CreateIndexExec --> Result
InsertExec --> Result
UpdateExec --> Result
DeleteExec --> Result
TruncateExec --> Result
SelectExec --> Result
Execution Stage
The RuntimeEngine receives a PlanStatement and dispatches it to the appropriate execution handler:
Sources: llkv-runtime/src/lib.rs:1-500 llkv-sql/src/sql_engine.rs:706-745
flowchart TB
SelectPlan["SelectPlan"]
Executor["QueryExecutor::execute_select"]
Strategy{"Execution Strategy"}
NoTable["execute_select_without_table\n(constant projection)"]
Compound["execute_compound_select\n(UNION/EXCEPT/INTERSECT)"]
GroupBy["execute_group_by_single_table\n(hash aggregation)"]
CrossProduct["execute_cross_product\n(nested loop join)"]
Aggregates["execute_aggregates\n(full-table aggregation)"]
Projection["execute_projection\n(scan + filter + project)"]
ScanStream["Table::scan_stream\n(streaming batches)"]
FilterEval["filter_row_ids\n(predicate evaluation)"]
Gather["gather_rows\n(assemble RecordBatch)"]
Sort["lexsort_to_indices\n(ORDER BY)"]
LimitOffset["take + offset\n(LIMIT/OFFSET)"]
SelectExecution["SelectExecution\n(result wrapper)"]
SelectPlan --> Executor
Executor --> Strategy
Strategy -->|tables.is_empty| NoTable
Strategy -->|compound.is_some| Compound
Strategy -->|!group_by.is_empty| GroupBy
Strategy -->|tables.len > 1| CrossProduct
Strategy -->|!aggregates.is_empty| Aggregates
Strategy -->|single table| Projection
Projection --> ScanStream
ScanStream --> FilterEval
FilterEval --> Gather
Gather --> Sort
Sort --> LimitOffset
NoTable --> SelectExecution
Compound --> SelectExecution
GroupBy --> SelectExecution
CrossProduct --> SelectExecution
Aggregates --> SelectExecution
LimitOffset --> SelectExecution
SELECT Execution Flow
SELECT statement execution is the most complex path, involving multiple optimization strategies:
Sources: llkv-executor/src/lib.rs:519-563 llkv-executor/src/scan.rs:1-500
Two-Phase Execution for Filtering
When a WHERE clause is present, execution follows a two-phase approach:
-
Phase 1: Row ID Collection
- Evaluate predicates against stored data
- Use chunk metadata for pruning (min/max values)
- Build bitmap of matching row IDs
-
Phase 2: Data Gathering
- Fetch only projected columns for matching rows
- Assemble Arrow
RecordBatchfrom gathered data
Sources: llkv-executor/src/scan.rs:200-400 [Diagram 6 from overview](https://github.com/jzombie/rust-llkv/blob/89777726/Diagram 6 from overview)
flowchart LR
RuntimeResult["RuntimeStatementResult"]
Branch{"Result Type"}
DDLResult["CreateTable/DropTable/etc\n(metadata only)"]
DMLResult["Insert/Update/Delete\n(row count)"]
SelectResult["Select\n(SelectExecution)"]
SelectExec["SelectExecution"]
Stream["stream(callback)\n(iterate batches)"]
IntoBatches["into_batches()\n(collect all)"]
IntoRows["into_rows()\n(materialize)"]
Batches["Vec<RecordBatch>"]
RuntimeResult --> Branch
Branch -->|DDL| DDLResult
Branch -->|DML| DMLResult
Branch -->|SELECT| SelectResult
SelectResult --> SelectExec
SelectExec --> Stream
SelectExec --> IntoBatches
SelectExec --> IntoRows
Stream --> Batches
IntoBatches --> Batches
Result Formatting
All execution paths ultimately produce a RuntimeStatementResult, which is converted to Arrow RecordBatch objects for SELECT queries:
The SelectExecution type provides three consumption modes:
| Method | Behavior | Use Case |
|---|---|---|
stream(callback) | Iterate batches without allocating | Large result sets, streaming output |
into_batches() | Collect all batches into Vec | Moderate result sets, need random access |
into_rows() | Materialize as Vec<CanonicalRow> | Small result sets, need row-level operations |
Sources: llkv-executor/src/types/execution.rs:1-300 llkv-sql/src/sql_engine.rs:1100-1150
Optimization Points in the Pipeline
Several optimization strategies are applied throughout the pipeline:
1. SQL Preprocessing Optimizations
- Empty IN list rewriting :
x IN ()becomes constant false, enabling early elimination - Index hint removal : Allows the planner to make optimal index choices without user hints
Sources: llkv-sql/src/sql_engine.rs:835-856 llkv-sql/src/sql_engine.rs:858-875
2. INSERT Buffering
- Batching : Consecutive literal
INSERTstatements are buffered until reachingMAX_BUFFERED_INSERT_ROWS(8192) - Throughput : Dramatically reduces planning overhead for bulk ingest workloads
- Semantics : Disabled by default to preserve transactional visibility
Sources: llkv-sql/src/sql_engine.rs:486-547 llkv-sql/src/sql_engine.rs:1200-1300
3. Predicate Pushdown
- Chunk pruning : Min/max metadata on column chunks enables skipping irrelevant data
- Vectorized evaluation : SIMD instructions accelerate predicate evaluation within chunks
Sources: llkv-executor/src/scan.rs:300-400
4. Projection Pushdown
- Lazy column loading : Only requested columns are fetched from storage
- Computed projection caching : Identical expressions are evaluated once and reused
Sources: llkv-executor/src/lib.rs:469-501
5. Fast Paths
- Constant SELECT : Queries without tables avoid storage access entirely
- Full table scan : Queries without predicates stream directly without bitmap construction
Sources: llkv-executor/src/lib.rs:533-534
Parameter Binding for Prepared Statements
The pipeline supports parameterized queries via PreparedStatement:
Parameter placeholders are normalized during preprocessing:
?,?N,$N: Positional parameters (1-indexed):name: Named parameters (converted to indices)
Sources: llkv-sql/src/sql_engine.rs:78-283 llkv-sql/src/sql_engine.rs:354-373
Summary
The SQL query processing pipeline in LLKV follows a clear multi-stage architecture:
- Preprocessing : Normalize SQL dialect differences
- Parsing : Convert text to AST (
sqlparser) - Planning : Build typed
PlanStatementwith column resolution - Execution : Execute via
RuntimeEngineandQueryExecutor - Result Formatting : Return Arrow
RecordBatchobjects
Key design principles:
- Dialect flexibility : Preprocessing enables SQLite, DuckDB, and PostgreSQL syntax
- Type safety : Early resolution of column names to
FieldIdprevents runtime errors - Streaming execution : Large result sets never require full materialization
- Optimization opportunities : Metadata pruning, SIMD evaluation, and projection pushdown reduce data movement
For implementation details of specific stages, refer to the linked subsystem pages.
Dismiss
Refresh this wiki
Enter email to refresh