This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Subquery and Correlation Handling
Relevant source files
This page documents how LLKV handles subqueries and correlated column references during query planning and execution. Subqueries can appear in WHERE clauses (as EXISTS predicates) or in SELECT projections (as scalar subqueries). When a subquery references columns from its outer query, it is called a correlated subquery , requiring special handling to bind outer row values during execution.
For information about expression evaluation and compilation, see Expression System. For query execution flow, see Query Execution.
Purpose and Scope
Subquery handling in LLKV involves three distinct phases:
- Detection and Tracking - During SQL translation, the planner identifies subqueries and tracks which outer columns they reference
- Placeholder Injection - Correlated columns are replaced with synthetic placeholder identifiers in the subquery's expression tree
- Binding and Execution - At runtime, for each outer row, placeholders are replaced with actual values and the subquery is executed
This document covers the data structures, algorithms, and execution flow for both filter subqueries (EXISTS/NOT EXISTS) and scalar subqueries (single-value returns).
Core Data Structures
LLKV represents subqueries and their correlation metadata through several interconnected structures defined in llkv-plan.
classDiagram
class SelectFilter {+Expr~String~ predicate\n+Vec~FilterSubquery~ subqueries}
class FilterSubquery {+SubqueryId id\n+Box~SelectPlan~ plan\n+Vec~CorrelatedColumn~ correlated_columns}
class ScalarSubquery {+SubqueryId id\n+Box~SelectPlan~ plan\n+Vec~CorrelatedColumn~ correlated_columns}
class CorrelatedColumn {+String placeholder\n+String column\n+Vec~String~ field_path}
class SelectPlan {+Vec~TableRef~ tables\n+Option~SelectFilter~ filter\n+Vec~ScalarSubquery~ scalar_subqueries\n+Vec~SelectProjection~ projections}
SelectPlan --> SelectFilter : filter
SelectPlan --> ScalarSubquery : scalar_subqueries
SelectFilter --> FilterSubquery : subqueries
FilterSubquery --> CorrelatedColumn : correlated_columns
ScalarSubquery --> CorrelatedColumn : correlated_columns
FilterSubquery --> SelectPlan : plan
ScalarSubquery --> SelectPlan : plan
Subquery Plan Structures
Sources: llkv-plan/src/plans.rs:28-67
Field Descriptions
| Structure | Field | Purpose |
|---|---|---|
FilterSubquery | id | Unique identifier used to match subquery in expression tree |
FilterSubquery | plan | Nested SELECT plan to execute for each outer row |
FilterSubquery | correlated_columns | Mappings from placeholder to real outer column |
ScalarSubquery | id | Unique identifier for scalar subquery references |
ScalarSubquery | plan | SELECT plan that must return single column/row |
CorrelatedColumn | placeholder | Synthetic column name injected into subquery expressions |
CorrelatedColumn | column | Canonical outer column name |
CorrelatedColumn | field_path | Nested field access path for struct columns |
Sources: llkv-plan/src/plans.rs:36-67
Correlation Tracking During Planning
During SQL-to-plan translation, LLKV uses the SubqueryCorrelatedTracker to detect when a subquery references outer columns. This tracker is passed through the expression translation pipeline and records each outer column access.
graph TB
subgraph "SQL Translation Phase"
SQL["SQL Query String"]
Parser["sqlparser AST"]
end
subgraph "Subquery Detection"
TranslateExpr["translate_predicate / translate_scalar"]
Tracker["SubqueryCorrelatedTracker"]
Resolver["IdentifierResolver"]
end
subgraph "Placeholder Injection"
OuterColumn["Outer Column Reference"]
Placeholder["Synthetic Placeholder"]
Recording["CorrelatedColumn Entry"]
end
subgraph "Plan Output"
FilterSubquery["FilterSubquery"]
ScalarSubquery["ScalarSubquery"]
SelectPlan["SelectPlan"]
end
SQL --> Parser
Parser --> TranslateExpr
TranslateExpr --> Tracker
TranslateExpr --> Resolver
Tracker --> OuterColumn
OuterColumn --> Placeholder
Placeholder --> Recording
Recording --> FilterSubquery
Recording --> ScalarSubquery
FilterSubquery --> SelectPlan
ScalarSubquery --> SelectPlan
Tracker Architecture
Sources: llkv-sql/src/sql_engine.rs24 llkv-sql/src/sql_engine.rs:326-363
Placeholder Generation
When the tracker detects an outer column reference in a subquery, it:
- Generates a unique placeholder string (e.g.,
"__correlated_0__") - Records the mapping:
placeholder → (outer_column, field_path) - Returns the placeholder to the expression translator
- The placeholder is embedded in the subquery's expression tree instead of the original column name
This allows the subquery plan to be "generic" - it references placeholders that will be bound to actual values at execution time.
Sources: llkv-sql/src/sql_engine.rs:337-351
Tracker Extension Traits
The SubqueryCorrelatedTrackerExt trait provides a convenience method to request placeholders directly from catalog resolution results, avoiding repetitive unpacking of ColumnResolution fields.
The SubqueryCorrelatedTrackerOptionExt trait enables chaining optional tracker references through nested translation helpers without explicit as_mut() calls.
Sources: llkv-sql/src/sql_engine.rs:337-363
Subquery Types and Execution
LLKV supports two categories of subqueries, each with distinct execution semantics.
sequenceDiagram
participant Executor as QueryExecutor
participant Filter as Filter Evaluation
participant Subquery as EXISTS Subquery
participant Binding as Binding Logic
participant Inner as Inner SelectPlan
Executor->>Filter: evaluate_predicate_mask()
Filter->>Filter: encounter Expr::Exists
Filter->>Subquery: evaluate_exists_subquery()
Subquery->>Binding: collect_correlated_bindings()
Binding->>Binding: extract outer row values
Binding-->>Subquery: bindings map
Subquery->>Inner: bind_select_plan()
Inner->>Inner: replace placeholders with values
Inner-->>Subquery: bound SelectPlan
Subquery->>Executor: execute_select(bound_plan)
Executor-->>Subquery: SelectExecution stream
Subquery->>Subquery: check if num_rows > 0
Subquery-->>Filter: boolean result
Filter-->>Executor: BooleanArray mask
Filter Subqueries (EXISTS Predicates)
Filter subqueries appear in WHERE clauses as EXISTS or NOT EXISTS predicates. They return a boolean indicating whether the subquery produced any rows.
Sources: llkv-executor/src/lib.rs:773-792
Scalar Subqueries (Projection Values)
Scalar subqueries appear in SELECT projections and must return exactly one column and at most one row. They are evaluated into a single literal value for each outer row.
Sources: llkv-executor/src/lib.rs:794-891
sequenceDiagram
participant Executor as QueryExecutor
participant Projection as Projection Logic
participant Subquery as Scalar Subquery Evaluator
participant Binding as Binding Logic
participant Inner as Inner SelectPlan
Executor->>Projection: evaluate_projection_expression()
Projection->>Projection: encounter ScalarExpr::ScalarSubquery
Projection->>Subquery: evaluate_scalar_subquery_numeric()
loop For each outer row
Subquery->>Subquery: evaluate_scalar_subquery_literal()
Subquery->>Binding: collect_correlated_bindings()
Binding-->>Subquery: bindings for current row
Subquery->>Inner: bind_select_plan()
Inner-->>Subquery: bound plan
Subquery->>Executor: execute_select()
Executor-->>Subquery: result batches
Subquery->>Subquery: validate single column/row
Subquery->>Subquery: convert to Literal
Subquery-->>Projection: literal value
end
Projection->>Projection: build NumericArray from literals
Projection-->>Executor: computed column array
Binding Process
The binding process replaces placeholder identifiers in a subquery plan with actual values from the current outer row.
Correlated Binding Collection
The collect_correlated_bindings function builds a map from placeholder strings to concrete Literal values by:
- Iterating over each
CorrelatedColumnin the subquery metadata - Looking up the outer column in the current RecordBatch
- Extracting the value at the current row index
- Converting the Arrow array value to a
Literal - Storing the mapping:
placeholder → Literal
Sources: Referenced in llkv-executor/src/lib.rs781 llkv-executor/src/lib.rs802
graph LR
subgraph "Input"
OuterBatch["RecordBatch\n(outer query result)"]
RowIndex["Current Row Index"]
Metadata["Vec<CorrelatedColumn>"]
end
subgraph "Processing"
Iterate["For each CorrelatedColumn"]
Lookup["Find column in schema"]
Extract["Extract array[row_idx]"]
Convert["Convert to Literal"]
end
subgraph "Output"
Bindings["FxHashMap<placeholder, Literal>"]
end
OuterBatch --> Iterate
RowIndex --> Iterate
Metadata --> Iterate
Iterate --> Lookup
Lookup --> Extract
Extract --> Convert
Convert --> Bindings
Plan Binding
The bind_select_plan function takes a subquery SelectPlan and a bindings map, then recursively replaces:
- Placeholder column references in filter expressions with
Expr::Literal - Placeholder column references in projections with
ScalarExpr::Literal - Placeholder column references in HAVING clauses
- Placeholder references in nested subqueries
This produces a new SelectPlan that is fully "grounded" with the outer row's values and can be executed independently.
Sources: Referenced in llkv-executor/src/lib.rs782 llkv-executor/src/lib.rs803
Execution Flow: EXISTS Subquery Example
Consider the query:
Planning Phase
Sources: llkv-plan/src/plans.rs:36-46
Execution Phase
Sources: llkv-executor/src/lib.rs:773-792
Execution Flow: Scalar Subquery Example
Consider the query:
Planning Phase
Sources: llkv-plan/src/plans.rs:48-56
Execution Phase
Sources: llkv-executor/src/lib.rs:834-891
Cross-Product Integration
When subqueries appear in cross-product (multi-table) queries, the binding process works identically but must resolve outer column names through the combined schema.
Cross-Product Expression Context
The CrossProductExpressionContext maintains:
- Combined schema from all tables in the FROM clause
- Column lookup map (qualified names → column indices)
- Numeric array cache for evaluated expressions
- Synthetic field ID allocation for subquery results
When evaluating a filter or projection expression that contains subqueries, the context:
- Detects subquery references by
SubqueryId - Calls the appropriate evaluator (
evaluate_exists_subqueryorevaluate_scalar_subquery_numeric) - Passes the combined schema and current row to the binding logic
- Caches numeric results for scalar subqueries to avoid re-evaluation
Sources: llkv-executor/src/lib.rs:1329-1383 llkv-executor/src/lib.rs:1429-1502
Validation and Error Handling
The executor enforces several constraints on subquery results:
| Constraint | Applies To | Error Condition |
|---|---|---|
| Single column | Scalar subqueries | num_columns() != 1 |
| At most one row | Scalar subqueries | num_rows() > 1 |
| Result present | N/A (returns NULL) | num_rows() == 0 for scalar subquery |
Error Examples
Scalar Subquery: Multiple Columns
Scalar Subquery: Multiple Rows
Sources: llkv-executor/src/lib.rs:808-819
Subquery ID Assignment
SubqueryId is a newtype wrapper around usize defined in llkv-expr. The planner assigns sequential IDs as it encounters subqueries during translation, ensuring each subquery has a unique identifier that persists from planning through execution.
The executor uses these IDs to:
- Look up subquery metadata in the plan's
scalar_subqueriesor filter'ssubqueriesvectors - Match subquery references in expression trees (
ScalarExpr::ScalarSubqueryorExpr::Exists) to their plans - Cache evaluation results (for scalar subqueries appearing multiple times)
Sources: [llkv-expr referenced in llkv-plan/src/plans.rs15](https://github.com/jzombie/rust-llkv/blob/4dc34c1f/llkv-expr referenced in llkv-plan/src/plans.rs#L15-L15)
Recursive Subquery Support
LLKV supports nested subqueries where an inner subquery can itself contain subqueries. The binding process is recursive:
- Bind outer-level placeholders in the top-level subquery plan
- For any nested subqueries within that plan, repeat the binding process
- Continue until all correlation layers are resolved
This is handled automatically by bind_select_plan which traverses the entire plan tree.
Sources: Referenced in llkv-executor/src/lib.rs782 llkv-executor/src/lib.rs803
Performance Considerations
Correlation Overhead
Correlated subqueries are executed once per outer row, which can be expensive:
- An outer query returning N rows with a correlated subquery executes N + 1 queries total
- For scalar subqueries in projections with multiple references, results are cached per subquery ID
- EXISTS subqueries short-circuit as soon as a matching row is found (first batch with
num_rows() > 0)
Uncorrelated Subqueries
If a subquery contains no correlated columns (empty correlated_columns vector), it could theoretically be executed once and reused. However, LLKV's current implementation still executes it per outer row. Future optimizations could detect this case and cache the result.
Sources: llkv-executor/src/lib.rs:773-891
Summary
LLKV's subquery handling follows a three-phase model:
-
Planning : The
SubqueryCorrelatedTrackerdetects outer column references and generates placeholders. Plans are built withFilterSubqueryorScalarSubquerystructures containing correlation metadata. -
Binding : At execution time,
collect_correlated_bindingsextracts outer row values andbind_select_planreplaces placeholders with literals, producing a grounded plan. -
Execution : The bound plan is executed as an independent query. EXISTS subqueries return a boolean; scalar subqueries return a single literal (or NULL if empty).
This design keeps the subquery plan generic during planning and binds it dynamically at execution time, enabling proper correlation semantics while maintaining the separation between planning and execution layers.
Sources: llkv-plan/src/plans.rs:28-67 llkv-sql/src/sql_engine.rs24 llkv-sql/src/sql_engine.rs:326-363 llkv-executor/src/lib.rs:773-891