Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Subquery and Correlation Handling

Relevant source files

This page documents how LLKV handles subqueries and correlated column references during query planning and execution. Subqueries can appear in WHERE clauses (as EXISTS predicates) or in SELECT projections (as scalar subqueries). When a subquery references columns from its outer query, it is called a correlated subquery , requiring special handling to bind outer row values during execution.

For information about expression evaluation and compilation, see Expression System. For query execution flow, see Query Execution.


Purpose and Scope

Subquery handling in LLKV involves three distinct phases:

  1. Detection and Tracking - During SQL translation, the planner identifies subqueries and tracks which outer columns they reference
  2. Placeholder Injection - Correlated columns are replaced with synthetic placeholder identifiers in the subquery's expression tree
  3. Binding and Execution - At runtime, for each outer row, placeholders are replaced with actual values and the subquery is executed

This document covers the data structures, algorithms, and execution flow for both filter subqueries (EXISTS/NOT EXISTS) and scalar subqueries (single-value returns).


Core Data Structures

LLKV represents subqueries and their correlation metadata through several interconnected structures defined in llkv-plan.

classDiagram
    class SelectFilter {+Expr~String~ predicate\n+Vec~FilterSubquery~ subqueries}
    
    class FilterSubquery {+SubqueryId id\n+Box~SelectPlan~ plan\n+Vec~CorrelatedColumn~ correlated_columns}
    
    class ScalarSubquery {+SubqueryId id\n+Box~SelectPlan~ plan\n+Vec~CorrelatedColumn~ correlated_columns}
    
    class CorrelatedColumn {+String placeholder\n+String column\n+Vec~String~ field_path}
    
    class SelectPlan {+Vec~TableRef~ tables\n+Option~SelectFilter~ filter\n+Vec~ScalarSubquery~ scalar_subqueries\n+Vec~SelectProjection~ projections}
    
    SelectPlan --> SelectFilter : filter
    SelectPlan --> ScalarSubquery : scalar_subqueries
    SelectFilter --> FilterSubquery : subqueries
    FilterSubquery --> CorrelatedColumn : correlated_columns
    ScalarSubquery --> CorrelatedColumn : correlated_columns
    FilterSubquery --> SelectPlan : plan
    ScalarSubquery --> SelectPlan : plan

Subquery Plan Structures

Sources: llkv-plan/src/plans.rs:28-67

Field Descriptions

StructureFieldPurpose
FilterSubqueryidUnique identifier used to match subquery in expression tree
FilterSubqueryplanNested SELECT plan to execute for each outer row
FilterSubquerycorrelated_columnsMappings from placeholder to real outer column
ScalarSubqueryidUnique identifier for scalar subquery references
ScalarSubqueryplanSELECT plan that must return single column/row
CorrelatedColumnplaceholderSynthetic column name injected into subquery expressions
CorrelatedColumncolumnCanonical outer column name
CorrelatedColumnfield_pathNested field access path for struct columns

Sources: llkv-plan/src/plans.rs:36-67


Correlation Tracking During Planning

During SQL-to-plan translation, LLKV uses the SubqueryCorrelatedTracker to detect when a subquery references outer columns. This tracker is passed through the expression translation pipeline and records each outer column access.

graph TB
    subgraph "SQL Translation Phase"
        SQL["SQL Query String"]
Parser["sqlparser AST"]
end
    
    subgraph "Subquery Detection"
        TranslateExpr["translate_predicate / translate_scalar"]
Tracker["SubqueryCorrelatedTracker"]
Resolver["IdentifierResolver"]
end
    
    subgraph "Placeholder Injection"
        OuterColumn["Outer Column Reference"]
Placeholder["Synthetic Placeholder"]
Recording["CorrelatedColumn Entry"]
end
    
    subgraph "Plan Output"
        FilterSubquery["FilterSubquery"]
ScalarSubquery["ScalarSubquery"]
SelectPlan["SelectPlan"]
end
    
 
   SQL --> Parser
 
   Parser --> TranslateExpr
 
   TranslateExpr --> Tracker
 
   TranslateExpr --> Resolver
    
 
   Tracker --> OuterColumn
 
   OuterColumn --> Placeholder
 
   Placeholder --> Recording
    
 
   Recording --> FilterSubquery
 
   Recording --> ScalarSubquery
 
   FilterSubquery --> SelectPlan
 
   ScalarSubquery --> SelectPlan

Tracker Architecture

Sources: llkv-sql/src/sql_engine.rs24 llkv-sql/src/sql_engine.rs:326-363

Placeholder Generation

When the tracker detects an outer column reference in a subquery, it:

  1. Generates a unique placeholder string (e.g., "__correlated_0__")
  2. Records the mapping: placeholder → (outer_column, field_path)
  3. Returns the placeholder to the expression translator
  4. The placeholder is embedded in the subquery's expression tree instead of the original column name

This allows the subquery plan to be "generic" - it references placeholders that will be bound to actual values at execution time.

Sources: llkv-sql/src/sql_engine.rs:337-351

Tracker Extension Traits

The SubqueryCorrelatedTrackerExt trait provides a convenience method to request placeholders directly from catalog resolution results, avoiding repetitive unpacking of ColumnResolution fields.

The SubqueryCorrelatedTrackerOptionExt trait enables chaining optional tracker references through nested translation helpers without explicit as_mut() calls.

Sources: llkv-sql/src/sql_engine.rs:337-363


Subquery Types and Execution

LLKV supports two categories of subqueries, each with distinct execution semantics.

sequenceDiagram
    participant Executor as QueryExecutor
    participant Filter as Filter Evaluation
    participant Subquery as EXISTS Subquery
    participant Binding as Binding Logic
    participant Inner as Inner SelectPlan
    
    Executor->>Filter: evaluate_predicate_mask()
    Filter->>Filter: encounter Expr::Exists
    Filter->>Subquery: evaluate_exists_subquery()
    Subquery->>Binding: collect_correlated_bindings()
    Binding->>Binding: extract outer row values
    Binding-->>Subquery: bindings map
    Subquery->>Inner: bind_select_plan()
    Inner->>Inner: replace placeholders with values
    Inner-->>Subquery: bound SelectPlan
    Subquery->>Executor: execute_select(bound_plan)
    Executor-->>Subquery: SelectExecution stream
    Subquery->>Subquery: check if num_rows > 0
    Subquery-->>Filter: boolean result
    Filter-->>Executor: BooleanArray mask

Filter Subqueries (EXISTS Predicates)

Filter subqueries appear in WHERE clauses as EXISTS or NOT EXISTS predicates. They return a boolean indicating whether the subquery produced any rows.

Sources: llkv-executor/src/lib.rs:773-792

Scalar Subqueries (Projection Values)

Scalar subqueries appear in SELECT projections and must return exactly one column and at most one row. They are evaluated into a single literal value for each outer row.

Sources: llkv-executor/src/lib.rs:794-891

sequenceDiagram
    participant Executor as QueryExecutor
    participant Projection as Projection Logic
    participant Subquery as Scalar Subquery Evaluator
    participant Binding as Binding Logic
    participant Inner as Inner SelectPlan
    
    Executor->>Projection: evaluate_projection_expression()
    Projection->>Projection: encounter ScalarExpr::ScalarSubquery
    Projection->>Subquery: evaluate_scalar_subquery_numeric()
    
    loop For each outer row
        Subquery->>Subquery: evaluate_scalar_subquery_literal()
        Subquery->>Binding: collect_correlated_bindings()
        Binding-->>Subquery: bindings for current row
        Subquery->>Inner: bind_select_plan()
        Inner-->>Subquery: bound plan
        Subquery->>Executor: execute_select()
        Executor-->>Subquery: result batches
        Subquery->>Subquery: validate single column/row
        Subquery->>Subquery: convert to Literal
        Subquery-->>Projection: literal value
    end
    
    Projection->>Projection: build NumericArray from literals
    Projection-->>Executor: computed column array

Binding Process

The binding process replaces placeholder identifiers in a subquery plan with actual values from the current outer row.

Correlated Binding Collection

The collect_correlated_bindings function builds a map from placeholder strings to concrete Literal values by:

  1. Iterating over each CorrelatedColumn in the subquery metadata
  2. Looking up the outer column in the current RecordBatch
  3. Extracting the value at the current row index
  4. Converting the Arrow array value to a Literal
  5. Storing the mapping: placeholder → Literal

Sources: Referenced in llkv-executor/src/lib.rs781 llkv-executor/src/lib.rs802

graph LR
    subgraph "Input"
        OuterBatch["RecordBatch\n(outer query result)"]
RowIndex["Current Row Index"]
Metadata["Vec<CorrelatedColumn>"]
end
    
    subgraph "Processing"
        Iterate["For each CorrelatedColumn"]
Lookup["Find column in schema"]
Extract["Extract array[row_idx]"]
Convert["Convert to Literal"]
end
    
    subgraph "Output"
        Bindings["FxHashMap<placeholder, Literal>"]
end
    
 
   OuterBatch --> Iterate
 
   RowIndex --> Iterate
 
   Metadata --> Iterate
    
 
   Iterate --> Lookup
 
   Lookup --> Extract
 
   Extract --> Convert
 
   Convert --> Bindings

Plan Binding

The bind_select_plan function takes a subquery SelectPlan and a bindings map, then recursively replaces:

  • Placeholder column references in filter expressions with Expr::Literal
  • Placeholder column references in projections with ScalarExpr::Literal
  • Placeholder column references in HAVING clauses
  • Placeholder references in nested subqueries

This produces a new SelectPlan that is fully "grounded" with the outer row's values and can be executed independently.

Sources: Referenced in llkv-executor/src/lib.rs782 llkv-executor/src/lib.rs803


Execution Flow: EXISTS Subquery Example

Consider the query:

Planning Phase

Sources: llkv-plan/src/plans.rs:36-46

Execution Phase

Sources: llkv-executor/src/lib.rs:773-792


Execution Flow: Scalar Subquery Example

Consider the query:

Planning Phase

Sources: llkv-plan/src/plans.rs:48-56

Execution Phase

Sources: llkv-executor/src/lib.rs:834-891


Cross-Product Integration

When subqueries appear in cross-product (multi-table) queries, the binding process works identically but must resolve outer column names through the combined schema.

Cross-Product Expression Context

The CrossProductExpressionContext maintains:

  • Combined schema from all tables in the FROM clause
  • Column lookup map (qualified names → column indices)
  • Numeric array cache for evaluated expressions
  • Synthetic field ID allocation for subquery results

When evaluating a filter or projection expression that contains subqueries, the context:

  1. Detects subquery references by SubqueryId
  2. Calls the appropriate evaluator (evaluate_exists_subquery or evaluate_scalar_subquery_numeric)
  3. Passes the combined schema and current row to the binding logic
  4. Caches numeric results for scalar subqueries to avoid re-evaluation

Sources: llkv-executor/src/lib.rs:1329-1383 llkv-executor/src/lib.rs:1429-1502


Validation and Error Handling

The executor enforces several constraints on subquery results:

ConstraintApplies ToError Condition
Single columnScalar subqueriesnum_columns() != 1
At most one rowScalar subqueriesnum_rows() > 1
Result presentN/A (returns NULL)num_rows() == 0 for scalar subquery

Error Examples

Scalar Subquery: Multiple Columns

Scalar Subquery: Multiple Rows

Sources: llkv-executor/src/lib.rs:808-819


Subquery ID Assignment

SubqueryId is a newtype wrapper around usize defined in llkv-expr. The planner assigns sequential IDs as it encounters subqueries during translation, ensuring each subquery has a unique identifier that persists from planning through execution.

The executor uses these IDs to:

  • Look up subquery metadata in the plan's scalar_subqueries or filter's subqueries vectors
  • Match subquery references in expression trees (ScalarExpr::ScalarSubquery or Expr::Exists) to their plans
  • Cache evaluation results (for scalar subqueries appearing multiple times)

Sources: [llkv-expr referenced in llkv-plan/src/plans.rs15](https://github.com/jzombie/rust-llkv/blob/4dc34c1f/llkv-expr referenced in llkv-plan/src/plans.rs#L15-L15)


Recursive Subquery Support

LLKV supports nested subqueries where an inner subquery can itself contain subqueries. The binding process is recursive:

  1. Bind outer-level placeholders in the top-level subquery plan
  2. For any nested subqueries within that plan, repeat the binding process
  3. Continue until all correlation layers are resolved

This is handled automatically by bind_select_plan which traverses the entire plan tree.

Sources: Referenced in llkv-executor/src/lib.rs782 llkv-executor/src/lib.rs803


Performance Considerations

Correlation Overhead

Correlated subqueries are executed once per outer row, which can be expensive:

  • An outer query returning N rows with a correlated subquery executes N + 1 queries total
  • For scalar subqueries in projections with multiple references, results are cached per subquery ID
  • EXISTS subqueries short-circuit as soon as a matching row is found (first batch with num_rows() > 0)

Uncorrelated Subqueries

If a subquery contains no correlated columns (empty correlated_columns vector), it could theoretically be executed once and reused. However, LLKV's current implementation still executes it per outer row. Future optimizations could detect this case and cache the result.

Sources: llkv-executor/src/lib.rs:773-891


Summary

LLKV's subquery handling follows a three-phase model:

  1. Planning : The SubqueryCorrelatedTracker detects outer column references and generates placeholders. Plans are built with FilterSubquery or ScalarSubquery structures containing correlation metadata.

  2. Binding : At execution time, collect_correlated_bindings extracts outer row values and bind_select_plan replaces placeholders with literals, producing a grounded plan.

  3. Execution : The bound plan is executed as an independent query. EXISTS subqueries return a boolean; scalar subqueries return a single literal (or NULL if empty).

This design keeps the subquery plan generic during planning and binds it dynamically at execution time, enabling proper correlation semantics while maintaining the separation between planning and execution layers.

Sources: llkv-plan/src/plans.rs:28-67 llkv-sql/src/sql_engine.rs24 llkv-sql/src/sql_engine.rs:326-363 llkv-executor/src/lib.rs:773-891