Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Expression Translation

Loading…

Expression Translation

Relevant source files

Purpose and Scope

Expression Translation is the process of converting expressions that reference columns by name (as strings) into expressions that reference columns by numeric field identifiers (FieldId). This translation bridges the gap between the SQL parsing/planning layer—which operates on human-readable column names—and the execution layer, which requires efficient numeric field identifiers for accessing columnar storage.

This page documents the translation mechanisms, key functions, and integration points. For information about the expression AST types themselves, see Expression AST. For details on how translated expressions are compiled into executable programs, see Program Compilation.

Sources: llkv-expr/src/expr.rs:1-819 llkv-executor/src/lib.rs:87-97


The Parameterized Expression Type System

The LLKV expression system uses generic type parameters to support multiple identifier types throughout the query processing pipeline. All expression types are parameterized over a field identifier type F:

Expression TypeDescriptionParameter
Expr<'a, F>Boolean predicate expressionField identifier type F
ScalarExpr<F>Arithmetic/scalar expressionField identifier type F
Filter<'a, F>Single-field predicateField identifier type F

The parameterization allows the same expression structures to be used with different identifier representations:

  • During Planning : Expr<'static, String> and ScalarExpr<String> use column names as parsed from SQL
  • During Execution : Expr<'static, FieldId> and ScalarExpr<FieldId> use numeric field identifiers for efficient storage access
graph TD
    subgraph "SQL Parsing Layer"
        SQL["SQL Query Text"]
PARSER["sqlparser"]
AST["SQL AST"]
end
    
    subgraph "Planning Layer"
        PLANNER["Query Planner"]
EXPR_STRING["Expr&lt;String&gt;\nScalarExpr&lt;String&gt;"]
PLAN["SelectPlan"]
end
    
    subgraph "Translation Layer"
        TRANSLATOR["translate_scalar\ntranslate_predicate"]
SCHEMA["Schema / Catalog"]
RESOLVER["IdentifierResolver"]
end
    
    subgraph "Execution Layer"
        EXPR_FIELDID["Expr&lt;FieldId&gt;\nScalarExpr&lt;FieldId&gt;"]
EVALUATOR["Expression Evaluator"]
STORAGE["Column Store"]
end
    
 
   SQL --> PARSER
 
   PARSER --> AST
 
   AST --> PLANNER
 
   PLANNER --> EXPR_STRING
 
   EXPR_STRING --> PLAN
    
 
   PLAN --> TRANSLATOR
 
   SCHEMA --> TRANSLATOR
 
   RESOLVER --> TRANSLATOR
 
   TRANSLATOR --> EXPR_FIELDID
    
 
   EXPR_FIELDID --> EVALUATOR
 
   EVALUATOR --> STORAGE

This design separates concerns: the planner manipulates human-readable names without needing catalog knowledge, while the executor works with resolved numeric identifiers that map directly to physical storage locations.

Diagram: Expression Translation Flow from SQL to Execution

Sources: llkv-expr/src/expr.rs:14-182 llkv-executor/src/lib.rs:87-97


Core Translation Functions

The translation layer exposes a set of functions for converting string-based expressions to field ID-based expressions. These functions are defined in the llkv-plan crate’s translation module and re-exported by llkv-executor for convenience.

Primary Translation Functions

FunctionPurposeSignature Pattern
translate_scalarTranslate scalar expressions(expr: &ScalarExpr<String>, schema, error_fn) -> Result<ScalarExpr<FieldId>>
translate_scalar_withTranslate with custom resolver(expr: &ScalarExpr<String>, resolver, error_fn) -> Result<ScalarExpr<FieldId>>
translate_predicateTranslate filter predicates(expr: &Expr<String>, schema, error_fn) -> Result<Expr<FieldId>>
translate_predicate_withTranslate predicate with resolver(expr: &Expr<String>, resolver, error_fn) -> Result<Expr<FieldId>>
resolve_field_id_from_schemaResolve single column name(name: &str, schema) -> Result<FieldId>

The _with variants accept an IdentifierResolver reference for more complex scenarios (multi-table queries, subqueries, etc.), while the simpler variants accept a schema directly and construct a resolver internally.

Usage Pattern

Translation functions follow a consistent pattern: they take a string-based expression, schema/resolver information, and an error handler closure. The error handler is invoked when a column name cannot be resolved, allowing callers to customize error messages:

The error closure receives the unresolved column name and returns an appropriate error type. This pattern appears throughout the executor when translating expressions from plans:

Sources: llkv-executor/src/lib.rs:87-97 llkv-executor/src/lib.rs:485-489 llkv-executor/src/lib.rs:1054-1059


Schema-Based Resolution

Column name resolution relies on Arrow schema information to map string identifiers to numeric field IDs. The resolution process handles case-insensitive matching and validates that referenced columns actually exist in the schema.

graph LR
    subgraph "Input"
        EXPR_STR["ScalarExpr&lt;String&gt;"]
COLUMN_NAME["Column Name: 'user_id'"]
end
    
    subgraph "Resolution Context"
        SCHEMA["Arrow Schema"]
FIELDS["Field Definitions"]
METADATA["Field Metadata"]
end
    
    subgraph "Resolution Process"
        NORMALIZE["Normalize Name\n(case-insensitive)"]
LOOKUP["Lookup in Schema"]
EXTRACT_ID["Extract FieldId"]
end
    
    subgraph "Output"
        EXPR_FIELD["ScalarExpr&lt;FieldId&gt;"]
FIELD_ID["FieldId: 42"]
end
    
 
   COLUMN_NAME --> NORMALIZE
 
   SCHEMA --> LOOKUP
 
   NORMALIZE --> LOOKUP
 
   LOOKUP --> EXTRACT_ID
 
   EXTRACT_ID --> FIELD_ID
    
 
   EXPR_STR --> NORMALIZE
 
   EXTRACT_ID --> EXPR_FIELD

Resolution Workflow

Diagram: Column Name to FieldId Resolution

The resolve_field_id_from_schema function performs the core resolution logic. It searches the schema’s field definitions for a matching column name and extracts the associated field ID from the field’s metadata.

Schema Structure

Arrow schemas used during translation contain:

  • Field Definitions : Name, data type, nullability
  • Field Metadata : Key-value pairs including the numeric field ID
  • Nested Field Support : For struct types, schemas may contain nested field hierarchies

The translation process must handle qualified names (e.g., table.column), nested field access (e.g., user.address.city), and alias resolution when applicable.

Sources: llkv-executor/src/lib.rs:87-97


Field Path Resolution for Nested Fields

When expressions reference nested fields within struct types, the translation process must resolve not just the top-level column but the entire field path. This is handled through the IdentifierResolver and ColumnResolution types provided by llkv-table/catalog.

graph TD
    subgraph "Input Expression"
        NESTED["GetField Expression"]
BASE["base: user"]
FIELD["field_name: 'address'"]
SUBFIELD["field_name: 'city'"]
end
    
    subgraph "Resolver"
        RESOLVER["IdentifierResolver"]
CONTEXT["IdentifierContext"]
end
    
    subgraph "Resolution Result"
        COL_RES["ColumnResolution"]
COL_NAME["column(): 'user'"]
FIELD_PATH["field_path(): ['address', 'city']"]
FIELD_ID["Resolved FieldId"]
end
    
 
   NESTED --> RESOLVER
 
   CONTEXT --> RESOLVER
 
   RESOLVER --> COL_RES
 
   COL_RES --> COL_NAME
 
   COL_RES --> FIELD_PATH
 
   COL_RES --> FIELD_ID

ColumnResolution Structure

Diagram: Nested Field Resolution

The ColumnResolution type encapsulates the resolution result, providing:

  • The base column name
  • The field path for nested access (empty for top-level columns)
  • The resolved field ID for storage access

This information is used during correlated subquery tracking and when translating GetField expressions in the scalar expression tree.

Sources: llkv-sql/src/sql_engine.rs:37-38 llkv-sql/src/sql_engine.rs:420-427


Translation in Multi-Table Contexts

When translating expressions for queries involving multiple tables (joins, cross products, subqueries), the translation process must disambiguate column references that may appear in multiple tables. This is handled by the IdentifierResolver which maintains context about available tables and their schemas.

IdentifierContext and Resolution

The IdentifierContext type (from llkv-table/catalog) represents the set of tables and columns available in a given scope. During translation:

  1. Outer Scope Tracking : For subqueries, outer table contexts are tracked separately
  2. Column Disambiguation : Qualified names (e.g., table.column) are resolved against the appropriate table
  3. Ambiguity Detection : Unqualified references to columns that exist in multiple tables produce errors

The translate_predicate_with and translate_scalar_with functions accept an IdentifierResolver reference that encapsulates this context:

Sources: llkv-sql/src/sql_engine.rs:37-38


Error Handling and Diagnostics

Translation failures occur when column names cannot be resolved. The error handling strategy uses caller-provided closures to generate context-specific error messages.

Error Patterns

ScenarioError Message Pattern
Unknown column in aggregate"unknown column '{name}' in aggregate expression"
Unknown column in WHERE clause"unknown column '{name}' in filter"
Unknown column in cross product"column '{name}' not found in cross product result"
Ambiguous column reference"column '{name}' is ambiguous"

The error closure pattern allows the caller to include query-specific context in error messages. This is particularly important for debugging complex queries where the same expression type might be used in multiple contexts.

Resolution Failure Example

When translate_scalar encounters a ScalarExpr::Column(name) variant and the name cannot be found in the schema, it invokes the error closure:

Sources: llkv-executor/src/lib.rs:485-489 llkv-executor/src/lib.rs:1054-1059


graph TB
    subgraph "Planning Phase"
        SQL["SQL Statement"]
PARSE["Parse & Build Plan"]
PLAN["SelectPlan\nUpdatePlan\netc."]
EXPR_STR["Expressions with\nString identifiers"]
end
    
    subgraph "Execution Preparation"
        GET_TABLE["Get Table Handle"]
SCHEMA_FETCH["Fetch Schema"]
TRANSLATE["Translation Functions"]
EXPR_FIELD["Expressions with\nFieldId identifiers"]
end
    
    subgraph "Execution Phase"
        BUILD_SCAN["Build ScanProjection"]
COMPILE["Compile to EvalProgram"]
EVALUATE["Evaluate Against Batches"]
RESULTS["RecordBatch Results"]
end
    
 
   SQL --> PARSE
 
   PARSE --> PLAN
 
   PLAN --> EXPR_STR
    
 
   PLAN --> GET_TABLE
 
   GET_TABLE --> SCHEMA_FETCH
 
   SCHEMA_FETCH --> TRANSLATE
 
   EXPR_STR --> TRANSLATE
 
   TRANSLATE --> EXPR_FIELD
    
 
   EXPR_FIELD --> BUILD_SCAN
 
   BUILD_SCAN --> COMPILE
 
   COMPILE --> EVALUATE
 
   EVALUATE --> RESULTS

Integration with Query Execution Pipeline

Expression translation occurs at the boundary between planning and execution. Plans produced by the SQL layer contain string-based expressions, which are translated as execution structures are built.

Translation Points in Execution

Diagram: Translation in the Execution Pipeline

Key Translation Points

  1. Filter Translation : When building scan plans, WHERE clause expressions are translated before being passed to the scan optimizer
  2. Projection Translation : Computed columns in SELECT projections are translated before evaluation
  3. Aggregate Translation : Aggregate function arguments are translated to resolve column references
  4. Join Condition Translation : ON clause expressions for joins are translated in the context of both joined tables

The executor’s ensure_computed_projection function demonstrates this integration. It translates a string-based expression, infers its result data type, and registers it as a computed projection for the scan:

This function encapsulates the full translation workflow: resolve column names, infer types, and prepare the translated expression for execution.

Sources: llkv-executor/src/lib.rs:470-501 llkv-executor/src/lib.rs:87-97


Translation of Complex Expression Types

The translation process must handle all variants of the expression AST, recursively translating nested expressions while preserving structure and semantics.

Recursive Translation Table

Expression VariantTranslation Strategy
ScalarExpr::ColumnResolve string to FieldId via schema
ScalarExpr::LiteralNo translation needed (no field references)
ScalarExpr::BinaryRecursively translate left and right operands
ScalarExpr::AggregateTranslate the aggregate’s argument expression
ScalarExpr::GetFieldTranslate base expression, preserve field name
ScalarExpr::CastTranslate inner expression, preserve target type
ScalarExpr::CompareRecursively translate both comparison operands
ScalarExpr::CoalesceTranslate each expression in the list
ScalarExpr::CaseTranslate operand and all WHEN/THEN/ELSE branches
ScalarExpr::ScalarSubqueryNo translation (contains SubqueryId reference)
ScalarExpr::RandomNo translation (no field references)

For predicate expressions (Expr<F>):

Predicate VariantTranslation Strategy
Expr::And / Expr::OrRecursively translate all sub-expressions
Expr::NotRecursively translate inner expression
Expr::Pred(Filter)Translate filter’s field ID, preserve operator
Expr::CompareTranslate left and right scalar expressions
Expr::InListTranslate target expression and list elements
Expr::IsNullTranslate the operand expression
Expr::LiteralNo translation (constant boolean value)
Expr::ExistsNo translation (contains SubqueryId reference)

The translation process maintains the expression tree structure while substituting field identifiers, ensuring that evaluation semantics remain unchanged.

Sources: llkv-expr/src/expr.rs:125-182 llkv-expr/src/expr.rs:14-66


Performance Considerations

Expression translation is performed once during query execution setup, not per-row or per-batch. The translated expressions are then compiled into evaluation programs (see Program Compilation) which are reused across all batches in the query result.

Translation Caching

The executor maintains caches to avoid redundant translation work:

  • Computed Projection Cache : Stores translated expressions keyed by their string representation to avoid re-translating identical expressions in the same query
  • Column Projection Cache : Maps field IDs to projection indices to reuse existing projections when multiple expressions reference the same column

This caching strategy is evident in functions like ensure_computed_projection, which checks the cache before performing translation:

Sources: llkv-executor/src/lib.rs:470-501

Dismiss

Refresh this wiki

Enter email to refresh