Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Program Compilation

Loading…

Program Compilation

Relevant source files

Purpose and Scope

Program compilation transforms expression ASTs into executable bytecode optimized for evaluation. This intermediate representation enables efficient vectorized operations and simplifies the runtime evaluation engine. The compilation phase bridges the gap between high-level expression trees and low-level execution kernels.

This page covers the compilation of ScalarExpr and filter Expr trees into EvalProgram and DomainProgram bytecode respectively. For information about the expression AST structure, see Expression AST. For how compiled programs are evaluated, see Scalar Evaluation and NumericKernels.

Compilation Pipeline Overview

Sources: llkv-expr/src/expr.rs:1-819

Compilation Targets

The compilation system produces two distinct bytecode formats depending on the expression context:

Program TypeInput ASTPurposeOutput
EvalProgramScalarExpr<FieldId>Compute scalar values per rowArrow array of computed values
DomainProgramExpr<FieldId>Evaluate boolean predicatesBitmap of matching row IDs

EvalProgram Structure

EvalProgram compiles scalar expressions into a stack-based bytecode suitable for vectorized evaluation. Each instruction operates on Arrow arrays, producing intermediate or final results:

Sources: llkv-expr/src/expr.rs:125-307

flowchart TB
    subgraph ScalarInput["Scalar Expression Tree"]
ROOT["Binary: Add"]
LEFT["Binary: Multiply"]
RIGHT["Cast: Float64"]
COL1["Column: field_id=1"]
LIT1["Literal: 10"]
COL2["Column: field_id=2"]
end
    
 
   ROOT --> LEFT
 
   ROOT --> RIGHT
 
   LEFT --> COL1
 
   LEFT --> LIT1
 
   RIGHT --> COL2
    
    subgraph Bytecode["EvalProgram Bytecode"]
I1["LOAD_COLUMN field_id=1"]
I2["LOAD_LITERAL 10"]
I3["MULTIPLY"]
I4["LOAD_COLUMN field_id=2"]
I5["CAST Float64"]
I6["ADD"]
end
    
 
   I1 --> I2
 
   I2 --> I3
 
   I3 --> I4
 
   I4 --> I5
 
   I5 --> I6
    
    subgraph Execution["Execution"]
STACK["Evaluation Stack\nArray-based operations"]
RESULT["Result Array"]
end
    
 
   I6 --> STACK
 
   STACK --> RESULT

DomainProgram Structure

DomainProgram compiles filter predicates into bytecode optimized for boolean evaluation and row filtering. The program operates on column metadata and data chunks to identify matching rows:

Sources: llkv-expr/src/expr.rs:14-123 llkv-expr/src/expr.rs:365-428

flowchart TB
    subgraph FilterInput["Filter Expression Tree"]
AND["And"]
PRED1["Pred: field_id=1 > 100"]
PRED2["Pred: field_id=2 IN values"]
NOT["Not"]
PRED3["Pred: field_id=3 LIKE pattern"]
end
    
 
   AND --> PRED1
 
   AND --> NOT
 
   NOT --> PRED2
 
   AND --> PRED3
    
    subgraph DomainBytecode["DomainProgram Bytecode"]
D1["EVAL_PRED field_id=1\nGreaterThan 100"]
D2["EVAL_PRED field_id=2\nIn values"]
D3["NEGATE"]
D4["EVAL_PRED field_id=3\nContains pattern"]
D5["AND_ALL"]
end
    
 
   D1 --> D2
 
   D2 --> D3
 
   D3 --> D4
 
   D4 --> D5
    
    subgraph BitmapOps["Bitmap Operations"]
B1["Bitmap: field_id=1 matches"]
B2["Bitmap: field_id=2 matches"]
B3["Bitmap: NOT operation"]
B4["Bitmap: field_id=3 matches"]
B5["Bitmap: AND operation"]
FINAL["Final: matching row IDs"]
end
    
 
   D5 --> B1
 
   B1 --> B2
 
   B2 --> B3
 
   B3 --> B4
 
   B4 --> B5
 
   B5 --> FINAL

Expression Analysis and Type Inference

Before bytecode generation, the compiler analyzes the expression tree to:

  1. Resolve data types - Each expression node’s output type is inferred from its inputs and operation
  2. Validate operations - Ensure type compatibility for binary operations and function calls
  3. Track dependencies - Identify which columns and subqueries the expression requires
  4. Detect constant subexpressions - Find opportunities for constant folding

Sources: llkv-expr/src/expr.rs:125-182 llkv-expr/src/expr.rs:309-363

Compilation Optimizations

The compilation phase applies several optimization passes to improve evaluation performance:

Constant Folding

Expressions involving only literal values are evaluated at compile time:

Original ExpressionOptimized Form
Literal(2) + Literal(3)Literal(5)
Literal(10) * Literal(0)Literal(0)
Cast(Literal("123"), Int64)Literal(123)

Expression Simplification

Algebraic identities and boolean logic simplifications reduce instruction count:

PatternSimplified To
x + 0x
x * 1x
x * 00
And([true, expr])expr
Or([false, expr])expr
Not(Not(expr))expr

Dead Code Elimination

Unreachable code paths in Case expressions are removed:

Sources: llkv-expr/src/expr.rs:169-176

Bytecode Generation

After optimization, the compiler generates bytecode instructions by walking the expression tree in post-order (depth-first). Each expression variant maps to one or more bytecode instructions:

Scalar Expression Instruction Mapping

Expression VariantGenerated Instructions
Column(field_id)LOAD_COLUMN field_id
Literal(value)LOAD_LITERAL value
Binary{left, op, right}Compile left → Compile right → BINARY_OP op
Cast{expr, data_type}Compile expr → CAST data_type
Compare{left, op, right}Compile left → Compile right → COMPARE op
Coalesce(exprs)Compile all exprs → COALESCE count
Case{operand, branches, else}Complex multi-instruction sequence with jump tables
RandomRANDOM_F64

Sources: llkv-expr/src/expr.rs:125-182

Filter Expression Instruction Mapping

Expression VariantGenerated Instructions
Pred(Filter{field_id, op})EVAL_PREDICATE field_id op
And(exprs)Compile all exprs → AND_ALL count
Or(exprs)Compile all exprs → OR_ALL count
Not(expr)Compile expr → NEGATE
Compare{left, right, op}Compile as scalar → TO_BOOLEAN
InList{expr, list, negated}Compile expr → Build lookup table → IN_SET negated
IsNull{expr, negated}Compile expr → IS_NULL negated
Literal(bool)LOAD_BOOL value
Exists(subquery)EVAL_SUBQUERY subquery_id

Sources: llkv-expr/src/expr.rs:14-56

Subquery Handling

Correlated subqueries require special compilation handling. The compiler generates placeholder references that are resolved during execution:

Sources: llkv-expr/src/expr.rs:45-66

flowchart TB
    subgraph Outer["Outer Query Expression"]
FILTER["Filter with EXISTS"]
SUBQUERY["Subquery: SubqueryId=1\nCorrelated columns"]
end
    
    subgraph Compilation["Compilation Strategy"]
PLACEHOLDER["Generate EVAL_SUBQUERY\ninstruction with ID"]
CORRELATION["Track correlated\ncolumn mappings"]
DEFER["Defer actual subquery\ncompilation to executor"]
end
    
    subgraph Execution["Runtime Resolution"]
BIND["Bind outer row values\nto correlated placeholders"]
EVAL_SUB["Execute subquery plan\nwith bound values"]
COLLECT["Collect subquery\nresult set"]
BOOLEAN["Convert to boolean\nfor EXISTS/IN"]
end
    
 
   FILTER --> SUBQUERY
 
   SUBQUERY --> PLACEHOLDER
 
   PLACEHOLDER --> CORRELATION
 
   CORRELATION --> DEFER
    
 
   DEFER --> BIND
 
   BIND --> EVAL_SUB
 
   EVAL_SUB --> COLLECT
 
   COLLECT --> BOOLEAN

Aggregate Function Compilation

Aggregate expressions within scalar contexts (e.g., COUNT(*) + 1) compile to instructions that reference pre-computed aggregate results:

Sources: llkv-expr/src/expr.rs:184-215

flowchart LR
    subgraph AggExpr["Aggregate Expression"]
AGG_CALL["AggregateCall\nCountStar / Sum / Avg"]
end
    
    subgraph Compilation["Compilation"]
REF["Generate AGG_REFERENCE\ninstruction"]
METADATA["Store aggregate metadata\nfunction type, distinct flag"]
end
    
    subgraph PreExecution["Pre-Execution Phase"]
COMPUTE["Executor computes\naggregate values"]
STORE["Store in aggregate\nresult table"]
end
    
    subgraph Evaluation["Expression Evaluation"]
LOOKUP["AGG_REFERENCE instruction\nlooks up pre-computed value"]
BROADCAST["Broadcast scalar result\nto array length"]
CONTINUE["Continue with remaining\nexpression operations"]
end
    
 
   AGG_CALL --> REF
 
   REF --> METADATA
 
   METADATA --> COMPUTE
 
   COMPUTE --> STORE
 
   STORE --> LOOKUP
 
   LOOKUP --> BROADCAST
 
   BROADCAST --> CONTINUE

Integration with Execution Layer

Compiled programs are executed by the compute kernels layer, which provides vectorized implementations of each bytecode instruction:

Sources: llkv-expr/src/expr.rs:1-819

flowchart TB
    subgraph Programs["Compiled Programs"]
EP["EvalProgram"]
DP["DomainProgram"]
end
    
    subgraph ComputeLayer["llkv-compute Kernels"]
ARITHMETIC["Arithmetic Kernels\nadd_arrays, multiply_arrays"]
COMPARISON["Comparison Kernels\ngt_arrays, eq_arrays"]
CAST_K["Cast Kernels\ncast_array"]
LOGICAL["Logical Kernels\nand_bitmaps, or_bitmaps"]
STRING["String Kernels\ncontains, starts_with"]
end
    
    subgraph Execution["Execution Context"]
BATCH["Input RecordBatch\nColumn arrays"]
STACK["Evaluation Stack"]
BITMAP["Bitmap Accumulator"]
end
    
    subgraph Results["Evaluation Results"]
SCALAR_OUT["Array of computed\nscalar values"]
FILTER_OUT["Bitmap of matching\nrow IDs"]
end
    
 
   EP --> ARITHMETIC
 
   EP --> COMPARISON
 
   EP --> CAST_K
    
 
   DP --> COMPARISON
 
   DP --> LOGICAL
 
   DP --> STRING
    
 
   ARITHMETIC --> STACK
 
   COMPARISON --> STACK
 
   CAST_K --> STACK
    
 
   LOGICAL --> BITMAP
 
   STRING --> BITMAP
    
 
   BATCH --> STACK
 
   BATCH --> BITMAP
    
 
   STACK --> SCALAR_OUT
 
   BITMAP --> FILTER_OUT

Compilation Performance Considerations

The compilation phase is designed to amortize its cost over repeated evaluations:

ScenarioCompilation Strategy
One-time queryCompile on demand, minimal optimization
Repeated queryCompile once, cache bytecode, reuse across invocations
Prepared statementPre-compile at preparation time, execute many times
Table scan filterCompile predicate once, apply to all batches
AggregationCompile aggregate expressions, evaluate per group

Compilation Cache Strategy

Sources: llkv-expr/src/expr.rs:1-819

Summary

The compilation phase transforms high-level expression ASTs into efficient bytecode programs optimized for vectorized execution. By separating compilation from evaluation, the system achieves:

  • Performance : Bytecode enables efficient stack-based evaluation with Arrow kernels
  • Reusability : Compiled programs can be cached and reused across query invocations
  • Optimization : Multiple optimization passes improve runtime performance
  • Type Safety : Type inference and validation occur during compilation, not evaluation
  • Maintainability : Clear separation between compilation and execution concerns

The compiled EvalProgram and DomainProgram bytecode formats serve as the bridge between query planning and execution, enabling the query engine to efficiently evaluate complex scalar computations and filter predicates over columnar data.

Dismiss

Refresh this wiki

Enter email to refresh