This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Program Compilation
Loading…
Program Compilation
Relevant source files
- .github/workflows/build.docs.yml
- demos/llkv-sql-pong-demo/assets/llkv-sql-pong-screenshot.png
- dev-docs/doc-preview.md
- llkv-expr/src/expr.rs
- llkv-plan/Cargo.toml
- llkv-plan/src/plans.rs
- llkv-sql/Cargo.toml
- llkv-test-utils/Cargo.toml
Purpose and Scope
Program compilation transforms expression ASTs into executable bytecode optimized for evaluation. This intermediate representation enables efficient vectorized operations and simplifies the runtime evaluation engine. The compilation phase bridges the gap between high-level expression trees and low-level execution kernels.
This page covers the compilation of ScalarExpr and filter Expr trees into EvalProgram and DomainProgram bytecode respectively. For information about the expression AST structure, see Expression AST. For how compiled programs are evaluated, see Scalar Evaluation and NumericKernels.
Compilation Pipeline Overview
Sources: llkv-expr/src/expr.rs:1-819
Compilation Targets
The compilation system produces two distinct bytecode formats depending on the expression context:
| Program Type | Input AST | Purpose | Output |
|---|---|---|---|
EvalProgram | ScalarExpr<FieldId> | Compute scalar values per row | Arrow array of computed values |
DomainProgram | Expr<FieldId> | Evaluate boolean predicates | Bitmap of matching row IDs |
EvalProgram Structure
EvalProgram compiles scalar expressions into a stack-based bytecode suitable for vectorized evaluation. Each instruction operates on Arrow arrays, producing intermediate or final results:
Sources: llkv-expr/src/expr.rs:125-307
flowchart TB
subgraph ScalarInput["Scalar Expression Tree"]
ROOT["Binary: Add"]
LEFT["Binary: Multiply"]
RIGHT["Cast: Float64"]
COL1["Column: field_id=1"]
LIT1["Literal: 10"]
COL2["Column: field_id=2"]
end
ROOT --> LEFT
ROOT --> RIGHT
LEFT --> COL1
LEFT --> LIT1
RIGHT --> COL2
subgraph Bytecode["EvalProgram Bytecode"]
I1["LOAD_COLUMN field_id=1"]
I2["LOAD_LITERAL 10"]
I3["MULTIPLY"]
I4["LOAD_COLUMN field_id=2"]
I5["CAST Float64"]
I6["ADD"]
end
I1 --> I2
I2 --> I3
I3 --> I4
I4 --> I5
I5 --> I6
subgraph Execution["Execution"]
STACK["Evaluation Stack\nArray-based operations"]
RESULT["Result Array"]
end
I6 --> STACK
STACK --> RESULT
DomainProgram Structure
DomainProgram compiles filter predicates into bytecode optimized for boolean evaluation and row filtering. The program operates on column metadata and data chunks to identify matching rows:
Sources: llkv-expr/src/expr.rs:14-123 llkv-expr/src/expr.rs:365-428
flowchart TB
subgraph FilterInput["Filter Expression Tree"]
AND["And"]
PRED1["Pred: field_id=1 > 100"]
PRED2["Pred: field_id=2 IN values"]
NOT["Not"]
PRED3["Pred: field_id=3 LIKE pattern"]
end
AND --> PRED1
AND --> NOT
NOT --> PRED2
AND --> PRED3
subgraph DomainBytecode["DomainProgram Bytecode"]
D1["EVAL_PRED field_id=1\nGreaterThan 100"]
D2["EVAL_PRED field_id=2\nIn values"]
D3["NEGATE"]
D4["EVAL_PRED field_id=3\nContains pattern"]
D5["AND_ALL"]
end
D1 --> D2
D2 --> D3
D3 --> D4
D4 --> D5
subgraph BitmapOps["Bitmap Operations"]
B1["Bitmap: field_id=1 matches"]
B2["Bitmap: field_id=2 matches"]
B3["Bitmap: NOT operation"]
B4["Bitmap: field_id=3 matches"]
B5["Bitmap: AND operation"]
FINAL["Final: matching row IDs"]
end
D5 --> B1
B1 --> B2
B2 --> B3
B3 --> B4
B4 --> B5
B5 --> FINAL
Expression Analysis and Type Inference
Before bytecode generation, the compiler analyzes the expression tree to:
- Resolve data types - Each expression node’s output type is inferred from its inputs and operation
- Validate operations - Ensure type compatibility for binary operations and function calls
- Track dependencies - Identify which columns and subqueries the expression requires
- Detect constant subexpressions - Find opportunities for constant folding
Sources: llkv-expr/src/expr.rs:125-182 llkv-expr/src/expr.rs:309-363
Compilation Optimizations
The compilation phase applies several optimization passes to improve evaluation performance:
Constant Folding
Expressions involving only literal values are evaluated at compile time:
| Original Expression | Optimized Form |
|---|---|
Literal(2) + Literal(3) | Literal(5) |
Literal(10) * Literal(0) | Literal(0) |
Cast(Literal("123"), Int64) | Literal(123) |
Expression Simplification
Algebraic identities and boolean logic simplifications reduce instruction count:
| Pattern | Simplified To |
|---|---|
x + 0 | x |
x * 1 | x |
x * 0 | 0 |
And([true, expr]) | expr |
Or([false, expr]) | expr |
Not(Not(expr)) | expr |
Dead Code Elimination
Unreachable code paths in Case expressions are removed:
Sources: llkv-expr/src/expr.rs:169-176
Bytecode Generation
After optimization, the compiler generates bytecode instructions by walking the expression tree in post-order (depth-first). Each expression variant maps to one or more bytecode instructions:
Scalar Expression Instruction Mapping
| Expression Variant | Generated Instructions |
|---|---|
Column(field_id) | LOAD_COLUMN field_id |
Literal(value) | LOAD_LITERAL value |
Binary{left, op, right} | Compile left → Compile right → BINARY_OP op |
Cast{expr, data_type} | Compile expr → CAST data_type |
Compare{left, op, right} | Compile left → Compile right → COMPARE op |
Coalesce(exprs) | Compile all exprs → COALESCE count |
Case{operand, branches, else} | Complex multi-instruction sequence with jump tables |
Random | RANDOM_F64 |
Sources: llkv-expr/src/expr.rs:125-182
Filter Expression Instruction Mapping
| Expression Variant | Generated Instructions |
|---|---|
Pred(Filter{field_id, op}) | EVAL_PREDICATE field_id op |
And(exprs) | Compile all exprs → AND_ALL count |
Or(exprs) | Compile all exprs → OR_ALL count |
Not(expr) | Compile expr → NEGATE |
Compare{left, right, op} | Compile as scalar → TO_BOOLEAN |
InList{expr, list, negated} | Compile expr → Build lookup table → IN_SET negated |
IsNull{expr, negated} | Compile expr → IS_NULL negated |
Literal(bool) | LOAD_BOOL value |
Exists(subquery) | EVAL_SUBQUERY subquery_id |
Sources: llkv-expr/src/expr.rs:14-56
Subquery Handling
Correlated subqueries require special compilation handling. The compiler generates placeholder references that are resolved during execution:
Sources: llkv-expr/src/expr.rs:45-66
flowchart TB
subgraph Outer["Outer Query Expression"]
FILTER["Filter with EXISTS"]
SUBQUERY["Subquery: SubqueryId=1\nCorrelated columns"]
end
subgraph Compilation["Compilation Strategy"]
PLACEHOLDER["Generate EVAL_SUBQUERY\ninstruction with ID"]
CORRELATION["Track correlated\ncolumn mappings"]
DEFER["Defer actual subquery\ncompilation to executor"]
end
subgraph Execution["Runtime Resolution"]
BIND["Bind outer row values\nto correlated placeholders"]
EVAL_SUB["Execute subquery plan\nwith bound values"]
COLLECT["Collect subquery\nresult set"]
BOOLEAN["Convert to boolean\nfor EXISTS/IN"]
end
FILTER --> SUBQUERY
SUBQUERY --> PLACEHOLDER
PLACEHOLDER --> CORRELATION
CORRELATION --> DEFER
DEFER --> BIND
BIND --> EVAL_SUB
EVAL_SUB --> COLLECT
COLLECT --> BOOLEAN
Aggregate Function Compilation
Aggregate expressions within scalar contexts (e.g., COUNT(*) + 1) compile to instructions that reference pre-computed aggregate results:
Sources: llkv-expr/src/expr.rs:184-215
flowchart LR
subgraph AggExpr["Aggregate Expression"]
AGG_CALL["AggregateCall\nCountStar / Sum / Avg"]
end
subgraph Compilation["Compilation"]
REF["Generate AGG_REFERENCE\ninstruction"]
METADATA["Store aggregate metadata\nfunction type, distinct flag"]
end
subgraph PreExecution["Pre-Execution Phase"]
COMPUTE["Executor computes\naggregate values"]
STORE["Store in aggregate\nresult table"]
end
subgraph Evaluation["Expression Evaluation"]
LOOKUP["AGG_REFERENCE instruction\nlooks up pre-computed value"]
BROADCAST["Broadcast scalar result\nto array length"]
CONTINUE["Continue with remaining\nexpression operations"]
end
AGG_CALL --> REF
REF --> METADATA
METADATA --> COMPUTE
COMPUTE --> STORE
STORE --> LOOKUP
LOOKUP --> BROADCAST
BROADCAST --> CONTINUE
Integration with Execution Layer
Compiled programs are executed by the compute kernels layer, which provides vectorized implementations of each bytecode instruction:
Sources: llkv-expr/src/expr.rs:1-819
flowchart TB
subgraph Programs["Compiled Programs"]
EP["EvalProgram"]
DP["DomainProgram"]
end
subgraph ComputeLayer["llkv-compute Kernels"]
ARITHMETIC["Arithmetic Kernels\nadd_arrays, multiply_arrays"]
COMPARISON["Comparison Kernels\ngt_arrays, eq_arrays"]
CAST_K["Cast Kernels\ncast_array"]
LOGICAL["Logical Kernels\nand_bitmaps, or_bitmaps"]
STRING["String Kernels\ncontains, starts_with"]
end
subgraph Execution["Execution Context"]
BATCH["Input RecordBatch\nColumn arrays"]
STACK["Evaluation Stack"]
BITMAP["Bitmap Accumulator"]
end
subgraph Results["Evaluation Results"]
SCALAR_OUT["Array of computed\nscalar values"]
FILTER_OUT["Bitmap of matching\nrow IDs"]
end
EP --> ARITHMETIC
EP --> COMPARISON
EP --> CAST_K
DP --> COMPARISON
DP --> LOGICAL
DP --> STRING
ARITHMETIC --> STACK
COMPARISON --> STACK
CAST_K --> STACK
LOGICAL --> BITMAP
STRING --> BITMAP
BATCH --> STACK
BATCH --> BITMAP
STACK --> SCALAR_OUT
BITMAP --> FILTER_OUT
Compilation Performance Considerations
The compilation phase is designed to amortize its cost over repeated evaluations:
| Scenario | Compilation Strategy |
|---|---|
| One-time query | Compile on demand, minimal optimization |
| Repeated query | Compile once, cache bytecode, reuse across invocations |
| Prepared statement | Pre-compile at preparation time, execute many times |
| Table scan filter | Compile predicate once, apply to all batches |
| Aggregation | Compile aggregate expressions, evaluate per group |
Compilation Cache Strategy
Sources: llkv-expr/src/expr.rs:1-819
Summary
The compilation phase transforms high-level expression ASTs into efficient bytecode programs optimized for vectorized execution. By separating compilation from evaluation, the system achieves:
- Performance : Bytecode enables efficient stack-based evaluation with Arrow kernels
- Reusability : Compiled programs can be cached and reused across query invocations
- Optimization : Multiple optimization passes improve runtime performance
- Type Safety : Type inference and validation occur during compilation, not evaluation
- Maintainability : Clear separation between compilation and execution concerns
The compiled EvalProgram and DomainProgram bytecode formats serve as the bridge between query planning and execution, enabling the query engine to efficiently evaluate complex scalar computations and filter predicates over columnar data.
Dismiss
Refresh this wiki
Enter email to refresh