This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Filter Evaluation
Relevant source files
- llkv-executor/src/translation/expression.rs
- llkv-executor/src/translation/schema.rs
- llkv-expr/src/expr.rs
- llkv-table/src/planner/mod.rs
- llkv-table/src/planner/program.rs
- llkv-table/src/scalar_eval.rs
Purpose and Scope
This page explains how filter expressions (WHERE clause predicates) are evaluated against table rows during query execution. This includes the compilation of filter expressions into efficient bytecode programs, stack-based evaluation mechanisms, integration with MVCC visibility filtering, and various optimization strategies.
For information about how expressions are initially structured and planned, see Expression System and Query Planning. For details about the broader scan execution context, see Scan Execution and Optimization.
Filter Expression Pipeline
Filter evaluation follows a multi-stage pipeline that transforms SQL predicates into efficient executable programs:
Sources: llkv-table/src/planner/mod.rs:595-636 llkv-table/src/planner/program.rs:256-284 llkv-executor/src/translation/expression.rs:18-174
graph LR SQL["SQL WHERE Clause"] --> Parser["sqlparser AST"] Parser --> ExprString["Expr<String>\nField names"] ExprString --> Translation["Field Resolution\nCatalog Lookup"] Translation --> ExprFieldId["Expr<FieldId>\nResolved fields"] ExprFieldId --> Normalize["normalize_predicate\nDe Morgan's Laws\nFlatten AND/OR"] Normalize --> Compiler["ProgramCompiler"] Compiler --> EvalProg["EvalProgram\nStack Bytecode"] Compiler --> DomainProg["DomainProgram\nRow ID Selection"] EvalProg --> Evaluation["Row Evaluation"] DomainProg --> Evaluation Evaluation --> MVCCFilter["MVCC Filtering"] MVCCFilter --> Results["Filtered Results"]
Program Compilation
Normalization
Before compilation, filter expressions are normalized into canonical form using the normalize_predicate function. This transformation ensures consistent structure for optimization and evaluation.
Normalization rules:
- Flatten nested conjunctions/disjunctions:
AND(AND(a,b),c)→AND(a,b,c) - Apply De Morgan's laws: Push NOT operators down through logical connectives
- Eliminate double negations:
NOT(NOT(expr))→expr - Simplify literal booleans:
NOT(true)→false
The normalization process uses an iterative approach to handle deeply nested expressions without stack overflow. The transformation is applied recursively, with special handling for negated conjunctions and disjunctions.
Key normalization functions:
| Function | Purpose |
|---|---|
normalize_predicate | Entry point for expression normalization |
normalize_expr | Flattens AND/OR and delegates to normalize_negated |
normalize_negated | Applies De Morgan's laws and simplifies negations |
Sources: llkv-table/src/planner/program.rs:286-343
Bytecode Generation
The ProgramCompiler translates normalized expressions into two complementary program representations:
EvalProgram operations:
graph TB
subgraph "Compilation"
Expr["Normalized Expr<FieldId>"]
Compiler["ProgramCompiler"]
Expr --> Compiler
end
subgraph "Programs"
EvalProg["EvalProgram\nStack-based bytecode\nfor predicate evaluation"]
DomainProg["DomainProgram\nRow ID domain\ncalculation"]
Compiler --> EvalProg
Compiler --> DomainProg
end
subgraph "Evaluation"
EvalProg --> ResultStack["Result Stack\nbool values"]
DomainProg --> RowIDs["Row ID Sets\ncandidate rows"]
end
| Operation | Stack Effect | Purpose |
|---|---|---|
PushPredicate | → bool | Evaluate single predicate |
PushCompare | → bool | Evaluate scalar comparison |
PushInList | → bool | Evaluate IN list membership |
PushIsNull | → bool | Evaluate NULL test |
PushLiteral | → bool | Push constant boolean |
FusedAnd | → bool | Evaluate fused predicates on same field |
And | bool×N → bool | Logical AND of N values |
Or | bool×N → bool | Logical OR of N values |
Not | bool → bool | Logical negation (uses domain program) |
DomainProgram operations:
| Operation | Purpose |
|---|---|
PushFieldAll | Include all rows for a field |
PushCompareDomain | Rows where scalar comparison fields exist |
PushInListDomain | Rows where IN list fields exist |
PushIsNullDomain | Rows where NULL test fields exist |
Union | Combine row sets (OR semantics) |
Intersect | Intersect row sets (AND semantics) |
Sources: llkv-table/src/planner/program.rs:22-284 llkv-table/src/planner/program.rs:416-516 llkv-table/src/planner/program.rs:544-631
Predicate Fusion
An optimization that recognizes multiple predicates on the same field within an AND expression and evaluates them together. This reduces overhead and enables more efficient filtering.
Fusion conditions (fromPredicateFusionCache):
| Data Type | Fusion Threshold |
|---|---|
| String types | ≥2 total predicates AND ≥1 Contains operator |
| Other types | ≥2 total predicates on same field |
Example transformation:
age >= 18 AND age < 65 AND age != 30
→ FusedAnd(field_id: age, filters: [>=18, <65, !=30])
The fusion cache tracks predicate patterns during compilation:
- Counts total predicates per field
- Tracks specific operator types (e.g.,
Containsfor strings) - Decides whether fusion is beneficial via
should_fuse
Sources: llkv-table/src/planner/mod.rs:517-570 llkv-table/src/planner/program.rs:518-542
Row-Level Evaluation
Stack-Based Evaluation Engine
Filter evaluation uses a stack-based virtual machine that processes EvalProgram bytecode. Each operation manipulates a boolean result stack.
graph LR
subgraph "Evaluation Loop"
Ops["EvalOp Instructions"]
Stack["Result Stack\nVec<bool>"]
Ops -->|Process| Stack
Stack -->|Final Value| Result["Filter Decision"]
end
subgraph "Example: age >= 18 AND status = 'active'"
Op1["PushPredicate(age >= 18)"] -->|Stack: [true]| Op2["PushPredicate(status = 'active')"]
Op2 -->|Stack: [true, false]| Op3["And(2)"]
Op3 -->|Stack: [false]| Final["Result: false"]
end
The evaluation process iterates through EvalOp instructions, pushing boolean results and combining them according to logical operators. Each predicate evaluation consults the underlying storage to check field values against filter conditions.
Sources: llkv-table/src/planner/mod.rs:1009-1031
graph TB
Predicate["Filter<FieldId>\nfield_id + operator"]
Predicate --> TypeCheck{"Data Type?"}
TypeCheck -->|Fixed-width| FixedPath["build_fixed_width_predicate\nInt, Float, Date, etc."]
TypeCheck -->|Variable-width| VarPath["build_var_width_predicate\nString types"]
TypeCheck -->|Boolean| BoolPath["build_bool_predicate\nBool type"]
FixedPath --> Native["Native comparison\nusing PredicateValue"]
VarPath --> Pattern["Pattern matching\nStartsWith, Contains, etc."]
BoolPath --> Boolean["Boolean logic"]
Native --> Result["bool"]
Pattern --> Result
Boolean --> Result
Predicate Evaluation
Individual predicates are evaluated by comparing field values against filter operators. The evaluation strategy depends on the data type:
Type-specific evaluation paths:
Operator semantics:
| Operator | Description | NULL Handling |
|---|---|---|
Equals | Exact match | NULL = NULL → NULL |
Range | Bounded interval (inclusive/exclusive) | NULL in range → NULL |
In | Set membership | NULL in [values] → NULL |
StartsWith | String prefix match (case-sensitive/insensitive) | NULL starts with X → NULL |
EndsWith | String suffix match | NULL ends with X → NULL |
Contains | String substring match | NULL contains X → NULL |
IsNull | NULL test | Returns true/false |
IsNotNull | NOT NULL test | Returns true/false |
Sources: llkv-expr/src/expr.rs:295-358 llkv-expr/src/typed_predicate.rs:1-500 (referenced but not shown)
graph TB
ScalarExpr["ScalarExpr<FieldId>"]
ScalarExpr --> Mode{"Evaluation Mode"}
Mode -->|Single Row| RowLevel["NumericKernels::evaluate_value\nRecursive evaluation\nReturns Option<NumericValue>"]
Mode -->|Batch| BatchLevel["NumericKernels::evaluate_batch\nVectorized when possible"]
BatchLevel --> Vectorized{"Vectorizable?"}
Vectorized -->|Yes| Vec["try_evaluate_vectorized\nArrow compute kernels"]
Vectorized -->|No| Fallback["Per-row evaluation\nLoop + evaluate_value"]
Vec --> Array["ArrayRef result"]
Fallback --> Array
Scalar Expression Evaluation
For computed columns and complex predicates (e.g., WHERE salary * 1.1 > 50000), scalar expressions are evaluated using the NumericKernels utility.
Evaluation modes:
Numeric type handling:
The NumericArray wrapper provides unified access to different numeric types:
- Integer :
Int64Arrayfor integers, booleans, dates - Float :
Float64Arrayfor floating-point numbers - Decimal :
Decimal128Arrayfor precise decimal values
Type coercion occurs automatically during expression evaluation:
- Mixed integer/float operations promote to float
- String-to-numeric conversions follow SQLite semantics (invalid → 0)
- NULL propagates through operations
Key evaluation functions:
| Function | Purpose | Performance |
|---|---|---|
evaluate_value | Single-row evaluation | Used for non-vectorizable expressions |
evaluate_batch | Batch evaluation | Attempts vectorization first |
try_evaluate_vectorized | Vectorized computation | Uses Arrow compute kernels |
prepare_numeric_arrays | Type coercion | Converts columns to numeric representation |
Sources: llkv-table/src/scalar_eval.rs:453-713 llkv-table/src/scalar_eval.rs:92-383
graph TB
subgraph "Filter Stages"
Scan["Table Scan"]
Scan --> Domain["1. Domain Program\nDetermines candidate\nrow IDs"]
Domain --> UserPred["2. User Predicates\nSemantic filtering\nvia EvalProgram"]
UserPred --> MVCCFilter["3. MVCC Filtering\nrow_id_filter.filter()\nVisibility rules"]
MVCCFilter --> Results["Visible Results"]
end
subgraph "MVCC Visibility"
RowID["Row ID"]
CreatedBy["created_by TxnId"]
DeletedBy["deleted_by Option<TxnId>"]
Snapshot["Transaction Snapshot"]
RowID --> Check{"Visibility Check"}
CreatedBy --> Check
DeletedBy --> Check
Snapshot --> Check
Check -->|Created before snapshot Not deleted or deleted after snapshot| Visible["Include"]
Check -->|Otherwise| Invisible["Exclude"]
end
MVCC Integration
Filter evaluation integrates with MVCC visibility filtering to ensure queries only see rows visible to their transaction. This is a two-stage filtering process:
MVCC filtering implementation:
The row_id_filter option in ScanStreamOptions provides transaction-aware filtering:
- Created by runtime's transaction manager
- Encapsulates snapshot visibility rules
- Applied after user predicate evaluation
- Filters row IDs based on
created_byanddeleted_bytransaction IDs
Filtering order rationale:
- Domain programs - Quickly eliminate rows where referenced fields don't exist
- User predicates - Evaluate semantic conditions (WHERE clause)
- MVCC filter - Apply transaction visibility rules
This ordering minimizes MVCC overhead by only checking visibility for rows that pass semantic filters.
Sources: llkv-table/src/planner/mod.rs:940-982 llkv-table/src/table.rs:200-300 (type definitions referenced but not shown)
Evaluation Optimizations
Single-Column Direct Scan Fast Path
A specialized fast path for queries that project a single column with simple filtering. This bypasses the general evaluation machinery for better performance.
Conditions for fast path:
- Single column projection
- Filter references only that column
- Simple operator (no complex scalar expressions)
When activated, the scan directly streams the target column's values without materializing intermediate structures.
Sources: llkv-table/src/planner/mod.rs:1020-1031 (method name: try_single_column_direct_scan)
graph LR
Shadow["Shadow Column\nrow_id metadata"]
Shadow -->|Exists?| FastEnum["Fast Path\nstream_table_row_ids"]
Shadow -->|Missing| Fallback["Fallback Path\nMulti-column scan\n+ deduplication"]
FastEnum --> Chunks["Row ID Chunks\nSTREAM_BATCH_ROWS"]
Fallback --> Chunks
Chunks --> MVCCCheck["Apply MVCC Filter"]
MVCCCheck --> Gather["Gather Columns"]
Gather --> Batch["RecordBatch"]
Full Table Scan Streaming
When no predicates require evaluation (e.g., WHERE true or full scan), the executor uses streaming row ID enumeration:
The fast path attempts to use a shadow column (row_id) that stores all row IDs for a table:
- Success case : Shadow column exists → stream chunks directly
- Fallback case : Shadow column missing → scan user columns and deduplicate
Sources: llkv-table/src/planner/mod.rs:739-857 llkv-table/src/planner/mod.rs:859-902 llkv-table/src/planner/mod.rs:904-999
graph TB
Expression["WHERE clause\nexpression tree"]
Expression --> Traverse["Traverse expression\nrecord_expr"]
Traverse --> Track["Track per-field stats:\n- Total predicate count\n- Contains operator count"]
Track --> Decide["should_fuse decision"]
Decide -->|String + Contains ≥1| Fuse1["Enable fusion"]
Decide -->|Any type + predicates ≥2| Fuse2["Enable fusion"]
Decide -->|Otherwise| NoFuse["No fusion"]
Predicate Fusion Cache
The PredicateFusionCache tracks predicate patterns during compilation to enable fusion optimization:
Fusion benefits:
- Reduces function call overhead
- Enables specialized evaluation routines
- Improves cache locality by processing same field
Fusion conditions table:
| Field Data Type | Conditions for Fusion |
|---|---|
Utf8 / LargeUtf8 | Total predicates ≥ 2 AND Contains operations ≥ 1 |
| Other types | Total predicates ≥ 2 on same field |
Sources: llkv-table/src/planner/mod.rs:517-570
graph TB
Expr["ScalarExpr batch"]
Expr --> Check{"Vectorizable?"}
Check -->|Yes| Patterns["Supported patterns:\n- Column references\n- Literal constants\n- Binary ops\n- Scalar×Array ops\n- Array×Array ops"]
Patterns --> ArrowCompute["Arrow compute kernels\nSIMD-optimized"]
Check -->|No| PerRow["Per-row evaluation\nevaluate_value loop"]
ArrowCompute --> Result["ArrayRef"]
PerRow --> Result
Vectorized Expression Evaluation
For numeric operations, the system attempts vectorized evaluation to process entire batches at once:
Vectorization strategy:
Vectorizable expression patterns:
- Pure column references
- Literal constants
- Binary operations:
Array ⊕ Array,Array ⊕ Scalar,Scalar ⊕ Array - Simple casts between numeric types
Non-vectorizable expressions:
- CASE expressions with complex branches
- Date/interval arithmetic
- Aggregate functions
- Subqueries
The vectorization attempt happens in try_evaluate_vectorized, which recursively checks if all sub-expressions can be vectorized. If any sub-expression is non-vectorizable, the entire expression falls back to row-by-row evaluation.
Sources: llkv-table/src/scalar_eval.rs:714-763 llkv-table/src/scalar_eval.rs:676-713