Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Filter Evaluation

Relevant source files

Purpose and Scope

This page explains how filter expressions (WHERE clause predicates) are evaluated against table rows during query execution. This includes the compilation of filter expressions into efficient bytecode programs, stack-based evaluation mechanisms, integration with MVCC visibility filtering, and various optimization strategies.

For information about how expressions are initially structured and planned, see Expression System and Query Planning. For details about the broader scan execution context, see Scan Execution and Optimization.


Filter Expression Pipeline

Filter evaluation follows a multi-stage pipeline that transforms SQL predicates into efficient executable programs:

Sources: llkv-table/src/planner/mod.rs:595-636 llkv-table/src/planner/program.rs:256-284 llkv-executor/src/translation/expression.rs:18-174

graph LR
 
   SQL["SQL WHERE Clause"] --> Parser["sqlparser AST"]
Parser --> ExprString["Expr<String>\nField names"]
ExprString --> Translation["Field Resolution\nCatalog Lookup"]
Translation --> ExprFieldId["Expr<FieldId>\nResolved fields"]
ExprFieldId --> Normalize["normalize_predicate\nDe Morgan's Laws\nFlatten AND/OR"]
Normalize --> Compiler["ProgramCompiler"]
Compiler --> EvalProg["EvalProgram\nStack Bytecode"]
Compiler --> DomainProg["DomainProgram\nRow ID Selection"]
EvalProg --> Evaluation["Row Evaluation"]
DomainProg --> Evaluation
 
   Evaluation --> MVCCFilter["MVCC Filtering"]
MVCCFilter --> Results["Filtered Results"]

Program Compilation

Normalization

Before compilation, filter expressions are normalized into canonical form using the normalize_predicate function. This transformation ensures consistent structure for optimization and evaluation.

Normalization rules:

  • Flatten nested conjunctions/disjunctions: AND(AND(a,b),c)AND(a,b,c)
  • Apply De Morgan's laws: Push NOT operators down through logical connectives
  • Eliminate double negations: NOT(NOT(expr))expr
  • Simplify literal booleans: NOT(true)false

The normalization process uses an iterative approach to handle deeply nested expressions without stack overflow. The transformation is applied recursively, with special handling for negated conjunctions and disjunctions.

Key normalization functions:

FunctionPurpose
normalize_predicateEntry point for expression normalization
normalize_exprFlattens AND/OR and delegates to normalize_negated
normalize_negatedApplies De Morgan's laws and simplifies negations

Sources: llkv-table/src/planner/program.rs:286-343

Bytecode Generation

The ProgramCompiler translates normalized expressions into two complementary program representations:

EvalProgram operations:

graph TB
    subgraph "Compilation"
        Expr["Normalized Expr<FieldId>"]
Compiler["ProgramCompiler"]
Expr --> Compiler
    end
    
    subgraph "Programs"
        EvalProg["EvalProgram\nStack-based bytecode\nfor predicate evaluation"]
DomainProg["DomainProgram\nRow ID domain\ncalculation"]
Compiler --> EvalProg
 
       Compiler --> DomainProg
    end
    
    subgraph "Evaluation"
 
       EvalProg --> ResultStack["Result Stack\nbool values"]
DomainProg --> RowIDs["Row ID Sets\ncandidate rows"]
end
OperationStack EffectPurpose
PushPredicate→ boolEvaluate single predicate
PushCompare→ boolEvaluate scalar comparison
PushInList→ boolEvaluate IN list membership
PushIsNull→ boolEvaluate NULL test
PushLiteral→ boolPush constant boolean
FusedAnd→ boolEvaluate fused predicates on same field
Andbool×N → boolLogical AND of N values
Orbool×N → boolLogical OR of N values
Notbool → boolLogical negation (uses domain program)

DomainProgram operations:

OperationPurpose
PushFieldAllInclude all rows for a field
PushCompareDomainRows where scalar comparison fields exist
PushInListDomainRows where IN list fields exist
PushIsNullDomainRows where NULL test fields exist
UnionCombine row sets (OR semantics)
IntersectIntersect row sets (AND semantics)

Sources: llkv-table/src/planner/program.rs:22-284 llkv-table/src/planner/program.rs:416-516 llkv-table/src/planner/program.rs:544-631

Predicate Fusion

An optimization that recognizes multiple predicates on the same field within an AND expression and evaluates them together. This reduces overhead and enables more efficient filtering.

Fusion conditions (fromPredicateFusionCache):

Data TypeFusion Threshold
String types≥2 total predicates AND ≥1 Contains operator
Other types≥2 total predicates on same field

Example transformation:

age >= 18 AND age < 65 AND age != 30
→ FusedAnd(field_id: age, filters: [>=18, <65, !=30])

The fusion cache tracks predicate patterns during compilation:

  • Counts total predicates per field
  • Tracks specific operator types (e.g., Contains for strings)
  • Decides whether fusion is beneficial via should_fuse

Sources: llkv-table/src/planner/mod.rs:517-570 llkv-table/src/planner/program.rs:518-542


Row-Level Evaluation

Stack-Based Evaluation Engine

Filter evaluation uses a stack-based virtual machine that processes EvalProgram bytecode. Each operation manipulates a boolean result stack.

graph LR
    subgraph "Evaluation Loop"
        Ops["EvalOp Instructions"]
Stack["Result Stack\nVec&lt;bool&gt;"]
Ops -->|Process| Stack
 
       Stack -->|Final Value| Result["Filter Decision"]
end
    
    subgraph "Example: age >= 18 AND status = 'active'"
 
       Op1["PushPredicate(age >= 18)"] -->|Stack: [true]| Op2["PushPredicate(status = 'active')"]
Op2 -->|Stack: [true, false]| Op3["And(2)"]
Op3 -->|Stack: [false]| Final["Result: false"]
end

The evaluation process iterates through EvalOp instructions, pushing boolean results and combining them according to logical operators. Each predicate evaluation consults the underlying storage to check field values against filter conditions.

Sources: llkv-table/src/planner/mod.rs:1009-1031

graph TB
    Predicate["Filter&lt;FieldId&gt;\nfield_id + operator"]
Predicate --> TypeCheck{"Data Type?"}
TypeCheck -->|Fixed-width| FixedPath["build_fixed_width_predicate\nInt, Float, Date, etc."]
TypeCheck -->|Variable-width| VarPath["build_var_width_predicate\nString types"]
TypeCheck -->|Boolean| BoolPath["build_bool_predicate\nBool type"]
FixedPath --> Native["Native comparison\nusing PredicateValue"]
VarPath --> Pattern["Pattern matching\nStartsWith, Contains, etc."]
BoolPath --> Boolean["Boolean logic"]
Native --> Result["bool"]
Pattern --> Result
 
   Boolean --> Result

Predicate Evaluation

Individual predicates are evaluated by comparing field values against filter operators. The evaluation strategy depends on the data type:

Type-specific evaluation paths:

Operator semantics:

OperatorDescriptionNULL Handling
EqualsExact matchNULL = NULL → NULL
RangeBounded interval (inclusive/exclusive)NULL in range → NULL
InSet membershipNULL in [values] → NULL
StartsWithString prefix match (case-sensitive/insensitive)NULL starts with X → NULL
EndsWithString suffix matchNULL ends with X → NULL
ContainsString substring matchNULL contains X → NULL
IsNullNULL testReturns true/false
IsNotNullNOT NULL testReturns true/false

Sources: llkv-expr/src/expr.rs:295-358 llkv-expr/src/typed_predicate.rs:1-500 (referenced but not shown)

graph TB
    ScalarExpr["ScalarExpr&lt;FieldId&gt;"]
ScalarExpr --> Mode{"Evaluation Mode"}
Mode -->|Single Row| RowLevel["NumericKernels::evaluate_value\nRecursive evaluation\nReturns Option&lt;NumericValue&gt;"]
Mode -->|Batch| BatchLevel["NumericKernels::evaluate_batch\nVectorized when possible"]
BatchLevel --> Vectorized{"Vectorizable?"}
Vectorized -->|Yes| Vec["try_evaluate_vectorized\nArrow compute kernels"]
Vectorized -->|No| Fallback["Per-row evaluation\nLoop + evaluate_value"]
Vec --> Array["ArrayRef result"]
Fallback --> Array

Scalar Expression Evaluation

For computed columns and complex predicates (e.g., WHERE salary * 1.1 > 50000), scalar expressions are evaluated using the NumericKernels utility.

Evaluation modes:

Numeric type handling:

The NumericArray wrapper provides unified access to different numeric types:

  • Integer : Int64Array for integers, booleans, dates
  • Float : Float64Array for floating-point numbers
  • Decimal : Decimal128Array for precise decimal values

Type coercion occurs automatically during expression evaluation:

  • Mixed integer/float operations promote to float
  • String-to-numeric conversions follow SQLite semantics (invalid → 0)
  • NULL propagates through operations

Key evaluation functions:

FunctionPurposePerformance
evaluate_valueSingle-row evaluationUsed for non-vectorizable expressions
evaluate_batchBatch evaluationAttempts vectorization first
try_evaluate_vectorizedVectorized computationUses Arrow compute kernels
prepare_numeric_arraysType coercionConverts columns to numeric representation

Sources: llkv-table/src/scalar_eval.rs:453-713 llkv-table/src/scalar_eval.rs:92-383


graph TB
    subgraph "Filter Stages"
        Scan["Table Scan"]
Scan --> Domain["1. Domain Program\nDetermines candidate\nrow IDs"]
Domain --> UserPred["2. User Predicates\nSemantic filtering\nvia EvalProgram"]
UserPred --> MVCCFilter["3. MVCC Filtering\nrow_id_filter.filter()\nVisibility rules"]
MVCCFilter --> Results["Visible Results"]
end
    
    subgraph "MVCC Visibility"
        RowID["Row ID"]
CreatedBy["created_by TxnId"]
DeletedBy["deleted_by Option&lt;TxnId&gt;"]
Snapshot["Transaction Snapshot"]
RowID --> Check{"Visibility Check"}
CreatedBy --> Check
 
       DeletedBy --> Check
 
       Snapshot --> Check
        
 
       Check -->|Created before snapshot Not deleted or deleted after snapshot| Visible["Include"]
Check -->|Otherwise| Invisible["Exclude"]
end

MVCC Integration

Filter evaluation integrates with MVCC visibility filtering to ensure queries only see rows visible to their transaction. This is a two-stage filtering process:

MVCC filtering implementation:

The row_id_filter option in ScanStreamOptions provides transaction-aware filtering:

  • Created by runtime's transaction manager
  • Encapsulates snapshot visibility rules
  • Applied after user predicate evaluation
  • Filters row IDs based on created_by and deleted_by transaction IDs

Filtering order rationale:

  1. Domain programs - Quickly eliminate rows where referenced fields don't exist
  2. User predicates - Evaluate semantic conditions (WHERE clause)
  3. MVCC filter - Apply transaction visibility rules

This ordering minimizes MVCC overhead by only checking visibility for rows that pass semantic filters.

Sources: llkv-table/src/planner/mod.rs:940-982 llkv-table/src/table.rs:200-300 (type definitions referenced but not shown)


Evaluation Optimizations

Single-Column Direct Scan Fast Path

A specialized fast path for queries that project a single column with simple filtering. This bypasses the general evaluation machinery for better performance.

Conditions for fast path:

  • Single column projection
  • Filter references only that column
  • Simple operator (no complex scalar expressions)

When activated, the scan directly streams the target column's values without materializing intermediate structures.

Sources: llkv-table/src/planner/mod.rs:1020-1031 (method name: try_single_column_direct_scan)

graph LR
    Shadow["Shadow Column\nrow_id metadata"]
Shadow -->|Exists?| FastEnum["Fast Path\nstream_table_row_ids"]
Shadow -->|Missing| Fallback["Fallback Path\nMulti-column scan\n+ deduplication"]
FastEnum --> Chunks["Row ID Chunks\nSTREAM_BATCH_ROWS"]
Fallback --> Chunks
    
 
   Chunks --> MVCCCheck["Apply MVCC Filter"]
MVCCCheck --> Gather["Gather Columns"]
Gather --> Batch["RecordBatch"]

Full Table Scan Streaming

When no predicates require evaluation (e.g., WHERE true or full scan), the executor uses streaming row ID enumeration:

The fast path attempts to use a shadow column (row_id) that stores all row IDs for a table:

  • Success case : Shadow column exists → stream chunks directly
  • Fallback case : Shadow column missing → scan user columns and deduplicate

Sources: llkv-table/src/planner/mod.rs:739-857 llkv-table/src/planner/mod.rs:859-902 llkv-table/src/planner/mod.rs:904-999

graph TB
    Expression["WHERE clause\nexpression tree"]
Expression --> Traverse["Traverse expression\nrecord_expr"]
Traverse --> Track["Track per-field stats:\n- Total predicate count\n- Contains operator count"]
Track --> Decide["should_fuse decision"]
Decide -->|String + Contains ≥1| Fuse1["Enable fusion"]
Decide -->|Any type + predicates ≥2| Fuse2["Enable fusion"]
Decide -->|Otherwise| NoFuse["No fusion"]

Predicate Fusion Cache

The PredicateFusionCache tracks predicate patterns during compilation to enable fusion optimization:

Fusion benefits:

  • Reduces function call overhead
  • Enables specialized evaluation routines
  • Improves cache locality by processing same field

Fusion conditions table:

Field Data TypeConditions for Fusion
Utf8 / LargeUtf8Total predicates ≥ 2 AND Contains operations ≥ 1
Other typesTotal predicates ≥ 2 on same field

Sources: llkv-table/src/planner/mod.rs:517-570

graph TB
    Expr["ScalarExpr batch"]
Expr --> Check{"Vectorizable?"}
Check -->|Yes| Patterns["Supported patterns:\n- Column references\n- Literal constants\n- Binary ops\n- Scalar×Array ops\n- Array×Array ops"]
Patterns --> ArrowCompute["Arrow compute kernels\nSIMD-optimized"]
Check -->|No| PerRow["Per-row evaluation\nevaluate_value loop"]
ArrowCompute --> Result["ArrayRef"]
PerRow --> Result

Vectorized Expression Evaluation

For numeric operations, the system attempts vectorized evaluation to process entire batches at once:

Vectorization strategy:

Vectorizable expression patterns:

  • Pure column references
  • Literal constants
  • Binary operations: Array ⊕ Array, Array ⊕ Scalar, Scalar ⊕ Array
  • Simple casts between numeric types

Non-vectorizable expressions:

  • CASE expressions with complex branches
  • Date/interval arithmetic
  • Aggregate functions
  • Subqueries

The vectorization attempt happens in try_evaluate_vectorized, which recursively checks if all sub-expressions can be vectorized. If any sub-expression is non-vectorizable, the entire expression falls back to row-by-row evaluation.

Sources: llkv-table/src/scalar_eval.rs:714-763 llkv-table/src/scalar_eval.rs:676-713