Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Scalar Evaluation and NumericKernels

Loading…

Scalar Evaluation and NumericKernels

Relevant source files

Purpose and Scope

This page documents the scalar expression evaluation engine in LLKV, covering how ScalarExpr instances are computed against columnar data to produce results for projections, filters, and computed columns. The evaluation system operates on Apache Arrow arrays and leverages vectorization for performance.

For information about the expression AST structure and variants, see Expression AST. For how expressions are translated from SQL column names to field identifiers, see Expression Translation. For the bytecode compilation pipeline, see Program Compilation. For aggregate function evaluation, see Aggregation System.

ScalarExpr Variants

The ScalarExpr<F> enum represents all scalar computations that can be performed in LLKV. Each variant corresponds to a specific type of operation that produces a single value per row.

Sources: llkv-expr/src/expr.rs:126-182

graph TB
    ScalarExpr["ScalarExpr&lt;F&gt;"]
ScalarExpr --> Column["Column(F)\nDirect column reference"]
ScalarExpr --> Literal["Literal(Literal)\nConstant value"]
ScalarExpr --> Binary["Binary\nArithmetic operations"]
ScalarExpr --> Not["Not(Box&lt;ScalarExpr&gt;)\nLogical negation"]
ScalarExpr --> IsNull["IsNull\nNULL test"]
ScalarExpr --> Aggregate["Aggregate(AggregateCall)\nAggregate functions"]
ScalarExpr --> GetField["GetField\nStruct field access"]
ScalarExpr --> Cast["Cast\nType conversion"]
ScalarExpr --> Compare["Compare\nComparison ops"]
ScalarExpr --> Coalesce["Coalesce\nFirst non-null"]
ScalarExpr --> ScalarSubquery["ScalarSubquery\nSubquery result"]
ScalarExpr --> Case["Case\nCASE expression"]
ScalarExpr --> Random["Random\nRandom number"]
Binary --> BinaryOp["BinaryOp:\nAdd, Subtract, Multiply,\nDivide, Modulo,\nAnd, Or,\nBitwiseShiftLeft,\nBitwiseShiftRight"]
Compare --> CompareOp["CompareOp:\nEq, NotEq,\nLt, LtEq,\nGt, GtEq"]

The generic type parameter F represents the field identifier type, which is typically String (for SQL column names) or FieldId (for translated physical column identifiers).

Binary Operations

Binary operations perform arithmetic or logical computations between two scalar expressions. The BinaryOp enum defines the supported operators:

OperatorSymbolDescriptionExample
Add+Additioncol1 + col2
Subtract-Subtractioncol1 - 100
Multiply*Multiplicationprice * quantity
Divide/Divisiontotal / count
Modulo%Remainderid % 10
AndANDLogical ANDflag1 AND flag2
OrORLogical ORstatus1 OR status2
BitwiseShiftLeft<<Left bit shiftmask << 2
BitwiseShiftRight>>Right bit shiftflags >> 4

Sources: llkv-expr/src/expr.rs:310-338

Binary expressions are constructed recursively, allowing complex nested computations:

Comparison Operations

Comparison operations produce boolean results (represented as 1/0 integers) by comparing two scalar expressions. The CompareOp enum defines six comparison operators:

OperatorSymbolDescription
Eq=Equal to
NotEq!=Not equal to
Lt<Less than
LtEq<=Less than or equal
Gt>Greater than
GtEq>=Greater than or equal

Sources: llkv-expr/src/expr.rs:341-363

Comparisons are represented as a ScalarExpr::Compare variant that contains the left operand, operator, and right operand. When evaluated, they produce a boolean value that follows SQL three-valued logic (true, false, NULL).

Evaluation Pipeline

The evaluation of scalar expressions follows a multi-stage pipeline from SQL text to computed Arrow arrays:

Sources: llkv-expr/src/expr.rs:1-819 llkv-table/src/table.rs:29-30

graph LR
    SQL["SQL Expression\n'col1 + col2 * 10'"]
Parse["SQL Parser\nsqlparser-rs"]
Translate["Expression Translation\nString → FieldId"]
Compile["Program Compilation\nScalarExpr → EvalProgram"]
Execute["Evaluation Engine\nVectorized Execution"]
Result["Arrow Array\nComputed Results"]
SQL --> Parse
 
   Parse --> Translate
 
   Translate --> Compile
 
   Compile --> Execute
 
   Execute --> Result
    
    Compile -.uses.-> ProgramCompiler["ProgramCompiler\nllkv-compute"]
Execute -.operates on.-> ArrowArrays["Arrow Arrays\nColumnar Data"]

The key stages are:

  1. Parsing : SQL expressions are parsed into AST nodes by sqlparser-rs
  2. Translation : Column names (strings) are resolved to FieldId identifiers
  3. Compilation : ScalarExpr is compiled into bytecode by ProgramCompiler
  4. Execution : The bytecode is evaluated against Arrow columnar data

Type System and Casting

LLKV uses Arrow’s type system for scalar evaluation. The Cast variant of ScalarExpr performs explicit type conversions:

Sources: llkv-expr/src/expr.rs:154-157

The data_type field is an Arrow DataType that specifies the target type for the conversion. Type casting follows Arrow’s casting rules and handles conversions between:

  • Numeric types (integers, floats, decimals)
  • String types (UTF-8)
  • Temporal types (Date32, timestamps)
  • Boolean types
  • Struct types

Invalid casts produce NULL values following SQL semantics.

Null Handling and Propagation

Scalar expressions implement SQL three-valued logic where NULL values propagate through most operations. The IsNull variant provides explicit NULL testing:

Sources: llkv-expr/src/expr.rs:139-142

When negated is false, this evaluates to 1 (true) if the expression is NULL, 0 (false) otherwise. When negated is true, it performs the inverse test (IS NOT NULL).

The Coalesce variant provides NULL-coalescing behavior, returning the first non-NULL value from a list of expressions:

Sources: llkv-expr/src/expr.rs165

This is used to implement SQL’s COALESCE(expr1, expr2, ...) function.

CASE Expressions

The Case variant implements SQL CASE expressions with optional operand and ELSE branches:

Sources: llkv-expr/src/expr.rs:169-176

This represents both simple and searched CASE expressions:

  • Simple CASE : When operand is Some, each branch’s WHEN expression is compared to the operand
  • Searched CASE : When operand is None, each branch’s WHEN expression is evaluated as a boolean

The branches vector contains (WHEN, THEN) pairs evaluated in order. If no branch matches, else_expr is returned, or NULL if else_expr is None.

graph TB
    subgraph "Row-by-Row Evaluation (Avoided)"
        Row1["Row 1:\ncol1=10, col2=5\n→ 10 + 5 = 15"]
Row2["Row 2:\ncol1=20, col2=3\n→ 20 + 3 = 23"]
Row3["Row 3:\ncol1=15, col2=7\n→ 15 + 7 = 22"]
end
    
    subgraph "Vectorized Evaluation (Used)"
        Col1["Int64Array\n[10, 20, 15]"]
Col2["Int64Array\n[5, 3, 7]"]
Result["Int64Array\n[15, 23, 22]"]
Col1 --> VectorAdd["Vectorized Add\nSIMD Operations"]
Col2 --> VectorAdd
 
       VectorAdd --> Result
    end
    
    style Row1 fill:#f9f9f9
    style Row2 fill:#f9f9f9
    style Row3 fill:#f9f9f9

Vectorization Strategy

Expression evaluation in LLKV leverages Apache Arrow’s columnar format for vectorized execution. Rather than evaluating expressions row-by-row, operations process entire arrays at once.

Sources: llkv-table/src/table.rs:1-681

Key benefits of vectorization:

  1. SIMD Instructions : Modern CPUs can process multiple values simultaneously
  2. Reduced Overhead : Eliminates per-row interpretation overhead
  3. Cache Efficiency : Columnar layout improves CPU cache utilization
  4. Arrow Compute Kernels : Leverages highly optimized Arrow implementations
graph TB
    Expr["ScalarExpr::Binary\nop=Add"]
Dispatch["Type Dispatch"]
Expr --> Dispatch
    
 
   Dispatch --> Int64["Int64Array\nAdd kernel"]
Dispatch --> Int32["Int32Array\nAdd kernel"]
Dispatch --> Float64["Float64Array\nAdd kernel"]
Dispatch --> Decimal128["Decimal128Array\nAdd kernel"]
Int64 --> Result1["Int64Array result"]
Int32 --> Result2["Int32Array result"]
Float64 --> Result3["Float64Array result"]
Decimal128 --> Result4["Decimal128Array result"]

Numeric Type Dispatch

LLKV handles multiple numeric types through Arrow’s type system. The evaluation engine uses Arrow’s primitive type traits to dispatch operations:

Sources: llkv-table/src/table.rs:17-20

The dispatch mechanism uses Arrow’s type system to select the appropriate kernel at evaluation time. The macros llkv_for_each_arrow_numeric, llkv_for_each_arrow_boolean, and llkv_for_each_arrow_string provide type-safe iteration over all supported Arrow types.

Aggregate Functions in Scalar Context

The Aggregate variant of ScalarExpr allows aggregate functions to appear in scalar contexts, such as COUNT(*) + 1:

Sources: llkv-expr/src/expr.rs145

The AggregateCall enum includes:

  • CountStar: Count all rows
  • Count { expr, distinct }: Count non-NULL values
  • Sum { expr, distinct }: Sum of values
  • Total { expr, distinct }: Sum with NULL-safe 0 default
  • Avg { expr, distinct }: Average of values
  • Min(expr): Minimum value
  • Max(expr): Maximum value
  • CountNulls(expr): Count NULL values
  • GroupConcat { expr, distinct, separator }: String concatenation

Sources: llkv-expr/src/expr.rs:184-215

Each aggregate operates on a ScalarExpr, allowing nested computations like SUM(col1 + col2) or AVG(-price).

Random Number Generation

The Random variant produces pseudo-random floating-point values:

Sources: llkv-expr/src/expr.rs181

Following PostgreSQL and DuckDB semantics, each evaluation produces a new random value in the range [0.0, 1.0). The generator is seeded automatically and does not expose seed control at the SQL level.\n\n## Struct Field Access\n\nThe GetField variant extracts fields from struct expressions](https://github.com/jzombie/rust-llkv/blob/89777726/0.0, 1.0). The generator is seeded automatically and does not expose seed control at the SQL level.\n\n## Struct Field Access\n\nThe GetField variant extracts fields from struct expressions#LNaN-LNaN)

This enables navigation of nested data structures. For example, accessing user.address.city would be represented as:

The evaluation engine resolves field names against Arrow struct schemas at runtime.

Performance Considerations

The scalar evaluation engine includes several optimizations:

Expression Constant Folding

Constant subexpressions are evaluated once during compilation rather than per-row. For example, col1 + (10 * 20) is simplified to col1 + 200 before evaluation.

Predicate Pushdown

When scalar expressions appear in WHERE clauses, they may be pushed down to the storage layer for early filtering. The PredicateFusionCache in llkv-compute caches compiled predicates to avoid recompilation.

Sources: llkv-table/src/table.rs29

Type Specialization

Arrow kernels are specialized for each numeric type, avoiding generic dispatch overhead in tight loops. This ensures that Int64 + Int64 uses dedicated integer addition instructions rather than polymorphic dispatch.

SIMD Acceleration

The underlying storage layer (simd-r-drive) provides SIMD-optimized operations for bulk data movement and filtering, which complements the vectorized evaluation strategy.

Sources: llkv-storage/pager/MemPager llkv-table/src/table.rs21

sequenceDiagram
    participant SQL as SQL Engine
    participant Planner as TablePlanner
    participant Scanner as ScanRowStream
    participant Compute as Compute Layer
    participant Store as ColumnStore
    
    SQL->>Planner: Execute SELECT with expressions
    Planner->>Planner: Compile ScalarExpr to bytecode
    Planner->>Scanner: Create scan with projections
    
    loop For each batch
        Scanner->>Store: Gather column arrays
        Store-->>Scanner: Arrow arrays
        Scanner->>Compute: Evaluate expressions
        Compute->>Compute: Vectorized operations
        Compute-->>Scanner: Computed arrays
        Scanner-->>SQL: RecordBatch
    end

Integration with Scan Pipeline

Scalar expressions are evaluated during table scans through the ScanProjection system:

Sources: llkv-table/src/table.rs:447-488 llkv-scan/execute/execute_scan

The scan pipeline:

  1. Gathers base column arrays from the ColumnStore
  2. Passes arrays to the compute layer for expression evaluation
  3. Assembles computed results into RecordBatch instances
  4. Streams batches to the caller

This design minimizes memory allocation by processing data in fixed-size batches (typically 1024 rows) rather than materializing entire result sets.

Expression Compilation Flow

The complete compilation flow from SQL to executed results:

Sources: llkv-expr/src/expr.rs:1-819 llkv-plan/src/plans.rs:1-1500 llkv-table/src/table.rs:1-681

This pipeline ensures that expressions are validated, optimized, and compiled before execution begins, minimizing runtime overhead.

Dismiss

Refresh this wiki

Enter email to refresh