Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

SQL Preprocessing and Dialect Handling

Relevant source files

Purpose and Scope

SQL preprocessing is the first stage in LLKV's query processing pipeline, responsible for normalizing SQL syntax from various dialects before the statement reaches the parser. This system allows LLKV to accept SQL written for SQLite, DuckDB, and TPC-H tooling while using the standard sqlparser library, which has limited dialect support.

The preprocessing layer transforms dialect-specific syntax into forms that sqlparser can parse, enabling compatibility with SQL Logic Tests and real-world SQL scripts without modifying the parser itself. This page documents the preprocessing transformations and their implementation.

For information about what happens after preprocessing (SQL parsing and plan generation), see SQL Query Processing Pipeline. For details on the SqlEngine API that invokes preprocessing, see SqlEngine API.

Preprocessing in the Query Pipeline

SQL preprocessing occurs immediately before parsing in both the execute and prepare code paths. The following diagram shows where preprocessing fits in the overall query execution flow:

Diagram: SQL Preprocessing Pipeline Position

flowchart TB
    Input["SQL String Input"]
Preprocess["preprocess_sql_input()"]
Parse["sqlparser::Parser::parse()"]
Plan["Plan Generation"]
Execute["Query Execution"]
Input --> Preprocess
 
   Preprocess --> Parse
 
   Parse --> Plan
 
   Plan --> Execute
    
    subgraph "Preprocessing Transformations"
        direction TB
        TPC["TPC-H CONNECT removal"]
CreateType["CREATE TYPE → CREATE DOMAIN"]
Exclude["EXCLUDE syntax normalization"]
Trailing["Trailing comma removal"]
EmptyIn["Empty IN list handling"]
IndexHints["Index hint stripping"]
Reindex["REINDEX → VACUUM REINDEX"]
BareTable["Bare table IN expansion"]
TPC --> CreateType
 
       CreateType --> Exclude
 
       Exclude --> Trailing
 
       Trailing --> BareTable
 
       BareTable --> EmptyIn
 
       EmptyIn --> IndexHints
 
       IndexHints --> Reindex
    end
    
    Preprocess -.chains.-> TPC
    Reindex -.final.-> Parse

Sources: llkv-sql/src/sql_engine.rs:936-1001

The preprocess_sql_input method chains all dialect transformations in a specific order, with each transformation receiving the output of the previous one. If parsing fails after preprocessing and the SQL contains CREATE TRIGGER, a fallback preprocessor (preprocess_sqlite_trigger_shorthand) is applied before retrying the parse.

Diagram: Preprocessing Execution Sequence with Fallback

sequenceDiagram
    participant Caller
    participant SqlEngine
    participant Preprocess as "preprocess_sql_input"
    participant Parser as "sqlparser"
    participant Fallback as "preprocess_sqlite_trigger_shorthand"
    
    Caller->>SqlEngine: execute(sql)
    SqlEngine->>Preprocess: preprocess(sql)
    
    Note over Preprocess: Chain all transformations
    
    Preprocess-->>SqlEngine: processed_sql
    SqlEngine->>Parser: parse(processed_sql)
    
    alt Parse Success
        Parser-->>SqlEngine: AST
    else Parse Error + "CREATE TRIGGER"
        Parser-->>SqlEngine: ParseError
        SqlEngine->>Fallback: expand_trigger_syntax(processed_sql)
        Fallback-->>SqlEngine: expanded_sql
        SqlEngine->>Parser: parse(expanded_sql)
        Parser-->>SqlEngine: AST or Error
    end
    
    SqlEngine-->>Caller: Results

Sources: llkv-sql/src/sql_engine.rs:936-958

Supported Dialect Transformations

LLKV implements nine distinct preprocessing transformations, each targeting specific dialect compatibility issues. The following table summarizes each transformation:

PreprocessorDialectPurposeMethod
TPC-H CONNECTTPC-HStrip CONNECT TO database; statementspreprocess_tpch_connect_syntax
CREATE TYPEDuckDBConvert CREATE TYPE to CREATE DOMAINpreprocess_create_type_syntax
EXCLUDE SyntaxGeneralQuote qualified identifiers in EXCLUDE clausespreprocess_exclude_syntax
Trailing CommasDuckDBRemove trailing commas in VALUESpreprocess_trailing_commas_in_values
Empty IN ListsSQLiteConvert IN () to constant expressionspreprocess_empty_in_lists
Index HintsSQLiteStrip INDEXED BY and NOT INDEXEDpreprocess_index_hints
REINDEXSQLiteConvert REINDEX to VACUUM REINDEXpreprocess_reindex_syntax
Bare Table INSQLiteExpand IN table to IN (SELECT * FROM table)preprocess_bare_table_in_clauses
Trigger ShorthandSQLiteAdd AFTER and FOR EACH ROW to triggerspreprocess_sqlite_trigger_shorthand

Sources: llkv-sql/src/sql_engine.rs:628-842 llkv-sql/src/sql_engine.rs:992-1001

TPC-H CONNECT Statement Removal

The TPC-H benchmark tooling generates CONNECT TO <database>; directives in referential integrity scripts. Since LLKV operates within a single database context, these statements are treated as no-ops and stripped during preprocessing.

Transformation:

Sources: llkv-sql/src/sql_engine.rs:623-630

CREATE TYPE to CREATE DOMAIN Conversion

DuckDB uses CREATE TYPE name AS basetype for type aliases, but sqlparser only supports the SQL standard CREATE DOMAIN syntax. This preprocessor converts the DuckDB syntax to the standard form.

Transformation:

Implementation: Uses static regex patterns initialized via OnceLock for thread-safe lazy compilation.

Sources: llkv-sql/src/sql_engine.rs:634-657

EXCLUDE Syntax Normalization

When EXCLUDE clauses contain qualified identifiers (e.g., schema.table.column), sqlparser requires them to be quoted. This preprocessor wraps qualified names in double quotes.

Transformation:

Sources: llkv-sql/src/sql_engine.rs:659-676

Trailing Comma Removal in VALUES

DuckDB permits trailing commas in VALUES clauses like VALUES ('v2',), but sqlparser rejects them. This preprocessor removes trailing commas before closing parentheses.

Transformation:

Sources: llkv-sql/src/sql_engine.rs:678-689

Empty IN List Handling

SQLite allows degenerate IN () and NOT IN () expressions. Since sqlparser rejects these, the preprocessor converts them to constant boolean expressions while preserving the original expression evaluation (in case of side effects).

Transformation:

The pattern matches various expression forms: parenthesized expressions, quoted strings, hex literals, identifiers, and numbers.

Sources: llkv-sql/src/sql_engine.rs:691-720

Index Hint Stripping

SQLite supports query optimizer hints like FROM table INDEXED BY index_name and FROM table NOT INDEXED. Since sqlparser doesn't support this syntax and LLKV makes its own index decisions, these hints are stripped during preprocessing.

Transformation:

Sources: llkv-sql/src/sql_engine.rs:722-739

REINDEX to VACUUM REINDEX Conversion

SQLite supports REINDEX index_name as a standalone statement, but sqlparser only recognizes REINDEX as part of VACUUM syntax. This preprocessor converts the standalone form.

Transformation:

Sources: llkv-sql/src/sql_engine.rs:741-757

Bare Table IN Clause Expansion

SQLite allows expr IN tablename as shorthand for expr IN (SELECT * FROM tablename). The preprocessor expands this shorthand to the subquery form that sqlparser requires.

Transformation:

The pattern avoids matching IN ( which is already a valid subquery.

Sources: llkv-sql/src/sql_engine.rs:844-873

SQLite Trigger Shorthand Expansion

SQLite allows omitting the trigger timing (defaults to AFTER) and the FOR EACH ROW clause (defaults to row-level triggers). sqlparser requires both to be explicit. This preprocessor injects the missing clauses.

Transformation:

This is a fallback preprocessor that only runs if initial parsing fails and the SQL contains CREATE TRIGGER. The implementation uses complex regex patterns to handle optional dotted identifiers with various quoting styles.

Sources: llkv-sql/src/sql_engine.rs:759-842 llkv-sql/src/sql_engine.rs:944-957

graph TB
    subgraph "SqlEngine Methods"
        Execute["execute(sql)"]
Prepare["prepare(sql)"]
PreprocessInput["preprocess_sql_input(sql)"]
end
    
    subgraph "Static Regex Patterns"
        CreateTypeRE["CREATE_TYPE_REGEX"]
DropTypeRE["DROP_TYPE_REGEX"]
ExcludeRE["EXCLUDE_REGEX"]
TrailingRE["TRAILING_COMMA_REGEX"]
EmptyInRE["EMPTY_IN_REGEX"]
IndexHintRE["INDEX_HINT_REGEX"]
ReindexRE["REINDEX_REGEX"]
BareTableRE["BARE_TABLE_IN_REGEX"]
TimingRE["TIMING_REGEX"]
ForEachBeginRE["FOR_EACH_BEGIN_REGEX"]
ForEachWhenRE["FOR_EACH_WHEN_REGEX"]
end
    
    subgraph "Preprocessor Methods"
        TPC["preprocess_tpch_connect_syntax"]
CreateType["preprocess_create_type_syntax"]
Exclude["preprocess_exclude_syntax"]
Trailing["preprocess_trailing_commas_in_values"]
EmptyIn["preprocess_empty_in_lists"]
IndexHints["preprocess_index_hints"]
Reindex["preprocess_reindex_syntax"]
BareTable["preprocess_bare_table_in_clauses"]
Trigger["preprocess_sqlite_trigger_shorthand"]
end
    
 
   Execute --> PreprocessInput
 
   Prepare --> PreprocessInput
    
 
   PreprocessInput --> TPC
 
   PreprocessInput --> CreateType
 
   PreprocessInput --> Exclude
 
   PreprocessInput --> Trailing
 
   PreprocessInput --> BareTable
 
   PreprocessInput --> EmptyIn
 
   PreprocessInput --> IndexHints
 
   PreprocessInput --> Reindex
    
    CreateType -.uses.-> CreateTypeRE
    CreateType -.uses.-> DropTypeRE
    Exclude -.uses.-> ExcludeRE
    Trailing -.uses.-> TrailingRE
    EmptyIn -.uses.-> EmptyInRE
    IndexHints -.uses.-> IndexHintRE
    Reindex -.uses.-> ReindexRE
    BareTable -.uses.-> BareTableRE
    Trigger -.uses.-> TimingRE
    Trigger -.uses.-> ForEachBeginRE
    Trigger -.uses.-> ForEachWhenRE

Implementation Architecture

The preprocessing system is implemented using a combination of regex transformations and string manipulation. The following diagram shows the key components:

Diagram: Preprocessing Implementation Components

Sources: llkv-sql/src/sql_engine.rs:640-842 llkv-sql/src/sql_engine.rs:992-1001

Regex Pattern Management

All regex patterns are stored in OnceLock static variables for thread-safe lazy initialization. This ensures patterns are compiled once per process and reused across all preprocessing operations, avoiding the overhead of repeated compilation.

Pattern Initialization Example:

The patterns use case-insensitive matching ((?i)) and word boundaries (\b) to avoid false matches within identifiers or string literals.

Sources: llkv-sql/src/sql_engine.rs:640-650 llkv-sql/src/sql_engine.rs:661-669 llkv-sql/src/sql_engine.rs:682-686

Preprocessing Order

The order of transformations matters because later transformations may depend on earlier ones. The current order:

  1. TPC-H CONNECT removal - Must happen first to remove non-SQL directives
  2. CREATE TYPE conversion - Normalizes DDL before other transformations
  3. EXCLUDE syntax - Handles qualified names in projection lists
  4. Trailing comma removal - Fixes VALUES clause syntax
  5. Bare table IN expansion - Converts shorthand to subqueries before empty IN check
  6. Empty IN handling - Must come after bare table expansion to avoid conflicts
  7. Index hint stripping - Removes query hints from FROM clauses
  8. REINDEX conversion - Must be last to avoid interfering with VACUUM statements

Sources: llkv-sql/src/sql_engine.rs:992-1001

Parser Integration

The preprocessed SQL is passed to sqlparser with a custom recursion limit to handle deeply nested queries from test suites:

The default sqlparser recursion limit (50) is insufficient for some SQLite test suite queries, so LLKV uses 200 to balance compatibility with stack safety.

Sources: llkv-sql/src/sql_engine.rs:318-324

Testing and Validation

The preprocessing transformations are validated through:

  1. SQL Logic Tests (SLT) - The llkv-slt-tester runs thousands of SQLite test cases that exercise various dialect features
  2. TPC-H Benchmarks - The llkv-tpch crate verifies compatibility with TPC-H SQL scripts
  3. Unit Tests - Individual preprocessor functions are tested in isolation

The preprocessing system is designed to be conservative: it only transforms patterns that are known to cause parser errors, and it preserves the original SQL semantics whenever possible.

Sources: llkv-sql/src/sql_engine.rs:623-1001 llkv-sql/Cargo.toml:1-34

Future Considerations

The preprocessing approach is a pragmatic solution that enables broad dialect compatibility without modifying sqlparser. However, it has limitations:

  • Fragile regex patterns - Complex transformations like trigger shorthand expansion use intricate regex that may not handle all edge cases
  • Limited context awareness - String-based transformations cannot distinguish between SQL keywords and string literals containing those keywords
  • Maintenance burden - Each new dialect feature requires a new preprocessor

The long-term solution is to contribute dialect-specific parsing improvements back to sqlparser, eliminating the need for preprocessing. The trigger shorthand transformation includes a TODO comment noting that proper SQLite dialect support in sqlparser would eliminate that preprocessor entirely.

Sources: llkv-sql/src/sql_engine.rs:765-770