This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
SQL Preprocessing and Dialect Handling
Relevant source files
Purpose and Scope
SQL preprocessing is the first stage in LLKV's query processing pipeline, responsible for normalizing SQL syntax from various dialects before the statement reaches the parser. This system allows LLKV to accept SQL written for SQLite, DuckDB, and TPC-H tooling while using the standard sqlparser library, which has limited dialect support.
The preprocessing layer transforms dialect-specific syntax into forms that sqlparser can parse, enabling compatibility with SQL Logic Tests and real-world SQL scripts without modifying the parser itself. This page documents the preprocessing transformations and their implementation.
For information about what happens after preprocessing (SQL parsing and plan generation), see SQL Query Processing Pipeline. For details on the SqlEngine API that invokes preprocessing, see SqlEngine API.
Preprocessing in the Query Pipeline
SQL preprocessing occurs immediately before parsing in both the execute and prepare code paths. The following diagram shows where preprocessing fits in the overall query execution flow:
Diagram: SQL Preprocessing Pipeline Position
flowchart TB
Input["SQL String Input"]
Preprocess["preprocess_sql_input()"]
Parse["sqlparser::Parser::parse()"]
Plan["Plan Generation"]
Execute["Query Execution"]
Input --> Preprocess
Preprocess --> Parse
Parse --> Plan
Plan --> Execute
subgraph "Preprocessing Transformations"
direction TB
TPC["TPC-H CONNECT removal"]
CreateType["CREATE TYPE → CREATE DOMAIN"]
Exclude["EXCLUDE syntax normalization"]
Trailing["Trailing comma removal"]
EmptyIn["Empty IN list handling"]
IndexHints["Index hint stripping"]
Reindex["REINDEX → VACUUM REINDEX"]
BareTable["Bare table IN expansion"]
TPC --> CreateType
CreateType --> Exclude
Exclude --> Trailing
Trailing --> BareTable
BareTable --> EmptyIn
EmptyIn --> IndexHints
IndexHints --> Reindex
end
Preprocess -.chains.-> TPC
Reindex -.final.-> Parse
Sources: llkv-sql/src/sql_engine.rs:936-1001
The preprocess_sql_input method chains all dialect transformations in a specific order, with each transformation receiving the output of the previous one. If parsing fails after preprocessing and the SQL contains CREATE TRIGGER, a fallback preprocessor (preprocess_sqlite_trigger_shorthand) is applied before retrying the parse.
Diagram: Preprocessing Execution Sequence with Fallback
sequenceDiagram
participant Caller
participant SqlEngine
participant Preprocess as "preprocess_sql_input"
participant Parser as "sqlparser"
participant Fallback as "preprocess_sqlite_trigger_shorthand"
Caller->>SqlEngine: execute(sql)
SqlEngine->>Preprocess: preprocess(sql)
Note over Preprocess: Chain all transformations
Preprocess-->>SqlEngine: processed_sql
SqlEngine->>Parser: parse(processed_sql)
alt Parse Success
Parser-->>SqlEngine: AST
else Parse Error + "CREATE TRIGGER"
Parser-->>SqlEngine: ParseError
SqlEngine->>Fallback: expand_trigger_syntax(processed_sql)
Fallback-->>SqlEngine: expanded_sql
SqlEngine->>Parser: parse(expanded_sql)
Parser-->>SqlEngine: AST or Error
end
SqlEngine-->>Caller: Results
Sources: llkv-sql/src/sql_engine.rs:936-958
Supported Dialect Transformations
LLKV implements nine distinct preprocessing transformations, each targeting specific dialect compatibility issues. The following table summarizes each transformation:
| Preprocessor | Dialect | Purpose | Method |
|---|---|---|---|
| TPC-H CONNECT | TPC-H | Strip CONNECT TO database; statements | preprocess_tpch_connect_syntax |
| CREATE TYPE | DuckDB | Convert CREATE TYPE to CREATE DOMAIN | preprocess_create_type_syntax |
| EXCLUDE Syntax | General | Quote qualified identifiers in EXCLUDE clauses | preprocess_exclude_syntax |
| Trailing Commas | DuckDB | Remove trailing commas in VALUES | preprocess_trailing_commas_in_values |
| Empty IN Lists | SQLite | Convert IN () to constant expressions | preprocess_empty_in_lists |
| Index Hints | SQLite | Strip INDEXED BY and NOT INDEXED | preprocess_index_hints |
| REINDEX | SQLite | Convert REINDEX to VACUUM REINDEX | preprocess_reindex_syntax |
| Bare Table IN | SQLite | Expand IN table to IN (SELECT * FROM table) | preprocess_bare_table_in_clauses |
| Trigger Shorthand | SQLite | Add AFTER and FOR EACH ROW to triggers | preprocess_sqlite_trigger_shorthand |
Sources: llkv-sql/src/sql_engine.rs:628-842 llkv-sql/src/sql_engine.rs:992-1001
TPC-H CONNECT Statement Removal
The TPC-H benchmark tooling generates CONNECT TO <database>; directives in referential integrity scripts. Since LLKV operates within a single database context, these statements are treated as no-ops and stripped during preprocessing.
Transformation:
Sources: llkv-sql/src/sql_engine.rs:623-630
CREATE TYPE to CREATE DOMAIN Conversion
DuckDB uses CREATE TYPE name AS basetype for type aliases, but sqlparser only supports the SQL standard CREATE DOMAIN syntax. This preprocessor converts the DuckDB syntax to the standard form.
Transformation:
Implementation: Uses static regex patterns initialized via OnceLock for thread-safe lazy compilation.
Sources: llkv-sql/src/sql_engine.rs:634-657
EXCLUDE Syntax Normalization
When EXCLUDE clauses contain qualified identifiers (e.g., schema.table.column), sqlparser requires them to be quoted. This preprocessor wraps qualified names in double quotes.
Transformation:
Sources: llkv-sql/src/sql_engine.rs:659-676
Trailing Comma Removal in VALUES
DuckDB permits trailing commas in VALUES clauses like VALUES ('v2',), but sqlparser rejects them. This preprocessor removes trailing commas before closing parentheses.
Transformation:
Sources: llkv-sql/src/sql_engine.rs:678-689
Empty IN List Handling
SQLite allows degenerate IN () and NOT IN () expressions. Since sqlparser rejects these, the preprocessor converts them to constant boolean expressions while preserving the original expression evaluation (in case of side effects).
Transformation:
The pattern matches various expression forms: parenthesized expressions, quoted strings, hex literals, identifiers, and numbers.
Sources: llkv-sql/src/sql_engine.rs:691-720
Index Hint Stripping
SQLite supports query optimizer hints like FROM table INDEXED BY index_name and FROM table NOT INDEXED. Since sqlparser doesn't support this syntax and LLKV makes its own index decisions, these hints are stripped during preprocessing.
Transformation:
Sources: llkv-sql/src/sql_engine.rs:722-739
REINDEX to VACUUM REINDEX Conversion
SQLite supports REINDEX index_name as a standalone statement, but sqlparser only recognizes REINDEX as part of VACUUM syntax. This preprocessor converts the standalone form.
Transformation:
Sources: llkv-sql/src/sql_engine.rs:741-757
Bare Table IN Clause Expansion
SQLite allows expr IN tablename as shorthand for expr IN (SELECT * FROM tablename). The preprocessor expands this shorthand to the subquery form that sqlparser requires.
Transformation:
The pattern avoids matching IN ( which is already a valid subquery.
Sources: llkv-sql/src/sql_engine.rs:844-873
SQLite Trigger Shorthand Expansion
SQLite allows omitting the trigger timing (defaults to AFTER) and the FOR EACH ROW clause (defaults to row-level triggers). sqlparser requires both to be explicit. This preprocessor injects the missing clauses.
Transformation:
This is a fallback preprocessor that only runs if initial parsing fails and the SQL contains CREATE TRIGGER. The implementation uses complex regex patterns to handle optional dotted identifiers with various quoting styles.
Sources: llkv-sql/src/sql_engine.rs:759-842 llkv-sql/src/sql_engine.rs:944-957
graph TB
subgraph "SqlEngine Methods"
Execute["execute(sql)"]
Prepare["prepare(sql)"]
PreprocessInput["preprocess_sql_input(sql)"]
end
subgraph "Static Regex Patterns"
CreateTypeRE["CREATE_TYPE_REGEX"]
DropTypeRE["DROP_TYPE_REGEX"]
ExcludeRE["EXCLUDE_REGEX"]
TrailingRE["TRAILING_COMMA_REGEX"]
EmptyInRE["EMPTY_IN_REGEX"]
IndexHintRE["INDEX_HINT_REGEX"]
ReindexRE["REINDEX_REGEX"]
BareTableRE["BARE_TABLE_IN_REGEX"]
TimingRE["TIMING_REGEX"]
ForEachBeginRE["FOR_EACH_BEGIN_REGEX"]
ForEachWhenRE["FOR_EACH_WHEN_REGEX"]
end
subgraph "Preprocessor Methods"
TPC["preprocess_tpch_connect_syntax"]
CreateType["preprocess_create_type_syntax"]
Exclude["preprocess_exclude_syntax"]
Trailing["preprocess_trailing_commas_in_values"]
EmptyIn["preprocess_empty_in_lists"]
IndexHints["preprocess_index_hints"]
Reindex["preprocess_reindex_syntax"]
BareTable["preprocess_bare_table_in_clauses"]
Trigger["preprocess_sqlite_trigger_shorthand"]
end
Execute --> PreprocessInput
Prepare --> PreprocessInput
PreprocessInput --> TPC
PreprocessInput --> CreateType
PreprocessInput --> Exclude
PreprocessInput --> Trailing
PreprocessInput --> BareTable
PreprocessInput --> EmptyIn
PreprocessInput --> IndexHints
PreprocessInput --> Reindex
CreateType -.uses.-> CreateTypeRE
CreateType -.uses.-> DropTypeRE
Exclude -.uses.-> ExcludeRE
Trailing -.uses.-> TrailingRE
EmptyIn -.uses.-> EmptyInRE
IndexHints -.uses.-> IndexHintRE
Reindex -.uses.-> ReindexRE
BareTable -.uses.-> BareTableRE
Trigger -.uses.-> TimingRE
Trigger -.uses.-> ForEachBeginRE
Trigger -.uses.-> ForEachWhenRE
Implementation Architecture
The preprocessing system is implemented using a combination of regex transformations and string manipulation. The following diagram shows the key components:
Diagram: Preprocessing Implementation Components
Sources: llkv-sql/src/sql_engine.rs:640-842 llkv-sql/src/sql_engine.rs:992-1001
Regex Pattern Management
All regex patterns are stored in OnceLock static variables for thread-safe lazy initialization. This ensures patterns are compiled once per process and reused across all preprocessing operations, avoiding the overhead of repeated compilation.
Pattern Initialization Example:
The patterns use case-insensitive matching ((?i)) and word boundaries (\b) to avoid false matches within identifiers or string literals.
Sources: llkv-sql/src/sql_engine.rs:640-650 llkv-sql/src/sql_engine.rs:661-669 llkv-sql/src/sql_engine.rs:682-686
Preprocessing Order
The order of transformations matters because later transformations may depend on earlier ones. The current order:
- TPC-H CONNECT removal - Must happen first to remove non-SQL directives
- CREATE TYPE conversion - Normalizes DDL before other transformations
- EXCLUDE syntax - Handles qualified names in projection lists
- Trailing comma removal - Fixes VALUES clause syntax
- Bare table IN expansion - Converts shorthand to subqueries before empty IN check
- Empty IN handling - Must come after bare table expansion to avoid conflicts
- Index hint stripping - Removes query hints from FROM clauses
- REINDEX conversion - Must be last to avoid interfering with VACUUM statements
Sources: llkv-sql/src/sql_engine.rs:992-1001
Parser Integration
The preprocessed SQL is passed to sqlparser with a custom recursion limit to handle deeply nested queries from test suites:
The default sqlparser recursion limit (50) is insufficient for some SQLite test suite queries, so LLKV uses 200 to balance compatibility with stack safety.
Sources: llkv-sql/src/sql_engine.rs:318-324
Testing and Validation
The preprocessing transformations are validated through:
- SQL Logic Tests (SLT) - The llkv-slt-tester runs thousands of SQLite test cases that exercise various dialect features
- TPC-H Benchmarks - The llkv-tpch crate verifies compatibility with TPC-H SQL scripts
- Unit Tests - Individual preprocessor functions are tested in isolation
The preprocessing system is designed to be conservative: it only transforms patterns that are known to cause parser errors, and it preserves the original SQL semantics whenever possible.
Sources: llkv-sql/src/sql_engine.rs:623-1001 llkv-sql/Cargo.toml:1-34
Future Considerations
The preprocessing approach is a pragmatic solution that enables broad dialect compatibility without modifying sqlparser. However, it has limitations:
- Fragile regex patterns - Complex transformations like trigger shorthand expansion use intricate regex that may not handle all edge cases
- Limited context awareness - String-based transformations cannot distinguish between SQL keywords and string literals containing those keywords
- Maintenance burden - Each new dialect feature requires a new preprocessor
The long-term solution is to contribute dialect-specific parsing improvements back to sqlparser, eliminating the need for preprocessing. The trigger shorthand transformation includes a TODO comment noting that proper SQLite dialect support in sqlparser would eliminate that preprocessor entirely.
Sources: llkv-sql/src/sql_engine.rs:765-770