This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
SQL Preprocessing and Dialect Handling
Loading…
SQL Preprocessing and Dialect Handling
Relevant source files
This document describes the SQL preprocessing layer that normalizes SQL syntax from multiple dialects (SQLite, DuckDB, PostgreSQL) before parsing. The preprocessor transforms dialect-specific syntax into forms that the sqlparser-rs library can parse, enabling broad compatibility across different SQL variants.
For information about the overall SQL execution flow, see SqlEngine API. For details on how SQL parameters are bound to prepared statements, see Plan Structures.
Purpose and Architecture
The SQL preprocessing system transforms SQL text through a series of regex-based rewriting rules before passing it to sqlparser-rs. This architecture allows LLKV to accept SQL from various dialects without forking the parser library or implementing a custom parser.
Key components :
SqlEngine::preprocess_sql_input()- main preprocessing orchestrator- Individual preprocessing functions for each dialect feature
- Thread-local parameter state tracking
- Regex-based pattern matching and replacement
flowchart LR
RawSQL["Raw SQL Text\n(SQLite/DuckDB/PostgreSQL)"]
Preprocess["preprocess_sql_input()"]
subgraph "Preprocessing Steps"
TPCH["TPC-H CONNECT removal"]
CreateType["CREATE TYPE → DOMAIN"]
Exclude["EXCLUDE qualifier handling"]
Trailing["Trailing comma removal"]
BareTable["Bare table IN conversion"]
EmptyIn["Empty IN list handling"]
IndexHints["Index hint removal"]
Reindex["REINDEX normalization"]
Trigger["Trigger shorthand expansion"]
end
Parser["sqlparser::Parser"]
AST["Statement AST"]
RawSQL --> Preprocess
Preprocess --> TPCH
TPCH --> CreateType
CreateType --> Exclude
Exclude --> Trailing
Trailing --> BareTable
BareTable --> EmptyIn
EmptyIn --> IndexHints
IndexHints --> Reindex
Reindex --> Trigger
Trigger --> Parser
Parser --> AST
The preprocessing layer sits between raw SQL input and the generic SQL parser:
Sources : llkv-sql/src/sql_engine.rs:1116-1125
Preprocessing Pipeline
The preprocess_sql_input() function applies transformations in a specific order to handle interdependencies between rules:
Sources : llkv-sql/src/sql_engine.rs:1116-1125
flowchart TD
Input["SQL Input String"]
Step1["preprocess_tpch_connect_syntax()"]
Step2["preprocess_create_type_syntax()"]
Step3["preprocess_exclude_syntax()"]
Step4["preprocess_trailing_commas_in_values()"]
Step5["preprocess_bare_table_in_clauses()"]
Step6["preprocess_empty_in_lists()"]
Step7["preprocess_index_hints()"]
Step8["preprocess_reindex_syntax()"]
Output["Normalized SQL"]
Input --> Step1
Step1 --> Step2
Step2 --> Step3
Step3 --> Step4
Step4 --> Step5
Step5 --> Step6
Step6 --> Step7
Step7 --> Step8
Step8 --> Output
Note1["Removes TPC-H CONNECT statements"]
Note2["Converts DuckDB type aliases"]
Note3["Quotes qualified identifiers"]
Note4["Removes trailing commas in VALUES"]
Note5["Wraps bare tables in subqueries"]
Note6["Converts empty IN to constant predicates"]
Note7["Strips SQLite index hints"]
Note8["Converts standalone REINDEX"]
Step1 -.-> Note1
Step2 -.-> Note2
Step3 -.-> Note3
Step4 -.-> Note4
Step5 -.-> Note5
Step6 -.-> Note6
Step7 -.-> Note7
Step8 -.-> Note8
Dialect-Specific Transformations
TPC-H CONNECT Statement Removal
TPC-H benchmark scripts include CONNECT TO database; directives that are no-ops in LLKV (single database system). The preprocessor strips these statements entirely.
Implementation : llkv-sql/src/sql_engine.rs:759-766
Pattern : CONNECT TO <identifier>;
Action : Remove statement
CREATE TYPE → CREATE DOMAIN Conversion
DuckDB uses CREATE TYPE name AS basetype for type aliases, but sqlparser-rs only supports the SQL standard CREATE DOMAIN syntax. The preprocessor performs bidirectional conversion:
CREATE TYPE myint AS INTEGER → CREATE DOMAIN myint AS INTEGER
DROP TYPE myint → DROP DOMAIN myint
Regex patterns :
CREATE TYPE→CREATE DOMAINDROP TYPE→DROP DOMAIN
Implementation : llkv-sql/src/sql_engine.rs:775-793
Sources : llkv-sql/src/sql_engine.rs:775-793
EXCLUDE Clause Qualified Name Handling
DuckDB allows qualified identifiers in EXCLUDE clauses: SELECT * EXCLUDE (schema.table.col). The preprocessor wraps these in double quotes for parser compatibility:
EXCLUDE (schema.table.col) → EXCLUDE ("schema.table.col")
Pattern : EXCLUDE\s*\(\s*([a-zA-Z_][a-zA-Z0-9_]*(?:\.[a-zA-Z_][a-zA-Z0-9_]*)+)\s*\)
Implementation : llkv-sql/src/sql_engine.rs:795-812
Sources : llkv-sql/src/sql_engine.rs:795-812
Trailing Comma Removal in VALUES Clauses
DuckDB permits trailing commas in tuple literals: VALUES ('v2',). The preprocessor removes these for parser compatibility:
VALUES (1, 2,) → VALUES (1, 2)
Pattern : ,(\s*)\)
Replacement : $1)
Implementation : llkv-sql/src/sql_engine.rs:814-825
Sources : llkv-sql/src/sql_engine.rs:814-825
Empty IN List Handling
SQLite allows degenerate forms expr IN () and expr NOT IN () which sqlparser-rs rejects. The preprocessor converts these to constant boolean expressions while preserving expression evaluation (for potential side effects):
expr IN () → (expr = NULL AND 0 = 1) -- always false
expr NOT IN () → (expr = NULL OR 1 = 1) -- always true
Regex : Matches parenthesized expressions, quoted strings, hex literals, identifiers, or numbers followed by [NOT] IN ()
Implementation : llkv-sql/src/sql_engine.rs:827-856
Sources : llkv-sql/src/sql_engine.rs:827-856
SQLite Index Hint Removal
SQLite query optimizer hints (INDEXED BY index_name, NOT INDEXED) are stripped since LLKV makes its own index selection decisions:
FROM table INDEXED BY idx_name → FROM table
FROM table NOT INDEXED → FROM table
Pattern : \s+(INDEXED\s+BY\s+[a-zA-Z_][a-zA-Z0-9_]*|NOT\s+INDEXED)\b
Implementation : llkv-sql/src/sql_engine.rs:858-875
Sources : llkv-sql/src/sql_engine.rs:858-875
REINDEX Syntax Normalization
SQLite supports standalone REINDEX index_name statements, but sqlparser-rs only recognizes REINDEX within VACUUM statements. The preprocessor converts:
REINDEX my_index → VACUUM REINDEX my_index
Pattern : \bREINDEX\s+([a-zA-Z_][a-zA-Z0-9_]*(?:\.[a-zA-Z_][a-zA-Z0-9_]*)*)\b
Implementation : llkv-sql/src/sql_engine.rs:877-893
Sources : llkv-sql/src/sql_engine.rs:877-893
SQLite Trigger Shorthand Expansion
SQLite allows omitting trigger timing (BEFORE/AFTER, defaults to AFTER) and the FOR EACH ROW clause (defaults to row-level triggers). The sqlparser-rs library requires these to be explicit, so the preprocessor injects them when missing.
Transformation example :
Regex approach :
- Detect missing timing keyword and inject
AFTER - Detect missing
FOR EACH ROWbeforeBEGINorWHENand inject it
Implementation : llkv-sql/src/sql_engine.rs:895-978
Note : This is marked as a temporary workaround. The proper fix would be extending sqlparser-rs’s SQLiteDialect::parse_statement to handle these optional clauses.
Sources : llkv-sql/src/sql_engine.rs:895-978
Bare Table Names in IN Clauses
SQLite allows expr IN tablename as shorthand for expr IN (SELECT * FROM tablename). The preprocessor converts this to the explicit subquery form:
col IN users → col IN (SELECT * FROM users)
Pattern : \b(NOT\s+)?IN\s+([a-zA-Z_][a-zA-Z0-9_]*(?:\.[a-zA-Z_][a-zA-Z0-9_]*)*)(\s|$|;|,|\))
Avoids : Already-parenthesized expressions (IN (...))
Implementation : llkv-sql/src/sql_engine.rs:980-1009
Sources : llkv-sql/src/sql_engine.rs:980-1009
SQL Parameter System
The parameter system enables prepared statements with placeholder binding. Parameters can use various syntaxes across different SQL dialects:
| Syntax | Example | Description |
|---|---|---|
? | WHERE id = ? | Positional, auto-incremented |
?N | WHERE id = ?1 | Positional, explicit index |
$N | WHERE id = $1 | PostgreSQL-style positional |
:name | WHERE id = :user_id | Named parameter |
Parameter State Management
Thread-local state tracks parameter registration during statement preparation:
Key functions :
ParameterScope::new()- initializes thread-local state llkv-sql/src/sql_engine.rs:228-237register_placeholder(raw: &str)- assigns indices to parameters llkv-sql/src/sql_engine.rs:258-268placeholder_marker(index: usize)- generates internal sentinel llkv-sql/src/sql_engine.rs:270-272
Sources : llkv-sql/src/sql_engine.rs:78-282
flowchart TD
Input["Raw Parameter String"]
CheckAuto{"Is '?' ?"}
IncrementAuto["Increment next_auto\nReturn new index"]
CheckCache{"Already registered?"}
ReturnCached["Return cached index"]
ParseType{"Parameter type"}
ParseNumeric["Parse numeric index\n(?N or $N)"]
AssignNamed["Assign max_index + 1\n(:name)"]
UpdateCache["Store in assigned map\nUpdate max_index"]
Return["Return index"]
Input --> CheckAuto
CheckAuto --
Yes --> IncrementAuto
CheckAuto --
No --> CheckCache
CheckCache --
Yes --> ReturnCached
CheckCache --
No --> ParseType
ParseType -- "?N or $N" --> ParseNumeric
ParseType -- ":name" --> AssignNamed
ParseNumeric --> UpdateCache
AssignNamed --> UpdateCache
UpdateCache --> Return
IncrementAuto --> Return
Parameter Index Assignment
The ParameterState::register() method normalizes different parameter syntaxes to unified indices:
Sources : llkv-sql/src/sql_engine.rs:94-133
Parameter Sentinel System
During plan building, parameters are represented as sentinel strings that can be recognized and replaced during execution:
| Sentinel Format | Example | Usage |
|---|---|---|
__llkv_param__N__ | __llkv_param__1__ | Internal representation in plan |
Functions :
placeholder_marker(index)- generates sentinel llkv-sql/src/sql_engine.rs:270-272literal_placeholder(index)- wraps inLiteral::Stringllkv-sql/src/sql_engine.rs:274-276parse_placeholder_marker(text)- extracts index from sentinel llkv-sql/src/sql_engine.rs:278-282
Sources : llkv-sql/src/sql_engine.rs:78-282
Integration with SqlEngine Execution
The preprocessing pipeline integrates with the main SQL execution flow:
Trigger retry logic : If initial parsing fails and the SQL contains CREATE TRIGGER, the engine applies the trigger shorthand preprocessing and retries. This is a fallback for cases where the initial preprocessing pipeline doesn’t catch all trigger variations.
Sources : llkv-sql/src/sql_engine.rs:1057-1083
Recursion Limit Configuration
The parser recursion limit is set higher than sqlparser-rs’s default to accommodate deeply nested SQL expressions common in test suites:
This prevents stack overflows while still protecting against pathological inputs.
Sources : llkv-sql/src/sql_engine.rs:393-400
Design Rationale
Why Regex-Based Preprocessing?
The preprocessing approach uses regex pattern matching rather than extending the parser for several reasons:
- Non-invasive : Does not require forking or patching
sqlparser-rs - Composable : Multiple transformations can be chained independently
- Maintainable : Each dialect feature is isolated in its own function
- Sufficient : Handles syntactic transformations without semantic analysis
Trade-offs
| Approach | Advantages | Disadvantages |
|---|---|---|
| Regex preprocessing | Simple, composable, no parser changes | Fragile to edge cases, string-based |
| Parser extension | Robust, type-safe | Requires maintaining parser fork |
| Custom parser | Full control | High maintenance burden |
The current regex-based approach is pragmatic for the current set of dialect differences. If the number of transformations grows significantly, a proper parser extension may become necessary.
Sources : llkv-sql/src/sql_engine.rs:759-1009
Summary
The SQL preprocessing layer provides broad dialect compatibility through a pipeline of targeted string transformations:
- TPC-H compatibility : Strips
CONNECTstatements - DuckDB compatibility : Type aliases, trailing commas, qualified EXCLUDE
- SQLite compatibility : Trigger shorthand, empty IN lists, index hints, bare table IN, REINDEX
- PostgreSQL compatibility :
$Nparameter syntax
The system is extensible—new dialect features can be added as additional preprocessing functions without disrupting existing transformations.
Sources : llkv-sql/src/sql_engine.rs:759-1125
Dismiss
Refresh this wiki
Enter email to refresh