Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

SQL Preprocessing and Dialect Handling

Loading…

SQL Preprocessing and Dialect Handling

Relevant source files

This document describes the SQL preprocessing layer that normalizes SQL syntax from multiple dialects (SQLite, DuckDB, PostgreSQL) before parsing. The preprocessor transforms dialect-specific syntax into forms that the sqlparser-rs library can parse, enabling broad compatibility across different SQL variants.

For information about the overall SQL execution flow, see SqlEngine API. For details on how SQL parameters are bound to prepared statements, see Plan Structures.


Purpose and Architecture

The SQL preprocessing system transforms SQL text through a series of regex-based rewriting rules before passing it to sqlparser-rs. This architecture allows LLKV to accept SQL from various dialects without forking the parser library or implementing a custom parser.

Key components :

  • SqlEngine::preprocess_sql_input() - main preprocessing orchestrator
  • Individual preprocessing functions for each dialect feature
  • Thread-local parameter state tracking
  • Regex-based pattern matching and replacement
flowchart LR
    RawSQL["Raw SQL Text\n(SQLite/DuckDB/PostgreSQL)"]
Preprocess["preprocess_sql_input()"]
subgraph "Preprocessing Steps"
        TPCH["TPC-H CONNECT removal"]
CreateType["CREATE TYPE → DOMAIN"]
Exclude["EXCLUDE qualifier handling"]
Trailing["Trailing comma removal"]
BareTable["Bare table IN conversion"]
EmptyIn["Empty IN list handling"]
IndexHints["Index hint removal"]
Reindex["REINDEX normalization"]
Trigger["Trigger shorthand expansion"]
end
    
    Parser["sqlparser::Parser"]
AST["Statement AST"]
RawSQL --> Preprocess
 
   Preprocess --> TPCH
 
   TPCH --> CreateType
 
   CreateType --> Exclude
 
   Exclude --> Trailing
 
   Trailing --> BareTable
 
   BareTable --> EmptyIn
 
   EmptyIn --> IndexHints
 
   IndexHints --> Reindex
 
   Reindex --> Trigger
 
   Trigger --> Parser
 
   Parser --> AST

The preprocessing layer sits between raw SQL input and the generic SQL parser:

Sources : llkv-sql/src/sql_engine.rs:1116-1125


Preprocessing Pipeline

The preprocess_sql_input() function applies transformations in a specific order to handle interdependencies between rules:

Sources : llkv-sql/src/sql_engine.rs:1116-1125

flowchart TD
    Input["SQL Input String"]
Step1["preprocess_tpch_connect_syntax()"]
Step2["preprocess_create_type_syntax()"]
Step3["preprocess_exclude_syntax()"]
Step4["preprocess_trailing_commas_in_values()"]
Step5["preprocess_bare_table_in_clauses()"]
Step6["preprocess_empty_in_lists()"]
Step7["preprocess_index_hints()"]
Step8["preprocess_reindex_syntax()"]
Output["Normalized SQL"]
Input --> Step1
 
   Step1 --> Step2
 
   Step2 --> Step3
 
   Step3 --> Step4
 
   Step4 --> Step5
 
   Step5 --> Step6
 
   Step6 --> Step7
 
   Step7 --> Step8
 
   Step8 --> Output
    
    Note1["Removes TPC-H CONNECT statements"]
Note2["Converts DuckDB type aliases"]
Note3["Quotes qualified identifiers"]
Note4["Removes trailing commas in VALUES"]
Note5["Wraps bare tables in subqueries"]
Note6["Converts empty IN to constant predicates"]
Note7["Strips SQLite index hints"]
Note8["Converts standalone REINDEX"]
Step1 -.-> Note1
 
   Step2 -.-> Note2
 
   Step3 -.-> Note3
 
   Step4 -.-> Note4
 
   Step5 -.-> Note5
 
   Step6 -.-> Note6
 
   Step7 -.-> Note7
 
   Step8 -.-> Note8

Dialect-Specific Transformations

TPC-H CONNECT Statement Removal

TPC-H benchmark scripts include CONNECT TO database; directives that are no-ops in LLKV (single database system). The preprocessor strips these statements entirely.

Implementation : llkv-sql/src/sql_engine.rs:759-766

Pattern : CONNECT TO <identifier>;

Action : Remove statement


CREATE TYPE → CREATE DOMAIN Conversion

DuckDB uses CREATE TYPE name AS basetype for type aliases, but sqlparser-rs only supports the SQL standard CREATE DOMAIN syntax. The preprocessor performs bidirectional conversion:

CREATE TYPE myint AS INTEGER  →  CREATE DOMAIN myint AS INTEGER
DROP TYPE myint               →  DROP DOMAIN myint

Regex patterns :

  • CREATE TYPECREATE DOMAIN
  • DROP TYPEDROP DOMAIN

Implementation : llkv-sql/src/sql_engine.rs:775-793

Sources : llkv-sql/src/sql_engine.rs:775-793


EXCLUDE Clause Qualified Name Handling

DuckDB allows qualified identifiers in EXCLUDE clauses: SELECT * EXCLUDE (schema.table.col). The preprocessor wraps these in double quotes for parser compatibility:

EXCLUDE (schema.table.col)  →  EXCLUDE ("schema.table.col")

Pattern : EXCLUDE\s*\(\s*([a-zA-Z_][a-zA-Z0-9_]*(?:\.[a-zA-Z_][a-zA-Z0-9_]*)+)\s*\)

Implementation : llkv-sql/src/sql_engine.rs:795-812

Sources : llkv-sql/src/sql_engine.rs:795-812


Trailing Comma Removal in VALUES Clauses

DuckDB permits trailing commas in tuple literals: VALUES ('v2',). The preprocessor removes these for parser compatibility:

VALUES (1, 2,)  →  VALUES (1, 2)

Pattern : ,(\s*)\)

Replacement : $1)

Implementation : llkv-sql/src/sql_engine.rs:814-825

Sources : llkv-sql/src/sql_engine.rs:814-825


Empty IN List Handling

SQLite allows degenerate forms expr IN () and expr NOT IN () which sqlparser-rs rejects. The preprocessor converts these to constant boolean expressions while preserving expression evaluation (for potential side effects):

expr IN ()      →  (expr = NULL AND 0 = 1)  -- always false
expr NOT IN ()  →  (expr = NULL OR 1 = 1)   -- always true

Regex : Matches parenthesized expressions, quoted strings, hex literals, identifiers, or numbers followed by [NOT] IN ()

Implementation : llkv-sql/src/sql_engine.rs:827-856

Sources : llkv-sql/src/sql_engine.rs:827-856


SQLite Index Hint Removal

SQLite query optimizer hints (INDEXED BY index_name, NOT INDEXED) are stripped since LLKV makes its own index selection decisions:

FROM table INDEXED BY idx_name  →  FROM table
FROM table NOT INDEXED           →  FROM table

Pattern : \s+(INDEXED\s+BY\s+[a-zA-Z_][a-zA-Z0-9_]*|NOT\s+INDEXED)\b

Implementation : llkv-sql/src/sql_engine.rs:858-875

Sources : llkv-sql/src/sql_engine.rs:858-875


REINDEX Syntax Normalization

SQLite supports standalone REINDEX index_name statements, but sqlparser-rs only recognizes REINDEX within VACUUM statements. The preprocessor converts:

REINDEX my_index  →  VACUUM REINDEX my_index

Pattern : \bREINDEX\s+([a-zA-Z_][a-zA-Z0-9_]*(?:\.[a-zA-Z_][a-zA-Z0-9_]*)*)\b

Implementation : llkv-sql/src/sql_engine.rs:877-893

Sources : llkv-sql/src/sql_engine.rs:877-893


SQLite Trigger Shorthand Expansion

SQLite allows omitting trigger timing (BEFORE/AFTER, defaults to AFTER) and the FOR EACH ROW clause (defaults to row-level triggers). The sqlparser-rs library requires these to be explicit, so the preprocessor injects them when missing.

Transformation example :

Regex approach :

  1. Detect missing timing keyword and inject AFTER
  2. Detect missing FOR EACH ROW before BEGIN or WHEN and inject it

Implementation : llkv-sql/src/sql_engine.rs:895-978

Note : This is marked as a temporary workaround. The proper fix would be extending sqlparser-rs’s SQLiteDialect::parse_statement to handle these optional clauses.

Sources : llkv-sql/src/sql_engine.rs:895-978


Bare Table Names in IN Clauses

SQLite allows expr IN tablename as shorthand for expr IN (SELECT * FROM tablename). The preprocessor converts this to the explicit subquery form:

col IN users  →  col IN (SELECT * FROM users)

Pattern : \b(NOT\s+)?IN\s+([a-zA-Z_][a-zA-Z0-9_]*(?:\.[a-zA-Z_][a-zA-Z0-9_]*)*)(\s|$|;|,|\))

Avoids : Already-parenthesized expressions (IN (...))

Implementation : llkv-sql/src/sql_engine.rs:980-1009

Sources : llkv-sql/src/sql_engine.rs:980-1009


SQL Parameter System

The parameter system enables prepared statements with placeholder binding. Parameters can use various syntaxes across different SQL dialects:

SyntaxExampleDescription
?WHERE id = ?Positional, auto-incremented
?NWHERE id = ?1Positional, explicit index
$NWHERE id = $1PostgreSQL-style positional
:nameWHERE id = :user_idNamed parameter

Parameter State Management

Thread-local state tracks parameter registration during statement preparation:

Key functions :

Sources : llkv-sql/src/sql_engine.rs:78-282

flowchart TD
    Input["Raw Parameter String"]
CheckAuto{"Is '?' ?"}
IncrementAuto["Increment next_auto\nReturn new index"]
CheckCache{"Already registered?"}
ReturnCached["Return cached index"]
ParseType{"Parameter type"}
ParseNumeric["Parse numeric index\n(?N or $N)"]
AssignNamed["Assign max_index + 1\n(:name)"]
UpdateCache["Store in assigned map\nUpdate max_index"]
Return["Return index"]
Input --> CheckAuto
    CheckAuto --
 Yes --> IncrementAuto
    CheckAuto --
 No --> CheckCache
    CheckCache --
 Yes --> ReturnCached
    CheckCache --
 No --> ParseType
    
    ParseType -- "?N or $N" --> ParseNumeric
    ParseType -- ":name" --> AssignNamed
    
 
   ParseNumeric --> UpdateCache
 
   AssignNamed --> UpdateCache
 
   UpdateCache --> Return
 
   IncrementAuto --> Return

Parameter Index Assignment

The ParameterState::register() method normalizes different parameter syntaxes to unified indices:

Sources : llkv-sql/src/sql_engine.rs:94-133

Parameter Sentinel System

During plan building, parameters are represented as sentinel strings that can be recognized and replaced during execution:

Sentinel FormatExampleUsage
__llkv_param__N____llkv_param__1__Internal representation in plan

Functions :

Sources : llkv-sql/src/sql_engine.rs:78-282


Integration with SqlEngine Execution

The preprocessing pipeline integrates with the main SQL execution flow:

Trigger retry logic : If initial parsing fails and the SQL contains CREATE TRIGGER, the engine applies the trigger shorthand preprocessing and retries. This is a fallback for cases where the initial preprocessing pipeline doesn’t catch all trigger variations.

Sources : llkv-sql/src/sql_engine.rs:1057-1083


Recursion Limit Configuration

The parser recursion limit is set higher than sqlparser-rs’s default to accommodate deeply nested SQL expressions common in test suites:

This prevents stack overflows while still protecting against pathological inputs.

Sources : llkv-sql/src/sql_engine.rs:393-400


Design Rationale

Why Regex-Based Preprocessing?

The preprocessing approach uses regex pattern matching rather than extending the parser for several reasons:

  1. Non-invasive : Does not require forking or patching sqlparser-rs
  2. Composable : Multiple transformations can be chained independently
  3. Maintainable : Each dialect feature is isolated in its own function
  4. Sufficient : Handles syntactic transformations without semantic analysis

Trade-offs

ApproachAdvantagesDisadvantages
Regex preprocessingSimple, composable, no parser changesFragile to edge cases, string-based
Parser extensionRobust, type-safeRequires maintaining parser fork
Custom parserFull controlHigh maintenance burden

The current regex-based approach is pragmatic for the current set of dialect differences. If the number of transformations grows significantly, a proper parser extension may become necessary.

Sources : llkv-sql/src/sql_engine.rs:759-1009


Summary

The SQL preprocessing layer provides broad dialect compatibility through a pipeline of targeted string transformations:

  • TPC-H compatibility : Strips CONNECT statements
  • DuckDB compatibility : Type aliases, trailing commas, qualified EXCLUDE
  • SQLite compatibility : Trigger shorthand, empty IN lists, index hints, bare table IN, REINDEX
  • PostgreSQL compatibility : $N parameter syntax

The system is extensible—new dialect features can be added as additional preprocessing functions without disrupting existing transformations.

Sources : llkv-sql/src/sql_engine.rs:759-1125

Dismiss

Refresh this wiki

Enter email to refresh