# index.html.md # postgast [GitHub](https://github.com/eddieland/postgast) | [PyPI](https://pypi.org/project/postgast/) BSD-licensed Python bindings to [libpg_query](https://github.com/pganalyze/libpg_query), the PostgreSQL parser extracted as a standalone C library. Parse, deparse, normalize, fingerprint, split, and scan PostgreSQL SQL statements from Python with a minimal dependency footprint — just `protobuf` and the vendored C library. ```python import postgast # Parse a query into an AST tree = postgast.parse("SELECT id, name FROM users WHERE active = true") # Deparse an AST back to SQL sql = postgast.deparse(tree) # Normalize a query (replace constants with placeholders) postgast.normalize("SELECT * FROM users WHERE id = 42") # => "SELECT * FROM users WHERE id = $1" # Fingerprint a query fp = postgast.fingerprint("SELECT * FROM users WHERE id = 42") # Split a multi-statement string postgast.split("SELECT 1; SELECT 2;") # => ["SELECT 1", "SELECT 2"] ``` ## Installation ```bash pip install postgast ``` ## LLM-Friendly Docs Machine-readable documentation is available for AI coding assistants: - [llms.txt](/llms.txt) — concise overview - [llms-full.txt](/llms-full.txt) — complete documentation in a single file These files follow the [llms.txt standard](https://llmstxt.org/). ## Contents * [User Guide](guide.md) * [Parsing](guide.md#parsing) * [Deparsing](guide.md#deparsing) * [Normalization](guide.md#normalization) * [Fingerprinting](guide.md#fingerprinting) * [Splitting](guide.md#splitting) * [Scanning](guide.md#scanning) * [Tree Walking](guide.md#tree-walking) * [AST Helpers](guide.md#ast-helpers) * [DDL Helpers](guide.md#ddl-helpers) * [Error Handling](guide.md#error-handling) * [Common Usage Patterns](examples.md) * [Query Analysis](examples.md#query-analysis) * [Query Monitoring](examples.md#query-monitoring) * [SQL Formatting](examples.md#sql-formatting) * [Batch Processing](examples.md#batch-processing) * [DDL Tooling](examples.md#ddl-tooling) * [Tree Walking](examples.md#tree-walking) * [Working with the Protobuf AST](examples.md#working-with-the-protobuf-ast) * [PL/pgSQL Parsing](examples.md#pl-pgsql-parsing) * [Error Handling](examples.md#error-handling) * [Tokenization](examples.md#tokenization) * [API Reference](api.md) * [Core Functions](api.md#core-functions) * [Tree Walking](api.md#tree-walking) * [AST Helpers](api.md#ast-helpers) * [DDL Helpers](api.md#ddl-helpers) * [Types](api.md#types) * [Exceptions](api.md#exceptions) * [Architecture](architecture.md) * [Why ctypes (not Cython, Rust, or C extensions)](architecture.md#why-ctypes-not-cython-rust-or-c-extensions) * [How the binding layer works](architecture.md#how-the-binding-layer-works) * [Alternatives considered](architecture.md#alternatives-considered) * [Upgrade Policy](upgrade-policy.md) * [Python](upgrade-policy.md#python) * [PostgreSQL (libpg_query)](upgrade-policy.md#postgresql-libpg-query) * [Versioning](upgrade-policy.md#versioning) * [What this means in practice](upgrade-policy.md#what-this-means-in-practice) * [Contributing](contributing.md) * [Quick Start](contributing.md#quick-start) * [Development Commands](contributing.md#development-commands) # api.html.md # API Reference ## Core Functions ### postgast.parse(query) Parse a SQL query into a protobuf AST. Calls libpg_query’s `pg_query_parse_protobuf` to parse the query and returns the deserialized `ParseResult` protobuf message containing the abstract syntax tree. * **Parameters:** **query** ([`str`](https://docs.python.org/3/library/stdtypes.html#str)) – A SQL query string. * **Return type:** `ParseResult` * **Returns:** A `ParseResult` protobuf message with `version` (int) and `stmts` (list of `RawStmt`) fields. * **Raises:** [**PgQueryError**](#postgast.PgQueryError) – If the query contains a syntax error. ### Example ```pycon >>> tree = parse("SELECT id, name FROM users WHERE active = true") >>> len(tree.stmts) 1 >>> tree.stmts[0].stmt.HasField("select_stmt") True ``` ### postgast.deparse(tree) Convert a protobuf parse tree back into a SQL string. Calls libpg_query’s `pg_query_deparse_protobuf` to convert a `ParseResult` AST back into SQL text. This is the inverse of [`postgast.parse()`](#postgast.parse). #### NOTE The deparsed SQL is canonicalized by libpg_query and may differ from the original query in whitespace, casing, or parenthesization while remaining semantically equivalent. * **Parameters:** **tree** (`ParseResult` | `AstNode`) – A `ParseResult` protobuf message (as returned by [`postgast.parse()`](#postgast.parse)), or a typed `AstNode` wrapper. * **Return type:** [`str`](https://docs.python.org/3/library/stdtypes.html#str) * **Returns:** The deparsed SQL string. * **Raises:** [**PgQueryError**](#postgast.PgQueryError) – If the parse tree cannot be deparsed. ### Example ```pycon >>> from postgast import parse, deparse >>> tree = parse("SELECT id FROM users") >>> deparse(tree) 'SELECT id FROM users' ``` ### postgast.normalize(query) Normalize a SQL query by replacing literal constants with placeholders. Calls libpg_query’s `pg_query_normalize` to replace literal values (strings, numbers, etc.) with parameter placeholders (`$1`, `$2`, …). This is useful for grouping structurally equivalent queries. * **Parameters:** **query** ([`str`](https://docs.python.org/3/library/stdtypes.html#str)) – A SQL query string. * **Return type:** [`str`](https://docs.python.org/3/library/stdtypes.html#str) * **Returns:** The normalized query with constants replaced by positional placeholders. * **Raises:** [**PgQueryError**](#postgast.PgQueryError) – If the query cannot be parsed. ### Example ```pycon >>> normalize("SELECT * FROM users WHERE id = 42 AND name = 'Alice'") 'SELECT * FROM users WHERE id = $1 AND name = $2' ``` ### postgast.fingerprint(query) Compute a structural fingerprint of a SQL query. Calls libpg_query’s `pg_query_fingerprint` to produce a hash that identifies structurally equivalent queries regardless of literal values. * **Parameters:** **query** ([`str`](https://docs.python.org/3/library/stdtypes.html#str)) – A SQL query string. * **Return type:** [`FingerprintResult`](#postgast.FingerprintResult) * **Returns:** A `FingerprintResult` containing the numeric fingerprint and its hex string representation. * **Raises:** [**PgQueryError**](#postgast.PgQueryError) – If the query cannot be parsed. ### Example ```pycon >>> result = fingerprint("SELECT * FROM users WHERE id = 1") >>> result.hex '0ca858a0484f5826' >>> result == fingerprint("SELECT * FROM users WHERE id = 2") True ``` ### postgast.split(sql, , method='parser') Split a multi-statement SQL string into individual statements. Calls the selected libpg_query split function to split the input into individual SQL statements. The `"parser"` method (default) uses the full PostgreSQL parser for improved accuracy, while `"scanner"` uses a faster scanner-based approach that tolerates invalid SQL. * **Parameters:** * **sql** ([`str`](https://docs.python.org/3/library/stdtypes.html#str)) – A SQL string potentially containing multiple statements. * **method** ([`Literal`](https://docs.python.org/3/library/typing.html#typing.Literal)[`'scanner'`, `'parser'`]) – Which libpg_query splitter to use. `"parser"` (default) calls `pg_query_split_with_parser` for improved accuracy on valid SQL. `"scanner"` calls `pg_query_split_with_scanner`, which tolerates malformed SQL but may miss some edge cases. * **Return type:** [`list`](https://docs.python.org/3/library/stdtypes.html#list)[[`str`](https://docs.python.org/3/library/stdtypes.html#str)] * **Returns:** A list of individual SQL statement strings. * **Raises:** * [**PgQueryError**](#postgast.PgQueryError) – If the SQL causes a parse/scanner error. * [**ValueError**](https://docs.python.org/3/library/exceptions.html#ValueError) – If *method* is not `"scanner"` or `"parser"`. ### Example ```pycon >>> split("SELECT 1; SELECT 2") ['SELECT 1', ' SELECT 2'] >>> split("SELECT 'hello;world'") ["SELECT 'hello;world'"] ``` ### postgast.scan(sql) Tokenize a SQL string into a sequence of scan tokens. Calls libpg_query’s `pg_query_scan` to tokenize the input and returns the deserialized `ScanResult` protobuf message containing a list of `ScanToken` objects with token type, keyword kind, and byte positions. * **Parameters:** **sql** ([`str`](https://docs.python.org/3/library/stdtypes.html#str)) – A SQL string to tokenize. * **Return type:** `ScanResult` * **Returns:** A `ScanResult` protobuf message with `version` (int) and `tokens` (list of `ScanToken`) fields. * **Raises:** [**PgQueryError**](#postgast.PgQueryError) – If the input contains a scan error (e.g., unterminated string literal). ### Example ```pycon >>> result = scan("SELECT 1") >>> len(result.tokens) 2 >>> result.tokens[0].start, result.tokens[0].end (0, 6) ``` ## Tree Walking ### postgast.walk(node) Depth-first pre-order traversal of a protobuf message tree. Yields `(field_name, message)` tuples for every protobuf message encountered. The *field_name* is the protobuf field name that led to the message (e.g. `"where_clause"`, `"target_list"`), or an empty string for the root. `Node` oneof wrappers are transparently unwrapped so that only concrete message types appear in the output. * **Parameters:** **node** ([`Message`](https://googleapis.dev/python/protobuf/latest/google/protobuf/message.html#google.protobuf.message.Message)) – Any protobuf `Message` instance (`ParseResult`, `SelectStmt`, etc.). * **Yields:** `(field_name, message)` tuples in depth-first pre-order. * **Return type:** [`Generator`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Generator)[[`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple)[[`str`](https://docs.python.org/3/library/stdtypes.html#str), [`Message`](https://googleapis.dev/python/protobuf/latest/google/protobuf/message.html#google.protobuf.message.Message)], [`None`](https://docs.python.org/3/library/constants.html#None), [`None`](https://docs.python.org/3/library/constants.html#None)] ### Example ```pycon >>> from postgast import parse, walk >>> tree = parse("SELECT 1") >>> for field_name, node in walk(tree): ... if field_name: ... print(f"{field_name}: {type(node).__name__}") stmts: RawStmt stmt: SelectStmt target_list: ResTarget val: A_Const ival: Integer ``` ### *class* postgast.Visitor Base class for protobuf parse tree visitors. Subclass and override `visit_` methods (e.g. `visit_SelectStmt`, `visit_ColumnRef`) to handle specific node types. Unhandled types fall through to [`generic_visit()`](#postgast.Visitor.generic_visit) which recurses into children. Call [`visit()`](#postgast.Visitor.visit) on a root message to start traversal: ```default class TableCollector(Visitor): def __init__(self): self.tables = [] def visit_RangeVar(self, node): self.tables.append(node.relname) collector = TableCollector() collector.visit(parse_result) ``` #### visit(node) Dispatch *node* to `visit_` or [`generic_visit()`](#postgast.Visitor.generic_visit). Looks up a method named `visit_` (where `` matches the protobuf descriptor name, e.g. `visit_SelectStmt`). Falls back to [`generic_visit()`](#postgast.Visitor.generic_visit) if no specific handler exists. * **Parameters:** **node** ([`Message`](https://googleapis.dev/python/protobuf/latest/google/protobuf/message.html#google.protobuf.message.Message)) – Any protobuf `Message` instance. * **Return type:** [`None`](https://docs.python.org/3/library/constants.html#None) #### generic_visit(node) Visit all message-typed children of *node*. Override this method to customize the default traversal behavior. Call `super().generic_visit(node)` from a `visit_*` handler to continue recursion into a node’s children after custom processing. * **Parameters:** **node** ([`Message`](https://googleapis.dev/python/protobuf/latest/google/protobuf/message.html#google.protobuf.message.Message)) – Any protobuf `Message` instance. * **Return type:** [`None`](https://docs.python.org/3/library/constants.html#None) ## AST Helpers ### postgast.find_nodes(tree, node_type) Yield all protobuf messages matching *node_type* from a parse tree. Walks the tree in depth-first pre-order (same as [`walk()`](#postgast.walk)) and yields every message that is an instance of *node_type*. * **Parameters:** * **tree** ([`Message`](https://googleapis.dev/python/protobuf/latest/google/protobuf/message.html#google.protobuf.message.Message)) – Any protobuf `Message` (`ParseResult`, `SelectStmt`, etc.). * **node_type** ([`type`](https://docs.python.org/3/library/functions.html#type)[[`TypeVar`](https://docs.python.org/3/library/typing.html#typing.TypeVar)(`_M`, bound= [`Message`](https://googleapis.dev/python/protobuf/latest/google/protobuf/message.html#google.protobuf.message.Message))]) – Protobuf message class to match (e.g., `RangeVar`). * **Yields:** Matching instances in depth-first pre-order. * **Return type:** [`Generator`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Generator)[[`TypeVar`](https://docs.python.org/3/library/typing.html#typing.TypeVar)(`_M`, bound= [`Message`](https://googleapis.dev/python/protobuf/latest/google/protobuf/message.html#google.protobuf.message.Message)), [`None`](https://docs.python.org/3/library/constants.html#None), [`None`](https://docs.python.org/3/library/constants.html#None)] ### Example ```pycon >>> from postgast import find_nodes, parse >>> from postgast.pg_query_pb2 import RangeVar >>> tree = parse("SELECT * FROM users JOIN orders ON users.id = orders.uid") >>> [n.relname for n in find_nodes(tree, RangeVar)] ['users', 'orders'] ``` ### postgast.extract_tables(tree) Yield table names referenced in a parse tree. Walks all `RangeVar` nodes and yields their names as dot-joined strings (`"schema.table"` when schema-qualified, `"table"` otherwise). Results preserve encounter order and include duplicates. Use `set()` on the result to get unique table names. * **Parameters:** **tree** ([`Message`](https://googleapis.dev/python/protobuf/latest/google/protobuf/message.html#google.protobuf.message.Message)) – Any protobuf `Message` (`ParseResult`, `SelectStmt`, etc.). * **Yields:** Table names in encounter order. * **Return type:** [`Generator`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Generator)[[`str`](https://docs.python.org/3/library/stdtypes.html#str), [`None`](https://docs.python.org/3/library/constants.html#None), [`None`](https://docs.python.org/3/library/constants.html#None)] ### Example ```pycon >>> from postgast import extract_tables, parse >>> tree = parse("SELECT * FROM public.users JOIN orders ON true") >>> list(extract_tables(tree)) ['public.users', 'orders'] ``` ### postgast.extract_columns(tree) Yield column references found in a parse tree. Walks all `ColumnRef` nodes and yields their names as dot-joined strings. `SELECT *` produces `"*"`; `t.*` produces `"t.*"`. Results preserve encounter order and include duplicates. * **Parameters:** **tree** ([`Message`](https://googleapis.dev/python/protobuf/latest/google/protobuf/message.html#google.protobuf.message.Message)) – Any protobuf `Message` (`ParseResult`, `SelectStmt`, etc.). * **Yields:** Column references in encounter order. * **Return type:** [`Generator`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Generator)[[`str`](https://docs.python.org/3/library/stdtypes.html#str), [`None`](https://docs.python.org/3/library/constants.html#None), [`None`](https://docs.python.org/3/library/constants.html#None)] ### Example ```pycon >>> from postgast import extract_columns, parse >>> tree = parse("SELECT u.name, age FROM users u WHERE age > 18") >>> list(extract_columns(tree)) ['u.name', 'age', 'age'] ``` ### postgast.extract_functions(tree) Yield function call names found in a parse tree. Walks all `FuncCall` nodes and yields their names as dot-joined strings (`"schema.func"` when schema-qualified, `"func"` otherwise). Results preserve encounter order and include duplicates. * **Parameters:** **tree** ([`Message`](https://googleapis.dev/python/protobuf/latest/google/protobuf/message.html#google.protobuf.message.Message)) – Any protobuf `Message` (`ParseResult`, `SelectStmt`, etc.). * **Yields:** Function names in encounter order. * **Return type:** [`Generator`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Generator)[[`str`](https://docs.python.org/3/library/stdtypes.html#str), [`None`](https://docs.python.org/3/library/constants.html#None), [`None`](https://docs.python.org/3/library/constants.html#None)] ### Example ```pycon >>> from postgast import extract_functions, parse >>> tree = parse("SELECT lower(name), count(*) FROM users") >>> list(extract_functions(tree)) ['lower', 'count'] ``` ### postgast.extract_function_identity(tree) Return the identity of the first `CREATE FUNCTION` statement in a parse tree. Finds the first `CreateFunctionStmt` node where `is_procedure` is `False` and returns a [`FunctionIdentity`](#postgast.FunctionIdentity) with the schema and function name. * **Parameters:** **tree** ([`Message`](https://googleapis.dev/python/protobuf/latest/google/protobuf/message.html#google.protobuf.message.Message)) – Any protobuf `Message` (`ParseResult`, `SelectStmt`, etc.). * **Return type:** [`FunctionIdentity`](#postgast.FunctionIdentity) | [`None`](https://docs.python.org/3/library/constants.html#None) * **Returns:** A [`FunctionIdentity`](#postgast.FunctionIdentity) or `None` if no matching node is found. ### Example ```pycon >>> from postgast import extract_function_identity, parse >>> sql = "CREATE FUNCTION public.add(a int, b int) RETURNS int LANGUAGE sql AS $$ SELECT a + b $$" >>> identity = extract_function_identity(parse(sql)) >>> identity.schema, identity.name ('public', 'add') ``` ### postgast.extract_trigger_identity(tree) Return the identity of the first `CREATE TRIGGER` statement in a parse tree. Finds the first `CreateTrigStmt` node and returns a [`TriggerIdentity`](#postgast.TriggerIdentity) with the trigger name, schema, and table name. * **Parameters:** **tree** ([`Message`](https://googleapis.dev/python/protobuf/latest/google/protobuf/message.html#google.protobuf.message.Message)) – Any protobuf `Message` (`ParseResult`, `SelectStmt`, etc.). * **Return type:** [`TriggerIdentity`](#postgast.TriggerIdentity) | [`None`](https://docs.python.org/3/library/constants.html#None) * **Returns:** A [`TriggerIdentity`](#postgast.TriggerIdentity) or `None` if no matching node is found. ### Example ```pycon >>> from postgast import extract_trigger_identity, parse >>> sql = "CREATE TRIGGER my_trg AFTER INSERT ON orders FOR EACH ROW EXECUTE FUNCTION notify()" >>> identity = extract_trigger_identity(parse(sql)) >>> identity.trigger, identity.table ('my_trg', 'orders') ``` ## DDL Helpers ### postgast.set_or_replace(tree) Set `replace = True` on eligible DDL nodes in a parse tree. Walks *tree* and flips the `replace` flag on `CreateFunctionStmt`, `CreateTrigStmt`, and `ViewStmt` nodes where it is currently `False`. * **Parameters:** **tree** ([`Message`](https://googleapis.dev/python/protobuf/latest/google/protobuf/message.html#google.protobuf.message.Message)) – A protobuf `Message` (typically a `ParseResult`). * **Return type:** [`int`](https://docs.python.org/3/library/functions.html#int) * **Returns:** Number of nodes that were modified. ### Example ```pycon >>> from postgast import set_or_replace, parse, deparse >>> tree = parse("CREATE VIEW v AS SELECT 1") >>> set_or_replace(tree) 1 >>> "OR REPLACE" in deparse(tree) True ``` ### postgast.ensure_or_replace(sql) Return *sql* with all eligible `CREATE` statements rewritten to `CREATE OR REPLACE`. Parses the input, sets `replace = True` on `CreateFunctionStmt`, `CreateTrigStmt`, and `ViewStmt` nodes, and deparses back to SQL. * **Parameters:** **sql** ([`str`](https://docs.python.org/3/library/stdtypes.html#str)) – One or more SQL statements. * **Return type:** [`str`](https://docs.python.org/3/library/stdtypes.html#str) * **Returns:** The rewritten SQL text. * **Raises:** [**PgQueryError**](#postgast.PgQueryError) – If *sql* cannot be parsed. ### Example ```pycon >>> from postgast import ensure_or_replace >>> ensure_or_replace("CREATE VIEW v AS SELECT 1") 'CREATE OR REPLACE VIEW v AS SELECT 1' ``` ### postgast.to_drop(sql) Return the `DROP` statement corresponding to a `CREATE` statement. Parses *sql*, builds a `DropStmt` protobuf from the parsed AST, and deparses it back to SQL. Supports: - `CREATE FUNCTION` / `CREATE PROCEDURE` - `CREATE TRIGGER` - `CREATE VIEW` - `CREATE TABLE` - `CREATE INDEX` - `CREATE SEQUENCE` - `CREATE SCHEMA` - `CREATE EXTENSION` - `CREATE TYPE` (enum, range, composite) - `CREATE MATERIALIZED VIEW ... AS` All `OR REPLACE` and `IF NOT EXISTS` variants are accepted. * **Parameters:** **sql** ([`str`](https://docs.python.org/3/library/stdtypes.html#str)) – A single CREATE statement. * **Return type:** [`str`](https://docs.python.org/3/library/stdtypes.html#str) * **Returns:** The corresponding DROP statement. * **Raises:** * [**ValueError**](https://docs.python.org/3/library/exceptions.html#ValueError) – If *sql* contains zero or more than one statement, or if the statement is not a supported CREATE type. * [**PgQueryError**](#postgast.PgQueryError) – If *sql* is not valid SQL. ### Example ```pycon >>> from postgast import to_drop >>> to_drop("CREATE TABLE public.users (id int)") 'DROP TABLE public.users' ``` ## Types ### *class* postgast.FingerprintResult(fingerprint, hex) Result of fingerprinting a SQL query. * **Parameters:** * **fingerprint** ([*int*](https://docs.python.org/3/library/functions.html#int)) * **hex** ([*str*](https://docs.python.org/3/library/stdtypes.html#str)) #### fingerprint The uint64 numeric hash. #### hex The hexadecimal string representation of the fingerprint. ### *class* postgast.FunctionIdentity(schema, name) Identity parts of a `CREATE FUNCTION` statement. * **Parameters:** * **schema** ([*str*](https://docs.python.org/3/library/stdtypes.html#str) *|* *None*) * **name** ([*str*](https://docs.python.org/3/library/stdtypes.html#str)) #### schema Schema name, or `None` for unqualified functions. #### name Function name. ### *class* postgast.TriggerIdentity(trigger, schema, table) Identity parts of a `CREATE TRIGGER` statement. * **Parameters:** * **trigger** ([*str*](https://docs.python.org/3/library/stdtypes.html#str)) * **schema** ([*str*](https://docs.python.org/3/library/stdtypes.html#str) *|* *None*) * **table** ([*str*](https://docs.python.org/3/library/stdtypes.html#str)) #### trigger Trigger name. #### schema Schema qualifying the target table, or `None`. #### table Target table name. ## Exceptions ### *class* postgast.PgQueryError(message, , cursorpos=0, context=None, funcname=None, filename=None, lineno=0) Structured error raised when libpg_query rejects a SQL statement. Every postgast function that calls into libpg_query ([`parse()`](#postgast.parse), [`deparse()`](#postgast.deparse), [`normalize()`](#postgast.normalize), [`fingerprint()`](#postgast.fingerprint), [`split()`](#postgast.split), [`scan()`](#postgast.scan), `parse_plpgsql()`, and `format_sql()`) may raise this exception. The error carries the same structured fields that the C library provides, so callers can build precise diagnostics (e.g., underlining the offending token) without parsing the message string. `cursorpos` is a **1-based byte offset** into the original SQL string pointing to the token where the error was detected. When it is `0` the position is unknown. Because it counts *bytes*, `e.cursorpos - 1` only equals the corresponding Python string index when the SQL is pure ASCII. For SQL containing multibyte UTF-8 characters (e.g., Unicode identifiers or string literals), index into the UTF-8-encoded `bytes` representation instead: ```default pos = sql.encode("utf-8")[: e.cursorpos - 1].decode("utf-8") char_offset = len(pos) ``` The `funcname`, `filename`, and `lineno` fields refer to the *internal C source* of libpg_query / PostgreSQL’s parser, not to your Python code. They are mainly useful for filing upstream bug reports. * **Parameters:** * **message** ([*str*](https://docs.python.org/3/library/stdtypes.html#str)) * **cursorpos** ([*int*](https://docs.python.org/3/library/functions.html#int)) * **context** ([*str*](https://docs.python.org/3/library/stdtypes.html#str) *|* *None*) * **funcname** ([*str*](https://docs.python.org/3/library/stdtypes.html#str) *|* *None*) * **filename** ([*str*](https://docs.python.org/3/library/stdtypes.html#str) *|* *None*) * **lineno** ([*int*](https://docs.python.org/3/library/functions.html#int)) #### message Human-readable error description from the PostgreSQL parser. #### cursorpos 1-based byte offset in the SQL string where the error was detected (`0` when the position is unavailable). #### context Additional context from the parser (e.g., PL/pgSQL function name), or `None`. #### funcname Internal C function name where the error originated, or `None`. #### filename Internal C source file where the error originated, or `None`. #### lineno Line number in the internal C source file (`0` when unavailable). ### Examples Catch a syntax error and inspect the cursor position: ```pycon >>> from postgast import parse, PgQueryError >>> try: ... parse("SELECT FROM") ... except PgQueryError as e: ... print(e.cursorpos) 12 ``` Use `cursorpos` to highlight the error location (ASCII-safe shortcut): ```pycon >>> from postgast import parse, PgQueryError >>> sql = "SELECT * FORM users" >>> try: ... parse(sql) ... except PgQueryError as e: ... idx = max(e.cursorpos - 1, 0) ... print(sql) ... print(" " * idx + "^") ... print(e.message) SELECT * FORM users ^ syntax error at or near "FORM" ``` For SQL that may contain non-ASCII characters, convert via the encoded bytes to get the correct character offset: ```default sql = "SELECT 'ü' FORM t" try: parse(sql) except PgQueryError as e: idx = len(sql.encode("utf-8")[: e.cursorpos - 1].decode("utf-8")) print(sql) print(" " * idx + "^") ``` # architecture.html.md # Architecture This page covers the main technical decisions behind `postgast` and why we made them. ## Why ctypes (not Cython, Rust, or C extensions) `postgast` binds to `libpg_query` using Python’s built-in `ctypes` module rather than Cython, PyO3/Rust, or a hand-written CPython C extension. This was a deliberate choice. Here’s the reasoning. ### Pure-Python packaging With `ctypes` the only compiled artifact is the vendored `libpg_query` shared library itself. Everything above it (struct definitions, function signatures, error handling, protobuf deserialization) is plain Python. This means: - **No compiler toolchain at install time.** Users never need Cython, a Rust toolchain, or a C compiler to install `postgast`. `pip install postgast` delivers a pre-built wheel containing the shared library and pure-Python code. - **Simpler CI matrix.** Wheels are built by compiling a single C library (`libpg_query`) per platform. There’s no second compilation step for a Python extension module, which removes an entire class of ABI-compatibility issues (limited API, stable ABI, per-interpreter builds, etc.). - **Easier debugging.** Every line between the C boundary and the public API is inspectable Python. You can step through `native.py` with a normal debugger, no mixed C/Python stack frames to deal with. ### Minimal dependency footprint The only runtime dependency is `protobuf`. There’s no build-time dependency on Cython or `setuptools-rust`, no transitive dependency on `cffi`, and no compiled glue code. Fewer moving parts means fewer ways the install can break. ### BSD licensing `pglast`, the most established `libpg_query` wrapper for Python, is licensed under GPLv3. That makes it unusable in many commercial and permissively-licensed projects. By keeping the binding layer to `ctypes` (stdlib) plus `protobuf` (BSD-compatible), `postgast` can ship under the BSD 2-Clause license with no copyleft obligations. ### Trade-offs `ctypes` isn’t free of downsides: - **No compile-time type checking at the C boundary.** If the `libpg_query` struct layout changes between versions, the ctypes bindings break silently at runtime rather than failing to compile. We mitigate this by pinning to a specific `libpg_query` version and testing across platforms in CI. - **Per-call overhead.** Each `ctypes` call has slightly more overhead than a direct C extension call. In practice this is negligible because the real work happens inside `libpg_query` (parsing a full PostgreSQL grammar) and the `ctypes` marshalling cost is dwarfed by the parser itself. - **Manual struct definitions.** The ctypes `Structure` classes in `native.py` must mirror the C structs exactly. It’s a small amount of code (~200 lines) maintained by hand. A Cython `.pxd` or Rust `bindgen` would generate these, but at the cost of the toolchain complexity described above. On balance the simplicity, portability, and licensing benefits outweigh the minor ergonomic costs. ## How the binding layer works All C interop lives in a single internal module, `native.py`. It does three things: 1. **Loads the shared library.** Checks for a vendored copy bundled in the wheel first, then falls back to `ctypes.util.find_library` for system-installed libraries. 2. **Defines ctypes Structure classes** that mirror every `libpg_query` result type (`PgQueryParseResult`, `PgQueryNormalizeResult`, etc.). 3. **Declares function signatures** (`argtypes` / `restype`) for each public C function so that calls are type-checked at the Python level. Higher-level modules (`parse.py`, `deparse.py`, `normalize.py`, …) import `native.lib` and follow a consistent pattern: ```python # Pseudocode (actual code is in each module) result = native.lib.pg_query_parse_protobuf(sql.encode()) try: check_error(result) # raise PgQueryError if result.error is set payload = extract(result) # read return value finally: native.lib.pg_query_free_protobuf_parse_result(result) ``` The `finally` block ensures the C-allocated memory is always freed, even when an error is raised. ### Protobuf deserialization `libpg_query` returns parse trees as serialized Protocol Buffer messages. `postgast` deserializes them using the official `protobuf` library into generated Python message classes (`pg_query_pb2`). This avoids writing a custom deserializer and tracks the upstream `.proto` schema exactly. Binary payloads are read with `ctypes.string_at(data, length)` rather than `c_char_p` because protobuf data can contain embedded null bytes that `c_char_p` would silently truncate. ## Alternatives considered ### Cython Cython would give compile-time type safety at the C boundary and marginally faster call overhead. But it requires a C compiler at wheel-build time *and* introduces a Cython build dependency. For a thin binding layer (~200 lines of struct definitions and function signatures), the added build complexity isn’t justified. ### Rust (PyO3 / maturin) A Rust extension via PyO3 would provide memory safety guarantees and strong typing. But `libpg_query` is a C library, so the Rust layer would still call C via FFI. Adding Rust introduces a second toolchain (`cargo`), a `maturin` build backend, and complicates cross-compilation. The binding layer is too thin to benefit from Rust’s strengths. ### CFFI CFFI is a popular alternative to `ctypes` that offers an ABI mode (similar to `ctypes`) and an API mode (generates a C extension). ABI mode provides no real advantage over `ctypes` for this use case, and API mode reintroduces the C compiler requirement. Staying with `ctypes` avoids adding `cffi` as a dependency. ### Hand-written CPython C extension A C extension would be the fastest option, but it ties the code to CPython internals, requires careful reference counting, and complicates building wheels for multiple Python versions. The performance difference is immaterial for this library’s workload. # contributing.html.md # Contributing See the [contributing guide on GitHub](https://github.com/eddieland/postgast/blob/main/CONTRIBUTING.md) for full details on setting up a development environment, running tests, and submitting pull requests. ## Quick Start ```bash git clone https://github.com/eddieland/postgast.git cd postgast make install # Install dependencies make build-native # Compile vendored libpg_query make test # Run tests ``` ## Development Commands ```bash make fmt # Autoformat (mdformat, codespell, ruff) make lint # Format + type-check (basedpyright) make test # Run tests make coverage # Tests with coverage report make all # install + lint + test ``` # examples.html.md # Common Usage Patterns This page demonstrates practical patterns for common tasks. For API fundamentals, see the [User Guide](guide.md). ## Query Analysis ### Audit which tables a query touches ```python import postgast sql = """ SELECT o.id, c.name, p.title FROM orders o JOIN customers c ON o.customer_id = c.id JOIN products p ON o.product_id = p.id WHERE o.created_at > '2024-01-01' """ tree = postgast.parse(sql) tables = postgast.extract_tables(tree) columns = postgast.extract_columns(tree) functions = postgast.extract_functions(tree) print("Tables:", tables) # ['orders', 'customers', 'products'] print("Columns:", columns) # ['o.id', 'c.name', 'p.title', ...] print("Functions:", functions) # [] ``` ### Detect queries that use subqueries ```python from postgast import parse, find_nodes from postgast.pg_query_pb2 import SelectStmt def has_subquery(sql: str) -> bool: """Return True if the SQL contains a nested SELECT.""" tree = parse(sql) select_count = sum(1 for _ in find_nodes(tree, SelectStmt)) return select_count > 1 has_subquery("SELECT * FROM users") # => False has_subquery("SELECT * FROM (SELECT id FROM users) AS sub") # => True ``` ### Find all function calls in a query ```python import postgast sql = "SELECT upper(name), count(*) FROM users GROUP BY upper(name)" tree = postgast.parse(sql) print(postgast.extract_functions(tree)) # ['upper', 'count', 'upper'] # Unique function names: print(set(postgast.extract_functions(tree))) # {'upper', 'count'} ``` ## Query Monitoring ### Group queries with normalization Replace literal values with positional placeholders so that structurally identical queries collapse into a single group: ```python import postgast queries = [ "SELECT * FROM users WHERE id = 42", "SELECT * FROM users WHERE id = 99", "SELECT * FROM orders WHERE status = 'pending'", ] for q in queries: print(postgast.normalize(q)) # SELECT * FROM users WHERE id = $1 # SELECT * FROM users WHERE id = $1 # SELECT * FROM orders WHERE status = $1 ``` ### Fingerprint queries for deduplication Two queries are structurally equivalent when they have the same fingerprint, regardless of literal values, whitespace, or formatting: ```python import postgast fp1 = postgast.fingerprint("SELECT * FROM users WHERE id = 1") fp2 = postgast.fingerprint("select * from users where id = 999") assert fp1.hex == fp2.hex # same structure fp3 = postgast.fingerprint("SELECT * FROM orders WHERE id = 1") assert fp1.hex != fp3.hex # different table ``` ## SQL Formatting ### Pretty-print SQL ```python import postgast ugly = "select u.id,u.name,o.total from users u join orders o on u.id=o.user_id where o.total>100 order by o.total desc" print(postgast.format_sql(ugly)) ``` Output: ```sql SELECT u.id, u.name, o.total FROM users u JOIN orders o ON u.id = o.user_id WHERE o.total > 100 ORDER BY o.total DESC; ``` ### Format an already-parsed tree `format_sql` also accepts a `ParseResult`, so you can format after making AST modifications: ```python import postgast tree = postgast.parse("CREATE VIEW v AS SELECT 1") postgast.set_or_replace(tree) print(postgast.format_sql(tree)) ``` Output: ```sql CREATE OR REPLACE VIEW v AS SELECT 1; ``` ## Batch Processing ### Process a SQL migration file Use [`split()`](api.md#postgast.split) to break a multi-statement file into individual statements, then analyze each one: ```python import postgast migration = """ CREATE TABLE users ( id serial PRIMARY KEY, name text NOT NULL ); CREATE INDEX idx_users_name ON users (name); INSERT INTO users (name) VALUES ('alice'), ('bob'); """ for stmt in postgast.split(migration): tree = postgast.parse(stmt) tables = postgast.extract_tables(tree) print(f"Tables: {tables!r:30s} SQL: {stmt.strip()[:60]}...") ``` ### Split with tolerance for invalid SQL The `"scanner"` method splits on semicolons without parsing, so it works even when the SQL contains syntax errors: ```python import postgast broken = "SELECT 1; INVALID SYNTAX HERE; SELECT 2" stmts = postgast.split(broken, method="scanner") print(stmts) # ['SELECT 1', ' INVALID SYNTAX HERE', ' SELECT 2'] ``` ## DDL Tooling ### Generate rollback DROP statements Automatically produce `DROP` statements from `CREATE` DDL for migration rollback scripts: ```python import postgast creates = [ "CREATE FUNCTION public.add(a int, b int) RETURNS int LANGUAGE sql AS $$ SELECT a + b $$", "CREATE VIEW active_users AS SELECT * FROM users WHERE active", "CREATE TRIGGER audit_trg BEFORE UPDATE ON users FOR EACH ROW EXECUTE FUNCTION audit()", ] for sql in creates: print(postgast.to_drop(sql)) # DROP FUNCTION public.add(int, int) # DROP VIEW active_users # DROP TRIGGER audit_trg ON users ``` ### Make CREATE statements idempotent Add `OR REPLACE` to `CREATE FUNCTION`, `CREATE TRIGGER`, and `CREATE VIEW` statements so they can be re-run safely: ```python import postgast sql = "CREATE VIEW active_users AS SELECT * FROM users WHERE active" print(postgast.ensure_or_replace(sql)) # CREATE OR REPLACE VIEW active_users AS SELECT * FROM users WHERE active = true # Already idempotent input is unchanged: postgast.ensure_or_replace(postgast.ensure_or_replace(sql)) ``` ## Tree Walking ### Walk the AST to inspect structure [`walk()`](api.md#postgast.walk) yields every node in the tree with its parent field name, useful for debugging or generic transforms: ```python import postgast tree = postgast.parse("SELECT a FROM t WHERE x = 1") for field_name, node in postgast.walk(tree): if field_name: print(f" {field_name}: {type(node).__name__}") ``` Output: ```default stmts: RawStmt stmt: SelectStmt target_list: ResTarget val: ColumnRef fields: String from_clause: RangeVar where_clause: A_Expr lexpr: ColumnRef fields: String rexpr: A_Const ``` ### Collect information with the Visitor pattern Create a [`Visitor`](api.md#postgast.Visitor) subclass with `visit_` methods. Unhandled node types automatically recurse into their children: ```python import postgast class QueryAnalyzer(postgast.Visitor): def __init__(self): self.tables = [] self.columns = [] self.has_where = False def visit_RangeVar(self, node): self.tables.append(node.relname) def visit_ColumnRef(self, node): parts = [] for f in node.fields: inner = getattr(f, f.WhichOneof("node")) if hasattr(inner, "sval"): parts.append(inner.sval) self.columns.append(".".join(parts)) def visit_SelectStmt(self, node): if node.HasField("where_clause"): self.has_where = True self.generic_visit(node) # continue into children tree = postgast.parse("SELECT u.name FROM users u WHERE u.active = true") analyzer = QueryAnalyzer() analyzer.visit(tree) print(analyzer.tables) # ['users'] print(analyzer.columns) # ['u.name', 'u.active'] print(analyzer.has_where) # True ``` ### Control traversal depth Omitting the call to `self.generic_visit(node)` in a handler stops recursion into that node’s children. This lets you skip subtrees: ```python import postgast class TopLevelTables(postgast.Visitor): """Collect tables from the top-level FROM clause only, ignoring subqueries.""" def __init__(self): self.tables = [] def visit_RangeVar(self, node): self.tables.append(node.relname) def visit_SubLink(self, _node): pass # don't recurse into subqueries tree = postgast.parse( "SELECT * FROM orders WHERE customer_id IN (SELECT id FROM vip_customers)" ) v = TopLevelTables() v.visit(tree) print(v.tables) # ['orders'] — vip_customers is skipped ``` ### Use typed AST wrappers Wrap a parse tree with `wrap()` for typed attribute access. Works with `walk_typed()` and `TypedVisitor`: ```python from postgast import parse, wrap, walk_typed, TypedVisitor tree = wrap(parse("SELECT a, b FROM t")) for field_name, node in walk_typed(tree): if field_name: print(f" {field_name}: {type(node).__name__}") ``` ## Working with the Protobuf AST ### Access raw protobuf nodes The parse tree is a standard protobuf `Message`. You can inspect it using all the usual protobuf APIs: ```python import postgast tree = postgast.parse("SELECT id, name FROM users WHERE active = true") # Navigate to the SelectStmt raw_stmt = tree.stmts[0] select = raw_stmt.stmt.select_stmt # Inspect the target list (SELECT columns) for target in select.target_list: col = target.res_target.val.column_ref name = col.fields[0].string.sval print(f"Column: {name}") # Inspect the FROM clause table = select.from_clause[0].range_var print(f"Table: {table.relname}") # 'users' ``` ### Find specific node types [`find_nodes()`](api.md#postgast.find_nodes) filters the walk to a single protobuf message type: ```python from postgast import parse, find_nodes from postgast.pg_query_pb2 import FuncCall, RangeVar tree = parse("SELECT lower(name), count(*) FROM users GROUP BY lower(name)") # All table references for rv in find_nodes(tree, RangeVar): print(f"Table: {rv.relname}") # All function calls for fc in find_nodes(tree, FuncCall): func_name = fc.funcname[0].string.sval print(f"Function: {func_name}") ``` ## PL/pgSQL Parsing ### Parse a PL/pgSQL function body `parse_plpgsql()` returns a structured representation of a PL/pgSQL function’s body, including declarations, assignments, and control flow: ```python import json import postgast sql = """ CREATE FUNCTION greet(name text) RETURNS text LANGUAGE plpgsql AS $$ DECLARE result text; BEGIN result := 'Hello, ' || name; RETURN result; END; $$ """ parsed = postgast.parse_plpgsql(sql) print(json.dumps(parsed, indent=2)) ``` ## Error Handling ### Catch and inspect parse errors All functions raise [`PgQueryError`](api.md#postgast.PgQueryError) on invalid SQL. The exception carries the error message, cursor position, and source location: ```python import postgast try: postgast.parse("SELECT * FORM users") except postgast.PgQueryError as e: print(f"Error: {e.message}") print(f"Position: {e.cursorpos}") # Error: syntax error at or near "users" # Position: 15 ``` ### Validate SQL before execution Use `parse` as a fast syntax check without hitting the database: ```python import postgast def is_valid_sql(sql: str) -> bool: try: postgast.parse(sql) return True except postgast.PgQueryError: return False is_valid_sql("SELECT * FROM users") # True is_valid_sql("SLECT * FORM users") # False is_valid_sql("SELECT 1; DROP TABLE users; --") # True (valid SQL!) ``` ## Tokenization ### Scan SQL into tokens [`scan()`](api.md#postgast.scan) returns the raw token stream, useful for syntax highlighting, keyword detection, or building custom splitters: ```python import postgast result = postgast.scan("SELECT id FROM users WHERE active = true") for token in result.tokens: # Extract the token text using byte positions text = "SELECT id FROM users WHERE active = true"[token.start:token.end] print(f"{text:12s} token={token.token} keyword={token.keyword_kind}") ``` # guide.html.md # User Guide ## Parsing Parse a SQL query into a protobuf AST: ```python import postgast tree = postgast.parse("SELECT id, name FROM users WHERE active = true") # tree.stmts[0] contains the first (and only) statement ``` The returned `ParseResult` is a protobuf message. Each statement in `tree.stmts` is a `RawStmt` wrapping a `Node` oneof. ## Deparsing Convert a parse tree back into SQL text: ```python sql = postgast.deparse(tree) # "SELECT id, name FROM users WHERE active = true" ``` The deparsed SQL is canonicalized by libpg_query and may differ from the original in whitespace, casing, or parenthesization while remaining semantically equivalent. ## Normalization Replace literal constants with positional placeholders: ```python postgast.normalize("SELECT * FROM users WHERE id = 42 AND name = 'alice'") # => "SELECT * FROM users WHERE id = $1 AND name = $2" ``` This is useful for grouping structurally equivalent queries. ## Fingerprinting Compute a structural hash that identifies equivalent queries regardless of literal values: ```python fp = postgast.fingerprint("SELECT * FROM users WHERE id = 42") fp.fingerprint # uint64 hash fp.hex # hex string representation ``` ## Splitting Split a multi-statement SQL string into individual statements: ```python postgast.split("SELECT 1; SELECT 2;") # => ["SELECT 1", "SELECT 2"] ``` Two methods are available via the `method` parameter: - `"parser"` (default) — uses the full PostgreSQL parser for accuracy - `"scanner"` — faster, tolerates invalid SQL ```python postgast.split("SELECT 1; SELECT 2;", method="scanner") ``` ## Scanning Tokenize a SQL string: ```python result = postgast.scan("SELECT 1") for token in result.tokens: print(token.token, token.keyword_kind, token.start, token.end) ``` ## Tree Walking Walk all nodes in a parse tree with depth-first traversal: ```python for field_name, node in postgast.walk(tree): print(type(node).__name__, field_name) ``` Or use the visitor pattern: ```python class TableCollector(postgast.Visitor): def __init__(self): self.tables = [] def visit_RangeVar(self, node): self.tables.append(node.relname) collector = TableCollector() collector.visit(tree) print(collector.tables) ``` ## AST Helpers Extract common information from parse trees: ```python tree = postgast.parse("SELECT u.id, u.name FROM users u JOIN orders o ON u.id = o.user_id") postgast.extract_tables(tree) # ["users", "orders"] postgast.extract_columns(tree) # ["u.id", "u.name", "u.id", "o.user_id"] postgast.extract_functions(tree) # [] ``` Find specific node types: ```python from postgast.pg_query_pb2 import RangeVar for node in postgast.find_nodes(tree, RangeVar): print(node.relname, node.alias.aliasname if node.alias.aliasname else "") ``` ## DDL Helpers Generate `DROP` statements from `CREATE` DDL: ```python postgast.to_drop("CREATE FUNCTION add(a int, b int) RETURNS int AS $$ SELECT a + b $$ LANGUAGE sql") # => "DROP FUNCTION add(int, int)" ``` Rewrite `CREATE` to `CREATE OR REPLACE`: ```python postgast.ensure_or_replace("CREATE FUNCTION add(a int, b int) RETURNS int AS $$ SELECT a + b $$ LANGUAGE sql") # => "CREATE OR REPLACE FUNCTION ..." ``` ## Error Handling All functions raise [`PgQueryError`](api.md#postgast.PgQueryError) on invalid SQL: ```python try: postgast.parse("SELECT FROM") except postgast.PgQueryError as e: print(e.message) # Error message print(e.cursorpos) # Position in the SQL string ``` # upgrade-policy.html.md # Upgrade Policy This page describes how `postgast` tracks upstream dependencies and which Python versions are supported. ## Python `postgast` currently supports **Python 3.10** through the latest GA release (currently 3.14). The minimum supported version will move to **Python 3.12** in an upcoming release. When this happens the major version will be bumped to signal the change. Pin `postgast<2` if you need to stay on Python 3.10 or 3.11. ## PostgreSQL (libpg_query) `postgast` always vendors the latest available version of [libpg_query](https://github.com/pganalyze/libpg_query). The vendored version determines which PostgreSQL grammar is used for parsing, deparsing, and all other operations. Because `postgast` delegates all parsing to `libpg_query`, the PostgreSQL syntax it understands is dictated entirely by the vendored library version. There is no separate PostgreSQL version knob to configure. Backwards-incompatible changes to the PostgreSQL parser are exceedingly rare. In practice a `libpg_query` upgrade means *new* syntax is accepted, not that existing syntax breaks. Still, if you need to pin a specific parser version you can pin the `postgast` version that vendors it. ## Versioning `postgast` follows [Semantic Versioning](https://semver.org/): - **Patch** releases contain bug fixes and `libpg_query` patch updates. - **Minor** releases add new features or upgrade `libpg_query` to a new PostgreSQL major version. - **Major** releases include breaking API changes or Python support-range changes (such as dropping a Python version). ## What this means in practice - **Most users** can track the latest `postgast` release with no issues. `libpg_query` parser upgrades almost never break existing SQL. - **Users on older Python versions** should pin to the last major version that supports their interpreter once the minimum is raised. - **Users who need a specific PostgreSQL parser version** should pin the `postgast` version that vendors the corresponding `libpg_query` release.