More robust SQL query generation #158

edublancas · 2023-02-24T13:14:33Z

Currently, we have a few SQL templates that we use to generate queries (examples: here, and here). All of these templates have double quotes to wrap identifiers such as a table or column names (I added this to support identifiers with spaces). However, this isn't compatible with MySQL (and possibly other databases as well).

We've had reports (CTEs and plotting) where JupySQL fails on MySQL because the default configuration uses backticks and breaks with double quotes.

solution: sqlglot

I did some quick comparison and determined that sqlglot is the best solution: https://github.com/ploomber/contributing/blob/main/notes/sqlalchemy-sqlglot.ipynb

requirements:

we need to validate it produces valid SQL for the percentile_disc use case (namely a SQL query that has percentile_disc([0.25, 0.50]) which is valid in duckdb, let's validate that the output from sqlglot is valid in MySQL, postgres, and sqlite
mapping between sqlalchemy dialect and sqlglot write parameter. sqlalchemy's dialect string might be different from the parameter that sqlglot expects

comments

Note that this is a configurable parameter, and users can configure MySQL (or other databases) to use double quotes and perhaps other characters. So even if we generate SQL statements with the default character, it might fail. For now, we can add a section in our docs explaining this issue (that JupySQL will generate SQL statements with the default delimiter).

The text was updated successfully, but these errors were encountered:

tonykploomber · 2023-02-24T16:47:02Z

Very appreciate for the research, I can work on this immediately

I agree this is more robust way instead of just removing " in our current Template to solve the issue.

@edublancas
One thing to clarify: What's the read parameter we should always set? The read parameter means the query template based on one specific database dialect. According to the example, you are using read="duckdb". I think we should choose the most generic one, what do you think?

The write parameter will be dynamic, according to the current database connection, which we already able to get the current dialect info in previous Telemetry database version collection, it's doable.

# read (source) will be always same, how we write the query template will depend on what dialect we choose
# write (target) will be current database connection 
sqlglot.transpile(query, read="duckdb", write="mysql")

Ref: transpile

edublancas · 2023-02-24T17:03:36Z

One thing to clarify: What's the read parameter we should always set?

I think we can use DuckDB since we can test that one easily. But we should keep an eye on the functions we use because if we use functions that are only available on DuckDB, the transpiled queries will fail as sqlglot does not check that the functions exist in the target dialect.

Down the road, we can decide if we want to keep using DuckDB or switch to another one. I think another good option would be PostgreSQL since many databases attempt to be postgres-compatible.

tonykploomber · 2023-02-24T17:10:22Z

OK let's start with duckdb, our templates are still pretty simple, not fancy queries seen in our codebase yet.

sync-by-unito · 2023-03-13T13:42:25Z

➤ Tony Kuo commented:

We have some blocking issues by sqlglot, need to wait for them and unblock the issue
#164 (comment) ( #164 (comment) )

Move to block now

edublancas added the stash Label used to categorize issues that will be worked on next label Feb 24, 2023

This was referenced Feb 24, 2023

Improve boxplot SQL perf - Part 2 #152

Merged

--save name and --with name is creating an error with the CTE name appearing with double quotes in MySQL #145

Closed

edublancas assigned tonykploomber Feb 24, 2023

edublancas changed the title ~~More robust SQLquery generation~~ More robust SQL query generation Feb 24, 2023

edublancas mentioned this issue Feb 24, 2023

dropping sqlparse #160

Closed

tonykploomber mentioned this issue Feb 24, 2023

Fix: SQL template rendered with Double Quote Issue #162

Closed

3 tasks

tonykploomber linked a pull request Feb 24, 2023 that will close this issue

Add: sqlglot to formalize sql clause #164

Merged

3 tasks

edublancas mentioned this issue Feb 24, 2023

Automatically inferring dependencies in CTE feature #166

Closed

tonykploomber mentioned this issue Feb 27, 2023

Add: sqlglot to formalize sql clause #164

Merged

3 tasks

edublancas closed this as completed in #164 Mar 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More robust SQL query generation #158

More robust SQL query generation #158

edublancas commented Feb 24, 2023 •

edited by sync-by-unito bot

Loading

tonykploomber commented Feb 24, 2023 •

edited

Loading

edublancas commented Feb 24, 2023

tonykploomber commented Feb 24, 2023

sync-by-unito bot commented Mar 13, 2023

More robust SQL query generation #158

More robust SQL query generation #158

Comments

edublancas commented Feb 24, 2023 • edited by sync-by-unito bot Loading

solution: sqlglot

comments

tonykploomber commented Feb 24, 2023 • edited Loading

edublancas commented Feb 24, 2023

tonykploomber commented Feb 24, 2023

sync-by-unito bot commented Mar 13, 2023

edublancas commented Feb 24, 2023 •

edited by sync-by-unito bot

Loading

tonykploomber commented Feb 24, 2023 •

edited

Loading