Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More robust SQL query generation #158

Closed
edublancas opened this issue Feb 24, 2023 · 4 comments · Fixed by #164
Closed

More robust SQL query generation #158

edublancas opened this issue Feb 24, 2023 · 4 comments · Fixed by #164
Assignees
Labels
stash Label used to categorize issues that will be worked on next

Comments

@edublancas
Copy link

edublancas commented Feb 24, 2023

Currently, we have a few SQL templates that we use to generate queries (examples: here, and here). All of these templates have double quotes to wrap identifiers such as a table or column names (I added this to support identifiers with spaces). However, this isn't compatible with MySQL (and possibly other databases as well).

We've had reports (CTEs and plotting) where JupySQL fails on MySQL because the default configuration uses backticks and breaks with double quotes.

solution: sqlglot

I did some quick comparison and determined that sqlglot is the best solution: https://github.com/ploomber/contributing/blob/main/notes/sqlalchemy-sqlglot.ipynb

requirements:

  • we need to validate it produces valid SQL for the percentile_disc use case (namely a SQL query that has percentile_disc([0.25, 0.50]) which is valid in duckdb, let's validate that the output from sqlglot is valid in MySQL, postgres, and sqlite
  • mapping between sqlalchemy dialect and sqlglot write parameter. sqlalchemy's dialect string might be different from the parameter that sqlglot expects

comments

Note that this is a configurable parameter, and users can configure MySQL (or other databases) to use double quotes and perhaps other characters. So even if we generate SQL statements with the default character, it might fail. For now, we can add a section in our docs explaining this issue (that JupySQL will generate SQL statements with the default delimiter).

@edublancas edublancas added the stash Label used to categorize issues that will be worked on next label Feb 24, 2023
@edublancas edublancas changed the title More robust SQLquery generation More robust SQL query generation Feb 24, 2023
@tonykploomber
Copy link

tonykploomber commented Feb 24, 2023

Very appreciate for the research, I can work on this immediately

I agree this is more robust way instead of just removing " in our current Template to solve the issue.

@edublancas
One thing to clarify: What's the read parameter we should always set? The read parameter means the query template based on one specific database dialect. According to the example, you are using read="duckdb". I think we should choose the most generic one, what do you think?

The write parameter will be dynamic, according to the current database connection, which we already able to get the current dialect info in previous Telemetry database version collection, it's doable.

# read (source) will be always same, how we write the query template will depend on what dialect we choose
# write (target) will be current database connection 
sqlglot.transpile(query, read="duckdb", write="mysql")

Ref: transpile

@edublancas
Copy link
Author

One thing to clarify: What's the read parameter we should always set?

I think we can use DuckDB since we can test that one easily. But we should keep an eye on the functions we use because if we use functions that are only available on DuckDB, the transpiled queries will fail as sqlglot does not check that the functions exist in the target dialect.

Down the road, we can decide if we want to keep using DuckDB or switch to another one. I think another good option would be PostgreSQL since many databases attempt to be postgres-compatible.

@tonykploomber
Copy link

OK let's start with duckdb, our templates are still pretty simple, not fancy queries seen in our codebase yet.

@sync-by-unito
Copy link

sync-by-unito bot commented Mar 13, 2023

➤ Tony Kuo commented:

We have some blocking issues by sqlglot, need to wait for them and unblock the issue
#164 (comment) ( #164 (comment) )

Move to block now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stash Label used to categorize issues that will be worked on next
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants