Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(decompile): make the decompiler run on TPCH query 1 #9779

Merged
merged 23 commits into from
Aug 26, 2024

Conversation

gforsyth
Copy link
Member

@gforsyth gforsyth commented Aug 6, 2024

Context: the sql -> ibis expression parser/creator runs off of a sqlglot.Plan object

The sqlglot plan optimizer replaces a few items with aliases. In other places, it doesn't use
fully-qualified column names. Both of these prevent us from reliably converting into an Ibis expression.

Description of changes

Dereferencing aggregation operands

The sqlglot planner will pull out computed operands into a separate
section and alias them e.g.

Context:
    Aggregations:
    - SUM("_a_0") AS "sum_disc_price"
    Operands:
    - "lineitem"."l_extendedprice" * (1 - "lineitem"."l_discount") AS _a_0

For the purposes of decompiling, we want these to be inline, so here we
replace those new aliases with the parsed sqlglot expression

Dereferencing projections

The sqlglot planner will (sometimes) alias projections to the aggregate
that precedes it.

- Sort: lineitem (132849388268768)
  Context:
    Key:
      - "l_returnflag"
      - "l_linestatus"
  Projections:
    - lineitem._g0 AS "l_returnflag"
    - lineitem._g1 AS "l_linestatus"
    <snip>
  Dependencies:
  - Aggregate: lineitem (132849388268864)
    Context:
      Aggregations:
        <snip>
      Group:
        - "lineitem"."l_returnflag"  <-- this is _g0
        - "lineitem"."l_linestatus"  <-- this is _g1
        <snip>

These aliases are stored in a dictionary in the aggregate groups, so if
those are pulled out beforehand then we can use them to replace the
aliases in the projections.

Dereferencing sort keys

The sqlglot planner doesn't fully qualify sort keys

- Sort: lineitem (132849388268768)
  Context:
    Key:
      - "l_returnflag"
      - "l_linestatus"

For now we do a naive thing here and prepend the name of the sort
operation itself, which (maybe?) is the name of the parent table.
This is definitely the most brittle part of the existing decompiler

I've mucked around with this a little bit, and while it's a touch hacky, using the deferred operator with the not-fully-referenced sort-keys works well.

This makes parsing TPC-H query 1 break in a different place, so progress?
The `sqlglot` plan optimizer (even with optimizations disabled) replaces
a few items with aliases.  In other places, it doesn't use
fully-qualified column names.  Both of these prevent us from reliably
converting into an Ibis expression.

There are more details in the comments of each helper function, but
overall, this adds the table name to qualify the sort columns, replaces
the projection aliases with the qualified column names, and replaces any
intermediate operands in agg functions with those intermediates.
@gforsyth gforsyth marked this pull request as draft August 6, 2024 20:38
ibis/expr/tests/test_sql.py Outdated Show resolved Hide resolved
ibis/expr/tests/test_sql.py Outdated Show resolved Hide resolved
ibis/expr/sql.py Show resolved Hide resolved
@gforsyth
Copy link
Member Author

gforsyth commented Aug 8, 2024

Ok, this is slightly bonkers, but I'm re-using the values comparison from the regular TPCH tests, then using importlib to load the snapshot we've created, then executing that against TPCH data and comparing it to running the sql query against DuckDB.

I was going to exec the string we create in decompilation, but that doesn't seem to make it available in the pytest context (although it does work in a REPL)

@gforsyth gforsyth marked this pull request as ready for review August 8, 2024 14:44
@gforsyth
Copy link
Member Author

Ok, this is ready for a look. I think we can improve the testing beyond just DuckDB, but for now this is a definite improvement over the current decompiler and I think there is more we can do to improve it.

If we're good with this, I can look into adding more of the TPCH tests

@gforsyth gforsyth mentioned this pull request Aug 22, 2024
1 task
@gforsyth gforsyth requested a review from cpcloud August 22, 2024 18:38
snapshot.assert_match(code, "out_tpch.py")

# Import just-created snapshot
SNAPSHOT_MODULE = f"ibis.backends.duckdb.tests.snapshots.test_decompile_tpch.test_parse_sql_tpch.tpch{tpch_query:02d}.out_tpch"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yikes 😬

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plan on making this better in a follow up, I promise.

@@ -5,4 +5,4 @@
name="airlines", schema={"dest": "string", "origin": "string", "arrdelay": "int32"}
)

result = airlines.filter((airlines.dest.cast("int64") == 0) == True)
result = airlines.filter((((airlines.dest.cast("int64") == 0)) == True))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're a lisp now!

Copy link
Member

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have this function marked as @̶̜̍ë̷͔́x̵̺́p̶̠̈́e̶̠͐r̵̹͐i̴̜̎m̶̩̎e̸̛̹ň̶̺ṯ̷̃å̴͉l̷̼͘ 😅

@gforsyth
Copy link
Member Author

Do we have this function marked as @̶̜̍ë̷͔́x̵̺́p̶̠̈́e̶̠͐r̵̹͐i̴̜̎m̶̩̎e̸̛̹ň̶̺ṯ̷̃å̴͉l̷̼͘ 😅

I am happy to report that we do

@cpcloud cpcloud added the experimental Experimental features label Aug 22, 2024
@gforsyth gforsyth merged commit 0268044 into ibis-project:main Aug 26, 2024
86 checks passed
@gforsyth gforsyth deleted the decompile branch August 26, 2024 13:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
experimental Experimental features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants