-
Notifications
You must be signed in to change notification settings - Fork 504
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Deduplicate Macro to use QUALIFY #543
Comments
I suspect that this will still mean that a Relation has to be passed though, right? In order to be compatible with Redshift + Postgres. |
I've implemented this (along with some other non-breaking improvements) in #549. |
@codigo-ergo-sum This isn't entirely relevant to this issue but having taken a look at this for other DBs as well I've come to the conclusion that SQL dialects are even more annoying and limiting than I'd thought. The actual ANSI way of doing this deduplication is so rarely supported that I didn't even know it existed! As far as I can tell only Trino and PG13+ actually support it (maybe others). The ANSI way of doing this dedupe is: select *
from {{ relation if relation_alias is none else relation_alias }}
order by row_number() over (
partition by {{ group_by }}
order by {{ order_by }}
) fetch first row with ties And even though Snowflake explicitly claims to support the ANSI |
This can be closed now that #548 has been merged, I believe. |
Thank you for calling this out @judahrand ! Added "Resolves #543" as a comment into #548 for traceability and manually closing this issue. |
Describe the feature
The deduplicate macro currently uses a combination of dbt_utils.star and a subquery to work around needing to filter based on the result of a window function but not wanting to return the filtering column used. The QUALIFY keyword, recently introduced in Snowflake and BigQuery, allows for filtering the result of a query directly on a window function in a cleaner way.
This current code:
I think could look like this:
Additional context
Although BQ does support qualify, it also has issues with window functions with too much data choking on single nodes, hence why the BQ override for the macro uses array_agg instead. And Redshift and potentially other databases don't support QUALIFY. But this could at least be overridden more cleanly and probably more performantly for Snowflake.
Who will this benefit?
What kind of use case will this feature be useful for? Please be specific and provide examples, this will help us prioritize properly.
Are you interested in contributing this feature?
Possibly, depends on time and what the dbt_utils build process looks like these days? Last time I submitted a fix a few months ago it definitely required some help from the team in troubleshooting the build process.
The text was updated successfully, but these errors were encountered: