-
Notifications
You must be signed in to change notification settings - Fork 504
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Methods to achieve null safety for deduplicate
#815
Closed
Closed
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
3eced4d
Null safety for `deduplicate` via `row_alias` keyword argument
dbeatty10 d46676e
Null safety for `deduplicate` via `columns` keyword argument
dbeatty10 fe03f43
Null safety for `deduplicate` when `relation` is not a CTE
dbeatty10 e13d72d
Update caveats
dbeatty10 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,8 +1,40 @@ | ||
{%- macro deduplicate(relation, partition_by, order_by) -%} | ||
{{ return(adapter.dispatch('deduplicate', 'dbt_utils')(relation, partition_by, order_by)) }} | ||
{{ return(adapter.dispatch('deduplicate', 'dbt_utils')(relation, partition_by, order_by, **kwargs)) }} | ||
{% endmacro %} | ||
|
||
{%- macro default__deduplicate(relation, partition_by, order_by) -%} | ||
{# | ||
-- ⚠️ This macro drops rows that contain NULL values ⚠️ | ||
|
||
-- The implementation below uses a natural join which avoids returning an | ||
-- extra column at the cost of not being null safe. | ||
|
||
-- dbt_utils._safe_deduplicate is an alternative that avoids dropping rows | ||
-- that contain NULL values at the cost of adding an extra column. | ||
#} | ||
{%- macro _unsafe_deduplicate(relation, partition_by, order_by) -%} | ||
|
||
{%- set error_message = " | ||
Warning: the implementation of the `deduplicate` macro for the `{}` adapter is not null safe. \ | ||
|
||
Set `row_alias` within calls to `deduplicate` to achieve null safety (which will also add it \ | ||
as an extra column to the output). | ||
|
||
e.g., | ||
{{ | ||
dbt_utils.deduplicate( | ||
'my_cte', | ||
partition_by='user_id', | ||
order_by='version desc', | ||
row_alias='rn' | ||
) | indent | ||
}} | ||
|
||
Warning triggered by model: {}.{} | ||
dbt project / package: {} | ||
path: {} | ||
".format(target.type, model.package_name, model.name, model.package_name, model.original_file_path) -%} | ||
|
||
{%- do exceptions.warn(error_message) -%} | ||
|
||
with row_numbered as ( | ||
select | ||
|
@@ -29,6 +61,63 @@ | |
|
||
{%- endmacro -%} | ||
|
||
{# | ||
-- For data platforms that don't support QUALIFY or an equivalent, the | ||
-- best we can do to ensure null safety is to use a window function + | ||
-- filter (which returns an extra column): | ||
-- https://modern-sql.com/caniuse/qualify | ||
#} | ||
{%- macro _safe_deduplicate(relation, partition_by, order_by, row_alias="rn", columns=none) -%} | ||
|
||
{% if not row_alias %} | ||
{% set row_alias = "rn" %} | ||
{% endif %} | ||
|
||
with row_numbered as ( | ||
select | ||
|
||
{% if columns != None %} | ||
{% for column in columns %} | ||
{{ column }}, | ||
{% endfor %} | ||
{% else %} | ||
_inner.*, | ||
{% endif %} | ||
|
||
row_number() over ( | ||
partition by {{ partition_by }} | ||
order by {{ order_by }} | ||
) as {{ row_alias }} | ||
from {{ relation }} as _inner | ||
) | ||
|
||
select * | ||
from row_numbered | ||
where {{ row_alias }} = 1 | ||
|
||
{%- endmacro -%} | ||
|
||
{# | ||
-- ⚠️ This macro drops rows that contain NULL values unless one of the following is true: | ||
-- - `relation` parameter is a non-CTE dbt Relation | ||
-- - `row_alias` parameter is included | ||
-- - `columns` parameter is included | ||
#} | ||
{%- macro default__deduplicate(relation, partition_by, order_by) -%} | ||
{% set row_alias = kwargs.get('row_alias') %} | ||
{% set columns = kwargs.get('columns') %} | ||
|
||
{% if relation.is_cte is defined and not relation.is_cte %} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
{% set columns = dbt_utils.get_filtered_columns_in_relation(relation) %} | ||
{{ dbt_utils._safe_deduplicate(relation, partition_by, order_by, columns=columns) }} | ||
{% elif row_alias != None or columns != None %} | ||
{{ dbt_utils._safe_deduplicate(relation, partition_by, order_by, row_alias=row_alias, columns=columns) }} | ||
{% else %} | ||
{{ dbt_utils._unsafe_deduplicate(relation, partition_by, order_by) }} | ||
{% endif %} | ||
|
||
{%- endmacro -%} | ||
|
||
-- Redshift has the `QUALIFY` syntax: | ||
-- https://docs.aws.amazon.com/redshift/latest/dg/r_QUALIFY_clause.html | ||
{% macro redshift__deduplicate(relation, partition_by, order_by) -%} | ||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What databases allow for
minus
orexcept
syntax? I know snowflake does - that could be an option for removing the extra column. Though maybe in that case you'd just usequalify
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would minus or except work to remove extra column(s)? Do you mean
select * exclude ( <col_name>, <col_name>, ... )
?This would be the perfect solution if we could rely on it! 💡
But it is not in the SQL standard, and the databases that don't have
qualify
are probably missingselect * exclude (...)
as well. So I don't think we'll be able to reliably use it as part of the default implementation 😢.select * exclude (...)
Snowflake has
select * exclude
:And so does DuckDB:
select * except (...)
And because it's not in the standard, other databases use
except
instead ofexclude
.BigQuery uses
except
:As does Databricks:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, yes I meant
exclude
. What about using the star macro with theexcept
argument?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The initial implementation in #512 used the
star
macro but it was removed in #548.I haven't considered the details of how we might be able to bring it back or what those implications would be.
I think we'd still need to handle the case where the
relation
is a CTE name instead of a Relation. That's the case that this draft PR is covering with therow_alias
parameter. An alternative way to cover it would be acolumns
parameter like suggested here. Allowing the end user to choose between eitherrow_alias
orcolumns
would provide the most optionality.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@graciegoheen your idea about using the star macro inspired fe03f43.
It retrieves columns similarly to
dbt_utils.star
IFF:Otherwise, a user can pass a list of
columns
manually (d46676e). Or they can specify arow_alias
that is acceptable to them.