Documentation Issues with New "deduplicate" macro #542

codigo-ergo-sum · 2022-04-08T22:12:49Z

Describe the bug

The new "deduplicate" macro has a broken link at the top of the page to the documentation section on the front page of the repo. First link is https://github.com/dbt-labs/dbt-utils#deduplicate but then the second link is https://github.com/dbt-labs/dbt-utils#deduplicate-source. Could this be fixed?

Also, the macro itself is a bit confusing. I understand deduplication in general and I'm having a bit of a hard time understanding how this function is intended to be used and what some of the parameters are for. Possible to beef this up?

Steps to reproduce

Same as above.

Expected results

Links work on the documentation page for the repo and the macro is clearly understandable along with all parameters to be used with it, whether required or optional.

Actual results

N/A

Screenshots and log output

N/A

System information

N/A

The output of dbt --version:
N/A

Are you interested in contributing the fix?

Potentially

The text was updated successfully, but these errors were encountered:

dbeatty10 · 2022-04-14T00:12:16Z

@judahrand

Do you have insight on how this deduplicate function is intended to be used and what each of the parameters are for?

Here's my guesses, but would be great to hear your thoughts (and suggestions for word-smithing):

Args:

relation (required): a Relation (a ref or source) whose rows you wish to deduplicate
group_by (required, default=''): attribute names (or expressions) that compose a key for which to eliminate any duplicates (comma-separated string).
order_by (optional, default=none): attribute names (or expressions) that determine the priority order of which row should be chosen first if there are duplicates (comma-separated string).
relation_alias (optional, default=''): will prefix all generated fields with an alias (relation_alias.field_name).

codigo-ergo-sum · 2022-04-14T00:45:09Z

Part of what confused me about this is the name "group_by" for the second parameter. That implies aggregation to me, but this is not doing the logical operation of aggregation (with the exception of the BQ-specific version of the macro which has to do an involved workaround due to BQ issues with window functions not being able to successfully process large volumes of data.) This is really doing a window function. It would properly be more helpful to call the second parameter something like "key_columns" or "deduplication_columns." Also, purely from reading the code, I think instead of saying attribute names (or expressions) that compose a key for which to eliminate any duplicates (comma-separated string) it seems a bit more intuitive to me to say column names (or expressions) to use to identify a set/window of rows out of which to select one as the deduplicated row.

Also, I'm confused how order_by can be optional. When using row_number() on Snowflake the order_by clause is not optional: https://docs.snowflake.com/en/sql-reference/functions/row_number.html. You don't have to provide it on Redshift, but if you don't, the result is nondeterministic which seems to me to be a really bad idea: https://docs.aws.amazon.com/redshift/latest/dg/r_WF_ROW_NUMBER.html

judahrand · 2022-04-14T08:23:16Z

@judahrand

Do you have insight on how this deduplicate function is intended to be used and what each of the parameters are for?

Here's my guesses, but would be great to hear your thoughts (and suggestions for word-smithing):

Args:

relation (required): a Relation (a ref or source) whose rows you wish to deduplicate

group_by (required, default=''): attribute names (or expressions) that compose a key for which to eliminate any duplicates (comma-separated string).

order_by (optional, default=none): attribute names (or expressions) that determine the priority order of which row should be chosen first if there are duplicates (comma-separated string).

relation_alias (optional, default=''): will prefix all generated fields with an alias (relation_alias.field_name).

That's all correct with the exception of relation_alias. The relation_alias argument is intended to allow you to reference a CTE rather than the Relation you actually pass. I added this as I noticed that a common use pattern was to filter a relation and then deduplicate (for example to take advantage of table partitions). An example of this can be seen in the tests: https://github.com/dbt-labs/dbt-utils/blob/33299334a305d0acb99ebee8cc2eb6eb2ba5ca31/integration_tests/models/sql/test_deduplicate.sql

judahrand · 2022-04-14T08:24:25Z

Also, I'm confused how order_by can be optional. When using row_number() on Snowflake the order_by clause is not optional: https://docs.snowflake.com/en/sql-reference/functions/row_number.html.

@codigo-ergo-sum Oh, I missed that in the docs! It is optional for all the other DBs but maybe you're right and it shouldn't be optional.

judahrand · 2022-04-14T08:59:39Z

I've just had a look at this and I think we can do away with the confusing relation_alias by writing the default implementation a bit more smartly. I've taken a crack at some of the issues raised here in #548.

judahrand · 2022-04-14T11:01:19Z

It would properly be more helpful to call the second parameter something like "key_columns" or "deduplication_columns."

I'm not sure that including columns in these arguments is a good idea. They may well not just be column names! For example, what is currently called order_by really is an order by clause - it is expected to have asc and desc keywords to set the order.

Similarly, group_by may well contain expressions! For example, I might have a timestamp column buy I actually want to deduplicate per day - so I'd include date(timestamp_col) in group_by, which isn't a column.

Suggesting that these are just column names, I think, would introduce additional confusion. Perhaps we can just change group_by to partition_by? Since that's more accurate.

judahrand · 2022-05-16T17:57:27Z

This can be closed now that #548 has been merged, I believe.

dbeatty10 · 2022-05-17T12:31:15Z

This can be closed now that #548 has been merged, I believe.

Thank you for calling this out @judahrand ! Added "Resolves #542" as a comment into #548 for traceability and manually closing this issue.

codigo-ergo-sum added bug Something isn't working triage labels Apr 8, 2022

judahrand mentioned this issue Apr 14, 2022

Refactor deduplicate() arguments #548

Merged

15 tasks

dbeatty10 closed this as completed May 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation Issues with New "deduplicate" macro #542

Documentation Issues with New "deduplicate" macro #542

codigo-ergo-sum commented Apr 8, 2022 •

edited

Loading

dbeatty10 commented Apr 14, 2022

codigo-ergo-sum commented Apr 14, 2022

judahrand commented Apr 14, 2022

judahrand commented Apr 14, 2022

judahrand commented Apr 14, 2022 •

edited

Loading

judahrand commented Apr 14, 2022

judahrand commented May 16, 2022

dbeatty10 commented May 17, 2022

Documentation Issues with New "deduplicate" macro #542

Documentation Issues with New "deduplicate" macro #542

Comments

codigo-ergo-sum commented Apr 8, 2022 • edited Loading

Describe the bug

Steps to reproduce

Expected results

Actual results

Screenshots and log output

System information

Are you interested in contributing the fix?

dbeatty10 commented Apr 14, 2022

codigo-ergo-sum commented Apr 14, 2022

judahrand commented Apr 14, 2022

judahrand commented Apr 14, 2022

judahrand commented Apr 14, 2022 • edited Loading

judahrand commented Apr 14, 2022

judahrand commented May 16, 2022

dbeatty10 commented May 17, 2022

codigo-ergo-sum commented Apr 8, 2022 •

edited

Loading

judahrand commented Apr 14, 2022 •

edited

Loading