Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation Issues with New "deduplicate" macro #542

Closed
codigo-ergo-sum opened this issue Apr 8, 2022 · 8 comments · Fixed by #548
Closed

Documentation Issues with New "deduplicate" macro #542

codigo-ergo-sum opened this issue Apr 8, 2022 · 8 comments · Fixed by #548
Labels
bug Something isn't working triage

Comments

@codigo-ergo-sum
Copy link
Contributor

codigo-ergo-sum commented Apr 8, 2022

Describe the bug

The new "deduplicate" macro has a broken link at the top of the page to the documentation section on the front page of the repo. First link is https://github.com/dbt-labs/dbt-utils#deduplicate but then the second link is https://github.com/dbt-labs/dbt-utils#deduplicate-source. Could this be fixed?

Also, the macro itself is a bit confusing. I understand deduplication in general and I'm having a bit of a hard time understanding how this function is intended to be used and what some of the parameters are for. Possible to beef this up?

Steps to reproduce

Same as above.

Expected results

Links work on the documentation page for the repo and the macro is clearly understandable along with all parameters to be used with it, whether required or optional.

Actual results

N/A

Screenshots and log output

N/A

System information

N/A

The output of dbt --version:
N/A

Are you interested in contributing the fix?

Potentially

@codigo-ergo-sum codigo-ergo-sum added bug Something isn't working triage labels Apr 8, 2022
@dbeatty10
Copy link
Contributor

@judahrand

Do you have insight on how this deduplicate function is intended to be used and what each of the parameters are for?

Here's my guesses, but would be great to hear your thoughts (and suggestions for word-smithing):

Args:

  • relation (required): a Relation (a ref or source) whose rows you wish to deduplicate
  • group_by (required, default=''): attribute names (or expressions) that compose a key for which to eliminate any duplicates (comma-separated string).
  • order_by (optional, default=none): attribute names (or expressions) that determine the priority order of which row should be chosen first if there are duplicates (comma-separated string).
  • relation_alias (optional, default=''): will prefix all generated fields with an alias (relation_alias.field_name).

@codigo-ergo-sum
Copy link
Contributor Author

Part of what confused me about this is the name "group_by" for the second parameter. That implies aggregation to me, but this is not doing the logical operation of aggregation (with the exception of the BQ-specific version of the macro which has to do an involved workaround due to BQ issues with window functions not being able to successfully process large volumes of data.) This is really doing a window function. It would properly be more helpful to call the second parameter something like "key_columns" or "deduplication_columns." Also, purely from reading the code, I think instead of saying attribute names (or expressions) that compose a key for which to eliminate any duplicates (comma-separated string) it seems a bit more intuitive to me to say column names (or expressions) to use to identify a set/window of rows out of which to select one as the deduplicated row.

Also, I'm confused how order_by can be optional. When using row_number() on Snowflake the order_by clause is not optional: https://docs.snowflake.com/en/sql-reference/functions/row_number.html. You don't have to provide it on Redshift, but if you don't, the result is nondeterministic which seems to me to be a really bad idea: https://docs.aws.amazon.com/redshift/latest/dg/r_WF_ROW_NUMBER.html

@judahrand
Copy link
Contributor

@judahrand

Do you have insight on how this deduplicate function is intended to be used and what each of the parameters are for?

Here's my guesses, but would be great to hear your thoughts (and suggestions for word-smithing):

Args:

  • relation (required): a Relation (a ref or source) whose rows you wish to deduplicate
  • group_by (required, default=''): attribute names (or expressions) that compose a key for which to eliminate any duplicates (comma-separated string).
  • order_by (optional, default=none): attribute names (or expressions) that determine the priority order of which row should be chosen first if there are duplicates (comma-separated string).
  • relation_alias (optional, default=''): will prefix all generated fields with an alias (relation_alias.field_name).

That's all correct with the exception of relation_alias. The relation_alias argument is intended to allow you to reference a CTE rather than the Relation you actually pass. I added this as I noticed that a common use pattern was to filter a relation and then deduplicate (for example to take advantage of table partitions). An example of this can be seen in the tests: https://github.com/dbt-labs/dbt-utils/blob/33299334a305d0acb99ebee8cc2eb6eb2ba5ca31/integration_tests/models/sql/test_deduplicate.sql

@judahrand
Copy link
Contributor

Also, I'm confused how order_by can be optional. When using row_number() on Snowflake the order_by clause is not optional: https://docs.snowflake.com/en/sql-reference/functions/row_number.html.

@codigo-ergo-sum Oh, I missed that in the docs! It is optional for all the other DBs but maybe you're right and it shouldn't be optional.

@judahrand
Copy link
Contributor

judahrand commented Apr 14, 2022

I've just had a look at this and I think we can do away with the confusing relation_alias by writing the default implementation a bit more smartly. I've taken a crack at some of the issues raised here in #548.

@judahrand
Copy link
Contributor

It would properly be more helpful to call the second parameter something like "key_columns" or "deduplication_columns."

I'm not sure that including columns in these arguments is a good idea. They may well not just be column names! For example, what is currently called order_by really is an order by clause - it is expected to have asc and desc keywords to set the order.

Similarly, group_by may well contain expressions! For example, I might have a timestamp column buy I actually want to deduplicate per day - so I'd include date(timestamp_col) in group_by, which isn't a column.

Suggesting that these are just column names, I think, would introduce additional confusion. Perhaps we can just change group_by to partition_by? Since that's more accurate.

@judahrand
Copy link
Contributor

This can be closed now that #548 has been merged, I believe.

@dbeatty10
Copy link
Contributor

This can be closed now that #548 has been merged, I believe.

Thank you for calling this out @judahrand ! Added "Resolves #542" as a comment into #548 for traceability and manually closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants