[ADAP-548] [Feature] Flag to opt out of `order by` when cluster keys are specified #606

jaypeedevlin · 2023-05-12T01:05:15Z

Is this your first time submitting a feature request?

I have read the expectations for open source contributors
I have searched the existing issues, and I could not find an existing issue for this feature
I am requesting a straightforward extension of existing dbt-snowflake functionality, rather than a Big Idea better suited to a discussion

Describe the feature

When we add cluster keys to an incremental model, there is an order by added to the model SQL that sorts the model before the merge.

With extremely large tables (ie those most likely to require clustering), this 'front-loads' some of the clustering into the model run itself. I find this undesirable for two reasons:

It increases the length of our dbt runs as opposed to letting the entire cluster process happen async
It 'conceals' some of the cost associated with clustering — this sort is not part of the clustering credits consumed and it's harder to recognise that cost when understanding the true cost of your clustering on an ongoing basis.

I am proposing to add a flag, where the default value is to keep things as is, and when the flag is True that the order by is not added to the model. I do actually think that a change in default behaviour would be desirable for most folks, but I feel like I'm more likely to get a greenlight on a version of this that doesn't alter behavior.

Describe alternatives you've considered

Overriding the macro or creating a custom materialisation.

Who will this benefit?

Folks with big data™.

Are you interested in contributing this feature?

Yes!

Anything else?

No response

The text was updated successfully, but these errors were encountered:

dbeatty10 · 2023-05-12T14:52:25Z

The order by topic came up recently in #585, so very timely feature request @jaypeedevlin 🤩 !

💡 While we wouldn't take it lightly, we're open to changing the current behavior that adds order by when cluster keys are specified.

For context, discussion surrounding the original clustering implementation for dbt SQL models is in these pull requests:

A key thing for us to do would be to read those earlier discussions to thoroughly understand their rationale. In particular, I think we'd need to understand the performance implications of the (8) combinations of these scenarios:

table vs incremental materialization
automatic_clustering vs. not
order by included vs. not

Could you give a read though the comments in those pull requests (dbt-labs/dbt-core#1591 in particular) and let me know how it affects your thinking?

I need to do a more thorough reading over there as well.

github-actions · 2023-08-11T01:43:32Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

jaypeedevlin · 2023-08-11T03:29:21Z

Commenting to leave this open, I intend to respond soon!

MichaelGFagan · 2023-10-13T14:59:40Z

Adding a comment here to bump this as we have run into the same issue as @jaypeedevlin.

@dbeatty10, I might be missing something, but why would the performance implications be a blocker if the suggestion is to add an optional flag that allows users to alter the default behavior? I think @jaypeedevlin is correct that removing the order by as the default is likely better for the use-cases that get the most benefit from clustering, but our team would much rather have the option to turn off the default behavior rather than having to carry the tech debt burden of overriding the current create_table_as macro/rolling our own materialization.

Edit: Also, from what I can tell, there's no ordering that's done when the incremental is done with the default temp view.

Slanman3755 · 2024-01-22T20:45:07Z

@dbeatty10 I have quickly reviewed the old pull requests and feature requests. I believe the decision to add the "order by" to the cluster key config was an over optimization that was influenced by old snowflake clustering behavior and best practices of other users. I do not see any reason the "order by" should be forced upon a normal DBT user and it should at the very least be optional.

I don't have historical Snowflake documentation in front of me but I believe at some point they recommended "clustering" your own tables by sorting data on insert. Around the time the cluster key feature was added to dbt for Snowflake, a big change was made in how Snowflake clustering works, which was the removal of manual reclustering. Review of the github discussion seems to suggest the "order by" was added in an attempt to supplement the removal of the manual "recluster" command by manually "clustering" the data before it was inserted into the new table by first sorting it.

Snowflake clustering now works through background tasks operated by Snowflake. They call this auto clustering. Snowflake does not expect data to be pre-sorted or clustered when creating or inserting into a clustered table. Snowflake's background process will recluster the table at an indeterminate time later. Theoretically there could be a use case where you could improve query performance between the insert/creation time and the reclustering process execution by pre-sorting. This is very niche though and I do not believe this should be forced upon the user.

The result of the current behavior is that DBT is effectively front-loading the work of clustering by first sorting the data. This can potentially be a very expensive operation. Snowflake auto clustering will automatically recluster and sort this data later (most likely more efficiently and cheaper) so there is no reason for DBT to do so. This is no longer a recommended practice by Snowflake and there is normally no reason for a user to do this unless they are very concerned about query performance before reclustering occurs.

Additionally from a user experience perspective, I believe adding an expensive operation such as an "order by" behind a seemingly unrelated feature is deceptive to the user. Maybe I am biased from my own experience but I am confident that there are many DBT users that have suffered from poor performance due to this "order by" and have no idea it was even added to their query.

Slanman3755 · 2024-01-22T20:47:14Z

Snowflake automatic clustering documentation: https://docs.snowflake.com/en/user-guide/tables-auto-reclustering
tldr; There is no longer a need for DBT to try and cluster/sort for Snowflake.

Slanman3755 · 2024-08-20T20:43:09Z

Still waiting on feedback on this issue.

colin-rogers-dbt · 2024-10-11T19:28:34Z

It seems like we want to change the default eventually to not perform the order by. We would want to do this with a combination of Behavior Flags and a new model/profile config where we default to the existing behavior in the new config and then change the behavior to default to not performing the order by

jaypeedevlin added enhancement New feature or request triage labels May 12, 2023

github-actions bot changed the title ~~[Feature] Flag to opt out of order by when cluster keys are specified~~ [ADAP-548] [Feature] Flag to opt out of order by when cluster keys are specified May 12, 2023

dbeatty10 added awaiting_response and removed triage labels May 12, 2023

github-actions bot added the Stale label Aug 11, 2023

github-actions bot added triage and removed awaiting_response labels Aug 11, 2023

dbeatty10 added awaiting_response and removed Stale triage labels Aug 11, 2023

github-actions bot added triage and removed awaiting_response labels Oct 13, 2023

Slanman3755 linked a pull request Feb 27, 2024 that will close this issue

[ADAP-548] Make cluster key sorting optional on table creation #917

Open

4 tasks

amychen1776 removed the triage label Jul 24, 2024

dtiesling mentioned this issue Sep 6, 2024

[Feature] dynamic table cluster_by support #1166

Open

3 tasks

amychen1776 added the performance label Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ADAP-548] [Feature] Flag to opt out of `order by` when cluster keys are specified #606

[ADAP-548] [Feature] Flag to opt out of `order by` when cluster keys are specified #606

jaypeedevlin commented May 12, 2023

dbeatty10 commented May 12, 2023

github-actions bot commented Aug 11, 2023

jaypeedevlin commented Aug 11, 2023

MichaelGFagan commented Oct 13, 2023 •

edited

Loading

Slanman3755 commented Jan 22, 2024 •

edited

Loading

Slanman3755 commented Jan 22, 2024

Slanman3755 commented Aug 20, 2024

colin-rogers-dbt commented Oct 11, 2024

[ADAP-548] [Feature] Flag to opt out of order by when cluster keys are specified #606

[ADAP-548] [Feature] Flag to opt out of order by when cluster keys are specified #606

Comments

jaypeedevlin commented May 12, 2023

Is this your first time submitting a feature request?

Describe the feature

Describe alternatives you've considered

Who will this benefit?

Are you interested in contributing this feature?

Anything else?

dbeatty10 commented May 12, 2023

github-actions bot commented Aug 11, 2023

jaypeedevlin commented Aug 11, 2023

MichaelGFagan commented Oct 13, 2023 • edited Loading

Slanman3755 commented Jan 22, 2024 • edited Loading

Slanman3755 commented Jan 22, 2024

Slanman3755 commented Aug 20, 2024

colin-rogers-dbt commented Oct 11, 2024

[ADAP-548] [Feature] Flag to opt out of `order by` when cluster keys are specified #606

[ADAP-548] [Feature] Flag to opt out of `order by` when cluster keys are specified #606

MichaelGFagan commented Oct 13, 2023 •

edited

Loading

Slanman3755 commented Jan 22, 2024 •

edited

Loading