Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove trivial nodes before building subdag #7194

Merged
merged 5 commits into from
Apr 7, 2023

Conversation

ttusing
Copy link
Contributor

@ttusing ttusing commented Mar 20, 2023

Resolves #7195

Description

This implements the "smart trimming" suggested in (#7195).

Local testing on my project:
DBT 1.4.4:

dbt build --select modelname
00:31:26  Running with dbt=1.4.4
00:31:27  Found 307 models, 3708 tests...
00:34:02  
00:34:08 
00:34:09  Concurrency: 16 threads (target='default')
00:34:09  
00:34:09  1 of 179 START sql table model  .................................... [RUN]

Build time: 2 minutes 45 seconds

Version 1.5.0-b4:

dbt build --select modelname
00:39:02  Running with dbt=1.5.0-b4
00:39:05  Found 307 models, 3708 tests...
00:41:14  
00:41:19  
00:41:20  Concurrency: 16 threads (target='default')
00:41:20  
00:41:20  1 of 179 START sql table model  .................................... [RUN]

Build time: 2 minutes 18 seconds

This branch:

dbt build --select modelname
05:28:42  Running with dbt=1.5.0-b4
05:28:45  Found 307 models, 3708 tests...
05:28:45  
05:28:52  
05:28:52  Concurrency: 16 threads (target='default')
05:28:52  
05:28:52  1 of 179 START sql table model  .................................... [RUN]

Build time: 10 seconds

Checklist

@cla-bot
Copy link

cla-bot bot commented Mar 20, 2023

Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA.

In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, please reach out through a comment on this PR.

CLA has not been signed by users: @ttusing

@cla-bot
Copy link

cla-bot bot commented Mar 20, 2023

Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA.

In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, please reach out through a comment on this PR.

CLA has not been signed by users: @ttusing

@ttusing ttusing changed the title remove trial nodes before building subdag Remove trial nodes before building subdag Mar 20, 2023
@ttusing ttusing changed the title Remove trial nodes before building subdag Remove trivial nodes before building subdag Mar 20, 2023
@ttusing ttusing marked this pull request as ready for review March 20, 2023 01:05
@ttusing ttusing requested a review from a team March 20, 2023 01:05
@ttusing ttusing requested a review from a team as a code owner March 20, 2023 01:05
@ttusing ttusing requested review from stu-k and ChenyuLInx March 20, 2023 01:05
@ttusing
Copy link
Contributor Author

ttusing commented Mar 20, 2023

Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA.

In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, please reach out through a comment on this PR.

CLA has not been signed by users: @ttusing

Believe this is in error! Filled it out a few minutes before opening PR, might be a timing issue.

remove comment
@cla-bot cla-bot bot added the cla:yes label Mar 20, 2023
core/dbt/graph/graph.py Outdated Show resolved Hide resolved
@peterallenwebb
Copy link
Contributor

Overall this looks like an exciting improvement. We're thinking through whether it is likely to work well for all of the graph topologies that come up in practice, but I'm pretty optimistic about that.

@ttusing ttusing requested review from peterallenwebb and removed request for ChenyuLInx, stu-k and iknox-fa March 25, 2023 08:58
@ttusing
Copy link
Contributor Author

ttusing commented Mar 25, 2023

Overall this looks like an exciting improvement. We're thinking through whether it is likely to work well for all of the graph topologies that come up in practice, but I'm pretty optimistic about that.

I think this is strictly better than random selection in all cases, especially with switching to the product of in_degree and out_degree.

@peterallenwebb
Copy link
Contributor

@iknox-fa Looks like the issues we discussed have been addressed here. What do you think about this PR?

@ttusing
Copy link
Contributor Author

ttusing commented Apr 4, 2023

@iknox-fa Looks like the issues we discussed have been addressed here. What do you think about this PR?

@iknox-fa If part of the hesitation is the pruning part and the sorting is good to go, I can open a separate PR.

@peterallenwebb
Copy link
Contributor

@iknox-fa @jtcohen6 (cc: @leahwicz)

I'd like to approve this PR and get it into the RC for 1.5, and what follows is my case for doing that.

What we know:

  • The code is correct. There is no disagreement on that point.
  • It is 5-10x faster on long runs involving narrow node selections on large graphs, as benchmarked by @ttusing and at least one other user.
  • The code involves heuristic changes to the order in which nodes are removed from the graph, and there is a strong basis for concluding that the heuristic will improve performance across a variety of graph types.

Based on my general algorithmic/performance experience, I'd guess the following:

  • There is a 85% chance that this is faster for all users all of the time.
  • There is a 10% chance that this is faster for users with graphs similar to the ones the change has been tried on, but no change or slightly slower for other graphs.
  • There is 5% chance that the algorithm is significantly slower for some users, even though it dramatically improves performance in other cases.

Since we will have time to test the performance in the wild during the RC interval, I think it is worth a 5% risk of some performance regressions if it lets us dramatically improve the experience of many users in the 95% case. Even if there are performance regressions, I believe we could probably address them by rolling forward instead of back, but rolling back is easy.

@peterallenwebb peterallenwebb dismissed their stale review April 7, 2023 15:01

Changes addressed

@jtcohen6
Copy link
Contributor

jtcohen6 commented Apr 7, 2023

@peterallenwebb Thanks for laying out the rationale - I'm down to give this a shot, and merge for inclusion in v1.5.0-rc1.

@leahwicz
Copy link
Contributor

leahwicz commented Apr 7, 2023

@peterallenwebb thanks as well for the analysis. I feel good with getting this in the RC based off of that and monitoring to see if we see any regressions in perf from different DAG shapes.

core/dbt/graph/graph.py Outdated Show resolved Hide resolved
@peterallenwebb peterallenwebb merged commit 7045e11 into dbt-labs:main Apr 7, 2023
@ttusing ttusing deleted the CT-2315/remove_trival_nodes branch April 7, 2023 16:10
emmyoop pushed a commit that referenced this pull request Apr 11, 2023
* remove trial nodes before building subdag

* add changie

* Update graph.py

remove comment

* further optimize by sorting node search by degree

* change degree to product of in and out degree
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla:yes ready_for_review Externally contributed PR has functional approval, ready for code review from Core engineering
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CT-2316] [Feature] Improve subset_graph selection by removing trivial nodes and improving removal order
4 participants