Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-2316] [Feature] Improve subset_graph selection by removing trivial nodes and improving removal order #7195

Closed
3 tasks done
ttusing opened this issue Mar 20, 2023 · 5 comments · Fixed by #7194
Closed
3 tasks done
Assignees
Labels
enhancement New feature or request performance

Comments

@ttusing
Copy link
Contributor

ttusing commented Mar 20, 2023

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

This is a performance suggestion separate from but similar to ideas explored in #7191, #6073 .

I have a DBT project with ~300 models and ~4200 tests. When selecting a single model to build, dbt build takes ~3 minutes to start. This is a major hurdle for development iterations.

I have diagnosed that this is because of the performance of subset_graph using record_timing_info. Using dbt run for the same model builds almost instantly, as does dbt build for the entire DAG.

There appear to be a number of major efficiency gains possible for this function. I propose a "smart trimming" of the DAG before starting the more expensive calculation in the function.

The function currently (randomly?) chooses a node, and if it is not in the selection, connects all immediate dependents and successors, then removes the node and continues iterating. This is presumably to preserve build order when nodes aren't directly connected and perhaps for other implications I haven't considered.

For a variety of reasons described in related issues, this can be expensive and cause the creation of many new edges, which most commonly will be deleted later. I see spikes of memory usage above 1 gig on my local machine when building for a single node and not above 200MB for the full DAG.

#7191 describes some of these situations. For an additional example, one can imagine two wide tables with lots of tests, where removing one of the tables causes all tests in model A to be linked to model B on the scale of tens of thousands or hundreds of thousands of new connections, only for all tests and models to be removed later because they were not in the selector.

We can avoid these expensive operations to add new edges by first addressing the removal of nodes that do not require adding new edges. These trivial nodes are nodes with degree 0 or 1 (and are not part of the selector). If a node is not linked to at least 2 other nodes, removing it definitally does not require constructing new edges. By iterating over removing trivial nodes, the graph will collapse on a much simplier subgraph that the existing algorithm can much more efficiency prune.

Similarly, when moving on to remove non-selected nodes from the DAG, starting with the lowest degree nodes is likely to improve performance by reducing the number of edges added.

In the case of many common simple selectors, like single models with tests or simple paths, this algorithm gives the full intended subgraph without needing to invoke the more expensive algorithm.

Describe alternatives you've considered

There are other approaches to making this function more efficient, including an alternate trimming approach in #6073 (comment). It appears that suggestion ultimately did not end up being accepted into dbt-core for not making efficiency gains. My suggestion should be compatible with most other efficiency suggestions.

Who will this benefit?

  • Users of DBT with many nodes, especially tests
  • Developers running a single model with dbt build during development

Are you interested in contributing this feature?

Yes!

Anything else?

No response

@ttusing ttusing added enhancement New feature or request triage labels Mar 20, 2023
@github-actions github-actions bot changed the title [Feature] Improve subset_graph selection by first removing trivial nodes [CT-2316] [Feature] Improve subset_graph selection by first removing trivial nodes Mar 20, 2023
@ttusing
Copy link
Contributor Author

ttusing commented Mar 20, 2023

I think this can get even better. In addition to nodes with 1 degree, any nodes with 0 parents OR 0 children can be safely removed, even if they have higher degrees.

These are quickly removed in the existing algorithm since they add no edges, but this could help collapse especially big/complex DAGs more efficiently.

@ttusing
Copy link
Contributor Author

ttusing commented Mar 20, 2023

I added an additional change to #7194 to sort nodes by degree before searching after doing the trimming, and it reduced my build from 26 seconds down to 10 (!).

@ttusing
Copy link
Contributor Author

ttusing commented Mar 21, 2023

I've been thinking about/refining this further. Here are some of the big open questions that I have to inform how to address this problem:

  • What are the common graphs shapes and sizes that people have in DBT?
  • In relation to these graphs, what are the common selectors that people are using?
  • Will this issue still be a major slowdown point after moderate optimization, or is this sufficient?
  • How big of a DAG is DBT looking to support at what levels? At what node counts is a slowdown expected for common operations? I imagine the considerations at 5k, 10k, 50k, 100k, and 1m nodes are quite different (we hit DBT cloud memory limits at a much lower level of nodes than I expected, for instance https://getdbt.slack.com/archives/CMZ2V0X8V/p1676927241522179)
  • Is this get_subset_graph functionality actually important in all situations? Would an option for a faster selection that has less functionality be useful?

The current implementation in DBT 1.4 that constructs the graph somewhat randomly being improved by #7194 (removing the trivial nodes and doing 1 sort after by degree) improved my runtimes because of some of these properties:

  • I am running a single node and tests, which is a simple connected graph
  • My project's DAG has a lot of tests, sources, etc., which are trivial nodes here that get pruned quickly

I suspect that a huge proportion of builds that people do and projects share these properties. I suspect most common selectors are:
dbt build (trivial to build DAG for)
dbt build -s model
dbt build -s n+model+m

All of which are connected. I also think this pattern is probably common:
dbt build -s n+modelA+m n+modelB+m

I use something like this in CI for changed nodes and their decedents, and also DBT project evaluator. These situations are not always simply connected.

It seems like there is room for improvement by:

  • expanding the identification of trivial nodes
  • modifying the core loop to detect and trim the most trivial nodes first in more situations.

There are so many ways that this could improved with the knowledge of the selectors beforehand, the node types, etc. I could imagine if a selector is less than say 10% of nodes and is simply connected or a small number of simply connected subgraphs it might be faster to "build up" rather than trim down the DAG.

Rough draft PR in my own repo that implements some of these ideas: https://github.com/ttusing/dbt-core/pull/1/files

@jtcohen6
Copy link
Contributor

jtcohen6 commented Mar 22, 2023

@ttusing Thank you for all the thought you've put into this already, and for putting up some code to support it!

From a product perspective: I don't know if dbt could (or should!) gracefully scale to DAGs of 1 million nodes. My biggest priority right now is giving teams the ability to split up large monolithic projects into multi-project deployments. Beyond 5k nodes, let alone 500, I believe project complexity inhibits effective collaboration. That said, if there are opportunities to speed up dbt-core / reduce known bottlenecks at scale, we should do it.

From a user perspective: It's especially frustrating to wait a long time when only trying to select a single model (dbt build -s model). If we can find a way to streamline that DAG construction—without introducing functional regressions for more complex node selection (e.g. cases where we need to re-add transitive edges after nodes have been removed)—then it's a change I'd love to include.

I'll leave it to folks from the Core engineering team to take a closer look at the substance of the proposed changes :)

@jtcohen6 jtcohen6 removed the triage label Mar 22, 2023
@ttusing ttusing changed the title [CT-2316] [Feature] Improve subset_graph selection by first removing trivial nodes [CT-2316] [Feature] Improve subset_graph selection by removing trivial nodes and improving removal order Apr 4, 2023
@kostek-pl
Copy link
Contributor

kostek-pl commented Apr 6, 2023

I've checked Tobie's version against our dbt project(11251 models and 34993 tests) and indeed I see a significant improvement. Time between running a single model build and when this single model is being processed decreased from 7 mins to 3 mins

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance
Projects
None yet
4 participants