-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Regression] 1.8.2 slower to build than 1.5.9 when tag+ includes many nodes #10434
Comments
Thanks @cajubelt! Is this only for the I suspect this might be due to the large number of tests in your project (per our conversation), and the additional time that dbt spends "linking" the DAG (adding edges between test on upstream model -> downstream models, so that they skip on test failure). While it's not immediately clear to me what change we made between v1.5 -> v1.8 to that logic, if you only see this slowdown on |
@jtcohen6 yes I just confirmed that dbt run doesn't have any noticeable slowdown on 1.8 with the same selector. |
@cajubelt Okay! So I think the hypothesis is:
|
If you're up for it, there are two ways to try confirming that hypothesis by profiling dbt's performance:
|
We have the ability to exclude tests from the build resource types, but that doesn't affect whether or not we pass "add_test_edges" to compile. It would be possible to check the resource types and if tests are excluded, pass "add_test_edges" = False. Of course if you do want tests to run, that doesn't help. |
We do want to run tests if possible. @jtcohen6 here's a screenshot of a search through py-spy's output for the longest segments of the flame graph. Looks like |
[Preview](https://docs-getdbt-com-git-dbeatty10-patch-2-dbt-labs.vercel.app/reference/global-configs/record-timing-info) ## What are you changing in this pull request and why? Officially documenting the performance profiling approach mentioned in dbt-labs/dbt-core#10434 (comment). ## Checklist - [x] Review the [Content style guide](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/content-style-guide.md) so my content adheres to these guidelines.
@jtcohen6 Not super sure if related to this issue but we're seeing an extreme slowdown when
I tried to figure out what's going on using the |
What percentage of your graph (including tests) is selected by the tag, and what % is selected by the selector? I had someone on my team try using the I don't really understand the
If you can share the snakeviz would help out, also maybe check your memory usage? Last bug I worked on for this graph optimization stuff was having performance issues from running out of memory when the DAG got too big. If my stab in the dark here is in the right direction, maybe there is a pruning that could be done to reduce calling For your graph has 20k nodes, if 1k have the tag, 10k are descendents, and each descendent is on average the descendent of 70% of the tagged nodes, this takes the number of times each of the descendents has to be evaluated by ~700. 700*10,000 is a lot of calls. After writing this out this seems actually pretty likely XD. Think an update to Also if the slow part is will try tests on my own project with ~500 models and ~5000 tests next week. |
I added a PR! #10526 I believe I was able to recreate the issue in my project, which has: Running My PR applies the pruning strategy I described. I saw the runtime improve for I am unsure how much DBT 1.9/main branch contains performance improvements... the total runtime improvement was 66 seconds, but Feedback welcome, especially if you can test/review the change! |
I've reviewed and suggested some changes to @ttusing's PR but we should be close to a fix on this. |
This is now resolved in 1.9+, but I'm re-opening this issue so we can consider backports. |
I'm also interested in seeing performance reports from @stefankeidel and @cajubelt. This should help, but checking the edge type seems to be inherently slow. Wonder if the runtime will improve from 20 minutes to like 15, 10, or 5 minutes and if there are further improvements that can be made in the edge type lookup speed. I could image avoiding the edge type lookup altogether somehow with the knowledge that tests cannot depend on tests. So, don't check the edge type for searching children if the node is a test (or if parent is a test for ancestors) or something like that. Or, building the subgraph with all non-tests nodes first. |
Just saw the great activity on this after it fell off my radar for a bit, thanks all. Will follow up on Monday with a perf report, I have personal commitments this weekend. We'd love to see a backport to 1.8. |
Thanks for the update and amazing work! I am currently traveling on a sabbatical leave so I won’t be able to test this before I’m back in November, but will make sure to do so then!On 9. Aug 2024, at 23:55, Charlie Andrews-Jubelt ***@***.***> wrote:
Just saw the great activity on this after it fell off my radar for a bit, thanks all. Will follow up on Monday with a perf report, I have personal commitments this weekend.
We'd love to see a backport to 1.8 since our project is fairly large and we are unlikely to update to 1.9 before some key deadlines this Fall.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
So I'm looking at this section: dbt-core/core/dbt/compilation.py Lines 197 to 215 in 63262e9
This adds a huge amount of edges. For probably most but pathological DAGs, this is an edge between every test and all nodes downstream of what its testing. For the example of 5k models and 15k data tests, this might be added approximately (5k*15k/2)= 37.5 million edges. Was added in #7276, I don't really understand the purpose of this code. @gshank can you explain why it is necessary to add all of these edges? It seems like a huge factor limiting DBT scaling. If the goal is to get I'm guessing the get edge type function is slow because of the millions of edges. There are way less nodes (thousands), so getting node type should probably be way faster, and I am nearly certain nothing can directly depend on tests (only Somewhat separately, I wonder about a feature enhancement to print simple DAG stats with like |
Ok I did some profiling last night on |
Ok, ran it again with |
Some stats about my project: 5799 models, 20 snapshots, 91 analyses, 18763 data tests, 186 seeds, 1 operation, 1086 sources, 69 exposures, 1899 macros |
Is this a regression in a recent version of dbt-core?
Current Behavior
dbt build -s tag:my_tag+
takes about 20 minutes longer to start on dbt 1.8.2 than it does on 1.5.9 with the same tag. The tag used has a lot of downstream nodes in our project, about 11k. Generally we’re seeing better performance on 1.8 so we were surprised to see this big regression in performance.Expected/Previous Behavior
Previously building everything downstream of a tag with lots of nodes would take a couple minutes of startup time and then begin running queries against our db. Now it takes 20+ minutes.
Steps To Reproduce
Relevant log output
No response
Environment
Which database adapter are you using with dbt?
bigquery
Additional Context
The reason we need this is because we have a selector used in CI/CD that excludes everything downstream of a tag that is upstream of many nodes. The example given is a simpler version of the original issue we found with that selector. (We tested the simpler version and found to also have the same performance issue.) The selector was something like
The text was updated successfully, but these errors were encountered: