Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dxvk] Periodically flush handles for long-running queries #3273

Closed
wants to merge 1 commit into from
Closed

[dxvk] Periodically flush handles for long-running queries #3273

wants to merge 1 commit into from

Conversation

CFSworks
Copy link
Contributor

@CFSworks CFSworks commented Mar 2, 2023

Some apps (e.g. Halo: MCC) misuse D3D11 queries: they'll Begin(...) a query, and then only afterward decide that they don't actually want the results, returning the query object back to the engine's internal query pool without first calling End(...). The query thus continues to remain active until the engine decides to allocate the query for another purpose... if it ever does.

While this is misuse of the D3D API and is definitely a bug in the app, DXVK ought to tolerate this situation at least as well as D3D itself apparently does. This change will try to free up Vulkan query handles if they're in the available state, tracking their totals on the DxvkGpuQuery object itself instead.

As a bonus, this change should also reduce the CPU time consumed by repeated GetData(...) calls, since it avoids redundantly summing query handle results.

Some apps (e.g. Halo: MCC) misuse D3D11 queries: they'll Begin(...) a
query, and then only afterward decide that they don't actually want
the results, returning the query object back to the engine's internal
query pool without first calling End(...). The query thus continues
to remain active until the engine decides to allocate the query for
another purpose... if it ever does.

While this is misuse of the D3D API and is definitely a bug in the
app, DXVK ought to tolerate this situation at least as well as D3D
itself apparently does. This change will try to free up Vulkan query
handles if they're in the available state, tracking their totals on
the DxvkGpuQuery object itself instead.

As a bonus, this change should also reduce the CPU time consumed by
repeated GetData(...) calls, since it avoids redundantly summing
query handle results.
Copy link
Owner

@doitsujin doitsujin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for debugging that and coming up with a solution; I think the concept is good but I'll want to do some refactoring on the query code before addressing this.

Do you happen to have an apitrace or something to test the changes with? If not, I can hack a quick test app as well in a few minutes as well, would just be nice to see how bad the problem is in practice.

@@ -78,13 +73,44 @@ namespace dxvk {


void DxvkGpuQuery::addQueryHandle(const DxvkGpuQueryHandle& handle) {
// Some apps forget to End() their queries; instead of accumulating query
// handles indefinitely, try to clean them up periodically.
if (m_handles.size() >= 256)
Copy link
Owner

@doitsujin doitsujin Mar 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to do this on every call? 256 queries as a threshold is a lot for a single query object, and games tend to have thousands of query objects. At the same time, we'll very rarely have more than one Vulkan query per D3D query in well-behaved games, so there won't really be any overhead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You guessed it in your edit: the reason is to avoid overhead. Checking a deque's size, comparing against a constant, and conditionally branching over a call is one or two orders of magnitude cheaper than a vkGetQueryPoolResults which will likely only come back with VK_NOT_READY if run on a query handle so recently allocated.

Yeah, 256 is a pretty big threshold! The vast, vast majority of D3D queries out there in the wild are used responsibly and don't come anywhere near that number, which I picked since I didn't want to spend time seeking out the not-as-well-behaved games which do use multiple Vulkan queries per D3D query to make sure this new codepath wasn't causing some regression somewhere. I chose 256 conservatively to be sure that this would only affect "leaked" D3D queries. (And, let's remember that the old threshold was actually infinity, and while games do tend to have thousands of query objects, I only observed about a dozen being leaked in this manner.)

If m_handles is to become a small_vector, it does make sense to make the threshold the same as its preallocated region, however.


mutable DxvkQueryData m_queryData;
mutable DxvkGpuQueryHandle m_handle;
mutable std::deque<DxvkGpuQueryHandle> m_handles;
Copy link
Owner

@doitsujin doitsujin Mar 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly dislike this mutable spam; if methods modify internal state then they shouldn't be const, especially since doing it like this hides potential thread safety concerns.

If anything, this highlights that the current implementation is bugged (in that accesses to m_ended do not enforce memory order).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like mutable either, though I wanted to avoid a pointer to an extra struct that would have just meant yet one more allocation.

Since DxvkGpuQuery::getData() is part of the "public" (D3D layer facing) API, it seemed appropriate to keep it const as it is indeed logically const. If that isn't The DXVK Way™, it's easy enough to make getData() non-const.

I can replace m_ended with an atomic if you'd like, though this is a separate change and I'd rebase this PR to put that commit deeper than this one.

m_handles.pop_front();
}

if (m_handle.queryPool) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably easier to turn this whole mess into a small_vector. The main reason why there are two structures at all was to avoid allocations in the common case (1 query), but we have better ways to deal with this now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concern that I have here is that the small_vector's equivalent to pop_front() (erase(0)) is O(n), which would make the while loop O(n²). I suppose this could be mitigated if small_vector::erase() got a range overload, which could then be called once after the loop finishes; flushHandles() would be back to O(n).

Either way, that's a separate change so I'd be rebasing the PR to add a commit if I do it.

@CFSworks
Copy link
Contributor Author

CFSworks commented Mar 2, 2023

Thank you for the expedient and thorough review!

Do you happen to have an apitrace or something to test the changes with? If not, I can hack a quick test app as well in a few minutes as well, would just be nice to see how bad the problem is in practice.

The best I have is a RenderDoc capture, but I haven't found a way to get it to replay the Begin(Query ...) events. All of my repro work has been in Halo MCC itself. It renders full frames even with the pause menu active.

But, the problem is pretty bad: RSS growth of ~130MiB/minute at 60FPS and a new query pool (and associated GPU buffer mmap) every two seconds or so.

@vimpostor
Copy link

Some apps (e.g. Halo: MCC) misuse D3D11 queries: they'll Begin(...) a query, and then only afterward decide that they don't actually want the results, returning the query object back to the engine's internal query pool without first calling End(...). The query thus continues to remain active until the engine decides to allocate the query for another purpose... if it ever does.

@CFSworks Just out of curiosity, is this the memory leak that only happens in the Halo 1 campaign?

If yes, than it is pretty awesome that this got fixed in dxvk, because the memory leak even happens on Windows and hasn't been fixed there for months, even though the devs are aware of the issue.

@CFSworks
Copy link
Contributor Author

@CFSworks Just out of curiosity, is this the memory leak that only happens in the Halo 1 campaign?

It sounds about right for the timeframe and everything. I was stuck at home sick for most of February and decided to play Halo 1's campaign to pass the time, but was plagued with OOMs after about 30 minutes of play (and progressively-worsening stuttering, probably from Linux's memory management trying to cope with the leak). I haven't tried the other campaigns or multiplayer, so I can't confirm whether my problem was unique only to Halo 1.

I suppose another way of confirming that it's the same issue would be to run Halo 1 on Windows but with dxvk instead of D3D11, and then see if it resolves the leak?

If yes, than it is pretty awesome that this got fixed in dxvk, because the memory leak even happens on Windows and hasn't been fixed there for months, even though the devs are aware of the issue.

Wow, I had no idea that this was affecting the Windows players too. I'm used to things like this being, "well the game itself is technically bugged because it's misusing the Windows API, but Windows itself seems to tolerate it so [dxvk/wine/whatever] should as well" so I never checked my (wrong) assumption that that would be the case here... Guess D3D11 itself doesn't tolerate this query misuse any better, which means dxvk is doing something better than the library it's trying to emulate! :)

I'd thought about sending in a report to the devs, pointing out their exact misuse of the API and giving some tips to finding the actual culprit code. Perhaps I/we still should? Maybe the reason it's gone unfixed for so long is they don't know where to look?

@vimpostor
Copy link

vimpostor commented May 19, 2023

It sounds about right for the timeframe and everything. I was stuck at home sick for most of February and decided to play Halo 1's campaign to pass the time, but was plagued with OOMs after about 30 minutes of play (and progressively-worsening stuttering, probably from Linux's memory management trying to cope with the leak).

Yeah from your description it's pretty clear that we are indeed talking about the same memory leak, I don't think we have to do any further testing.

I'm used to things like this being, "well the game itself is technically bugged because it's misusing the Windows API, but Windows itself seems to tolerate it so [dxvk/wine/whatever] should as well" so I never checked my (wrong) assumption that that would be the case here...

Wow, pretty impressive that you still were able to find the issue, even without the Halo source code available, while not even 343I was able to find it with the code available :D Reminds me of the time where this happened in GTA V.

I'd thought about sending in a report to the devs, pointing out their exact misuse of the API and giving some tips to finding the actual culprit code. Perhaps I/we still should?

I agree, contacting the devs would be a good idea. Do you want to go ahead? :)

P.S. Stuff like this is why I enjoy the Linux gaming ecosystem. Sometimes it feels like a lot of workarounds, but then occasionally we have a gem like this one, where we are able to fix issues that seem unfixable even on Windows.

@CFSworks
Copy link
Contributor Author

Wow, pretty impressive that you still were able to find the issue, even without the Halo source code available, while not even 343I was able to find it with the code available :D Reminds me of the time where this happened in GTA V.

Counterintuitively, I've identified more of these kinds of bugs in closed-source software than open-source. My guess is it's because they're actually pretty low-hanging fruit when you have the source and know what you need to look for (i.e. you are a user who is actively being bitten by the bug, not a developer who has only seen reports), so in the open-source world, these things get fixed pretty quickly -- long before I come along to encounter them. Even your GTA V example would fit the pattern: building the app with debug information and then using a profiler would have revealed strlen() eating a lot of CPU at load time straight away; no need for the tenacity and skillset of someone like t0st.

But then in the closed-source world, you either need a developer to prioritize that issue enough to take the time to nail it down (hard when there's more issues than developers, harder still when the most skilled devs have even more on their plates, and the developers aren't necessarily also users/players so they might not have quite the same personal motivation to fix it)... or you need the tenacity and skillset of someone like t0st. :)

I agree, contacting the devs would be a good idea. Do you want to go ahead? :)

Well, I replied to the 343I employee on the Reddit thread you linked. My hope is this will generate a few upvotes and make this enough of a "squeaky wheel with an easy fix" that someone over there either finally nails it down or shoots me a message to ask for more details. I'd thought about also opening a support ticket but those teams usually don't/can't escalate something directly to engineering, no matter how technical the reporter is -- any other way you think I should be getting their attention besides just Reddit?

@vimpostor
Copy link

Counterintuitively, I've identified more of these kinds of bugs in closed-source software than open-source.

Yeah, low hanging fruit are definitely easier to get fixed in free software.

any other way you think I should be getting their attention besides just Reddit?

For now this should be enough. If they still don't bother to reply in a few weeks, we can try to contact them in another way.

Personally I am also very interested in how the hell you debugged this, debugging software without source code available always felt like impossible to me.

So there is also the alternative strategy of writing up a blogpost for this, reach Hackernews front page, which will surely get the devs attention then. :D

(Seriously though, don't feel obliged to write a blog post just because I said this)

@vimpostor
Copy link

Looks like this might have indeed helped 343I fix the memory leak upstream: The latest patch fixes the problem:

Resolved a memory leak in Halo: Combat Evolved Campaign missions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants