Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance questions #660

Closed
roninrival opened this issue Jun 15, 2017 · 15 comments
Closed

Performance questions #660

roninrival opened this issue Jun 15, 2017 · 15 comments
Labels
Feature An improvement or feature Unresolved Waiting for a fix or implementation

Comments

@roninrival
Copy link

First, let me say that I love Renderdoc - it is superior to even the "big name" tools out there.

In doing some tests to use it on my game, I noticed that there is some significant performance degredation when the renderdoc.dll is loaded. It goes from about 100fps down to 30ish on a fairly powerful computer. It doesn't matter if I launch my game via the UI, or if I load the .dll file from within the game before I create my D3D11 device. I've tried manually setting all the capture options disabled (i.e. to the default values, especially including CaptureAllCmdLists) with no difference. While I'd certainly expect a big spike in frame time during a capture, this happens any time I run my game.

In doing some profiling, the cost is definitely coming from CPU-side draw/d3d calls, with it taking as much as 5x as long to do rendering. Again, with the hooking taking place, I'd understand some performance overhead, even when not actively capturing.

So, I guess my question is: is this expected behavior?

If the answer is 'yes', well, Renderdoc still rocks. However, I'd love to hear, "no, performance overhead should be minimal" and help work through a bug. This is using the latest version of Renderdoc.

Thanks for the tool!

@baldurk
Copy link
Owner

baldurk commented Jun 15, 2017

Hey, some amount of overhead during idle is expected and how much you get hit by is kind of strongly dependent on what your application does. It won't depend on how renderdoc is loaded though, or really any of the capture options (except for CaptureAllCmdLists - but that one will only affect you if you use deferred contexts).

There are some possibilities for optimisation, ranging from relatively minor to more extensive. A couple of other people were asking about this so recently I made a branch - d3d11_renderstate_opt which has some of the hotspots I found while profiling improved. For me however the overhead was already pretty low - maybe 2ms or so, and this saved about 0.5ms. You're saying the overhead might be 20ms or so (hard to say since you mentioned fps. Never use FPS for performance measurment! 😄) so I'd be interested to hear if you see any wins from trying on that branch, and if so then how much. If you get big enough wins, I can see about testing it and merging it in.

@baldurk baldurk added Unresolved Waiting for a fix or implementation Feature An improvement or feature labels Jun 15, 2017
@roninrival
Copy link
Author

That build definitely was an improvement. I jumped from ~30fps (i.e. ~33ms) to 50fps (i.e. ~20ms) frames. One specific type of renderable was about 2.5ms of CPU time for all of them with no renderdoc.dll loaded, 12ms of time with the release build, and 6ms with that branch.

I am doing quite a bit of drawing; about 6000 draw calls with a generated .rdc file of 250MB. So, I'm sure there is bound to be some overhead.

Anyway, thanks for the build. I hope my feedback helps, and I'd be happy to answer any more questions or do some additional tests if you'd like. I hope you do integrate those changes, as they certainly helpful.

@baldurk
Copy link
Owner

baldurk commented Jun 16, 2017

Good to hear that it has some decent performance improvements, I'll see about merging it in then.

To find out more I'll need some more information about what is slow - I know about lots of things that aren't optimal but it's not a good idea to optimise without any data.

Are you able to share your program with me so I can profile it? If you can't I understand, but it's worth asking in case it's possible 😄. If not then could you capture an ETW trace of a few seconds of idle use? If you send me that along with your renderdoc.pdb from your local release build I can try to see what is still taking up time and how easily it can be further improved.

@moradin
Copy link
Contributor

moradin commented Jun 16, 2017

You might want to consider using Unreal 4 for the tests. The same kind of performance problems are visible there. Since it only affects complex scenes (or I should say it's not really measurable in simple cases) the best idea is to download some of the example content that is free and fairly well represents what some games are doing. I think this would give a pretty good idea. You can find example content from past GDCs or even the new open source Unreal Tournament has a few levels that are a good reference.

I know that this might not be the most convenient way but it's a good source of assets that are closer to real production than a programmer test level with 1000 boxes :)

@baldurk
Copy link
Owner

baldurk commented Jun 16, 2017

I can certainly get latest UE4 and Unreal Tournament and profile that. I was using other shipped games for profiling before but I didn't know which ones were seeing heavy performance loss.

@moradin
Copy link
Contributor

moradin commented Jun 16, 2017

I wish I could just send you a level from our project at work. We are seeing serious perf drops once we enable the renderdoc plugin in the UE4 editor. Unfortunately, I can't share anything.

@baldurk
Copy link
Owner

baldurk commented Jun 16, 2017

If you can see about giving me an ETW trace(ideally from that branch above so I don't go investigating slow code that's already fixed), as long as I have matching renderdoc symbols I can investigate. That doesn't reveal any assets or code, but gives me enough to at least know where the time is being taken up. It's not completely anonymised though as it has some machine details etc.

@moradin
Copy link
Contributor

moradin commented Jun 16, 2017

I probably won't be able to provide that in the near future, not because I wouldn't want to but I probably won't have time to spend on it. We are just before a few key deliveries in our project. Sorry :/

@baldurk
Copy link
Owner

baldurk commented Jun 16, 2017

Yeh no worries, just thought I'd mention it. Hopefully I can find the same roadblocks you are running into on a more accessible project.

@baldurk
Copy link
Owner

baldurk commented Jun 16, 2017

I had a look at UT4 - it has a fairly bad overhead. Here's the numbers from one particular spot on a level that was reasonable complex:

  • Baseline of roughly ~8ms
  • v0.34 unmodified was ~27.5ms
  • The branch above was closer to 20ms.
  • One weird thing I noticed was there's some "Razer Chroma" code that was calling LoadLibrary every frame to load some razer DLL. That was unnecessarily triggering renderdoc to iterate over all loaded modules to hook them. Fixing that brought it down to ~17ms or so.

There are some further opportunities for optimisation in different directions. From what I can see though most of the overhead still remains in the active pipeline state tracking. There is one bit of code that is just plain bad and can be improved, which at best will bring it down to ~15ms (but most likely somewhere in between), but beyond that it might take more significant changes.

In general there's no need to track state while not capturing, as it's not used for anything in particular and every state binding is queryable. However there's one big problem with that, which is circular refcounts. D3D's refcounting is complete mess because the device can query the immediate devicecontext, which can query all bound resources - but any resource can query the device that it was created from. Any user reference to any of these objects means that all of them must be kept alive because the user might decide to query it all back again. Only when there are no external references at all can you delete objects. (This is a more pathological version of the common case where an object that has no user references but is bound to the pipeline cannot be deleted).

Currently to break that cycle, I keep track of all the back-references to the device from all resources bound to the current pipeline state. When those equal the number of references the device has, we know that the cycle is isolated and can destroy itself. The downside is this means a bunch of extra refcounting work for any resource bound to the pipeline, and on top of that it means we can't just ignore pipeline reference counting while idle.

There's probably a smarter way to solve this that doesn't require tracking pipeline state while idle. I wrote this with correctness not high performance in mind since it's hard enough to be correct.

@moradin
Copy link
Contributor

moradin commented Jun 17, 2017

Great results!
The razer thing is probably a plugin that could be switched off easily but if you fixed it already then even better.
As for the refcounting, I remember we had discussions about this when I was trying to get it working on DX9 and it's indeed a huge pain. On the other hand even DX11 deals with it the same way as you do. You probably know this even better than I do but I think it's worth to mention.

If you check in this article the debug output, it looks like this:

D3D11 WARNING: Live ID3D11RasterizerState at 0x000000F8DD051190, Refcount: 0, IntRef: 1 [ STATE_CREATION WARNING #437:

As you can see there is a Refcount and an IntRef, the first one being the public and the second one being the internal refcount that is incremented when binding to the pipeline or by some other operations.

Unfortunately MS doesn't provide any more details about how this refcounting is really implemented and as far as I know there is no interface to access the internal refcount. Even if there would be, the counts would change without any callback so it wouldn't be possible to rely on it.

All in all, it should be possible to reduce the overhead as much as it is reduced in D3D but it would probably take a substantial amount of work to do so.

@baldurk
Copy link
Owner

baldurk commented Jul 27, 2017

FYI the optimisations on the branch talked about above have been merged into the mainline.

@baldurk
Copy link
Owner

baldurk commented Mar 6, 2018

I would be curious to hear what people have to say about the performance on the latest v1.0 branch, since there have been optimisations above as well as some other optimisations in the v1.x series.

@moradin / @roninrival do you have any feedback on that?

@roninrival
Copy link
Author

I did another test today, with 1.0 build 5ef2d0b. There is definitely a huge improvement over my original tests. Full frame time on the CPU in my test did increase by about 50%, from 10.8ms to 16.1ms. And, like I said before we do a lot of model drawing, so it's understandable to have that type of overhead. Some of my specific model draw calls were about 2.5 times as long, but that's still twice as fast as the original tests (which were about 5x as long).

Thanks for all your hard work! I love being able to insert your .dll and get a gpu capture any time I do a screen capture.

@baldurk
Copy link
Owner

baldurk commented Mar 15, 2018

Thanks for testing! Good to know that the situation is well improved.

There may be more to do, but some overhead is always going to be inevitable. From my findings last time the biggest win to get from here at least for D3D11 would be some kind of opt-in mode to relax the restrictions on refcounting. For example if an application could promise that it would always completely clear its pipeline state before trying to destroy the device, there would be no need to check for refcounting loops and all of the active state tracking could be disabled. I don't know if any app wants to make such changes in exchange for perf improvements.

I'm going to close this now though so that it doesn't become a horrible meta-bug that stays alive as long as there is ever any kind of performance problem! If someone has a case where they'd like me to investigate the overhead and see what could be improved, please open a new bug ideally with some example program I can test with.

@baldurk baldurk closed this as completed Mar 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature An improvement or feature Unresolved Waiting for a fix or implementation
Projects
None yet
Development

No branches or pull requests

3 participants