Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profiling performance of rendering #664

Open
marcus-bigrep opened this issue Jan 13, 2021 · 17 comments
Open

Profiling performance of rendering #664

marcus-bigrep opened this issue Jan 13, 2021 · 17 comments

Comments

@marcus-bigrep
Copy link

I deal with big data and some times preview rendering is here quite slow < 1 FPS.
Thus, I have now done some profiling and results are strange.
Take attached 3mf and code adjustments at the bottom of the issue. Then slice the sample and rendering the preview of the gcode.
The log states that (preview) rendering takes <0.3sec but I get >1sec if I manually stop the time with a stop watch (from "start rendering" to "end rendering"). In particular, >1sec is the real rendering time.
Do you know where the time is spent?

https://github.com/Ultimaker/Uranium/blob/master/UM/Qt/Bindings/MainWindow.py

    import time

    def _render(self):
        tstart = time.perf_counter()
        Logger.log("w", "start rendering")

        if self._full_render_required:
            renderer = self._app.getRenderer()
            view = self._app.getController().getActiveView()
            renderer.beginRendering()
            view.beginRendering()
            renderer.render()
            view.endRendering()
            renderer.endRendering()
            self._full_render_required = False
            self.renderCompleted.emit()
        else:
            self._app.getRenderer().reRenderLast()

        tend = time.perf_counter()
        Logger.log("w", "end rendering %s", str(tend - tstart))

UMS5_BigRepManifold.zip

@nallath
Copy link
Member

nallath commented Jan 14, 2021

I don't really understand how you timed this. When profiling code like this, I'd recommend using something like https://github.com/pyutils/line_profiler since it shows per line how much time was spent on it.

@marcus-bigrep
Copy link
Author

Thanks for the advice, I'll do it as well.
Though, I don't think it is about digging deeper in this function.
It seems that there is something in the message handling and Qt which takes a lot of time. And this is a big black hole, I don't understand.
In particular, it is not about this code. This code is fast.

Bottom line of this investigation is:
Log message is performed asynchronously and has a latency of >0.7sec.
Especially, the question is what happens in the 0.7sec? Do you have an idea?

@nallath
Copy link
Member

nallath commented Jan 14, 2021

I'm not sure what part of the messaging that could be. It has to be something with opengl since the rendering is so much faster if you only display a few lines.

But i do have the feeling that you might be on to something here that I missed the last few times that I tried profiling. Problem is; I don't know what I missed...

@marcus-bigrep
Copy link
Author

Just why I stumbled about it. I hacked in some caching of VBOs.
Then the render method takes 0.03 (i.e. 6x faster) but overall duration is still >1sec.

@nallath
Copy link
Member

nallath commented Jan 14, 2021

I managed to find it what is causing it; it's the frame swap. I don't know why that would take so much longer than the actual rendering...

@marcus-bigrep
Copy link
Author

Where is it in the code?

@nallath
Copy link
Member

nallath commented Jan 14, 2021

It's in Qt itself. The QQUickWindow has a number of stages: https://doc.qt.io/qt-5/qquickwindow.html#RenderStage-enum

I've been digging around in the documentation of Qt and it seems that the swap stage is when the actual openGL calls that are queued are made. So in that sense it makes sense that the swap takes a long time.

I've hooked it up to QtCreator to debug this a bit, and this is a screenshot from what the scenegraph is doing:
image

The bar before the swap is the actual render (which is 56.8 ms).

@marcus-bigrep
Copy link
Author

What surprises me that caching the VBOs has so little impact on the performance.
I am looking now for some OpenGL profilers to get a better understanding.

@nallath
Copy link
Member

nallath commented Jan 15, 2021

The thing is that we use a shader to get the 3D effect of the layers. We could consider pre-generating the verts and send that to be rendered. It might make it faster at the expense of more pre-computation.

@marcus-bigrep
Copy link
Author

Still my gut feeling is that VBO caching should have significant effect even with the shaders.

But your idea is good and Cura's old approach is exactly this. I'll try VBO caching with the old approach.
(The issue with the old approach is that pre-generating is done in Python which is damn slow.)

@nallath
Copy link
Member

nallath commented Jan 15, 2021

We could also do the pre-generating in C++ and make a python binding.

I do agree with your gut feeling though. The VBO caching should have more effect.

@marcus-bigrep
Copy link
Author

marcus-bigrep commented Jan 19, 2021

Update from my side:

  1. I have profiled it with NVIDIA NSight and >90% is up to rendering the tubes. In particular, this backs up that VBO caching does not take much time.
  2. I was wrong with that tubes were generated on the cpu in a former version. In fact, there is a compatibility mode which renders pure lines. This is fast though.

I'll hack in the tubes on the cpu now and do some benchmarking.

@Ghostkeeper Ghostkeeper changed the title Profile Rendering Profiling performance of rendering Jan 25, 2021
@marcus-bigrep
Copy link
Author

What I have done now:
I have "exported" the preview to an stl file. This file is huge >3GB. Of course, you may store it more efficiently (triangle strip etc.).
Though, colors and vertex normals come up additionally and then you likely end up by even more memory consumption.
My gfx card has 4 GB of RAM and won't make it.

Long story short, I don't think that creating the tubes initially and then cache them, will work out.

In case of 1fps, one can have a simple view (i.e. lines) for navigating and then rendering the ordinary view afterwards for the "static" situation. Though, my resources are slipping away and I won't make this in the short term.
@nallath Maybe it's something for the ordinary cura development? ;-)

@nallath
Copy link
Member

nallath commented Jan 27, 2021

Could you share the project file to create the tubes file (and the tubes file itself?). It would help me to convince people that this is worth the effort.

@marcus-bigrep
Copy link
Author

(1)
Sorry for answering not earlier. I have added a simple file to generate out of an txt file an stl file of tubes.
Furthermore you have to insert
with open(r"c:\dev\ttt.txt", "a") as file: numpy.savetxt(file, data)
in LayerPolygon.init

tube_to_stl.zip

(2) I had a look at Simplify3d and PrusaSlicer. They are about 10x faster.
Definitely, we should look into that! IMHO, this is the way to go.

@bremco
Copy link
Contributor

bremco commented Jul 26, 2021

Did a PR (both Uranium, for the actual VAO issues and Cura, for the shader implementation the fix implies) that might partially(?) fix this. Not sure if people get notifications if you just mention an issue, so here is a message as well :-)

@Ghostkeeper
Copy link
Collaborator

With Bremco's PR I observed a 4-fold performance increase in a large layer view. Whatever benchmarks you have now will be invalidated with the next release. Perhaps we can then consider this issue fixed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants