Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use instanced array buffer instead of UBO for canvas item batching #70065

Merged
merged 1 commit into from
Dec 15, 2022

Conversation

clayjohn
Copy link
Member

This simplifies the generated shader code which increases both performance and compile time on low end devices. However, the main purpose of this PR is to get OpenGL rendering on Apple devices again.

Fixes: #68760

May fix: #68980
May fix: #68819
May fix: #67595

Background

This started as a search for a workaround to the issues we were facing on Apple devices, but as it turns out, this new approach can also be more efficient on some devices, especially low end devices.

Basically what this PR does is switch from using a giant Uniform Buffer Object to using a giant Vertex Buffer Object. Instead of dynamically indexing into an array of uniforms (using instance ID is the index) in the shader, we push instance rate attributes to the shader. This improves codegen as there is no dynamic indexing.

For Uniform Buffers we had the ability to bind a specific range, for Vertex Buffers we don't, but we can specify an offset while binding the buffer. Unfortunately, this means we need to rebind the buffer for every batch. In my testing this doesn't carry a noticeable performance penalty.

By way of background, the approach we were using is often referred to as "vertex pulling" i.e. we are using an index to pull vertex data from some buffer. This contrasts with "vertex pushing" which is when we use attributes to push vertex data to a shader (so the shader only sees the data for the current vertex). On modern desktop hardware vertex pulling can be way more efficient as it often allows you to get by with much less data.

In our case, since we only need information on a per-vertex or per-instance basis, the built in tools in OpenGL for vertex attributes and instance attributes work just fine. There is an overhead to instancing, however, in my testing it appears the benefits outweigh the costs.

Performance analysis

I carried out the best analysis I could of rendering performance. For the most part this means that I profiled GPU operations using the best tools available for the platform. On macOS I was limited to printing FPS which is understandably a poor way of measuring GPU performance.

All measurements were taken with the Project Manager

Platform Device 4.0-dev This PR Notes
Windows RX 6900XT 1.1-1.4ms 1.0ms Renderdoc
PopOS 22.04 i7-1165G7 (Xe graphics) 1.6ms 1.4ms Renderdoc
Windows i5-8265U 1.6GHz (intel HD 620) 1.7-1.9ms 1.5-1.6ms Renderdoc
Windows i5-8265U 1.6GHz (intel HD 620) 0.8ms 0.5ms Intel Graphics Monitor
macOS 12.6.1 2.7 GHz Dual core intel i5 N/A 6-10ms* --print-fps
Android 13 Pixel 4 5.3ms 5.2ms Android GPU Inspector
Android 13 Pixel 7 4.7ms 4.4ms Android GPU Inspector
Lubuntu 20.04 Intel Celeron N2840 2.16GHz** 6.5ms 4.4ms Renderdoc
  • On macOS I forcibly disabled low processor mode and vsync However, It appears some form of vsync was still occurring, as the times it was at 6ish ms it appeared locked at 144 FPS
    ** This device is a 2014 Chromebook with ChromeOS wiped

Given these results, I feel comfortable saying that this PR will run at least as fast as the previous approach and may even run faster on some devices.

Further work

I quickly tried out two improvements that are promising but did not show a performance improvement on my main development device (there may be improvements on other devices though). Noting the approaches here for future reference:

  1. Add a "single instance" mode to the shader so batches with a single instance avoid utilizing the vertex buffer and instead pass their per-instance data in a single uniform. In theory this should be faster as it still avoids dynamic indexing and reduces vertex pressure/bandwidth, however in testing on my main device this resulted in a marginal decrease in performance (perhaps due to all the context switching) (https://github.com/clayjohn/godot/tree/GLES3-SI) Needs testing on bandwidth restricted devices (i.e. mobile)
  2. Use non-instanced variations of draw functions when instance count is 1 (i.e. glDrawArrays instead of glDrawArraysInstanced). Doing so resulted in no performance benefit. This may be because the drivers on my development device optimize for this situation already. This should be tested on a wider variety of hardware.

@qarmin
Copy link
Contributor

qarmin commented Dec 15, 2022

Tested that on Linux fixed grey screen on Intel HD 3000

This simplifies the generated shader code which increases both performance and compile time on low end devices
@clayjohn
Copy link
Member Author

Rebased to fix CI

Copy link
Member

@akien-mga akien-mga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like the right move! Didn't review the code in depth but the best review will be testing in beta 9 :)

@akien-mga akien-mga merged commit 47ef054 into godotengine:master Dec 15, 2022
@akien-mga
Copy link
Member

Thanks!

@lawnjelly
Copy link
Member

I had a little look through this morning and it looked fine from what I saw. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment