Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document performance caveats of RGB image formats versus RGBA #79771

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Calinou
Copy link
Member

@Calinou Calinou commented Jul 21, 2023

For context, @lyuma ran this benchmark on a 4096×4096 texture:

func _process(delta):
	var time_start: int = Time.get_ticks_usec()
	image_texture.update(image)
	var time_end: int = Time.get_ticks_usec()
	print(str(0.001 * (time_end - time_start)) + " ms")

These are the results they got depending on the Image format used:

Format   Time taken to update
RF 21ms
RGF 48ms
RGBF 650ms
RGBAF 100ms
RH 10ms
RGH 20ms
RGBH 800ms
RGBAH 48ms
R8 6.0ms
RG8 12ms
RGB8 51ms
RGBA8 21ms

Counterintuitively, the more memory-intensive RGBA formats always outperform the RGB format by a significant margin (by a 5× factor on average on this test). This is due to memory alignment, as RGB uses 3 components whereas other formats use 1, 2 or 4 components (powers of 2).

Considering how important the difference is, I think it's better to add notes everywhere it's relevant, even if it's a bit redundant.

@Calinou Calinou requested a review from a team as a code owner July 21, 2023 23:19
@Calinou Calinou added enhancement documentation cherrypick:4.0 cherrypick:4.1 Considered for cherry-picking into a future 4.1.x release labels Jul 21, 2023
@Calinou Calinou added this to the 4.2 milestone Jul 21, 2023
@torcado194
Copy link

torcado194 commented Jul 22, 2023

Most of these changes say "Converting to [constant FORMAT_RGBA8] is slow, ... consider using [constant FORMAT_RGBA8]", instead of "[constant FORMAT_RGB8] is slow". Is this correct or am I misunderstanding?

@@ -107,6 +107,7 @@
<param index="0" name="format" type="int" enum="Image.Format" />
<description>
Converts the image's format. See [enum Format] constants.
[b]Note:[/b] Converting to [constant FORMAT_RGBA8] is slow, as it is not aligned to 1, 2 or 4 bytes. If you need to frequently convert an image, consider using [constant FORMAT_RGBA8], or better, [constant FORMAT_RG8] or [constant FORMAT_R8] if possible. The same applies to [constant FORMAT_RGBAH] and [constant FORMAT_RGBAF] versus their RGB/RG/R counterparts.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[b]Note:[/b] Converting to [constant FORMAT_RGBA8] is slow, as it is not aligned to 1, 2 or 4 bytes. If you need to frequently convert an image, consider using [constant FORMAT_RGBA8], or better, [constant FORMAT_RG8] or [constant FORMAT_R8] if possible. The same applies to [constant FORMAT_RGBAH] and [constant FORMAT_RGBAF] versus their RGB/RG/R counterparts.
[b]Note:[/b] Converting to [constant FORMAT_RGB8] is slow, as it is not aligned to 1, 2 or 4 bytes. If you need to frequently convert an image, consider using [constant FORMAT_RGBA8], or for images without a blue channel, [constant FORMAT_RG8] or [constant FORMAT_R8] if possible. The same applies to [constant FORMAT_RGBH] and [constant FORMAT_RGBF] versus their RGBA/RG/R counterparts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can at least do without the bit-alignment piece of knowledge.

Suggested change
[b]Note:[/b] Converting to [constant FORMAT_RGBA8] is slow, as it is not aligned to 1, 2 or 4 bytes. If you need to frequently convert an image, consider using [constant FORMAT_RGBA8], or better, [constant FORMAT_RG8] or [constant FORMAT_R8] if possible. The same applies to [constant FORMAT_RGBAH] and [constant FORMAT_RGBAF] versus their RGB/RG/R counterparts.
[b]Note:[/b] Converting to [constant FORMAT_RGB8] is [i]very[/i] slow. If you need to frequently convert an image, consider using [constant FORMAT_RGBA8], [constant FORMAT_RG8], or [constant FORMAT_R8], which are faster. The same applies to [constant FORMAT_RGBH] and [constant FORMAT_RGBF] compared to their RGBA/RG/R counterparts.

Or

Suggested change
[b]Note:[/b] Converting to [constant FORMAT_RGBA8] is slow, as it is not aligned to 1, 2 or 4 bytes. If you need to frequently convert an image, consider using [constant FORMAT_RGBA8], or better, [constant FORMAT_RG8] or [constant FORMAT_R8] if possible. The same applies to [constant FORMAT_RGBAH] and [constant FORMAT_RGBAF] versus their RGB/RG/R counterparts.
[b]Note:[/b] Converting to [constant FORMAT_RGB8] is [i]very[/i] slow. If you need to frequently convert an image, consider using [constant FORMAT_RGBA8]. If possible, [constant FORMAT_RG8] or [constant FORMAT_R8] are faster. The same applies to [constant FORMAT_RGBH] and [constant FORMAT_RGBF] compared to their RGBA/RG/R counterparts.

Or similar.

Copy link
Member

@aaronfranke aaronfranke Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we can make the text both informative and user-friendly by taking your suggestion but prepending "Due to memory alignment reasons," or appending "This slowness happens due to memory alignment reasons."

@clayjohn
Copy link
Member

Before documenting this, I'd really like to see a performance analysis of what exactly is slow. I don't see an image conversion in the texture update codepath. We need to understand if the slowness is in the GPU driver (in which case we need to document it) or if the slowness is from something we are doing (in which case we may be able to fix the underlying issue)

@bitsawer
Copy link
Member

bitsawer commented Jul 22, 2023

@clayjohn I did some digging during #74238 when we noticed that sometimes custom image mipmaps were wiped out in the renderer/storage: #66848 (comment) There are a few places in texture storage where images are converted, for example this one (same also for FORMAT_RGBF and FORMAT_RGBH):

case Image::FORMAT_RGB8: {
//this format is not mandatory for specification, check if supported first
if (false && RD::get_singleton()->texture_is_format_supported_for_usage(RD::DATA_FORMAT_R8G8B8_UNORM, RD::TEXTURE_USAGE_SAMPLING_BIT | RD::TEXTURE_USAGE_CAN_UPDATE_BIT) && RD::get_singleton()->texture_is_format_supported_for_usage(RD::DATA_FORMAT_R8G8B8_SRGB, RD::TEXTURE_USAGE_SAMPLING_BIT | RD::TEXTURE_USAGE_CAN_UPDATE_BIT)) {
r_format.format = RD::DATA_FORMAT_R8G8B8_UNORM;
r_format.format_srgb = RD::DATA_FORMAT_R8G8B8_SRGB;
} else {
//not supported, reconvert
r_format.format = RD::DATA_FORMAT_R8G8B8A8_UNORM;
r_format.format_srgb = RD::DATA_FORMAT_R8G8B8A8_SRGB;
image->convert(Image::FORMAT_RGBA8);
}
r_format.swizzle_r = RD::TEXTURE_SWIZZLE_R;
r_format.swizzle_g = RD::TEXTURE_SWIZZLE_G;
r_format.swizzle_b = RD::TEXTURE_SWIZZLE_B;
r_format.swizzle_a = RD::TEXTURE_SWIZZLE_ONE;

And a few other places:

Ref<Image> image = Image::create_from_data(tex->width, tex->height, tex->mipmaps > 1, tex->validated_format, data);
ERR_FAIL_COND_V(image->is_empty(), Ref<Image>());
if (tex->format != tex->validated_format) {
image->convert(tex->format);
}

@clayjohn
Copy link
Member

@bitsawer Thanks for pointing that out. Looks like the condition was set to always be false in 42b44f4. Although, it probably doesn't matter much as actual GPU support is really poor
image (linear tiling)

I think we still need to check whether the call to image->convert is to blame though. If so, we can likely optimize that code

@sakrel
Copy link
Contributor

sakrel commented Jul 22, 2023

This is a frame flame graph for RGB8 using the benchmark from the OP with a custom build:

6588a4a - Merge pull request #79661 from sepTN/fix-typo-batch
scons platform=windows arch=x86_64 verbose=yes warnings=no progress=no production=yes debug_symbols=yes target=editor
Found MSVC version 14.3, arch x86_64

image
image

RenderingDeviceVulkan:_texture_update spends most time in _copy_region
image

_copy_region(read_ptr, write_ptr, x, y, region_w, region_h, width, pixel_size);

Image::convert
image
Image new_img(width, height, mipmaps, p_new_format);

image
_convert<3, false, 3, true, false, false>(mip_width, mip_height, rptr, wptr);

Something to note: When I run the benchmark with an official 4.1.1 build, a single iteration takes about 70 ms. With my custom build I get around 70 to 80 ms. I wonder if the difference can be explained by me using MSVC to compile the build? I copied the scons arguments from https://github.com/godotengine/godot-build-scripts/blob/main/build-windows/build.sh
Are there debug symbols for the official builds I can download? I want to rerun the Visual Studio diagnostic tools with the official build.

@Calinou
Copy link
Member Author

Calinou commented Jul 23, 2023

I wonder if the difference can be explained by me using MSVC to compile the build? I

Official builds are compiled with a recent MinGW-GCC using optimize=speed and full LTO (lto=full). These options are aliased as production=yes. However, production=yes does not enable LTCG1 on MSVC because it has known issues.

Are there debug symbols for the official builds I can download? I want to rerun the Visual Studio diagnostic tools with the official build.

Unfortunately, no. This is planned for a future release, but since we use MinGW, you'd have to convert the DWARF debug symbols to PDB format (there's a tool for that out there).

Footnotes

  1. Link-time code generation (MSVC's LTO equivalent).

@YuriSizov YuriSizov modified the milestones: 4.2, 4.x Oct 30, 2023
@YuriSizov YuriSizov removed the cherrypick:4.1 Considered for cherry-picking into a future 4.1.x release label Jan 23, 2024
@Mickeon
Copy link
Contributor

Mickeon commented Nov 12, 2024

Is this still relevant now? I feel like the note in the current PR is particularly excessive.

@aaronfranke
Copy link
Member

@Mickeon Considering how extreme the performance disparity is, a strongly worded note is important.

@clayjohn
Copy link
Member

@Mickeon Considering how extreme the performance disparity is, a strongly worded note is important.

Both of the problematic functions have been massively optimized in the past year. So this needs to be re-evaluated

@Calinou
Copy link
Member Author

Calinou commented Nov 19, 2024

Both of the problematic functions have been massively optimized in the past year. So this needs to be re-evaluated

I ran the benchmark again using an updated MRP on 4.4.dev fd4c29a: test_image_performance.zip

PC specifications
  • CPU: Intel Core i9-13900K
  • GPU: NVIDIA GeForce RTX 4090
  • RAM: 64 GB (2×32 GB DDR5-5800 C30)
  • SSD: Solidigm P44 Pro 2 TB
  • OS: Linux (Fedora 41)

Using an optimized editor build for the tests.

Format Time taken to update
RF 7.59 ms
RGF 13.93 ms
RGBF 136.30 ms
RGBAF 28.75 ms
RH 4.63 ms
RGH 7.84 ms
RGBH 62.48 ms
RGBAH 17.39 ms
R8 3.11 ms
RG8 3.76 ms
RGB8 29.02 ms
RGBA8 7.85 ms

The RGB functions are on average 4 times slower than the RGBA ones, despite storing less data.

@Mickeon
Copy link
Contributor

Mickeon commented Nov 19, 2024

The optimization is incredible, but the stark difference in performance with RGB is still way too noticeable to ignore, yeah.

@clayjohn
Copy link
Member

@Calinou I found your results surprising, so I tested on 4.4 dev4 on my device (i7-1165G7). I suspect the difference comes from the graphics drivers

Format Time taken to update
RF 14.67 ms
RGF 38.26 ms
RGBF 48.72 ms
RGBAF 63.42 ms
RH 5.90 ms
RGH 14.91 ms
RGBH 25.17 ms
RGBAH 39.74 ms
R8 4.43 ms
RG8 5.97 ms
RGB8 54.72 ms
RGBA8 15.48 ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants