Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BC4/5 fixes and performance improvements #18

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

cwoffenden
Copy link

@cwoffenden cwoffenden commented Sep 4, 2022

This fixes #17 but goes further:

Lots of text snipped, jump down to the next paragraph. Originally this expanded the internal endpoints to 14-bit, but in testing the RMSE and PSNR were always slightly worse even though the max error was reduced. These errors were higher due being calculated from the 8-bit PNG file, not the hardware's representation. Ryg's blog entry has a good explanation of the hardware.

I simplified this commit to address the main issue, which was blocks with two (or few) values having errors in hardware due to one endpoint always being interpolated (which doesn't occur with an 8-bit software decoder). This is achieved by starting the search radius at zero and working outwards (0, -1, 1, -2, 2, etc.). Further, once we have zero error we take this block as the best available and exit early.

This fixes the original issue, keeps the max error, RMSE and PSNR exactly the same, and improves performance. Some timings, using the default -hr5 radius:

Original code:

BC4

flowers-2048x2048
Total encoding time: 0.599000 secs
Total processing time: 0.656000 secs

quenza-2048x2048
Total encoding time: 0.825000 secs
Total processing time: 0.883000 secs

BC5

bunny-nmap-2048x2048
Total encoding time: 0.446000 secs
Total processing time: 0.510000 secs

can-nmap-2048x2048
Total encoding time: 0.342000 secs
Total processing time: 0.398000 secs

This commit:

BC4

flowers-2048x2048
Total encoding time: 0.476000 secs
Total processing time: 0.534000 secs

quenza-2048x2048
Total encoding time: 0.725000 secs
Total processing time: 0.784000 secs

BC5

bunny-nmap-2048x2048
Total encoding time: 0.214000 secs
Total processing time: 0.271000 secs

can-nmap-2048x2048
Total encoding time: 0.212000 secs
Total processing time: 0.268000 secs

All timings were from the best of four runs. The biggest improvement was in normal maps since there are large areas with 2-3 values hovering around 127, and since the search radius is now growing outwards these are found early on.

This fixes richgel999#17 but goes further, since it provides higher accuracy for other blocks with few values. Two value blocks are special-cased to use the two endpoints. An early out is taken when the error reaches zero.
As 16-bit we couldn't accumulate the worst-case error without overflowing. Also fixed a bug whereby the values6 were truncated to 8-bit, therefore mostly favouring values8. The return from encode_bc4_hq() is now scaled to the same range from before the changes.
@richgel999
Copy link
Owner

Thank you, this looks very valuable. Have you tested these changes on a large amount of content to verify the output encoding hasn't changed? That's one of my primary concerns, initially.

@cwoffenden
Copy link
Author

I'll try to find the time to throw a few thousand grayscale and normal maps at it and verify the error metrics and times. The encoded output may differ (e.g. two-value BC4 will always use selectors 0 and 1 instead of a single endpoint plus an interpolation) but the decoded output at 8-bit should be the same, so I could hash the decoded PNG. I don't think I'll be able to do that in the next few weeks though.

@cwoffenden
Copy link
Author

cwoffenden commented Oct 2, 2022

I have some initial results. I wrote this (rather sprawling) test runner to verify everything:

https://gist.github.com/cwoffenden/98780e9009a2d4f62433ea9f77ef4113

You can give it a directory of PNGs and it'll compress them then collect the metrics in a CSV file. For example:

./runbc7enc.py -b 4 -o /Volumes/Temp -x ./bc7enc -l nfproj-grey-orig.csv -t /Volumes/Work/Assets/Test/Numfum/grey

This ran the BC4 encoder on 450-ish greyscale files and recorded the max error, RMSE and PSNR (but ignored the time, just so I could do a quick diff). Here are the results: the original and changed code.

The RMSE and PSNR don't change (probably not enough digits) but the max error does, in an interesting way. There are five differences in this set of files, with four of the five having a lower-by-one max error in the new code. It's interesting because it highlights a potential accidental improvement which I'll look at in the week (better selection of the best block).

I'll cover the processing time later when I've thrown more files at it (short version: it's faster, about 20% average when fed 100s of normal maps). On Mac it doesn't build with OpenMP (it's not supported out of the box) so I want to wait until I'm back at work to test on other OSes.

I can share the test files with you so you can verify if you like? I have a classifier go through internal projects and pull out different texture types.

@cwoffenden
Copy link
Author

I ran the same on approx 1400 other greyscale files and recorded two more from them all where the max error is lower in the changed code. CSV files here.

It's totally accidental that it swings this way, since I've seen a few normal maps where the max lower is in the original code. It's to do with taking the summed error and calling the lowest value the best, rather than looking at which equal summed errors have lower averages or maximums. Specifically here:

if (trial_err < best_err)

trial_err needs to have further refinement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BC4/5 blocks with two values per channel are slightly off
2 participants