Update bc7e.ispc · CoriolisStorm/bc7enc_rdo@d651cfd

Commit

Update bc7e.ispc

OVERVIEW
--------
The biggest image quality improvement is a new way of measuring error that tries to minimize the final frame buffer error after blending by treating color and alpha error together, rather than as independent channels. This really helps eliminate blocking artifacts in some alpha images, with no user-tweaked parameters. (A big downside of user-tweaked parameters is that often different parts of an image need different settings!) The only user setting is how color and alpha are to be used, which is one of four intuitive choices. In addition, the alpha blended and alpha tested error metrics don't even need you to set the alpha weight relative to RGB weights! That's because it optimizes color and alpha together based on the worst case color error after blending with the frame buffer, either in RGB or in YCbCr color space. Since it's minimizing color error after blending, alpha is implicitly handled by the color weights, making it even easier to tune.

I also did some improvements to the color partition selector that reduce overall error.

I improved the perceptual compressor by supporting rotation and by using the YPbPr weights for sRGB (in gamma space) rather than the weights for linear RGB. Using the original linear weights is a simple #if change. The images that we compress are always expected to be in sRGB.

One other improvement is to make bc7e optionally return the error for each block. Apex Legends used this to recompress just the really low quality blocks with the uber quality setting. This gives most of the benefit of uber quality in a fraction of the time.

The per-channel error weights are now floats instead of integers.

I tried to keep the API the same as Rich Geldreich's original BC7E with additions that default to old behavior, so it should be an easy drop-in replacement.

MAJOR DETAILS
-------------
) Added "m_optimize_for" to "ispc::bc7e_compress_block_params". This can be one of the following values:
BC7E_OPTIMIZE_FOR_INDEPENDENT_CHANNELS -- optimize each color channel independently based on their relative error weights (old behavior)
BC7E_OPTIMIZE_FOR_COLOR_TIMES_ALPHA -- optimize for color times alpha, where the product replaces or adds to the frame buffer
BC7E_OPTIMIZE_FOR_ALPHA_BLENDING -- optimize for alpha blending with the frame buffer.
BC7E_OPTIMIZE_FOR_ALPHA_TEST -- optimize for alpha test.

The default is to keep the old behavior.

) When optimizing for alpha blending, we want to minimize the peak squared error between the original and compressed images after blending with every possible frame buffer. The math is pretty simple. If the color is C, the alpha is A, and the frame buffer is F, the final on-screen error is "(C1 A1 + F (1 - A1)) - (C0 A0 + F (1 - A0))", which easily simplifies to "C1 A1 - C0 A0 - F (A1 - A0)". It turns out this is equivalent to "((C1 + C0)/2 - F) * (A1 - A0) + (A1 + A0)/2 * (C1 - C0)". The error is linear in F, so the maximum must be at F = 0 or F = 255. Whether the absolute value is maximized with F = 0 or F = 255 depends on the signs of A1 - A0 and C1 - C0, as well as the magnitude of C1 + C0. The cheapest way to compute which F to use is to try both and keep the maximum.

The code uses "C1 A1 - C0 A0 - F (A1 - A0)", since it's the cheapest to compute. But, the last equivalent form in the previous paragraph does give a bit more intuition. The importance of changes in alpha is whatever "A1 - A0" gets multiplied by, and similarly for color. So, the importance of color is proportional to the average alpha value, which is not surprising. But, the importance of alpha is generally proportional to "(C1 + C0) / 2 - F". This is more complex to analyze, since F may be 0 or 255. Because of that, alpha is generally more important when the color is near black or white, and less important when the color is near middle gray. But there are still complex interactions between color and alpha, especially since there are three color channels! So, it really is worth it to model the full complexity of how color and alpha interact; it doesn't seem worth it to try to come up with a per-pixel importance for color and alpha, and certainly not a full-image importance.

) Optimizing for color times alpha is most useful for alpha-weighted additive blending, where you do "C * A + F". The error is simply "(C1 A1 - F) - (C0 A0 - F)", which easily simplifies to "C1 A1 - C0 A0". Note that this is identical to the alpha blended form, except that we always use F = 0. So, these two use the exact same code path. One frame buffer value is always 0, and I just set whether the "other" frame buffer value is 0 or 255.

) Optimizing for alpha test is complicated to support a single image with multiple alpha test thresholds, which artists sometimes do to save memory when one alpha tested image is known to be a subset of another. To support this accurately, we should sum the error for each alpha test threshold, which would really complicate the code. Instead, I added "ispc::bc7e_compress_block_params::m_alpha_test_threshold_min" and "m_alpha_test_threshold_max". It remaps alpha so that it is 0 below the min and 255 above the max, and then just uses the alpha blend error metric. This was much easier to implement. If min == max, this is normal alpha testing with a single threshold; this is the most important case. As min and max get further apart, it still tries to minimize error in alpha, but it isn't as accurate as if it knew where the thresholds were.

) To improve quality, I made the BC7E compressor optionally return the error for each block. In my calling code, after all compression is done, I gather all blocks that had error greater than a threshold, and then I recompress just those blocks at "uber" quality. This brings quality up and spends compression time where it matters most. About 1% of blocks try uber quality, and about 50% of those improve. Sometimes the improvement is slight, sometimes it is surpringly better (like 1/20th of the squared error).

PERCEPTUAL ERROR AND PARTITION SELECTION IMPROVEMENTS
-----------------------------------------------------
) The bc7e compressor has a "perceptual" error mode. This tries to minimize error in YCbCr space. There were some minor issues with it:
- The perceptual space is implemented as YCrCb, but the standard definition is YCbCr -- it swapped Cr and Cb. This makes a little sense, because Cb is strongest in blue... but if you were to do that, the order CrYCb makes the most sense, since Cr is strongest in red and Y is strongest in green. The only public-facing part of this is how you specify the weights for perceptual error. I decided to keep the non-standard order so as to not break the API.
- The perceptual space internally is sometimes called "YPbPr", and sometimes uses "l" instead of "y" for luminance. "l" is never a good variable name; it looks too much like "1" and "I". It is not great to use that to hold a value called "y", though it does seem that "l" is more natural than "y" for luminance. I think the standard "Y" for luminance comes from the CIE XYZ color space, where Y is luminance, and I think they picked XYZ as color coordinates because 3D coordinates in math are pretty much always XYZ. CIE XYZ was ratified in the early 1900s, long before computers (depending on what you call a "computer"), much less computer graphics.
- It used the YCbCr transform for linear RGB, not the transform for non-linear sRGB. The inputs to the compressor are sRGB, so this overemphasizes green and deemphasizes blue. I switched it to the transform from sRGB to sYCC, but left the transform for linear RGB as a #if toggle.
- Modes 4 and 5 support rotation (where alpha swaps with a color channel), but it skipped that in perceptual mode. I think that's because it couldn't do the proper weights once a color channel was swapped with alpha. If you interpret the channels after rotation as if there was no rotation when calculating error, you can get bizarre error values that make the compressor select some really bad blocks. Since I added rotation support to do alpha-aware optimization, it was pretty easy to let perceptual optimization handle rotation too. Rich Geldreich also wrote on his blog that rotation has a relatively small impact on total MSE and hurts RDO, so he may have also thought it wasn't worth the effort to support. He's absolutely right about it being minor in most images, but it can be a noticeable quality improvement for images that are strongly red, green or blue with a simple alpha channel, such as the blue earth image mentioned before.
- Many BC7 modes use partitions. Instead of just one pair of color endpoints in a 4x4 block, modes 0-3 and 7 have 2 or 3 pairs of endpoints called subsets. Each pixel knows what subset to use based on the partition map, which is chosen from a table of 64 pre-defined partitions (16 partitions for mode 0). To decide what partition works best, the compressor iterates each partition and splits pixels by their subset in that partition, then gets a quick estimate of the error for the colors in each subset. It keeps the N partitions with the lowest error, where N is usually 1, and then does the expensive color fitting to the pixels in each subset. Anyway, there was an issue with the fast estimated colors when using perceptual error. For modes 0-3, it would use the Y Cr Cb weights as if they were R G B weights, respectively. For mode 7, it switched to uniform weights. This could cause it to choose an inferior partition in modes 0-3, but it would then correctly optimize the colors for that subset. I at first fixed this by making modes 0-3 switch to uniform weights as well, but later I made this even better (see next point).

) So, the partition selector needs more discussion. It does quick color estimation of each subset in each partition by finding a color ramp for each subset. In the old code, the color ramp just took the minimum and maximum of RGB (or RGBA) independently, then made one end of the line the mins and the other end the maxs. It then chose which point to use for each color by projecting it as a 0-1 fraction on the line, clamping it, scaling it by the maximum selector value (so * 7 for 3-bit indexes), and rounding. It then got the squared distance of each channel to this chosen point, and weighted those errors by the individual channel weights.

I made three improvements to this.

First, I track the covariance of the red, blue and alpha channels versus green. If the covariance is negative, I swap the start and end of that line for that channel. It is usually the case that all channels are positively correlated, but this is often untrue at edges. For instance, if you have a redish region with an antialiased boundary with a greenish region, then red and green will have negative correllation. In simple English instead of math speak, the code was assuming that when green got bigger, red/blue/alpha also got bigger. The new code handles those channels getting smaller when green gets bigger. This had the biggest impact on choosing better partitions.

Second, I weighted the dot product to account for the relative weights of the different channels. This is easiest to understand if you think of RGB colors as XYZ coordinates in a 3D space. The error metric finds a weighted sum of the squares of the RGB deltas. But, you can first transform the space by scaling each coordinate by the square root of that channel's weight, and then in this modified space the squared error is exactly equal to the squared Cartesian distance. In this "error space", we can project the query point onto the color line and pick the closest point by rounding, and we're guaranteed to get the point that minimizes squared error. This is usually the same point as if we did the projection in standard RGB space, but not always! That's because the non-uniform weights correspond to a non-uniform scale. So, vectors that are perpendicular in one space are not necessarily perpendicular in the other space. The closest point on the line forms a perpendicular vector, so the closest point on the line in one space is probably not the closest point on the line in the other space. We round the closest point, so these two points will usually round the same way to choose the same index on the color line, but not always. When they round differently, it will be near halfway between two points, so their distances are still expected to be similar. So, using the suboptimal projection will usually have no effect on the partition ranking order. But, it does affect the order sometimes, because using the more accurate projection reduces the total error. Using the correct projection is really cheap, too; it's just 3 or 4 additional multiplies outside the inner loop.

Finally, I made perceptual error use weights that are a blend between uniform weights and the YCbCr weights for Y from RGB. I experimentally found that a ratio of 10:1 was the sweet spot (the Y weights are 10x as strong as the uniform weights). This gives better results than just using the Y weights, probably because just using the Y weights ignores all errors in Cb and Cr.

) There is one thing I tried for improving perceptual error that didn't pan out. I tried transforming all pixels to YCbCr space before doing the partition estimates. This made it cheap to estimate error difference directly in YCbCr space, instead of approximately in RGB space. Surprisingly, this actually increased total compression error as measured in YCbCr! My theory is that the line connecting two opposite corners of the color box in RGB is a better estimate of the optimal color line than doing the same thing for the corners of the color box in YCbCr, but I didn't spend any time actually figuring out why this was worse.

) I implemented perceptual error with alpha-aware error. This was mathematically interesting. The error for a pixel should be the worst possible error after alpha blending the pixel with the frame buffer, measured in YCbCr (with weights).

It's best to think of this geometrically, by treating RGB as coordinates in 3D space. The uncompressed color C0 and compressed color C1 are two points. The set of all possible frame buffer colors forms a cube. The error equation "A1 C1 - A0 C0 - (A1 - A0) F" is then a linear combination of two points and a cube, so the resulting shape is still a cube, just translated and scaled. So, in 3D space using RGB coordinates, the set of all possible errors is a solid cube. We then transform this set of errors into 3D space using YCbCr coordinates. This is a linear transform, but it is not orthogonal, so the resulting shape is sheared into a parallelepiped (i.e., a 3D parallelogram).

There's one more transform to consider. To get the error, we take a weighted sum of the squared differences in each YCbCr color channel. If we first multiply each channel of YCbCr by the square root of its respective weight, we would get the error as the sum of the squares of each difference... i.e., the error is just the normal squared Cartesian distance in this new space.

So, the last transformation is non-uniform scale by the square root of the color weights. This remains a parallelepiped. In this final space, squared Cartesian distance is the error. So, the maximal error is just the squared distance to the point on this parallelepiped that is furthest from the origin.

To find the furthest point, we first find the center of the parallelepiped. This is actually pretty easy. The parallelepiped is a weighted sum of 3 non-orthogonal axis vectors. To get those vectors, first take the YCbCr transform matrix and multiply the rows by the square roots of the YCbCr weights. The axis vectors are now the columns of this matrix, so these vectors just depend on the color transform matrix and the error weights and can be precalculated. To construct the parallelepiped in error space, we multiply each axis by R, G, or B, respectively (and scaled by A1 - A0). The "first" corner is the alpha-weighted difference between the transformed points C0 and C1; the rest of the parallelepiped is gotten by adding 0-255 times each axis vector. So, the center is just 127.5 times each axis vector. You can get this by adding the R, G, B axis vectors together and scaling the sum by 127.5. That's convenient, because the sum of those vectors has a lot of cancelation. In fact, it just moves in the Y direction by the square root of the Y weight! This kind of makes intuitive sense for Y Cb Cr, because Cb and Cr are centered around 0.

Once we have the center of the parallelepiped, we know the furthest point is half a parallelepiped away on each axis. We can tell which way to go by taking the dot product of the axis and the center; the sign of that dot product is the sign we should use for adding the other half of the step on that axis. We do this for each channel of Y Cb Cr, and we arrive at the point that has the most error. The squared distance between this point and the origin is the maximum error, which is what we wanted to find all along.

MINOR DETAILS
-------------
) Most of the work was refactoring the BC7E code to make it easier to do a different error metric. The existing error metric was really simple, just basically squaring a vector and dotting it with a constant, so the math was rewritten in literally dozens of places rather than in a function.

) One tricky thing is that 2 of the 8 modes in BC7 have separate "scalar" and "vector" parts. Two bits specify which channel is scalar; 0 = alpha, 1 = red, 2 = green, and 3 = blue. They call this the rotation. If the rotation is not 0, then after decompression, it swaps alpha and the corresponding color channel (so the rotation swaps color channels, it does not rotate them, and the official name is something of a misnomer). It is clearly designed for separate color + alpha, with the observation that sometimes it's better for another channel to be independent. For instance, one test image is basically earth from space with a slow alpha gradient. In this image, most of the variation is in the blue channel, so the vector/scalar split is more often RGA|B than it is RGB|A.

The original bc7e compressor handled this by swapping the color and alpha channels before compressing, as well as their weights. This lets the rest of the compressor code not have to worry about rotation at all. This works great when you treat error as a weighted sum of the individual channel weights. If you are trying to emulate alpha blending, and alpha was swapped with blue, it doesn't work so great. So, I had to handle the alpha channel swapping with another channel.

Part of handling this was to make alpha compress before color (scalar before vector), so that vector compression can use the results of scalar compression. The scalar channel always optimizes independently, then the compressed scalar channel is passed to the vector compressor, so that it can measure the error after taking into account alpha blending or color times alpha. This needs to handle rotation, so there are several cases in the vector compressor!

Compressing alpha independently is the right thing to do for independent channels, but not quite right with alpha-aware error metrics. I still optimize independently as a performance optimization. It's possible that a compression choice that is inferior for the scalar channel independently is actually superior for the scalar and vector channels collectively. The chances of this are quite small, and the improvement when it does happen is expected to be quite small, and getting that improvement is a lot of work (both dev time and CPU time), so I decided it wasn't worth it.

) The BC7E compression code often calculated "floor(x + 0.5)". This should be the same as "round(x)", except when the fraction part of x is exactly 0.5; the "floor" version always rounds this up, but the "round" version rounds to the nearest even value. SSE2 has an efficient "round" instruction, but you don't get an efficient "floor" instruction until SSE4.1. The compiler can't just turn "floor(x + 0.5)" to "round(x)" because of the difference in rounding some half values. So, I changed all cases of "floor(x + 0.5)" to "round(x)" in the BC7E code.

) The BC7E compression code sometimes initialized "infinite" error as 1e+9 or 1e+10, or 1-10 billion. I ran into cases where really bad blocks with high error weights exceeded this limit. I changed this to FLT_MAX.

) The BC7E code was doing too much work to calculate p bits, and split cases and loops that could be merged (see the top of find_optimal_solution in bc7e.ispc, around line 3007). I made this more efficient. I noticed it was also always calculating the alpha channel even when it didn't need to; I tracked this down to "scale_color" in "evaluate_solutions" always transforming all 4 channels, so it needed the alpha channel to be initialized. I made "scale_color" only scale the channels we care about, which does less work for opaque blocks there, and lets us do less work in "find_optimal_solution" too. Not only was initializing alpha doing more math, but needing to initialize alpha was the only reason the various loops over color channels had to be split.

) The partition error estimation functions in BC7E were taking the same parameters as the accurate color compressor, but only using a small subset of the values. I made a new struct to hold just the values it needs, plus any values that only apply to partition error estimation.

) The partition error estimation functions took in a "best so far" value. This was never used. It could be used as an early out in the last part of the function, but that's only saving a fraction of the work, and in ISPC code it only saves that if all lanes in the gang agree to take the early out. But, testing for the early out takes work on every iteration. So, using this as an early out is unlikely to save more time than it costs in ISPC code, so it's not worth passing to the function. I removed it.

) BC7E had some mysterious "disable_faster_part_selection || num_solutions <= 2" bools passed to a function, then a "!disable_faster_part_selection && num_solutions > 2" check several pages of code later. It turns out that, if you have more than 2 partitions, this was asking it to find the best partition without spending extra time to refine the endpoints. Then, if the best partition you found was better than the best solution using another method, it would refine the endpoints of just the winning partition. So, this is a performance optimization. I made a bool "refine_while_choosing" and used this in the two places to better document what it does and how the two uses are connected.

) BC7E had some tables initialized with floats specified with 6 fractional digits. The exact constants need at most 12 mantissa bits, but the 6-digit floats usually needed 23 mantissa bits, because they were usually off by 2 ULPs from the exact constants. I replaced the initializers for this table with a macro that gives the exact values, and is easier to understand. This avoids some potential numerical issues in the least squares solver, where a determinant should have been zero but wasn't. See how "pSelector_weights" is used in "compute_least_squares_endpoints_rgb[a]"; the calculation of the "z" determinant has a high chance of being exact when summing 12-bit floats, and a much lower chance of being exact when summing 23-bit floats. It does an early out check with exactly zero; epsilon issues can make it skip this check, doing extra work and getting nonsense results. This would just waste compression time. It wouldn't crash, and the code only keeps the best results, which would never be this result (except maybe by an incredible freak miracle).

) Made BC7E use float errors instead of integer errors. This is needed to improve compression on low alpha blocks when optimizing for alpha blended error. For example, if alpha is 1 out of 255, then all color errors are basically scaled by 1/255, which can easily round to 0 when cast to an integer. If multiple encodings all round to 0, it can't tell which was actually better.

I could have kept everything as integers and scaled the squared error by 255^2, but instead I chose to use floats for error and let the errors be less than 1. It usually used float errors internally and converted them to integers at the end, so staying in floats is more efficient and doesn't lose any precision.

) Measuring errors as floats lets me turn the error weights into floats as well. This lets callers have easier control over the weights, and avoids a lot of integer-to-float conversions.

) "color_cell_compression()" only tried to do the "uber" stuff if the error metric exceeded num_pixels * 3.5. If you scale the weights by a constant, the error metric is scaled by that same constant, but this threshold is not. The default weights for RGB errors sum to 4, so this is an average squared error of about 0.875, or average error of about 0.9354. But the default weights for perceptual errors summed to 464, so this is an average squared error of about 0.007543, or an average error of about 0.08685 (assuming all channels have equal error). In other words, the default perceptual weights are more than 10x as likely to do the "uber" stuff as the default non-perceptual weights.

I changed the threshold to be numPixels * 0.875f * the sum of the weights. This will keep the default behavior for non-perceptual weights, and will stop making other methods do the extra uber stuff for really low error blocks. This is a performance optimization.

) "evaluate_solution()" did something odd for 4-bit indices, which is only mode 6. It also only did this for non-perceptual error; perceptual error tries all 16 options.

It rounded to the nearest index assuming linear interpolation, then tried that index and the previous index. If you were to take an index and the previous index, I'd expect to start with "ceil" instead of "round", so that you're trying the integers on either side of the fraction. For example, if you decide you want index 3.4, this would round to 3, so it would try indices 2 and 3. I think it makes more sense to try indices 3 and 4, so I switched it from "round" to "ceil".

Now, the indices aren't actually evenly spaced. This code assumes the fraction "f" used for index "i" is simply "f = i / 15", so you can go backwards using "i = f * 15". In fact, the fraction for an index is defined to be in fixed point with 6 bits of precision, so the actual math is "f = ((i * 64 + 7) / 15) * 0.015625", where this is a mix of standard C++ integer and float math. In all float math it would be "f = round( i * 64 / 15 ) / 64".

The actual palette lookup is a bit more complex than even this. It first decodes the two end points to 8-bit integers. Then, it interpolates those integers using the 6-bit interpolation factor, and rounds the final result to an 8-bit integer. So, the palette entries may not actually approximate a line very well at all, especially when the end points are near each other but not equal. The code ignores the effects of this second rounding when choosing the indices to try. With non-uniform error weights, I could see situations where the best index is not actually where we predict. I think this is why most modes try all indices, not just the most likely 1 or 2.

Anyway, we can encapsulate the rounding in a random error "e", such that "f = i / 15 + e". It is obvious that "|e| <= 1/128"; rounding introduces up to +/-0.5 error, and we divide that by 64. Going backwards, "i = (f - e) * 15 = 15 f - 15 e".

So, we can use "i0 = round( 15 * f - 15/128 )" and "i1 = round( 15 * f + 15/128 )" to get upper and lower bounds, respectively, for the nearest index. The old code could miss the nearest index by only looking at indices that were too low.

To fix this, I just changed "round" to "ceil". There is never a case where "ceil( 15 * f ) < round( 15 * f + 15/128 )", since "ceil(x) ~= round(x + 0.5)" (they can only differ when the fraction of x is 0.5). Similarly, there is never a case where "ceil( 15 * f ) - 1 > round( 15 * f - 15/128 )". So, this will now try the two indices on either side of the index predicted by assuming linear interpolation, which will always try the nearest index, and will almost always try the second nearest index.

It may not actually try the second closest index. For example, if "f * 15 = 2.001", it will try indices 2 and 3. It could be the case that the fraction for index 1 is closer than the fraction for index 3 to the ideal fraction. But when this happens, the nearest fraction is always much closer to the ideal fraction than the second nearest fraction, so it should never matter. It only really becomes important to try both fractions when you're close to the transition between indices, which will be when "f * 15" has a fraction somewhat near 0.5. Using "ceil" will always try the two closest options in this case, where "round" would only try the two closest options about half the time.

) BC7 mode 6 uses 7777.1 end points, meaning that you get 7 unique bits for each of RGBA and 1 shared LSB bit (the parity or p bit). So, all channels are even or odd. It also has 4-bit indices. So, this is a pretty good format for many blocks, even fully opaque blocks.

However, the BC7E compressor for fully opaque blocks was choosing if the p bit should be 0 or 1. If it chose 0, alpha for the opaque block would change from 255 to 254. When testing the error, it only looked at RGB and not alpha, because it knew alpha was always 255, so the error metric didn't actually register any error for the alpha changing.

My fix is to make the p bits always 1 for mode 6 for fully opaque blocks. This forces alpha to stay 255. It may cause the other colors to compress slightly worse, but keeping alpha 255 is more important for us. This fixes a thin gray line over some dark art. The thin gray line was actually the bright background seeping through the alpha 254 transparency. You can disable this by setting "force_pbits_to_1" to false in "find_optimal_solution".

Loading branch information

CoriolisStorm committed Jan 27, 2022

1 parent e6990bc commit d651cfd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `d651cfd`

Commit

There are no files selected for viewing

0 comments on commit d651cfd

0 comments on commit `d651cfd`