Skip to content

Commit

Permalink
kram - update bc7enc to bc7enc_rdo
Browse files Browse the repository at this point in the history
This is the maintained codebase.  Still has bug with all alpha = 255 mapped to 254.  Will put in patch for that next.
richgel999/bc7enc#3
  • Loading branch information
alecazam committed Jul 15, 2022
1 parent 4c9cd92 commit cc9a579
Show file tree
Hide file tree
Showing 18 changed files with 11,931 additions and 4,636 deletions.
134 changes: 94 additions & 40 deletions build2/kram.xcodeproj/project.pbxproj

Large diffs are not rendered by default.

12 changes: 7 additions & 5 deletions libkram/bc7enc/LICENSE
Original file line number Diff line number Diff line change
@@ -1,10 +1,12 @@
The following source code files are available under 2 licenses -- choose whichever you prefer:
rgbcx.h
bc7decomp.cpp/h
bc7enc.c
If you use this software in a product, attribution / credits is requested but not required.

bc7e.ispc uses the Apache 2.0 license and is Copyright (C) 2018-2021 Binomial LLC.
LodePNG is Copyright (c) 2005-2016 Lode Vandevenne. See LodePNG.cpp for its license.

All other source code files in this repo are available under 2 licenses -- choose whichever you prefer.

ALTERNATIVE A - MIT License
Copyright(c) 2020 Richard Geldreich, Jr.
Copyright(c) 2020-2021 Richard Geldreich, Jr.
Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files(the "Software"), to deal in
the Software without restriction, including without limitation the rights to
Expand Down
237 changes: 127 additions & 110 deletions libkram/bc7enc/README.md
Original file line number Diff line number Diff line change
@@ -1,146 +1,163 @@
bc7enc - Fast, single source file BC1-5 and BC7/BPTC GPU texture encoders.
bc7enc - Fast BC1-7 GPU texture encoders with Rate Distortion Optimization (RDO)

Features:
- BC1/3 encoder (in [rgbcx.h](https://github.com/richgel999/bc7enc/blob/master/rgbcx.h)) uses a new algorithm (which we've named "prioritized cluster fit") which is 3-4x faster than traditional cluster fit (as implemented in [libsquish](https://github.com/svn2github/libsquish) with SSE2) at the same or slightly higher average quality using scalar CPU instructions. This algorithm is suitable for GPU encoder implementations.
This repo contains fast texture encoders for BC1-7. All formats support a simple post-processing transform on the encoded texture data designed to trade off quality for smaller compressed file sizes using LZ compression. Significant (10-50%) size reductions are possible. The BC7 encoder also supports a "reduced entropy" mode using the -e option which causes the output to be biased/weighted in various ways which minimally impact quality, which results in 5-10% smaller file sizes with no slowdowns in encoding time.

The BC1/BC3 encoder also implements [Castano's optimal endpoint rounding improvement](https://gist.github.com/castano/c92c7626f288f9e99e158520b14a61cf).
Currently, the entropy reduction transform is tuned for Deflate, LZHAM, or LZMA. The method used to control the rate-distortion tradeoff is the classic Lagrangian multiplier RDO method, modified to favor MSE on very smooth blocks. Rate is approximated using a fixed Deflate model. The post-processing transform applied to the encoded texture data tries to introduce the longest match it can into every encoded output block. It also tries to continue matches between blocks and (specifically for codecs like LZHAM/LZMA/Zstd) it tries to utilize REP0 (repeat) matches.

rgbcx's BC1 encoder is faster than both AMD Compressonator and libsquish at the same average quality.
You can see examples of the RDO BC7 encoder's current output [here](https://richg42.blogspot.com/2021/02/more-rdo-bc7-encoding.html). Some examples on how to use the command line tool are on my blog, [here](https://richg42.blogspot.com/2021/02/how-to-use-bc7encrdo.html).

- BC7 encoder (in bc7enc.c/.h) has perceptual colorspace metric support, and is very fast compared to ispc_texcomp (see below) for RGB textures. Important: The BC7 encoder included in this repo is still a work in progress. I took bc7enc16 and added more modes for better alpha support, but it needs more testing and development.
This repo contains both [bc7e.ispc](https://github.com/BinomialLLC/bc7e) and its distantly related but weaker 4 mode only non-ispc variant, bc7enc.cpp. By default, if you set SUPPORT_BC7E=TRUE when running cmake, you get bc7e.ispc, otherwise you get bc7enc.cpp. (The -C option forces bc7enc.cpp.) bc7e supports all BC7 modes and features, but doesn't yet support reduced entropy BC7 encoding. bc7enc.cpp supports optional reduced entropy encoding (using -e with the command line tool). RDO BC7 is supported when using either encoder, however.

- Full decoders for BC1-5/7. BC7 decoder is in bc7decomp.cpp/.h, BC1-5 decoders in rgbcx.h.
The next major focus will be improving the default smooth block handling and improving rate distorton performance.

This project is basically a demo of some of the techniques we use in Basis BC7,
which is Binomial's state of the art vectorized BC7 encoder. Basis BC7 is the
highest quality and fastest CPU BC7 encoder available (2-3x faster than
ispc_texcomp). It supports all modes and linear/perceptual colorspace metrics.
Licensees get full ISPC source code so they can customize the codec as needed.
This repo was originally derived from [bc7enc](https://github.com/richgel999/bc7enc) and [bc7e](https://github.com/BinomialLLC/bc7e). Note this repo contains the latest version of bc7e.ispc, which has a determinism bug fix.

bc7enc currently only supports modes 1 and 6 for RGB, and modes 1, 5, 6, and 7 for alpha. The plan is to add all the modes. See the [bc7enc16](https://github.com/richgel999/bc7enc16) project for the previous version (which only supports modes 1 and 6). Note this readme still refers to "bc7enc16", but bc7enc is the same encoder but with more alpha modes.
**Note: If you use this software in a product, attribution / credits is requested but not required. Thanks!**

This codec supports a perceptual mode when encoding BC7, where it computes colorspace error in
weighted YCbCr space (like etc2comp), and it also supports weighted RGBA
metrics. It's particular strong in perceptual mode, beating the current state of
the art CPU encoder (Intel's ispc_texcomp) by a wide margin when measured by
Luma PSNR, even though it only supports 2 modes and isn't vectorized.
### Compiling

Why only modes 1 and 6 for opaque BC7?
Because with these two modes you have a complete encoder that supports both
opaque and transparent textures in a small amount (~1400 lines) of
understandable plain C code. Mode 6 excels on smooth blocks, and mode 1 is
strong with complex blocks, and a strong encoder that combines both modes can be
quite high quality. Fast mode 6-only encoders will have noticeable block
artifacts which this codec avoids by fully supporting mode 1.
This build has been tested with MSVC 2019 x64 and clang 6.0.0 under Ubuntu v18.04.

Modes 1 and 6 are typically the most used modes on many textures using other
encoders. Mode 1 has two subsets, 64 possible partitions, and 3-bit indices,
while mode 6 has large 4-bit indices and high precision 7777.1 endpoints. This
codec produces output that is far higher quality than any BC1 encoder, and
approaches (or in perceptual mode exceeds!) the quality of other full BC7
encoders.
To compile with bc7e.ispc (on Linux this requires [Intel's ISPC compiler](https://ispc.github.io/downloads.html) to be in your path - recommended):

Why is bc7enc16 so fast in perceptual mode?
Computing error in YCbCr space is more expensive than in RGB space, yet bc7enc16
in perceptual mode is stronger than ispc_texcomp (see the benchmark below) -
even without SSE/AVX vectorization and with only 2 modes to work with!
```
cmake -D SUPPORT_BC7E=TRUE .
make
```

To compile without BC7E:

```
cmake .
make
```

Most BC7 encoders only support linear RGB colorspace metrics, which is a
fundamental weakness. Some support weighted RGB metrics, which is better. With
linear RGB metrics, encoding error is roughly balanced between each channel, and
encoders have to work *very* hard (examining large amounts of RGB search space)
to get overall quality up. With perceptual colorspace metrics, RGB error tends
to become a bit unbalanced, with green quality favored more highly than red and
blue, and blue quality favored the least. A perceptual encoder is tuned to
prefer exploring solutions along the luma axis, where it's much less work to find
solutions with less luma error. bc7enc16 is, as far as I know, the first BC7
codec to support computing error in weighted YCbCr colorspace.
Note the MSVC and Linux builds enable OpenMP for faster compression.

Note: Most of the timings here (except for the ispc_texcomp "fast" mode timings at the very bottom)
are for the *original* release, before I added several more optimizations. The latest version of
bc7enc16.c is around 8-27% faster than the initial release at same quality (when mode 1 is enabled -
there's no change with just mode 6).
### Examples

Some benchmarks across 31 images (kodim corpus+others):
The [.DDS](https://docs.microsoft.com/en-us/windows/win32/direct3ddds/dx-graphics-dds-pguide) output files can be loaded/viewed using tools like [AMD Compressonator](https://gpuopen.com/compressonator/).

Perceptual (average REC709 Luma PSNR - higher is better quality):
To encode to non-RDO BC7 using BC7E, highest quality, linear RGB(A) metrics:

```
./bc7enc blah.png
```
iscp_texcomp slow vs. bc7enc16 uber4/max_partitions 64
iscp_texcomp: 355.4 secs 48.6 dB
bc7enc16: 122.6 secs 50.0 dB

iscp_texcomp slow vs. bc7enc16 uber0/max_partitions 64
iscp_texcomp: 355.4 secs 48.6 dB
bc7enc16: 38.3 secs 49.6 dB
To encode to non-RDO BC7 using BC7E, highest quality, using perceptual (scaled YCbCr) colorspace error metrics:

iscp_texcomp basic vs. bc7enc16 uber0/max_partitions 16
ispc_texcomp: 100.2 secs 48.3 dB
bc7enc16: 20.8 secs 49.3 dB
```
./bc7enc blah.png -s
```

iscp_texcomp fast vs. bc7enc16 uber0/max_partitions 16
iscp_texcomp: 41.5 secs 48.0 dB
bc7enc16: 20.8 secs 49.3 dB
To encode to RDO BC7 using BC7E, highest quality, lambda=.5, linear metrics (perceptual colorspace metrics are always automatically disabled when -z is specified), with a balance of encoding performance vs. RDO efficiency:

iscp_texcomp ultrafast vs. bc7enc16 uber0/max_partitions 0
iscp_texcomp: 1.9 secs 46.2 dB
bc7enc16: 8.9 secs 48.4 dB
```
./bc7enc blah.png -z.5
```

Non-perceptual (average RGB PSNR):
To encode to RDO BC7 using BC7E, lower baseline quality (-u4) for faster encoding, lambda=.5, and with faster encoding (only inject one match vs two, with a tiny RDO lookback window size of 16 bytes):

iscp_texcomp slow vs. bc7enc16 uber4/max_partitions 64
iscp_texcomp: 355.4 secs 46.8 dB
bc7enc16: 51 secs 46.1 dB
```
./bc7enc blah.png -u4 -z.5 -ze -zc16
```

iscp_texcomp slow vs. bc7enc16 uber0/max_partitions 64
iscp_texcomp: 355.4 secs 46.8 dB
bc7enc16: 29.3 secs 45.8 dB
To encode to non-RDO BC7 using entropy reduced or quantized/weighted BC7 (no slowdown vs. non-RDO bc7enc.cpp for BC7, slightly reduced quality, but 5-10% better LZ compression, only uses 2 or 4 BC7 modes):

iscp_texcomp basic vs. bc7enc16 uber4/max_partitions 64
iscp_texcomp: 99.9 secs 46.5 dB
bc7enc16: 51 secs 46.1 dB
```
./bc7enc blah.png -C -e
```

iscp_texcomp fast vs. bc7enc16 uber1/max_partitions 16
ispc_texcomp: 41.5 secs 46.1 dB
bc7enc16: 19.8 secs 45.5 dB
To encode to RDO BC7 using the entropy reduction transform combined with reduced entropy BC7 encoding, with a slightly larger window size than the default which is 128 bytes:

iscp_texcomp fast vs. bc7enc16 uber0/max_partitions 8
ispc_texcomp: 41.5 secs 46.1 dB
bc7enc16: 10.46 secs 44.4 dB
```
./bc7enc -zc256 blah.png -C -e -z1.0
```

iscp_texcomp ultrafast vs. bc7enc16 uber0/max_partitions 0
ispc_texcomp: 1.9 secs 42.7 dB
bc7enc16: 3.8 secs 42.7 dB
Same as before, but higher compression (allow 2 matches per block instead of 1):

DirectXTex CPU in "mode 6 only" mode vs. bc7enc16 uber1/max_partions 0 (mode 6 only), non-perceptual:
DirectXTex: 466.4 secs 41.9 dB
bc7enc16: 6.7 secs 42.8 dB
```
./bc7enc -zc256 blah.png -C -e -z1.0 -zn
```

DirectXTex CPU in (default - no 3 subset modes) vs. bc7enc16 uber1/max_partions 64, non-perceptual:
Same, except disable ultra-smooth block handling:

DirectXTex: 9485.1 secs 45.6 dB
bc7enc16: 36 secs 46.0 dB
```
(Note this version of DirectXTex has a key pbit bugfix which I've submitted but
is still waiting to be accepted. Non-bugfixed versions will be slightly lower
quality.)
```
./bc7enc -zc256 blah.png -C -e -z1.0 -zu
```

UPDATE: To illustrate how strong the mode 1+6 implementation is in bc7enc16, let's compare ispc_texcomp
fast vs. the latest version of bc7enc16 uber4/max_partitions 64:
To encode to RDO BC7 using the entropy reduction transform at lower quality, combined with reduced entropy BC7 encoding, with a slightly larger window size than the default which is 128 bytes:

Without filterbank optimizations:
```
Time RGB PSNR Y PSNR
ispc_texcomp: 41.45 secs 46.09 dB 48.0 dB
bc7enc16: 41.42 secs 46.03 dB 48.2 dB
./bc7enc -zc256 blah.png -C -e -z2.0
```

To encode to RDO BC7 using the entropy reduction transform at higher effectivenes using a larger window size, without using reduced entropy BC7 encoding:

With filterbank optimizations enabled:
bc7enc16: 38.78 secs 45.94 dB 48.12 dB
```
They both have virtually the same average RGB PSNR with these settings (.06 dB is basically noise), but
bc7enc16 is just as fast as ispc_texcomp fast, even though it's not vectorized. Interestingly, our Y PSNR is better,
although bc7enc16 wasn't using perceptual metrics in these benchmarks.
./bc7enc -zc1024 blah.png -z1.0
```

To encode to RDO BC7 using the entropy reduction transform at higher effectivenes using a larger window size, with a manually specified max smooth block max error scale:

```
./bc7enc -zc1024 blah.png -z2.0 -zb30.0
```

To encode to RDO BC7 using the entropy reduction transform at higher effectivenes using a larger window size, using only mode 6 (more block artifacts, but better rate-distortion performance as measured by PSNR):

```
./bc7enc -zc1024 blah.png -6 -z1.0 -e
```

To encode to BC1:
```
./bc7enc -1 blah.png
```

To encode to BC1 with Rate Distortion Optimization (RDO) at lambda=1.0:
```
./bc7enc -1 -z1.0 blah.png
```

The -z option controls lambda, or the rate vs. distortion tradeoff. 0 = maximum quality, higher values=lower bitrates but lower quality. Try values [.25-8].

To encode to BC1 with RDO, with RDO debug output, to monitor the percentage of blocks impacted:
```
./bc7enc -1 -z1.0 -zd blah.png
```

To encode to BC1 with RDO with a higher then default smooth block scale factor:
```
./bc7enc -1 -z1.0 -zb40.0 blah.png
```

Use -zb1.0 to disable smooth block error scaling completely, which increases RDO performance but can result in noticeable artifacts on smooth/flat blocks at higher lambdas.

Use -zc# to control the RDO window size in bytes. Good values to try are 16-8192.
Use -zt to disable RDO multithreading.

To encode to BC1 with RDO at the highest achievable quality/effectiveness (this is extremely slow):

```
./bc7enc -1 -z1.0 -zc32768 blah.png
```

This sets the window size to 32KB (the highest setting that makes sense for Deflate). Window sizes of 2KB (the default) to 8KB are way faster and in practice are almost as effective. The maximum window size setting supported by the command line tool is 64KB, but this would be very slow.

For even higher quality per bit (this is incredibly slow):
```
./bc7enc -1 -z1.0 -zc32768 -zm blah.png
```

### Dependencies
There are no 3rd party code or library dependencies. utils.cpp/.h is only needed by the example command line tool. It uses C++11. The individual .cpp files are designed to be easily dropped into other codebases.

For RDO post-processing of any block-based format: ert.cpp/.h. You provide this function an array of encoded blocks, an array of source/original 32bpp blocks, some parameters, and a pointer to a block decoder function for your format as a callback. It must return false if the passed in block data is invalid. (Make sure you *really* validate the block's data, because the ERT post-processor will inevitably call your callback with invalid blocks.) This transform works on most other texture formats, such as ETC1/2, EAC, and ASTC. The ERT works on block sizes ranging from 1x1 to 12x12. This file has no other dependencies apart from utils.cpp/h.

For BC1-5 encoding/decoding: rgbcx.cpp/.h

For BC7 encoding: bc7enc.cpp/.h

For BC7 decoding: bc7decomp.cpp/.h

This was a multithreaded benchmark (using OpenMP) on a dual Xeon workstation.
ispc_texcomp was called with 64-blocks at a time and used AVX instructions.
Timings are for encoding only.
Loading

0 comments on commit cc9a579

Please sign in to comment.