Releases: NVlabs/tiny-cuda-nn
Releases Β· NVlabs/tiny-cuda-nn
Version 1.6
With as many improvements as have happened since April, as well as the duration for which tiny-cuda-nn's current state has been stable, I think it's about time for another release.
Changes Since Last Release
- Multi-GPU support: tiny-cuda-nn can now run on multiple GPUs simultaneously. It is the user's responsibility to ensure that parameters, inputs, outputs, and streams reside on the currently active CUDA device.
- PyTorch multi-GPU operation works out-of-the-box.
- CMake improvements: When using tiny-cuda-nn as a CMake submodule, its include folders and libraries are now tracked as part of its
PUBLIC
interface. This means the following two lines of CMake are sufficient for a parent project to be able to use tiny-cuda-nn in its CUDA code:add_subdirectory(dependencies/tiny-cuda-nn) target_link_libraries(<parent project> PUBLIC tiny-cuda-nn)
- Assorted functionality upgrades:
AdamOptimizer
can now perform weight clipping.- A new
CompositeOptimizer
got added (courtesy of @Solonets). It can optimize different parts of the model (such as encoding and neural net) using different optimizers, e.g. to use different learning rates. CompositeEncoding
can now perform sum or product reduction over its nested encodings.- Alignment of
Encoding
's input and output matrices has been simplified and should work automatically in all cases now. - Many situations that used to cause undefined behavior are now checked and throw descriptive exceptions.
- Parameter initialization
model->initialize_params(...)
and settingmodel->set_params(...)
has been decoupled. Callingset_params
is required before being able to use a model. Callinginitialize_params
no longer influences the parameters of the model and instead merely returns a set of parameters that serves as a good initial state for training. - Snapshots are now compatible across
CutlassMLP
andFullyFusedMLP
, as well as acrossfloat
and__half
precision. This means snapshots generated from any GPU can be loaded by any other GPU. - The hash function of
GridEncoding
can now be configured.
- Countless bug fixes and performance improvements.
Version 1.5
Changes Since Last Release
- Encodings and neural networks in tiny-cuda-nn now share the same generic API for differentiable objects. This simplifies implementations significantly.
- As part of this generalization, encodings and neural networks can now take and produce row- and column-major matrices (i.e. both AoS and SoA data). Additionally, input data may be strided arbitrarily, which permits slicing of input matrices without copying.
- Added
GridEncoding
support for double-backward, which is useful for e.g. eikonal supervision (courtesy of @ventusff). - Dropped the dependency on PyEXR / tinyexr in the sample applications (using
imageio
/stb_image
instead). - Fixed many bug, added several performance improvements, and improved compatibility with older GPUs.
Version 1.4
Changes Since Last Release
Major Changes
- Added a PyTorch extension for using tiny-cuda-nn from within Python.
- This functionality is considered to be in a "beta" state. Please do report any issues you come across!
- See the this section of the README for installation/usage instructions.
- Caveat: the overheads of Python/PyTorch can be extensive. For example, the bundled mlp_learning_an_image example is ~2x slower through PyTorch than native CUDA. (This is still faster than implementing everything from scratch in Python, but something to be aware of.)
- Significantly reduced memory usage (sometimes 3x lower)
- Added a GPU memory arena that permits efficient, stream-ordered allocation and de-allocation of temporary buffers. This circumvents the need for pre-allocation, resulting in often 3x lower memory consumption.
- The memory arena uses the GPU's virtual memory mapper to get its performance without invalidating pointers or shuffling memory around.
- All neural networks in tiny-cuda-nn now additionally support row-major input memory layout. This affords higher performance and lower memory usage when transposition was otherwise required.
GridEncoding
naturally outputs row-major data and is thus sped-up by ~20% when followed by a neural network.
- tiny-cuda-nn now runs on older GPUs down to compute capability 37.
Minor Changes
- Sped up the input gradient computation of
GridEncoding
by ~3x. - Sped up
SyncedMultiStream
. - Fixed incorrect gradients of
SphericalHarmonicsEncoding
. - Fixed incorrect gradients of
GridEncoding
whenmax_level
arguments were provided orInterpolation::Nearest
was used.
Version 1.3
Changes Since Last Release
Major Changes
- Adds a new encoding:
GridEncoding
- This encoding can be used to train and render neural graphics primitives instantly (see real-time NeRF flythroughs above)
- It is based on the concept of trainable multiresolution grids which can be backed by hashtables, dense storage, or tiled storage.
- More details in this technical paper.
- tiny-cuda-nn now runs on CUDA 10.2 (previously required CUDA 11 and higher)
- tiny-cuda-nn now only requires C++14 (previously C++17)
Minor Changes
- This repository now supports continuous integration builds through GitHub Actions.
- Added support for 16 neurons wide
FullyFusedMLP
- Added support for nesting of
SyncedMultiStream
Version 1.2
Changes Since Last Release
Major Changes
- Adds three new encodings: (i)
TriangleWave
, (ii)SphericalHarmonics
, (iii)Composite
- Pitched pointers are now used to parameterize inputs and outputs of all encodings.
- This feature enables a new
Composite
encoding that can apply basic encodings to different subsets of input dimensions. - This also removes the distinction of "encoded dims" vs. "passthrough_dims". The old behavior of passing through certain dimensions can be achieved by composing with the
Identity
encoding.
- This feature enables a new
- tiny-cuda-nn no longer depends on cuRAND and instead uses an implementation of the PCG32 random number generator (derived from https://github.com/wjakob/pcg32) for all randomness.
- Activation code has been centralized within and across CUTLASS components. All neural network implementations now support all activation functions (except for the
ResNet
, which still only supportsReLU
activations in its hidden layers).
Minor Changes
- Installed GPUs are now correctly automatically detected and targeted by CMake.
- Samples and benchmarks can now be disabled when tiny-cuda-nn is used as a submodule.
- The required CUDA version has been relaxed. Future plans include compatibility with CUDA 10.2
Version 1.1
Changes Since Last Release
Major Changes
- tiny-cuda-nn now supports saving and loading snapshots via
Trainer::serialize
andTrainer::deserialize
. These functions produce anlohmann::json
object containing the trained parameters of the model as well as, optionally, the state of the optimizer (to support continued training).
The intended way to efficiently store the resulting json blob to disk is:
std::ofstream f("checkpoint.msgpack", std::ios::out | std::ios::binary);
json::to_msgpack(trainer->serialize(), f);
and to load it again:
std::ifstream f{"checkpoint.msgpack", std::ios::in | std::ios::binary};
trainer->deserialize(json::from_msgpack(f));
- tiny-cuda-nn now supports L1-type losses. Four new losses were added:
L1
,Relative L1
,MAPE
(Mean Absolute Percentage Error), andSMAPE
(Symmetric Mean Absolute Percentage Error). GPUMatrix
has been made much less verbose. Column-major matrices now have the typeGPUMatrix<T>
and row-major matricesGPUMatrix<T, RM>
. We also introduced a dynamically laid out matrix type:GPUMatrixDynamic<T>
. As a result, the API for dynamically laid out network outputs is now simplified.
Minor Changes
- Extends the functionality of
Network
/NetworkWithInputEncoding
to support features such as extraction of neuron activations or gradients of the output w.r.t. the input. - Added
Squareplus
andSoftplus
activations toFullyFusedMLP
. - CMake now automatically detects the GPU architecture of the system, simplifying the compilation process for Turing and A100 GPUs (see updated
README.md
) - Removed
data_factor
from all losses. To achieve the same behavior, please wrap existing losses in a helper class.