Skip to content

Releases: NVlabs/tiny-cuda-nn

Version 1.6

15 Dec 14:59
8e6e242
Compare
Choose a tag to compare

With as many improvements as have happened since April, as well as the duration for which tiny-cuda-nn's current state has been stable, I think it's about time for another release.

Changes Since Last Release

  • Multi-GPU support: tiny-cuda-nn can now run on multiple GPUs simultaneously. It is the user's responsibility to ensure that parameters, inputs, outputs, and streams reside on the currently active CUDA device.
    • PyTorch multi-GPU operation works out-of-the-box.
  • CMake improvements: When using tiny-cuda-nn as a CMake submodule, its include folders and libraries are now tracked as part of its PUBLIC interface. This means the following two lines of CMake are sufficient for a parent project to be able to use tiny-cuda-nn in its CUDA code:
    add_subdirectory(dependencies/tiny-cuda-nn)
    target_link_libraries(<parent project> PUBLIC tiny-cuda-nn)
  • Assorted functionality upgrades:
    • AdamOptimizer can now perform weight clipping.
    • A new CompositeOptimizer got added (courtesy of @Solonets). It can optimize different parts of the model (such as encoding and neural net) using different optimizers, e.g. to use different learning rates.
    • CompositeEncoding can now perform sum or product reduction over its nested encodings.
    • Alignment of Encoding's input and output matrices has been simplified and should work automatically in all cases now.
    • Many situations that used to cause undefined behavior are now checked and throw descriptive exceptions.
    • Parameter initialization model->initialize_params(...) and setting model->set_params(...) has been decoupled. Calling set_params is required before being able to use a model. Calling initialize_params no longer influences the parameters of the model and instead merely returns a set of parameters that serves as a good initial state for training.
    • Snapshots are now compatible across CutlassMLP and FullyFusedMLP, as well as across float and __half precision. This means snapshots generated from any GPU can be loaded by any other GPU.
    • The hash function of GridEncoding can now be configured.
  • Countless bug fixes and performance improvements.

Version 1.5

22 Apr 07:20
Compare
Choose a tag to compare

Changes Since Last Release

  • Encodings and neural networks in tiny-cuda-nn now share the same generic API for differentiable objects. This simplifies implementations significantly.
    • As part of this generalization, encodings and neural networks can now take and produce row- and column-major matrices (i.e. both AoS and SoA data). Additionally, input data may be strided arbitrarily, which permits slicing of input matrices without copying.
  • Added GridEncoding support for double-backward, which is useful for e.g. eikonal supervision (courtesy of @ventusff).
  • Dropped the dependency on PyEXR / tinyexr in the sample applications (using imageio / stb_image instead).
  • Fixed many bug, added several performance improvements, and improved compatibility with older GPUs.

Version 1.4

14 Feb 14:53
Compare
Choose a tag to compare

Changes Since Last Release

Major Changes

  • Added a PyTorch extension for using tiny-cuda-nn from within Python.
    • This functionality is considered to be in a "beta" state. Please do report any issues you come across!
    • See the this section of the README for installation/usage instructions.
    • Caveat: the overheads of Python/PyTorch can be extensive. For example, the bundled mlp_learning_an_image example is ~2x slower through PyTorch than native CUDA. (This is still faster than implementing everything from scratch in Python, but something to be aware of.)
  • Significantly reduced memory usage (sometimes 3x lower)
    • Added a GPU memory arena that permits efficient, stream-ordered allocation and de-allocation of temporary buffers. This circumvents the need for pre-allocation, resulting in often 3x lower memory consumption.
    • The memory arena uses the GPU's virtual memory mapper to get its performance without invalidating pointers or shuffling memory around.
  • All neural networks in tiny-cuda-nn now additionally support row-major input memory layout. This affords higher performance and lower memory usage when transposition was otherwise required.
    • GridEncoding naturally outputs row-major data and is thus sped-up by ~20% when followed by a neural network.
  • tiny-cuda-nn now runs on older GPUs down to compute capability 37.

Minor Changes

  • Sped up the input gradient computation of GridEncoding by ~3x.
  • Sped up SyncedMultiStream.
  • Fixed incorrect gradients of SphericalHarmonicsEncoding.
  • Fixed incorrect gradients of GridEncoding when max_level arguments were provided or Interpolation::Nearest was used.

Version 1.3

14 Jan 09:47
Compare
Choose a tag to compare

Changes Since Last Release

Major Changes

  • Adds a new encoding: GridEncoding
  • tiny-cuda-nn now runs on CUDA 10.2 (previously required CUDA 11 and higher)
  • tiny-cuda-nn now only requires C++14 (previously C++17)

Minor Changes

  • This repository now supports continuous integration builds through GitHub Actions.
  • Added support for 16 neurons wide FullyFusedMLP
  • Added support for nesting of SyncedMultiStream

Version 1.2

15 Dec 14:43
Compare
Choose a tag to compare

Changes Since Last Release

Major Changes

  • Adds three new encodings: (i) TriangleWave, (ii) SphericalHarmonics, (iii) Composite
  • Pitched pointers are now used to parameterize inputs and outputs of all encodings.
    • This feature enables a new Composite encoding that can apply basic encodings to different subsets of input dimensions.
    • This also removes the distinction of "encoded dims" vs. "passthrough_dims". The old behavior of passing through certain dimensions can be achieved by composing with the Identity encoding.
  • tiny-cuda-nn no longer depends on cuRAND and instead uses an implementation of the PCG32 random number generator (derived from https://github.com/wjakob/pcg32) for all randomness.
  • Activation code has been centralized within and across CUTLASS components. All neural network implementations now support all activation functions (except for the ResNet, which still only supports ReLU activations in its hidden layers).

Minor Changes

  • Installed GPUs are now correctly automatically detected and targeted by CMake.
  • Samples and benchmarks can now be disabled when tiny-cuda-nn is used as a submodule.
  • The required CUDA version has been relaxed. Future plans include compatibility with CUDA 10.2

Version 1.1

30 Oct 08:50
Compare
Choose a tag to compare

Changes Since Last Release

Major Changes

  • tiny-cuda-nn now supports saving and loading snapshots via Trainer::serialize and Trainer::deserialize. These functions produce a nlohmann::json object containing the trained parameters of the model as well as, optionally, the state of the optimizer (to support continued training).

The intended way to efficiently store the resulting json blob to disk is:

std::ofstream f("checkpoint.msgpack", std::ios::out | std::ios::binary);
json::to_msgpack(trainer->serialize(), f);

and to load it again:

std::ifstream f{"checkpoint.msgpack", std::ios::in | std::ios::binary};
trainer->deserialize(json::from_msgpack(f));
  • tiny-cuda-nn now supports L1-type losses. Four new losses were added: L1, Relative L1, MAPE (Mean Absolute Percentage Error), and SMAPE (Symmetric Mean Absolute Percentage Error).
  • GPUMatrix has been made much less verbose. Column-major matrices now have the type GPUMatrix<T> and row-major matrices GPUMatrix<T, RM>. We also introduced a dynamically laid out matrix type: GPUMatrixDynamic<T>. As a result, the API for dynamically laid out network outputs is now simplified.

Minor Changes

  • Extends the functionality of Network/NetworkWithInputEncoding to support features such as extraction of neuron activations or gradients of the output w.r.t. the input.
  • Added Squareplus and Softplus activations to FullyFusedMLP.
  • CMake now automatically detects the GPU architecture of the system, simplifying the compilation process for Turing and A100 GPUs (see updated README.md)
  • Removed data_factor from all losses. To achieve the same behavior, please wrap existing losses in a helper class.