Skip to content

Commit

Permalink
Merge branch 'branch-24.06' into jawe/ops-2621
Browse files Browse the repository at this point in the history
  • Loading branch information
jameslamb authored Apr 19, 2024
2 parents ba03497 + 9e6db74 commit 9dfd907
Show file tree
Hide file tree
Showing 87 changed files with 260 additions and 226 deletions.
4 changes: 2 additions & 2 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,14 +14,14 @@ repos:
rev: 5.12.0
hooks:
- id: isort
args: ["--settings-path=python/pyproject.toml"]
args: ["--settings-path=python/rmm/pyproject.toml"]
files: python/.*
types_or: [python, cython, pyi]
- repo: https://github.com/ambv/black
rev: 22.3.0
hooks:
- id: black
args: ["--config=python/pyproject.toml"]
args: ["--config=python/rmm/pyproject.toml"]
- repo: https://github.com/MarcoGorelli/cython-lint
rev: v0.15.0
hooks:
Expand Down
54 changes: 54 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,57 @@
# RMM 24.04.00 (10 Apr 2024)

## 🚨 Breaking Changes

- Accept stream argument in DeviceMemoryResource allocate/deallocate ([#1494](https://github.com/rapidsai/rmm/pull/1494)) [@wence-](https://github.com/wence-)
- Replace all internal usage of `get_upstream` with `get_upstream_resource` ([#1491](https://github.com/rapidsai/rmm/pull/1491)) [@miscco](https://github.com/miscco)
- Deprecate rmm::mr::device_memory_resource::supports_streams() ([#1452](https://github.com/rapidsai/rmm/pull/1452)) [@harrism](https://github.com/harrism)
- Remove deprecated rmm::detail::available_device_memory ([#1438](https://github.com/rapidsai/rmm/pull/1438)) [@harrism](https://github.com/harrism)
- Make device_memory_resource::supports_streams() not pure virtual. Remove derived implementations and calls in RMM ([#1437](https://github.com/rapidsai/rmm/pull/1437)) [@harrism](https://github.com/harrism)
- Deprecate rmm::mr::device_memory_resource::get_mem_info() and supports_get_mem_info(). ([#1436](https://github.com/rapidsai/rmm/pull/1436)) [@harrism](https://github.com/harrism)

## 🐛 Bug Fixes

- Fix search path for torch allocator in editable installs and ensure CUDA support is available ([#1498](https://github.com/rapidsai/rmm/pull/1498)) [@vyasr](https://github.com/vyasr)
- Accept stream argument in DeviceMemoryResource allocate/deallocate ([#1494](https://github.com/rapidsai/rmm/pull/1494)) [@wence-](https://github.com/wence-)
- Run STATISTICS_TEST and TRACKING_TEST in serial to avoid OOM errors. ([#1487](https://github.com/rapidsai/rmm/pull/1487)) [@bdice](https://github.com/bdice)

## 📖 Documentation

- Pin to recent breathe, to prevent getting an unsupported sphinx version. ([#1495](https://github.com/rapidsai/rmm/pull/1495)) [@bdice](https://github.com/bdice)

## 🚀 New Features

- Replace all internal usage of `get_upstream` with `get_upstream_resource` ([#1491](https://github.com/rapidsai/rmm/pull/1491)) [@miscco](https://github.com/miscco)
- Add complete set of resource ref aliases ([#1479](https://github.com/rapidsai/rmm/pull/1479)) [@nvdbaranec](https://github.com/nvdbaranec)
- Automate include grouping using clang-format ([#1463](https://github.com/rapidsai/rmm/pull/1463)) [@harrism](https://github.com/harrism)
- Add `get_upstream_resource` to resource adaptors ([#1456](https://github.com/rapidsai/rmm/pull/1456)) [@miscco](https://github.com/miscco)
- Deprecate rmm::mr::device_memory_resource::supports_streams() ([#1452](https://github.com/rapidsai/rmm/pull/1452)) [@harrism](https://github.com/harrism)
- Remove duplicated memory_resource_tests ([#1451](https://github.com/rapidsai/rmm/pull/1451)) [@miscco](https://github.com/miscco)
- Change `rmm::exec_policy` to take `async_resource_ref` ([#1449](https://github.com/rapidsai/rmm/pull/1449)) [@miscco](https://github.com/miscco)
- Change `device_scalar` to take `async_resource_ref` ([#1447](https://github.com/rapidsai/rmm/pull/1447)) [@miscco](https://github.com/miscco)
- Add device_async_resource_ref convenience alias ([#1441](https://github.com/rapidsai/rmm/pull/1441)) [@harrism](https://github.com/harrism)
- Remove deprecated rmm::detail::available_device_memory ([#1438](https://github.com/rapidsai/rmm/pull/1438)) [@harrism](https://github.com/harrism)
- Make device_memory_resource::supports_streams() not pure virtual. Remove derived implementations and calls in RMM ([#1437](https://github.com/rapidsai/rmm/pull/1437)) [@harrism](https://github.com/harrism)
- Deprecate rmm::mr::device_memory_resource::get_mem_info() and supports_get_mem_info(). ([#1436](https://github.com/rapidsai/rmm/pull/1436)) [@harrism](https://github.com/harrism)
- Support CUDA 12.2 ([#1419](https://github.com/rapidsai/rmm/pull/1419)) [@jameslamb](https://github.com/jameslamb)

## 🛠️ Improvements

- Use `conda env create --yes` instead of `--force` ([#1509](https://github.com/rapidsai/rmm/pull/1509)) [@bdice](https://github.com/bdice)
- Add upper bound to prevent usage of NumPy 2 ([#1501](https://github.com/rapidsai/rmm/pull/1501)) [@bdice](https://github.com/bdice)
- Remove hard-coding of RAPIDS version where possible ([#1496](https://github.com/rapidsai/rmm/pull/1496)) [@KyleFromNVIDIA](https://github.com/KyleFromNVIDIA)
- Require NumPy 1.23+ ([#1488](https://github.com/rapidsai/rmm/pull/1488)) [@jakirkham](https://github.com/jakirkham)
- Use `rmm::device_async_resource_ref` in multi_stream_allocation benchmark ([#1482](https://github.com/rapidsai/rmm/pull/1482)) [@miscco](https://github.com/miscco)
- Update devcontainers to CUDA Toolkit 12.2 ([#1470](https://github.com/rapidsai/rmm/pull/1470)) [@trxcllnt](https://github.com/trxcllnt)
- Add support for Python 3.11 ([#1469](https://github.com/rapidsai/rmm/pull/1469)) [@jameslamb](https://github.com/jameslamb)
- target branch-24.04 for GitHub Actions workflows ([#1468](https://github.com/rapidsai/rmm/pull/1468)) [@jameslamb](https://github.com/jameslamb)
- [FEA]: Use `std::optional` instead of `thrust::optional` ([#1464](https://github.com/rapidsai/rmm/pull/1464)) [@miscco](https://github.com/miscco)
- Add environment-agnostic scripts for running ctests and pytests ([#1462](https://github.com/rapidsai/rmm/pull/1462)) [@trxcllnt](https://github.com/trxcllnt)
- Ensure that `ctest` is called with `--no-tests=error`. ([#1460](https://github.com/rapidsai/rmm/pull/1460)) [@bdice](https://github.com/bdice)
- Update ops-bot.yaml ([#1458](https://github.com/rapidsai/rmm/pull/1458)) [@AyodeAwe](https://github.com/AyodeAwe)
- Adopt the `rmm::device_async_resource_ref` alias ([#1454](https://github.com/rapidsai/rmm/pull/1454)) [@miscco](https://github.com/miscco)
- Refactor error.hpp out of detail ([#1439](https://github.com/rapidsai/rmm/pull/1439)) [@lamarrr](https://github.com/lamarrr)

# RMM 24.02.00 (12 Feb 2024)

## 🚨 Breaking Changes
Expand Down
3 changes: 2 additions & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,8 @@ endif()
# * add tests --------------------------------------------------------------------------------------

if(BUILD_TESTS AND CMAKE_PROJECT_NAME STREQUAL PROJECT_NAME)
include(cmake/thirdparty/get_gtest.cmake)
include(${rapids-cmake-dir}/cpm/gtest.cmake)
rapids_cpm_gtest(BUILD_STATIC)
include(CTest) # calls enable_testing()

add_subdirectory(tests)
Expand Down
167 changes: 108 additions & 59 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -207,38 +207,7 @@ alignment argument. All allocations are required to be aligned to at least 256B.
`device_memory_resource` adds an additional `cuda_stream_view` argument to allow specifying the stream
on which to perform the (de)allocation.

## `cuda_stream_view` and `cuda_stream`

`rmm::cuda_stream_view` is a simple non-owning wrapper around a CUDA `cudaStream_t`. This wrapper's
purpose is to provide strong type safety for stream types. (`cudaStream_t` is an alias for a pointer,
which can lead to ambiguity in APIs when it is assigned `0`.) All RMM stream-ordered APIs take a
`rmm::cuda_stream_view` argument.

`rmm::cuda_stream` is a simple owning wrapper around a CUDA `cudaStream_t`. This class provides
RAII semantics (constructor creates the CUDA stream, destructor destroys it). An `rmm::cuda_stream`
can never represent the CUDA default stream or per-thread default stream; it only ever represents
a single non-default stream. `rmm::cuda_stream` cannot be copied, but can be moved.

## `cuda_stream_pool`

`rmm::cuda_stream_pool` provides fast access to a pool of CUDA streams. This class can be used to
create a set of `cuda_stream` objects whose lifetime is equal to the `cuda_stream_pool`. Using the
stream pool can be faster than creating the streams on the fly. The size of the pool is configurable.
Depending on this size, multiple calls to `cuda_stream_pool::get_stream()` may return instances of
`rmm::cuda_stream_view` that represent identical CUDA streams.

### Thread Safety

All current device memory resources are thread safe unless documented otherwise. More specifically,
calls to memory resource `allocate()` and `deallocate()` methods are safe with respect to calls to
either of these functions from other threads. They are _not_ thread safe with respect to
construction and destruction of the memory resource object.

Note that a class `thread_safe_resource_adapter` is provided which can be used to adapt a memory
resource that is not thread safe to be thread safe (as described above). This adapter is not needed
with any current RMM device memory resources.

### Stream-ordered Memory Allocation
## Stream-ordered Memory Allocation

`rmm::mr::device_memory_resource` is a base class that provides stream-ordered memory allocation.
This allows optimizations such as re-using memory deallocated on the same stream without the
Expand Down Expand Up @@ -270,39 +239,39 @@ For further information about stream-ordered memory allocation semantics, read
Allocator](https://developer.nvidia.com/blog/using-cuda-stream-ordered-memory-allocator-part-1/)
on the NVIDIA Developer Blog.

### Available Resources
## Available Device Resources

RMM provides several `device_memory_resource` derived classes to satisfy various user requirements.
For more detailed information about these resources, see their respective documentation.

#### `cuda_memory_resource`
### `cuda_memory_resource`

Allocates and frees device memory using `cudaMalloc` and `cudaFree`.

#### `managed_memory_resource`
### `managed_memory_resource`

Allocates and frees device memory using `cudaMallocManaged` and `cudaFree`.

Note that `managed_memory_resource` cannot be used with NVIDIA Virtual GPU Software (vGPU, for use
with virtual machines or hypervisors) because [NVIDIA CUDA Unified Memory is not supported by
NVIDIA vGPU](https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#cuda-open-cl-support-vgpu).

#### `pool_memory_resource`
### `pool_memory_resource`

A coalescing, best-fit pool sub-allocator.

#### `fixed_size_memory_resource`
### `fixed_size_memory_resource`

A memory resource that can only allocate a single fixed size. Average allocation and deallocation
cost is constant.

#### `binning_memory_resource`
### `binning_memory_resource`

Configurable to use multiple upstream memory resources for allocations that fall within different
bin sizes. Often configured with multiple bins backed by `fixed_size_memory_resource`s and a single
`pool_memory_resource` for allocations larger than the largest bin size.

### Default Resources and Per-device Resources
## Default Resources and Per-device Resources

RMM users commonly need to configure a `device_memory_resource` object to use for all allocations
where another resource has not explicitly been provided. A common example is configuring a
Expand All @@ -327,7 +296,7 @@ Accessing and modifying the default resource is done through two functions:
`get_current_device_resource()`
- For more explicit control, you can use `set_per_device_resource()`, which takes a device ID.

#### Example
### Example

```c++
rmm::mr::cuda_memory_resource cuda_mr;
Expand All @@ -339,7 +308,7 @@ rmm::mr::set_current_device_resource(&pool_mr); // Updates the current device re
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource(); // Points to `pool_mr`
```
#### Multiple Devices
### Multiple Devices
A `device_memory_resource` should only be used when the active CUDA device is the same device
that was active when the `device_memory_resource` was created. Otherwise behavior is undefined.
Expand Down Expand Up @@ -367,36 +336,116 @@ for(int i = 0; i < N; ++i) {

Note that the CUDA device that is current when creating a `device_memory_resource` must also be
current any time that `device_memory_resource` is used to deallocate memory, including in a
destructor. This affects RAII classes like `rmm::device_buffer` and `rmm::device_uvector`. Here's an
(incorrect) example that assumes the above example loop has been run to create a
`pool_memory_resource` for each device. A correct example adds a call to `cudaSetDevice(0)` on the
line of the error comment.
destructor. The RAII class `rmm::device_buffer` and classes that use it as a backing store
(`rmm::device_scalar` and `rmm::device_uvector`) handle this by storing the active device when the
constructor is called, and then ensuring that the stored device is active whenever an allocation or
deallocation is performed (including in the destructor). The user must therefore only ensure that
the device active during _creation_ of an `rmm::device_buffer` matches the active device of the
memory resource being used.

Here is an incorrect example that creates a memory resource on device zero and then uses it to
allocate a `device_buffer` on device one:

```c++
{
RMM_CUDA_TRY(cudaSetDevice(0));
rmm::device_buffer buf_a(16);

auto mr = rmm::mr::cuda_memory_resource{};
{
RMM_CUDA_TRY(cudaSetDevice(1));
rmm::device_buffer buf_b(16);
// Invalid, current device is 1, but MR is only valid for device 0
rmm::device_buffer buf(16, rmm::cuda_stream_default, &mr);
}
}
```

// Error: when buf_a is destroyed, the current device must be 0, but it is 1
A correct example creates the device buffer with device zero active. After that it is safe to switch
devices and let the buffer go out of scope and destruct with a different device active. For example,
this code is correct:

```c++
{
RMM_CUDA_TRY(cudaSetDevice(0));
auto mr = rmm::mr::cuda_memory_resource{};
rmm::device_buffer buf(16, rmm::cuda_stream_default, &mr);
RMM_CUDA_TRY(cudaSetDevice(1));
...
// No need to switch back to device 0 before ~buf runs
}
```

### Allocators
#### Use of `rmm::device_vector` with multiple devices

`rmm:device_vector` uses an `rmm::mr::thrust_allocator` to enable `thrust::device_vector` to
allocate and deallocate memory using RMM. As such, the usual rules for usage of the backing memory
resource apply: the active device must match the active device at resource construction time. To
facilitate use in an RAII setting, `rmm::mr::thrust_allocator` records the active device at
construction time and ensures that device is active whenever it allocates or deallocates memory.
Usage of `rmm::device_vector` with multiple devices is therefore the same as `rmm::device_buffer`.
One must _create_ `device_vector`s with the correct device active, but it is safe to destroy them
with a different active device.

For example, recapitulating the previous example using `rmm::device_vector`:

```c++
{
RMM_CUDA_TRY(cudaSetDevice(0));
auto mr = rmm::mr::cuda_memory_resource{};
rmm::device_vector<int> vec(16, rmm::mr::thrust_allocator<int>(rmm::cuda_stream_default, &mr));
RMM_CUDA_TRY(cudaSetDevice(1));
...
// No need to switch back to device 0 before ~vec runs
}
```

> [!NOTE]
> Although allocation and deallocation in the `thrust_allocator` run with the correct active device,
> modification of `rmm::device_vector` might necessitate a kernel launch, and this must run with the
> correct device active. For example, `.resize()` might both allocate _and_ launch a kernel to
> initialize new elements: the user must arrange for this kernel launch to occur with the correct
> device for the memory resource active.
## `cuda_stream_view` and `cuda_stream`

`rmm::cuda_stream_view` is a simple non-owning wrapper around a CUDA `cudaStream_t`. This wrapper's
purpose is to provide strong type safety for stream types. (`cudaStream_t` is an alias for a pointer,
which can lead to ambiguity in APIs when it is assigned `0`.) All RMM stream-ordered APIs take a
`rmm::cuda_stream_view` argument.

`rmm::cuda_stream` is a simple owning wrapper around a CUDA `cudaStream_t`. This class provides
RAII semantics (constructor creates the CUDA stream, destructor destroys it). An `rmm::cuda_stream`
can never represent the CUDA default stream or per-thread default stream; it only ever represents
a single non-default stream. `rmm::cuda_stream` cannot be copied, but can be moved.

## `cuda_stream_pool`

`rmm::cuda_stream_pool` provides fast access to a pool of CUDA streams. This class can be used to
create a set of `cuda_stream` objects whose lifetime is equal to the `cuda_stream_pool`. Using the
stream pool can be faster than creating the streams on the fly. The size of the pool is configurable.
Depending on this size, multiple calls to `cuda_stream_pool::get_stream()` may return instances of
`rmm::cuda_stream_view` that represent identical CUDA streams.

## Thread Safety

All current device memory resources are thread safe unless documented otherwise. More specifically,
calls to memory resource `allocate()` and `deallocate()` methods are safe with respect to calls to
either of these functions from other threads. They are _not_ thread safe with respect to
construction and destruction of the memory resource object.

Note that a class `thread_safe_resource_adapter` is provided which can be used to adapt a memory
resource that is not thread safe to be thread safe (as described above). This adapter is not needed
with any current RMM device memory resources.

## Allocators

C++ interfaces commonly allow customizable memory allocation through an [`Allocator`](https://en.cppreference.com/w/cpp/named_req/Allocator) object.
RMM provides several `Allocator` and `Allocator`-like classes.

#### `polymorphic_allocator`
### `polymorphic_allocator`

A [stream-ordered](#stream-ordered-memory-allocation) allocator similar to [`std::pmr::polymorphic_allocator`](https://en.cppreference.com/w/cpp/memory/polymorphic_allocator).
Unlike the standard C++ `Allocator` interface, the `allocate` and `deallocate` functions take a `cuda_stream_view` indicating the stream on which the (de)allocation occurs.

#### `stream_allocator_adaptor`
### `stream_allocator_adaptor`

`stream_allocator_adaptor` can be used to adapt a stream-ordered allocator to present a standard `Allocator` interface to consumers that may not be designed to work with a stream-ordered interface.

Expand All @@ -415,7 +464,7 @@ auto p = adapted.allocate(100);
adapted.deallocate(p,100);
```

#### `thrust_allocator`
### `thrust_allocator`

`thrust_allocator` is a device memory allocator that uses the strongly typed `thrust::device_ptr`, making it usable with containers like `thrust::device_vector`.

Expand Down Expand Up @@ -497,13 +546,13 @@ Similar to `device_memory_resource`, it has two key functions for (de)allocation
Unlike `device_memory_resource`, the `host_memory_resource` interface and behavior is identical to
`std::pmr::memory_resource`.
### Available Resources
## Available Host Resources
#### `new_delete_resource`
### `new_delete_resource`
Uses the global `operator new` and `operator delete` to allocate host memory.
#### `pinned_memory_resource`
### `pinned_memory_resource`
Allocates "pinned" host memory using `cuda(Malloc/Free)Host`.
Expand Down Expand Up @@ -611,7 +660,7 @@ resources are detectable with Compute Sanitizer Memcheck.
It may be possible in the future to add support for memory bounds checking with other memory
resources using NVTX APIs.
## Using RMM in Python Code
# Using RMM in Python
There are two ways to use RMM in Python code:
Expand All @@ -622,7 +671,7 @@ There are two ways to use RMM in Python code:
RMM provides a `MemoryResource` abstraction to control _how_ device
memory is allocated in both the above uses.
### DeviceBuffers
## DeviceBuffer
A DeviceBuffer represents an **untyped, uninitialized device memory
allocation**. DeviceBuffers can be created by providing the
Expand Down Expand Up @@ -662,7 +711,7 @@ host:
array([1., 2., 3.])
```

### MemoryResource objects
## MemoryResource objects

`MemoryResource` objects are used to configure how device memory allocations are made by
RMM.
Expand Down
Loading

0 comments on commit 9dfd907

Please sign in to comment.