Skip to content

Commit

Permalink
Refactor RCCL install guide into several pages (#1427)
Browse files Browse the repository at this point in the history
* Refactor RCCL install guide into several pages

* Changes from code review and new docker guide

* Add missing entries to ToC

* Minor fixes

* Fix help strings

* Edits after review and remove extra white space
  • Loading branch information
amd-jnovotny authored Nov 27, 2024
1 parent e42f10a commit bf7c130
Show file tree
Hide file tree
Showing 8 changed files with 309 additions and 162 deletions.
41 changes: 0 additions & 41 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,11 +127,6 @@ $ mpirun --allow-run-as-root -np 8 --mca pml ucx --mca btl ^openib -x NCCL_DEBUG
For more information on rccl-tests options, refer to the [Usage](https://github.com/ROCm/rccl-tests#usage) section of rccl-tests.
## Enabling peer-to-peer transport
In order to enable peer-to-peer access on machines with PCIe-connected GPUs, the HSA environment variable `HSA_FORCE_FINE_GRAIN_PCIE=1` is required to be set, on top of requiring GPUs that support peer-to-peer access and proper large BAR addressing support.
## Tests
There are rccl unit tests implemented with the Googletest framework in RCCL. The rccl unit tests require Googletest 1.10 or higher to build and execute properly (installed with the -d option to install.sh).
Expand All @@ -152,31 +147,6 @@ will run only AllReduce correctness tests with float16 datatype. A list of avail
There are also other performance and error-checking tests for RCCL. These are maintained separately at https://github.com/ROCm/rccl-tests.
See the rccl-tests README for more information on how to build and run those tests.
## NPKit
RCCL integrates [NPKit](https://github.com/microsoft/npkit), a profiler framework that enables collecting fine-grained trace events in RCCL components, especially in giant collective GPU kernels.
Please check [NPKit sample workflow for RCCL](https://github.com/microsoft/NPKit/tree/main/rccl_samples) as a fully automated usage example. It also provides good templates for the following manual instructions.
To manually build RCCL with NPKit enabled, pass `-DNPKIT_FLAGS="-DENABLE_NPKIT -DENABLE_NPKIT_...(other NPKit compile-time switches)"` with cmake command. All NPKit compile-time switches are declared in the RCCL code base as macros with prefix `ENABLE_NPKIT_`, and they control which information will be collected. Also note that currently NPKit only supports collecting non-overlapped events on GPU, and `-DNPKIT_FLAGS` should follow this rule.
To manually run RCCL with NPKit enabled, environment variable `NPKIT_DUMP_DIR` needs to be set as the NPKit event dump directory. Also note that currently NPKit only supports 1 GPU per process.
To manually analyze NPKit dump results, please leverage [npkit_trace_generator.py](https://github.com/microsoft/NPKit/blob/main/rccl_samples/npkit_trace_generator.py).
## MSCCL/MSCCL++
RCCL integrates [MSCCL](https://github.com/Azure/msccl) and [MSCCL++](https://github.com/microsoft/mscclpp) to leverage the highly efficient GPU-GPU communication primitives for collective operations. Thanks to Microsoft Corporation for collaborating with us in this project.
MSCCL uses XMLs for different collective algorithms on different architectures. RCCL collectives can leverage those algorithms once the corresponding XML has been provided by the user. The XML files contain the sequence of send-recv and reduction operations to be executed by the kernel. On MI300X, MSCCL is enabled by default. On other platforms, the users may have to enable this by setting `RCCL_MSCCL_FORCE_ENABLE=1`. By default, MSCCL will only be used if every rank belongs to a unique process; to disable this restriction for multi-threaded or single-threaded configurations, set `RCCL_MSCCL_ENABLE_SINGLE_PROCESS=1`.
On the other hand, RCCL allreduce and allgather collectives can leverage the efficient MSCCL++ communication kernels for certain message sizes. MSCCL++ support is available whenever MSCCL support is available. Users need to set the RCCL environment variable `RCCL_MSCCLPP_ENABLE=1` to run RCCL workload with MSCCL++ support. It is also possible to set the message size threshold for using MSCCL++ by using the environment variable `RCCL_MSCCLPP_THRESHOLD`. Once `RCCL_MSCCLPP_THRESHOLD` (the default value is 1MB) is set, RCCL will invoke MSCCL++ kernels for all message sizes less than or equal to the specified threshold.
If some restrictions are not met, it will fall back to MSCCL or RCCL. The following are restrictions on using MSCCL++:
- Message size must be a non-zero multiple of 32 bytes
- Does not support `hipMallocManaged` buffers
- Allreduce only supports `float16`, `int32`, `uint32`, `float32`, and `bfloat16` data types
- Allreduce only supports the `sum` op
## Library and API Documentation
Please refer to the [RCCL Documentation Site](https://rocm.docs.amd.com/projects/rccl/en/latest/) for current documentation.
Expand All @@ -191,17 +161,6 @@ pip3 install -r sphinx/requirements.txt
python3 -m sphinx -T -E -b html -d _build/doctrees -D language=en . _build/html
```
### Improving performance on MI300 when using less than 8 GPUs
On a system with 8\*MI300X GPUs, each pair of GPUs are connected with dedicated XGMI links in a fully-connected topology. So, for collective operations, one can achieve good performance when all 8 GPUs (and all XGMI links) are used. When using less than 8 GPUs, one can only achieve a fraction of the potential bandwidth on the system.
But, if your workload warrants using less than 8 MI300 GPUs on a system, you can set the run-time variable `NCCL_MIN_NCHANNELS` to increase the number of channels.\
E.g.: `export NCCL_MIN_NCHANNELS=32`
Increasing the number of channels can be beneficial to performance, but it also increases GPU utilization for collective operations.
Additionally, we have pre-defined higher number of channels when using only 2 GPUs or 4 GPUs on a 8\*MI300 system. Here, RCCL will use **32 channels** for the 2 MI300 GPUs scenario and **24 channels** for the 4 MI300 GPUs scenario.
## Copyright
All source code and accompanying documentation is copyright (c) 2015-2022, NVIDIA CORPORATION. All rights reserved.
Expand Down
103 changes: 103 additions & 0 deletions docs/how-to/rccl-usage-tips.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
.. meta::
:description: Usage tips for the RCCL library of collective communication primitives
:keywords: RCCL, ROCm, library, API, peer-to-peer, transport

.. _rccl-usage-tips:


*****************************************
RCCL usage tips
*****************************************

This topic describes some of the more common RCCL extensions, such as NPKit and MSCCL, and provides tips on how to
configure and customize the application.

NPKit
=====

RCCL integrates `NPKit <https://github.com/microsoft/npkit>`_, a profiler framework that
enables the collection of fine-grained trace events in RCCL components, especially in giant collective GPU kernels.
See the `NPKit sample workflow for RCCL <https://github.com/microsoft/NPKit/tree/main/rccl_samples>`_ for
a fully-automated usage example. It also provides useful templates for the following manual instructions.

To manually build RCCL with NPKit enabled, pass ``-DNPKIT_FLAGS="-DENABLE_NPKIT -DENABLE_NPKIT_...(other NPKit compile-time switches)"`` to the ``cmake`` command.
All NPKit compile-time switches are declared in the RCCL code base as macros with the prefix ``ENABLE_NPKIT_``.
These switches control the information that is collected.

.. note::

NPKit only supports the collection of non-overlapped events on the GPU.
The ``-DNPKIT_FLAGS`` settings must follow this rule.

To manually run RCCL with NPKit enabled, set the environment variable ``NPKIT_DUMP_DIR``
to the NPKit event dump directory. NPKit only supports one GPU per process.
To manually analyze the NPKit dump results, use `npkit_trace_generator.py <https://github.com/microsoft/NPKit/blob/main/rccl_samples/npkit_trace_generator.py>`_.

MSCCL/MSCCL++
=============

RCCL integrates `MSCCL <https://github.com/microsoft/msccl>`_ and `MSCCL++ <https://github.com/microsoft/mscclpp>`_ to
leverage these highly efficient GPU-GPU communication primitives for collective operations.
Microsoft Corporation collaborated with AMD for this project.

MSCCL uses XMLs for different collective algorithms on different architectures.
RCCL collectives can leverage these algorithms after the user provides the corresponding XML.
The XML files contain sequences of send-recv and reduction operations for the kernel to run.

MSCCL is enabled by default on the AMD Instinct™ MI300X accelerator. On other platforms, users might have to enable it
using the setting ``RCCL_MSCCL_FORCE_ENABLE=1``. By default, MSCCL is only used if every rank belongs
to a unique process. To disable this restriction for multi-threaded or single-threaded configurations,
use the setting ``RCCL_MSCCL_ENABLE_SINGLE_PROCESS=1``.

RCCL allreduce and allgather collectives can leverage the efficient MSCCL++ communication kernels
for certain message sizes. MSCCL++ support is available whenever MSCCL support is available.
To run a RCCL workload with MSCCL++ support, set the following RCCL environment variable:

.. code-block:: shell
RCCL_MSCCLPP_ENABLE=1
To set the message size threshold for using MSCCL++, use the environment variable ``RCCL_MSCCLPP_THRESHOLD``,
which has a default value of 1MB. After ``RCCL_MSCCLPP_THRESHOLD`` has been set,
RCCL invokes MSCCL++ kernels for all message sizes less than or equal to the specified threshold.

The following restrictions apply when using MSCCL++. If these restrictions are not met,
operations fall back to using MSCCL or RCCL.

* The message size must be a non-zero multiple of 32 bytes
* It does not support ``hipMallocManaged`` buffers
* Allreduce only supports the ``float16``, ``int32``, ``uint32``, ``float32``, and ``bfloat16`` data types
* Allreduce only supports the sum operation

Enabling peer-to-peer transport
===============================

To enable peer-to-peer access on machines with PCIe-connected GPUs,
set the HSA environment variable as follows:

.. code-block:: shell
HSA_FORCE_FINE_GRAIN_PCIE=1
This feature requires GPUs that support peer-to-peer access along with
proper large BAR addressing support.

Improving performance on the MI300X accelerator when using fewer than 8 GPUs
============================================================================

On a system with 8\*MI300X accelerators, each pair of accelerators is connected with dedicated XGMI links
in a fully-connected topology. For collective operations, this can achieve good performance when
all 8 accelerators (and all XGMI links) are used. When fewer than 8 GPUs are used, however, this can only achieve a fraction
of the potential bandwidth on the system.
However, if your workload warrants using fewer than 8 MI300X accelerators on a system,
you can set the run-time variable ``NCCL_MIN_NCHANNELS`` to increase the number of channels. For example:

.. code-block:: shell
export NCCL_MIN_NCHANNELS=32
Increasing the number of channels can benefit performance, but it also increases
GPU utilization for collective operations.
Additionally, RCCL pre-defines a higher number of channels when only 2 or
4 accelerators are in use on a 8\*MI300X system. In this situation, RCCL uses 32 channels with two MI300X accelerators
and 24 channels for four MI300X accelerators.
16 changes: 11 additions & 5 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,14 +20,20 @@ The RCCL public repository is located at `<https://github.com/ROCm/rccl>`_.

.. grid-item-card:: Install

* :ref:`RCCL installation guide <install>`

.. grid:: 2
:gutter: 3
* :doc:`Installing RCCL using the install script <./install/installation>`
* :doc:`Running RCCL using Docker <./install/docker-install>`
* :doc:`Building and installing RCCL from source code <./install/building-installing>`

.. grid-item-card:: How to

* :ref:`using-nccl`
* :doc:`Using the NCCL Net plugin <./how-to/using-nccl>`
* :doc:`RCCL usage tips <./how-to/rccl-usage-tips>`


.. grid-item-card:: Examples

* `RCCL Tuner plugin examples <https://github.com/ROCm/rccl/tree/develop/ext-tuner/example>`_
* `NCCL Net plugin examples <https://github.com/ROCm/rccl/tree/develop/ext-net/example>`_

.. grid-item-card:: API reference

Expand Down
102 changes: 102 additions & 0 deletions docs/install/building-installing.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
.. meta::
:description: Information on how to build the RCCL library from source code
:keywords: RCCL, ROCm, library, API, build, install

.. _building-from-source:

*********************************************
Building and installing RCCL from source code
*********************************************

To build RCCL directly from the source code, follow these steps. This guide also includes
instructions explaining how to test the build.
For information on using the quick start install script to build RCCL, see :doc:`installation`.

Requirements
============

The following prerequisites are required to build RCCL:

1. ROCm-supported GPUs
2. Having the ROCm stack installed on the system, including the :doc:`HIP runtime <hip:index>` and the HIP-Clang compiler.

Building the library using CMake:
---------------------------------

To build the library from source, follow these steps:

.. code-block:: shell
git clone --recursive https://github.com/ROCm/rccl.git
cd rccl
mkdir build
cd build
cmake ..
make -j 16 # Or some other suitable number of parallel jobs
If you have already cloned the repository, you can checkout the external submodules manually.

.. code-block:: shell
git submodule update --init --recursive --depth=1
You can substitute a different installation path by providing the path as a parameter
to ``CMAKE_INSTALL_PREFIX``, for example:

.. code-block:: shell
cmake -DCMAKE_INSTALL_PREFIX=$PWD/rccl-install ..
.. note::

Ensure ROCm CMake is installed using the command ``apt install rocm-cmake``.


Building the RCCL package and install package:
----------------------------------------------

After you have cloned the repository and built the library as described in the previous section,
use this command to build the package:

.. code-block:: shell
cd rccl/build
make package
sudo dpkg -i *.deb
.. note::

The RCCL package install process requires ``sudo`` or root access because it creates a directory
named ``rccl`` in ``/opt/rocm/``. This is an optional step. RCCL can be used directly by including the path containing ``librccl.so``.

Testing RCCL
============

The RCCL unit tests are implemented using the Googletest framework in RCCL. These unit tests require Googletest 1.10
or higher to build and run (this dependency can be installed using the ``-d`` option for ``install.sh``).
To run the RCCL unit tests, go to the ``build`` folder and the ``test`` subfolder,
then run the appropriate RCCL unit test executables.

The RCCL unit test names follow this format:

.. code-block:: shell
CollectiveCall.[Type of test]
Filtering of the RCCL unit tests can be done using environment variables
and by passing the ``--gtest_filter`` command line flag:

.. code-block:: shell
UT_DATATYPES=ncclBfloat16 UT_REDOPS=prod ./rccl-UnitTests --gtest_filter="AllReduce.C*"
This command runs only the ``AllReduce`` correctness tests with the ``float16`` datatype.
A list of the available environment variables for filtering appears at the top of every run.
See the `Googletest documentation <https://google.github.io/googletest/advanced.html#running-a-subset-of-the-tests>`_
for more information on how to form advanced filters.

There are also other performance and error-checking tests for RCCL. They are maintained separately at `<https://github.com/ROCm/rccl-tests>`_.

.. note::

For more information on how to build and run rccl-tests, see the `rccl-tests README file <https://github.com/ROCm/rccl-tests/blob/develop/README.md>`_ .
45 changes: 45 additions & 0 deletions docs/install/docker-install.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
.. meta::
:description: Instruction on how to install the RCCL library for collective communication primitives using Docker
:keywords: RCCL, ROCm, library, API, install, Docker

.. _install-docker:

*****************************************
Running RCCL using Docker
*****************************************

To use Docker to run RCCL, Docker must already be installed on the system.
To build the Docker image and run the container, follow these steps.

#. Build the Docker image

By default, the Dockerfile uses ``docker.io/rocm/dev-ubuntu-22.04:latest`` as the base Docker image.
It then installs RCCL and rccl-tests (in both cases, it uses the version from the RCCL ``develop`` branch).

Use this command to build the Docker image:

.. code-block:: shell
docker build -t rccl-tests -f Dockerfile.ubuntu --pull .
The base Docker image, rccl repository, and rccl-tests repository can be modified
by using ``--build-args`` in the ``docker build`` command above. For example, to use a different base Docker image,
use this command:

.. code-block:: shell
docker build -t rccl-tests -f Dockerfile.ubuntu --build-arg="ROCM_IMAGE_NAME=rocm/dev-ubuntu-20.04" --build-arg="ROCM_IMAGE_TAG=6.2" --pull .
#. Launch an interactive Docker container on a system with AMD GPUs:

.. code-block:: shell
docker run -it --rm --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --network=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined rccl-tests /bin/bash
To run, for example, the ``all_reduce_perf`` test from rccl-tests on 8 AMD GPUs from inside the Docker container, use this command:

.. code-block:: shell
mpirun --allow-run-as-root -np 8 --mca pml ucx --mca btl ^openib -x NCCL_DEBUG=VERSION /workspace/rccl-tests/build/all_reduce_perf -b 1 -e 16G -f 2 -g 1
For more information on the rccl-tests options, see the `Usage guidelines <https://github.com/ROCm/rccl-tests#usage>`_ in the GitHub repository.
Loading

0 comments on commit bf7c130

Please sign in to comment.