From 7e2952c99efba9ac26c6173c2433bcf6b02fae6d Mon Sep 17 00:00:00 2001
From: Jeffrey Novotny <jnovotny@amd.com>
Date: Tue, 10 Dec 2024 15:22:29 -0500
Subject: [PATCH] Cherry-pick recent docs fixes to release-staging/rocm-rel-6.4
 (#1452)

* Refactor RCCL install guide into several pages (#1427)

* Refactor RCCL install guide into several pages

* Changes from code review and new docker guide

* Add missing entries to ToC

* Minor fixes

* Fix help strings

* Edits after review and remove extra white space

(cherry picked from commit bf7c1306313c080fbb82bbe9753faeb4a88a5055)

* Update rccl changelog for 6.3.1 (#1433)

* Update rccl changelog for 6.3.1

* Fix version number

* Correct RCCL release version

* Added details to 6.3.0 changelog

---------

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>
(cherry picked from commit e42f10a361227f39aada80fc577c47812114e046)

* Modify cmake instruction in build from source (#1445)

(cherry picked from commit 28594b26b3e62670a159f714b61f651cc5e02434)

* Add RCCL debugging guide (#1420)

* Add RCCL debugging guide

* Changes from external review

* More edits from internal review

* Additional edits

* Minor correction

* More changes after external review

* Integrate index and ToC changes with incoming merge changes

* Integrate feedback from management review

* Minor edits from the internal review

(cherry picked from commit 6d34fb76321600d5693b24f1edc875605c5cc638)
---
 CHANGELOG.md                         |  16 +-
 README.md                            |  43 +----
 docs/how-to/rccl-usage-tips.rst      | 103 +++++++++++
 docs/how-to/troubleshooting-rccl.rst | 249 +++++++++++++++++++++++++++
 docs/index.rst                       |  17 +-
 docs/install/building-installing.rst | 103 +++++++++++
 docs/install/docker-install.rst      |  45 +++++
 docs/install/installation.rst        | 147 ++++------------
 docs/sphinx/_toc.yml.in              |  15 ++
 install.sh                           |   4 +-
 10 files changed, 577 insertions(+), 165 deletions(-)
 create mode 100644 docs/how-to/rccl-usage-tips.rst
 create mode 100644 docs/how-to/troubleshooting-rccl.rst
 create mode 100644 docs/install/building-installing.rst
 create mode 100644 docs/install/docker-install.rst

diff --git a/CHANGELOG.md b/CHANGELOG.md
index af9b7194c..745717bf2 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -2,16 +2,28 @@
 
 Full documentation for RCCL is available at [https://rccl.readthedocs.io](https://rccl.readthedocs.io)
 
+## RCCL 2.21.5 for ROCm 6.3.1
+
+### Added
+
+### Changed
+
+* Enhanced user documentation
+
+### Resolved issues
+
+* Corrected user help strings in `install.sh`
+
 ## RCCL 2.21.5 for ROCm 6.3.0
 
 ### Added
 
-* MSCCL++ integration for specific contexts
+* MSCCL++ integration for AllReduce and AllGather on gfx942
 * Performance collection to rccl_replayer
 * Tuner Plugin example for MI300
 * Tuning table for large number of nodes
 * Support for amdclang++
-* New Rome model
+* Allow NIC ID remapping using `NCCL_RINGS_REMAP` environment variable
 
 ### Changed
 
diff --git a/README.md b/README.md
index b266ab96a..475c1957e 100644
--- a/README.md
+++ b/README.md
@@ -81,7 +81,7 @@ $ git submodule update --init --recursive --depth=1
 ```
 You may substitute an installation path of your own choosing by passing `CMAKE_INSTALL_PREFIX`. For example:
 ```shell
-$ cmake -DCMAKE_INSTALL_PREFIX=$PWD/rccl-install ..
+$ cmake -DCMAKE_INSTALL_PREFIX=$PWD/rccl-install -DCMAKE_BUILD_TYPE=Release ..
 ```
 Note: ensure rocm-cmake is installed, `apt install rocm-cmake`.
 
@@ -127,11 +127,6 @@ $ mpirun --allow-run-as-root -np 8 --mca pml ucx --mca btl ^openib -x NCCL_DEBUG
 
 For more information on rccl-tests options, refer to the [Usage](https://github.com/ROCm/rccl-tests#usage) section of rccl-tests.
 
-
-## Enabling peer-to-peer transport
-
-In order to enable peer-to-peer access on machines with PCIe-connected GPUs, the HSA environment variable `HSA_FORCE_FINE_GRAIN_PCIE=1` is required to be set, on top of requiring GPUs that support peer-to-peer access and proper large BAR addressing support.
-
 ## Tests
 
 There are rccl unit tests implemented with the Googletest framework in RCCL.  The rccl unit tests require Googletest 1.10 or higher to build and execute properly (installed with the -d option to install.sh).
@@ -152,31 +147,6 @@ will run only AllReduce correctness tests with float16 datatype. A list of avail
 There are also other performance and error-checking tests for RCCL.  These are maintained separately at https://github.com/ROCm/rccl-tests.
 See the rccl-tests README for more information on how to build and run those tests.
 
-## NPKit
-
-RCCL integrates [NPKit](https://github.com/microsoft/npkit), a profiler framework that enables collecting fine-grained trace events in RCCL components, especially in giant collective GPU kernels.
-
-Please check [NPKit sample workflow for RCCL](https://github.com/microsoft/NPKit/tree/main/rccl_samples) as a fully automated usage example. It also provides good templates for the following manual instructions.
-
-To manually build RCCL with NPKit enabled, pass `-DNPKIT_FLAGS="-DENABLE_NPKIT -DENABLE_NPKIT_...(other NPKit compile-time switches)"` with cmake command. All NPKit compile-time switches are declared in the RCCL code base as macros with prefix `ENABLE_NPKIT_`, and they control which information will be collected. Also note that currently NPKit only supports collecting non-overlapped events on GPU, and `-DNPKIT_FLAGS` should follow this rule.
-
-To manually run RCCL with NPKit enabled, environment variable `NPKIT_DUMP_DIR` needs to be set as the NPKit event dump directory. Also note that currently NPKit only supports 1 GPU per process.
-
-To manually analyze NPKit dump results, please leverage [npkit_trace_generator.py](https://github.com/microsoft/NPKit/blob/main/rccl_samples/npkit_trace_generator.py).
-
-## MSCCL/MSCCL++
-RCCL integrates [MSCCL](https://github.com/Azure/msccl) and [MSCCL++](https://github.com/microsoft/mscclpp) to leverage the highly efficient GPU-GPU communication primitives for collective operations. Thanks to Microsoft Corporation for collaborating with us in this project.
-
-MSCCL uses XMLs for different collective algorithms on different architectures. RCCL collectives can leverage those algorithms once the corresponding XML has been provided by the user. The XML files contain the sequence of send-recv and reduction operations to be executed by the kernel. On MI300X, MSCCL is enabled by default. On other platforms, the users may have to enable this by setting `RCCL_MSCCL_FORCE_ENABLE=1`. By default, MSCCL will only be used if every rank belongs to a unique process; to disable this restriction for multi-threaded or single-threaded configurations, set `RCCL_MSCCL_ENABLE_SINGLE_PROCESS=1`.
-
-On the other hand, RCCL allreduce and allgather collectives can leverage the efficient MSCCL++ communication kernels for certain message sizes. MSCCL++ support is available whenever MSCCL support is available. Users need to set the RCCL environment variable `RCCL_MSCCLPP_ENABLE=1` to run RCCL workload with MSCCL++ support. It is also possible to set the message size threshold for using MSCCL++ by using the environment variable `RCCL_MSCCLPP_THRESHOLD`. Once `RCCL_MSCCLPP_THRESHOLD` (the default value is 1MB) is set, RCCL will invoke MSCCL++ kernels for all message sizes less than or equal to the specified threshold.
-
-If some restrictions are not met, it will fall back to MSCCL or RCCL. The following are restrictions on using MSCCL++:
-- Message size must be a non-zero multiple of 32 bytes
-- Does not support `hipMallocManaged` buffers
-- Allreduce only supports `float16`, `int32`, `uint32`, `float32`, and `bfloat16` data types
-- Allreduce only supports the `sum` op
-
 ## Library and API Documentation
 
 Please refer to the [RCCL Documentation Site](https://rocm.docs.amd.com/projects/rccl/en/latest/) for current documentation.
@@ -191,17 +161,6 @@ pip3 install -r sphinx/requirements.txt
 python3 -m sphinx -T -E -b html -d _build/doctrees -D language=en . _build/html
 ```
 
-### Improving performance on MI300 when using less than 8 GPUs
-
-On a system with 8\*MI300X GPUs, each pair of GPUs are connected with dedicated XGMI links in a fully-connected topology. So, for collective operations, one can achieve good performance when all 8 GPUs (and all XGMI links) are used. When using less than 8 GPUs, one can only achieve a fraction of the potential bandwidth on the system.
-
-But, if your workload warrants using less than 8 MI300 GPUs on a system, you can set the run-time variable `NCCL_MIN_NCHANNELS` to increase the number of channels.\
-E.g.: `export NCCL_MIN_NCHANNELS=32`
-
-Increasing the number of channels can be beneficial to performance, but it also increases GPU utilization for collective operations.
-
-Additionally, we have pre-defined higher number of channels when using only 2 GPUs or 4 GPUs on a 8\*MI300 system. Here, RCCL will use **32 channels** for the 2 MI300 GPUs scenario and **24 channels** for the 4 MI300 GPUs scenario.
-
 ## Copyright
 
 All source code and accompanying documentation is copyright (c) 2015-2022, NVIDIA CORPORATION. All rights reserved.
diff --git a/docs/how-to/rccl-usage-tips.rst b/docs/how-to/rccl-usage-tips.rst
new file mode 100644
index 000000000..59abd32c6
--- /dev/null
+++ b/docs/how-to/rccl-usage-tips.rst
@@ -0,0 +1,103 @@
+.. meta::
+   :description: Usage tips for the RCCL library of collective communication primitives
+   :keywords: RCCL, ROCm, library, API, peer-to-peer, transport
+
+.. _rccl-usage-tips:
+
+
+*****************************************
+RCCL usage tips
+*****************************************
+
+This topic describes some of the more common RCCL extensions, such as NPKit and MSCCL, and provides tips on how to
+configure and customize the application.
+
+NPKit
+=====
+
+RCCL integrates `NPKit <https://github.com/microsoft/npkit>`_, a profiler framework that
+enables the collection of fine-grained trace events in RCCL components, especially in giant collective GPU kernels.
+See the `NPKit sample workflow for RCCL <https://github.com/microsoft/NPKit/tree/main/rccl_samples>`_ for
+a fully-automated usage example. It also provides useful templates for the following manual instructions.
+
+To manually build RCCL with NPKit enabled, pass ``-DNPKIT_FLAGS="-DENABLE_NPKIT -DENABLE_NPKIT_...(other NPKit compile-time switches)"`` to the ``cmake`` command. 
+All NPKit compile-time switches are declared in the RCCL code base as macros with the prefix ``ENABLE_NPKIT_``.
+These switches control the information that is collected.
+
+.. note::
+   
+   NPKit only supports the collection of non-overlapped events on the GPU.
+   The ``-DNPKIT_FLAGS`` settings must follow this rule.
+
+To manually run RCCL with NPKit enabled, set the environment variable ``NPKIT_DUMP_DIR``
+to the NPKit event dump directory. NPKit only supports one GPU per process.
+To manually analyze the NPKit dump results, use `npkit_trace_generator.py <https://github.com/microsoft/NPKit/blob/main/rccl_samples/npkit_trace_generator.py>`_.
+
+MSCCL/MSCCL++
+=============
+
+RCCL integrates `MSCCL <https://github.com/microsoft/msccl>`_ and `MSCCL++ <https://github.com/microsoft/mscclpp>`_ to
+leverage these highly efficient GPU-GPU communication primitives for collective operations.
+Microsoft Corporation collaborated with AMD for this project.
+
+MSCCL uses XMLs for different collective algorithms on different architectures. 
+RCCL collectives can leverage these algorithms after the user provides the corresponding XML.
+The XML files contain sequences of send-recv and reduction operations for the kernel to run.
+
+MSCCL is enabled by default on the AMD Instinct™ MI300X accelerator. On other platforms, users might have to enable it
+using the setting ``RCCL_MSCCL_FORCE_ENABLE=1``. By default, MSCCL is only used if every rank belongs
+to a unique process. To disable this restriction for multi-threaded or single-threaded configurations,
+use the setting ``RCCL_MSCCL_ENABLE_SINGLE_PROCESS=1``.
+
+RCCL allreduce and allgather collectives can leverage the efficient MSCCL++ communication kernels
+for certain message sizes. MSCCL++ support is available whenever MSCCL support is available.
+To run a RCCL workload with MSCCL++ support, set the following RCCL environment variable:
+
+.. code-block:: shell
+
+   RCCL_MSCCLPP_ENABLE=1
+
+To set the message size threshold for using MSCCL++, use the environment variable ``RCCL_MSCCLPP_THRESHOLD``,
+which has a default value of 1MB. After ``RCCL_MSCCLPP_THRESHOLD`` has been set,
+RCCL invokes MSCCL++ kernels for all message sizes less than or equal to the specified threshold.
+
+The following restrictions apply when using MSCCL++. If these restrictions are not met,
+operations fall back to using MSCCL or RCCL.
+
+*  The message size must be a non-zero multiple of 32 bytes
+*  It does not support ``hipMallocManaged`` buffers
+*  Allreduce only supports the ``float16``, ``int32``, ``uint32``, ``float32``, and ``bfloat16`` data types
+*  Allreduce only supports the sum operation
+
+Enabling peer-to-peer transport
+===============================
+
+To enable peer-to-peer access on machines with PCIe-connected GPUs,
+set the HSA environment variable as follows:
+
+.. code-block:: shell
+
+   HSA_FORCE_FINE_GRAIN_PCIE=1
+
+This feature requires GPUs that support peer-to-peer access along with
+proper large BAR addressing support.
+
+Improving performance on the MI300X accelerator when using fewer than 8 GPUs
+============================================================================
+
+On a system with 8\*MI300X accelerators, each pair of accelerators is connected with dedicated XGMI links
+in a fully-connected topology. For collective operations, this can achieve good performance when
+all 8 accelerators (and all XGMI links) are used. When fewer than 8 GPUs are used, however, this can only achieve a fraction
+of the potential bandwidth on the system.
+However, if your workload warrants using fewer than 8 MI300X accelerators on a system,
+you can set the run-time variable ``NCCL_MIN_NCHANNELS`` to increase the number of channels. For example:
+
+.. code-block:: shell
+
+   export NCCL_MIN_NCHANNELS=32
+
+Increasing the number of channels can benefit performance, but it also increases
+GPU utilization for collective operations.
+Additionally, RCCL pre-defines a higher number of channels when only 2 or
+4 accelerators are in use on a 8\*MI300X system. In this situation, RCCL uses 32 channels with two MI300X accelerators
+and 24 channels for four MI300X accelerators.
\ No newline at end of file
diff --git a/docs/how-to/troubleshooting-rccl.rst b/docs/how-to/troubleshooting-rccl.rst
new file mode 100644
index 000000000..c4168da25
--- /dev/null
+++ b/docs/how-to/troubleshooting-rccl.rst
@@ -0,0 +1,249 @@
+.. meta::
+   :description: A guide to troubleshooting the RCCL library of multi-GPU and multi-node collective communication primitives optimized for AMD GPUs
+   :keywords: RCCL, ROCm, library, API, debug
+
+.. _troubleshooting-rccl:
+
+*********************
+Troubleshooting RCCL
+*********************
+
+This topic explains the steps to troubleshoot functional and performance issues with RCCL.
+While debugging, collect the output from the commands in this guide. This data
+can be used as supporting information when submitting an issue report to AMD.
+
+.. _debugging-system-info:
+
+Collecting system information
+=============================
+
+Collect this information about the ROCm version, GPU/accelerator, platform, and configuration.
+
+*  Verify the ROCm version. This might be a release version or a
+   mainline or staging version. Use this command to display the version:
+
+   .. code:: shell
+
+      cat /opt/rocm/.info/version
+
+   Run the following command and collect the output:
+
+   .. code:: shell
+
+      rocm_agent_enumerator
+
+   Also, collect the name of the GPU or accelerator:
+
+   .. code:: shell
+
+      rocminfo
+
+*  Run these ``rocm-smi`` commands to display the system topology.
+
+   .. code:: shell
+
+      rocm-smi
+      rocm-smi --showtopo
+      rocm-smi --showdriverversion
+
+*  Determine the values of the ``PATH`` and ``LD_LIBRARY_PATH`` environment variables.
+
+   .. code:: shell
+
+      echo $PATH
+      echo $LD_LIBRARY_PATH
+
+*  Collect the HIP configuration.
+
+   .. code:: shell
+
+      /opt/rocm/bin/hipconfig --full
+
+*  Verify the network settings and setup. Use the ``ibv_devinfo`` command 
+   to display information about the available RDMA devices and determine 
+   whether they are installed and functioning properly. Run ``rdma link``
+   to print a summary of the network links.
+
+   .. code:: shell
+
+      ibv_devinfo
+      rdma link
+
+Isolating the issue
+-------------------
+
+The problem might be a general issue or specific to the architecture or system.
+To narrow down the issue, collect information about the GPU or accelerator and other
+details about the platform and system. Some issues to consider include:
+
+*  Is ROCm running on:
+
+   *  A bare-metal setup
+   *  In a Docker container (determine the name of the Docker image)
+   *  In an SR-IOV virtualized
+   *  Some combination of these configurations
+
+*  Is the problem only seen on a specific GPU architecture?
+*  Is it only seen on a specific system type?
+*  Is it happening on a single node or multinode setup?
+*  Use the following troubleshooting techniques to attempt to isolate the issue.
+
+   *  Build or run the develop branch version of RCCL and see if the problem persists.
+   *  Try an earlier RCCL version (minor or major).
+   *  If you recently changed the ROCm runtime configuration, KFD/driver, or compiler,
+      rerun the test with the previous configuration.
+
+.. _collecting-rccl-info:
+
+Collecting RCCL information
+=============================
+
+Collect the following information about the RCCL installation and configuration.
+
+*  Run the ``ldd`` command to list any dynamic dependencies for RCCL.
+
+   .. code:: shell
+
+      ldd <specify-path-to-librccl.so>
+
+*  Determine the RCCL version. This might be the pre-packaged component in
+   ``/opt/rocm/lib`` or a version that was built from source. To verify the RCCL version,
+   enter the following command, then run either rccl-tests or an e2e application.
+
+   .. code:: shell
+
+      export NCCL_DEBUG=VERSION
+
+*  Run rccl-tests and collect the results. For information on how to build and run rccl-tests, see the
+   `rccl-tests GitHub <https://github.com/ROCm/rccl-tests/blob/develop/README.md>`_.
+
+*  Collect the RCCL logging information. Enable the debug logs, 
+   then run rccl-tests or any e2e workload to collect the logs. Use the 
+   following command to enable the logs.
+
+   .. code:: shell
+
+      export NCCL_DEBUG=INFO
+
+.. _use-rccl-replayer:
+
+Using the RCCL Replayer
+------------------------
+
+The RCCL Replayer is a debugging tool designed to analyze and replay the collective logs obtained from RCCL runs. 
+It can be helpful when trying to reproduce problems, because it uses dummy data and doesn't have any dependencies 
+on non-RCCL calls. For more information, 
+see `RCCL Replayer GitHub documentation <https://github.com/ROCm/rccl/tree/develop/tools/rccl_replayer>`_.
+
+You must build the RCCL Replayer before you can use it. To build it, run these commands. Ensure ``MPI_DIR`` is set to 
+the path where MPI is installed.
+
+.. code:: shell
+
+   cd rccl/tools/rccl_replayer
+   MPI_DIR=/path/to/mpi make
+
+To use the RCCL Replayer, follow these steps: 
+
+#. Collect the per-rank logs from the RCCL run by adding the following environment variables.
+   This prevents any race conditions that might cause ranks to interrupt the output from other ranks.
+
+   .. code:: shell
+
+      NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=COLL NCCL_DEBUG_FILE=some_name_here.%h.%p.log
+
+#. Combine all the logs into a single file. This will become the input to the RCCL Replayer.
+
+   .. code:: shell
+
+      cat some_name_here_*.log > some_name_here.log
+
+#. Run the RCCL Replayer using the following command. Replace ``<numProcesses>`` with the number of MPI processes to 
+   run, ``</path/to/logfile>`` with the path to the collective log file generated during 
+   the RCCL runs, and ``<numGpusPerMpiRank>`` with the number of GPUs per MPI rank used in the application.
+
+   .. code:: shell
+
+      mpirun -np <numProcesses> ./rcclReplayer </path/to/logfile> <numGpusPerMpiRank>
+
+   In a multi-node application environment, you can replay the collective logs on multiple nodes
+   using the following command:
+
+   .. code:: shell
+
+      mpirun --hostfile <path/to/hostfile.txt> -np <numProcesses> ./rcclReplayer </path/to/logfile> <numGpusPerMpiRank>
+
+   .. note::
+
+      Depending on the MPI library you're using, you might need to modify the ``mpirun`` command.
+
+.. _analyze-performance-info:
+
+Analyzing performance issues
+=============================
+
+If the issues involve performance issues in an e2e workload, try the following 
+microbenchmarks and collect the results. Follow the instructions in the subsequent sections
+to run these benchmarks and provide the results to the support team.
+
+*  TransferBench
+*  RCCL Unit Tests
+*  rccl-tests
+  
+Collect the TransferBench data
+---------------------------------
+
+TransferBench allows you to benchmark simultaneous copies between
+user-specified devices. For more information, 
+see the :doc:`TransferBench documentation <transferbench:index>`.
+
+To collect the TransferBench data, follow these steps:
+
+#. Clone the TransferBench Git repository.
+
+   .. code:: shell
+
+      git clone https://github.com/ROCm/TransferBench.git 
+
+#. Change to the new directory and build the component.
+
+   .. code:: shell
+
+      cd TransferBench
+      make
+
+#. Run the TransferBench utility with the following parameters and save the results.
+
+   .. code:: shell
+
+      USE_FINE_GRAIN=1 GFX_UNROLL=2 ./TransferBench a2a 64M 8
+
+Collect the RCCL microbenchmark data
+-------------------------------------
+
+To use the RCCL tests to collect the RCCL benchmark data, follow these steps:
+
+#. Disable NUMA auto-balancing using the following command:
+
+   .. code:: shell
+
+      sudo sysctl kernel.numa_balancing=0
+
+   Run the following command to verify the setting. The expected output is ``0``.
+
+   .. code:: shell
+
+      cat /proc/sys/kernel/numa_balancing
+
+#. Build MPI, RCCL, and rccl-tests. To download and install MPI, see either 
+   `OpenMPI <https://www.open-mpi.org/software/ompi/v5.0/>`_ or `MPICH <https://www.mpich.org/>`_.
+   To learn how to build and run rccl-tests, see the `rccl-tests GitHub <https://github.com/ROCm/rccl-tests/blob/develop/README.md>`_.
+
+#. Run rccl-tests with MPI and collect the performance numbers.
+
+RCCL and NCCL comparisons
+=============================
+
+If you are also using NVIDIA hardware or NCCL and notice a performance gap between the two systems,
+collect the system and performance data on the NVIDIA platform. 
+Provide both sets of data to the support team.
diff --git a/docs/index.rst b/docs/index.rst
index 34125e180..ba3950383 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -20,14 +20,21 @@ The RCCL public repository is located at `<https://github.com/ROCm/rccl>`_.
 
   .. grid-item-card:: Install
 
-    * :ref:`RCCL installation guide <install>`
-
-.. grid:: 2
-  :gutter: 3
+    * :doc:`Installing RCCL using the install script <./install/installation>`
+    * :doc:`Running RCCL using Docker <./install/docker-install>`
+    * :doc:`Building and installing RCCL from source code <./install/building-installing>`
 
   .. grid-item-card:: How to
 
-    * :ref:`using-nccl`
+    * :doc:`Using the NCCL Net plugin <./how-to/using-nccl>`
+    * :doc:`Troubleshoot RCCL <./how-to/troubleshooting-rccl>`
+    * :doc:`RCCL usage tips <./how-to/rccl-usage-tips>`
+
+
+  .. grid-item-card:: Examples
+
+    * `RCCL Tuner plugin examples <https://github.com/ROCm/rccl/tree/develop/ext-tuner/example>`_
+    * `NCCL Net plugin examples <https://github.com/ROCm/rccl/tree/develop/ext-net/example>`_
        
   .. grid-item-card:: API reference
 
diff --git a/docs/install/building-installing.rst b/docs/install/building-installing.rst
new file mode 100644
index 000000000..da33e1076
--- /dev/null
+++ b/docs/install/building-installing.rst
@@ -0,0 +1,103 @@
+.. meta::
+   :description: Information on how to build the RCCL library from source code
+   :keywords: RCCL, ROCm, library, API, build, install
+
+.. _building-from-source:
+
+*********************************************
+Building and installing RCCL from source code
+*********************************************
+
+To build RCCL directly from the source code, follow these steps. This guide also includes
+instructions explaining how to test the build.
+For information on using the quick start install script to build RCCL, see :doc:`installation`.
+
+Requirements
+============
+
+The following prerequisites are required to build RCCL:
+
+1. ROCm-supported GPUs
+2. Having the ROCm stack installed on the system, including the :doc:`HIP runtime <hip:index>` and the HIP-Clang compiler.
+
+Building the library using CMake:
+---------------------------------
+
+To build the library from source, follow these steps:
+
+.. code-block:: shell
+
+    git clone --recursive https://github.com/ROCm/rccl.git
+    cd rccl
+    mkdir build
+    cd build
+    cmake ..
+    make -j 16      # Or some other suitable number of parallel jobs
+
+If you have already cloned the repository, you can checkout the external submodules manually.
+
+.. code-block:: shell
+
+    git submodule update --init --recursive --depth=1
+
+You can substitute a different installation path by providing the path as a parameter
+to ``CMAKE_INSTALL_PREFIX``, for example:
+
+.. code-block:: shell
+
+    cmake -DCMAKE_INSTALL_PREFIX=$PWD/rccl-install -DCMAKE_BUILD_TYPE=Release ..
+
+.. note::
+
+    Ensure ROCm CMake is installed using the command ``apt install rocm-cmake``. By default,
+    CMake builds the component in debug mode unless ``DCMAKE_BUILD_TYPE`` is specified.
+
+
+Building the RCCL package and install package:
+----------------------------------------------
+
+After you have cloned the repository and built the library as described in the previous section,
+use this command to build the package:
+
+.. code-block:: shell
+
+    cd rccl/build
+    make package
+    sudo dpkg -i *.deb
+
+.. note::
+   
+   The RCCL package install process requires ``sudo`` or root access because it creates a directory
+   named ``rccl`` in ``/opt/rocm/``. This is an optional step. RCCL can be used directly by including the path containing ``librccl.so``.
+
+Testing RCCL
+============
+
+The RCCL unit tests are implemented using the Googletest framework in RCCL. These unit tests require Googletest 1.10
+or higher to build and run (this dependency can be installed using the ``-d`` option for ``install.sh``).
+To run the RCCL unit tests, go to the ``build`` folder and the ``test`` subfolder,
+then run the appropriate RCCL unit test executables.
+
+The RCCL unit test names follow this format:
+
+.. code-block:: shell
+
+    CollectiveCall.[Type of test]
+
+Filtering of the RCCL unit tests can be done using environment variables
+and by passing the ``--gtest_filter`` command line flag:
+
+.. code-block:: shell
+
+    UT_DATATYPES=ncclBfloat16 UT_REDOPS=prod ./rccl-UnitTests --gtest_filter="AllReduce.C*"
+
+This command runs only the ``AllReduce`` correctness tests with the ``float16`` datatype.
+A list of the available environment variables for filtering appears at the top of every run.
+See the `Googletest documentation <https://google.github.io/googletest/advanced.html#running-a-subset-of-the-tests>`_
+for more information on how to form advanced filters.
+
+There are also other performance and error-checking tests for RCCL. They are maintained separately at `<https://github.com/ROCm/rccl-tests>`_.
+
+.. note::
+
+    For more information on how to build and run rccl-tests, see the `rccl-tests README file <https://github.com/ROCm/rccl-tests/blob/develop/README.md>`_ .
diff --git a/docs/install/docker-install.rst b/docs/install/docker-install.rst
new file mode 100644
index 000000000..3b0c780bb
--- /dev/null
+++ b/docs/install/docker-install.rst
@@ -0,0 +1,45 @@
+.. meta::
+   :description: Instruction on how to install the RCCL library for collective communication primitives using Docker
+   :keywords: RCCL, ROCm, library, API, install, Docker
+
+.. _install-docker:
+
+*****************************************
+Running RCCL using Docker
+*****************************************
+
+To use Docker to run RCCL, Docker must already be installed on the system.
+To build the Docker image and run the container, follow these steps.
+
+#. Build the Docker image
+
+   By default, the Dockerfile uses ``docker.io/rocm/dev-ubuntu-22.04:latest`` as the base Docker image.
+   It then installs RCCL and rccl-tests (in both cases, it uses the version from the RCCL ``develop`` branch).
+
+   Use this command to build the Docker image:
+
+   .. code-block:: shell
+
+      docker build -t rccl-tests -f Dockerfile.ubuntu --pull .
+
+   The base Docker image, rccl repository, and rccl-tests repository can be modified
+   by using ``--build-args`` in the ``docker build`` command above. For example, to use a different base Docker image,
+   use this command:
+
+   .. code-block:: shell
+
+      docker build -t rccl-tests -f Dockerfile.ubuntu --build-arg="ROCM_IMAGE_NAME=rocm/dev-ubuntu-20.04" --build-arg="ROCM_IMAGE_TAG=6.2" --pull .
+
+#. Launch an interactive Docker container on a system with AMD GPUs:
+
+   .. code-block:: shell
+
+      docker run -it --rm --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --network=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined rccl-tests /bin/bash
+
+To run, for example, the ``all_reduce_perf`` test from rccl-tests on 8 AMD GPUs from inside the Docker container, use this command:
+
+.. code-block:: shell
+
+   mpirun --allow-run-as-root -np 8 --mca pml ucx --mca btl ^openib -x NCCL_DEBUG=VERSION /workspace/rccl-tests/build/all_reduce_perf -b 1 -e 16G -f 2 -g 1
+
+For more information on the rccl-tests options, see the `Usage guidelines <https://github.com/ROCm/rccl-tests#usage>`_ in the GitHub repository.
\ No newline at end of file
diff --git a/docs/install/installation.rst b/docs/install/installation.rst
index 8ce01b379..ce613c344 100644
--- a/docs/install/installation.rst
+++ b/docs/install/installation.rst
@@ -1,45 +1,58 @@
 .. meta::
-   :description: RCCL is a stand-alone library that provides multi-GPU and multi-node collective communication primitives optimized for AMD GPUs
-   :keywords: RCCL, ROCm, library, API
+   :description: Instruction on how to install the RCCL library for collective communication primitives using the quick start install script
+   :keywords: RCCL, ROCm, library, API, install
 
 .. _install:
 
-***************
-Installing RCCL
-***************
+*****************************************
+Installing RCCL using the install script
+*****************************************
+
+To quickly install RCCL using the install script, follow these steps.
+For instructions on building RCCL from the source code, see :doc:`building-installing`.
+For additional tips, see :doc:`../how-to/rccl-usage-tips`.
 
 Requirements
 ============
 
-1. ROCm supported GPUs
-2. ROCm stack installed on the system (HIP runtime & HIP-Clang)
+The following prerequisites are required to use RCCL:
+
+1. ROCm-supported GPUs
+2. The ROCm stack must be installed on the system, including the :doc:`HIP runtime <hip:index>` and the HIP-Clang compiler.
+
+Quick start RCCL build
+======================
 
-Quickstart RCCL Build
-=====================
+RCCL directly depends on the HIP runtime plus the HIP-Clang compiler, which are part of the ROCm software stack.
+For ROCm installation instructions, see :doc:`rocm-install-on-linux:install/native-install/index`.
 
-RCCL directly depends on HIP runtime plus the HIP-Clang compiler, which are part of the ROCm software stack.
-For ROCm installation instructions, see https://github.com/ROCm/ROCm.
-The root of this repository has a helper script ``install.sh`` to build and install RCCL with a single command. It hard-codes configurations that can be specified through invoking cmake directly, but it's a great way to get started quickly and can serve as an example of how to build/install RCCL.
+Use the `install.sh helper script <https://github.com/ROCm/rccl/blob/develop/install.sh>`_,
+located in the root directory of the RCCL repository,
+to build and install RCCL with a single command. It uses hard-coded configurations that can be specified directly
+when using cmake. However, it's a great way to get started quickly and provides an
+example of how to build and install RCCL.
 
-To build the library using the install script:
+Building the library using the install script:
 ----------------------------------------------
 
+To build the library using the install script, use this command:
+
 .. code-block:: shell
 
     ./install.sh
 
-For more info on build options/flags when using the install script, use the following:
+For more information on the build options and flags for the install script, run the following command:
 
 .. code-block:: shell
 
     ./install.sh --help
 
-RCCL build & installation helper script options:
+The RCCL build and installation helper script options are as follows:
 
 .. code-block:: shell
 
        --address-sanitizer     Build with address sanitizer enabled
-    -d|--dependencies          Install RCCL depdencencies
+    -d|--dependencies          Install RCCL dependencies
        --debug                 Build debug library
        --enable_backtrace      Build with custom backtrace support
        --disable-colltrace     Build without collective trace
@@ -50,7 +63,7 @@ RCCL build & installation helper script options:
     -i|--install               Install RCCL library (see --prefix argument below)
     -j|--jobs                  Specify how many parallel compilation jobs to run ($nproc by default)
     -l|--local_gpu_only        Only compile for local GPU architecture
-       --amdgpu_targets        Only compile for specified GPU architecture(s). For multiple targets, seperate by ';' (builds for all supported GPU architectures by default)
+       --amdgpu_targets        Only compile for specified GPU architecture(s). For multiple targets, separate by ';' (builds for all supported GPU architectures by default)
        --no_clean              Don't delete files if they already exist
        --npkit-enable          Compile with npkit enabled
        --openmp-test-enable    Enable OpenMP in rccl unit tests
@@ -66,101 +79,7 @@ RCCL build & installation helper script options:
        --verbose               Show compile commands
 
 .. tip::
-    By default, RCCL builds for all GPU targets defined in ``DEFAULT_GPUS`` in `CMakeLists.txt <https://github.com/ROCm/rccl/blob/develop/CMakeLists.txt>`_. To target specific GPU(s), and potentially reduce build time, use ``--amdgpu_targets`` as a ``;`` separated string listing GPU(s) to target.
-
-Manual build
-============
-
-To build the library using CMake:
----------------------------------
-
-.. code-block:: shell
-
-    $ git clone https://github.com/ROCm/rccl.git
-    $ cd rccl
-    $ mkdir build
-    $ cd build
-    $ cmake ..
-    $ make -j 16      # Or some other suitable number of parallel jobs
-
-You may substitute an installation path of your own choosing by passing ``CMAKE_INSTALL_PREFIX``. For example:
-
-.. code-block:: shell
-
-    $ cmake -DCMAKE_INSTALL_PREFIX=$PWD/rccl-install ..
-
-.. note::
-    Ensure rocm-cmake is installed, ``apt install rocm-cmake``.
-
-
-To build the RCCL package and install package:
-----------------------------------------------
-
-Assuming you have already cloned this repository and built the library as shown in the previous section:
-
-.. code-block:: shell
-
-    $ cd rccl/build
-    $ make package
-    $ sudo dpkg -i *.deb
-
-RCCL package install requires sudo/root access because it creates a directory called "rccl" under ``/opt/rocm/``. This is an optional step and RCCL can be used directly by including the path containing ``librccl.so``.
-
-Enabling peer-to-peer transport
-===============================
-
-In order to enable peer-to-peer access on machines with PCIe-connected GPUs, the HSA environment variable ``HSA_FORCE_FINE_GRAIN_PCIE=1`` is required to be set, on top of requiring GPUs that support peer-to-peer access and proper large BAR addressing support.
-
-Testing RCCL
-============
-
-There are rccl unit tests implemented with the Googletest framework in RCCL.  The rccl unit tests require Googletest 1.10 or higher to build and execute properly (installed with the ``-d`` option to ``install.sh``).
-To invoke the rccl unit tests, go to the build folder, then the test subfolder, and execute the appropriate rccl unit test executable(s).
-
-rccl unit test names are now of the format:
-
-.. code-block:: shell
-
-    CollectiveCall.[Type of test]
-
-Filtering of rccl unit tests should be done with environment variable and by passing the ``--gtest_filter`` command line flag:
-
-.. code-block:: shell
-
-    UT_DATATYPES=ncclBfloat16 UT_REDOPS=prod ./rccl-UnitTests --gtest_filter="AllReduce.C*"
-
-This will run only ``AllReduce`` correctness tests with float16 datatype. A list of available filtering environment variables appears at the top of every run. See https://google.github.io/googletest/advanced.html#running-a-subset-of-the-tests for more information on how to form more advanced filters.
-
-There are also other performance and error-checking tests for RCCL.  These are maintained separately at https://github.com/ROCm/rccl-tests.
-
-.. note::
-    See the `rccl-tests/README <https://github.com/ROCm/rccl-tests/blob/develop/README.md>`_ for more information on how to build and run those tests.
-
-NPKit
-=====
-
-RCCL integrates `NPKit <https://github.com/microsoft/npkit>`_, a profiler framework that enables collecting fine-grained trace events in RCCL components, especially in giant collective GPU kernels.
-Please check `NPKit sample workflow for RCCL <https://github.com/microsoft/NPKit/tree/main/rccl_samples>`_ as a fully automated usage example. It also provides good templates for the following manual instructions.
-To manually build RCCL with NPKit enabled, pass ``-DNPKIT_FLAGS="-DENABLE_NPKIT -DENABLE_NPKIT_...(other NPKit compile-time switches)"`` with ``cmake`` command. All NPKit compile-time switches are declared in the RCCL code base as macros with prefix ``ENABLE_NPKIT_``, and they control which information will be collected. Also note that currently NPKit only supports collecting non-overlapped events on GPU, and ``-DNPKIT_FLAGS`` should follow this rule.
-
-To manually run RCCL with NPKit enabled, environment variable ``NPKIT_DUMP_DIR`` needs to be set as the NPKit event dump directory. Also note that currently NPKit only supports 1 GPU per process.
-To manually analyze NPKit dump results, please leverage `npkit_trace_generator.py <https://github.com/microsoft/NPKit/blob/main/rccl_samples/npkit_trace_generator.py>`_.
-
-MSCCL/MSCCL++
-=============
-
-RCCL integrates `MSCCL <https://github.com/microsoft/msccl>`_ and `MSCCL++ <https://github.com/microsoft/mscclpp>`_ to leverage the highly efficient GPU-GPU communication primitives for collective operations. Thanks to Microsoft Corporation for collaborating with us in this project.
-
-MSCCL uses XMLs for different collective algorithms on different architectures. RCCL collectives can leverage those algorithms once the corresponding XML has been provided by the user. The XML files contain the sequence of send-recv and reduction operations to be executed by the kernel. On MI300X, MSCCL is enabled by default. On other platforms, the users may have to enable this by setting ``RCCL_MSCCL_FORCE_ENABLE=1``.
-On the other hand, RCCL allreduce and allgather collectives can leverage the efficient MSCCL++ communication kernels for certain message sizes. MSCCL++ support is available whenever MSCCL support is available. Users need to set the RCCL environment variable ``RCCL_ENABLE_MSCCLPP=1`` to run RCCL workload with MSCCL++ support. It is also possible to set the message size threshold for using MSCCL++ by using the environment variable ``RCCL_MSCCLPP_THRESHOLD``. Once ``RCCL_MSCCLPP_THRESHOLD`` (the default value is 1MB) is set, RCCL will invoke MSCCL++ kernels for all message sizes less than or equal to the specified threshold.
-
-Improving performance on MI300 when using less than 8 GPUs
-==========================================================
-
-On a system with 8\*MI300X GPUs, each pair of GPUs are connected with dedicated XGMI links in a fully-connected topology. So, for collective operations, one can achieve good performance when all 8 GPUs (and all XGMI links) are used. When using less than 8 GPUs, one can only achieve a fraction of the potential bandwidth on the system.
-But, if your workload warrants using less than 8 MI300 GPUs on a system, you can set the run-time variable `NCCL_MIN_NCHANNELS` to increase the number of channels. 
-
-For example: ``export NCCL_MIN_NCHANNELS=32``
 
-Increasing the number of channels can be beneficial to performance, but it also increases GPU utilization for collective operations.
-Additionally, we have pre-defined higher number of channels when using only 2 GPUs or 4 GPUs on a 8\*MI300 system. Here, RCCL will use **32 channels** for the 2 MI300 GPUs scenario and **24 channels** for the 4 MI300 GPUs scenario.
+    By default, the RCCL install script builds all the GPU targets that are defined in ``DEFAULT_GPUS`` in `CMakeLists.txt <https://github.com/ROCm/rccl/blob/develop/CMakeLists.txt>`_.
+    To target specific GPUs and potentially reduce the build time, use ``--amdgpu_targets`` along with
+    a semicolon (``;``) separated string list of the GPU targets.
diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in
index 7d991867e..5ced6215f 100644
--- a/docs/sphinx/_toc.yml.in
+++ b/docs/sphinx/_toc.yml.in
@@ -9,10 +9,25 @@ subtrees:
   entries:
   - file: install/installation
     title: Installation guide
+  - file: install/docker-install
+    title: Running RCCL using Docker
+  - file: install/building-installing
+    title: Building and installing from source
 
 - caption: How to 
   entries:
   - file: how-to/using-nccl
+    title: Using the NCCL Net plugin
+  - file: how-to/troubleshooting-rccl
+    title: Troubleshoot RCCL
+  - file: how-to/rccl-usage-tips
+
+- caption: Examples
+  entries:
+  - url: https://github.com/ROCm/rccl/tree/develop/ext-tuner/example
+    title: RCCL Tuner plugin examples
+  - url: https://github.com/ROCm/rccl/tree/develop/ext-net/example
+    title: NCCL Net plugin examples
 
 - caption: API reference 
   entries:
diff --git a/install.sh b/install.sh
index 892f2bbee..ab57e0e2e 100755
--- a/install.sh
+++ b/install.sh
@@ -41,7 +41,7 @@ function display_help()
     echo "RCCL build & installation helper script"
     echo " Options:"
     echo "       --address-sanitizer     Build with address sanitizer enabled"
-    echo "    -d|--dependencies          Install RCCL depdencencies"
+    echo "    -d|--dependencies          Install RCCL dependencies"
     echo "       --debug                 Build debug library"
     echo "       --enable_backtrace      Build with custom backtrace support"
     echo "       --disable-colltrace     Build without collective trace"
@@ -52,7 +52,7 @@ function display_help()
     echo "    -i|--install               Install RCCL library (see --prefix argument below)"
     echo "    -j|--jobs                  Specify how many parallel compilation jobs to run ($num_parallel_jobs by default)"
     echo "    -l|--local_gpu_only        Only compile for local GPU architecture"
-    echo "       --amdgpu_targets        Only compile for specified GPU architecture(s). For multiple targets, seperate by ';' (builds for all supported GPU architectures by default)"
+    echo "       --amdgpu_targets        Only compile for specified GPU architecture(s). For multiple targets, separate by ';' (builds for all supported GPU architectures by default)"
     echo "       --no_clean              Don't delete files if they already exist"
     echo "       --npkit-enable          Compile with npkit enabled"
     echo "       --openmp-test-enable    Enable OpenMP in rccl unit tests"