Skip to content

Latest commit

 

History

History
216 lines (158 loc) · 7.44 KB

build_from_source.md

File metadata and controls

216 lines (158 loc) · 7.44 KB

Build from Source

Overview

This document provides instructions for building TensorRT-LLM from source code on Linux.

We first recommend that you install TensorRT-LLM directly. Building from source code is necessary for users who require the best performance or debugging capabilities, or if the GNU C++11 ABI is required.

We recommend the use of Docker to build and run TensorRT-LLM. Instructions to install an environment to run Docker containers for the NVIDIA platform can be found here.

Fetch the Sources

The first step to build TensorRT-LLM is to fetch the sources:

# TensorRT-LLM uses git-lfs, which needs to be installed in advance.
apt-get update && apt-get -y install git git-lfs
git lfs install

git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
git submodule update --init --recursive
git lfs pull

Note: There are two options to create TensorRT-LLM Docker image and approximate disk space required to build the image is 63 GB

Option 1: Build TensorRT-LLM in One Step

TensorRT-LLM contains a simple command to create a Docker image:

make -C docker release_build

It is possible to add the optional argument CUDA_ARCHS="<list of architectures in CMake format>" to specify which architectures should be supported by TensorRT-LLM. It restricts the supported GPU architectures but helps reduce compilation time:

# Restrict the compilation to Ada and Hopper architectures.
make -C docker release_build CUDA_ARCHS="89-real;90-real"

Once the image is built, the Docker container can be executed using:

make -C docker release_run

The make command supports the LOCAL_USER=1 argument to switch to the local user account instead of root inside the container. The examples of TensorRT-LLM are installed in directory /app/tensorrt_llm/examples.

Option 2: Build Step-by-step

For users looking for more flexibility, TensorRT-LLM has commands to create and run a development container in which TensorRT-LLM can be built.

Create the Container

On Systems with GNU make

The following command creates a Docker image for development:

make -C docker build

The image will be tagged locally with tensorrt_llm/devel:latest. To run the container, use the following command:

make -C docker run

For users who prefer to work with their own user account in that container instead of root, the option LOCAL_USER=1 must be added to the above command above:

make -C docker run LOCAL_USER=1

On Systems Without GNU make

On systems without GNU make or shell support, the Docker image for development can be built using:

docker build --pull  \
             --target devel \
             --file docker/Dockerfile.multi \
             --tag tensorrt_llm/devel:latest \
             .

The container can then be run using:

docker run --rm -it \
           --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all \
           --volume ${PWD}:/code/tensorrt_llm \
           --workdir /code/tensorrt_llm \
           tensorrt_llm/devel:latest

Build TensorRT-LLM

Once in the container, TensorRT-LLM can be built from source using:

# To build the TensorRT-LLM code.
python3 ./scripts/build_wheel.py --trt_root /usr/local/tensorrt

# Deploy TensorRT-LLM in your environment.
pip install ./build/tensorrt_llm*.whl

By default, build_wheel.py enables incremental builds. To clean the build directory, add the --clean option:

python3 ./scripts/build_wheel.py --clean  --trt_root /usr/local/tensorrt

It is possible to restrict the compilation of TensorRT-LLM to specific CUDA architectures. For that purpose, the build_wheel.py script accepts a semicolon separated list of CUDA architecture as shown in the following example:

# Build TensorRT-LLM for Ampere.
python3 ./scripts/build_wheel.py --cuda_architectures "80-real;86-real" --trt_root /usr/local/tensorrt

The list of supported architectures can be found in the CMakeLists.txt file.

Build the Python Bindings for the C++ Runtime

The C++ Runtime, in particular, 'Executor' and GptSession can be exposed to Python via bindings. This feature can be turned on through the default build options:

python3 ./scripts/build_wheel.py --trt_root /usr/local/tensorrt

After installing the resulting wheel as described above, the C++ Runtime bindings will be available in package tensorrt_llm.bindings. Running help on this package in a Python interpreter will provide on overview of the relevant classes. The examples and unit tests can also be consulted for understanding the API.

This feature will not be enabled when building only the C++ runtime.

Link with the TensorRT-LLM C++ Runtime

The build_wheel.py script will also compile the library containing the C++ runtime of TensorRT-LLM. If Python support and torch modules are not required, the script provides the option --cpp_only which restricts the build to the C++ runtime only:

python3 ./scripts/build_wheel.py --cuda_architectures "80-real;86-real" --cpp_only --clean

This is particularly useful to avoid linking problems which may be introduced by particular versions of torch related to the dual ABI support of GCC. The option --clean will remove the build directory before building. The default build directory is cpp/build, which may be overridden using the option --build_dir. Run build_wheel.py --help for an overview of all supported options.

The shared library can be found in the following location:

cpp/build/tensorrt_llm/libtensorrt_llm.so

In addition, one needs to link against the library containing the LLM plugins for TensorRT available here:

cpp/build/tensorrt_llm/plugins/libnvinfer_plugin_tensorrt_llm.so

Supported C++ Header Files

When using TensorRT-LLM, you need to add the cpp and cpp/include directories to the project's include paths. Only header files contained in cpp/include are part of the supported API and may be directly included. Other headers contained under cpp should not be included directly since they might change in future versions.

For examples of how to use the C++ runtime, see the unit tests in gptSessionTest.cpp and the related CMakeLists.txt file.