Skip to content

kamil-andrzejewski/intel-xpu-backend-for-triton

 
 

Repository files navigation

Intel® XPU Backend for Triton*

This is the development repository of Intel® XPU Backend for Triton*, a new OpenAI Triton backend for Intel GPUs. Triton is a language and compiler for writing highly efficient custom deep learning primitives. The aim of Triton is to provide an open-source environment to write fast code at higher productivity than CUDA, but also with higher flexibility than other existing DSLs. Intel® XPU Backend for Triton* is a module used by Triton to provide a reasonable tradeoff between performance and productivity on Intel GPUs.

Setup Guide

Intel® XPU Backend for Triton* serves as a backend for OpenAI Triton. There are two Options for installation: Install from whl package or build from the source. Please follow either option for setup.

Prerequisites

Intel® XPU Backend for Triton* requires the following two dependencies package:

  1. PyTorch.
  2. Intel® Extension for PyTorch* .

Please follow installation guide for Intel® Extension for PyTorch* for the detailed process for BOTH PyTorch and Intel® Extension for PyTorch*. Please make sure that the associated driver and Intel® oneAPI Base Toolkit are installed correctly.

Option 1: Install From whl Packages

This method is the simplest way of getting things done.

Download the latest .whl according to your Python version. We provide Cython and Pypy version. By default, it should be CPython. You could check your Python implementation with the following command:

python -c "import platform;print(platform.python_implementation())"

Then download the corresponding .whl at the release page and install it locally, for example:

wget https://github.com/intel/intel-xpu-backend-for-triton/releases/download/v2.1.0_rc1/triton-2.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

pip install triton-2.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

Option 2: Build From the Source

# Clone OpenAI/Triton
git clone https://github.com/openai/triton.git
cd triton
# Clone submodules
git submodule sync && git submodule update --init --recursive --jobs 0

Since we are at the active development stage, it is recommended to check to latest commit for intel-xpu-backend-for-triton:

cd third_party/intel_xpu_backend
git checkout main && git pull

Now Build Triton with Intel XPU backend enabled, note that it is important to make sure that the triton repo is checked to the pinned commit. This commit is the latest tested working commit.

# cd to triton root folder and checkout to pinned commit
cd ../..
git checkout `cat third_party/intel_xpu_backend/triton_hash.txt`
# Build triton with XPU backend enabled
cd python
TRITON_CODEGEN_INTEL_XPU_BACKEND=1 python setup.py develop

We also provide a detailed page for the overall building process. It includes all source building methods. You could refer to build_from_source.md for more detail. If you encountered any problem, please refer to the Possible-Build-Bugs page first.

Option 3: Build in dockerfile

# activate dependencies version variables
source triton/third_party/intel_xpu_backend/.github/ci_pins/version.txt
# cd to docker folder and build image form Dockerfile
cd triton/third_party/intel_xpu_backend/docker
DOCKER_BUILDKIT=1 docker build \
                 --build-arg http_proxy=${http_proxy} \
                 --build-arg https_proxy=${https_proxy} \
                 --build-arg PT_REPO=$torch_repo \
                 --build-arg PT_BRANCH=$torch_branch \
                 --build-arg PT_COMMIT=$torch_commit \
                 --build-arg IPEX_REPO=$ipex_repo \
                 --build-arg IPEX_BRANCH=$ipex_branch \
                 --build-arg IPEX_COMMIT=$ipex_commit \
                 --build-arg BASEKIT_URL=https://registrationcenter-download.intel.com/akdlm/IRC_NAS/20f4e6a1-6b0b-4752-b8c1-e5eacba10e01/l_BaseKit_p_2024.0.0.49564_offline.sh \
                 -t triton:xpu \
                 -f Dockerfile \
                 --target image .
# creat e a container from the image
docker run -id --name $USER --privileged --env https_proxy=${https_proxy} --env http_proxy=${http_proxy} --net host --shm-size 2G triton:xpu
# env check in container
docker exec -ti $USER bash -c "source /opt/intel/oneapi/setvars.sh;python -c 'import torch,intel_extension_for_pytorch,triton'"
## (optional) run E2E test in container
docker exec -ti $USER bash -c "source /opt/intel/oneapi/setvars.sh ;\
                              cd /workspace/pytorch && wget -O inductor_xpu_test.sh https://raw.githubusercontent.com/intel/intel-xpu-backend-for-triton/main/.github/scripts/inductor_xpu_test.sh ;\
                               pip install pandas && bash inductor_xpu_test.sh huggingface amp_bf16 inference accuracy xpu 1 static 1 0 DebertaForMaskedLM
                              "

Usage Guide

Code Modifications

Intel® XPU Backend for Triton* only requires minor code changes. The user needs to do the following two things:

  1. Add import intel_extension_for_pytorch for xpu support.
  2. Put the tensor and models to XPU by calling to('xpu'). There are cases when PyTorch API needs to be changed, please refer to API Documentation from Intel® Extension for PyTorch* for more detail.

The following examples show modifications for the user code.

Example 1 : Triton Kernel

This Example is a modified version of Vector Add triton kernel. Please refer to Vector Add for detailed comments and illustration about the code semantics.

Comparing to the original code, the following code modifies:

  1. Add import intel_extension_for_pytorch for xpu support.
  2. Put the tensor to XPU and change the API for manual_seed.
import torch
# Need to import intel_extension_for_pytorch for xpu support
import intel_extension_for_pytorch

import triton
import triton.language as tl


@triton.jit
def add_kernel(
    x_ptr,
    y_ptr,
    output_ptr,
    n_elements,
    BLOCK_SIZE: tl.constexpr,
):
    pid = tl.program_id(axis=0)
    block_start = pid * BLOCK_SIZE
    offsets = block_start + tl.arange(0, BLOCK_SIZE)
    mask = offsets < n_elements
    x = tl.load(x_ptr + offsets, mask=mask)
    y = tl.load(y_ptr + offsets, mask=mask)
    output = x + y
    tl.store(output_ptr + offsets, output, mask=mask)

def add(x: torch.Tensor, y: torch.Tensor):
    # Put the tensor to xpu
    output = torch.empty_like(x).xpu()
    assert x.is_xpu and y.is_xpu and output.is_xpu
    n_elements = output.numel()
    grid = lambda meta: (triton.cdiv(n_elements, meta['BLOCK_SIZE']),)
    add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=1024)

    return output

# For manual_seed, needs to use API for XPU
torch.xpu.manual_seed(0)
size = 512
# For tensors, needs to be put on XPU
x = torch.rand(size, device='xpu')
y = torch.rand(size, device='xpu')
output_torch = x + y
output_triton = add(x, y)
print(output_torch)
print(output_triton)
print(
    f'The maximum difference between torch and triton is '
    f'{torch.max(torch.abs(output_torch - output_triton))}'
)

Example 2 : End-to-End Model

Triton is transparent for End-to-End models. One could easily use torch.compile with inductor as backend by default. It will automatically generates triton kernel and gets benefit from it.

import torch
# Need to import intel_extension_for_pytorch for xpu support
import intel_extension_for_pytorch
from torch._dynamo.testing import rand_strided

from torch.nn import *
class simpleModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        # tensors inside model should be on xpu
        self.y = rand_strided((32, 8), (8, 1), device='xpu:0', dtype=torch.float32)

    def forward(self, x):
        z = x + self.y
        return z

# tensors passed to the model should be on xpu
x = rand_strided((32, 8), (8, 1), device='xpu:0', dtype=torch.float32)
xpu_model = simpleModel()
# Call torch.compile for optimization
optimized_mod = torch.compile(xpu_model)

graph_result = optimized_mod(x)

More Examples on Tests

If you wish to take a look at more examples, please refer to the Unit Tests and End-to-End Benchmark Tests.

Performance Analysis Guide

There are several ways of doing performance analysis. We recommend using torch.profiler for End-to-End performance analysis and using Intel® VTune™ Profiler for more detailed kernel analysis. We provide a comprehensive guide for those two:

  1. end_to_end_tests#profiling settings section for using torch.profiler.
  2. VTune Profiling Guide for kernel analysis.

Note that the user needs to explicitly set TRITON_XPU_PROFILE=1 when the user needs to enable kernel profiling.

export TRITON_XPU_PROFILE=1

Known Limitations

For known limitations, please refer to the wiki page for known limitations.

Contributing

It is a warm welcome for any contributions from the community, please refer to the contribution guidelines and code of conduct.

License

MIT License. As found in LICENSE file.

Security

See Intel's Security Center for information on how to report a potential security issue or vulnerability.

See also: Security Policy

About

OpenAI Triton backend for Intel® GPUs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 87.2%
  • Python 8.3%
  • CMake 3.1%
  • Dockerfile 1.3%
  • C 0.1%