This is the development repository of Intel® XPU Backend for Triton*, a new OpenAI Triton backend for Intel GPUs. Triton is a language and compiler for writing highly efficient custom deep learning primitives. The aim of Triton is to provide an open-source environment to write fast code at higher productivity than CUDA, but also with higher flexibility than other existing DSLs. Intel® XPU Backend for Triton* is a module used by Triton to provide a reasonable tradeoff between performance and productivity on Intel GPUs.
Intel® XPU Backend for Triton* serves as a backend for OpenAI Triton. There are two Options for installation: Install from whl package or build from the source. Please follow either option for setup.
Intel® XPU Backend for Triton* requires the following two dependencies package:
Please follow installation guide for Intel® Extension for PyTorch* for the detailed process for BOTH PyTorch and Intel® Extension for PyTorch*. Please make sure that the associated driver and Intel® oneAPI Base Toolkit are installed correctly.
This method is the simplest way of getting things done.
Download the latest .whl
according to your Python version. We provide Cython
and Pypy
version. By default, it should be CPython
. You could check your Python implementation with the following command:
python -c "import platform;print(platform.python_implementation())"
Then download the corresponding .whl
at the release page and install it locally, for example:
wget https://github.com/intel/intel-xpu-backend-for-triton/releases/download/v2.1.0_rc1/triton-2.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
pip install triton-2.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
# Clone OpenAI/Triton
git clone https://github.com/openai/triton.git
cd triton
# Clone submodules
git submodule sync && git submodule update --init --recursive --jobs 0
Since we are at the active development stage, it is recommended to check to latest commit for intel-xpu-backend-for-triton
:
cd third_party/intel_xpu_backend
git checkout main && git pull
Now Build Triton with Intel XPU backend enabled, note that it is important to make sure that the triton
repo is checked to the pinned commit. This commit is the latest tested working commit.
# cd to triton root folder and checkout to pinned commit
cd ../..
git checkout `cat third_party/intel_xpu_backend/triton_hash.txt`
# Build triton with XPU backend enabled
cd python
TRITON_CODEGEN_INTEL_XPU_BACKEND=1 python setup.py develop
We also provide a detailed page for the overall building process. It includes all source building methods. You could refer to build_from_source.md for more detail. If you encountered any problem, please refer to the Possible-Build-Bugs page first.
# activate dependencies version variables
source triton/third_party/intel_xpu_backend/.github/ci_pins/version.txt
# cd to docker folder and build image form Dockerfile
cd triton/third_party/intel_xpu_backend/docker
DOCKER_BUILDKIT=1 docker build \
--build-arg http_proxy=${http_proxy} \
--build-arg https_proxy=${https_proxy} \
--build-arg PT_REPO=$torch_repo \
--build-arg PT_BRANCH=$torch_branch \
--build-arg PT_COMMIT=$torch_commit \
--build-arg IPEX_REPO=$ipex_repo \
--build-arg IPEX_BRANCH=$ipex_branch \
--build-arg IPEX_COMMIT=$ipex_commit \
--build-arg BASEKIT_URL=https://registrationcenter-download.intel.com/akdlm/IRC_NAS/20f4e6a1-6b0b-4752-b8c1-e5eacba10e01/l_BaseKit_p_2024.0.0.49564_offline.sh \
-t triton:xpu \
-f Dockerfile \
--target image .
# creat e a container from the image
docker run -id --name $USER --privileged --env https_proxy=${https_proxy} --env http_proxy=${http_proxy} --net host --shm-size 2G triton:xpu
# env check in container
docker exec -ti $USER bash -c "source /opt/intel/oneapi/setvars.sh;python -c 'import torch,intel_extension_for_pytorch,triton'"
## (optional) run E2E test in container
docker exec -ti $USER bash -c "source /opt/intel/oneapi/setvars.sh ;\
cd /workspace/pytorch && wget -O inductor_xpu_test.sh https://raw.githubusercontent.com/intel/intel-xpu-backend-for-triton/main/.github/scripts/inductor_xpu_test.sh ;\
pip install pandas && bash inductor_xpu_test.sh huggingface amp_bf16 inference accuracy xpu 1 static 1 0 DebertaForMaskedLM
"
Intel® XPU Backend for Triton* only requires minor code changes. The user needs to do the following two things:
- Add
import intel_extension_for_pytorch
for xpu support. - Put the tensor and models to XPU by calling
to('xpu')
. There are cases when PyTorch API needs to be changed, please refer to API Documentation from Intel® Extension for PyTorch* for more detail.
The following examples show modifications for the user code.
This Example is a modified version of Vector Add triton kernel. Please refer to Vector Add for detailed comments and illustration about the code semantics.
Comparing to the original code, the following code modifies:
- Add
import intel_extension_for_pytorch
for xpu support. - Put the tensor to XPU and change the API for manual_seed.
import torch
# Need to import intel_extension_for_pytorch for xpu support
import intel_extension_for_pytorch
import triton
import triton.language as tl
@triton.jit
def add_kernel(
x_ptr,
y_ptr,
output_ptr,
n_elements,
BLOCK_SIZE: tl.constexpr,
):
pid = tl.program_id(axis=0)
block_start = pid * BLOCK_SIZE
offsets = block_start + tl.arange(0, BLOCK_SIZE)
mask = offsets < n_elements
x = tl.load(x_ptr + offsets, mask=mask)
y = tl.load(y_ptr + offsets, mask=mask)
output = x + y
tl.store(output_ptr + offsets, output, mask=mask)
def add(x: torch.Tensor, y: torch.Tensor):
# Put the tensor to xpu
output = torch.empty_like(x).xpu()
assert x.is_xpu and y.is_xpu and output.is_xpu
n_elements = output.numel()
grid = lambda meta: (triton.cdiv(n_elements, meta['BLOCK_SIZE']),)
add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=1024)
return output
# For manual_seed, needs to use API for XPU
torch.xpu.manual_seed(0)
size = 512
# For tensors, needs to be put on XPU
x = torch.rand(size, device='xpu')
y = torch.rand(size, device='xpu')
output_torch = x + y
output_triton = add(x, y)
print(output_torch)
print(output_triton)
print(
f'The maximum difference between torch and triton is '
f'{torch.max(torch.abs(output_torch - output_triton))}'
)
Triton is transparent for End-to-End models. One could easily use torch.compile
with inductor
as backend by default. It will automatically generates triton kernel and gets benefit from it.
import torch
# Need to import intel_extension_for_pytorch for xpu support
import intel_extension_for_pytorch
from torch._dynamo.testing import rand_strided
from torch.nn import *
class simpleModel(torch.nn.Module):
def __init__(self):
super().__init__()
# tensors inside model should be on xpu
self.y = rand_strided((32, 8), (8, 1), device='xpu:0', dtype=torch.float32)
def forward(self, x):
z = x + self.y
return z
# tensors passed to the model should be on xpu
x = rand_strided((32, 8), (8, 1), device='xpu:0', dtype=torch.float32)
xpu_model = simpleModel()
# Call torch.compile for optimization
optimized_mod = torch.compile(xpu_model)
graph_result = optimized_mod(x)
If you wish to take a look at more examples, please refer to the Unit Tests and End-to-End Benchmark Tests.
There are several ways of doing performance analysis. We recommend using torch.profiler
for End-to-End performance analysis and using Intel® VTune™ Profiler for more detailed kernel analysis. We provide a comprehensive guide for those two:
- end_to_end_tests#profiling settings section for using
torch.profiler
. - VTune Profiling Guide for kernel analysis.
Note that the user needs to explicitly set TRITON_XPU_PROFILE=1
when the user needs to enable kernel profiling.
export TRITON_XPU_PROFILE=1
For known limitations, please refer to the wiki page for known limitations.
It is a warm welcome for any contributions from the community, please refer to the contribution guidelines and code of conduct.
MIT License. As found in LICENSE file.
See Intel's Security Center for information on how to report a potential security issue or vulnerability.
See also: Security Policy