Skip to content

🔥🔥🔥 A collection of some awesome public CUDA, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR and High Performance Computing (HPC) projects.

Notifications You must be signed in to change notification settings

coderonion/awesome-cuda-triton-hpc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 

Repository files navigation

Awesome-CUDA-Triton-HPC

Awesome

🔥🔥🔥 This repository lists some awesome public CUDA, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR and High Performance Computing (HPC) projects.

Contents

Official Version

  • CUDA : CUDA is a parallel computing platform and programming model developed by NVIDIA for general computing on graphical processing units (GPUs).

  • NVIDIA/cuda-python : CUDA Python is the home for accessing NVIDIA’s CUDA platform from Python. CUDA Python Low-level Bindings. nvidia.github.io/cuda-python/

  • cuBLAS : Basic Linear Algebra on NVIDIA GPUs. NVIDIA cuBLAS is a GPU-accelerated library for accelerating AI and HPC applications. It includes several API extensions for providing drop-in industry standard BLAS APIs and GEMM APIs with support for fusions that are highly optimized for NVIDIA GPUs. The cuBLAS library also contains extensions for batched operations, execution across multiple GPUs, and mixed- and low-precision execution with additional tuning for the best performance.

  • cuDNN : The NVIDIA CUDA Deep Neural Network library (cuDNN) is a GPU-accelerated library of primitives for deep neural networks. cuDNN provides highly tuned implementations for standard routines such as forward and backward convolution, attention, matmul, pooling, and normalization.

  • CUTLASS : CUDA Templates for Linear Algebra Subroutines. CUTLASS is a collection of CUDA C++ template abstractions for implementing high-performance matrix-matrix multiplication (GEMM) and related computations at all levels and scales within CUDA. It incorporates strategies for hierarchical decomposition and data movement similar to those used to implement cuBLAS and cuDNN.

  • TensorRT : NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT. developer.nvidia.com/tensorrt

  • TensorRT-LLM : TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines. nvidia.github.io/TensorRT-LLM

  • Triton : Triton is a language and compiler for parallel programming. It aims to provide a Python-based programming environment for productively writing custom DNN compute kernels capable of running at maximal throughput on modern GPU hardware. triton-lang.org/

  • TVM : Open deep learning compiler stack for cpu, gpu and specialized accelerators. tvm.apache.org/

  • MLIR : Multi-Level Intermediate Representation Compiler Framework. The MLIR project is a novel approach to building reusable and extensible compiler infrastructure. MLIR aims to address software fragmentation, improve compilation for heterogeneous hardware, significantly reduce the cost of building domain specific compilers, and aid in connecting existing compilers together.

Awesome List

Learning Resources

Frameworks

  • CUDA Frameworks

    • GPU Interface

      GPU接口
      • CPP Version
        • CCCL : CUDA C++ Core Libraries. The concept for the CUDA C++ Core Libraries (CCCL) grew organically out of the Thrust, CUB, and libcudacxx projects that were developed independently over the years with a similar goal: to provide high-quality, high-performance, and easy-to-use C++ abstractions for CUDA developers.

        • HIP : HIP: C++ Heterogeneous-Compute Interface for Portability. HIP is a C++ Runtime API and Kernel Language that allows developers to create portable applications for AMD and NVIDIA GPUs from single source code. rocmdocs.amd.com/projects/HIP/

      • Python Version
      • Rust Version
      • Julia Version
    • Performance Benchmark

      • FlagPerf : FlagPerf is an open-source software platform for benchmarking AI chips. FlagPerf是智源研究院联合AI硬件厂商共建的一体化AI硬件评测引擎,旨在建立以产业实践为导向的指标体系,评测AI硬件在软件栈组合(模型+框架+编译器)下的实际能力。

      • te42kyfo/gpu-benches : collection of benchmarks to measure basic GPU capabilities.

    • Scientific Computing Framework

      科学计算框架
      • cuBLAS : Basic Linear Algebra on NVIDIA GPUs. NVIDIA cuBLAS is a GPU-accelerated library for accelerating AI and HPC applications. It includes several API extensions for providing drop-in industry standard BLAS APIs and GEMM APIs with support for fusions that are highly optimized for NVIDIA GPUs. The cuBLAS library also contains extensions for batched operations, execution across multiple GPUs, and mixed- and low-precision execution with additional tuning for the best performance.

      • CUTLASS : CUDA Templates for Linear Algebra Subroutines.

      • MUTLASS : MUSA Templates for Linear Algebra Subroutines.

      • MatX : MatX - GPU-Accelerated Numerical Computing in Modern C++. An efficient C++17 GPU numerical computing library with Python-like syntax. nvidia.github.io/MatX

      • CuPy : CuPy : NumPy & SciPy for GPU. cupy.dev

      • GenericLinearAlgebra.jl : Generic numerical linear algebra in Julia.

      • custos-math : This crate provides CUDA, OpenCL, CPU (and Stack) based matrix operations using custos.

    • Machine Learning Framework

      • cuDNN : The NVIDIA CUDA® Deep Neural Network library (cuDNN) is a GPU-accelerated library of primitives for deep neural networks. cuDNN provides highly tuned implementations for standard routines such as forward and backward convolution, attention, matmul, pooling, and normalization.

      • PyTorch : Tensors and Dynamic neural networks in Python with strong GPU acceleration. pytorch.org

      • MooreThreads/torch_musa : torch_musa is an open source repository based on PyTorch, which can make full use of the super computing power of MooreThreads graphics cards.

      • PaddlePaddle : PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署). www.paddlepaddle.org/

      • flashlight/flashlight : A C++ standalone library for machine learning. fl.readthedocs.io/en/latest/

      • NVlabs/tiny-cuda-nn : Lightning fast C++/CUDA neural network framework.

      • yhwang-hub/dl_model_infer : his is a c++ version of the AI reasoning library. Currently, it only supports the reasoning of the tensorrt model. The follow-up plan supports the c++ reasoning of frameworks such as Openvino, NCNN, and MNN. There are two versions for pre- and post-processing, c++ version and cuda version. It is recommended to use the cuda version., This repository provides accelerated deployment cases of deep learning CV popular models, and cuda c supports dynamic-batch image process, infer, decode, NMS.

    • AI Inference Framework

      AI推理框架
      • LLM Inference and Serving Engine
        • TensorRT : NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT. developer.nvidia.com/tensorrt

        • TensorRT-LLM : TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines. nvidia.github.io/TensorRT-LLM

        • vLLM : A high-throughput and memory-efficient inference and serving engine for LLMs. docs.vllm.ai

        • MLC LLM : Enable everyone to develop, optimize and deploy AI models natively on everyone's devices. mlc.ai/mlc-llm

        • Lamini : Lamini: The LLM engine for rapidly customizing models 🦙.

        • datawhalechina/self-llm : 《开源大模型食用指南》基于Linux环境快速部署开源大模型,更适合中国宝宝的部署教程。

        • ninehills/llm-inference-benchmark : LLM Inference benchmark.

      • C Implementation
        • llm.c : LLM training in simple, pure C/CUDA. There is no need for 245MB of PyTorch or 107MB of cPython. For example, training GPT-2 (CPU, fp32) is ~1,000 lines of clean code in a single file. It compiles and runs instantly, and exactly matches the PyTorch reference implementation.

        • llama2.c : Inference Llama 2 in one file of pure C. Train the Llama 2 LLM architecture in PyTorch then inference it with one simple 700-line C file (run.c).

      • CPP Implementation
      • Mojo Implementation
      • Rust Implementation
      • Zig Implementation

        • llama2.zig : Inference Llama 2 in one file of pure Zig.

        • renerocksai/gpt4all.zig : ZIG build for a terminal-based chat client for an assistant-style large language model with ~800k GPT-3.5-Turbo Generations based on LLaMa.

        • EugenHotaj/zig_inference : Neural Network Inference Engine in Zig.

      • Go Implementation
        • Ollama : Get up and running with Llama 2, Mistral, Gemma, and other large language models. ollama.com

        • go-skynet/LocalAI : 🤖 Self-hosted, community-driven, local OpenAI-compatible API. Drop-in replacement for OpenAI running LLMs on consumer-grade hardware. Free Open Source OpenAI alternative. No GPU required. LocalAI is an API to run ggml compatible models: llama, gpt4all, rwkv, whisper, vicuna, koala, gpt4all-j, cerebras, falcon, dolly, starcoder, and many other. localai.io

    • Multi-GPU Framework

      多GPU框架
    • Robotics Framework

      机器人框架
      • Cupoch : Robotics with GPU computing.
    • ZKP and Web3 Framework

      零知识证明和Web3框架
  • Triton Frameworks

    • Triton Machine Learning Framework

      • BobMcDear/attorch : A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.
    • Triton LLM Operator Library

      • Liger-Kernel : Efficient Triton Kernels for LLM Training. arxiv.org/pdf/2410.10989

      • FlagGems : FlagGems is a high-performance general operator library implemented in OpenAI Triton. It aims to provide a suite of kernel functions to accelerate LLM training and inference.

      • linxihui/dkernel : This repo contains customized CUDA kernels written in OpenAI Triton. As of now, it contains the sparse attention kernel used in phi-3-small models. The sparse attention is also supported in vLLM for efficient inference.

    • Triton Inference Framework

  • MLIR Frameworks

    • MLIR GPU Programming

      • 'gpu' Dialect : This dialect provides middle-level abstractions for launching GPU kernels following a programming model similar to that of CUDA or OpenCL.

      • 'amdgpu' Dialect : The AMDGPU dialect provides wrappers around AMD-specific functionality and LLVM intrinsics.

    • MLIR FFI Bindings

      • pyMLIR : Python interface for MLIR - the Multi-Level Intermediate Representation. pyMLIR is a full Python interface to parse, process, and output MLIR files according to the syntax described in the MLIR documentation. pyMLIR supports the basic dialects and can be extended with other dialects.
    • MLIR Machine Learning Framework

      • Torch-MLIR : The Torch-MLIR project aims to provide first class support from the PyTorch ecosystem to the MLIR ecosystem.

      • ONNX-MLIR : Representation and Reference Lowering of ONNX Models in MLIR Compiler Infrastructure.

      • TPU-MLIR : Machine learning compiler based on MLIR for Sophgo TPU. TPU-MLIR is an open-source machine-learning compiler based on MLIR for TPU. This project provides a complete toolchain, which can convert pre-trained neural networks from different frameworks into binary files bmodel that can be efficiently operated on TPUs.

      • IREE : IREE: Intermediate Representation Execution Environment. A retargetable MLIR-based machine learning compiler and runtime toolkit. iree.dev/

      • ByteIR : The ByteIR Project is a ByteDance model compilation solution. ByteIR includes compiler, runtime, and frontends, and provides an end-to-end model compilation solution. byteir.ai

      • Xilinx/mlir-aie : An MLIR-based toolchain for AMD AI Engine-enabled devices. This repository contains an MLIR-based toolchain for AI Engine-enabled devices, such as AMD Ryzen™ AI and Versal™.

  • HPC Frameworks

    • BLAS : BLAS (Basic Linear Algebra Subprograms). The BLAS (Basic Linear Algebra Subprograms) are routines that provide standard building blocks for performing basic vector and matrix operations. The Level 1 BLAS perform scalar, vector and vector-vector operations, the Level 2 BLAS perform matrix-vector operations, and the Level 3 BLAS perform matrix-matrix operations.

    • LAPACK : LAPACK development repository. LAPACK — Linear Algebra PACKage. LAPACK is written in Fortran 90 and provides routines for solving systems of simultaneous linear equations, least-squares solutions of linear systems of equations, eigenvalue problems, and singular value problems. The associated matrix factorizations (LU, Cholesky, QR, SVD, Schur, generalized Schur) are also provided, as are related computations such as reordering of the Schur factorizations and estimating condition numbers. Dense and banded matrices are handled, but not general sparse matrices. In all areas, similar functionality is provided for real and complex matrices, in both single and double precision.

    • OpenBLAS : OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version. www.openblas.net

    • BLIS : BLAS-like Library Instantiation Software Framework.

    • NumPy : The fundamental package for scientific computing with Python. numpy.org

    • SciPy : SciPy library main repository. SciPy (pronounced "Sigh Pie") is an open-source software for mathematics, science, and engineering. It includes modules for statistics, optimization, integration, linear algebra, Fourier transforms, signal and image processing, ODE solvers, and more. scipy.org

    • Gonum : Gonum is a set of numeric libraries for the Go programming language. It contains libraries for matrices, statistics, optimization, and more. www.gonum.org/

    • YichengDWu/matmul.mojo : High Performance Matrix Multiplication in Pure Mojo 🔥. Matmul.🔥 is a high performance muilti-threaded implimentation of the BLIS algorithm in pure Mojo 🔥.

Applications

Blogs

Videos

Interview

About

🔥🔥🔥 A collection of some awesome public CUDA, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR and High Performance Computing (HPC) projects.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published