Backend Optimization

This post gathers backend optimization techniques in machine learning.

CPU

GEMM

How to Optimize GEMM
How to optimize GEMM on CPU
GEMM: From Pure C to SSE Optimized Micro Kernels
如何利用TVM快速实现超越Numpy的GEMM
x64 CPU GEMM 优化 (玩转SIMD指令编程)
CPU高性能计算 1 - SGEMM 性能瓶颈分析与解决思路
机器学习中的高性能计算（一）CPU优化
机器学习中的高性能计算（二）SSE优化
大佬是怎么优雅实现矩阵乘法的？

CUDA

Elementwise operation

深入浅出GPU优化系列：elementwise优化及CUDA工具链介绍
高效、易用、可拓展我全都要：OneFlow CUDA Elementwise 模板库的设计优化思路
【BBuf 的CUDA笔记】一，解析OneFlow Element-Wise 算子实现

Reduction

Chapter 10 of Programming Massively Parallel Processors
如何实现一个高效的Softmax CUDA kernel？——OneFlow 性能优化分享
【BBuf的CUDA笔记】八，对比学习OneFlow 和 FasterTransformer 的 Softmax Cuda实现
CUDA高性能计算经典问题（一）—— 归约（Reduction）
CUDA WarpReduce学习
深入浅出GPU优化系列：reduce优化
简单谈谈CUDA Reduce
CUDA编程入门（四）并行归约算法
CUDA编程入门（五）更高效的并行归约算法
CUDA编程入门（六）展开循环继续优化
Pytorch CUDA源码解析 - BlockReduceSum
【BBuf的CUDA笔记】八，对比学习OneFlow 和 FasterTransformer 的 Softmax Cuda实现¹

Scan

Chapter 11 of Programming Massively Parallel Processors
高效CUDA Scan算法浅析
CUB scan 算法学习
CUDA高性能计算经典问题（二）—— 前缀和（Prefix Sum）
Scan Primitives for GPU Computing

GEMM/GEMV

传统 CUDA GEMM 不完全指北
cuda 入门的正确姿势：how-to-optimize-gemm
CUDA 矩阵乘法终极优化指南²
CUDA SGEMM矩阵乘法优化笔记——从入门到cublas
CUDA GEMM 理论性能分析与 kernel 优化
深入浅出GPU优化系列：GEMM优化（一）
深入浅出GPU优化系列：GEMM优化（二）
深入浅出GPU优化系列：GEMM优化（三）
如何开发机器学习系统：高性能GPU矩阵乘法
CUDA Ampere Tensor Core HGEMM 矩阵乘法优化笔记 —— Up To 131 TFLOPS!
有关CUBLAS中的矩阵乘法函数
手把手推导分布式矩阵乘的最优并行策略
Stream-K: Work-centric Parallel Decomposition for Dense Matrix-Matrix Multiplication on the GPU³
深入浅出GPU优化系列：gemv优化
Sparse Matrix-Vector Multiplication with CUDA
深入浅出GPU优化系列：spmv优化
Accelerating Matrix Multiplication with Block Sparse Format and NVIDIA Tensor Cores
Sparse GPU Kernels for Deep Learning
GPU Kernels for Block-Sparse Weights
Block Sparse Matrix-Vector Multiplication with CUDA⁴

Convolution

Chapters 7 and 16 of Programming Massively Parallel Processors
MegEngine TensorCore 卷积算子实现原理
卷积神经网络性能优化
Im2Col+GEMM的改进方法MEC，一种更加高效的卷积计算策略
MegEngine Inference 卷积优化之 Im2col 和 winograd 优化
CUDA卷积算子手写详细实现

Layer

CUDA优化之LayerNorm性能优化实践
【BBuf的CUDA笔记】二，解析OneFlow BatchNorm相关算子实现
【BBuf的CUDA笔记】六，总结 FasterTransformer Encoder(BERT) 的cuda相关优化技巧
【BBuf的CUDA笔记】七，总结 FasterTransformer Decoder(GPT) 的cuda相关优化技巧

Miscellaneous

【BBuf的CUDA笔记】四，介绍三个高效实用的CUDA算法实现（OneFlow ElementWise模板，FastAtomicAdd模板，OneFlow UpsampleNearest2d模板）
如何实现比PyTorch快6倍的Permute/Transpose算子？⁵
在OneFlow实现Unfold Fold算子⁶
实例：手写 CUDA 算子，让 Pytorch 提速 20 倍（某特殊算子）
CUDA GroupNorm NHWC优化

Framework

The Journey of an Operator in a Deep Learning Framework
OneFlow源码解析：自动微分机制

Profiling

深入浅出GPU优化系列：elementwise优化及CUDA工具链介绍

Customized PyTorch kernel

Official tutorial
像教女朋友一样教你用Cuda实现PyTorch算子
PyTorch自定义CUDA算子教程与运行时间分析
详解PyTorch编译并调用自定义CUDA算子的三种方式
三分钟教你如何PyTorch自定义反向传播
PyTorch 源码解读之 cpp_extension：揭秘 C++/CUDA 算子实现和调用全流程

Footnotes

【BBuf的CUDA笔记】九，使用newbing（chatgpt）解析oneflow softmax相关的fuse优化 ↩
The code is available at https://github.com/niuhope/cuda_sgemm. ↩
The code is now part of cuTLASS. ↩
使用CUDA实现块稀疏矩阵向量乘（BSpMV） ↩
The code is available at https://github.com/Oneflow-Inc/oneflow/blob/master/oneflow/core/ep/cuda/primitive/permute.cu. ↩
PyTorch nn.Unfold generalizes the $\verb|im2col|$ operation. ↩

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backend.md

backend.md

Backend Optimization

CPU

GEMM

CUDA

Elementwise operation

Reduction

Scan

GEMM/GEMV

Convolution

Layer

Miscellaneous

Framework

Profiling

Customized PyTorch kernel

Files

backend.md

Latest commit

History

backend.md

File metadata and controls

Backend Optimization

CPU

GEMM

CUDA

Elementwise operation

Reduction

Scan

GEMM/GEMV

Convolution

Layer

Miscellaneous

Framework

Profiling

Customized PyTorch kernel

Footnotes