Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x86 optimization for gemm int8 #5763

Merged
merged 75 commits into from
Dec 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
75 commits
Select commit Hold shift + click to select a range
8076f4e
x86 sse2/xop/avx/avx2 optimization for gemm int8
nihui Oct 28, 2024
9c85f62
apply code-format changes
nihui Oct 28, 2024
952bedd
Merge branch 'Tencent:master' into gemm-quantize-x86
nihui Oct 29, 2024
502d58b
w
nihui Nov 4, 2024
5721ca7
apply code-format changes
nihui Nov 4, 2024
2c4bf75
w
nihui Nov 7, 2024
58cfbb8
apply code-format changes
nihui Nov 7, 2024
c49bb4f
fix
nihui Nov 7, 2024
c8b2c31
add c
nihui Nov 14, 2024
99a3a11
apply code-format changes
nihui Nov 14, 2024
3184d8b
Merge branch 'master' into gemm-quantize-x86
nihui Nov 14, 2024
d314a14
fix
nihui Nov 14, 2024
a321287
fix avx
nihui Nov 14, 2024
459cf4c
fix avx
nihui Nov 14, 2024
1bec95e
divps
nihui Nov 14, 2024
de33f23
dispatch avxvnni
nihui Nov 14, 2024
b2b0b96
Merge branch 'master' into gemm-quantize-x86
nihui Nov 28, 2024
aa727e4
skip round problem
nihui Nov 28, 2024
2143260
fix for x86 32bit
nihui Nov 28, 2024
ea43b93
apply code-format changes
nihui Nov 28, 2024
f32b4b4
no sse fix
nihui Nov 29, 2024
15ebb61
comp avxvnni
nihui Nov 29, 2024
1069d2f
fix vs2019 ice
nihui Nov 29, 2024
f4e41e1
comp
nihui Nov 29, 2024
310992b
comp++
nihui Nov 29, 2024
cd35861
apply code-format changes
nihui Nov 29, 2024
e13ab3e
opt 8x8
nihui Nov 29, 2024
6a48a9e
Merge branch 'master' into gemm-quantize-x86
nihui Dec 4, 2024
9a8d97c
f
nihui Dec 4, 2024
8ec4d6f
Merge branch 'master' into gemm-quantize-x86
nihui Dec 5, 2024
99dd447
w
nihui Dec 5, 2024
a41edb0
apply code-format changes
nihui Dec 5, 2024
7d1958a
w
nihui Dec 6, 2024
e68283e
opt packa avx512
nihui Dec 6, 2024
9b626fa
opt packb avx512
nihui Dec 9, 2024
2756894
opt packa packb avx
nihui Dec 9, 2024
0313c56
apply code-format changes
nihui Dec 9, 2024
8ba8efe
opt packa packb avx
nihui Dec 9, 2024
ca25494
f
nihui Dec 10, 2024
bda5f00
apply code-format changes
nihui Dec 9, 2024
5d53b4d
apply code-format changes
nihui Dec 10, 2024
652e0ee
w
nihui Dec 10, 2024
d9f6518
w
nihui Dec 10, 2024
cac1e24
w
nihui Dec 10, 2024
c6f8b83
w
nihui Dec 10, 2024
88e9e91
w
nihui Dec 11, 2024
87f94c0
w
nihui Dec 11, 2024
3710641
dispatch packb avx2
nihui Dec 11, 2024
b38c5bf
cc
nihui Dec 11, 2024
383cb5f
w--
nihui Dec 11, 2024
a3bf66f
a
nihui Dec 11, 2024
6347dd1
aa
nihui Dec 11, 2024
82f1a2b
skip more halfway cases
nihui Dec 12, 2024
494f042
a
nihui Dec 12, 2024
4feceac
ua fix
nihui Dec 12, 2024
0df3241
f
nihui Dec 12, 2024
fcdae8f
f
nihui Dec 12, 2024
e52daad
sde on ubuntu24
nihui Dec 12, 2024
9d22cd6
fix avxvnni dispatch
nihui Dec 12, 2024
65d185f
gcov14
nihui Dec 12, 2024
7a7d918
Merge branch 'master' into gemm-quantize-x86
nihui Dec 16, 2024
165f2d4
opt avxvnniint8
nihui Dec 16, 2024
2c08968
apply code-format changes
nihui Dec 16, 2024
9d759d9
avxvnniint8 without wshift
nihui Dec 16, 2024
0d0abf2
fix
nihui Dec 16, 2024
cdba870
fix dispatch
nihui Dec 16, 2024
fa88315
ooops
nihui Dec 16, 2024
301f902
ooops
nihui Dec 16, 2024
b69bb0f
fix
nihui Dec 16, 2024
f3f1fb5
opt unpack aligned cvt
nihui Dec 16, 2024
33781f0
opt avx512 scatter
nihui Dec 16, 2024
786247d
opt++
nihui Dec 16, 2024
8126a5e
f
nihui Dec 16, 2024
2df658c
cc
nihui Dec 16, 2024
0c76a3d
cc
nihui Dec 17, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 0 additions & 57 deletions .github/workflows/linux-x64-cpu-gcc-sde.yml

This file was deleted.

85 changes: 85 additions & 0 deletions .github/workflows/linux-x64-sde.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
name: linux-x64-sde
on:
push:
branches: [master]
paths:
- '.github/workflows/linux-x64-sde.yml'
- 'CMakeLists.txt'
- 'cmake/**'
- 'src/*'
- 'src/layer/*'
- 'src/layer/x86/**'
- 'tests/**'
- 'tools/**'
- '!tools/pnnx/**'
- 'examples/**'
pull_request:
branches: [master]
paths:
- '.github/workflows/linux-x64-sde.yml'
- 'CMakeLists.txt'
- 'cmake/**'
- 'src/*'
- 'src/layer/*'
- 'src/layer/x86/**'
- 'tests/**'
- 'tools/**'
- '!tools/pnnx/**'
- 'examples/**'
concurrency:
group: linux-x64-sde-${{ github.ref }}
cancel-in-progress: true
permissions:
contents: read

jobs:
gcc-sde:
runs-on: ubuntu-24.04
steps:
- uses: actions/checkout@v4
- name: update
run: sudo apt-get update
- name: gcc14
run: sudo apt-get install gcc-14 g++-14
- name: Setup SDE binaries
uses: petarpetrovt/setup-sde@v2.4
- name: build
env:
CC: gcc-14
CXX: g++-14
run: |
mkdir build && cd build
cmake -DNCNN_BUILD_TESTS=ON ..
cmake --build . -j $(nproc)
- name: test-p4p
run: |
cd build
TESTS_EXECUTABLE_LOADER=$SDE_PATH/sde64 TESTS_EXECUTABLE_LOADER_ARGUMENTS="-p4p;--" ctest --output-on-failure -j $(nproc)
- name: test-snb
run: |
cd build
TESTS_EXECUTABLE_LOADER=$SDE_PATH/sde64 TESTS_EXECUTABLE_LOADER_ARGUMENTS="-snb;--" ctest --output-on-failure -j $(nproc)
- name: test-hsw
run: |
cd build
TESTS_EXECUTABLE_LOADER=$SDE_PATH/sde64 TESTS_EXECUTABLE_LOADER_ARGUMENTS="-hsw;--" ctest --output-on-failure -j $(nproc)
- name: test-adl
run: |
cd build
TESTS_EXECUTABLE_LOADER=$SDE_PATH/sde64 TESTS_EXECUTABLE_LOADER_ARGUMENTS="-adl;--" ctest --output-on-failure -j $(nproc)
- name: test-arl
run: |
cd build
TESTS_EXECUTABLE_LOADER=$SDE_PATH/sde64 TESTS_EXECUTABLE_LOADER_ARGUMENTS="-arl;--" ctest --output-on-failure -j $(nproc)
- name: test-skx
run: |
cd build
TESTS_EXECUTABLE_LOADER=$SDE_PATH/sde64 TESTS_EXECUTABLE_LOADER_ARGUMENTS="-skx;--" ctest --output-on-failure -j $(nproc)
- name: test-spr
run: |
cd build
TESTS_EXECUTABLE_LOADER=$SDE_PATH/sde64 TESTS_EXECUTABLE_LOADER_ARGUMENTS="-spr;--" ctest --output-on-failure -j $(nproc)
- name: test-gnr
run: |
cd build
TESTS_EXECUTABLE_LOADER=$SDE_PATH/sde64 TESTS_EXECUTABLE_LOADER_ARGUMENTS="-gnr;--" ctest --output-on-failure -j $(nproc)
69 changes: 40 additions & 29 deletions .github/workflows/test-coverage.yml
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ jobs:
LD_LIBRARY_PATH: /data/action/install/lib64
run: |
mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=debug -DNCNN_VULKAN=ON -DNCNN_COVERAGE=ON -DNCNN_RUNTIME_CPU=OFF -DNCNN_AVX2=ON -DNCNN_XOP=OFF -DNCNN_AVXVNNI=OFF -DNCNN_AVX512=ON -DNCNN_AVX512VNNI=ON -DNCNN_OPENMP=OFF -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_TESTS=ON ..
cmake -DCMAKE_BUILD_TYPE=debug -DNCNN_VULKAN=ON -DNCNN_COVERAGE=ON -DNCNN_RUNTIME_CPU=OFF -DNCNN_AVX2=ON -DNCNN_XOP=OFF -DNCNN_AVXVNNI=OFF -DNCNN_AVXNECONVERT=OFF -DNCNN_AVX512=OFF -DNCNN_OPENMP=OFF -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_TESTS=ON ..
cmake --build . -j 4
- name: test
env:
Expand All @@ -54,61 +54,72 @@ jobs:
lcov --list lcov.info

- name: codecov
id: codecov
continue-on-error: true
uses: codecov/codecov-action@v5
with:
token: ${{ secrets.CODECOV_TOKEN }}
disable_search: true
plugins: noop
files: build/lcov.info
- name: set the status
if: always()
run: |
if ${{ steps.codecov.outcome=='success' }}; then
echo fine
else
exit 1
fi

linux-gcc-x64-avx512-spr:
runs-on: ubuntu-22.04
linux-gcc-x64-sde:
name: linux-gcc-sde-${{ matrix.cpu }}
runs-on: ubuntu-24.04
strategy:
fail-fast: false
matrix:
include:
- { cpu: hsw, AVX2: ON, AVXVNNI: OFF, AVXVNNIINT8: OFF, AVXNECONVERT: OFF, AVX512: OFF, AVX512VNNI: OFF, AVX512BF16: OFF, AVX512FP16: OFF }
- { cpu: adl, AVX2: ON, AVXVNNI: ON, AVXVNNIINT8: OFF, AVXNECONVERT: OFF, AVX512: OFF, AVX512VNNI: OFF, AVX512BF16: OFF, AVX512FP16: OFF }
- { cpu: arl, AVX2: ON, AVXVNNI: ON, AVXVNNIINT8: ON, AVXNECONVERT: ON, AVX512: OFF, AVX512VNNI: OFF, AVX512BF16: OFF, AVX512FP16: OFF }
- { cpu: spr, AVX2: ON, AVXVNNI: OFF, AVXVNNIINT8: OFF, AVXNECONVERT: OFF, AVX512: ON, AVX512VNNI: ON, AVX512BF16: ON, AVX512FP16: ON }
steps:
- uses: actions/checkout@v4
- name: update
run: sudo apt-get update
- name: gcc12
run: sudo apt-get install gcc-12 g++-12
- name: gcc14
run: sudo apt-get install gcc-14 g++-14
- name: lcov
run: sudo apt-get install lcov
- name: Setup SDE binaries
uses: petarpetrovt/setup-sde@v2.4
- name: build-avx512-spr
- name: build
env:
CC: gcc-12
CXX: g++-12
CC: gcc-14
CXX: g++-14
run: |
mkdir build-avx512-spr && cd build-avx512-spr
cmake -DCMAKE_BUILD_TYPE=debug -DNCNN_COVERAGE=ON -DNCNN_RUNTIME_CPU=OFF -DNCNN_AVX2=ON -DNCNN_AVX512=ON -DNCNN_AVX512VNNI=ON -DNCNN_AVX512BF16=ON -DNCNN_AVX512FP16=ON -DNCNN_XOP=OFF -DNCNN_OPENMP=OFF -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_TESTS=ON ..
cmake --build . -j 2
- name: test-avx512-spr
mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=debug -DNCNN_COVERAGE=ON -DNCNN_RUNTIME_CPU=OFF \
-DNCNN_AVX=ON \
-DNCNN_F16C=ON \
-DNCNN_XOP=OFF \
-DNCNN_AVX2=${{ matrix.AVX2 }} \
-DNCNN_AVXVNNI=${{ matrix.AVXVNNI }} \
-DNCNN_AVXVNNIINT8=${{ matrix.AVXVNNIINT8 }} \
-DNCNN_AVXNECONVERT=${{ matrix.AVXNECONVERT }} \
-DNCNN_AVX512=${{ matrix.AVX512 }} \
-DNCNN_AVX512VNNI=${{ matrix.AVX512VNNI }} \
-DNCNN_AVX512BF16=${{ matrix.AVX512BF16 }} \
-DNCNN_AVX512FP16=${{ matrix.AVX512FP16 }} \
-DNCNN_OPENMP=OFF -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_TESTS=ON ..
cmake --build . -j $(nproc)
- name: test
run: |
cd build-avx512-spr
TESTS_EXECUTABLE_LOADER=$SDE_PATH/sde64 TESTS_EXECUTABLE_LOADER_ARGUMENTS="-spr;--" ctest --output-on-failure -j 2
cd build
TESTS_EXECUTABLE_LOADER=$SDE_PATH/sde64 TESTS_EXECUTABLE_LOADER_ARGUMENTS="-${{ matrix.cpu }};--" ctest --output-on-failure -j $(nproc)
- name: lcov-collect
run: |
cd build-avx512-spr
lcov --gcov-tool gcov-12 -d ./src -c -o lcov.info
cd build
lcov --gcov-tool gcov-14 -d ./src -c -o lcov.info
lcov -r lcov.info '/usr/*' -o lcov.info
lcov -r lcov.info '*/build-avx512-spr/*' -o lcov.info
lcov -r lcov.info '*/build/*' -o lcov.info
lcov --list lcov.info
- name: codecov-avx512-spr
- name: codecov
uses: codecov/codecov-action@v5
with:
token: ${{ secrets.CODECOV_TOKEN }}
disable_search: true
plugins: noop
files: build-avx512-spr/lcov.info
files: build/lcov.info

linux-gcc-riscv64-rvv:
strategy:
Expand Down
4 changes: 2 additions & 2 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -508,7 +508,7 @@ else()
check_cxx_compiler_flag("/arch:AVX512" NCNN_COMPILER_SUPPORT_X86_AVX512)

set(CMAKE_REQUIRED_FLAGS "/arch:AVX2")
check_cxx_source_compiles("#include <immintrin.h>\nint main() { __m256i _s, _a, _b; _s = _mm256_dpwssd_epi32(_s, _a, _b); return 0; }" NCNN_COMPILER_SUPPORT_X86_AVX_VNNI)
check_cxx_source_compiles("#include <immintrin.h>\nint main() { __m256i _s, _a, _b; _s = _mm256_dpwssd_avx_epi32(_s, _a, _b); return 0; }" NCNN_COMPILER_SUPPORT_X86_AVX_VNNI)

set(CMAKE_REQUIRED_FLAGS "/arch:AVX2")
check_cxx_source_compiles("#include <immintrin.h>\nint main() { __m256i _s, _a, _b; _s = _mm256_dpbssd_epi32(_s, _a, _b); return 0; }" NCNN_COMPILER_SUPPORT_X86_AVX_VNNI_INT8)
Expand Down Expand Up @@ -545,7 +545,7 @@ else()
check_cxx_compiler_flag("/arch:AVX512 -mfma -mf16c -mavx512cd -mavx512bw -mavx512dq -mavx512vl" NCNN_COMPILER_SUPPORT_X86_AVX512)

set(CMAKE_REQUIRED_FLAGS "/arch:AVX2 -mfma -mf16c -mavxvnni")
check_cxx_source_compiles("#include <immintrin.h>\nint main() { __m256i _s, _a, _b; _s = _mm256_dpwssd_epi32(_s, _a, _b); return 0; }" NCNN_COMPILER_SUPPORT_X86_AVX_VNNI)
check_cxx_source_compiles("#include <immintrin.h>\nint main() { __m256i _s, _a, _b; _s = _mm256_dpwssd_avx_epi32(_s, _a, _b); return 0; }" NCNN_COMPILER_SUPPORT_X86_AVX_VNNI)

set(CMAKE_REQUIRED_FLAGS "/arch:AVX2 -mfma -mf16c -mavxvnni -mavxvnniint8")
check_cxx_source_compiles("#include <immintrin.h>\nint main() { __m256i _s, _a, _b; _s = _mm256_dpbssd_epi32(_s, _a, _b); return 0; }" NCNN_COMPILER_SUPPORT_X86_AVX_VNNI_INT8)
Expand Down
11 changes: 1 addition & 10 deletions src/layer/gemm.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -220,9 +220,7 @@ static void gemm_transB_int8(const Mat& A_int8, const Mat& BT_int8, const Mat& A
const int N = BT_int8.h;
const int K = A_int8.w; // assert A_int8.w == BT_int8.w

// NCNN_LOGE("naive ds %f %f", A_int8_scales[0], BT_int8_scale);

// #pragma omp parallel for num_threads(opt.num_threads)
#pragma omp parallel for num_threads(opt.num_threads)
for (int i = 0; i < M; i++)
{
const int out_hstep = top_blob.dims == 3 ? (int)top_blob.cstep : top_blob.w;
Expand All @@ -232,16 +230,13 @@ static void gemm_transB_int8(const Mat& A_int8, const Mat& BT_int8, const Mat& A

const float descale = 1.f / (A_int8_scales[i] * BT_int8_scale);

// NCNN_LOGE("descale %f", descale);

for (int j = 0; j < N; j++)
{
const signed char* ptrBT = BT_int8.row<const signed char>(j);

int sum = 0;
for (int k = 0; k < K; k++)
{
// NCNN_LOGE("ptrA[%d] %d", k, ptrA[k]);
sum += ptrA[k] * ptrBT[k];
}

Expand Down Expand Up @@ -501,8 +496,6 @@ int Gemm::forward_int8(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& t
absmax = std::max(absmax, (float)fabs(ptr[k]));
}

// NCNN_LOGE("A[%d] absmax %f", i, absmax);

float A_int8_scale = absmax == 0.f ? 1.f : 127.f / absmax;
A_int8_scales[i] = A_int8_scale;

Expand Down Expand Up @@ -534,8 +527,6 @@ int Gemm::forward_int8(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& t
}
}

// NCNN_LOGE("B0 absmax %f", absmax);

B_int8_scale = absmax == 0.f ? 1.f : 127.f / absmax;

for (int i = 0; i < B0_int8.h; i++)
Expand Down
Loading
Loading