Skip to content

yanfeich/g2

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

G2

G2 is a TPC(Tensor Processor Cpre) optimized attention lib for transformer decoder on Intel Gaudi2.

Usage

build from source

  • clone source code in Intel Gaudi2 docker
git clone https://github.com/ZhaiFeiyue/g2.git
  • build
pip install .

install from github

pip install git+https://github.com/ZhaiFeiyue/g2.git#egg=g2attn

Run test

cd tests
./run.sh

Motivation

QK bmm of transformer decoder could be illustrated in the following picture qk_bmm

The Q shape is [B, M, 1, H], the K shape is [B, M, T, H] and the output shape is [B, M, 1, T].

where:

  • B is the Batch Size.
  • M is the number of head.
  • H is head dimention
  • T is the number of cached tokens for K.

Intel Gaudi2 is a Systolic array based AI Accelerators, and the peek tops is ~410T for BF16. But when calculate QK Bmm(same on ScoreV Bmm), the valid tops is only 3.2T, since the valid row of Q is only one, see following.

mme1

So this project aims to tackle the above problem by leverage the tops of TPC.

Benchmark

Llama2-7B QK BMM

BS Head KV length Head Dim Dtype TPC latency(us) MME latency(us)
64 32 128 128 BF16 63 41
64 32 256 128 BF16 62 77
64 32 512 128 BF16 126 69
64 32 1024 128 BF16 244 192
64 32 2048 128 BF16 498 467
64 32 4096 128 BF16 1007 935
64 32 8192 128 BF16 2011 1894

Llama2-7B ScoreV BMM

BS Head KV length Head Dim Dtype TPC latency(us) MME latency(us)
64 32 128 128 BF16 63 42
64 32 256 128 BF16 123 81
64 32 512 128 BF16 247 160
64 32 1024 128 BF16 482 311
64 32 2048 128 BF16 981 639
64 32 4096 128 BF16 1965 1279
64 32 8192 128 BF16 3911 2539

Know issue

  • not support release >1.12
  • bad perf of ScoreV BMM

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 62.5%
  • C++ 33.1%
  • Shell 4.4%