You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I reviewed the Discussions, and have a new bug or useful enhancement to share.
Expected Behavior
You should be able to have good performance with flash attention on ROCM.
Current Behavior
When running without flash attention, the performance is good as expected. When running with flash attention, there is a several minute period of high CPU usage before token generation, followed by extremely slow token generation with high CPU usage. I have also tested flash attention with vulkan and get the same results (good performance without fa and high cpu usage with fa)
I have tested llama.cpp directly (llama-cpp-python's version) with -fa and it works without the high CPU usage. It's likely an option is being incorrectly set by llama-cpp-python.
Environment and Context
llama-cpp-python 0.2.85 with HIPBLAS on a 7900XT via ROCM 6.1.2 on Ubuntu 22.04.
Name: llama_cpp_python
Version: 0.2.85
Summary: Python bindings for the llama.cpp library
Home-page: https://github.com/abetlen/llama-cpp-python
Author:
Author-email: Andrei Betlen <abetlen@gmail.com>
License: MIT
Location: /home/chase/.local/lib/python3.10/site-packages
Requires: diskcache, jinja2, numpy, typing-extensions
Required-by: dir-assistant
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 24
On-line CPU(s) list: 0-23
Vendor ID: AuthenticAMD
Model name: AMD Ryzen 9 7900X3D 12-Core Processor
CPU family: 25
Model: 97
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 1
Stepping: 2
CPU max MHz: 5660.0000
CPU min MHz: 400.0000
BogoMIPS: 8800.73
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mc
a cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall n
x mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_go
od amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfm
perf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 s
se4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cm
p_legacy svm extapic cr8_legacy abm sse4a misalignsse 3
dnowprefetch osvw ibs skinit wdt tce topoext perfctr_co
re perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l
3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_en
hanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpci
d cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma
clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsav
eopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mb
m_total cqm_mbm_local avx512_bf16 clzero irperf xsaveer
ptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_sav
e tsc_scale vmcb_clean flushbyasid decodeassists pausef
ilter pfthreshold avic v_vmsave_vmload vgif x2avic v_sp
ec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfn
i vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpop
cntdq rdpid overflow_recov succor smca fsrm flush_l1d
Virtualization features:
Virtualization: AMD-V
Caches (sum of all):
L1d: 384 KiB (12 instances)
L1i: 384 KiB (12 instances)
L2: 12 MiB (12 instances)
L3: 128 MiB (2 instances)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-23
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Retbleed: Not affected
Spec rstack overflow: Mitigation; Safe RET
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer
sanitization
Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional
; STIBP always-on; RSB filling; PBRSB-eIBRS Not affecte
d; BHI Not affected
Srbds: Not affected
Tsx async abort: Not affected
Operating System, e.g. for Linux:
Linux timberwolf 6.5.0-45-generic #45~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Jul 15 16:40:02 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
Python 3.10.12
GNU Make 4.3
Built for x86_64-pc-linux-gnu
Copyright (C) 1988-2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Failure Information (for bugs)
Slow performance (N/A)
Steps to Reproduce
Run llama-cpp-python completion with flash_attn=True.
Failure Logs
N/A
The text was updated successfully, but these errors were encountered:
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
You should be able to have good performance with flash attention on ROCM.
Current Behavior
When running without flash attention, the performance is good as expected. When running with flash attention, there is a several minute period of high CPU usage before token generation, followed by extremely slow token generation with high CPU usage. I have also tested flash attention with vulkan and get the same results (good performance without fa and high cpu usage with fa)
I have tested llama.cpp directly (llama-cpp-python's version) with -fa and it works without the high CPU usage. It's likely an option is being incorrectly set by llama-cpp-python.
Environment and Context
llama-cpp-python 0.2.85 with HIPBLAS on a 7900XT via ROCM 6.1.2 on Ubuntu 22.04.
Failure Information (for bugs)
Slow performance (N/A)
Steps to Reproduce
Run llama-cpp-python completion with flash_attn=True.
Failure Logs
N/A
The text was updated successfully, but these errors were encountered: