High CPU Usage, very slow performance, with flash_attn=true on ROCM 6.1.2 #1661

curvedinf · 2024-08-06T19:14:42Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

You should be able to have good performance with flash attention on ROCM.

Current Behavior

When running without flash attention, the performance is good as expected. When running with flash attention, there is a several minute period of high CPU usage before token generation, followed by extremely slow token generation with high CPU usage. I have also tested flash attention with vulkan and get the same results (good performance without fa and high cpu usage with fa)

I have tested llama.cpp directly (llama-cpp-python's version) with -fa and it works without the high CPU usage. It's likely an option is being incorrectly set by llama-cpp-python.

Environment and Context

llama-cpp-python 0.2.85 with HIPBLAS on a 7900XT via ROCM 6.1.2 on Ubuntu 22.04.

Name: llama_cpp_python
Version: 0.2.85
Summary: Python bindings for the llama.cpp library
Home-page: https://github.com/abetlen/llama-cpp-python
Author: 
Author-email: Andrei Betlen <abetlen@gmail.com>
License: MIT
Location: /home/chase/.local/lib/python3.10/site-packages
Requires: diskcache, jinja2, numpy, typing-extensions
Required-by: dir-assistant

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         48 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  24
  On-line CPU(s) list:   0-23
Vendor ID:               AuthenticAMD
  Model name:            AMD Ryzen 9 7900X3D 12-Core Processor
    CPU family:          25
    Model:               97
    Thread(s) per core:  2
    Core(s) per socket:  12
    Socket(s):           1
    Stepping:            2
    CPU max MHz:         5660.0000
    CPU min MHz:         400.0000
    BogoMIPS:            8800.73
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mc
                         a cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall n
                         x mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_go
                         od amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfm
                         perf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 s
                         se4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cm
                         p_legacy svm extapic cr8_legacy abm sse4a misalignsse 3
                         dnowprefetch osvw ibs skinit wdt tce topoext perfctr_co
                         re perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l
                         3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_en
                         hanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpci
                         d cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma
                          clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsav
                         eopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mb
                         m_total cqm_mbm_local avx512_bf16 clzero irperf xsaveer
                         ptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_sav
                         e tsc_scale vmcb_clean flushbyasid decodeassists pausef
                         ilter pfthreshold avic v_vmsave_vmload vgif x2avic v_sp
                         ec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfn
                         i vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpop
                         cntdq rdpid overflow_recov succor smca fsrm flush_l1d
Virtualization features: 
  Virtualization:        AMD-V
Caches (sum of all):     
  L1d:                   384 KiB (12 instances)
  L1i:                   384 KiB (12 instances)
  L2:                    12 MiB (12 instances)
  L3:                    128 MiB (2 instances)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-23
Vulnerabilities:         
  Gather data sampling:  Not affected
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Not affected
  Spec rstack overflow:  Mitigation; Safe RET
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer
                          sanitization
  Spectre v2:            Mitigation; Enhanced / Automatic IBRS; IBPB conditional
                         ; STIBP always-on; RSB filling; PBRSB-eIBRS Not affecte
                         d; BHI Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected

Operating System, e.g. for Linux:

Linux timberwolf 6.5.0-45-generic #45~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Jul 15 16:40:02 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

Python 3.10.12

GNU Make 4.3
Built for x86_64-pc-linux-gnu
Copyright (C) 1988-2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.


g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Failure Information (for bugs)

Slow performance (N/A)

Steps to Reproduce

Run llama-cpp-python completion with flash_attn=True.

Failure Logs

N/A

The text was updated successfully, but these errors were encountered:

abetlen added the bug Something isn't working label Aug 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High CPU Usage, very slow performance, with flash_attn=true on ROCM 6.1.2 #1661

High CPU Usage, very slow performance, with flash_attn=true on ROCM 6.1.2 #1661

curvedinf commented Aug 6, 2024 •

edited

Loading

High CPU Usage, very slow performance, with flash_attn=true on ROCM 6.1.2 #1661

High CPU Usage, very slow performance, with flash_attn=true on ROCM 6.1.2 #1661

Comments

curvedinf commented Aug 6, 2024 • edited Loading

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Failure Information (for bugs)

Steps to Reproduce

Failure Logs

curvedinf commented Aug 6, 2024 •

edited

Loading