gfx1010 optimizations #8085

daniandtheweb · 2024-06-24T00:13:16Z

Reading @IMbackK 's PR #8082 I've noticed that RDNA1 cards can also benefit from a small performance gain just by adjusting the same values as that PR.

model	size	params	backend	ngl	test	t/s master	t/s PR	Speedup
llama 8B Q5_K - Small	5.21 GiB	8.03 B	ROCm	99	pp512	163.86 ± 0.18	276.60 ± 0.41	1.69

This is still far from the performance pre #7716 (RDNA1 cards suffered a 50% performance drop with that) but it's still a good performance improvement.

Thanks again to @IMbackK for his PR as I wouldn't have noticed this without it.

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

JohannesGaessler · 2024-07-03T13:23:56Z

I've only managed to get this far as setting the return of get_mmq_y_device to 32 boosts the performance to 800 t/s but completely breaks the inference by outputting only gibberish.

get_mmq_y_host and get_mmq_y_device need to return consistent values. If you only reduce get_mmq_y_device then only part of the matrix multiplication will be done which of course leads to a much faster (but wrong) kernel. As of right now both functions are situated next to each other in mmq.cuh. If you want you can try tuning the values again, otherwise I would just merge this PR as-is.

daniandtheweb · 2024-07-03T14:36:40Z

Thanks for the information. I'll test it better in a couple of hours but for now using a value of 64 instead of the default 128 of the master branch I manage 275 t/s for the prompt processing.

daniandtheweb · 2024-07-03T17:09:27Z

The performance boost is consistent to 275 t/s and the output works fine, however I'm having some trouble to add a check in get_mmq_y_host since checking if RDNA1 is defined fails.

static constexpr int get_mmq_y_host(const int cc) {
#if defined(RDNA1)
    return 64;
#else
    return int8_mma_available(cc) || cc >= CC_VOLTA ? 128 : 64;
#endif // defined(RDNA1)
}

static constexpr __device__ int get_mmq_y_device() {
#if defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__)
#if defined(RDNA1)
    return 64;
#else
    return 128;
#endif // defined RDNA1
#else
#if __CUDA_ARCH__ >= CC_VOLTA
    return 128;
#else
    return 64;
#endif // __CUDA_ARCH__ >= CC_VOLTA
#endif // defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__)
}

@JohannesGaessler any idea about how to solve this?

sorasoras · 2024-07-03T18:10:27Z

Does this pr affect RDNA3? I really can use some optimizations.

daniandtheweb · 2024-07-03T18:19:05Z

Does this pr affect RDNA3? I really can use some optimizations.

Not at all, this PR just tunes some parameters on Navi 10 that are already tuned in the 7000 series.

JohannesGaessler · 2024-07-03T18:58:59Z

I would patch it like this:

static constexpr int get_mmq_y_host(const int cc) {
    return cc >= CC_OFFSET_AMD ? (cc >= CC_RDNA2 ? 128 : 64) : (cc >= CC_VOLTA ? 128 : 64);
}

daniandtheweb · 2024-07-03T21:00:27Z

@JohannesGaessler I've applied the change. I still think this isn't the best way to do it because if different values are needed for different cards this can result in something messy, maybe using a normal if statement so that the row isn't too long, however I haven't been able to make it work that way.

#if cc >= CC_OFFSET_AMD
#if cc == CC_RDNA1
    return 64;
#else
    return 128;
#endif // cc == CC_RDNA1
#else
#if int8_mma_available(cc) || cc >= CC_VOLTA \\ checking int8_mma_available(cc) causes the whole if statement to fail.
    return 128;
#else
    return 64;
#endif // int8_mma_available(cc) || cc >= CC_VOLTA
#endif // cc >= CC_OFFSET_AMD

EDIT: Apparently all the issues I've been having are caused by the check on int8_mma_available not working as intended. Just removing it in the if check makes everything work again.
Is that check needed for this or is the cc >= CC_VOLTA the only necessary check?

JohannesGaessler · 2024-07-03T21:26:08Z

I still think this isn't the best way to do it because if different values are needed for different cards this can result in something messy, maybe using a normal if statement so that the row isn't too long, however I haven't been able to make it work that way.

Yes, constexpr functions do not allow conditional statements in the C++ 11 standard.

Apparently all the issues I've been having are caused by the check on int8_mma_available not working as intended.

It is working as intended, you are just not using it as intended. The lowercase constexpr functions are intended to be used in host code to determine feature availability at runtime. The uppercase preprocessor macros are intended to be used in device code to determine feature availability at compile time. But if you have a separate check for AMD you don't need INT8_MMA_AVAILABLE anyways. On NVIDIA INT8_MMA_AVAILABLE is effectively __CUDA_ARCH__ >= CC_TURING and therefore always true if __CUDA_ARCH__ >= CC_VOLTA.

daniandtheweb · 2024-07-03T21:55:01Z

I see. In this case would it be better to keep it like this:

return cc >= CC_OFFSET_AMD ? (cc == CC_RDNA1 ? 64 : 128) : (cc >= CC_VOLTA ? 128 : 64);

Or go like this:

#if cc >= CC_OFFSET_AMD
#if cc == CC_RDNA1
    return 64;
#else
    return 128;
#endif // cc == CC_RDNA1
#else
#if cc >= CC_VOLTA
    return 128;
#else
    return 64;
#endif // cc >= CC_VOLTA
#endif // cc >= CC_OFFSET_AMD

?

JohannesGaessler · 2024-07-03T22:00:21Z

You cannot do the second one. cc is a function parameter and not defined in the preprocessor phase where the macros are being resolved. The first one would be correct for your particular case but the version I posted earlier would be better because AMD architectures older than RDNA1 would be closer to the hardware that you are tuning performance for than the hardware that I used (RX 6800).

daniandtheweb · 2024-07-03T22:02:42Z

Okay, I actually wanted to specify for RDNA1 because I wasn't sure of the effects it could have on the pr #8082.

JohannesGaessler · 2024-07-03T22:06:50Z

Sorry, actually it has to be done the way you had it with an RDNA1 check. On AMD you cannot do a simple check against a number because there is no sensible value for __CUDA_ARCH__.

daniandtheweb · 2024-07-03T22:09:36Z

I think this may be ready to merge, once all the checks are completed. Thanks for the tips on how to improve it.

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jun 24, 2024

daniandtheweb changed the title ~~Gfx1010 optimizations~~ gfx1010 optimizations Jun 24, 2024

mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Jun 24, 2024

daniandtheweb force-pushed the gfx1010_optimizations branch 3 times, most recently from 1d1754f to 46923c6 Compare July 3, 2024 02:23

slaren requested a review from JohannesGaessler July 3, 2024 12:42

JohannesGaessler approved these changes Jul 3, 2024

View reviewed changes

daniandtheweb force-pushed the gfx1010_optimizations branch from c4005a9 to e4accb8 Compare July 3, 2024 22:06

Define and optimize RDNA1

68b57ed

daniandtheweb force-pushed the gfx1010_optimizations branch from e4accb8 to 68b57ed Compare July 3, 2024 22:08

JohannesGaessler approved these changes Jul 3, 2024

View reviewed changes

JohannesGaessler merged commit d23287f into ggerganov:master Jul 3, 2024
49 checks passed

Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Jul 4, 2024

Define and optimize RDNA1 (ggerganov#8085)

2df8d03

Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Jul 5, 2024

Define and optimize RDNA1 (ggerganov#8085)

2a0b01b

Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Jul 6, 2024

Define and optimize RDNA1 (ggerganov#8085)

b329789

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jul 7, 2024

Define and optimize RDNA1 (ggerganov#8085)

ab0e5de

Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Jul 11, 2024

Define and optimize RDNA1 (ggerganov#8085)

7acf2b3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gfx1010 optimizations #8085

gfx1010 optimizations #8085

daniandtheweb commented Jun 24, 2024 •

edited

Loading

JohannesGaessler commented Jul 3, 2024 •

edited

Loading

daniandtheweb commented Jul 3, 2024

daniandtheweb commented Jul 3, 2024 •

edited

Loading

sorasoras commented Jul 3, 2024

daniandtheweb commented Jul 3, 2024

JohannesGaessler commented Jul 3, 2024

daniandtheweb commented Jul 3, 2024 •

edited

Loading

JohannesGaessler commented Jul 3, 2024

daniandtheweb commented Jul 3, 2024

JohannesGaessler commented Jul 3, 2024

daniandtheweb commented Jul 3, 2024

JohannesGaessler commented Jul 3, 2024

daniandtheweb commented Jul 3, 2024 •

edited

Loading

gfx1010 optimizations #8085

gfx1010 optimizations #8085

Conversation

daniandtheweb commented Jun 24, 2024 • edited Loading

JohannesGaessler commented Jul 3, 2024 • edited Loading

daniandtheweb commented Jul 3, 2024

daniandtheweb commented Jul 3, 2024 • edited Loading

sorasoras commented Jul 3, 2024

daniandtheweb commented Jul 3, 2024

JohannesGaessler commented Jul 3, 2024

daniandtheweb commented Jul 3, 2024 • edited Loading

JohannesGaessler commented Jul 3, 2024

daniandtheweb commented Jul 3, 2024

JohannesGaessler commented Jul 3, 2024

daniandtheweb commented Jul 3, 2024

JohannesGaessler commented Jul 3, 2024

daniandtheweb commented Jul 3, 2024 • edited Loading

daniandtheweb commented Jun 24, 2024 •

edited

Loading

JohannesGaessler commented Jul 3, 2024 •

edited

Loading

daniandtheweb commented Jul 3, 2024 •

edited

Loading

daniandtheweb commented Jul 3, 2024 •

edited

Loading

daniandtheweb commented Jul 3, 2024 •

edited

Loading