Introduce bfloat16 support #6412

jart · 2024-03-31T14:45:56Z

Many models on Hugging Face (e.g. Mistral, TinyLLaMA) use bfloat16 as their canonical floating point format.

      ┌sign
      │
      │   ┌exponent
      │   │
      │   │      ┌mantissa
      │   │      │
      │┌──┴───┐┌─┴───┐
    0b0000000000000000 brain16

This encoding has the same number of exponent bits as float32. That makes conversion relatively straightforward, even in the absence of hardware support. For example, converting brain16 to binary32 means simply shifting 16 bits to the left.

      ┌sign
      │
      │   ┌exponent
      │   │
      │   │      ┌mantissa
      │   │      │
      │┌──┴───┐┌─┴───────────────────┐
    0b00000000000000000000000000000000 IEEE binary32

The issue is that converting weights from bf16 to fp16 will cause 3 bits of knowledge to be lost. There is currently no way to evaluate models like Mistral at full fidelity, without f32, using llama.cpp.

      ┌sign
      │
      │  ┌exponent
      │  │
      │  │    ┌mantissa
      │  │    │
      │┌─┴─┐┌─┴──────┐
    0b0000000000000000 IEEE binary16

This change fixes that, by adding a bf16 data type to GGML. Support for CPU inference has been implemented along with optimizations for the AVX2, AVX512F, and AVX512BF16 ISAs. Perplexity on Mistral 7b 0.2 improves somewhere around -0.0024 to -0.0046 compared to using fp16

github-actions · 2024-03-31T15:22:45Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3: 523 iterations 🚀

Concurrent users: 8, duration: 10m
HTTP request : avg=8964.0ms p(90)=25761.73ms fails=0, finish reason: stop=523 truncated=0
Prompt processing (pp): avg=237.3tk/s p(90)=697.6tk/s total=203.61tk/s
Token generation (tg): avg=101.36tk/s p(90)=283.09tk/s total=132.93tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=bf16 commit=44d5c7070f3b33714c3d92b6e3c757e00877b4e1

Time series

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 523 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1712200211 --> 1712200833
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 93.17, 93.17, 93.17, 93.17, 93.17, 389.84, 389.84, 389.84, 389.84, 389.84, 417.01, 417.01, 417.01, 417.01, 417.01, 458.22, 458.22, 458.22, 458.22, 458.22, 479.5, 479.5, 479.5, 479.5, 479.5, 486.64, 486.64, 486.64, 486.64, 486.64, 514.45, 514.45, 514.45, 514.45, 514.45, 515.56, 515.56, 515.56, 515.56, 515.56, 528.8, 528.8, 528.8, 528.8, 528.8, 535.36, 535.36, 535.36, 535.36, 535.36, 555.53, 555.53, 555.53, 555.53, 555.53, 593.43, 593.43, 593.43, 593.43, 593.43, 607.51, 607.51, 607.51, 607.51, 607.51, 600.68, 600.68, 600.68, 600.68, 600.68, 607.12, 607.12, 607.12, 607.12, 607.12, 610.54, 610.54, 610.54, 610.54, 610.54, 611.02, 611.02, 611.02, 611.02, 611.02, 617.75, 617.75, 617.75, 617.75, 617.75, 628.59, 628.59, 628.59, 628.59, 628.59, 629.34, 629.34, 629.34, 629.34, 629.34, 629.21, 629.21, 629.21, 629.21, 629.21, 634.61, 634.61, 634.61, 634.61, 634.61, 637.1, 637.1, 637.1, 637.1, 637.1, 639.9, 639.9, 639.9, 639.9, 639.9, 638.96, 638.96, 638.96, 638.96, 638.96, 641.08, 641.08, 641.08, 641.08, 641.08, 641.61, 641.61, 641.61, 641.61, 641.61, 642.78, 642.78, 642.78, 642.78, 642.78, 649.25, 649.25, 649.25, 649.25, 649.25, 651.33, 651.33, 651.33, 651.33, 651.33, 654.43, 654.43, 654.43, 654.43, 654.43, 653.13, 653.13, 653.13, 653.13, 653.13, 652.4, 652.4, 652.4, 652.4, 652.4, 655.7, 655.7, 655.7, 655.7, 655.7, 657.83, 657.83, 657.83, 657.83, 657.83, 663.8, 663.8, 663.8, 663.8, 663.8, 668.95, 668.95, 668.95, 668.95, 668.95, 674.31, 674.31, 674.31, 674.31, 674.31, 675.06, 675.06, 675.06, 675.06, 675.06, 674.28, 674.28, 674.28, 674.28, 674.28, 674.39, 674.39, 674.39, 674.39, 674.39, 673.8, 673.8, 673.8, 673.8, 673.8, 676.58, 676.58, 676.58, 676.58, 676.58, 683.31, 683.31, 683.31, 683.31, 683.31, 673.42, 673.42, 673.42, 673.42, 673.42, 670.59, 670.59, 670.59, 670.59, 670.59, 669.69, 669.69, 669.69, 669.69, 669.69, 669.4, 669.4, 669.4, 669.4, 669.4, 667.55, 667.55, 667.55, 667.55, 667.55, 668.62, 668.62, 668.62, 668.62, 668.62, 671.56, 671.56, 671.56, 671.56, 671.56, 671.08, 671.08, 671.08, 671.08, 671.08, 664.43, 664.43, 664.43, 664.43, 664.43, 667.33, 667.33, 667.33, 667.33, 667.33, 670.62, 670.62, 670.62, 670.62, 670.62, 669.95, 669.95, 669.95, 669.95, 669.95, 673.49, 673.49, 673.49, 673.49, 673.49, 675.21, 675.21, 675.21, 675.21, 675.21, 677.55, 677.55, 677.55, 677.55, 677.55, 678.37, 678.37, 678.37, 678.37, 678.37, 678.12, 678.12]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 523 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1712200211 --> 1712200833
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 30.81, 30.81, 30.81, 30.81, 30.81, 18.56, 18.56, 18.56, 18.56, 18.56, 18.26, 18.26, 18.26, 18.26, 18.26, 18.59, 18.59, 18.59, 18.59, 18.59, 19.19, 19.19, 19.19, 19.19, 19.19, 20.2, 20.2, 20.2, 20.2, 20.2, 20.47, 20.47, 20.47, 20.47, 20.47, 20.55, 20.55, 20.55, 20.55, 20.55, 20.59, 20.59, 20.59, 20.59, 20.59, 20.43, 20.43, 20.43, 20.43, 20.43, 20.41, 20.41, 20.41, 20.41, 20.41, 20.2, 20.2, 20.2, 20.2, 20.2, 19.81, 19.81, 19.81, 19.81, 19.81, 19.41, 19.41, 19.41, 19.41, 19.41, 18.99, 18.99, 18.99, 18.99, 18.99, 18.86, 18.86, 18.86, 18.86, 18.86, 18.79, 18.79, 18.79, 18.79, 18.79, 18.94, 18.94, 18.94, 18.94, 18.94, 18.82, 18.82, 18.82, 18.82, 18.82, 18.68, 18.68, 18.68, 18.68, 18.68, 18.59, 18.59, 18.59, 18.59, 18.59, 18.46, 18.46, 18.46, 18.46, 18.46, 18.41, 18.41, 18.41, 18.41, 18.41, 18.45, 18.45, 18.45, 18.45, 18.45, 18.43, 18.43, 18.43, 18.43, 18.43, 18.44, 18.44, 18.44, 18.44, 18.44, 18.48, 18.48, 18.48, 18.48, 18.48, 18.58, 18.58, 18.58, 18.58, 18.58, 18.56, 18.56, 18.56, 18.56, 18.56, 18.53, 18.53, 18.53, 18.53, 18.53, 18.59, 18.59, 18.59, 18.59, 18.59, 18.71, 18.71, 18.71, 18.71, 18.71, 18.8, 18.8, 18.8, 18.8, 18.8, 18.94, 18.94, 18.94, 18.94, 18.94, 18.99, 18.99, 18.99, 18.99, 18.99, 18.96, 18.96, 18.96, 18.96, 18.96, 18.96, 18.96, 18.96, 18.96, 18.96, 18.9, 18.9, 18.9, 18.9, 18.9, 18.8, 18.8, 18.8, 18.8, 18.8, 18.66, 18.66, 18.66, 18.66, 18.66, 18.68, 18.68, 18.68, 18.68, 18.68, 18.67, 18.67, 18.67, 18.67, 18.67, 18.71, 18.71, 18.71, 18.71, 18.71, 18.75, 18.75, 18.75, 18.75, 18.75, 18.66, 18.66, 18.66, 18.66, 18.66, 18.58, 18.58, 18.58, 18.58, 18.58, 18.45, 18.45, 18.45, 18.45, 18.45, 18.24, 18.24, 18.24, 18.24, 18.24, 18.05, 18.05, 18.05, 18.05, 18.05, 17.9, 17.9, 17.9, 17.9, 17.9, 17.81, 17.81, 17.81, 17.81, 17.81, 17.84, 17.84, 17.84, 17.84, 17.84, 17.9, 17.9, 17.9, 17.9, 17.9, 17.92, 17.92, 17.92, 17.92, 17.92, 17.92, 17.92, 17.92, 17.92, 17.92, 17.91, 17.91, 17.91, 17.91, 17.91, 17.89, 17.89, 17.89, 17.89, 17.89, 17.86, 17.86, 17.86, 17.86, 17.86, 17.91, 17.91, 17.91, 17.91, 17.91, 17.97, 17.97, 17.97, 17.97, 17.97, 18.07, 18.07]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 523 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1712200211 --> 1712200833
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.08, 0.08, 0.08, 0.08, 0.08, 0.24, 0.24, 0.24, 0.24, 0.24, 0.08, 0.08, 0.08, 0.08, 0.08, 0.17, 0.17, 0.17, 0.17, 0.17, 0.09, 0.09, 0.09, 0.09, 0.09, 0.11, 0.11, 0.11, 0.11, 0.11, 0.16, 0.16, 0.16, 0.16, 0.16, 0.21, 0.21, 0.21, 0.21, 0.21, 0.24, 0.24, 0.24, 0.24, 0.24, 0.11, 0.11, 0.11, 0.11, 0.11, 0.14, 0.14, 0.14, 0.14, 0.14, 0.18, 0.18, 0.18, 0.18, 0.18, 0.16, 0.16, 0.16, 0.16, 0.16, 0.26, 0.26, 0.26, 0.26, 0.26, 0.19, 0.19, 0.19, 0.19, 0.19, 0.18, 0.18, 0.18, 0.18, 0.18, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.27, 0.27, 0.27, 0.27, 0.27, 0.29, 0.29, 0.29, 0.29, 0.29, 0.28, 0.28, 0.28, 0.28, 0.28, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.12, 0.12, 0.12, 0.12, 0.12, 0.16, 0.16, 0.16, 0.16, 0.16, 0.14, 0.14, 0.14, 0.14, 0.14, 0.18, 0.18, 0.18, 0.18, 0.18, 0.2, 0.2, 0.2, 0.2, 0.2, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.15, 0.15, 0.15, 0.15, 0.15, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.14, 0.14, 0.14, 0.14, 0.14, 0.19, 0.19, 0.19, 0.19, 0.19, 0.15, 0.15, 0.15, 0.15, 0.15, 0.22, 0.22, 0.22, 0.22, 0.22, 0.24, 0.24, 0.24, 0.24, 0.24, 0.21, 0.21, 0.21, 0.21, 0.21, 0.15, 0.15, 0.15, 0.15, 0.15, 0.22, 0.22, 0.22, 0.22, 0.22, 0.09, 0.09, 0.09, 0.09, 0.09, 0.11, 0.11, 0.11, 0.11, 0.11, 0.31, 0.31, 0.31, 0.31, 0.31, 0.45, 0.45, 0.45, 0.45, 0.45, 0.38, 0.38, 0.38, 0.38, 0.38, 0.37, 0.37, 0.37, 0.37, 0.37, 0.4, 0.4, 0.4, 0.4, 0.4, 0.29, 0.29, 0.29, 0.29, 0.29, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.11, 0.11, 0.11, 0.11, 0.11, 0.09, 0.09, 0.09, 0.09, 0.09, 0.19, 0.19, 0.19, 0.19, 0.19, 0.3, 0.3, 0.3, 0.3, 0.3, 0.23, 0.23, 0.23, 0.23, 0.23, 0.22, 0.22, 0.22, 0.22, 0.22, 0.11, 0.11, 0.11, 0.11, 0.11, 0.12, 0.12, 0.12, 0.12, 0.12, 0.14, 0.14, 0.14, 0.14, 0.14, 0.19, 0.19]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 523 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1712200211 --> 1712200833
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0]

JohannesGaessler · 2024-03-31T15:23:09Z

The issue is that converting weights from bf16 to fp16 will cause 3 bits of knowledge to be lost. There is currently no way to evaluate models like Mistral at full fidelity, without f32, using llama.cpp.

IEEE 754 half precision floats can store values in the range $5.96 \cdot 10^{-8}$ to $65504$. For all values within this range there is no precision loss whatsoever when converting from BF16. And I would be very surprised if even a single model weight were to be outside this range since these would also be leading to vanishing/exploding gradients.

Perplexity on Mistral 7b 0.2 improves somewhere around -0.0024 to -0.0046 compared to using fp16

I think this is not due to any change in the weights but rather due to a difference in rounding error in the accumulator. I expect this improvement to not be consistent across models/text corpuses and I also expect there to be no statistically significant improvement at all for a large enough sample size.

sorasoras · 2024-03-31T15:43:06Z

There are some different between quant from BF16-FP32 to BF16-FP16.
It's not the same model when compare PPL between FP16 and FP32, and it behave differently.
It would be interest to inference BF16 directly.

jart · 2024-03-31T23:10:09Z

@JohannesGaessler Only 13% of bf16 numbers can be represented accurately by a bf16 -> fp16 conversion. https://justine.lol/tmp/bf16-to-fp16.txt Yes, the vast majority of weights cluster within that 13%. By my calculation, only 0.29101% of Mistral 7b's numbers are broken. I want those numbers. I also don't want to accept limits on what's possible based on what's normal. Someone might find those broken intervals useful. But if that doesn't persuade you, consider this. I recently bought a Threadripper and it offers hardware acceleration for bf16 but not fp16. So this change is not just good for accuracy, it can be good for performance too.

JohannesGaessler · 2024-03-31T23:46:05Z

By my calculation, only 0.29101% of Mistral 7b's numbers are broken.

Broken in what sense? Numbers being flushed to zero is not an issue because the difference between 0 and almost 0 is negligible for matrix multiplication.

I recently bought a Threadripper and it offers hardware acceleration for bf16 but not fp16. So this change is not just good for accuracy, it can be good for performance too.

The performance point is valid.

In terms of numerical precision, this is the bottom line for me: I very much expect the difference between IEEE 754 half precision and bfloat to be completely negligible. I'm not telling you this out of malice but because I want contributors to spend their time in a way that is useful. If it turns out I'm wrong I will happily accept it.

jart · 2024-04-01T00:15:06Z

You might find the differences negligible, but it's important to me. I want llamafile to be able to deliver, to the best of its ability, whatever number of bits are claimed, even if those extra bits are only good for audiophiles. In my day-to-day work as a developer, I feel more comfortable being able to compare my tradeoffs with the master copies. Furthermore, I need this data type in order to be able to exploit the full capabilities of my hardware.

Am I correct in understanding you won't merge this? That surprises me. This project recently accepted nine novel "IQ" quantization formats, which I know very little about. So I was under the impression there was a certain level of inclusiveness. Why would you not support the data type that companies like Mistral and Google widely use?

JohannesGaessler · 2024-04-01T00:21:24Z

Am I correct in understanding you won't merge this? That surprises me. This project recently accepted nine novel "IQ" quantization formats, which I know very little about. So I was under the impression there was a certain level of inclusiveness. Why would you not support the data type that companies like Mistral and Google widely use?

The ultimate decision of what gets merged is not up to me. And I am not at all opposed to adding bfloat support. I only want to stress that I do not expect the gains from this feature to be in any way proportional to the amount of effort it will take. As such I personally will not invest time into bfloat support by e.g. modifying the CUDA code. If other devs want to do it that is their decision.

jart · 2024-04-01T00:27:52Z

I don't hold any demands on your time. In terms of resources, Mozilla is sponsoring me to help llama.cpp so you've got a lot more resources than before. At the moment, I only need this to work on CPU however I'll likely get personal enjoyment at some point in getting this to work on CUDA and Metal too. Particularly Metal, since I've been looking for a good reason to learn it for some time.

sorasoras · 2024-04-01T05:45:59Z

I don't hold any demands on your time. In terms of resources, Mozilla is sponsoring me to help llama.cpp so you've got a lot more resources than before. At the moment, I only need this to work on CPU however I'll likely get personal enjoyment at some point in getting this to work on CUDA and Metal too. Particularly Metal, since I've been looking for a good reason to learn it for some time.

I would imagine older cuda hardware wouldn't support it due to bf16 unsupport on Pascal. What's solution to that？

jart · 2024-04-01T07:08:05Z

Here's the decoding process for bfloat16:

typedef struct {
    uint16_t x;
} ggml_bf16_t;

/**
 * Converts brain16 to float32.
 */
static inline float ggml_bf16_to_fp32(ggml_bf16_t h) {
    union {
        float f;
        uint32_t i;
    } u;
    u.i = (uint32_t)h.x << 16;
    return u.f;
}

So the only thing old CUDA needs to do, is left shift the bf16 number by 16 bits, and then it becomes a float.

Artefact2 · 2024-04-01T08:18:23Z

I think bf16 support is nice to have in GGUF, if only because it makes quantizing a lot of models much less I/O intensive. Consider changing convert.py to make use of it.

JohannesGaessler · 2024-04-01T15:21:49Z

Relevant for discussion: Mozilla-Ocho/llamafile@ef0307e

It seems there seem to be at least some values above the maximum value representable by IEEE 754 half precision floats. @jart do you know in which specific matrices these weights show up? Depending on where they are relative to softmax this could be an issue.

cpumaxx · 2024-04-01T16:43:03Z

Is there anything special needed to see performance gains? I cloned/built/tested this PR branch and am seeing no change in performance on CPU (CUDA support flags disabled at compile time)

sorasoras · 2024-04-01T18:18:13Z

Is there anything special needed to see performance gains? I cloned/built/tested this PR branch and am seeing no change in performance on CPU (CUDA support flags disabled at compile time)

For CPU, I think you need something that support bf16 acceleration like AVX512VNNI?
also, you need conversion script that just copy BF16 weight from py to GGUF to get any benefit.

cpumaxx · 2024-04-01T20:37:02Z

For CPU, I think you need something that support bf16 acceleration like AVX512VNNI?
also, you need conversion script that just copy BF16 weight from py to GGUF to get any benefit.

system_info: n_threads = 55 / 128 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |

Hardware-wise I think I have what's needed.
Is the conversion script already available? I don't see it in any obvious place in this PR

sorasoras · 2024-04-02T12:14:21Z

For CPU, I think you need something that support bf16 acceleration like AVX512VNNI?
also, you need conversion script that just copy BF16 weight from py to GGUF to get any benefit.

system_info: n_threads = 55 / 128 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |

Hardware-wise I think I have what's needed. Is the conversion script already available? I don't see it in any obvious place in this PR

https://justine.lol/matmul/
I think the full implementation is in llamafile side.
LLM Performance on AMD Ryzen Threadripper PRO 7995WX w/ 96 cores ($10,000)

cpumaxx · 2024-04-02T12:37:39Z

I think the full implementation is in llamafile side.

What should be expected in llama.cpp from this patch specifically? I'm seeing about 6% speed increase on prompt processing and inference and I've pulled and built the master, avx512vnni, sgemm and bf16 branches. Each of them perform almost identically on a Q8 70b.
I'm on EPYC Genoa, so if anything I'd expect better results than that threadripper system.

jart · 2024-04-02T12:40:37Z

@Artefact2 I've updated gguf-py/gguf/constants.py so that BF16 is listed. I have no idea how to make the Python script generate BF16 GGML files. What I've been doing is running convert.py --outtype f32 and then running the ./quantize ggml-model-f32.gguf ggml-model-bf16.gguf bf16 program. Please take a look.

jart · 2024-04-02T12:45:59Z

@cpumaxx This change only adds support for bf16. Once #6414 the next thing I'll do is upstream the llamafile bfloat16 kernels. Here's what one of them looks like:

I'm working on ARM64 bfloat16 kernels tonight.

cpumaxx · 2024-04-02T13:29:56Z

the next thing I'll do is upstream the llamafile bfloat16 kernels

Nice. I'll keep an eye out for them. Is there a relevant branch on your llama.cpp fork I can test prior to a PR, or do you still need to merge changes already in llamafile?

jart · 2024-04-02T14:14:31Z

@cpumaxx Could you download https://huggingface.co/jartine/Mistral-7B-Instruct-v0.2-llamafile/blob/main/mistral-7b-instruct-v0.2.BF16.gguf and then build the code in the branch I just created https://github.com/jart/llama.cpp/tree/unified which unifies #6412 and #6414? Thanks!

jart · 2024-04-02T14:19:45Z

Here's an example of what you should expect to see with that branch.

wget https://huggingface.co/jartine/Mistral-7B-Instruct-v0.2-llamafile/resolve/main/mistral-7b-instruct-v0.2.BF16.gguf
wget https://justine.lol/tmp/getty.txt
make -j32 main && ./main -m /disk/mistral/mistral-7b-instruct-v0.2.BF16.gguf -f ~/getty.txt -n 22 --temp 0
[...]
It is for us, the living, rather to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced.
llama_print_timings:        load time =     773.90 ms
llama_print_timings:      sample time =       0.46 ms /    22 runs   (    0.02 ms per token, 48034.93 tokens per second)
llama_print_timings: prompt eval time =     407.51 ms /   215 tokens (    1.90 ms per token,   527.59 tokens per second)
llama_print_timings:        eval time =    1230.05 ms /    21 runs   (   58.57 ms per token,    17.07 tokens per second)
llama_print_timings:       total time =    1643.99 ms /   236 tokens
Log end

EPYC is for servers so I've heard they generally run at much lower clock rates than Threadripper Pro. So if you get a lower number than 530 tok/sec then try comparing it to llama.cpp at HEAD using the Mistral 7b f16 weights.

cpumaxx · 2024-04-02T16:19:21Z

Here's an example of what you should expect to see with that branch.
llama_print_timings:        load time =     773.90 ms
llama_print_timings:      sample time =       0.46 ms /    22 runs   (    0.02 ms per token, 48034.93 tokens per second)
llama_print_timings: prompt eval time =     407.51 ms /   215 tokens (    1.90 ms per token,   527.59 tokens per second)
llama_print_timings:        eval time =    1230.05 ms /    21 runs   (   58.57 ms per token,    17.07 tokens per second)
llama_print_timings:       total time =    1643.99 ms /   236 tokens
Log end
EPYC is for servers so I've heard they generally run at much lower clock rates than Threadripper Pro. So if you get a lower number than 530 tok/sec then try comparing it to llama.cpp at HEAD using the Mistral 7b f16 weights.

My system is a dual 64 core 9334 running with a 3.9ghz boost clock
I've got NPS set at 4 (so 8 numa nodes) for development reasons, which may be effecting results.
I tested your unified branch vs ggerganov master, and I'm seeing a severe speed regression:

/usr/src/llama.cpp.jart# ./main -m /media/models/bf16/mistral-7b-instruct-v0.2.BF16.gguf -n 22 --temp 0

system_info: n_threads = 64 / 128 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |

Question: Let i = 11 + -11. Let g = 1.1 + -
llama_print_timings: sample time = 0.53 ms / 22 runs ( 0.02 ms per token, 41666.67 tokens per second)
llama_print_timings: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_print_timings: eval time = 4618.55 ms / 22 runs ( 209.93 ms per token, 4.76 tokens per second)
llama_print_timings: total time = 4624.48 ms / 23 tokens

vs

/usr/src/llama.cpp.master.clean# ./main -m /media/models/bf16/ggml-model-f16.gguf -n 22 --temp 0

llama_print_timings: sample time = 0.61 ms / 22 runs ( 0.03 ms per token, 36184.21 tokens per second)
llama_print_timings: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_print_timings: eval time = 1363.28 ms / 22 runs ( 61.97 ms per token, 16.14 tokens per second)
llama_print_timings: total time = 1369.01 ms / 23 tokens

This was with identical build flags and after dropping all caches for a level playing field.

Anything else I should be trying in order to see the speedup?

jart · 2024-04-02T16:27:20Z

Could you pass the flag -f getty.txt please after you've downloaded that file from the link above? Then re-post your results.

unicomp21 · 2024-05-03T11:35:39Z

#6412 (comment)

browser? webgpu? webassembly? mesh networking w/ rtcdatachannel?

Srihari-mcw · 2024-05-03T13:30:14Z

Hi @jart, when the PR was tried building and running in windows, the build gave issues. In PR 1 of your fork, the same was addressed and fixed. Could you please take a look on the same. Thanks

Srihari-mcw · 2024-05-03T13:51:11Z

@jart, Further we had tried to run the prompt speedup code from https://github.com/jart/llama.cpp/tree/unified . With the current code in the fork, the code was going through second input(operand) as GGML_TYPE_F32 for mulmat functions. We tried to modify the code such that the second input is in GGML_TYPE_BF16 for mulmat kernels and removes the GGML_TYPE_F32 case, which enables the input of second operand (Btype) to get quantized to BF16 format and hence uses BF16 intrinsics in turn for dot product operation. Significant speedup was observed while comparing the code with original version in the fork where the second operand of mulmat operation is in FP32 format.

model	size	params	backend	threads	test	t/s	speedup	Commit id
llama 7B BF16 (without prompt speedup)	12.55 GiB	6.74 B	CPU	6	pp 512	39.111 ± 0.03		4e57aa6
llama 7B BF16 ( prompt speedup - both inputs BF16)	12.55 GiB	6.74 B	CPU	6	pp 512	103.343 ± 0.14	164.23%	b25ba28
llama 7B BF16 ( prompt speedup - BF16 x FP32 - second input FP32 format)	12.55 GiB	6.74 B	CPU	6	pp 512	45.126 ± 0.04	15.379%	4e57aa6

The code was tested in AMD Raphael 7600X machine which has AVX512_BF16 support in Linux platform. The original unquantized model is taken from https://huggingface.co/TheBloke/wizardLM-7B-HF . Please find the updated code in PR 2 of your fork of llama.cpp - jart#2. Changes in jart#1 (PR 1) was included while testing the same

Could you please share your thoughts here? Is prompt speedup for BF16 models planned to be included in future commits of prompt speedup changes/ BF16 model PR? Thanks

jart · 2024-05-03T15:59:03Z

@Srihari-mcw this change doesn't modify sgemm.cpp because then it would overlap with my other change:

llamafile : improve moe prompt eval speed on cpu #6840

So BF16 optimizations are blocked on review. As for your pull request, the canonical location of the code you're modifying is here:

I've done a lot of work in the past month identifying other performance opportunities.

Many models on Hugging Face (e.g. Mistral, TinyLLaMA) use bfloat16 as their canonical floating point format. ┌sign │ │ ┌exponent │ │ │ │ ┌mantissa │ │ │ │┌──┴───┐┌─┴───┐ 0b0000000000000000 brain16 This encoding has the same number of exponent bits as float32. That makes conversion relatively straightforward, even in the absence of hardware support. For example, converting brain16 to binary32 means simply shifting 16 bits to the left. ┌sign │ │ ┌exponent │ │ │ │ ┌mantissa │ │ │ │┌──┴───┐┌─┴───────────────────┐ 0b00000000000000000000000000000000 IEEE binary32 The issue is that converting bf16 to fp16 can result in information loss. Only 13% of bf16 numbers can be precisely represented in fp16 which in practice ends up being 99.71% of Mistral 7b v0.2's weights however there is currently no way other than fp32 to get the others ┌sign │ │ ┌exponent │ │ │ │ ┌mantissa │ │ │ │┌─┴─┐┌─┴──────┐ 0b0000000000000000 IEEE binary16 This change fixes that, by adding a bf16 data type to GGML. Support for CPU inference has been implemented along with optimizations for the AVX2, AVX512, and AVX512BF16 ISAs. Perplexity on Mistral 7b 0.2 improves somewhere around -0.0024 to -0.0046 compared to using fp16

ddh0 · 2024-05-08T06:37:52Z

So happy to see this land! Will convert.py and convert-hf-to-gguf.py need to be updated?

arch-btw · 2024-05-08T08:42:59Z

So happy to see this land! Will convert.py and convert-hf-to-gguf.py need to be updated?

I'm wondering the same thing

jart · 2024-05-08T11:39:58Z

The Python scripts do need to be updated. I was only able to add the IDs. I wasn't able to successfully figure out how to get the raw bfloat16 data from Torch because Numpy doesn't support it. Someone who knows more than me will need to figure that out.

So happy to see this merged @ggerganov! Thank you!

jart · 2024-05-08T11:41:22Z

By the way, the workaround I'm currently using is to:

Use Python to create an --outtype f32 gguf file.
Run ./quantize ggml-model-f32.gguf ggml-model-bf16.gguf bf16 to create bfloat16 weights.

teleprint-me · 2024-05-08T15:59:06Z

The Python scripts do need to be updated. I was only able to add the IDs. I wasn't able to successfully figure out how to get the raw bfloat16 data from Torch because Numpy doesn't support it. Someone who knows more than me will need to figure that out.

We'll need to use a custom wrapper to implement. I tried doing this last year with pure python and it was a no go. Probably ctypes interface to add FFI support? Not sure how the community feels about this. I have experience with C and my expertise is in Python, but my C++ is limited and I've been picking it up as I go.

compilade · 2024-05-08T16:21:28Z

Note

my implementation of bfloat16 conversion was too naïve: it didn't round to nearest even and did not handle subnormals. So I've decided to exclude my flawed bfloat16 conversion from #7075

~~I've added support for bfloat16 conversion as part of #7075~~, doing the conversion with Numpy is possible even if it doesn't support the bfloat16 type. More explanations in #7075 (comment), and the relevant changes are in 1eccde6.

Important

EDIT: I've made a proper implementation in #7158 which does properly handle subnormals, and rounding.

Bring `GGMLQuantizationType` up to date; adds `I8`, `I16`, `I32`, `I64`, `F64`, `IQ1_M` and `BF16`. Added in: * ggerganov/llama.cpp#6045 * ggerganov/llama.cpp#6062 * ggerganov/llama.cpp#6302 * ggerganov/llama.cpp#6412

jart force-pushed the bf16 branch 3 times, most recently from 436956a to e52d5e5 Compare March 31, 2024 15:09

This comment was marked as off-topic.

Sign in to view

phymbert mentioned this pull request Apr 1, 2024

server: bench: continuous performance testing #6233

Closed

16 tasks

jart force-pushed the bf16 branch from e52d5e5 to 07cebab Compare April 2, 2024 12:38

jart force-pushed the bf16 branch from ed0f47b to 82aebcf Compare May 1, 2024 16:59

compilade mentioned this pull request May 5, 2024

convert-hf : save memory with lazy evaluation #7075

Merged

7 tasks

jart added 8 commits May 7, 2024 22:26

Remove GGML code that's not needed

823d45a

Minimize the GGML API surface area for BF16

180bfcd

Remove bf16 luts

d6892c4

Make the GGML header look nicer

ce0442d

Fix documentation

bc278c8

Apply ggerganov's fixes for test-backend-ops

2741a99

Add BF16 code for new ggml_validate_row_data() function

632624e

jart force-pushed the bf16 branch from 82aebcf to 632624e Compare May 8, 2024 05:26

ggerganov merged commit 3855416 into ggerganov:master May 8, 2024
64 checks passed

compilade mentioned this pull request May 9, 2024

convert-hf : support bfloat16 conversion #7158

Merged

4 tasks

lemmi mentioned this pull request May 9, 2024

BF16 prompt processing has half the performance compared to F16 and F32 von AMD Ryzen Embedded V3000 (Zen 3) #7182

Closed

mofosyne mentioned this pull request May 15, 2024

gguf.md: Add GGUF Naming Convention Section ggerganov/ggml#822

Merged

mofosyne added Tensor Encoding Scheme https://github.com/ggerganov/llama.cpp/wiki/Tensor-Encoding-Schemes Review Complexity : High Generally require indepth knowledge of LLMs or GPUs labels May 25, 2024

CISC mentioned this pull request Jun 3, 2024

Update GGUF quantization types huggingface/huggingface.js#729

Merged

rick-github mentioned this pull request Jan 5, 2025

Speed ten times slower than llamafile ollama/ollama#8305

Open

Introduce bfloat16 support #6412

Introduce bfloat16 support #6412

Conversation

jart commented Mar 31, 2024

github-actions bot commented Mar 31, 2024 • edited Loading

JohannesGaessler commented Mar 31, 2024

sorasoras commented Mar 31, 2024

jart commented Mar 31, 2024

JohannesGaessler commented Mar 31, 2024

jart commented Apr 1, 2024

JohannesGaessler commented Apr 1, 2024

This comment was marked as off-topic.

jart commented Apr 1, 2024

sorasoras commented Apr 1, 2024

jart commented Apr 1, 2024

Artefact2 commented Apr 1, 2024

JohannesGaessler commented Apr 1, 2024

cpumaxx commented Apr 1, 2024

sorasoras commented Apr 1, 2024

cpumaxx commented Apr 1, 2024

sorasoras commented Apr 2, 2024

cpumaxx commented Apr 2, 2024 • edited Loading

jart commented Apr 2, 2024

jart commented Apr 2, 2024

cpumaxx commented Apr 2, 2024

jart commented Apr 2, 2024

jart commented Apr 2, 2024

cpumaxx commented Apr 2, 2024

jart commented Apr 2, 2024

unicomp21 commented May 3, 2024

Srihari-mcw commented May 3, 2024

Srihari-mcw commented May 3, 2024 • edited Loading

jart commented May 3, 2024

ddh0 commented May 8, 2024 • edited Loading

arch-btw commented May 8, 2024 • edited Loading

jart commented May 8, 2024

jart commented May 8, 2024

teleprint-me commented May 8, 2024 • edited Loading

compilade commented May 8, 2024 • edited Loading

github-actions bot commented Mar 31, 2024 •

edited

Loading

cpumaxx commented Apr 2, 2024 •

edited

Loading

Srihari-mcw commented May 3, 2024 •

edited

Loading

ddh0 commented May 8, 2024 •

edited

Loading

arch-btw commented May 8, 2024 •

edited

Loading

teleprint-me commented May 8, 2024 •

edited

Loading

compilade commented May 8, 2024 •

edited

Loading