You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Other formats are broken as well, for example Q4_0 and F16.
Problem description & steps to reproduce
When llama.cpp is compiled with the fp16fml CPU flag (-DGGML_NATIVE=0 -DGGML_CPU_ARM_ARCH=armv8.2-a+fp16fml), the "DeepScaleR" model can start outputting random tokens, for example:
Certainly! Here's an organized and elegant presentation of various delicious ways to enjoy cake, categorized for clarity:
1;@0#H1D*()"H2-+G<#0--/8<=.5(F&:=$
I've found that ggml_vec_dot_f16 can return values as high as 321000 for this model. That's okay without fp16fml, since single-precision accumulators are used.
But with fp16fml, the intermediate values are summed in fp16, where the maximum normalized number is 65504. So the values are about five times larger than what can be represented. Some of the time it still works, because accumulation is done into the sum vector, and so while the returned sumf would overflow a half-precision variable, the value is spread out over multiple elements of the vector. But other times, a single one of the 32 accumulator elements overflows, and so ggml_vec_dot_f16 returns inf.
(The code in sgemm.cpp exhibits similar issues, but for this testing I've completely disabled the FP16 case there.)
What are possible solutions?
I guess what would make the most sense is to use a scale factor somewhere, for example dividing by eight or sixteen for this model.
There are a few other possibilities I can think of:
Set the "Alternate half-precision control" bit (which will effectively saturate instead of returning infinite values)
Armv8.4 FEAT_FHM which accumulates to single precision (but might be twice as slow)
Armv8.6 BFloat16
Use isfinite and handle overflow when it happens somehow
Increase GGML_F16_STEP and make GGML_F16x8_REDUCE do everything in single-precision
Name and Version
version: 4733 (faaa9b93)
built with cc (Ubuntu 14.2.0-4ubuntu2) 14.2.0 for aarch64-linux-gnu
Operating systems
Linux
GGML backends
CPU
Hardware
Tested on Snapdragon X Elite and Cortex-A76.
Models
https://huggingface.co/bartowski/agentica-org_DeepScaleR-1.5B-Preview-GGUF/blob/main/agentica-org_DeepScaleR-1.5B-Preview-IQ4_NL.gguf
Other formats are broken as well, for example Q4_0 and F16.
Problem description & steps to reproduce
When llama.cpp is compiled with the
fp16fml
CPU flag (-DGGML_NATIVE=0 -DGGML_CPU_ARM_ARCH=armv8.2-a+fp16fml
), the "DeepScaleR" model can start outputting random tokens, for example:I've found that
ggml_vec_dot_f16
can return values as high as 321000 for this model. That's okay withoutfp16fml
, since single-precision accumulators are used.But with
fp16fml
, the intermediate values are summed in fp16, where the maximum normalized number is 65504. So the values are about five times larger than what can be represented. Some of the time it still works, because accumulation is done into thesum
vector, and so while the returnedsumf
would overflow a half-precision variable, the value is spread out over multiple elements of the vector. But other times, a single one of the 32 accumulator elements overflows, and soggml_vec_dot_f16
returnsinf
.(The code in
sgemm.cpp
exhibits similar issues, but for this testing I've completely disabled the FP16 case there.)What are possible solutions?
I guess what would make the most sense is to use a scale factor somewhere, for example dividing by eight or sixteen for this model.
There are a few other possibilities I can think of:
isfinite
and handle overflow when it happens somehowGGML_F16_STEP
and makeGGML_F16x8_REDUCE
do everything in single-precisionFirst Bad Commit
No response
Relevant log output
The text was updated successfully, but these errors were encountered: