-
-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for 4 bit Quantization #580
Comments
Does anyone know how they represent tensors with 4 bit data? Would this be some packed structure where they store 2 i4s with a u8? struct i4x2(u8) |
Here is a relevant description of the implementation llama.cpp uses to represent floats in 4 bits. It essentially boils down to storing some number of 4 bit integers along with a f32 scaling factor and an optional f32 offset. From what I have read of source code, it seems possible to do a lot of the math highly efficiently on the cpu using SIMD packed 8/16 bit integers, and without touching floating point at all. While this representation is excellent for specialized inference libraries, I don't think that it's practical for a generalist library like dfdx, because dfdx must deal with strided indexing and must support cuda. Furthermore, I'm not sure how efficient we could make operations on this representation, considering that simd support in stable rust is very limited. |
Yep, I was also looking into this, this would be very nice to have, but I am not sure if we can make it even remotely approach the efficiency of what the C++ people are doing, considering how general purpose dfdx is. I think the SIMD concerns are not that big of a deal, since dfdx heavily relies on nightly for some features already anyway and the SIMD support there is ok from what I have seen so far. But I take your point on how specialized this is, and I am not sure if it is worth the effort to have this representation as an option for dfdx tensors. |
Language model progress has been rapid recently and with the llama weights being released, so much progress is being made on the c++ side
https://github.com/ggerganov/llama.cpp
I see that fp16 is on the roadmap soon.
But it might also be a good idea to consider support for 4 bit quantization and related techniques. Is that something that will be considered?
The text was updated successfully, but these errors were encountered: