-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to move param to device
before quantization
#699
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/699
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 711d001 with merge base b523f9f (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
nf4 is not yet using Line 544 in b523f9f
quantize_ as well if there is a weight only quantization use case for nf4 as well. cc @drisspg
the change makes sense, one thing I'm wondering is would it help further reduce cuda memory usage if you move the module back to cpu (original device) after quantization as well (and release the used cuda memory)? not sure if it would be slow though. for tests, I think you'll probably need to construct a model with linear + non-linear modules and compare the max cuda memory for "Quantize on CUDA" and "Move to CUDA and quantize each param individually" and make sure the second one consumes less memory. you can add it in https://github.com/pytorch/ao/blob/main/test/quantization/test_quant_api.py I think |
I see. So users of NF4 need to implement this logic i.e. torchtune will have to implement this instead of torchao.
I can try this. But in the end, I think we should stick to 1 approach to avoid confusion to users. |
yeah according to Driss, this LoraLinear code is specific to the framework that uses nf4 (e.g. torchtune, huggingface, vllm etc.) so we don't want to offer an API for people |
Agreed this is the right test
This scenario seems to optimize for people that have a GPU but want the model to run on the CPU, that seems rarer than the alternative of someone who wants to fit a model that in full precision does not fit on their GPU Otherwise yeah once CI is green can merge this |
The difference in peak memory between this PR and quantize on CPU (8.29 GiB vs 6.99 GiB) is probably due to the largest |
Using Llama2-7B
There are tradeoffs between approach 1 and 2. Lmk which one you prefer. I think we should only keep one approach to avoid confusion. Another thing we can consider is to measure the size of the weight to decide what to do with it. E.g. if the weight exceeds xx size (which hopefully only the LM head is), we will always quantize on CPU -> avoid the memory spike. Again, this also adds seemingly unnecessary complexity. |
I like approach 1 better because it's a better tradeoff between the options we already have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for benchmarking different approaches @gau-nernst , yeah Approach 1 sounds good to me as well, it is simpler and easier to explain in API as well
Fixes #655
Test
@jerryzh168 Do you have any suggestions what tests I should add for this?
This PR may help the slow NF4 quantization issue observed in torchtune too #642 (though I'm not sure what is the NF4 quantization API. Does it use
quantize_()
?)