-
Notifications
You must be signed in to change notification settings - Fork 461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposed changes to reduce VRAM usage. Potentially quantize larger models on consumer hardware. #269
Comments
neat |
Hey, thank you for your flan-ul2 quant! I really appreciate it. Very useful model, but unfortunately overlooked due to the flood of llamas.. Seeing your work, I tried to quantize a flan-t5-xxl but the resulting file outputs nonsense. Can you please point me in the right direction, or share your modifications to the t5 branch? |
Aside from the change listed in this issue, I've made no other change. I don't know why your quantized model isn't working really, but according to the config.json files, flan-ul2 is bfloat16 while flan-t5-xxl is float32. I'm not sure if this makes a difference or not, but it may. I also don't know which flan-t5-xxl model you downloaded. The repo has several formats it looks like. I'd recommend only getting the bin files and double check their SHA256 against what is listed on HF. Sorry I couldn't be more help. |
Hello everyone,
Recently I noticed a lack of 4-bit quantized versions of
Google/flan-ul2
on HF, and so, decided to set out to quantize the model on my 4090.I struggled with this for a bit due to my own oversights, but during my debugging, I found a section of code that, if run on the CPU, does NOT take long to run, and yet, if run on the GPU, requires a significant allocation of memory. Thus I propose the following changes inside of
t5_sequential()
in thet5.py
script file:At the moment, I'm not actually certain whether or not using
gc.collect()
is entirely necessary, but it also doesn't hurt. The real memory saving changes come from runningmodel.encoder.final_layer_norm
andmodel.encoder.dropout
on the CPU. This (combined with--n_samples 256
andPYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512
) enabled me to quantize a 20b model on my 4090 where I otherwise would not have been able to. Or at least, not with that size of sample count.Note: Use of
gc.collect()
requires addingimport gc
somewhere (preferably near the top) in the file.The text was updated successfully, but these errors were encountered: