Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REQUEST] Democratizing Present Day LLM Fine-Tuning To The GPU Poor #5276

Open
NicolasMejiaPetit opened this issue Mar 14, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@NicolasMejiaPetit
Copy link

NicolasMejiaPetit commented Mar 14, 2024

Dear DeepSpeed developers,

This is a letter to the deep speed developers asking for several improvements that would allow deepspeed to fine tune 175b models on consumer hardware faster, and more efficiently. Currently in order to offload the entire optimizer, we are required to use 32bit CPU Adam, which is highly inefficient in terms of memory usage. Full fine tuning a 7b model with 16 bit Adam uses 25 GB alone. With modern advancements in quantization, optimizer’s have gotten significantly better at smaller sizes. Such as 4bit optimizers, Code paper . Achieving almost the same benchmarks as its 16 and 32 bit counterparts, with significant gain in memory efficiency. A CPU optimizer with 4 bits would allow the CPU to work faster and not require as much memory, leaving room for more CPU memory, for shading the model. Secondly NVME offloading has proven to be a worthy contribution allowing fine tuning of 175b on a single 24gb graphics card. Paper Similar code by same author . The authors highlight Issues with current iteration of deepspeed: “1) low GPU utilization due to inefficient swapping, and 2) limited trainable model size due to CPU memory capacity. The underlying reason is that ZeRO-Infinity is optimized for running on high-end GPU servers.” Thirdly Bitsandbytes quantization, currently there is no support for fine tuning a model using 4bit Qlora, as there is no BNB support. If we had bnb support we could on the fly quantize then finetune the models with Qlora. Fourth we need 8 bit and 4 bit full fine tuning, to allow us to not have to use lora, and instead train the entire model at a time. Lastly we need int4 and int8 support, I do know there is half support baked in, as pytorch isn’t in support of int 4 and int 8 (currently work is being made on torchAO ). However these use upcasting to mimic int, and don’t utilize the int engines that would allow these models to train at over 4 times higher TFLOPS (3080 results: BF16:59.5 TFLOPS INT8: 238 INT4: 476 TFLOPS) according to nvidia. Source: Blog and Results. If these additions were added into Deepspeed, users would see a significant increase in training speed, and decrease in training cost. This would allow the entire system to work in tandem to provide the absolute fastest training possible, fully democratizing present day LLM fine tuning to the GPU poor.

Thank you,
Nicolas Mejia-Petit

Also side note, full windows support would be pretty cool.

@NicolasMejiaPetit NicolasMejiaPetit added the enhancement New feature or request label Mar 14, 2024
@GuanhuaWang
Copy link
Member

@NickWithBotronics

Really appreciated your suggestions

  1. we noted down 4bit/low bit adam, which should be in our roadmap of cpu-offload
  2. nvme offloading we have some progress here, will release when ready
  3. b&b quant is also noted down.
  4. for 8/4 bit fine tuning. our ZeRO++ do support some of these features. And our ZeRO++ has already been used in some real industry production workload for speedup performance.

@NicolasMejiaPetit
Copy link
Author

@GuanhuaWang Thank you for the update! I appreciate it!

@tjruwase
Copy link
Contributor

tjruwase commented Jul 1, 2024

Also side note, full windows support would be pretty cool.

@NicolasMejiaPetit, we heard you: #5609

@tjruwase tjruwase reopened this Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants