[REQUEST] Democratizing Present Day LLM Fine-Tuning To The GPU Poor #5276

NicolasMejiaPetit · 2024-03-14T09:08:20Z

Dear DeepSpeed developers,

This is a letter to the deep speed developers asking for several improvements that would allow deepspeed to fine tune 175b models on consumer hardware faster, and more efficiently. Currently in order to offload the entire optimizer, we are required to use 32bit CPU Adam, which is highly inefficient in terms of memory usage. Full fine tuning a 7b model with 16 bit Adam uses 25 GB alone. With modern advancements in quantization, optimizer’s have gotten significantly better at smaller sizes. Such as 4bit optimizers, Code paper . Achieving almost the same benchmarks as its 16 and 32 bit counterparts, with significant gain in memory efficiency. A CPU optimizer with 4 bits would allow the CPU to work faster and not require as much memory, leaving room for more CPU memory, for shading the model. Secondly NVME offloading has proven to be a worthy contribution allowing fine tuning of 175b on a single 24gb graphics card. Paper Similar code by same author . The authors highlight Issues with current iteration of deepspeed: “1) low GPU utilization due to inefficient swapping, and 2) limited trainable model size due to CPU memory capacity. The underlying reason is that ZeRO-Infinity is optimized for running on high-end GPU servers.” Thirdly Bitsandbytes quantization, currently there is no support for fine tuning a model using 4bit Qlora, as there is no BNB support. If we had bnb support we could on the fly quantize then finetune the models with Qlora. Fourth we need 8 bit and 4 bit full fine tuning, to allow us to not have to use lora, and instead train the entire model at a time. Lastly we need int4 and int8 support, I do know there is half support baked in, as pytorch isn’t in support of int 4 and int 8 (currently work is being made on torchAO ). However these use upcasting to mimic int, and don’t utilize the int engines that would allow these models to train at over 4 times higher TFLOPS (3080 results: BF16:59.5 TFLOPS INT8: 238 INT4: 476 TFLOPS) according to nvidia. Source: Blog and Results. If these additions were added into Deepspeed, users would see a significant increase in training speed, and decrease in training cost. This would allow the entire system to work in tandem to provide the absolute fastest training possible, fully democratizing present day LLM fine tuning to the GPU poor.

Thank you,
Nicolas Mejia-Petit

Also side note, full windows support would be pretty cool.

GuanhuaWang · 2024-03-15T17:33:35Z

@NickWithBotronics

Really appreciated your suggestions

we noted down 4bit/low bit adam, which should be in our roadmap of cpu-offload
nvme offloading we have some progress here, will release when ready
b&b quant is also noted down.
for 8/4 bit fine tuning. our ZeRO++ do support some of these features. And our ZeRO++ has already been used in some real industry production workload for speedup performance.

NicolasMejiaPetit · 2024-03-15T19:03:22Z

@GuanhuaWang Thank you for the update! I appreciate it!

tjruwase · 2024-07-01T13:51:56Z

Also side note, full windows support would be pretty cool.

@NicolasMejiaPetit, we heard you: #5609

NicolasMejiaPetit added the enhancement New feature or request label Mar 14, 2024

GuanhuaWang closed this as completed Mar 19, 2024

tjruwase mentioned this issue May 8, 2024

CPUAdam fp16 and bf16 support #5409

Merged

tjruwase reopened this Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REQUEST] Democratizing Present Day LLM Fine-Tuning To The GPU Poor #5276

[REQUEST] Democratizing Present Day LLM Fine-Tuning To The GPU Poor #5276

NicolasMejiaPetit commented Mar 14, 2024 •

edited

Loading

GuanhuaWang commented Mar 15, 2024

NicolasMejiaPetit commented Mar 15, 2024

tjruwase commented Jul 1, 2024

[REQUEST] Democratizing Present Day LLM Fine-Tuning To The GPU Poor #5276

[REQUEST] Democratizing Present Day LLM Fine-Tuning To The GPU Poor #5276

Comments

NicolasMejiaPetit commented Mar 14, 2024 • edited Loading

GuanhuaWang commented Mar 15, 2024

NicolasMejiaPetit commented Mar 15, 2024

tjruwase commented Jul 1, 2024

NicolasMejiaPetit commented Mar 14, 2024 •

edited

Loading