You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a letter to the deep speed developers asking for several improvements that would allow deepspeed to fine tune 175b models on consumer hardware faster, and more efficiently. Currently in order to offload the entire optimizer, we are required to use 32bit CPU Adam, which is highly inefficient in terms of memory usage. Full fine tuning a 7b model with 16 bit Adam uses 25 GB alone. With modern advancements in quantization, optimizer’s have gotten significantly better at smaller sizes. Such as 4bit optimizers, Codepaper . Achieving almost the same benchmarks as its 16 and 32 bit counterparts, with significant gain in memory efficiency. A CPU optimizer with 4 bits would allow the CPU to work faster and not require as much memory, leaving room for more CPU memory, for shading the model. Secondly NVME offloading has proven to be a worthy contribution allowing fine tuning of 175b on a single 24gb graphics card. PaperSimilar code by same author . The authors highlight Issues with current iteration of deepspeed: “1) low GPU utilization due to inefficient swapping, and 2) limited trainable model size due to CPU memory capacity. The underlying reason is that ZeRO-Infinity is optimized for running on high-end GPU servers.” Thirdly Bitsandbytes quantization, currently there is no support for fine tuning a model using 4bit Qlora, as there is no BNB support. If we had bnb support we could on the fly quantize then finetune the models with Qlora. Fourth we need 8 bit and 4 bit full fine tuning, to allow us to not have to use lora, and instead train the entire model at a time. Lastly we need int4 and int8 support, I do know there is half support baked in, as pytorch isn’t in support of int 4 and int 8 (currently work is being made on torchAO ). However these use upcasting to mimic int, and don’t utilize the int engines that would allow these models to train at over 4 times higher TFLOPS (3080 results: BF16:59.5 TFLOPS INT8: 238 INT4: 476 TFLOPS) according to nvidia. Source: Blog and Results. If these additions were added into Deepspeed, users would see a significant increase in training speed, and decrease in training cost. This would allow the entire system to work in tandem to provide the absolute fastest training possible, fully democratizing present day LLM fine tuning to the GPU poor.
Thank you,
Nicolas Mejia-Petit
Also side note, full windows support would be pretty cool.
The text was updated successfully, but these errors were encountered:
we noted down 4bit/low bit adam, which should be in our roadmap of cpu-offload
nvme offloading we have some progress here, will release when ready
b&b quant is also noted down.
for 8/4 bit fine tuning. our ZeRO++ do support some of these features. And our ZeRO++ has already been used in some real industry production workload for speedup performance.
Dear DeepSpeed developers,
This is a letter to the deep speed developers asking for several improvements that would allow deepspeed to fine tune 175b models on consumer hardware faster, and more efficiently. Currently in order to offload the entire optimizer, we are required to use 32bit CPU Adam, which is highly inefficient in terms of memory usage. Full fine tuning a 7b model with 16 bit Adam uses 25 GB alone. With modern advancements in quantization, optimizer’s have gotten significantly better at smaller sizes. Such as 4bit optimizers, Code paper . Achieving almost the same benchmarks as its 16 and 32 bit counterparts, with significant gain in memory efficiency. A CPU optimizer with 4 bits would allow the CPU to work faster and not require as much memory, leaving room for more CPU memory, for shading the model. Secondly NVME offloading has proven to be a worthy contribution allowing fine tuning of 175b on a single 24gb graphics card. Paper Similar code by same author . The authors highlight Issues with current iteration of deepspeed: “1) low GPU utilization due to inefficient swapping, and 2) limited trainable model size due to CPU memory capacity. The underlying reason is that ZeRO-Infinity is optimized for running on high-end GPU servers.” Thirdly Bitsandbytes quantization, currently there is no support for fine tuning a model using 4bit Qlora, as there is no BNB support. If we had bnb support we could on the fly quantize then finetune the models with Qlora. Fourth we need 8 bit and 4 bit full fine tuning, to allow us to not have to use lora, and instead train the entire model at a time. Lastly we need int4 and int8 support, I do know there is half support baked in, as pytorch isn’t in support of int 4 and int 8 (currently work is being made on torchAO ). However these use upcasting to mimic int, and don’t utilize the int engines that would allow these models to train at over 4 times higher TFLOPS (3080 results: BF16:59.5 TFLOPS INT8: 238 INT4: 476 TFLOPS) according to nvidia. Source: Blog and Results. If these additions were added into Deepspeed, users would see a significant increase in training speed, and decrease in training cost. This would allow the entire system to work in tandem to provide the absolute fastest training possible, fully democratizing present day LLM fine tuning to the GPU poor.
Thank you,
Nicolas Mejia-Petit
Also side note, full windows support would be pretty cool.
The text was updated successfully, but these errors were encountered: