-
Notifications
You must be signed in to change notification settings - Fork 896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[enhancement] improve the performace of bloom model conversion, reduce the memory and time cost #568
Conversation
…rease the time cost and the mem using
Thank @Yangruipis to propose a significant improvement of the bloom converter. It looks very good to me. |
sure, I'll check it later |
@kjaedeok I couldn't locate the issue or any related comments. Could you please direct me to them? |
The issue was reported directly to us, and the above workaround solved it. |
glad to hear that |
* Update beam_search_topk_kernels.cu fix: fix bug of beam search * fix: change int of some kernels to int64_t to prevent overflow * fix: gpt tensor shapes inconsistency (NVIDIA#505) Signed-off-by: AkiyamaYummy <842720660@qq.com> * Update gpt_guide.md (NVIDIA#529) * fix: fix bug of gpt buffer and gpt gemm overflow * Update T5DecodingWeight.cc fix: fix loading bug of t5 * [Enhancement]add pytorch backend support for gptneox (NVIDIA#550) * add pytorch backend support for gptneox Signed-off-by: AkiyamaYummy <842720660@qq.com> * fix early stopping invalid * 1) Some unused parameters and logic have been removed. 2) Revisions that would affect pipeline parallelism have been reverted. 3) The code has been made capable of direct validation on TabbyML/NeoX-1.3B. Signed-off-by: AkiyamaYummy <842720660@qq.com> * Change the names of classes, removing 'parallel' from their names Signed-off-by: AkiyamaYummy <842720660@qq.com> * Format the code. Signed-off-by: AkiyamaYummy <842720660@qq.com> * Only print results when rank is 0. Signed-off-by: AkiyamaYummy <842720660@qq.com> * Add dist.init_process_group(). Signed-off-by: AkiyamaYummy <842720660@qq.com> * update docs Signed-off-by: AkiyamaYummy <842720660@qq.com> --------- Signed-off-by: AkiyamaYummy <842720660@qq.com> * Update cublasMMWrapper.cc Fix the CUBLAS_VERSION checking of cublasMMWrapper * Update cublasMMWrapper.cc * fix overflow in softmax_kernel when process long seqlen and big batch_size (NVIDIA#524) * Update unfused_attention_kernels.cu fix bug of softmax kernel * [Enhancement]create huggingface_gptneox_convert.py (NVIDIA#569) * create huggingface_gptneox_convert.py Signed-off-by: AkiyamaYummy <842720660@qq.com> * adjust HF's multi bin files Signed-off-by: AkiyamaYummy <842720660@qq.com> * update gptneox_guide.md Signed-off-by: AkiyamaYummy <842720660@qq.com> --------- Signed-off-by: AkiyamaYummy <842720660@qq.com> * perf(bloom): improve performance of huggingface_bloom_convert.py, decrease the time cost and the mem using (NVIDIA#568) Co-authored-by: r.yang <r.yang@tianrang-inc.com> * Fix/gpt early stop (NVIDIA#584) * fix: fix bug of early stopping of gpt * [bugfix] Fix 2-shot All Reduce correctness issue (indexing bug). (NVIDIA#672) FasterTransformer 2-shot all reduce is implemented as a reduce-scatter + all-gather. There is an indexing bug in the all-gather step. Prior to this change, 2-shot all reduce was only producing correct results on device 0. Now, all devices have the correct results. * fix: swap tensor bug (NVIDIA#683) * Support size_per_head=112 (NVIDIA#660) * fix multi-gpu build * add support for size_per_head=112 for gpt decoder * remove mpi_cxx from multi-gpu build for now (NVIDIA#705) --------- Signed-off-by: AkiyamaYummy <842720660@qq.com> Co-authored-by: byshiue <bhsueh@nvidia.com> Co-authored-by: _yummy_ <842720660@qq.com> Co-authored-by: Ying Sheng <sqy1415@gmail.com> Co-authored-by: zhangxin81 <115389973+zhangxin81@users.noreply.github.com> Co-authored-by: 杨睿 <595403043@qq.com> Co-authored-by: r.yang <r.yang@tianrang-inc.com> Co-authored-by: Rahul Kindi <rkindi@users.noreply.github.com> Co-authored-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> Co-authored-by: Daya Khudia <37562707+dskhudia@users.noreply.github.com> Co-authored-by: Dean Wyatte <2512762+dwyatte@users.noreply.github.com>
* Merge with main (#1) * Update beam_search_topk_kernels.cu fix: fix bug of beam search * fix: change int of some kernels to int64_t to prevent overflow * fix: gpt tensor shapes inconsistency (NVIDIA#505) Signed-off-by: AkiyamaYummy <842720660@qq.com> * Update gpt_guide.md (NVIDIA#529) * fix: fix bug of gpt buffer and gpt gemm overflow * Update T5DecodingWeight.cc fix: fix loading bug of t5 * [Enhancement]add pytorch backend support for gptneox (NVIDIA#550) * add pytorch backend support for gptneox Signed-off-by: AkiyamaYummy <842720660@qq.com> * fix early stopping invalid * 1) Some unused parameters and logic have been removed. 2) Revisions that would affect pipeline parallelism have been reverted. 3) The code has been made capable of direct validation on TabbyML/NeoX-1.3B. Signed-off-by: AkiyamaYummy <842720660@qq.com> * Change the names of classes, removing 'parallel' from their names Signed-off-by: AkiyamaYummy <842720660@qq.com> * Format the code. Signed-off-by: AkiyamaYummy <842720660@qq.com> * Only print results when rank is 0. Signed-off-by: AkiyamaYummy <842720660@qq.com> * Add dist.init_process_group(). Signed-off-by: AkiyamaYummy <842720660@qq.com> * update docs Signed-off-by: AkiyamaYummy <842720660@qq.com> --------- Signed-off-by: AkiyamaYummy <842720660@qq.com> * Update cublasMMWrapper.cc Fix the CUBLAS_VERSION checking of cublasMMWrapper * Update cublasMMWrapper.cc * fix overflow in softmax_kernel when process long seqlen and big batch_size (NVIDIA#524) * Update unfused_attention_kernels.cu fix bug of softmax kernel * [Enhancement]create huggingface_gptneox_convert.py (NVIDIA#569) * create huggingface_gptneox_convert.py Signed-off-by: AkiyamaYummy <842720660@qq.com> * adjust HF's multi bin files Signed-off-by: AkiyamaYummy <842720660@qq.com> * update gptneox_guide.md Signed-off-by: AkiyamaYummy <842720660@qq.com> --------- Signed-off-by: AkiyamaYummy <842720660@qq.com> * perf(bloom): improve performance of huggingface_bloom_convert.py, decrease the time cost and the mem using (NVIDIA#568) Co-authored-by: r.yang <r.yang@tianrang-inc.com> * Fix/gpt early stop (NVIDIA#584) * fix: fix bug of early stopping of gpt * [bugfix] Fix 2-shot All Reduce correctness issue (indexing bug). (NVIDIA#672) FasterTransformer 2-shot all reduce is implemented as a reduce-scatter + all-gather. There is an indexing bug in the all-gather step. Prior to this change, 2-shot all reduce was only producing correct results on device 0. Now, all devices have the correct results. * fix: swap tensor bug (NVIDIA#683) * Support size_per_head=112 (NVIDIA#660) * fix multi-gpu build * add support for size_per_head=112 for gpt decoder * remove mpi_cxx from multi-gpu build for now (NVIDIA#705) --------- Signed-off-by: AkiyamaYummy <842720660@qq.com> Co-authored-by: byshiue <bhsueh@nvidia.com> Co-authored-by: _yummy_ <842720660@qq.com> Co-authored-by: Ying Sheng <sqy1415@gmail.com> Co-authored-by: zhangxin81 <115389973+zhangxin81@users.noreply.github.com> Co-authored-by: 杨睿 <595403043@qq.com> Co-authored-by: r.yang <r.yang@tianrang-inc.com> Co-authored-by: Rahul Kindi <rkindi@users.noreply.github.com> Co-authored-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> Co-authored-by: Daya Khudia <37562707+dskhudia@users.noreply.github.com> Co-authored-by: Dean Wyatte <2512762+dwyatte@users.noreply.github.com> * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit * commit --------- Signed-off-by: AkiyamaYummy <842720660@qq.com> Co-authored-by: Asim Shankar <asim.shankar@snowflake.com> Co-authored-by: byshiue <bhsueh@nvidia.com> Co-authored-by: _yummy_ <842720660@qq.com> Co-authored-by: Ying Sheng <sqy1415@gmail.com> Co-authored-by: zhangxin81 <115389973+zhangxin81@users.noreply.github.com> Co-authored-by: 杨睿 <595403043@qq.com> Co-authored-by: r.yang <r.yang@tianrang-inc.com> Co-authored-by: Rahul Kindi <rkindi@users.noreply.github.com> Co-authored-by: Perkz Zheng <67892460+PerkzZheng@users.noreply.github.com> Co-authored-by: Daya Khudia <37562707+dskhudia@users.noreply.github.com> Co-authored-by: Dean Wyatte <2512762+dwyatte@users.noreply.github.com>
what can i do
pytorch bin / safetensors
file directory, instead of callfrom_pretrained
function of transformers, cause thefrom_pretrained
method may take a lot of time for weight initialization, and auto covnertbf16
weights tofp32
which doubles the memory.conversion benchmarks
[1]: from_pretrained: 1910.47, convert: 1479.53
some screenshots