v3.0.0-beta3
Pre-release
Pre-release
本次更新增强了PaddleNLP的基础体验,新增了Llama-3.2、DeepSeekV2模型,升级了TokenizerFast功能,重构了SFTTrainer。
此外,PaddleNLP还支持了优化器状态的卸载和重载功能,实现了精细化的重新计算,训练性能提升7%。在Unified Checkpoint方面,进一步优化了异步保存逻辑,新增Checkpoint压缩功能,可节省78.5%存储空间。
最后,在大模型推理、自动并行、多硬件支持、文档使用上,我们都进行了深度优化。
主要更新与增强
-
新增模型:
-
基础架构改进:
-
推理性能提升:
-
硬件兼容性扩展:
-
自动并行优化:
-
文档和测试更新:
本次更新标志着PaddleNLP的持续进步,为用户提供了更加全面、高效和稳定的NLP解决方案。我们期待在未来的版本中,继续为用户带来更多的创新和价值。
What's Changed
- [Unified Checkpoint] update async_save_info in develop by @DesmonDay in #9173
- add flashmask rm by @lugimzzz in #9154
- [LLM_INFER] Support quantized model from bos and fix docs by @yuanlehome in #9197
- fix ci not set no_proxy and modify tests in pir mode by @fightfat in #9205
- [Models] Add Llama-3.2 by @DrownFish19 in #9199
- move some auto_parallel args into class AutoTrainingArguments by @Wennie396 in #9155
- [Performance] Compatible with flashmask API rename upgrade by @GuoxiaWang in #9019
- [AutoParallel] add vpp align and pp amp test by @AndSonder in #9176
- fix auto ci return bug when run in v100 by @fightfat in #9216
- fix auto ci return bug when run in v100 by @AndSonder in #9228
- [LLM] Add tools for parameters by @Hanyonggong in #9137
- [AutoParallel] Add test for fuse_ffn and fuse_attention_qkv pass by @zhangbo9674 in #9203
- [CI] Fix ci import. by @ZHUI in #9239
- [Version] Update version info by @DrownFish19 in #9241
- [Auto Parallel] Adding align mode support by @zhangyuqin1998 in #9150
- [LLM INFER] top_p_sampling_reject support top_p=0 and custom seed by @gzy19990617 in #9202
- [INFER] update tune_cublaslt_gemm op and fix some bugs by @yuanlehome in #9222
- Reduce the time spent on git downloading third-party libraries by @vivienfanghuagood in #9246
- [PIR] fix pir open bugs by @yuanlehome in #9248
- Cherry-pick some PRs from incubate/paddlenlp-fleety by @sneaxiy in #9245
- [Unified Checkpoint] Support expert parallel by @DesmonDay in #9055
- [PIR] fix pir dt2st for chatglm_v2 by @yuanlehome in #9251
- Cherry-pick some PRs from incubate/paddlenlp-fleety by @LiYuRio in #9253
- [Unified Checkpoint] Fix generation config save by @DrownFish19 in #9223
- [AutoParallel] Fix tests for pass paddle AutoParallel CI by @liym27 in #9267
- change dataset by @lugimzzz in #9266
- [Unified Checkpoint] update async save logic by @DesmonDay in #9274
- add config file for model chatglm2,gemma,yuan by @Mangodadada in #9139
- Fix async hang by @DesmonDay in #9276
- [AutoParallel] Change llama test from sharding stage2 to stage1 by @zhangbo9674 in #9281
- [Tokenizer] Enable padding_side as call time kwargs by @DrownFish19 in #9258
- [Trainer] fix save_model by @DesmonDay in #9286
- [CI] Skip inference test cases by @DrownFish19 in #9270
- [LLM] Add deepseekv2 by @DrownFish19 in #9250
- [Tokenizer] Unify tokenizer _pad by @DrownFish19 in #9280
- [CI] Fix llm/alignment/rm/flashmask path by @DrownFish19 in #9289
- support attention mask using causal=True by @GuoxiaWang in #9268
- [FlashMask] Add FlashMask for Qwen2 by @DrownFish19 in #9264
- bug fix for xpu_parallel_matmul by @FeixLiu in #9297
- fix lora sharding v2 by @lugimzzz in #9300
- [LLM INFER] Append attn by @yuanlehome in #9244
- [Auto Parallel] fix bugs for split_batches_for_accumulation && fix bu… by @zhangyuqin1998 in #9217
- [Tokenizer] Fix TokenizerFast missing clean_up_tokenization_spaces by @dynamicheart in #9304
- clean llama static modeling file by @zhiqiu in #9301
- [Unified Checkpoint] Accelerate loading checkpoint by multi-thread by @Crystal-X-111 in #9034
- fix non-pipelinelayer to distributed by @gongel in #9310
- change the legacy to slm by @wawltor in #9311
- [TRL] Rename sft trainer. by @ZHUI in #9292
- [XPU] support unified ckpt function by @cqulilujia in #9312
- [LLM INFER] Fix some bugs and chatglm_v2 support block_attn by @yuanlehome in #9271
- [Readme] Add flash mask by @lugimzzz in #9219
- update llm infer docs by @yuanlehome in #9314
- [Unified Checkpoint] Add split param and refactor code by @DesmonDay in #9240
- [METAX] Support llama for MX C550 by @idontkonwher in #9186
- update QR code by @DrownFish19 in #9325
- add flash_attention on model chatglm_v2 by @Mangodadada in #9296
- fix readme by @Mangodadada in #9326
- [Unified Checkpoint] update non-merge checkpoint loading, move async_save_info.json location by @DesmonDay in #9321
- [paddle cpu inference]fix cpu doc by @bukejiyu in #9299
- [LLM INFER] add rope_theta for block_multihead_attention by @yuanlehome in #9334
- Fix pr 9334 by @yuanlehome in #9335
- fix parameter calculation in auto_parallel mode by @zhiqiu in #9327
- [Docs] Update flashmask by @DrownFish19 in #9330
- Update load_save_single_card.py by @DesmonDay in #9337
- Update README.md by @DrownFish19 in #9339
- [Tokenizer] Support reading Tiktoken tokenizer.model. by @lvdongyi in #9215
- align default custom black/white list for dygraph and static graph by @zhiqiu in #9340
- [intel_hpu] initial commit for intel_hpu support by @yanfeich in #9273
- Compatible with Tensor.to change to out_of_place. by @DrownFish19 in #9343
- [Tokenizer] Fix Llama3Tokenizer import by @DrownFish19 in #9341
- [Docs] Add precision alignment doc by @DrownFish19 in #9346
- [Tokenizer] Support adding special tokens to Qwen tokenizer by @DrownFish19 in #9344
- Add ordered save to avoid OOM by @ForFishes in #9347
- [AutoParallel]Bugfix Hang for VPP-Sharding by @JZ-LIANG in #9336
- Add CI testing for A100 and V100 device by @waliwali777 in #9324
- [Inference] Append attn FP8 quant by @ckl117 in #9328
- [Tokenizer] Add BertTokenizerFast, support register new tokenizer by @lvdongyi in #9353
- clean print in auto_trainer by @zhiqiu in #9357
- [Unified Checkpoint] Fix fp32 dtype for using newest paddle by @DesmonDay in #9360
- [UIE] Fix tokenizer output with return_token_type_ids by @DrownFish19 in #9363
- Add offload/reload for optimizer by @ForFishes in #9359
- refine dtype use by @wanghuancoder in #9366
- Add check for sharding stage1-v2 using amp master grad by @ForFishes in #9333
- [Trainer] Update assert to warning by @DesmonDay in #9332
- [Auto Parallel] fix adapt_stale_fwd_patch for to_static mode by @zhangyuqin1998 in #9372
- [LLM INFER] Optimize fuse some kernels in postprocess by @gzy19990617 in #9201
- [AutoParallel] Fix
EXCODE
bug of AutoParallel CI by @waliwali777 in #9355 - Support pp + no_recompute_layer. by @tianyuzhou668 in #9373
- [Unified Checkpoint] Support empty state_dict saving by @DesmonDay in #9380
- Add submodule by @risemeup1 in #9385
- [CI] add recursive for submodule by @Liujie0926 in #9389
- [CI]fix scripts by @Liujie0926 in #9394
- [LLM]add ktotrainer by @lugimzzz in #9393
- Refine log freq by @zhangbo9674 in #9397
- [XPU] Llama XPU's swiglu uses phi's swiglu by @dynamicheart in #9414
- fix hip paddlenlp_ops bug by @TBD1 in #9418
- [CI]update target_lists_for_llm by @Liujie0926 in #9417
- [INFER][LLM] Add the AutoModel for inference mode by @zeroRains in #9416
- [Unified Checkpoint] Support sharding_comm_overlap by @DesmonDay in #9392
- [DCU] update dcu paddlenlp_ops by @TBD1 in #9433
- Change core.LoDTensor to core.DenseTensor by @co63oc in #9434
- Change LOD_TENSOR to DENSE_TENSOR by @co63oc in #9419
- [LLM] Fix deepseekv2 import in py38 by @DrownFish19 in #9446
- [Distributed Dataloader] change process new_group creation by @DesmonDay in #9438
- Update dist_dataloader.py by @DesmonDay in #9451
- [llm]fix pp no drop last by @lugimzzz in #9439
- Reduce long duration for the
exit -6 re-run
process. by @waliwali777 in #9400 - Fix row parallel lora layers parameters initialization bug by @will-jl944 in #9427
- Refactor tool of creating pretrain dataset by @gongel in #9454
- 【Auto-Parallel】update conf for sharding overlap in static by @liym27 in #9456
- [AutoParallel] add release_gradients and comm_buffer_size_MB to strategy by @AndSonder in #9432
- [LLM] Skip zero loss by @DrownFish19 in #9447
- [ChatTemplate] Fix chat template when answer is contained within question. by @DrownFish19 in #9444
- [LLM] Add expert parallel by @DrownFish19 in #9368
- 增加benchmark多机任务执行脚本对于异常退出的处理 by @XieYunshen in #9442
- [llm]add set_seed by @lugimzzz in #9429
- [AutoParallel] Reconstruct sharding mesh dimension inference logic - Part2 add sharding_mesh_dimension param by @AndSonder in #9382
- Fix auto parallel CI exit -6 by @waliwali777 in #9460
- [ChatTemplate] Fix chat template for
Gemma
when answer is contained within question. by @lvdongyi in #9462 - Use paddle.cast instead of Tensor.astype by @HydrogenSulfate in #9461
- fixed the init problem in tensor parallel by @wawltor in #9452
- Revised PoSE by @whf313 in #8822
- fix AutoInferenceModel for qwen-vl by @yuanlehome in #9463
- add reft method by @TranscenderNing in #8819
- [AutoParallel]: llama_model_auto support alibi by @blacksheep-Aristotle in #9422
- [AutoParallel]:gpt 13b model support fused_linear sp fused_attention … by @blacksheep-Aristotle in #9477
- add Moslora by @TranscenderNing in #9331
- [Trainer] Fix eval for map dataset by @DesmonDay in #9472
- [Inference]Move quantization code from run_finetune.py to run_quantization.py by @lixcli in #9450
- [AutoParallel] Fix parameter passing for comm_buffer_size_MB and release_gradients by @AndSonder in #9481
- [AutoParallel]:fix run llama_13b_auto error by @blacksheep-Aristotle in #9480
- [Unified Checkpoint] Checkpoint compression by @wtmlon in #9183
- fixbug for chatglm_v2's RetaryEmbedding dtype by @mingMelody in #9476
- [LLM INFER] Support speculative decoding (llama) by @Wanglongzhi2001 in #9180
- [Fix] Remove data args print by @DrownFish19 in #9486
- [AutoParallel] open vpp test cast at v100 machines by @AndSonder in #9468
- [ChatTemplate] Fix chat template for
Yuan
when answer is contained within question. by @lvdongyi in #9485 - [AutoParallel]:fix baichuan d2s fail by @blacksheep-Aristotle in #9478
- [Tokenizer] Support fast tokenizer within AutoTokenizer import by @DrownFish19 in #9466
- [Inference] use fp8 cuda core gemm kernel when M<=4 by @zhink in #9423
- [XPU] set appropriate mask value for xpu by @runzhech in #9495
- [LLM INFER] not use gemm_dequant default and fix bug by @yuanlehome in #9498
- [NEW Feature] 新增基于hook的refined_recompute支持 by @JunnYu in #9396
- 【Hackathon 7th No.43】完善 TokenizerFast 功能支持 part 1 by @yinfan98 in #9407
- [BUG] fix pp eval shape bug by @JunnYu in #9505
- Adding LoKrModel Class to paddle.peft library by @WhuanY in #9269
- 移除CUDA_DEVICE_MAX_CONNECTIONS环境变量, 优化benchmark执行脚本 by @XieYunshen in #9500
- [Refactor] SFTTrainer SFTConfig by @ZHUI in #9318
- fix csrc readme by @yuanlehome in #9515
- Add document for speculative decoding by @Wanglongzhi2001 in #9492
- [News] FlashRAG-Paddle by @DrownFish19 in #9511
- support quant ckpt limit strategy by @wtmlon in #9494
- Fix ckpt convert bug by @zhangbo9674 in #9521
- support pp accuracy calculation by @wtmlon in #9379
- Fix ckpt convert bug1 by @zhangbo9674 in #9522
- [CI] Compatible with paddle.where by @DrownFish19 in #9534
- [Inference] Update DygraphInferencePredictor by @DrownFish19 in #9491
- support offload/reload optimizer's states for custom device by @tianhaodongbd in #9467
- [LLM INFER] fix tune_cublaslt_int8_gemm.py and remove dist_config by @yuanlehome in #9520
- 【Hackathon 7th No.43】TokenizerFast for Qwen2 by @yinfan98 in #9532
- [INFER][LLM] Add the AutoPredictor for inference by @zeroRains in #9445
- Support call sft training with clone PaddleNLP by @ZHUI in #9516
New Contributors
- @Crystal-X-111 made their first contribution in #9034
- @idontkonwher made their first contribution in #9186
- @waliwali777 made their first contribution in #9324
- @tianyuzhou668 made their first contribution in #9373
- @risemeup1 made their first contribution in #9385
- @TBD1 made their first contribution in #9418
- @zeroRains made their first contribution in #9416
- @XieYunshen made their first contribution in #9442
- @whf313 made their first contribution in #8822
- @mingMelody made their first contribution in #9476
- @runzhech made their first contribution in #9495
- @WhuanY made their first contribution in #9269
Full Changelog: v3.0.0-beta2...v3.0.0-beta3