Release v3.0.0-beta3 · PaddlePaddle/PaddleNLP

本次更新增强了PaddleNLP的基础体验，新增了Llama-3.2、DeepSeekV2模型，升级了TokenizerFast功能，重构了SFTTrainer。

此外，PaddleNLP还支持了优化器状态的卸载和重载功能，实现了精细化的重新计算，训练性能提升7%。在Unified Checkpoint方面，进一步优化了异步保存逻辑，新增Checkpoint压缩功能，可节省78.5%存储空间。
最后，在大模型推理、自动并行、多硬件支持、文档使用上，我们都进行了深度优化。

主要更新与增强

新增模型：
- 新增了Llama-3.2模型（#9199）、DeepSeekV2模型（#9250），进一步丰富了大型模型的选择。
基础架构改进：
- 重构了SFTTrainer和SFTConfig，提高了代码的可维护性。（#9318)
- 支持优化器状态的卸载和重载功能（#9467），有效降低了内存使用。
- 通过Hook实现了精细化的重新计算支持，例如，在llama模型上，训练性能可提升7%。（#9396）
- Unified Checkpoint优化：
  - 更新了异步保存逻辑（#9173, #9274, #9321），显著提升了检查点的保存与加载效率。
  - 增加了对专家并行的支持（#9055），使模型训练更加灵活。
  - 支持在开启sharding_comm_overlap时使用Unified Checkpoint。（#9392）
  - 新增了Checkpoint压缩功能，最多可节省78.5%的存储空间。（#9183）
  - 通过多线程技术减少了检查点的加载时间（#9034）。
- Tokenizer功能增强：
  - 允许在Tokenizer调用时指定padding_side参数（#9258），提升了用户体验。
  - Qwen tokenizer现支持添加特殊标记（#9344），增强了其灵活性。
  - 修复了TokenizerFast中缺失的clean_up_tokenization_spaces问题（#9304），提高了文本处理的准确性。
  - 统一了分词器的_pad函数到基类。#9280
  - 新增了对BertTokenizerFast的支持，并允许在调用时注册tokenizer。（#9353）
  - 改进了Qwen、Gemma、Yuan模型chat template的特殊输入处理。（#9462）
推理性能提升：
- 支持LLM推理直接量化内置bos模型（#9197）。
- 加强了对LLM推理中FP8 量化的支持（如#9328, #9423），满足了多样化的精度需求。
- 增强了投机解码（speculative decoding）和Append Attention 的支持。(#9180) (#9244)
硬件兼容性扩展：
- 加强了对Intel HPU的支持（#9273），现在支持动态图预测。
- 为XPU等国产硬件提供了统一检查点功能（#9312）。
- 修复了XPU和DCU支持中的错误，并提升了性能。#9414 和#9433
自动并行优化：
- 修复了自动并行过程中的多个问题（如#9217, #9355），确保了并行训练的稳定性。
- 更新了自动并行配置与检查点转换器（如#9136, #9432），提升了训练的灵活性和稳定性。
文档和测试更新：
- 更新了多个文档，包括LLM模型文档（如#9314）和量化文档（如#9330），确保了信息的时效性和准确性。
- 新增了多个测试用例，如分布式数据加载测试（#9438），提高了测试的覆盖率。
- 修复了文档中的链接错误和排版问题（如#9127, #9515），提升了用户体验。

本次更新标志着PaddleNLP的持续进步，为用户提供了更加全面、高效和稳定的NLP解决方案。我们期待在未来的版本中，继续为用户带来更多的创新和价值。

What's Changed

[Unified Checkpoint] update async_save_info in develop by @DesmonDay in #9173
add flashmask rm by @lugimzzz in #9154
[LLM_INFER] Support quantized model from bos and fix docs by @yuanlehome in #9197
fix ci not set no_proxy and modify tests in pir mode by @fightfat in #9205
[Models] Add Llama-3.2 by @DrownFish19 in #9199
move some auto_parallel args into class AutoTrainingArguments by @Wennie396 in #9155
[Performance] Compatible with flashmask API rename upgrade by @GuoxiaWang in #9019
[AutoParallel] add vpp align and pp amp test by @AndSonder in #9176
fix auto ci return bug when run in v100 by @fightfat in #9216
fix auto ci return bug when run in v100 by @AndSonder in #9228
[LLM] Add tools for parameters by @Hanyonggong in #9137
[AutoParallel] Add test for fuse_ffn and fuse_attention_qkv pass by @zhangbo9674 in #9203
[CI] Fix ci import. by @ZHUI in #9239
[Version] Update version info by @DrownFish19 in #9241
[Auto Parallel] Adding align mode support by @zhangyuqin1998 in #9150
[LLM INFER] top_p_sampling_reject support top_p=0 and custom seed by @gzy19990617 in #9202
[INFER] update tune_cublaslt_gemm op and fix some bugs by @yuanlehome in #9222
Reduce the time spent on git downloading third-party libraries by @vivienfanghuagood in #9246
[PIR] fix pir open bugs by @yuanlehome in #9248
Cherry-pick some PRs from incubate/paddlenlp-fleety by @sneaxiy in #9245
[Unified Checkpoint] Support expert parallel by @DesmonDay in #9055
[PIR] fix pir dt2st for chatglm_v2 by @yuanlehome in #9251
Cherry-pick some PRs from incubate/paddlenlp-fleety by @LiYuRio in #9253
[Unified Checkpoint] Fix generation config save by @DrownFish19 in #9223
[AutoParallel] Fix tests for pass paddle AutoParallel CI by @liym27 in #9267
change dataset by @lugimzzz in #9266
[Unified Checkpoint] update async save logic by @DesmonDay in #9274
add config file for model chatglm2,gemma,yuan by @Mangodadada in #9139
Fix async hang by @DesmonDay in #9276
[AutoParallel] Change llama test from sharding stage2 to stage1 by @zhangbo9674 in #9281
[Tokenizer] Enable padding_side as call time kwargs by @DrownFish19 in #9258
[Trainer] fix save_model by @DesmonDay in #9286
[CI] Skip inference test cases by @DrownFish19 in #9270
[LLM] Add deepseekv2 by @DrownFish19 in #9250
[Tokenizer] Unify tokenizer _pad by @DrownFish19 in #9280
[CI] Fix llm/alignment/rm/flashmask path by @DrownFish19 in #9289
support attention mask using causal=True by @GuoxiaWang in #9268
[FlashMask] Add FlashMask for Qwen2 by @DrownFish19 in #9264
bug fix for xpu_parallel_matmul by @FeixLiu in #9297
fix lora sharding v2 by @lugimzzz in #9300
[LLM INFER] Append attn by @yuanlehome in #9244
[Auto Parallel] fix bugs for split_batches_for_accumulation && fix bu… by @zhangyuqin1998 in #9217
[Tokenizer] Fix TokenizerFast missing clean_up_tokenization_spaces by @dynamicheart in #9304
clean llama static modeling file by @zhiqiu in #9301
[Unified Checkpoint] Accelerate loading checkpoint by multi-thread by @Crystal-X-111 in #9034
fix non-pipelinelayer to distributed by @gongel in #9310
change the legacy to slm by @wawltor in #9311
[TRL] Rename sft trainer. by @ZHUI in #9292
[XPU] support unified ckpt function by @cqulilujia in #9312
[LLM INFER] Fix some bugs and chatglm_v2 support block_attn by @yuanlehome in #9271
[Readme] Add flash mask by @lugimzzz in #9219
update llm infer docs by @yuanlehome in #9314
[Unified Checkpoint] Add split param and refactor code by @DesmonDay in #9240
[METAX] Support llama for MX C550 by @idontkonwher in #9186
update QR code by @DrownFish19 in #9325
add flash_attention on model chatglm_v2 by @Mangodadada in #9296
fix readme by @Mangodadada in #9326
[Unified Checkpoint] update non-merge checkpoint loading, move async_save_info.json location by @DesmonDay in #9321
[paddle cpu inference]fix cpu doc by @bukejiyu in #9299
[LLM INFER] add rope_theta for block_multihead_attention by @yuanlehome in #9334
Fix pr 9334 by @yuanlehome in #9335
fix parameter calculation in auto_parallel mode by @zhiqiu in #9327
[Docs] Update flashmask by @DrownFish19 in #9330
Update load_save_single_card.py by @DesmonDay in #9337
Update README.md by @DrownFish19 in #9339
[Tokenizer] Support reading Tiktoken tokenizer.model. by @lvdongyi in #9215
align default custom black/white list for dygraph and static graph by @zhiqiu in #9340
[intel_hpu] initial commit for intel_hpu support by @yanfeich in #9273
Compatible with Tensor.to change to out_of_place. by @DrownFish19 in #9343
[Tokenizer] Fix Llama3Tokenizer import by @DrownFish19 in #9341
[Docs] Add precision alignment doc by @DrownFish19 in #9346
[Tokenizer] Support adding special tokens to Qwen tokenizer by @DrownFish19 in #9344
Add ordered save to avoid OOM by @ForFishes in #9347
[AutoParallel]Bugfix Hang for VPP-Sharding by @JZ-LIANG in #9336
Add CI testing for A100 and V100 device by @waliwali777 in #9324
[Inference] Append attn FP8 quant by @ckl117 in #9328
[Tokenizer] Add BertTokenizerFast, support register new tokenizer by @lvdongyi in #9353
clean print in auto_trainer by @zhiqiu in #9357
[Unified Checkpoint] Fix fp32 dtype for using newest paddle by @DesmonDay in #9360
[UIE] Fix tokenizer output with return_token_type_ids by @DrownFish19 in #9363
Add offload/reload for optimizer by @ForFishes in #9359
refine dtype use by @wanghuancoder in #9366
Add check for sharding stage1-v2 using amp master grad by @ForFishes in #9333
[Trainer] Update assert to warning by @DesmonDay in #9332
[Auto Parallel] fix adapt_stale_fwd_patch for to_static mode by @zhangyuqin1998 in #9372
[LLM INFER] Optimize fuse some kernels in postprocess by @gzy19990617 in #9201
[AutoParallel] Fix EXCODE bug of AutoParallel CI by @waliwali777 in #9355
Support pp + no_recompute_layer. by @tianyuzhou668 in #9373
[Unified Checkpoint] Support empty state_dict saving by @DesmonDay in #9380
Add submodule by @risemeup1 in #9385
[CI] add recursive for submodule by @Liujie0926 in #9389
[CI]fix scripts by @Liujie0926 in #9394
[LLM]add ktotrainer by @lugimzzz in #9393
Refine log freq by @zhangbo9674 in #9397
[XPU] Llama XPU's swiglu uses phi's swiglu by @dynamicheart in #9414
fix hip paddlenlp_ops bug by @TBD1 in #9418
[CI]update target_lists_for_llm by @Liujie0926 in #9417
[INFER][LLM] Add the AutoModel for inference mode by @zeroRains in #9416
[Unified Checkpoint] Support sharding_comm_overlap by @DesmonDay in #9392
[DCU] update dcu paddlenlp_ops by @TBD1 in #9433
Change core.LoDTensor to core.DenseTensor by @co63oc in #9434
Change LOD_TENSOR to DENSE_TENSOR by @co63oc in #9419
[LLM] Fix deepseekv2 import in py38 by @DrownFish19 in #9446
[Distributed Dataloader] change process new_group creation by @DesmonDay in #9438
Update dist_dataloader.py by @DesmonDay in #9451
[llm]fix pp no drop last by @lugimzzz in #9439
Reduce long duration for the exit -6 re-run process. by @waliwali777 in #9400
Fix row parallel lora layers parameters initialization bug by @will-jl944 in #9427
Refactor tool of creating pretrain dataset by @gongel in #9454
【Auto-Parallel】update conf for sharding overlap in static by @liym27 in #9456
[AutoParallel] add release_gradients and comm_buffer_size_MB to strategy by @AndSonder in #9432
[LLM] Skip zero loss by @DrownFish19 in #9447
[ChatTemplate] Fix chat template when answer is contained within question. by @DrownFish19 in #9444
[LLM] Add expert parallel by @DrownFish19 in #9368
增加benchmark多机任务执行脚本对于异常退出的处理 by @XieYunshen in #9442
[llm]add set_seed by @lugimzzz in #9429
[AutoParallel] Reconstruct sharding mesh dimension inference logic - Part2 add sharding_mesh_dimension param by @AndSonder in #9382
Fix auto parallel CI exit -6 by @waliwali777 in #9460
[ChatTemplate] Fix chat template for Gemma when answer is contained within question. by @lvdongyi in #9462
Use paddle.cast instead of Tensor.astype by @HydrogenSulfate in #9461
fixed the init problem in tensor parallel by @wawltor in #9452
Revised PoSE by @whf313 in #8822
fix AutoInferenceModel for qwen-vl by @yuanlehome in #9463
add reft method by @TranscenderNing in #8819
[AutoParallel]: llama_model_auto support alibi by @blacksheep-Aristotle in #9422
[AutoParallel]:gpt 13b model support fused_linear sp fused_attention … by @blacksheep-Aristotle in #9477
add Moslora by @TranscenderNing in #9331
[Trainer] Fix eval for map dataset by @DesmonDay in #9472
[Inference]Move quantization code from run_finetune.py to run_quantization.py by @lixcli in #9450
[AutoParallel] Fix parameter passing for comm_buffer_size_MB and release_gradients by @AndSonder in #9481
[AutoParallel]:fix run llama_13b_auto error by @blacksheep-Aristotle in #9480
[Unified Checkpoint] Checkpoint compression by @wtmlon in #9183
fixbug for chatglm_v2's RetaryEmbedding dtype by @mingMelody in #9476
[LLM INFER] Support speculative decoding (llama) by @Wanglongzhi2001 in #9180
[Fix] Remove data args print by @DrownFish19 in #9486
[AutoParallel] open vpp test cast at v100 machines by @AndSonder in #9468
[ChatTemplate] Fix chat template for Yuan when answer is contained within question. by @lvdongyi in #9485
[AutoParallel]:fix baichuan d2s fail by @blacksheep-Aristotle in #9478
[Tokenizer] Support fast tokenizer within AutoTokenizer import by @DrownFish19 in #9466
[Inference] use fp8 cuda core gemm kernel when M<=4 by @zhink in #9423
[XPU] set appropriate mask value for xpu by @runzhech in #9495
[LLM INFER] not use gemm_dequant default and fix bug by @yuanlehome in #9498
[NEW Feature] 新增基于hook的refined_recompute支持 by @JunnYu in #9396
【Hackathon 7th No.43】完善 TokenizerFast 功能支持 part 1 by @yinfan98 in #9407
[BUG] fix pp eval shape bug by @JunnYu in #9505
Adding LoKrModel Class to paddle.peft library by @WhuanY in #9269
移除CUDA_DEVICE_MAX_CONNECTIONS环境变量, 优化benchmark执行脚本 by @XieYunshen in #9500
[Refactor] SFTTrainer SFTConfig by @ZHUI in #9318
fix csrc readme by @yuanlehome in #9515
Add document for speculative decoding by @Wanglongzhi2001 in #9492
[News] FlashRAG-Paddle by @DrownFish19 in #9511
support quant ckpt limit strategy by @wtmlon in #9494
Fix ckpt convert bug by @zhangbo9674 in #9521
support pp accuracy calculation by @wtmlon in #9379
Fix ckpt convert bug1 by @zhangbo9674 in #9522
[CI] Compatible with paddle.where by @DrownFish19 in #9534
[Inference] Update DygraphInferencePredictor by @DrownFish19 in #9491
support offload/reload optimizer's states for custom device by @tianhaodongbd in #9467
[LLM INFER] fix tune_cublaslt_int8_gemm.py and remove dist_config by @yuanlehome in #9520
【Hackathon 7th No.43】TokenizerFast for Qwen2 by @yinfan98 in #9532
[INFER][LLM] Add the AutoPredictor for inference by @zeroRains in #9445
Support call sft training with clone PaddleNLP by @ZHUI in #9516

New Contributors

@Crystal-X-111 made their first contribution in #9034
@idontkonwher made their first contribution in #9186
@waliwali777 made their first contribution in #9324
@tianyuzhou668 made their first contribution in #9373
@risemeup1 made their first contribution in #9385
@TBD1 made their first contribution in #9418
@zeroRains made their first contribution in #9416
@XieYunshen made their first contribution in #9442
@whf313 made their first contribution in #8822
@mingMelody made their first contribution in #9476
@runzhech made their first contribution in #9495
@WhuanY made their first contribution in #9269

Full Changelog: v3.0.0-beta2...v3.0.0-beta3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v3.0.0-beta3

主要更新与增强

What's Changed

New Contributors

Contributors