[Tokenizer] Unify tokenizer _pad #9280

DrownFish19 · 2024-10-16T09:50:12Z

PR types

Function optimization

PR changes

APIs

Description

Unify tokenizer _pad function.

Move the attention_mask([1,seq_len,seql_len]) padding action into tokenizer_base _pad.
Move attn_mask_startend_row_indices padding action into tokenizer_base _pad.

[FlashMask] Add FlashMask for Qwen2 #9264误差范围验证基于此PR

paddle-bot · 2024-10-16T09:50:17Z

Thanks for your contribution!

…r__pad

…Fish19/PaddleNLP into dev_20241016_update_tokenizer__pad

codecov · 2024-10-16T13:44:26Z

Codecov Report

Attention: Patch coverage is 92.30769% with 2 lines in your changes missing coverage. Please review.

Project coverage is 52.96%. Comparing base (1dc6e18) to head (5ad7381).
Report is 2 commits behind head on develop.

Files with missing lines	Patch %	Lines
paddlenlp/transformers/tokenizer_utils_base.py	87.50%	2 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #9280      +/-   ##
===========================================
+ Coverage    52.94%   52.96%   +0.02%     
===========================================
  Files          657      657              
  Lines       106533   106384     -149     
===========================================
- Hits         56404    56351      -53     
+ Misses       50129    50033      -96

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

lugimzzz

LGTM

…date_tokenizer__pad

…r__pad

DrownFish19 added 3 commits October 16, 2024 16:48

update attention_mask padding

a6bfd74

update _pad function

b2f8727

unify attn_mask_startend_row_indices

70d8d72

DrownFish19 added 4 commits October 16, 2024 21:10

update chatglm_v2 tokenizer

68fd3b0

Merge branch 'PaddlePaddle:develop' into dev_20241016_update_tokenize…

c1b13aa

…r__pad

revert chatglmv2

fb93a58

Merge branch 'dev_20241016_update_tokenizer__pad' of github.com:Drown…

1ef3a35

…Fish19/PaddleNLP into dev_20241016_update_tokenizer__pad

lugimzzz previously approved these changes Oct 17, 2024

View reviewed changes

add test cases for 3D attention_mask and attn_mask_startend_row_indices

c82b04c

DrownFish19 dismissed lugimzzz’s stale review via c82b04c October 17, 2024 06:20

DrownFish19 added 3 commits October 17, 2024 17:50

update attn_mask_startend_row_indices return type

3b3029f

fix padding_side=left

3d852b1

update test case

867ad0b

DrownFish19 force-pushed the dev_20241016_update_tokenizer__pad branch from 7f3b3c8 to 867ad0b Compare October 17, 2024 13:12

DrownFish19 added 2 commits October 17, 2024 13:18

Merge remote-tracking branch 'paddlenlp/develop' into dev_20241016_up…

a183b40

…date_tokenizer__pad

Merge branch 'PaddlePaddle:develop' into dev_20241016_update_tokenize…

5ad7381

…r__pad

ZHUI approved these changes Oct 18, 2024

View reviewed changes

ZHUI merged commit 1770b51 into PaddlePaddle:develop Oct 18, 2024
8 of 12 checks passed

DrownFish19 deleted the dev_20241016_update_tokenizer__pad branch October 18, 2024 07:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tokenizer] Unify tokenizer _pad #9280

[Tokenizer] Unify tokenizer _pad #9280

DrownFish19 commented Oct 16, 2024 •

edited

Loading

paddle-bot bot commented Oct 16, 2024

codecov bot commented Oct 16, 2024 •

edited

Loading

lugimzzz left a comment

[Tokenizer] Unify tokenizer _pad #9280

[Tokenizer] Unify tokenizer _pad #9280

Conversation

DrownFish19 commented Oct 16, 2024 • edited Loading

PR types

PR changes

Description

paddle-bot bot commented Oct 16, 2024

codecov bot commented Oct 16, 2024 • edited Loading

Codecov Report

lugimzzz left a comment

Choose a reason for hiding this comment

DrownFish19 commented Oct 16, 2024 •

edited

Loading

codecov bot commented Oct 16, 2024 •

edited

Loading