[ML] Fix for Deberta tokenizer when input sequence exceeds 512 tokens #117595

maxhniebergall · 2024-11-26T21:53:50Z

We were missing the "balanced" case for the NLP tokenizer which caused exceptions with large inputs. In addition to the fix, I've also added a test which I confirmed fails without the fix, with the same error message as reported.

elasticsearchmachine · 2024-11-26T21:54:13Z

Pinging @elastic/ml-core (Team:ML)

elasticsearchmachine · 2024-11-26T21:54:14Z

Hi @maxhniebergall, I've created a changelog YAML for you.

…asticsearch into debertaTokenizerTruncationFix

…elastic#117595) * Add test and fix * Update docs/changelog/117595.yaml * Remove test which wasn't working

elasticsearchmachine · 2024-11-26T23:02:53Z

💚 Backport successful

Status	Branch	Result
✅	8.17
✅	8.16

…elastic#117595) * Add test and fix * Update docs/changelog/117595.yaml * Remove test which wasn't working

…#117595) (#117601) * Add test and fix * Update docs/changelog/117595.yaml * Remove test which wasn't working

…#117595) (#117600) * Add test and fix * Update docs/changelog/117595.yaml * Remove test which wasn't working Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>

…elastic#117595) * Add test and fix * Update docs/changelog/117595.yaml * Remove test which wasn't working

Add test and fix

1e914a2

maxhniebergall added >bug :ml Machine learning auto-backport Automatically create backport pull requests when merged v9.0.0 v8.17.0 v8.16.2 labels Nov 26, 2024

elasticsearchmachine added the Team:ML Meta label for the ML team label Nov 26, 2024

Update docs/changelog/117595.yaml

4b47d89

maxhniebergall added 2 commits November 26, 2024 16:54

Remove test which wasn't working

b21182c

Merge branch 'debertaTokenizerTruncationFix' of github.com:elastic/el…

f5a3454

…asticsearch into debertaTokenizerTruncationFix

prwhelan approved these changes Nov 26, 2024

View reviewed changes

maxhniebergall enabled auto-merge (squash) November 26, 2024 22:11

maxhniebergall merged commit 433a00c into main Nov 26, 2024
17 checks passed

maxhniebergall deleted the debertaTokenizerTruncationFix branch November 26, 2024 23:00

maxhniebergall mentioned this pull request Nov 26, 2024

[8.17] [ML] Fix for Deberta tokenizer when input sequence exceeds 512 tokens (#117595) #117600

Merged

maxhniebergall mentioned this pull request Nov 26, 2024

[8.16] [ML] Fix for Deberta tokenizer when input sequence exceeds 512 tokens (#117595) #117601

Merged

elasticsearchmachine pushed a commit that referenced this pull request Nov 27, 2024

[ML] Fix for Deberta tokenizer when input sequence exceeds 512 tokens (…

bdebe39

…#117595) (#117601) * Add test and fix * Update docs/changelog/117595.yaml * Remove test which wasn't working

cbuescher pushed a commit to cbuescher/elasticsearch that referenced this pull request Nov 27, 2024

[ML] Fix for Deberta tokenizer when input sequence exceeds 512 tokens (…

a8d0f21

…elastic#117595) * Add test and fix * Update docs/changelog/117595.yaml * Remove test which wasn't working

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Fix for Deberta tokenizer when input sequence exceeds 512 tokens #117595

[ML] Fix for Deberta tokenizer when input sequence exceeds 512 tokens #117595

maxhniebergall commented Nov 26, 2024

elasticsearchmachine commented Nov 26, 2024

elasticsearchmachine commented Nov 26, 2024

elasticsearchmachine commented Nov 26, 2024

[ML] Fix for Deberta tokenizer when input sequence exceeds 512 tokens #117595

[ML] Fix for Deberta tokenizer when input sequence exceeds 512 tokens #117595

Conversation

maxhniebergall commented Nov 26, 2024

elasticsearchmachine commented Nov 26, 2024

elasticsearchmachine commented Nov 26, 2024

elasticsearchmachine commented Nov 26, 2024

💚 Backport successful