Skip to content

Commit

Permalink
Sort added tokens by length to avoid early partial matches
Browse files Browse the repository at this point in the history
  • Loading branch information
xenova committed Jul 5, 2024
1 parent 344af32 commit c305c38
Showing 1 changed file with 5 additions and 1 deletion.
6 changes: 5 additions & 1 deletion src/tokenizers.js
Original file line number Diff line number Diff line change
Expand Up @@ -2559,7 +2559,11 @@ export class PreTrainedTokenizer extends Callable {


this.added_tokens_regex = this.added_tokens.length > 0 ? new RegExp(
this.added_tokens.map(x => `${x.lstrip ? '\\s*' : ''}(${escapeRegExp(x.content)})${x.rstrip ? '\\s*' : ''}`).join('|')
this.added_tokens
// Sort by length (desc) to avoid early partial matches
.toSorted((a, b) => b.content.length - a.content.length)
.map(x => `${x.lstrip ? '\\s*' : ''}(${escapeRegExp(x.content)})${x.rstrip ? '\\s*' : ''}`)
.join('|')
) : null;

// Set mask token if present (otherwise will be undefined, which is fine)
Expand Down

0 comments on commit c305c38

Please sign in to comment.