Skip to content

Commit

Permalink
fix CLIPTokenizer skipping underscores
Browse files Browse the repository at this point in the history
  • Loading branch information
TyrianOtter authored and catwell committed Oct 3, 2024
1 parent f89c4f7 commit 590648e
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion src/refiners/foundationals/clip/tokenizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ def __init__(
# to get rid of the dependence on the `regex` module. Unicode support could
# potentially be added back by leveraging the `\w` character class.
self.token_pattern = re.compile(
pattern=r"""<\|startoftext\|>|<\|endoftext\|>|'s|'t|'re|'ve|'m|'ll|'d|[a-zA-Z]+|[0-9]|[^\s\w]+""",
pattern=r"""<\|startoftext\|>|<\|endoftext\|>|'s|'t|'re|'ve|'m|'ll|'d|[a-zA-Z]+|[0-9]|(?:[^\s\w]|_)+""",
flags=re.IGNORECASE,
)
self.start_of_text_token_id: int = start_of_text_token_id
Expand Down

0 comments on commit 590648e

Please sign in to comment.