-
Notifications
You must be signed in to change notification settings - Fork 816
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merges cannot handle tokens containing spaces. #909
Conversation
Hi. I found this PR through #566. How can I use the PR? I wasn't able to see the rust files in my tokenizers site-packages, and the README on the repo doesn't give instructions on building from source |
74a4c9e
to
0f617cc
Compare
Here are the docs for installing from source: https://huggingface.co/docs/tokenizers/installation#installation-from-sources I rebased again |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
Thanks for the speedy response Narsil. To clarify, this PR allows us to have merges in our tokenizer.json file which may have multiple whitespaces for each merge? I'll play around with it tomorrow and let you know 👍 |
Yes exactly, this allows any number of spaces or any other char in the merges. |
Wow that installation process was painless. I guess there is a first time for everything! Anyway, I'm still facing the error that was there in the original issue: data did not match any variant of untagged enum ModelWrapper at line 24420 column 3. The merges in tokenizer.json are now a length-2 list of strings instead of a string - so I'm pretty sure your branch was used successfully. The error is getting thrown at this line: |
Could you share your serialized file somehow ? I could take a look. |
Hey man, did you get a chance to take a look? |
Thanks for reminding. I just did, it seems the merges you are referrring in your file are not present in the vocabulary making the merges unsound. How did you create that file ? |
Here is how I'm testing this in Rust to get a better error message: #[test]
// Ensure `BPE::from_file` works as expected.
fn test_bpe_from_file2() {
// Make sure we can instantiate a BPE model from the files.
let data = std::fs::read_to_string("test.json").unwrap();
let bpe: BPE = serde_json::from_str(&data).unwrap();
} Just add this in the rust source code and run Making better error messages all the way into python is trickier because of the libs we use and how this library is structured (though it would be really cool to have). |
I'm trying to get BPE to learn the ideal character sequence to split on instead of splitting on whitespace first. Something like this: tokenizer = Tokenizer(BPE()) # type: ignore
# data has been preprocessed to add Ġ to the beginning of every word in the corpus
tokenizer.normalizer = normalizers.Sequence([Lowercase(), normalizers.Replace("ġ", 'Ġ')])
# Customize training
trainer = BpeTrainer(vocab_size=5000, min_frequency=2,
show_progress=True,
special_tokens=[
"<pad>",
"<s>",
"</s>",
"<unk>",
"<mask>",
])
tokenizer.train(files=file_list_for_language, trainer=trainer)
tokenizer_path = f"tokenizers/{language}/"
os.makedirs(tokenizer_path, exist_ok=True)
new_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)
new_tokenizer.save_pretrained(tokenizer_path) I'm also doing some postprocessing which modifies the vocabulary directly. This may be where the discrepancy with the merges is coming from: with open(tokenizer_path+"tokenizer.json", "r") as f:
data = json.load(f)
model_vocab: Dict[str, int] = data["model"]["vocab"]
postprocessed_vocab = {}
for tok, idx in model_vocab.items():
if idx < 6:
postprocessed_vocab[tok] = idx
continue
tok = tok.strip()
if tok.endswith("Ġ.##"):
tok = tok[:-4].strip()
if tok.endswith("Ġ"):
tok = tok[:-1].strip()
postprocessed_vocab[tok] = idx I'm not entirely sure how I can "clean" my vocabulary and make the relevant changes in the merges file? |
Yes it's your postprocessing that is most likely causing the issue. |
Hi, can this please be merged as a different (de)serialization format? Several issues through years have encountered this issue and have used this PR to fix it. Making sure that this is a separate serialization format would be fully backwards compatible and explicitly alert people to use the different functions to (de)serialize the tokenizer. |
Hey @ashutoshbsathe if there are more issue that are not linked / any particular use cases could you link them here? I'll see what I can do! |
This fixes this while keeping backward support. We don't want to merge that blindly.
11f84e7
to
5026d82
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would just add a non legacy test with spaces inside the merges
Also this would potentially mean support spaces in added tokens 🍭 |
Done. |
This fixes this while keeping backward support.
We don't want to merge that blindly.