-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1-character wordpieces fail to encode #4
Comments
I basically took https://github.com/huggingface/transformers/blob/master/src/transformers/tokenization_bert.py#L512 and converted it to c++ https://github.com/bnosac/sentencepiece/blob/master/src/rcpp_wordpiece.cpp#L5 |
I would not be surprised if it's an indexing issue! |
beware that compilation of this package takes some time (30 mins on my old Windows machine from 2013) so it might be a good idea while debugging the C++ code to blow away only the relevant .so files by commenting out this line https://github.com/bnosac/sentencepiece/blob/master/src/Makevars#L62 and putting comments on line https://github.com/bnosac/sentencepiece/blob/master/src/Makevars#L61 while developing |
Thankfully it's far less than 30 minutes on my machine, but thanks, I was about to try to figure out how to do that! |
Seems to be just this https://github.com/bnosac/sentencepiece/blob/master/src/rcpp_wordpiece.cpp#L8, removing the -1 part otherwise, it will never enter the loop on length 1 strings.
|
Would be great if you could test this out to validate correctness if on line https://github.com/bnosac/sentencepiece/blob/master/src/rcpp_wordpiece.cpp#L8 we remove the |
Went away for lunch for a minute, trying... but I'm pretty sure that line was what I tried first. I'll try to work through what's actually happening and understand, though! I don't understand what I need to do to keep recompilation from taking forever, evidently. If I only need to recompile rcpp_wordpiece.cpp, what do I do? |
Update: Ugh yes that's totally it, I evidently hadn't actually tried that one yet! Writing a test or three then I'll PR. |
Closes bnosac#4. Note: There were a bunch of new .o files in src/sentencepiece/src but I didn't include them since I didn't intentionally change anything there.
Nope, not quite. Digging through to see why "icos" doesn't tokenize as expected:
|
Current implementation (at CRAN) has
|
The third example should come out as "i" "##cos" I think it's related to https://github.com/bnosac/sentencepiece/blob/master/src/rcpp_wordpiece.cpp#L19 but I'm still working on wrapping my head around it, and it's a bit tough to iterate still. I don't think I'm grokking how to reduce the amount of recompilation. |
Updating line 19 to Edit: and also line 15, same edit. |
And... not quite, one of those has to be < I think. Cuz I appear to be in an infinite loop with another test (words not in vocab). |
👍 you're definitely at the same mental state as where I was when I wrote that function |
Ok, gonna walk through and actually understand every line, or at least try to. What should I put on https://github.com/bnosac/sentencepiece/blob/master/src/Makevars#L61 to make it only update the 1 changed function? |
comment it out by prepending # or just remove that line |
BOOM!
(end is an unsigned int, so, at the start, it can never become less than start; I tried making it signed but that didn't work out) Doing some more checks then submitting the PR. |
Thanks for looking into this. Is it possible to provide some tests showing all the expected behaviour. Thanks. |
Will do! |
@jonthegeek Regarding R package https://CRAN.R-project.org/package=wordpiece
Please add the following to the package DESCRIPTION
And next change the license of the wordpiece package to a license which is compatible with the MPL-2 license (Apache is a more liberal license not compatible with MPL-2) where you took and adapted the code .tokenize_word from. I also indicated this at #7, next ask CRAN to remove the current wordpiece package due to this MPL violation while re-uploading the wordpiece package to CRAN under the MPL-2 compatible license which you decided to choose. Or just rewrite the tokenizer based on huggingface python code and don't do all the above but please don't just copy-paste code you've found on the internet without taking into account the right copyright/license statements. |
I don't believe our .tokenize_word has any relationship to yours; we ended up writing it from scratch. I will look more closely in the morning to make sure it does not overlap. Our intention was definitely not to steal your code. |
I really don't mind if someone is reusing my code in fact just go ahead. But just follow the rules indicated by the CRAN policies and the MPL, that's all. That also means derived work. |
Jonathan based this on @jonathanbratt's code from jonathanbratt/RBERT rather than the code from your package after I tested speed and found no appreciable difference for the tokenization step. We decided to allow others (stringr) to handle the C/C++ code, rather than implementing a version that we'd need to maintain. I do not believe Jonathan's code has any relationship to yours. |
Ok never mind then. |
To confirm what @jonthegeek said, this was an independent implementation. We had indeed considered building on sentencepiece, but our speed tests didn't show a significant difference, so I just took what we already had in RBERT and pulled it into a separate package. But this reminds me: sentencepiece does have functionality that wordpiece doesn't, so I definitely want to add a mention to the README in our next update. All the best! |
If that was the case, no problem. My apologies for tagging you both then. |
There appears to be a bug in the wordpiece implementation around 1-character words:
I did a few other tests, and it appears to be 1-character "words" in general, but it's always possible I'm mis-identifying the issue.
We're strongly considering putting together a separate wordpiece package for BERT-style encoding, so I wanted to see if your implementation would do the trick, and right now it doesn't quite. It's super close, though, so it'd be great if we could track this down (a smaller footprint in a separate wordpiece package would be great... but we'll take your fast implementation over the monstrosity we have in RBERT!).
The text was updated successfully, but these errors were encountered: