Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1-character wordpieces fail to encode #4

Closed
jonthegeek opened this issue Nov 4, 2020 · 26 comments · Fixed by #5
Closed

1-character wordpieces fail to encode #4

jonthegeek opened this issue Nov 4, 2020 · 26 comments · Fixed by #5

Comments

@jonthegeek
Copy link
Contributor

There appears to be a bug in the wordpiece implementation around 1-character words:

sentencepiece::wordpiece_encode(
  x = "i like tacos", 
  vocabulary = c(
    "i", "like", "ta", "##cos"
  )
)
#> [[1]]
#> [1] "[UNK]" "like"  "ta"    "##cos"

I did a few other tests, and it appears to be 1-character "words" in general, but it's always possible I'm mis-identifying the issue.

We're strongly considering putting together a separate wordpiece package for BERT-style encoding, so I wanted to see if your implementation would do the trick, and right now it doesn't quite. It's super close, though, so it'd be great if we could track this down (a smaller footprint in a separate wordpiece package would be great... but we'll take your fast implementation over the monstrosity we have in RBERT!).

@jwijffels
Copy link
Contributor

jwijffels commented Nov 4, 2020

I basically took https://github.com/huggingface/transformers/blob/master/src/transformers/tokenization_bert.py#L512 and converted it to c++ https://github.com/bnosac/sentencepiece/blob/master/src/rcpp_wordpiece.cpp#L5
To be honest, I'm not using wordpiece in production anywhere so it's not tested thoroughly. I'm mostly using the sentencepiece models in production. So would be greatefull if you could indicate where the conversion went wrong (maybe something alongside python/c++/r indexing starting from 0 or 1?

@jonthegeek
Copy link
Contributor Author

I would not be surprised if it's an indexing issue!
My C++ is suuuuuuuper weak (I literally haven't used it in almost 30 years), but I've been meaning to brush up, so I'll see if I can find the issue!

@jwijffels
Copy link
Contributor

beware that compilation of this package takes some time (30 mins on my old Windows machine from 2013) so it might be a good idea while debugging the C++ code to blow away only the relevant .so files by commenting out this line https://github.com/bnosac/sentencepiece/blob/master/src/Makevars#L62 and putting comments on line https://github.com/bnosac/sentencepiece/blob/master/src/Makevars#L61 while developing

@jonthegeek
Copy link
Contributor Author

Thankfully it's far less than 30 minutes on my machine, but thanks, I was about to try to figure out how to do that!

@jwijffels
Copy link
Contributor

jwijffels commented Nov 4, 2020

Seems to be just this https://github.com/bnosac/sentencepiece/blob/master/src/rcpp_wordpiece.cpp#L8, removing the -1 part otherwise, it will never enter the loop on length 1 strings.
I wonder why I've put that there. Probably I prototyped the function in R before porting to C++. Yeah that was probably it, I still have that prototype in R in my dev folder and there I had the following.

wordpiece_encode <- function(x, vocabulary, unk_token = "[UNK]", max_input_chars_per_word = 200){
  x <- trimws(x)
  x <- strsplit(x, split = " ")
  x <- lapply(x, FUN=function(terms){
    lapply(terms, FUN=function(term){
      if(nchar(term) > max_input_chars_per_word){
        return(unk_token)
      }else{
        output_tokens <- character()
        sub_tokens <- character()
        start <- 1
        is_bad <- FALSE
        while(start < nchar(term)){
          end        <- nchar(term)
          cur_substr <- character()
          while(start < end){
            subterm <- substr(term, start = start, stop = end)
            subterm <- paste(subterm, collapse = "")
            print(subterm)
            if(start > 1){
              subterm <- paste("##", subterm, sep = "")
            }
            if(subterm %in% vocabulary){
              cur_substr <- subterm
              sub_tokens <- append(sub_tokens, subterm)
              break
            }
            end <- end - 1
          }
          print(cur_substr)
          if(length(cur_substr) == 0){
            is_bad = TRUE
            break
          }
          start <- end
        }
        if(is_bad){
          output_tokens <- append(output_tokens, unk_token)
        }else{
          output_tokens <- append(output_tokens, sub_tokens)
        }
        return(output_tokens)
      }
    })
  })
  x
}

@jwijffels
Copy link
Contributor

Would be great if you could test this out to validate correctness if on line https://github.com/bnosac/sentencepiece/blob/master/src/rcpp_wordpiece.cpp#L8 we remove the -1 part

@jonthegeek
Copy link
Contributor Author

Went away for lunch for a minute, trying... but I'm pretty sure that line was what I tried first. I'll try to work through what's actually happening and understand, though!

I don't understand what I need to do to keep recompilation from taking forever, evidently. If I only need to recompile rcpp_wordpiece.cpp, what do I do?

@jonthegeek
Copy link
Contributor Author

Update: Ugh yes that's totally it, I evidently hadn't actually tried that one yet!

Writing a test or three then I'll PR.

jonthegeek added a commit to jonthegeek/sentencepiece that referenced this issue Nov 4, 2020
Closes bnosac#4. Note: There were a bunch of new .o files in src/sentencepiece/src but I didn't include them since I didn't intentionally change anything there.
@jonthegeek
Copy link
Contributor Author

Nope, not quite. Digging through to see why "icos" doesn't tokenize as expected:

library(sentencepiece)
#> Warning: package 'sentencepiece' was built under R version 4.0.3
wordpiece_encode(
  x = c("tacos i like", "i like tacos", "icos"), 
  vocabulary = c(
    "i", "like", "ta", "##cos"
  )
)
#> [[1]]
#> [1] "ta"    "##cos" "i" "like" 
#> 
#> [[2]]
#> [1] "i" "like"  "ta"    "##cos"
#> 
#> [[3]]
#> [1] "[UNK]"

@jwijffels
Copy link
Contributor

Current implementation (at CRAN) has

> library(sentencepiece)
Warning message:
package ‘sentencepiece’ was built under R version 4.0.3 
> #> Warning: package 'sentencepiece' was built under R version 4.0.3
> wordpiece_encode(
+     x = c("tacos i like", "i like tacos", "icos"), 
+     vocabulary = c(
+         "i", "like", "ta", "##cos"
+     )
+ )
[[1]]
[1] "ta"    "##cos" "[UNK]" "like" 

[[2]]
[1] "[UNK]" "like"  "ta"    "##cos"

[[3]]
[1] "[UNK]"

@jonthegeek
Copy link
Contributor Author

The third example should come out as "i" "##cos"

I think it's related to https://github.com/bnosac/sentencepiece/blob/master/src/rcpp_wordpiece.cpp#L19 but I'm still working on wrapping my head around it, and it's a bit tough to iterate still. I don't think I'm grokking how to reduce the amount of recompilation.

@jonthegeek
Copy link
Contributor Author

jonthegeek commented Nov 4, 2020

Updating line 19 to while(start <= end){ appears to have fixed it!

Edit: and also line 15, same edit.

@jonthegeek
Copy link
Contributor Author

And... not quite, one of those has to be < I think. Cuz I appear to be in an infinite loop with another test (words not in vocab).

@jwijffels
Copy link
Contributor

👍 you're definitely at the same mental state as where I was when I wrote that function

@jonthegeek
Copy link
Contributor Author

Ok, gonna walk through and actually understand every line, or at least try to.

What should I put on https://github.com/bnosac/sentencepiece/blob/master/src/Makevars#L61 to make it only update the 1 changed function?

@jwijffels
Copy link
Contributor

jwijffels commented Nov 4, 2020

comment it out by prepending # or just remove that line

@jonthegeek
Copy link
Contributor Author

BOOM!

if (end > 0) {
  end = end - 1;
} else {
  break;
}

(end is an unsigned int, so, at the start, it can never become less than start; I tried making it signed but that didn't work out)

Doing some more checks then submitting the PR.

@jonthegeek jonthegeek mentioned this issue Nov 4, 2020
@jwijffels
Copy link
Contributor

Thanks for looking into this. Is it possible to provide some tests showing all the expected behaviour. Thanks.

@jonthegeek
Copy link
Contributor Author

Will do!

@jwijffels
Copy link
Contributor

@jonthegeek
@jonathanbratt
I don't like to be the police officer here but.

Regarding R package https://CRAN.R-project.org/package=wordpiece
Can you please follow some general rules indicated by the CRAN policies (https://cran.r-project.org/web/packages/policies.html)

The ownership of copyright and intellectual property rights of all components of the package must be clear and unambiguous (including from the authors specification in the DESCRIPTION file). Where code is copied (or derived) from the work of others (including from R itself), care must be taken that any copyright/license statements are preserved and authorship is not misrepresented.

Please add the following to the package DESCRIPTION

person('Jan', 'Wijffels', role = 'ctb', email = 'jwijffels@bnosac.be', comment = "Main functionality in .tokenize_word"), 
person('BNOSAC', role = 'cph', comment = "Main functionality in .tokenize_word") 

And next change the license of the wordpiece package to a license which is compatible with the MPL-2 license (Apache is a more liberal license not compatible with MPL-2) where you took and adapted the code .tokenize_word from. I also indicated this at #7, next ask CRAN to remove the current wordpiece package due to this MPL violation while re-uploading the wordpiece package to CRAN under the MPL-2 compatible license which you decided to choose.

Or just rewrite the tokenizer based on huggingface python code and don't do all the above but please don't just copy-paste code you've found on the internet without taking into account the right copyright/license statements.

@jonthegeek
Copy link
Contributor Author

I don't believe our .tokenize_word has any relationship to yours; we ended up writing it from scratch. I will look more closely in the morning to make sure it does not overlap. Our intention was definitely not to steal your code.

@jwijffels
Copy link
Contributor

I really don't mind if someone is reusing my code in fact just go ahead. But just follow the rules indicated by the CRAN policies and the MPL, that's all. That also means derived work.

@jonthegeek
Copy link
Contributor Author

Jonathan based this on @jonathanbratt's code from jonathanbratt/RBERT rather than the code from your package after I tested speed and found no appreciable difference for the tokenization step. We decided to allow others (stringr) to handle the C/C++ code, rather than implementing a version that we'd need to maintain. I do not believe Jonathan's code has any relationship to yours.

@jwijffels
Copy link
Contributor

Ok never mind then.

@jonathanbratt
Copy link

To confirm what @jonthegeek said, this was an independent implementation. We had indeed considered building on sentencepiece, but our speed tests didn't show a significant difference, so I just took what we already had in RBERT and pulled it into a separate package.

But this reminds me: sentencepiece does have functionality that wordpiece doesn't, so I definitely want to add a mention to the README in our next update.

All the best!

@jwijffels
Copy link
Contributor

If that was the case, no problem. My apologies for tagging you both then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants