1-character wordpieces fail to encode #4

jonthegeek · 2020-11-04T18:20:14Z

There appears to be a bug in the wordpiece implementation around 1-character words:

sentencepiece::wordpiece_encode(
  x = "i like tacos", 
  vocabulary = c(
    "i", "like", "ta", "##cos"
  )
)
#> [[1]]
#> [1] "[UNK]" "like"  "ta"    "##cos"

I did a few other tests, and it appears to be 1-character "words" in general, but it's always possible I'm mis-identifying the issue.

We're strongly considering putting together a separate wordpiece package for BERT-style encoding, so I wanted to see if your implementation would do the trick, and right now it doesn't quite. It's super close, though, so it'd be great if we could track this down (a smaller footprint in a separate wordpiece package would be great... but we'll take your fast implementation over the monstrosity we have in RBERT!).

The text was updated successfully, but these errors were encountered:

jwijffels · 2020-11-04T18:34:21Z

I basically took https://github.com/huggingface/transformers/blob/master/src/transformers/tokenization_bert.py#L512 and converted it to c++ https://github.com/bnosac/sentencepiece/blob/master/src/rcpp_wordpiece.cpp#L5
To be honest, I'm not using wordpiece in production anywhere so it's not tested thoroughly. I'm mostly using the sentencepiece models in production. So would be greatefull if you could indicate where the conversion went wrong (maybe something alongside python/c++/r indexing starting from 0 or 1?

jonthegeek · 2020-11-04T18:39:22Z

I would not be surprised if it's an indexing issue!
My C++ is suuuuuuuper weak (I literally haven't used it in almost 30 years), but I've been meaning to brush up, so I'll see if I can find the issue!

jwijffels · 2020-11-04T18:45:02Z

beware that compilation of this package takes some time (30 mins on my old Windows machine from 2013) so it might be a good idea while debugging the C++ code to blow away only the relevant .so files by commenting out this line https://github.com/bnosac/sentencepiece/blob/master/src/Makevars#L62 and putting comments on line https://github.com/bnosac/sentencepiece/blob/master/src/Makevars#L61 while developing

jonthegeek · 2020-11-04T18:47:09Z

Thankfully it's far less than 30 minutes on my machine, but thanks, I was about to try to figure out how to do that!

jwijffels · 2020-11-04T19:01:56Z

Seems to be just this https://github.com/bnosac/sentencepiece/blob/master/src/rcpp_wordpiece.cpp#L8, removing the -1 part otherwise, it will never enter the loop on length 1 strings.
I wonder why I've put that there. Probably I prototyped the function in R before porting to C++. Yeah that was probably it, I still have that prototype in R in my dev folder and there I had the following.

wordpiece_encode <- function(x, vocabulary, unk_token = "[UNK]", max_input_chars_per_word = 200){
  x <- trimws(x)
  x <- strsplit(x, split = " ")
  x <- lapply(x, FUN=function(terms){
    lapply(terms, FUN=function(term){
      if(nchar(term) > max_input_chars_per_word){
        return(unk_token)
      }else{
        output_tokens <- character()
        sub_tokens <- character()
        start <- 1
        is_bad <- FALSE
        while(start < nchar(term)){
          end        <- nchar(term)
          cur_substr <- character()
          while(start < end){
            subterm <- substr(term, start = start, stop = end)
            subterm <- paste(subterm, collapse = "")
            print(subterm)
            if(start > 1){
              subterm <- paste("##", subterm, sep = "")
            }
            if(subterm %in% vocabulary){
              cur_substr <- subterm
              sub_tokens <- append(sub_tokens, subterm)
              break
            }
            end <- end - 1
          }
          print(cur_substr)
          if(length(cur_substr) == 0){
            is_bad = TRUE
            break
          }
          start <- end
        }
        if(is_bad){
          output_tokens <- append(output_tokens, unk_token)
        }else{
          output_tokens <- append(output_tokens, sub_tokens)
        }
        return(output_tokens)
      }
    })
  })
  x
}

jwijffels · 2020-11-04T19:06:06Z

Would be great if you could test this out to validate correctness if on line https://github.com/bnosac/sentencepiece/blob/master/src/rcpp_wordpiece.cpp#L8 we remove the -1 part

jonthegeek · 2020-11-04T19:25:27Z

Went away for lunch for a minute, trying... but I'm pretty sure that line was what I tried first. I'll try to work through what's actually happening and understand, though!

I don't understand what I need to do to keep recompilation from taking forever, evidently. If I only need to recompile rcpp_wordpiece.cpp, what do I do?

jonthegeek · 2020-11-04T19:26:15Z

Update: Ugh yes that's totally it, I evidently hadn't actually tried that one yet!

Writing a test or three then I'll PR.

Closes bnosac#4. Note: There were a bunch of new .o files in src/sentencepiece/src but I didn't include them since I didn't intentionally change anything there.

jonthegeek · 2020-11-04T19:33:11Z

Nope, not quite. Digging through to see why "icos" doesn't tokenize as expected:

library(sentencepiece)
#> Warning: package 'sentencepiece' was built under R version 4.0.3
wordpiece_encode(
  x = c("tacos i like", "i like tacos", "icos"), 
  vocabulary = c(
    "i", "like", "ta", "##cos"
  )
)
#> [[1]]
#> [1] "ta"    "##cos" "i" "like" 
#> 
#> [[2]]
#> [1] "i" "like"  "ta"    "##cos"
#> 
#> [[3]]
#> [1] "[UNK]"

jwijffels · 2020-11-04T19:44:32Z

Current implementation (at CRAN) has

> library(sentencepiece)
Warning message:
package ‘sentencepiece’ was built under R version 4.0.3 
> #> Warning: package 'sentencepiece' was built under R version 4.0.3
> wordpiece_encode(
+     x = c("tacos i like", "i like tacos", "icos"), 
+     vocabulary = c(
+         "i", "like", "ta", "##cos"
+     )
+ )
[[1]]
[1] "ta"    "##cos" "[UNK]" "like" 

[[2]]
[1] "[UNK]" "like"  "ta"    "##cos"

[[3]]
[1] "[UNK]"

jonthegeek · 2020-11-04T19:46:00Z

The third example should come out as "i" "##cos"

I think it's related to https://github.com/bnosac/sentencepiece/blob/master/src/rcpp_wordpiece.cpp#L19 but I'm still working on wrapping my head around it, and it's a bit tough to iterate still. I don't think I'm grokking how to reduce the amount of recompilation.

jonthegeek · 2020-11-04T19:46:43Z

Updating line 19 to while(start <= end){ appears to have fixed it!

Edit: and also line 15, same edit.

jonthegeek · 2020-11-04T19:51:57Z

And... not quite, one of those has to be < I think. Cuz I appear to be in an infinite loop with another test (words not in vocab).

jwijffels · 2020-11-04T19:55:52Z

👍 you're definitely at the same mental state as where I was when I wrote that function

jonthegeek · 2020-11-04T19:57:41Z

Ok, gonna walk through and actually understand every line, or at least try to.

What should I put on https://github.com/bnosac/sentencepiece/blob/master/src/Makevars#L61 to make it only update the 1 changed function?

jwijffels · 2020-11-04T19:58:39Z

comment it out by prepending # or just remove that line

jonthegeek · 2020-11-04T20:42:12Z

BOOM!

if (end > 0) {
  end = end - 1;
} else {
  break;
}

(end is an unsigned int, so, at the start, it can never become less than start; I tried making it signed but that didn't work out)

Doing some more checks then submitting the PR.

jwijffels · 2020-11-05T08:51:41Z

Thanks for looking into this. Is it possible to provide some tests showing all the expected behaviour. Thanks.

jonthegeek · 2020-11-05T14:52:31Z

Will do!

jwijffels · 2021-02-16T08:42:53Z

@jonthegeek
@jonathanbratt
I don't like to be the police officer here but.

Regarding R package https://CRAN.R-project.org/package=wordpiece
Can you please follow some general rules indicated by the CRAN policies (https://cran.r-project.org/web/packages/policies.html)

The ownership of copyright and intellectual property rights of all components of the package must be clear and unambiguous (including from the authors specification in the DESCRIPTION file). Where code is copied (or derived) from the work of others (including from R itself), care must be taken that any copyright/license statements are preserved and authorship is not misrepresented.

Please add the following to the package DESCRIPTION

person('Jan', 'Wijffels', role = 'ctb', email = 'jwijffels@bnosac.be', comment = "Main functionality in .tokenize_word"), 
person('BNOSAC', role = 'cph', comment = "Main functionality in .tokenize_word")

And next change the license of the wordpiece package to a license which is compatible with the MPL-2 license (Apache is a more liberal license not compatible with MPL-2) where you took and adapted the code .tokenize_word from. I also indicated this at #7, next ask CRAN to remove the current wordpiece package due to this MPL violation while re-uploading the wordpiece package to CRAN under the MPL-2 compatible license which you decided to choose.

Or just rewrite the tokenizer based on huggingface python code and don't do all the above but please don't just copy-paste code you've found on the internet without taking into account the right copyright/license statements.

jonthegeek · 2021-02-16T08:49:02Z

I don't believe our .tokenize_word has any relationship to yours; we ended up writing it from scratch. I will look more closely in the morning to make sure it does not overlap. Our intention was definitely not to steal your code.

jwijffels · 2021-02-16T08:57:29Z

I really don't mind if someone is reusing my code in fact just go ahead. But just follow the rules indicated by the CRAN policies and the MPL, that's all. That also means derived work.

jonthegeek · 2021-02-16T09:07:47Z

Jonathan based this on @jonathanbratt's code from jonathanbratt/RBERT rather than the code from your package after I tested speed and found no appreciable difference for the tokenization step. We decided to allow others (stringr) to handle the C/C++ code, rather than implementing a version that we'd need to maintain. I do not believe Jonathan's code has any relationship to yours.

jwijffels · 2021-02-16T09:21:18Z

Ok never mind then.

jonathanbratt · 2021-02-16T15:12:48Z

To confirm what @jonthegeek said, this was an independent implementation. We had indeed considered building on sentencepiece, but our speed tests didn't show a significant difference, so I just took what we already had in RBERT and pulled it into a separate package.

But this reminds me: sentencepiece does have functionality that wordpiece doesn't, so I definitely want to add a mention to the README in our next update.

All the best!

jwijffels · 2021-02-16T15:17:24Z

If that was the case, no problem. My apologies for tagging you both then.

jonthegeek added a commit to jonthegeek/sentencepiece that referenced this issue Nov 4, 2020

Fixed wordpiece 1-character

a7ef937

Closes bnosac#4. Note: There were a bunch of new .o files in src/sentencepiece/src but I didn't include them since I didn't intentionally change anything there.

jonthegeek mentioned this issue Nov 4, 2020

Issue 4 #5

Merged

jwijffels closed this as completed in #5 Nov 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1-character wordpieces fail to encode #4

1-character wordpieces fail to encode #4

jonthegeek commented Nov 4, 2020

jwijffels commented Nov 4, 2020 •

edited

Loading

jonthegeek commented Nov 4, 2020

jwijffels commented Nov 4, 2020

jonthegeek commented Nov 4, 2020

jwijffels commented Nov 4, 2020 •

edited

Loading

jwijffels commented Nov 4, 2020

jonthegeek commented Nov 4, 2020

jonthegeek commented Nov 4, 2020

jonthegeek commented Nov 4, 2020

jwijffels commented Nov 4, 2020

jonthegeek commented Nov 4, 2020

jonthegeek commented Nov 4, 2020 •

edited

Loading

jonthegeek commented Nov 4, 2020

jwijffels commented Nov 4, 2020

jonthegeek commented Nov 4, 2020

jwijffels commented Nov 4, 2020 •

edited

Loading

jonthegeek commented Nov 4, 2020

jwijffels commented Nov 5, 2020

jonthegeek commented Nov 5, 2020

jwijffels commented Feb 16, 2021

jonthegeek commented Feb 16, 2021

jwijffels commented Feb 16, 2021

jonthegeek commented Feb 16, 2021

jwijffels commented Feb 16, 2021

jonathanbratt commented Feb 16, 2021

jwijffels commented Feb 16, 2021

1-character wordpieces fail to encode #4

1-character wordpieces fail to encode #4

Comments

jonthegeek commented Nov 4, 2020

jwijffels commented Nov 4, 2020 • edited Loading

jonthegeek commented Nov 4, 2020

jwijffels commented Nov 4, 2020

jonthegeek commented Nov 4, 2020

jwijffels commented Nov 4, 2020 • edited Loading

jwijffels commented Nov 4, 2020

jonthegeek commented Nov 4, 2020

jonthegeek commented Nov 4, 2020

jonthegeek commented Nov 4, 2020

jwijffels commented Nov 4, 2020

jonthegeek commented Nov 4, 2020

jonthegeek commented Nov 4, 2020 • edited Loading

jonthegeek commented Nov 4, 2020

jwijffels commented Nov 4, 2020

jonthegeek commented Nov 4, 2020

jwijffels commented Nov 4, 2020 • edited Loading

jonthegeek commented Nov 4, 2020

jwijffels commented Nov 5, 2020

jonthegeek commented Nov 5, 2020

jwijffels commented Feb 16, 2021

jonthegeek commented Feb 16, 2021

jwijffels commented Feb 16, 2021

jonthegeek commented Feb 16, 2021

jwijffels commented Feb 16, 2021

jonathanbratt commented Feb 16, 2021

jwijffels commented Feb 16, 2021

jwijffels commented Nov 4, 2020 •

edited

Loading

jwijffels commented Nov 4, 2020 •

edited

Loading

jonthegeek commented Nov 4, 2020 •

edited

Loading

jwijffels commented Nov 4, 2020 •

edited

Loading