Double [CLS] token in the first doc chunk #25

mitchellgordon95 · 2022-05-18T01:38:56Z

I noticed when we tokenize, we set add_special_tokens to True here:

https://github.com/lucidrains/RETRO-pytorch/blob/main/retro_pytorch/retrieval.py#L72

which adds a [CLS] token to the beginning of the doc tokens. But when we embed the chunks with BERT, we also add a CLS token to the beginning of the chunk:

https://github.com/lucidrains/RETRO-pytorch/blob/main/retro_pytorch/retrieval.py#L240

So for some chunks (the first chunk in every doc) we will have two [CLS] tokens at the beginning of the chunk. I think the solution here is just to turn off add_special_tokens when going from text -> chunks? Is that correct?

The text was updated successfully, but these errors were encountered:

lucidrains · 2022-05-18T18:29:07Z

@mitchellgordon95 hey Mitchell, yes indeed you spotted a problem i knew about but did not address. however, my take is that multiple CLS tokens shouldn't harm things too much (could be totally wrong about that though)

yes, you are correct that add_special_tokens controls the addition of [cls] and [sep] (which I use as the end-of-string/document)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Double [CLS] token in the first doc chunk #25

Double [CLS] token in the first doc chunk #25

mitchellgordon95 commented May 18, 2022

lucidrains commented May 18, 2022

Double [CLS] token in the first doc chunk #25

Double [CLS] token in the first doc chunk #25

Comments

mitchellgordon95 commented May 18, 2022

lucidrains commented May 18, 2022