Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Double [CLS] token in the first doc chunk #25

Open
mitchellgordon95 opened this issue May 18, 2022 · 1 comment
Open

Double [CLS] token in the first doc chunk #25

mitchellgordon95 opened this issue May 18, 2022 · 1 comment

Comments

@mitchellgordon95
Copy link
Contributor

I noticed when we tokenize, we set add_special_tokens to True here:

https://github.com/lucidrains/RETRO-pytorch/blob/main/retro_pytorch/retrieval.py#L72

which adds a [CLS] token to the beginning of the doc tokens. But when we embed the chunks with BERT, we also add a CLS token to the beginning of the chunk:

https://github.com/lucidrains/RETRO-pytorch/blob/main/retro_pytorch/retrieval.py#L240

So for some chunks (the first chunk in every doc) we will have two [CLS] tokens at the beginning of the chunk. I think the solution here is just to turn off add_special_tokens when going from text -> chunks? Is that correct?

@lucidrains
Copy link
Owner

@mitchellgordon95 hey Mitchell, yes indeed you spotted a problem i knew about but did not address. however, my take is that multiple CLS tokens shouldn't harm things too much (could be totally wrong about that though)

yes, you are correct that add_special_tokens controls the addition of [cls] and [sep] (which I use as the end-of-string/document)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants