Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't we need to reconcile SpaCy and BERT tokens? #9

Open
hjpark2017 opened this issue Oct 1, 2022 · 2 comments
Open

Don't we need to reconcile SpaCy and BERT tokens? #9

hjpark2017 opened this issue Oct 1, 2022 · 2 comments

Comments

@hjpark2017
Copy link

First of all, thank you for releasing the program on your paper. What I'm curious about is that SpaCy divides sentences into word units, but BERT divides them into WordPiece units, so I think there will be a problem that the tokens are not accurately mapped to each other. I wonder which part of the program you uploaded deals with these problems.

@BinLiang-NLP
Copy link
Owner

First of all, thank you for releasing the program on your paper. What I'm curious about is that SpaCy divides sentences into word units, but BERT divides them into WordPiece units, so I think there will be a problem that the tokens are not accurately mapped to each other. I wonder which part of the program you uploaded deals with these problems.

Hi,
Thanks for your question.
I do agree that SpaCy divides sentences into word units, but BERT divides them into WordPiece units. That is, the tokens of a small number of samples are not incongruent in SenticGCN-BERT. For the datasets of this work, however, most samples are consistent. Therefore, we do not deal with this problem in our work. Definitely, you can also align the WordPiece units of BERT model for better results.
Please let me know if there is any problem.
Thanks!!!

@hjpark2017
Copy link
Author

I'm sorry for the late greeting.
Thank you for your kind explanation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants