-
Notifications
You must be signed in to change notification settings - Fork 27.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tapas tokenization Different from Tensorflow Code #13244
Comments
Hi, Thanks for your interest in TAPAS. However, I do think the You can also verify this using a simple example:
As you can see, I've replaced two cell values by n/a and ?, i.e. there are some empty cells in the table. This returns:
The empty cells are correctly replaced by the [EMPTY] token. |
Thank you very much for your reply! It seems that "n/a" and "?" are tokenized into [EMPTY] token, but if the cell is an empty string, then it is ignored by the tokenizer. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
That's interesting @Doreenruirui, are you interested in making a PR to fix this? |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
unstale |
Hi @Doreenruirui,
This is very interesting, thanks for letting me know. Are you interested in opening a PR that includes the fix? We could perhaps also add the table retrieval models to the hub. Thanks! |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Hi @NielsRogge
I would like to work on this, i can start if nobody else is working on this. Thanks |
@NielsRogge @Doreenruirui This issue seems to fixed. We can close this issue. |
Environment info
transformers
version: 4.9.1Who can help
@LysandreJik @sgugger @NielsRogge
Information
Model I am using (Bert, XLNet ...): Tapas
When I am trying to replicate the TAPAS table retrieval results using Huggingface Tapas implementation, I find that Tapas tokenization in Huggingface is different from the original Tensorflow code . The original code first checks whether the table cell is "n/a", "?" or empty. If so, it would return "[EMPTY]" token. The Huggingface code has implemented the same tokenization with the tensorflow code, but it is not used to tokenize the tables. It could be easily fixed by changing all the calls of function
self.tokenize
toself._tokenize
in the_tokenize_table
function. After fixing this, I could use the released table retrieval model to replicate their results on NQ dataset with Huggingface Tapas.The text was updated successfully, but these errors were encountered: