Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QA Answers at the Beginning of the Document are Labeled as (0, 0) #558

Closed
EhsanM4t1qbit opened this issue Sep 24, 2020 · 3 comments
Closed
Assignees
Labels
bug Something isn't working
Milestone

Comments

@EhsanM4t1qbit
Copy link

EhsanM4t1qbit commented Sep 24, 2020

I suspect that there is a bug in the function generate_labels https://github.com/deepset-ai/FARM/blob/master/farm/data_handler/input_features.py#L577. The conditional statement should be changed to passage_len > start_idx >= 0. In its current form, this causes an answer that starts from the beginning of the sentence (i.e. start_idx =0) to be labeled as (0, 0). This might be related to #552 .

processor = SquadProcessor(...)
data_silo = DataSilo(processor=processor, batch_size=16, automatic_loading=False)
basic_texts = {"context": "endesa, s.a. financial statements for the year ended 31 december 2018 5 endesa, s.a. "
                          "and subsidiaries consolidated financial statements for the year ended 31 december 2018 207",
 "qas": [{"question": "What is the company name?", "id": "0",
          "answers": [{"text": "endesa", "answer_start": 0},
                      ], "is_impossible": False}]}

data_silo._load_data(train_dicts=[basic_texts])
print(data_silo.data['train'].datasets[0].tensors[6]) # labels
tensor([[[ 0,  0],
         [-1, -1],
         [-1, -1],
         [-1, -1],
         [-1, -1],
         [-1, -1]]])

print(data_silo.data['train'].datasets[0].tensors[-1]) # seq_2_start_t
tensor([8])

After changing the conditional statement to passage_len > start_idx >= 0:

print(data_silo.data['train'].datasets[0].tensors[6]) # labels
tensor([[[ 8, 10],
         [-1, -1],
         [-1, -1],
         [-1, -1],
         [-1, -1],
         [-1, -1]]])
print(data_silo.data['train'].datasets[0].tensors[-1]) # seq_2_start_t
tensor([8])
  • FARM version: 0.4.8
@brandenchan
Copy link
Contributor

Hey @EhsanM4t1qbit thanks for pointing this out! The new PR #564 implements the fix that you suggested. Can you confirm that this fixes your issue?

@EhsanM4t1qbit
Copy link
Author

Yes it does, thanks!
Do you think it makes sense to allow passage_len to be equal to end_idx too? (i.e. passage_len >= end_idx > 0)

@tholor tholor added this to the #2 milestone Oct 6, 2020
@tholor tholor modified the milestones: #2, #3 Oct 21, 2020
@brandenchan
Copy link
Contributor

@EhsanM4t1qbit Sorry I missed that last message!

If end_idx = passage_len, it is pointing to a token that is outside of the passage so I think its best we keep > instead of >=.

But thanks for raising this issue in the first place! It has helped us greatly in debugging various issues that have come up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants