QA Answers at the Beginning of the Document are Labeled as (0, 0) #558

EhsanM4t1qbit · 2020-09-24T21:52:24Z

I suspect that there is a bug in the function generate_labels https://github.com/deepset-ai/FARM/blob/master/farm/data_handler/input_features.py#L577. The conditional statement should be changed to passage_len > start_idx >= 0. In its current form, this causes an answer that starts from the beginning of the sentence (i.e. start_idx =0) to be labeled as (0, 0). This might be related to #552 .

processor = SquadProcessor(...)
data_silo = DataSilo(processor=processor, batch_size=16, automatic_loading=False)
basic_texts = {"context": "endesa, s.a. financial statements for the year ended 31 december 2018 5 endesa, s.a. "
                          "and subsidiaries consolidated financial statements for the year ended 31 december 2018 207",
 "qas": [{"question": "What is the company name?", "id": "0",
          "answers": [{"text": "endesa", "answer_start": 0},
                      ], "is_impossible": False}]}

data_silo._load_data(train_dicts=[basic_texts])

print(data_silo.data['train'].datasets[0].tensors[6]) # labels
tensor([[[ 0,  0],
         [-1, -1],
         [-1, -1],
         [-1, -1],
         [-1, -1],
         [-1, -1]]])

print(data_silo.data['train'].datasets[0].tensors[-1]) # seq_2_start_t
tensor([8])

After changing the conditional statement to passage_len > start_idx >= 0:

print(data_silo.data['train'].datasets[0].tensors[6]) # labels
tensor([[[ 8, 10],
         [-1, -1],
         [-1, -1],
         [-1, -1],
         [-1, -1],
         [-1, -1]]])
print(data_silo.data['train'].datasets[0].tensors[-1]) # seq_2_start_t
tensor([8])

FARM version: 0.4.8

The text was updated successfully, but these errors were encountered:

brandenchan · 2020-10-05T15:19:55Z

Hey @EhsanM4t1qbit thanks for pointing this out! The new PR #564 implements the fix that you suggested. Can you confirm that this fixes your issue?

EhsanM4t1qbit · 2020-10-05T16:00:28Z

Yes it does, thanks!
Do you think it makes sense to allow passage_len to be equal to end_idx too? (i.e. passage_len >= end_idx > 0)

brandenchan · 2020-11-03T10:38:49Z

@EhsanM4t1qbit Sorry I missed that last message!

If end_idx = passage_len, it is pointing to a token that is outside of the passage so I think its best we keep > instead of >=.

But thanks for raising this issue in the first place! It has helped us greatly in debugging various issues that have come up.

EhsanM4t1qbit added the bug Something isn't working label Sep 24, 2020

Timoeller assigned brandenchan Sep 29, 2020

brandenchan mentioned this issue Oct 5, 2020

Fix QA bug that rejected spans at beginning of passage #564

Merged

tholor added this to the #2 milestone Oct 6, 2020

tholor modified the milestones: #2, #3 Oct 21, 2020

brandenchan mentioned this issue Nov 2, 2020

Cannot extract QA answer at beginning of document #552

Closed

brandenchan closed this as completed Nov 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QA Answers at the Beginning of the Document are Labeled as (0, 0) #558

QA Answers at the Beginning of the Document are Labeled as (0, 0) #558

EhsanM4t1qbit commented Sep 24, 2020 •

edited

Loading

brandenchan commented Oct 5, 2020

EhsanM4t1qbit commented Oct 5, 2020

brandenchan commented Nov 3, 2020

QA Answers at the Beginning of the Document are Labeled as (0, 0) #558

QA Answers at the Beginning of the Document are Labeled as (0, 0) #558

Comments

EhsanM4t1qbit commented Sep 24, 2020 • edited Loading

brandenchan commented Oct 5, 2020

EhsanM4t1qbit commented Oct 5, 2020

brandenchan commented Nov 3, 2020

EhsanM4t1qbit commented Sep 24, 2020 •

edited

Loading