Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added filter_range parameter that allows to filter answers with similar start/end indices #680

Merged
merged 2 commits into from
Jan 12, 2021

Conversation

julian-risch
Copy link
Member

Design choices
Current question answering predictions contain (near-)duplicates. To ensure a variety of answer options coming from different text positions, this PR introduces a filtering step during the generation of predictions. To control the filtering, there is now an integer filter_range class variable for the class QuestionAnsweringHead. It is applied in the method get_top_candidates.
The default behavior is unchanged and corresponds to filter_range set to -1 (or smaller). Setting the parameter filter_range to 0 removes exact duplicates (same start or end index). Setting the parameter filter_range to any larger value consider answers with similar start or end index as duplicates, e.g., filter_range=5 considers the two answers with start_idx 4 and start_idx 9 as duplicates.

Tests added
test_duplicate_answer_filtering() tests whether there are no two generated answers with the same start or end index.
test_no_duplicate_answer_filtering() tests whether the default behavior is unchanged so that the answers contain duplicates.
test_range_duplicate_answer_filtering() tests whether filter_range = 5 leads to answers with a distance between start indices or end_indices of at least 6.

Limitations
If filter_range is to large (e.g., as large as the number of tokens in the given context) only one answer will be generated.
The similarity of answers is solely defined based on their start and end indices and does not consider similar answer texts with different indices as duplicates.

Future enhancements
Take into account textual similarity rather than only the indices. For example, compare exact words instead of start and end indices. For more advanced solutions, one could use locality-sensitive hash functions on the text of the generated answers and define a threshold of accepted hash distance.

Closes #667

@julian-risch julian-risch requested a review from Timoeller January 7, 2021 16:27
Copy link
Contributor

@Timoeller Timoeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good already, lets discuss the proposed changes.

examples/question_answering_filtering_similar_answers.py Outdated Show resolved Hide resolved
farm/modeling/prediction_head.py Outdated Show resolved Hide resolved
Renaming filter_range parameter
Removing example of duplicate answer filtering
@julian-risch
Copy link
Member Author

As discussed, I removed the example, used fixtures to speed up the CI, and renamed the parameter.

Copy link
Contributor

@Timoeller Timoeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the improvements. I actually looked a bit deeper into the testing and found potential for improvement.

test/test_question_answering.py Show resolved Hide resolved
@julian-risch
Copy link
Member Author

The tests now check for the exact start and end indices so that they are more explicit.
The reason why the no_dupliacte_answer_filtering() test passed is because answers only need to have the same start indices OR the same end indices to count as duplicates here. The current method in the PredictionHead checks for the same start AND the same end indices to filter out duplicates.

@julian-risch julian-risch force-pushed the filtering_similar_answers branch from 7ef91fd to af04d97 Compare January 12, 2021 15:55
@Timoeller Timoeller self-requested a review January 12, 2021 16:22
Copy link
Contributor

@Timoeller Timoeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add filter for similar QA predictions
2 participants