-
Notifications
You must be signed in to change notification settings - Fork 248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added filter_range parameter that allows to filter answers with similar start/end indices #680
Conversation
…ar start/end indices
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good already, lets discuss the proposed changes.
Renaming filter_range parameter Removing example of duplicate answer filtering
As discussed, I removed the example, used fixtures to speed up the CI, and renamed the parameter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the improvements. I actually looked a bit deeper into the testing and found potential for improvement.
The tests now check for the exact start and end indices so that they are more explicit. |
7ef91fd
to
af04d97
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Design choices
Current question answering predictions contain (near-)duplicates. To ensure a variety of answer options coming from different text positions, this PR introduces a filtering step during the generation of predictions. To control the filtering, there is now an integer filter_range class variable for the class QuestionAnsweringHead. It is applied in the method get_top_candidates.
The default behavior is unchanged and corresponds to filter_range set to -1 (or smaller). Setting the parameter filter_range to 0 removes exact duplicates (same start or end index). Setting the parameter filter_range to any larger value consider answers with similar start or end index as duplicates, e.g., filter_range=5 considers the two answers with start_idx 4 and start_idx 9 as duplicates.
Tests added
test_duplicate_answer_filtering() tests whether there are no two generated answers with the same start or end index.
test_no_duplicate_answer_filtering() tests whether the default behavior is unchanged so that the answers contain duplicates.
test_range_duplicate_answer_filtering() tests whether filter_range = 5 leads to answers with a distance between start indices or end_indices of at least 6.
Limitations
If filter_range is to large (e.g., as large as the number of tokens in the given context) only one answer will be generated.
The similarity of answers is solely defined based on their start and end indices and does not consider similar answer texts with different indices as duplicates.
Future enhancements
Take into account textual similarity rather than only the indices. For example, compare exact words instead of start and end indices. For more advanced solutions, one could use locality-sensitive hash functions on the text of the generated answers and define a threshold of accepted hash distance.
Closes #667