Added filter_range parameter that allows to filter answers with similar start/end indices #680

julian-risch · 2021-01-07T16:27:41Z

Design choices
Current question answering predictions contain (near-)duplicates. To ensure a variety of answer options coming from different text positions, this PR introduces a filtering step during the generation of predictions. To control the filtering, there is now an integer filter_range class variable for the class QuestionAnsweringHead. It is applied in the method get_top_candidates.
The default behavior is unchanged and corresponds to filter_range set to -1 (or smaller). Setting the parameter filter_range to 0 removes exact duplicates (same start or end index). Setting the parameter filter_range to any larger value consider answers with similar start or end index as duplicates, e.g., filter_range=5 considers the two answers with start_idx 4 and start_idx 9 as duplicates.

Tests added
test_duplicate_answer_filtering() tests whether there are no two generated answers with the same start or end index.
test_no_duplicate_answer_filtering() tests whether the default behavior is unchanged so that the answers contain duplicates.
test_range_duplicate_answer_filtering() tests whether filter_range = 5 leads to answers with a distance between start indices or end_indices of at least 6.

Limitations
If filter_range is to large (e.g., as large as the number of tokens in the given context) only one answer will be generated.
The similarity of answers is solely defined based on their start and end indices and does not consider similar answer texts with different indices as duplicates.

Future enhancements
Take into account textual similarity rather than only the indices. For example, compare exact words instead of start and end indices. For more advanced solutions, one could use locality-sensitive hash functions on the text of the generated answers and define a threshold of accepted hash distance.

Closes #667

…ar start/end indices

Timoeller

Looking good already, lets discuss the proposed changes.

examples/question_answering_filtering_similar_answers.py

farm/modeling/prediction_head.py

test/test_question_answering.py

Renaming filter_range parameter Removing example of duplicate answer filtering

julian-risch · 2021-01-08T14:03:05Z

As discussed, I removed the example, used fixtures to speed up the CI, and renamed the parameter.

Timoeller

Thanks for the improvements. I actually looked a bit deeper into the testing and found potential for improvement.

test/test_question_answering.py

julian-risch · 2021-01-12T10:44:55Z

The tests now check for the exact start and end indices so that they are more explicit.
The reason why the no_dupliacte_answer_filtering() test passed is because answers only need to have the same start indices OR the same end indices to count as duplicates here. The current method in the PredictionHead checks for the same start AND the same end indices to filter out duplicates.

Timoeller

LGTM

Added filter_range parameter that allows to filter answers with simil…

5c300ad

…ar start/end indices

julian-risch requested a review from Timoeller January 7, 2021 16:27

Timoeller suggested changes Jan 8, 2021

View reviewed changes

examples/question_answering_filtering_similar_answers.py Outdated Show resolved Hide resolved

farm/modeling/prediction_head.py Outdated Show resolved Hide resolved

Timoeller reviewed Jan 8, 2021

View reviewed changes

test/test_question_answering.py Outdated Show resolved Hide resolved

Adding usage of fixtures for test of duplicate_answer_filtering

af04d97

Renaming filter_range parameter Removing example of duplicate answer filtering

Timoeller reviewed Jan 8, 2021

View reviewed changes

test/test_question_answering.py Show resolved Hide resolved

julian-risch force-pushed the filtering_similar_answers branch from 7ef91fd to af04d97 Compare January 12, 2021 15:55

Timoeller self-requested a review January 12, 2021 16:22

Timoeller approved these changes Jan 12, 2021

View reviewed changes

Timoeller merged commit 8982bb8 into master Jan 12, 2021

tholor mentioned this pull request Apr 22, 2021

Add option to filter out similar answers from results deepset-ai/haystack#990

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added filter_range parameter that allows to filter answers with similar start/end indices #680

Added filter_range parameter that allows to filter answers with similar start/end indices #680

julian-risch commented Jan 7, 2021

Timoeller left a comment

julian-risch commented Jan 8, 2021

Timoeller left a comment

julian-risch commented Jan 12, 2021

Timoeller left a comment

Added filter_range parameter that allows to filter answers with similar start/end indices #680

Added filter_range parameter that allows to filter answers with similar start/end indices #680

Conversation

julian-risch commented Jan 7, 2021

Timoeller left a comment

Choose a reason for hiding this comment

julian-risch commented Jan 8, 2021

Timoeller left a comment

Choose a reason for hiding this comment

julian-risch commented Jan 12, 2021

Timoeller left a comment

Choose a reason for hiding this comment