-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HotPotQA lookup returns the wrong result is asked to look up the same term in 2 different pages sequentially #99
Changes from 5 commits
356db3e
bdc09f5
5b997a4
4fb7bcb
cff7513
43fb054
04a31c0
3274129
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,5 @@ | ||
import re | ||
|
||
import pytest | ||
|
||
from aviary.core import Environment, TaskDataset | ||
|
@@ -27,3 +29,39 @@ def test_dataset_from_name() -> None: | |
|
||
with pytest.raises(ValueError, match="answer"): | ||
TaskDataset.from_name("hotpotqa", split="test") | ||
|
||
|
||
@pytest.mark.asyncio | ||
async def test_tool_results() -> None: | ||
hotpotqa_env: HotPotQAEnv = Environment.from_name( | ||
"hotpotqa", | ||
question=("Which country has a larger population: China or France?"), | ||
correct_answer="China", | ||
) | ||
lookup_pattern = r"^\(Result \d+ / \d+\)\s*(.*)" | ||
|
||
_, _ = await hotpotqa_env.reset() | ||
obs1 = await hotpotqa_env.search("China") | ||
obs2 = hotpotqa_env.construct_lookup_list("population") | ||
|
||
# Check lookup return format | ||
match = re.match(lookup_pattern, obs2) | ||
assert match # Starts with the right pattern | ||
assert ( | ||
match.group(1) + "\n" in hotpotqa_env.state.page | ||
) # Everything after the pattern should be a paragraph in current page | ||
|
||
obs3 = await hotpotqa_env.search("France") | ||
obs4 = hotpotqa_env.construct_lookup_list("population") | ||
|
||
# Check lookup return format | ||
match = re.match(lookup_pattern, obs4) | ||
assert match # Starts with the right pattern | ||
assert ( | ||
match.group(1) + "\n" in hotpotqa_env.state.page | ||
) # Everything after the pattern should be a paragraph in current page | ||
albertbou92 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
obs5 = hotpotqa_env.finish("China") | ||
|
||
# Ensure that the observations are different | ||
assert obs1 != obs2 != obs3 != obs4 != obs5 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What's the intuition behind this assertion? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just making sure the returned outputs change when they are supposed to. Before this PR, if lookup only finds an occurrence, obs2 and obs4 would be equal, which is not correct. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw @jamesbraza 's original suggestion, but IMO this is less readable and depends on the order in which
self.tools
is populated. "Finish" is better. If we want to be extra careful, we could make a constantFINISH_TOOL_NAME = "Finish"
and use it in both places.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree with Sid that we shouldn't rely on ordering here @albertbou92 . Can you find a way to:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just
last_action_is_lookup: bool
, since that's the main thing we check withlast_action
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this is a good solution actually