Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Does the Browsing Agent need run_ipython action? #4355

Closed
1 task done
enyst opened this issue Oct 13, 2024 · 5 comments
Closed
1 task done

[Bug]: Does the Browsing Agent need run_ipython action? #4355

enyst opened this issue Oct 13, 2024 · 5 comments
Labels
agent quality Related to specific agent quality bug Something isn't working Stale Inactive for 30 days

Comments

@enyst
Copy link
Collaborator

enyst commented Oct 13, 2024

Is there an existing issue for the same bug?

  • I have checked the existing issues.

Describe the bug and reproduction steps

Running Browsing Agent with Deepseek, I got a syntax error, and it turns out that what the LLM was trying to do is ... not necessarily "wrong", but we're not ready for it. Maybe we can address this by teaching our browsing agent the run ipython action? Or can we just send it like this to browsergym?

LLM response:

To summarize the quality and cost information of various language models from the blog article, I need to extract the relevant details and add them to the file. I'll start by summarizing the key points from the article.

send_msg_to_user('Summarizing the quality and cost information of various language models from the blog article...')

# Summarize the key points
summary = """
Key takeaways from the evaluation of LLMs as coding agents on SWE-Bench:

1. Clauder 3.5 Sonnet is the best by a fair amount, achieving a 27% resolve rate with the default agent in OpenHands.
2. GPT-4o lags behind, and o1-mini performed somewhat worse than GPT-4o. Analysis suggests o1 was sometimes "overthinking" tasks, performing extra environment configuration.
3. The strongest open models were Llama 3.1 405B and deepseek-v2.5, which performed reasonably well, even besting some closed models.
4. The price/accuracy tradeoff is an important consideration. A chart of this tradeoff is provided in the article.
"""

# Add the summary to the file
send_msg_to_user(summary)```

The response did include "```python" too.

Result with our current implementation, which doesn't expect variables in send_msg_to_user:

Traceback (most recent call last):
....
  File "/Users/enyst/repos/odie/openhands/agenthub/browsing_agent/response_parser.py", line 42, in parse_action
    return self.default_parser.parse(action_str)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/enyst/repos/odie/openhands/agenthub/browsing_agent/response_parser.py", line 108, in parse
    msg_content = args[0].value
                  ^^^^^^^^^^^^^
AttributeError: 'Name' object has no attribute 'value'

OpenHands Installation

Development workflow

OpenHands Version

No response

Operating System

MacOS

Logs, Errors, Screenshots, and Additional Context

No response

@enyst enyst added bug Something isn't working agent quality Related to specific agent quality labels Oct 13, 2024
@enyst
Copy link
Collaborator Author

enyst commented Oct 13, 2024

Curiously, the LLM seems to be actually right: it has been told in the prompt that its action space includes:

send_msg_to_user(text: str)
    Examples:
        send_msg_to_user('Based on the results of my search, the city was built in 1751.')

and then:

Multiple actions can be provided at once, but will be executed sequentially without any feedback from the page.
More than 2-3 actions usually leads to failure or unexpected behavior. Example:
fill('a12', 'example with "quotes"')
click('a51')
click('48', button='middle', modifiers=['Shift'])

So... it seems fair that the LLM may decide to send 2x send_msg_to_the_user? 😅

@enyst
Copy link
Collaborator Author

enyst commented Oct 13, 2024

Example of an interesting response, just for thought:

To summarize the quality and cost information of various language models, I need to extract the relevant details from the text and the chart. I will start by extracting the text information and then focus on the chart.

# Extract text information
text_info = [
    "Clauder 3.5 Sonnet is the best by a fair amount, achieving a 27% resolve rate with the default agent in OpenHands.",
    "GPT-4o lags behind, and o1-mini actually performed somewhat worse than GPT-4o. We went in and analyzed the results a little, and briefly it seemed like o1 was sometimes 'overthinking' things, performing extra environment configuration tasks when it could just go ahead and finish the task.",
    "Finally, the strongest open models were Llama 3.1 405 B and deepseek-v2.5, and they performed reasonably, even besting some of the closed models."
]

# Extract chart information
# Since the chart is an image, I will need to click on it to get more details
click('112'

Note:
The task was to summarize the quality and cost of LLMs. The agent wants both the text - which it summarized in text_info, and then, it wants, at next step, to see the chart. But the way the browsing agent currently works, it will lose text_info at the next step, because it never includes anything else from step to step, but the commands (e.g. only click('112')). Cc: @ketan1741

Thus in my test, this lead to a lot of time trying and losing and trying and losing, and ended in a stuck in a loop.

@ketan1741
Copy link
Contributor

But the way the browsing agent currently works, it will lose text_info at the next step, because it never includes anything else from step to step, but the commands (e.g. only click('112')).

Yes, that's exactly how it works right now. We should look into ways to improve it. We could include at least the previous one or two observations, thoughts+action, for the next step.

Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

Copy link
Contributor

This issue was closed because it has been stalled for over 30 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
agent quality Related to specific agent quality bug Something isn't working Stale Inactive for 30 days
Projects
None yet
Development

No branches or pull requests

2 participants