Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Roadmap] Web Browsing #2017

Closed
9 of 17 tasks
afourney opened this issue Mar 14, 2024 · 16 comments
Closed
9 of 17 tasks

[Roadmap] Web Browsing #2017

afourney opened this issue Mar 14, 2024 · 16 comments
Assignees
Labels
in-progress Roadmap is actively being worked on roadmap Issues related to roadmap of AutoGen

Comments

@afourney
Copy link
Member

afourney commented Mar 14, 2024

Tip

Want to get involved?

We'd love it if you did! Please get in contact with the people assigned to this issue, or leave a comment. See general contributing advice here too.

Background

Web browsing is quickly becoming a table-stakes capability for agentic systems. For several months, AutoGen has offered basic web browsing capabilities via the WebSurferAgent and browser_utils module. The browser_utils provides a simple text-based browsing experience similar to LYNX, but converts pages to Markdown rather than plain text. The WebSurferAgent then maps incoming requests to operations in this text-based browser. For example, if one were to ask 'go to AutoGen's GitHub page', the WebSurferAgent would map the request to two function calls: web_search("autogen github"), and visit_page(url_of_first_search_result).

Markdown is convenient because modern HTML is very bloated, and Markdown strips most of that away, while leaving essential semantic information such as hyperlinks, titles, tables, etc. A simplified or restricted subset of HTML would likely have worked as well, but we take advantage of the fact that OpenAI's models are quite comfortable with Markdown.

An important design element of the WebSurferAgent is that it basically just maps natural language to browser commands, then outputs the text content of the virtual viewport as a message to other agents. In this way, it leaves all planning and interpretation to other agents in the AutoGen stack.

This arrangement is surprisingly powerful and led to our top submission on GAIA, but has some obvious limitations:

  • First: Our use of the python requests library means that we are only considering the raw HTML returned for each page visit. If any components of the page are loaded dynamically by JavaScript, we cannot see them.
  • Second: Since the pages are immediately converted to Markdown, we see only a snapshot of the content and cannot interact with them (e.g., to fill in forms). It's as if we hit print, and are now studying the paper copy.
  • Third: At no step of the process are we considering the visual material of the page. As multi-modal models become increasingly popular, DocVQA is becoming useful and important.

To grow AutoGen's web browsing capabilities and overcome the above-mentioned limitations, the following roadmap is proposed:

Roadmap

Enhanced Markdown browsing

Given the general simplicity and utility of the existing Markdown-based solution, and in the spirit of starting a to-do list with tasks already complete, PR #1929 proposed enhancing the Markdown browsing in AutoGen in the following ways:

Tasks

Importantly, 1929 combines ideas and code from numerous other PRs including #1534, #1733, #1832, #1572, and possibly others. Each author @vijaykramesh, @signalprime, @INF800, @gagb is credited here.

However, there is more to do on Markdown browsing before we can consider this wrapped up:

Tasks

Vision-based Interactive Browsing

As handy as it is, Markdown-based browsing will only ever get us so far. To address limitations two and three above, we need to take an interactive and multimodal approach similar to WebVoyager. Such systems generally work using Set-of-Mark prompting -- they take a screenshot of the web page, add labels and obvious bounding boxes to each interactive component, and then ask GPT-v to select elements to interact with via their visual labels. This solves the localization and grounding problem, where vision models have trouble outputting real-world coordinates (e.g., where a mouse should be clicked).

Here again, I want to acknowledge that @shauppi has already demonstrated an initial replication of the WebVoyager work, which is fantastic. I hope that we can maybe work together on this, but ultimately AutoGen likely needs a vision-based web surfing agent as part of its core offering.

Patterned after our existing WebSurferAgent, I propose that any MultimodalWebSurferAgent should adhere to the following design principle:

MultimodalWebSurferAgent should focus only on mapping natural language instructions to low-level browser commands (e.g., scrolling, clicking, visiting a page, etc.) and output both text and a screenshot of the browser viewport. All other planning will be left to other agents in the AutoGen stack.

Importantly, AutoGen is working to support multimodality through the agent stack, and by outputting both screenshots and page text in messages to other agents, we can then use MultimodalWebSurferAgent in many different agent configurations.

This is where the following roadmap and task list lead us:

Tasks

@gagb
Copy link
Collaborator

gagb commented Mar 14, 2024

Fantastic roadmap and description @afourney!

@afourney
Copy link
Member Author

Fantastic roadmap and description @afourney!

I'm hoping it might form the basis of a blog post later.

@afourney
Copy link
Member Author

afourney commented Mar 14, 2024

@BeibinLi I'd love to hear your thoughts on this part of the design proposal in particular:
image

Basically, any agents that MultimodalWebSurfer talks to should also be able to "see" the web page via the screenshots (if vision-capable), and direct MultimodalWebSurfer to take further actions (e.g., "Sort the table by cost.", "Scroll to the reviews section.", etc.)

@gagb
Copy link
Collaborator

gagb commented Mar 14, 2024

Fantastic roadmap and description @afourney!

I'm hoping it might form the basis of a blog post later.

Yesssssss

@BeibinLi
Copy link
Collaborator

@afourney Yes, ideally it will work.

Do you want to use GPT-4V for MultimodalWebSurfer or for all agents? I think using GPT-4V for all agents might produce better results. See #2013.

Caveat: GPT-4V is not good enough for reading tables and other tasks, and we will need OCR and other rule-based methods for the VisionCapability, which will be then added to the other agents who will talk with the MultimodalWebSurfer.

@afourney
Copy link
Member Author

@afourney Yes, ideally it will work.

Do you want to use GPT-4V for MultimodalWebSurfer or for all agents? I think using GPT-4V for all agents might produce better results. See #2013.

Caveat: GPT-4V is not good enough for reading tables and other tasks, and we will need OCR and other rule-based methods for the VisionCapability, which will be then added to the other agents who will talk with the MultimodalWebSurfer.

I was thinking all agents. But the text content of the message would also contain the text of the webpage from the dom (no need for ocr). So ideally any agent can consume it

@BeibinLi
Copy link
Collaborator

@afourney Got it! Then, yes, this design would work.

@skzhang1
Copy link
Collaborator

skzhang1 commented Mar 14, 2024

Great design! One difficulty is to label each interactive components in web. Webvoyager seems to use a separate interactive segmentation model. Correct bounding boxes should be the pre-requirements to mapping natural language instructions to low-level browser commands. I think it is hard for GPT-4v to directly label each element.

@afourney
Copy link
Member Author

Great design! One difficulty is to label each interactive components in web. Webvoyager seems to use a separate interactive segmentation model. Correct bounding boxes should be the pre-requirements to mapping natural language instructions to low-level browser commands. I think it is hard for GPT-4v to directly label each element.

Yes. Good point. I want to abstract this step so that we can substitute in different implementations. https://github.com/schauppi/MultimodalWebAgent has a good approach to this. I've also had reasonable results using the accessibility AXTree to enumerate interactive components (focusable, etc.). Once we know which elements are interactive, we can decorate them with the labels and outlines.

@Tylersuard
Copy link
Collaborator

I was going to say this but for GUIs

@afourney
Copy link
Member Author

I was going to say this but for GUIs

General apps or GUIs would require a different mechanism to capture the window and generate events, but the principle would be very similar — we just wouldn’t have the DOM for perfect segmentation, and info.

@skzhang1
Copy link
Collaborator

@afourney got it!

@Tylersuard
Copy link
Collaborator

Actually this may not be a good idea. If we can use agents to automate web browsing, how many jobs might be eliminated?

@jackgerrits jackgerrits added the roadmap Issues related to roadmap of AutoGen label Mar 18, 2024
@jackgerrits jackgerrits added the in-progress Roadmap is actively being worked on label Mar 18, 2024
@jackgerrits jackgerrits changed the title [Roadmap]: Web Browsing in AutoGen [Roadmap] Web Browsing Mar 18, 2024
@gasse
Copy link

gasse commented Apr 8, 2024

@afourney you should check out our recently released browsergym :)
It is meant to be a flexible framework build upon playwright. It already supports most of the features you describe (AXTree, screenshots, different action spaces).
Disclaimer: I am one of the authors of the library.

@afourney
Copy link
Member Author

A quick update. PR #1929 is out of draft, and once merged, will complete much of the Markdown browsing items.

Work on the MultimodalWebSurfer is active and ongoing in the ct_webarena branch of the repo under autogen/autogen/contrib/multimodal_web_surfer. A standalone PR will be prepared once we've stabilized some of the larger issues (e.g., synchronizing text and screenshots)

@afourney
Copy link
Member Author

1929 is merged. Will have new about multi-modal support soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in-progress Roadmap is actively being worked on roadmap Issues related to roadmap of AutoGen
Projects
None yet
Development

No branches or pull requests

7 participants