[Roadmap] Web Browsing #2017

afourney · 2024-03-14T05:16:25Z

Tip

Want to get involved?

We'd love it if you did! Please get in contact with the people assigned to this issue, or leave a comment. See general contributing advice here too.

Background

Web browsing is quickly becoming a table-stakes capability for agentic systems. For several months, AutoGen has offered basic web browsing capabilities via the WebSurferAgent and browser_utils module. The browser_utils provides a simple text-based browsing experience similar to LYNX, but converts pages to Markdown rather than plain text. The WebSurferAgent then maps incoming requests to operations in this text-based browser. For example, if one were to ask 'go to AutoGen's GitHub page', the WebSurferAgent would map the request to two function calls: web_search("autogen github"), and visit_page(url_of_first_search_result).

Markdown is convenient because modern HTML is very bloated, and Markdown strips most of that away, while leaving essential semantic information such as hyperlinks, titles, tables, etc. A simplified or restricted subset of HTML would likely have worked as well, but we take advantage of the fact that OpenAI's models are quite comfortable with Markdown.

An important design element of the WebSurferAgent is that it basically just maps natural language to browser commands, then outputs the text content of the virtual viewport as a message to other agents. In this way, it leaves all planning and interpretation to other agents in the AutoGen stack.

This arrangement is surprisingly powerful and led to our top submission on GAIA, but has some obvious limitations:

First: Our use of the python requests library means that we are only considering the raw HTML returned for each page visit. If any components of the page are loaded dynamically by JavaScript, we cannot see them.
Second: Since the pages are immediately converted to Markdown, we see only a snapshot of the content and cannot interact with them (e.g., to fill in forms). It's as if we hit print, and are now studying the paper copy.
Third: At no step of the process are we considering the visual material of the page. As multi-modal models become increasingly popular, DocVQA is becoming useful and important.

To grow AutoGen's web browsing capabilities and overcome the above-mentioned limitations, the following roadmap is proposed:

Roadmap

Enhanced Markdown browsing

Given the general simplicity and utility of the existing Markdown-based solution, and in the spirit of starting a to-do list with tasks already complete, PR #1929 proposed enhancing the Markdown browsing in AutoGen in the following ways:

Tasks

Give feedback

Supplement the requests library with headless browsers powered by Selenium and/or Playwright. This allows them to appear as regular web browsers (e.g., in their User-Agents) and also to execute JavaScript before converting pages to Markdown.
Abstract web search, providing an easy way to replace Bing, and a means of operating without a Bing API key
Allow the Markdown browsers to access the local file system (providing directory listings, opening documents, etc.)
Greatly expanding file format support (since we're already converting HTML to Markdown, why stop there? We can also convert pptx, docx, xlsx, pdf, etc.)
Options

Importantly, 1929 combines ideas and code from numerous other PRs including #1534, #1733, #1832, #1572, and possibly others. Each author @vijaykramesh, @signalprime, @INF800, @gagb is credited here.

However, there is more to do on Markdown browsing before we can consider this wrapped up:

Tasks

Give feedback

Tests need to be added to WebSurfer Updated (Selenium, Playwright, and support for many filetypes) #1929
We need documentation, and the WebSurferAgent notebook needs to be updated
All above-mentioned co-contributors are invited to co-author a blog post
Allow costs and window size limits to be specified in the config_list #1682 needs to be merged so that read_page_and_answer can optimally do Q&A (this should be improved, or perhaps abstracted anyway)
When Selenium falls back to requests to download files, we should take the User-Agent and Cookies from the browser and pass them to requests. Introduce SeleniumBrowser #1733 already does this very nicely, and I would like to integrate that here.
Options

Vision-based Interactive Browsing

As handy as it is, Markdown-based browsing will only ever get us so far. To address limitations two and three above, we need to take an interactive and multimodal approach similar to WebVoyager. Such systems generally work using Set-of-Mark prompting -- they take a screenshot of the web page, add labels and obvious bounding boxes to each interactive component, and then ask GPT-v to select elements to interact with via their visual labels. This solves the localization and grounding problem, where vision models have trouble outputting real-world coordinates (e.g., where a mouse should be clicked).

Here again, I want to acknowledge that @shauppi has already demonstrated an initial replication of the WebVoyager work, which is fantastic. I hope that we can maybe work together on this, but ultimately AutoGen likely needs a vision-based web surfing agent as part of its core offering.

Patterned after our existing WebSurferAgent, I propose that any MultimodalWebSurferAgent should adhere to the following design principle:

MultimodalWebSurferAgent should focus only on mapping natural language instructions to low-level browser commands (e.g., scrolling, clicking, visiting a page, etc.) and output both text and a screenshot of the browser viewport. All other planning will be left to other agents in the AutoGen stack.

Importantly, AutoGen is working to support multimodality through the agent stack, and by outputting both screenshots and page text in messages to other agents, we can then use MultimodalWebSurferAgent in many different agent configurations.

This is where the following roadmap and task list lead us:

Tasks

Give feedback

Create a MultimodalWebSurferAgent, similar to WebSurferAgent, but that takes a Playwright page or browser context instead of a Markdown browser
Support the 7 core WebVoyager function calls: Click. Input. Scroll. Wait. Back. Web Search. Answer.
Provide an abstraction for localizing elements in MLM responses, and provide Set-of-Mark bounding boxes and prompts as one implementation.
Related to set-of-mark prompting, explore the accessibility tree (AXTree) for focusable elements, and fully-resolved aria labels and roles (see existing proof of concept)
Explore allowing the agent to write and run JavaScript code in the page's context (it would become a new execution environment for agents)
Ensure compatibility with both vision-capable and text-only conversation partner agents (vision-capable agents should also receive screenshots)
Ensure screenshots and page text snapshots are synchronized.
Support downloads (perhaps falling back to the Markdown approach above)
Options

The text was updated successfully, but these errors were encountered:

gagb · 2024-03-14T08:49:51Z

Fantastic roadmap and description @afourney!

afourney · 2024-03-14T14:18:15Z

Fantastic roadmap and description @afourney!

I'm hoping it might form the basis of a blog post later.

afourney · 2024-03-14T14:23:40Z

@BeibinLi I'd love to hear your thoughts on this part of the design proposal in particular:

Basically, any agents that MultimodalWebSurfer talks to should also be able to "see" the web page via the screenshots (if vision-capable), and direct MultimodalWebSurfer to take further actions (e.g., "Sort the table by cost.", "Scroll to the reviews section.", etc.)

gagb · 2024-03-14T18:37:17Z

Fantastic roadmap and description @afourney!

I'm hoping it might form the basis of a blog post later.

Yesssssss

BeibinLi · 2024-03-14T20:08:23Z

@afourney Yes, ideally it will work.

Do you want to use GPT-4V for MultimodalWebSurfer or for all agents? I think using GPT-4V for all agents might produce better results. See #2013.

Caveat: GPT-4V is not good enough for reading tables and other tasks, and we will need OCR and other rule-based methods for the VisionCapability, which will be then added to the other agents who will talk with the MultimodalWebSurfer.

afourney · 2024-03-14T20:12:26Z

@afourney Yes, ideally it will work.

Do you want to use GPT-4V for MultimodalWebSurfer or for all agents? I think using GPT-4V for all agents might produce better results. See #2013.

Caveat: GPT-4V is not good enough for reading tables and other tasks, and we will need OCR and other rule-based methods for the VisionCapability, which will be then added to the other agents who will talk with the MultimodalWebSurfer.

I was thinking all agents. But the text content of the message would also contain the text of the webpage from the dom (no need for ocr). So ideally any agent can consume it

BeibinLi · 2024-03-14T20:13:37Z

@afourney Got it! Then, yes, this design would work.

skzhang1 · 2024-03-14T21:09:44Z

Great design! One difficulty is to label each interactive components in web. Webvoyager seems to use a separate interactive segmentation model. Correct bounding boxes should be the pre-requirements to mapping natural language instructions to low-level browser commands. I think it is hard for GPT-4v to directly label each element.

afourney · 2024-03-14T21:54:20Z

Great design! One difficulty is to label each interactive components in web. Webvoyager seems to use a separate interactive segmentation model. Correct bounding boxes should be the pre-requirements to mapping natural language instructions to low-level browser commands. I think it is hard for GPT-4v to directly label each element.

Yes. Good point. I want to abstract this step so that we can substitute in different implementations. https://github.com/schauppi/MultimodalWebAgent has a good approach to this. I've also had reasonable results using the accessibility AXTree to enumerate interactive components (focusable, etc.). Once we know which elements are interactive, we can decorate them with the labels and outlines.

Tylersuard · 2024-03-14T22:52:12Z

I was going to say this but for GUIs

afourney · 2024-03-14T23:37:08Z

I was going to say this but for GUIs

General apps or GUIs would require a different mechanism to capture the window and generate events, but the principle would be very similar — we just wouldn’t have the DOM for perfect segmentation, and info.

skzhang1 · 2024-03-14T23:42:26Z

@afourney got it!

Tylersuard · 2024-03-16T06:29:18Z

Actually this may not be a good idea. If we can use agents to automate web browsing, how many jobs might be eliminated?

gasse · 2024-04-08T16:03:27Z

@afourney you should check out our recently released browsergym :)
It is meant to be a flexible framework build upon playwright. It already supports most of the features you describe (AXTree, screenshots, different action spaces).
Disclaimer: I am one of the authors of the library.

afourney · 2024-04-26T00:19:39Z

A quick update. PR #1929 is out of draft, and once merged, will complete much of the Markdown browsing items.

Work on the MultimodalWebSurfer is active and ongoing in the ct_webarena branch of the repo under autogen/autogen/contrib/multimodal_web_surfer. A standalone PR will be prepared once we've stabilized some of the larger issues (e.g., synchronizing text and screenshots)

afourney · 2024-09-25T22:41:13Z

1929 is merged. Will have new about multi-modal support soon.

skzhang1 mentioned this issue Mar 14, 2024

[Roadmap] Agents Self-Improvement/Learning/Optimization #521

Open

jackgerrits added the roadmap Issues related to roadmap of AutoGen label Mar 18, 2024

jackgerrits assigned afourney and gagb Mar 18, 2024

jackgerrits added the in-progress Roadmap is actively being worked on label Mar 18, 2024

jackgerrits changed the title ~~[Roadmap]: Web Browsing in AutoGen~~ [Roadmap] Web Browsing Mar 18, 2024

afourney closed this as completed Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Roadmap] Web Browsing #2017

[Roadmap] Web Browsing #2017

afourney commented Mar 14, 2024 •

edited

Loading

Want to get involved?

Tasks

Tasks

Tasks

gagb commented Mar 14, 2024

afourney commented Mar 14, 2024

afourney commented Mar 14, 2024 •

edited

Loading

gagb commented Mar 14, 2024

BeibinLi commented Mar 14, 2024

afourney commented Mar 14, 2024

BeibinLi commented Mar 14, 2024

skzhang1 commented Mar 14, 2024 •

edited

Loading

afourney commented Mar 14, 2024

Tylersuard commented Mar 14, 2024

afourney commented Mar 14, 2024

skzhang1 commented Mar 14, 2024

Tylersuard commented Mar 16, 2024

gasse commented Apr 8, 2024

afourney commented Apr 26, 2024

afourney commented Sep 25, 2024

[Roadmap] Web Browsing #2017

[Roadmap] Web Browsing #2017

Comments

afourney commented Mar 14, 2024 • edited Loading

Want to get involved?

Background

Roadmap

Enhanced Markdown browsing

Tasks

Importantly, 1929 combines ideas and code from numerous other PRs including #1534, #1733, #1832, #1572, and possibly others. Each author @vijaykramesh, @signalprime, @INF800, @gagb is credited here.

Tasks

Vision-based Interactive Browsing

Tasks

gagb commented Mar 14, 2024

afourney commented Mar 14, 2024

afourney commented Mar 14, 2024 • edited Loading

gagb commented Mar 14, 2024

BeibinLi commented Mar 14, 2024

afourney commented Mar 14, 2024

BeibinLi commented Mar 14, 2024

skzhang1 commented Mar 14, 2024 • edited Loading

afourney commented Mar 14, 2024

Tylersuard commented Mar 14, 2024

afourney commented Mar 14, 2024

skzhang1 commented Mar 14, 2024

Tylersuard commented Mar 16, 2024

gasse commented Apr 8, 2024

afourney commented Apr 26, 2024

afourney commented Sep 25, 2024

afourney commented Mar 14, 2024 •

edited

Loading

afourney commented Mar 14, 2024 •

edited

Loading

skzhang1 commented Mar 14, 2024 •

edited

Loading