WebSurfer Updated (Selenium, Playwright, and support for many filetypes) #1929

afourney · 2024-03-09T07:16:55Z

Why are these changes needed?

This PR add Selenium and Playwright variants of the Markdown Web Browser used by WebSurfer. It also adds support for many additional content-types, and support for alternate search engines.

All MarkdownBrowser variants work via the following principle:
1. Fetch a page,
2. Convert it to markdown,
3. Operate on the Markdown

Such browsers are simple, and suitable for read-only agentic use -- they cannot be used to interact with complex web applications. Nevertheless, they are a great stopgap, and super useful when browsing local files (file:///user/afourney/repos/autogen) etc. because they can handle many different file formats (Office docs, PDFs, etc.), provide a common interface for Q&A, summarization, passage extraction etc.

Instructions

When installing AutoGen, use the [websurfer] optional dependencies.

If using Selenium, you must also pip install selenium

If using Playwright you must both pip install playwright and playwright install --with-deps chromium

Related issue number

#1481, #1534, #1733, #1832

* Add headless browser to the WebSurferAgent, closes #1481 * replace soup.get_text() with markdownify.MarkdownConverter().convert_soup(soup) * import HeadlessChromeBrowser * implicitly wait for 10s * inicrease max. wait time to 99s * fix: trim trailing whitespace * test: fix headless tests * better bing query search * docs: add example 3 for headless option --------- Co-authored-by: Vijay Ramesh <vijay@regrello.com>

* Based browser on mdconvert. * Updated web_surfer. * Renamed HeadlessChromeBrowser to SeleniumChromeBrowser

codecov-commenter · 2024-03-09T07:17:54Z

Codecov Report

Attention: Patch coverage is 60.62133% with 469 lines in your changes are missing coverage. Please review.

Project coverage is 50.75%. Comparing base (c3193f8) to head (6ba05c9).

Files	Patch %	Lines
autogen/browser_utils/mdconvert.py	71.69%	133 Missing and 34 partials ⚠️
autogen/browser_utils/markdown_search.py	22.98%	120 Missing and 4 partials ⚠️
autogen/browser_utils/requests_markdown_browser.py	71.84%	51 Missing and 16 partials ⚠️
...togen/browser_utils/playwright_markdown_browser.py	27.41%	45 Missing ⚠️
autogen/agentchat/contrib/web_surfer.py	46.15%	23 Missing and 5 partials ⚠️
autogen/browser_utils/selenium_markdown_browser.py	35.71%	27 Missing ⚠️
autogen/browser_utils/abstract_markdown_browser.py	71.79%	11 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #1929       +/-   ##
===========================================
+ Coverage   37.94%   50.75%   +12.80%     
===========================================
  Files          77       83        +6     
  Lines        7784     8776      +992     
  Branches     1667     2040      +373     
===========================================
+ Hits         2954     4454     +1500     
+ Misses       4580     3946      -634     
- Partials      250      376      +126

Flag	Coverage Δ
unittest	`12.75% <0.08%> (?)`
unittests	`49.80% <60.62%> (+11.86%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

.github/workflows/build.yml

afourney · 2024-03-09T17:05:11Z

@signalprime @vijaykramesh @INF800

With this PR, I tried to combine your Selenium browser PRs together in one place. Even if it doesn't show in the commit history, I used and learned a lot from each of your contributions, and welcome your further comments and contributions here. Once this is ready, the final PR will credit each of you, and we can perhaps co-author a Blog post.

Further, I believe @INF800 and @vijaykramesh 's PRs used Selenium to call Bing search -- which is clever in that it simplifies requirements to get up and running (you don't need to register for an API key). However, I opted to leave this out in favor of the API because it is a better fit for our automated use. Bing actively discourages scraping, and supporting that approach long term would involve actively evading bot detection. I am open to adding further modularity and configurability to add other search engines, perhaps DuckDuckGo, ArXiv etc. that don't require an API key.

INF800 and others added 12 commits March 1, 2024 22:05

Handle missing Selenium package.

348d676

Added browser_chat.py example to simplify testing.

bb7a249

Based browser on mdconvert. (#1847)

7535226

* Based browser on mdconvert. * Updated web_surfer. * Renamed HeadlessChromeBrowser to SeleniumChromeBrowser

Added an initial POC with Playwright.

8dc2220

Merge branch 'main' into headless_web_surfer

4e7e6a5

Separated Bing search into it's own utility module.

1d96568

Simple browser now uses Bing tools.

21b1789

Updated Playwright browser to inherit from SimpleTextBrowser

19bb19c

Got Selenium working too.

c6a7ee3

Renamed classes and files for consistency.

d5d6644

Added more instructions.

acb08c3

afourney requested a review from INF800 March 9, 2024 07:16

afourney had a problem deploying to openai1 March 9, 2024 07:17 — with GitHub Actions Failure

sonichi reviewed Mar 9, 2024

View reviewed changes

.github/workflows/build.yml Outdated Show resolved Hide resolved

Merge branch 'main' into headless_web_surfer

d19c9c7

afourney had a problem deploying to openai1 March 9, 2024 19:02 — with GitHub Actions Failure

afourney had a problem deploying to openai1 September 25, 2024 19:05 — with GitHub Actions Failure

Fixed style errors.

42fe8f5

afourney had a problem deploying to openai1 September 25, 2024 21:53 — with GitHub Actions Failure

gagb requested review from gagb and removed request for INF800 September 25, 2024 22:05

gagb approved these changes Sep 25, 2024

View reviewed changes

jackgerrits added this pull request to the merge queue Sep 25, 2024

Merged via the queue into main with commit 0d5163b Sep 25, 2024
39 of 52 checks passed

jackgerrits deleted the headless_web_surfer branch September 25, 2024 22:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WebSurfer Updated (Selenium, Playwright, and support for many filetypes) #1929

WebSurfer Updated (Selenium, Playwright, and support for many filetypes) #1929

afourney commented Mar 9, 2024 •

edited

Loading

codecov-commenter commented Mar 9, 2024 •

edited

Loading

afourney commented Mar 9, 2024 •

edited

Loading

WebSurfer Updated (Selenium, Playwright, and support for many filetypes) #1929

WebSurfer Updated (Selenium, Playwright, and support for many filetypes) #1929

Conversation

afourney commented Mar 9, 2024 • edited Loading

Why are these changes needed?

Instructions

Related issue number

codecov-commenter commented Mar 9, 2024 • edited Loading

Codecov Report

afourney commented Mar 9, 2024 • edited Loading

afourney commented Mar 9, 2024 •

edited

Loading

codecov-commenter commented Mar 9, 2024 •

edited

Loading

afourney commented Mar 9, 2024 •

edited

Loading