Add html parser for RAG and some improvements #2271

thinkall · 2024-04-04T06:24:28Z

Why are these changes needed?

Added a html parser to retrieve_utils (borrowed from browser_utils)
Better handle save_path in get_file_from_url, won't break when save_path is a directory
Better usage of get_file_from_url, won't break if url is a broken one
Enable overlap in split_text_to_chunks, not exposed to RAG agent yet

Related issue number

Checks

I've included any doc changes needed for https://microsoft.github.io/autogen/. See https://microsoft.github.io/autogen/docs/Contribute#documentation to build and test documentation locally.
I've added tests (if relevant) corresponding to the changes introduced in this PR.
I've made sure all auto checks have passed.

codecov-commenter · 2024-04-04T06:26:03Z

Codecov Report

Attention: Patch coverage is 53.84615% with 30 lines in your changes are missing coverage. Please review.

Project coverage is 50.06%. Comparing base (0d99d45) to head (97c42a6).

Files	Patch %	Lines
autogen/retrieve_utils.py	53.84%	24 Missing and 6 partials ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #2271       +/-   ##
===========================================
+ Coverage   38.39%   50.06%   +11.67%     
===========================================
  Files          78       78               
  Lines        7808     7859       +51     
  Branches     1669     1818      +149     
===========================================
+ Hits         2998     3935      +937     
+ Misses       4560     3593      -967     
- Partials      250      331       +81

Flag	Coverage Δ
unittest	`14.22% <0.00%> (?)`
unittests	`49.03% <53.84%> (+10.65%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ekzhu

In the future, I think we should consider sharing these utility functions across the core library rather than exclusively used by retrieval. For example, the parse HTML web pages can be built into a regular tool or user-defined function to be used by any ConversableAgent.

cc @gagb @jackgerrits

Co-authored-by: Chi Wang <wang.chi@microsoft.com>

Add bs4 and overlap

3222c90

thinkall had a problem deploying to openai1 April 4, 2024 06:24 — with GitHub Actions Failure

thinkall requested review from ekzhu, cipherself, sonichi and davorrunje April 4, 2024 06:25

thinkall mentioned this pull request Apr 4, 2024

[Roadmap] RAG #1657

Open

11 tasks

ekzhu approved these changes Apr 4, 2024

View reviewed changes

Merge main

fb135a8

thinkall had a problem deploying to openai1 April 5, 2024 03:05 — with GitHub Actions Failure

sonichi added the rag retrieve-augmented generative agents label Apr 5, 2024

Merge branch 'main' into add_html_parser

97c42a6

sonichi had a problem deploying to openai1 April 5, 2024 05:14 — with GitHub Actions Failure

sonichi enabled auto-merge April 5, 2024 05:14

sonichi had a problem deploying to openai1 April 5, 2024 05:14 — with GitHub Actions Failure

sonichi added this pull request to the merge queue Apr 5, 2024

Merged via the queue into main with commit 6b1376b Apr 5, 2024
63 of 75 checks passed

sonichi deleted the add_html_parser branch April 5, 2024 05:29

whiskyboy pushed a commit to whiskyboy/autogen that referenced this pull request Apr 17, 2024

Add bs4 and overlap (microsoft#2271)

89b9eeb

Co-authored-by: Chi Wang <wang.chi@microsoft.com>

thinkall mentioned this pull request Jun 19, 2024

[Bug]: overlap parameter in the split_text_to_chunks not used. #1844

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add html parser for RAG and some improvements #2271

Add html parser for RAG and some improvements #2271

thinkall commented Apr 4, 2024 •

edited

Loading

codecov-commenter commented Apr 4, 2024 •

edited

Loading

ekzhu left a comment •

edited

Loading

Add html parser for RAG and some improvements #2271

Add html parser for RAG and some improvements #2271

Conversation

thinkall commented Apr 4, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

codecov-commenter commented Apr 4, 2024 • edited Loading

Codecov Report

ekzhu left a comment • edited Loading

Choose a reason for hiding this comment

thinkall commented Apr 4, 2024 •

edited

Loading

codecov-commenter commented Apr 4, 2024 •

edited

Loading

ekzhu left a comment •

edited

Loading