Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add html parser for RAG and some improvements #2271

Merged
merged 3 commits into from
Apr 5, 2024
Merged

Conversation

thinkall
Copy link
Collaborator

@thinkall thinkall commented Apr 4, 2024

Why are these changes needed?

  • Added a html parser to retrieve_utils (borrowed from browser_utils)
  • Better handle save_path in get_file_from_url, won't break when save_path is a directory
  • Better usage of get_file_from_url, won't break if url is a broken one
  • Enable overlap in split_text_to_chunks, not exposed to RAG agent yet

Related issue number

Checks

@codecov-commenter
Copy link

codecov-commenter commented Apr 4, 2024

Codecov Report

Attention: Patch coverage is 53.84615% with 30 lines in your changes are missing coverage. Please review.

Project coverage is 50.06%. Comparing base (0d99d45) to head (97c42a6).

Files Patch % Lines
autogen/retrieve_utils.py 53.84% 24 Missing and 6 partials ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #2271       +/-   ##
===========================================
+ Coverage   38.39%   50.06%   +11.67%     
===========================================
  Files          78       78               
  Lines        7808     7859       +51     
  Branches     1669     1818      +149     
===========================================
+ Hits         2998     3935      +937     
+ Misses       4560     3593      -967     
- Partials      250      331       +81     
Flag Coverage Δ
unittest 14.22% <0.00%> (?)
unittests 49.03% <53.84%> (+10.65%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@thinkall thinkall mentioned this pull request Apr 4, 2024
11 tasks
Copy link
Collaborator

@ekzhu ekzhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future, I think we should consider sharing these utility functions across the core library rather than exclusively used by retrieval. For example, the parse HTML web pages can be built into a regular tool or user-defined function to be used by any ConversableAgent.

cc @gagb @jackgerrits

@sonichi sonichi enabled auto-merge April 5, 2024 05:14
@sonichi sonichi added this pull request to the merge queue Apr 5, 2024
Merged via the queue into main with commit 6b1376b Apr 5, 2024
63 of 75 checks passed
@sonichi sonichi deleted the add_html_parser branch April 5, 2024 05:29
whiskyboy pushed a commit to whiskyboy/autogen that referenced this pull request Apr 17, 2024
Co-authored-by: Chi Wang <wang.chi@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rag retrieve-augmented generative agents
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants