Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding web page digest function to service module #84

Merged
merged 15 commits into from
Mar 27, 2024

Conversation

ZiTao-Li
Copy link
Collaborator

@ZiTao-Li ZiTao-Li commented Mar 18, 2024


Adding web page digest function to service module


Description

As there are some recent internal requests about parsing the webpage, this PR introduce a webpage digestion service method to the framework.

  • If a LLM is provided as the model parameter, the webpage will first be split (by langchain_text_splitters.HTMLHeaderTextSplitter) and analyzed by LLM one by one.

  • If there is no LLM provide, then the langchain_community.document_transformers.BeautifulSoupTransformer will be used to clean the webpage.

Checklist

Please check the following items before code is ready to be reviewed.

  • Code has passed all tests
  • Docstrings have been added/updated in Google Style
  • Documentation has been updated
  • Code is ready for review

Copy link
Collaborator

@DavdGao DavdGao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Since there's no overlap in the processes for handling web pages via a model and a third-party library, they should be split into two separate methods like bing_search and google_search. The developer should decide which to use when choosing the service function.
  2. Please see inline comments.

src/agentscope/service/web_search/web_digest.py Outdated Show resolved Hide resolved
src/agentscope/service/web_search/web_digest.py Outdated Show resolved Hide resolved
src/agentscope/service/web_search/web_digest.py Outdated Show resolved Hide resolved
src/agentscope/service/web_search/web_digest.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@DavdGao DavdGao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please see inline comments

src/agentscope/service/web_search/web_digest.py Outdated Show resolved Hide resolved
src/agentscope/service/web_search/web_digest.py Outdated Show resolved Hide resolved
src/agentscope/service/web_search/web_digest.py Outdated Show resolved Hide resolved
src/agentscope/service/__init__.py Outdated Show resolved Hide resolved
src/agentscope/service/web_search/web_digest.py Outdated Show resolved Hide resolved
tests/web_digest_test.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@DavdGao DavdGao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see inline comments, and solve the conflicts.

Ps:
We need to determine if we want developers to use the parse_html function directly. If so:

  1. The arguments keep_raw and html_parse_func in parse_html function are meaningless for developers when they use parse_html function directly.
  2. The "return raw" and "parse html by customized function" operations should be handled within the load_web function.

src/agentscope/service/web_search/web_digest.py Outdated Show resolved Hide resolved
src/agentscope/service/web_search/web_digest.py Outdated Show resolved Hide resolved
src/agentscope/service/web_search/web_digest.py Outdated Show resolved Hide resolved
# Conflicts:
#	src/agentscope/service/text_processing/summarization.py
Copy link
Collaborator

@DavdGao DavdGao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@DavdGao DavdGao merged commit 69f8798 into modelscope:main Mar 27, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants