Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update webscraping tool #843

Merged
merged 1 commit into from
Feb 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion examples/crawl_webpage.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@

async def main():
prompt = """Get data from `paperlist` table in https://papercopilot.com/statistics/iclr-statistics/iclr-2024-statistics/,
and save it to a csv file. paper title must include `multiagent` or `large language model`. *notice: print key data*"""
and save it to a csv file. paper title must include `multiagent` or `large language model`. *notice: print key variables*"""
ci = CodeInterpreter(goal=prompt, use_tools=True)

await ci.run(prompt)
Expand Down
9 changes: 4 additions & 5 deletions metagpt/tools/libs/web_scraping.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,19 +4,18 @@


@register_tool(tool_type=ToolType.WEBSCRAPING.type_name)
async def scrape_web_playwright(url, *urls):
async def scrape_web_playwright(url):
"""
Scrape and save the HTML structure and inner text content of a web page using Playwright.
Asynchronously Scrape and save the HTML structure and inner text content of a web page using Playwright.

Args:
url (str): The main URL to fetch inner text from.
*urls (str): Additional URLs to fetch inner text from.

Returns:
(dict): The inner text content and html structure of the web page, key are : 'inner_text', 'html'.
dict: The inner text content and html structure of the web page, keys are 'inner_text', 'html'.
"""
# Create a PlaywrightWrapper instance for the Chromium browser
web = await PlaywrightWrapper().run(url, *urls)
web = await PlaywrightWrapper().run(url)

# Return the inner text content of the web page
return {"inner_text": web.inner_text.strip(), "html": web.html.strip()}
Loading