Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhance: optimistic search page loading and improved content refinement #415

Merged
merged 1 commit into from
Feb 7, 2025

Conversation

njhale
Copy link
Member

@njhale njhale commented Feb 5, 2025

Improves Google Search page loading, scraping, and LLM-based content refinement.

This PR includes the following enhancements:

  • Wait for result page DOMs to stabilize -- i.e. no changes for 500ms -- before scraping content
  • Do a better job pre-filtering non-content HTML elements before transforming to markdown
  • Produce dense/prettified markdown content on the initial scrape
  • Use tiktoken to get a more accurate token count for content truncation
  • Change content refinement prompt to produce output suitable for truncation (dense markdown instead of JSON)
  • Distribute the Google Search tool output token budget across results based on their size and the quality of their content when truncating refined search result content (the goal is to keep shorter, higher quality, content as intact as possible)

Requires the gptscript module in Obot to be bumped to a version that includes: #415 (see obot-platform/obot#1680 for bump PR)

Also addresses: obot-platform/obot#1671 and obot-platform/obot#1423

@njhale njhale force-pushed the fix/google-search-load-timeouts branch 2 times, most recently from 3c47592 to 4fb8ab2 Compare February 6, 2025 20:42
@njhale njhale requested a review from thedadams February 6, 2025 21:46
@njhale njhale force-pushed the fix/google-search-load-timeouts branch from 4fb8ab2 to c58e7e3 Compare February 6, 2025 21:48
@njhale njhale marked this pull request as ready for review February 6, 2025 21:49
- boilerplate and unintelligable text
- unrelated advertisements, links, and web page structure
2. Select excerpts from the refined content that you think would make good notes for conducting detailed research about the topic
3. Compose a concise markdown document containing the excerpts organized in decending order of importance to understanding the topic. Do not paraphrase, summarize, or reword the excerpts. The goal is to preserve as much of the original content as possible.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
3. Compose a concise markdown document containing the excerpts organized in decending order of importance to understanding the topic. Do not paraphrase, summarize, or reword the excerpts. The goal is to preserve as much of the original content as possible.
3. Compose a concise markdown document containing the excerpts organized in descending order of importance to understanding the topic. Do not paraphrase, summarize, or reword the excerpts. The goal is to preserve as much of the original content as possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching that! Pushed.

Signed-off-by: Nick Hale <4175918+njhale@users.noreply.github.com>
@njhale njhale force-pushed the fix/google-search-load-timeouts branch from c58e7e3 to f1903b1 Compare February 7, 2025 01:38
@njhale njhale merged commit 04d1535 into obot-platform:main Feb 7, 2025
1 check passed
njhale added a commit to njhale/tools that referenced this pull request Feb 7, 2025
njhale added a commit that referenced this pull request Feb 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants