More comprehensive readthedocs document loader #12382

adrwz · 2023-10-26T22:35:39Z

Description:

When building our own readthedocs.io scraper, we noticed a couple interesting things:

Text lines with a lot of nested tags would give unclean text with a bunch of newlines. For example, for Langchain's documentation, a single line is represented in a complicated nested HTML structure, and the naive soup.get_text() call currently being made will create a newline for each nested HTML element. Therefore, the document loader would give a messy, newline-separated blob of text. This would be true in a lot of cases.

Additionally, content from iframes, code from scripts, css from styles, etc. will be gotten if it's a subclass of the selector (which happens more often than you'd think). For example, this page will scrape 1.5 million characters of content that looks like this:

Therefore, I wrote a recursive _get_clean_text(soup) class function that 1. skips all irrelevant elements, and 2. only adds newlines when necessary.

Index pages (like this one) would be loaded, chunked, and eventually embedded. This is really bad not just because the user will be embedding irrelevant information - but because index pages are very likely to show up in retrieved content, making retrieval less effective (in our tests). Therefore, I added a bool parameter exclude_index_pages defaulted to False (which is the current behavior — although I'd petition to default this to True) that will skip all pages where links take up 50%+ of the page. Through manual testing, this seems to be the best threshold.

Other Information:

Issue: n/a
Dependencies: n/a
Tag maintainer: n/a
Twitter handle: @andrewthezhou

…ary/code information from readthedocs

vercel · 2023-10-26T22:35:43Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment

Name	Status	Preview	Comments	Updated (UTC)
langchain	⬜️ Ignored (Inspect)	Visit Preview		Oct 28, 2023 1:21am

baskaryan

looking great! some quick comments

libs/langchain/langchain/document_loaders/readthedocs.py

libs/langchain/pyproject.toml

libs/langchain/langchain/document_loaders/readthedocs.py

libs/langchain/tests/unit_tests/document_loaders/test_readthedoc.py

adrwz · 2023-10-27T20:44:03Z

libs/langchain/langchain/document_loaders/readthedocs.py


 from langchain.docstore.document import Document
 from langchain.document_loaders.base import BaseLoader

+if TYPE_CHECKING:
+    from bs4 import NavigableString
+    from bs4.element import Comment, Tag


@baskaryan added type checking, a lot of my parameters are bs4 objects, but truthfully couldn't find any examples within the codebase on how parameter type checking for bs4 objects are done. Would love to get your eyes on how I'm handling type checking for Tag, Comment, and NavigableString

…provements cr

## **Description:** When building our own readthedocs.io scraper, we noticed a couple interesting things: 1. Text lines with a lot of nested <span> tags would give unclean text with a bunch of newlines. For example, for [Langchain's documentation](https://api.python.langchain.com/en/latest/document_loaders/langchain.document_loaders.readthedocs.ReadTheDocsLoader.html#langchain.document_loaders.readthedocs.ReadTheDocsLoader), a single line is represented in a complicated nested HTML structure, and the naive `soup.get_text()` call currently being made will create a newline for each nested HTML element. Therefore, the document loader would give a messy, newline-separated blob of text. This would be true in a lot of cases. <img width="945" alt="Screenshot 2023-10-26 at 6 15 39 PM" src="https://github.com/langchain-ai/langchain/assets/44193474/eca85d1f-d2bf-4487-a18a-e1e732fadf19"> <img width="1031" alt="Screenshot 2023-10-26 at 6 16 00 PM" src="https://github.com/langchain-ai/langchain/assets/44193474/035938a0-9892-4f6a-83cd-0d7b409b00a3"> Additionally, content from iframes, code from scripts, css from styles, etc. will be gotten if it's a subclass of the selector (which happens more often than you'd think). For example, [this page](https://pydeck.gl/gallery/contour_layer.html#) will scrape 1.5 million characters of content that looks like this: <img width="1372" alt="Screenshot 2023-10-26 at 6 32 55 PM" src="https://github.com/langchain-ai/langchain/assets/44193474/dbd89e39-9478-4a18-9e84-f0eb91954eac"> Therefore, I wrote a recursive _get_clean_text(soup) class function that 1. skips all irrelevant elements, and 2. only adds newlines when necessary. 2. Index pages (like [this one](https://api.python.langchain.com/en/latest/api_reference.html)) would be loaded, chunked, and eventually embedded. This is really bad not just because the user will be embedding irrelevant information - but because index pages are very likely to show up in retrieved content, making retrieval less effective (in our tests). Therefore, I added a bool parameter `exclude_index_pages` defaulted to False (which is the current behavior — although I'd petition to default this to True) that will skip all pages where links take up 50%+ of the page. Through manual testing, this seems to be the best threshold. ## Other Information: - **Issue:** n/a - **Dependencies:** n/a - **Tag maintainer:** n/a - **Twitter handle:** @andrewthezhou --------- Co-authored-by: Andrew Zhou <andrew@heykona.com> Co-authored-by: Bagatur <baskaryan@gmail.com>

Andrew Zhou added 2 commits October 26, 2023 18:04

Add exclude index pages option; preserve newline data dnd exclude bin…

a904b84

…ary/code information from readthedocs

Clean code

b4fa3bd

dosubot bot added Ɑ: doc loader Related to document loader module (not documentation) 🤖:improvement Medium size change to existing code to handle new use-cases labels Oct 26, 2023

Andrew Zhou added 3 commits October 26, 2023 18:59

Write tests

4f2a6ae

Revert poetry.lock changes

51cee6d

Revert pyproject.toml changes

f19dc24

baskaryan reviewed Oct 27, 2023

View reviewed changes

Typing, change var, etc

2c334d4

adrwz commented Oct 27, 2023

View reviewed changes

adrwz requested a review from baskaryan October 27, 2023 20:44

adrwz marked this pull request as ready for review October 27, 2023 20:58

baskaryan and others added 3 commits October 27, 2023 15:39

cr

2b2912b

Merge pull request #1 from langchain-ai/bagatur/readthedocs-loader-im…

cb3aa7a

…provements cr

Remove logging

7195d89

baskaryan merged commit 64c4a69 into langchain-ai:master Oct 29, 2023
22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More comprehensive readthedocs document loader #12382

More comprehensive readthedocs document loader #12382

adrwz commented Oct 26, 2023

vercel bot commented Oct 26, 2023 •

edited

Loading

baskaryan left a comment

adrwz Oct 27, 2023

More comprehensive readthedocs document loader #12382

More comprehensive readthedocs document loader #12382

Conversation

adrwz commented Oct 26, 2023

Description:

Other Information:

vercel bot commented Oct 26, 2023 • edited Loading

baskaryan left a comment

Choose a reason for hiding this comment

adrwz Oct 27, 2023

Choose a reason for hiding this comment

vercel bot commented Oct 26, 2023 •

edited

Loading