Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More comprehensive readthedocs document loader #12382

Merged
merged 9 commits into from
Oct 29, 2023
Merged

More comprehensive readthedocs document loader #12382

merged 9 commits into from
Oct 29, 2023

Conversation

adrwz
Copy link
Contributor

@adrwz adrwz commented Oct 26, 2023

Description:

When building our own readthedocs.io scraper, we noticed a couple interesting things:

  1. Text lines with a lot of nested tags would give unclean text with a bunch of newlines. For example, for Langchain's documentation, a single line is represented in a complicated nested HTML structure, and the naive soup.get_text() call currently being made will create a newline for each nested HTML element. Therefore, the document loader would give a messy, newline-separated blob of text. This would be true in a lot of cases.
Screenshot 2023-10-26 at 6 15 39 PM Screenshot 2023-10-26 at 6 16 00 PM

Additionally, content from iframes, code from scripts, css from styles, etc. will be gotten if it's a subclass of the selector (which happens more often than you'd think). For example, this page will scrape 1.5 million characters of content that looks like this:

Screenshot 2023-10-26 at 6 32 55 PM

Therefore, I wrote a recursive _get_clean_text(soup) class function that 1. skips all irrelevant elements, and 2. only adds newlines when necessary.

  1. Index pages (like this one) would be loaded, chunked, and eventually embedded. This is really bad not just because the user will be embedding irrelevant information - but because index pages are very likely to show up in retrieved content, making retrieval less effective (in our tests). Therefore, I added a bool parameter exclude_index_pages defaulted to False (which is the current behavior — although I'd petition to default this to True) that will skip all pages where links take up 50%+ of the page. Through manual testing, this seems to be the best threshold.

Other Information:

  • Issue: n/a
  • Dependencies: n/a
  • Tag maintainer: n/a
  • Twitter handle: @andrewthezhou

@vercel
Copy link

vercel bot commented Oct 26, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview Oct 28, 2023 1:21am

@dosubot dosubot bot added Ɑ: doc loader Related to document loader module (not documentation) 🤖:improvement Medium size change to existing code to handle new use-cases labels Oct 26, 2023
Copy link
Collaborator

@baskaryan baskaryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking great! some quick comments

libs/langchain/langchain/document_loaders/readthedocs.py Outdated Show resolved Hide resolved
libs/langchain/langchain/document_loaders/readthedocs.py Outdated Show resolved Hide resolved
libs/langchain/pyproject.toml Outdated Show resolved Hide resolved
libs/langchain/langchain/document_loaders/readthedocs.py Outdated Show resolved Hide resolved

from langchain.docstore.document import Document
from langchain.document_loaders.base import BaseLoader

if TYPE_CHECKING:
from bs4 import NavigableString
from bs4.element import Comment, Tag
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@baskaryan added type checking, a lot of my parameters are bs4 objects, but truthfully couldn't find any examples within the codebase on how parameter type checking for bs4 objects are done. Would love to get your eyes on how I'm handling type checking for Tag, Comment, and NavigableString

@adrwz adrwz requested a review from baskaryan October 27, 2023 20:44
@adrwz adrwz marked this pull request as ready for review October 27, 2023 20:58
@baskaryan baskaryan merged commit 64c4a69 into langchain-ai:master Oct 29, 2023
22 checks passed
nicolewhite pushed a commit to autoblocksai/autoblocks-examples that referenced this pull request Oct 31, 2023
[![Mend
Renovate](https://app.renovatebot.com/images/banner.svg)](https://renovatebot.com)

This PR contains the following updates:

| Package | Change | Age | Adoption | Passing | Confidence |
|---|---|---|---|---|---|
| [langchain](https://github.com/langchain-ai/langchain) | `^0.0.323`
-> `^0.0.326` |
[![age](https://developer.mend.io/api/mc/badges/age/pypi/langchain/0.0.326?slim=true)](https://docs.renovatebot.com/merge-confidence/)
|
[![adoption](https://developer.mend.io/api/mc/badges/adoption/pypi/langchain/0.0.326?slim=true)](https://docs.renovatebot.com/merge-confidence/)
|
[![passing](https://developer.mend.io/api/mc/badges/compatibility/pypi/langchain/0.0.323/0.0.326?slim=true)](https://docs.renovatebot.com/merge-confidence/)
|
[![confidence](https://developer.mend.io/api/mc/badges/confidence/pypi/langchain/0.0.323/0.0.326?slim=true)](https://docs.renovatebot.com/merge-confidence/)
|
| [langchain](https://github.com/langchain-ai/langchainjs) |
[`^0.0.173` ->
`^0.0.176`](https://renovatebot.com/diffs/npm/langchain/0.0.173/0.0.176)
|
[![age](https://developer.mend.io/api/mc/badges/age/npm/langchain/0.0.176?slim=true)](https://docs.renovatebot.com/merge-confidence/)
|
[![adoption](https://developer.mend.io/api/mc/badges/adoption/npm/langchain/0.0.176?slim=true)](https://docs.renovatebot.com/merge-confidence/)
|
[![passing](https://developer.mend.io/api/mc/badges/compatibility/npm/langchain/0.0.173/0.0.176?slim=true)](https://docs.renovatebot.com/merge-confidence/)
|
[![confidence](https://developer.mend.io/api/mc/badges/confidence/npm/langchain/0.0.173/0.0.176?slim=true)](https://docs.renovatebot.com/merge-confidence/)
|

---

### Release Notes

<details>
<summary>langchain-ai/langchain (langchain)</summary>

###
[`v0.0.326`](https://github.com/langchain-ai/langchain/releases/tag/v0.0.326)

[Compare
Source](https://github.com/langchain-ai/langchain/compare/v0.0.325...v0.0.326)

#### What's Changed

- MongoDB $vectorSearch doc update by
[@&#8203;prakul](https://github.com/prakul) in
[langchain-ai/langchain#12404
- rfc: type LLMChain.llm as runnable by
[@&#8203;baskaryan](https://github.com/baskaryan) in
[langchain-ai/langchain#12385
- Update long_context_reorder.py by
[@&#8203;ennio1991](https://github.com/ennio1991) in
[langchain-ai/langchain#12422
- optional reusable connection by
[@&#8203;theromis](https://github.com/theromis) in
[langchain-ai/langchain#12051
- added rrf argument in ApproxRetrievalStrategy class **init**() by
[@&#8203;HwangJohn](https://github.com/HwangJohn) in
[langchain-ai/langchain#11987
- Fix Typo in clickhouse.ipynb file by
[@&#8203;Uxywannasleep](https://github.com/Uxywannasleep) in
[langchain-ai/langchain#12429
- cli updates oct27 by [@&#8203;efriis](https://github.com/efriis) in
[langchain-ai/langchain#12436
- Fix the missing temperature parameter for Baichuan-AI chat_model by
[@&#8203;henter](https://github.com/henter) in
[langchain-ai/langchain#12420
- make doc utils public by
[@&#8203;baskaryan](https://github.com/baskaryan) in
[langchain-ai/langchain#12394
- Trace if run tree set by
[@&#8203;hinthornw](https://github.com/hinthornw) in
[langchain-ai/langchain#12444
- AWS Bedrock RAG template by
[@&#8203;rlancemartin](https://github.com/rlancemartin) in
[langchain-ai/langchain#12450
- Update AWS Bedrock README.md by
[@&#8203;rlancemartin](https://github.com/rlancemartin) in
[langchain-ai/langchain#12451
- cli updates 2 by [@&#8203;efriis](https://github.com/efriis) in
[langchain-ai/langchain#12447
- Improve handling of empty queries for timescale vector by
[@&#8203;cevian](https://github.com/cevian) in
[langchain-ai/langchain#12393
- Add security note to API chain by
[@&#8203;eyurtsev](https://github.com/eyurtsev) in
[langchain-ai/langchain#12452
- Bump [@&#8203;babel/traverse](https://github.com/babel/traverse)
from 7.22.8 to 7.23.2 in /docs by
[@&#8203;dependabot](https://github.com/dependabot) in
[langchain-ai/langchain#12453
- add reranking to azuresearch by
[@&#8203;lawadam](https://github.com/lawadam) in
[langchain-ai/langchain#12454
- Add Weaviate RAG template by
[@&#8203;rlancemartin](https://github.com/rlancemartin) in
[langchain-ai/langchain#12460
- Update SQL templates by
[@&#8203;rlancemartin](https://github.com/rlancemartin) in
[langchain-ai/langchain#12464
- Update llama.cpp and Ollama templates by
[@&#8203;rlancemartin](https://github.com/rlancemartin) in
[langchain-ai/langchain#12466
- Updated the Bedrock rag template by
[@&#8203;3coins](https://github.com/3coins) in
[langchain-ai/langchain#12462
- cli improvements by [@&#8203;efriis](https://github.com/efriis) in
[langchain-ai/langchain#12465
- Redis langserve template by
[@&#8203;tylerhutcherson](https://github.com/tylerhutcherson) in
[langchain-ai/langchain#12443
- bump to 36 by [@&#8203;hwchase17](https://github.com/hwchase17) in
[langchain-ai/langchain#12487
- Added a rag template for Kendra by
[@&#8203;3coins](https://github.com/3coins) in
[langchain-ai/langchain#12470
- Bagatur/self query doc update by
[@&#8203;baskaryan](https://github.com/baskaryan) in
[langchain-ai/langchain#12461
- Harrison/quick start by
[@&#8203;hwchase17](https://github.com/hwchase17) in
[langchain-ai/langchain#12491
- Sphinxbio nls/add plate chain template by
[@&#8203;hwchase17](https://github.com/hwchase17) in
[langchain-ai/langchain#12502
- Update tool.py by [@&#8203;0xC9](https://github.com/0xC9) in
[langchain-ai/langchain#12472
- add gha for cli by [@&#8203;hwchase17](https://github.com/hwchase17)
in
[langchain-ai/langchain#12492
- add cookbook for selectins llms based on context length by
[@&#8203;hwchase17](https://github.com/hwchase17) in
[langchain-ai/langchain#12486
- various templates improvements by
[@&#8203;hwchase17](https://github.com/hwchase17) in
[langchain-ai/langchain#12500
- Evaluation Callback Multi Response by
[@&#8203;hinthornw](https://github.com/hinthornw) in
[langchain-ai/langchain#12505
- Patch forward ref bug by
[@&#8203;hinthornw](https://github.com/hinthornw) in
[langchain-ai/langchain#12508
- OpenAI runnable constructor by
[@&#8203;baskaryan](https://github.com/baskaryan) in
[langchain-ai/langchain#12455
- `_dalle_image_url` returns list of urls if n>1 by
[@&#8203;silvhua](https://github.com/silvhua) in
[langchain-ai/langchain#11800
- docs(prompt_templates): fix typo in prompt template by
[@&#8203;nbbaier](https://github.com/nbbaier) in
[langchain-ai/langchain#12497
- Mask API key for AI21 LLM by
[@&#8203;gautamanirudh](https://github.com/gautamanirudh) in
[langchain-ai/langchain#12418
- Harrison/add descriptions by
[@&#8203;hwchase17](https://github.com/hwchase17) in
[langchain-ai/langchain#12522
- notebook fmt by [@&#8203;baskaryan](https://github.com/baskaryan) in
[langchain-ai/langchain#12498
- Bagatur/fix doc ci by
[@&#8203;baskaryan](https://github.com/baskaryan) in
[langchain-ai/langchain#12529
- update contributing by
[@&#8203;baskaryan](https://github.com/baskaryan) in
[langchain-ai/langchain#12532
- Add unit tests for document_transformers/beautiful_soup_transformer.py
by [@&#8203;petervandenabeele](https://github.com/petervandenabeele)
in
[langchain-ai/langchain#12520
- More comprehensive readthedocs document loader by
[@&#8203;adrwz](https://github.com/adrwz) in
[langchain-ai/langchain#12382
- Masking of API Key for GooseAI LLM by
[@&#8203;samadkoita](https://github.com/samadkoita) in
[langchain-ai/langchain#12496
- feat: Add Google Cloud Translation document transformer by
[@&#8203;holtskinner](https://github.com/holtskinner) in
[langchain-ai/langchain#12433
- goog translate nb formatting by
[@&#8203;baskaryan](https://github.com/baskaryan) in
[langchain-ai/langchain#12534
- Update neo4j template readmes by
[@&#8203;tomasonjo](https://github.com/tomasonjo) in
[langchain-ai/langchain#12540
- Allow astream_log to be used inside atrace_as_chain_group by
[@&#8203;nfcampos](https://github.com/nfcampos) in
[langchain-ai/langchain#12558
- Image Caption accepts bytes for images by
[@&#8203;eyurtsev](https://github.com/eyurtsev) in
[langchain-ai/langchain#12561
- bump 326 by [@&#8203;baskaryan](https://github.com/baskaryan) in
[langchain-ai/langchain#12569

#### New Contributors

- [@&#8203;ennio1991](https://github.com/ennio1991) made their first
contribution in
[langchain-ai/langchain#12422
- [@&#8203;theromis](https://github.com/theromis) made their first
contribution in
[langchain-ai/langchain#12051
- [@&#8203;HwangJohn](https://github.com/HwangJohn) made their first
contribution in
[langchain-ai/langchain#11987
- [@&#8203;Uxywannasleep](https://github.com/Uxywannasleep) made their
first contribution in
[langchain-ai/langchain#12429
- [@&#8203;henter](https://github.com/henter) made their first
contribution in
[langchain-ai/langchain#12420
- [@&#8203;dependabot](https://github.com/dependabot) made their first
contribution in
[langchain-ai/langchain#12453
- [@&#8203;lawadam](https://github.com/lawadam) made their first
contribution in
[langchain-ai/langchain#12454
- [@&#8203;0xC9](https://github.com/0xC9) made their first
contribution in
[langchain-ai/langchain#12472
- [@&#8203;silvhua](https://github.com/silvhua) made their first
contribution in
[langchain-ai/langchain#11800
- [@&#8203;nbbaier](https://github.com/nbbaier) made their first
contribution in
[langchain-ai/langchain#12497
- [@&#8203;gautamanirudh](https://github.com/gautamanirudh) made their
first contribution in
[langchain-ai/langchain#12418
- [@&#8203;petervandenabeele](https://github.com/petervandenabeele)
made their first contribution in
[langchain-ai/langchain#12520
- [@&#8203;adrwz](https://github.com/adrwz) made their first
contribution in
[langchain-ai/langchain#12382
- [@&#8203;samadkoita](https://github.com/samadkoita) made their first
contribution in
[langchain-ai/langchain#12496

**Full Changelog**:
langchain-ai/langchain@v0.0.325...v0.0.326

###
[`v0.0.325`](https://github.com/langchain-ai/langchain/releases/tag/v0.0.325)

[Compare
Source](https://github.com/langchain-ai/langchain/compare/v0.0.324...v0.0.325)

#### What's Changed

- Add template for Pinecone + Multi-Query by
[@&#8203;rlancemartin](https://github.com/rlancemartin) in
[langchain-ai/langchain#12353
- Docs: Add lcel to combine_docs chains by
[@&#8203;baskaryan](https://github.com/baskaryan) in
[langchain-ai/langchain#12310
- Update multi query template README, ntbk by
[@&#8203;rlancemartin](https://github.com/rlancemartin) in
[langchain-ai/langchain#12356
- langserve doc by [@&#8203;baskaryan](https://github.com/baskaryan)
in
[langchain-ai/langchain#12357
- E2B tool - Improve description wuth uploaded files info by
[@&#8203;jakubno](https://github.com/jakubno) in
[langchain-ai/langchain#12355
- Fix redis vectorfield schema defaults by
[@&#8203;tylerhutcherson](https://github.com/tylerhutcherson) in
[langchain-ai/langchain#12223
- update chains how to by
[@&#8203;baskaryan](https://github.com/baskaryan) in
[langchain-ai/langchain#12362
- Support Fireworks batching
([#&#8203;8](https://github.com/langchain-ai/langchain/issues/8)) by
[@&#8203;ZixinYang](https://github.com/ZixinYang) in
[langchain-ai/langchain#12052
- fix some stuff by [@&#8203;hwchase17](https://github.com/hwchase17)
in
[langchain-ai/langchain#12292
- Harrison/improve cli by
[@&#8203;hwchase17](https://github.com/hwchase17) in
[langchain-ai/langchain#12368
- Adds linter in templates by
[@&#8203;jacoblee93](https://github.com/jacoblee93) in
[langchain-ai/langchain#12321
- Fix langsmith walkthrough doc dataset by
[@&#8203;lucasc896](https://github.com/lucasc896) in
[langchain-ai/langchain#12027
- Mention other function calling/grammar support by
[@&#8203;hinthornw](https://github.com/hinthornw) in
[langchain-ai/langchain#12369
- rm init by [@&#8203;baskaryan](https://github.com/baskaryan) in
[langchain-ai/langchain#12374
- Fix SupbaseVectoreStore write operation timeout by
[@&#8203;j1philli](https://github.com/j1philli) in
[langchain-ai/langchain#12318
- fireworks scheduled integration tests by
[@&#8203;baskaryan](https://github.com/baskaryan) in
[langchain-ai/langchain#12373
- Cohere re-rank template by
[@&#8203;rlancemartin](https://github.com/rlancemartin) in
[langchain-ai/langchain#12378
- cli fix by [@&#8203;efriis](https://github.com/efriis) in
[langchain-ai/langchain#12380
- better lint triggering by
[@&#8203;baskaryan](https://github.com/baskaryan) in
[langchain-ai/langchain#12376
- Minor updates to ReRank template by
[@&#8203;rlancemartin](https://github.com/rlancemartin) in
[langchain-ai/langchain#12388
- Update broken redis tests by
[@&#8203;tylerhutcherson](https://github.com/tylerhutcherson) in
[langchain-ai/langchain#12371
- CLI CI 2 by [@&#8203;efriis](https://github.com/efriis) in
[langchain-ai/langchain#12387
- add template for hyde by
[@&#8203;hwchase17](https://github.com/hwchase17) in
[langchain-ai/langchain#12390
- Wfh/json edit dist by
[@&#8203;hinthornw](https://github.com/hinthornw) in
[langchain-ai/langchain#12361
- Fixed some grammatical and Exception types issues by
[@&#8203;ShorthillsAI](https://github.com/ShorthillsAI) in
[langchain-ai/langchain#12015
- Format Templates by [@&#8203;efriis](https://github.com/efriis) in
[langchain-ai/langchain#12396
- Add HTML Title and Page Language into metadata for AsyncHtmlLoader by
[@&#8203;kkk55596](https://github.com/kkk55596) in
[langchain-ai/langchain#11326
- johnsnowlabs embeddings support by
[@&#8203;C-K-Loan](https://github.com/C-K-Loan) in
[langchain-ai/langchain#11271
- Templates CI by [@&#8203;efriis](https://github.com/efriis) in
[langchain-ai/langchain#12313
- Wfh/json schema evaluation by
[@&#8203;hinthornw](https://github.com/hinthornw) in
[langchain-ai/langchain#12389
- Add space by [@&#8203;hinthornw](https://github.com/hinthornw) in
[langchain-ai/langchain#12395
- Str Evaluator Mapper by
[@&#8203;hinthornw](https://github.com/hinthornw) in
[langchain-ai/langchain#12401
- Clean-up template READMEs by
[@&#8203;rlancemartin](https://github.com/rlancemartin) in
[langchain-ai/langchain#12403
- Mask API key for Aleph Alpha LLM by
[@&#8203;slangenbach](https://github.com/slangenbach) in
[langchain-ai/langchain#12377
- fix(openai-callback): completion count logic by
[@&#8203;mspronesti](https://github.com/mspronesti) in
[langchain-ai/langchain#12383
- Fix templates typos by [@&#8203;dqbd](https://github.com/dqbd) in
[langchain-ai/langchain#12428
- feat: Add Google Speech to Text API Document Loader by
[@&#8203;holtskinner](https://github.com/holtskinner) in
[langchain-ai/langchain#12298
- LLaMA2 with JSON schema support template by
[@&#8203;rlancemartin](https://github.com/rlancemartin) in
[langchain-ai/langchain#12435
- clean up deprecated agents by
[@&#8203;hwchase17](https://github.com/hwchase17) in
[langchain-ai/langchain#12231
- Add invoke example to LLaMA2 function template notebook by
[@&#8203;rlancemartin](https://github.com/rlancemartin) in
[langchain-ai/langchain#12437
- Deprecate PythonRepl tools and Pandas/Xorbits/Spark
DataFrame/Python/CSV agents by
[@&#8203;eyurtsev](https://github.com/eyurtsev) in
[langchain-ai/langchain#12427
- Bagatur/bump 325 by
[@&#8203;baskaryan](https://github.com/baskaryan) in
[langchain-ai/langchain#12440

#### New Contributors

- [@&#8203;jakubno](https://github.com/jakubno) made their first
contribution in
[langchain-ai/langchain#12355
- [@&#8203;lucasc896](https://github.com/lucasc896) made their first
contribution in
[langchain-ai/langchain#12027
- [@&#8203;kkk55596](https://github.com/kkk55596) made their first
contribution in
[langchain-ai/langchain#11326
- [@&#8203;C-K-Loan](https://github.com/C-K-Loan) made their first
contribution in
[langchain-ai/langchain#11271
- [@&#8203;slangenbach](https://github.com/slangenbach) made their
first contribution in
[langchain-ai/langchain#12377

#### CVEs

CVE-2023-39659 resolved in
[langchain-ai/langchain#12427

**Full Changelog**:
langchain-ai/langchain@v0.0.324...v0.0.325

###
[`v0.0.324`](https://github.com/langchain-ai/langchain/releases/tag/v0.0.324)

[Compare
Source](https://github.com/langchain-ai/langchain/compare/v0.0.323...v0.0.324)

#### What's Changed

- Pop max concurrency when recursing by
[@&#8203;nfcampos](https://github.com/nfcampos) in
[langchain-ai/langchain#12281
- Remove CLI by [@&#8203;nfcampos](https://github.com/nfcampos) in
[langchain-ai/langchain#12283
- CohereEmbeddings: Add max_retries and request_timeout by
[@&#8203;ephe-meral](https://github.com/ephe-meral) in
[langchain-ai/langchain#12275
- response parser for ArceeRetriever by
[@&#8203;EricLiclair](https://github.com/EricLiclair) in
[langchain-ai/langchain#12270
- CLI by [@&#8203;efriis](https://github.com/efriis) in
[langchain-ai/langchain#12284
- chat loader doc titles by
[@&#8203;baskaryan](https://github.com/baskaryan) in
[langchain-ai/langchain#12289
- Update vectorstore.mdx\[Make an improvement] by
[@&#8203;SilvaXiang](https://github.com/SilvaXiang) in
[langchain-ai/langchain#12252
- rm .html from local doc links by
[@&#8203;baskaryan](https://github.com/baskaryan) in
[langchain-ai/langchain#12293
- dev guide by [@&#8203;baskaryan](https://github.com/baskaryan) in
[langchain-ai/langchain#12291
- Update README.md by
[@&#8203;RS-labhub](https://github.com/RS-labhub) in
[langchain-ai/langchain#12286
- Strips leading/trailing whitespace before parsing xml by
[@&#8203;donatoaz](https://github.com/donatoaz) in
[langchain-ai/langchain#12297
- fix sparql queries for relations in schema description by
[@&#8203;felixocker](https://github.com/felixocker) in
[langchain-ai/langchain#9136
- use snippet search optionally by
[@&#8203;sam-h-bean](https://github.com/sam-h-bean) in
[langchain-ai/langchain#12236
- Add baidu cloud vector search in vectorstore and fix some unit test in
vectorstores by [@&#8203;wemysschen](https://github.com/wemysschen) in
[langchain-ai/langchain#11605
- Rm langchain server by
[@&#8203;hinthornw](https://github.com/hinthornw) in
[langchain-ai/langchain#12305
- Relax type annotation for custom input/output types by
[@&#8203;eyurtsev](https://github.com/eyurtsev) in
[langchain-ai/langchain#12300
- Better wrap traceable by
[@&#8203;hinthornw](https://github.com/hinthornw) in
[langchain-ai/langchain#12303
- CLI Git Improvements by [@&#8203;efriis](https://github.com/efriis)
in
[langchain-ai/langchain#12311
- Templates by [@&#8203;efriis](https://github.com/efriis) in
[langchain-ai/langchain#12294
- removed CardLists for LLMs and ChatModels by
[@&#8203;leo-gan](https://github.com/leo-gan) in
[langchain-ai/langchain#12307
- Allow index name customization via env var in rag-conversation by
[@&#8203;jacoblee93](https://github.com/jacoblee93) in
[langchain-ai/langchain#12315
- add docs for templates by
[@&#8203;hwchase17](https://github.com/hwchase17) in
[langchain-ai/langchain#12346
- e2b tool - fix initialization and improve tool description by
[@&#8203;mlejva](https://github.com/mlejva) in
[langchain-ai/langchain#12345
- Fix Typo in CONTRIBUTING.md by
[@&#8203;kishanrajput23](https://github.com/kishanrajput23) in
[langchain-ai/langchain#12320
- Fix typos in quickstart.mdx by
[@&#8203;LaurentAjdnik](https://github.com/LaurentAjdnik) in
[langchain-ai/langchain#12333
- fix self query constructor by
[@&#8203;baskaryan](https://github.com/baskaryan) in
[langchain-ai/langchain#12349
- add allowed_operators property in QdrantTranslator by
[@&#8203;xieqihui](https://github.com/xieqihui) in
[langchain-ai/langchain#12328
- fix typo by [@&#8203;wongzc](https://github.com/wongzc) in
[langchain-ai/langchain#12338
- Fix the exception from langchain.utilities import ArceeWrapper by
[@&#8203;gnakw](https://github.com/gnakw) in
[langchain-ai/langchain#12342
- Fix a typo in the summarization use case. by
[@&#8203;season179](https://github.com/season179) in
[langchain-ai/langchain#12316
- Update code_understanding.ipynb by
[@&#8203;kengoodridge](https://github.com/kengoodridge) in
[langchain-ai/langchain#12309
- Finetuned openai azure models cost calculation by
[@&#8203;nirkopler](https://github.com/nirkopler) in
[langchain-ai/langchain#12267
- fixed error message and a check for processor name by
[@&#8203;lkuligin](https://github.com/lkuligin) in
[langchain-ai/langchain#12200
- Takeoff pro support by
[@&#8203;hoyungcher](https://github.com/hoyungcher) in
[langchain-ai/langchain#12070
- Comprehend Moderation 0.2 by
[@&#8203;nikhilkjha](https://github.com/nikhilkjha) in
[langchain-ai/langchain#11730
- Fix json key output parser in partial (streaming) mode by
[@&#8203;nfcampos](https://github.com/nfcampos) in
[langchain-ai/langchain#12332
- bump 324 and 35 by [@&#8203;baskaryan](https://github.com/baskaryan)
in
[langchain-ai/langchain#12352

#### New Contributors

- [@&#8203;SilvaXiang](https://github.com/SilvaXiang) made their first
contribution in
[langchain-ai/langchain#12252
- [@&#8203;RS-labhub](https://github.com/RS-labhub) made their first
contribution in
[langchain-ai/langchain#12286
- [@&#8203;donatoaz](https://github.com/donatoaz) made their first
contribution in
[langchain-ai/langchain#12297
- [@&#8203;kishanrajput23](https://github.com/kishanrajput23) made
their first contribution in
[langchain-ai/langchain#12320
- [@&#8203;LaurentAjdnik](https://github.com/LaurentAjdnik) made their
first contribution in
[langchain-ai/langchain#12333
- [@&#8203;wongzc](https://github.com/wongzc) made their first
contribution in
[langchain-ai/langchain#12338
- [@&#8203;gnakw](https://github.com/gnakw) made their first
contribution in
[langchain-ai/langchain#12342
- [@&#8203;season179](https://github.com/season179) made their first
contribution in
[langchain-ai/langchain#12316
- [@&#8203;kengoodridge](https://github.com/kengoodridge) made their
first contribution in
[langchain-ai/langchain#12309

**Full Changelog**:
langchain-ai/langchain@v0.0.323...v0.0.324

</details>

<details>
<summary>langchain-ai/langchainjs (langchain)</summary>

###
[`v0.0.176`](https://github.com/langchain-ai/langchainjs/releases/tag/0.0.176)

[Compare
Source](https://github.com/langchain-ai/langchainjs/compare/0.0.175...0.0.176)

#### What's Changed

- Release 0.0.175 by
[@&#8203;jacoblee93](https://github.com/jacoblee93) in
[langchain-ai/langchainjs#3082
- Update ollama.mdx - `numThreads` to `numThread` by
[@&#8203;paulwongx](https://github.com/paulwongx) in
[langchain-ai/langchainjs#3083
- Update RAG fusion notebook by
[@&#8203;jacoblee93](https://github.com/jacoblee93) in
[langchain-ai/langchainjs#3095
- JSON functions output parser docs by
[@&#8203;bracesproul](https://github.com/bracesproul) in
[langchain-ai/langchainjs#3098
- Reduce max size for docs build by
[@&#8203;jacoblee93](https://github.com/jacoblee93) in
[langchain-ai/langchainjs#3100
- Implement streaming for GoogleVertexAI text and chat models by
[@&#8203;afirstenberg](https://github.com/afirstenberg) in
[langchain-ai/langchainjs#3088
- Increase docs build RAM by
[@&#8203;jacoblee93](https://github.com/jacoblee93) in
[langchain-ai/langchainjs#3101
- Jtcorrin/plan execute adjustments by
[@&#8203;JTCorrin](https://github.com/JTCorrin) in
[langchain-ai/langchainjs#3072

#### New Contributors

- [@&#8203;paulwongx](https://github.com/paulwongx) made their first
contribution in
[langchain-ai/langchainjs#3083
- [@&#8203;JTCorrin](https://github.com/JTCorrin) made their first
contribution in
[langchain-ai/langchainjs#3072

**Full Changelog**:
langchain-ai/langchainjs@0.0.175...0.0.176

###
[`v0.0.175`](https://github.com/langchain-ai/langchainjs/releases/tag/0.0.175)

[Compare
Source](https://github.com/langchain-ai/langchainjs/compare/0.0.174...0.0.175)

#### What's Changed

- Release 0.0.174 by
[@&#8203;jacoblee93](https://github.com/jacoblee93) in
[langchain-ai/langchainjs#3064
- Adds basic critique revise cookbook by
[@&#8203;jacoblee93](https://github.com/jacoblee93) in
[langchain-ai/langchainjs#3067
- fix: calculate Milvus textlength in bytes by
[@&#8203;kurtmoser](https://github.com/kurtmoser) in
[langchain-ai/langchainjs#3079
- Add Convex integration by
[@&#8203;xixixao](https://github.com/xixixao) in
[langchain-ai/langchainjs#2977
- Make "IN-filters" possible for Vercel Postgres Metadata by
[@&#8203;rbouschery](https://github.com/rbouschery) in
[langchain-ai/langchainjs#3040
- Fix Convex typo by
[@&#8203;jacoblee93](https://github.com/jacoblee93) in
[langchain-ai/langchainjs#3081

#### New Contributors

- [@&#8203;kurtmoser](https://github.com/kurtmoser) made their first
contribution in
[langchain-ai/langchainjs#3079
- [@&#8203;xixixao](https://github.com/xixixao) made their first
contribution in
[langchain-ai/langchainjs#2977
- [@&#8203;rbouschery](https://github.com/rbouschery) made their first
contribution in
[langchain-ai/langchainjs#3040

**Full Changelog**:
langchain-ai/langchainjs@0.0.174...0.0.175

###
[`v0.0.174`](https://github.com/langchain-ai/langchainjs/releases/tag/0.0.174)

[Compare
Source](https://github.com/langchain-ai/langchainjs/compare/0.0.173...0.0.174)

#### What's Changed

- Release 0.0.173 by
[@&#8203;jacoblee93](https://github.com/jacoblee93) in
[langchain-ai/langchainjs#3054
- fix: LANGCHAIN_VERBOSE=false does not turn off verbose logging. by
[@&#8203;xianminx](https://github.com/xianminx) in
[langchain-ai/langchainjs#3057
- Format by [@&#8203;jacoblee93](https://github.com/jacoblee93) in
[langchain-ai/langchainjs#3058
- fix: improper markdown links by
[@&#8203;bracesproul](https://github.com/bracesproul) in
[langchain-ai/langchainjs#3061
- Adds MMR search for Pinecone by
[@&#8203;jacoblee93](https://github.com/jacoblee93) in
[langchain-ai/langchainjs#3059
- Allow vectorstore retrievers in SemanticSimilarityExampleSelector.ts
by [@&#8203;jacoblee93](https://github.com/jacoblee93) in
[langchain-ai/langchainjs#3056
- Add getOutputValue to langchain/memory exports by
[@&#8203;mattmcgiv](https://github.com/mattmcgiv) in
[langchain-ai/langchainjs#3060

#### New Contributors

- [@&#8203;xianminx](https://github.com/xianminx) made their first
contribution in
[langchain-ai/langchainjs#3057
- [@&#8203;mattmcgiv](https://github.com/mattmcgiv) made their first
contribution in
[langchain-ai/langchainjs#3060

**Full Changelog**:
langchain-ai/langchainjs@0.0.173...0.0.174

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined),
Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you
are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the
rebase/retry checkbox.

👻 **Immortal**: This PR will be recreated if closed unmerged. Get
[config help](https://github.com/renovatebot/renovate/discussions) if
that's undesired.

---

- [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check
this box

---

This PR has been generated by [Mend
Renovate](https://www.mend.io/free-developer-tools/renovate/). View
repository job log
[here](https://developer.mend.io/github/autoblocksai/autoblocks-examples).

<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNy4zMS41IiwidXBkYXRlZEluVmVyIjoiMzcuMzEuNSIsInRhcmdldEJyYW5jaCI6Im1haW4ifQ==-->

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
xieqihui pushed a commit to xieqihui/langchain that referenced this pull request Nov 21, 2023
## **Description:**
When building our own readthedocs.io scraper, we noticed a couple
interesting things:

1. Text lines with a lot of nested <span> tags would give unclean text
with a bunch of newlines. For example, for [Langchain's
documentation](https://api.python.langchain.com/en/latest/document_loaders/langchain.document_loaders.readthedocs.ReadTheDocsLoader.html#langchain.document_loaders.readthedocs.ReadTheDocsLoader),
a single line is represented in a complicated nested HTML structure, and
the naive `soup.get_text()` call currently being made will create a
newline for each nested HTML element. Therefore, the document loader
would give a messy, newline-separated blob of text. This would be true
in a lot of cases.

<img width="945" alt="Screenshot 2023-10-26 at 6 15 39 PM"
src="https://github.com/langchain-ai/langchain/assets/44193474/eca85d1f-d2bf-4487-a18a-e1e732fadf19">
<img width="1031" alt="Screenshot 2023-10-26 at 6 16 00 PM"
src="https://github.com/langchain-ai/langchain/assets/44193474/035938a0-9892-4f6a-83cd-0d7b409b00a3">

Additionally, content from iframes, code from scripts, css from styles,
etc. will be gotten if it's a subclass of the selector (which happens
more often than you'd think). For example, [this
page](https://pydeck.gl/gallery/contour_layer.html#) will scrape 1.5
million characters of content that looks like this:

<img width="1372" alt="Screenshot 2023-10-26 at 6 32 55 PM"
src="https://github.com/langchain-ai/langchain/assets/44193474/dbd89e39-9478-4a18-9e84-f0eb91954eac">

Therefore, I wrote a recursive _get_clean_text(soup) class function that
1. skips all irrelevant elements, and 2. only adds newlines when
necessary.

2. Index pages (like [this
one](https://api.python.langchain.com/en/latest/api_reference.html))
would be loaded, chunked, and eventually embedded. This is really bad
not just because the user will be embedding irrelevant information - but
because index pages are very likely to show up in retrieved content,
making retrieval less effective (in our tests). Therefore, I added a
bool parameter `exclude_index_pages` defaulted to False (which is the
current behavior — although I'd petition to default this to True) that
will skip all pages where links take up 50%+ of the page. Through manual
testing, this seems to be the best threshold.



## Other Information:
  - **Issue:** n/a
  - **Dependencies:** n/a
  - **Tag maintainer:** n/a
  - **Twitter handle:** @andrewthezhou

---------

Co-authored-by: Andrew Zhou <andrew@heykona.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
hoanq1811 pushed a commit to hoanq1811/langchain that referenced this pull request Feb 2, 2024
## **Description:**
When building our own readthedocs.io scraper, we noticed a couple
interesting things:

1. Text lines with a lot of nested <span> tags would give unclean text
with a bunch of newlines. For example, for [Langchain's
documentation](https://api.python.langchain.com/en/latest/document_loaders/langchain.document_loaders.readthedocs.ReadTheDocsLoader.html#langchain.document_loaders.readthedocs.ReadTheDocsLoader),
a single line is represented in a complicated nested HTML structure, and
the naive `soup.get_text()` call currently being made will create a
newline for each nested HTML element. Therefore, the document loader
would give a messy, newline-separated blob of text. This would be true
in a lot of cases.

<img width="945" alt="Screenshot 2023-10-26 at 6 15 39 PM"
src="https://github.com/langchain-ai/langchain/assets/44193474/eca85d1f-d2bf-4487-a18a-e1e732fadf19">
<img width="1031" alt="Screenshot 2023-10-26 at 6 16 00 PM"
src="https://github.com/langchain-ai/langchain/assets/44193474/035938a0-9892-4f6a-83cd-0d7b409b00a3">

Additionally, content from iframes, code from scripts, css from styles,
etc. will be gotten if it's a subclass of the selector (which happens
more often than you'd think). For example, [this
page](https://pydeck.gl/gallery/contour_layer.html#) will scrape 1.5
million characters of content that looks like this:

<img width="1372" alt="Screenshot 2023-10-26 at 6 32 55 PM"
src="https://github.com/langchain-ai/langchain/assets/44193474/dbd89e39-9478-4a18-9e84-f0eb91954eac">

Therefore, I wrote a recursive _get_clean_text(soup) class function that
1. skips all irrelevant elements, and 2. only adds newlines when
necessary.

2. Index pages (like [this
one](https://api.python.langchain.com/en/latest/api_reference.html))
would be loaded, chunked, and eventually embedded. This is really bad
not just because the user will be embedding irrelevant information - but
because index pages are very likely to show up in retrieved content,
making retrieval less effective (in our tests). Therefore, I added a
bool parameter `exclude_index_pages` defaulted to False (which is the
current behavior — although I'd petition to default this to True) that
will skip all pages where links take up 50%+ of the page. Through manual
testing, this seems to be the best threshold.



## Other Information:
  - **Issue:** n/a
  - **Dependencies:** n/a
  - **Tag maintainer:** n/a
  - **Twitter handle:** @andrewthezhou

---------

Co-authored-by: Andrew Zhou <andrew@heykona.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ɑ: doc loader Related to document loader module (not documentation) 🤖:improvement Medium size change to existing code to handle new use-cases
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants