Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harrison/html splitter #5468

Merged
merged 2 commits into from
May 31, 2023
Merged

Harrison/html splitter #5468

merged 2 commits into from
May 31, 2023

Conversation

hwchase17
Copy link
Contributor

No description provided.

r3v1 and others added 2 commits May 30, 2023 15:18
# HtmlTextSplitter

I am submitting a new HtmlTextSplitter class, which attempts to split
text along HTML layout elements.

This PR addresses the need for HTML text splitting functionality in the
LangChain library. There are no additional dependencies required for
this change.

## Examples

An [example
notebook](docs/modules/indexes/text_splitters/examples/html.ipynb)
showing its use

## Who can review?

Community members can review the PR once tests pass. Tag
maintainers/contributors who might be interested:
 - @vowelparrot

---------

Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
@@ -478,6 +478,45 @@ def __init__(self, **kwargs: Any):
super().__init__(separators=separators, **kwargs)


class HtmlTextSplitter(RecursiveCharacterTextSplitter):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a splitter that's similar conceptually to the code language splitters below (and to markdown, rst).

And it feels very different from SpacyTextSplitter or NLTKTextSplitter.

I don't have a strong opinion here, mostly wondering if it makes sense to keep a small number of core classes in the global namespace, and do the parameterization as part of the initializer when makes sense

@hwchase17 hwchase17 merged commit f72bb96 into master May 31, 2023
@hwchase17 hwchase17 deleted the harrison/html-splitter branch May 31, 2023 04:06
vowelparrot pushed a commit that referenced this pull request May 31, 2023
Co-authored-by: David Revillas <26328973+r3v1@users.noreply.github.com>
@danielchalef danielchalef mentioned this pull request Jun 5, 2023
Undertone0809 pushed a commit to Undertone0809/langchain that referenced this pull request Jun 19, 2023
Co-authored-by: David Revillas <26328973+r3v1@users.noreply.github.com>
This was referenced Jun 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants