-
Notifications
You must be signed in to change notification settings - Fork 15.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Text splitter for Markdown files by header #5860
Text splitter for Markdown files by header #5860
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure we need sorted(splits, key=lambda x: (-len(x[0]), -x[0].count('#')))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tbh probably best reviewd by writing a bunch of test cases and ensuring it works
8506e9e
to
f6e1909
Compare
f6e1909
to
ed23ae4
Compare
Done. Added to notebook. |
Re-organized code a bit; we no longer do this. |
450efd1
to
06894b4
Compare
4840d6b
to
3365aaf
Compare
3365aaf
to
f65227d
Compare
This creates a new kind of text splitter for markdown files. The user can supply a set of headers that they want to split the file on. We define a new text splitter class, `MarkdownHeaderTextSplitter`, that does a few things: (1) For each line, it determines the associated set of user-specified headers (2) It groups lines with common headers into splits See notebook for example usage and test cases.
This creates a new kind of text splitter for markdown files.
The user can supply a set of headers that they want to split the file on.
We define a new text splitter class,
MarkdownHeaderTextSplitter
, that does a few things:(1) For each line, it determines the associated set of user-specified headers
(2) It groups lines with common headers into splits
See notebook for example usage and test cases.