You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
We've found in practice that cleaning up files before being used in RAG pipelines does increase overall performance. For example, this Haystack user found the same.
We do have a DocumentCleaner to help with this process, but we found there are some options missing for the type of cleaning we would like to accomplish.
Describe the solution you'd like
The options I'd like to add to the DocumentCleaner are:
an option that just runs .strip() on the content of every document. Often times we just want to remove the extra leading and trailing white space, but leave the white space within a chunk alone. For example, in mark down files the extra newlines can matter for formatting.
also an option to provide a regex pattern to remove and a string to replace that regex match with. We currently have a few regex replaces in the DocumentCleaner and have the remove_regex parameter, but we don't have a way to customize what string should be used to replace the regex match. For example, one scenario that I'd like to do is replace all double newline characters \n\n with a single newline character \n.
Describe alternatives you've considered
We can create a custom component do perform these operations instead.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
We've found in practice that cleaning up files before being used in RAG pipelines does increase overall performance. For example, this Haystack user found the same.
We do have a
DocumentCleaner
to help with this process, but we found there are some options missing for the type of cleaning we would like to accomplish.Describe the solution you'd like
The options I'd like to add to the
DocumentCleaner
are:.strip()
on the content of every document. Often times we just want to remove the extra leading and trailing white space, but leave the white space within a chunk alone. For example, in mark down files the extra newlines can matter for formatting.DocumentCleaner
and have theremove_regex
parameter, but we don't have a way to customize what string should be used to replace the regex match. For example, one scenario that I'd like to do is replace all double newline characters\n\n
with a single newline character\n
.Describe alternatives you've considered
We can create a custom component do perform these operations instead.
The text was updated successfully, but these errors were encountered: