Skip to content

Commit

Permalink
fix: Sanitize null bytes before ingestion (#2090)
Browse files Browse the repository at this point in the history
* Sanitize null bytes before ingestion

* Added comments
  • Loading branch information
laoqiu233 authored Sep 25, 2024
1 parent fa3c306 commit 5fbb402
Showing 1 changed file with 7 additions and 1 deletion.
8 changes: 7 additions & 1 deletion private_gpt/components/ingest/ingest_helper.py
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,13 @@ def _load_file_to_documents(file_name: str, file_data: Path) -> list[Document]:
return string_reader.load_data([file_data.read_text()])

logger.debug("Specific reader found for extension=%s", extension)
return reader_cls().load_data(file_data)
documents = reader_cls().load_data(file_data)

# Sanitize NUL bytes in text which can't be stored in Postgres
for i in range(len(documents)):
documents[i].text = documents[i].text.replace("\u0000", "")

return documents

@staticmethod
def _exclude_metadata(documents: list[Document]) -> None:
Expand Down

0 comments on commit 5fbb402

Please sign in to comment.