Fix office file indexing problem #844

danielaskdd · 2025-02-18T17:44:10Z

Fix docx pptx indexing error
Add xlsx support

- Removed redundant file content reading. - Directly passed file to BytesIO. - Simplified DOCX content extraction. - Streamlined PPTX slide processing. - Reduced memory usage in file handling.

- Install openpyxl if not present - Load .xlsx file using openpyxl - Extract sheet titles and content - Format rows with tab-separated values - Append sheet content to overall text

- Added return status to `apipeline_enqueue_documents` - Enhanced logging for duplicate documents

YanSte · 2025-02-18T18:45:24Z

lightrag/lightrag.py

@@ -653,7 +653,7 @@ async def ainsert_custom_chunks(self, full_text: str, text_chunks: list[str]):
            if update_storage:
                await self._insert_done()

-    async def apipeline_enqueue_documents(self, input: str | list[str]):
+    async def apipeline_enqueue_documents(self, input: str | list[str]) -> bool:


Hello,

Thanks for sharing.

We can't modify apipeline_enqueue_documents to returns a boolean.
Lightserver should not modify the library code.

Could you remove ?

If Lightserver could recognize the status of file additions, it would be able to provide more user-friendly notifications. Currently, when a user uploads a duplicate file, the logs shows "Successfully processed and enqueued file", which may cause confusion for user.

I have reviewed all instances where apipeline_enqueue_documents is used and found that the function returning a Boolean value does not cause any compatibility issues. Therefore, I believe it would be a better solution to have apipeline_enqueue_documents explicitly return the status of file additions.

Thanks for your feedback.

However, this change is not useful in the current context, as the value is only used for logging. Adding a boolean for logging purposes doesn’t add much value.

Additionally, pipe needs to return None, as that aligns with its intended behavior.

WebUI also need to know this status to make a sensible respond .

I see your concern about handling concurrent file additions, but separating checking from enqueuing is actually a cleaner and more scalable approach.

Returning a status from apipeline_enqueue_documents blurs its responsibility. Its job is to enqueue, not to validate whether a document is new. If we mix concerns here, we introduce unnecessary complexity and make the function behave inconsistently across different projects.

If the goal is to reliably check for duplicates in a multi-user scenario, the right place to handle this logic in a new method like.

def check_new_file(..) new_doc = compute_mdhash_id(content, prefix="doc-") all_new_doc_ids = set(new_doc) unique_new_doc_ids = await self.doc_status.filter_keys(all_new_doc_ids) new_docs = {doc_id: new_docs[doc_id] for doc_id in unique_new_doc_ids} return new_docs

Would a dedicated is_new_document method in LightRag serve your needs better? (here it's add value)
This would provide the validation you need without overloading apipeline_enqueue_documents with extra responsibility.

@LarFii What do you think ?

I get your point, you are right.

I am currently contemplating how to program LightRAG into a universal knowledge base backend. Therefore, I am considering many aspects. I hope that LightRAG can conveniently create and delete Namespaces, easily add files in batches to a Namespace, and have each Namespace simulate an Ollama model to provide services externally. I wonder if everyone supports my idea, or have any suggestion? @ParisNeo

Hello,

Thanks for sharing.

We can't modify apipeline_enqueue_documents to returns a boolean. Lightserver should not modify the library code.

Could you remove ?

The codes is removed.

YanSte · 2025-02-19T07:34:36Z

Thanks for sharing 🙏🏻

danielaskdd added 4 commits February 19, 2025 01:34

Optimize file handling for DOCX and PPTX processing.

9c45824

- Removed redundant file content reading. - Directly passed file to BytesIO. - Simplified DOCX content extraction. - Streamlined PPTX slide processing. - Reduced memory usage in file handling.

Added support for reading .xlsx files in LightRAG.

cac9342

- Install openpyxl if not present - Load .xlsx file using openpyxl - Extract sheet titles and content - Format rows with tab-separated values - Append sheet content to overall text

Improved document enqueue logic with existence checks.

d34dbc5

- Added return status to `apipeline_enqueue_documents` - Enhanced logging for duplicate documents

Fix linting

8196df8

YanSte self-requested a review February 18, 2025 17:55

YanSte reviewed Feb 18, 2025

View reviewed changes

YanSte requested a review from LarFii February 18, 2025 21:01

Revert "Improved document enqueue logic with existence checks"

32bfcbb

danielaskdd changed the title ~~Add duplicate check for document enqueue logic and fix office file indexing problem~~ Fix office file indexing problem Feb 19, 2025

danielaskdd requested a review from YanSte February 19, 2025 05:52

YanSte merged commit c79b15c into HKUDS:main Feb 19, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix office file indexing problem #844

Fix office file indexing problem #844

danielaskdd commented Feb 18, 2025 •

edited

Loading

YanSte Feb 18, 2025

danielaskdd Feb 18, 2025

YanSte Feb 18, 2025

danielaskdd Feb 18, 2025

danielaskdd Feb 18, 2025

YanSte Feb 18, 2025

YanSte Feb 18, 2025

danielaskdd Feb 19, 2025

danielaskdd Feb 19, 2025

danielaskdd Feb 19, 2025

YanSte commented Feb 19, 2025

Fix office file indexing problem #844

Fix office file indexing problem #844

Conversation

danielaskdd commented Feb 18, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

YanSte commented Feb 19, 2025

danielaskdd commented Feb 18, 2025 •

edited

Loading