Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix office file indexing problem #844

Merged
merged 5 commits into from
Feb 19, 2025
Merged

Conversation

danielaskdd
Copy link
Contributor

@danielaskdd danielaskdd commented Feb 18, 2025

  • Fix docx pptx indexing error
  • Add xlsx support

- Removed redundant file content reading.
- Directly passed file to BytesIO.
- Simplified DOCX content extraction.
- Streamlined PPTX slide processing.
- Reduced memory usage in file handling.
- Install openpyxl if not present
- Load .xlsx file using openpyxl
- Extract sheet titles and content
- Format rows with tab-separated values
- Append sheet content to overall text
- Added return status to `apipeline_enqueue_documents`
- Enhanced logging for duplicate documents
@YanSte YanSte self-requested a review February 18, 2025 17:55
@@ -653,7 +653,7 @@ async def ainsert_custom_chunks(self, full_text: str, text_chunks: list[str]):
if update_storage:
await self._insert_done()

async def apipeline_enqueue_documents(self, input: str | list[str]):
async def apipeline_enqueue_documents(self, input: str | list[str]) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello,

Thanks for sharing.

We can't modify apipeline_enqueue_documents to returns a boolean.
Lightserver should not modify the library code.

Could you remove ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Lightserver could recognize the status of file additions, it would be able to provide more user-friendly notifications. Currently, when a user uploads a duplicate file, the logs shows "Successfully processed and enqueued file", which may cause confusion for user.

I have reviewed all instances where apipeline_enqueue_documents is used and found that the function returning a Boolean value does not cause any compatibility issues. Therefore, I believe it would be a better solution to have apipeline_enqueue_documents explicitly return the status of file additions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your feedback.

However, this change is not useful in the current context, as the value is only used for logging. Adding a boolean for logging purposes doesn’t add much value.

Additionally, pipe needs to return None, as that aligns with its intended behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iShot_2025-02-18_20 02 54

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WebUI also need to know this status to make a sensible respond .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your concern about handling concurrent file additions, but separating checking from enqueuing is actually a cleaner and more scalable approach.

Returning a status from apipeline_enqueue_documents blurs its responsibility. Its job is to enqueue, not to validate whether a document is new. If we mix concerns here, we introduce unnecessary complexity and make the function behave inconsistently across different projects.

If the goal is to reliably check for duplicates in a multi-user scenario, the right place to handle this logic in a new method like.

def check_new_file(..)
new_doc = compute_mdhash_id(content, prefix="doc-")
all_new_doc_ids = set(new_doc)
unique_new_doc_ids = await self.doc_status.filter_keys(all_new_doc_ids)
new_docs = {doc_id: new_docs[doc_id] for doc_id in unique_new_doc_ids}
return new_docs

Would a dedicated is_new_document method in LightRag serve your needs better? (here it's add value)
This would provide the validation you need without overloading apipeline_enqueue_documents with extra responsibility.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LarFii What do you think ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get your point, you are right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am currently contemplating how to program LightRAG into a universal knowledge base backend. Therefore, I am considering many aspects. I hope that LightRAG can conveniently create and delete Namespaces, easily add files in batches to a Namespace, and have each Namespace simulate an Ollama model to provide services externally. I wonder if everyone supports my idea, or have any suggestion? @ParisNeo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello,

Thanks for sharing.

We can't modify apipeline_enqueue_documents to returns a boolean. Lightserver should not modify the library code.

Could you remove ?

The codes is removed.

@YanSte YanSte requested a review from LarFii February 18, 2025 21:01
@danielaskdd danielaskdd changed the title Add duplicate check for document enqueue logic and fix office file indexing problem Fix office file indexing problem Feb 19, 2025
@danielaskdd danielaskdd requested a review from YanSte February 19, 2025 05:52
@YanSte
Copy link
Contributor

YanSte commented Feb 19, 2025

Thanks for sharing 🙏🏻

@YanSte YanSte merged commit c79b15c into HKUDS:main Feb 19, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants