-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix office file indexing problem #844
Conversation
danielaskdd
commented
Feb 18, 2025
•
edited
Loading
edited
- Fix docx pptx indexing error
- Add xlsx support
- Removed redundant file content reading. - Directly passed file to BytesIO. - Simplified DOCX content extraction. - Streamlined PPTX slide processing. - Reduced memory usage in file handling.
- Install openpyxl if not present - Load .xlsx file using openpyxl - Extract sheet titles and content - Format rows with tab-separated values - Append sheet content to overall text
- Added return status to `apipeline_enqueue_documents` - Enhanced logging for duplicate documents
lightrag/lightrag.py
Outdated
@@ -653,7 +653,7 @@ async def ainsert_custom_chunks(self, full_text: str, text_chunks: list[str]): | |||
if update_storage: | |||
await self._insert_done() | |||
|
|||
async def apipeline_enqueue_documents(self, input: str | list[str]): | |||
async def apipeline_enqueue_documents(self, input: str | list[str]) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello,
Thanks for sharing.
We can't modify apipeline_enqueue_documents to returns a boolean.
Lightserver should not modify the library code.
Could you remove ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If Lightserver could recognize the status of file additions, it would be able to provide more user-friendly notifications. Currently, when a user uploads a duplicate file, the logs shows "Successfully processed and enqueued file", which may cause confusion for user.
I have reviewed all instances where apipeline_enqueue_documents is used and found that the function returning a Boolean value does not cause any compatibility issues. Therefore, I believe it would be a better solution to have apipeline_enqueue_documents explicitly return the status of file additions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your feedback.
However, this change is not useful in the current context, as the value is only used for logging. Adding a boolean for logging purposes doesn’t add much value.
Additionally, pipe needs to return None, as that aligns with its intended behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WebUI also need to know this status to make a sensible respond .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your concern about handling concurrent file additions, but separating checking from enqueuing is actually a cleaner and more scalable approach.
Returning a status from apipeline_enqueue_documents blurs its responsibility. Its job is to enqueue, not to validate whether a document is new. If we mix concerns here, we introduce unnecessary complexity and make the function behave inconsistently across different projects.
If the goal is to reliably check for duplicates in a multi-user scenario, the right place to handle this logic in a new method like.
def check_new_file(..)
new_doc = compute_mdhash_id(content, prefix="doc-")
all_new_doc_ids = set(new_doc)
unique_new_doc_ids = await self.doc_status.filter_keys(all_new_doc_ids)
new_docs = {doc_id: new_docs[doc_id] for doc_id in unique_new_doc_ids}
return new_docs
Would a dedicated is_new_document method in LightRag serve your needs better? (here it's add value)
This would provide the validation you need without overloading apipeline_enqueue_documents with extra responsibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LarFii What do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get your point, you are right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am currently contemplating how to program LightRAG into a universal knowledge base backend. Therefore, I am considering many aspects. I hope that LightRAG can conveniently create and delete Namespaces, easily add files in batches to a Namespace, and have each Namespace simulate an Ollama model to provide services externally. I wonder if everyone supports my idea, or have any suggestion? @ParisNeo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello,
Thanks for sharing.
We can't modify apipeline_enqueue_documents to returns a boolean. Lightserver should not modify the library code.
Could you remove ?
The codes is removed.
Thanks for sharing 🙏🏻 |