Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

It take a long time to read docx that contains images #757

Open
KfirAlfa opened this issue Sep 10, 2024 · 1 comment
Open

It take a long time to read docx that contains images #757

KfirAlfa opened this issue Sep 10, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@KfirAlfa
Copy link

Thank you for this amazing library!

I've encountered some performance issues when parsing large DOCX files. I used the reader.rs from the examples directory on a MacBook Pro with 36 GB of RAM, and observed the following parsing times:

Processing pic_42727bc0ea22235a316e371df59c49011ef2a328.docx took 145 seconds
Processing pic_b243f6218b4f2ca4de2717cf4a2af223b68210db.docx took 127 seconds

I've uploaded one of the problematic files:
pic_b243f6218b4f2ca4de2717cf4a2af223b68210db.docx
pic_42727bc0ea22235a316e371df59c49011ef2a328..docx

Any insights or suggestions for improving the parsing speed would be greatly appreciated. Thank you!

@KfirAlfa KfirAlfa added the bug Something isn't working label Sep 10, 2024
@qarmin
Copy link

qarmin commented Sep 25, 2024

Cannot reproduce, for me it takes only 5 seconds to process and 7 with saving as json.
Maybe you run code with debug mode?

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants