-
Notifications
You must be signed in to change notification settings - Fork 510
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I am not sure if this is a bug. #3788
Comments
The attached PDF is different from the attached image! This clearly is no error and I also see no basis for whatever "enhancement". |
I am talking about text extraction. You will find 'A194/C194 Cu Alloy' and 'Sample Name' are not extracted in the same line if you look at RED line 2 of reference image. |
That too is not a bug but a technical peculiarity of MuPDF. You need your own code to recover lines that roughly like the ones visible. But there is example code that can be used for this: import pymupdf
# import a helper method from sister package
from pymupdf4llm.helpers.get_text_lines import get_text_lines
doc = pymupdf.open("test.pdf")
page = doc[0]
text = get_text_lines(page)
print(text) This produces the following output:
|
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
I have a sample PDF. Hope that thse 5 interested lines can be extracted correctly and displayed correctly
(please refer to the RED underlined of attached PNG file)
The sample PDF file can be found here.
https://www.nxp.com/testreports/360000002263_CDA_194_ZHM_A_HLGN.pdf
(update sample PDF)
The text was updated successfully, but these errors were encountered: