-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
getTextContent() text items have wrong height #8276
Comments
Looks like this may be a regression from #7879 |
Same deal here using pdfjs-dist 1.8.412 Heights are greatly exaggerated:
For a single "Y" at a 6.5pt font, I would expect this to be near 6.5pt in height, not multiplied by 6.5. |
If it helps, this worked correctly in 1.6.210, which I have reverted back to. |
Same issue using pdfjs 1.9.646 Heights are reported as the square of their actual values.
|
to work around mozilla/pdf.js#8276
@LeonMelis @Saltallica @chadkirby If you don't mind sharing, what are you using getTextContent for? I've recently been thinking about cleaning up a few things, but don't want to remove things people are using. |
@brendandahl In a nutshell, I am trying to draw rectangles around certain paragraphs. I search through each page for target strings, then when I find the target, I need to compute the paragraph's bounding box so that I can annotate that portion of the page. |
I use it to extract text from a document, with coordinates and dimensions of the textboxes. This allows me to parse documents and extract relevant data. I'm also using it to draw rectangles around certain text elements (highlighting them), like chadkirby does. |
VIewed on Mac bla.pdf |
This still seems to be a problem. I'm using pdfjs-dist@2.0.205 I use getTextContent to extract the text and map text items to groups roughly representing paragraphs. The dimensions and position are key to this. In my own documents, I have found that
|
This is still a problem in 1.10.97. And like other - I'm using PDF.js not just as a viewer, but as a way to extract data by getting text boxes from certain areas of a document - without having accurate heights it doesn't work. |
Also... is there a way we can get someone to look in to this particular issue? |
If it worked in 1.6.210, you could try using |
I have the same problem but i get I use pdfjs to index a pdf, and then open the pdf with pdfjs with the correct page and y coordinate. I'm using Is there some other way i can calculate the top position? i'm currently using |
I found problem in commit 4537590 |
to work around mozilla/pdf.js#8276
pdfJS Version: 1.7.290
nodeJS Version: v6.9.3
Test PDF file: test.pdf
TL;DR: textContent.height is way off compared to rendered PDF, I'm not sure if this is a bug, an invalid PDF file or if this is intended behaviour.
One some PDF files (see attached file for example) the textContent items seem to have a wrong value for the 'height' property.
Consider the following example, which is the text 'Uw rekening' just below the top-right logo:
Here the 'height' property value is 0.54, whilst
Math.sqrt(t[2]*t[2] + t[3]*t[3]) = 18
is expected.When looking at the rendered PDF, we can also confirm that the text in question is actually rendered as 18px high.
I traced the code backwards in the PDFJS source, this is what I found:
In
flushTextContentItem()
, the height is 18, but then multiplied withtextContentItem.textAdvanceScale
, which has a value of 0.03 for the attached PDF.If we look at
ensureTextContentItem()
, we see thattextAdvanceScale
is calculated as follows:Where
ctm
is the content transform matrix, andtlm
the text line matrix.The text line matrix looks just fine, but (in case of this PDF example), the ctm seems very unlikely:
Eventually I found that a
cm
operator is encountered with args[0.03, 0, 0, 0.03, 0, 0]
, which is then handled inpreprocessCommand()
and triggersstateManager.transform(args)
, where the ctm is updated to[0.03, 0, 0, 0.03, 0, 0]
.But this is where my debugger threw in the towel as it crashes when trying to navigate through the massive 57k LOC PDFJS library.
When inspecting the PDF, I find this part:
So yes, the graphic static is modified right before the text portion, but that's about where my knowledge of the PDF format ends. I don't know if the 'graphic state' is supposed to influence text size?
So, in conclusion: I don't know if this is a bug, an invalid PDF document or an intended behaviour. But I do know that height 0.54 is not how the document is actually rendered.
To get the actual rendered height of a text item, can I safely assume that the 'real' height is equal to
Math.sqrt(t[2]*t[2] + t[3]*t[3])
?The text was updated successfully, but these errors were encountered: