-
Notifications
You must be signed in to change notification settings - Fork 921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrectly Determining Height of Characters #48
Comments
you can change code in pdfminer/pdffont.py line 511,
|
Possibly related to #203 |
If have put some effort in to this to figure out what is going wrong. First of all, in the current implementation of pdfminer the In your pdf there are two different type of fonts with different font heights: txt font-name font-height x font-size x scale-x = bbox-size B BAAAAA+Cambria-Bold 1.34 x 21.90 x 1.00 = 29.46 i BAAAAA+Cambria-Bold 1.34 x 21.90 x 1.00 = 29.46 g BAAAAA+Cambria-Bold 1.34 x 21.90 x 1.00 = 29.46 T BAAAAA+Cambria-Bold 1.34 x 21.90 x 1.00 = 29.46 e BAAAAA+Cambria-Bold 1.34 x 21.90 x 1.00 = 29.46 x BAAAAA+Cambria-Bold 1.34 x 21.90 x 1.00 = 29.46 t BAAAAA+Cambria-Bold 1.34 x 21.90 x 1.00 = 29.46 S CAAAAA+Cambria 5.58 x 12.00 x 1.00 = 66.96 m CAAAAA+Cambria 5.58 x 12.00 x 1.00 = 66.96 a CAAAAA+Cambria 5.58 x 12.00 x 1.00 = 66.96 l CAAAAA+Cambria 5.58 x 12.00 x 1.00 = 66.96 l CAAAAA+Cambria 5.58 x 12.00 x 1.00 = 66.96 CAAAAA+Cambria 5.58 x 12.00 x 1.00 = 66.96 CAAAAA+Cambria 5.58 x 12.00 x 1.00 = 66.96 T CAAAAA+Cambria 5.58 x 12.00 x 1.00 = 66.96 e CAAAAA+Cambria 5.58 x 12.00 x 1.00 = 66.96 x CAAAAA+Cambria 5.58 x 12.00 x 1.00 = 66.96 t CAAAAA+Cambria 5.58 x 12.00 x 1.00 = 66.96 In this case the font-size looks like the size that you would specify in Microsoft Word or something else. The font-size is also smaller for the small text than the large text. The font-height however is what is causing the problem, apparently the font of the small text is much higher than the font of the large text. If I copy the text from the pdf to Microsoft Word and save it again as a pdf the problem disappears. @jeremyschiff Is the difference in fonts essential to your problem? Could you generate a new problematic example pdf that only contains one font? Is this still worth looking into? (You might want to use the font-size to get your result. However, this might sometimes be misleading as I've seen pdf's where the font-size is always 1.0 and the x-scaling of the transformation matrix changes the actual font size.) |
I've checked if this bug is introduced recently. The answer is: no, it has at least been there since 7b620b3. |
Other observations:
In short: my conclusion is that this bug is caused by multiplying with the font-height (i.e. the height of the |
@jeremyschiff do you have time for a review? Could you take a look at #348? |
WrongFontSizes3.pdf
The following simple python code illustrates a bug with parsing the attached PDF file. Specifically, it incorrectly determines the height of text. Namely it thinks the small text is much larger than the big text.
Output:
The text was updated successfully, but these errors were encountered: