Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrectly Determining Height of Characters #48

Closed
jeremyschiff opened this issue Mar 6, 2017 · 6 comments · Fixed by #348
Closed

Incorrectly Determining Height of Characters #48

jeremyschiff opened this issue Mar 6, 2017 · 6 comments · Fixed by #348
Labels
component: converter Related to any PDFLayoutAnalyzer type: bug

Comments

@jeremyschiff
Copy link

WrongFontSizes3.pdf

The following simple python code illustrates a bug with parsing the attached PDF file. Specifically, it incorrectly determines the height of text. Namely it thinks the small text is much larger than the big text.

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTTextBox, LTChar

def parse_pages():

    fp = open('WrongFontSizes3.pdf', 'rb')
    parser = PDFParser(fp)
    doc = PDFDocument(parser)
    parser.set_document(doc)

    rsrcmgr = PDFResourceManager()
    laparams = LAParams(char_margin=3.5, all_texts=True)
    device = PDFPageAggregator(rsrcmgr, laparams=laparams)
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)
        layout = device.get_result()
        yield layout

if __name__ == '__main__':
    for page in parse_pages():
        for tbox in page:
            if not isinstance(tbox, LTTextBox):
                continue
            for line in tbox:
                for char in line:
                    if not isinstance(char, LTChar):
                        continue
                    print char.get_text().encode('UTF-8'), char.size

Output:

B 29.4555
i 29.4555
g 29.4555
T 29.4555
e 29.4555
x 29.4555
t 29.4555
S 66.96
m 66.96
a 66.96
l 66.96
l 66.96
  66.96
  66.96
T 66.96
e 66.96
x 66.96
t 66.96

Process finished with exit code 0
@LuJyKa
Copy link

LuJyKa commented May 15, 2018

you can change code in pdfminer/pdffont.py line 511,

# PDFFont
class PDFFont(object):
    ....
    def get_height(self):
        # h = self.bbox[3]-self.bbox[1]
        # if h == 0:
        #    h = self.ascent - self.descent
        h = self.ascent - self.descent
        return h * self.vscale

@pietermarsman
Copy link
Member

Possibly related to #203

@pietermarsman
Copy link
Member

pietermarsman commented Oct 16, 2019

If have put some effort in to this to figure out what is going wrong.

First of all, in the current implementation of pdfminer the LTChar.size for horizontal characters is equal to the height of the bounding box of the character. The height is determined by the height of the font, the font size for that particular character and the x-scaling from the transformation matrix at the moment of writing the character(PDF Reference section 4.2.2). It is the product of those three things.

In your pdf there are two different type of fonts with different font heights:

txt   font-name            font-height x font-size x scale-x = bbox-size
  B    BAAAAA+Cambria-Bold        1.34 x     21.90 x    1.00 = 29.46
  i    BAAAAA+Cambria-Bold        1.34 x     21.90 x    1.00 = 29.46
  g    BAAAAA+Cambria-Bold        1.34 x     21.90 x    1.00 = 29.46
  T    BAAAAA+Cambria-Bold        1.34 x     21.90 x    1.00 = 29.46
  e    BAAAAA+Cambria-Bold        1.34 x     21.90 x    1.00 = 29.46
  x    BAAAAA+Cambria-Bold        1.34 x     21.90 x    1.00 = 29.46
  t    BAAAAA+Cambria-Bold        1.34 x     21.90 x    1.00 = 29.46
  S         CAAAAA+Cambria        5.58 x     12.00 x    1.00 = 66.96
  m         CAAAAA+Cambria        5.58 x     12.00 x    1.00 = 66.96
  a         CAAAAA+Cambria        5.58 x     12.00 x    1.00 = 66.96
  l         CAAAAA+Cambria        5.58 x     12.00 x    1.00 = 66.96
  l         CAAAAA+Cambria        5.58 x     12.00 x    1.00 = 66.96
            CAAAAA+Cambria        5.58 x     12.00 x    1.00 = 66.96
            CAAAAA+Cambria        5.58 x     12.00 x    1.00 = 66.96
  T         CAAAAA+Cambria        5.58 x     12.00 x    1.00 = 66.96
  e         CAAAAA+Cambria        5.58 x     12.00 x    1.00 = 66.96
  x         CAAAAA+Cambria        5.58 x     12.00 x    1.00 = 66.96
  t         CAAAAA+Cambria        5.58 x     12.00 x    1.00 = 66.96

In this case the font-size looks like the size that you would specify in Microsoft Word or something else. The font-size is also smaller for the small text than the large text. The font-height however is what is causing the problem, apparently the font of the small text is much higher than the font of the large text.

If I copy the text from the pdf to Microsoft Word and save it again as a pdf the problem disappears.

@jeremyschiff Is the difference in fonts essential to your problem? Could you generate a new problematic example pdf that only contains one font? Is this still worth looking into?

(You might want to use the font-size to get your result. However, this might sometimes be misleading as I've seen pdf's where the font-size is always 1.0 and the x-scaling of the transformation matrix changes the actual font size.)

@pietermarsman
Copy link
Member

I've checked if this bug is introduced recently. The answer is: no, it has at least been there since 7b620b3.

@pietermarsman
Copy link
Member

pietermarsman commented Dec 30, 2019

Other observations:

  • If I copy the text from the pdf to microsoft word, it has font-size 22 for the big text and font-size 12 for the small text. These match the pdfminer.six computed.
  • The font-height is computed based on the /FontBBox property of the font. If I increase the upper y-coordinate drastically the pdfminer.six computed font-sizes also increase. However, Adobe Acrobat Reader displays the font in the same size. This could suggest that /FontBBox should not influence the size of the drawn characters.
  • Section 5.1.1 and Section 5.3.3 of the PDF Reference only refers to the text state (e.g. font size) and text matrix to compute the size and spacing of glyphs. No use of the /FontBBox.

In short: my conclusion is that this bug is caused by multiplying with the font-height (i.e. the height of the /FontBBox). This multiplication should be removed.

@pietermarsman
Copy link
Member

@jeremyschiff do you have time for a review? Could you take a look at #348?

@pietermarsman pietermarsman added component: converter Related to any PDFLayoutAnalyzer and removed help wanted labels Jan 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: converter Related to any PDFLayoutAnalyzer type: bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants