Incorrectly Determining Height of Characters #48

jeremyschiff · 2017-03-06T21:58:39Z

The following simple python code illustrates a bug with parsing the attached PDF file. Specifically, it incorrectly determines the height of text. Namely it thinks the small text is much larger than the big text.

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams, LTTextBox, LTChar

def parse_pages():

    fp = open('WrongFontSizes3.pdf', 'rb')
    parser = PDFParser(fp)
    doc = PDFDocument(parser)
    parser.set_document(doc)

    rsrcmgr = PDFResourceManager()
    laparams = LAParams(char_margin=3.5, all_texts=True)
    device = PDFPageAggregator(rsrcmgr, laparams=laparams)
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)
        layout = device.get_result()
        yield layout

if __name__ == '__main__':
    for page in parse_pages():
        for tbox in page:
            if not isinstance(tbox, LTTextBox):
                continue
            for line in tbox:
                for char in line:
                    if not isinstance(char, LTChar):
                        continue
                    print char.get_text().encode('UTF-8'), char.size

Output:

B 29.4555
i 29.4555
g 29.4555
T 29.4555
e 29.4555
x 29.4555
t 29.4555
S 66.96
m 66.96
a 66.96
l 66.96
l 66.96
  66.96
  66.96
T 66.96
e 66.96
x 66.96
t 66.96

Process finished with exit code 0

The text was updated successfully, but these errors were encountered:

LuJyKa · 2018-05-15T08:13:28Z

you can change code in pdfminer/pdffont.py line 511,

# PDFFont
class PDFFont(object):
    ....
    def get_height(self):
        # h = self.bbox[3]-self.bbox[1]
        # if h == 0:
        #    h = self.ascent - self.descent
        h = self.ascent - self.descent
        return h * self.vscale

pietermarsman · 2019-10-15T10:06:58Z

Possibly related to #203

pietermarsman · 2019-10-16T14:06:08Z

If have put some effort in to this to figure out what is going wrong.

First of all, in the current implementation of pdfminer the LTChar.size for horizontal characters is equal to the height of the bounding box of the character. The height is determined by the height of the font, the font size for that particular character and the x-scaling from the transformation matrix at the moment of writing the character(PDF Reference section 4.2.2). It is the product of those three things.

In your pdf there are two different type of fonts with different font heights:

txt   font-name            font-height x font-size x scale-x = bbox-size
  B    BAAAAA+Cambria-Bold        1.34 x     21.90 x    1.00 = 29.46
  i    BAAAAA+Cambria-Bold        1.34 x     21.90 x    1.00 = 29.46
  g    BAAAAA+Cambria-Bold        1.34 x     21.90 x    1.00 = 29.46
  T    BAAAAA+Cambria-Bold        1.34 x     21.90 x    1.00 = 29.46
  e    BAAAAA+Cambria-Bold        1.34 x     21.90 x    1.00 = 29.46
  x    BAAAAA+Cambria-Bold        1.34 x     21.90 x    1.00 = 29.46
  t    BAAAAA+Cambria-Bold        1.34 x     21.90 x    1.00 = 29.46
  S         CAAAAA+Cambria        5.58 x     12.00 x    1.00 = 66.96
  m         CAAAAA+Cambria        5.58 x     12.00 x    1.00 = 66.96
  a         CAAAAA+Cambria        5.58 x     12.00 x    1.00 = 66.96
  l         CAAAAA+Cambria        5.58 x     12.00 x    1.00 = 66.96
  l         CAAAAA+Cambria        5.58 x     12.00 x    1.00 = 66.96
            CAAAAA+Cambria        5.58 x     12.00 x    1.00 = 66.96
            CAAAAA+Cambria        5.58 x     12.00 x    1.00 = 66.96
  T         CAAAAA+Cambria        5.58 x     12.00 x    1.00 = 66.96
  e         CAAAAA+Cambria        5.58 x     12.00 x    1.00 = 66.96
  x         CAAAAA+Cambria        5.58 x     12.00 x    1.00 = 66.96
  t         CAAAAA+Cambria        5.58 x     12.00 x    1.00 = 66.96

In this case the font-size looks like the size that you would specify in Microsoft Word or something else. The font-size is also smaller for the small text than the large text. The font-height however is what is causing the problem, apparently the font of the small text is much higher than the font of the large text.

If I copy the text from the pdf to Microsoft Word and save it again as a pdf the problem disappears.

@jeremyschiff Is the difference in fonts essential to your problem? Could you generate a new problematic example pdf that only contains one font? Is this still worth looking into?

(You might want to use the font-size to get your result. However, this might sometimes be misleading as I've seen pdf's where the font-size is always 1.0 and the x-scaling of the transformation matrix changes the actual font size.)

pietermarsman · 2019-12-30T14:06:25Z

I've checked if this bug is introduced recently. The answer is: no, it has at least been there since 7b620b3.

pietermarsman · 2019-12-30T16:18:12Z

Other observations:

If I copy the text from the pdf to microsoft word, it has font-size 22 for the big text and font-size 12 for the small text. These match the pdfminer.six computed.
The font-height is computed based on the /FontBBox property of the font. If I increase the upper y-coordinate drastically the pdfminer.six computed font-sizes also increase. However, Adobe Acrobat Reader displays the font in the same size. This could suggest that /FontBBox should not influence the size of the drawn characters.
Section 5.1.1 and Section 5.3.3 of the PDF Reference only refers to the text state (e.g. font size) and text matrix to compute the size and spacing of glyphs. No use of the /FontBBox.

In short: my conclusion is that this bug is caused by multiplying with the font-height (i.e. the height of the /FontBBox). This multiplication should be removed.

pietermarsman · 2020-01-09T19:56:26Z

@jeremyschiff do you have time for a review? Could you take a look at #348?

goulu added the help wanted label Apr 18, 2017

pietermarsman added the type: bug label Oct 13, 2019

pietermarsman mentioned this issue Dec 30, 2019

Fix bug in computing character bounding box #348

Merged

6 tasks

pietermarsman added component: converter Related to any PDFLayoutAnalyzer and removed help wanted labels Jan 14, 2020

pietermarsman closed this as completed in #348 Jan 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrectly Determining Height of Characters #48

Incorrectly Determining Height of Characters #48

jeremyschiff commented Mar 6, 2017

LuJyKa commented May 15, 2018 •

edited

Loading

pietermarsman commented Oct 15, 2019

pietermarsman commented Oct 16, 2019 •

edited

Loading

pietermarsman commented Dec 30, 2019

pietermarsman commented Dec 30, 2019 •

edited

Loading

pietermarsman commented Jan 9, 2020

Incorrectly Determining Height of Characters #48

Incorrectly Determining Height of Characters #48

Comments

jeremyschiff commented Mar 6, 2017

LuJyKa commented May 15, 2018 • edited Loading

pietermarsman commented Oct 15, 2019

pietermarsman commented Oct 16, 2019 • edited Loading

pietermarsman commented Dec 30, 2019

pietermarsman commented Dec 30, 2019 • edited Loading

pietermarsman commented Jan 9, 2020

LuJyKa commented May 15, 2018 •

edited

Loading

pietermarsman commented Oct 16, 2019 •

edited

Loading

pietermarsman commented Dec 30, 2019 •

edited

Loading