getTextContent() text items have wrong height #8276

LeonMelis · 2017-04-12T15:02:49Z

pdfJS Version: 1.7.290
nodeJS Version: v6.9.3
Test PDF file: test.pdf

TL;DR: textContent.height is way off compared to rendered PDF, I'm not sure if this is a bug, an invalid PDF file or if this is intended behaviour.

One some PDF files (see attached file for example) the textContent items seem to have a wrong value for the 'height' property.

Consider the following example, which is the text 'Uw rekening' just below the top-right logo:

{
  "str": "Uw rekening",
  "dir": "ltr",
  "width": 98.928,
  "height": 0.54,
  "transform": [
    18,
    0,
    0,
    18,
    441.81,
    708.4499999999999
  ],
  "fontName": "g_d9_f21"
}

Here the 'height' property value is 0.54, whilst Math.sqrt(t[2]*t[2] + t[3]*t[3]) = 18 is expected.
When looking at the rendered PDF, we can also confirm that the text in question is actually rendered as 18px high.

I traced the code backwards in the PDFJS source, this is what I found:

In flushTextContentItem(), the height is 18, but then multiplied with textContentItem.textAdvanceScale, which has a value of 0.03 for the attached PDF.

If we look at ensureTextContentItem(), we see that textAdvanceScale is calculated as follows:

textAdvanceScale = Math.sqrt(ctm[0]*ctm[0] + ctm[1]+ctm[1]) * Math.sqrt(tlm[0]*tlm[0] + tml[1]*tlm[1])

Where ctm is the content transform matrix, and tlm the text line matrix.

The text line matrix looks just fine, but (in case of this PDF example), the ctm seems very unlikely:

[
  0.03,
  0,
  0,
  0.03,
  0,
  0
]

Eventually I found that a cm operator is encountered with args [0.03, 0, 0, 0.03, 0, 0], which is then handled in preprocessCommand() and triggers stateManager.transform(args), where the ctm is updated to [0.03, 0, 0, 0.03, 0, 0].

But this is where my debugger threw in the towel as it crashes when trying to navigate through the massive 57k LOC PDFJS library.

When inspecting the PDF, I find this part:

q
0.03 0 0 0.03 0 0 cm
BT
/F9 600.00 Tf
0.89 0.00 0.10 rg
14727 23615 TD
(Uw rekening) Tj
*snip*
ET Q

So yes, the graphic static is modified right before the text portion, but that's about where my knowledge of the PDF format ends. I don't know if the 'graphic state' is supposed to influence text size?

So, in conclusion: I don't know if this is a bug, an invalid PDF document or an intended behaviour. But I do know that height 0.54 is not how the document is actually rendered.

To get the actual rendered height of a text item, can I safely assume that the 'real' height is equal to Math.sqrt(t[2]*t[2] + t[3]*t[3]) ?

The text was updated successfully, but these errors were encountered:

brendandahl · 2017-04-12T17:55:27Z

Looks like this may be a regression from #7879

Saltallica · 2017-06-03T13:37:17Z

Same deal here using pdfjs-dist 1.8.412

Heights are greatly exaggerated:

{ str: 'Y',
  dir: 'ltr',
  width: 4.335500000000001,
  height: 42.25,
  transform: [ 0, 6.5, -6.5, 0, 488.4, 611.8 ],
  fontName: 'Helvetica' }

For a single "Y" at a 6.5pt font, I would expect this to be near 6.5pt in height, not multiplied by 6.5.

Saltallica · 2017-06-03T13:49:12Z

If it helps, this worked correctly in 1.6.210, which I have reverted back to.

chadkirby · 2017-11-17T01:00:08Z

Same issue using pdfjs 1.9.646

Heights are reported as the square of their actual values.

{
  "initialized": false,
  "str": [],
  "width": 19.91999320925712,
  "height": 423.18367629061225,
  "vertical": false,
  "lastAdvanceWidth": 0.91875,
  "lastAdvanceHeight": 0,
  "textAdvanceScale": 20.57142864,
  "spaceWidth": 0.6,
  "fakeSpaceMin": 0.18,
  "fakeMultiSpaceMin": 0.8999999999999999,
  "fakeMultiSpaceMax": 2.4,
  "textRunBreakAllowed": false,
  "transform": [
    6.300000021,
    0,
    0,
    20.57142864,
    59.76000019919999,
    751.1999999999999
  ],
  "fontName": "Courier"
}

to work around mozilla/pdf.js#8276

brendandahl · 2017-11-17T22:32:48Z

@LeonMelis @Saltallica @chadkirby If you don't mind sharing, what are you using getTextContent for? I've recently been thinking about cleaning up a few things, but don't want to remove things people are using.

chadkirby · 2017-11-17T22:41:08Z

@brendandahl In a nutshell, I am trying to draw rectangles around certain paragraphs. I search through each page for target strings, then when I find the target, I need to compute the paragraph's bounding box so that I can annotate that portion of the page.

LeonMelis · 2017-11-18T12:23:03Z

I use it to extract text from a document, with coordinates and dimensions of the textboxes. This allows me to parse documents and extract relevant data. I'm also using it to draw rectangles around certain text elements (highlighting them), like chadkirby does.

mv80 · 2017-12-07T07:40:15Z

VIewed on Mac bla.pdf
I am working with version 1.8.418 and also getting heights that are greatly exaggerated.In my document i get height of 144 instead of what i expected is about 20 . The function getTextContent returns wrong height value . Any solution beside changing version to 1.6.210 ?? @yurydelendik
when i am searching for the word test in the file i get the test word highlighted wrong . i attached my sample file .

jacksteamdev · 2017-12-15T15:06:00Z

This still seems to be a problem. I'm using pdfjs-dist@2.0.205

I use getTextContent to extract the text and map text items to groups roughly representing paragraphs. The dimensions and position are key to this.

In my own documents, I have found that item.transform[3] consistently provides a value close enough for my purposes.

{
  height: 72.89999999999999, // Way off
  transform: [
    9, // This value is close
    0,
    0,
    8.1, // So is this value
    54,
    756.0884
  ],
  width: 10.008000000000001
}

Saltallica · 2018-05-22T21:28:54Z

This is still a problem in 1.10.97. And like other - I'm using PDF.js not just as a viewer, but as a way to extract data by getting text boxes from certain areas of a document - without having accurate heights it doesn't work.

Saltallica · 2018-05-23T14:26:34Z

Also... is there a way we can get someone to look in to this particular issue?

timvandermeij · 2018-05-23T20:35:31Z

If it worked in 1.6.210, you could try using git bisect to find out the commit where it regressed if it's indeed a regression. That would help to speed up the resolution process.

Peter-Optiway · 2018-11-21T15:57:47Z

I have the same problem but i get height: 195.4404 it should be something somewhere between 12 and 15 (i guess).

I use pdfjs to index a pdf, and then open the pdf with pdfjs with the correct page and y coordinate.

I'm using pdfjs-dist: 1.9.638, upgrading to 2.0.943 did not help, i also tried to revert back to 1.6.210 but that did not solve my issue. (it might be a faulty pdf, but it renders correctly in all pdfs readers i have tried including pdfjs).

Is there some other way i can calculate the top position? i'm currently using item.transform[5] + item.height.

2008-005.pdf

aberkovsky · 2018-12-15T06:32:45Z

I found problem in commit 4537590
before this commit height multiply to textAdvanceScale only for vertical fonts
after multiply in any cases

fix mozilla#8276

to work around mozilla/pdf.js#8276

timvandermeij added the text-selection label Apr 12, 2017

chadkirby added a commit to chadkirby/pdf.js-extract that referenced this issue Nov 17, 2017

compute height from transform data

accb080

to work around mozilla/pdf.js#8276

chadkirby mentioned this issue Nov 17, 2017

compute height from transform matrix and add buffer-processing method chadkirby/pdf.js-extract#1

Merged

axlebedev added a commit to axlebedev/pdf.js that referenced this issue Jan 9, 2019

recalc textItems width or height; process 'font.vertical' cases

787668a

fix mozilla#8276

axlebedev added a commit to axlebedev/pdf.js that referenced this issue Jan 9, 2019

recalc textItems width or height; process 'font.vertical' cases

1020b70

fix mozilla#8276

axlebedev added a commit to axlebedev/pdf.js that referenced this issue Jan 14, 2019

recalc textItems width or height; process 'font.vertical' cases

b8cd69c

fix mozilla#8276

axlebedev added a commit to axlebedev/pdf.js that referenced this issue Jan 15, 2019

recalc textItems width or height; process 'font.vertical' cases

dc333cc

fix mozilla#8276

axlebedev added a commit to axlebedev/pdf.js that referenced this issue Jan 15, 2019

recalc textItems width or height; process 'font.vertical' cases

1e60853

fix mozilla#8276

axlebedev added a commit to axlebedev/pdf.js that referenced this issue Jan 15, 2019

recalc textItems width or height; process 'font.vertical' cases

0eab43f

fix mozilla#8276

axlebedev mentioned this issue Jan 15, 2019

recalc textItems width or height; process 'font.vertical' cases #10456

Closed

Snuffleupagus mentioned this issue Jan 29, 2019

Do the final text scaling correctly in flushTextContentItem (issue 8276) #10508

Merged

timvandermeij closed this as completed in #10508 Jan 29, 2019

timvandermeij mentioned this issue Feb 10, 2019

PDF coordinates from backend to pdfjs #10535

Closed

This was referenced Aug 12, 2020

[Snyk] Security upgrade react-scripts from 3.2.0 to 3.4.2 yoonpsu/pdf.js#3

Open

[Snyk] Security upgrade react-scripts from 3.2.0 to 3.4.3 yoonpsu/pdf.js#4

Open

snyk-bot mentioned this issue Oct 21, 2020

[Snyk] Security upgrade react-scripts from 3.2.0 to 3.4.4 yoonpsu/pdf.js#5

Open

snyk-bot mentioned this issue Aug 28, 2021

[Snyk] Security upgrade react-scripts from 3.2.0 to 3.4.4 yoonpsu/pdf.js#17

Open

snyk-bot mentioned this issue Sep 17, 2021

[Snyk] Security upgrade react-scripts from 3.2.0 to 3.4.4 yoonpsu/pdf.js#19

Open

atbah added a commit to atbah/pdf-extract that referenced this issue Apr 17, 2024

compute height from transform data

cdc5ccc

to work around mozilla/pdf.js#8276

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

getTextContent() text items have wrong height #8276

getTextContent() text items have wrong height #8276

LeonMelis commented Apr 12, 2017

brendandahl commented Apr 12, 2017

Saltallica commented Jun 3, 2017

Saltallica commented Jun 3, 2017

chadkirby commented Nov 17, 2017

brendandahl commented Nov 17, 2017

chadkirby commented Nov 17, 2017

LeonMelis commented Nov 18, 2017

mv80 commented Dec 7, 2017 •

edited

Loading

jacksteamdev commented Dec 15, 2017 •

edited

Loading

Saltallica commented May 22, 2018

Saltallica commented May 23, 2018

timvandermeij commented May 23, 2018

Peter-Optiway commented Nov 21, 2018

aberkovsky commented Dec 15, 2018

getTextContent() text items have wrong height #8276

getTextContent() text items have wrong height #8276

Comments

LeonMelis commented Apr 12, 2017

brendandahl commented Apr 12, 2017

Saltallica commented Jun 3, 2017

Saltallica commented Jun 3, 2017

chadkirby commented Nov 17, 2017

brendandahl commented Nov 17, 2017

chadkirby commented Nov 17, 2017

LeonMelis commented Nov 18, 2017

mv80 commented Dec 7, 2017 • edited Loading

jacksteamdev commented Dec 15, 2017 • edited Loading

Saltallica commented May 22, 2018

Saltallica commented May 23, 2018

timvandermeij commented May 23, 2018

Peter-Optiway commented Nov 21, 2018

aberkovsky commented Dec 15, 2018

mv80 commented Dec 7, 2017 •

edited

Loading

jacksteamdev commented Dec 15, 2017 •

edited

Loading