I don't understand how to interpret the result from getTextContent #8096

seth4618 · 2017-02-22T23:46:42Z

I am using pdf.js to do understand the text layout of pdf documents. However, I am having trouble understanding the information returned by getTextContent. Sometime it appears that I have to scale the height for each item by the vertical scale in the transform, i.e., transform[3]. Othertimes I don't. I have no idea how to determine when I have to and when I don't. Below are two examples from different pdf documents. In the first, I should not scale the height. In the second, I should. Does anyone know how I can figure this out?

-- Result from Document 1 (In this case the font is 14pt)

{str:"one two three four",
 dir:"ltr",
 width:151.67308125000002,
 height:20.6625,
 transform:[20.6625,0,0,20.6625,110.854,651.853],
 fontName:"g_d0_f1"}

{fontFamily:"sans-serif",
 ascent:0.694,
 descent:-0.195}

-- Result from Document 2 (In this case the font is 8pt)

{str:"some text goes here to see if 101 and 2011xx: 1–8",
 dir:"ltr",
 width:171.72799999999987,
 height:64,
 transform:[8,0,0,8,30,684.5359],
 fontName:"g_d0_f1"
}

{fontFamily:"serif",
 ascent:0.883,
 descent:-0.217}

(this is also posted on stackoverlow)

The text was updated successfully, but these errors were encountered:

akshayda04 · 2017-03-02T15:15:18Z

Even I need this info. I need to know the exact location of text on the pdf doc

yurydelendik · 2017-03-02T15:24:59Z

See https://github.com/mozilla/pdf.js/blob/master/examples/text-only/pdf2svg.js example comments.

Jun711 · 2018-09-07T17:20:42Z

Hi @yurydelendik

By pageLoaded, it doesn't mean the whole pdf has to be rendered, right?

My use case is I want to get text from a range of page while I am on the first page of the pdf, for example. Will the example below work for me? thanks

function pageLoaded() {
  // Loading document and page text content
  pdfjsLib.getDocument({url: PDF_PATH}).then(function (pdfDocument) {
    pdfDocument.getPage(PAGE_NUMBER).then(function (page) {
      var viewport = page.getViewport(PAGE_SCALE);
      page.getTextContent().then(function (textContent) {
        // building SVG and adding that to the DOM
        var svg = buildSVG(viewport, textContent);
        document.getElementById('pageContainer').appendChild(svg);
      });
    });
  });
}

cg2p · 2019-01-16T08:43:57Z

Even I need this info. I need to know the exact location of text on the pdf doc

Me too. Have you figured this out?

I assume each of these elements in this example transform:[8,0,0,8,30,684.5359],
specifiy the character position on the page.

What do each of the array elements mean?

timvandermeij · 2019-01-29T21:49:15Z

Closing since the height and width calculation was wrong. This has been fixed in #10508.

timvandermeij added the other label Feb 23, 2017

timvandermeij closed this as completed Jan 29, 2019

timvandermeij added text-selection and removed other labels Jan 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I don't understand how to interpret the result from getTextContent #8096

I don't understand how to interpret the result from getTextContent #8096

seth4618 commented Feb 22, 2017

akshayda04 commented Mar 2, 2017 •

edited

Loading

yurydelendik commented Mar 2, 2017

Jun711 commented Sep 7, 2018 •

edited

Loading

cg2p commented Jan 16, 2019

timvandermeij commented Jan 29, 2019

I don't understand how to interpret the result from getTextContent #8096

I don't understand how to interpret the result from getTextContent #8096

Comments

seth4618 commented Feb 22, 2017

akshayda04 commented Mar 2, 2017 • edited Loading

yurydelendik commented Mar 2, 2017

Jun711 commented Sep 7, 2018 • edited Loading

cg2p commented Jan 16, 2019

timvandermeij commented Jan 29, 2019

akshayda04 commented Mar 2, 2017 •

edited

Loading

Jun711 commented Sep 7, 2018 •

edited

Loading