Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I don't understand how to interpret the result from getTextContent #8096

Closed
seth4618 opened this issue Feb 22, 2017 · 5 comments
Closed

I don't understand how to interpret the result from getTextContent #8096

seth4618 opened this issue Feb 22, 2017 · 5 comments

Comments

@seth4618
Copy link

I am using pdf.js to do understand the text layout of pdf documents. However, I am having trouble understanding the information returned by getTextContent. Sometime it appears that I have to scale the height for each item by the vertical scale in the transform, i.e., transform[3]. Othertimes I don't. I have no idea how to determine when I have to and when I don't. Below are two examples from different pdf documents. In the first, I should not scale the height. In the second, I should. Does anyone know how I can figure this out?

-- Result from Document 1 (In this case the font is 14pt)

{str:"one two three four",
 dir:"ltr",
 width:151.67308125000002,
 height:20.6625,
 transform:[20.6625,0,0,20.6625,110.854,651.853],
 fontName:"g_d0_f1"}

{fontFamily:"sans-serif",
 ascent:0.694,
 descent:-0.195}

-- Result from Document 2 (In this case the font is 8pt)

{str:"some text goes here to see if 101 and 2011xx: 1–8",
 dir:"ltr",
 width:171.72799999999987,
 height:64,
 transform:[8,0,0,8,30,684.5359],
 fontName:"g_d0_f1"
}

{fontFamily:"serif",
 ascent:0.883,
 descent:-0.217}

(this is also posted on stackoverlow)

@akshayda04
Copy link

akshayda04 commented Mar 2, 2017

Even I need this info. I need to know the exact location of text on the pdf doc

@yurydelendik
Copy link
Contributor

See https://github.com/mozilla/pdf.js/blob/master/examples/text-only/pdf2svg.js example comments.

@Jun711
Copy link

Jun711 commented Sep 7, 2018

Hi @yurydelendik

By pageLoaded, it doesn't mean the whole pdf has to be rendered, right?

My use case is I want to get text from a range of page while I am on the first page of the pdf, for example. Will the example below work for me? thanks

function pageLoaded() {
  // Loading document and page text content
  pdfjsLib.getDocument({url: PDF_PATH}).then(function (pdfDocument) {
    pdfDocument.getPage(PAGE_NUMBER).then(function (page) {
      var viewport = page.getViewport(PAGE_SCALE);
      page.getTextContent().then(function (textContent) {
        // building SVG and adding that to the DOM
        var svg = buildSVG(viewport, textContent);
        document.getElementById('pageContainer').appendChild(svg);
      });
    });
  });
}

@cg2p
Copy link

cg2p commented Jan 16, 2019

Even I need this info. I need to know the exact location of text on the pdf doc

Me too. Have you figured this out?

I assume each of these elements in this example transform:[8,0,0,8,30,684.5359],
specifiy the character position on the page.

What do each of the array elements mean?

@timvandermeij
Copy link
Contributor

Closing since the height and width calculation was wrong. This has been fixed in #10508.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants