Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to get a response while parsing a PDF file that contains image. #101

Closed
Raju03-git opened this issue Dec 22, 2021 · 5 comments
Closed

Comments

@Raju03-git
Copy link

Raju03-git commented Dec 22, 2021

While processing a PDF file that contains image then my backend node service fails. I've tried some work arounds, but the service just fails every time. Please help me resolve this issue.

pdfreader version - 1.2.14
Node Version - 16.13.0

Used File - pdfReaderIssue.pdf

Note : I tried with lower version of nodeJs. It's working till version 16.6.2.

@Raju03-git Raju03-git changed the title Fails when uploading file that contains image within a PDF Unable to get a response while parsing a PDF file that contains image. Dec 22, 2021
@adrienjoly
Copy link
Owner

In order for us to reproduce and investigate, can you share your PDF file and JS code, please?

@Raju03-git
Copy link
Author

Raju03-git commented Dec 23, 2021

Please find the JS code & Sample File.

PDF File -
pdfReaderIssue.pdf

JS Code -

 var buffer = '';
          return new Promise((resolve, reject) => {
              new pdfreader.PdfReader().parseFileItems('sample.pdf', function (err, item) {
                  if (err) {
                      reject(err);
                  } else if (!item) {
                      resolve(buffer);
                  }
                  else if (item.text) buffer += item.text;
              });
          });

@adrienjoly
Copy link
Owner

Thank you!

I'll try to find some time to test this. In the meantime, anybody can feel free to give it a go!

@adrienjoly
Copy link
Owner

adrienjoly commented Dec 26, 2021

To investigate, I've tried parsing that file with the parse.js script provided in this repo, after enabling the debug mode:

diff --git a/parse.js b/parse.js
index 02100cb..ab0c9e4 100644
--- a/parse.js
+++ b/parse.js
@@ -2,7 +2,7 @@ var LOG = require("./lib/LOG.js").toggle(false);
 var PdfReader = require("./index.js").PdfReader;
 
 function printRawItems(filename, callback) {
-  new PdfReader().parseFileItems(filename, function (err, item) {
+  new PdfReader({ debug: true }).parseFileItems(filename, function (err, item) {
     if (err) callback(err);
     else if (!item) callback();
     else if (item.file) console.log("file =", item.file.path);

=> Here's what I'm getting:

$ node parse.js '/Users/adrienjoly/Downloads/pdfReaderIssue.pdf'

printing raw items from file: /Users/adrienjoly/Downloads/pdfReaderIssue.pdf ...
file = /Users/adrienjoly/Downloads/pdfReaderIssue.pdf
Warning: Setting up fake worker.
Warning: Unhandled rejection: ReferenceError: MozBlobBuilder is not defined
ReferenceError: MozBlobBuilder is not defined
    at Object.createBlob (eval at <anonymous> (/Users/adrienjoly/dev/adrienjoly/npm-pdfreader/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:1104:12)
    at Object.createObjectURL (eval at <anonymous> (/Users/adrienjoly/dev/adrienjoly/npm-pdfreader/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:1112:24)
    at JpegStream_getIR [as getIR] (eval at <anonymous> (/Users/adrienjoly/dev/adrienjoly/npm-pdfreader/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:25575:18)
    at PartialEvaluator_buildPaintImageXObject [as buildPaintImageXObject] (eval at <anonymous> (/Users/adrienjoly/dev/adrienjoly/npm-pdfreader/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:7120:64)
    at PartialEvaluator_getOperatorList [as getOperatorList] (eval at <anonymous> (/Users/adrienjoly/dev/adrienjoly/npm-pdfreader/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:7464:24)
    at Object.eval [as onResolve] (eval at <anonymous> (/Users/adrienjoly/dev/adrienjoly/npm-pdfreader/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:4348:26)
    at Object.runHandlers (eval at <anonymous> (/Users/adrienjoly/dev/adrienjoly/npm-pdfreader/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:864:35)
    at listOnTimeout (node:internal/timers:557:17)
    at processTimers (node:internal/timers:500:7)

This error is referenced on pdf2json's repository, and seems to have been fixed in its latest version, cf modesty/pdf2json#247.

@adrienjoly
Copy link
Owner

adrienjoly commented Dec 26, 2021

After upgrading pdf2json to version 2.0.0 in package.json, here's what I'm getting:

$ node parse.js '/Users/adrienjoly/Downloads/pdfReaderIssue.pdf'

printing raw items from file: /Users/adrienjoly/Downloads/pdfReaderIssue.pdf ...
file = /Users/adrienjoly/Downloads/pdfReaderIssue.pdf
Warning: Setting up fake worker.
Warning: TT: undefined function: 32
(node:47195) ExperimentalWarning: buffer.Blob is an experimental feature. This feature could change at any time
(Use `node --trace-warnings ...` to show where the warning was created)
page = 1
4.252   14.793          left    29      Sample
6.254   14.793          left    23      check
done.

=> If you want to upgrade that dependency in that project, add the following lines in your package.json then run npm install again:

  "resolutions": {
    "pdf2json": "2.0.0"
  }

Cf #97 (comment)

adrienjoly pushed a commit that referenced this issue Apr 23, 2022
By @copmerbenjamin:

* Update pdf2json to version 2
MozBlobBuilder is not defined in newer versions of node, pdf2json 2 resolves this but drops node versions below 14 (#101, #103, #114)

* BREAKING CHANGE: drop support for Node.js < 14

* Increase ecmaVersion, according to node.js version upgrade
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants