Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update pdf2json to version 2 #116

Merged
merged 6 commits into from
Apr 23, 2022
Merged

Update pdf2json to version 2 #116

merged 6 commits into from
Apr 23, 2022

Conversation

copmerbenjamin
Copy link
Contributor

@copmerbenjamin copmerbenjamin commented Apr 19, 2022

As discussed in #114, #103, #101, MozBlobBuilder is not defined (only newer node versions tested).

The issue is resolved in pdf2json in version 2.

pdf2json v2 changed the password part a bit, this as been adjusted
v2 also removed the property oc for the pdf items, and added the prop clr. This might trigger a new major version of npm-pdfreader.
A new test snapshot have therefore been created to reflect those changes.
"Warning: Setting up fake worker." have also been removed from pdf2json.

Old test before generating new snapshots: "C:\Program Files\nodejs\node.exe "C:\Program Files\nodejs\node_modules\npm\bin\npm-cli.js" test --scripts-prepend-node-path=auto
> pdfreader@0.0.0-development test
> ava


  × parse raw items from pdf file Did not match snapshot
  √ parse structured content from pdf file, using rules
  √ parse Table from PDF file, using TableParser (168ms)
  √ support pdf file with password (168ms)
  × sample scripts should print raw items from pdf file Did not match snapshot
  ─

  parse raw items from pdf file

  test\test.js:20

   19:   });
   20:   t.snapshot(await res);
   21: });

  Did not match snapshot

  Difference:

    [
      {
        file: {
          path: './test/sample.pdf',
        },
      },
      {
        height: 52.618,
        page: 1,
        width: 37.205,
      },
      {
        A: 'left',
        R: [
          Object { … },
        ],
  -     oc: undefined,
  +     clr: 0,
        sw: 0.32553125,
        text: 'Hello "world"',
        w: 6.138,
        x: 4.555,
        y: 5.154,
      },
      {
        A: 'left',
        R: [
          Object { … },
        ],
  -     oc: undefined,
  +     clr: 0,
        sw: NaN,
        text: 'Value:',
        w: 2.666,
        x: 4.555,
        y: 7.174,
      },
      {
        A: 'left',
        R: [
          Object { … },
        ],
  -     oc: undefined,
  +     clr: 0,
        sw: NaN,
        text: '4',
        w: 0.5,
        x: 4.555,
        y: 8.761,
      },
      {
        A: 'left',
        R: [
          Object { … },
        ],
  -     oc: undefined,
  +     clr: 0,
        sw: 0.32553125,
        text: 'c1',
        w: 0.944,
        x: 5.095,
        y: 10.501,
      },
      {
        A: 'left',
        R: [
          Object { … },
        ],
  -     oc: undefined,
  +     clr: 0,
        sw: 0.32553125,
        text: 'c2',
        w: 0.944,
        x: 7.262,
        y: 10.501,
      },
      {
        A: 'left',
        R: [
          Object { … },
        ],
  -     oc: undefined,
  +     clr: 0,
        sw: 0.32553125,
        text: 'c3',
        w: 0.944,
        x: 10.131,
        y: 10.501,
      },
      {
        A: 'left',
        R: [
          Object { … },
        ],
  -     oc: undefined,
  +     clr: 0,
        sw: NaN,
        text: '1',
        w: 0.5,
        x: 5.288,
        y: 11.447,
      },
      {
        A: 'left',
        R: [
          Object { … },
        ],
  -     oc: undefined,
  +     clr: 0,
        sw: NaN,
        text: '2.3',
        w: 1.25,
        x: 10.477,
        y: 11.447,
      },
      {
        A: 'left',
        R: [
          Object { … },
        ],
  -     oc: undefined,
  +     clr: 0,
        sw: NaN,
        text: 'hello',
        w: 2,
        x: 6.937,
        y: 12.363,
      },
      {
        A: 'left',
        R: [
          Object { … },
        ],
  -     oc: undefined,
  +     clr: 0,
        sw: NaN,
        text: 'world',
        w: 2.333,
        x: 9.684,
        y: 12.363,
      },
      {
        A: 'left',
        R: [
          Object { … },
        ],
  -     oc: undefined,
  +     clr: 0,
        sw: NaN,
        text: 'Values:',
        w: 3.055,
        x: 4.555,
        y: 13.248,
      },
      {
        A: 'left',
        R: [
          Object { … },
        ],
  -     oc: undefined,
  +     clr: 0,
        sw: NaN,
        text: '1',
        w: 0.5,
        x: 4.555,
        y: 14.835,
      },
      {
        A: 'left',
        R: [
          Object { … },
        ],
  -     oc: undefined,
  +     clr: 0,
        sw: NaN,
        text: '2',
        w: 0.5,
        x: 4.555,
        y: 16.423,
      },
      {
        A: 'left',
        R: [
          Object { … },
        ],
  -     oc: undefined,
  +     clr: 0,
        sw: NaN,
        text: '3',
        w: 0.5,
        x: 4.555,
        y: 18.01,
      },
    ]

  › test/test.js:20:5



  sample scripts should print raw items from pdf file

  test\test.js:106

   105:   });
   106:   t.snapshot({ stdout, stderr });
   107: });

  Did not match snapshot

  Difference:

    {
      stderr: `printing raw items from file: test/sample.pdf ...␊
      done.␊
      printing raw items from file: test/sample.pdf ...␊
      done.`,
      stdout: `␊
      > pdfreader@0.0.0-development test:samples␊
      > node parse.js test/sample.pdf && node parseAsBuffer.js test/sample.pdf␊
      ␊
      file = test/sample.pdf␊
      page = 1␊
      4.555     5.154           left    6       Hello "world"␊
      4.555     7.174           left    2       Value:␊
      4.555     8.761           left    0       4␊
      5.095     10.501          left    0       c1␊
      7.262     10.501          left    0       c2␊
      10.131    10.501          left    0       c3␊
      5.288     11.447          left    0       1␊
      10.477    11.447          left    1       2.3␊
      6.937     12.363          left    2       hello␊
      9.684     12.363          left    2       world␊
      4.555     13.248          left    3       Values:␊
      4.555     14.835          left    0       1␊
      4.555     16.423          left    0       2␊
      4.555     18.01           left    0       3␊
      file = undefined␊
  -   Warning: Setting up fake worker.␊
      page = 1␊
      4.555     5.154           left    6       Hello "world"␊
      4.555     7.174           left    2       Value:␊
      4.555     8.761           left    0       4␊
      5.095     10.501          left    0       c1␊
      7.262     10.501          left    0       c2␊
      10.131    10.501          left    0       c3␊
      5.288     11.447          left    0       1␊
      10.477    11.447          left    1       2.3␊
      6.937     12.363          left    2       hello␊
      9.684     12.363          left    2       world␊
      4.555     13.248          left    3       Values:␊
      4.555     14.835          left    0       1␊
      4.555     16.423          left    0       2␊
      4.555     18.01           left    0       3`,
    }

  › test/test.js:106:5

  ─

  2 tests failed

Process finished with exit code 1

@copmerbenjamin
Copy link
Contributor Author

copmerbenjamin commented Apr 20, 2022

Pdfparser v2 uses private class fields, so Node 8 & 10 is expected to fail.

Node 12 surprises me a bit, looks like private class fields is supported but it fails on the line linked above. But still, pdfparser explicitly states it needs node v14.

Is it time to drop node 8, 10, 12?

@adrienjoly
Copy link
Owner

adrienjoly commented Apr 21, 2022

Is it time to drop node 8, 10, 12?

merging this PR implies breaking changes (and therefore a major version increase), so yes, I think it’s a good occasion to do it!

Can you make that clear by updating the following files in this PR:

  • the yml files of GitHub actions (node’s version matrix)
  • .nvmrc (if it links to node < 14)
  • and any mention to old nodejs versions you may find in other files?

MozBlobBuilder is not defined in newer versions of node, pdf2json 2 resolves this but drops node versions below 14 (#101, #103, #114)
@copmerbenjamin
Copy link
Contributor Author

Old node branches has been deleted, 12.x removed from test
.nvmrc was already set to v16
I didn't find any version references, but I added engine requirement in package.json
Latest commit should trigger a major version with a description

by increasing ecmaVersion, according to node.js version upgrade
@adrienjoly adrienjoly self-requested a review April 23, 2022 08:26
Copy link
Owner

@adrienjoly adrienjoly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@adrienjoly adrienjoly merged commit 740cccb into adrienjoly:master Apr 23, 2022
@github-actions
Copy link

🎉 This PR is included in version 2.0.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

@adrienjoly
Copy link
Owner

PdfReader 2.0 is out: pdfreader - npm! 🎉

Thank you for your contribution, @copmerbenjamin! 🙌

@adrienjoly adrienjoly mentioned this pull request Apr 23, 2022
@benjaco
Copy link

benjaco commented Apr 23, 2022

Thank you @adrienjoly 🙌

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants