Handle toUnicode cMaps that omit leading zeros in hex encoded UTF-16 (issue 18099) #18390

alexcat3 · 2024-07-04T20:14:39Z

Modifies partialEvaluator.readToUnicode() to handle toUnicode cMaps that omit leading zeros in hex encoded UTF-16, as seen in the PDF in [https://github.com//issues/18099](issue 18099).

In the PDF in question, the toUnicode cmap had a line to map the glyph char codes from 00 to 7F to the corresponding unicode code points. The syntax to map a range of char codes to a range of unicode code points is
<start_char_code> <end_char_code> <start_unicode_codepoint>
As the unicode code points are supposed to be given in UTF 16, the PDF's line SHOULD have probably read
<00> <7F> <0000>
Instead it omitted two leading zeros from the UTF 16 like this
<00> <7F> <00>
This confused pdf.js into mapping these character codes to the UTF-16 characters with the corresponding high bytes ( 01 becomes \u0100, 02 becomes \u0200) which ended up turning latin text in the PDF into chinese when it was copied.

I'm not sure if the specification actually allows PDFs to do this, but since there's at least one PDF in the wild that does and other PDF readers read it correctly, pdf.js should probably support this.

calixteman · 2024-07-04T20:57:54Z

Could you add a unit test ? something similar to:

pdf.js/test/unit/api_spec.js

Lines 3202 to 3216 in 790470c

    
           it("gets text content, with no extra spaces (issue 13226)", async function () { 
        
             const loadingTask = getDocument(buildGetDocumentParams("issue13226.pdf")); 
        
             const pdfDoc = await loadingTask.promise; 
        
             const pdfPage = await pdfDoc.getPage(1); 
        
             const { items } = await pdfPage.getTextContent({ 
        
               disableNormalization: true, 
        
             }); 
        
             const text = mergeText(items); 
        
             expect(text).toEqual( 
        
               "Mitarbeiterinnen und Mitarbeiter arbeiten in über 100 Ländern engagiert im Dienste" 
        
             ); 
        
             await loadingTask.destroy(); 
        
           });

You can just test that the retrieved text starts with Yýsledková listina přijímacích zkoušek.

alexcat3 · 2024-07-04T20:58:33Z

OK, I will.

Snuffleupagus

When addressing the comments, please remember to squash the commits.

test/pdfs/issue18099_reduced.pdf

test/unit/api_spec.js

alexcat3 · 2024-07-05T17:08:15Z

OK, I made the necessary changes and squashed the commits.

timvandermeij · 2024-07-05T20:15:33Z

The CI failed on linting. You can automatically fix that by running npx gulp lint --fix locally. Other than that this patch LGTM, with passing tests which we'll trigger once the linting issue is fixed, but I'll leave the final sign-off here to @Snuffleupagus given the familiarity with the font code. Thanks!

calixteman · 2024-07-06T12:08:24Z

/botio test

moz-tools-bot · 2024-07-06T12:08:26Z

From: Bot.io (Linux m4)

Received

Command cmd_test from @calixteman received. Current queue size: 0

Live output at: http://54.241.84.105:8877/c8ea0e506839555/output.txt

moz-tools-bot · 2024-07-06T12:08:26Z

From: Bot.io (Windows)

Received

Command cmd_test from @calixteman received. Current queue size: 0

Live output at: http://54.193.163.58:8877/db23d9ada3a63cd/output.txt

moz-tools-bot · 2024-07-06T12:37:25Z

From: Bot.io (Linux m4)

Failed

Full output at http://54.241.84.105:8877/c8ea0e506839555/output.txt

Total script time: 28.99 mins

Unit tests: Passed
Integration Tests: Passed
Regression tests: FAILED

  different ref/snapshot: 19
  different first/second rendering: 2

Image differences available at: http://54.241.84.105:8877/c8ea0e506839555/reftest-analyzer.html#web=eq.log

Snuffleupagus

Can you please improve the commit message a bit, such that all relevant information is available on the Git command line as well without having to read GitHub?

Generally the first line of the commit message should contain a summary of the changes, with a reference to the bug/issue, and then any other relevant details below. You included a bunch of nice context in #18390 (comment) that belongs in the commit message too. My suggestion would be something like this:

Handle toUnicode cMaps that omit leading zeros in hex encoded UTF-16 (issue 18099)

In the PDF in question, the toUnicode cmap had a line to map the glyph char codes from 00 to 7F to the corresponding unicode code points. The syntax to map a range of char codes to a range of unicode code points is
<start_char_code> <end_char_code> <start_unicode_codepoint>
As the unicode code points are supposed to be given in UTF 16, the PDF's line SHOULD have probably read
<00> <7F> <0000>
Instead it omitted two leading zeros from the UTF 16 like this
<00> <7F> <00>
This confused PDF.js into mapping these character codes to the UTF-16 characters with the corresponding high bytes ( 01 becomes \u0100, 02 becomes \u0200) which ended up turning latin text in the PDF into chinese when it was copied.

I'm not sure if the specification actually allows PDFs to do this, but since there's at least one PDF in the wild that does and other PDF readers read it correctly, PDF.js should probably support this.

moz-tools-bot · 2024-07-06T12:50:54Z

From: Bot.io (Windows)

Failed

Full output at http://54.193.163.58:8877/db23d9ada3a63cd/output.txt

Total script time: 42.46 mins

Unit tests: Passed
Integration Tests: Passed
Regression tests: FAILED

  different ref/snapshot: 3

Image differences available at: http://54.193.163.58:8877/db23d9ada3a63cd/reftest-analyzer.html#web=eq.log

…(issue 18099) Add unit test to check compatability with such cmaps In the PDF in issue 18099. the toUnicode cmap had a line to map the glyph char codes from 00 to 7F to the corresponding code points. The syntax to map a range of char codes to a range of unicode code points is <start_char_code> <end_char_code> <start_unicode_codepoint> As the unicode code points are supposed to be given in UTF-16 BE, the PDF's line SHOULD have probably read <00> <7F> <0000> Instead it omitted two leading zeros from the UTF-16 like this <00> <7F> <00> This confused PDF.js into mapping these character codes to the UTF-16 characters with the corresponding HIGH bytes (01 became \u0100, 02 became \u0200, et cetera), which ended up turning latin text in the PDF into chinese when it was copied I'm not sure if the PDF spec actually allows PDFs to do this, but since there's at least one PDF in the wild that does and other PDF readers read it correctly, PDF.js should probably support this

alexcat3 · 2024-07-06T15:40:41Z

@Snuffleupagus I changed the commit message

Snuffleupagus

r=me, thank you!

alexcat3 · 2024-07-06T17:18:32Z

Thank YOU for all your help!

timvandermeij added the font-conversion label Jul 4, 2024

Snuffleupagus changed the title ~~Fix Issue 18099~~ Handle toUnicode cMaps that omit leading zeros in hex encoded UTF-16 (issue 18099) Jul 4, 2024

alexcat3 mentioned this pull request Jul 4, 2024

CJK characters in clipboard after copy of latin text #18099

Closed

timvandermeij linked an issue Jul 4, 2024 that may be closed by this pull request

CJK characters in clipboard after copy of latin text #18099

Closed

Snuffleupagus reviewed Jul 4, 2024

View reviewed changes

test/pdfs/issue18099_reduced.pdf Outdated Show resolved Hide resolved

test/unit/api_spec.js Show resolved Hide resolved

test/unit/api_spec.js Outdated Show resolved Hide resolved

This comment was marked as outdated.

Sign in to view

alexcat3 force-pushed the fix-issue-18099 branch from fcabc61 to c7a0a6c Compare July 5, 2024 17:07

timvandermeij requested a review from Snuffleupagus July 5, 2024 20:15

alexcat3 force-pushed the fix-issue-18099 branch from c7a0a6c to 1b9671a Compare July 5, 2024 20:28

Snuffleupagus requested changes Jul 6, 2024

View reviewed changes

alexcat3 force-pushed the fix-issue-18099 branch from 1b9671a to 1c36442 Compare July 6, 2024 15:39

Snuffleupagus approved these changes Jul 6, 2024

View reviewed changes

Snuffleupagus merged commit 5ee6169 into mozilla:master Jul 6, 2024
9 checks passed

alexcat3 deleted the fix-issue-18099 branch July 6, 2024 17:18

npanchal108 mentioned this pull request Aug 31, 2024

[Snyk] Upgrade pdfjs-dist from 4.0.269 to 4.5.136 npanchal108/ecommercebase#5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle toUnicode cMaps that omit leading zeros in hex encoded UTF-16 (issue 18099) #18390

Handle toUnicode cMaps that omit leading zeros in hex encoded UTF-16 (issue 18099) #18390

alexcat3 commented Jul 4, 2024

calixteman commented Jul 4, 2024

alexcat3 commented Jul 4, 2024

Snuffleupagus left a comment

This comment was marked as outdated.

alexcat3 commented Jul 5, 2024

timvandermeij commented Jul 5, 2024

calixteman commented Jul 6, 2024

moz-tools-bot commented Jul 6, 2024

moz-tools-bot commented Jul 6, 2024

moz-tools-bot commented Jul 6, 2024

Snuffleupagus left a comment

moz-tools-bot commented Jul 6, 2024

alexcat3 commented Jul 6, 2024 •

edited

Loading

Snuffleupagus left a comment

alexcat3 commented Jul 6, 2024

Handle toUnicode cMaps that omit leading zeros in hex encoded UTF-16 (issue 18099) #18390

Handle toUnicode cMaps that omit leading zeros in hex encoded UTF-16 (issue 18099) #18390

Conversation

alexcat3 commented Jul 4, 2024

calixteman commented Jul 4, 2024

alexcat3 commented Jul 4, 2024

Snuffleupagus left a comment

Choose a reason for hiding this comment

This comment was marked as outdated.

alexcat3 commented Jul 5, 2024

timvandermeij commented Jul 5, 2024

calixteman commented Jul 6, 2024

moz-tools-bot commented Jul 6, 2024

From: Bot.io (Linux m4)

Received

moz-tools-bot commented Jul 6, 2024

From: Bot.io (Windows)

Received

moz-tools-bot commented Jul 6, 2024

From: Bot.io (Linux m4)

Failed

Snuffleupagus left a comment

Choose a reason for hiding this comment

moz-tools-bot commented Jul 6, 2024

From: Bot.io (Windows)

Failed

alexcat3 commented Jul 6, 2024 • edited Loading

Snuffleupagus left a comment

Choose a reason for hiding this comment

alexcat3 commented Jul 6, 2024

alexcat3 commented Jul 6, 2024 •

edited

Loading