CJK characters in clipboard after copy of latin text #18099

myfonj · 2024-05-15T15:59:29Z

Link to PDF file:

https://web.archive.org/web/20240515102919/https://www.oahovorcovicka.cz/files/soubory/WEB_2023/Vsledky_CR_2024.pdf

Configuration:

Web browser and its version: Firefox Nightly 128 / Developer Edition 127 / Stable 125 (or viewer)
Operating system and its version: Windows 10
PDF.js version: 4.3.8 (probably, according line resource://pdf.js/web/viewer.mjs line 8457 and console)
Is a browser extension: Yes. (Same in viewer.)

Steps to reproduce the problem:

Open PDF from Web Archive (Warning: nearly 10MB payload.)
Select first line
Copy
Paste

What is the expected behaviour? (add screenshot)

Clipboard should read Yýsledková listina přijímacích zkoušek, as it does when executed in SumatraPDF or Acrobat Reader:

(This is almost correct OCR of the scan consisting of latin characters.)

What went wrong? (add screenshot)

Clipboard reads 夀ý猀氀攀搀欀漀瘀á 氀椀猀琀椀渀愀瀀ř椀樀í洀愀挀í挀栀稀欀漀甀š攀欀:

(This is weird sequence of CJK characters, with few latin glyphs, all with diacritics.)

I see this PDF is really sloppy and there are many OCR errors thorough the document, but I guess it is not relevant.

The text was updated successfully, but these errors were encountered:

alexcat3 · 2024-07-01T22:32:52Z

This bug is reproducible on both Mozilla Firefox and Microsoft Edge on Windows 11 with the latest code. All text in the document, including numbers, is copied as Chinese. Meanwhile Microsoft Edge's built-in PDF viewer copies the text correctly.

alexcat3 · 2024-07-01T22:43:50Z

Actually my previous statement is incorrect: letters with accent marks are copied correctly in pdf.js. Other characters are replaced with CJK.

alexcat3 · 2024-07-02T21:29:07Z

Looking at the garbled text in a hex editor, it appears that the problem is that ASCII characters were converted to UTF 16 with the wrong endianness.

alexcat3 · 2024-07-03T04:29:16Z

I have managed to create a minimal (3kb) example file that exhibits the behavior by adding the cmap from the file provided by the user to the sample "hello world" pdf file.
helloworld.pdf

alexcat3 · 2024-07-03T15:12:23Z

It turns out the trouble boils down to one line in the PDF's font's toUnicode CMap which appears intended to map all characters in the range from 00 to 7F-- all the ASCII characters-- to the corresponding unicode characters:
<00> <7F> <00>
If you change this line to
<00> <7F> <0000>
thus specifying the starting unicode value with 2 bytes instead of one, the problem goes away.

alexcat3 · 2024-07-03T16:33:26Z

I'm confused by the code that handles Bf ranges in CMaps. It seems to treat Javascript strings as an array of bytes, but I thought that as Javascript used UTF 16 they would be an array of 16 bit words.

mapBfRange(low, high, dstLow) {
    if (high - low > MAX_MAP_RANGE) {
      throw new Error("mapBfRange - ignoring data above MAX_MAP_RANGE.");
    }
    const lastByte = dstLow.length - 1;
    while (low <= high) {
      this._map[low++] = dstLow;
      // Only the last byte has to be incremented (in the normal case).
      const nextCharCode = dstLow.charCodeAt(lastByte) + 1;
      if (nextCharCode > 0xff) {
        dstLow =
          dstLow.substring(0, lastByte - 1) +
          String.fromCharCode(dstLow.charCodeAt(lastByte - 1) + 1) +
          "\x00";
        continue;
      }
      dstLow =
        dstLow.substring(0, lastByte) + String.fromCharCode(nextCharCode);
    }
  }

alexcat3 · 2024-07-03T22:50:21Z

It appears that the above code is actually correct, the code uses the 16 bit characters of a JS string to store 8 bit bytes of the destination char code. It seems that the problem is that that the readToUnicode function in evaluator.js, which uses cmap.js to parse the ToUnicode cmap as a regular cmap, and then converts it into a ToUnicode cmap, assumes that the regular cmap resulting from the parse will have strings where merging each pair of adjacent characters will make a valid UTF-16 string. However, if the PDF file omits the leading zeros on the UTF-16 string, the cmap will end up with a string with an odd number of characters where the first character is a UTF-16 low byte with no high byte to pair it with.

alexcat3 · 2024-07-03T22:58:32Z

It is unclear if omitting leading zeros on hex-encoded UTF-16 in the ToUnicode cmap is allowed by the PDF spec. However, seeing that there is at least one PDF in the wild that does it and other PDF readers can read it, pdf.js should probably fix it. I will try and make a pull request with a fix. This will be my first ever pull request to an open source project.

myfonj · 2024-07-04T00:13:56Z

Nice, good luck!
I have zero experience with PDF internals, but in unlikely case it hasn't occurred to you, most probably there may be some hints somewhere in the SumatraPDF codebase whether they are doing some "magical fixups" of badly encoded PDFs, and possibly how.

alexcat3 · 2024-07-04T20:22:51Z

Thank you! I submitted my pull request.
#18390

Snuffleupagus added font-conversion text-selection labels May 15, 2024

This comment was marked as resolved.

Sign in to view

alexcat3 mentioned this issue Jul 4, 2024

Handle toUnicode cMaps that omit leading zeros in hex encoded UTF-16 (issue 18099) #18390

Merged

timvandermeij linked a pull request Jul 4, 2024 that will close this issue

Handle toUnicode cMaps that omit leading zeros in hex encoded UTF-16 (issue 18099) #18390

Merged

Snuffleupagus closed this as completed in #18390 Jul 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CJK characters in clipboard after copy of latin text #18099

CJK characters in clipboard after copy of latin text #18099

myfonj commented May 15, 2024 •

edited

Loading

This comment was marked as resolved.

alexcat3 commented Jul 1, 2024

alexcat3 commented Jul 1, 2024

alexcat3 commented Jul 2, 2024

alexcat3 commented Jul 3, 2024

alexcat3 commented Jul 3, 2024

alexcat3 commented Jul 3, 2024 •

edited

Loading

alexcat3 commented Jul 3, 2024

alexcat3 commented Jul 3, 2024

myfonj commented Jul 4, 2024

alexcat3 commented Jul 4, 2024

CJK characters in clipboard after copy of latin text #18099

CJK characters in clipboard after copy of latin text #18099

Comments

myfonj commented May 15, 2024 • edited Loading

Link to PDF file:

Configuration:

Steps to reproduce the problem:

What is the expected behaviour? (add screenshot)

What went wrong? (add screenshot)

This comment was marked as resolved.

alexcat3 commented Jul 1, 2024

alexcat3 commented Jul 1, 2024

alexcat3 commented Jul 2, 2024

alexcat3 commented Jul 3, 2024

alexcat3 commented Jul 3, 2024

alexcat3 commented Jul 3, 2024 • edited Loading

alexcat3 commented Jul 3, 2024

alexcat3 commented Jul 3, 2024

myfonj commented Jul 4, 2024

alexcat3 commented Jul 4, 2024

myfonj commented May 15, 2024 •

edited

Loading

alexcat3 commented Jul 3, 2024 •

edited

Loading