Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CJK characters in clipboard after copy of latin text #18099

Closed
myfonj opened this issue May 15, 2024 · 11 comments · Fixed by #18390
Closed

CJK characters in clipboard after copy of latin text #18099

myfonj opened this issue May 15, 2024 · 11 comments · Fixed by #18390

Comments

@myfonj
Copy link

myfonj commented May 15, 2024

Link to PDF file:

https://web.archive.org/web/20240515102919/https://www.oahovorcovicka.cz/files/soubory/WEB_2023/Vsledky_CR_2024.pdf

Configuration:

  • Web browser and its version: Firefox Nightly 128 / Developer Edition 127 / Stable 125 (or viewer)
  • Operating system and its version: Windows 10
  • PDF.js version: 4.3.8 (probably, according line resource://pdf.js/web/viewer.mjs line 8457 and console)
  • Is a browser extension: Yes. (Same in viewer.)

Steps to reproduce the problem:

  1. Open PDF from Web Archive (Warning: nearly 10MB payload.)
  2. Select first line
  3. Copy
  4. Paste

What is the expected behaviour? (add screenshot)

Clipboard should read Yýsledková listina přijímacích zkoušek, as it does when executed in SumatraPDF or Acrobat Reader:

SumatraPDF with selection rectanhle around first line and Notepad++ above with single line of text in latin alphabet

(This is almost correct OCR of the scan consisting of latin characters.)

What went wrong? (add screenshot)

Clipboard reads 夀ý猀氀攀搀欀漀瘀á 氀椀猀琀椀渀愀 瀀ř椀樀í洀愀挀í挀栀稀欀漀甀š攀欀:

Firefox window showing webarchived pdf with first line selected and Npp window with pasted text consting of CJK characters.

(This is weird sequence of CJK characters, with few latin glyphs, all with diacritics.)


I see this PDF is really sloppy and there are many OCR errors thorough the document, but I guess it is not relevant.

@ArmaandeepSingh

This comment was marked as resolved.

@alexcat3
Copy link
Contributor

alexcat3 commented Jul 1, 2024

This bug is reproducible on both Mozilla Firefox and Microsoft Edge on Windows 11 with the latest code. All text in the document, including numbers, is copied as Chinese. Meanwhile Microsoft Edge's built-in PDF viewer copies the text correctly.

@alexcat3
Copy link
Contributor

alexcat3 commented Jul 1, 2024

Actually my previous statement is incorrect: letters with accent marks are copied correctly in pdf.js. Other characters are replaced with CJK.

@alexcat3
Copy link
Contributor

alexcat3 commented Jul 2, 2024

Looking at the garbled text in a hex editor, it appears that the problem is that ASCII characters were converted to UTF 16 with the wrong endianness.

@alexcat3
Copy link
Contributor

alexcat3 commented Jul 3, 2024

I have managed to create a minimal (3kb) example file that exhibits the behavior by adding the cmap from the file provided by the user to the sample "hello world" pdf file.
helloworld.pdf

@alexcat3
Copy link
Contributor

alexcat3 commented Jul 3, 2024

It turns out the trouble boils down to one line in the PDF's font's toUnicode CMap which appears intended to map all characters in the range from 00 to 7F-- all the ASCII characters-- to the corresponding unicode characters:
<00> <7F> <00>
If you change this line to
<00> <7F> <0000>
thus specifying the starting unicode value with 2 bytes instead of one, the problem goes away.

@alexcat3
Copy link
Contributor

alexcat3 commented Jul 3, 2024

I'm confused by the code that handles Bf ranges in CMaps. It seems to treat Javascript strings as an array of bytes, but I thought that as Javascript used UTF 16 they would be an array of 16 bit words.

mapBfRange(low, high, dstLow) {
    if (high - low > MAX_MAP_RANGE) {
      throw new Error("mapBfRange - ignoring data above MAX_MAP_RANGE.");
    }
    const lastByte = dstLow.length - 1;
    while (low <= high) {
      this._map[low++] = dstLow;
      // Only the last byte has to be incremented (in the normal case).
      const nextCharCode = dstLow.charCodeAt(lastByte) + 1;
      if (nextCharCode > 0xff) {
        dstLow =
          dstLow.substring(0, lastByte - 1) +
          String.fromCharCode(dstLow.charCodeAt(lastByte - 1) + 1) +
          "\x00";
        continue;
      }
      dstLow =
        dstLow.substring(0, lastByte) + String.fromCharCode(nextCharCode);
    }
  }

@alexcat3
Copy link
Contributor

alexcat3 commented Jul 3, 2024

It appears that the above code is actually correct, the code uses the 16 bit characters of a JS string to store 8 bit bytes of the destination char code. It seems that the problem is that that the readToUnicode function in evaluator.js, which uses cmap.js to parse the ToUnicode cmap as a regular cmap, and then converts it into a ToUnicode cmap, assumes that the regular cmap resulting from the parse will have strings where merging each pair of adjacent characters will make a valid UTF-16 string. However, if the PDF file omits the leading zeros on the UTF-16 string, the cmap will end up with a string with an odd number of characters where the first character is a UTF-16 low byte with no high byte to pair it with.

@alexcat3
Copy link
Contributor

alexcat3 commented Jul 3, 2024

It is unclear if omitting leading zeros on hex-encoded UTF-16 in the ToUnicode cmap is allowed by the PDF spec. However, seeing that there is at least one PDF in the wild that does it and other PDF readers can read it, pdf.js should probably fix it. I will try and make a pull request with a fix. This will be my first ever pull request to an open source project.

@myfonj
Copy link
Author

myfonj commented Jul 4, 2024

Nice, good luck!
I have zero experience with PDF internals, but in unlikely case it hasn't occurred to you, most probably there may be some hints somewhere in the SumatraPDF codebase whether they are doing some "magical fixups" of badly encoded PDFs, and possibly how.

@alexcat3
Copy link
Contributor

alexcat3 commented Jul 4, 2024

Thank you! I submitted my pull request.
#18390

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants