-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CJK characters in clipboard after copy of latin text #18099
Comments
This comment was marked as resolved.
This comment was marked as resolved.
This bug is reproducible on both Mozilla Firefox and Microsoft Edge on Windows 11 with the latest code. All text in the document, including numbers, is copied as Chinese. Meanwhile Microsoft Edge's built-in PDF viewer copies the text correctly. |
Actually my previous statement is incorrect: letters with accent marks are copied correctly in pdf.js. Other characters are replaced with CJK. |
Looking at the garbled text in a hex editor, it appears that the problem is that ASCII characters were converted to UTF 16 with the wrong endianness. |
I have managed to create a minimal (3kb) example file that exhibits the behavior by adding the cmap from the file provided by the user to the sample "hello world" pdf file. |
It turns out the trouble boils down to one line in the PDF's font's toUnicode CMap which appears intended to map all characters in the range from 00 to 7F-- all the ASCII characters-- to the corresponding unicode characters: |
I'm confused by the code that handles Bf ranges in CMaps. It seems to treat Javascript strings as an array of bytes, but I thought that as Javascript used UTF 16 they would be an array of 16 bit words. mapBfRange(low, high, dstLow) {
if (high - low > MAX_MAP_RANGE) {
throw new Error("mapBfRange - ignoring data above MAX_MAP_RANGE.");
}
const lastByte = dstLow.length - 1;
while (low <= high) {
this._map[low++] = dstLow;
// Only the last byte has to be incremented (in the normal case).
const nextCharCode = dstLow.charCodeAt(lastByte) + 1;
if (nextCharCode > 0xff) {
dstLow =
dstLow.substring(0, lastByte - 1) +
String.fromCharCode(dstLow.charCodeAt(lastByte - 1) + 1) +
"\x00";
continue;
}
dstLow =
dstLow.substring(0, lastByte) + String.fromCharCode(nextCharCode);
}
} |
It appears that the above code is actually correct, the code uses the 16 bit characters of a JS string to store 8 bit bytes of the destination char code. It seems that the problem is that that the readToUnicode function in evaluator.js, which uses cmap.js to parse the ToUnicode cmap as a regular cmap, and then converts it into a ToUnicode cmap, assumes that the regular cmap resulting from the parse will have strings where merging each pair of adjacent characters will make a valid UTF-16 string. However, if the PDF file omits the leading zeros on the UTF-16 string, the cmap will end up with a string with an odd number of characters where the first character is a UTF-16 low byte with no high byte to pair it with. |
It is unclear if omitting leading zeros on hex-encoded UTF-16 in the ToUnicode cmap is allowed by the PDF spec. However, seeing that there is at least one PDF in the wild that does it and other PDF readers can read it, pdf.js should probably fix it. I will try and make a pull request with a fix. This will be my first ever pull request to an open source project. |
Nice, good luck! |
Thank you! I submitted my pull request. |
Link to PDF file:
https://web.archive.org/web/20240515102919/https://www.oahovorcovicka.cz/files/soubory/WEB_2023/Vsledky_CR_2024.pdf
Configuration:
Steps to reproduce the problem:
What is the expected behaviour? (add screenshot)
Clipboard should read
Yýsledková listina přijímacích zkoušek
, as it does when executed in SumatraPDF or Acrobat Reader:(This is almost correct OCR of the scan consisting of latin characters.)
What went wrong? (add screenshot)
Clipboard reads
夀ý猀氀攀搀欀漀瘀á 氀椀猀琀椀渀愀 瀀ř椀樀í洀愀挀í挀栀稀欀漀甀š攀欀
:(This is weird sequence of CJK characters, with few latin glyphs, all with diacritics.)
I see this PDF is really sloppy and there are many OCR errors thorough the document, but I guess it is not relevant.
The text was updated successfully, but these errors were encountered: