-
Notifications
You must be signed in to change notification settings - Fork 29.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OS X 10.11: system calls return non-normalized unicode strings #2165
Comments
Just for comparision, could someone post the output of above script on 10.10? Maybe @Fishrock123? |
output on 10.10.4:
|
@silverwind Maybe we can check what is at the code points using |
@silverwind note that your script has a bug: you should use |
Perhaps this can help: const mkdirp = require('mkdirp');
const dir = __dirname + '/weird \uc3a4\uc3ab\uc3af characters \u00e1\u00e2\u00e3';
mkdirp.sync(dir);
process.chdir(dir);
function getChars(str) {
var chars = [];
for (var c of str) chars.push(c);
return chars;
}
var dirC = getChars(dir);
var cwdC = getChars(process.cwd());
console.log(dirC.length);
console.log(cwdC.length);
for (var i = 0; i < dirC.length; i++) {
if (dirC[i].codePointAt(0) !== cwdC[i].codePointAt(0))
throw `Different code point at ${i}: ${dirC[i]} - ${cwdC[i]}`;
}
console.log('strings are identical'); On 10.10, strings are identical. |
@targos fixed the mkdirp. Here's your script's output on 10.11:
|
Either the bug is in |
@silverwind could you modify it again to make the output run through Edit: Actually, it would be more helpful do get more out than just one of the conflicts... so log instead of throw also. |
@Fishrock123 not sure if const mkdirp = require('mkdirp');
const hex = require('hexy').hexy;
const dir = __dirname + '/\uc3a4';
mkdirp.sync(dir);
process.chdir(dir);
const a = dir.substring(__dirname.length);
const b = process.cwd().substring(__dirname.length);
console.log(hex(new Buffer(a)));
console.log(hex(new Buffer(b)));
console.log(a.length, a);
console.log(b.length, b); Output:
It goes from 3 to 9 bytes. What kind of unicode encoding is that? |
I meant in logging the points ala
|
@Fishrock123 just comparing the 6 unicode characters (
|
I don't know how chinese works but it's like if OSX decomposed the character into smaller pieces. What happens with a character like 🚀 |
@targos lol, didn't notice it actually was the three parts combined. 🚀 looks fine:
|
Noticed the same "assembling" happens on upper ascii characters like umlauts |
I think it's Normalization taking place: |
@silverwind I think @alexlamsl is right. Here is a simple example in Python:
Note how u1 and u2 "look" the same, but u2 is the decomposed canonical representation: https://en.wikipedia.org/wiki/Unicode_equivalence FWIW, libuv just returns whatever Not sure if OSX 10.11 is returning the cwd with a different normalization. |
I am trying to see if So unfortunately I can't provide any further assistance, sorry... |
@alexlamsl in theory |
this tool will show the normalization differences. The "expanded" ones (form D or KD) are hangul jamo, separating the Korean character into its Consonant-Vowel-Consonant forms. On
But apparently they are also going to form D (probably NFKD?) in |
I've sent some feedback regarding this to Apple. Might as well be an oversight on their side, considering there were unicode changes. The fact that normalization was performed on 10.10 and earlier leads me to believe it's a bug on their side. I hope to hear back from them. |
@silverwind I'm also asking some contacts at Apple. So far I have that "HFS+ has always been quasi NFD". It's possible this is a bugfix. (update) if "NFD" was the right form, then the longer form (as returned by 10.11) is correct. |
So that basically means we have to include Intl in our builds (if we want to normalize it in JS-land)? |
Can we work around it by using the canonical form in the test ? |
@targos yes, in fact using the NFD (or NFKD) form I don't consider this a real solution as it's not consistent across platforms, and I think the right way would be to bundle Intl in releases and use |
Quick note, as of v3.1.0, small-icu is enabled in release binaries. If you build from source, you still need to follow the steps that @jorangreef outlined. |
@jorangreef thanks @bnoordhuis I just downloaded v3.1.0 and tried out normalize() on my testcase with readdir() and it just seems to work without changing anything (assertion passes). Can you clarify what you mean with the steps I still need to follow? I would also like to see the sourcecode of the implementation of String.normalize() that is now being used in io.js, if someone could point me to it. |
If you downloaded the binary from iojs.org, you don't need to do anything.
It's... complicated. Start with i18n.js, runtime/runtime-i18n.cc and i18n.cc in deps/v8, in that order. It all ends in V8 calling icu::Normalizer::normalize() from ICU, though. |
When using String.normalize(), can I always assume that the result I get is readable by io.js when I do a fs operation with it? We already see that on Mac I can pass a NFC or NFD form of a path to the fs APIs and both variants work. But lets assume I always convert to NFC (using String.normalize()) any path I get from the OS. Is there a chance I run into fs issues? I would assume that a string remains untouched when it cannot be normalized to NFC and thus this would not be an issue. |
Did some more testing. It seems that Windows and Linux file systems allow files with either NFD or NFC form and they are both preserved. On Mac, I can create either a file in NFD or NFC form but in both cases it ends up to be just one file in NFD form. That also means, if any String.normalize() is being used, it must only be used on Mac because the other OS allow for both forms. I am still a bit puzzled why Mac behaves this way. If you compare this behavior to how case-sensitive file names are handled it would mean that Mac only allows 1 casing and it would always lowercase your filename. Am I missing something? |
You've probably figured this out already, but to recap:
OS X is the odd one out, both in that it normalizes file names and in that it uses NFD instead of the more common NFC (although NFD has some properties that make it attractive from a technical perspective, compared to NFC.) |
Mostly historic reasons it seems. I'll link this Q&A again as it still seems the best resource I could find. In summary, it's a big mess and the returned form can vary based on the filesystem and even on code points in the HFS+ case. |
Is there any disadvantage to using |
Yes, Remember thought, that for comparison to work, both strings must be in the same form. If we decide to normalize all file names returned by the system to NFC, there's still the possibilty that the user tries to compare it to NFD obtained through other means, which would fail. Maybe the best course of action is to advice to do |
This would not be correct, i.e. the same as making fs.readFile normalize CRLF line endings to LF because people are not aware of the difference. Filenames should be treated as data.
Spot on. Although, when comparing file system strings, one must take into account whether the file system preserves form or not. e.g. When comparing filenames on Windows or Linux (or on any form-preserving filesystem), do not use normalize to compare, use ===. And on HFS+ or any non-form-preserving filesystem use normalize('NFD') to compare. Windows or Linux will preserve and return NFC or NFD as per Ben's comment above so using normalize on those filesystems would be wrong as two similar looking files could exist in different Unicode forms. |
@Fishrock123 this would fix the test where normalize works. But to fix the test to work where there is no ICU, it would be better to just change the test to create a NFD form of the directory to start with. That will work on all OS X versions. |
That's exactly right. Apple may have thought that some users would be confused by different normalization forms, and they also wanted to be form-insensitive, so instead of preserving the form, they just convert One can think of it exactly as dealing with a non-case-preserving file-system. |
I think a better comparision would be when you think of your data as case-insensitve (you wouldn't use uppercase for files in your project, would you?), and the OS starts to return uppercase for certain letters based on the characters used. It's a mess in my eyes. |
WIP fix for this test: #3007 |
Fixed the test in 81e98e9. Docs regarding this will follow - see nodejs/docs#42. |
OS X 10.11 changed the unicode normalization form of certain code points returned by system calls like getcwd() from NFC to NFD which made results in this test failing. The consensus of #2165 is to delegate the task of unicode normalization to the user, and work will continue to document how to handle unicode in a form-sensitive file system. PR-URL: #3007 Fixes: #2165 Reviewed-By: Johan Bergström <bugs@bergstroem.nu> Reviewed-By: Jeremiah Senkpiel <fishrock123@rocketmail.com> Reviewed-By: Sakthipriyan Vairamani <thechargingvolcano@gmail.com>
A guide is now up that explains how to work with filesystems that do Unicode normalization: https://nodejs.org/en/docs/guides/working-with-different-filesystems/ |
test/sequential/test-chdir.js
persistently fails on OS X 10.11 because the output ofprocess.cwd()
doesn't match the path we're doingprocess.chdir()
on before.Here's a reduced test case (
npm i mkdirp hexy
):The strings look identical on the terminal, but the bytes differ. Here's the output:
cc: @evanlucas
The text was updated successfully, but these errors were encountered: