OS X 10.11: system calls return non-normalized unicode strings #2165

silverwind · 2015-07-11T23:57:49Z

test/sequential/test-chdir.js persistently fails on OS X 10.11 because the output of process.cwd() doesn't match the path we're doing process.chdir() on before.

Here's a reduced test case (npm i mkdirp hexy):

const mkdirp = require('mkdirp');
const hex = require('hexy').hexy;
const dir = __dirname + '/weird \uc3a4\uc3ab\uc3af characters \u00e1\u00e2\u00e3';

mkdirp.sync(dir);
process.chdir(dir);

console.log(hex(new Buffer(dir)));
console.log(hex(new Buffer(process.cwd())));
console.log(dir.length, dir);
console.log(process.cwd().length, process.cwd());

The strings look identical on the terminal, but the bytes differ. Here's the output:

00000000: 2f55 7365 7273 2f73 696c 7665 7277 696e  /Users/silverwin
00000010: 642f 6769 742f 6368 6469 722f 7765 6972  d/git/chdir/weir
00000020: 6420 ec8e a4ec 8eab ec8e af20 6368 6172  d.l.$l.+l./.char
00000030: 6163 7465 7273 20c3 a1c3 a2c3 a3         acters.C!C"C#

00000000: 2f55 7365 7273 2f73 696c 7665 7277 696e  /Users/silverwin
00000010: 642f 6769 742f 6368 6469 722f 7765 6972  d/git/chdir/weir
00000020: 6420 e184 8ae1 85a7 e186 abe1 848a e185  d.a..a.'a.+a..a.
00000030: a7e1 86b2 e184 8ae1 85a7 e186 b620 6368  'a.2a..a.'a.6.ch
00000040: 6172 6163 7465 7273 2061 cc81 61cc 8261  aracters.aL.aL.a
00000050: cc83                                     L.

52 '/Users/silverwind/git/chdir/weird 쎤쎫쎯 characters áâã'
61 '/Users/silverwind/git/chdir/weird 쎤쎫쎯 characters áâã'

cc: @evanlucas

The text was updated successfully, but these errors were encountered:

silverwind · 2015-07-12T15:08:20Z

Just for comparision, could someone post the output of above script on 10.10? Maybe @Fishrock123?

targos · 2015-07-12T16:44:06Z

output on 10.10.4:

00000000: 2f55 7365 7273 2f6d 7a61 7373 6f2f 7465  /Users/mzasso/te
00000010: 7374 2f77 6569 7264 20ec 8ea4 ec8e abec  st/weird.l.$l.+l
00000020: 8eaf 2063 6861 7261 6374 6572 7320 c3a1  ./.characters.C!
00000030: c3a2 c3a3                                C"C#

00000000: 2f55 7365 7273 2f6d 7a61 7373 6f2f 7465  /Users/mzasso/te
00000010: 7374 2f77 6569 7264 20ec 8ea4 ec8e abec  st/weird.l.$l.+l
00000020: 8eaf 2063 6861 7261 6374 6572 7320 c3a1  ./.characters.C!
00000030: c3a2 c3a3                                C"C#

43 '/Users/mzasso/test/weird 쎤쎫쎯 characters áâã'
43 '/Users/mzasso/test/weird 쎤쎫쎯 characters áâã'

Fishrock123 · 2015-07-12T16:58:37Z

@silverwind Maybe we can check what is at the code points using String.prototype.codePointAt(n) or similar? There are different ways of denoting unicode via code points, and it looks like El Capitan is expanding them. If so, that may actually be an OS bug.

targos · 2015-07-12T17:07:12Z

@silverwind note that your script has a bug: you should use mkdirp.sync or it fails the first time because the directory doesn't exist yet when process.chdir is executed

targos · 2015-07-12T17:42:18Z

Perhaps this can help:

const mkdirp = require('mkdirp');
const dir = __dirname + '/weird \uc3a4\uc3ab\uc3af characters \u00e1\u00e2\u00e3';

mkdirp.sync(dir);
process.chdir(dir);

function getChars(str) {
  var chars = [];
  for (var c of str) chars.push(c);
  return chars;
}

var dirC = getChars(dir);
var cwdC = getChars(process.cwd());

console.log(dirC.length);
console.log(cwdC.length);

for (var i = 0; i < dirC.length; i++) {
  if (dirC[i].codePointAt(0) !== cwdC[i].codePointAt(0))
    throw `Different code point at ${i}: ${dirC[i]} - ${cwdC[i]}`;
}

console.log('strings are identical');

On 10.10, strings are identical.

silverwind · 2015-07-12T19:50:42Z

@targos fixed the mkdirp. Here's your script's output on 10.11:

52
61

/Users/silverwind/git/chdir/codepoints.js:21
    throw `Different code point at ${i}: ${dirC[i]} - ${cwdC[i]}`;
                                                               ^
Different code point at 34: 쎤 - ᄊ

silverwind · 2015-07-12T19:58:21Z

Either the bug is in uv_cwd or (more likely) in the OS. cc: @saghul

Fishrock123 · 2015-07-12T20:35:52Z

@silverwind could you modify it again to make the output run through String.fromCodePoint()?

Edit: Actually, it would be more helpful do get more out than just one of the conflicts... so log instead of throw also.

silverwind · 2015-07-12T20:51:21Z

@Fishrock123 not sure if fromCodePoint is of much help. I don't think I can feed bytes into it. For simplicy, here's just a single character compared:

const mkdirp = require('mkdirp');
const hex = require('hexy').hexy;
const dir = __dirname + '/\uc3a4';

mkdirp.sync(dir);
process.chdir(dir);

const a = dir.substring(__dirname.length);
const b = process.cwd().substring(__dirname.length);

console.log(hex(new Buffer(a)));
console.log(hex(new Buffer(b)));
console.log(a.length, a);
console.log(b.length, b);

Output:

00000000: ec8e a4                                  l.$

00000000: e184 8ae1 85a7 e186 ab                   a..a.'a.+

1 '쎤'
3 '쎤'

It goes from 3 to 9 bytes. What kind of unicode encoding is that?

Fishrock123 · 2015-07-12T20:54:58Z

I meant in logging the points ala

for (var i = 0; i < dirC.length; i++) {
  if (dirC[i].codePointAt(0) !== cwdC[i].codePointAt(0))
    throw `Different code point at ${i}: ${dirC[i]} - ${cwdC[i]}`;
}

silverwind · 2015-07-12T21:00:06Z

@Fishrock123 just comparing the 6 unicode characters (\uc3a4\uc3ab\uc3af\u00e1\u00e2\u00e3):

00000000: ec8e a4ec 8eab ec8e afc3 a1c3 a2c3 a3    l.$l.+l./C!C"C#

00000000: e184 8ae1 85a7 e186 abe1 848a e185 a7e1  a..a.'a.+a..a.'a
00000010: 86b2 e184 8ae1 85a7 e186 b661 cc81 61cc  .2a..a.'a.6aL.aL
00000020: 8261 cc83                                .aL.

6 '쎤쎫쎯áâã'
15 '쎤쎫쎯áâã'
Different code point at 0: 쎤 - ᄊ
Different code point at 1: 쎫 - ᅧ
Different code point at 2: 쎯 - ᆫ
Different code point at 3: á - ᄊ
Different code point at 4: â - ᅧ
Different code point at 5: ã - ᆲ

targos · 2015-07-12T21:17:42Z

I don't know how chinese works but it's like if OSX decomposed the character into smaller pieces.
Look how the 3 first on the right (ᄊ, ᅧ, ᆫ) can be used together to create the first one on the left (쎤).

What happens with a character like 🚀 '\uD83D\uDE80' ?

silverwind · 2015-07-12T21:20:16Z

@targos lol, didn't notice it actually was the three parts combined. 🚀 looks fine:

00000000: f09f 9a80                                p...

00000000: f09f 9a80                                p...

2 '🚀'
2 '🚀'

silverwind · 2015-07-12T21:26:28Z

Noticed the same "assembling" happens on upper ascii characters like umlauts öäü, where the dots are a separate character following the ascii letter. I've seen this before on bugged webfonts, not sure what this mechanism is called.

alexlamsl · 2015-07-13T16:18:50Z

I think it's Normalization taking place:
https://en.wikipedia.org/wiki/Unicode_equivalence#Normalization

saghul · 2015-07-13T16:42:27Z

@silverwind I think @alexlamsl is right. Here is a simple example in Python:

In [25]: u1 = u'mañana'

In [26]: print u1
mañana

In [28]: u1
Out[28]: u'ma\xf1ana'

In [29]: u2 = unicodedata.normalize('NFD', u1)

In [30]: print u2
mañana

In [31]: u1 == u2
Out[31]: False

In [32]: u2
Out[32]: u'man\u0303ana'

Note how u1 and u2 "look" the same, but u2 is the decomposed canonical representation: https://en.wikipedia.org/wiki/Unicode_equivalence

FWIW, libuv just returns whatever getcwd returns: https://github.com/libuv/libuv/blob/c8eebc93a9efcdcc2913723ccd86b35498cc271f/src/unix/core.c#L636-L650

Not sure if OSX 10.11 is returning the cwd with a different normalization.

alexlamsl · 2015-07-13T16:55:48Z

I am trying to see if String.prototype.normalize() would be a good solution to this problem. However 1) I am on Windows and 2) I couldn't get iojs 2.3.4 to give me different results to the strings even in this example:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize

So unfortunately I can't provide any further assistance, sorry...

targos · 2015-07-13T18:06:11Z

@alexlamsl in theory String.prototype.normalize() could be a solution but sadly it only works if iojs is built with Intl support. Otherwise it is just a no-op.

srl295 · 2015-07-13T18:18:11Z

this tool will show the normalization differences. The "expanded" ones (form D or KD) are hangul jamo, separating the Korean character into its Consonant-Vowel-Consonant forms.

On 10.10.4 the OS is already normalizing the strings:

$ mkdir '쎤쎫쎯' '쎤쎫쎯'
mkdir: 쎤쎫쎯: File exists

But apparently they are also going to form D (probably NFKD?) in cwd in the new version

silverwind · 2015-07-13T20:14:35Z

I've sent some feedback regarding this to Apple. Might as well be an oversight on their side, considering there were unicode changes. The fact that normalization was performed on 10.10 and earlier leads me to believe it's a bug on their side. I hope to hear back from them.

srl295 · 2015-07-13T21:40:00Z

@silverwind I'm also asking some contacts at Apple. So far I have that "HFS+ has always been quasi NFD". It's possible this is a bugfix. (update) if "NFD" was the right form, then the longer form (as returned by 10.11) is correct.

silverwind · 2015-07-13T22:37:55Z

So that basically means we have to include Intl in our builds (if we want to normalize it in JS-land)?

targos · 2015-07-14T04:43:22Z

Can we work around it by using the canonical form in the test ?

silverwind · 2015-07-14T17:47:31Z

@targos yes, in fact using the NFD (or NFKD) form '\u110a\u1167\u11ab\u110a\u1167\u11b2\u110a\u1167\u11b6\u0061\u0301\u0061\u0302\u0061\u0303' does work and the strings are equal.

I don't consider this a real solution as it's not consistent across platforms, and I think the right way would be to bundle Intl in releases and use string.prototype.normalize on all values returned by system calls on OS X >= 10.11.

bnoordhuis · 2015-08-20T10:24:32Z

Quick note, as of v3.1.0, small-icu is enabled in release binaries. If you build from source, you still need to follow the steps that @jorangreef outlined.

bpasero · 2015-08-23T06:54:57Z

@jorangreef thanks

@bnoordhuis I just downloaded v3.1.0 and tried out normalize() on my testcase with readdir() and it just seems to work without changing anything (assertion passes). Can you clarify what you mean with the steps I still need to follow?

I would also like to see the sourcecode of the implementation of String.normalize() that is now being used in io.js, if someone could point me to it.

bnoordhuis · 2015-08-23T11:15:45Z

If you downloaded the binary from iojs.org, you don't need to do anything.

I would also like to see the sourcecode of the implementation of String.normalize() that is now being used in io.js, if someone could point me to it.

It's... complicated. Start with i18n.js, runtime/runtime-i18n.cc and i18n.cc in deps/v8, in that order. It all ends in V8 calling icu::Normalizer::normalize() from ICU, though.

bpasero · 2015-08-24T12:01:11Z

When using String.normalize(), can I always assume that the result I get is readable by io.js when I do a fs operation with it? We already see that on Mac I can pass a NFC or NFD form of a path to the fs APIs and both variants work. But lets assume I always convert to NFC (using String.normalize()) any path I get from the OS. Is there a chance I run into fs issues? I would assume that a string remains untouched when it cannot be normalized to NFC and thus this would not be an issue.

bpasero · 2015-08-24T14:31:47Z

Did some more testing. It seems that Windows and Linux file systems allow files with either NFD or NFC form and they are both preserved. On Mac, I can create either a file in NFD or NFC form but in both cases it ends up to be just one file in NFD form. That also means, if any String.normalize() is being used, it must only be used on Mac because the other OS allow for both forms.

I am still a bit puzzled why Mac behaves this way. If you compare this behavior to how case-sensitive file names are handled it would mean that Mac only allows 1 casing and it would always lowercase your filename.

Am I missing something?

bnoordhuis · 2015-08-24T14:56:24Z

Did some more testing. It seems that Windows and Linux file systems allow files with either NFD or NFC form and they are both preserved. On Mac, I can create either a file in NFD or NFC form but in both cases it ends up to be just one file in NFD form.

You've probably figured this out already, but to recap:

Windows (or rather, NTFS) stores file names as UTF-16 without further canonicalizing them.
Most Unices, including Linux, treat file names as simple byte arrays without any particular encoding.

OS X is the odd one out, both in that it normalizes file names and in that it uses NFD instead of the more common NFC (although NFD has some properties that make it attractive from a technical perspective, compared to NFC.)

silverwind · 2015-08-24T14:59:42Z

I am still a bit puzzled why Mac behaves this way

Mostly historic reasons it seems. I'll link this Q&A again as it still seems the best resource I could find. In summary, it's a big mess and the returned form can vary based on the filesystem and even on code points in the HFS+ case.

Fishrock123 · 2015-08-24T15:01:44Z

Is there any disadvantage to using string.normalize() on mac? To my knowledge it should to what's correct per system, and will just return the string unmodified if node is built without Intl?

silverwind · 2015-08-24T15:20:32Z

Yes, string.normalize should be a noop without Intl.

Remember thought, that for comparison to work, both strings must be in the same form. If we decide to normalize all file names returned by the system to NFC, there's still the possibilty that the user tries to compare it to NFD obtained through other means, which would fail.

Maybe the best course of action is to advice to do str1.normalize() === str2.normalize() when comparing any unicode strings, regardless of origin.

jorangreef · 2015-08-24T16:57:16Z

If we decide to normalize all file names returned by the system to NFC

This would not be correct, i.e. the same as making fs.readFile normalize CRLF line endings to LF because people are not aware of the difference. Filenames should be treated as data.

Maybe the best course of action is to advice to do str1.normalize() === str2.normalize() when comparing any unicode strings, regardless of origin.

Spot on. Although, when comparing file system strings, one must take into account whether the file system preserves form or not. e.g. When comparing filenames on Windows or Linux (or on any form-preserving filesystem), do not use normalize to compare, use ===. And on HFS+ or any non-form-preserving filesystem use normalize('NFD') to compare. Windows or Linux will preserve and return NFC or NFD as per Ben's comment above so using normalize on those filesystems would be wrong as two similar looking files could exist in different Unicode forms.

jorangreef · 2015-08-24T16:58:59Z

Is there any disadvantage to using string.normalize() on mac? To my knowledge it should to what's correct per system, and will just return the string unmodified if node is built without Intl?

@Fishrock123 this would fix the test where normalize works. But to fix the test to work where there is no ICU, it would be better to just change the test to create a NFD form of the directory to start with. That will work on all OS X versions.

jorangreef · 2015-08-25T05:27:20Z

Did some more testing. It seems that Windows and Linux file systems allow files with either NFD or NFC form and they are both preserved. On Mac, I can create either a file in NFD or NFC form but in both cases it ends up to be just one file in NFD form. That also means, if any String.normalize() is being used, it must only be used on Mac because the other OS allow for both forms.

I am still a bit puzzled why Mac behaves this way. If you compare this behavior to how case-sensitive file names are handled it would mean that Mac only allows 1 casing and it would always lowercase your filename.

Am I missing something?

That's exactly right. Apple may have thought that some users would be confused by different normalization forms, and they also wanted to be form-insensitive, so instead of preserving the form, they just convert abc to ABC and lose the form in the process. Meanwhile, the other file-systems preserve form and are sensitive to form.

One can think of it exactly as dealing with a non-case-preserving file-system.

silverwind · 2015-08-25T05:35:10Z

One can think of it exactly as dealing with a non-case-preserving file-system.

I think a better comparision would be when you think of your data as case-insensitve (you wouldn't use uppercase for files in your project, would you?), and the OS starts to return uppercase for certain letters based on the characters used. It's a mess in my eyes.

silverwind · 2015-09-22T17:04:09Z

WIP fix for this test: #3007
Documentation suggestion: nodejs/docs#42

silverwind · 2015-10-06T16:16:55Z

Fixed the test in 81e98e9. Docs regarding this will follow - see nodejs/docs#42.

OS X 10.11 changed the unicode normalization form of certain code points returned by system calls like getcwd() from NFC to NFD which made results in this test failing. The consensus of #2165 is to delegate the task of unicode normalization to the user, and work will continue to document how to handle unicode in a form-sensitive file system. PR-URL: #3007 Fixes: #2165 Reviewed-By: Johan Bergström <bugs@bergstroem.nu> Reviewed-By: Jeremiah Senkpiel <fishrock123@rocketmail.com> Reviewed-By: Sakthipriyan Vairamani <thechargingvolcano@gmail.com>

jorangreef · 2015-11-17T07:13:13Z

A guide is now up that explains how to work with filesystems that do Unicode normalization: https://nodejs.org/en/docs/guides/working-with-different-filesystems/

silverwind added the process Issues and PRs related to the process subsystem. label Jul 11, 2015

silverwind added the macos Issues and PRs related to the macOS platform / OSX. label Jul 13, 2015

silverwind changed the title ~~process: unicode issue on process.cwd() on OS X 10.11~~ OS X 10.11: unicode issue on process.cwd() Jul 13, 2015

srl295 mentioned this issue Jul 13, 2015

Feedback on README.md? nodejs/Intl#4

Closed

silverwind added the confirmed-bug Issues with confirmed bugs. label Jul 13, 2015

silverwind changed the title ~~OS X 10.11: unicode issue on process.cwd()~~ OS X 10.11: system calls return non-normalized unicode strings Jul 14, 2015

This was referenced Sep 22, 2015

test: use normalize() for unicode paths #3007

Closed

A section for guides nodejs/docs#42

Closed

silverwind mentioned this issue Sep 29, 2015

Test "test-process-chdir" fails on El Capitan #3108

Closed

silverwind closed this as completed in 81e98e9 Oct 6, 2015

Trott mentioned this issue Oct 7, 2015

4.1.2 make check errors #3242

Closed

silverwind mentioned this issue Oct 17, 2015

fs: decode filenames using UTF-8 in fs.watch #3401

Closed

mderazon mentioned this issue Apr 7, 2016

bad routing where there are non-english characters in page url sintaxi/harp#495

Closed

silverwind mentioned this issue Aug 23, 2016

process.cwd() casing inconsistent across platforms #8237

Closed

Tyriar mentioned this issue Jun 26, 2017

Test: New multi-root aware file watcher microsoft/vscode#29478

Closed

41 tasks

richardschneider mentioned this issue Dec 16, 2017

Non-ASCII key names on MacOS libp2p/js-libp2p-keychain#16

Closed

GomPam mentioned this issue May 3, 2019

[Bug] unicode normalization problem fork-dev/Tracker#444

Closed

OS X 10.11: system calls return non-normalized unicode strings #2165

OS X 10.11: system calls return non-normalized unicode strings #2165

Comments

silverwind commented Jul 11, 2015

silverwind commented Jul 12, 2015

targos commented Jul 12, 2015

Fishrock123 commented Jul 12, 2015

targos commented Jul 12, 2015

targos commented Jul 12, 2015

silverwind commented Jul 12, 2015

silverwind commented Jul 12, 2015

Fishrock123 commented Jul 12, 2015

silverwind commented Jul 12, 2015

Fishrock123 commented Jul 12, 2015

silverwind commented Jul 12, 2015

targos commented Jul 12, 2015

silverwind commented Jul 12, 2015

silverwind commented Jul 12, 2015

alexlamsl commented Jul 13, 2015

saghul commented Jul 13, 2015

alexlamsl commented Jul 13, 2015

targos commented Jul 13, 2015

srl295 commented Jul 13, 2015

silverwind commented Jul 13, 2015

srl295 commented Jul 13, 2015

silverwind commented Jul 13, 2015

targos commented Jul 14, 2015

silverwind commented Jul 14, 2015

bnoordhuis commented Aug 20, 2015

bpasero commented Aug 23, 2015

bnoordhuis commented Aug 23, 2015

bpasero commented Aug 24, 2015

bpasero commented Aug 24, 2015

bnoordhuis commented Aug 24, 2015

silverwind commented Aug 24, 2015

Fishrock123 commented Aug 24, 2015

silverwind commented Aug 24, 2015

jorangreef commented Aug 24, 2015

jorangreef commented Aug 24, 2015

jorangreef commented Aug 25, 2015

silverwind commented Aug 25, 2015

silverwind commented Sep 22, 2015

silverwind commented Oct 6, 2015

jorangreef commented Nov 17, 2015