Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fs: decode filenames using UTF-8 in fs.watch #3401

Closed
wants to merge 1 commit into from
Closed

fs: decode filenames using UTF-8 in fs.watch #3401

wants to merge 1 commit into from

Conversation

seishun
Copy link
Contributor

@seishun seishun commented Oct 16, 2015

Fixes #2088.

cc @bnoordhuis

@mscdex mscdex added the fs Issues and PRs related to the fs subsystem / file system. label Oct 16, 2015
@bnoordhuis
Copy link
Member

This is the correct fix for Windows but not necessarily for other platforms.

@seishun
Copy link
Contributor Author

seishun commented Oct 17, 2015

@bnoordhuis UTF-8 is assumed for other platforms in e.g. readdir.

@bnoordhuis
Copy link
Member

Yes, that's wrong.

@seishun
Copy link
Contributor Author

seishun commented Oct 17, 2015

It's correct for 99% of users. Do we want to fix fs.watch for them now, even if it's technically wrong, or wait until we change literally all of the fs API, which will probably never happen because no one sane would want to use buffers for paths?

@bnoordhuis
Copy link
Member

So what I think we need to do is bite the bullet and move the file name / path name munging into a class that contains platform-specific or file system-specific logic for encoding and decoding them. That's going to be a fair bit of churn but that's better than the halfhearted approach where everything is broken in different ways all the time.

@seishun
Copy link
Contributor Author

seishun commented Oct 17, 2015

a class that contains platform-specific or file system-specific logic for encoding and decoding them

What logic could there be for Unices other than assuming an encoding? And if it assumes UTF-8, then the logic is exactly the same for all platforms.

where everything is broken in different ways all the time

The only thing objectively broken right now is fs.watch. The theoretical breakage of non-UTF8 filenames on Unices is consistent across all of fs APIs as far as I'm aware.

I don't understand why you insist on making global changes before making fs.watch consistent with the rest of the fs API, whatever they might be.

@bnoordhuis
Copy link
Member

What logic could there be for Unices other than assuming an encoding?

Just an example but if you consider OS X a UNIX, HFS+ uses NFD decomposition with extensions. Assuming UTF-8 produces wrong results.

fs.watch() doesn't produce correct results now either but the one-byte encoding means you can at least patch it up post facto.

I don't understand why you insist on making global changes before making fs.watch consistent with the rest of the fs API, whatever they might be.

Because it's just swapping one bug for another. Maybe it's going to work better now for a majority of users but that's small consolation for the minority whose workflow just broke. I don't feel comfortable signing off on that.

@seishun
Copy link
Contributor Author

seishun commented Oct 17, 2015

Just an example but if you consider OS X a UNIX, HFS+ uses NFD decomposition with extensions. Assuming UTF-8 produces wrong results.

Shouldn't it be libuv's job to convert that into proper UTF-8?

@bnoordhuis
Copy link
Member

It could be. I'm not completely opposed to that but it's a discussion that gets philosophical quickly and it would be a semver-major change at the very least. I do think that in general it's best to do conversion at the edges, which in this case would be the JS <-> C++ border.

@seishun
Copy link
Contributor Author

seishun commented Oct 17, 2015

It could be. I'm not completely opposed to that but it's a discussion that gets philosophical quickly and it would be a semver-major change at the very least. I do think that in general it's best to do conversion at the edges, which in this case would be the JS <-> C++ border.

It does convert on Windows. Why should OS X be treated any differently? I thought libuv was supposed to abstract away platform differences.

Anyway, my point still stands. You can argue that using UTF-8 on OS X is wrong, but that's how all of fs works. I don't see why fs.watch should be an exception (and in an extremely unintuitive way, too). If there really is an issue with OS X filenames (and so far I haven't found any source confirming that their filenames aren't valid UTF-8), then it should be fixed globally for all of fs at some point.

but that's small consolation for the minority whose workflow just broke

If their workflow would break from that, then why don't I see their complaints about other fs functions?

@nodejs/collaborators other opinions?

@bnoordhuis
Copy link
Member

It does convert on Windows.

Windows is special. Libuv converts to UTF-16 so it can use the wide-char OS functions. The ANSI functions simply don't cut it in many cases.

@bnoordhuis
Copy link
Member

If there really is an issue with OS X filenames (and so far I haven't found any source confirming that their filenames aren't valid UTF-8),

It's UTF-8 with a twist, that's why iconv for example calls it utf-8-mac. From here:

HFS Plus (Mac OS Extended) uses a variant of Normal Form D in which U+2000 through U+2FFF, U+F900 through U+FAFF, and U+2F800 through U+2FAFF are not decomposed

If you google around for "OS X NFD" or "HFS NFD", you'll find a great many SO questions on the topic.

then it should be fixed globally for all of fs at some point.

That's what I was suggesting!

@silverwind
Copy link
Contributor

We had a lengthy discussion about normalization for OSX in #2165, with the gist being that we don't want to normalize to a specific form because of a possible data loss.

As for this issue, I think a class that abstracts away platform inconsistencies and returns UTF-8 in all cases would be prefered and be best done in libuv.

cc: @jorangreef

@seishun
Copy link
Contributor Author

seishun commented Oct 18, 2015

@bnoordhuis from what I gather, it's still valid UTF-8. In which case, I don't see any problem with decoding it as UTF-8.

@seishun
Copy link
Contributor Author

seishun commented Oct 18, 2015

@silverwind

with the gist being that we don't want to normalize to a specific form because of a possible data loss.

Which means we just want to pass filenames as-is, right? Which is exactly what this PR does, just like everywhere else in fs.

@silverwind
Copy link
Contributor

Yes, the closest representation to the original file system would be prefered. What's the issue exactly on Unices if we do it like in this PR?

@seishun
Copy link
Contributor Author

seishun commented Oct 19, 2015

I assume this question was directed at @bnoordhuis, because I have no idea.

@jorangreef
Copy link
Contributor

@bnoordhuis is right. Although 95% of filesystems use utf-8 as their encoding, one can't assume that every fileysystem is utf-8 or that just because the platform is Windows or Darwin that the encoding is therefore utf-8. Node currently makes this assumption without providing a way to work with filesystems which use other encodings.

The current binary string is better as Ben mentioned because it leaves the encoding choice up to the user. If fs.watch converted to utf-8 then this would lose data in some rare edge cases when the encoding is not in fact utf-8. If the utf-8 decoding then replaced some unknown sequences with the replacement character then there would be no way for the user to get back to the actual filename as used by the filesystem.

@bnoordhuis, does the current utf-8 decoding also normalize form or does it just bring in whatever utf-8 form is being used in the data (whether mixed, NFC, NFD, HFS+ NFD etc.)? I thought it would do the latter, and if so then it should be safe to use with HFS+ so long as the decoding process does not also normalize form at the same time (this should never be done).

Just my two cents, but I think it would be great to make utf-8 the default encoding for all filenames across all fs methods. And then add a path encoding option to the various fs methods such as fs.watch and fs.readdir in case the user knows that the filesystem is actually using another encoding. It should always be up to the user to figure out what encoding the filesystem is using and Node should never try to take this responsibility on itself (apart from assume utf-8 as the default, but give an option to override encoding). Unicode form however should never be normalized and there should never be any option to do this.

@seishun
Copy link
Contributor Author

seishun commented Oct 19, 2015

What's the likelihood of someone using fs.watch on an exotic filesystem while disregarding the fact that they are completely screwed by the rest of fs? (And discovering the Buffer hack instead of giving up or submitting an issue when fs.watch gave them garbage)

The current binary string is better as Ben mentioned because it leaves the encoding choice up to the user.

No, it's not better. It "supports" the likely irrelevant edge case in the worst way possible (basically by passing a "binary string", which should have been part of history for about 3 years now), while being inconsistent with the rest of fs. What kind of a first impression will a new user of Node.js get when they are told that fs.readdir passes correct filenames while fs.watch requires an undocumented hack with a deprecated encoding?

@trevnorris
Copy link
Contributor

"And discovering the Buffer hack"
"requires an undocumented hack"

I'm curious what you're referring to?

(basically by passing a "binary string", which should have been part of history for about 3 years now)

Do you mean removed from fs or removed completely?

with a deprecated encoding

I'd completely forgot about that line in the docs. It's incorrect. Thanks for the reminder (#3441)

@seishun
Copy link
Contributor Author

seishun commented Oct 20, 2015

I'm curious what you're referring to?

Using new Buffer(str, 'binary') to convert it back to a byte array.

Do you mean removed from fs or removed completely?

Removed from all APIs.

@jorangreef
Copy link
Contributor

@seishun I think you may have skipped over the last part of my comment:

Just my two cents, but I think it would be great to make utf-8 the default encoding for all filenames across all fs methods. And then add a path encoding option to the various fs methods such as fs.watch and fs.readdir in case the user knows that the filesystem is actually using another encoding. It should always be up to the user to figure out what encoding the filesystem is using and Node should never try to take this responsibility on itself (apart from assume utf-8 as the default, but give an option to override encoding). Unicode form however should never be normalized and there should never be any option to do this.

@seishun
Copy link
Contributor Author

seishun commented Oct 20, 2015

@jorangreef Yes, it might be a good idea to have in the future, but fs.watch shouldn't remain in its current state until then.

@seishun
Copy link
Contributor Author

seishun commented Oct 21, 2015

@bnoordhuis What's the status on this?

@srl295
Copy link
Member

srl295 commented Oct 23, 2015

I don't think I have enough context to ±1 this specific change here, but UTF-8-only from libuv or the JS⇄C++ boundary would make sense.

It is fact that not all file systems are in UTF-8.

It's correct for 99% of users. Do we want to fix fs.watch for them now, even if it's technically wrong

And thus a breaking change for 1% (the number may actually be higher)

@seishun
Copy link
Contributor Author

seishun commented Oct 23, 2015

And thus a breaking change for 1% (the number may actually be higher)

The number is irrelevant. Even if such users exist, they won't be able to read or write any files, as all of fs assumes UTF-8. I doubt they'd be able to run Node.js scripts at all. fs.watch is the least of their concern, singling it out doesn't make sense.

@bnoordhuis
Copy link
Member

Taking that step back, I think we all agree that node is buggy, we just disagree on how to fix it. I don't want to derail this PR further so I filed #3519 with a recap of the current situation and a call for proposals.

@seishun
Copy link
Contributor Author

seishun commented Oct 27, 2015

One thing is clear - there will be no breaking changes to fs as the result of #3519. So I don't see any reason why a fix to fs.watch must be postponed until that.

@rvagg rvagg removed the tsc-agenda label Nov 4, 2015
@MylesBorins MylesBorins mentioned this pull request Jan 11, 2016
@seishun seishun mentioned this pull request Mar 9, 2016
4 tasks
jasnell added a commit to jasnell/node that referenced this pull request Mar 25, 2016
This makes several changes:

1. Allow path/filename to be passed in as a Buffer on fs methods
2. Add `options.encoding` to fs.readdir, fs.readdirSync, fs.readlink,
   fs.readlinkSync and fs.watch.
3. Documentation updates

For 1... it's now possible to do:

```js
fs.open(Buffer('/fs/foo/bar'), 'w+', (err, fd) => { });
```

For 2...
```js
fs.readdir('/fs/foo/bar', {encoding:'hex'}, (err,list) => { });

fs.readdir('/fs/foo/bar', {encoding:'buffer'}, (err, list) => { });
```

encoding can also be passed as a string

```js
fs.readdir('/fs/foo/bar', 'hex', (err,list) => { });
```

The default encoding is set to UTF8 so this addresses the
discrepency that existed previously between fs.readdir and
fs.watch handling filenames differently.

Fixes: nodejs#2088
Refs: nodejs#3519
Alternate: nodejs#3401
@jasnell
Copy link
Member

jasnell commented Apr 2, 2016

#5616 landed... which includes decoding filenames in watch as utf8 by default

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fs Issues and PRs related to the fs subsystem / file system. semver-major PRs that contain breaking changes and should be released in the next major version.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants