Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

createGetCanonicalFileName yields non-existent paths on Windows if the path contains certain unicode letters #31819

Closed
monoblaine opened this issue Jun 7, 2019 · 8 comments · Fixed by #36106
Assignees
Labels
Bug A bug in TypeScript Fix Available A PR has been opened for this issue

Comments

@monoblaine
Copy link

TypeScript Version: 3.1.2 Grabbed from the file C:\Program Files (x86)\Microsoft SDKs\TypeScript\3.1\tsserver.js:

// WARNING: The script `configureNightly.ts` uses a regexp to parse out these values.
// If changing the text in this section, be sure to test `configureNightly` too.
ts.versionMajorMinor = "3.1";
/** The version of the TypeScript compiler release */
ts.version = ts.versionMajorMinor + ".2";

But the issue is present in the latest version, too.

Search Terms:

  • unicode
  • unicode path
  • turkish

Code

I don't know TypeScript yet. Still, I've tried, but I couldn't figure out how the ts global variable is used.

The function with the problem is here:

export function createGetCanonicalFileName(useCaseSensitiveFileNames: boolean): GetCanonicalFileName {

If we're on Windows, this function returns a function that returns the lowercased version of the parameter given. So C:/Users/Serhan/Desktop/İ returns c:/users/serhan/desktop/i̇.

Expected behavior:

All I can say is that the use of toLowerCase is not the expected behavior here. Because these two paths are not the same on Windows.

Related Issues:

@sandersn sandersn added the Bug A bug in TypeScript label Jun 7, 2019
@RyanCavanaugh RyanCavanaugh added this to the TypeScript 3.6.0 milestone Jun 12, 2019
@monoblaine
Copy link
Author

monoblaine commented Jun 13, 2019

I think I should make some things clear:

The paths C:/Users/Serhan/Desktop/İ and c:/users/serhan/desktop/i̇ are different in Windows file system terms and the reason lies in the fact that Unicode representations of some Turkish letters are really weird:

# Character Char code Desc
1. ı 305 A Turkish specific letter.
2. I 73 Uppercase form of ı
3. i 105 A lowercase letter
4. İ 304 Uppercase form of i
5. 105, 775 'İ'.toLowerCase() (4th item)
6. 73, 775 'i̇'.toUpperCase() (5th item)

The 3rd and the 5th characters look the same, and they are supposed to be the same. Same thing applies for the 4th and the 6th characters, too. (I think these pairs are convertible to one another using unicode normalization.)

What strikes me as odd is the way Windows handles file name cases:

  • Folders with names I and i cannot coexist. (This is understandable)
  • Folders with names I and ı can coexist. (This is weird.)

Maybe unicode needs another Turkish-specific uppercase ı? 😄

@orta
Copy link
Contributor

orta commented Aug 2, 2019

Looks like in windows we always assume case insensitive paths:

function isFileSystemCaseSensitive(): boolean {
// win32\win64 are case insensitive platforms
if (platform === "win32" || platform === "win64") {
return false;
}
// If this file exists under a different case, we must be case-insensitve.

Which means

export function createGetCanonicalFileName(useCaseSensitiveFileNames: boolean): GetCanonicalFileName {
return useCaseSensitiveFileNames ? identity : toLowerCase;
}

Will fall to:

/** Returns lower case string */
export function toLowerCase(x: string) { return x.toLowerCase(); }

Which falls back to node's implementation - the question is, do we want to add a workaround in TypeScript when there's are two open PRs for this in node? nodejs/node#27644 + nodejs/node#27662 - my gut says we probably don't.

@monoblaine
Copy link
Author

Which falls back to node's implementation

Hi @orta how did you conclude that toLowerCase falls back to node's implementation? I do not see any connection between this and node's implementation. In fact, I'm the author of one of those PR's and I'm currently using a custom built node.exe from the changes introduced by that PR. Yet, it's not enough to solve this issue: I also had to change createGetCanonicalFileName so that it always returns identity. (In order to solve the issue about Visual Studio)

@monoblaine
Copy link
Author

I feel like I'm talking to a wall. I don't understand why this project is open-source if the contributions are considered write-only.

@RyanCavanaugh
Copy link
Member

RyanCavanaugh commented Aug 16, 2019

@monoblaine We all have tons of work to be doing here. An average month here sees 2,000 comments and a team of six engineers and a lead is trying to stay on top of all of it, we cannot issue immediate responses to all comments. The bug fixing window for the 3.7 milestone is two months long; this issue will be fixed in that timeframe. Please don't interpret a delay in response as a refusal to engage - your comments are appreciated and will be addressed in due time.

@orta
Copy link
Contributor

orta commented Aug 17, 2019

Sorry, yeah, I was hesitant to reply till I was sure after I had taken time to investigate, which I've not found yet

@orta
Copy link
Contributor

orta commented Oct 10, 2019

Hi @orta how did you conclude that toLowerCase falls back to node's implementation? I do not see any connection between this and node's implementation

I guess more specifically I mean v8's implementation of "toLowerCase", e.g. not one we have written inside TypeScript. Because it just forwards straight to the string's implementation of toLowerCase:

/** Returns lower case string */
export function toLowerCase(x: string) { return x.toLowerCase(); }

I feel like there are three options:

  • Maybe switching it out to use toLocaleLowerCase instead? Given that it is built to handle Turkish diacritics specifically.

  • Turkish users on windows specifically always get the identity function instead, this could make it so that folks using Turkish locales could end up with different build states in comparison to CI etc

  • We add catches for fails on fs access, these can check the path for any of the known examples above and can start looking through the file system with different variants of it to find the path. This looks to be how true-case-path does it

@monoblaine
Copy link
Author

  • Maybe switching it out to use toLocaleLowerCase instead? Given that it is built to handle Turkish diacritics specifically.

I'm afraid this won't work because the result of toLocaleLowerCase depends on the user's locale. ('I'.toLocaleLowerCase() will give i to you and ı to me.)

  • Turkish users on windows specifically always get the identity function instead, this could make it so that folks using Turkish locales could end up with different build states in comparison to CI etc

I'm actually using this solution for months in order to fix the Visual Studio issue that I've stated above. I didn't have any problems so far, but I'm not confident to say that there won't be any side effects. Still, I don't really know why the function createGetCanonicalFileName is required in the first place: Is it to prevent saying that paths C:\Foo and C:\foo are different on Windows?

sheetalkamat added a commit that referenced this issue Jan 31, 2020
…tCanonicalFileName (#36106)

* Add test case to verify directory casing preservation when watching

* Fix unicode file name handling when watching failed lookup locations

* Add special file name lower conversion routine and use that instead of toLowerCase
Fixes #31819 and #35559

* Remove unicode from code

* Replace toLocaleLowerCase on filenames with ts.toFileNameLowerCase

* Make the intent of using toFileNameLowerCase more clear and why we make the restriction on turkish I with dot on top of it

* Update baselines for newly added tests in master
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug A bug in TypeScript Fix Available A PR has been opened for this issue
Projects
None yet
6 participants