-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Switch config values to UTF8 #109418
Conversation
* Switch JitConfig strings from UTF16 to UTF8 * Switch relevant JIT-EE and host methods to UTF8
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
src/coreclr/vm/jithost.cpp
Outdated
{ | ||
WRAPPER_NO_CONTRACT; | ||
|
||
SString str(SString::Utf8, name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SString str(SString::Utf8, name); | |
StackSString str; | |
SString(SString::Utf8Literal, name).ConvertToUnicode(str); |
This is more efficient way to do the conversions (avoids heap allocations and unnecessary copies).
Hmm, this seems to break the ability to specify file paths with unicode characters on Windows. Probably can't switch to |
Actually it works fine with UTF8 enabled, this was operator error. |
PTAL @dotnet/jit-contrib @AaronRobinsonMSFT |
Do you mean that you have to have your Windows configured to |
I think so, but I haven't confirmed. I can try to confirm whether this is the case. |
I think it is ok for debug-only configs. If it is used for documented env vars in release builds, it is a compliance issue. All features have to handle globalization properly, even the obscure ones. |
I asked @EgorBo to test this since he has UTF8 disabled, but it appears that this fails much earlier on the VM side when you try to use non-ANSI paths: Assert failure(PID 864836 [0x000d3244], Thread: 864760 [0xd31f8]): SString::_stricmp(valueNoCache, temp.GetUTF8()) == 0
CORECLR! `anonymous namespace'::EnvGetString + 0x3B8 (0x00007ff8`747f40d8)
CORECLR! `anonymous namespace'::GetConfigString + 0xDB (0x00007ff8`747f46eb)
CORECLR! CLRConfig::GetConfigValue + 0xD1 (0x00007ff8`747f4951)
CORECLR! CLRConfig::GetConfigValue + 0xCF (0x00007ff8`747f4eef)
CORECLR! JitHost::getStringConfigValue + 0x8F (0x00007ff8`741ed90f)
CLRJIT! JitConfigValues::initialize + 0x145C (0x00007ff9`206e70fc)
CLRJIT! jitStartup + 0x7E (0x00007ff9`205d9d6e)
CORECLR! LoadAndInitializeJIT + 0xEE7 (0x00007ff8`7496b267)
CORECLR! EEJitManager::LoadJIT + 0x1C0 (0x00007ff8`7496b650)
CORECLR! UnsafeJitFunction + 0x20D (0x00007ff8`741fe3ed)
File: C:\prj\runtime-main\src\coreclr\utilcode\clrconfig.cpp:211
Image: C:\prj\runtime-main\artifacts\bin\coreclr\windows.x64.Checked\CoreRun.exe
ERROR:
Assert failure(PID 864836 [0x000d3244], Thread: 864760 [0xd31f8]): SString::_stricmp(valueNoCache, temp.GetUTF8()) == 0
CORECLR! `anonymous namespace'::EnvGetString + 0x3B8 (0x00007ff8`747f40d8)
CORECLR! `anonymous namespace'::GetConfigString + 0xDB (0x00007ff8`747f46eb)
CORECLR! CLRConfig::GetConfigValue + 0xD1 (0x00007ff8`747f4951)
CORECLR! CLRConfig::GetConfigValue + 0xCF (0x00007ff8`747f4eef)
CORECLR! JitHost::getStringConfigValue + 0x8F (0x00007ff8`741ed90f)
CLRJIT! JitConfigValues::initialize + 0x145C (0x00007ff9`206e70fc)
CLRJIT! jitStartup + 0x7E (0x00007ff9`205d9d6e)
CORECLR! LoadAndInitializeJIT + 0xEE7 (0x00007ff8`7496b267)
CORECLR! EEJitManager::LoadJIT + 0x1C0 (0x00007ff8`7496b650)
CORECLR! UnsafeJitFunction + 0x20D (0x00007ff8`741fe3ed)
File: C:\prj\runtime-main\src\coreclr\utilcode\clrconfig.cpp:211
Image: C:\prj\runtime-main\artifacts\bin\coreclr\windows.x64.Checked\CoreRun.exe (fails on main too) Edit: This appears to just be a buggy assert, in release builds main seems to work ok, and indeed this PR regresses it. |
This came from https://github.com/dotnet/runtime/pull/59513/files#diff-ca28f8b9cc75904774104e92e308462d908149b3e95288e37acd5b719bf6e0e9R211 cc @AaronRobinsonMSFT |
FILE* fopen_utf8(const char* path, const char* mode) | ||
{ | ||
#ifdef HOST_WINDOWS | ||
Utf16String<256> pathWide(path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Paths on Windows can be longer than 256 characters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a dynamic allocation inside Utf16String
for the longer case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MAX_PATH
would be more appropriate here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really agree, this constant is just an optimization to avoid dynamic allocation in the common cases, it is not intended to have any relation to MAX_PATH
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really agree, this constant is just an optimization to avoid dynamic allocation in the common cases, it is not intended to have any relation to
MAX_PATH
.
The fact that 256 was choosen on Windows means it is precisely for a path and the common case. That is why the MAX_PATH
is what is it, 260, because it was the most common path on Windows and indicates as such. If this was 1024 because that was common, then sure, but someone chose 256 and that aligns with MAX_PATH
. The runtime is littered with MAX_PATH
precisely for the reason being made.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chose 256 because that is a 512 byte stack allocation, which was the largest size stack allocation I wanted to go with for the buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. Given this is defined to only be on Windows and there is a macro with a predefined size for the "common" path on the platform, using MAX_PATH
in a variable called pathWide
seems like the most appropriate to me. If you think an explicit 256 is clearer then feel free to go with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer the 256 simply because the value I picked here only coincidentally is close to MAX_PATH
and wasn't intended to have any semantic relevance to its value (it seems like an outdated define nowadays given that the max is much higher).
With help from @EgorBo we've verified that unicode paths now work again, even without UTF8 code page. @AaronRobinsonMSFT @jkotas can you please take another look? |
Ping @dotnet/jit-contrib for a review on the JIT changes |
} | ||
printf("Trying to unroll MemoryExtensions.Equals|SequenceEqual|StartsWith(op1, \"%s\")...\n", utf8Str); | ||
} | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like too much effort to just dump a string in JitDump, I wouldn't bother. We should either remove it or move into some utility, I think it decreases code readability in this file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point - I added a Compiler::convertUtf16ToUtf8ForPrinting
and switched this to use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JIT changes LGTM too, thanks!
src/coreclr/jit/utils.cpp
Outdated
@@ -1883,29 +1881,15 @@ AssemblyNamesList2::AssemblyNamesList2(const WCHAR* list, HostAllocator alloc) | |||
|
|||
AssemblyName* newName = new (m_alloc) AssemblyName(); | |||
|
|||
// Null out the current character so we can do zero-terminated string work; we'll restore it later. | |||
*listWalk = W('\0'); | |||
size_t nameLen = listWalk - nameStart; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This logic looks safe to me, though I'm a bit surprised that none of the native compilers are complaining about implicit cast from ptrdiff_t
to size_t
, since the former is signed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched it to ptrdiff_t
getJitTimeLogFilename
JIT-EE call, which seemingly existed to avoid having a config parser in the JIT in release builds, but that ship has sailedSubsumes #104805