-
Notifications
You must be signed in to change notification settings - Fork 8.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Terminal should force pseudoconsole host into UTF-8 codepage by default #1802
Comments
I_am_okay_with_this.jpg |
Does cmd support batch scripts in codepage 65001 now? |
+1 for running *nix tools with CJK outputs For example: fc-list in texlive |
A work-around in the meanwhile is ... enable |
I'm on board with this, esp. if we add a |
Maybe we need to somehow enable localized messages in CMD while codepage is 65001. |
This changes the system code page to 65001, if you have any ANSI application, they will be forced to use UTF-8. |
Is there any plan to add an environment variable for "UTF-8 mode"? For example, Python has PYTHONUTF8. Git for Windows uses LANG. Some tools use the Is the ConsoleCP the best way? |
Noe that the console host is still broken for non-ASCII input when the input codepage is set to UTF-8 (65001). Unfortunately both the registry "CodePage" value and chcp.com set both the input and output codepage to a single value, so an admin or user can't easily set just the output codepage to UTF-8. In particular, the calculation of |
@eryksun MS Pinyin IME works fine on CP65001 on ConHost though. |
Now (19H1+) there is a way to force UTF-8 CP in application manifest (so your -A APIs will use UTF-8), although the console CP is still separated from application CP... For Application CP, use GetACP(). For console, ConsoleCP should be fine. |
I know it but it doesn't help me because:
What I want to have is the environment variable which indicates user want to use UTF-8 in this session. Python has PYTHONUTF8. But it changes only Python. |
This is not stable though. Windows terminal starts OpenConsole. But wsl.exe starts another terminal. And when executing Windows command from wsl, new conhost is executed again. Current code page is legacy regradless the first codepage in the OpenConsole. We are mixing many conhost and they have their own code page. When writing and reading PIPE, which encoding should be used? In wsl, Linux commands uses UTF-8 in most case because UTF-8 locale is almost standard. That's why I want to have standard environment variable for UTF-8 mode. Let's call it
When all modern developer tools supports the |
@driver1998, I'm not familiar enough with the code to know the pathways taken in East-Asian locales. Maybe IME hacks the console service routine for |
This is something to be addressed by the team that's in charge of the NLS API. Locales address language and regional formatting rules. They really shouldn't concern text encodings, at least not in Windows NT, which was always a Unicode system. It shouldn't matter to the locale whether it's English or Hindi text -- except for the intersection with language rules (e.g. collation). That said, most locales have legacy ANSI and OEM codepages because this was necessary prior to Unicode back in the 80s, and the NLS team needed to support non-Unicode applications, especially on Windows 9x systems that had hardly any Unicode support. But not all locales have legacy ANSI/OEM codepages. Some locales are Unicode only, i.e. their ANSI codepage is >>> GetLocaleInfoEx('hi-IN', LOCALE_IDEFAULTANSICODEPAGE, cp, 8)
2
>>> cp[:1]
'0' Nowadays the Universal C Runtime (ucrt) supports using UTF-8 ( >>> setlocale(LC_ALL, 'en_UK.utf8')
'en_UK.utf8'
>>> ucrt.___lc_codepage_func()
65001
>>> setlocale(LC_ALL, 'hi_IN')
'hi-IN'
>>> ucrt.___lc_codepage_func()
65001 The NLS team could provide a setting for the per-user locale that overrides the active codepage ( Applications that query either the active codepage via |
The problem is, there is no way to "opt-in" UTF-8 mode in console from an app-by-app basics. They can output either UTF-16 (which does not care about ConsoleCP) or "ANSI" (which has to be ConsoleCP regardless of the Application CP, and UTF-8 is one of those CP). Well maybe WriteConsoleOutputA can output UTF-8 on a CP936 console, but I don't think printf can. Maybe make WSL start conhost in UTF-8 CP when WINUTF8=1? |
The console's multibyte-string functions such as The default codepage is the active OEM codepage of the conhost.exe process. In principle this should be configurable as the "CodePage" value in "HKCU\Console", but that still doesn't work as intended. This value only works in the per-window subkey settings. (For convenience the codepage value should be made configurable in the property dialog, and also in the shell-link property dialog.) Anyway, if the system locale is set to UTF-8, the active OEM codepage is UTF-8, and the console follows suit. If the NLS team implemented the extension for the user locale that I suggested above, the result would be similar given the user enables UTF-8. Then all that's needed is for the console to finally support reading non-ASCII text from the input buffer as UTF-8 -- at least 24 years after UTF-8 was introduced. Note that since the console is a shared I/O resource, any one process attached to it doesn't get to dictate its global input and output codepages. Any program that attaches to the console at any time can change these values, or even library code in your process might sneak in a change. Thus in principle the values can't be relied on as constants, but many programs check once and assume the values are constant. For example, classic Python prior to 3.6 works like this when setting the encoding of the |
@BSG-75 Nope, when there is an update to this, we'll make sure to post in this thread 😉 |
I still want to know how many real world console apps are there expects legacy codepages (especially ones that outputs legacy codepage strings like GBK), and is it a good idea to break these. Because that's why system-wide UTF-8 is still labeled as beta. Given that many modern Windows command line apps are ported from the *nix world, and the modern principle seems to be command line apps should use English, I guess it is acceptable? |
System wide setting breaks legacy applications. That's why we need per-session option instead.
Some modern CLI applications (Python and Go) use WriteConsoleW to write to console. But Python still use legacy encoding for PIPE. To use such applications, we want UTF-8 session in VSCode terminal and Windows Terminal. |
Since std input works well with UTF-8 encoding today #14745, may be it's worth to add some sort of a syntactic sugar to push/pop the initial state of the system code page for new Windows console applications? Wrap it up in a single API call or put it into a specific header file (e.g. <iostream>) to reduce following boilerplate code at the beginning #define UTF8_EVERYWHERE
#include <iostream>
#include <string>
// Put this block inside <iostream> on windows?
#ifdef _WIN32
#ifdef UTF8_EVERYWHERE
#include "windows.h"
namespace winapi_cp_state
{
static UINT ou_state = GetConsoleOutputCP(); // Save original system code pages.
static UINT in_state = GetConsoleCP(); //
static void set_page(UINT out, UINT in) { SetConsoleOutputCP(out); SetConsoleCP(in); }
static void set_page() { set_page(ou_state, in_state); }
static int _state = (set_page(CP_UTF8, CP_UTF8), ::atexit(set_page)); // Set to UTF-8 and always restore original system code pages at exit.
}
#endif
#endif
// x-platform code
int main()
{
std::cout << "Test: あああ🙂🙂🙂日本👌中文👍Кириллица\n"; // Make sure you save your project file with 65001(UTF-8) encoding.
std::cout << "Enter text: ";
std::string utf8;
std::cin >> utf8;
std::cout << "UTF-8 text: " << utf8 << std::endl;
return 0;
} The |
xref some discussion in #15504 pretty sure our plan was to do:
Now it's just a matter of plumbing, and deciding if we really want to do the More notes
|
It's 2019, after all. Maybe we should introduce a flag that starts up the pseudoconsole host in codepage 65001 so that we make good on our promise of "emoji just work and everything else works like it should too," and use WT as a real opportunity to push the boundaries here.
maintainer note, Aug 2023
Also, we want to take into account arbitrary codepages, ala #15678
The text was updated successfully, but these errors were encountered: