-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enforce localized sorting 2 #40041
Enforce localized sorting 2 #40041
Conversation
Noticed #40012 and tested #39873 for Russian on MacOS and it doesn't work at all :( Tested #39873 for German on MacOS too, it (not?) works. As far as I understand, the German alphabet contains the same Latin letters as English, except for a few special ones. I don't know German, but The Russian alphabet does not contain any Latin letters, only Cyrillic. Even if they look the same, they have their own code. So I was solving a problem with comparing local strings here #33206. But, after this game a bit crash a on Windows… Because #33214:
(I tested only on MacOS) Here's how the strings are compared when searching something (for crafting and else):
Can this be used for sorting (like |
@Night-Pryanik, you put +1 emoji, does sorting work for you (on Windows for Russian)? |
yyyy.... in what universe does zh come after A but before a?! |
Tested on mac and results are consistent with another mac above. Which probably means that there is some sorting. How to apply this to multi-language app, though? |
Thanks everyone for testing. The Russian MacOS results from @akirashirosawa are very puzzling; I would guess it's not even correctly interpreting the strings as UTF-8. I notice that Some of the changes from #33214 are strange; in particular this line
is a no-op, because @ScampsAdams The MSDN docs suggest that Windows UTF-8 support on Windows is tricky at best so indeed I wouldn't be surprised if it was easiest to convert to I'll add some debug messages so we can hopefully see what the locale is actually set to in each of your various situations and then we can hopefully debug from there. |
Yes, it seems. But for me works std::locale::global( std::locale( "ru_RU.UTF-8" ) ); w/o exception, because Here my #40054 i18n debug log: -----------------------------------------
00:17:42.734 : Starting log.
00:17:42.735 INFO : Cataclysm DDA version 0.E-1409-gd72ae905cc-dirty
00:17:42.735 INFO : [main] C locale set to ru_RU.UTF-8
00:17:42.735 INFO : [main] C++ locale set to
00:17:42.735 INFO : SDL version used during compile is 2.0.10
00:17:42.736 INFO : SDL version used during linking and in runtime is 2.0.10
00:17:42.945 INFO : Number of render drivers on your system: 4
00:17:42.945 INFO : Render driver: 0/metal
00:17:42.945 INFO : Render driver: 1/opengl
00:17:42.945 INFO : Render driver: 2/opengles2
00:17:42.945 INFO : Render driver: 3/software
00:17:42.958 INFO : [options] C locale set to ru_RU.UTF-8
00:17:42.958 INFO : [options] C++ locale set to ru_RU.UTF-8
00:17:43.374 INFO : Active renderer: 1/opengl
00:17:43.563 INFO : USE_COLOR_MODULATED_TEXTURES is set to 0
00:17:43.941 INFO : Language is set to: 'ru'
00:17:45.210 WARNING : opendir [/Users/akira/Library/Application Support/Cataclysm/mods/] failed with "No such file or directory".
00:18:34.876 INFO : [options] C locale set to de_DE.UTF-8 // <- switch langs manually on options
00:18:34.876 INFO : [options] C++ locale set to de_DE.UTF-8
00:18:34.876 INFO : Language is set to: 'de'
00:18:48.492 INFO : [options] C locale set to en_US.UTF-8 // <- switch langs manually on options
00:18:48.492 INFO : [options] C++ locale set to en_US.UTF-8
00:18:48.492 INFO : Language is set to: 'en'
00:18:59.020 INFO : [options] C locale set to en_US.UTF-8 // <- switch langs manually on options
00:18:59.020 INFO : [options] C++ locale set to en_US.UTF-8
00:18:59.020 INFO : Language is set to: 'ru_RU' // <- switch lang to the "System language"
// locate switch didn't handle "System language" correctly, but it is a separate issue
00:20:54.166 WARNING : opendir [/Users/akira/Library/Application Support/Cataclysm/save/World/mods] failed with "No such file or directory".
00:21:04.260 INFO : Loaded tileset: UNDEAD_PEOPLE
00:21:34.239 : Log shutdown.
----------------------------------------- |
@akirashirosawa Just to verify that I understand: that's your debug.log on MacOS, and it's with those settings that you observe the strange Russian sorting pictured in your first comment above? |
Yes. |
Well, it looks like the locale is configured correctly, so I can only assume that the STL is just not sorting correctly on MacOS. I'll try making another PR with the convert-to-wide workaround and we can see if that works any better for you. Besides that, I'm quite confused by the output from It's all quite strange... |
@akirashirosawa one more thing: can you show the output from |
@jbytheway LANG="ru_RU.UTF-8"
LC_COLLATE="ru_RU.UTF-8"
LC_CTYPE="ru_RU.UTF-8"
LC_MESSAGES="ru_RU.UTF-8"
LC_MONETARY="ru_RU.UTF-8"
LC_NUMERIC="ru_RU.UTF-8"
LC_TIME="ru_RU.UTF-8"
LC_ALL=
...
ru_RU.ISO8859-5
...
ru_RU.CP866
...
ru_RU.CP1251
...
ru_RU.UTF-8
...
ru_RU.KOI8-R
...
ru_RU
...
C
POSIX For some reason list doesn't sorted like guy with OS X 10.10 (my version is MacOS 10.15), but it mostly the same. UPD #include <stdio.h>
#include <locale.h>
#include <langinfo.h>
int main() {
setlocale(LC_ALL, "");
printf("Reported: '%s'\n", nl_langinfo(CODESET));
} Returns same |
@ScampsAdams Experimental builds 10608 or newer should contain the new log messages I added. Could you please test again, and copy your |
11de7d8
to
6e740f2
Compare
I decided to test how TL;DR Test code: #include <iostream>
#include <string>
#include <vector>
#include <iterator>
#include <utility>
#include <algorithm>
#include <random>
struct localized_comparator {
bool operator()( const std::string &, const std::string & ) const;
};
bool localized_comparator::operator()( const std::string &l, const std::string &r ) const {
return std::locale()( l, r );
}
constexpr localized_comparator lc;
std::string testOne (std::vector<std::string> v) {
std::random_device rd;
std::mt19937 g(rd());
std::shuffle(v.begin(), v.end(), g);
std::sort(v.begin(), v.end(), lc);
std::string str;
std::for_each(v.begin(), v.end(), [&str](std::string &s){ str+=s; });
return str;
}
void testRussian () {
std::vector<std::string> ru = {
"а", "б", "в", "г", "д", "е", "ё", "ж",
"з", "и", "й", "к", "л", "м", "н", "о",
"п", "р", "с", "т", "у", "ф", "х", "ц",
"ч", "ш", "щ", "ъ", "ы", "ь", "э", "ю", "я"
};
std::vector<std::string> RU = {
"А", "Б", "В", "Г", "Д", "Е", "Ё", "Ж",
"З", "И", "Й", "К", "Л", "М", "Н", "О",
"П", "Р", "С", "Т", "У", "Ф", "Х", "Ц",
"Ч", "Ш", "Щ", "Ъ", "Ы", "Ь", "Э", "Ю", "Я"
};
std::vector<std::string> rU = {
"а", "б", "в", "г", "д", "е", "ё", "ж",
"З", "И", "Й", "К", "Л", "М", "Н", "О",
"п", "р", "с", "т", "у", "ф", "х", "ц",
"Ч", "Ш", "Щ", "Ъ", "Ы", "Ь", "Э", "Ю", "Я"
};
// expected results:
std::string result_ru = "абвгдеёжзийклмнопрстуфхцчшщъыьэюя";
std::string result_RU = "АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ";
std::string result_rU = "абвгдеёжЗИЙКЛМНОпрстуфхцЧШЩЪЫЬЭЮЯ";
std::string result_ru_wrong_yo = "абвгдежзийклмнопрстуфхцчшщъыьэюяё";
std::string result_RU_wrong_yo = "ЁАБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ";
std::string result_rU_case_sensitive = "ЗИЙКЛМНОЧШЩЪЫЬЭЮЯабвгдеёжпрстуфхц";
std::string result_rU_case_sensitive_wrong_yo = "ЗИЙКЛМНОЧШЩЪЫЬЭЮЯабвгдежпрстуфхцё";
std::string tmp_str;
tmp_str = testOne(ru);
if (tmp_str == result_ru) {
std::cout << "lowercase Russian strings sorting successful" << "\n";
} else if (tmp_str == result_ru_wrong_yo) {
std::cout <<"lowercase Russian strings sorting successful, but issue with \"ё\" (yo)"<< "\n";
} else {
std::cout <<"lowercase Russian strings sorting failed: " << "\n" << tmp_str << "\n";
}
tmp_str = testOne(RU);
if (tmp_str == result_RU) {
std::cout << "uppercase Russian strings sorting successful" << "\n";
} else if (tmp_str == result_RU_wrong_yo) {
std::cout <<"uppercase Russian strings sorting successful, but issue with \"ё\" (yo)"<< "\n";
} else {
std::cout <<"uppercase Russian strings sorting failed: " << "\n" << tmp_str << "\n";
}
tmp_str = testOne(rU);
if (tmp_str == result_rU) {
std::cout << "mixedcase Russian strings sorting successful" << "\n";
} else if (tmp_str == result_rU_case_sensitive) {
std::cout <<"mixedcase Russian strings sorting successful, but uppercase going first"<< "\n";
} else if (tmp_str == result_rU_case_sensitive_wrong_yo) {
std::cout <<"mixedcase Russian strings sorting successful, but uppercase going first, and issue with \"ё\" (yo)"<< "\n";
} else {
std::cout <<"mixedcase Russian strings sorting failed: " << "\n" << tmp_str << "\n";
}
}
int main () {
std::cout << "---------------------------------------------------"<< "\n";
std::cout << "[default] C locale set to " << setlocale( LC_ALL, nullptr )<< "\n";
std::cout << "[default] C++ locale set to " << std::locale().name()<< "\n";
testRussian();
std::cout << "---------------------------------------------------"<< "\n";
std::locale::global( std::locale( "ru_RU.UTF-8" ) );
std::cout << "[ru_RU.UTF-8] C locale set to " << setlocale( LC_ALL, nullptr )<< "\n";
std::cout << "[ru_RU.UTF-8] C++ locale set to " << std::locale().name()<< "\n";
testRussian();
std::cout << "---------------------------------------------------"<< "\n";
std::locale::global( std::locale( "ru_RU" ) );
std::cout << "[ru_RU] C locale set to " << setlocale( LC_ALL, nullptr )<< "\n";
std::cout << "[ru_RU] C++ locale set to " << std::locale().name()<< "\n";
testRussian();
std::cout << "---------------------------------------------------"<< "\n";
std::locale::global( std::locale( "de_DE.UTF-8" ) );
std::cout << "[de_DE.UTF-8] C locale set to " << setlocale( LC_ALL, nullptr )<< "\n";
std::cout << "[de_DE.UTF-8] C++ locale set to " << std::locale().name()<< "\n";
testRussian();
std::cout << "---------------------------------------------------"<< "\n";
std::locale::global( std::locale( "ru_RU.CP1251" ) );
std::cout << "[Windows-1251] C locale set to " << setlocale( LC_ALL, nullptr )<< "\n";
std::cout << "[Windows-1251] C++ locale set to " << std::locale().name()<< "\n";
testRussian();
std::cout << "---------------------------------------------------"<< "\n";
std::locale::global( std::locale( "ru_RU.CP866" ) );
std::cout << "[DOS-CP866] C locale set to " << setlocale( LC_ALL, nullptr )<< "\n";
std::cout << "[DOS-CP866] C++ locale set to " << std::locale().name()<< "\n";
testRussian();
std::cout << "---------------------------------------------------"<< "\n";
} Output:
From this we can conclude that sorting works on a MasOS correctly (with a small issue with "ё" letter, sadly, but it is the "special" letter). Correctly sorting with locales: I can only test on MacOS. I posted all code, so if anyone has the opportunity to test on Windows/Linux or for other languages, let's try if it helps. |
I disagree; I have the opposite conclusion. The fact that all the UTF-8 locales gave the same result as the C locale suggests that it's not working correctly; it's not properly handling UTF-8. I ran the same test code on Linux (I deleted the cases for the locales I don't have installed here), and got the following results:
That's what I'd expect to see if MacOS was implementing the locales correctly. Here's a similar program, but using wide strings; can you try this one? #include <cstdlib>
#include <iostream>
#include <string>
#include <vector>
#include <iterator>
#include <utility>
#include <algorithm>
#include <random>
struct localized_comparator {
bool operator()( const std::wstring &, const std::wstring & ) const;
};
bool localized_comparator::operator()( const std::wstring &l, const std::wstring &r ) const {
return std::locale()( l, r );
}
constexpr localized_comparator lc;
std::wstring testOne (std::vector<std::wstring> v) {
std::random_device rd;
std::mt19937 g(rd());
std::shuffle(v.begin(), v.end(), g);
std::sort(v.begin(), v.end(), lc);
std::wstring str;
std::for_each(v.begin(), v.end(), [&str](std::wstring &s){ str+=s; });
return str;
}
void testRussian () {
std::vector<std::wstring> ru = {
L"а", L"б", L"в", L"г", L"д", L"е", L"ё", L"ж",
L"з", L"и", L"й", L"к", L"л", L"м", L"н", L"о",
L"п", L"р", L"с", L"т", L"у", L"ф", L"х", L"ц",
L"ч", L"ш", L"щ", L"ъ", L"ы", L"ь", L"э", L"ю", L"я"
};
std::vector<std::wstring> RU = {
L"А", L"Б", L"В", L"Г", L"Д", L"Е", L"Ё", L"Ж",
L"З", L"И", L"Й", L"К", L"Л", L"М", L"Н", L"О",
L"П", L"Р", L"С", L"Т", L"У", L"Ф", L"Х", L"Ц",
L"Ч", L"Ш", L"Щ", L"Ъ", L"Ы", L"Ь", L"Э", L"Ю", L"Я"
};
std::vector<std::wstring> rU = {
L"а", L"б", L"в", L"г", L"д", L"е", L"ё", L"ж",
L"З", L"И", L"Й", L"К", L"Л", L"М", L"Н", L"О",
L"п", L"р", L"с", L"т", L"у", L"ф", L"х", L"ц",
L"Ч", L"Ш", L"Щ", L"Ъ", L"Ы", L"Ь", L"Э", L"Ю", L"Я"
};
// expected results:
std::wstring result_ru = L"абвгдеёжзийклмнопрстуфхцчшщъыьэюя";
std::wstring result_RU = L"АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ";
std::wstring result_rU = L"абвгдеёжЗИЙКЛМНОпрстуфхцЧШЩЪЫЬЭЮЯ";
std::wstring result_ru_wrong_yo = L"абвгдежзийклмнопрстуфхцчшщъыьэюяё";
std::wstring result_RU_wrong_yo = L"ЁАБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ";
std::wstring result_rU_case_sensitive = L"ЗИЙКЛМНОЧШЩЪЫЬЭЮЯабвгдеёжпрстуфхц";
std::wstring result_rU_case_sensitive_wrong_yo = L"ЗИЙКЛМНОЧШЩЪЫЬЭЮЯабвгдежпрстуфхцё";
std::wstring tmp_str;
tmp_str = testOne(ru);
if (tmp_str == result_ru) {
std::wcout << "lowercase Russian strings sorting successful" << L"\n";
} else if (tmp_str == result_ru_wrong_yo) {
std::wcout << "lowercase Russian strings sorting successful, but issue with \"ё\" (yo)"<< "\n";
} else {
std::wcout << "lowercase Russian strings sorting failed: " << "\n" << tmp_str << "\n";
}
tmp_str = testOne(RU);
if (tmp_str == result_RU) {
std::cout << "uppercase Russian strings sorting successful" << "\n";
} else if (tmp_str == result_RU_wrong_yo) {
std::cout <<"uppercase Russian strings sorting successful, but issue with \"ё\" (yo)"<< "\n";
} else {
std::wcout <<"uppercase Russian strings sorting failed: " << "\n" << tmp_str << "\n";
}
tmp_str = testOne(rU);
if (tmp_str == result_rU) {
std::cout << "mixedcase Russian strings sorting successful" << "\n";
} else if (tmp_str == result_rU_case_sensitive) {
std::cout <<"mixedcase Russian strings sorting successful, but uppercase going first"<< "\n";
} else if (tmp_str == result_rU_case_sensitive_wrong_yo) {
std::cout <<"mixedcase Russian strings sorting successful, but uppercase going first, and issue with \"ё\" (yo)"<< "\n";
} else {
std::wcout <<"mixedcase Russian strings sorting failed: " << "\n" << tmp_str << "\n";
}
}
int main () {
std::cout << "---------------------------------------------------"<< "\n";
std::cout << "[default] C locale set to " << setlocale( LC_ALL, nullptr )<< "\n";
std::cout << "[default] C++ locale set to " << std::locale().name()<< "\n";
testRussian();
std::cout << "---------------------------------------------------"<< "\n";
std::locale::global( std::locale( "ru_RU.UTF-8" ) );
std::cout << "[ru_RU.UTF-8] C locale set to " << setlocale( LC_ALL, nullptr )<< "\n";
std::cout << "[ru_RU.UTF-8] C++ locale set to " << std::locale().name()<< "\n";
testRussian();
std::cout << "---------------------------------------------------"<< "\n";
std::locale::global( std::locale( "de_DE.UTF-8" ) );
std::cout << "[de_DE.UTF-8] C locale set to " << setlocale( LC_ALL, nullptr )<< "\n";
std::cout << "[de_DE.UTF-8] C++ locale set to " << std::locale().name()<< "\n";
testRussian();
std::cout << "---------------------------------------------------"<< "\n";
} |
That's +1 for the idea. I didn't test the implementation. |
If this is a MacOS specific problem, then I tried the MacOS specific solution, using --- test.cpp
+++ testcf.cpp
@@ -6,12 +6,22 @@
#include <algorithm>
#include <random>
+#if defined(MACOSX)
+#include <CoreFoundation/CoreFoundation.h>
+#endif
+
struct localized_comparator {
bool operator()( const std::string &, const std::string & ) const;
};
bool localized_comparator::operator()( const std::string &l, const std::string &r ) const {
+ #if defined(MACOSX)
+ CFStringRef lr = CFStringCreateWithCString(kCFAllocatorDefault, l.c_str(), kCFStringEncodingUTF8);
+ CFStringRef rr = CFStringCreateWithCString(kCFAllocatorDefault, r.c_str(), kCFStringEncodingUTF8);
+ return CFStringCompare(lr, rr, kCFCompareLocalized) < 0;
+ #else
return std::locale()( l, r );
+ #endif
}
constexpr localized_comparator lc; Output:
It works as perfectly as it does on Linux. I do not know how expensive UPD Maybe bast option is use
--- testcf.cpp
+++ testcf_nocopy.cpp
@@ -16,8 +16,8 @@
bool localized_comparator::operator()( const std::string &l, const std::string &r ) const {
#if defined(MACOSX)
- CFStringRef lr = CFStringCreateWithCString(kCFAllocatorDefault, l.c_str(), kCFStringEncodingUTF8);
- CFStringRef rr = CFStringCreateWithCString(kCFAllocatorDefault, r.c_str(), kCFStringEncodingUTF8);
+ CFStringRef lr = CFStringCreateWithCStringNoCopy(kCFAllocatorDefault, l.c_str(), kCFStringEncodingUTF8, kCFAllocatorNull);
+ CFStringRef rr = CFStringCreateWithCStringNoCopy(kCFAllocatorDefault, r.c_str(), kCFStringEncodingUTF8, kCFAllocatorNull);
return CFStringCompare(lr, rr, kCFCompareLocalized) < 0;
#else
return std::locale()( l, r ); Note: I use UPD2 |
Windows debug.log 20:21:09.384 : Starting log. |
Short string test.
|
Wide string test:
|
But I expected to see // src/options.cpp:3157
std::locale::global( std::locale( "ru_RU.UTF-8" ) ); does not work for you. And works // src/options.cpp:3163
catch( std::runtime_error &e ) {
std::locale::global( std::locale() );
} instead. For Windows ( // src/translates.cpp:205
#if defined(_WIN32)
// Use the ANSI code page 1252 to work around some language output bugs.
if( setlocale( LC_ALL, ".1252" ) == nullptr ) {
DebugLog( D_WARNING, D_MAIN ) << "Error while setlocale(LC_ALL, '.1252').";
}
DebugLog( D_INFO, DC_ALL ) << "[translations] C locale set to " << setlocale( LC_ALL, nullptr );
DebugLog( D_INFO, DC_ALL ) << "[translations] C++ locale set to " << std::locale().name();
#endif Don't know why. May be need to
With |
@akirashirosawa @ScampsAdams Thanks very much for the assistance in debugging. I agree that the |
On Windows the existing solution for localized comparison seems to do the wrong thing, but evidence suggests that comparison of wstring should work. Convert the strings before comparison on Windows. At the same time, do the reverse conversion on MacOS (this won't actually be used anywhere in the current code, but it seemse a good idea to implement it while we had the experimental data to suggest it was necessary. See CleverRaven#40041 for more discussion.
On Windows the existing solution for localized comparison seems to do the wrong thing, but evidence suggests that comparison of wstring should work. Convert the strings before comparison on Windows. At the same time, do the reverse conversion on MacOS (this won't actually be used anywhere in the current code, but it seemse a good idea to implement it while we had the experimental data to suggest it was necessary. See CleverRaven#40041 for more discussion.
@jbytheway Maybe there is a chance that |
On Windows the existing solution for localized comparison seems to do the wrong thing, but evidence suggests that comparison of wstring should work. Convert the strings before comparison on Windows. At the same time, do the reverse conversion on MacOS (this won't actually be used anywhere in the current code, but it seemse a good idea to implement it while we had the experimental data to suggest it was necessary. See CleverRaven#40041 for more discussion.
Look for calls which sort strings in a non-localized manner.
These are the ones that were caught by the recently improved check.
6e740f2
to
8ca7f94
Compare
It shouldn't work. Our strings are UTF-8, so the comparison cannot possibly work correctly on a |
On Windows the existing solution for localized comparison seems to do the wrong thing, but evidence suggests that comparison of wstring should work. Convert the strings before comparison on Windows. At the same time, do the reverse conversion on MacOS (this won't actually be used anywhere in the current code, but it seemse a good idea to implement it while we had the experimental data to suggest it was necessary. See CleverRaven#40041 for more discussion.
On Windows the existing solution for localized comparison seems to do the wrong thing, but evidence suggests that comparison of wstring should work. Convert the strings before comparison on Windows. At the same time, do the reverse conversion on MacOS (this won't actually be used anywhere in the current code, but it seemse a good idea to implement it while we had the experimental data to suggest it was necessary. See CleverRaven#40041 for more discussion.
On Windows the existing solution for localized comparison seems to do the wrong thing, but evidence suggests that comparison of wstring should work. Convert the strings before comparison on Windows. At the same time, do the reverse conversion on MacOS (this won't actually be used anywhere in the current code, but it seemse a good idea to implement it while we had the experimental data to suggest it was necessary. See CleverRaven#40041 for more discussion.
On Windows the existing solution for localized comparison seems to do the wrong thing, but evidence suggests that comparison of wstring should work. Convert the strings before comparison on Windows. At the same time, do the reverse conversion on MacOS (this won't actually be used anywhere in the current code, but it seemse a good idea to implement it while we had the experimental data to suggest it was necessary. See CleverRaven#40041 for more discussion.
@ScampsAdams can you tell us how you compiled the two small test programs? i.e. what compiler did you use? |
Summary
SUMMARY: None
Purpose of change
Continuing the changes of #40012, to enforce more localized sorting.
Describe the solution
Expanded the clang-tidy check to catch calls to
std::sort
where the argument type is a string and the comparison is default. Suggest that they uselocalized_compare
.Fix the cases that caught:
Describe alternatives you've considered
None.
Testing
Looked at some of the changed lists, but only in English mode where change is unlikely to occur.
Additional context
The next step is to check for sorting pairs or tuple where one element is a string, but that can be in a future PR.