Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enforce localized sorting 2 #40041

Merged

Conversation

jbytheway
Copy link
Contributor

Summary

SUMMARY: None

Purpose of change

Continuing the changes of #40012, to enforce more localized sorting.

Describe the solution

Expanded the clang-tidy check to catch calls to std::sort where the argument type is a string and the comparison is default. Suggest that they use localized_compare.

Fix the cases that caught:

  • NPC skill lists in a couple of places.
  • Character template list in the main menu.
  • Weapon lists in martial arts descriptions.

Describe alternatives you've considered

None.

Testing

Looked at some of the changed lists, but only in English mode where change is unlikely to occur.

Additional context

The next step is to check for sorting pairs or tuple where one element is a string, but that can be in a future PR.

@akirashirosawa
Copy link
Contributor

I wasn't able to find anyone else to test the changes from #39873, but I did test the Windows build under Wine, and it seemed to work as expected there, which is hopeful.

Noticed #40012 and tested #39873 for Russian on MacOS and it doesn't work at all :(
ru
As you can see, words starting with а are interrupted by words starting with other letters.

Tested #39873 for German on MacOS too, it (not?) works.
de

As far as I understand, the German alphabet contains the same Latin letters as English, except for a few special ones. I don't know German, but Kaja sorts before Käfer, so it works only for Latin characters, not for German special ones. And ä displays incorrectly, as you can see.

The Russian alphabet does not contain any Latin letters, only Cyrillic. Even if they look the same, they have their own code. So а != a (first one is Cyrillic, second Latin)
compare

I was solving a problem with comparing local strings here #33206. But, after this game a bit crash a on Windows… Because #33214:

MinGW64 on Windows does not support any locale except for the classic "C" locale

(I tested only on MacOS)

Here's how the strings are compared when searching something (for crafting and else):

auto &f = std::use_facet<std::ctype<wchar_t>>( std::locale() );
std::wstring wneedle = utf8_to_wstr( qry );
std::wstring whaystack = utf8_to_wstr( str );
f.tolower( &whaystack[0], &whaystack[0] + whaystack.size() );
f.tolower( &wneedle[0], &wneedle[0] + wneedle.size() );
whaystack.find( wneedle );

Can this be used for sorting (like utf8_to_wstr -> tolower(?) -> compare)? I don't really write on C ++ by the way. But I can test for Russian and on MacOS any solution.

@akirashirosawa
Copy link
Contributor

akirashirosawa commented May 1, 2020

@Night-Pryanik, you put +1 emoji, does sorting work for you (on Windows for Russian)?

@ScampsAdams
Copy link

Doesn't work on Windows for Russian. Order is different though. Build 10606
изображение

@Zireael07
Copy link
Contributor

yyyy.... in what universe does zh come after A but before a?!

@ScampsAdams
Copy link

Tested on mac and results are consistent with another mac above. Which probably means that there is some sorting.
My guess would be that current locale won't work, I tend to something like this:
#ifdef WIN32
setlocale( LC_ALL, "Russian" );
#else
setlocale( LC_ALL, "ru_RU.UTF-8" );
#endif

How to apply this to multi-language app, though?

@jbytheway
Copy link
Contributor Author

Thanks everyone for testing. The Russian MacOS results from @akirashirosawa are very puzzling; I would guess it's not even correctly interpreting the strings as UTF-8.

I notice that main doesn't even set the locale in the same way on Mac, so that's another potential source of problems.

Some of the changes from #33214 are strange; in particular this line

std::locale::global( std::locale() );

is a no-op, because std::locale() returns the current global locale and then is used to set the global locale...

@ScampsAdams The MSDN docs suggest that Windows setlocale can take similar locale strings to those used on Linux, except maybe with dashes rather than underscores.

UTF-8 support on Windows is tricky at best so indeed I wouldn't be surprised if it was easiest to convert to wstrings for comparison there.

I'll add some debug messages so we can hopefully see what the locale is actually set to in each of your various situations and then we can hopefully debug from there.

@akirashirosawa
Copy link
Contributor

is a no-op

Yes, it seems. But for me works

std::locale::global( std::locale( "ru_RU.UTF-8" ) );

w/o exception, because I18n case insensitive search works.

Here my #40054 i18n debug log:

-----------------------------------------
00:17:42.734 : Starting log.
00:17:42.735 INFO : Cataclysm DDA version 0.E-1409-gd72ae905cc-dirty
00:17:42.735 INFO : [main] C locale set to ru_RU.UTF-8
00:17:42.735 INFO : [main] C++ locale set to 
00:17:42.735 INFO : SDL version used during compile is 2.0.10
00:17:42.736 INFO : SDL version used during linking and in runtime is 2.0.10
00:17:42.945 INFO : Number of render drivers on your system: 4
00:17:42.945 INFO : Render driver: 0/metal
00:17:42.945 INFO : Render driver: 1/opengl
00:17:42.945 INFO : Render driver: 2/opengles2
00:17:42.945 INFO : Render driver: 3/software
00:17:42.958 INFO : [options] C locale set to ru_RU.UTF-8
00:17:42.958 INFO : [options] C++ locale set to ru_RU.UTF-8
00:17:43.374 INFO : Active renderer: 1/opengl
00:17:43.563 INFO : USE_COLOR_MODULATED_TEXTURES is set to 0
00:17:43.941 INFO : Language is set to: 'ru'
00:17:45.210 WARNING : opendir [/Users/akira/Library/Application Support/Cataclysm/mods/] failed with "No such file or directory".
00:18:34.876 INFO : [options] C locale set to de_DE.UTF-8  // <- switch langs manually on options
00:18:34.876 INFO : [options] C++ locale set to de_DE.UTF-8
00:18:34.876 INFO : Language is set to: 'de'
00:18:48.492 INFO : [options] C locale set to en_US.UTF-8 // <- switch langs manually on options
00:18:48.492 INFO : [options] C++ locale set to en_US.UTF-8
00:18:48.492 INFO : Language is set to: 'en'
00:18:59.020 INFO : [options] C locale set to en_US.UTF-8 // <- switch langs manually on options
00:18:59.020 INFO : [options] C++ locale set to en_US.UTF-8
00:18:59.020 INFO : Language is set to: 'ru_RU'  // <- switch lang to the "System language"
// locate switch didn't handle "System language" correctly, but it is a separate issue 
00:20:54.166 WARNING : opendir [/Users/akira/Library/Application Support/Cataclysm/save/World/mods] failed with "No such file or directory".
00:21:04.260 INFO : Loaded tileset: UNDEAD_PEOPLE
00:21:34.239 : Log shutdown.
-----------------------------------------

@jbytheway
Copy link
Contributor Author

@akirashirosawa Just to verify that I understand: that's your debug.log on MacOS, and it's with those settings that you observe the strange Russian sorting pictured in your first comment above?

@akirashirosawa
Copy link
Contributor

@akirashirosawa Just to verify that I understand: that's your debug.log on MacOS, and it's with those settings that you observe the strange Russian sorting pictured in your first comment above?

Yes. debug.log is fully consistent with the my comment above. Checked it now - all the same.
Render driver: 0/metal available only for MacOS :)

@jbytheway
Copy link
Contributor Author

Well, it looks like the locale is configured correctly, so I can only assume that the STL is just not sorting correctly on MacOS. I'll try making another PR with the convert-to-wide workaround and we can see if that works any better for you.

Besides that, I'm quite confused by the output from main in your log. In the code, on OSX, it only sets the C++ locale, not the C locale. And yet the log shows that the C locale was set, but not the C++ locale; exactly the opposite of what I would have expected!

It's all quite strange...

@jbytheway
Copy link
Contributor Author

@akirashirosawa one more thing: can you show the output from locale -a (I think that ought to work on OS X; it did back in #11976). Hopefully that will list your installed locales. I assume it will to include ru_RU.UTF-8 (else a bunch of other stuff would break) but it would be nice to double-check.

@akirashirosawa
Copy link
Contributor

akirashirosawa commented May 1, 2020

@jbytheway
locale

LANG="ru_RU.UTF-8"
LC_COLLATE="ru_RU.UTF-8"
LC_CTYPE="ru_RU.UTF-8"
LC_MESSAGES="ru_RU.UTF-8"
LC_MONETARY="ru_RU.UTF-8"
LC_NUMERIC="ru_RU.UTF-8"
LC_TIME="ru_RU.UTF-8"
LC_ALL=

locale -a

...
ru_RU.ISO8859-5
...
ru_RU.CP866
...
ru_RU.CP1251
...
ru_RU.UTF-8
...
ru_RU.KOI8-R
...
ru_RU
...
C
POSIX

For some reason list doesn't sorted like guy with OS X 10.10 (my version is MacOS 10.15), but it mostly the same.

UPD

#include <stdio.h>
#include <locale.h>
#include <langinfo.h>

int main() {
  setlocale(LC_ALL, "");
  printf("Reported: '%s'\n", nl_langinfo(CODESET));
}

Returns same UTF-8 if it is matter.

@jbytheway jbytheway mentioned this pull request May 2, 2020
@jbytheway
Copy link
Contributor Author

@ScampsAdams Experimental builds 10608 or newer should contain the new log messages I added. Could you please test again, and copy your debug.log here to assist with figuring out what's happening on Windows.

@jbytheway jbytheway force-pushed the enforce_localized_sorting_2 branch from 11de7d8 to 6e740f2 Compare May 2, 2020 13:20
@akirashirosawa
Copy link
Contributor

@jbytheway

so I can only assume that the STL is just not sorting correctly on MacOS.

I decided to test how localized_comparator works w/o Cataclysm.

TL;DR
STL sorting works correctly on MacOS. Probably we have some issue inside Cataclysm.

Test code:

#include <iostream>
#include <string>
#include <vector>
#include <iterator>
#include <utility>
#include <algorithm>
#include <random>

struct localized_comparator {
    bool operator()( const std::string &, const std::string & ) const;
};

bool localized_comparator::operator()( const std::string &l, const std::string &r ) const {
    return std::locale()( l, r );
}
constexpr localized_comparator lc;

std::string testOne (std::vector<std::string> v) {
    std::random_device rd;
    std::mt19937 g(rd());

    std::shuffle(v.begin(), v.end(), g);
    std::sort(v.begin(), v.end(), lc);
 
    std::string str;
    std::for_each(v.begin(), v.end(), [&str](std::string &s){ str+=s; });  
    return str;
}

void testRussian () {
    std::vector<std::string> ru = {
        "а", "б", "в", "г", "д", "е", "ё", "ж",
        "з", "и", "й", "к", "л", "м", "н", "о",
        "п", "р", "с", "т", "у", "ф", "х", "ц",
        "ч", "ш", "щ", "ъ", "ы", "ь", "э", "ю", "я"
    };
    std::vector<std::string> RU = {
        "А", "Б", "В", "Г", "Д", "Е", "Ё", "Ж",
        "З", "И", "Й", "К", "Л", "М", "Н", "О",
        "П", "Р", "С", "Т", "У", "Ф", "Х", "Ц",
        "Ч", "Ш", "Щ", "Ъ", "Ы", "Ь", "Э", "Ю", "Я"
    };
     std::vector<std::string> rU = {
        "а", "б", "в", "г", "д", "е", "ё", "ж",
        "З", "И", "Й", "К", "Л", "М", "Н", "О",
        "п", "р", "с", "т", "у", "ф", "х", "ц",
        "Ч", "Ш", "Щ", "Ъ", "Ы", "Ь", "Э", "Ю", "Я"
    };
    // expected results:
    std::string result_ru = "абвгдеёжзийклмнопрстуфхцчшщъыьэюя";
    std::string result_RU = "АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ";
    std::string result_rU = "абвгдеёжЗИЙКЛМНОпрстуфхцЧШЩЪЫЬЭЮЯ";
    std::string result_ru_wrong_yo = "абвгдежзийклмнопрстуфхцчшщъыьэюяё";
    std::string result_RU_wrong_yo = "ЁАБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ";
    std::string result_rU_case_sensitive = "ЗИЙКЛМНОЧШЩЪЫЬЭЮЯабвгдеёжпрстуфхц";
    std::string result_rU_case_sensitive_wrong_yo = "ЗИЙКЛМНОЧШЩЪЫЬЭЮЯабвгдежпрстуфхцё";

    std::string tmp_str;
    
    tmp_str = testOne(ru);

    if (tmp_str == result_ru) {
        std::cout << "lowercase Russian strings sorting successful" << "\n";
    } else if (tmp_str == result_ru_wrong_yo) {
        std::cout <<"lowercase Russian strings sorting successful, but issue with \"ё\" (yo)"<< "\n";
    } else {
        std::cout <<"lowercase Russian strings sorting failed: " << "\n" << tmp_str << "\n";
    }

    tmp_str = testOne(RU);
    
    if (tmp_str == result_RU) {
        std::cout << "uppercase Russian strings sorting successful" << "\n";
    } else if (tmp_str == result_RU_wrong_yo) {
        std::cout <<"uppercase Russian strings sorting successful, but issue with \"ё\" (yo)"<< "\n";
    } else {
        std::cout <<"uppercase Russian strings sorting failed: " << "\n" << tmp_str << "\n";
    }
    
    tmp_str = testOne(rU);

    if (tmp_str == result_rU) {
        std::cout << "mixedcase Russian strings sorting successful" << "\n";
    } else if (tmp_str == result_rU_case_sensitive) {
        std::cout <<"mixedcase Russian strings sorting successful, but uppercase going first"<< "\n";
    } else if (tmp_str == result_rU_case_sensitive_wrong_yo) {
        std::cout <<"mixedcase Russian strings sorting successful, but uppercase going first, and issue with \"ё\" (yo)"<< "\n";
    } else {
        std::cout <<"mixedcase Russian strings sorting failed: " << "\n" << tmp_str << "\n";
    }
}

int main () {
    std::cout << "---------------------------------------------------"<< "\n";
    
    std::cout << "[default] C locale set to " << setlocale( LC_ALL, nullptr )<< "\n";
    std::cout << "[default] C++ locale set to " << std::locale().name()<< "\n";
    testRussian();
    std::cout << "---------------------------------------------------"<< "\n";
    
    std::locale::global( std::locale( "ru_RU.UTF-8" ) );
    std::cout << "[ru_RU.UTF-8] C locale set to " << setlocale( LC_ALL, nullptr )<< "\n";
    std::cout << "[ru_RU.UTF-8] C++ locale set to " << std::locale().name()<< "\n";
    testRussian();
    std::cout << "---------------------------------------------------"<< "\n";

    std::locale::global( std::locale( "ru_RU" ) );
    std::cout << "[ru_RU] C locale set to " << setlocale( LC_ALL, nullptr )<< "\n";
    std::cout << "[ru_RU] C++ locale set to " << std::locale().name()<< "\n";
    testRussian();
    std::cout << "---------------------------------------------------"<< "\n";

    std::locale::global( std::locale( "de_DE.UTF-8" ) );
    std::cout << "[de_DE.UTF-8] C locale set to " << setlocale( LC_ALL, nullptr )<< "\n";
    std::cout << "[de_DE.UTF-8] C++ locale set to " << std::locale().name()<< "\n";
    testRussian();
    std::cout << "---------------------------------------------------"<< "\n";

    std::locale::global( std::locale( "ru_RU.CP1251" ) );
    std::cout << "[Windows-1251] C locale set to " << setlocale( LC_ALL, nullptr )<< "\n";
    std::cout << "[Windows-1251] C++ locale set to " << std::locale().name()<< "\n";
    testRussian();
    std::cout << "---------------------------------------------------"<< "\n";

    std::locale::global( std::locale( "ru_RU.CP866" ) );
    std::cout << "[DOS-CP866] C locale set to " << setlocale( LC_ALL, nullptr )<< "\n";
    std::cout << "[DOS-CP866] C++ locale set to " << std::locale().name()<< "\n";
    testRussian();
    std::cout << "---------------------------------------------------"<< "\n";

}

Output:

clang++ -std=c++11 test.cpp -o test && ./test
---------------------------------------------------
[default] C locale set to C
[default] C++ locale set to C
lowercase Russian strings sorting successful, but issue with "ё" (yo)
uppercase Russian strings sorting successful, but issue with "ё" (yo)
mixedcase Russian strings sorting successful, but uppercase going first, and issue with "ё" (yo)
---------------------------------------------------
[ru_RU.UTF-8] C locale set to ru_RU.UTF-8
[ru_RU.UTF-8] C++ locale set to ru_RU.UTF-8
lowercase Russian strings sorting successful, but issue with "ё" (yo)
uppercase Russian strings sorting successful, but issue with "ё" (yo)
mixedcase Russian strings sorting successful, but uppercase going first, and issue with "ё" (yo)
---------------------------------------------------
[ru_RU] C locale set to ru_RU
[ru_RU] C++ locale set to ru_RU
lowercase Russian strings sorting successful, but issue with "ё" (yo)
uppercase Russian strings sorting successful, but issue with "ё" (yo)
mixedcase Russian strings sorting successful, but uppercase going first, and issue with "ё" (yo)
---------------------------------------------------
[de_DE.UTF-8] C locale set to de_DE.UTF-8
[de_DE.UTF-8] C++ locale set to de_DE.UTF-8
lowercase Russian strings sorting successful, but issue with "ё" (yo)
uppercase Russian strings sorting successful, but issue with "ё" (yo)
mixedcase Russian strings sorting successful, but uppercase going first, and issue with "ё" (yo)
---------------------------------------------------
[Windows-1251] C locale set to ru_RU.CP1251
[Windows-1251] C++ locale set to ru_RU.CP1251
lowercase Russian strings sorting failed: 
лбжизаейвгдкмнопшётфцчхщырсуъьэюя
uppercase Russian strings sorting failed: 
РГДЫФБВЭЧЩЮШЬЦЖЗЛЙЕИЁАКМНОПСТУХЪЯ
mixedcase Russian strings sorting failed: 
ЫбЭЧжЩЮШЬаеЗЛЙИКМНОЪЯвгдпётфцхрсу
---------------------------------------------------
[DOS-CP866] C locale set to ru_RU.CP866
[DOS-CP866] C++ locale set to ru_RU.CP866
lowercase Russian strings sorting failed: 
рстуфхцчшщъыьэюяёгкпизлонмдежйабв
uppercase Russian strings sorting successful, but issue with "ё" (yo)
mixedcase Russian strings sorting failed: 
рстуфхцёЗИЙКЛМНОЧШЩЪЫЬЭЮЯгпдежабв
---------------------------------------------------

From this we can conclude that sorting works on a MasOS correctly (with a small issue with "ё" letter, sadly, but it is the "special" letter). Correctly sorting with locales: C, ru_RU.UTF-8, ru_RU, de_DE.UTF-8 (and other *_*.UTF-8 locales), incorrectly with Windows and DOS locates ru_RU.CP1251, ru_RU.CP866.

I can only test on MacOS. I posted all code, so if anyone has the opportunity to test on Windows/Linux or for other languages, let's try if it helps.

@jbytheway
Copy link
Contributor Author

From this we can conclude that sorting works on a MasOS correctly (with a small issue with "ё" letter, sadly, but it is the "special" letter). Correctly sorting with locales: C, ru_RU.UTF-8, ru_RU, de_DE.UTF-8 (and other *_*.UTF-8 locales), incorrectly with Windows and DOS locates ru_RU.CP1251, ru_RU.CP866.

I disagree; I have the opposite conclusion. The fact that all the UTF-8 locales gave the same result as the C locale suggests that it's not working correctly; it's not properly handling UTF-8.

I ran the same test code on Linux (I deleted the cases for the locales I don't have installed here), and got the following results:

---------------------------------------------------
[default] C locale set to C
[default] C++ locale set to C
lowercase Russian strings sorting successful, but issue with "ё" (yo)
uppercase Russian strings sorting successful, but issue with "ё" (yo)
mixedcase Russian strings sorting successful, but uppercase going first, and issue with "ё" (yo)
---------------------------------------------------
[ru_RU.UTF-8] C locale set to ru_RU.UTF-8
[ru_RU.UTF-8] C++ locale set to ru_RU.UTF-8
lowercase Russian strings sorting successful
uppercase Russian strings sorting successful
mixedcase Russian strings sorting successful
---------------------------------------------------
[de_DE.UTF-8] C locale set to de_DE.UTF-8
[de_DE.UTF-8] C++ locale set to de_DE.UTF-8
lowercase Russian strings sorting successful
uppercase Russian strings sorting successful
mixedcase Russian strings sorting successful
---------------------------------------------------

That's what I'd expect to see if MacOS was implementing the locales correctly.

Here's a similar program, but using wide strings; can you try this one?

#include <cstdlib>
#include <iostream>
#include <string>
#include <vector>
#include <iterator>
#include <utility>
#include <algorithm>
#include <random>

struct localized_comparator {
    bool operator()( const std::wstring &, const std::wstring & ) const;
};

bool localized_comparator::operator()( const std::wstring &l, const std::wstring &r ) const {
    return std::locale()( l, r );
}
constexpr localized_comparator lc;

std::wstring testOne (std::vector<std::wstring> v) {
    std::random_device rd;
    std::mt19937 g(rd());

    std::shuffle(v.begin(), v.end(), g);
    std::sort(v.begin(), v.end(), lc);
 
    std::wstring str;
    std::for_each(v.begin(), v.end(), [&str](std::wstring &s){ str+=s; });  
    return str;
}

void testRussian () {
    std::vector<std::wstring> ru = {
        L"а", L"б", L"в", L"г", L"д", L"е", L"ё", L"ж",
        L"з", L"и", L"й", L"к", L"л", L"м", L"н", L"о",
        L"п", L"р", L"с", L"т", L"у", L"ф", L"х", L"ц",
        L"ч", L"ш", L"щ", L"ъ", L"ы", L"ь", L"э", L"ю", L"я"
    };
    std::vector<std::wstring> RU = {
        L"А", L"Б", L"В", L"Г", L"Д", L"Е", L"Ё", L"Ж",
        L"З", L"И", L"Й", L"К", L"Л", L"М", L"Н", L"О",
        L"П", L"Р", L"С", L"Т", L"У", L"Ф", L"Х", L"Ц",
        L"Ч", L"Ш", L"Щ", L"Ъ", L"Ы", L"Ь", L"Э", L"Ю", L"Я"
    };
     std::vector<std::wstring> rU = {
        L"а", L"б", L"в", L"г", L"д", L"е", L"ё", L"ж",
        L"З", L"И", L"Й", L"К", L"Л", L"М", L"Н", L"О",
        L"п", L"р", L"с", L"т", L"у", L"ф", L"х", L"ц",
        L"Ч", L"Ш", L"Щ", L"Ъ", L"Ы", L"Ь", L"Э", L"Ю", L"Я"
    };
    // expected results:
    std::wstring result_ru = L"абвгдеёжзийклмнопрстуфхцчшщъыьэюя";
    std::wstring result_RU = L"АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ";
    std::wstring result_rU = L"абвгдеёжЗИЙКЛМНОпрстуфхцЧШЩЪЫЬЭЮЯ";
    std::wstring result_ru_wrong_yo = L"абвгдежзийклмнопрстуфхцчшщъыьэюяё";
    std::wstring result_RU_wrong_yo = L"ЁАБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ";
    std::wstring result_rU_case_sensitive = L"ЗИЙКЛМНОЧШЩЪЫЬЭЮЯабвгдеёжпрстуфхц";
    std::wstring result_rU_case_sensitive_wrong_yo = L"ЗИЙКЛМНОЧШЩЪЫЬЭЮЯабвгдежпрстуфхцё";

    std::wstring tmp_str;
    
    tmp_str = testOne(ru);

    if (tmp_str == result_ru) {
        std::wcout << "lowercase Russian strings sorting successful" << L"\n";
    } else if (tmp_str == result_ru_wrong_yo) {
        std::wcout << "lowercase Russian strings sorting successful, but issue with \"ё\" (yo)"<< "\n";
    } else {
        std::wcout << "lowercase Russian strings sorting failed: " << "\n" << tmp_str << "\n";
    }

    tmp_str = testOne(RU);
    
    if (tmp_str == result_RU) {
        std::cout << "uppercase Russian strings sorting successful" << "\n";
    } else if (tmp_str == result_RU_wrong_yo) {
        std::cout <<"uppercase Russian strings sorting successful, but issue with \"ё\" (yo)"<< "\n";
    } else {
        std::wcout <<"uppercase Russian strings sorting failed: " << "\n" << tmp_str << "\n";
    }
    
    tmp_str = testOne(rU);

    if (tmp_str == result_rU) {
        std::cout << "mixedcase Russian strings sorting successful" << "\n";
    } else if (tmp_str == result_rU_case_sensitive) {
        std::cout <<"mixedcase Russian strings sorting successful, but uppercase going first"<< "\n";
    } else if (tmp_str == result_rU_case_sensitive_wrong_yo) {
        std::cout <<"mixedcase Russian strings sorting successful, but uppercase going first, and issue with \"ё\" (yo)"<< "\n";
    } else {
        std::wcout <<"mixedcase Russian strings sorting failed: " << "\n" << tmp_str << "\n";
    }
}

int main () {
    std::cout << "---------------------------------------------------"<< "\n";
    
    std::cout << "[default] C locale set to " << setlocale( LC_ALL, nullptr )<< "\n";
    std::cout << "[default] C++ locale set to " << std::locale().name()<< "\n";
    testRussian();
    std::cout << "---------------------------------------------------"<< "\n";
    
    std::locale::global( std::locale( "ru_RU.UTF-8" ) );
    std::cout << "[ru_RU.UTF-8] C locale set to " << setlocale( LC_ALL, nullptr )<< "\n";
    std::cout << "[ru_RU.UTF-8] C++ locale set to " << std::locale().name()<< "\n";
    testRussian();
    std::cout << "---------------------------------------------------"<< "\n";

    std::locale::global( std::locale( "de_DE.UTF-8" ) );
    std::cout << "[de_DE.UTF-8] C locale set to " << setlocale( LC_ALL, nullptr )<< "\n";
    std::cout << "[de_DE.UTF-8] C++ locale set to " << std::locale().name()<< "\n";
    testRussian();
    std::cout << "---------------------------------------------------"<< "\n";
}

@Night-Pryanik
Copy link
Contributor

you put +1 emoji, does sorting work for you (on Windows for Russian)?

That's +1 for the idea. I didn't test the implementation.

@akirashirosawa
Copy link
Contributor

@jbytheway

Here's a similar program, but using wide strings; can you try this one?

I fixed a little (because "ё" did not print and ate other lines):

@@ -61,9 +61,9 @@
     tmp_str = testOne(ru);
 
     if (tmp_str == result_ru) {
-        std::wcout << "lowercase Russian strings sorting successful" << L"\n";
+        std::cout << "lowercase Russian strings sorting successful" << "\n";
     } else if (tmp_str == result_ru_wrong_yo) {
-        std::wcout << "lowercase Russian strings sorting successful, but issue with \"ё\" (yo)"<< "\n";
+        std::cout << "lowercase Russian strings sorting successful, but issue with \"ё\" (yo)"<< "\n";
     } else {
         std::wcout << "lowercase Russian strings sorting failed: " << "\n" << tmp_str << "\n";
     }

Output:

clang++ -std=c++11 testw.cpp -o testw && ./testw
---------------------------------------------------
[default] C locale set to C
[default] C++ locale set to C
lowercase Russian strings sorting successful, but issue with "ё" (yo)
uppercase Russian strings sorting successful, but issue with "ё" (yo)
mixedcase Russian strings sorting successful, but uppercase going first, and issue with "ё" (yo)
---------------------------------------------------
[ru_RU.UTF-8] C locale set to ru_RU.UTF-8
[ru_RU.UTF-8] C++ locale set to ru_RU.UTF-8
lowercase Russian strings sorting successful, but issue with "ё" (yo)
uppercase Russian strings sorting successful, but issue with "ё" (yo)
mixedcase Russian strings sorting successful, but uppercase going first, and issue with "ё" (yo)
---------------------------------------------------
[de_DE.UTF-8] C locale set to de_DE.UTF-8
[de_DE.UTF-8] C++ locale set to de_DE.UTF-8
lowercase Russian strings sorting successful, but issue with "ё" (yo)
uppercase Russian strings sorting successful, but issue with "ё" (yo)
mixedcase Russian strings sorting successful, but uppercase going first, and issue with "ё" (yo)
---------------------------------------------------

I ran the same test code on Linux

It's perfect, no need tolower at all, and not exists yo issue.

I disagree; I have the opposite conclusion.

I agree that sorting does not work completely correctly on MacOS, but it works with some issues. It would be great if sort worked as perfectly as it did on Linux. On the other hand, here it doesn't work at all:

for Russian on MacOS and it doesn't work at all :(
ru

@ScampsAdams

Doesn't work on Windows for Russian. Order is different though. Build 10606
изображение

Can someone run tests for Windows? I can ask someone I knew to run tests on Windows, but I don’t know how to cross-compile code on my MacOS for Win.

@akirashirosawa
Copy link
Contributor

akirashirosawa commented May 3, 2020

@jbytheway

That's what I'd expect to see if MacOS was implementing the locales correctly.

If this is a MacOS specific problem, then I tried the MacOS specific solution, using CoreFoundation. Here's what get:

--- test.cpp
+++ testcf.cpp
@@ -6,12 +6,22 @@
 #include <algorithm>
 #include <random>
 
+#if defined(MACOSX)
+#include <CoreFoundation/CoreFoundation.h>
+#endif
+
 struct localized_comparator {
     bool operator()( const std::string &, const std::string & ) const;
 };
 
 bool localized_comparator::operator()( const std::string &l, const std::string &r ) const {
+    #if defined(MACOSX)
+        CFStringRef lr = CFStringCreateWithCString(kCFAllocatorDefault, l.c_str(), kCFStringEncodingUTF8);
+        CFStringRef rr = CFStringCreateWithCString(kCFAllocatorDefault, r.c_str(), kCFStringEncodingUTF8);
+        return CFStringCompare(lr, rr, kCFCompareLocalized) < 0;
+    #else
     return std::locale()( l, r );
+    #endif
 }
 constexpr localized_comparator lc;

Output:

clang++ -std=c++11 -DMACOSX -framework Foundation testcf.cpp -o testcf && ./testcf
---------------------------------------------------
[default] C locale set to C
[default] C++ locale set to C
lowercase Russian strings sorting successful
uppercase Russian strings sorting successful
mixedcase Russian strings sorting successful
---------------------------------------------------
[ru_RU.UTF-8] C locale set to ru_RU.UTF-8
[ru_RU.UTF-8] C++ locale set to ru_RU.UTF-8
lowercase Russian strings sorting successful
uppercase Russian strings sorting successful
mixedcase Russian strings sorting successful
---------------------------------------------------
[de_DE.UTF-8] C locale set to de_DE.UTF-8
[de_DE.UTF-8] C++ locale set to de_DE.UTF-8
lowercase Russian strings sorting successful
uppercase Russian strings sorting successful
mixedcase Russian strings sorting successful
---------------------------------------------------

It works as perfectly as it does on Linux. I do not know how expensive CFStringCreateWithCString. In essence, this is copying a C string to a new container like std::string. I think if we can’t solve the issue with std::locale compare for MacOS, we can use CFStringCompare. The promlem with Windows is still open.

UPD

Maybe bast option is use CFStringCreateWithCStringNoCopy insted of CFStringCreateWithCString. It may be more memory efficient.

Unless the situation warrants otherwise, the created object does not copy the external buffer to internal storage but instead uses the buffer as its backing store. docs

--- testcf.cpp
+++ testcf_nocopy.cpp
@@ -16,8 +16,8 @@
 
 bool localized_comparator::operator()( const std::string &l, const std::string &r ) const {
     #if defined(MACOSX)
-        CFStringRef lr = CFStringCreateWithCString(kCFAllocatorDefault, l.c_str(), kCFStringEncodingUTF8);
-        CFStringRef rr = CFStringCreateWithCString(kCFAllocatorDefault, r.c_str(), kCFStringEncodingUTF8);
+        CFStringRef lr = CFStringCreateWithCStringNoCopy(kCFAllocatorDefault, l.c_str(), kCFStringEncodingUTF8, kCFAllocatorNull);
+        CFStringRef rr = CFStringCreateWithCStringNoCopy(kCFAllocatorDefault, r.c_str(), kCFStringEncodingUTF8, kCFAllocatorNull);
         return CFStringCompare(lr, rr, kCFCompareLocalized) < 0;
     #else
     return std::locale()( l, r );

Note: I use kCFAllocatorNull because std::string deallocate its char array (pointer to which returns string::c_str), them self.

UPD2
I created a PR #40100 with changes above for macOS.

@ScampsAdams
Copy link

Windows debug.log
Note locale 1252. Expected to see 1251.

20:21:09.384 : Starting log.
20:21:09.384 INFO : Cataclysm DDA version 0.E-1477-g8cea0fc
20:21:09.394 INFO : [main] C locale set to C
20:21:09.394 INFO : [main] C++ locale set to C
20:21:09.394 INFO : SDL version used during compile is 2.0.5
20:21:09.394 INFO : SDL version used during linking and in runtime is 2.0.5
20:21:09.448 INFO : Number of render drivers on your system: 4
20:21:09.448 INFO : Render driver: 0/direct3d
20:21:09.449 INFO : Render driver: 1/direct3d11
20:21:09.449 INFO : Render driver: 2/opengl
20:21:09.449 INFO : Render driver: 3/software
20:21:09.452 INFO : [options] C locale set to C
20:21:09.452 INFO : [options] C++ locale set to C
20:21:09.460 INFO : Active renderer: 0/direct3d
20:21:09.718 INFO : USE_COLOR_MODULATED_TEXTURES is set to 0
20:21:09.815 INFO : Language is set to: 'ru'
20:21:09.815 INFO : [translations] C locale set to Russian_Russia.1252
20:21:09.815 INFO : [translations] C++ locale set to C
20:21:10.238 WARNING : opendir [./mods/] failed with "No such file or directory".
20:21:20.251 WARNING : opendir [./save/Брундидж/mods] failed with "No such file or directory".
20:21:25.524 INFO : Loaded tileset: retrodays
20:24:19.734 : Log shutdown.

@ScampsAdams
Copy link

ScampsAdams commented May 3, 2020

Short string test.
Crashes at last two encodings without any message. Start of test below:
Totally NOT reproducible, strings are different after each run.

---------------------------------------------------
[default] C locale set to C
[default] C++ locale set to C
lowercase Russian strings sorting successful, but issue with "ё" (yo)
uppercase Russian strings sorting successful, but issue with "ё" (yo)
mixedcase Russian strings sorting successful, but uppercase going first, and issue with "ё" (yo)
---------------------------------------------------
[ru_RU.UTF-8] C locale set to ru_RU.UTF-8
[ru_RU.UTF-8] C++ locale set to ru_RU.UTF-8
lowercase Russian strings sorting failed:
къеврщчёншжьамюэписцбйтгядоузфхыл
uppercase Russian strings sorting failed:
ОИХАФЛЁНЗВЫШБМПУЖЦЯДТЩСЬЙЪЮГЭЕЧРК
mixedcase Russian strings sorting failed:
цжМапЬЭбНЫЛЗгсхрЯОКеЧфтЙШвИЮЪёудЩ
---------------------------------------------------
[ru_RU] C locale set to ru_RU
[ru_RU] C++ locale set to ru_RU
lowercase Russian strings sorting failed:
влнктяцуэйъщеюшычдмсоафгжьрибёзпх
uppercase Russian strings sorting failed:
ЪПОЗЙМЮЛБРИЁШЫФЧЖНТДГЩЕАХКЯЬСЦУВЭ
mixedcase Russian strings sorting failed:
ЭёКМЩИхцЪЛШНеудпЗфЧбЫЙсЬвжОгтраЮЯ
---------------------------------------------------
[de_DE.UTF-8] C locale set to de_DE.UTF-8
[de_DE.UTF-8] C++ locale set to de_DE.UTF-8
lowercase Russian strings sorting failed:
бтвзагъхйпдфжэцркшёылсмоечнюяуиьщ
uppercase Russian strings sorting failed:
ХШЧМИТЦЗЩЙСРЬОАГВУЪЫПЮКДЛЭНЖФБЁЯЕ
mixedcase Russian strings sorting failed:
ШКЪЬЯцЩвбёЗЛгеНжпурИсдЫЙтЮЧМафЭхО
---------------------------------------------------

@ScampsAdams
Copy link

ScampsAdams commented May 3, 2020

Wide string test:

---------------------------------------------------
[default] C locale set to C
[default] C++ locale set to C
uppercase Russian strings sorting successful, but issue with "ё" (yo)
mixedcase Russian strings sorting successful, but uppercase going first, and issue with "ё" (yo)
---------------------------------------------------
[ru_RU.UTF-8] C locale set to ru_RU.UTF-8
[ru_RU.UTF-8] C++ locale set to ru_RU.UTF-8
uppercase Russian strings sorting successful
mixedcase Russian strings sorting successful
---------------------------------------------------
[de_DE.UTF-8] C locale set to de_DE.UTF-8
[de_DE.UTF-8] C++ locale set to de_DE.UTF-8
uppercase Russian strings sorting successful
mixedcase Russian strings sorting successful
---------------------------------------------------

@akirashirosawa
Copy link
Contributor

akirashirosawa commented May 3, 2020

@ScampsAdams

Windows debug.log
Note locale 1252. Expected to see 1251.

Windows-1252 is a single-byte character encoding of the Latin alphabet. Windows-1251 for Russian (Cyrillic) characters.

20:21:09.452 INFO : [options] C locale set to C
20:21:09.452 INFO : [options] C++ locale set to C

But I expected to see ru_RU.UTF-8 in [options] message. It means that

// src/options.cpp:3157
std::locale::global( std::locale( "ru_RU.UTF-8" ) );

does not work for you. And works

// src/options.cpp:3163
catch( std::runtime_error &e ) {
    std::locale::global( std::locale() );
}

instead.

For Windows (_WIN32) sets Windows-1252 here:

// src/translates.cpp:205
#if defined(_WIN32)
    // Use the ANSI code page 1252 to work around some language output bugs.
    if( setlocale( LC_ALL, ".1252" ) == nullptr ) {
        DebugLog( D_WARNING, D_MAIN ) << "Error while setlocale(LC_ALL, '.1252').";
    }
    DebugLog( D_INFO, DC_ALL ) << "[translations] C locale set to " << setlocale( LC_ALL, nullptr );
    DebugLog( D_INFO, DC_ALL ) << "[translations] C++ locale set to " << std::locale().name();
#endif

Don't know why. May be need to setlocale( LC_ALL, ".1251" ) for Russian, (.1250 for Poland, .1252 for German, etc) and it is solve sorting problem.

Crashes at last two encodings without any message.

With DOS-866 and Windows-1251? It's funny, but these are just Windows encodings. Maybe they are simply called differently in macOS. Try to rename it in ".1251" and ".866" like on Cataclysm translates.cpp. If it is works - we can just set right Windows locate in src/translates.cpp in #if defined(_WIN32), if not, maybe using std::wstring compare for Windows.

@jbytheway
Copy link
Contributor Author

@akirashirosawa @ScampsAdams Thanks very much for the assistance in debugging. I agree that the CFStringRef solution seems reasonable for MacOS, and it looks like using wstring will work for Windows (honestly, that was what I was expecting to have to do for Windows all along. I'll revamp #40062 to do that for Windows only.

jbytheway added a commit to jbytheway/Cataclysm-DDA that referenced this pull request May 4, 2020
On Windows the existing solution for localized comparison seems to do
the wrong thing, but evidence suggests that comparison of wstring should
work.

Convert the strings before comparison on Windows.

At the same time, do the reverse conversion on MacOS (this won't
actually be used anywhere in the current code, but it seemse a good idea
to implement it while we had the experimental data to suggest it was
necessary.

See CleverRaven#40041 for more discussion.
jbytheway added a commit to jbytheway/Cataclysm-DDA that referenced this pull request May 4, 2020
On Windows the existing solution for localized comparison seems to do
the wrong thing, but evidence suggests that comparison of wstring should
work.

Convert the strings before comparison on Windows.

At the same time, do the reverse conversion on MacOS (this won't
actually be used anywhere in the current code, but it seemse a good idea
to implement it while we had the experimental data to suggest it was
necessary.

See CleverRaven#40041 for more discussion.
@akirashirosawa
Copy link
Contributor

@jbytheway Maybe there is a chance that setlocale( LC_ALL, ".1251" ) will work for Windows?

jbytheway added a commit to jbytheway/Cataclysm-DDA that referenced this pull request May 4, 2020
On Windows the existing solution for localized comparison seems to do
the wrong thing, but evidence suggests that comparison of wstring should
work.

Convert the strings before comparison on Windows.

At the same time, do the reverse conversion on MacOS (this won't
actually be used anywhere in the current code, but it seemse a good idea
to implement it while we had the experimental data to suggest it was
necessary.

See CleverRaven#40041 for more discussion.
@jbytheway jbytheway force-pushed the enforce_localized_sorting_2 branch from 6e740f2 to 8ca7f94 Compare May 4, 2020 13:37
@jbytheway
Copy link
Contributor Author

@jbytheway Maybe there is a chance that setlocale( LC_ALL, ".1251" ) will work for Windows?

It shouldn't work. Our strings are UTF-8, so the comparison cannot possibly work correctly on a std::string unless the locale is a UTF-8 locale, and Windows is unlikely to be using one of those. I think it might be possible for us to set a UTF-8 locale on sufficiently recent versions of Windows and have it work OK, but the way we build Windows releases does not have good support for that kind of thing, based on my searches on the topic, so I don't really want to pursue that approach.

jbytheway added a commit to jbytheway/Cataclysm-DDA that referenced this pull request May 4, 2020
On Windows the existing solution for localized comparison seems to do
the wrong thing, but evidence suggests that comparison of wstring should
work.

Convert the strings before comparison on Windows.

At the same time, do the reverse conversion on MacOS (this won't
actually be used anywhere in the current code, but it seemse a good idea
to implement it while we had the experimental data to suggest it was
necessary.

See CleverRaven#40041 for more discussion.
@kevingranade kevingranade merged commit e635446 into CleverRaven:master May 4, 2020
jbytheway added a commit to jbytheway/Cataclysm-DDA that referenced this pull request May 5, 2020
On Windows the existing solution for localized comparison seems to do
the wrong thing, but evidence suggests that comparison of wstring should
work.

Convert the strings before comparison on Windows.

At the same time, do the reverse conversion on MacOS (this won't
actually be used anywhere in the current code, but it seemse a good idea
to implement it while we had the experimental data to suggest it was
necessary.

See CleverRaven#40041 for more discussion.
jbytheway added a commit to jbytheway/Cataclysm-DDA that referenced this pull request May 5, 2020
On Windows the existing solution for localized comparison seems to do
the wrong thing, but evidence suggests that comparison of wstring should
work.

Convert the strings before comparison on Windows.

At the same time, do the reverse conversion on MacOS (this won't
actually be used anywhere in the current code, but it seemse a good idea
to implement it while we had the experimental data to suggest it was
necessary.

See CleverRaven#40041 for more discussion.
@jbytheway jbytheway deleted the enforce_localized_sorting_2 branch May 5, 2020 15:46
jbytheway added a commit to jbytheway/Cataclysm-DDA that referenced this pull request May 5, 2020
On Windows the existing solution for localized comparison seems to do
the wrong thing, but evidence suggests that comparison of wstring should
work.

Convert the strings before comparison on Windows.

At the same time, do the reverse conversion on MacOS (this won't
actually be used anywhere in the current code, but it seemse a good idea
to implement it while we had the experimental data to suggest it was
necessary.

See CleverRaven#40041 for more discussion.
@jbytheway
Copy link
Contributor Author

@ScampsAdams can you tell us how you compiled the two small test programs? i.e. what compiler did you use?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants