Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accented characters are always displayed in lowercase #724

Open
Jaffacakelover opened this issue Jan 11, 2021 · 18 comments
Open

Accented characters are always displayed in lowercase #724

Jaffacakelover opened this issue Jan 11, 2021 · 18 comments

Comments

@Jaffacakelover
Copy link

The content scraper (and any rom lists for systems that have been scraped) displays accented characters in lower case, while everything else is upper case.
Most noticeable with Pokémon games. For example:
IMG_20201031_221610_cropped

The font that RetroPie uses (Open Sans Condensed) does include upper case accented characters: For example, É is U+00C9 in unicode.

@Gemba
Copy link

Gemba commented Feb 7, 2021

As you are right, @Jaffacakelover the font does provide this character (and also the "Cabin Bold" (which is used in the game list for Carbon Theme).

However, root cause is that how the uppercase function is implemented in ES.
The i18n fork of ES uses Boost Lib to tackle this, but this ES does not use Boost any longer. So I propose a "boost-less" change:

#include <iostream>
#include <locale>
#include <codecvt>

// yarked from current master:
// https://github.com/RetroPie/EmulationStation/blob/771f2a608b84b5262666af7888f377a5b3f280ea/es-core/src/utils/StringUtil.cpp#L163-L172
std::string toUpper(const std::string& _string)
{
        std::string string;

        for(size_t i = 0; i < _string.length(); ++i)
                string += (char)toupper(_string[i]);

        return string;

}

// proposed change/fix for #724
std::string toUpperNew(const std::string& _string)
{
        std::wstring w_str = std::wstring_convert<std::codecvt_utf8<wchar_t> >().from_bytes(_string);
        auto& fct = std::use_facet<std::ctype<wchar_t> >(std::locale());
        fct.toupper(&w_str[0], &w_str[0] + w_str.size());
        return std::wstring_convert<std::codecvt_utf8<wchar_t> >().to_bytes(w_str);
}

int main() {
        std::locale::global(std::locale(""));

        std::string str = "Déjà Vu: A Nightmare Come True";

        std::cout << "\nInput string    : " << str << std::endl;
        std::cout << "\nUpper (current) : " << toUpper(str) << std::endl;
        std::cout << "\nUpper (proposed): " << toUpperNew(str) << std::endl;

        // enable for profiling
        /*
        std::string tmp;
        for (int i = 0 ; i < 20 * 1000; i++) {
            //tmp = toUpper(str);
            tmp = toUpperNew(str);
        }
        */

        return 0;
}

(used: g++-8 uppercase_wstring.cpp -o uppercase_wstring)

Will output with locale en_GB.UTF-8 on RetroPie:

./uppercase_wstring

Input string    : Déjà Vu: A Nightmare Come True

Upper (current) : DéJà VU: A NIGHTMARE COME TRUE

Upper (proposed): DÉJÀ VU: A NIGHTMARE COME TRUE

Verified ok on Linux/recent RetroPie 4.7.1 (the impact on runtime can be neglected IMHO).
Could someone (of the core maintainers, e.g. @pjft @tomaz82 ...) pls verify this proposed change on Windows/Mac before proceeding?

@cmitu
Copy link

cmitu commented Feb 7, 2021

Doesn't work on macOS (Apple clang 11):

es-core/src/utils/StringUtil.cpp:165:43: error: no member named 'ctype' in namespace 'std'; did you mean 'wctype'?
                        auto& fct = std::use_facet<std::ctype<wchar_t> >(std::locale());
                                                   ~~~~~^~~~~
                                                        wctype

According to cppreference, std::wstring_convert is deprecated since C++17.

Does work on macos, just needed to include the new headers.
RetroPie's fork of ES doesn't support macOS yet, but I have a compatible version in my fork.

@Gemba
Copy link

Gemba commented Feb 7, 2021

Ah, thanks. If/if not MacOS is supported by this repo escaped my mind.
I have seen the deprecation note and also read it shall not be "cut off" until some more suitable approach can be provided.

However, it boils down to a design decision: leave it as it, use this approach (albeit deprecated), use a library (which, if not boost?) or roll your own.

@Gemba
Copy link

Gemba commented Feb 8, 2021

Could verify successfully on Windows too (W10/VS2019). Works with same output and with std::locale::global(std::locale("")); iff UTF-8 is enabled in Windows: Open Region -> Register Administrative -> Change system locale... -> Enable Checkbox Use Unicode UTF-8 ...

@tomaz82
Copy link
Collaborator

tomaz82 commented Feb 8, 2021

Only done some quick test, and as mentioned, this requires locale to be changed from "C" to "", which breaks other things, (most noticeable is that Carbon theme layout broke) but I have not had time to investigate how or why it broke.

So if the choice is between broken themes and lowercase accented e, I know which one I prefer.

@Gemba
Copy link

Gemba commented Feb 9, 2021

No need to make a choice.
Here's what I did: Pulled latest ES (771f2a6) and latest ES-carbon-Theme (b09973e). Applied the proposed changes (including the region setting) from above and let cmake do the tedious work. Voilà!

Works for me:

ES_2 10 0RP-DEV

@tomaz82
Copy link
Collaborator

tomaz82 commented Feb 10, 2021

Yeah that's not the latest VIP Carbon, and that's beside the point, if this breaks the current VIP Carbon, it might break many other themes, which is why this needs to be investigated further, I don't remember the details anymore but I know I did a lot of testing back when I removed boost before coming to the conclusion that locale( "C" ) was the one that behaved as it should.

@Gemba
Copy link

Gemba commented Feb 11, 2021

Now we are getting closer to the facts. If you let me know where this "VIP Carbon" can be found and also can provide specific details what broke in this "VIP Carbon" we might be able to tackle this issue. Or maybe you table it on the RetroPie forum and ask for testers (for Windows builds).

@tomaz82
Copy link
Collaborator

tomaz82 commented Feb 12, 2021

I just tested it on "normal" Carbon and it flatout crashes with locale( "" ), and now I remember why I chose to use locale( "C" ) when I removed boost.
locale( "" ) uses the default locale of the computer it's running on, so for example, Swedish, English, Russian, Japanese locales doesn't behave the same.

Swedish for example does not use numbers with periods, but with comma, so it doesn't handle xml values like 0.30, since it expects it to be 0,30, so when it sees 0.30 it just becomes 0. In this case it meant that fontSize became 0 which is invalid. I'm sure there might be other ways around this, but this is where we are right now.

This is the "dangers" with locale, what works on your computer, might crash on someone elses computer using a different locale.

@tomaz82
Copy link
Collaborator

tomaz82 commented Feb 12, 2021

I also did some benchmarking

Function       NumCalls   Time
toUpper        142095     0.062938
toUpperNew     142095     7.365905

It added 7 seconds to starting up ES and that is not an impact on runtime that can be neglected IMHO since this is even on pc which is way faster than a pi. (and yes, this was on an optimized release build, on a debug build it adds 26 seconds)

I'm tired and might be wrong. but my quick math says it's about 117 times slower, just to get an uppercase "é"

@tomaz82
Copy link
Collaborator

tomaz82 commented Feb 13, 2021

This is gonna be a long comment, but I have now done quite a bit of testing. I wanna point out that all my tests have been done on Windows 10 so the results might differ on a pi.

First off, the local that in my tests works the best is:

std::locale::global(std::locale("en_US.UTF-8"));

Simply because this gives us the ability to do upper case é without breaking the parsing of floats from xml files.

And onto the toUpper itself I made 4 versions and profiled them (PR #716 adds a profiling utility)

But first I added 2 new functions to StringUtil which will be used later on:

std::wstring multiByteToWideChar(const std::string& _string)
{
	std::wstring string;

	string.resize(_string.length());
	mbstowcs((wchar_t*)string.c_str(), _string.c_str(), string.length());

	return string;

} // multiByteToWideChar

std::string wideCharToMultiByte(const std::wstring& _string)
{
	std::string string;

	string.resize(_string.length());
	wcstombs((char*)string.c_str(), _string.c_str(), string.length());

	return string;

} // wideCharToMultiByte

Then I added these 4 versions of toUpper:

std::string toUpperStd(const std::string& _string)
{
	ProfileScope("toUpperStd");

	std::wstring w_str = std::wstring_convert<std::codecvt_utf8<wchar_t> >().from_bytes(_string);
	auto& fct = std::use_facet<std::ctype<wchar_t> >(std::locale());
	fct.toupper(&w_str[0], &w_str[0] + w_str.size());
	std::string string = std::wstring_convert<std::codecvt_utf8<wchar_t> >().to_bytes(w_str);

	return string;

} // toUpperStd

std::string toUpperC2U(const std::string& _string)
{
	ProfileScope("toUpperC2U");

	const std::ctype<wchar_t>& facet = std::use_facet<std::ctype<wchar_t>>(std::locale());
	std::wstring               wstring;
	std::string                string;

	size_t i = 0;
	while(i < _string.length())
		wstring += facet.toupper((wchar_t)Utils::String::chars2Unicode(_string, i));

	for(size_t i = 0; i < wstring.length(); ++i)
		string += unicode2Chars(wstring[i]);

	return string;

} // toUpperC2U

std::string toUpperMB2WC(const std::string& _string)
{
	ProfileScope("toUpperMB2WC");

	const std::ctype<wchar_t>& facet   = std::use_facet<std::ctype<wchar_t>>(std::locale());
	std::wstring               wstring = multiByteToWideChar(_string);

	for(size_t i = 0; i < wstring.length(); ++i)
		wstring[i] = facet.toupper(wstring[i]);

	return wideCharToMultiByte(wstring);

} // toUpperMB2WC

std::string toUpperMB2WC2(const std::string& _string)
{
	ProfileScope("toUpperMB2WC2");

	const std::ctype<wchar_t>& facet   = std::use_facet<std::ctype<wchar_t>>(std::locale());
	std::wstring               wstring = multiByteToWideChar(_string);

	facet.toupper(&wstring[0], &wstring[0] + wstring.size());

	return wideCharToMultiByte(wstring);

} // toUpperMB2WC2

A quick explanation of each function:
toUpperStd: The original method by Gemba who initiated this discussion.
toUpperC2U: A different approach using the already existing chars2Unicode and unicode2Chars in for loops.
toUpperMB2WC: Yet a different approach using the 2 new methods multiByteToWideChar and wideCharToMultiByte.
toUpperMB2WC2: A modification of the previous method to avoid the for loop when calling facet.toupper.

I also modfied toUpper to call all 4 of these just so they could be profiled, all 4 variants have been tested and verified to make é to the proper uppercase, everything have also been tested as to not break chinese characters:

std::string toUpper(const std::string& _string)
{
	toUpperC2U(_string);
	toUpperStd(_string);
	toUpperMB2WC(_string);
	toUpperMB2WC2(_string);

	ProfileScope("toUpper");

	std::string string;

	for(size_t i = 0; i < _string.length(); ++i)
		string += (char)toupper(_string[i]);

	return string;

} // toUpper

None of the results from the 4 new methods in the profiling are actually used as to not favor any of them.

The results were as follow on PC:

Debug:

Message           Calls    Total Time    Total time compared to toUpper

toUpper          142307      0.567588      1.0000
toUpperC2U       142307      2.892302      5.0958
toUpperMB2WC     142307      1.002347      1.7660
toUpperMB2WC2    142307      0.812284      1.4311
toUpperStd       142307     25.876925     45.5910

The winner here is toUpperMB2WC2.

Release:

Message           Calls    Total Time    Total time compared to toUpper

toUpper           142095     0.062217      1.0000
toUpperC2U        142095     0.105076      1.6889
toUpperMB2WC      142095     0.131224      2.1091
toUpperMB2WC2     142095     0.102216      1.6429
toUpperStd        142095     7.800818    125.3808

The winner is once again toUpperMB2WC2.

What's everyones thought on this?

@cmitu
Copy link

cmitu commented Feb 13, 2021

@tomaz82 nicely done.
Since multiByteToWideChar and wideCharToMultiByte and Win32 specific, they won't work on Linux.
I'll try the other 2 functions and see how they compare on a Pi.

@tomaz82
Copy link
Collaborator

tomaz82 commented Feb 13, 2021

They use mbstowcs and wcstombs which should not be windows specific.

@cmitu
Copy link

cmitu commented Feb 13, 2021

@tomaz82 yes, I forgot to include the implementation and assumed to be the Win32 equivalents. Here's a result from a Pi4 (no Debug/Release configurations):

Message           Calls    Total Time    Total time compared to toUpper

toUpper           13793      0.010680      1.0000
toUpperC2U        13793      0.024807      2.3228
toUpperMB2WC      13793      0.024823      2.3243
toUpperMB2WC2     13793      0.021708      2.0326
toUpperStd        13793      0.032451      3.0385

@Gemba
Copy link

Gemba commented Feb 16, 2021

@tomaz82 thanks for the throughout analysis and data. If there are better solutions to this which work on Linux and Windows with satisfaction I am open to it. Looks toUpperMB2WC2 is the way to go, tested on my Rpi system and does not impact runtime.
So only the theme-engine must be instructed to interpret number strings without i18n to be able to form a PR. Right?

@tomaz82
Copy link
Collaborator

tomaz82 commented Feb 16, 2021

Well no, not really, that thing has been solved by setting

std::setlocale(LC_NUMERIC, "C");

The problem is what locale the global locale should be set to, it can't be "" because that uses whatever is the default on the running system, which might not even be UTF-8. And that would break even more things.

This is exactly why it was left as "C", 99.9% of the things works, and it's guaranteed to work identically on all systems.

@Gemba
Copy link

Gemba commented Feb 16, 2021

Ah, yes. LC_NUMERIC is the straight forward solution (should have had more coffee today morning 😉 ). Plus, if at any time later floats will be implemented in es_settings.cfg the values won't be misinterpreted.

RetroPie ships locale en_GB.UTF-8 by default.
ES alone on Windows can not rely on this, as you noted correctly, and should test if the current locale is "UTF-8able" and only then switch away from "C" locale.

@cmitu
Copy link

cmitu commented Oct 20, 2021

The problem is what locale the global locale should be set to, it can't be "" because that uses whatever is the default on the running system, which might not even be UTF-8. And that would break even more things.

Looking at previous changes for locale settings via std::locale::global:

  • it was originally added in Aloshi@ac37765 ( Use user locale at startup. Should fix Unicode paths on Windows. ) :
    std::locale::global(boost::locale::generator().generate(""));
    boost::filesystem::path::imbue(std::locale());

Looks like the locale was set to the user's runtime locale.

  • @tomaz82 refactored the code in [Suggestion] Replace boost::locale with std::locale #276:

    -   std::locale::global(boost::locale::generator().generate(""));
    +   std::locale::global(std::locale(std::locale(""), "C", std::locale::numeric));

    Looks like the user locale was still used, but the LC_NUMERIC facet was set to C, to ensure the theme units were un-ambiguously intepreted. This however seems to have broken something else, which lead to the next change.

  • the latest change was PR Trying to fix japanese text #294 which replaced the user's runtime locale with C overall:

-   std::locale::global(std::locale(std::locale(""), "C", std::locale::numeric));
+   std::locale::global(std::locale("C"));

I cannot find a reference of the issue that triggered the latest change (on Github), but testing the previous version of the code with Japanese characters does indeed shows garbled text (mojibake) when the game title is in Japanese (UTF8) and the theme has forceUppercase set. Didn't experience a crash though.

Looks like one way or the other there's some breakage in displaying text. Most likely using something like ICU would solve the issue without relying on the locale support for upcase/lowercase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants