Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Result length is wrong when some arguments use negative char values #3058

Closed
omascia opened this issue Aug 25, 2022 · 5 comments
Closed

Result length is wrong when some arguments use negative char values #3058

omascia opened this issue Aug 25, 2022 · 5 comments
Labels

Comments

@omascia
Copy link

omascia commented Aug 25, 2022

Using fmt 9.0, with MSVC 17.x (C++20) :

string accents{ "école" };  // { -23, 99, 111, 108, 101 } (Western ANSI)
string plain{ "ecole" };  // { 101, 99, 111, 108, 101 }

string s1 = fmt::sprintf("%-*.*s", 10, 10, accents);// 12 (wrong)
string f1 = fmt::format("{:<{}}", accents, 10);		// 12 (wrong)

string s2 = fmt::sprintf("%-10.10s", accents);		// 12 (wrong)
string f2 = fmt::format("{:<10}", accents);			// 12 (wrong)

string s3 = fmt::sprintf("%-*.*s", 10, 10, plain);	// 10 (ok)
string f3 = fmt::format("{:<{}}", plain, 10);		// 10 (ok)

string std1 = std::format("{:<{}}", accents, 10);	// 10 (ok)
string std2 = std::format("{:<10}", accents);		// 10 (ok)
string std3 = std::format("{:<{}}", plain, 10);		// 10 (ok)
string std4 = std::format("{:<10}", plain);			// 10 (ok)

With some gcc, the wrong cases give 11 instead of 12, but that is wrong too.

@vitaut
Copy link
Contributor

vitaut commented Aug 25, 2022

Not sure about msvc but the output and width look correct on gcc: https://godbolt.org/z/bGaMEn7xE.

Please note that width is an estimated display width, not code units. So if you are using Unicode, 11 or 12 is correct depending on normalization and 10 is incorrect. 10 is only correct if you are using a legacy encoding.

@vitaut
Copy link
Contributor

vitaut commented Aug 25, 2022

If you want width in code units you can wrap the string in fmt::bytes:

string f1 = fmt::format("{:<{}}", fmt::bytes(accents), 10); // f1.size() == 10

@omascia
Copy link
Author

omascia commented Aug 25, 2022

Agreed on the overall, though this is ANSI Western code source. The accents string effectively contains the 8 bits quantities { -23, 99, 111, 108, 101 }.
And, at least on MSVC (which may mean this is a compiler bug triggering some unwanted char promotion to a larger intermediate type when the signed byte value is negative) the result given by fmt::format is wrong and it is right with std::format. It is not that the output has a 'é' using 2 bytes like if it was coded in UTF-8, no. The output 'é' is still OK ({-23}) but the result string has additional spaces {32} at the end.

The very same line of code, same source code, same compiler, using std::format instead of fmt::format does not exhibit the same length issue.

Tracing the code while running shows that at some point this function is called:
FMT_CONSTEXPR inline auto utf8_decode(const char* s, uint32_t* c, int* e) -> const char*

and that code_point_length() is called, taking the first byte value { -23 } for UTF-8.

template
FMT_CONSTEXPR auto code_point_length(const Char* begin) -> int {
if (const_check(sizeof(Char) != 1)) return 1;
auto lengths =
"\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\0\0\0\0\0\0\0\0\2\2\2\2\3\3\4";
int len = lengths[static_cast(*begin) >> 3];

// Compute the pointer to the next character early so that the next
// iteration can start working on the next character. Neither Clang
// nor GCC figure out this reordering on their own.
return len + !len;
}

Is there a way to break that assumption that string arguments are UTF-8 encoded, other than patching 3 bazillions string arguments with fmt::bytes() to overcome this? Locales? Maybe the default locale implies UTF-8?

I fully agree the issue does not seem to occur using some other compilers on Compiler Explorer, though I couldn't understand why, looking at the v9.0 code (which I'm not sure those other compilers were using).

@phprus
Copy link
Contributor

phprus commented Aug 25, 2022

utf8_decode is broken now.
See PR #3056 and comment skeeto/branchless-utf8#8 (comment)

@vitaut
Copy link
Contributor

vitaut commented Aug 27, 2022

I missed that you are using a legacy encoding in your example, sorry. With the fix to handling invalid UTF-8 (#3056) it should give 10 now because an invalid code unit is counted as 1.

Is there a way to break that assumption that string arguments are UTF-8 encoded, other than patching 3 bazillions string arguments with fmt::bytes() to overcome this? Locales? Maybe the default locale implies UTF-8?

{fmt} is mostly encoding-agnostic. In your example encoding is only used for width estimation. So most call sites shouldn't be affected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants