Result length is wrong when some arguments use negative char values #3058

omascia · 2022-08-25T15:50:15Z

Using fmt 9.0, with MSVC 17.x (C++20) :

string accents{ "école" };  // { -23, 99, 111, 108, 101 } (Western ANSI)
string plain{ "ecole" };  // { 101, 99, 111, 108, 101 }

string s1 = fmt::sprintf("%-*.*s", 10, 10, accents);// 12 (wrong)
string f1 = fmt::format("{:<{}}", accents, 10);		// 12 (wrong)

string s2 = fmt::sprintf("%-10.10s", accents);		// 12 (wrong)
string f2 = fmt::format("{:<10}", accents);			// 12 (wrong)

string s3 = fmt::sprintf("%-*.*s", 10, 10, plain);	// 10 (ok)
string f3 = fmt::format("{:<{}}", plain, 10);		// 10 (ok)

string std1 = std::format("{:<{}}", accents, 10);	// 10 (ok)
string std2 = std::format("{:<10}", accents);		// 10 (ok)
string std3 = std::format("{:<{}}", plain, 10);		// 10 (ok)
string std4 = std::format("{:<10}", plain);			// 10 (ok)

With some gcc, the wrong cases give 11 instead of 12, but that is wrong too.

vitaut · 2022-08-25T16:07:43Z

Not sure about msvc but the output and width look correct on gcc: https://godbolt.org/z/bGaMEn7xE.

Please note that width is an estimated display width, not code units. So if you are using Unicode, 11 or 12 is correct depending on normalization and 10 is incorrect. 10 is only correct if you are using a legacy encoding.

vitaut · 2022-08-25T16:18:21Z

If you want width in code units you can wrap the string in fmt::bytes:

string f1 = fmt::format("{:<{}}", fmt::bytes(accents), 10); // f1.size() == 10

omascia · 2022-08-25T17:19:27Z

Agreed on the overall, though this is ANSI Western code source. The accents string effectively contains the 8 bits quantities { -23, 99, 111, 108, 101 }.
And, at least on MSVC (which may mean this is a compiler bug triggering some unwanted char promotion to a larger intermediate type when the signed byte value is negative) the result given by fmt::format is wrong and it is right with std::format. It is not that the output has a 'é' using 2 bytes like if it was coded in UTF-8, no. The output 'é' is still OK ({-23}) but the result string has additional spaces {32} at the end.

The very same line of code, same source code, same compiler, using std::format instead of fmt::format does not exhibit the same length issue.

Tracing the code while running shows that at some point this function is called:
FMT_CONSTEXPR inline auto utf8_decode(const char* s, uint32_t* c, int* e) -> const char*

and that code_point_length() is called, taking the first byte value { -23 } for UTF-8.

template
FMT_CONSTEXPR auto code_point_length(const Char* begin) -> int {
if (const_check(sizeof(Char) != 1)) return 1;
auto lengths =
"\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\0\0\0\0\0\0\0\0\2\2\2\2\3\3\4";
int len = lengths[static_cast(*begin) >> 3];

// Compute the pointer to the next character early so that the next
// iteration can start working on the next character. Neither Clang
// nor GCC figure out this reordering on their own.
return len + !len;
}

Is there a way to break that assumption that string arguments are UTF-8 encoded, other than patching 3 bazillions string arguments with fmt::bytes() to overcome this? Locales? Maybe the default locale implies UTF-8?

I fully agree the issue does not seem to occur using some other compilers on Compiler Explorer, though I couldn't understand why, looking at the v9.0 code (which I'm not sure those other compilers were using).

phprus · 2022-08-25T18:56:22Z

utf8_decode is broken now.
See PR #3056 and comment skeeto/branchless-utf8#8 (comment)

vitaut · 2022-08-27T15:39:15Z

I missed that you are using a legacy encoding in your example, sorry. With the fix to handling invalid UTF-8 (#3056) it should give 10 now because an invalid code unit is counted as 1.

Is there a way to break that assumption that string arguments are UTF-8 encoded, other than patching 3 bazillions string arguments with fmt::bytes() to overcome this? Locales? Maybe the default locale implies UTF-8?

{fmt} is mostly encoding-agnostic. In your example encoding is only used for width estimation. So most call sites shouldn't be affected.

vitaut closed this as completed Aug 25, 2022

vitaut added the question label Aug 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Result length is wrong when some arguments use negative char values #3058

Result length is wrong when some arguments use negative char values #3058

omascia commented Aug 25, 2022 •

edited

Loading

vitaut commented Aug 25, 2022

vitaut commented Aug 25, 2022

omascia commented Aug 25, 2022

phprus commented Aug 25, 2022

vitaut commented Aug 27, 2022

Result length is wrong when some arguments use negative char values #3058

Result length is wrong when some arguments use negative char values #3058

Comments

omascia commented Aug 25, 2022 • edited Loading

vitaut commented Aug 25, 2022

vitaut commented Aug 25, 2022

omascia commented Aug 25, 2022

phprus commented Aug 25, 2022

vitaut commented Aug 27, 2022

omascia commented Aug 25, 2022 •

edited

Loading