Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling strings which might not be proper UTF-8 in a better way... #3059

Closed
omascia opened this issue Aug 26, 2022 · 3 comments
Closed

Handling strings which might not be proper UTF-8 in a better way... #3059

omascia opened this issue Aug 26, 2022 · 3 comments
Labels

Comments

@omascia
Copy link

omascia commented Aug 26, 2022

How about adding these 3 lines (or a better rewriting of them) to function code_point_length()?

  if (len >= 2 && (static_cast<unsigned char>(*(begin + 1)) >> 6) != 0x2) len = 1;
  if (len >= 3 && (static_cast<unsigned char>(*(begin + 2)) >> 6) != 0x2) len = 1;
  if (len == 4 && (static_cast<unsigned char>(*(begin + 3)) >> 6) != 0x2) len = 1;

Based on 9.0 source code, this becomes:

template <typename Char>
FMT_CONSTEXPR auto code_point_length(const Char* begin) -> int {
  if (const_check(sizeof(Char) != 1)) return 1;
  auto lengths =
      "\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\0\0\0\0\0\0\0\0\2\2\2\2\3\3\4";
  int len = lengths[static_cast<unsigned char>(*begin) >> 3];
  if (len >= 2 && (static_cast<unsigned char>(*(begin + 1)) >> 6) != 0x2) len = 1;
  if (len >= 3 && (static_cast<unsigned char>(*(begin + 2)) >> 6) != 0x2) len = 1;
  if (len == 4 && (static_cast<unsigned char>(*(begin + 3)) >> 6) != 0x2) len = 1;

  // Compute the pointer to the next character early so that the next
  // iteration can start working on the next character. Neither Clang
  // nor GCC figure out this reordering on their own.
  return len + !len;
}

This simply consider that a byte value, which should introduce a 2, 3, or 4 bytes UTF-8 sequence, is only counted as a 2, 3, 4 bytes sequence IF the right count of next bytes are indeed trailing bytes of an UTF-8 sequence.
If the library is used with char strings encoding like, let's say ISO8859, it won't start miscounting lengths when padding.
And it still works properly for correct UTF-8 strings :

		string iso{ -23, 99, 111, 108, 101 };  // "école" (ISO889)
		string utf{ -61, -87, 99, 111, 108, 101 }; // "école" (UTF-8)
		string asc{ 101, 99, 111, 108, 101 };  // "ecole" (ASCII)

		string out_iso = fmt::format("{:<10}", iso);  // size() == 10 (correct)
		string out_utf = fmt::format("{:<10}", utf);  // size() == 11 (correct)
		string out_asc = fmt::format("{:<10}", asc);  // size() == 10 (correct)

Of course, there is a possibility of inventing single-byte character sets sequences which would "look like" valid UTF-8 encoding, but generally, those will be unusual combinations for real text sequences.

@vitaut
Copy link
Contributor

vitaut commented Aug 27, 2022

There is no code you are referring to in the current master but what you are looking for might have already been addressed by #3056.

@omascia
Copy link
Author

omascia commented Aug 27, 2022

I was working with "latest" from https://fmt.dev, which was 9.0.0.
Downloaded 9.1.0 released about or less than an hour ago (not yet on https://fmt.dev obviously), and I can confirm that the way the code has been restructured since 9.0.0, I do not face the issue for which I had posted the above fix.

@vitaut
Copy link
Contributor

vitaut commented Aug 27, 2022

Great, thanks for checking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants