Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix UTF-8 truncation #1390

Merged
merged 1 commit into from
Nov 3, 2019
Merged

Fix UTF-8 truncation #1390

merged 1 commit into from
Nov 3, 2019

Conversation

tajtiattila
Copy link
Contributor

This fixes #1389. The new test in format-test.cc fails in the absence of changes to format.h.

@foonathan
Copy link
Contributor

This does not work if é is e followed by combining . Code point count is not character code.

@tajtiattila
Copy link
Contributor Author

@foonathan This addresses basic UTF-8 handling only, so that width and precision are handled consistently. It ensures that valid UTF-8 input results in valid UTF-8 as far as code points are considered, which is not true without this change.

Handling different Unicode normalization forms, or even UTF-8 validation for that matter doesn't belong to fmt in my opinion, because 99.98% of the text on the web is NFC form anyway. But even if I'm wrong, fmt should be able to preserve existing code points at least.

@foonathan
Copy link
Contributor

Yes, it's an improvement, but not a complete fix. Even with NFC, characters can be multiple code points.

@tajtiattila
Copy link
Contributor Author

Code points are the units of text in u8char_t strings in fmt (see fmt::internal::count_code_points), and the PR is a complete fix under this assumption.

Consider the following example, which is not related to trimming:

fmt::print("{:10}", "café");

This will print 10 code points if the strings are either in ISO-8859-1 or UTF-8 as expected.

Yet it writes 9 "characters" when I understand your definition for "character" correctly if "café" is written in NFD form ('e' and COMBINING ACUTE ACCENT).

Please also note that with the PR in place users can provide overloads for their own string types for:

fmt::internal::count_code_points
fmt::internal::size_code_points

and use whatever meaning of "character" they need.

@foonathan
Copy link
Contributor

I don't disagree with you, @vitaut just needs to decide wether the width should be given in code points or actual columns in the terminal.

@tajtiattila
Copy link
Contributor Author

Ok, but then I don't see how your concern is valid in relation with this PR. It just fixes code point calculation with precision in mind, ensuring UTF-8 strings stay valid. I didn't want to go into detail how counting characters should work, I just wanted it to be consistent.

I think it is unfair to discuss changes to the meaning of "character" here because it is a much more complex issue.

@vitaut
Copy link
Contributor

vitaut commented Nov 2, 2019

Thanks for the PR. This is an improvement even though in the long term we want higher-level units. However, please reuse count_code_points:

inline size_t count_code_points(basic_string_view<char8_t> s) {

@tajtiattila
Copy link
Contributor Author

tajtiattila commented Nov 2, 2019

However, please reuse count_code_points:

inline size_t count_code_points(basic_string_view<char8_t> s) {

I don't see a way to reusing count_code_points, because the calculations involved are the other way around:

  • count_code_points calculates the number of code points from the size, and
  • size_code_points calculates the size from the number of code points.

The code point index is always smaller than the byte index after the first non-ASCII code point.

Consider the text in my test case for example, "cafés" is {'c', 'a', 'f', '\xc3', '\xa9', 's'}. We want size_code_points(s, 4) return 5 in this case, which is completely opposite to what count_code_points does.

Maybe another name would be more appropriate for size_code_points, how about code_point_index?

@tajtiattila
Copy link
Contributor Author

I've renamed size_code_points to code_point_index and changed it to use a single loop keeping the same logic.

It follows the simple logic in count_code_points, but the lack of UTF-8 validation count_code_points and therefore in code_point_index may yield surprising results. With the byte sequence s = "\x80\x80" count_code_points(s) returns 0, and code_point_index(s, 0) returns 2, because there are 2 bytes before the 0th (possibly valid) code point.

There are many ways for a UTF-8 string to be invalid, but these two functions are still the minimum necessary building blocks for UTF-8 string handling in fmt, even if the actual implementation and the definition of a character are subject to change.

@vitaut vitaut merged commit 0889856 into fmtlib:master Nov 3, 2019
@vitaut
Copy link
Contributor

vitaut commented Nov 3, 2019

Sounds reasonable. I think it should be a caller's responsibility to provide valid UTF-8.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

UTF-8 truncation
3 participants