Fix UTF-8 truncation #1390

tajtiattila · 2019-11-02T09:47:00Z

This fixes #1389. The new test in format-test.cc fails in the absence of changes to format.h.

foonathan · 2019-11-02T10:45:51Z

This does not work if é is e followed by combining ’. Code point count is not character code.

tajtiattila · 2019-11-02T13:03:08Z

@foonathan This addresses basic UTF-8 handling only, so that width and precision are handled consistently. It ensures that valid UTF-8 input results in valid UTF-8 as far as code points are considered, which is not true without this change.

Handling different Unicode normalization forms, or even UTF-8 validation for that matter doesn't belong to fmt in my opinion, because 99.98% of the text on the web is NFC form anyway. But even if I'm wrong, fmt should be able to preserve existing code points at least.

foonathan · 2019-11-02T13:32:26Z

Yes, it's an improvement, but not a complete fix. Even with NFC, characters can be multiple code points.

tajtiattila · 2019-11-02T14:33:17Z

Code points are the units of text in u8char_t strings in fmt (see fmt::internal::count_code_points), and the PR is a complete fix under this assumption.

Consider the following example, which is not related to trimming:

fmt::print("{:10}", "café");

This will print 10 code points if the strings are either in ISO-8859-1 or UTF-8 as expected.

Yet it writes 9 "characters" when I understand your definition for "character" correctly if "café" is written in NFD form ('e' and COMBINING ACUTE ACCENT).

Please also note that with the PR in place users can provide overloads for their own string types for:

fmt::internal::count_code_points
fmt::internal::size_code_points

and use whatever meaning of "character" they need.

foonathan · 2019-11-02T14:37:15Z

I don't disagree with you, @vitaut just needs to decide wether the width should be given in code points or actual columns in the terminal.

tajtiattila · 2019-11-02T15:40:10Z

Ok, but then I don't see how your concern is valid in relation with this PR. It just fixes code point calculation with precision in mind, ensuring UTF-8 strings stay valid. I didn't want to go into detail how counting characters should work, I just wanted it to be consistent.

I think it is unfair to discuss changes to the meaning of "character" here because it is a much more complex issue.

vitaut · 2019-11-02T15:54:00Z

Thanks for the PR. This is an improvement even though in the long term we want higher-level units. However, please reuse count_code_points:

fmt/include/fmt/format.h

Line 430 in d6eede9

inline size_t count_code_points(basic_string_view<char8_t> s) {

tajtiattila · 2019-11-02T16:24:39Z

However, please reuse count_code_points:

fmt/include/fmt/format.h

Line 430 in d6eede9

inline size_t count_code_points(basic_string_view<char8_t> s) {

I don't see a way to reusing count_code_points, because the calculations involved are the other way around:

count_code_points calculates the number of code points from the size, and
size_code_points calculates the size from the number of code points.

The code point index is always smaller than the byte index after the first non-ASCII code point.

Consider the text in my test case for example, "cafés" is {'c', 'a', 'f', '\xc3', '\xa9', 's'}. We want size_code_points(s, 4) return 5 in this case, which is completely opposite to what count_code_points does.

Maybe another name would be more appropriate for size_code_points, how about code_point_index?

tajtiattila · 2019-11-02T20:03:27Z

I've renamed size_code_points to code_point_index and changed it to use a single loop keeping the same logic.

It follows the simple logic in count_code_points, but the lack of UTF-8 validation count_code_points and therefore in code_point_index may yield surprising results. With the byte sequence s = "\x80\x80" count_code_points(s) returns 0, and code_point_index(s, 0) returns 2, because there are 2 bytes before the 0th (possibly valid) code point.

There are many ways for a UTF-8 string to be invalid, but these two functions are still the minimum necessary building blocks for UTF-8 string handling in fmt, even if the actual implementation and the definition of a character are subject to change.

vitaut · 2019-11-03T11:54:33Z

Sounds reasonable. I think it should be a caller's responsibility to provide valid UTF-8.

Thanks!

Fix UTF-8 truncation

c3c7abb

tajtiattila force-pushed the master branch from 8f61198 to c3c7abb Compare November 2, 2019 16:46

vitaut merged commit 0889856 into fmtlib:master Nov 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix UTF-8 truncation #1390

Fix UTF-8 truncation #1390

tajtiattila commented Nov 2, 2019

foonathan commented Nov 2, 2019

tajtiattila commented Nov 2, 2019

foonathan commented Nov 2, 2019

tajtiattila commented Nov 2, 2019

foonathan commented Nov 2, 2019

tajtiattila commented Nov 2, 2019

vitaut commented Nov 2, 2019 •

edited

Loading

tajtiattila commented Nov 2, 2019 •

edited

Loading

tajtiattila commented Nov 2, 2019

vitaut commented Nov 3, 2019

Fix UTF-8 truncation #1390

Fix UTF-8 truncation #1390

Conversation

tajtiattila commented Nov 2, 2019

foonathan commented Nov 2, 2019

tajtiattila commented Nov 2, 2019

foonathan commented Nov 2, 2019

tajtiattila commented Nov 2, 2019

foonathan commented Nov 2, 2019

tajtiattila commented Nov 2, 2019

vitaut commented Nov 2, 2019 • edited Loading

tajtiattila commented Nov 2, 2019 • edited Loading

tajtiattila commented Nov 2, 2019

vitaut commented Nov 3, 2019

vitaut commented Nov 2, 2019 •

edited

Loading

tajtiattila commented Nov 2, 2019 •

edited

Loading