Fix decoder on broken utf8 sequences. #3044

phprus · 2022-08-18T21:19:01Z

utf8_decode did not check if the first byte is correct.
for_each_codepoint did not check for overflow on broken codepoint.

cc @stevenzzzz @dyfrgi

vitaut

Thanks for the PR.

vitaut · 2022-08-20T16:14:38Z

include/fmt/format.h

+  *e |= ((uchar(s[0]) & prefix_masks[len]) !=
+         uchar((prefix_masks[len] << 1) & 0xFF));  // first byte correct?


I don't think this is needed because utf8_decode already gives an error for "\xf0\x28...".

Without this check failed at sequence \xf4\x8f\xbf\xc0.

/Users/phprus/Devel/fmt/test/ranges-test.cc:393: Failure Expected equality of these values: fmt::format("{}", vec{"\xf4\x8f\xbf\xc0"}) Which is: "[\"\\xf4\\x0f\xBF\\xc0\"]" "[\"\\xf4\\x8f\\xbf\\xc0\"]"

utf8_decode return zero in e.

Could you add a test case then?

Nevermind, I missed the code snippet which shows it's already there.

This test already present:

fmt/test/ranges-test.cc

Lines 383 to 384 in 541cd21

EXPECT_EQ(fmt::format("{}", vec{"\xf4\x8f\xbf\xc0"}),

"[\"\\xf4\\x8f\\xbf\\xc0\"]");

This patch fix two error.
First error fixed by add condition error ? 1 : to_unsigned(end - buf_ptr)
Second - by add first byte check.

The first error masked the second on the sequence \xf4\x8f\xbf\xc0

AFAICS the current version of utf8_decode already returns an error for "\xf4\x8f\xbf\xc0" too: https://godbolt.org/z/E194qvshK

Step 1:
"\xf4\x8f\xbf\xc0" - is not valid utf-8, print \xf4, remove \xf4 from sequence.
Step 2:
"\x8f\xbf\xc0\0" - is not valid utf-8, but utf8_decode return no error and invalid codepoint (15, not 143).

See: https://godbolt.org/z/EbeEex4Gf

Makes sense, thanks.

stevenzzzz · 2022-08-22T20:17:56Z

include/fmt/format.h

@@ -643,7 +646,7 @@ FMT_CONSTEXPR void for_each_codepoint(string_view s, F f) {
    auto error = 0;
    auto end = utf8_decode(buf_ptr, &cp, &error);
    bool result = f(error ? invalid_code_point : cp,
-                    string_view(ptr, to_unsigned(end - buf_ptr)));
+                    string_view(ptr, error ? 1 : to_unsigned(end - buf_ptr)));


we saw error here "end - buf_ptr" is < 0, but I dont follow how that could happen with invalid code point.

stevenzzzz · 2022-08-22T20:27:06Z

include/fmt/format.h

@@ -643,7 +646,7 @@ FMT_CONSTEXPR void for_each_codepoint(string_view s, F f) {
    auto error = 0;
    auto end = utf8_decode(buf_ptr, &cp, &error);
    bool result = f(error ? invalid_code_point : cp,
-                    string_view(ptr, to_unsigned(end - buf_ptr)));
+                    string_view(ptr, error ? 1 : to_unsigned(end - buf_ptr)));
    return result ? end : nullptr;


could you help me to understand in the example of "\xF0\x28", is this end still buf_ptr+4 even tho there is an error?

@stevenzzzz
No. On invalid utf-8 sequence, processing is one byte at a time.

Please describe your testing process.

My environment:
macOS 12.5.1
Apple clang version 13.1.6 (clang-1316.0.21.2.5)

Test steps:

CXXFLAGS="-fsanitize=address" cmake -DCMAKE_VERBOSE_MAKEFILE=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_STANDARD=20 -DFMT_PEDANTIC=ON -DFMT_WERROR=ON /path/to/patched/fmt make ./bin/ranges-test

The test shows no errors.

String "\xF0\x28" present in updated test.

Without this PR:

ERROR: AddressSanitizer: negative-size-param: (size=-2)

I might have overlooked something, but for "\xF0\x28\0\0anything" the utf8_decode_new/decode would return 4 (as expected), when the error is set to 42. Is that "4" gonna skip the "\x28\0\0"?

same test: https://godbolt.org/z/EbeEex4Gf

@stevenzzzz
If error != 0 then return value of the utf8_decode function is not used and one byte is skipped (pushed to output in "\xVV" format).

I added a test for the string "\xf0\x28\0\0anything" and it finished without ASAN error and with expected results: \"\\xf0(\\x00\\x00anything\".

Please provide full test source for ASAN errors.

thanks for working on this!

We got the ASAN error by printing a protobuf debugstring() which is arbitrary binary bytes, unfortunately I can't show the protobuf to you.

I am a little bit lost on how the printing works, could you help me to understand where is the "skip one byte on error" pls? per this piece of code:

for (auto end = p + s.size() - block_size + 1; p < end;) { p = decode(p, p); if (!p) return; }

it seems p would advance 4 bytes when encode(p, p) is called on "\xof\x28".

@stevenzzzz
Please replace

auto decode = [f](const char* buf_ptr, const char* ptr) { auto cp = uint32_t(); auto error = 0; auto end = utf8_decode(buf_ptr, &cp, &error); bool result = f(error ? invalid_code_point : cp, string_view(ptr, error ? 1 : to_unsigned(end - buf_ptr))); return result ? end : nullptr; };

with:

auto decode = [f](const char* buf_ptr, const char* ptr) { auto cp = uint32_t(); auto error = 0; auto end = utf8_decode(buf_ptr, &cp, &error); bool result = f(error ? invalid_code_point : cp, string_view(ptr, error ? 1 : to_unsigned(end - buf_ptr))); return result ? (error ? buf_ptr + 1 : end) : nullptr; };

and run test.

I was wondering why you didn't replace that in this PR.

@stevenzzzz
Did this replacement fix the bug? If yes, I will update the PR.

I didn't see this bug before.

Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>

vitaut · 2022-08-24T14:16:25Z

Merged, thanks! @phprus, could you report this to https://github.com/skeeto/branchless-utf8 where this implementation originates from?

phprus · 2022-08-24T14:41:59Z

@vitaut : skeeto/branchless-utf8#8

- Update from version 9.0.0 to 9.1.0 - Update of rootfile - Changelog 9.1.0 - 2022-08-27 * ``fmt::formatted_size`` now works at compile time `#3026 <https://github.com/fmtlib/fmt/pull/3026>`_ For example (`godbolt <https://godbolt.org/z/1MW5rMdf8>`__): .. code:: c++ #include <fmt/compile.h> int main() { using namespace fmt::literals; constexpr size_t n = fmt::formatted_size("{}"_cf, 42); fmt::print("{}\n", n); // prints 2 } * Fixed handling of invalid UTF-8 `#3038 <https://github.com/fmtlib/fmt/pull/3038>`_, `#3044 <https://github.com/fmtlib/fmt/pull/3044>`_, `#3056 <https://github.com/fmtlib/fmt/pull/3056>`_ * Improved Unicode support in ``ostream`` overloads of ``print`` `#2994 <https://github.com/fmtlib/fmt/pull/2994>`_, `#3001 <https://github.com/fmtlib/fmt/pull/3001>`_, `#3025 <https://github.com/fmtlib/fmt/pull/3025>`_ * Fixed handling of the sign specifier in localized formatting on systems with 32-bit ``wchar_t`` `#3041 <https://github.com/fmtlib/fmt/issues/3041>`_). * Added support for wide streams to ``fmt::streamed`` `#2994 <https://github.com/fmtlib/fmt/pull/2994>`_ * Added the ``n`` specifier that disables the output of delimiters when formatting ranges `#2981 <https://github.com/fmtlib/fmt/pull/2981>`_, `#2983 <https://github.com/fmtlib/fmt/pull/2983>`_ For example (`godbolt <https://godbolt.org/z/roKqGdj8c>`__): .. code:: c++ #include <fmt/ranges.h> #include <vector> int main() { auto v = std::vector{1, 2, 3}; fmt::print("{:n}\n", v); // prints 1, 2, 3 } * Worked around problematic ``std::string_view`` constructors introduced in C++23 `#3030 <https://github.com/fmtlib/fmt/issues/3030>`_, `#3050 <https://github.com/fmtlib/fmt/issues/3050>`_ * Improve handling (exclusion) of recursive ranges `#2968 <https://github.com/fmtlib/fmt/issues/2968>`_, `#2974 <https://github.com/fmtlib/fmt/pull/2974>`_ * Improved error reporting in format string compilation `#3055 <https://github.com/fmtlib/fmt/issues/3055>`_ * Improved the implementation of `Dragonbox <https://github.com/jk-jeon/dragonbox>`_, the algorithm used for the default floating-point formatting `#2984 <https://github.com/fmtlib/fmt/pull/2984>`_ * Fixed issues with floating-point formatting on exotic platforms. * Improved the implementation of chrono formatting `#3010 <https://github.com/fmtlib/fmt/pull/3010>`_ * Improved documentation `#2966 <https://github.com/fmtlib/fmt/pull/2966>`_, `#3009 <https://github.com/fmtlib/fmt/pull/3009>`_, `#3020 <https://github.com/fmtlib/fmt/issues/3020>`_, `#3037 <https://github.com/fmtlib/fmt/pull/3037>`_ * Improved build configuration `#2991 <https://github.com/fmtlib/fmt/pull/2991>`_, `#2995 <https://github.com/fmtlib/fmt/pull/2995>`_, `#3004 <https://github.com/fmtlib/fmt/issues/3004>`_, `#3007 <https://github.com/fmtlib/fmt/pull/3007>`_, `#3040 <https://github.com/fmtlib/fmt/pull/3040>`_ * Fixed various warnings and compilation issues `#2969 <https://github.com/fmtlib/fmt/issues/2969>`_, `#2971 <https://github.com/fmtlib/fmt/pull/2971>`_, `#2975 <https://github.com/fmtlib/fmt/issues/2975>`_, `#2982 <https://github.com/fmtlib/fmt/pull/2982>`_, `#2985 <https://github.com/fmtlib/fmt/pull/2985>`_, `#2988 <https://github.com/fmtlib/fmt/issues/2988>`_, `#3000 <https://github.com/fmtlib/fmt/issues/3000>`_, `#3006 <https://github.com/fmtlib/fmt/issues/3006>`_, `#3014 <https://github.com/fmtlib/fmt/issues/3014>`_, `#3015 <https://github.com/fmtlib/fmt/issues/3015>`_, `#3021 <https://github.com/fmtlib/fmt/pull/3021>`_, `#3023 <https://github.com/fmtlib/fmt/issues/3023>`_, `#3024 <https://github.com/fmtlib/fmt/pull/3024>`_, `#3029 <https://github.com/fmtlib/fmt/pull/3029>`_, `#3043 <https://github.com/fmtlib/fmt/pull/3043>`_, `#3052 <https://github.com/fmtlib/fmt/issues/3052>`_, `#3053 <https://github.com/fmtlib/fmt/pull/3053>`_, `#3054 <https://github.com/fmtlib/fmt/pull/3054>`_ Signed-off-by: Adolf Belka <adolf.belka@ipfire.org> Reviewed-by: Michael Tremer <michael.tremer@ipfire.org>

phprus mentioned this pull request Aug 18, 2022

write_range_entry fails writing non-utf8 string #3038

Closed

phprus force-pushed the issue-3038 branch from d1cbd5e to 9bdcd18 Compare August 18, 2022 22:38

vitaut requested changes Aug 20, 2022

View reviewed changes

stevenzzzz reviewed Aug 22, 2022

View reviewed changes

phprus force-pushed the issue-3038 branch from 9bdcd18 to 1d560a3 Compare August 23, 2022 15:12

Fix decoder on broken utf8 sequences.

186ac82

Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>

phprus force-pushed the issue-3038 branch from 1d560a3 to 186ac82 Compare August 23, 2022 16:50

vitaut approved these changes Aug 24, 2022

View reviewed changes

vitaut merged commit 489dabb into fmtlib:master Aug 24, 2022

phprus mentioned this pull request Aug 24, 2022

Invalid first byte bug skeeto/branchless-utf8#8

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix decoder on broken utf8 sequences. #3044

Fix decoder on broken utf8 sequences. #3044

phprus commented Aug 18, 2022 •

edited

Loading

vitaut left a comment

vitaut Aug 20, 2022

phprus Aug 20, 2022

vitaut Aug 20, 2022

vitaut Aug 20, 2022

phprus Aug 20, 2022

vitaut Aug 21, 2022 •

edited

Loading

phprus Aug 21, 2022

vitaut Aug 24, 2022

stevenzzzz Aug 22, 2022

stevenzzzz Aug 22, 2022

phprus Aug 22, 2022

stevenzzzz Aug 23, 2022

phprus Aug 23, 2022 •

edited

Loading

stevenzzzz Aug 23, 2022

phprus Aug 23, 2022

stevenzzzz Aug 23, 2022

phprus Aug 23, 2022

vitaut commented Aug 24, 2022

phprus commented Aug 24, 2022

		*e \|= ((uchar(s[0]) & prefix_masks[len]) !=
		uchar((prefix_masks[len] << 1) & 0xFF)); // first byte correct?

	EXPECT_EQ(fmt::format("{}", vec{"\xf4\x8f\xbf\xc0"}),
	"[\"\\xf4\\x8f\\xbf\\xc0\"]");

Fix decoder on broken utf8 sequences. #3044

Fix decoder on broken utf8 sequences. #3044

Conversation

phprus commented Aug 18, 2022 • edited Loading

vitaut left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vitaut Aug 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phprus Aug 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vitaut commented Aug 24, 2022

phprus commented Aug 24, 2022

phprus commented Aug 18, 2022 •

edited

Loading

vitaut Aug 21, 2022 •

edited

Loading

phprus Aug 23, 2022 •

edited

Loading