Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix decoder on broken utf8 sequences. #3044

Merged
merged 1 commit into from
Aug 24, 2022
Merged

Conversation

phprus
Copy link
Contributor

@phprus phprus commented Aug 18, 2022

Issue: #3038

utf8_decode did not check if the first byte is correct.
for_each_codepoint did not check for overflow on broken codepoint.

cc @stevenzzzz @dyfrgi

Copy link
Contributor

@vitaut vitaut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR.

Comment on lines +632 to +633
*e |= ((uchar(s[0]) & prefix_masks[len]) !=
uchar((prefix_masks[len] << 1) & 0xFF)); // first byte correct?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is needed because utf8_decode already gives an error for "\xf0\x28...".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this check failed at sequence \xf4\x8f\xbf\xc0.

/Users/phprus/Devel/fmt/test/ranges-test.cc:393: Failure
Expected equality of these values:
  fmt::format("{}", vec{"\xf4\x8f\xbf\xc0"})
    Which is: "[\"\\xf4\\x0f\xBF\\xc0\"]"
  "[\"\\xf4\\x8f\\xbf\\xc0\"]"

utf8_decode return zero in e.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test case then?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, I missed the code snippet which shows it's already there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test already present:

fmt/test/ranges-test.cc

Lines 383 to 384 in 541cd21

EXPECT_EQ(fmt::format("{}", vec{"\xf4\x8f\xbf\xc0"}),
"[\"\\xf4\\x8f\\xbf\\xc0\"]");

This patch fix two error.
First error fixed by add condition error ? 1 : to_unsigned(end - buf_ptr)
Second - by add first byte check.

The first error masked the second on the sequence \xf4\x8f\xbf\xc0

Copy link
Contributor

@vitaut vitaut Aug 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICS the current version of utf8_decode already returns an error for "\xf4\x8f\xbf\xc0" too: https://godbolt.org/z/E194qvshK

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Step 1:
"\xf4\x8f\xbf\xc0" - is not valid utf-8, print \xf4, remove \xf4 from sequence.
Step 2:
"\x8f\xbf\xc0\0" - is not valid utf-8, but utf8_decode return no error and invalid codepoint (15, not 143).

See: https://godbolt.org/z/EbeEex4Gf

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks.

@@ -643,7 +646,7 @@ FMT_CONSTEXPR void for_each_codepoint(string_view s, F f) {
auto error = 0;
auto end = utf8_decode(buf_ptr, &cp, &error);
bool result = f(error ? invalid_code_point : cp,
string_view(ptr, to_unsigned(end - buf_ptr)));
string_view(ptr, error ? 1 : to_unsigned(end - buf_ptr)));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we saw error here "end - buf_ptr" is < 0, but I dont follow how that could happen with invalid code point.

@@ -643,7 +646,7 @@ FMT_CONSTEXPR void for_each_codepoint(string_view s, F f) {
auto error = 0;
auto end = utf8_decode(buf_ptr, &cp, &error);
bool result = f(error ? invalid_code_point : cp,
string_view(ptr, to_unsigned(end - buf_ptr)));
string_view(ptr, error ? 1 : to_unsigned(end - buf_ptr)));
return result ? end : nullptr;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you help me to understand in the example of "\xF0\x28", is this end still buf_ptr+4 even tho there is an error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevenzzzz
No. On invalid utf-8 sequence, processing is one byte at a time.

Please describe your testing process.

My environment:
macOS 12.5.1
Apple clang version 13.1.6 (clang-1316.0.21.2.5)

Test steps:

CXXFLAGS="-fsanitize=address" cmake -DCMAKE_VERBOSE_MAKEFILE=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_STANDARD=20 -DFMT_PEDANTIC=ON -DFMT_WERROR=ON /path/to/patched/fmt
make
./bin/ranges-test

The test shows no errors.

String "\xF0\x28" present in updated test.

Without this PR:

ERROR: AddressSanitizer: negative-size-param: (size=-2)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have overlooked something, but for "\xF0\x28\0\0anything" the utf8_decode_new/decode would return 4 (as expected), when the error is set to 42. Is that "4" gonna skip the "\x28\0\0"?

same test: https://godbolt.org/z/EbeEex4Gf

Copy link
Contributor Author

@phprus phprus Aug 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevenzzzz
If error != 0 then return value of the utf8_decode function is not used and one byte is skipped (pushed to output in "\xVV" format).

I added a test for the string "\xf0\x28\0\0anything" and it finished without ASAN error and with expected results: \"\\xf0(\\x00\\x00anything\".

Please provide full test source for ASAN errors.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for working on this!

We got the ASAN error by printing a protobuf debugstring() which is arbitrary binary bytes, unfortunately I can't show the protobuf to you.

I am a little bit lost on how the printing works, could you help me to understand where is the "skip one byte on error" pls? per this piece of code:

    for (auto end = p + s.size() - block_size + 1; p < end;) {
      p = decode(p, p);
      if (!p) return;
    }

it seems p would advance 4 bytes when encode(p, p) is called on "\xof\x28".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevenzzzz
Please replace

  auto decode = [f](const char* buf_ptr, const char* ptr) {
    auto cp = uint32_t();
    auto error = 0;
    auto end = utf8_decode(buf_ptr, &cp, &error);
    bool result = f(error ? invalid_code_point : cp,
                    string_view(ptr, error ? 1 : to_unsigned(end - buf_ptr)));
    return result ? end : nullptr;
  };

with:

  auto decode = [f](const char* buf_ptr, const char* ptr) {
    auto cp = uint32_t();
    auto error = 0;
    auto end = utf8_decode(buf_ptr, &cp, &error);
    bool result = f(error ? invalid_code_point : cp,
                    string_view(ptr, error ? 1 : to_unsigned(end - buf_ptr)));
    return result ? (error ? buf_ptr + 1 : end) : nullptr;
  };

and run test.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering why you didn't replace that in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevenzzzz
Did this replacement fix the bug? If yes, I will update the PR.

I didn't see this bug before.

Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>
@vitaut vitaut merged commit 489dabb into fmtlib:master Aug 24, 2022
@vitaut
Copy link
Contributor

vitaut commented Aug 24, 2022

Merged, thanks! @phprus, could you report this to https://github.com/skeeto/branchless-utf8 where this implementation originates from?

@phprus
Copy link
Contributor Author

phprus commented Aug 24, 2022

@vitaut : skeeto/branchless-utf8#8

mtremer pushed a commit to ipfire/ipfire-2.x that referenced this pull request Nov 28, 2022
- Update from version 9.0.0 to 9.1.0
- Update of rootfile
- Changelog
    9.1.0 - 2022-08-27
	* ``fmt::formatted_size`` now works at compile time
		  `#3026 <https://github.com/fmtlib/fmt/pull/3026>`_
			  For example (`godbolt <https://godbolt.org/z/1MW5rMdf8>`__):
			   .. code:: c++
			     #include <fmt/compile.h>
			     int main() {
			       using namespace fmt::literals;
			       constexpr size_t n = fmt::formatted_size("{}"_cf, 42);
			       fmt::print("{}\n", n); // prints 2
			     }
	* Fixed handling of invalid UTF-8
		  `#3038 <https://github.com/fmtlib/fmt/pull/3038>`_,
		  `#3044 <https://github.com/fmtlib/fmt/pull/3044>`_,
		  `#3056 <https://github.com/fmtlib/fmt/pull/3056>`_
	* Improved Unicode support in ``ostream`` overloads of ``print``
		  `#2994 <https://github.com/fmtlib/fmt/pull/2994>`_,
		  `#3001 <https://github.com/fmtlib/fmt/pull/3001>`_,
		  `#3025 <https://github.com/fmtlib/fmt/pull/3025>`_
	* Fixed handling of the sign specifier in localized formatting on systems with
	   32-bit ``wchar_t``
		  `#3041 <https://github.com/fmtlib/fmt/issues/3041>`_).
	* Added support for wide streams to ``fmt::streamed``
		  `#2994 <https://github.com/fmtlib/fmt/pull/2994>`_
	* Added the ``n`` specifier that disables the output of delimiters when
	   formatting ranges
		  `#2981 <https://github.com/fmtlib/fmt/pull/2981>`_,
		  `#2983 <https://github.com/fmtlib/fmt/pull/2983>`_
			  For example (`godbolt <https://godbolt.org/z/roKqGdj8c>`__):
			   .. code:: c++
			     #include <fmt/ranges.h>
			     #include <vector>
			     int main() {
			       auto v = std::vector{1, 2, 3};
			       fmt::print("{:n}\n", v); // prints 1, 2, 3
			     }
	* Worked around problematic ``std::string_view`` constructors introduced in C++23
		  `#3030 <https://github.com/fmtlib/fmt/issues/3030>`_,
		  `#3050 <https://github.com/fmtlib/fmt/issues/3050>`_
	* Improve handling (exclusion) of recursive ranges
		  `#2968 <https://github.com/fmtlib/fmt/issues/2968>`_,
		  `#2974 <https://github.com/fmtlib/fmt/pull/2974>`_
	* Improved error reporting in format string compilation
		  `#3055 <https://github.com/fmtlib/fmt/issues/3055>`_
	* Improved the implementation of
		  `Dragonbox <https://github.com/jk-jeon/dragonbox>`_, the algorithm used for
		   the default floating-point formatting
		  `#2984 <https://github.com/fmtlib/fmt/pull/2984>`_
	* Fixed issues with floating-point formatting on exotic platforms.
	* Improved the implementation of chrono formatting
		  `#3010 <https://github.com/fmtlib/fmt/pull/3010>`_
	* Improved documentation
		  `#2966 <https://github.com/fmtlib/fmt/pull/2966>`_,
		  `#3009 <https://github.com/fmtlib/fmt/pull/3009>`_,
		  `#3020 <https://github.com/fmtlib/fmt/issues/3020>`_,
		  `#3037 <https://github.com/fmtlib/fmt/pull/3037>`_
	* Improved build configuration
		  `#2991 <https://github.com/fmtlib/fmt/pull/2991>`_,
		  `#2995 <https://github.com/fmtlib/fmt/pull/2995>`_,
		  `#3004 <https://github.com/fmtlib/fmt/issues/3004>`_,
		  `#3007 <https://github.com/fmtlib/fmt/pull/3007>`_,
		  `#3040 <https://github.com/fmtlib/fmt/pull/3040>`_
	* Fixed various warnings and compilation issues
		  `#2969 <https://github.com/fmtlib/fmt/issues/2969>`_,
		  `#2971 <https://github.com/fmtlib/fmt/pull/2971>`_,
		  `#2975 <https://github.com/fmtlib/fmt/issues/2975>`_,
		  `#2982 <https://github.com/fmtlib/fmt/pull/2982>`_,
		  `#2985 <https://github.com/fmtlib/fmt/pull/2985>`_,
		  `#2988 <https://github.com/fmtlib/fmt/issues/2988>`_,
		  `#3000 <https://github.com/fmtlib/fmt/issues/3000>`_,
		  `#3006 <https://github.com/fmtlib/fmt/issues/3006>`_,
		  `#3014 <https://github.com/fmtlib/fmt/issues/3014>`_,
		  `#3015 <https://github.com/fmtlib/fmt/issues/3015>`_,
		  `#3021 <https://github.com/fmtlib/fmt/pull/3021>`_,
		  `#3023 <https://github.com/fmtlib/fmt/issues/3023>`_,
		  `#3024 <https://github.com/fmtlib/fmt/pull/3024>`_,
		  `#3029 <https://github.com/fmtlib/fmt/pull/3029>`_,
		  `#3043 <https://github.com/fmtlib/fmt/pull/3043>`_,
		  `#3052 <https://github.com/fmtlib/fmt/issues/3052>`_,
		  `#3053 <https://github.com/fmtlib/fmt/pull/3053>`_,
		  `#3054 <https://github.com/fmtlib/fmt/pull/3054>`_

Signed-off-by: Adolf Belka <adolf.belka@ipfire.org>
Reviewed-by: Michael Tremer <michael.tremer@ipfire.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants