Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid first byte bug #8

Closed
phprus opened this issue Aug 24, 2022 · 3 comments
Closed

Invalid first byte bug #8

phprus opened this issue Aug 24, 2022 · 3 comments

Comments

@phprus
Copy link

phprus commented Aug 24, 2022

If first byte of the sequence is invalid then function can return zero as error code (valid utf-8).

Look fmt issue fmtlib/fmt#3038 , PR fmtlib/fmt#3044 and https://godbolt.org/z/EbeEex4Gf for details.

@skeeto
Copy link
Owner

skeeto commented Aug 24, 2022

Thanks for the Godbolt link since that was very helpful in figuring this out. It looks like the two issues observed in fmt are due to invalid use and a defective local modification. I'm unable to reproduce issues on my original:

#include "utf8.h"
#include <stdio.h>

int main(void)
{
    char tests[][4] = {
        "\x8f\xbf\xc0\x00",
        "\xbf\xc0\x00\x00",
        "\xf0\x28\x00\x00",
        "\xf4\x8f\xbf\xc0",
    };
    int ntests = sizeof(tests) / sizeof(*tests);
    for (int i = 0; i < ntests; i++) {
        int err;
        uint32_t cp;
        char *p = utf8_decode(tests[i], &cp, &err);
        printf("U+%06lx len=%td err=%d\n", (long)cp, (p-tests[i]), err);
    }
}

My output:

U+03f000 len=1 err=70
U+000000 len=1 err=90
U+028000 len=4 err=42
U+10ffc0 len=4 err=1

The ASan issue was caused by the invalid use. The input must be no less than 4 bytes, and it's the caller's responsibility to pad the input if necessary. It looks like fmt handles this part correctly. The returned pointer may point into or one past this padding. This is a subtle detail, and it's why they were observing a size of -2. Yes, this is an awkward interface, but it's paying for the performance gains if used appropriately (more on that below).

The zero error problem is caused by merging the len computation with advancing the pointer, via fmt::detail::code_point_length. It's important that the pointer always advances, so it adds len + !len. However, the domain of len must include zero for table lookups into shiftc, mins, and shifte. By forcing it non-zero, the error checks were broken. I strongly suggest the fmt folks run their modified version against my quite-thorough test suite to see if any other error checks are still broken.

Regarding the overall use of utf8_decode in fmt: It makes little sense to use a branchless decoder and then put branches around the call site. The decoder is written for the following use case:

  1. Large buffers (at least thousands of code points)
  2. Probably containing non-ASCII code points (length varies unpredictably)
  3. Probably valid UTF-8 (optimistic)
  4. Invalid input will be completely rejected (no replacement characters)

The caller runs the decoder across the buffer, extracting code points, using the returned pointer for the next iteration — all while making no decisions about errors. Instead they're accumulated and checked later. After all, we're optimistic and expect no errors, so we march along as though nothing has gone wrong. Following the 4-byte rule means the input must simply have 3 bytes of zero padding.

An example that decodes standard input:

#define N (1 << 20)  // 1 MiB

// input buffer
char buf[N+3];
char *end = buf + fread(buf, 1, N, stdin);
end[0] = end[1] = end[2] = 0;

// parsed code points
int len = 0;
uint32_t cp[N];

int errors = 0;
for (char *p = buf; p < end;) {
    int e;
    p = utf8_decode(p, cp+len++, &e);
    errors |= e;
}

The input buffer has 3 extra bytes which do not accept input. The loop stops at the end of the input, not the end of the padding (i.e. no "-2 size"). When the loop is done, errors will indicate if anything went wrong during decoding. It will not say where or what since invalid input will be rejected outright (4), not massaged or recovered, and it didn't waste time in the loop inspecting the error indicator.

Note: Rejection could mean switching to a slower decoder that permits replacement characters — a slow path for the unexpected case.

In retrospect I probably should have required callers to zero e, then only |= in the function instead of the initial assignment. That simplifies the loop nicely:

int errors = 0;
for (char *p = buf; p < end;) {
    p = utf8_decode(p, cp+len++, &errors);
}

If fmt wants to use my decoder, they ought to have a fast path that looks like this when the string is above some length threshold, so that they actually get the performance benefits. Since there's no efficient way to append padding to the input they could stop short and decode the last few bytes slowly.

@phprus
Copy link
Author

phprus commented Aug 24, 2022

@skeeto
Thank you for your comment!
I fixed bugs in fmt's utf8_decode implementation and added tests.

cc @vitaut

@vitaut
Copy link

vitaut commented Aug 26, 2022

Thanks a lot, @skeeto, for the detailed reply. Looks like I introduced an error during some refactor.

If fmt wants to use my decoder, they ought to have a fast path that looks like this when the string is above some length threshold, so that they actually get the performance benefits.

This is the plan but we haven't done it yet. The current usage is indeed suboptimal.

@phprus phprus closed this as completed Sep 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants