-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hamming_distance can hang or produce incorrect results #9349
Conversation
This pull request was exported from Phabricator. Differential Revision: D55670296 |
✅ Deploy Preview for meta-velox canceled.
|
…ator#9349) Summary: The call function for hamming_distance was written to iterate through two strings comparing UTF-8 characters. It uses utf8proc_codepoint to read those characters, it returns the character or the negative length of the invalid code point if it's invalid UTF-8. It then updates it's position in the string to either the number of bytes in the character, or the length of the invalid code point. The logic currently incorrectly treats ASCII 0 (the null character) as an invalid code point. Since the external library correctly treats it as a valid UTF-8 character it returns 0 for the character. The logic in hamming_distance treats 0 as the negative value of the length of the invalid code point, meaning it doesn't change it's position in the string. This means we return incorrect results if a null character appears in either string, as we incorrectly compute the length of the string with the null character. If both strings contain null characters, we end up in an infinite loop as neither string will make progress. Note that callAscii handles this correctly. Differential Revision: D55670296
03aa062
to
fdab0d6
Compare
This pull request was exported from Phabricator. Differential Revision: D55670296 |
CC: @wills-feng |
…ator#9349) Summary: The call function for hamming_distance was written to iterate through two strings comparing UTF-8 characters. It uses utf8proc_codepoint to read those characters, it returns the character or the negative length of the invalid code point if it's invalid UTF-8. It then updates it's position in the string to either the number of bytes in the character, or the length of the invalid code point. The logic currently incorrectly treats ASCII 0 (the null character) as an invalid code point. Since the external library correctly treats it as a valid UTF-8 character it returns 0 for the character. The logic in hamming_distance treats 0 as the negative value of the length of the invalid code point, meaning it doesn't change it's position in the string. This means we return incorrect results if a null character appears in either string, as we incorrectly compute the length of the string with the null character. If both strings contain null characters, we end up in an infinite loop as neither string will make progress. Note that callAscii handles this correctly. Reviewed By: kgpai Differential Revision: D55670296
fdab0d6
to
87bc046
Compare
This pull request was exported from Phabricator. Differential Revision: D55670296 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the fix.
This pull request has been merged in bbbe224. |
Thanks for the fix. |
Conbench analyzed the 1 benchmark run on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
…ator#9349) Summary: Pull Request resolved: facebookincubator#9349 The call function for hamming_distance was written to iterate through two strings comparing UTF-8 characters. It uses utf8proc_codepoint to read those characters, it returns the character or the negative length of the invalid code point if it's invalid UTF-8. It then updates it's position in the string to either the number of bytes in the character, or the length of the invalid code point. The logic currently incorrectly treats ASCII 0 (the null character) as an invalid code point. Since the external library correctly treats it as a valid UTF-8 character it returns 0 for the character. The logic in hamming_distance treats 0 as the negative value of the length of the invalid code point, meaning it doesn't change it's position in the string. This means we return incorrect results if a null character appears in either string, as we incorrectly compute the length of the string with the null character. If both strings contain null characters, we end up in an infinite loop as neither string will make progress. Note that callAscii handles this correctly. Reviewed By: amitkdutta, kgpai Differential Revision: D55670296 fbshipit-source-id: 73d15b48b67f5342fe1c7904146c32dc5c34bd2e
…ator#9349) Summary: Pull Request resolved: facebookincubator#9349 The call function for hamming_distance was written to iterate through two strings comparing UTF-8 characters. It uses utf8proc_codepoint to read those characters, it returns the character or the negative length of the invalid code point if it's invalid UTF-8. It then updates it's position in the string to either the number of bytes in the character, or the length of the invalid code point. The logic currently incorrectly treats ASCII 0 (the null character) as an invalid code point. Since the external library correctly treats it as a valid UTF-8 character it returns 0 for the character. The logic in hamming_distance treats 0 as the negative value of the length of the invalid code point, meaning it doesn't change it's position in the string. This means we return incorrect results if a null character appears in either string, as we incorrectly compute the length of the string with the null character. If both strings contain null characters, we end up in an infinite loop as neither string will make progress. Note that callAscii handles this correctly. Reviewed By: amitkdutta, kgpai Differential Revision: D55670296 fbshipit-source-id: 73d15b48b67f5342fe1c7904146c32dc5c34bd2e
Summary:
The call function for hamming_distance was written to iterate through two strings comparing UTF-8 characters. It uses
utf8proc_codepoint to read those characters, it returns the character or the negative length of the invalid code point if it's
invalid UTF-8. It then updates it's position in the string to either the number of bytes in the character, or the length of the
invalid code point.
The logic currently incorrectly treats ASCII 0 (the null character) as an invalid code point. Since the external library correctly
treats it as a valid UTF-8 character it returns 0 for the character. The logic in hamming_distance treats 0 as the negative
value of the length of the invalid code point, meaning it doesn't change it's position in the string.
This means we return incorrect results if a null character appears in either string, as we incorrectly compute the length of the
string with the null character. If both strings contain null characters, we end up in an infinite loop as neither string will make
progress.
Note that callAscii handles this correctly.
Differential Revision: D55670296