hamming_distance can hang or produce incorrect results #9349

kevinwilfong · 2024-04-03T00:21:01Z

Summary:
The call function for hamming_distance was written to iterate through two strings comparing UTF-8 characters. It uses
utf8proc_codepoint to read those characters, it returns the character or the negative length of the invalid code point if it's
invalid UTF-8. It then updates it's position in the string to either the number of bytes in the character, or the length of the
invalid code point.

The logic currently incorrectly treats ASCII 0 (the null character) as an invalid code point. Since the external library correctly
treats it as a valid UTF-8 character it returns 0 for the character. The logic in hamming_distance treats 0 as the negative
value of the length of the invalid code point, meaning it doesn't change it's position in the string.

This means we return incorrect results if a null character appears in either string, as we incorrectly compute the length of the
string with the null character. If both strings contain null characters, we end up in an infinite loop as neither string will make
progress.

Note that callAscii handles this correctly.

Differential Revision: D55670296

facebook-github-bot · 2024-04-03T00:21:08Z

This pull request was exported from Phabricator. Differential Revision: D55670296

netlify · 2024-04-03T00:21:17Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`87bc046`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/660ca6746f6cb500084bb6ae

…ator#9349) Summary: The call function for hamming_distance was written to iterate through two strings comparing UTF-8 characters. It uses utf8proc_codepoint to read those characters, it returns the character or the negative length of the invalid code point if it's invalid UTF-8. It then updates it's position in the string to either the number of bytes in the character, or the length of the invalid code point. The logic currently incorrectly treats ASCII 0 (the null character) as an invalid code point. Since the external library correctly treats it as a valid UTF-8 character it returns 0 for the character. The logic in hamming_distance treats 0 as the negative value of the length of the invalid code point, meaning it doesn't change it's position in the string. This means we return incorrect results if a null character appears in either string, as we incorrectly compute the length of the string with the null character. If both strings contain null characters, we end up in an infinite loop as neither string will make progress. Note that callAscii handles this correctly. Differential Revision: D55670296

facebook-github-bot · 2024-04-03T00:22:05Z

This pull request was exported from Phabricator. Differential Revision: D55670296

mbasmanova · 2024-04-03T00:25:21Z

CC: @wills-feng

…ator#9349) Summary: The call function for hamming_distance was written to iterate through two strings comparing UTF-8 characters. It uses utf8proc_codepoint to read those characters, it returns the character or the negative length of the invalid code point if it's invalid UTF-8. It then updates it's position in the string to either the number of bytes in the character, or the length of the invalid code point. The logic currently incorrectly treats ASCII 0 (the null character) as an invalid code point. Since the external library correctly treats it as a valid UTF-8 character it returns 0 for the character. The logic in hamming_distance treats 0 as the negative value of the length of the invalid code point, meaning it doesn't change it's position in the string. This means we return incorrect results if a null character appears in either string, as we incorrectly compute the length of the string with the null character. If both strings contain null characters, we end up in an infinite loop as neither string will make progress. Note that callAscii handles this correctly. Reviewed By: kgpai Differential Revision: D55670296

facebook-github-bot · 2024-04-03T00:44:41Z

This pull request was exported from Phabricator. Differential Revision: D55670296

mbasmanova

Thank you for the fix.

facebook-github-bot · 2024-04-03T17:45:53Z

This pull request has been merged in bbbe224.

wills-feng · 2024-04-03T18:01:17Z

Thanks for the fix.
Thanks for letting me know @mbasmanova

conbench-facebook · 2024-04-03T18:11:08Z

Conbench analyzed the 1 benchmark run on commit bbbe2243.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

…ator#9349) Summary: Pull Request resolved: facebookincubator#9349 The call function for hamming_distance was written to iterate through two strings comparing UTF-8 characters. It uses utf8proc_codepoint to read those characters, it returns the character or the negative length of the invalid code point if it's invalid UTF-8. It then updates it's position in the string to either the number of bytes in the character, or the length of the invalid code point. The logic currently incorrectly treats ASCII 0 (the null character) as an invalid code point. Since the external library correctly treats it as a valid UTF-8 character it returns 0 for the character. The logic in hamming_distance treats 0 as the negative value of the length of the invalid code point, meaning it doesn't change it's position in the string. This means we return incorrect results if a null character appears in either string, as we incorrectly compute the length of the string with the null character. If both strings contain null characters, we end up in an infinite loop as neither string will make progress. Note that callAscii handles this correctly. Reviewed By: amitkdutta, kgpai Differential Revision: D55670296 fbshipit-source-id: 73d15b48b67f5342fe1c7904146c32dc5c34bd2e

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 3, 2024

facebook-github-bot added the fb-exported label Apr 3, 2024

kevinwilfong force-pushed the export-D55670296 branch from 03aa062 to fdab0d6 Compare April 3, 2024 00:21

kevinwilfong force-pushed the export-D55670296 branch from fdab0d6 to 87bc046 Compare April 3, 2024 00:44

mbasmanova approved these changes Apr 3, 2024

View reviewed changes

facebook-github-bot closed this in bbbe224 Apr 3, 2024

facebook-github-bot added the Merged label Apr 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hamming_distance can hang or produce incorrect results #9349

hamming_distance can hang or produce incorrect results #9349

kevinwilfong commented Apr 3, 2024

facebook-github-bot commented Apr 3, 2024

netlify bot commented Apr 3, 2024 •

edited

Loading

facebook-github-bot commented Apr 3, 2024

mbasmanova commented Apr 3, 2024

facebook-github-bot commented Apr 3, 2024

mbasmanova left a comment

facebook-github-bot commented Apr 3, 2024

wills-feng commented Apr 3, 2024

conbench-facebook bot commented Apr 3, 2024

hamming_distance can hang or produce incorrect results #9349

hamming_distance can hang or produce incorrect results #9349

Conversation

kevinwilfong commented Apr 3, 2024

facebook-github-bot commented Apr 3, 2024

netlify bot commented Apr 3, 2024 • edited Loading

✅ Deploy Preview for meta-velox canceled.

facebook-github-bot commented Apr 3, 2024

mbasmanova commented Apr 3, 2024

facebook-github-bot commented Apr 3, 2024

mbasmanova left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Apr 3, 2024

wills-feng commented Apr 3, 2024

conbench-facebook bot commented Apr 3, 2024

netlify bot commented Apr 3, 2024 •

edited

Loading