Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hamming_distance can hang or produce incorrect results #9349

Closed

Conversation

kevinwilfong
Copy link
Contributor

Summary:
The call function for hamming_distance was written to iterate through two strings comparing UTF-8 characters. It uses
utf8proc_codepoint to read those characters, it returns the character or the negative length of the invalid code point if it's
invalid UTF-8. It then updates it's position in the string to either the number of bytes in the character, or the length of the
invalid code point.

The logic currently incorrectly treats ASCII 0 (the null character) as an invalid code point. Since the external library correctly
treats it as a valid UTF-8 character it returns 0 for the character. The logic in hamming_distance treats 0 as the negative
value of the length of the invalid code point, meaning it doesn't change it's position in the string.

This means we return incorrect results if a null character appears in either string, as we incorrectly compute the length of the
string with the null character. If both strings contain null characters, we end up in an infinite loop as neither string will make
progress.

Note that callAscii handles this correctly.

Differential Revision: D55670296

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 3, 2024
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D55670296

Copy link

netlify bot commented Apr 3, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 87bc046
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/660ca6746f6cb500084bb6ae

kevinwilfong pushed a commit to kevinwilfong/velox that referenced this pull request Apr 3, 2024
…ator#9349)

Summary:

The call function for hamming_distance was written to iterate through two strings comparing UTF-8 characters.  It uses
utf8proc_codepoint to read those characters, it returns the character or the negative length of the invalid code point if it's
invalid UTF-8.  It then updates it's position in the string to either the number of bytes in the character, or the length of the 
invalid code point.

The logic currently incorrectly treats ASCII 0 (the null character) as an invalid code point.  Since the external library correctly
treats it as a valid UTF-8 character it returns 0 for the character.  The logic in hamming_distance treats 0 as the negative
value of the length of the invalid code point, meaning it doesn't change it's position in the string.

This means we return incorrect results if a null character appears in either string, as we incorrectly compute the length of the
string with the null character.  If both strings contain null characters, we end up in an infinite loop as neither string will make
progress.

Note that callAscii handles this correctly.

Differential Revision: D55670296
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D55670296

@mbasmanova
Copy link
Contributor

CC: @wills-feng

…ator#9349)

Summary:

The call function for hamming_distance was written to iterate through two strings comparing UTF-8 characters.  It uses
utf8proc_codepoint to read those characters, it returns the character or the negative length of the invalid code point if it's
invalid UTF-8.  It then updates it's position in the string to either the number of bytes in the character, or the length of the 
invalid code point.

The logic currently incorrectly treats ASCII 0 (the null character) as an invalid code point.  Since the external library correctly
treats it as a valid UTF-8 character it returns 0 for the character.  The logic in hamming_distance treats 0 as the negative
value of the length of the invalid code point, meaning it doesn't change it's position in the string.

This means we return incorrect results if a null character appears in either string, as we incorrectly compute the length of the
string with the null character.  If both strings contain null characters, we end up in an infinite loop as neither string will make
progress.

Note that callAscii handles this correctly.

Reviewed By: kgpai

Differential Revision: D55670296
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D55670296

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the fix.

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in bbbe224.

@wills-feng
Copy link
Contributor

Thanks for the fix.
Thanks for letting me know @mbasmanova

Copy link

Conbench analyzed the 1 benchmark run on commit bbbe2243.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

Joe-Abraham pushed a commit to Joe-Abraham/velox that referenced this pull request Apr 4, 2024
…ator#9349)

Summary:
Pull Request resolved: facebookincubator#9349

The call function for hamming_distance was written to iterate through two strings comparing UTF-8 characters.  It uses
utf8proc_codepoint to read those characters, it returns the character or the negative length of the invalid code point if it's
invalid UTF-8.  It then updates it's position in the string to either the number of bytes in the character, or the length of the
invalid code point.

The logic currently incorrectly treats ASCII 0 (the null character) as an invalid code point.  Since the external library correctly
treats it as a valid UTF-8 character it returns 0 for the character.  The logic in hamming_distance treats 0 as the negative
value of the length of the invalid code point, meaning it doesn't change it's position in the string.

This means we return incorrect results if a null character appears in either string, as we incorrectly compute the length of the
string with the null character.  If both strings contain null characters, we end up in an infinite loop as neither string will make
progress.

Note that callAscii handles this correctly.

Reviewed By: amitkdutta, kgpai

Differential Revision: D55670296

fbshipit-source-id: 73d15b48b67f5342fe1c7904146c32dc5c34bd2e
Joe-Abraham pushed a commit to Joe-Abraham/velox that referenced this pull request Jun 7, 2024
…ator#9349)

Summary:
Pull Request resolved: facebookincubator#9349

The call function for hamming_distance was written to iterate through two strings comparing UTF-8 characters.  It uses
utf8proc_codepoint to read those characters, it returns the character or the negative length of the invalid code point if it's
invalid UTF-8.  It then updates it's position in the string to either the number of bytes in the character, or the length of the
invalid code point.

The logic currently incorrectly treats ASCII 0 (the null character) as an invalid code point.  Since the external library correctly
treats it as a valid UTF-8 character it returns 0 for the character.  The logic in hamming_distance treats 0 as the negative
value of the length of the invalid code point, meaning it doesn't change it's position in the string.

This means we return incorrect results if a null character appears in either string, as we incorrectly compute the length of the
string with the null character.  If both strings contain null characters, we end up in an infinite loop as neither string will make
progress.

Note that callAscii handles this correctly.

Reviewed By: amitkdutta, kgpai

Differential Revision: D55670296

fbshipit-source-id: 73d15b48b67f5342fe1c7904146c32dc5c34bd2e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants