Skip to content

Commit

Permalink
hamming_distance can hang or produce incorrect results (facebookincub…
Browse files Browse the repository at this point in the history
…ator#9349)

Summary:
Pull Request resolved: facebookincubator#9349

The call function for hamming_distance was written to iterate through two strings comparing UTF-8 characters.  It uses
utf8proc_codepoint to read those characters, it returns the character or the negative length of the invalid code point if it's
invalid UTF-8.  It then updates it's position in the string to either the number of bytes in the character, or the length of the
invalid code point.

The logic currently incorrectly treats ASCII 0 (the null character) as an invalid code point.  Since the external library correctly
treats it as a valid UTF-8 character it returns 0 for the character.  The logic in hamming_distance treats 0 as the negative
value of the length of the invalid code point, meaning it doesn't change it's position in the string.

This means we return incorrect results if a null character appears in either string, as we incorrectly compute the length of the
string with the null character.  If both strings contain null characters, we end up in an infinite loop as neither string will make
progress.

Note that callAscii handles this correctly.

Reviewed By: amitkdutta, kgpai

Differential Revision: D55670296

fbshipit-source-id: 73d15b48b67f5342fe1c7904146c32dc5c34bd2e
  • Loading branch information
Kevin Wilfong authored and facebook-github-bot committed Apr 3, 2024
1 parent f0f5986 commit bbbe224
Show file tree
Hide file tree
Showing 2 changed files with 20 additions and 2 deletions.
4 changes: 2 additions & 2 deletions velox/functions/prestosql/StringFunctions.h
Original file line number Diff line number Diff line change
Expand Up @@ -368,8 +368,8 @@ struct HammingDistanceFunction {
// if both code points are invalid, we do not care if they are equal
// the following code treats them as equal if they happen to be of the
// same length
leftPosition += codePointLeft > 0 ? leftSize : -codePointLeft;
rightPosition += codePointRight > 0 ? rightSize : -codePointRight;
leftPosition += codePointLeft >= 0 ? leftSize : -codePointLeft;
rightPosition += codePointRight >= 0 ? rightSize : -codePointRight;

if (codePointLeft != codePointRight) {
distance++;
Expand Down
18 changes: 18 additions & 0 deletions velox/functions/prestosql/tests/StringFunctionsTest.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1905,6 +1905,20 @@ TEST_F(StringFunctionsTest, hammingDistance) {
1);
EXPECT_EQ(hammingDistance("\u0001", "\u0001"), 0);
EXPECT_EQ(hammingDistance("\u0001", "\u0002"), 1);
// Test equal null characters on ASCII path.
EXPECT_EQ(
hammingDistance(std::string("\u0000", 1), std::string("\u0000", 1)), 0);
// Test null and non-null character on ASCII path.
EXPECT_EQ(hammingDistance(std::string("\u0000", 1), "\u0001"), 1);
// Test null and non-null character on non-ASCII path.
EXPECT_EQ(hammingDistance(std::string("\u0000", 1), "\u7231"), 1);
// Test equal null characters on non-ASCII path.
EXPECT_EQ(
hammingDistance(
std::string("\u7231\u0000", 2), std::string("\u7231\u0000", 2)),
0);
// Test invalid UTF-8 characters.
EXPECT_EQ(hammingDistance("\xFF\xFF", "\xF0\x82"), 0);

VELOX_ASSERT_THROW(
hammingDistance("\u0000", "\u0001"),
Expand All @@ -1928,4 +1942,8 @@ TEST_F(StringFunctionsTest, hammingDistance) {
hammingDistance(
"\u4FE1\u5FF5,\u7231,\u5E0C\u671B", "\u4FE1\u5FF5\u5E0C\u671B"),
"The input strings to hamming_distance function must have the same length");
// Test invalid UTF-8 characters.
VELOX_ASSERT_THROW(
hammingDistance("\xFF\x82\xFF", "\xF0\x82"),
"The input strings to hamming_distance function must have the same length");
}

0 comments on commit bbbe224

Please sign in to comment.