Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix character type truncation Hive LazySimpleSerDe #20731

Merged
merged 1 commit into from
Feb 16, 2024

Conversation

dain
Copy link
Member

@dain dain commented Feb 16, 2024

Description

Truncation logic for VARCHAR and CHAR can return a value outside the bounds of the input field.

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Hive
* Fix decoding of `VARCHAR` and `CHAR` with lengh in `TEXTFILE` and `SEQUENCEFILE ` formats. ({issue}`issuenumber`)

@oneonestar
Copy link
Member

oneonestar commented Feb 16, 2024

Thank you for the quick fix!

Let me summarize the issue:

  • Affected table format: SEQUENCEFILE and TEXTFILE
  • Affected column type: VARCHAR(n) and CHAR(n) (VARCHAR and CHAR are not affected)
  • Affected data: codepoint < n && byte count > n

@oneonestar
Copy link
Member

    @Test
    public void unicodeTest()
    {
        // "日本語,日本語" = e697a5 e69cac e8aa9e 2c e697a5 e69cac e8aa9e

        int varcharSize = 5;

        // Codepoint = 1, byte count = 3
        testCalculateTruncationLength("日,本語日本語", varcharSize);
        // Bug: Codepoint = 2, byte count = 6
        testCalculateTruncationLength("日本,語日本語", varcharSize);
        // Bug: Codepoint = 3, byte count = 9
        testCalculateTruncationLength("日本語,日本語", varcharSize);
        // Bug: Codepoint = 4, byte count = 12
        testCalculateTruncationLength("日本語日,本語", varcharSize);
        // Codepoint = 5, byte count = 15
        testCalculateTruncationLength("日本語日本,語", varcharSize);

        // Codepoint = 1, byte count = 3
        testCalculateTruncationLength("日,本語日本語", varcharSize);
        // Codepoint = 2, byte count = 4
        testCalculateTruncationLength("-日,本語日本語", varcharSize);
        // Codepoint = 3, byte count = 5
        testCalculateTruncationLength("--日,本語日本語", varcharSize);
        // Bug: Codepoint = 4, byte count = 6
        testCalculateTruncationLength("---日,本語日本語", varcharSize);
        // Codepoint = 5, byte count = 7
        testCalculateTruncationLength("----日,本語日本語", varcharSize);
    }

    public void testCalculateTruncationLength(String s, int varcharSize)
    {
        byte[] delimiter = {0x2c};
        Type type = VarcharType.createVarcharType(varcharSize);
//        Type type = CharType.createCharType(varcharSize);
        byte[] buffer = s.getBytes(StandardCharsets.UTF_8);
        Slice slice = Slices.wrappedBuffer(buffer);
        int length = Bytes.indexOf(buffer, delimiter);
        int cutOff = ReadWriteUtils.calculateTruncationLength(type, slice, 0, length);
        byte[] output = slice.getBytes(0, cutOff);
        System.out.println(new String(output, StandardCharsets.UTF_8));
    }

Before the patch:

2024-02-16T01:00:46.242-0600	INFO	ForkJoinPool-1-worker-1	stdout	日
2024-02-16T01:00:46.516-0600	INFO	ForkJoinPool-1-worker-1	stdout	日本,語日
2024-02-16T01:00:46.516-0600	INFO	ForkJoinPool-1-worker-1	stdout	日本語,日
2024-02-16T01:00:46.516-0600	INFO	ForkJoinPool-1-worker-1	stdout	日本語日,
2024-02-16T01:00:46.516-0600	INFO	ForkJoinPool-1-worker-1	stdout	日本語日本
2024-02-16T01:00:46.516-0600	INFO	ForkJoinPool-1-worker-1	stdout	日
2024-02-16T01:00:46.516-0600	INFO	ForkJoinPool-1-worker-1	stdout	-日
2024-02-16T01:00:46.516-0600	INFO	ForkJoinPool-1-worker-1	stdout	--日
2024-02-16T01:00:46.516-0600	INFO	ForkJoinPool-1-worker-1	stdout	---日,
2024-02-16T01:00:46.516-0600	INFO	ForkJoinPool-1-worker-1	stdout	----日

After the patch:

2024-02-16T01:01:37.322-0600	INFO	ForkJoinPool-1-worker-1	stdout	日
2024-02-16T01:01:37.594-0600	INFO	ForkJoinPool-1-worker-1	stdout	日本
2024-02-16T01:01:37.594-0600	INFO	ForkJoinPool-1-worker-1	stdout	日本語
2024-02-16T01:01:37.594-0600	INFO	ForkJoinPool-1-worker-1	stdout	日本語日
2024-02-16T01:01:37.594-0600	INFO	ForkJoinPool-1-worker-1	stdout	日本語日本
2024-02-16T01:01:37.594-0600	INFO	ForkJoinPool-1-worker-1	stdout	日
2024-02-16T01:01:37.594-0600	INFO	ForkJoinPool-1-worker-1	stdout	-日
2024-02-16T01:01:37.595-0600	INFO	ForkJoinPool-1-worker-1	stdout	--日
2024-02-16T01:01:37.595-0600	INFO	ForkJoinPool-1-worker-1	stdout	---日
2024-02-16T01:01:37.595-0600	INFO	ForkJoinPool-1-worker-1	stdout	----日

@dain dain merged commit 4781b0f into trinodb:master Feb 16, 2024
57 checks passed
@dain dain deleted the fix-varchar-trunction-in-hive-readers branch February 16, 2024 17:00
@github-actions github-actions bot added this to the 440 milestone Feb 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

3 participants