Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Group by multi string column in cop/batchCop mode will truncate the last char of group by columns #6294

Closed
windtalker opened this issue Nov 11, 2022 · 2 comments
Assignees
Labels
affects-6.3 severity/critical type/bug The issue is confirmed as a bug.

Comments

@windtalker
Copy link
Contributor

windtalker commented Nov 11, 2022

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

create a TiDB cluster with v6.3.0

mysql> create table test(a varchar(12) not null, b varchar(17) not null);  
Query OK, 0 rows affected (0.07 sec)

mysql> alter table test set tiflash replica 1;
Query OK, 0 rows affected (0.07 sec)

mysql> insert into test values('203','003004'),('606','003004');
Query OK, 2 rows affected (0.01 sec)
Records: 2  Duplicates: 0  Warnings: 0
mysql> set tidb_allow_mpp=0;
Query OK, 0 rows affected (0.00 sec)
mysql>  select /*+ agg_to_cop() */ a, b from test where b='003004' group by a, b;
+---------+---------+
| a | b |
+---------+---------+
| 20      | 00300   |
| 60      | 00300   |
+---------+---------+
2 rows in set (0.00 sec)

2. What did you expect to see? (Required)

the query result should be

+---------+---------+
| a | b |
+---------+---------+
| 203     | 003004  |
| 606     | 003004  |
+---------+---------+

3. What did you see instead (Required)

the query result is

+---------+---------+
| a | b |
+---------+---------+
| 20      | 00300   |
| 60      | 00300   |
+---------+---------+

4. What is your TiFlash version? (Required)

v6.3.0

5. Root cause

This bug is caused by #5834. In #5834, it introduced a new hash method named HashMethodMultiString, in HashMethodMultiString, for string without collator, it uses

StringRef key(chars[key_index] + last_offset, offsets[key_index][row] - last_offset - 1);

as the key, it is not right, because ColumnString::deserializeAndInsertFromArena actually assumes that the key is including the terminating zero.
This bug seems can't reproduced in master branch, but we still need to double confirm it.

@solotzg
Copy link
Contributor

solotzg commented Nov 11, 2022

Root cause

  • HashMethodMultiString is used when all agg keys are type string.
    /// For the case when there is multi string key.
    template <typename Value, typename Mapped>
    struct HashMethodMultiString
    : public columns_hashing_impl::HashMethodBase<HashMethodMultiString<Value, Mapped>, Value, Mapped, false>
    {
    using Self = HashMethodMultiString<Value, Mapped>;
    using Base = columns_hashing_impl::HashMethodBase<Self, Value, Mapped, false>;
    std::vector<const IColumn::Offset *> offsets;
    std::vector<const UInt8 *> chars;
    TiDB::TiDBCollators collators;
    bool all_collators_padding_bin = false;
    HashMethodMultiString(const ColumnRawPtrs & key_columns, const Sizes &, const TiDB::TiDBCollators & collators_)
    : collators(collators_)
    {
    size_t num = key_columns.size();
    offsets.resize(num);
    chars.resize(num);
    for (size_t i = 0; i < num; ++i)
    {
    const IColumn & column = *key_columns[i];
    const auto & column_string = assert_cast<const ColumnString &>(column);
    offsets[i] = column_string.getOffsets().data();
    chars[i] = column_string.getChars().data();
    }
    if (!collators.empty())
    {
    all_collators_padding_bin = std::all_of(collators.begin(), collators.end(), [](auto & x) {
    return x->isPaddingBinary();
    });
    }
    }
    template <typename F>
    ALWAYS_INLINE inline SerializedKeyHolder genSerializedKeyHolder(ssize_t row, Arena * pool, F && fn_handle_key) const
    {
    auto num = offsets.size();
    static_assert(std::is_same_v<size_t, decltype(reinterpret_cast<const StringRef *>(0)->size)>);
    const char * begin = nullptr;
    size_t sum_size = 0;
    for (size_t key_index = 0; key_index < num; ++key_index)
    {
    auto last_offset = row == 0 ? 0 : offsets[key_index][row - 1];
    StringRef key(chars[key_index] + last_offset, offsets[key_index][row] - last_offset - 1);
    key = fn_handle_key(key_index, key);
    char * pos = pool->allocContinue(key.size + sizeof(key.size), begin);
    {
    memcpy(pos, &key.size, sizeof(key.size));
    inline_memcpy(pos + sizeof(key.size), key.data, key.size);
    }
    sum_size += key.size + sizeof(key.size);
    }
    return SerializedKeyHolder{{begin, sum_size}, *pool};
    }
    ALWAYS_INLINE inline auto getKeyHolder(ssize_t row, Arena * pool, std::vector<String> & sort_key_containers) const
    {
    if (likely(all_collators_padding_bin))
    {
    return genSerializedKeyHolder(row, pool, [](size_t, StringRef key) {
    return DB::BinCollatorSortKey<true>(key.data, key.size);
    });
    }
    if (unlikely(collators.empty()))
    {
    return genSerializedKeyHolder(row, pool, [](size_t, StringRef key) {
    return key;
    });
    }
    else
    {
    return genSerializedKeyHolder(row, pool, [&](size_t key_index, StringRef key) {
    if (collators[key_index])
    return collators[key_index]->sortKey(key.data, key.size, sort_key_containers[key_index]);
    return key;
    });
    }
    }
    protected:
    friend class columns_hashing_impl::HashMethodBase<Self, Value, Mapped, false>;
    };

    If collation is not used, the expected way to serialized string column should keep tail zero byte.
    if (unlikely(collators.empty()))
    {
    return genSerializedKeyHolder(row, pool, [](size_t, StringRef key) {
    return key;
    });
    }

This bug only affect release-6.3. HashMethodMultiString is no longer used since #6135

@windtalker
Copy link
Contributor Author

close this issue since the bug only exists in v6.3.0 and there is no patch version for v6.3.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-6.3 severity/critical type/bug The issue is confirmed as a bug.
Projects
None yet
Development

No branches or pull requests

4 participants