ddl: fix partition definition for literal with non-utf8 charsets + binary column | tidb-test=pr/2262 (#49229) #51602
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is an automated cherry-pick of #49229
What problem does this PR solve?
Issue Number: close #36433, close #49251
Problem Summary:
The string literal in the partition clause is stored as a formated string literal. For example, the
'你好'
will be kept as'你好'
in the schema, and the collation information is lost. However, when running thegetRangeLocateExprs
, the expression is constructed with the DDL session, which always useutf8mb4
collation.When inserting the row, the string (after converted to the collation corresponding to the column) will be compared with the
utf8mb4
literal string, which may cause unexpected behavior. In the issue #36433, the insert value你好
is converted to binary encoding GBK\xc4\xe3\xba\xc3
, and compare it with the utf8 string'你好'
through binary collation directly, then former is smaller than the later one.MySQL records the binary representation in table schema. If we run
show create table a
in MySQL, it'll give a hex literal with_binary
prefix in the partition clause.What changed and how does it work?
If the target is binary charset, use the hex literal to represent the string. Actually this PR does two change:
_binary
+ hex literal to represent the definition, as it may be an invalid utf8 string.As TiDB always use utf8 to represent a string internally (except binary), it's fine to not consider other charset like GBK.
Check List
Tests
Release note