diff --git a/docs/design/2021-08-18-charsets.md b/docs/design/2021-08-18-charsets.md index 16cad2fd044ed..441f5b0917d6b 100644 --- a/docs/design/2021-08-18-charsets.md +++ b/docs/design/2021-08-18-charsets.md @@ -98,8 +98,10 @@ After receiving the non-utf-8 character set request, this solution will convert ### Collation Add gbk_chinese_ci and gbk_bin collations. In addition, considering the performance, we can add the collation of utf8mb4 (gbk_utf8mb4_bin). +- To support gbk_chinese_ci and gbk_bin collations, it needs to turn on the `new_collations_enabled_on_first_bootstrap` switch. + - If `new_collations_enabled_on_first_bootstrap` is off, it only supports gbk_utf8mb4_bin which does not need to be converted to gbk charset before processing. - Implement the Collator and WildcardPattern interface functions for each collation. - - gbk_chinese_ci and gbk_bin need to convert utf-8 to gbk encoding and then generate a sort key. gbk_utf8mb4_bin does not need to be converted to gbk code for processing. + - gbk_chinese_ci and gbk_bin need to convert utf-8 to gbk encoding and then generate a sort key. - Implement the corresponding functions in the Coprocessor. ### DDL @@ -119,43 +121,18 @@ Other behaviors that need to be dealt with: #### Compatibility between TiDB versions - Upgrade compatibility: - - Upgrades from versions below 4.0 do not support gbk or any character sets other than the original five (binary, ascii, latin1, utf8, utf8mb4). - - Upgrade from version 4.0 or higher - - There may be compatibility issues when performing non-utf-8-related operations during the rolling upgrade. - - The new version of the cluster is expected to have no compatibility issues when reading old data. + - There may be compatibility issues when performing operations during the rolling upgrade. + - The new version of the cluster is expected to have no compatibility issues when reading old data. - Downgrade compatibility: - Downgrade is not compatible. The index key uses the table of gbk_bin/gbk_chinese_ci. The lower version of TiDB will have problems when decoding, and it needs to be transcoded before downgrading. #### Compatibility with MySQL -Illegal character related issue: +- Illegal character related issue: + - Due to the internal conversion of non-utf-8-related encoding to utf8 for processing, it is not fully compatible with MySQL in some cases in terms of illegal character processing. TiDB controls its behavior through sql_mode. -```sql -create table t3(a char(10) charset gbk); -insert into t3 values ('a'); - -// 0xcee5 is a valid gbk hex literal but invalid utf8mb4 hex literal. -select hex(concat(a, 0xcee5)) from t3; --- mysql 61cee5 - -// 0xe4b880 is an invalid gbk hex literal but valid utf8mb4 hex literal. -select hex(concat(a, 0xe4b880)) from t3; --- mysql 61e4b880 (test on mysql 5.7 and 8.0.22) --- mysql returns "Cannot convert string '\x80' from binary to gbk" (test on mysql 8.0.25 and 8.0.26). TiDB will be compatible with this behavior. - -// 0x80 is a hex literal that invalid for neither gbk nor utf8mb4. -select hex(concat(a, 0x80)) from t3; --- mysql 6180 (test on mysql 5.7 and 8.0.22) --- mysql returns "Cannot convert string '\x80' from binary to gbk" (test on mysql 8.0.25 and 8.0.26). TiDB will be compatible with this behavior. - -set @@sql_mode = ''; -insert into t3 values (0x80); --- mysql gets a warning and insert null values (warning: "Incorrect string value: '\x80' for column 'a' at row 1") - -set @@sql_mode = 'STRICT_TRANS_TABLES'; -insert into t3 values (0x80); --- mysql returns "Incorrect string value: '\x80' for column 'a' at row 1" -``` +- Collation + - Fully support `gbk_bin` and `gbk_chinese_ci` only when the config `new_collations_enabled_on_first_bootstrap` is enabled. Otherwise, it only supports gbk_utf8mb4_bin. #### Compatibility with other components