pingcap · ti-srebot · Nov 6, 2020 · Sep 14, 2020 · Sep 14, 2020 · Sep 14, 2020
diff --git a/docs/design/2020-09-12-utf8mb4_zh_0900_as_cs.md b/docs/design/2020-09-12-utf8mb4_zh_0900_as_cs.md
@@ -0,0 +1,86 @@
+# Proposal: support `pinyin` order for `utf8mb4` charset
+
+- Author(s):     [xiongjiwei](https://github.com/xiongjiwei)
+- Last updated:  2020-09-15
+- Discussion at: https://github.com/pingcap/tidb/issues/19747
+
+## Abstract
+This proposal proposes a new feature that supports `pinyin` order for chinese character.
+
+## Background
+It's unable now to order by a column based on it's pinyin order. For example:
+
+```sql
+create table t(
+	a varchar(100)
+)
+charset = 'utf8mb4' collate = 'utf8mb4_zh_0900_as_cs';
+
+# insert some data:
+insert into t values ("中文"), ("啊中文");
+
+# a query requires to order by column a in its pinyin order:
+select * from t order by a;
++-----------+
+| a         |
++-----------+
+| 啊中文    |
+| 中文      |
++-----------+
+2 rows in set (0.00 sec)
+```
+
+## Proposal
+
+`pinyin` order for Chinese character supported by this proposal will add a new collation named `utf8mb4_general_zh_cs` which are exactly same with `gbk_bin`. Collation `utf8mb4_general_zh_cs` is for charset `utf8mb4`.
+
+Following SQL statements should have same result.
+```sql
+# order
+select * from t order by a collate utf8mb4_general_zh_cs;
+select * from t order by convert(a using gbk) collate gbk_bin;
+
+# sort key
+select weight_string(a collate utf8mb4_general_zh_cs);
+select weight_string(convert(a using gbk) collate gbk_bin);
+```
+
+## Rationale
+
+### How to implement
+
+Collation `utf8mb4_general_zh_cs` actually convert `utf8mb4` to `gbk` code point and do same thing with collation `gbk_bin`. For the compatibility with MySQL, our convert step should exactly same as MySQL `convert(... using gbk)`.
+
+### Parser
+
+choose collation ID `2048` for `utf8mb4_general_zh_cs` and add it into parser
+
+> MySQL supports two-byte collation IDs. The range of IDs from 1024 to 2047 is reserved for user-defined collations. [see also](https://dev.mysql.com/doc/refman/8.0/en/adding-collation-choosing-id.html)
+
+### Compatibility with current collations
+
+`utf8mb4_general_zh_cs` has same priority with `utf8mb4_unicode_ci` and `utf8mb4_general_ci` which means these three collations incompatible with each other.
+
+### Alternative
+MySQL has a lot of language specific collation, for `pinyin` order, MySQL use collation `utf8mb4_zh_0900_as_cs`.
+
+#### Advantages
+
+It is full compatible with MySQL.
+
+#### Disadvantages
+
+It's a lot of work if implements `utf8mb4_zh_0900_as_cs`. The implementation of MySQL looks complicated with weight reorders, magic numbers, and some sort of trick.
+
+
+## Compatibility and Mirgration Plan
+
+### Compatibility issues with MySQL
+
+There is no `utf8mb4_general_zh_cs` collation in MySQL.
+
+## Open issues (if applicable)
+
+https://github.com/pingcap/tidb/issues/19747
+
+https://github.com/pingcap/tidb/issues/10192