Skip to content

Commit

Permalink
doc/design: add pinyin order collation for utf8mb4 charset (#19984)
Browse files Browse the repository at this point in the history
  • Loading branch information
xiongjiwei authored Nov 6, 2020
1 parent 2c8b28c commit 155e908
Showing 1 changed file with 78 additions and 0 deletions.
78 changes: 78 additions & 0 deletions docs/design/2020-09-12-utf8mb4-pinyin-order.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# Proposal: support `pinyin` order for `utf8mb4` charset

- Author(s): [xiongjiwei](https://github.com/xiongjiwei)
- Last updated: 2020-11-06
- Discussion at: https://github.com/pingcap/tidb/issues/19747

## Abstract
This proposal proposes a new feature that supports `pinyin` order for chinese character.

## Background
It's unable now to order by a column based on its pinyin order. For example:

```sql
create table t(
a varchar(100)
)
charset = 'utf8mb4' collate = 'utf8mb4_zh_0900_as_cs';

# insert some data:
insert into t values ("中文"), ("啊中文");

# a query requires to order by column a in its pinyin order:
select * from t order by a;
+-----------+
| a |
+-----------+
| 啊中文 |
| 中文 |
+-----------+
2 rows in set (0.00 sec)
```

## Proposal

`pinyin` order for Chinese character supported by this proposal will add a new collation named `utf8mb4_zh_pinyin_tidb_as_cs` which is support all Unicode and sort Chinese characters correctly according to the PINYIN collation in zh.xml file of [CLDR24](http://unicode.org/Public/cldr/24/core.zip), and only support those Chinese characters with `pinyin` in zh.xml currently, we support neither those CJK characters whose category defined in Unicode are Symbol with the same shape as Chinese characters nor the PINYIN characters. In `utf8mb4_zh_pinyin_tidb_as_cs`, `utf8mb4` means charset utf8mb4, `zh` means Chinese language, `pinyin` means it has pinyin order, `tidb` means a special(tidb) version, and `as_cs` means it is accent-sensitive and case-sensitive.

### Advantages

It's a lot of work if we implement `utf8mb4_zh_0900_as_cs`. The implementation of MySQL looks complicated with weight reorders, magic numbers, and some tricks. Implementing `utf8mb4_zh_pinyin_tidb_as_cs` is much easier. It supports all Chinese characters and sorts Chinese characters in pinyin order. It is good enough.

### Disadvantages

It is not compatible with MySQL. MySQL does not have a collation named `utf8mb4_zh_pinyin_tidb_as_cs`.

## Rationale

### How to implement

#### Compare and Key

- For any Chinese character, which has non-zero seq NO. defined in zh.xml according to its gb18030 code, the final weight shall be 0xFFA00000+(seq No.)
- For any non-Chinese gb18030 character 2 bytes C, the final weight shall be C itself.
- For any non-Chinese gb18030 character 4 bytes C, the final weight shall be 0xFF000000+diff(C)(we get diff by Algorithm).

### Parser

Choose collation ID `2048` for `utf8mb4_zh_pinyin_tidb_as_cs` and add it into parser.

> MySQL supports two-byte collation IDs. The range of IDs from 1024 to 2047 is reserved for user-defined collations. [see also](https://dev.mysql.com/doc/refman/8.0/en/adding-collation-choosing-id.html)
### Compatibility with current collations

`utf8mb4_zh_pinyin_tidb_as_cs` has same priority with `utf8mb4_unicode_ci` and `utf8mb4_general_ci`, which means these three collations incompatible with each other.

### Alternative
MySQL has a lot of language specific collations, for `pinyin` order, MySQL uses collation `utf8mb4_zh_0900_as_cs`.

## Compatibility and Migration Plan

### Compatibility issues with MySQL

There is no `utf8mb4_zh_pinyin_tidb_as_cs` collation in MySQL. We can comment `utf8mb4_zh_pinyin_tidb_as_cs` when users need to replicate their data from TiDB to MySQL.

## Open issues (if applicable)

https://github.com/pingcap/tidb/issues/19747

https://github.com/pingcap/tidb/issues/10192

0 comments on commit 155e908

Please sign in to comment.