Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc/design: add pinyin order collation for utf8mb4 charset #19984

Merged
merged 18 commits into from
Nov 6, 2020
86 changes: 86 additions & 0 deletions docs/design/2020-09-12-utf8mb4_zh_0900_as_cs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# Proposal: support `pinyin` order for `utf8mb4` charset

- Author(s): [xiongjiwei](https://github.com/xiongjiwei)
- Last updated: 2020-09-15
- Discussion at: https://github.com/pingcap/tidb/issues/19747

## Abstract
This proposal proposes a new feature that supports `pinyin` order for chinese character.

## Background
It's unable now to order by a column based on it's pinyin order. For example:

```sql
create table t(
a varchar(100)
)
charset = 'utf8mb4' collate = 'utf8mb4_zh_0900_as_cs';

# insert some data:
insert into t values ("中文"), ("啊中文");

# a query requires to order by column a in its pinyin order:
select * from t order by a;
+-----------+
| a |
+-----------+
| 啊中文 |
| 中文 |
+-----------+
2 rows in set (0.00 sec)
```

## Proposal

`pinyin` order for Chinese character supported by this proposal will add a new collation named `utf8mb4_general_zh_cs` which are exactly same with `gbk_bin`. Collation `utf8mb4_general_zh_cs` is for charset `utf8mb4`.

Following SQL statements should have same result.
```sql
# order
select * from t order by a collate utf8mb4_general_zh_cs;
select * from t order by convert(a using gbk) collate gbk_bin;

# sort key
select weight_string(a collate utf8mb4_general_zh_cs);
select weight_string(convert(a using gbk) collate gbk_bin);
```

## Rationale

### How to implement

Collation `utf8mb4_general_zh_cs` actually convert `utf8mb4` to `gbk` code point and do same thing with collation `gbk_bin`. For the compatibility with MySQL, our convert step should exactly same as MySQL `convert(... using gbk)`.
xiongjiwei marked this conversation as resolved.
Show resolved Hide resolved

### Parser

choose collation ID `2048` for `utf8mb4_general_zh_cs` and add it into parser

> MySQL supports two-byte collation IDs. The range of IDs from 1024 to 2047 is reserved for user-defined collations. [see also](https://dev.mysql.com/doc/refman/8.0/en/adding-collation-choosing-id.html)

### Compatibility with current collations

`utf8mb4_general_zh_cs` has same priority with `utf8mb4_unicode_ci` and `utf8mb4_general_ci` which means these three collations incompatible with each other.

### Alternative
MySQL has a lot of language specific collation, for `pinyin` order, MySQL use collation `utf8mb4_zh_0900_as_cs`.

#### Advantages

It is full compatible with MySQL.
xiongjiwei marked this conversation as resolved.
Show resolved Hide resolved

#### Disadvantages

It's a lot of work if implements `utf8mb4_zh_0900_as_cs`. The implementation of MySQL looks complicated with weight reorders, magic numbers, and some sort of trick.


bb7133 marked this conversation as resolved.
Show resolved Hide resolved
## Compatibility and Mirgration Plan

### Compatibility issues with MySQL

There is no `utf8mb4_general_zh_cs` collation in MySQL.
xiongjiwei marked this conversation as resolved.
Show resolved Hide resolved

## Open issues (if applicable)

https://github.com/pingcap/tidb/issues/19747

https://github.com/pingcap/tidb/issues/10192