Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support utf8mb4_zh_0900_as_cs for utf8mb4 character set #19747

Open
zz-jason opened this issue Sep 3, 2020 · 20 comments
Open

support utf8mb4_zh_0900_as_cs for utf8mb4 character set #19747

zz-jason opened this issue Sep 3, 2020 · 20 comments
Labels
feature/accepted This feature request is accepted by product managers sig/sql-infra SIG: SQL Infra type/feature-request Categorizes issue or PR as related to a new feature.

Comments

@zz-jason
Copy link
Member

zz-jason commented Sep 3, 2020

Feature Request

Is your feature request related to a problem? Please describe:

It's unable to order by a column based on it's pinyin order. For example:

create table t(
	a varchar(100)
)
charset = 'utf8mb4'
collate = 'utf8mb4_zh_0900_as_cs';

insert some data:

insert into t values ("中文"), ("啊中文");

a query requires to order by column a in its pinyin order:

select * from t order by a;

Describe the feature you'd like:

Support the utf8mb4_zh_0900_as_cs collate like MySQL, the output should be:

MySQL(root@127.0.0.1:test) > select * from t order by a;
+-----------+
| a         |
+-----------+
| 啊中文    |
| 中文      |
+-----------+
2 rows in set (0.00 sec)

Describe alternatives you've considered:

N/A

Teachability, Documentation, Adoption, Migration Strategy:

N/A

@zz-jason zz-jason added the type/feature-request Categorizes issue or PR as related to a new feature. label Sep 3, 2020
@zz-jason
Copy link
Member Author

zz-jason commented Sep 3, 2020

@bb7133 PTAL

@zz-jason
Copy link
Member Author

zz-jason commented Sep 3, 2020

BTW, MySQL supports lots of collations for utf8mb4 charset, in order to support these collations in TiDB in the future, we need a flexible execution engine that allows us to easily add a new collation or even add a new character set. @SunRunAway PTAL

@wjhuang2016
Copy link
Member

Adding a new collation is easy, but adding a new charset is hard now.

@zz-jason zz-jason added the feature/accepted This feature request is accepted by product managers label Sep 3, 2020
@pingcap pingcap deleted a comment from ti-srebot Sep 4, 2020
@xiongjiwei
Copy link
Contributor

MySQL has a framework to support language specific collation, it based on UCA v900 and accept different tailoring rules.
we can do same with mysql, add a framework for language specific collation, it will be easier to adding other language specific collation in future, or we implement collation separately, this way has better performance, easier to optimize.

@zz-jason
Copy link
Member Author

or we implement collation separately, this way has better performance, easier to optimize.

@xiongjiwei How much performance would we gain compared with the unified collation framework?

@xiongjiwei
Copy link
Contributor

xiongjiwei commented Sep 10, 2020

or we implement collation separately, this way has better performance, easier to optimize.

@xiongjiwei How much performance would we gain compared with the unified collation framework?

@zz-jason I have no idea, I am not deep in this, but I instinctively feel that if we implement framework very carefully, we may only loss performance when startup, it is hard but still has a way.

@zz-jason
Copy link
Member Author

zz-jason commented Sep 10, 2020

we may only loss performance when startup

Do you mean when the tidb-server instance is started up?

@xiongjiwei
Copy link
Contributor

xiongjiwei commented Sep 10, 2020

Do you mean when the tidb-server instance is started up?

right

@zz-jason
Copy link
Member Author

OK. I think it's affordable. Would you like to write a proposal for this framework or other algorithms?

@xiongjiwei
Copy link
Contributor

xiongjiwei commented Sep 10, 2020

OK. I think it's affordable. Would you like to write a proposal for this framework or other algorithms?

sure

@bb7133
Copy link
Member

bb7133 commented Sep 10, 2020

This feature request is related to #10192 because I believe they actually aim to the same goal - the 'pinyin' order of Chinese character.

Basically we've two alternatives:

  1. Implement utf8mb4_zh_0900_as_cs: we need to implement the standard UCA algorithm and the tailoring rules of utf8mb4_zh_0900_as_cs in MySQL 8, which seems non-trivial. The implementation of MySQL looks complicated with weight reorders, magic numbers, and some sort of thicks: https://github.com/mysql/mysql-server/blob/8.0/strings/ctype-uca.cc#L697. A lot of effort should be paid to know the exact meaning of its implementation.
  2. Before MySQL 8 and utf8mb4_zh_0900_as_cs, almost all MySQL users archive 'pinyin' order by order by convert(... using gbk) asc(code point of GBK is ordered by pinyin). It is possible(and easy) for us to implement an additional collation that sort the Chinese character exactly the same with GBK, named like tidb_utf8_general_zh_cs. This satisfies the requirements of most users but it is not compatible with MySQL. For those users need the replication from TiDB to MySQL, we need to comment the additional this collation to ensure the compatibility will not be broken.

@ilovesoup
Copy link
Contributor

This feature request is related to #10192 because I believe they actually aim to the same goal - the 'pinyin' order of Chinese character.

Basically we've two alternatives:

  1. Implement utf8mb4_zh_0900_as_cs: we need to implement the standard UCA algorithm and the tailoring rules of utf8mb4_zh_0900_as_cs in MySQL 8, which seems non-trivial. The implementation of MySQL looks complicated with weight reorders, magic numbers, and some sort of thicks: https://github.com/mysql/mysql-server/blob/8.0/strings/ctype-uca.cc#L697. A lot of effort should be paid to know the exact meaning of its implementation.
  2. Before MySQL 8 and utf8mb4_zh_0900_as_cs, almost all MySQL users archive 'pinyin' order by order by convert(... using gbk) asc(code point of GBK is ordered by pinyin). It is possible(and easy) for us to implement an additional collation that sort the Chinese character exactly the same with GBK, named like tidb_utf8_general_zh_cs. This satisfies the requirements of most users but it is not compatible with MySQL. For those users need the replication from TiDB to MySQL, we need to comment the additional this collation to ensure the compatibility will not be broken.

I feel like option 2 is better for now considering the cost. It does not prevent us go for optional 1 in the future if needed.

@xiongjiwei
Copy link
Contributor

hi, @zz-jason, consider opinion of bb7133 , it is much easier to

implement an additional collation that sort the Chinese character exactly the same with GBK

if we just need pinyin order.

if we need full compatibility with MySQL, we may have a lot to do. we should discuss which way to achieve pinyin order. I'd like to write a proposal if we choose full compatibility with MySQL

@zz-jason
Copy link
Member Author

zz-jason commented Sep 14, 2020

It is possible(and easy) for us to implement an additional collation that sort the Chinese character exactly the same with GBK, named like tidb_utf8_general_zh_cs.

@bb7133 @ilovesoup It's indeed easy to be implemented, but introducing a new collation tidb_utf8_general_zh_cs brings additional maintenance effects. For example:

  • what's the difference between tidb_utf8_general_zh_cs and utf8mb4_zh_0900_as_cs, tidb_utf8_general_zh_cs and order by convert(... using gbk) asc, what collation should the users choose if TiDB provides both tidb_utf8_general_zh_cs and utf8mb4_zh_0900_as_cs?
  • would the behavior of tidb_utf8_general_zh_cs be changed for newer TiDB versions?
  • how to migrate the applications using tidb_utf8_general_zh_cs to other databases?

@bb7133
Copy link
Member

bb7133 commented Sep 29, 2020

  • what's the difference between tidb_utf8_general_zh_cs and utf8mb4_zh_0900_as_cs, tidb_utf8_general_zh_cs and order by convert(... using gbk) asc, what collation should the users choose if TiDB provides both tidb_utf8_general_zh_cs and utf8mb4_zh_0900_as_cs?

We aim to make the Chinese characters of tidb_utf8_general_zh_ci ordered by pinyin and that's all. It may be different from GBK(actually, gbk_chinese_ci collation)/utf8mb4_zh_0900_as_cs for non-Chinese characters. This is to make the implementation quick and easy.

  • would the behavior of tidb_utf8_general_zh_cs be changed for newer TiDB versions?

It will not change, no collation will be changed since it is released.

  • how to migrate the applications using tidb_utf8_general_zh_ci to other databases?

We may not support it and this is not something new, considering there are some features in TiDB that doesn't exists in MySQL(Sequences/Placement Rules...). However, if the tidb_utf8_general_zh_ci becomes a trouble for migration, we can comment it and fallback to the 'default collation' when replicating the data to other databases.

@kennytm
Copy link
Contributor

kennytm commented Oct 20, 2020

The algorithm of utf8mb4_zh_0900_as_cs is described in https://dev.mysql.com/worklog/task/?id=11825. The UCA is implemented in https://pkg.go.dev/golang.org/x/text@v0.3.3/collate, but the table version is 6.2.0 (CLDR v23), rather than 9.0.0 (CLDR v28). Maybe we can make it work from a static 9.0.0 table or adjust the reorder parameters for the golang 6.2.0 range.

Example program:

package main

import (
	"fmt"
	"golang.org/x/text/collate"
	"golang.org/x/text/language"
)

func main() {
	testCases := []string{"一", "二", "三", "四", "五", "1", "one", "yi", "YI", "yī", "🤔"}
	collator := collate.New(language.Chinese)
	var buffer collate.Buffer
	for _, tc := range testCases {
		fmt.Printf("%s\t%x\n", tc, collator.KeyFromString(&buffer, tc))
		buffer.Reset()
	}
}

Output:

一      9689db00000020000002
二      961bda00000020000002
三      9664f900000020000002
四      966c3a00000020000002
五      967a0a00000020000002
1       14e700000020000002
one     1771174f164c00000020002000200000020202
yi      188016cd00000020002000000202
YI      188016cd00000020002000000808
yī      188016cd00000020001c00000202
🤔      86f91400000020000002

on MySQL 8.0:

select column_0, weight_string(column_0 collate utf8mb4_zh_0900_as_cs) 
from (values 
    row(''), row(''), row(''), row(''), row(''), row('1'), row('one'), row('yi'), row('YI'), row(''), row('🤔')
) a;

output:

+----------+--------------------------------------------------------------------------------------------------------------+
| column_0 | weight_string(column_0 collate utf8mb4_zh_0900_as_cs)                                                        |
+----------+--------------------------------------------------------------------------------------------------------------+
| 一       | 0xA63E0000002000000002                                                                                       |
| 二       | 0x38290000002000000002                                                                                       |
| 三       | 0x815D0000002000000002                                                                                       |
| 四       | 0x88A40000002000000002                                                                                       |
| 五       | 0x966E0000002000000002                                                                                       |
| 1        | 0x1C3E0000002000000002                                                                                       |
| one      | 0xBF5ABF36BE2700000020002000200000000200020002                                                               |
| yi       | 0xC088BEAF000000200020000000020002                                                                           |
| YI       | 0xC088BEAF000000200020000000080008                                                                           |
| yī       | 0xC088BEAF00000020001F0116000000020002                                                                       |
| 🤔         | 0x15AD0000002000000002                                                                                       |
+----------+--------------------------------------------------------------------------------------------------------------+

@xiongjiwei
Copy link
Contributor

xiongjiwei commented Oct 21, 2020

Maybe we can make it work from a static 9.0.0 table or adjust the reorder parameters for the golang 6.2.0 range.

  • The performance is not as good as implement it by hand, only about 30% under the default table.
  • The collation should be also implemented in TiKV, and the results must same, it means, we must implement it by ourselves in TiKV
  • As a quick implementation of pinyin order, convert to gbk and use gbk_chinese_ci is enough, we will support utf8mb4_zh_0900_as_cs in some day

@kennytm

@kennytm
Copy link
Contributor

kennytm commented Oct 21, 2020

  • As a quick implementation of pinyin order, convert to gbk and use gbk_chinese_ci is enough, we will support utf8mb4_zh_0900_as_cs in some day

well this issue is about utf8mb4_zh_0900_as_cs, the two don't have to interfere each other.

  • The collation should be also implemented in TiKV, and the results must same, it means, we must implement it by ourselves in TiKV

that's unfortunate, but there's still no need to reimplement UCA by hand in Rust, reusing ICU's ucol_getSortKey should be fine (and so the Go side should also use an ICU Cgo binding).

@wjhuang2016
Copy link
Member

The Go's sortKey is not the same as ICU's sortKey. The sortKey can be implement-defined. Besides, collate in Go is still unreleased, and it tends to buggy:golang/go#38726.

@dveeden
Copy link
Contributor

dveeden commented Aug 1, 2023

Looks related to #37566 / #45650, maybe this can be done once utf8mb4_0900_ai_ci support is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature/accepted This feature request is accepted by product managers sig/sql-infra SIG: SQL Infra type/feature-request Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

7 participants