[YSQL] Feature Support - Column Collation #1127

nocaway · 2019-04-04T18:41:22Z

Jira Link: DB-1087
Support COLLATE for columns.

Phase 1: Deterministic Collations (PG 11)

Feature	Status	Comments
Functional Spec and Design Doc	✅	https://github.com/yugabyte/yugabyte-db/blob/master/architecture/design/ysql-collation-support.md
Part 1: Support compiling code with `libicu`	✅	Only under a special compiler flag
Part 2: Support collation encoding for every text-type	✅	Encoding includes ICU sort key plus original string for deterministic collations (see design doc for details)
Part 3: Optimize collation encoding for value columns	✅	Sort key not needed for non-key columns
Part 4: Optimize collation encoding for hash key columns	✅	Maybe sort key is not needed for hash keys and deterministic collations
Part 5: Import ICU collations and add tests	✅	Turn on collations to be enabled by default (remove compiler flag)
Part 6: Add docs for collation	✅	Tracked in #10455

Phase 2: Non-Deterministic Collations (PG 12/13)

The text was updated successfully, but these errors were encountered:

irizzant · 2020-01-31T16:44:48Z

+1 for this issue this is blocking the restore of my Postgres dump:

LINE 14: ALTER TABLE ONLY crm.customer_relationships ALTER COLUMN ind...
                                                     ^
HINT:  See https://github.com/YugaByte/yugabyte-db/issues/1124. Click '+' on the description to raise its priority
    Command was: CREATE TABLE crm.customer_relationships (
    customer_id character(16) NOT NULL,
    index integer NOT NULL,
    relationship_type character varying(20) NOT NULL,
    assigned_by integer NOT NULL,
    assigned_on timestamp without time zone NOT NULL,
    since date NOT NULL,
    upto date,
    deleted_by integer,
    deleted_on timestamp without time zone,
    record_status character(1) NOT NULL
)
WITH (autovacuum_enabled='true');
ALTER TABLE ONLY crm.customer_relationships ALTER COLUMN index SET STATISTICS 0;

ghost · 2021-03-23T09:00:44Z

test=# CREATE TEMP TABLE t0(c0 CHAR(4) COLLATE "C" , c1 REAL  NOT NULL);
ERROR:  COLLATE not supported yet
LINE 1: CREATE TEMP TABLE t0(c0 CHAR(4) COLLATE "C" , c1 REAL  NOT N...
                                        ^
HINT:  See https://github.com/YugaByte/yugabyte-db/issues/1127. Click '+' on the description to raise its priority

Summary: Currently YSQL has very limited support for collation. Collation is about text data sorting rules and postgres provides collation support via the underlining OS library including libc and libicu. Simply put, given two text values (aka strings) s1 and s2, collation sort comparison must be done via strcoll(s1, s2) rather than strcmp(s1, s2). Postgres supports two types of collations: deterministic collations and non-deterministic collations. Postgres 11.2 only supports deterministic collations. Postgres 12 and beyond support both deterministic collations and non-deterministic collations. Non-deterministic collation comparison of s1 and s2 directly translates to strcoll(s1, s2), and deterministic collation comparison uses strcmp(s1, s2) as a "tie-breaker" when strcoll(s1, s2) == 0. This is the first change to support deterministic collation in YSQL to bring YSQL collation support in sync with PG 11.2. With this change, the normal build will not have any behavior change and YSQL will continue to report collation related errors. We'll produce a special build when ICU is enabled: YB_POSTGRES_WITH_ICU=1 ./yb_build.sh The plan is that once we fully support deterministic collations in YSQL, we'll remove the special build flag YB_POSTGRES_WITH_ICU=1 and make that the default. Before that, we do need to have the env flag in order to trigger building postgres with ICU enabled (postgres configure option --with-icu). Otherwise the macro USE_ICU will not be defined and the the ICU relevant code will not be included. To simplify the build command, one can also do a one time "export YB_POSTGRES_WITH_ICU=1" and then use the default ./yb_build.sh build command instead of adding the "YB_POSTGRES_WITH_ICU=1" prefix every time. Test Plan: (1) regular build: ./yb_build.sh, verified that (1.1) pg_collation has only 4 rows as before: ``` yugabyte=# select * from pg_collation; collname | collnamespace | collowner | collprovider | collencoding | collcollate | collctype | collversion -----------+---------------+-----------+--------------+--------------+-------------+-----------+------------- default | 11 | 10 | d | -1 | | | C | 11 | 10 | c | -1 | C | C | POSIX | 11 | 10 | c | -1 | POSIX | POSIX | ucs_basic | 11 | 10 | c | 6 | C | C | (4 rows) ``` (1.2) We continue to report collation related errors: ``` yugabyte=# create collation nd (provider = 'libc', locale=''); ERROR: CREATE COLLATION not supported yet LINE 1: create collation nd (provider = 'libc', locale=''); ^ HINT: Please report the issue on https://github.com/YugaByte/yugabyte-db/issues yugabyte=# create table foo(id text collate "C"); ERROR: COLLATE not supported yet LINE 1: create table foo(id text collate "C"); ^ HINT: See #1127. Click '+' on the description to raise its priority ``` (2) ICU build: YB_POSTGRES_WITH_ICU=1 ./yb_build.sh, verified that (2.1) pg_collation has many rows because it has imported many ICU collations from OS provided libicu library. ``` yugabyte=# select * from pg_collation limit 10; collname | collnamespace | collowner | collprovider | collencoding | collcollate | collctype | collversion -------------+---------------+-----------+--------------+--------------+-------------+------------+------------- default | 11 | 10 | d | -1 | | | C | 11 | 10 | c | -1 | C | C | POSIX | 11 | 10 | c | -1 | POSIX | POSIX | ucs_basic | 11 | 10 | c | 6 | C | C | en_US.utf8 | 11 | 10 | c | 6 | en_US.utf8 | en_US.utf8 | en_US | 11 | 10 | c | 6 | en_US.utf8 | en_US.utf8 | und-x-icu | 11 | 10 | i | -1 | und | und | 153.14 af-x-icu | 11 | 10 | i | -1 | af | af | 153.14.37 af-NA-x-icu | 11 | 10 | i | -1 | af-NA | af-NA | 153.14.37 af-ZA-x-icu | 11 | 10 | i | -1 | af-ZA | af-ZA | 153.14.37 (10 rows) yugabyte=# select count(*) from pg_collation; count ------- 789 (1 row) ``` (2.2) We can now create collation and also support column collation. ``` yugabyte=# create collation nd (provider = 'libc', locale=''); CREATE COLLATION yugabyte=# create table foo(id text collate "C"); CREATE TABLE ``` Reviewers: mbautin, mihnea Reviewed By: mihnea Subscribers: yql Differential Revision: https://phabricator.dev.yugabyte.com/D11968

Summary: Currently YSQL has very limited support for collation. Only C collation is supported. Collation is about text data sorting rules and postgres provides collation support via the underlining OS library including libc and libicu. Simply put, given two text values (aka strings) s1 and s2, collation sort comparison must be done via strcoll(s1, s2) rather than strcmp(s1, s2). Postgres supports two types of collations: deterministic collations and non-deterministic collations. Postgres 11.2 only supports deterministic collations. Postgres 12 and beyond support both deterministic collations and non-deterministic collations. Non-deterministic collation comparison of s1 and s2 directly translates to strcoll(s1, s2), and deterministic collation comparison uses strcmp(s1, s2) as a "tie-breaker" when strcoll(s1, s2) == 0. This is the second change to support deterministic collation in YSQL to bring YSQL collation support in sync with PG 11.2. With this change, --with-icu is now always on when building postgres so that ICU code is always built into postgres. The normal build are not expected to have any behavior change and YSQL will continue to report collation related errors. We'll produce a special build with ICU enabled via a new gflag: FLAGS_TEST_pg_collation_enabled=true. This gflag is used during development to enable collation support. However, once collation encoded data lands in docdb, turning off this gflag is not supported. The plan is that once we fully support deterministic collations in YSQL, we'll turn on FLAGS_TEST_pg_collation_enabled=true by default. The high level changes are: (1) In addition to type and modifier, for character data types that YSQL expects, collation related information is now also passed from postgres layer to PgGate when constructing any PgExpr. Collation info currently includes: ``` typedef struct PgCollationInfo { bool collate_is_valid_non_c; const char *sortkey; } YBCPgCollationInfo; ``` collate_is_valid_non_c indicates whether the collation is a valid non-C collation. A non-C collation requires strcoll semantics. In contrast C collation only requires strcmp semantics which is much more efficient. sortkey is the collation sortkey that will be stored in docdb and used for byte-wise comparison. It is null-terminated. (2) When building PgExpr, collate_is_valid_non_c is recorded as 'collate_is_valid_non_c_' in PgExpr. (3) When building PgConstant, sortkey is recorded as collation_sortkey_ in PgConstant. (4) For C collation PgConstant, collation_sortkey_ should be nullptr and we continue to store the string value that postgres layer has passed down as we currently do. In this way we try to minimize the performance penalty for text data that do not use collation. (5) For non-C collation PgConstant, we store a collation-encoded string value. which is composed of a fixed leading bytes, followed by a null-terminated sort key, followed by the original character string value. (6) A sort key of a string value is computed such that strcmp(sortkey(s1), sortkey(s2)) == strcoll(s1, s2). Two different s1 and s2 may have identical sort keys. For deterministic collations, we need to use strcmp(s1, s2) as a tie-breaker when strcmp(sortkey(s1), sortkey(s2)) == 0. That's why the original string value is appended after the null-terminated sort key to complete a collation-encoded string. (7) The collation-encoded string value replaces the original value and is sent to DocDB for storage. (8) On the way back from DocDB, PgGate will use the null byte that separates the sort key from the original string value to strip off the leading bytes and the sort key, only hands the original string value over to postgres layer above. (9) A new test only C constant kTestOnlyUseOSDefaultCollation is used for testing purpose. It can only be set when FLAGS_pg_collation_enabled is true (otherwise assertion fails). Once set to true, a clean build is needed which includes initdb and CreateInitialSysCatalogSnapshot. All the initial databases will have text columns default to en_US.UTF-8 collation. Therefore in regression tests all text columns in user tables will also have en_US.UTF-8 collation. Because this is a non-C collation, the new code path that stores collation-encoded strings is tested. (10) Planed optimization: store only the original value for a non-key column to save storage space, also for hash key, consider store only the original value as well if feasible. (11) Known issue: a sort key is a null-terminated byte sequence (without any embedded \0 byte). As a result the collation encoded string may contain invalid UTF-8 characters. However it is set as a QLValue string field in place of the original string value which is UTF-8. Protobuf reports invalid UTF-8 as an ERROR even though the collation-encoded string is still sent across the wire without any loss. Test Plan: 1. Do default build and run regression tests. Verify: (1.1) pg_collation has only 4 rows as before. ``` yugabyte=# select * from pg_collation; select * from pg_collation; collname | collnamespace | collowner | collprovider | collencoding | collcollate | collctype | collversion -----------+---------------+-----------+--------------+--------------+------------- +-----------+------------- default | 11 | 10 | d | -1 | | | C | 11 | 10 | c | -1 | C | C | POSIX | 11 | 10 | c | -1 | POSIX | POSIX | ucs_basic | 11 | 10 | c | 6 | C | C | (4 rows) ``` (1.2) We continue to report collation related errors. ``` yugabyte=# create collation nd (provider = 'libc', locale=''); create collation nd (provider = 'libc', locale=''); ERROR: CREATE COLLATION not supported yet LINE 1: create collation nd (provider = 'libc', locale=''); ^ HINT: Please report the issue on https://github.com/YugaByte/yugabyte-db/issues yugabyte=# create table foo(id text collate "C"); create table foo(id text collate "C"); ERROR: COLLATE not supported yet LINE 1: create table foo(id text collate "C"); ^ HINT: See #1127. Click '+' on the description to raise its priority ``` (1.3) All initial databases have collation "C" as expected. ``` yugabyte=# select datname, encoding, datcollate from pg_database; select datname, encoding, datcollate from pg_database; datname | encoding | datcollate -----------------+----------+------------ template1 | 6 | C template0 | 6 | C postgres | 6 | C yugabyte | 6 | C system_platform | 6 | C (5 rows) ``` 2. Do build with FLAGS_TEST_pg_collation_enabled=true, also in YBIsCollationEnabled set default env value of YBCIsEnvVarTrueWithDefault("FLAGS_TEST_pg_collation_enabled" to true, run regression tests. Verify: (2.1) pg_collation has many rows because it has imported many ICU collations from OS provided libicu library. ``` yugabyte=# select * from pg_collation limit 10; select * from pg_collation limit 10; collname | collnamespace | collowner | collprovider | collencoding | collcollat e | collctype | collversion -------------+---------------+-----------+--------------+--------------+----------- --+------------+------------- default | 11 | 10 | d | -1 | | | C | 11 | 10 | c | -1 | C | C | POSIX | 11 | 10 | c | -1 | POSIX | POSIX | ucs_basic | 11 | 10 | c | 6 | C | C | en_US.utf8 | 11 | 10 | c | 6 | en_US.utf8 | en_US.utf8 | en_US | 11 | 10 | c | 6 | en_US.utf8 | en_US.utf8 | und-x-icu | 11 | 10 | i | -1 | und | und | 153.14 af-x-icu | 11 | 10 | i | -1 | af | af | 153.14.37 af-NA-x-icu | 11 | 10 | i | -1 | af-NA | af-NA | 153.14.37 af-ZA-x-icu | 11 | 10 | i | -1 | af-ZA | af-ZA | 153.14.37 (10 rows) yugabyte=# select count(*) from pg_collation; select count(*) from pg_collation; count ------- 789 (1 row) ``` (2.2) All initial databases have collation "C" as expected. ``` yugabyte=# select datname, encoding, datcollate from pg_database; select datname, encoding, datcollate from pg_database; datname | encoding | datcollate -----------------+----------+------------ template1 | 6 | C template0 | 6 | C postgres | 6 | C yugabyte | 6 | C system_platform | 6 | C (5 rows) ``` (2.3) We can now create collation and also support column collation. ``` yugabyte=# create collation nd (provider = 'libc', locale=''); create collation nd (provider = 'libc', locale=''); CREATE COLLATION yugabyte=# create table foo(id text collate "C"); create table foo(id text collate "C"); CREATE TABLE yugabyte=# create table bar(id text collate "en_US.utf8"); create table bar(id text collate "en_US.utf8"); ``` (2.4) We can see collation "C" and "en_US.utf8" sort 'a' and 'A' differently which is consistent with postgres. ``` yugabyte=# insert into foo values ('a'); insert into foo values ('a'); INSERT 0 1 yugabyte=# insert into foo values ('A'); insert into foo values ('A'); INSERT 0 1 yugabyte=# select id from foo order by id; select id from foo order by id; id ---- A a (2 rows) yugabyte=# insert into bar values ('a'); insert into bar values ('a'); INSERT 0 1 yugabyte=# insert into bar values ('A'); insert into bar values ('A'); INSERT 0 1 yugabyte=# select id from bar order by id; select id from bar order by id; id ---- a A (2 rows) ``` 3. Do build with same gflags settings as 2, also set kTestOnlyUseOSDefaultCollation=true and run regression tests. Verify: (3.1) Same as (2.1) (3.2) All initial databases have collation "en_US.UTF-8". ``` yugabyte=# select datname, encoding, datcollate from pg_database; select datname, encoding, datcollate from pg_database; datname | encoding | datcollate -----------------+----------+------------- template1 | 6 | en_US.UTF-8 template0 | 6 | en_US.UTF-8 postgres | 6 | en_US.UTF-8 yugabyte | 6 | en_US.UTF-8 system_platform | 6 | en_US.UTF-8 (5 rows) ``` (3.3) Same as (2.3) (3.4) Because "en_US.UTF-8" is the default collation, a simple text column will use it implicitly and will be sorted according to collation "en_US.UTF-8". ``` yugabyte=# create table text_tab (id text); create table text_tab (id text); CREATE TABLE yugabyte=# insert into text_tab values ('a'); insert into text_tab values ('a'); INSERT 0 1 yugabyte=# insert into text_tab values ('A'); insert into text_tab values ('A'); INSERT 0 1 yugabyte=# select id from text_tab order by id; select id from text_tab order by id; id ---- a A (2 rows) ``` (3.5) Explicitly specified collation "C" will be sorted according to collation "C". ``` yugabyte=# create table text_tab2(id text collate "C"); create table text_tab2(id text collate "C"); CREATE TABLE yugabyte=# insert into text_tab2 values ('a'); insert into text_tab2 values ('a'); INSERT 0 1 yugabyte=# insert into text_tab2 values ('A'); insert into text_tab2 values ('A'); INSERT 0 1 yugabyte=# select id from text_tab2 order by id; select id from text_tab2 order by id; id ---- A a (2 rows) ``` Reviewers: mihnea, mbautin, neil, dmitry Reviewed By: dmitry Subscribers: yql Differential Revision: https://phabricator.dev.yugabyte.com/D12330

Summary: So far YSQL has added collation support by always performing collation encoding on any non-C collation string constant value that is sent to docdb. The collation encoded result can be memcmp'ed by docdb to achieve the same comparison semantics such that for two strings s1 and s2, their collation encoded results are e(s1), and e(s2), we have strcoll(s1, s2) == memcmp(e(s1), e(s2) Docdb is implemented as rocksdb which only performs memcmp on keys, not values. If a postgres table column is part of a primary key, it will be stored in the key part of the rocksdbs of the tablets for the table. If a postgres table column appears in an index, it will also be stored in the key part of the rocksdbs of the tablets for the index. In both cases, because they are stored in the key part of a rocksdb, collation-encoding is needed to ensure correct comparison semantics. However, if a postgres table column is neither part of a primary key, not it is used to build any index, then it is a non-key column and will be stored in the value part of rocksdb. Rocksdb does not perform memcmp on its value part, therefore performing collation-encoding is not needed. For space efficiency, we should store the original string value by removing the sortkey from the collation-encoded string. This diff implements this space optimization via the following steps: (1) Added PgDml::GetColumnInfo that for given column so that we can tell whether the column represents a primary key column. Note that for both YB base table and YB index table, a column is either a primary key column (that composes the primary key), or a value column. (2) At each bind point, for value column, change collation id to InvalidOid as encoding collation. This has the effect to disable collation encoding so that the original PG character string value will be passed to docdb. Test Plan: 1. Run regression tests with collation disabled (default build). 2. Run regression tests with collation enabled and default database collation is still "C" (FLAGS_TEST_pg_collation_enabled=true) 3. Run regression tests with collation enabled and set default database collation to "en_US.UTF-8" (FLAGS_TEST_pg_collation_enabled=true and kTestOnlyUseOSDefaultCollation=true). Reviewers: mihnea, dmitry Reviewed By: dmitry Subscribers: yql Differential Revision: https://phabricator.dev.yugabyte.com/D12962

Summary: In recently added YSQL collation support, YSQL performs collation encoding if a non-C collation string constant value is bound to a primary key text column. This includes a primary key column in the base table, or a primary key column in a secondary index table. The primary key can have only one column, or it can have multiple columns. YSQL uses docdb to store table data and index data persistently. Docdb is implemented as rocksdb which is a sorted key-value store and performs memcmp on keys to do lookups. At docdb side, the primary key will be encoded and stored in the key part. If a text column is stored in the key-part and has non-C collation then the column value is collation encoded. This is to ensure correct comparison semantics so that docdb can do correct lookups. A YSQL primary key is composed of two components: hash component and range component. Either component may be optional but at least one component must exist. When both components exist, hash component must be followed by range component. A hash component can include multiple columns and they are called hash columns. When YSQL processes a DML statement, it can pushdown expressions to docdb when possible to avoid unnecessary data reading across network. For example, conditional expressions that are used to filter rows based upon range columns can directly appear as 'condition_expr' in the docdb read/write requests. Docdb will use 'condition_expr' to select those matching rows by performing memcmp on rocksdb keys against various values in the 'condition_expr' and select or reject a row based upon the memcmp result. If a primary key has a hash component, then the only possible filter that can appear in a docdb read/write request is 'partition_column_values' and it must be equality '='. Any non-equality (e.g. '<') against a hash column will be left out of 'partition_column_values' by YB pushdown logic and the futher selection of rows that have matched 'partition_column_values' will be performed by postgres code. In other words, YSQL will not pushdown any non-equality filter on a hash column to docdb. Currently YSQL is based upon PG 11.2 and only supports deterministic collation. For deterministic collation, collation encoding is only needed for non-equality comparisons. To do equality '=' check, we do not need collation encoding. Given that, we can do space optimization for a hash column with non-C collation: (1) only store the original text value in such column (2) if a text value is bound to a hash column, only pass the original text value I added YBNeedCollationEncoding that only returns true for range columns to support the above two changes. Also found a bug that appears in case of foreign key reference. In this case a foreign key table references a base table. The value that is inserted into the foreign key table is used to lookup the base table. It needs to be encoded against the base table column collation, not the foreign key table column collation. Test Plan: 1. Run regression tests with collation disabled (default build). 2. Run regression tests with collation enabled and default database collation is still "C" (FLAGS_TEST_pg_collation_enabled=true) 3. Run regression tests with collation enabled and set default database collation to "en_US.UTF-8" (FLAGS_TEST_pg_collation_enabled=true and kTestOnlyUseOSDefaultCollation=true). Reviewers: mihnea, dmitry Reviewed By: dmitry Subscribers: yql Differential Revision: https://phabricator.dev.yugabyte.com/D13144

Summary: This diff addresses a number of known issues with recently added YSQL collation support and also enable YSQL collation support by default. 1. Disallow alter column collation unless the old and new collation match exactly. For example: ``` create table foo(id text); CREATE TABLE insert into foo values ('aaaa'); INSERT 0 1 alter table foo alter column id set data type text collate "en_US.utf8"; ysqlsh:altc1.sql:5: ERROR: This ALTER TABLE command is not yet supported. ``` In the example, column 'id' has default collation, but the alter statement tries to change it to "en_US.utf8" which is different. An alter text column command can succeed only when collations do not change. For example: one can change column type from varchar(10) to varchar(20): ``` create table foo(id varchar(10) collate "en_US.utf8"); CREATE TABLE create index foo_id_idx on foo(id collate "C" desc); CREATE INDEX insert into foo values ('aaaa'); INSERT 0 1 alter table foo alter column id set data type varchar(20) collate "en_US.utf8"; ALTER TABLE ``` In YSQL, collation change can imply on-disk change due to collation-encoding: in docdb a character data is stored with a collation sort key. Different collation can have different sort key for the same character data. Currently YSQL only supports very limited alter column command when no on-disk change is possible. Given that, we also simply disallow column collation change. 2. Disallow creating database with any non-C collation. In my previous change, I accidently enabled create database command to create a database with a non-C collation. This isn't intended as we still assume default collation is C. In addition, any non-C collation can imply perf/storage cost. So at this time we should continue to disallow this. We can enhance it in the future if needed. For example, the following command continues to fail: ``` yugabyte=# create database db LC_COLLATE = "en_US.utf8" TEMPLATE template0; create database db LC_COLLATE = "en_US.utf8" TEMPLATE template0; ERROR: Value other than 'C' for lc_collate option is not yet supported LINE 1: create database db LC_COLLATE = "en_US.utf8" TEMPLATE templa... ^ HINT: Please report the issue on https://github.com/YugaByte/yugabyte-db/issues ``` 3. Disallow text_pattern_ops/bpchar_pattern_ops/varchar_pattern_ops in index creation unless the indexed column has "C" collation. For example, ``` create table foo(id char(10) collate "en_US.utf8"); CREATE TABLE create index foo_id_idx on foo(id bpchar_pattern_ops asc); ysqlsh:pat3.sql:14: ERROR: could not use operator class "bpchar_pattern_ops" with column collation "en_US.utf8" HINT: Use the COLLATE clause to set "C" collation explicitly. ``` The semantics of bpchar_pattern_ops is to create an index such that the index keys are no longer sorted according to the base table column collation "en_US.utf8". Instead it sorts the index keys as if the collation is "C". However postgres does not change the index column collation to "C". Instead, it relies upon bpchar_pattern_ops to select a custom comparator bttext_pattern_cmp to do the comparison. That's why currently in YSQL collation support YB will detect that the index column has "en_US.utf8" collation and the index keys will still be sorted according to "en_US.utf8" not "C" collation. Therefore it is disallowed and a hint is given to user to use a work around such as: ``` create index foo_id_idx on foo(id collate "C" asc); ``` Historically, postgres only supported database collation that is decided at initdb time via OS env variable LC_COLLATE. As a result, normal index will be sorted according to collation determined by LC_COLLATE. Such an index is not usable by certain operators such as LIKE. As a work around, postgres provided these *_pattern_ops as work around to build an index that ignores LC_COLLATE (i.e., use C collation for the index). Now that postgres supports column collation which can override the database collation, *_pattern_ops are no longer needed. 4. Changed QLValue string_value field from 'string' to 'bytes' to suppress a error message. A collation sort key is a null-terminated byte sequence (without any embedded \0 byte). As a result the collation encoded string may contain invalid UTF-8 characters. However it is set as a QLValue string_value field in place of the original string value which is UTF-8. Protobuf reports invalid UTF-8 as an ERROR even though the collation-encoded string is still sent across the wire without any loss. 5. Added upgrade support Test Plan: ./yb_build.sh release --java-test 'org.yb.pgsql.TestPgRegressTypesString' ./yb_build.sh release --java-test 'org.yb.pgsql.TestPgRegressExtension' ./yb_build.sh release --java-test 'org.yb.pgsql.TestPgRegressPartitions' ./yb_build.sh release --java-test 'org.yb.pgsql.TestPgRegressDml' ./yb_build.sh release --java-test 'org.yb.pgsql.TestPgRegressPlpgsql' ./yb_build.sh release --java-test 'org.yb.pgsql.TestPgRegressFeature' ./yb_build.sh release --java-test 'org.yb.pgsql.TestYsqlUpgrade' Reviewers: mihnea, alex, dmitry Reviewed By: alex, dmitry Subscribers: ksreenivasan, yql Differential Revision: https://phabricator.dev.yugabyte.com/D13363

Summary: Yifan found that TestPgRegressTypesString is failing on Mac platform because of the newly added yb_pg_collate_icu_utf8.sql test. In particular, the test uses "en_US.utf8" collation which has a different name on Mac machine: "en_US.UTF-8". Because the collation name "en_US.utf8" does not exist on Mac machine, the test did not execute as expected. I changed the test to use ICU collation "en-US-x-icu" which exists as an identical name on both linux and Mac platforms. In addition, the test yb_pg_collate_icu_utf8.sql should use ICU collation and "en_US.utf8" is not an ICU collation. Test Plan: Run the following test on both Centos and Mac: ./yb_build.sh debug --java-test 'org.yb.pgsql.TestPgRegressTypesString' Reviewers: yguan Reviewed By: yguan Subscribers: mihnea, yql Differential Revision: https://phabricator.dev.yugabyte.com/D13800

Summary: Currently, the upgrade script `V12__1127__pg_collation.sql` calls `pg_import_system_collations` to import all the new YSQL collations from OS thirdparty libc collations and ICU collations. There are a total of 785 new collations need to be imported. Internally, `pg_import_system_collations` causes 785 individual insertions into the pg_catalog.pg_collation table. In addition, 772 collations have associated comments. That again triggered another 772 individual insertions into the pg_catalog.pg_description table. For example: for ICU collation `en-US-x-icu`, pg_import_system_collations generated the equivalent of the following two inserts: ``` INSERT INTO pg_catalog.pg_collation (oid, collname, collnamespace, collowner, collprovider, collencoding, collcollate, collctype, collversion) VALUES (12699, 'en-US-x-icu', 11, 10, 'i', -1, 'en-US', 'en-US', '153.14'); INSERT INTO pg_catalog.pg_description (objoid, classoid, objsubid, description) VALUES (12699, 3456, 0, 'English (United States)'); ``` I changed `V12__1127__pg_collation.sql` based upon advices and SQL scripts by Alex so that we do batch insertions of 785 pg_collation rows and another batch insertions of 772 pg_description rows in a single transaction to improve the overall collation upgrade execution speed. Test Plan: Jenkins: all (1) ./yb_build.sh debug --java-test 'org.yb.pgsql.TestYsqlUpgrade' Look into the test logs to find out how long it takes to apply V12__1127__pg_collation.sql to 6 databases. Before the change, it was about 151 seconds. After the change, it was about 85 seconds. (2) ./yb_build.sh release --java-test 'org.yb.pgsql.TestYsqlUpgrade' Before the change, it was about 39 seconds. After the change, it was about 19 seconds. Reviewers: alex Reviewed By: alex Subscribers: yql Differential Revision: https://phabricator.dev.yugabyte.com/D13888

Summary: Currently, the upgrade script `V12__1127__pg_collation.sql` calls `pg_import_system_collations` to import all the new YSQL collations from OS thirdparty libc collations and ICU collations. There are a total of 785 new collations need to be imported. Internally, `pg_import_system_collations` causes 785 individual insertions into the pg_catalog.pg_collation table. In addition, 772 collations have associated comments. That again triggered another 772 individual insertions into the pg_catalog.pg_description table. For example: for ICU collation `en-US-x-icu`, pg_import_system_collations generated the equivalent of the following two inserts: ``` INSERT INTO pg_catalog.pg_collation (oid, collname, collnamespace, collowner, collprovider, collencoding, collcollate, collctype, collversion) VALUES (12699, 'en-US-x-icu', 11, 10, 'i', -1, 'en-US', 'en-US', '153.14'); INSERT INTO pg_catalog.pg_description (objoid, classoid, objsubid, description) VALUES (12699, 3456, 0, 'English (United States)'); ``` I changed `V12__1127__pg_collation.sql` based upon advices and SQL scripts by Alex so that we do batch insertions of 785 pg_collation rows and another batch insertions of 772 pg_description rows in a single transaction to improve the overall collation upgrade execution speed. Original commit: D13888 / 85ed864 Test Plan: Jenkins: all, rebase: 2.11.0 (1) ./yb_build.sh debug --java-test 'org.yb.pgsql.TestYsqlUpgrade' Look into the test logs to find out how long it takes to apply V12__1127__pg_collation.sql to 6 databases. Before the change, it was about 151 seconds. After the change, it was about 85 seconds. (2) ./yb_build.sh release --java-test 'org.yb.pgsql.TestYsqlUpgrade' Before the change, it was about 39 seconds. After the change, it was about 19 seconds. Reviewers: alex Reviewed By: alex Subscribers: yql Differential Revision: https://phabricator.dev.yugabyte.com/D13917

Summary: This bug was found during collation test plan review. Consider this example ``` CREATE TABLE t(id TEXT COLLATE "en_US.utf8", PRIMARY KEY(id ASC)) SPLIT AT VALUES (('abc')); INSERT INTO t VALUES ('ab'), ('abcd'); ``` Table `t` has two tablets with 'abc' as the split point. One would expect 'ab' in one tablet and 'abcd' in another tablet. However, currently we use `value->constcollid` when processing the split point 'abc' which has database default collation. At present YSQL restricts database can only have "C" collation. Therefore 'abc' will not be collation-encoded. On the other hand, when 'ab' and 'abcd' are inserted into table `t`, they will be encoded using column collation "en_US.utf8". As a result both collation encoded values are compared larger than non-collation-encoded 'abc' and therefore are inserted into the same tablet. I made a change to use the column collation `attr->attcollation` when processing the split point 'abc'. In this way when comparing 'ab' or 'abcd' against 'abc', all of them will be collation-encded with "en_US.utf8". We assume collation-encoding will likely to retain the similar key distribution as the original text values. New test added. Test Plan: ./yb_build.sh release --cxx-test pgwrapper_pg_libpq-test --gtest_filter PgLibPqTest.CollationRangePresplit ./yb_build.sh release --java-test 'org.yb.pgsql.TestPgRegressTypesString' ./yb_build.sh release --java-test 'org.yb.pgsql.TestPgRegressExtension' ./yb_build.sh release --java-test 'org.yb.pgsql.TestPgRegressPartitions' ./yb_build.sh release --java-test 'org.yb.pgsql.TestPgRegressDml' ./yb_build.sh release --java-test 'org.yb.pgsql.TestPgRegressPlpgsql' ./yb_build.sh release --java-test 'org.yb.pgsql.TestPgRegressFeature' ./yb_build.sh release --java-test 'org.yb.pgsql.TestYsqlUpgrade' Reviewers: mihnea, dmitry Reviewed By: dmitry Subscribers: yql Differential Revision: https://phabricator.dev.yugabyte.com/D13933

Summary: The test TestYsqlUpgrade#migratingIsEquivalentToReinitdb is failing on Mac platform because of recent change I made to replace `pg_import_system_collations` call by a set of SQL insert statements to improve YSQL upgrade performance. This is because the SQL insert statements only run successfully on Linux but failed on Mac OS: The collation "en_US.utf8" on Linux is called "en_US.UTF-8". I changed the SQL script to detect the underlining OS platform and use the appropriate name to allow the test to pass on both Linux and Mac platforms. Test Plan: Jenkins: all Run the following command on both Centos and Mac. ./yb_build.sh release --java-test 'org.yb.pgsql.TestYsqlUpgrade' Reviewers: alex Reviewed By: alex Subscribers: yql Differential Revision: https://phabricator.dev.yugabyte.com/D14070

Summary: I added a new java unit test for YSQL collation support. The test is adapted from the existing testIndex so that we can test column collation for both table and different types of indexes. Test Plan: ./yb_build.sh debug --java-test 'org.yb.pgsql.TestYbBackup#testCollationIndexTypesEn' ./yb_build.sh debug --java-test 'org.yb.pgsql.TestYbBackup#testCollationIndexTypesSv' ./yb_build.sh debug --java-test 'org.yb.pgsql.TestYbBackup#testCollationIndexTypesTr' Reviewers: oleg Reviewed By: oleg Subscribers: yql Differential Revision: https://phabricator.dev.yugabyte.com/D14373

nocaway added kind/enhancement This is an enhancement of an existing feature area/ysql Yugabyte SQL (YSQL) labels Apr 4, 2019

nocaway assigned mbautin Apr 4, 2019

nocaway self-assigned this Apr 4, 2019

ndeodhar mentioned this issue Jun 4, 2019

Create Index with Column Collate Not supported Yet #1480

Open

WesleyW added area/ysql Yugabyte SQL (YSQL) and removed area/ysql Yugabyte SQL (YSQL) labels Jun 21, 2019

irizzant mentioned this issue Feb 4, 2020

[YSQL] Feature support - ALTER TABLE #1124

Open

nocaway assigned m-iancu and ndeodhar and unassigned mbautin and nocaway Jul 9, 2020

ndeodhar changed the title ~~[YSQL] Feature Support - Column Constraints~~ [YSQL] Feature Support - Column Collation Aug 5, 2020

ddorian mentioned this issue Oct 9, 2020

Django compatibility #4054

Open

adil246 mentioned this issue Feb 4, 2021

[RFE|BUG] Better collation for non-latin alphabets #7102

Closed

CharlotteRose pinned this issue Feb 9, 2021

epratt-yb unpinned this issue Feb 16, 2021

hbhanawat mentioned this issue Mar 23, 2021

YB Backend for Django #7764

Closed

m-iancu mentioned this issue Jul 26, 2021

[YSQL] Collation Support #7853

Closed

m-iancu assigned myang2021 and unassigned m-iancu and ndeodhar Jul 26, 2021

radekg mentioned this issue Aug 10, 2021

[YSQL] Foreign Data Wrapper support #9650

Merged

m-iancu mentioned this issue Nov 1, 2021

[Docs] Add docs for collation support #10455

Closed

m-iancu added the roadmap-tracking-issue This issue tracks a major roadmap item, and usually appears in the roadmap list. label Nov 4, 2021

m-iancu added this to YQL-beta Nov 4, 2021

yugabyte-ci added the priority/medium Medium priority issue label Jun 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[YSQL] Feature Support - Column Collation #1127

[YSQL] Feature Support - Column Collation #1127

nocaway commented Apr 4, 2019 •

edited by yugabyte-ci

Loading

irizzant commented Jan 31, 2020

ghost commented Mar 23, 2021

[YSQL] Feature Support - Column Collation #1127

[YSQL] Feature Support - Column Collation #1127

Comments

nocaway commented Apr 4, 2019 • edited by yugabyte-ci Loading

Phase 1: Deterministic Collations (PG 11)

Phase 2: Non-Deterministic Collations (PG 12/13)

irizzant commented Jan 31, 2020

ghost commented Mar 23, 2021

nocaway commented Apr 4, 2019 •

edited by yugabyte-ci

Loading