Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YSQL] Feature Support - Column Collation #1127

Open
nocaway opened this issue Apr 4, 2019 · 2 comments
Open

[YSQL] Feature Support - Column Collation #1127

nocaway opened this issue Apr 4, 2019 · 2 comments
Assignees
Labels
area/ysql Yugabyte SQL (YSQL) kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue roadmap-tracking-issue This issue tracks a major roadmap item, and usually appears in the roadmap list.

Comments

@nocaway
Copy link
Contributor

nocaway commented Apr 4, 2019

Jira Link: DB-1087
Support COLLATE for columns.

Phase 1: Deterministic Collations (PG 11)

Feature Status Comments
Functional Spec and Design Doc https://github.com/yugabyte/yugabyte-db/blob/master/architecture/design/ysql-collation-support.md
Part 1: Support compiling code with libicu Only under a special compiler flag
Part 2: Support collation encoding for every text-type Encoding includes ICU sort key plus original string for deterministic collations (see design doc for details)
Part 3: Optimize collation encoding for value columns Sort key not needed for non-key columns
Part 4: Optimize collation encoding for hash key columns Maybe sort key is not needed for hash keys and deterministic collations
Part 5: Import ICU collations and add tests Turn on collations to be enabled by default (remove compiler flag)
Part 6: Add docs for collation Tracked in #10455

Phase 2: Non-Deterministic Collations (PG 12/13)

@nocaway nocaway added kind/enhancement This is an enhancement of an existing feature area/ysql Yugabyte SQL (YSQL) labels Apr 4, 2019
@nocaway nocaway self-assigned this Apr 4, 2019
@WesleyW WesleyW added area/ysql Yugabyte SQL (YSQL) and removed area/ysql Yugabyte SQL (YSQL) labels Jun 21, 2019
@irizzant
Copy link

+1 for this issue this is blocking the restore of my Postgres dump:

LINE 14: ALTER TABLE ONLY crm.customer_relationships ALTER COLUMN ind...
                                                     ^
HINT:  See https://github.com/YugaByte/yugabyte-db/issues/1124. Click '+' on the description to raise its priority
    Command was: CREATE TABLE crm.customer_relationships (
    customer_id character(16) NOT NULL,
    index integer NOT NULL,
    relationship_type character varying(20) NOT NULL,
    assigned_by integer NOT NULL,
    assigned_on timestamp without time zone NOT NULL,
    since date NOT NULL,
    upto date,
    deleted_by integer,
    deleted_on timestamp without time zone,
    record_status character(1) NOT NULL
)
WITH (autovacuum_enabled='true');
ALTER TABLE ONLY crm.customer_relationships ALTER COLUMN index SET STATISTICS 0;

@nocaway nocaway assigned m-iancu and ndeodhar and unassigned mbautin and nocaway Jul 9, 2020
@ndeodhar ndeodhar changed the title [YSQL] Feature Support - Column Constraints [YSQL] Feature Support - Column Collation Aug 5, 2020
@CharlotteRose CharlotteRose pinned this issue Feb 9, 2021
@epratt-yb epratt-yb unpinned this issue Feb 16, 2021
@ghost
Copy link

ghost commented Mar 23, 2021

test=# CREATE TEMP TABLE t0(c0 CHAR(4) COLLATE "C" , c1 REAL  NOT NULL);
ERROR:  COLLATE not supported yet
LINE 1: CREATE TEMP TABLE t0(c0 CHAR(4) COLLATE "C" , c1 REAL  NOT N...
                                        ^
HINT:  See https://github.com/YugaByte/yugabyte-db/issues/1127. Click '+' on the description to raise its priority

myang2021 added a commit that referenced this issue Jun 30, 2021
Summary:
Currently YSQL has very limited support for collation. Collation is about text data sorting
rules and postgres provides collation support via the underlining OS library including
libc and libicu. Simply put, given two text values (aka strings) s1 and s2, collation sort
comparison must be done via strcoll(s1, s2) rather than strcmp(s1, s2).

Postgres supports two types of collations: deterministic collations and non-deterministic
collations. Postgres 11.2 only supports deterministic collations. Postgres 12 and beyond
support both deterministic collations and non-deterministic collations. Non-deterministic
collation comparison of s1 and s2 directly translates to strcoll(s1, s2), and deterministic
collation comparison uses strcmp(s1, s2) as a "tie-breaker" when strcoll(s1, s2) == 0.

This is the first change to support deterministic collation in YSQL to bring YSQL collation
support in sync with PG 11.2. With this change, the normal build will not have any behavior
change and YSQL will continue to report collation related errors. We'll produce a special
build when ICU is enabled:

YB_POSTGRES_WITH_ICU=1 ./yb_build.sh

The plan is that once we fully support deterministic collations in YSQL, we'll remove the
special build flag YB_POSTGRES_WITH_ICU=1 and make that the default. Before that,
we do need to have the env flag in order to trigger building postgres with ICU enabled
(postgres configure option --with-icu). Otherwise the macro USE_ICU will not be defined
and the the ICU relevant code will not be included. To simplify the build command, one
can also do a one time "export YB_POSTGRES_WITH_ICU=1" and then use the default
./yb_build.sh build command instead of adding the "YB_POSTGRES_WITH_ICU=1"
prefix every time.

Test Plan:
(1) regular build: ./yb_build.sh, verified that
(1.1) pg_collation has only 4 rows as before:
```
yugabyte=# select * from pg_collation;
 collname  | collnamespace | collowner | collprovider | collencoding | collcollate | collctype | collversion
-----------+---------------+-----------+--------------+--------------+-------------+-----------+-------------
 default   |            11 |        10 | d            |           -1 |             |           |
 C         |            11 |        10 | c            |           -1 | C           | C         |
 POSIX     |            11 |        10 | c            |           -1 | POSIX       | POSIX     |
 ucs_basic |            11 |        10 | c            |            6 | C           | C         |
(4 rows)
```
(1.2) We continue to report collation related errors:

```
yugabyte=# create collation nd (provider = 'libc', locale='');
ERROR:  CREATE COLLATION not supported yet
LINE 1: create collation nd (provider = 'libc', locale='');
        ^
HINT:  Please report the issue on https://github.com/YugaByte/yugabyte-db/issues
yugabyte=# create table foo(id text collate "C");
ERROR:  COLLATE not supported yet
LINE 1: create table foo(id text collate "C");
                                 ^
HINT:  See #1127. Click '+' on the description to raise its priority
```

(2) ICU build: YB_POSTGRES_WITH_ICU=1 ./yb_build.sh, verified that
(2.1) pg_collation has many rows because it has imported many ICU collations from OS provided libicu library.
```
yugabyte=# select * from pg_collation limit 10;
  collname   | collnamespace | collowner | collprovider | collencoding | collcollate | collctype  | collversion
-------------+---------------+-----------+--------------+--------------+-------------+------------+-------------
 default     |            11 |        10 | d            |           -1 |             |            |
 C           |            11 |        10 | c            |           -1 | C           | C          |
 POSIX       |            11 |        10 | c            |           -1 | POSIX       | POSIX      |
 ucs_basic   |            11 |        10 | c            |            6 | C           | C          |
 en_US.utf8  |            11 |        10 | c            |            6 | en_US.utf8  | en_US.utf8 |
 en_US       |            11 |        10 | c            |            6 | en_US.utf8  | en_US.utf8 |
 und-x-icu   |            11 |        10 | i            |           -1 | und         | und        | 153.14
 af-x-icu    |            11 |        10 | i            |           -1 | af          | af         | 153.14.37
 af-NA-x-icu |            11 |        10 | i            |           -1 | af-NA       | af-NA      | 153.14.37
 af-ZA-x-icu |            11 |        10 | i            |           -1 | af-ZA       | af-ZA      | 153.14.37
(10 rows)

yugabyte=# select count(*) from pg_collation;
 count
-------
   789
(1 row)

```
(2.2) We can now create collation and also support column collation.
```
yugabyte=# create collation nd (provider = 'libc', locale='');
CREATE COLLATION
yugabyte=# create table foo(id text collate "C");
CREATE TABLE
```

Reviewers: mbautin, mihnea

Reviewed By: mihnea

Subscribers: yql

Differential Revision: https://phabricator.dev.yugabyte.com/D11968
@m-iancu m-iancu assigned myang2021 and unassigned m-iancu and ndeodhar Jul 26, 2021
myang2021 added a commit that referenced this issue Sep 8, 2021
Summary:
Currently YSQL has very limited support for collation. Only C collation is
supported. Collation is about text data sorting rules and postgres provides
collation support via the underlining OS library including libc and libicu.
Simply put, given two text values (aka strings) s1 and s2, collation sort
comparison must be done via strcoll(s1, s2) rather than strcmp(s1, s2).

Postgres supports two types of collations: deterministic collations and
non-deterministic collations. Postgres 11.2 only supports deterministic
collations. Postgres 12 and beyond support both deterministic collations and
non-deterministic collations. Non-deterministic collation comparison of s1 and
s2 directly translates to strcoll(s1, s2), and deterministic collation
comparison uses strcmp(s1, s2) as a "tie-breaker" when strcoll(s1, s2) == 0.

This is the second change to support deterministic collation in YSQL to bring
YSQL collation support in sync with PG 11.2. With this change, --with-icu is now
always on when building postgres so that ICU code is always built into postgres.
The normal build are not expected to have any behavior change and YSQL will
continue to report collation related errors. We'll produce a special build with
ICU enabled via a new gflag: FLAGS_TEST_pg_collation_enabled=true. This gflag
is used during development to enable collation support. However, once collation
encoded data lands in docdb, turning off this gflag is not supported.

The plan is that once we fully support deterministic collations in YSQL, we'll
turn on FLAGS_TEST_pg_collation_enabled=true by default.

The high level changes are:
(1) In addition to type and modifier, for character data types that YSQL
expects, collation related information is now also passed from postgres
layer to PgGate when constructing any PgExpr. Collation info currently
includes:
```
  typedef struct PgCollationInfo {
    bool collate_is_valid_non_c;
    const char *sortkey;
  } YBCPgCollationInfo;
```
collate_is_valid_non_c indicates whether the collation is a valid non-C
collation. A non-C collation requires strcoll semantics. In contrast C collation
only requires strcmp semantics which is much more efficient.

sortkey is the collation sortkey that will be stored in docdb and used for
byte-wise comparison. It is null-terminated.

(2) When building PgExpr, collate_is_valid_non_c is recorded as
'collate_is_valid_non_c_' in PgExpr.

(3) When building PgConstant, sortkey is recorded as collation_sortkey_ in
PgConstant.

(4) For C collation PgConstant, collation_sortkey_ should be nullptr and we
continue to store the string value that postgres layer has passed down as we
currently do. In this way we try to minimize the performance penalty for text
data that do not use collation.

(5) For non-C collation PgConstant, we store a collation-encoded string value.
which is composed of a fixed leading bytes, followed by a null-terminated sort
key, followed by the original character string value.

(6) A sort key of a string value is computed such that strcmp(sortkey(s1),
sortkey(s2)) == strcoll(s1, s2). Two different s1 and s2 may have identical sort
keys. For deterministic collations, we need to use strcmp(s1, s2) as a
tie-breaker when strcmp(sortkey(s1), sortkey(s2)) == 0. That's why the original
string value is appended after the null-terminated sort key to complete a
collation-encoded string.

(7) The collation-encoded string value replaces the original value and is sent
to DocDB for storage.

(8) On the way back from DocDB, PgGate will use the null byte that separates the
sort key from the original string value to strip off the leading bytes and the
sort key, only hands the original string value over to postgres layer above.

(9) A new test only C constant kTestOnlyUseOSDefaultCollation is used for
testing purpose. It can only be set when FLAGS_pg_collation_enabled is true
(otherwise assertion fails). Once set to true, a clean build is needed which
includes initdb and CreateInitialSysCatalogSnapshot. All the initial databases
will have text columns default to en_US.UTF-8 collation. Therefore in regression
tests all text columns in user tables will also have en_US.UTF-8 collation.
Because this is a non-C collation, the new code path that stores
collation-encoded strings is tested.

(10) Planed optimization: store only the original value for a non-key column to
save storage space, also for hash key, consider store only the original value as
well if feasible.

(11) Known issue: a sort key is a null-terminated byte sequence (without any
embedded \0 byte). As a result the collation encoded string may contain invalid
UTF-8 characters. However it is set as a QLValue string field in place of the
original string value which is UTF-8. Protobuf reports invalid UTF-8 as an ERROR
even though the collation-encoded string is still sent across the wire without
any loss.

Test Plan:
1. Do default build and run regression tests. Verify:
(1.1) pg_collation has only 4 rows as before.

```
yugabyte=# select * from pg_collation;
select * from pg_collation;
 collname  | collnamespace | collowner | collprovider | collencoding |
collcollate
| collctype | collversion
-----------+---------------+-----------+--------------+--------------+-------------
+-----------+-------------
 default   |            11 |        10 | d            |           -1 |
|           |
 C         |            11 |        10 | c            |           -1 | C
| C         |
 POSIX     |            11 |        10 | c            |           -1 | POSIX
| POSIX     |
 ucs_basic |            11 |        10 | c            |            6 | C
| C         |
(4 rows)
```

(1.2) We continue to report collation related errors.
```
yugabyte=# create collation nd (provider = 'libc', locale='');
create collation nd (provider = 'libc', locale='');
ERROR:  CREATE COLLATION not supported yet
LINE 1: create collation nd (provider = 'libc', locale='');
        ^
HINT:  Please report the issue on https://github.com/YugaByte/yugabyte-db/issues
yugabyte=# create table foo(id text collate "C");
create table foo(id text collate "C");
ERROR:  COLLATE not supported yet
LINE 1: create table foo(id text collate "C");
                                 ^
HINT:  See #1127. Click '+' on the
description to raise its priority
```
(1.3) All initial databases have collation "C" as expected.
```
yugabyte=# select datname, encoding, datcollate from pg_database;
select datname, encoding, datcollate from pg_database;
     datname     | encoding | datcollate
-----------------+----------+------------
 template1       |        6 | C
 template0       |        6 | C
 postgres        |        6 | C
 yugabyte        |        6 | C
 system_platform |        6 | C
(5 rows)

```

2. Do build with FLAGS_TEST_pg_collation_enabled=true, also in
YBIsCollationEnabled set default env value of
YBCIsEnvVarTrueWithDefault("FLAGS_TEST_pg_collation_enabled" to true,
run regression tests. Verify:
(2.1) pg_collation has many rows because it has imported many ICU collations
from OS provided libicu library.
```
yugabyte=# select * from pg_collation limit 10;
select * from pg_collation limit 10;
  collname   | collnamespace | collowner | collprovider | collencoding |
collcollat
e | collctype  | collversion
-------------+---------------+-----------+--------------+--------------+-----------
--+------------+-------------
 default     |            11 |        10 | d            |           -1 |
  |            |
 C           |            11 |        10 | c            |           -1 | C
  | C          |
 POSIX       |            11 |        10 | c            |           -1 | POSIX
  | POSIX      |
 ucs_basic   |            11 |        10 | c            |            6 | C
  | C          |
 en_US.utf8  |            11 |        10 | c            |            6 |
en_US.utf8
  | en_US.utf8 |
 en_US       |            11 |        10 | c            |            6 |
en_US.utf8
  | en_US.utf8 |
 und-x-icu   |            11 |        10 | i            |           -1 | und
  | und        | 153.14
 af-x-icu    |            11 |        10 | i            |           -1 | af
  | af         | 153.14.37
 af-NA-x-icu |            11 |        10 | i            |           -1 | af-NA
  | af-NA      | 153.14.37
 af-ZA-x-icu |            11 |        10 | i            |           -1 | af-ZA
  | af-ZA      | 153.14.37
(10 rows)

yugabyte=# select count(*) from pg_collation;
select count(*) from pg_collation;
 count
-------
   789
(1 row)
```
(2.2) All initial databases have collation "C" as expected.
```
yugabyte=# select datname, encoding, datcollate from pg_database;
select datname, encoding, datcollate from pg_database;
     datname     | encoding | datcollate
-----------------+----------+------------
 template1       |        6 | C
 template0       |        6 | C
 postgres        |        6 | C
 yugabyte        |        6 | C
 system_platform |        6 | C
(5 rows)
```
(2.3) We can now create collation and also support column collation.
```
yugabyte=# create collation nd (provider = 'libc', locale='');
create collation nd (provider = 'libc', locale='');
CREATE COLLATION
yugabyte=# create table foo(id text collate "C");
create table foo(id text collate "C");
CREATE TABLE
yugabyte=# create table bar(id text collate "en_US.utf8");
create table bar(id text collate "en_US.utf8");
```
(2.4) We can see collation "C" and "en_US.utf8" sort 'a' and 'A' differently
which is consistent with postgres.
```
yugabyte=# insert into foo values ('a');
insert into foo values ('a');
INSERT 0 1
yugabyte=# insert into foo values ('A');
insert into foo values ('A');
INSERT 0 1
yugabyte=# select id from foo order by id;
select id from foo order by id;
 id
----
 A
 a
(2 rows)
yugabyte=# insert into bar values ('a');
insert into bar values ('a');
INSERT 0 1
yugabyte=# insert into bar values ('A');
insert into bar values ('A');
INSERT 0 1
yugabyte=# select id from bar order by id;
select id from bar order by id;
 id
----
 a
 A
(2 rows)
```

3. Do build with same gflags settings as 2, also set
kTestOnlyUseOSDefaultCollation=true and run regression tests. Verify:
(3.1) Same as (2.1)
(3.2) All initial databases have collation "en_US.UTF-8".
```
yugabyte=# select datname, encoding, datcollate from pg_database;
select datname, encoding, datcollate from pg_database;
     datname     | encoding | datcollate
-----------------+----------+-------------
 template1       |        6 | en_US.UTF-8
 template0       |        6 | en_US.UTF-8
 postgres        |        6 | en_US.UTF-8
 yugabyte        |        6 | en_US.UTF-8
 system_platform |        6 | en_US.UTF-8
(5 rows)
```
(3.3) Same as (2.3)
(3.4) Because "en_US.UTF-8" is the default collation, a simple text column will
use it implicitly and will be sorted according to collation "en_US.UTF-8".
```
yugabyte=# create table text_tab (id text);
create table text_tab (id text);
CREATE TABLE
yugabyte=# insert into text_tab values ('a');
insert into text_tab values ('a');
INSERT 0 1
yugabyte=# insert into text_tab values ('A');
insert into text_tab values ('A');
INSERT 0 1
yugabyte=# select id from text_tab order by id;
select id from text_tab order by id;
 id
----
 a
 A
(2 rows)
```
(3.5) Explicitly specified collation "C" will be sorted according to collation
"C".
```
yugabyte=# create table text_tab2(id text collate "C");
create table text_tab2(id text collate "C");
CREATE TABLE
yugabyte=# insert into text_tab2 values ('a');
insert into text_tab2 values ('a');
INSERT 0 1
yugabyte=# insert into text_tab2 values ('A');
insert into text_tab2 values ('A');
INSERT 0 1
yugabyte=# select id from text_tab2 order by id;
select id from text_tab2 order by id;
 id
----
 A
 a
(2 rows)
```

Reviewers: mihnea, mbautin, neil, dmitry

Reviewed By: dmitry

Subscribers: yql

Differential Revision: https://phabricator.dev.yugabyte.com/D12330
myang2021 added a commit that referenced this issue Sep 21, 2021
Summary:
So far YSQL has added collation support by always performing collation encoding
on any non-C collation string constant value that is sent to docdb. The
collation encoded result can be memcmp'ed by docdb to achieve the same
comparison semantics such that for two strings s1 and s2, their collation
encoded results are e(s1), and e(s2), we have

  strcoll(s1, s2) == memcmp(e(s1), e(s2)

Docdb is implemented as rocksdb which only performs memcmp on keys, not values.
If a postgres table column is part of a primary key, it will be stored in the
key part of the rocksdbs of the tablets for the table. If a postgres table
column appears in an index, it will also be stored in the key part of the
rocksdbs of the tablets for the index. In both cases, because they are stored in
the key part of a rocksdb, collation-encoding is needed to ensure correct
comparison semantics.

However, if a postgres table column is neither part of a primary key, not it
is used to build any index, then it is a non-key column and will be stored in
the value part of rocksdb. Rocksdb does not perform memcmp on its value part,
therefore performing collation-encoding is not needed. For space efficiency, we
should store the original string value by removing the sortkey from the
collation-encoded string. This diff implements this space optimization via the
following steps:

(1) Added PgDml::GetColumnInfo that for given column so that we can tell whether
the column represents a primary key column. Note that for both YB base table and
YB index table, a column is either a primary key column (that composes the
primary key), or a value column.
(2) At each bind point, for value column, change collation id to InvalidOid as
encoding collation.  This has the effect to disable collation encoding so that
the original PG character string value will be passed to docdb.

Test Plan:
1. Run regression tests with collation disabled (default build).
2. Run regression tests with collation enabled and default database collation is
still "C" (FLAGS_TEST_pg_collation_enabled=true)
3. Run regression tests with collation enabled and set default database
collation to "en_US.UTF-8" (FLAGS_TEST_pg_collation_enabled=true and
kTestOnlyUseOSDefaultCollation=true).

Reviewers: mihnea, dmitry

Reviewed By: dmitry

Subscribers: yql

Differential Revision: https://phabricator.dev.yugabyte.com/D12962
myang2021 added a commit that referenced this issue Oct 7, 2021
Summary:
In recently added YSQL collation support, YSQL performs collation encoding if a
non-C collation string constant value is bound to a primary key text column.
This includes a primary key column in the base table, or a primary key column in
a secondary index table. The primary key can have only one column, or it can
have multiple columns. YSQL uses docdb to store table data and index data
persistently. Docdb is implemented as rocksdb which is a sorted key-value store
and performs memcmp on keys to do lookups. At docdb side, the primary key will
be encoded and stored in the key part. If a text column is stored in the
key-part and has non-C collation then the column value is collation encoded.
This is to ensure correct comparison semantics so that docdb can do correct
lookups.

A YSQL primary key is composed of two components: hash component and range
component. Either component may be optional but at least one component must
exist. When both components exist, hash component must be followed by range
component. A hash component can include multiple columns and they are called
hash columns.

When YSQL processes a DML statement, it can pushdown expressions to docdb when
possible to avoid unnecessary data reading across network. For example,
conditional expressions that are used to filter rows based upon range columns
can directly appear as 'condition_expr' in the docdb read/write requests. Docdb
will use 'condition_expr' to select those matching rows by performing memcmp on
rocksdb keys against various values in the 'condition_expr' and select or reject
a row based upon the memcmp result.

If a primary key has a hash component, then the only possible filter that can
appear in a docdb read/write request is 'partition_column_values' and it must be
equality '='. Any non-equality (e.g. '<') against a hash column will be left out
of 'partition_column_values' by YB pushdown logic and the futher selection of
rows that have matched 'partition_column_values' will be performed by postgres
code. In other words, YSQL will not pushdown any non-equality filter on a hash
column to docdb.

Currently YSQL is based upon PG 11.2 and only supports deterministic collation.
For deterministic collation, collation encoding is only needed for non-equality
comparisons. To do equality '=' check, we do not need collation encoding.

Given that, we can do space optimization for a hash column with non-C collation:
(1) only store the original text value in such column
(2) if a text value is bound to a hash column, only pass the original text value

I added YBNeedCollationEncoding that only returns true for range columns to
support the above two changes. Also found a bug that appears in case of foreign
key reference. In this case a foreign key table references a base table. The
value that is inserted into the foreign key table is used to lookup the base
table. It needs to be encoded against the base table column collation, not the
foreign key table column collation.

Test Plan:
1. Run regression tests with collation disabled (default build).
2. Run regression tests with collation enabled and default database collation is
still "C" (FLAGS_TEST_pg_collation_enabled=true)
3. Run regression tests with collation enabled and set default database
collation to "en_US.UTF-8" (FLAGS_TEST_pg_collation_enabled=true and
kTestOnlyUseOSDefaultCollation=true).

Reviewers: mihnea, dmitry

Reviewed By: dmitry

Subscribers: yql

Differential Revision: https://phabricator.dev.yugabyte.com/D13144
myang2021 added a commit that referenced this issue Nov 3, 2021
Summary:
This diff addresses a number of known issues with recently added YSQL collation
support and also enable YSQL collation support by default.

1. Disallow alter column collation unless the old and new collation match
exactly. For example:
```
create table foo(id text);
CREATE TABLE
insert into foo values ('aaaa');
INSERT 0 1
alter table foo alter column id set data type text collate "en_US.utf8";
ysqlsh:altc1.sql:5: ERROR:  This ALTER TABLE command is not yet supported.
```
In the example, column 'id' has default collation, but the alter statement tries
to change it to "en_US.utf8" which is different. An alter text column command
can succeed only when collations do not change. For example: one can change
column type from varchar(10) to varchar(20):

```
create table foo(id varchar(10) collate "en_US.utf8");
CREATE TABLE
create index foo_id_idx on foo(id collate "C" desc);
CREATE INDEX
insert into foo values ('aaaa');
INSERT 0 1
alter table foo alter column id set data type varchar(20) collate "en_US.utf8";
ALTER TABLE
```

In YSQL, collation change can imply on-disk change due to collation-encoding: in
docdb a character data is stored with a collation sort key. Different collation
can have different sort key for the same character data. Currently YSQL only
supports very limited alter column command when no on-disk change is possible.
Given that, we also simply disallow column collation change.

2. Disallow creating database with any non-C collation.
In my previous change, I accidently enabled create database command to create a
database with a non-C collation. This isn't intended as we still assume default
collation is C. In addition, any non-C collation can imply perf/storage cost. So
at this time we should continue to disallow this. We can enhance it in the
future if needed. For example, the following command continues to fail:

```
yugabyte=# create database db LC_COLLATE = "en_US.utf8" TEMPLATE template0;
create database db LC_COLLATE = "en_US.utf8" TEMPLATE template0;
ERROR:  Value other than 'C' for lc_collate option is not yet supported
LINE 1: create database db LC_COLLATE = "en_US.utf8" TEMPLATE templa...
                           ^
HINT:  Please report the issue on https://github.com/YugaByte/yugabyte-db/issues
```

3. Disallow text_pattern_ops/bpchar_pattern_ops/varchar_pattern_ops in index
creation unless the indexed column has "C" collation. For example,

```
create table foo(id char(10) collate "en_US.utf8");
CREATE TABLE
create index foo_id_idx on foo(id bpchar_pattern_ops asc);
ysqlsh:pat3.sql:14: ERROR:  could not use operator class "bpchar_pattern_ops" with column collation
"en_US.utf8"
HINT:  Use the COLLATE clause to set "C" collation explicitly.
```
The semantics of bpchar_pattern_ops is to create an index such that the index
keys are no longer sorted according to the base table column collation
"en_US.utf8". Instead it sorts the index keys as if the collation is "C".
However postgres does not change the index column collation to "C". Instead, it
relies upon bpchar_pattern_ops to select a custom comparator bttext_pattern_cmp
to do the comparison. That's why currently in YSQL collation support YB will
detect that the index column has "en_US.utf8" collation and the index keys will
still be sorted according to "en_US.utf8" not "C" collation. Therefore it is
disallowed and a hint is given to user to use a work around such as:
```
create index foo_id_idx on foo(id collate "C" asc);
```

Historically, postgres only supported database collation that is decided at
initdb time via OS env variable LC_COLLATE. As a result, normal index
will be sorted according to collation determined by LC_COLLATE. Such an index is
not usable by certain operators such as LIKE. As a work around, postgres
provided these *_pattern_ops as work around to build an index that ignores
LC_COLLATE (i.e., use C collation for the index). Now that postgres supports
column collation which can override the database collation, *_pattern_ops are
no longer needed.

4. Changed QLValue string_value field from 'string' to 'bytes' to suppress a
error message. A collation sort key is a null-terminated byte sequence (without
any embedded \0 byte). As a result the collation encoded string may contain
invalid UTF-8 characters. However it is set as a QLValue string_value field in
place of the original string value which is UTF-8. Protobuf reports invalid
UTF-8 as an ERROR even though the collation-encoded string is still sent across
the wire without any loss.

5. Added upgrade support

Test Plan:
./yb_build.sh release --java-test 'org.yb.pgsql.TestPgRegressTypesString'
./yb_build.sh release --java-test 'org.yb.pgsql.TestPgRegressExtension'
./yb_build.sh release --java-test 'org.yb.pgsql.TestPgRegressPartitions'
./yb_build.sh release --java-test 'org.yb.pgsql.TestPgRegressDml'
./yb_build.sh release --java-test 'org.yb.pgsql.TestPgRegressPlpgsql'
./yb_build.sh release --java-test 'org.yb.pgsql.TestPgRegressFeature'
./yb_build.sh release --java-test 'org.yb.pgsql.TestYsqlUpgrade'

Reviewers: mihnea, alex, dmitry

Reviewed By: alex, dmitry

Subscribers: ksreenivasan, yql

Differential Revision: https://phabricator.dev.yugabyte.com/D13363
@m-iancu m-iancu added the roadmap-tracking-issue This issue tracks a major roadmap item, and usually appears in the roadmap list. label Nov 4, 2021
@m-iancu m-iancu added this to YQL-beta Nov 4, 2021
myang2021 added a commit that referenced this issue Nov 5, 2021
Summary:
Yifan found that TestPgRegressTypesString is failing on Mac platform because of the newly added
yb_pg_collate_icu_utf8.sql test. In particular, the test uses "en_US.utf8" collation which has a
different name on Mac machine: "en_US.UTF-8". Because the collation name "en_US.utf8" does not exist
on Mac machine, the test did not execute as expected.  I changed the test to use ICU collation
"en-US-x-icu" which exists as an identical name on both linux and Mac platforms. In addition, the
test yb_pg_collate_icu_utf8.sql should use ICU collation and "en_US.utf8" is not an ICU collation.

Test Plan:
Run the following test on both Centos and Mac:
./yb_build.sh debug --java-test 'org.yb.pgsql.TestPgRegressTypesString'

Reviewers: yguan

Reviewed By: yguan

Subscribers: mihnea, yql

Differential Revision: https://phabricator.dev.yugabyte.com/D13800
myang2021 added a commit that referenced this issue Nov 11, 2021
Summary:
Currently, the upgrade script `V12__1127__pg_collation.sql` calls `pg_import_system_collations` to
import all the new YSQL collations from OS thirdparty libc collations and ICU collations. There are
a total of 785 new collations need to be imported. Internally, `pg_import_system_collations` causes
785 individual insertions into the pg_catalog.pg_collation table. In addition, 772 collations have
associated comments. That again triggered another 772 individual insertions into the
pg_catalog.pg_description table. For example: for ICU collation `en-US-x-icu`,
pg_import_system_collations generated the equivalent of the following two inserts:

```
INSERT INTO pg_catalog.pg_collation (oid, collname, collnamespace, collowner, collprovider,
collencoding, collcollate, collctype, collversion) VALUES (12699, 'en-US-x-icu', 11, 10, 'i', -1,
'en-US', 'en-US', '153.14');
INSERT INTO pg_catalog.pg_description (objoid, classoid, objsubid, description) VALUES (12699, 3456,
0, 'English (United States)');
```

I changed `V12__1127__pg_collation.sql` based upon advices and SQL scripts by Alex so that we do batch
insertions of 785 pg_collation rows and another batch insertions of 772 pg_description rows in a
single transaction to improve the overall collation upgrade execution speed.

Test Plan:
Jenkins: all

(1) ./yb_build.sh debug --java-test 'org.yb.pgsql.TestYsqlUpgrade'
Look into the test logs to find out how long it takes to apply V12__1127__pg_collation.sql to 6
databases. Before the change, it was about 151 seconds. After the change, it was about 85 seconds.

(2) ./yb_build.sh release --java-test 'org.yb.pgsql.TestYsqlUpgrade'
Before the change, it was about 39 seconds. After the change, it was about 19 seconds.

Reviewers: alex

Reviewed By: alex

Subscribers: yql

Differential Revision: https://phabricator.dev.yugabyte.com/D13888
myang2021 added a commit that referenced this issue Nov 12, 2021
Summary:
Currently, the upgrade script `V12__1127__pg_collation.sql` calls `pg_import_system_collations` to
import all the new YSQL collations from OS thirdparty libc collations and ICU collations. There are
a total of 785 new collations need to be imported. Internally, `pg_import_system_collations` causes
785 individual insertions into the pg_catalog.pg_collation table. In addition, 772 collations have
associated comments. That again triggered another 772 individual insertions into the
pg_catalog.pg_description table. For example: for ICU collation `en-US-x-icu`,
pg_import_system_collations generated the equivalent of the following two inserts:

```
INSERT INTO pg_catalog.pg_collation (oid, collname, collnamespace, collowner, collprovider,
collencoding, collcollate, collctype, collversion) VALUES (12699, 'en-US-x-icu', 11, 10, 'i', -1,
'en-US', 'en-US', '153.14');
INSERT INTO pg_catalog.pg_description (objoid, classoid, objsubid, description) VALUES (12699, 3456,
0, 'English (United States)');
```

I changed `V12__1127__pg_collation.sql` based upon advices and SQL scripts by Alex so that we do batch
insertions of 785 pg_collation rows and another batch insertions of 772 pg_description rows in a
single transaction to improve the overall collation upgrade execution speed.

Original commit: D13888 / 85ed864

Test Plan:
Jenkins: all, rebase: 2.11.0

(1) ./yb_build.sh debug --java-test 'org.yb.pgsql.TestYsqlUpgrade'
Look into the test logs to find out how long it takes to apply V12__1127__pg_collation.sql to 6
databases. Before the change, it was about 151 seconds. After the change, it was about 85 seconds.

(2) ./yb_build.sh release --java-test 'org.yb.pgsql.TestYsqlUpgrade'
Before the change, it was about 39 seconds. After the change, it was about 19 seconds.

Reviewers: alex

Reviewed By: alex

Subscribers: yql

Differential Revision: https://phabricator.dev.yugabyte.com/D13917
myang2021 added a commit that referenced this issue Nov 16, 2021
Summary:
This bug was found during collation test plan review. Consider this example

```
   CREATE TABLE t(id TEXT COLLATE "en_US.utf8",
                  PRIMARY KEY(id ASC)) SPLIT AT VALUES (('abc'));
   INSERT INTO t VALUES ('ab'), ('abcd');
```

Table `t` has two tablets with 'abc' as the split point. One would expect 'ab' in one tablet and
'abcd' in another tablet.  However, currently we use `value->constcollid` when processing the split
point 'abc' which has database default collation. At present YSQL restricts database can only have
"C" collation.  Therefore 'abc' will not be collation-encoded. On the other hand, when 'ab' and
'abcd' are inserted into table `t`, they will be encoded using column collation "en_US.utf8". As a
result both collation encoded values are compared larger than non-collation-encoded 'abc' and
therefore are inserted into the same tablet.

I made a change to use the column collation `attr->attcollation` when processing the split point
'abc'. In this way when comparing 'ab' or 'abcd' against 'abc', all of them will be collation-encded
with "en_US.utf8". We assume collation-encoding will likely to retain the similar key distribution
as the original text values.

New test added.

Test Plan:
./yb_build.sh release --cxx-test pgwrapper_pg_libpq-test --gtest_filter PgLibPqTest.CollationRangePresplit
./yb_build.sh release --java-test 'org.yb.pgsql.TestPgRegressTypesString'
./yb_build.sh release --java-test 'org.yb.pgsql.TestPgRegressExtension'
./yb_build.sh release --java-test 'org.yb.pgsql.TestPgRegressPartitions'
./yb_build.sh release --java-test 'org.yb.pgsql.TestPgRegressDml'
./yb_build.sh release --java-test 'org.yb.pgsql.TestPgRegressPlpgsql'
./yb_build.sh release --java-test 'org.yb.pgsql.TestPgRegressFeature'
./yb_build.sh release --java-test 'org.yb.pgsql.TestYsqlUpgrade'

Reviewers: mihnea, dmitry

Reviewed By: dmitry

Subscribers: yql

Differential Revision: https://phabricator.dev.yugabyte.com/D13933
myang2021 added a commit that referenced this issue Nov 29, 2021
Summary:
The test TestYsqlUpgrade#migratingIsEquivalentToReinitdb is failing on Mac platform because of
recent change I made to replace `pg_import_system_collations` call by a set of SQL insert statements
to improve YSQL upgrade performance. This is because the SQL insert statements only run successfully
on Linux but failed on Mac OS: The collation "en_US.utf8" on Linux is called "en_US.UTF-8".

I changed the SQL script to detect the underlining OS platform and use the appropriate name
to allow the test to pass on both Linux and Mac platforms.

Test Plan:
Jenkins: all
Run the following command on both Centos and Mac.

./yb_build.sh release --java-test 'org.yb.pgsql.TestYsqlUpgrade'

Reviewers: alex

Reviewed By: alex

Subscribers: yql

Differential Revision: https://phabricator.dev.yugabyte.com/D14070
myang2021 added a commit that referenced this issue Dec 17, 2021
Summary:
I added a new java unit test for YSQL collation support. The test is adapted from the
existing testIndex so that we can test column collation for both table and different
types of indexes.

Test Plan:
./yb_build.sh debug --java-test 'org.yb.pgsql.TestYbBackup#testCollationIndexTypesEn'
./yb_build.sh debug --java-test 'org.yb.pgsql.TestYbBackup#testCollationIndexTypesSv'
./yb_build.sh debug --java-test 'org.yb.pgsql.TestYbBackup#testCollationIndexTypesTr'

Reviewers: oleg

Reviewed By: oleg

Subscribers: yql

Differential Revision: https://phabricator.dev.yugabyte.com/D14373
@yugabyte-ci yugabyte-ci added the priority/medium Medium priority issue label Jun 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ysql Yugabyte SQL (YSQL) kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue roadmap-tracking-issue This issue tracks a major roadmap item, and usually appears in the roadmap list.
Projects
Status: No status
Development

No branches or pull requests

8 participants