sql: support collations #2473

petermattis · 2015-09-11T13:27:05Z

It looks like Go has fairly good support for collations: https://godoc.org/golang.org/x/text/collate. The challenge is to plumb through the use of the collation everywhere we're performing string comparisons.

derkan · 2016-05-11T15:52:22Z

I think this issue should be labeled as a bug, not enhancement. ORDER BY on a utf8 string data shows completely wrong sorting. PostgreSQL's Collate implementation may help.

Example:

create table accounts(id int primary key, balance decimal, name string(64));
insert into accounts values(1,decimal '10.01', 'Slm1');
insert into accounts values(1,decimal '20.02', 'Şlm2');
insert into accounts values(3,decimal '30.03', 'Ümran');
select * from accounts order by name;
+----+---------+--------------+
| id | balance |     name     |
+----+---------+--------------+
|  1 |   10.01 | Slm1         |
|  3 |   30.03 | "\u00dcmran" |
|  2 |   20.01 | "\u015elm2"  |
+----+---------+--------------+

It doesn't understand UTF8 encoded chars, it accounts utf8 char codes as if they ordinary string.

petermattis · 2016-09-08T18:05:48Z

Cc @eisenstatdavid

eisenstatdavid · 2016-09-08T18:45:59Z

The byte-by-byte string comparison lives in

cockroach/sql/parser/datum.go

Line 505 in be463fc

func (d *DString) Compare(other Datum) int {

. Getting a default Unicode collation order might be as simple as using the collation here, but I'm new to this part of the code.

eisenstatdavid · 2016-09-14T17:58:45Z

Here's a proposal for adding collation to CockroachDB, based on PostgreSQL's documentation. (I misunderstood the feature request before.)

Add a language tag to sqlbase.DatabaseDescriptor. The default value is "C", which means collation by bytes. "POSIX" also means collation by bytes. Other values are parsed using the Go collation library.

Add a language tag to sqlbase.ColumnType, arising from a STRING column with a COLLATE annotation. The default value is "default", which is effectively the database language tag.

Add a language tag to DString (sql/parser/datum.go) and an associated enumeration having four levels: ImplicitFromDatabase, for string literals and placeholders; ImplicitFromColumn, for string column values, even if the column inherited its language tag from the database; Explicit, for COLLATE expressions; and Indeterminate (the language tag is invalid).

String operations take the language tag into account as follows.

COLLATE expressions: these confer the specified language tag at the Explicit level.
Concatenation: highest tag level wins (ImplicitFromDatabase < ImplicitFromColumn < Explicit < Indeterminate). The concatenation of two strings whose tags are at the same level has the same tag if the languages are the same or else an Indeterminate tag.
Ordering comparisons: the operands must have compatible tags. Two tags are compatible if neither is indeterminate and both specify the same language.
Insertions: the inserted value must have the same language tag as the column. PostgreSQL doesn't enforce this, though Bram thinks that T-SQL does. We can always stop enforcement if it's too onerous in practice.

The type checker catches all of the new errors statically. Computing tags is a straightforward bottom-up traversal.

Known difficulties

Sorting strategies need to consider language tags, since the KV store orders primary keys by bytes, not collation.

Collations where two strings with different bytes are equal seem like a headache. I'm particularly worried about primary keys. We should probably not support these collations at first.

eisenstatdavid · 2016-09-14T19:22:23Z

In offline discussion, Vivek raised the question of whether we want to change the key encoding for specially collated columns so that it's possible to extract ranges efficiently. The obvious encoding technique is to emit the collation key followed by the string itself, but this can double the storage needed and doesn't handle collations where byte-unequal strings collate equal. Since we're choosing a data format, we should get this right.

I tried to figure out what PostgreSQL does, but the only relevant piece of documentation that I could find is cryptic.

The drawback of using locales other than C or POSIX in PostgreSQL is its performance impact. It slows character handling and prevents ordinary indexes from being used by LIKE. For this reason use locales only if you actually need them.

LIKE presumably can't use an index because of combining characters.

bdarnell · 2016-09-15T13:05:40Z

I think it's important that queries using a non-C collation can be fast (i.e. use an index), so indexes should store the collation key.

I think using the collation key solves the problem of languages where byte-unequal strings compare equal. Such strings would have byte-equal collation keys.

bdarnell · 2016-09-15T13:06:43Z

We could avoid the double-storage by only storing the collation key in the index, and going back to the primary data row for the full string (a sort of anti-covering index).

maddyblue · 2016-09-15T13:18:43Z

I think it's a bad idea to avoid the double storage if it makes these keys non-covering. The whole point of an index is to trade size for speed.

I think the big problem with the double storage is that it'll break all the existing on-disk strings. There's a number of other on-disk formats we have discussed or would like to change but haven't since there's no good way to do that except for a full SQL dump and import. So if we decide to change on-disk stuff for this, maybe we should figure out a more general way to do these migrations.

petermattis · 2016-09-15T13:29:39Z

@mjibson I don't think we have to break existing on-disk strings. The collation key would be stored in the index key while the raw key would be stored in the value. The STORING specifier for indexes does exactly this already, we'd just need additional logic. We could even make the decision about whether to increase the storage user configurable. That is, we could rely on the user adding a STORING specifier to control when a collated string is stored in the value.

eisenstatdavid · 2016-09-20T19:38:06Z

I think it makes sense to implement this in three PRs.

COLLATE support for expressions.
COLLATE support for columns. Initially, these columns cannot belong to an index.
Support for indexes with custom collations.

I'm working on 1 currently.

petermattis · 2016-09-20T20:15:36Z

Sounds good to me.

RaduBerinde · 2017-02-02T23:43:18Z

I think decimals may benefit from a similar split across the key and value: 1 and 1.00 are equal and thus should have the same key in a unique index, but we want to be able to extract the original decimal unmodified (as opposed of reading 1 after inserting 1.00); this scale information would live in the value.

vivekmenezes · 2017-03-15T17:33:36Z

Yay! congrats on fixing this issue!

bdarnell · 2017-06-01T02:07:29Z

There are still two places (1, 2) in sql.y where we reference this issue in unimplementedWithIssue. We should probably create new issues for the remaining functionality (if they don't already exist) and point the error messages to these new issues. Until then, I'm reopening this one to track.

dianasaur323 · 2017-06-16T15:02:03Z

@eisenstatdavid do you mind taking a quick look at Ben's comment to see if we should break this into some new issues? Thanks!!

eisenstatdavid · 2017-06-20T14:36:13Z

Closing this issue in favor of the more specific issues #16618 and #16619. The other place that #2473 appears in the grammar is for altering the type of a column, which is not a limitation specific to collation.

petermattis added the SQL label Sep 11, 2015

petermattis added this to the 1.0 milestone Sep 11, 2015

petermattis added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) and removed SQL labels Feb 13, 2016

knz added the A-sql-semantics label Feb 21, 2016

a-robinson mentioned this issue Sep 8, 2016

ORDER BY on columns containing unicode string #9215

Closed

eisenstatdavid self-assigned this Sep 20, 2016

eisenstatdavid mentioned this issue Nov 10, 2016

sql: collation support, phase one #10605

Merged

dianasaur323 mentioned this issue Nov 12, 2016

Product Roadmap Q4 2016 #10528

Closed

37 tasks

eisenstatdavid mentioned this issue Dec 12, 2016

sql: collated string column values (phase two of collation support) #12294

Merged

RaduBerinde mentioned this issue Feb 2, 2017

sql: inconsistent printing of decimals #13384

Closed

dianasaur323 mentioned this issue Feb 9, 2017

Product Roadmap #13517

Closed

eisenstatdavid closed this as completed in 47e2d6c Mar 15, 2017

bdarnell reopened this Jun 1, 2017

petermattis modified the milestones: 1.1, 1.0 Jun 1, 2017

This was referenced Jun 20, 2017

sql: support database-level collations #16618

Open

sql: support indexing with a different collation #16619

Closed

eisenstatdavid closed this as completed Jun 20, 2017

eisenstatdavid mentioned this issue Jun 20, 2017

sql/parser: update issue numbers related to collated strings #16621

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: support collations #2473

sql: support collations #2473

petermattis commented Sep 11, 2015

derkan commented May 11, 2016 •

edited

Loading

petermattis commented Sep 8, 2016

eisenstatdavid commented Sep 8, 2016

eisenstatdavid commented Sep 14, 2016

eisenstatdavid commented Sep 14, 2016

bdarnell commented Sep 15, 2016

bdarnell commented Sep 15, 2016

maddyblue commented Sep 15, 2016 •

edited

Loading

petermattis commented Sep 15, 2016

eisenstatdavid commented Sep 20, 2016

petermattis commented Sep 20, 2016

RaduBerinde commented Feb 2, 2017

vivekmenezes commented Mar 15, 2017

bdarnell commented Jun 1, 2017

dianasaur323 commented Jun 16, 2017

eisenstatdavid commented Jun 20, 2017

sql: support collations #2473

sql: support collations #2473

Comments

petermattis commented Sep 11, 2015

derkan commented May 11, 2016 • edited Loading

Example:

petermattis commented Sep 8, 2016

eisenstatdavid commented Sep 8, 2016

eisenstatdavid commented Sep 14, 2016

eisenstatdavid commented Sep 14, 2016

bdarnell commented Sep 15, 2016

bdarnell commented Sep 15, 2016

maddyblue commented Sep 15, 2016 • edited Loading

petermattis commented Sep 15, 2016

eisenstatdavid commented Sep 20, 2016

petermattis commented Sep 20, 2016

RaduBerinde commented Feb 2, 2017

vivekmenezes commented Mar 15, 2017

bdarnell commented Jun 1, 2017

dianasaur323 commented Jun 16, 2017

eisenstatdavid commented Jun 20, 2017

derkan commented May 11, 2016 •

edited

Loading

maddyblue commented Sep 15, 2016 •

edited

Loading