Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: support collations #2473

Closed
petermattis opened this issue Sep 11, 2015 · 16 comments
Closed

sql: support collations #2473

petermattis opened this issue Sep 11, 2015 · 16 comments
Assignees
Labels
A-sql-semantics C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Milestone

Comments

@petermattis
Copy link
Collaborator

It looks like Go has fairly good support for collations: https://godoc.org/golang.org/x/text/collate. The challenge is to plumb through the use of the collation everywhere we're performing string comparisons.

@petermattis petermattis added this to the 1.0 milestone Sep 11, 2015
@petermattis petermattis added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) and removed SQL labels Feb 13, 2016
@derkan
Copy link

derkan commented May 11, 2016

I think this issue should be labeled as a bug, not enhancement. ORDER BY on a utf8 string data shows completely wrong sorting. PostgreSQL's Collate implementation may help.

Example:

create table accounts(id int primary key, balance decimal, name string(64));
insert into accounts values(1,decimal '10.01', 'Slm1');
insert into accounts values(1,decimal '20.02', 'Şlm2');
insert into accounts values(3,decimal '30.03', 'Ümran');
select * from accounts order by name;
+----+---------+--------------+
| id | balance |     name     |
+----+---------+--------------+
|  1 |   10.01 | Slm1         |
|  3 |   30.03 | "\u00dcmran" |
|  2 |   20.01 | "\u015elm2"  |
+----+---------+--------------+

It doesn't understand UTF8 encoded chars, it accounts utf8 char codes as if they ordinary string.

@petermattis
Copy link
Collaborator Author

Cc @eisenstatdavid

@eisenstatdavid
Copy link

The byte-by-byte string comparison lives in

func (d *DString) Compare(other Datum) int {
. Getting a default Unicode collation order might be as simple as using the collation here, but I'm new to this part of the code.

@eisenstatdavid
Copy link

Here's a proposal for adding collation to CockroachDB, based on PostgreSQL's documentation. (I misunderstood the feature request before.)

Add a language tag to sqlbase.DatabaseDescriptor. The default value is "C", which means collation by bytes. "POSIX" also means collation by bytes. Other values are parsed using the Go collation library.

Add a language tag to sqlbase.ColumnType, arising from a STRING column with a COLLATE annotation. The default value is "default", which is effectively the database language tag.

Add a language tag to DString (sql/parser/datum.go) and an associated enumeration having four levels: ImplicitFromDatabase, for string literals and placeholders; ImplicitFromColumn, for string column values, even if the column inherited its language tag from the database; Explicit, for COLLATE expressions; and Indeterminate (the language tag is invalid).

String operations take the language tag into account as follows.

  • COLLATE expressions: these confer the specified language tag at the Explicit level.
  • Concatenation: highest tag level wins (ImplicitFromDatabase < ImplicitFromColumn < Explicit < Indeterminate). The concatenation of two strings whose tags are at the same level has the same tag if the languages are the same or else an Indeterminate tag.
  • Ordering comparisons: the operands must have compatible tags. Two tags are compatible if neither is indeterminate and both specify the same language.
  • Insertions: the inserted value must have the same language tag as the column. PostgreSQL doesn't enforce this, though Bram thinks that T-SQL does. We can always stop enforcement if it's too onerous in practice.

The type checker catches all of the new errors statically. Computing tags is a straightforward bottom-up traversal.

Known difficulties

Sorting strategies need to consider language tags, since the KV store orders primary keys by bytes, not collation.

Collations where two strings with different bytes are equal seem like a headache. I'm particularly worried about primary keys. We should probably not support these collations at first.

@eisenstatdavid
Copy link

In offline discussion, Vivek raised the question of whether we want to change the key encoding for specially collated columns so that it's possible to extract ranges efficiently. The obvious encoding technique is to emit the collation key followed by the string itself, but this can double the storage needed and doesn't handle collations where byte-unequal strings collate equal. Since we're choosing a data format, we should get this right.

I tried to figure out what PostgreSQL does, but the only relevant piece of documentation that I could find is cryptic.

The drawback of using locales other than C or POSIX in PostgreSQL is its performance impact. It slows character handling and prevents ordinary indexes from being used by LIKE. For this reason use locales only if you actually need them.

LIKE presumably can't use an index because of combining characters.

@bdarnell
Copy link
Contributor

I think it's important that queries using a non-C collation can be fast (i.e. use an index), so indexes should store the collation key.

I think using the collation key solves the problem of languages where byte-unequal strings compare equal. Such strings would have byte-equal collation keys.

@bdarnell
Copy link
Contributor

We could avoid the double-storage by only storing the collation key in the index, and going back to the primary data row for the full string (a sort of anti-covering index).

@maddyblue
Copy link
Contributor

maddyblue commented Sep 15, 2016

I think it's a bad idea to avoid the double storage if it makes these keys non-covering. The whole point of an index is to trade size for speed.

I think the big problem with the double storage is that it'll break all the existing on-disk strings. There's a number of other on-disk formats we have discussed or would like to change but haven't since there's no good way to do that except for a full SQL dump and import. So if we decide to change on-disk stuff for this, maybe we should figure out a more general way to do these migrations.

@petermattis
Copy link
Collaborator Author

@mjibson I don't think we have to break existing on-disk strings. The collation key would be stored in the index key while the raw key would be stored in the value. The STORING specifier for indexes does exactly this already, we'd just need additional logic. We could even make the decision about whether to increase the storage user configurable. That is, we could rely on the user adding a STORING specifier to control when a collated string is stored in the value.

@eisenstatdavid eisenstatdavid self-assigned this Sep 20, 2016
@eisenstatdavid
Copy link

I think it makes sense to implement this in three PRs.

  1. COLLATE support for expressions.
  2. COLLATE support for columns. Initially, these columns cannot belong to an index.
  3. Support for indexes with custom collations.

I'm working on 1 currently.

@petermattis
Copy link
Collaborator Author

Sounds good to me.

@RaduBerinde
Copy link
Member

I think decimals may benefit from a similar split across the key and value: 1 and 1.00 are equal and thus should have the same key in a unique index, but we want to be able to extract the original decimal unmodified (as opposed of reading 1 after inserting 1.00); this scale information would live in the value.

@vivekmenezes
Copy link
Contributor

Yay! congrats on fixing this issue!

@bdarnell
Copy link
Contributor

bdarnell commented Jun 1, 2017

There are still two places (1, 2) in sql.y where we reference this issue in unimplementedWithIssue. We should probably create new issues for the remaining functionality (if they don't already exist) and point the error messages to these new issues. Until then, I'm reopening this one to track.

@bdarnell bdarnell reopened this Jun 1, 2017
@petermattis petermattis modified the milestones: 1.1, 1.0 Jun 1, 2017
@dianasaur323
Copy link
Contributor

@eisenstatdavid do you mind taking a quick look at Ben's comment to see if we should break this into some new issues? Thanks!!

@eisenstatdavid
Copy link

Closing this issue in favor of the more specific issues #16618 and #16619. The other place that #2473 appears in the grammar is for altering the type of a column, which is not a limitation specific to collation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-semantics C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

No branches or pull requests

9 participants