-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Protocol RFC for collations #3068
Conversation
|
||
Part | Description | ||
-|- | ||
Provider | Name of the provider. Must not contain dots |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since provider must and name can't contain dots should the provider perhaps be optional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does it work with an optional provider when we allow dots in versions. So for example if we have
prover.name.version
and name2.version.1
how do we create a parsing rule for this?
Hi @olaky. I have some questions please:
Thanks. |
Hi @felipepessoto, thanks for taking an interest in this project. 1: The idea is to add a writer feature indeed. Because we want to keep collation information in field metadata of the table schema, this might actually not be a requirement though. The protocol is designed in a way that writers can provide statistics for UTF8_BINARY (the collation currently used), and that clients that do bot understand collations do not break. I will need some time to explore if we can get away without a writer feature, because this would be nice actually. One more thing to point out here is that Hive tables, and HMS by extension, do not support collations. So if we want to use a schema that forces clients to know about collations, it would also mean that tables are not a hive compliant table any more. Same goes for Delta UniForm, because Iceberg has no way to specify collations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Left some nits.
…3117) ## Description This refactoring adds support for nested statistics columns. So far, all statistics are keys in the stats struct in AddFiles. This PR adds support for statistics that are part of nested structs. This is a prerequisite for file skipping on collated string columns ([Protocol RFC](#3068)). Statistics for collated string columns will be wrapped in a struct keyed by the versioned collation that was used to generate them. For example: ``` "stats": { "statsWithCollation": { "icu.en_US.72": { "minValues": { ...} } } } ``` This PR replaces statType in StatsColumn with pathToStatType, which can be used to represent a path. This way we can re-use all of the existing data skipping code without changes. ## How was this patch tested? It is not possible to test this change without altering [statsSchema](https://github.com/delta-io/delta/blob/master/spark/src/main/scala/org/apache/spark/sql/delta/stats/StatisticsCollection.scala#L285). I would still like to ship this PR separately because the change is big enough in itself. There is existing test coverage for stats parsing and file skipping, but none of them uses nested statistics yet. ## Does this PR introduce _any_ user-facing changes? No
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few naive questions (apologies in advance).
…elta-io#3117) ## Description This refactoring adds support for nested statistics columns. So far, all statistics are keys in the stats struct in AddFiles. This PR adds support for statistics that are part of nested structs. This is a prerequisite for file skipping on collated string columns ([Protocol RFC](delta-io#3068)). Statistics for collated string columns will be wrapped in a struct keyed by the versioned collation that was used to generate them. For example: ``` "stats": { "statsWithCollation": { "icu.en_US.72": { "minValues": { ...} } } } ``` This PR replaces statType in StatsColumn with pathToStatType, which can be used to represent a path. This way we can re-use all of the existing data skipping code without changes. ## How was this patch tested? It is not possible to test this change without altering [statsSchema](https://github.com/delta-io/delta/blob/master/spark/src/main/scala/org/apache/spark/sql/delta/stats/StatisticsCollection.scala#L285). I would still like to ship this PR separately because the change is big enough in itself. There is existing test coverage for stats parsing and file skipping, but none of them uses nested statistics yet. ## Does this PR introduce _any_ user-facing changes? No
## Description Protocol RFC for adding collation support to Delta [Design doc](https://docs.google.com/document/d/1cwztlKt7b2hWF6Uu1S895ko6jPfRlP9x-V5POUcXtXk/edit?usp=sharing)
Which Delta project/connector is this regarding?
Description
Protocol RFC for adding collation support to Delta
Design doc