Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protocol RFC for collations #3068

Merged
merged 6 commits into from
May 27, 2024
Merged

Conversation

olaky
Copy link
Contributor

@olaky olaky commented May 8, 2024

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

Protocol RFC for adding collation support to Delta

Design doc


Part | Description
-|-
Provider | Name of the provider. Must not contain dots

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since provider must and name can't contain dots should the provider perhaps be optional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does it work with an optional provider when we allow dots in versions. So for example if we have

prover.name.version and name2.version.1

how do we create a parsing rule for this?

@felipepessoto
Copy link
Contributor

felipepessoto commented May 8, 2024

Hi @olaky. I have some questions please:

  1. Are we introducing a new writerFeature? I don't see any mention to that
  2. Design doc says no readerFeature will be added. In that case it could return incorrect results if reader doesn't understand collation, for example: SELECT COUNT(*) FROM TABLE Group BY CaseInsensitiveColumn. Or when filtering. Is this expected?
  3. Do you know how Spark handles the back-compatibility issue described above for Parquet tables?

Thanks.

@olaky
Copy link
Contributor Author

olaky commented May 8, 2024

Hi @olaky. I have some questions please:

  1. Are we introducing a new writerFeature? I don't see any mention to that
  2. Design doc says no readerFeature will be added. In that case it could return incorrect results if reader doesn't understand collation, for example: SELECT COUNT(*) FROM TABLE Group BY CaseInsensitiveColumn. Or when filtering. Is this expected?
  3. Do you know how Spark handles the back-compatibility issue described above for Parquet tables?

Thanks.

Hi @felipepessoto,

thanks for taking an interest in this project.

1: The idea is to add a writer feature indeed. Because we want to keep collation information in field metadata of the table schema, this might actually not be a requirement though. The protocol is designed in a way that writers can provide statistics for UTF8_BINARY (the collation currently used), and that clients that do bot understand collations do not break. I will need some time to explore if we can get away without a writer feature, because this would be nice actually.
2: This is a difficult decision indeed, and yes, results can differ between clients respecting collations and between clients ignoring them. The thing is though that a lot of engines that can read and write Delta do not support collations, and are unlikely to do so in the mid term. This is why I am preferring not to have a reader feature, because that would make tables with collations unreadable by many clients for a very long time, which is not desired.
3: Spark will also not require clients to know about collations for the parquet tables it writes (this is work in progress).

One more thing to point out here is that Hive tables, and HMS by extension, do not support collations. So if we want to use a schema that forces clients to know about collations, it would also mean that tables are not a hive compliant table any more. Same goes for Delta UniForm, because Iceberg has no way to specify collations.

Copy link
Contributor

@cstavr cstavr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Left some nits.

protocol_rfcs/collated-string-type.md Outdated Show resolved Hide resolved
protocol_rfcs/collated-string-type.md Outdated Show resolved Hide resolved
protocol_rfcs/collated-string-type.md Outdated Show resolved Hide resolved
protocol_rfcs/collated-string-type.md Outdated Show resolved Hide resolved
protocol_rfcs/collated-string-type.md Outdated Show resolved Hide resolved
vkorukanti pushed a commit that referenced this pull request May 21, 2024
…3117)

## Description

This refactoring adds support for nested statistics columns. So far, all
statistics are keys in the stats struct in AddFiles. This PR adds
support for statistics that are part of nested structs. This is a
prerequisite for file skipping on collated string columns ([Protocol
RFC](#3068)). Statistics for
collated string columns will be wrapped in a struct keyed by the
versioned collation that was used to generate them. For example:

```
"stats": { "statsWithCollation": { "icu.en_US.72": { "minValues": { ...} } } }
```

This PR replaces statType in StatsColumn with pathToStatType, which can
be used to represent a path. This way we can re-use all of the existing
data skipping code without changes.

## How was this patch tested?
It is not possible to test this change without altering
[statsSchema](https://github.com/delta-io/delta/blob/master/spark/src/main/scala/org/apache/spark/sql/delta/stats/StatisticsCollection.scala#L285).
I would still like to ship this PR separately because the change is big
enough in itself. There is existing test coverage for stats parsing and
file skipping, but none of them uses nested statistics yet.

## Does this PR introduce _any_ user-facing changes?
No
Copy link
Collaborator

@vkorukanti vkorukanti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few naive questions (apologies in advance).

protocol_rfcs/collated-string-type.md Outdated Show resolved Hide resolved
protocol_rfcs/collated-string-type.md Outdated Show resolved Hide resolved
protocol_rfcs/collated-string-type.md Outdated Show resolved Hide resolved
protocol_rfcs/collated-string-type.md Show resolved Hide resolved
@vkorukanti vkorukanti merged commit 6c0137b into delta-io:master May 27, 2024
10 checks passed
longvu-db pushed a commit to longvu-db/delta that referenced this pull request May 28, 2024
…elta-io#3117)

## Description

This refactoring adds support for nested statistics columns. So far, all
statistics are keys in the stats struct in AddFiles. This PR adds
support for statistics that are part of nested structs. This is a
prerequisite for file skipping on collated string columns ([Protocol
RFC](delta-io#3068)). Statistics for
collated string columns will be wrapped in a struct keyed by the
versioned collation that was used to generate them. For example:

```
"stats": { "statsWithCollation": { "icu.en_US.72": { "minValues": { ...} } } }
```

This PR replaces statType in StatsColumn with pathToStatType, which can
be used to represent a path. This way we can re-use all of the existing
data skipping code without changes.

## How was this patch tested?
It is not possible to test this change without altering
[statsSchema](https://github.com/delta-io/delta/blob/master/spark/src/main/scala/org/apache/spark/sql/delta/stats/StatisticsCollection.scala#L285).
I would still like to ship this PR separately because the change is big
enough in itself. There is existing test coverage for stats parsing and
file skipping, but none of them uses nested statistics yet.

## Does this PR introduce _any_ user-facing changes?
No
longvu-db pushed a commit to longvu-db/delta that referenced this pull request May 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants