Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: adopt kernel schemas and improve protocol support #1756

Merged
merged 17 commits into from
Nov 6, 2023

Conversation

roeap
Copy link
Collaborator

@roeap roeap commented Oct 22, 2023

Description

this PR depends on #1741.

Migrating the implementation of actions and schema over from kernel. The schema is much more complete in terms of the more recent delta features and more rigorously leverages the rust type system.

Related Issue(s)

Documentation

@github-actions github-actions bot added binding/python Issues for the Python package binding/rust Issues for the Rust crate rust labels Oct 22, 2023
}
PrimitiveType::Timestamp => {
// Issue: https://github.com/delta-io/delta/issues/643
Ok(ArrowDataType::Timestamp(TimeUnit::Microsecond, None))
Copy link
Collaborator

@ion-elgreco ion-elgreco Oct 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about timestamps that were written with the UTC timezone? Are we doing anything with the IsAdjustedToUtc flag in the parquet timestamp column to return the arrow datatype as Timestamp(tz='UTC')

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right now this is not handled. Within delta this is also gated behind the timestampNtz feature. As this PR is quite hard to review due to its size, I focussed on not adjusting any business logic but rather only update the schema / action definitions and getting things to build. So support for that feature should be in a follow-up PR.

@github-actions github-actions bot removed the binding/python Issues for the Python package label Oct 24, 2023
@github-actions github-actions bot added the binding/python Issues for the Python package label Oct 24, 2023
@roeap roeap marked this pull request as ready for review October 26, 2023 05:56
@roeap
Copy link
Collaborator Author

roeap commented Oct 26, 2023

@wjones127 @rtyler - There are some failing python tests related to how we handle schema metadata, where we need to make a decision on the kernel side. Other then that, I think it's ready for review.

Since this PR is again quite sizable, I focussed on not making any changes to the logic and limit impact on the APIs. As such no new feature support in this PR :).

Copy link
Collaborator

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for moving this forward. This will be good for unblocking us on making progress on newer protocol versions :)

rust/src/operations/create.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should fix the bigint bug.

The other Python failure about metadata is a result of the metadata keys being strings. That seems like a case where the test might have been wrong, right?

#[derive(Debug, Serialize, Deserialize, PartialEq, Clone)]
#[serde(rename_all = "camelCase")]
/// Primitive types supported by Delta
pub enum PrimitiveType {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this. ❤️

@github-actions github-actions bot removed rust binding/rust Issues for the Rust crate labels Nov 5, 2023
@roeap
Copy link
Collaborator Author

roeap commented Nov 5, 2023

@wjones127 - made display for primitive types consistent with protocol...

The other Python failure about metadata is a result of the metadata keys being strings. That seems like a case where the test might have been wrong, right?

This is an open question to me and mainly motivated by not wanting serde_json in kernel. in practice it seems spark/delta encodes nested structures as a json string. We have been allowing arbitrary json data in our metadata fields.

Raised this in the kernel sync - i.e. does delta metadata support nested structures? - but jury is still out in that :). As for going forward here I would appreciate your opinion. Continuing to use Value for now or refactoring the test are both fairly easily done.

@wjones127
Copy link
Collaborator

Raised this in the kernel sync - i.e. does delta metadata support nested structures? - but jury is still out in that :). As for going forward here I would appreciate your opinion. Continuing to use Value for now or refactoring the test are both fairly easily done.

Absent any evidence, I don't know either. My slight inclination is to move toward strings.

@roeap
Copy link
Collaborator Author

roeap commented Nov 6, 2023

My slight inclination is to move toward strings.

I share that inclination, so just went ahead and made the change :).

@roeap roeap merged commit dcc3a7b into delta-io:main Nov 6, 2023
24 checks passed
@roeap roeap deleted the schema branch November 6, 2023 07:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants