Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python, rust): timestampNtz support #2236

Merged
merged 22 commits into from
Mar 5, 2024

Conversation

ion-elgreco
Copy link
Collaborator

@ion-elgreco ion-elgreco commented Mar 3, 2024

Description

  • This addresses all our timestamp inconsistencies, where we were reading Primitive:timestamp as a datatetime without UTC, and now we can properly write datetimes with no timezone as columns to Primitive::timestampNtz.
  • addressing small bug where checkConstraints feature was not set in writerFeatures when you are on table writer version 7.
  • bumping default protocol to 3,7
  • Made the pyarrow writer and reader more flexible so we can write/read a 3,7 table as long as it has the supported features there.
  • Properly parses timestamps with UTC into pyarrow timestamps with UTC
  • Added configkey translation to tablefeature inside the Create Operation

Related Issue(s)

@github-actions github-actions bot added binding/python Issues for the Python package binding/rust Issues for the Rust crate labels Mar 3, 2024
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Mar 3, 2024
@github-actions github-actions bot removed the documentation Improvements or additions to documentation label Mar 4, 2024
@hntd187
Copy link
Collaborator

hntd187 commented Mar 4, 2024

Also delta-scala doesn't universally cast things to UTC, it respects the isAdjustedToUTC param for timestamps, where timestampntz has this set to false, while normal timestamps are set to true

Comment on lines +286 to +289
"appendOnly" | "delta.appendOnly" => WriterFeatures::AppendOnly,
"invariants" | "delta.invariants" => WriterFeatures::Invariants,
"checkConstraints" | "delta.checkConstraints" => WriterFeatures::CheckConstraints,
"changeDataFeed" | "delta.enableChangeDataFeed" => WriterFeatures::ChangeDataFeed,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ion-elgreco why are these changes required? Are we not picking up table properties correctly right now? If so, is there a bug this is related to?

Copy link
Collaborator Author

@ion-elgreco ion-elgreco Mar 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rtyler I am doing that so I can go from tableConfigKeys into tableFeatures. This is what spark-delta also does if your table protocol versions are 3,7 and you do set tblproperties "delta.enableChangeDataFeed" = true, it will additionally set the feature ChangeDataFeed into the protocol commit.

Maybe there is another way of doing this but I thought it was clever to put it on the from_str :P

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, just passing by, but this seems strange. these are the values written in the feature vector list, so this seems like a strange place to do this conversion as well, isnce it describes completely differnet fields, now? one is values in the feature vecotor field, andother one is keys in the table properties, which should be handled in the table config?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the table config keys here as well for quick parsing to table features, see my comment above.

In spark when your protocol is at 3,7 it will auto convert the configs into table features so that's basically what I am doing here as well.

.as_any()
.downcast_ref::<TimestampMicrosecondArray>()
.map(|v| Self::Timestamp(v.value(index))),
Timestamp(TimeUnit::Microsecond, tz) => match tz {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the CL addresses the TODO no?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, forgot to remove that : )

@ion-elgreco ion-elgreco enabled auto-merge (squash) March 5, 2024 17:24
@ion-elgreco ion-elgreco merged commit cbcf6df into delta-io:main Mar 5, 2024
23 checks passed
@@ -468,7 +468,7 @@ fn default_true() -> bool {
}

#[derive(Debug, Serialize, Deserialize, PartialEq, Clone, Eq, Hash)]
#[serde(rename_all = "camelCase")]
#[serde(rename_all = "snake_case")]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we sure this is right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this it would fail for the timestampNtz, since spark wrote it as timestamp_ntz in the log. I couldnt find his anywhere in the docs, but it only impacted timestampNtz since the rest of them are one word long

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package binding/rust Issues for the Rust crate
Projects
None yet
5 participants