Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timezone aware timestamp parsing (#3794) #3795

Merged
merged 3 commits into from
Mar 4, 2023
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
133 changes: 88 additions & 45 deletions arrow-cast/src/parse.rs
Original file line number Diff line number Diff line change
Expand Up @@ -20,11 +20,7 @@ use arrow_array::ArrowPrimitiveType;
use arrow_schema::ArrowError;
use chrono::prelude::*;

/// Accepts a string in RFC3339 / ISO8601 standard format and some
/// variants and converts it to a nanosecond precision timestamp.
///
/// Implements the `to_timestamp` function to convert a string to a
/// timestamp, following the model of spark SQL’s to_`timestamp`.
/// Accepts a string and parses it relative to the provided `timezone`
///
/// In addition to RFC3339 / ISO8601 standard timestamps, it also
/// accepts strings that use a space ` ` to separate the date and time
Expand All @@ -38,36 +34,6 @@ use chrono::prelude::*;
/// * `1997-01-31 09:26:56.123` # close to RCF3339 but uses a space and no timezone offset
/// * `1997-01-31 09:26:56` # close to RCF3339, no fractional seconds
/// * `1997-01-31` # close to RCF3339, only date no time
//
/// Internally, this function uses the `chrono` library for the
/// datetime parsing
///
/// We hope to extend this function in the future with a second
/// parameter to specifying the format string.
///
/// ## Timestamp Precision
///
/// Function uses the maximum precision timestamps supported by
/// Arrow (nanoseconds stored as a 64-bit integer) timestamps. This
/// means the range of dates that timestamps can represent is ~1677 AD
/// to 2262 AM
///
///
/// ## Timezone / Offset Handling
///
/// Numerical values of timestamps are stored compared to offset UTC.
///
/// This function interprets strings without an explicit time zone as
/// timestamps with offsets of the local time on the machine
///
/// For example, `1997-01-31 09:26:56.123Z` is interpreted as UTC, as
/// it has an explicit timezone specifier (“Z” for Zulu/UTC)
///
/// `1997-01-31T09:26:56.123` is interpreted as a local timestamp in
/// the timezone of the machine. For example, if
/// the system timezone is set to Americas/New_York (UTC-5) the
/// timestamp will be interpreted as though it were
/// `1997-01-31T09:26:56.123-05:00`
///
/// Some formats that supported by PostgresSql <https://www.postgresql.org/docs/current/datatype-datetime.html#DATATYPE-DATETIME-TIME-TABLE>
/// still not supported by chrono, like
Expand All @@ -76,12 +42,14 @@ use chrono::prelude::*;
/// "2023-01-01 040506 +07:30:00",
/// "2023-01-01 04:05:06.789 PST",
/// "2023-01-01 04:05:06.789 -08",
#[inline]
pub fn string_to_timestamp_nanos(s: &str) -> Result<i64, ArrowError> {
pub fn string_to_datetime<T: TimeZone>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a question here:
if I want to parse a string with time zone, what should I do with string_to_datetime? Do I have to use string_to_timestamp_nanos ?
The precondition is I do not know the format of timestamp-like string.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand this question

I would think you could call string_to_datetime<Utc>(...) which would give you a chrono DateTime<Utc> and you can then then do whatever you want with the result (e.g. convert it into a nanosecond timestamp, etc)

timezone: &T,
s: &str,
) -> Result<DateTime<T>, ArrowError> {
// Fast path: RFC3339 timestamp (with a T)
// Example: 2020-09-08T13:42:29.190855Z
if let Ok(ts) = DateTime::parse_from_rfc3339(s) {
return Ok(ts.timestamp_nanos());
return Ok(ts.with_timezone(timezone));
}

// Implement quasi-RFC3339 support by trying to parse the
Expand All @@ -96,14 +64,14 @@ pub fn string_to_timestamp_nanos(s: &str) -> Result<i64, ArrowError> {

for f in supported_formats.iter() {
if let Ok(ts) = DateTime::parse_from_str(s, f) {
return to_timestamp_nanos(ts.naive_utc());
return Ok(ts.with_timezone(timezone));
}
}

// with an explicit Z, using ' ' as a separator
// Example: 2020-09-08 13:42:29Z
if let Ok(ts) = Utc.datetime_from_str(s, "%Y-%m-%d %H:%M:%S%.fZ") {
return to_timestamp_nanos(ts.naive_utc());
return Ok(ts.with_timezone(timezone));
}

// Support timestamps without an explicit timezone offset, again
Expand All @@ -112,34 +80,44 @@ pub fn string_to_timestamp_nanos(s: &str) -> Result<i64, ArrowError> {
// without a timezone specifier as a local time, using T as a separator
// Example: 2020-09-08T13:42:29.190855
if let Ok(ts) = NaiveDateTime::parse_from_str(s, "%Y-%m-%dT%H:%M:%S%.f") {
return to_timestamp_nanos(ts);
if let Some(offset) = timezone.offset_from_local_datetime(&ts).single() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return Ok(DateTime::from_local(ts, offset));
}
}

// without a timezone specifier as a local time, using T as a
// separator, no fractional seconds
// Example: 2020-09-08T13:42:29
if let Ok(ts) = NaiveDateTime::parse_from_str(s, "%Y-%m-%dT%H:%M:%S") {
return Ok(ts.timestamp_nanos());
if let Some(offset) = timezone.offset_from_local_datetime(&ts).single() {
return Ok(DateTime::from_local(ts, offset));
}
}

// without a timezone specifier as a local time, using ' ' as a separator
// Example: 2020-09-08 13:42:29.190855
if let Ok(ts) = NaiveDateTime::parse_from_str(s, "%Y-%m-%d %H:%M:%S%.f") {
return to_timestamp_nanos(ts);
if let Some(offset) = timezone.offset_from_local_datetime(&ts).single() {
return Ok(DateTime::from_local(ts, offset));
}
}

// without a timezone specifier as a local time, using ' ' as a
// separator, no fractional seconds
// Example: 2020-09-08 13:42:29
if let Ok(ts) = NaiveDateTime::parse_from_str(s, "%Y-%m-%d %H:%M:%S") {
return Ok(ts.timestamp_nanos());
if let Some(offset) = timezone.offset_from_local_datetime(&ts).single() {
return Ok(DateTime::from_local(ts, offset));
}
}

// without a timezone specifier as a local time, only date
// Example: 2020-09-08
if let Ok(dt) = NaiveDate::parse_from_str(s, "%Y-%m-%d") {
if let Some(ts) = dt.and_hms_opt(0, 0, 0) {
return Ok(ts.timestamp_nanos());
if let Some(offset) = timezone.offset_from_local_datetime(&ts).single() {
return Ok(DateTime::from_local(ts, offset));
}
}
}

Expand All @@ -153,6 +131,42 @@ pub fn string_to_timestamp_nanos(s: &str) -> Result<i64, ArrowError> {
)))
}

/// Accepts a string in RFC3339 / ISO8601 standard format and some
/// variants and converts it to a nanosecond precision timestamp.
///
/// See [`string_to_datetime`] for the full set of supported formats
///
/// Implements the `to_timestamp` function to convert a string to a
/// timestamp, following the model of spark SQL’s to_`timestamp`.
///
/// Internally, this function uses the `chrono` library for the
/// datetime parsing
///
/// We hope to extend this function in the future with a second
/// parameter to specifying the format string.
///
/// ## Timestamp Precision
///
/// Function uses the maximum precision timestamps supported by
/// Arrow (nanoseconds stored as a 64-bit integer) timestamps. This
/// means the range of dates that timestamps can represent is ~1677 AD
/// to 2262 AM
///
/// ## Timezone / Offset Handling
///
/// Numerical values of timestamps are stored compared to offset UTC.
///
/// This function interprets string without an explicit time zone timestamps
tustvold marked this conversation as resolved.
Show resolved Hide resolved
/// relative to UTC, see [`string_to_datetime`] for alternative semantics
///
/// For example, both `1997-01-31 09:26:56.123Z`, `1997-01-31T09:26:56.123`,
/// and `1997-01-31T14:26:56.123-05:00` will be parsed as the same value
///
Comment on lines +162 to +164
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some context got lost -- I only think this statement is true if the system timezone is set to UTC - 5:

Suggested change
/// For example, both `1997-01-31 09:26:56.123Z`, `1997-01-31T09:26:56.123`,
/// and `1997-01-31T14:26:56.123-05:00` will be parsed as the same value
///
/// For example, `1997-01-31 09:26:56.123Z`, `1997-01-31T09:26:56.123`,
/// and `1997-01-31T14:26:56.123-05:00` will be parsed as the same value
/// if the system timezone is set to Americas/New_York (UTC-5).

Copy link
Contributor Author

@tustvold tustvold Mar 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, that context was actually incorrect. They will always parse as the same regardless of the local timezone. If you want local timezone specific behaviour you should pass Local to the function above

#[inline]
pub fn string_to_timestamp_nanos(s: &str) -> Result<i64, ArrowError> {
to_timestamp_nanos(string_to_datetime(&Utc, s)?.naive_utc())
}

/// Defensive check to prevent chrono-rs panics when nanosecond conversion happens on non-supported dates
#[inline]
fn to_timestamp_nanos(dt: NaiveDateTime) -> Result<i64, ArrowError> {
Expand Down Expand Up @@ -448,6 +462,7 @@ impl Parser for Date64Type {
#[cfg(test)]
mod tests {
use super::*;
use arrow_array::timezone::Tz;

#[test]
fn string_to_timestamp_timezone() {
Expand Down Expand Up @@ -614,6 +629,34 @@ mod tests {
naive_datetime.timestamp_nanos(),
parse_timestamp("2020-09-08 13:42:29").unwrap()
);

let tz: Tz = "+02:00".parse().unwrap();
let date = string_to_datetime(&tz, "2020-09-08 13:42:29").unwrap();
let utc = date.naive_utc().to_string();
assert_eq!(utc, "2020-09-08 11:42:29");
let local = date.naive_local().to_string();
assert_eq!(local, "2020-09-08 13:42:29");

let date = string_to_datetime(&tz, "2020-09-08 13:42:29Z").unwrap();
let utc = date.naive_utc().to_string();
assert_eq!(utc, "2020-09-08 13:42:29");
let local = date.naive_local().to_string();
assert_eq!(local, "2020-09-08 15:42:29");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some tests to check timestamp is correct?

Suggested change
assert_eq!(local, "2020-09-08 15:42:29");
assert_eq!(local, "2020-09-08 15:42:29");
let &tz = Local::now().offset();
let date = string_to_datetime(&tz, "2020-09-08T13:42:29Z").unwrap();
let dt = NaiveDateTime::parse_from_str("2020-09-08T13:42:29Z","%Y-%m-%dT%H:%M:%SZ").unwrap();
let ts = date.timestamp();
assert_eq!(dt.timestamp(), ts);
let date = string_to_datetime(&tz, "2020-09-08 13:42:29").unwrap();
assert_eq!(dt.timestamp() - (tz.local_minus_utc() as i64), date.timestamp());

Copy link
Contributor Author

@tustvold tustvold Mar 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to avoid using Local::now in the tests as this makes the tests not reproducible across machines. The tests above should cover this, no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the tests above can run ok across machines, you can try.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They do, but they don't consistently test the behaviour. For example my machine (and CI) is set to UTC, which means it won't test the behaviour at all...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can use two different time zone to test the behavior. For example:

 let tz1: Tz = "+02:00".parse().unwrap();
let tz2: Tz = "+08:00".parse().unwrap();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a further test, let me know if that isn't what you meant

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is OK, thx.


let dt =
NaiveDateTime::parse_from_str("2020-09-08T13:42:29Z", "%Y-%m-%dT%H:%M:%SZ")
.unwrap();
let local: Tz = "+08:00".parse().unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


// Parsed as offset from UTC
let date = string_to_datetime(&local, "2020-09-08T13:42:29Z").unwrap();
assert_eq!(dt, date.naive_utc());
assert_ne!(dt, date.naive_local());

// Parsed as offset from local
let date = string_to_datetime(&local, "2020-09-08 13:42:29").unwrap();
assert_eq!(dt, date.naive_local());
assert_ne!(dt, date.naive_utc());
}

#[test]
Expand Down