Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-36954: [Python] Add more FlightInfo / FlightEndpoint attributes #43537

Merged
merged 32 commits into from
Nov 7, 2024

Conversation

EnricoMi
Copy link
Contributor

@EnricoMi EnricoMi commented Aug 2, 2024

Rationale for this change

The C++ classes FlightInfo and FlightEndpoint have attributes that are not available via the Python API.

What changes are included in this PR?

Make the following attributes available in Python:

  • FlightInfo.ordered
  • FlightInfo.app_metadata
  • FlightEndpoint.expiration_time
  • FlightEndpoint.app_metadata

Also makes existing attributes optional in constructor:

  • FlightInfo.total_records
  • FlightInfo.total_bytes

Are these changes tested?

Existing tests that test existing attributes are extended.

Are there any user-facing changes?

Yes, changes are backward compatible.

@EnricoMi EnricoMi requested a review from lidavidm as a code owner August 2, 2024 13:42
Copy link

github-actions bot commented Aug 2, 2024

⚠️ GitHub issue #36954 has been automatically assigned in GitHub to PR creator.

f"locations={self.locations!r}>")
f"locations={self.locations!r} "
f"expiration_time={self.expiration_time} "
f"app_metadata='{self.app_metadata.hex()}'>")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This matches the C++ implementation of ToString. In Python, we could easily use self.app_metadata, which represents the metadata bytes in ASCII with automatic escaping of non-printable characters, e.g. b'abc\xe2\x80\xa6', which is more readable in the absence of non-printable characters than .hex().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we don't need to use the same style as C++ in Python.
So we can remove .hex().

@jorisvandenbossche What do you think about this?

Comment on lines 891 to 894
total_records : int optional, default None
the total records in this flight, or -1 if unknown.
total_bytes : int optional, default None
the total bytes in this flight, or -1 if unknown.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While adding new attributes ordered and app_metadata as optional arguments, which simplifies user and test code, making existing total_records and total_bytes optional sounds like a good idea

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using -1 not None as the default value because the document says -1 if unknown not None if unknown?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will use -1 though it is less pythonic, as it is more consistent with C++.

@@ -950,7 +1005,9 @@ cdef class FlightInfo(_Weakrefable):
f"descriptor={self.descriptor} "
f"endpoints={self.endpoints} "
f"total_records={self.total_records} "
f"total_bytes={self.total_bytes}>")
f"total_bytes={self.total_bytes} "
f"ordered={'true' if self.ordered else 'false'} "
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This matches the C++ implementation of ToString. In Python, you would expect True or False though. Happy to change this to

Suggested change
f"ordered={'true' if self.ordered else 'false'} "
f"ordered={self.ordered} "

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we can use True/False in Python.

"locations=[]>")
"locations=[] "
"expiration_time=2023-04-05 12:34:56+00:00 "
"app_metadata='656e64706f696e7420617070206d65746164617461'>")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could read (which is more readable)

app_metadata=b'endpoint app metadata'

if we wouldn't stick to the C++ implementation of ToString here.

"total_bytes=-1>")
"total_records=1 "
"total_bytes=42 "
"ordered=true "
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Python, you would expect

ordered=True

here.

f"locations={self.locations!r}>")
f"locations={self.locations!r} "
f"expiration_time={self.expiration_time} "
f"app_metadata='{self.app_metadata.hex()}'>")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we don't need to use the same style as C++ in Python.
So we can remove .hex().

@jorisvandenbossche What do you think about this?

Comment on lines 891 to 894
total_records : int optional, default None
the total records in this flight, or -1 if unknown.
total_bytes : int optional, default None
the total bytes in this flight, or -1 if unknown.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using -1 not None as the default value because the document says -1 if unknown not None if unknown?

@@ -950,7 +1005,9 @@ cdef class FlightInfo(_Weakrefable):
f"descriptor={self.descriptor} "
f"endpoints={self.endpoints} "
f"total_records={self.total_records} "
f"total_bytes={self.total_bytes}>")
f"total_bytes={self.total_bytes} "
f"ordered={'true' if self.ordered else 'false'} "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we can use True/False in Python.

@@ -736,6 +743,12 @@ cdef class FlightEndpoint(_Weakrefable):
CLocation.Parse(tobytes(location)).Value(&c_location))
self.endpoint.locations.push_back(c_location)

if expiration_time is not None:
self.endpoint.expiration_time = time_point(
microseconds(expiration_time.cast(timestamp("us")).value))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is microseconds correct?
I think that we should use nanoseconds:

/// \brief A timestamp compatible with Protocol Buffer's
/// google.protobuf.Timestamp:
///
/// https://protobuf.dev/reference/protobuf/google.protobuf/#timestamp
///
/// > A Timestamp represents a point in time independent of any time
/// > zone or calendar, represented as seconds and fractions of
/// > seconds at nanosecond resolution in UTC Epoch time. It is
/// > encoded using the Proleptic Gregorian Calendar which extends the
/// > Gregorian calendar backwards to year one. It is encoded assuming
/// > all minutes are 60 seconds long, i.e. leap seconds are "smeared"
/// > so that no leap second table is needed for interpretation. Range
/// > is from 0001-01-01T00:00:00Z to 9999-12-31T23:59:59.999999999Z.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer nanoseconds as well, but that did not work on macOS:

 /Users/runner/work/arrow/arrow/build/python/build/temp.macosx-10.9-universal2-cpython-311/_flight.cpp:31893:102: error: no matching conversion for functional-style cast from 'std::chrono::nanoseconds' (aka 'duration<long long, ratio<1LL, 1000000000LL>>') to 'std::chrono::system_clock::time_point' (aka 'time_point<std::chrono::system_clock>')
      __pyx_v_self->endpoint.expiration_time = ((std::optional<std::chrono::system_clock::time_point> )std::chrono::system_clock::time_point(std::chrono::nanoseconds(__pyx_t_12)));
                                                                                                       ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

https://github.com/EnricoMi/arrow/actions/runs/10083176212/job/27879168275#step:8:3720

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use std::chrono::duration_cast<Timestamp::duration>() like

Timestamp expiration_time(
std::chrono::duration_cast<Timestamp::duration>(expiration_time_duration));
with Cython?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That did the trick, thanks for the pointer!

if expiration_time is not None:
self.endpoint.expiration_time = time_point(duration_cast[time_point.duration](
nanoseconds(expiration_time.cast(timestamp("ns")).value)))

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Aug 2, 2024
@rustyconover
Copy link

+1 Looking forward for this being merged.

@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Aug 3, 2024
Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No additional comments from me.

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Aug 5, 2024
@EnricoMi EnricoMi force-pushed the more-flightendpoint-attributes branch 3 times, most recently from 0506ead to e5ece4b Compare August 6, 2024 11:12
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just some minor comments

python/pyarrow/_flight.pyx Show resolved Hide resolved
Comment on lines 891 to 894
total_records : int optional, default -1
the total records in this flight, or -1 if unknown.
total_bytes : int optional, default -1
the total bytes in this flight, or -1 if unknown.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None would be a preferrable default for a Python API, IMHO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Me too, but the attributes will have to return -1 (because of the C++ backend), so that would be a bit inconsistent. @kou your thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Python attributes can (and probably should?) return None.

Copy link
Contributor Author

@EnricoMi EnricoMi Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, that would be pythonic and consistent (with None default values). But that might be a breaking change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None isn't a real value for these. (It's also inconsistent with the underlying protobuf, which doesn't really have optional values, and would default to 0 for a truly unspecified value.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what a "real value" is? None is the standard placeholder in Python for missing or unknown data. I agree that it would be a slight compatibility break to start returning None instead of -1, but that's for the better IMHO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The underlying structure does not admit a null value. It is Protobuf, there are no true null values (*this has changed slightly in recent versions but we do not use that codegen option for this structure).

Let's be clear here, though. Is this for input, or output? I would be -1 to ever returning None for this attribute. If it's merely to accept it as an input on construction that might be OK, but it would be misleading to people when it doesn't round trip.

Copy link
Contributor Author

@EnricoMi EnricoMi Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not affect the underlying structure. The Python FlightEnpoint can take None values and translate to -1 for the underlying C++ class. The getter methods can translate those -1 back to None, so this Python class feels pythonic.

Using -1 in constructor would continue to work while getters would return None.

User code like if endpoint.total_records == -1 would break. This would have to be migrated to if endpoint.total_records is None, which is for the better, as @pitrou phrased it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The getter methods can translate those -1 back to None, so this Python class feels pythonic.

This is what I would object to.

The Python FlightEnpoint can take None values and translate to -1 for the underlying C++ class.

I don't mind this though it should be clearly documented (as the underlying Protobuf does not take None).

Copy link
Contributor Author

@EnricoMi EnricoMi Aug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To summarize, the constructor allows for -1 and None, the properties in both cases return -1. Added test to highlight this.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Aug 7, 2024
Comment on lines 769 to 771
if self.endpoint.expiration_time.has_value():
time_since_epoch = duration_cast[nanoseconds](
self.endpoint.expiration_time.value().time_since_epoch()).count()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have some existing helpers defined for time point to nanoseconds conversion:

inline int64_t TimePoint_to_ns(TimePoint val) { return val.time_since_epoch().count(); }

(and similar for the other way around, TimePoint_from_ns)

That could maybe be used here? (didn't check in detail if it applies here, but they are already declared in libarrow_python.pxd, and then the new chrono.pxd might not be needed)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checking, thanks for the pointer!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simplified constructor and def expiration_time, could removed chrono.pxd with that

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Aug 30, 2024
@EnricoMi EnricoMi force-pushed the more-flightendpoint-attributes branch from 0b49a09 to 94a40b2 Compare November 7, 2024 11:01
@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Nov 7, 2024
@lidavidm lidavidm merged commit b193c4f into apache:main Nov 7, 2024
13 checks passed
@lidavidm lidavidm removed the awaiting merge Awaiting merge label Nov 7, 2024
@EnricoMi
Copy link
Contributor Author

EnricoMi commented Nov 7, 2024

@lidavidm thanks!

EnricoMi added a commit to EnricoMi/arrow that referenced this pull request Nov 7, 2024
Copy link

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit b193c4f.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 4 possible false positives for unstable benchmarks that are known to sometimes produce them.

EnricoMi added a commit to EnricoMi/arrow that referenced this pull request Nov 7, 2024
EnricoMi added a commit to EnricoMi/arrow that referenced this pull request Nov 7, 2024
@EnricoMi EnricoMi deleted the more-flightendpoint-attributes branch November 8, 2024 08:10
EnricoMi added a commit to EnricoMi/arrow that referenced this pull request Nov 8, 2024
EnricoMi added a commit to EnricoMi/arrow that referenced this pull request Nov 8, 2024
@EnricoMi
Copy link
Contributor Author

EnricoMi commented Nov 8, 2024

Fixing the Timestamp type in C++ and reverting the workaround of this PR is done in #44681.

lidavidm pushed a commit that referenced this pull request Nov 14, 2024
…ound from #43537 (#44681)

### What changes are included in this PR?
Make `arrow.flight.Timestamp` nanoseconds precision and OS-independent.

Use the `CTimePoint` for `FlightEndpoint.expiration_time` to have OS-independent nanoseconds precision.

https://github.com/apache/arrow/blob/093655c60783321c786a8e69632f185d37520f4d/python/pyarrow/includes/libarrow_flight.pxd#L133-L139

This reverts workaround for `expiration_time` from  #43537.

### Are these changes tested?
Existing unit tests are adjusted to the new precision.

### Are there any user-facing changes?

**This PR includes breaking changes to public APIs.**

The values of `FlightEndpoint.expiration_time` now have nanoseconds precision on any operating system.

* GitHub Issue: #44679

Authored-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: David Li <li.davidm96@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants