Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-206: Expose a C++ api to compare ranges of slots between two arrays #80

Closed
wants to merge 6 commits into from

Conversation

emkornfield
Copy link
Contributor

@emkornfield emkornfield commented May 20, 2016

@wesm the need for this grew out of @fengguangyuan PR to add struct type (#66) and struct builder. I considered a different APIs before settling on this:

  1. Add an API that took the parent bitmask (this potentially has possibility of being the most performant, but would have a more awkward contract then provided here)
  2. Add an equality comparison for a single slot (leaves the least amount of room for optimization but it would be the simplest to implement).
  3. This API which potentially leaves some room for optimization but I think places the least requirements on the caller.

Let me know if you would prefer a different API.

WIP because I need to add more unit tests (I also need to think about if it is worth mirroring the EqualsExact in addition to the Equals method). Which I should get to by the end of the weekend.

@fengguangyuan let me know if this makes sense to you as a way forward on your PR

@fengguangyuan
Copy link

fengguangyuan commented May 20, 2016

@emkornfield
Thanks for your works.
Certainly, these new methods are helpful in some cases, including PR #66, but I have to say that this PR may be a little far away from the topic we discussed in it.

Honestly, but uncertainly however, I've to say, for Struct case, the key point is how to determine the equality of Struct bitmap to the child fields' slots, i.e. whether these bitmaps are comparable. But I don't think any clues are exposed from your current API.

Again, in terms of PR #66, if it's right to compare the slots in Struct bitmap with the corresponding slots in the child fields' bitmap. Or if it's not OK, then, what's the plan to implement the Validate method.
Thank you! :)

int32_t start_idx, int32_t end_idx, const std::shared_ptr<Array>& arr) const {
if (this == arr.get()) { return true; }
if (this->type_enum() != arr->type_enum()) { return false; }
auto other = static_cast<ListArray*>(arr.get());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const

@wesm
Copy link
Member

wesm commented May 20, 2016

This seems like a reasonable API to me. I don't think it's worth worrying about performance too much on these methods for the moment until we see applications where it matters.

@fengguangyuan if the struct-level bitmap indicates that the slot is null, there is no need to inspect the child arrays. if the value is not null, then the child arrays' bitmaps and values can be inspected as needed.

@emkornfield
Copy link
Contributor Author

@fengguangyuan The plan for validate on structs should be that all children arrays are of the same length as the struct array, and that all children arrays are valid. No comparison of null bitmaps.

For equality, (to reiterate Wes's pont), you only check equality for slots on the children that are valid in the parent struct.

@@ -48,6 +48,14 @@ bool Array::EqualsExact(const Array& other) const {
return true;
}

bool Array::RangeEqualsExact(int32_t start_idx, int32_t end_idx, const Array& arr) const {
if (this == &arr) { return true; }
for (int i = start_idx; i < end_idx; ++i) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you thinking of the case of start_idx == end_idx or something else?

All else being equal it seems that an empty set should equal an empty. Do you disagree?

@emkornfield emkornfield changed the title [WIP] ARROW-206: Expose a C++ api to compare ranges of slots between two arrays ARROW-206: Expose a C++ api to compare ranges of slots between two arrays May 23, 2016
@emkornfield
Copy link
Contributor Author

I had to modify the API to take the start index of the other array, to handle edge cases for list arrays. I added some basic sanity unit tests, but I could add more. I feel not-horrible about the coverage (I think this just end of weekend laziness though), so if you would like me to add more please let me know.

@fengguangyuan
Copy link

Thanks for your answering. But currently, the check for equality is ignored, considering its potentially expensive costs.

Personally, I trust your changes will make the API more flexible and stronger. Good work. :)

@wesm
Copy link
Member

wesm commented May 23, 2016

Looks like a good checkpoint to me. One thought was that the RangeEquals might take the number of elements compared rather than the exclusive end index, but this is fine as is.

+1, thank you

@asfgit asfgit closed this in cd1d770 May 23, 2016
@xhochy
Copy link
Member

xhochy commented May 29, 2016

Just rebased my current changes on this and realised a disadvantage of it: PrimtiveArray is now an abstract class, not sure if we want to keep this.

@emkornfield
Copy link
Contributor Author

@xhochy can you give more details on where this caused you problems?

@xhochy
Copy link
Member

xhochy commented May 29, 2016

I used to instantiate a PrimitiveArray directly here: https://github.com/apache/arrow/pull/83/files#diff-89c0372c3577568ee3310523d6f05defR178 Using MakePrimitiveArray is definitely the better way after some thought.

@emkornfield
Copy link
Contributor Author

How does this sound as a resolution (I can open up the JIRA once we agree)?

  1. Make PrimitiveArray's constructor protected
  2. Update the documentation on the class to point users to MakePrimitiveArray.

@xhochy
Copy link
Member

xhochy commented May 29, 2016

Sounds good!

wesm pushed a commit to wesm/arrow that referenced this pull request Jun 8, 2016
Follow-up based on apache#80

Author: Micah Kornfield <emkornfield@gmail.com>

Closes apache#87 from emkornfield/emk_clarify_primitive and squashes the following commits:

14bd5b2 [Micah Kornfield] ARROW-212: Make the fact that PrimitiveArray is a abstract class more apparent fromt the contract
@emkornfield emkornfield deleted the emk_add_equality branch February 26, 2021 05:14
zhztheplayer pushed a commit to zhztheplayer/arrow-1 that referenced this pull request Feb 8, 2022
…tribution globally (apache#80)

* Add an offset for seed to achieve genuine random distribution globally

* Filter out rand in projection cache

* Evaluate expr with literal input for getting seed value
FelixYBW pushed a commit to FelixYBW/arrow that referenced this pull request Feb 19, 2022
…tribution globally (apache#80)

* Add an offset for seed to achieve genuine random distribution globally

* Filter out rand in projection cache

* Evaluate expr with literal input for getting seed value
zhztheplayer pushed a commit to zhztheplayer/arrow-1 that referenced this pull request Mar 3, 2022
…tribution globally (apache#80)

* Add an offset for seed to achieve genuine random distribution globally

* Filter out rand in projection cache

* Evaluate expr with literal input for getting seed value
rui-mo pushed a commit to rui-mo/arrow-1 that referenced this pull request Mar 23, 2022
…tribution globally (apache#80)

* Add an offset for seed to achieve genuine random distribution globally

* Filter out rand in projection cache

* Evaluate expr with literal input for getting seed value
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants