Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: don't use np.asarray on Index or Content objects #2740

Merged
merged 5 commits into from
Oct 5, 2023

Conversation

agoose77
Copy link
Collaborator

@agoose77 agoose77 commented Oct 4, 2023

nplike.asarray ultimately delegates to the underlying NumPy-like library's asarray. For NumPy, this invokes obj.__array__(). If our object is e.g. an Index that wraps a placeholder, this conversion fails because NumPy et al. are not aware of our placeholder type, and eagerly tries to convert it to a buffer.

This would also fail during typetracing time, but it may well be that there are branches where this gets skipped.

We could add our own protocol to the Content and Index types that permits Content and Index objects to pass through asarray, e.g.

class Numpy:
    ...
    def asarray(self, obj, *, ...):
        if hasattr(obj, "_awkward_asarray_"):
            return obj._awkward_asarray_(self, ...)
        else:
            return numpy.asarray(obj, ...)

class Index:
    ...
    def _awkward_asarray_(self, nplike, ...):
        return nplike.asarray(self.data, ...)

but on balance I think it might be easier just to pull out .data by hand, because most of the time we should know that we have a Content / Index object.

At the same time, we've long prohibited np.asarray on e.g. RegularArray, and I think we should just do this for all Content nodes to be more thorough. Thus, this PR removes NumpyArray.__array__, and the same for EmptyArray et al.

I didn't change Index, because I suspect that's used everywhere in the wild (in legitimate NumPy-only cases), and the argument for ak.Array being the high-level np.asarray interface does not apply. Thus, library authors must take care not to use asarray even in NumPy-only contexts iff. there's a chance that the incoming array is a placeholder.

So, this PR changes two things:

  1. Don't pass in Content or Index arguments to nplike.asarray, as they nest the true buffers within themselves.
  2. Don't support Content.__array__ for any content type, for the reason discussed above.

@agoose77 agoose77 marked this pull request as ready for review October 4, 2023 22:27
@agoose77 agoose77 temporarily deployed to docs-preview October 4, 2023 22:45 — with GitHub Actions Inactive
@codecov
Copy link

codecov bot commented Oct 4, 2023

Codecov Report

Merging #2740 (8473c7f) into main (c377185) will decrease coverage by 0.02%.
The diff coverage is 90.90%.

Additional details and impacted files
Files Coverage Δ
src/awkward/_connect/numexpr.py 90.66% <100.00%> (+0.25%) ⬆️
src/awkward/_connect/numpy.py 91.77% <100.00%> (+0.14%) ⬆️
src/awkward/contents/emptyarray.py 75.62% <ø> (+0.25%) ⬆️
src/awkward/contents/indexedoptionarray.py 88.48% <100.00%> (ø)
src/awkward/contents/numpyarray.py 91.53% <ø> (-0.04%) ⬇️
src/awkward/operations/ak_concatenate.py 96.29% <100.00%> (-0.03%) ⬇️
src/awkward/operations/ak_from_json.py 93.69% <ø> (ø)
src/awkward/operations/ak_mask.py 96.29% <100.00%> (ø)
src/awkward/operations/ak_where.py 92.59% <100.00%> (ø)
src/awkward/operations/ak_run_lengths.py 90.12% <71.42%> (-2.09%) ⬇️

... and 1 file with indirect coverage changes

@agoose77 agoose77 temporarily deployed to docs-preview October 4, 2023 23:37 — with GitHub Actions Inactive
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, that makes sense. How did you find all of the instances where __array__ was implicitly being called?

My guess (how I'd go about it): by adding raise Exception("HERE!") in __array__ and seeing where the tests fail.

@agoose77
Copy link
Collaborator Author

agoose77 commented Oct 5, 2023

Nearly; I removed the support for NumpyArray and EmptyArray for __array__, which leaves the base Content.__array__ that already throws an exception. This was prohibited for the base class some time ago; we expect users who do not wish to be explicit about e.g. ak.to_numpy to have a high-level array for such things.

For index nodes, a manual regex was used. We could also remove __array__ et al. from Index, I'm not yet sure how we feel; Index.__array__ et al. are useful for third-party package authors, who probably aren't going to have much support for placeholders. By the same token, however, we probably need such package authors to support typetracer and placeholders in order to build a better dask ecosystem so they wouldn't need __array__ either.

@agoose77 agoose77 merged commit 6dee1ea into main Oct 5, 2023
37 checks passed
@agoose77 agoose77 deleted the agoose77/fix-dont-use-asarray-on-content-or-index branch October 5, 2023 06:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants