Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!: rework ufunc type promotion handling #2767

Merged
merged 1 commit into from
Oct 24, 2023

Conversation

agoose77
Copy link
Collaborator

@agoose77 agoose77 commented Oct 23, 2023

In pursuing recent PRs, the fragility of our ufunc handling became clearer; we've spoken for a while about NEP-50 and value-based promotion, and the complexities of NumPy's type conversion rules. This PR I believe makes a pragmatic step in this direction, using https://numpy.org/doc/stable/reference/generated/numpy.ufunc.resolve_dtypes.html to predict the input and output type combination that a ufunc would use. As such, it applies NEP-50 rules which are value-insensitive.

How does this PR change Awkward?

In main, the following behavior is exhibited:

>>> import awkward as ak
>>> import numpy as np
>>> x = ak.from_numpy(np.arange(10, dtype=np.uint8)) + np.int64(1)
>>> x.type.show()
10 * uint8
>>> y = ak.from_numpy(np.arange(10, dtype=np.uint8)) + np.int64(2**32 - 1)
>>> y.type.show()
10 * uint32
>>> z = ak.from_numpy(np.arange(10, dtype=np.uint8)) + (2**32 - 1)
>>> z.type.show()
10 * uint32

After this PR:

>>> import awkward as ak
>>> import numpy as np
>>> x = ak.from_numpy(np.arange(10, dtype=np.uint8)) + np.int64(1)
>>> x.type.show()
10 * int64
>>> y = ak.from_numpy(np.arange(10, dtype=np.uint8)) + np.int64(2**32 - 1)
>>> y.type.show()
10 * int64
>>> z = ak.from_numpy(np.arange(10, dtype=np.uint8)) + (2**32 - 1)
DeprecationWarning: NumPy will stop allowing conversion of out-of-bound Python integers to integer arrays.  The conversion of 4294967296 to uint8 will fail in the future.
For the old behavior, usually:
    np.array(value).astype(dtype)
will give the desired result (the cast overflows).
  return self._module.asarray(obj, dtype=dtype)
>>> z.type.show()
10 * uint8

Crucially this means that old code will start throwing warnings if bare Python scalars exceed the size of the array dtype. This is probably not all that likely; most of the time we pull values from other array operations, e.g.

>>> array_of_large_values = ak.from_numpy(np.array([2**32-20], dtype=np.uint32))
>>> large_value = np.max(array_of_large_values)
>>> w = ak.from_numpy(np.arange(10, dtype=np.uint8)) + large_value
>>> w
<Array [4294967276, 4294967277, ..., 4294967284, 4294967285] type='10 * uint32'>

NumPy is going in this direction with NEP-50 (under consideration only so far). The only other way we could do this in a "safer" type agnostic fashion is to always take the default for the Python value kind (e.g. int64 for integrals, etc.), or the largest type for value kinds. I prefer to align with NumPy, and let warnings inform users.

@codecov
Copy link

codecov bot commented Oct 23, 2023

Codecov Report

Merging #2767 (f848940) into main (589b351) will increase coverage by 0.00%.
The diff coverage is 86.79%.

Additional details and impacted files
Files Coverage Δ
src/awkward/_backends/backend.py 87.50% <100.00%> (+0.83%) ⬆️
src/awkward/_backends/jax.py 100.00% <100.00%> (ø)
src/awkward/_connect/numpy.py 91.73% <100.00%> (-0.04%) ⬇️
src/awkward/_nplikes/array_module.py 87.50% <100.00%> (+0.54%) ⬆️
src/awkward/_nplikes/jax.py 80.85% <100.00%> (+1.30%) ⬆️
src/awkward/_slicing.py 86.59% <100.00%> (ø)
src/awkward/_nplikes/numpylike.py 73.72% <72.72%> (-0.04%) ⬇️
src/awkward/_nplikes/typetracer.py 76.46% <81.81%> (-0.02%) ⬇️

Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a lot of changes here. Expanding

_apply_ufunc(ufunc, *args)

to

apply_ufunc(ufunc, method, input_args, kwargs)

is a good idea.

I have scrolled through it carefully, and I don't see any red flags. Instances of nplike.asarray would look dangerous if I didn't know that this function is already carefully controlled.

So, it looks good, and feel free to merge it if you're sure you're done.

@agoose77
Copy link
Collaborator Author

@jpivarski I updated the PR description to make it explicit that this will have a high-level effect upon the visible type semantics of ufunc operations. When you next have a moment, could you confirm that you're happy to move in this direction? I think it's where NumPy is moving, but moreover we need our operations to be value agnostic, so I think we have to

@jpivarski
Copy link
Member

Previously, weren't we gluing our type promotion behavior to whatever NumPy does (by actually calling NumPy on empty arrays)? Then we'd inherit whatever NumPy decides to do in the future. Is that not what would automatically happen, without this PR?

As an aside, while it's bad for the output type of an operation to depend on the values in arrays, I'm less compelled to worry about it depending on the values of Python objects (scalar int or float), unless we can somehow have these Python objects be unknown in a Dask DAG. The type-tracer mechanism allows all of the values in our array buffers (and their sizes) to be unknown with known types, but Python scalar int and float are outside of this system. If it was a np.int32 or a np.float64, that would be inside the system, as those are the kinds of values that can come out of arrays.

Also, it would be fine to assume that any Python int is always np.int64 (we do that in several places, NumPy does it on all platforms but Windows, and Numba does it universally), and that any Python float is always np.float64 (which it literally is, unlike the arbitrary-precision int).

I'm not against this PR; if it's necessary to fully conform to NEP-50 and there are good reasons to believe that NumPy will actually adopt NEP-50.

@agoose77
Copy link
Collaborator Author

Previously, weren't we gluing our type promotion behavior to whatever NumPy does (by actually calling NumPy on empty arrays)? Then we'd inherit whatever NumPy decides to do in the future. Is that not what would automatically happen, without this PR?

Somewhat; the main places that we need to define what happens w.r.t types is in typetracer. We want the same rules to apply to typetracer and non-typetracer, so our ufunc handling is changed in this PR to ensure that we can predict types. So we have to implement scaffolding for typetracer, and right now we're choosing to be value (which is always 0 for typetracer) dependent.

The real problem with NumPy's default handling is that scalars are always typed by value; e.g. np.int64(0) resolves to np.int8/np.int8 if added to such an array. This is worse than just being value-sensitive for Python scalars which may well be effectively static constants in the program.

@jpivarski
Copy link
Member

>>> np.array([1, 2, 3, 4, 5], np.int8) + np.int64(10)
array([11, 12, 13, 14, 15], dtype=int8)

Yikes! Okay, that's terrible. Yes, you're right to introduce scaffolding to predict types manually.

Although this does it right:

>>> np.array([1, 2, 3, 4, 5], np.int8) + np.array([np.int64(10)])
array([11, 12, 13, 14, 15])

It seems that it just needs to not be a scalar (ndims == 0 array). This is perhaps a better way:

>>> np.array([1, 2, 3, 4, 5], np.int8) + np.int64(10)[np.newaxis]
array([11, 12, 13, 14, 15])

@agoose77
Copy link
Collaborator Author

>>> np.array([1, 2, 3, 4, 5], np.int8) + np.int64(10)
array([11, 12, 13, 14, 15], dtype=int8)

Yikes! Okay, that's terrible. Yes, you're right to introduce scaffolding to predict types manually.

Although this does it right:

>>> np.array([1, 2, 3, 4, 5], np.int8) + np.array([np.int64(10)])
array([11, 12, 13, 14, 15])

It seems that it just needs to not be a scalar (ndims == 0 array). This is perhaps a better way:

>>> np.array([1, 2, 3, 4, 5], np.int8) + np.int64(10)[np.newaxis]
array([11, 12, 13, 14, 15])

Yes, I'm genuinely surprised that this is the default behavior, but I'm sure that there's a long and complex reason behind its history.

The behaviour you describe above is now what this PR delivers:

>>> ak.from_numpy(np.arange(10, dtype=np.uint8)) + np.int64(2**32 - 1)
<Array [4294967295, 4294967296, ..., 4294967303, 4294967304] type='10 * int64'>

In the examples in the PR description, I demonstrate what happens to untyped scalars, which inherit the integer flavour of the array;

>>> ak.from_numpy(np.arange(10, dtype=np.uint8)) + (2**32 - 1)
/home/angus/Git/awkward/src/awkward/_nplikes/array_module.py:44: DeprecationWarning: NumPy will stop allowing conversion of out-of-bound Python integers to integer arrays.  The conversion of 4294967295 to uint8 will fail in the future.
For the old behavior, usually:
    np.array(value).astype(dtype)
will give the desired result (the cast overflows).
  return self._module.asarray(obj, dtype=dtype)
<Array [255, 0, 1, 2, 3, 4, 5, 6, 7, 8] type='10 * uint8'>

@jpivarski
Copy link
Member

I'm in favor of this.

Also, I read over this PR carefully yesterday; it can be merged.

@agoose77
Copy link
Collaborator Author

Also, I read over this PR carefully yesterday; it can be merged.

Thanks — changing something as fundamental as this feels a lot safer with a detailed review :)

@agoose77 agoose77 changed the title refactor: rework ufunc type promotion handling feat!: rework ufunc type promotion handling Oct 24, 2023
@agoose77 agoose77 merged commit 97cf35f into main Oct 24, 2023
39 checks passed
@agoose77 agoose77 deleted the agoose77/refactor-nplike-ufunc-type-promotion branch October 24, 2023 17:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants