-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tighten purpose of array_masked_to_nans
#2424
Conversation
lib/iris/_lazy_data.py
Outdated
result = masked_array | ||
else: | ||
if masked_array.dtype.kind == 'i': | ||
masked_array = masked_array.astype(np.dtype('f8')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we do this for all integer array inputs, not just masked ones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a word, "no" (!)
I think it is much safer to continue to have any un-masked integers with the "correct" type, as otherwise calculations could produce unexpected results (like non-integer division).
It's true that we already have such problems with masked integer data, but it seems that's something we just have to live with at present.
lib/iris/fileformats/pp.py
Outdated
@@ -1040,7 +1040,8 @@ def _data_bytes_to_shaped_array(data_bytes, lbpack, boundary_packing, | |||
|
|||
# Mask the array? | |||
if mdi in data: | |||
data = array_masked_to_nans(data, data == mdi) | |||
data = np.ma.masked_array(data, data == mdi) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This allows the mask
kwarg to be dropped from array_masked_to_nans
. (This is also the only place the kwarg was used!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the current implementation is bugged, and should not use masks at all.
See extended comment.
So here, I think we should have just
data[data == mdi] = np.nan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on reflection, both of these calls (previous and proposed) are creating a 'mask array', so perhaps the concerns over creating more arrays which I raised are less important than I thought
It seems odd to create a masked array and immediately turn that into another numpy array but perhaps it is a reasonable approach in this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So all we're trying to do here is replace instances of mdi
values with nan
values (and cast int --> float). Both ways that have been tried so far have ended up producing some sort of mask that has ultimately been thrown away; I wonder if we need another way of making this conversion?
As a follow-on question; do we need array_masked_to_nans
at all - will there ever be masked arrays at any of the places this function is used that will need to be nan-filled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think in my previous implementation this was all done in line and the reassignment happened conditionally
in factoring this out to share the type change code, i think this performance benefit was lost; I think that your refactoring just highlighted its loss
I am tempted to suggest that sharing code is one module is not always the best thing to do and this may be a context where it is easier to just process the data in line and accept and recognise a bit of shared logic, it is not like it is a lot of code
something like
https://github.com/SciTools/iris/blob/dask/lib/iris/fileformats/pp.py#L1040
if mdi in data:
if masked_array.dtype.kind == 'i':
masked_array = masked_array.astype(np.dtype('f8'))
data[data == mdi] = np.nan
perhaps with a code comment that this logic is reproduced here to avoid creating transient arrays or somesuch
please note, this behaviour was explicitly added in #2392 enabling the logic for type changing from int to float to be shared between instances where a masked array exists and needs to be converted to a nan array and instances where a numpy array exists with a known condition as to where to put the nan values these two cases exist in our different file format loaders and I expect both behaviours to be used more Additionally, the detail change to in its current form this proposed change appears to be a regressive step to me |
I do generally support this, as I think we really don't need masked arrays which happen to also have NaNs in them. |
@marqh #2392 is so large as to be unusable, can you point to a particular commit?
Then they should probably be handled separately. As it stood,
I know, that's why I've converted it to a masked array first, using the same logic as was before used to supply a boolean "mask" to the previous In summary, |
There is one commit on this PR, so I have explicitly pointed to the change in the file:
It was that creation of a mask to immediately remove it again that was identified as an unrequired performance overhead and avoided |
As far as I can see, this single place where the 'mask' keyword is used is also redundant, as this code should not need to use masked arrays at all any more : All the sources of the 'data' variable in '_data_bytes_to_shaped_array' now produce nan-arrays instead. So, I think this code (as we have it) is now mixing masks and NaNs, and I think getting it wrong... Worse though, the dual use of '_data_bytes_to_shaped_array' now looks wrong to me
I'm investigating that... |
Then I guess the PP loader does not need to call |
it shares the logic about type coercion where nans are required, i.e. the change from int to float. That is the logic which was factored out into the in order to avoid the call at all, this logic would have to be copied back into the |
Now with extra added rebase after #2421 got merged! |
if mdi in data: | ||
data = array_masked_to_nans(data, data == mdi) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we pay attention to the name of array_masked_to_nans
, it tells us that it will convert a masked array to a nan-filled array. If we don't have masked data (such as here), let's just not use it and instead do the processing we actually need to. That is, we cast int -> float if necessary and replace mdi
values with NaN
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to get so picky !
Actually this is all good stuff -- adding a decent docstring and some tests = highly desirable !
lib/iris/_lazy_data.py
Outdated
`np.nan` values. | ||
|
||
Returns: | ||
The input array if unmasked, or a NaN-filled masked array of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"NaN-filled masked array" -- no, the result is a "normal" ndarray, it is not a masked array.
It might be helpful to say that the result is always a "normal" (aka un-masked) array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, a clunky description of what is happening.
"Masked points are set to nan
" might be better? I'll also add the bit about the "normal" array too.
lib/iris/_lazy_data.py
Outdated
|
||
* masked_array: | ||
A NumPy array. If masked, the masked points will be filled with | ||
`np.nan` values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"If masked, the masked points will be filled with
np.nan
values."
That bit does not describe the input argument, best left out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
lib/iris/_lazy_data.py
Outdated
The fill value and mask of the input masked array will be lost. | ||
|
||
.. note:: | ||
Integer masked arrays are cast to floats because NaN is a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cast to floats
We could be more explicit here -- it is in fact always float64 (8 bytes), which I think may be worth saying.
At present it just uses f8, so that it can fit in 4-byte integers without precision problems.
It just doesn't bother to ever use anything smaller -- I suppose it could, but at present it does not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
lib/iris/_lazy_data.py
Outdated
array = array.astype(np.dtype('f8')) | ||
array[mask] = np.nan | ||
return array | ||
if not hasattr(masked_array, 'mask'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this test should use "standard facilities".
In this case, I think for once we really do want a class test and not duck typing, so I'd say np.ma.isMaskedArray
is the right thing here.
My reasoning is that the subsequent code is really only meant to work with 'real' masked arrays and not with something that might just mimic them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hasattr
is a standard facility 😛
But fair point about the mimicking, I'll make the change.
(On a complete side note, now that I know about multiple dispatch I miss it more and more...)
lib/iris/_lazy_data.py
Outdated
data = da.from_array(data, chunks=chunks) | ||
return data | ||
|
||
|
||
def array_masked_to_nans(array, mask=None): | ||
def array_masked_to_nans(masked_array): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we are now explicitly support passing this a non-masked input, the argument name should be changed.
I guess just 'array' will do it. In fact as written it really needs to be a real numpy array, but I'd say that is probably not a distinction worth making. It definitely isn't a dask array though.
Possibly we should change the function name too.
What do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the argument name should be changed
Good point! Shows how much change there's been to this function recently...
Possibly we should change the function name too
To what though? Primarily what it does is put nan
s where before there were masked values, which is what the function says it does. If we wanted to describe the non-masked array pass through too then we might end up with:
array_masked_to_nans_unless_the_array_isnt_masked_at_which_point_just_return_it
... which is a bit long (and very Pratchett-ian! 🎩)
Do you have a better name suggestion? Otherwise I propose we don't touch it.
|
||
result = array_masked_to_nans(masked_array).data | ||
|
||
def _common_checks(self, result): | ||
self.assertIsInstance(result, np.ndarray) | ||
self.assertFalse(isinstance(result, ma.MaskedArray)) | ||
self.assertFalse(ma.is_masked(result)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this redundant, given the previous line ?
(is_masked
meaning that : "it is a masked array, and some points are actually masked")
Based on my previous comment about "standard facilities", perhaps we should just be using self.assertFalse(ma.isMaskedArray(result))
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
FWIW, I inherited these checks. I kept these last two because I wanted to be absolutely sure that my non-masked array really hadn't retained its mask somehow!
result = array_masked_to_nans(unmasked_array) | ||
# Non-masked array is returned as-is, without copying. | ||
self.assertIs(result, unmasked_array) | ||
|
||
def test_empty_mask(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Masked arrays being the horrible things they are, it might be worth checking also the case where the 'mask' keyword is not set (the actual default is "mask=np.ma.nomask").
For a ghastly distinction, this is not the same thing as mask=zeros(...)
or mask=None
or mask=False
, all of which allocate an actual (empty) mask array:
>>> print np.ma.masked_array([1, 2], mask=False).mask
[False False]
>>> print np.ma.masked_array([1, 2], mask=None).mask
[False False]
>>> print np.ma.masked_array([1, 2]).mask
False
>>> print np.ma.masked_array([1, 2], mask=np.ma.nomask).mask
False
>>>
Urrrgh! Don't get me started...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this is getting a little close to testing the actual ma
implementation! Which, perhaps, it could do with a little more of...
result = array_masked_to_nans(unmasked_array, mask=False) | ||
# Non-masked array is returned as-is, without copying. | ||
self.assertIs(result, unmasked_array) | ||
def test_integer_array(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tests "masked integer" input.
We should have another one that tests "un-masked integers", where the input is a non-masked integer array, as in that case the dtype is preserved.
Also I suppose yet another, to test an integer masked array with no masked points (like your existing test_empty_mask
above).
Review actions to address comments from @pp-mo. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tiny extra tweaks suggested
|
||
self._common_checks(result) | ||
self.assertArrayAllClose(result, masked_array.data) | ||
|
||
def test_integer_array(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is worth renaming this test_masked_ints
to make clear the relationship to test_unmasked_ints
.
Also, as this is the one that changes the type, it is worth checking the actual values, as in test_masked_input
.
@@ -33,7 +33,7 @@ | |||
class Test(tests.IrisTest): | |||
def _common_checks(self, result): | |||
self.assertIsInstance(result, np.ndarray) | |||
self.assertFalse(isinstance(result, ma.MaskedArray)) | |||
self.assertFalse(ma.isMaskedArray(result)) | |||
self.assertFalse(ma.is_masked(result)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is still redundant, as we already checked it's not a masked array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really disagree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?? what does self.assertFalse(ma.isMaskedArray(result))
mean to you.
isMaskedArray(x)
Test whether input is an instance of MaskedArray.
This function returns True if `x` is an instance of MaskedArray
and returns False otherwise. Any object is accepted as input.
If the type is not a masked array, it cannot be masked, so why do you want to check this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I give in...
Suggested change : DPeterK#8 Other than that, I'm good with this + will merge when others have had their chance ... |
OK, I've changed |
lib/iris/_lazy_data.py
Outdated
|
||
Args: | ||
|
||
* masked_array: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dkillick This should be array
surely, and not masked_array
in order to align with the formal parameter list of the function ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bah! So it should.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good!
mask = array.mask | ||
array[mask] = np.nan | ||
result = array.data | ||
return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is nice 👍
|
||
self._common_checks(result) | ||
self.assertArrayAllClose(np.isnan(result), | ||
[[False, True], [False, False]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could have always reused the mask defined on line 40.
cube data dtype differs: int64 != int8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is fun: it looks like the changes in this PR have fixed this test result -- note how the dtype specified in the error text here now matches to the dtype set in the test code...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this was what it originally was, but as part of the previous dask work it was changed to float64
because the cube data array was a masked array. However the masked array doesn't contain any masked values so the dask array wouldn't contain any nans and therefore could be int
.
The correct handling of checking for ma.isMaskedArray(array)
and ma.is_masked(array)
in this PR corrects this behaviour and this test.
Tighten purpose of `array_masked_to_nans`
I think that
array_masked_to_nans
should always return anumpy.ndarray
and not ama.core.masked_array
as used to happen. This means we don't have to calldata = data.data
after every time we usearray_masked_to_nans
in Iris code.I've also removed the optional
mask
argument fromarray_masked_to_nans
as it was only used once and ran the risk of confusing the usage of the function. At its worst, it would have been possible to pass a masked array and an unmatched mask, at which point the function would have returned an array that was not completely filled withnan
s.