Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tighten purpose of array_masked_to_nans #2424

Merged
merged 6 commits into from
Mar 14, 2017
Merged

Conversation

DPeterK
Copy link
Member

@DPeterK DPeterK commented Mar 9, 2017

I think that array_masked_to_nans should always return a numpy.ndarray and not a ma.core.masked_array as used to happen. This means we don't have to call data = data.data after every time we use array_masked_to_nans in Iris code.

I've also removed the optional mask argument from array_masked_to_nans as it was only used once and ran the risk of confusing the usage of the function. At its worst, it would have been possible to pass a masked array and an unmatched mask, at which point the function would have returned an array that was not completely filled with nans.

result = masked_array
else:
if masked_array.dtype.kind == 'i':
masked_array = masked_array.astype(np.dtype('f8'))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do this for all integer array inputs, not just masked ones?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a word, "no" (!)
I think it is much safer to continue to have any un-masked integers with the "correct" type, as otherwise calculations could produce unexpected results (like non-integer division).
It's true that we already have such problems with masked integer data, but it seems that's something we just have to live with at present.

@@ -1040,7 +1040,8 @@ def _data_bytes_to_shaped_array(data_bytes, lbpack, boundary_packing,

# Mask the array?
if mdi in data:
data = array_masked_to_nans(data, data == mdi)
data = np.ma.masked_array(data, data == mdi)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows the mask kwarg to be dropped from array_masked_to_nans. (This is also the only place the kwarg was used!)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current implementation is bugged, and should not use masks at all.
See extended comment.
So here, I think we should have just
data[data == mdi] = np.nan

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on reflection, both of these calls (previous and proposed) are creating a 'mask array', so perhaps the concerns over creating more arrays which I raised are less important than I thought

It seems odd to create a masked array and immediately turn that into another numpy array but perhaps it is a reasonable approach in this case

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marqh we're seeing that there's no obvious answer to this. When I discussed this offline with @pp-mo he raised similar concerns about what was happening here...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So all we're trying to do here is replace instances of mdi values with nan values (and cast int --> float). Both ways that have been tried so far have ended up producing some sort of mask that has ultimately been thrown away; I wonder if we need another way of making this conversion?

As a follow-on question; do we need array_masked_to_nans at all - will there ever be masked arrays at any of the places this function is used that will need to be nan-filled?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think in my previous implementation this was all done in line and the reassignment happened conditionally

in factoring this out to share the type change code, i think this performance benefit was lost; I think that your refactoring just highlighted its loss

I am tempted to suggest that sharing code is one module is not always the best thing to do and this may be a context where it is easier to just process the data in line and accept and recognise a bit of shared logic, it is not like it is a lot of code

something like
https://github.com/SciTools/iris/blob/dask/lib/iris/fileformats/pp.py#L1040

    if mdi in data:
        if masked_array.dtype.kind == 'i':
            masked_array = masked_array.astype(np.dtype('f8'))
        data[data == mdi] = np.nan

perhaps with a code comment that this logic is reproduced here to avoid creating transient arrays or somesuch

@marqh
Copy link
Member

marqh commented Mar 9, 2017

please note, this behaviour was explicitly added in #2392
d2116d9#diff-6f32db0fa1a9558ed4ded7a2425d9fcdL111

enabling the logic for type changing from int to float to be shared between instances where a masked array exists and needs to be converted to a nan array and instances where a numpy array exists with a known condition as to where to put the nan values

these two cases exist in our different file format loaders and I expect both behaviours to be used more

Additionally, the detail change to iris.fileformats.pp doesn't make sense to me,
data
at https://github.com/SciTools/iris/pull/2424/files#diff-066f828420b27bfd2a03868477b98415L1043
will never be a masked array, always an array, so it will never get masked
perhaps our testing is lacking here, if this isn't resulting in a test failure

in its current form this proposed change appears to be a regressive step to me

@pp-mo
Copy link
Member

pp-mo commented Mar 9, 2017

I do generally support this, as I think we really don't need masked arrays which happen to also have NaNs in them.

@DPeterK
Copy link
Member Author

DPeterK commented Mar 9, 2017

@marqh #2392 is so large as to be unusable, can you point to a particular commit?

these two cases exist

Then they should probably be handled separately. As it stood, array_masked_to_nans could very easily be misused by passing both a masked array and a mask, at which point the masked array's mask would be ignored and the supplied mask used to set nan point. In this case the returned value would be partly nan-filled with some masked points not filled with anything.

will never be a masked array

I know, that's why I've converted it to a masked array first, using the same logic as was before used to supply a boolean "mask" to the previous array_masked_to_nans.

In summary, array_masked_to_nans seemed to have become confused in its purpose so I've made a positive assertion so that it does one job well rather than trying to do two overlapping things.

@marqh
Copy link
Member

marqh commented Mar 9, 2017

@marqh #2392 is so large as to be unusable, can you point to a particular commit?

There is one commit on this PR, so I have explicitly pointed to the change in the file:

please note, this behaviour was explicitly added in #2392
d2116d9#diff-6f32db0fa1a9558ed4ded7a2425d9fcdL111

I know, that's why I've converted it to a masked array first, using the same logic as was before used to supply a boolean "mask" to the previous array_masked_to_nans.

It was that creation of a mask to immediately remove it again that was identified as an unrequired performance overhead and avoided

@pp-mo
Copy link
Member

pp-mo commented Mar 9, 2017

As far as I can see, this single place where the 'mask' keyword is used is also redundant, as this code should not need to use masked arrays at all any more : All the sources of the 'data' variable in '_data_bytes_to_shaped_array' now produce nan-arrays instead.
Thus, no need for a mask keyword!
(cf. comment on the code).

So, I think this code (as we have it) is now mixing masks and NaNs, and I think getting it wrong...
For instance, it now uses NaNs for the missing parts of LBCs (on the dask branch), but then uses mask= to convert that to a masked array with land or sea points masked out. The two types of "missing" don't get combined any more, which I think is a bug.

Worse though, the dual use of '_data_bytes_to_shaped_array' now looks wrong to me
-- though again, not because of any change here ...

  • as '_data_bytes_to_shaped_array' now returns a nanarray, it is ok to wrap a dask wrapper around it + use it as your cube lazy data. This means the code at line 883 in is good
  • but the other call in _create_field_data when field.data is a LoadedArrayBytes seems to me now to be wrong : When this becomes cube.data, it will be real data, not lazy, so it will never be "converted back" from nans to a masked array -- which for Iris use it should do.

I'm investigating that...

@DPeterK
Copy link
Member Author

DPeterK commented Mar 9, 2017

It was that creation of a mask to immediately remove it again that was identified as an unrequired performance overhead and avoided

Then I guess the PP loader does not need to call array_masked_to_nans at all! And this very much reinforces the fact that array_masked_to_nans does not require a mask kwarg.

@marqh
Copy link
Member

marqh commented Mar 9, 2017

Then I guess the PP loader does not need to call array_masked_to_nans at all

it shares the logic about type coercion where nans are required, i.e. the change from int to float. That is the logic which was factored out into the _lazy_data module on request that it be shared

in order to avoid the call at all, this logic would have to be copied back into the fileformats.pp module

@DPeterK
Copy link
Member Author

DPeterK commented Mar 9, 2017

Now with extra added rebase after #2421 got merged!

@QuLogic QuLogic added this to the dask milestone Mar 9, 2017
if mdi in data:
data = array_masked_to_nans(data, data == mdi)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we pay attention to the name of array_masked_to_nans, it tells us that it will convert a masked array to a nan-filled array. If we don't have masked data (such as here), let's just not use it and instead do the processing we actually need to. That is, we cast int -> float if necessary and replace mdi values with NaN.

Copy link
Member

@pp-mo pp-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to get so picky !
Actually this is all good stuff -- adding a decent docstring and some tests = highly desirable !

`np.nan` values.

Returns:
The input array if unmasked, or a NaN-filled masked array of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"NaN-filled masked array" -- no, the result is a "normal" ndarray, it is not a masked array.
It might be helpful to say that the result is always a "normal" (aka un-masked) array.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, a clunky description of what is happening.

"Masked points are set to nan" might be better? I'll also add the bit about the "normal" array too.


* masked_array:
A NumPy array. If masked, the masked points will be filled with
`np.nan` values.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"If masked, the masked points will be filled with np.nan values."

That bit does not describe the input argument, best left out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

The fill value and mask of the input masked array will be lost.

.. note::
Integer masked arrays are cast to floats because NaN is a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cast to floats

We could be more explicit here -- it is in fact always float64 (8 bytes), which I think may be worth saying.

At present it just uses f8, so that it can fit in 4-byte integers without precision problems.
It just doesn't bother to ever use anything smaller -- I suppose it could, but at present it does not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

array = array.astype(np.dtype('f8'))
array[mask] = np.nan
return array
if not hasattr(masked_array, 'mask'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test should use "standard facilities".
In this case, I think for once we really do want a class test and not duck typing, so I'd say np.ma.isMaskedArray is the right thing here.
My reasoning is that the subsequent code is really only meant to work with 'real' masked arrays and not with something that might just mimic them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hasattr is a standard facility 😛

But fair point about the mimicking, I'll make the change.

(On a complete side note, now that I know about multiple dispatch I miss it more and more...)

data = da.from_array(data, chunks=chunks)
return data


def array_masked_to_nans(array, mask=None):
def array_masked_to_nans(masked_array):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we are now explicitly support passing this a non-masked input, the argument name should be changed.
I guess just 'array' will do it. In fact as written it really needs to be a real numpy array, but I'd say that is probably not a distinction worth making. It definitely isn't a dask array though.

Possibly we should change the function name too.
What do you think ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the argument name should be changed

Good point! Shows how much change there's been to this function recently...

Possibly we should change the function name too

To what though? Primarily what it does is put nans where before there were masked values, which is what the function says it does. If we wanted to describe the non-masked array pass through too then we might end up with:
array_masked_to_nans_unless_the_array_isnt_masked_at_which_point_just_return_it
... which is a bit long (and very Pratchett-ian! 🎩)

Do you have a better name suggestion? Otherwise I propose we don't touch it.


result = array_masked_to_nans(masked_array).data

def _common_checks(self, result):
self.assertIsInstance(result, np.ndarray)
self.assertFalse(isinstance(result, ma.MaskedArray))
self.assertFalse(ma.is_masked(result))
Copy link
Member

@pp-mo pp-mo Mar 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this redundant, given the previous line ?
(is_masked meaning that : "it is a masked array, and some points are actually masked")

Based on my previous comment about "standard facilities", perhaps we should just be using self.assertFalse(ma.isMaskedArray(result)) here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

FWIW, I inherited these checks. I kept these last two because I wanted to be absolutely sure that my non-masked array really hadn't retained its mask somehow!

result = array_masked_to_nans(unmasked_array)
# Non-masked array is returned as-is, without copying.
self.assertIs(result, unmasked_array)

def test_empty_mask(self):
Copy link
Member

@pp-mo pp-mo Mar 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Masked arrays being the horrible things they are, it might be worth checking also the case where the 'mask' keyword is not set (the actual default is "mask=np.ma.nomask").
For a ghastly distinction, this is not the same thing as mask=zeros(...) or mask=None or mask=False, all of which allocate an actual (empty) mask array:

>>> print np.ma.masked_array([1, 2], mask=False).mask
[False False]
>>> print np.ma.masked_array([1, 2], mask=None).mask
[False False]
>>> print np.ma.masked_array([1, 2]).mask
False
>>> print np.ma.masked_array([1, 2], mask=np.ma.nomask).mask
False
>>> 

Urrrgh! Don't get me started...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this is getting a little close to testing the actual ma implementation! Which, perhaps, it could do with a little more of...

result = array_masked_to_nans(unmasked_array, mask=False)
# Non-masked array is returned as-is, without copying.
self.assertIs(result, unmasked_array)
def test_integer_array(self):
Copy link
Member

@pp-mo pp-mo Mar 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tests "masked integer" input.
We should have another one that tests "un-masked integers", where the input is a non-masked integer array, as in that case the dtype is preserved.
Also I suppose yet another, to test an integer masked array with no masked points (like your existing test_empty_mask above).

@DPeterK
Copy link
Member Author

DPeterK commented Mar 10, 2017

Review actions to address comments from @pp-mo.

Copy link
Member

@pp-mo pp-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny extra tweaks suggested


self._common_checks(result)
self.assertArrayAllClose(result, masked_array.data)

def test_integer_array(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is worth renaming this test_masked_ints to make clear the relationship to test_unmasked_ints.
Also, as this is the one that changes the type, it is worth checking the actual values, as in test_masked_input.

@@ -33,7 +33,7 @@
class Test(tests.IrisTest):
def _common_checks(self, result):
self.assertIsInstance(result, np.ndarray)
self.assertFalse(isinstance(result, ma.MaskedArray))
self.assertFalse(ma.isMaskedArray(result))
self.assertFalse(ma.is_masked(result))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is still redundant, as we already checked it's not a masked array

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really disagree.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?? what does self.assertFalse(ma.isMaskedArray(result)) mean to you.

isMaskedArray(x)
    Test whether input is an instance of MaskedArray.
    
    This function returns True if `x` is an instance of MaskedArray
    and returns False otherwise.  Any object is accepted as input.

If the type is not a masked array, it cannot be masked, so why do you want to check this ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I give in...

@DPeterK
Copy link
Member Author

DPeterK commented Mar 13, 2017

Review actions: tests renamed and excess test removed.

@bjlittle @lbdreyer do you have anything to add here? It would be nice to get this in.

@pp-mo
Copy link
Member

pp-mo commented Mar 13, 2017

Suggested change : DPeterK#8

Other than that, I'm good with this + will merge when others have had their chance ...

@DPeterK
Copy link
Member Author

DPeterK commented Mar 13, 2017

OK, I've changed array_masked_to_nans so that it won't recast int masked array objects with mask=ma.nomask.

@bjlittle bjlittle self-assigned this Mar 13, 2017

Args:

* masked_array:
Copy link
Member

@bjlittle bjlittle Mar 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dkillick This should be array surely, and not masked_array in order to align with the formal parameter list of the function ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bah! So it should.

Copy link
Member

@bjlittle bjlittle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good!

mask = array.mask
array[mask] = np.nan
result = array.data
return result
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice 👍


self._common_checks(result)
self.assertArrayAllClose(np.isnan(result),
[[False, True], [False, False]])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could have always reused the mask defined on line 40.

@DPeterK
Copy link
Member Author

DPeterK commented Mar 14, 2017

Oh good, the tests have passed!

@pp-mo @lbdreyer @bjlittle nudge nudge nudge

:shipit:

cube data dtype differs: int64 != int8
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is fun: it looks like the changes in this PR have fixed this test result -- note how the dtype specified in the error text here now matches to the dtype set in the test code...

Copy link
Member

@lbdreyer lbdreyer Mar 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this was what it originally was, but as part of the previous dask work it was changed to float64 because the cube data array was a masked array. However the masked array doesn't contain any masked values so the dask array wouldn't contain any nans and therefore could be int.

The correct handling of checking for ma.isMaskedArray(array) and ma.is_masked(array) in this PR corrects this behaviour and this test.

@lbdreyer lbdreyer self-requested a review March 14, 2017 10:38
@pp-mo pp-mo merged commit 0332668 into SciTools:dask Mar 14, 2017
@DPeterK DPeterK deleted the contract_am2n branch March 14, 2017 10:52
bjlittle pushed a commit to bjlittle/iris that referenced this pull request May 31, 2017
Tighten purpose of `array_masked_to_nans`
@QuLogic QuLogic modified the milestones: dask, v2.0 Aug 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants