Tighten purpose of `array_masked_to_nans` #2424

DPeterK · 2017-03-09T14:42:16Z

I think that array_masked_to_nans should always return a numpy.ndarray and not a ma.core.masked_array as used to happen. This means we don't have to call data = data.data after every time we use array_masked_to_nans in Iris code.

I've also removed the optional mask argument from array_masked_to_nans as it was only used once and ran the risk of confusing the usage of the function. At its worst, it would have been possible to pass a masked array and an unmatched mask, at which point the function would have returned an array that was not completely filled with nans.

DPeterK · 2017-03-09T14:42:53Z

lib/iris/_lazy_data.py

+        result = masked_array
+    else:
+        if masked_array.dtype.kind == 'i':
+            masked_array = masked_array.astype(np.dtype('f8'))


Should we do this for all integer array inputs, not just masked ones?

In a word, "no" (!)
I think it is much safer to continue to have any un-masked integers with the "correct" type, as otherwise calculations could produce unexpected results (like non-integer division).
It's true that we already have such problems with masked integer data, but it seems that's something we just have to live with at present.

DPeterK · 2017-03-09T14:43:52Z

lib/iris/fileformats/pp.py

@@ -1040,7 +1040,8 @@ def _data_bytes_to_shaped_array(data_bytes, lbpack, boundary_packing,

    # Mask the array?
    if mdi in data:
-        data = array_masked_to_nans(data, data == mdi)
+        data = np.ma.masked_array(data, data == mdi)


This allows the mask kwarg to be dropped from array_masked_to_nans. (This is also the only place the kwarg was used!)

I think the current implementation is bugged, and should not use masks at all.
See extended comment.
So here, I think we should have just
data[data == mdi] = np.nan

on reflection, both of these calls (previous and proposed) are creating a 'mask array', so perhaps the concerns over creating more arrays which I raised are less important than I thought

It seems odd to create a masked array and immediately turn that into another numpy array but perhaps it is a reasonable approach in this case

@marqh we're seeing that there's no obvious answer to this. When I discussed this offline with @pp-mo he raised similar concerns about what was happening here...

So all we're trying to do here is replace instances of mdi values with nan values (and cast int --> float). Both ways that have been tried so far have ended up producing some sort of mask that has ultimately been thrown away; I wonder if we need another way of making this conversion?

As a follow-on question; do we need array_masked_to_nans at all - will there ever be masked arrays at any of the places this function is used that will need to be nan-filled?

i think in my previous implementation this was all done in line and the reassignment happened conditionally

in factoring this out to share the type change code, i think this performance benefit was lost; I think that your refactoring just highlighted its loss

I am tempted to suggest that sharing code is one module is not always the best thing to do and this may be a context where it is easier to just process the data in line and accept and recognise a bit of shared logic, it is not like it is a lot of code

something like
https://github.com/SciTools/iris/blob/dask/lib/iris/fileformats/pp.py#L1040

if mdi in data: if masked_array.dtype.kind == 'i': masked_array = masked_array.astype(np.dtype('f8')) data[data == mdi] = np.nan

perhaps with a code comment that this logic is reproduced here to avoid creating transient arrays or somesuch

marqh · 2017-03-09T15:22:23Z

please note, this behaviour was explicitly added in #2392
d2116d9#diff-6f32db0fa1a9558ed4ded7a2425d9fcdL111

enabling the logic for type changing from int to float to be shared between instances where a masked array exists and needs to be converted to a nan array and instances where a numpy array exists with a known condition as to where to put the nan values

these two cases exist in our different file format loaders and I expect both behaviours to be used more

Additionally, the detail change to iris.fileformats.pp doesn't make sense to me,
data
at https://github.com/SciTools/iris/pull/2424/files#diff-066f828420b27bfd2a03868477b98415L1043
will never be a masked array, always an array, so it will never get masked
perhaps our testing is lacking here, if this isn't resulting in a test failure

in its current form this proposed change appears to be a regressive step to me

pp-mo · 2017-03-09T15:24:03Z

I do generally support this, as I think we really don't need masked arrays which happen to also have NaNs in them.

DPeterK · 2017-03-09T15:32:36Z

@marqh #2392 is so large as to be unusable, can you point to a particular commit?

these two cases exist

Then they should probably be handled separately. As it stood, array_masked_to_nans could very easily be misused by passing both a masked array and a mask, at which point the masked array's mask would be ignored and the supplied mask used to set nan point. In this case the returned value would be partly nan-filled with some masked points not filled with anything.

will never be a masked array

I know, that's why I've converted it to a masked array first, using the same logic as was before used to supply a boolean "mask" to the previous array_masked_to_nans.

In summary, array_masked_to_nans seemed to have become confused in its purpose so I've made a positive assertion so that it does one job well rather than trying to do two overlapping things.

marqh · 2017-03-09T15:39:30Z

@marqh #2392 is so large as to be unusable, can you point to a particular commit?

There is one commit on this PR, so I have explicitly pointed to the change in the file:

please note, this behaviour was explicitly added in #2392
d2116d9#diff-6f32db0fa1a9558ed4ded7a2425d9fcdL111

I know, that's why I've converted it to a masked array first, using the same logic as was before used to supply a boolean "mask" to the previous array_masked_to_nans.

It was that creation of a mask to immediately remove it again that was identified as an unrequired performance overhead and avoided

pp-mo · 2017-03-09T15:41:05Z

As far as I can see, this single place where the 'mask' keyword is used is also redundant, as this code should not need to use masked arrays at all any more : All the sources of the 'data' variable in '_data_bytes_to_shaped_array' now produce nan-arrays instead.
Thus, no need for a mask keyword!
(cf. comment on the code).

So, I think this code (as we have it) is now mixing masks and NaNs, and I think getting it wrong...
For instance, it now uses NaNs for the missing parts of LBCs (on the dask branch), but then uses mask= to convert that to a masked array with land or sea points masked out. The two types of "missing" don't get combined any more, which I think is a bug.

Worse though, the dual use of '_data_bytes_to_shaped_array' now looks wrong to me
-- though again, not because of any change here ...

as '_data_bytes_to_shaped_array' now returns a nanarray, it is ok to wrap a dask wrapper around it + use it as your cube lazy data. This means the code at line 883 in is good
but the other call in _create_field_data when field.data is a LoadedArrayBytes seems to me now to be wrong : When this becomes cube.data, it will be real data, not lazy, so it will never be "converted back" from nans to a masked array -- which for Iris use it should do.

I'm investigating that...

DPeterK · 2017-03-09T15:47:33Z

It was that creation of a mask to immediately remove it again that was identified as an unrequired performance overhead and avoided

Then I guess the PP loader does not need to call array_masked_to_nans at all! And this very much reinforces the fact that array_masked_to_nans does not require a mask kwarg.

marqh · 2017-03-09T15:52:25Z

Then I guess the PP loader does not need to call array_masked_to_nans at all

it shares the logic about type coercion where nans are required, i.e. the change from int to float. That is the logic which was factored out into the _lazy_data module on request that it be shared

in order to avoid the call at all, this logic would have to be copied back into the fileformats.pp module

DPeterK · 2017-03-09T16:16:38Z

Now with extra added rebase after #2421 got merged!

DPeterK · 2017-03-10T09:50:27Z

lib/iris/fileformats/pp.py

    if mdi in data:
-        data = array_masked_to_nans(data, data == mdi)


If we pay attention to the name of array_masked_to_nans, it tells us that it will convert a masked array to a nan-filled array. If we don't have masked data (such as here), let's just not use it and instead do the processing we actually need to. That is, we cast int -> float if necessary and replace mdi values with NaN.

pp-mo

Sorry to get so picky !
Actually this is all good stuff -- adding a decent docstring and some tests = highly desirable !

pp-mo · 2017-03-10T11:37:08Z

lib/iris/_lazy_data.py

+        `np.nan` values.
+
+    Returns:
+        The input array if unmasked, or a NaN-filled masked array of


"NaN-filled masked array" -- no, the result is a "normal" ndarray, it is not a masked array.
It might be helpful to say that the result is always a "normal" (aka un-masked) array.

Ah yes, a clunky description of what is happening.

"Masked points are set to nan" might be better? I'll also add the bit about the "normal" array too.

pp-mo · 2017-03-10T11:38:13Z

lib/iris/_lazy_data.py

+
+    * masked_array:
+        A NumPy array. If masked, the masked points will be filled with
+        `np.nan` values.


"If masked, the masked points will be filled with np.nan values."

That bit does not describe the input argument, best left out.

pp-mo · 2017-03-10T11:39:13Z

lib/iris/_lazy_data.py

+        The fill value and mask of the input masked array will be lost.
+
+    .. note::
+        Integer masked arrays are cast to floats because NaN is a


cast to floats

We could be more explicit here -- it is in fact always float64 (8 bytes), which I think may be worth saying.

At present it just uses f8, so that it can fit in 4-byte integers without precision problems.
It just doesn't bother to ever use anything smaller -- I suppose it could, but at present it does not.

pp-mo · 2017-03-10T11:42:37Z

lib/iris/_lazy_data.py

-        array = array.astype(np.dtype('f8'))
-    array[mask] = np.nan
-    return array
+    if not hasattr(masked_array, 'mask'):


I think this test should use "standard facilities".
In this case, I think for once we really do want a class test and not duck typing, so I'd say np.ma.isMaskedArray is the right thing here.
My reasoning is that the subsequent code is really only meant to work with 'real' masked arrays and not with something that might just mimic them.

hasattr is a standard facility 😛

But fair point about the mimicking, I'll make the change.

(On a complete side note, now that I know about multiple dispatch I miss it more and more...)

pp-mo · 2017-03-10T11:51:12Z

lib/iris/_lazy_data.py

        data = da.from_array(data, chunks=chunks)
    return data


-def array_masked_to_nans(array, mask=None):
+def array_masked_to_nans(masked_array):


As we are now explicitly support passing this a non-masked input, the argument name should be changed.
I guess just 'array' will do it. In fact as written it really needs to be a real numpy array, but I'd say that is probably not a distinction worth making. It definitely isn't a dask array though.

Possibly we should change the function name too.
What do you think ?

the argument name should be changed

Good point! Shows how much change there's been to this function recently...

Possibly we should change the function name too

To what though? Primarily what it does is put nans where before there were masked values, which is what the function says it does. If we wanted to describe the non-masked array pass through too then we might end up with:
array_masked_to_nans_unless_the_array_isnt_masked_at_which_point_just_return_it
... which is a bit long (and very Pratchett-ian! 🎩)

Do you have a better name suggestion? Otherwise I propose we don't touch it.

pp-mo · 2017-03-10T12:00:39Z

lib/iris/tests/unit/lazy_data/test_array_masked_to_nans.py

-
-        result = array_masked_to_nans(masked_array).data
-
+    def _common_checks(self, result):
        self.assertIsInstance(result, np.ndarray)
        self.assertFalse(isinstance(result, ma.MaskedArray))
        self.assertFalse(ma.is_masked(result))


Isn't this redundant, given the previous line ?
(is_masked meaning that : "it is a masked array, and some points are actually masked")

Based on my previous comment about "standard facilities", perhaps we should just be using self.assertFalse(ma.isMaskedArray(result)) here.

Sure.

FWIW, I inherited these checks. I kept these last two because I wanted to be absolutely sure that my non-masked array really hadn't retained its mask somehow!

pp-mo · 2017-03-10T12:06:10Z

lib/iris/tests/unit/lazy_data/test_array_masked_to_nans.py

+        result = array_masked_to_nans(unmasked_array)
+        # Non-masked array is returned as-is, without copying.
+        self.assertIs(result, unmasked_array)
+
    def test_empty_mask(self):


Masked arrays being the horrible things they are, it might be worth checking also the case where the 'mask' keyword is not set (the actual default is "mask=np.ma.nomask").
For a ghastly distinction, this is not the same thing as mask=zeros(...) or mask=None or mask=False, all of which allocate an actual (empty) mask array:

>>> print np.ma.masked_array([1, 2], mask=False).mask [False False] >>> print np.ma.masked_array([1, 2], mask=None).mask [False False] >>> print np.ma.masked_array([1, 2]).mask False >>> print np.ma.masked_array([1, 2], mask=np.ma.nomask).mask False >>>

Urrrgh! Don't get me started...

I wonder if this is getting a little close to testing the actual ma implementation! Which, perhaps, it could do with a little more of...

pp-mo · 2017-03-10T12:17:21Z

lib/iris/tests/unit/lazy_data/test_array_masked_to_nans.py

-        result = array_masked_to_nans(unmasked_array, mask=False)
-        # Non-masked array is returned as-is, without copying.
-        self.assertIs(result, unmasked_array)
+    def test_integer_array(self):


This tests "masked integer" input.
We should have another one that tests "un-masked integers", where the input is a non-masked integer array, as in that case the dtype is preserved.
Also I suppose yet another, to test an integer masked array with no masked points (like your existing test_empty_mask above).

DPeterK · 2017-03-10T16:08:53Z

Review actions to address comments from @pp-mo.

pp-mo

Tiny extra tweaks suggested

pp-mo · 2017-03-10T16:37:45Z

lib/iris/tests/unit/lazy_data/test_array_masked_to_nans.py

+
+        self._common_checks(result)
+        self.assertArrayAllClose(result, masked_array.data)
+
    def test_integer_array(self):


I think it is worth renaming this test_masked_ints to make clear the relationship to test_unmasked_ints.
Also, as this is the one that changes the type, it is worth checking the actual values, as in test_masked_input.

pp-mo · 2017-03-10T16:38:20Z

lib/iris/tests/unit/lazy_data/test_array_masked_to_nans.py

@@ -33,7 +33,7 @@
 class Test(tests.IrisTest):
    def _common_checks(self, result):
        self.assertIsInstance(result, np.ndarray)
-        self.assertFalse(isinstance(result, ma.MaskedArray))
+        self.assertFalse(ma.isMaskedArray(result))
        self.assertFalse(ma.is_masked(result))


This line is still redundant, as we already checked it's not a masked array

I really disagree.

?? what does self.assertFalse(ma.isMaskedArray(result)) mean to you.

isMaskedArray(x) Test whether input is an instance of MaskedArray. This function returns True if `x` is an instance of MaskedArray and returns False otherwise. Any object is accepted as input.

If the type is not a masked array, it cannot be masked, so why do you want to check this ?

Alright, I give in...

DPeterK · 2017-03-13T12:31:17Z

Review actions: tests renamed and excess test removed.

@bjlittle @lbdreyer do you have anything to add here? It would be nice to get this in.

pp-mo · 2017-03-13T14:02:26Z

Suggested change : DPeterK#8

Other than that, I'm good with this + will merge when others have had their chance ...

DPeterK · 2017-03-13T15:43:25Z

OK, I've changed array_masked_to_nans so that it won't recast int masked array objects with mask=ma.nomask.

bjlittle · 2017-03-13T15:53:53Z

lib/iris/_lazy_data.py

+
+    Args:
+
+    * masked_array:


@dkillick This should be array surely, and not masked_array in order to align with the formal parameter list of the function ...

Bah! So it should.

bjlittle

All good!

bjlittle · 2017-03-13T15:59:23Z

lib/iris/_lazy_data.py

+            mask = array.mask
+            array[mask] = np.nan
+        result = array.data
+    return result


This is nice 👍

lbdreyer · 2017-03-13T16:12:44Z

lib/iris/tests/unit/lazy_data/test_array_masked_to_nans.py


+        self._common_checks(result)
        self.assertArrayAllClose(np.isnan(result),
                                 [[False, True], [False, False]])


You could have always reused the mask defined on line 40.

DPeterK · 2017-03-14T10:19:44Z

Oh good, the tests have passed!

@pp-mo @lbdreyer @bjlittle nudge nudge nudge

DPeterK · 2017-03-14T10:23:05Z

lib/iris/tests/results/unit/merge/ProtoCube/register__CubeSig/noise.txt

So this is fun: it looks like the changes in this PR have fixed this test result -- note how the dtype specified in the error text here now matches to the dtype set in the test code...

So this was what it originally was, but as part of the previous dask work it was changed to float64 because the cube data array was a masked array. However the masked array doesn't contain any masked values so the dask array wouldn't contain any nans and therefore could be int.

The correct handling of checking for ma.isMaskedArray(array) and ma.is_masked(array) in this PR corrects this behaviour and this test.

Tighten purpose of `array_masked_to_nans`

DPeterK added the Status: Work in Progress label Mar 9, 2017

DPeterK commented Mar 9, 2017

View reviewed changes

DPeterK added the dask label Mar 9, 2017

DPeterK requested review from bjlittle, pp-mo and lbdreyer March 9, 2017 14:44

DPeterK force-pushed the contract_am2n branch from 378e577 to 1d9ba99 Compare March 9, 2017 16:16

QuLogic added this to the dask milestone Mar 9, 2017

DPeterK force-pushed the contract_am2n branch from 1d9ba99 to 41090d9 Compare March 10, 2017 09:37

DPeterK commented Mar 10, 2017

View reviewed changes

pp-mo requested changes Mar 10, 2017

View reviewed changes

bjlittle self-assigned this Mar 13, 2017

bjlittle reviewed Mar 13, 2017

View reviewed changes

DPeterK force-pushed the contract_am2n branch from ba8eece to 12de2dd Compare March 13, 2017 15:59

bjlittle approved these changes Mar 13, 2017

View reviewed changes

lib/iris/_lazy_data.py

mask = array.mask

array[mask] = np.nan

result = array.data

return result

Copy link

Member

bjlittle Mar 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice 👍

lbdreyer reviewed Mar 13, 2017

View reviewed changes

DPeterK added 6 commits March 14, 2017 09:41

Tighten purpose of array_masked_to_nans

8b902e7

Don't use func to process pp data

0da8438

Review actions

12f7efc

Rename tests

7b158c2

Don't recast nomask int arrays

f6f14fd

Correct test file, reuse mask

9e94584

DPeterK force-pushed the contract_am2n branch from 12de2dd to 9e94584 Compare March 14, 2017 10:04

DPeterK commented Mar 14, 2017

View reviewed changes

lbdreyer self-requested a review March 14, 2017 10:38

lbdreyer approved these changes Mar 14, 2017

View reviewed changes

pp-mo approved these changes Mar 14, 2017

View reviewed changes

pp-mo merged commit 0332668 into SciTools:dask Mar 14, 2017

pp-mo removed the Status: Work in Progress label Mar 14, 2017

DPeterK deleted the contract_am2n branch March 14, 2017 10:52

bjlittle pushed a commit to bjlittle/iris that referenced this pull request May 31, 2017

Tighten purpose of array_masked_to_nans (SciTools#2424)

8b95033

Tighten purpose of `array_masked_to_nans`

QuLogic modified the milestones: dask, v2.0 Aug 2, 2017

		if mdi in data:
		data = array_masked_to_nans(data, data == mdi)

Tighten purpose of array_masked_to_nans #2424

Tighten purpose of array_masked_to_nans #2424

Conversation

DPeterK commented Mar 9, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marqh commented Mar 9, 2017

pp-mo commented Mar 9, 2017

DPeterK commented Mar 9, 2017

marqh commented Mar 9, 2017

pp-mo commented Mar 9, 2017 • edited Loading

DPeterK commented Mar 9, 2017

marqh commented Mar 9, 2017

DPeterK commented Mar 9, 2017

Choose a reason for hiding this comment

pp-mo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pp-mo Mar 10, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pp-mo Mar 10, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pp-mo Mar 10, 2017 • edited Loading

Choose a reason for hiding this comment

DPeterK commented Mar 10, 2017

pp-mo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DPeterK commented Mar 13, 2017

pp-mo commented Mar 13, 2017 • edited Loading

DPeterK commented Mar 13, 2017

bjlittle Mar 13, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bjlittle left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DPeterK commented Mar 14, 2017 • edited Loading

Choose a reason for hiding this comment

lbdreyer Mar 14, 2017 • edited Loading

Choose a reason for hiding this comment

Tighten purpose of `array_masked_to_nans` #2424

Tighten purpose of `array_masked_to_nans` #2424

pp-mo commented Mar 9, 2017 •

edited

Loading

pp-mo Mar 10, 2017 •

edited

Loading

pp-mo Mar 10, 2017 •

edited

Loading

pp-mo Mar 10, 2017 •

edited

Loading

pp-mo commented Mar 13, 2017 •

edited

Loading

bjlittle Mar 13, 2017 •

edited

Loading

DPeterK commented Mar 14, 2017 •

edited

Loading

lbdreyer Mar 14, 2017 •

edited

Loading