Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dask merge concat fill value #2520

Merged
merged 7 commits into from
May 4, 2017

Conversation

bjlittle
Copy link
Member

@bjlittle bjlittle commented May 3, 2017

This PR refines the approach taken with handling the cube.fill_value by merge and concatenate.

This PR extends the expected behaviour of numpy with regards to how it manages the fill_value when combining masked arrays and/or ndarrays to cubes (it is valid for a cube to have a fill_value even though it has a ndarray payload).

In summary, when combining cubes via merge or concatenate, the resultant cube will have:

  • a fill_value that is non-None iff all cubes have the same non-None fill_value
  • a fill_value that is None, if:
    • at least one cube has a fill_value of None, or
    • the candidate cube and the proto-cube have a different fill_value

This simply means that,

  • if the cube.fill_value is set to a non-None value, and the cube has a masked payload, then the masked array returned by cube.data will have a fill_value that matches the cube.fill_value.
  • if the cube.fill_value is set to None, and the cube has a masked payload, then the masked array returned by cube.data will have the default fill_value generated by numpy for that specific dtype.

@bjlittle
Copy link
Member Author

bjlittle commented May 3, 2017

Ping @djkirkham and @lbdreyer ... 😄

# Allow fill-value promotion if either fill-value is None.
if self.fill_value is not None and other.fill_value is not None:
msg = 'cube fill value differs: {} != {}'
msgs.append(msg.format(self.fill_value, other.fill_value))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you choose to allow for merging and concatenating cubes with different fill values?
I realise this is not a simple question so I can call you to discuss this if that's easier?

Copy link
Member Author

@bjlittle bjlittle May 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lbdreyer This is more in alignment with the behaviour of upstream/master, which doesn't discriminate based on fill_value. So no breaking or surprising behaviour for the end user.

It also aligns with the behaviour of numpy, which will merge/concatenate arrays with different fill_values, but the resultant array has a fill_value that is replaced with the numpy nominated default for that dtype.

With regards to cubes, we're doing the same now in this PR - if we have two cubes that only differ on fill_value, say one is 1234 and the other is 4321, then they will merge/concatenate into one cube, but the resultant cube will have a cube.fill_value of None. When the user does a cube.data to get the data, the resulting masked array (assuming the payload realizes as masked) will have a fill_value that is the numpy nominated default - note that, the cube.fill_value remains equal to None in this case ... until the user changes it via cube.fill_value = fill_value, or cube.replace(..., fill_value=fill_value)

np_fill_value = ma.masked_array(0, dtype=result.dtype).fill_value
self.assertEqual(result.data.fill_value, np_fill_value)

def test_fill_value_invariant_to_order__different_non_None(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than testing the different order, is this much different to test_concat_masked_2y2d__default_fill_value_from_diff?

I don't understand what you gain from doing all the permutations?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I understand that this concatenates cubes with x's like

concat [0, 1] [2, 3] [4, 5] => [0, 1, 2, 3, 4, 5]
concat [0, 1] [4, 5] [2, 3] => [0, 1, 2, 3, 4, 5]
concat [4, 5] [0, 1] 2, 3] => [0, 1, 2, 3, 4, 5]

and checking the fill value is correct after that

@@ -748,6 +730,49 @@ def test_concat_2x2d_aux_xy_bounds(self):
self.assertEqual(len(result), 1)
self.assertEqual(result[0].shape, (2, 4))

def test_fill_value_invariant_to_order__same_non_None(self):
Copy link
Member

@lbdreyer lbdreyer May 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test naming was a bit confusing to me. I only understood what __same_non_None meant when I started writing a comment asking you what it is.
Naming tests is verrry difficult so I am happy with it as is, but I am going to try to think of a different name

Copy link
Member Author

@bjlittle bjlittle May 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, test naming ... ugh ... you are right, I'll don my thinking cap 🤠

emsg = 'Fill values differ'
with self.assertRaisesRegexp(iris.exceptions.ConcatenateError, emsg):
cubes.concatenate_cube()


class Test2D(tests.IrisTest):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You never test the fill value when all inputs that get concatenated are unmasked. You could always add a quick check to the test on L535

return result

def _check_fill_value(self, result, fill0, fill1):
fill_value = self._expected_fill(fill0, fill1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename to be expected_fill_value or expected_cube_fill_value?
May make it easier to follow

[False, True])
fill_combos = itertools.product([None, fill_value],
[fill_value, None])
self.combos = itertools.product(lazy_combos, fill_combos)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was it too complicated to have another itertools.product for ndarray and masked?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm a wee bit wary of folding that in, as it seems like we'll end up with a single test that does absolutely everything and will be a tad impenetrable ... is that okay?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! makes sense

@lbdreyer
Copy link
Member

lbdreyer commented May 3, 2017

This looks very close!

The testing of concatenate isn't less complete than the testing of merge, but there are endless combinations (masked/unmasked, fill_value_same/fill_value_diff/fill_value_None, lazy/real etc) that we can test for... I want to go through some possible important combinations that we are missing.
but otherwise I'm happy with the changes.

@lbdreyer
Copy link
Member

lbdreyer commented May 4, 2017

I've looked into testing of concatenate a little more and ignoring the endless combinations we could test I think there are two missing combinations that we should at least consider.

  1. There doesn't appear to be a test for concatenating cubes that are masked and unmasked where both have a fill value of None.

  2. Also there isn't testing of fil value when both cubes are unmasked.

@bjlittle
Copy link
Member Author

bjlittle commented May 4, 2017

@lbdreyer Okay, so I've addressed your review actions in the simplest way possible. Adding a concatenate equivalent to the merge TestDataMergeCombos isn't really practical and would involve quite a bit of effort, so I've added the missing test coverage through new tests and adding to existing tests.

if cube_signature.fill_value is None or \
cube_signature.fill_value != fill_value:
# Demote the fill value to the default.
self._cube_signature.fill_value = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be simplified to

if cube_signature.fill_value != self._cube_signature.fill_value:
    self._cube_signature.fill_value = None

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, also applies to _merge.py 👍

if other.fill_value is None or \
cube_signature.fill_value != other.fill_value:
# Demote the fill value to the default.
signature = self._build_signature(self._source,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a call to _build_signature here, but in the concatenate case you can just set fill_value to None?

Copy link
Member Author

@bjlittle bjlittle May 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One is a class instance (writable) the other is a specialization of a namedtuple (read-only)

self.assertEqual(result[0].data.fill_value, fill_value)
self.assertIsNone(result[0].fill_value)
np_fill_value = ma.masked_array(0, dtype=result[0].dtype).fill_value
self.assertEqual(result[0].data.fill_value, np_fill_value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we need this check. I think perhaps it should be part of the contract that if the input fill values don't match then the fill value you get out on the array shouldn't be relied upon to have any particular value. That said, we are making an effort to make sure the correct fill value is placed on the data array when it is valid, so I'm not sure.

Copy link
Member Author

@bjlittle bjlittle May 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me it's just a fill_value, something, anything ... but a consistent, reliable fill_value, and what numpy offers seems reasonable (and predictable) given the case where cube.fill_value is None

The follow on point for me is whether a warning should be issues to notify the user that the fill_value in the masked array returned by the cube.data getter is a default numpy fill_value ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be happy either way on both points to be honest

self.assertIsNone(result.fill_value)
np_fill_value = ma.masked_array(0, dtype=result.dtype).fill_value
self.assertEqual(result.data.fill_value, np_fill_value)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice test coverage!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I try 😉

At least we're trying to join the dots on what's happening here ... so if it changes, a whole bunch of tests are now gonna complain .... which is a good thing!


def _check_fill_value(self, result, fill0, fill1):
fill_value = self._expected_fill_value(fill0, fill1)
if fill_value is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I said I would prefer a rename I meant the variable, not the method,
i.e. expected_fill_value = self._expected_fill(fill0, fill1)
and then the self.assertEqual checks would be:

self.assertEqual(data.fill_value, expected_fill_value)

@lbdreyer
Copy link
Member

lbdreyer commented May 4, 2017

@bjlittle if you address my last comment I will happily merge this in

@lbdreyer lbdreyer merged commit 25a7615 into SciTools:dask May 4, 2017
@bjlittle
Copy link
Member Author

bjlittle commented May 4, 2017

@lbdreyer Thanks 👍

@bjlittle bjlittle deleted the dask-merge-concat-fill-value branch May 4, 2017 14:05
bjlittle added a commit to bjlittle/iris that referenced this pull request May 31, 2017
* Fix concatenate fill value.

* Fix merge fill value.

* Added merge invariant tests.

* Added concatenate invariant tests.

* Review actions.

* Simplify merge/concatenate match condition.

* expected_fill_value
@QuLogic QuLogic modified the milestones: dask, v2.0 Aug 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants