Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sharedata #2691

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 14 additions & 4 deletions lib/iris/coords.py
Original file line number Diff line number Diff line change
Expand Up @@ -521,14 +521,24 @@ def copy(self, points=None, bounds=None):
raise ValueError('If bounds are specified, points must also be '
'specified')

new_coord = copy.deepcopy(self)
if points is not None:
# We do not perform a deepcopy when we supply new points so as to
# not unncessarily copy the old points.

# Create a temp-coord to manage deep copying of a small array.
temp_coord = copy.copy(self)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just do self.copy()?

temp_coord.bounds = None
# note: DataManager cannot be None or DataManager(None) for
# a deepcopy operation.
temp_coord._points_dm = DataManager(np.array((1,)))
new_coord = copy.deepcopy(temp_coord)
del(temp_coord)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be happening: using del in module code is a good indication that something is badly wrong in the code's logic.

new_coord._points_dm = None
new_coord.points = points
# Regardless of whether bounds are provided as an argument, new
# points will result in new bounds, discarding those copied from
# self.
# new points will result in new bounds.
new_coord.bounds = bounds
else:
new_coord = copy.deepcopy(self)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes to coords.py are not tested.


return new_coord

Expand Down
27 changes: 25 additions & 2 deletions lib/iris/cube.py
Original file line number Diff line number Diff line change
Expand Up @@ -781,6 +781,27 @@ def __init__(self, data, standard_name=None, long_name=None,
for cell_measure, dims in cell_measures_and_dims:
self.add_cell_measure(cell_measure, dims)

# When True indexing may result in a view onto the original data array,
# to avoid unnecessary copying.
self._share_data = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why has this been placed at the end of the constructor method? There are specific locations in the flow of the constructor code for "properties" such as this one, so why not follow the pattern.


@property
def share_data(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What? Why is this even a cube method? We need to be reducing the size of the cube API, not adding random stuff to it.

This should in no way be a cube method. If we need to change the behaviour of Iris at runtime, we have a module for that: iris.config. In there we can add option classes for controlling precisely this kind of behaviour. See #2467 for a good example of doing just that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we were to even consider sharing data, I would have thought that using iris.FUTURE might also have been an appropriate way to go. It's less intrusive.

Is there a direct need to have this as a property to fit the use case?

"""
Share cube data when slicing/indexing cube if True.
Setting this flag to True will realise the data payload,
if it is lazy, as lazy data cannot currently be shared across cubes.
"""
return self._share_data

@share_data.setter
def share_data(self, value):
# If value is True: realise the data (if is hasn't already been) as
# sharing lazy data is not possible.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not believe this comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok; most things are possible, at some level. this is not possible within the current implementation

would you just prefer not implemented for now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer lazy data to be shared, thanks.

if value and self.has_lazy_data():
_ = self.data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line makes no sense:

  • why does it realise the data? No cube behaviour other than cube.data should realise the data. It most certainly should not do so silently with no warning to the user that this functionality is going to exhibit such unprecedented behaviour.
  • why does it assign the unnecessarily loaded data into _? That's a nop, so it doesn't need to be done.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it states in the doc string that this will happen

i'm not minded to add a runtime warning saying i've just loaded your data
but i'd like to know if that is what you are asking for here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does it realise the data?

this was covered in detail in the long efforts to get functionality merged onto master in 1.12 (failed) then 1.13 (succeeded)

within current implementations, sharing of lazy data is not helpful or useful; it is only a functional thing on realised data. This could change in the future, but not in this minimal implementation.

thus: if one wants to share data across cubes, sub-cube slices or similar, the data must be realised first

why does it assign the unnecessarily loaded data into _? That's a nop, so it doesn't need to be done.

this is a pattern that has been used elsewhere to avoid any risk the the data will be streamed to standard out, i use it quite often myself. The code in this form is a way of saying, i am realising the data, but i'm not using it here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it states in the doc string that this will happen

That doesn't defend the unprecedented behaviour.

This method should not load data; only cube.data may do that. Maybe this method should simply fail if the user tries to use it on a cube with lazy data. That way it is very clear to Iris users that this method is doing something unprecedented and avoids the risk of this method unexpectedly loading data without needing a warning.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sharing of lazy data is not helpful or useful

I think it could be helpful and useful; I'm sure there are use-cases beyond this very tight use-case that could make use of such behaviour.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does it assign the unnecessarily loaded data into _?

this is a pattern that has been used elsewhere to avoid any risk the the data will be streamed to standard out

Can I have an example of that from existing Iris code, please? Of course, it would be much the better if this method were not loading data at all.

self._share_data = bool(value)

@property
def metadata(self):
"""
Expand Down Expand Up @@ -2173,8 +2194,10 @@ def new_cell_measure_dims(cm_):
dimension_mapping, data = iris.util._slice_data_with_keys(
cube_data, keys)

# We don't want a view of the data, so take a copy of it.
data = deepcopy(data)
# We don't want a view of the data, so take a copy of it, unless
# self.share_data is True.
if not self.share_data:
data = deepcopy(data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned that for such a large change to cube behaviour the testing is very light. We have no idea at all what will happen in general use cases – at the very least this change requires some integration testing, but some more robust unit testing as well would also be good.


# We can turn a masked array into a normal array if it's full.
if ma.isMaskedArray(data):
Expand Down
49 changes: 49 additions & 0 deletions lib/iris/tests/unit/cube/test_Cube.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@

from itertools import permutations

import dask.array as da
import numpy as np
import numpy.ma as ma

Expand Down Expand Up @@ -66,6 +67,10 @@ def test_matrix(self):
self.assertEqual(type(cube.data), np.ndarray)
self.assertArrayEqual(cube.data, data)

def test_default_share_data(self):
cube = Cube(np.arange(1))
self.assertFalse(cube.share_data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

share_data should not be a property of a cube.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cube.share_data is part of the Iris public API
https://github.com/SciTools/iris/blob/v1.13.0/lib/iris/cube.py#L780

it was proposed by a contributor and reviewed and merged by me; it was included in the 1.13 release.
This followed a long and detailed set of feature requirements, that was drastically paired down to try and minimise impact for version 2

I do not think it is appropriate to remove the part of the API in the next version of Iris. There has been no deprecation warning or any thought on alternative API options to retain the required functionality.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought on alternative API options

There's one just above in my review comments that can be implemented.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was proposed by a contributor and reviewed and merged by me

@marqh then it seems that this functionality was not properly thought through before being added to v1.13, despite numerous requests that it was. This puts you in a difficult situation because this PR is not getting merged in its current form. This behaviour will not be retained in its current form, so your proposed API will have to change, as per the recommendations elsewhere in the review of this PR.



class Test_extract(tests.IrisTest):
def test_scalar_cube_exists(self):
Expand Down Expand Up @@ -1877,6 +1882,50 @@ def test_remove_cell_measure(self):
[[self.b_cell_measure, (0, 1)]])


class Test_share_data(tests.IrisTest):
def setter_lazy_data(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe this will be run: test methods needs to be prefixed with test_ for unittest to pick them up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a mistake, these should be def test...

cube = Cube(biggus.NumpyArrayAdapter(np.arange(6).reshape(2, 3)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. Biggus is no longer a dependency of Iris.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is also a mistake, that I missed due to the earlier mistake, of the test not running

cube.share_data = True
self.assertFalse(cube.has_lazy_data())
self.assertTrue(cube._share_data)

def setter_realised_data(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe this will be run: test methods needs to be prefixed with test_ for unittest to pick them up.

cube = Cube(np.arange(6).reshape(2, 3))
cube.share_data = True
self.assertFalse(cube.has_lazy_data())
self.assertTrue(cube._share_data)


class Test___getitem__no_share_data(tests.IrisTest):
def test_lazy_array(self):
cube = Cube(da.from_array(np.arange(6).reshape(2, 3), chunks=6))
cube2 = cube[1:]
self.assertTrue(cube2.has_lazy_data())
cube.data
self.assertTrue(cube2.has_lazy_data())

def test_ndarray(self):
cube = Cube(np.arange(6).reshape(2, 3))
cube2 = cube[1:]
self.assertIsNot(cube.data.base, cube2.data.base)


class Test___getitem__share_data(tests.IrisTest):
def test_lazy_array(self):
cube = Cube(da.from_array(np.arange(6).reshape(2, 3), chunks=6))
cube.share_data = True
cube2 = cube[1:]
self.assertFalse(cube.has_lazy_data())
self.assertFalse(cube2.has_lazy_data())
self.assertIs(cube.data.base, cube2.data.base)

def test_ndarray(self):
cube = Cube(np.arange(6).reshape(2, 3))
cube.share_data = True
cube2 = cube[1:]
self.assertIs(cube.data.base, cube2.data.base)


class Test__getitem_CellMeasure(tests.IrisTest):
def setUp(self):
cube = Cube(np.arange(6).reshape(2, 3))
Expand Down