Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove deepcopies when slicing cubes and copying coords #1992

Closed
wants to merge 7 commits into from

Conversation

rhattersley
Copy link
Member

An updated version of #939.

Not ready for merging...

...the remaining open question is whether being able to switch this change on/off with a Future toggle out-weighs the implementation leakage.

This is a first attempt at removing unnecessary (and very slow) deepcopy operations
with slicing or otherwise manipulating cubes and coordinates. See SciTools#914.

Note: A few of the unit tests are failing, because they insist on checking the
order (Fortran or C) of numpy arrays. I think these checks should be removed,
because it is a waste of computational effort to always ensure arrays are
contiguous. If some code needs to interface with external modules code that
require continguous arrays, it should use np.ascontiguousarray or
np.asfortranarray at the immediate level of the wrapper.
@rhattersley
Copy link
Member Author

Ping @cpelley re. #1983.

@cpelley
Copy link

cpelley commented Apr 29, 2016

Great stuff, thank for doing this @rhattersley , very much appreciated.

My only concern is that when the cube is still lazy and sliced, that the sliced cube data when realised will not realise the data in the original cube. Could this not pose a possible danger with some users, returning views of their arrays (when they though it was still lazy) or vice versa if the data has been actualised.

Regarding the future toggle, it does sound like a good idea. I'm sure there must be people who have hardcoded slicing the cube (expecting not views of their data).

Check this out @jkettleb :)

Thanks @rhattersley

@rhattersley
Copy link
Member Author

... when the cube is still lazy and sliced ... the sliced cube data when realised will not realise the data in the original cube.

True.

>>> air_temp.has_lazy_data()
True
>>> t0 = air_temp[0]
>>> t0.has_lazy_data()
True
>>> real_numbers = t0.data
>>> t0.has_lazy_data()
False
>>> air_temp.has_lazy_data()
True

Could this not pose a possible danger with some users, returning views of their arrays (when they though it was still lazy) or vice versa if the data has been actualised.

It's certainly possibly that user code might rely on the old data copying behaviour to ensure that modifying the data in a sliced cube doesn't modify the data in the original cube. My guess is that such reliance is rare, but if there is any affected code the impact from this change might be hard to track down. (The same sort of thing applies to coordinate copies.)

Other that that I think the only implication is a change to the memory/performance characteristics.

@cpelley
Copy link

cpelley commented May 3, 2016

Other that that I think the only implication is a change to the memory/performance characteristics.

Is a degradation in performance you think significant enough to worry about?
I'm guessing this degradation would come from reading data from disk which may or may not be contiguous after slicing?

I think this would be a really nice addition :)

@rhattersley
Copy link
Member Author

Is a degradation in performance you think significant enough to worry about?

I wasn't implying performance will degrade, just that the performance will change. Hopefully mostly for the better, but perhaps sometimes for the worse. It will depend on the use case.

@rhattersley
Copy link
Member Author

Options:

  1. Switch to the new behaviour in 1.10
  2. Leave 1.10 untouched. Switch to the new behaviour in 2.0
  3. Add an iris.FUTURE toggle in 1.10, leaving the 1.9 behaviour as the default. Update the default behaviour in 2.0.

NB. Even if we use an iris.FUTURE toggle I can't think of a sensible way to issue a deprecation warning for the deep-copying behaviour. Lots of user code is going to be doing cube indexing without caring about the data-copying behaviour, and there's no real way for Iris to tell. It doesn't make much sense to have almost everyone have to insert iris.FUTURE.share_data = True in their scripts even though it only makes a difference for a tiny fraction of people. Plus, there's the knock-on impact on other parts of the Iris API that make use of cube indexing.

My current suggestion: go with option (3) but without any kind of deprecation warning. (When we make the switch in 2.0 we would still need to deprecate the iris.FUTURE.share_data toggle.)

@rhattersley
Copy link
Member Author

I've pushed a commit which implements option (3).

@rhattersley rhattersley added this to the v1.10 milestone May 4, 2016
@cpelley
Copy link

cpelley commented May 5, 2016

I'm happy with option3 (no risk then and least controversy).
rhattersley#8 if your interested, otherwise the PR looks ready to go :)

Thanks @rhattersley

@cpelley
Copy link

cpelley commented May 6, 2016

ooh just remembered, don't we need to update what's new for this?

@pp-mo
Copy link
Member

pp-mo commented May 6, 2016

I was initially a bit horrified at breaking so much deeply-buried subtle behaviour !
But on re-reading @shoyer original #914, I must say he has a powerful point :
I think I now agree that it would make more sense to behave like numpy, and force explicit copies when wanted. If we are going to do that, it needs to be soon.

@pp-mo
Copy link
Member

pp-mo commented May 6, 2016

breaking so much ... behaviour

After a closer look at this, I really don't see _why_ we can't issue deprecation warnings for these changes.
Very little code is touched, we just need to warn in those places.
True, it will affect nearly everyone, but only until 2.0, and it is something everyone needs to know.

  • In fact, we should really add something to the user guide about this.

See my proposals at #1999 for making attempts to commit to deeper promises regarding deprecation warnings.
In particular : summary-of-provisions comment

@cpelley
Copy link

cpelley commented May 9, 2016

...I really don't see why we can't issue deprecation warnings for these changes.
Very little code is touched, we just need to warn in those places.

Happy if a deprecation warning were to be issued. Assuming your happy with keeping the FUTURE toggle though?

In fact, we should really add something to the user guide about this.

+1 Perhaps an explicit 'Copies and views' section. However, unless I'm mistaken, there is a bigger hole here to describe in the user guide that extends beyond this behavioural change (an explicit section that explains when/what provides views/copies of what). There is a great deal of misunderstanding amongst the iris community. Perhaps this information is in there somewhere? For this reason, I would propose splitting the userguide work to another issue with v1.10 milestone. What you think?

@pp-mo
Copy link
Member

pp-mo commented May 9, 2016

Assuming your happy with keeping the FUTURE toggle though?

Yes, we need a control and I think it's just the kind of thing FUTURE should be used for.

My "new proposals" include some extra rules + enhanced importance for deprecations : It's of key importance that you can avoid deprecated features.

@pp-mo
Copy link
Member

pp-mo commented May 9, 2016

However, unless I'm mistaken, there is a bigger hole here ...
an explicit section that explains when/what provides views/copies of what).
There is a great deal of misunderstanding amongst the iris community.
Perhaps this information is in there somewhere?

I think any existing issues with copies + views are all based with _numpy_, as Iris itself (as it currently stands) makes strenuous efforts to avoid producing views anywhere within the cube operations.
No ?

So, it should really refer to the numpy docs to explain the concept.
The problem is, numpy documentation is rather weak on fundamental concepts.
Having just reviewed it, I think the best you get on "views" is a short entry in the glossary and a couple of mentions in the "Indexing" section.
Actually, even the detailed reference docs don't routinely make clear when views as opposed to new data may be returned,
much as the stats routines mostly don't bother to explain exactly what they do with missing data elements.

For this reason, I would propose splitting the userguide work to another issue with v1.10 milestone. What you think?

I think we should definitely not merge this without the accompanying documentation, so I'm not sure of the benefit of treating it as separate.
As I'm in writing mode, I might try to produce something ...

@cpelley
Copy link

cpelley commented May 9, 2016

I think any existing issues with copies + views are all based with numpy, as Iris itself (as it currently stands) makes strenuous efforts to avoid producing views anywhere within the cube operations.
No ?

I meant a bit wider than this. Other examples might include dropping metadata when performing most operations. I understand why but an explicit top-level section concerning the subtle behaviours of working with cubes would help iris users get to grips with how to better think about the concept of cubes.

I think we should definitely not merge this without the accompanying documentation, so I'm not sure of the benefit of treating it as separate.
As I'm in writing mode, I might try to produce something ...

Thanks @pp-mo sure.

@rhattersley rhattersley removed this from the v1.10 milestone May 11, 2016
@rhattersley
Copy link
Member Author

I'm not sure we're ready to force this into v1.10. It hasn't had very wide-spread discussion.

@pp-mo
Copy link
Member

pp-mo commented May 11, 2016

I'm not sure we're ready to force this into v1.10

I took a look into how to document/explain this, and was rather taken aback by some of the existing behaviour. I had thought we consistently avoided view-like copies in Iris up to now, but that is not the case for coords as it turns out .
Would you believe...

>>> from iris.coords import AuxCoord
>>> co1 = AuxCoord([1,2,3,4,5])
>>> co2 = co1.copy()
>>> co2.points[2:4] = 77
>>> co2
AuxCoord(array([ 1,  2, 77, 77,  5]), standard_name=None, units=Unit('1'))
>>> co1
AuxCoord(array([1, 2, 3, 4, 5]), standard_name=None, units=Unit('1'))
>>> 
>>> co3 = co1[:]
>>> co3.points[1:3] = -99
>>> co3
AuxCoord(array([  1, -99, -99,   4,   5]), standard_name=None, units=Unit('1'))
>>> co1
AuxCoord(array([  1, -99, -99,   4,   5]), standard_name=None, units=Unit('1'))

So [:] delivers a new coord with a view on the old one after all.

This could make it harder to explain what we are changing.

@rhattersley
Copy link
Member Author

This could make it harder to explain what we are changing.

Thanks for clarifying @pp-mo. Coming from a numpy perspective, having coord[:] return a view but coord.copy() return a copy makes a lot of sense and provides choice & flexibility to the user. I don't see how changing the behaviour of copy (as in this PR) helps matters.

Rather than make coord.copy() return a view we should be looking at the code that's using copy and decide whether it's appropriate to use indexing instead.

@cpelley
Copy link

cpelley commented Jul 26, 2016

@pp-mo are we able to reach an agreed way forward for this?
We would love to see this get into iris.

Cheers

@cpelley
Copy link

cpelley commented Aug 16, 2016

This PR looks good to go to me and provides significant benefit to us.
Can someone with 'merge privileges' take a look please?

Cheers

@cpelley
Copy link

cpelley commented Aug 23, 2016

ping

@cpelley
Copy link

cpelley commented Nov 29, 2016

Please let me know if there's anything I can do to help to get this in.
This would provide very useful capability for CMIP6 due to the sheer size of data involved.

Cheers

@marqh
Copy link
Member

marqh commented Dec 7, 2016

Hello @cpelley

please accept my apologies for this getting left behind and you having to chase so.

I feel that the change makes sense and I am content to support it.

...the remaining open question is whether being able to switch this change on/off with a Future toggle out-weighs the implementation leakage.
#939 (comment)

Is this still an open question that needs addressing?

This is part of the more general hazard of global flags that change functionality.
I agree with this in principal, but the question is out of scope for this PR.

Practically, it is a goodly while since this code was written (and tested) and there are references to deprecated in 1.10
we have since cut 1.100 and 1.11
@cpelley would you be prepared to create your own PR from the commits in this one and adapt the deprecation messages to state 1.12

I am minded to merge such a PR once I have seen it

thank you
mark

@cpelley
Copy link

cpelley commented Dec 8, 2016

Thanks for taking this up @marqh

Is this still an open question that needs addressing?

A subtlety I missed the first time around. I think we are opening ourselves to some potential problems by having the option to switch between both behaviours but whether this risk is greater than the backlash (impact) of changing the default behaviour right away... I don't know.

I never feel too unconformable going with @rhattersley's preferred option :).

@cpelley would you be prepared to create your own PR...

Happy to make a new PR

Thanks @marqh

@pp-mo
Copy link
Member

pp-mo commented Jan 4, 2017

In #2261 ...

Replaces #1992

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants