Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dask merge back #2597

Merged
merged 122 commits into from
Jun 14, 2017
Merged

Dask merge back #2597

merged 122 commits into from
Jun 14, 2017

Conversation

bjlittle
Copy link
Member

@bjlittle bjlittle commented Jun 9, 2017

This PR is the dask feature branch merge back to master.

Unfortunately, the rebase of master with this dask branch involved many conflicting changes, and as a result this PR branch will not merge back into dask i.e. it's not possible to create a hybrid branch that merges into both master and dask.

In the process of resolving assorted issues to allow this PR branch to merge back into master with all tests passing, the last 10 commits owned by me require review. Prior to those, all commits are a result of the rebase of master and this dask PR branch, so those have been reviewed as part of normal dask development work (with rebase conflict resolution where appropriate).

The commits that require review are:

Note that, the last two above commits are the missing commits from the dask branch (i.e. those commits that were merged to dask after cutting this PR branch), which could not be merged, rebased or cherry-picked onto this branch, and so I added them manually.

Also, this PR does not contain #2549, which was merged at the last moment into v1.13. It was simply not possible to easily support its inclusion without significant new development work. @cpelley has been notified - and so we need to discuss that matter further.

"The biggus is dead. Long live the dask".

pp-mo and others added 30 commits May 26, 2017 15:36
* Swap out Biggus concatenate for Dask concatenate

* Re-enable passing concatenate tests
* fixed typo lazy_array to lazy_data; give fill_value attribute to all lazy data
* Changed most occurrences of biggus.Array instance checks to either dask.array.Array instance checks or lazy data checks

* updated headers
@ajdawson
Copy link
Member

There is no way this can be properly reviewed by anyone who hasn't been following along extremely closely, so probably not wise to require approval from all devs. I don't have time to do anything other than remain neutral on this.

@DPeterK
Copy link
Member

DPeterK commented Jun 13, 2017

In fact, I propose that we do not merge this until we get approvals from every @SciTools/iris-devs member for this one.

Frankly, this PR is unreviewable

@pelson you are quite right - I should have said approvals or "I'm fine with this as it is" from each of the @SciTools/iris-devs. My main aim with this rather bold move was to ensure everyone who is interested gets the chance to get eyes on this so that, given the size of the PR, it doesn't get merged prematurely and we all get to regret doing so.

@SciTools/iris-devs if you're fine with this as is then by all means just remove yourself as a reviewer.

@DPeterK DPeterK removed the request for review from pelson June 13, 2017 08:28
* this may wrap a proxy to a file collection, or
* this may wrap the NumPy array in ``cube._numpy_array``.

* All dask arrays wrap array-like objects where missing data are represented by ``nan`` values:
Copy link

@cpelley cpelley Jun 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we lost the means to differentiate between 'nan' values and 'masked' values in that case?
Looks like this is the case:

iris master:

>>> iris.load_cube('nan_mask_tmp.nc').data
masked_array(data = [1.0 -- nan],
             mask = [False  True False],
       fill_value = 1e+20)

This branch:

>>> iris.load_cube('nan_mask_tmp.nc').data
masked_array(data = [1.0 -- --],
             mask = [False  True  True],
       fill_value = 1e+20)

Differentiating between masked values and nan values can be important. An example: Regridding a field with masked data to a target with a different coordinate system, where extrapolation is set to 'nan' and takes place due a mismatch between the source and target domains (i.e. not 100% overlap).
Though this behaviour I suspect has not changed for regridding, at the point of saving this data to disk and loading it back in again, we have lost this information which allows us to know which values were actually masked and which were 'nan' values. For our project, we cache data to disk which depends on knowing the difference between a masked value and a 'nan' status for this very reason above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cpelley. This is a know bug that we need to address, see #2578

Copy link

@cpelley cpelley Jun 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, I mentioned regridding only as a usecase for why one might have both masked and nan values present (I didn't realise there was a problem there).

The thing I'm demonstrating as no longer working above is to load data which has nan values within it (they are indistinguishable from masked values). I hope this is not intended behaviour, but either way it is not captured by #2578 :)

I think this would be a blocker for us using dask right now at least.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Captured in #2609

* All dask arrays wrap array-like objects where missing data are represented by ``nan`` values:

* Masked arrays derived from these dask arrays create their mask using the locations of ``nan`` values.
* Where dask-wrapped arrays of ``int`` require masks, these arrays will first be cast to ``float``.
Copy link

@cpelley cpelley Jun 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of float container? The smallest one possible for the range of values and dtype defined?
My first thought is of memory consumption and performance (speed). As I say, I have no looked at the implementation or have any idea of any benchmarking performed, but it would give me greater confidence if I knew what this might means for performance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cpelley I've created issue #2602 to address this concern.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bjlittle

* All dask arrays wrap array-like objects where missing data are represented by ``nan`` values:

* Masked arrays derived from these dask arrays create their mask using the locations of ``nan`` values.
* Where dask-wrapped arrays of ``int`` require masks, these arrays will first be cast to ``float``.
Copy link

@cpelley cpelley Jun 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem robust to me:

Perhaps I'm missing something about the implementation but I don't think you can represent the full int64 range of values as float64 ones in one container:

>>> arr = np.array([np.iinfo('int64').min, np.iinfo('int64').max])
array([-9223372036854775808, 9223372036854775807])
>>> arr.astype('float64').astype('int64')
array([-9223372036854775808, -9223372036854775808])

I have not looked at the implementation. Perhaps this is done element-wise so isn't a problem?
Either way, I think further explanation in the docs here would be useful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cpelley Interesting observation. Do you have an actual data use case for this?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My query is not driven by a usecase. I'm not sure I have seen 64bit integer field which spans a large enough range that it cannot be represented by a 64bit float field. However, this is my point, I don't know :)
Currently it won't fall over, it will silently overflow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cpelley I've raised issue #2603 to investigate this further.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bjlittle

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can cast to float64 without overflow, but not back due to rounding:

>>> arr = np.array([np.iinfo('int64').min, np.iinfo('int64').max])
array([-9223372036854775808, 9223372036854775807])
>>> arr.astype('float64')
array([ -9.22337204e+18,   9.22337204e+18])

Casting to float is always a compromise though, you can't have a 1-1 mapping of all integers->floats with the same bit size.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To extend the illustration:

>>> np.set_printoptions(precision=18)
>>> np.array([np.iinfo('int64').min, np.iinfo('int64').max], dtype='int64')
array([-9223372036854775808, 9223372036854775807])
>>> np.array([np.iinfo('int64').min, np.iinfo('int64').max], dtype='float64')
array([ -9.223372036854775808e+18,   9.223372036854775808e+18])

Note, this problem is not restricted to the very extreme of the limits.

@@ -176,8 +176,8 @@ For example, to mask values that lie beyond the range of the original data:
>>> scheme = iris.analysis.Linear(extrapolation_mode='mask')
>>> new_column = column.interpolate(sample_points, scheme)
>>> print(new_column.coord('altitude').points)
[ nan 494.44451904 588.88891602 683.33325195 777.77783203
872.222229 966.66674805 1061.11108398 1155.55541992 nan]
[-- 494.44451904296875 588.888916015625 683.333251953125 777.77783203125
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this shows the point I was making above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About differentiating between nan and masked values, I think so.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right about the point you are making, but I don't believe that this is the right time to be making it.

This PR is to merge a feature branch into Iris which has been under construction for 4 months, and every decision has been discussed in great detail already. This method may not be ideal, but with dask having no support for masked values it is the best option we have.

We have by no means kept development of the feature branch a secret, and there has been plenty of time and space for discussion of major implementation decisions, which is not in this PR. This is just to review the last 10 commits, as @bjlittle pointed out in his first comment, so even though you are right, there is really nothing we can do about it now.

will return the cube's lazy dask array. Calling the cube's
:meth:`~iris.cube.Cube.core_data` method **will never realise** the cube's data.
* If a cube has real data, calling the cube's :meth:`~iris.cube.Cube.core_data` method
will return the cube's real NumPy array.
Copy link

@cpelley cpelley Jun 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While in such a related space, is it worth mentioning here Cube.lazy_data(), which will return a dask array regardless no? (unless this has changed/removed, I haven't looked at anything which lies outside this PR). Is the context/reason to providing this property that converting a numpy array into a dask array is much more expensive than before with biggus arrays?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It may be worth including a mention of coord.lazy_data somewhere, which I will discuss with the dev team here, but this section is specifically about coord.core_data, which refers to the data's current state. This is therefore not the space to add an example of coord.lazy_data, which (as you say) will load a dask array regardless of the data's current state.

Copy link

@cpelley cpelley Jun 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, related area being being under the parent level 'When does my data become real?'.
Reading the documentation I was expecting to see it discussed or at least referenced to another area of the docs perhaps.

True
>>> print(aux_coord.has_bounds())
True
>>> print(aux_coord.has_lazy_bounds())
Copy link

@cpelley cpelley Jun 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no Coord.lazy_data()? what about Coord.core_data()?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about it? What are you expecting to see here?

Copy link

@cpelley cpelley Jun 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Iris cubes and coordinates have very similar interfaces, which extends to accessing
coordinates' lazy points and bounds"

I expect to see Coord.lazy_data() and Coord.coord_data() illustrated here if they do indeed do apply to Coordinates like they do with Cubes (and if not, to say so too).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. You are right, we should discuss those points in this section somewhere. Not in this PR though, as this is the mergeback of the feature branch for a pre-release candidate. But what I will do is add a link to this comment and the one above in the project ticket about final documentation so that we can include your suggestions in later revisions of the docs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cpelley it does say very similar and not identical. As @corinnebosley states, we're going to iterate over the documentation (we know it's not complete or perfect), so this feedback is welcomed; that's why we're keen to make a pre-release candidate available asap.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks both, happy with that :)

@bjlittle bjlittle mentioned this pull request Jun 13, 2017
@bjlittle
Copy link
Member Author

bjlittle commented Jun 13, 2017

dask.async.get_sync has moved to dask.local.get_sync as of 0.15 (2 days old), so I need to address that in this PR to get the failing unit test to work. 0.15 is issuing a warning that causes one of our unit tests to fail - a win for unit testing 😉

Copy link
Member

@pp-mo pp-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see there is a lot still to discuss !
It sounds like some of the outstanding issues will be blockers for an actual release.
However, I'm happy to merge this as-is.

@cpelley
Copy link

cpelley commented Jun 14, 2017

@corinnebosley

You are right about the point you are making, but I don't believe that this is the right time to be making it.... This PR is to merge a feature branch into Iris which has been under construction for 4 months, and every decision has been discussed in great detail already...
We have by no means kept development of the feature branch a secret, and there has been plenty of time and space for discussion of major implementation decisions, which is not in this PR. This is just to review the last 10 commits,...

I have taken that the significance of this PR to be that is proposes to merge the dask branch into master, as well as these 10 commits. You are correct that I was unable to commit time to tracking the dask development over these past months no...

@bjlittle
So here is a summary of what concerns me (highlighting my comments above):

  1. Iris no longer distinguishes between nan and masked values when loading data from disk with 'nan' values. I hope this one is simply a bug (comment).
  2. I'm not confident of the approach of casting arrays to float as a robust approach for supporting masks with dask (I think it could restrict the range of values allowed in the original container and at worst it might result in a silent overflow if it is not being checked). This is not to say that I think one way or another, only that I'm not 'confident' based on what I know and if there is a silent overflow, then this PR adds a potential bug (comment).
  3. Performance (memory and speed) characteristics of the mask support implementation with dask could perhaps have a significant impact on end users. Again, this ties into pnt2, where confidence of is achieved through understanding the potential impact to the approach chosen (comment). This one has a raised issue int to float promotion #2602, thanks @bjlittle

Cheers

@corinnebosley
Copy link
Member

I am happy that all assigned reviewers have approved this PR, and all issues raised regarding implementation have been recorded on a ticket to be addressed in the next week or so. I would like to merge this now; if anyone has any objections to that please let me know ASAP, otherwise I'm going to push the button very soon.

@corinnebosley corinnebosley merged commit 6bd26b7 into SciTools:master Jun 14, 2017
@rhattersley
Copy link
Member

🎉 😀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.