-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dask merge back #2597
Dask merge back #2597
Conversation
* Swap out Biggus concatenate for Dask concatenate * Re-enable passing concatenate tests
* fixed typo lazy_array to lazy_data; give fill_value attribute to all lazy data
* Changed most occurrences of biggus.Array instance checks to either dask.array.Array instance checks or lazy data checks * updated headers
There is no way this can be properly reviewed by anyone who hasn't been following along extremely closely, so probably not wise to require approval from all devs. I don't have time to do anything other than remain neutral on this. |
@pelson you are quite right - I should have said approvals or "I'm fine with this as it is" from each of the @SciTools/iris-devs. My main aim with this rather bold move was to ensure everyone who is interested gets the chance to get eyes on this so that, given the size of the PR, it doesn't get merged prematurely and we all get to regret doing so. @SciTools/iris-devs if you're fine with this as is then by all means just remove yourself as a reviewer. |
* this may wrap a proxy to a file collection, or | ||
* this may wrap the NumPy array in ``cube._numpy_array``. | ||
|
||
* All dask arrays wrap array-like objects where missing data are represented by ``nan`` values: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we lost the means to differentiate between 'nan' values and 'masked' values in that case?
Looks like this is the case:
iris master:
>>> iris.load_cube('nan_mask_tmp.nc').data
masked_array(data = [1.0 -- nan],
mask = [False True False],
fill_value = 1e+20)
This branch:
>>> iris.load_cube('nan_mask_tmp.nc').data
masked_array(data = [1.0 -- --],
mask = [False True True],
fill_value = 1e+20)
Differentiating between masked values and nan values can be important. An example: Regridding a field with masked data to a target with a different coordinate system, where extrapolation is set to 'nan' and takes place due a mismatch between the source and target domains (i.e. not 100% overlap).
Though this behaviour I suspect has not changed for regridding, at the point of saving this data to disk and loading it back in again, we have lost this information which allows us to know which values were actually masked and which were 'nan' values. For our project, we cache data to disk which depends on knowing the difference between a masked value and a 'nan' status for this very reason above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, I mentioned regridding only as a usecase for why one might have both masked and nan values present (I didn't realise there was a problem there).
The thing I'm demonstrating as no longer working above is to load data which has nan values within it (they are indistinguishable from masked values). I hope this is not intended behaviour, but either way it is not captured by #2578 :)
I think this would be a blocker for us using dask right now at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Captured in #2609
* All dask arrays wrap array-like objects where missing data are represented by ``nan`` values: | ||
|
||
* Masked arrays derived from these dask arrays create their mask using the locations of ``nan`` values. | ||
* Where dask-wrapped arrays of ``int`` require masks, these arrays will first be cast to ``float``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What kind of float container? The smallest one possible for the range of values and dtype defined?
My first thought is of memory consumption and performance (speed). As I say, I have no looked at the implementation or have any idea of any benchmarking performed, but it would give me greater confidence if I knew what this might means for performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @bjlittle
* All dask arrays wrap array-like objects where missing data are represented by ``nan`` values: | ||
|
||
* Masked arrays derived from these dask arrays create their mask using the locations of ``nan`` values. | ||
* Where dask-wrapped arrays of ``int`` require masks, these arrays will first be cast to ``float``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem robust to me:
Perhaps I'm missing something about the implementation but I don't think you can represent the full int64 range of values as float64 ones in one container:
>>> arr = np.array([np.iinfo('int64').min, np.iinfo('int64').max])
array([-9223372036854775808, 9223372036854775807])
>>> arr.astype('float64').astype('int64')
array([-9223372036854775808, -9223372036854775808])
I have not looked at the implementation. Perhaps this is done element-wise so isn't a problem?
Either way, I think further explanation in the docs here would be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cpelley Interesting observation. Do you have an actual data use case for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My query is not driven by a usecase. I'm not sure I have seen 64bit integer field which spans a large enough range that it cannot be represented by a 64bit float field. However, this is my point, I don't know :)
Currently it won't fall over, it will silently overflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @bjlittle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can cast to float64 without overflow, but not back due to rounding:
>>> arr = np.array([np.iinfo('int64').min, np.iinfo('int64').max])
array([-9223372036854775808, 9223372036854775807])
>>> arr.astype('float64')
array([ -9.22337204e+18, 9.22337204e+18])
Casting to float is always a compromise though, you can't have a 1-1 mapping of all integers->floats with the same bit size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To extend the illustration:
>>> np.set_printoptions(precision=18)
>>> np.array([np.iinfo('int64').min, np.iinfo('int64').max], dtype='int64')
array([-9223372036854775808, 9223372036854775807])
>>> np.array([np.iinfo('int64').min, np.iinfo('int64').max], dtype='float64')
array([ -9.223372036854775808e+18, 9.223372036854775808e+18])
Note, this problem is not restricted to the very extreme of the limits.
@@ -176,8 +176,8 @@ For example, to mask values that lie beyond the range of the original data: | |||
>>> scheme = iris.analysis.Linear(extrapolation_mode='mask') | |||
>>> new_column = column.interpolate(sample_points, scheme) | |||
>>> print(new_column.coord('altitude').points) | |||
[ nan 494.44451904 588.88891602 683.33325195 777.77783203 | |||
872.222229 966.66674805 1061.11108398 1155.55541992 nan] | |||
[-- 494.44451904296875 588.888916015625 683.333251953125 777.77783203125 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this shows the point I was making above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About differentiating between nan and masked values, I think so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right about the point you are making, but I don't believe that this is the right time to be making it.
This PR is to merge a feature branch into Iris which has been under construction for 4 months, and every decision has been discussed in great detail already. This method may not be ideal, but with dask having no support for masked values it is the best option we have.
We have by no means kept development of the feature branch a secret, and there has been plenty of time and space for discussion of major implementation decisions, which is not in this PR. This is just to review the last 10 commits, as @bjlittle pointed out in his first comment, so even though you are right, there is really nothing we can do about it now.
will return the cube's lazy dask array. Calling the cube's | ||
:meth:`~iris.cube.Cube.core_data` method **will never realise** the cube's data. | ||
* If a cube has real data, calling the cube's :meth:`~iris.cube.Cube.core_data` method | ||
will return the cube's real NumPy array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While in such a related space, is it worth mentioning here Cube.lazy_data()
, which will return a dask array regardless no? (unless this has changed/removed, I haven't looked at anything which lies outside this PR). Is the context/reason to providing this property that converting a numpy array into a dask array is much more expensive than before with biggus arrays?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It may be worth including a mention of coord.lazy_data somewhere, which I will discuss with the dev team here, but this section is specifically about coord.core_data, which refers to the data's current state. This is therefore not the space to add an example of coord.lazy_data, which (as you say) will load a dask array regardless of the data's current state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, related area being being under the parent level 'When does my data become real?'.
Reading the documentation I was expecting to see it discussed or at least referenced to another area of the docs perhaps.
True | ||
>>> print(aux_coord.has_bounds()) | ||
True | ||
>>> print(aux_coord.has_lazy_bounds()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no Coord.lazy_data()
? what about Coord.core_data()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about it? What are you expecting to see here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Iris cubes and coordinates have very similar interfaces, which extends to accessing
coordinates' lazy points and bounds"
I expect to see Coord.lazy_data()
and Coord.coord_data()
illustrated here if they do indeed do apply to Coordinates like they do with Cubes (and if not, to say so too).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. You are right, we should discuss those points in this section somewhere. Not in this PR though, as this is the mergeback of the feature branch for a pre-release candidate. But what I will do is add a link to this comment and the one above in the project ticket about final documentation so that we can include your suggestions in later revisions of the docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cpelley it does say very similar
and not identical
. As @corinnebosley states, we're going to iterate over the documentation (we know it's not complete or perfect), so this feedback is welcomed; that's why we're keen to make a pre-release candidate available asap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks both, happy with that :)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see there is a lot still to discuss !
It sounds like some of the outstanding issues will be blockers for an actual release.
However, I'm happy to merge this as-is.
I have taken that the significance of this PR to be that is proposes to merge the dask branch into master, as well as these 10 commits. You are correct that I was unable to commit time to tracking the dask development over these past months no... @bjlittle
Cheers |
I am happy that all assigned reviewers have approved this PR, and all issues raised regarding implementation have been recorded on a ticket to be addressed in the next week or so. I would like to merge this now; if anyone has any objections to that please let me know ASAP, otherwise I'm going to push the button very soon. |
🎉 😀 |
This PR is the
dask
feature branch merge back to master.Unfortunately, the rebase of
master
with thisdask
branch involved many conflicting changes, and as a result this PR branch will not merge back intodask
i.e. it's not possible to create a hybrid branch that merges into bothmaster
anddask
.In the process of resolving assorted issues to allow this PR branch to merge back into
master
with all tests passing, the last 10 commits owned by me require review. Prior to those, all commits are a result of the rebase ofmaster
and thisdask
PR branch, so those have been reviewed as part of normaldask
development work (with rebase conflict resolution where appropriate).The commits that require review are:
Note that, the last two above commits are the missing commits from the
dask
branch (i.e. those commits that were merged todask
after cutting this PR branch), which could not be merged, rebased or cherry-picked onto this branch, and so I added them manually.Also, this PR does not contain #2549, which was merged at the last moment into
v1.13
. It was simply not possible to easily support its inclusion without significant new development work. @cpelley has been notified - and so we need to discuss that matter further."The
biggus
is dead. Long live thedask
".