Dask merge back #2597

bjlittle · 2017-06-09T14:54:38Z

This PR is the dask feature branch merge back to master.

Unfortunately, the rebase of master with this dask branch involved many conflicting changes, and as a result this PR branch will not merge back into dask i.e. it's not possible to create a hybrid branch that merges into both master and dask.

In the process of resolving assorted issues to allow this PR branch to merge back into master with all tests passing, the last 10 commits owned by me require review. Prior to those, all commits are a result of the rebase of master and this dask PR branch, so those have been reviewed as part of normal dask development work (with rebase conflict resolution where appropriate).

The commits that require review are:

Note that, the last two above commits are the missing commits from the dask branch (i.e. those commits that were merged to dask after cutting this PR branch), which could not be merged, rebased or cherry-picked onto this branch, and so I added them manually.

Also, this PR does not contain #2549, which was merged at the last moment into v1.13. It was simply not possible to easily support its inclusion without significant new development work. @cpelley has been notified - and so we need to discuss that matter further.

"The biggus is dead. Long live the dask".

* Swap out Biggus concatenate for Dask concatenate * Re-enable passing concatenate tests

* fixed typo lazy_array to lazy_data; give fill_value attribute to all lazy data

)

* Changed most occurrences of biggus.Array instance checks to either dask.array.Array instance checks or lazy data checks * updated headers

ajdawson · 2017-06-13T08:12:36Z

There is no way this can be properly reviewed by anyone who hasn't been following along extremely closely, so probably not wise to require approval from all devs. I don't have time to do anything other than remain neutral on this.

DPeterK · 2017-06-13T08:28:08Z

In fact, I propose that we do not merge this until we get approvals from every @SciTools/iris-devs member for this one.

Frankly, this PR is unreviewable

@pelson you are quite right - I should have said approvals or "I'm fine with this as it is" from each of the @SciTools/iris-devs. My main aim with this rather bold move was to ensure everyone who is interested gets the chance to get eyes on this so that, given the size of the PR, it doesn't get merged prematurely and we all get to regret doing so.

@SciTools/iris-devs if you're fine with this as is then by all means just remove yourself as a reviewer.

cpelley · 2017-06-13T09:03:30Z

docs/iris/src/developers_guide/dask_interface.rst

+  * this may wrap a proxy to a file collection, or
+  * this may wrap the NumPy array in ``cube._numpy_array``.
+
+* All dask arrays wrap array-like objects where missing data are represented by ``nan`` values:


Have we lost the means to differentiate between 'nan' values and 'masked' values in that case?
Looks like this is the case:

iris master:

>>> iris.load_cube('nan_mask_tmp.nc').data masked_array(data = [1.0 -- nan], mask = [False True False], fill_value = 1e+20)

This branch:

>>> iris.load_cube('nan_mask_tmp.nc').data masked_array(data = [1.0 -- --], mask = [False True True], fill_value = 1e+20)

Differentiating between masked values and nan values can be important. An example: Regridding a field with masked data to a target with a different coordinate system, where extrapolation is set to 'nan' and takes place due a mismatch between the source and target domains (i.e. not 100% overlap).
Though this behaviour I suspect has not changed for regridding, at the point of saving this data to disk and loading it back in again, we have lost this information which allows us to know which values were actually masked and which were 'nan' values. For our project, we cache data to disk which depends on knowing the difference between a masked value and a 'nan' status for this very reason above.

Thanks @cpelley. This is a know bug that we need to address, see #2578

To clarify, I mentioned regridding only as a usecase for why one might have both masked and nan values present (I didn't realise there was a problem there).

The thing I'm demonstrating as no longer working above is to load data which has nan values within it (they are indistinguishable from masked values). I hope this is not intended behaviour, but either way it is not captured by #2578 :)

I think this would be a blocker for us using dask right now at least.

Captured in #2609

cpelley · 2017-06-13T09:05:28Z

docs/iris/src/developers_guide/dask_interface.rst

+* All dask arrays wrap array-like objects where missing data are represented by ``nan`` values:
+
+  * Masked arrays derived from these dask arrays create their mask using the locations of ``nan`` values.
+  * Where dask-wrapped arrays of ``int`` require masks, these arrays will first be cast to ``float``.


What kind of float container? The smallest one possible for the range of values and dtype defined?
My first thought is of memory consumption and performance (speed). As I say, I have no looked at the implementation or have any idea of any benchmarking performed, but it would give me greater confidence if I knew what this might means for performance.

@cpelley I've created issue #2602 to address this concern.

Thanks @bjlittle

cpelley · 2017-06-13T09:12:26Z

docs/iris/src/developers_guide/dask_interface.rst

+* All dask arrays wrap array-like objects where missing data are represented by ``nan`` values:
+
+  * Masked arrays derived from these dask arrays create their mask using the locations of ``nan`` values.
+  * Where dask-wrapped arrays of ``int`` require masks, these arrays will first be cast to ``float``.


This doesn't seem robust to me:

Perhaps I'm missing something about the implementation but I don't think you can represent the full int64 range of values as float64 ones in one container:

>>> arr = np.array([np.iinfo('int64').min, np.iinfo('int64').max]) array([-9223372036854775808, 9223372036854775807]) >>> arr.astype('float64').astype('int64') array([-9223372036854775808, -9223372036854775808])

I have not looked at the implementation. Perhaps this is done element-wise so isn't a problem?
Either way, I think further explanation in the docs here would be useful.

@cpelley Interesting observation. Do you have an actual data use case for this?

My query is not driven by a usecase. I'm not sure I have seen 64bit integer field which spans a large enough range that it cannot be represented by a 64bit float field. However, this is my point, I don't know :)
Currently it won't fall over, it will silently overflow.

@cpelley I've raised issue #2603 to investigate this further.

Thanks @bjlittle

You can cast to float64 without overflow, but not back due to rounding:

>>> arr = np.array([np.iinfo('int64').min, np.iinfo('int64').max]) array([-9223372036854775808, 9223372036854775807]) >>> arr.astype('float64') array([ -9.22337204e+18, 9.22337204e+18])

Casting to float is always a compromise though, you can't have a 1-1 mapping of all integers->floats with the same bit size.

To extend the illustration:

>>> np.set_printoptions(precision=18) >>> np.array([np.iinfo('int64').min, np.iinfo('int64').max], dtype='int64') array([-9223372036854775808, 9223372036854775807]) >>> np.array([np.iinfo('int64').min, np.iinfo('int64').max], dtype='float64') array([ -9.223372036854775808e+18, 9.223372036854775808e+18])

Note, this problem is not restricted to the very extreme of the limits.

cpelley · 2017-06-13T09:15:00Z

docs/iris/src/userguide/interpolation_and_regridding.rst

@@ -176,8 +176,8 @@ For example, to mask values that lie beyond the range of the original data:
   >>> scheme = iris.analysis.Linear(extrapolation_mode='mask')
   >>> new_column = column.interpolate(sample_points, scheme)
   >>> print(new_column.coord('altitude').points)
-   [           nan   494.44451904   588.88891602   683.33325195   777.77783203
-      872.222229     966.66674805  1061.11108398  1155.55541992            nan]
+   [-- 494.44451904296875 588.888916015625 683.333251953125 777.77783203125


I think this shows the point I was making above.

About differentiating between nan and masked values, I think so.

You are right about the point you are making, but I don't believe that this is the right time to be making it.

This PR is to merge a feature branch into Iris which has been under construction for 4 months, and every decision has been discussed in great detail already. This method may not be ideal, but with dask having no support for masked values it is the best option we have.

We have by no means kept development of the feature branch a secret, and there has been plenty of time and space for discussion of major implementation decisions, which is not in this PR. This is just to review the last 10 commits, as @bjlittle pointed out in his first comment, so even though you are right, there is really nothing we can do about it now.

cpelley · 2017-06-13T09:23:21Z

docs/iris/src/userguide/real_and_lazy_data.rst

+   will return the cube's lazy dask array. Calling the cube's
+   :meth:`~iris.cube.Cube.core_data` method **will never realise** the cube's data.
+ * If a cube has real data, calling the cube's :meth:`~iris.cube.Cube.core_data` method
+   will return the cube's real NumPy array.


While in such a related space, is it worth mentioning here Cube.lazy_data(), which will return a dask array regardless no? (unless this has changed/removed, I haven't looked at anything which lies outside this PR). Is the context/reason to providing this property that converting a numpy array into a dask array is much more expensive than before with biggus arrays?

Yes. It may be worth including a mention of coord.lazy_data somewhere, which I will discuss with the dev team here, but this section is specifically about coord.core_data, which refers to the data's current state. This is therefore not the space to add an example of coord.lazy_data, which (as you say) will load a dask array regardless of the data's current state.

Sorry, related area being being under the parent level 'When does my data become real?'.
Reading the documentation I was expecting to see it discussed or at least referenced to another area of the docs perhaps.

cpelley · 2017-06-13T09:29:43Z

docs/iris/src/userguide/real_and_lazy_data.rst

+    True
+    >>> print(aux_coord.has_bounds())
+    True
+    >>> print(aux_coord.has_lazy_bounds())


no Coord.lazy_data()? what about Coord.core_data()?

What about it? What are you expecting to see here?

"Iris cubes and coordinates have very similar interfaces, which extends to accessing
coordinates' lazy points and bounds"

I expect to see Coord.lazy_data() and Coord.coord_data() illustrated here if they do indeed do apply to Coordinates like they do with Cubes (and if not, to say so too).

Yes. You are right, we should discuss those points in this section somewhere. Not in this PR though, as this is the mergeback of the feature branch for a pre-release candidate. But what I will do is add a link to this comment and the one above in the project ticket about final documentation so that we can include your suggestions in later revisions of the docs.

@cpelley it does say very similar and not identical. As @corinnebosley states, we're going to iterate over the documentation (we know it's not complete or perfect), so this feedback is welcomed; that's why we're keen to make a pre-release candidate available asap.

Thanks both, happy with that :)

bjlittle · 2017-06-13T19:56:54Z

dask.async.get_sync has moved to dask.local.get_sync as of 0.15 (2 days old), so I need to address that in this PR to get the failing unit test to work. 0.15 is issuing a warning that causes one of our unit tests to fail - a win for unit testing 😉

pp-mo

I see there is a lot still to discuss !
It sounds like some of the outstanding issues will be blockers for an actual release.
However, I'm happy to merge this as-is.

cpelley · 2017-06-14T10:04:22Z

@corinnebosley

You are right about the point you are making, but I don't believe that this is the right time to be making it.... This PR is to merge a feature branch into Iris which has been under construction for 4 months, and every decision has been discussed in great detail already...
We have by no means kept development of the feature branch a secret, and there has been plenty of time and space for discussion of major implementation decisions, which is not in this PR. This is just to review the last 10 commits,...

I have taken that the significance of this PR to be that is proposes to merge the dask branch into master, as well as these 10 commits. You are correct that I was unable to commit time to tracking the dask development over these past months no...

@bjlittle
So here is a summary of what concerns me (highlighting my comments above):

Iris no longer distinguishes between nan and masked values when loading data from disk with 'nan' values. I hope this one is simply a bug (comment).
I'm not confident of the approach of casting arrays to float as a robust approach for supporting masks with dask (I think it could restrict the range of values allowed in the original container and at worst it might result in a silent overflow if it is not being checked). This is not to say that I think one way or another, only that I'm not 'confident' based on what I know and if there is a silent overflow, then this PR adds a potential bug (comment).
Performance (memory and speed) characteristics of the mask support implementation with dask could perhaps have a significant impact on end users. Again, this ties into pnt2, where confidence of is achieved through understanding the potential impact to the approach chosen (comment). This one has a raised issue int to float promotion #2602, thanks @bjlittle

Cheers

corinnebosley · 2017-06-14T10:12:32Z

I am happy that all assigned reviewers have approved this PR, and all issues raised regarding implementation have been recorded on a ticket to be addressed in the next week or so. I would like to merge this now; if anyone has any objections to that please let me know ASAP, otherwise I'm going to push the button very soon.

rhattersley · 2017-06-14T12:00:52Z

🎉 😀

pp-mo and others added 30 commits May 26, 2017 15:36

Define @skip_biggus test decorator. (SciTools#2353)

cc07254

Generic lazy data handling. (SciTools#2356)

d2e0849

Use _lazy_data functions for cube data.

04509be

Hack for dual lazy support, i.e. biggus OR dask.

a9a9cfd

Add mask/NaN translations into iris._lazy_data.

5cdd2c3

Started skipping tests.

f8d7e80

Revert unnecessary change to integration/test_pp.

53edef0

Various skips.

0aed81f

Disable Travis example + docs tests for now.

899f005

dask based merge

6869f9a

skip all iris_grib tests

a0a9a06

Lazy pp loading

e974847

switched netcdf loader from biggus to dask. untested. (#35)

bd9afce

skip failing netcdf unit mock tests: chunks do not add up to shape

794a230

pp_load data property fix

e1a8b1a

as_concrete_array always returns a masked array

889ec70

Use Dask for concatenate (#38)

0f1899c

* Swap out Biggus concatenate for Dask concatenate * Re-enable passing concatenate tests

pp unit test

e3d2549

Don't make lazy wrappers for cube shape and dtype. (#37)

051c81d

biggus ArrayStack.multidim_array_stack with da.stack

4f255fa

is_lazy_data over isinstance

784c47d

test_field_collection with dask

7e0694a

use np.dtype in mock tests

959b6cd

typo fix and fill_value guarantee (#39)

3861b87

* fixed typo lazy_array to lazy_data; give fill_value attribute to all lazy data

Replace biggus ndarray with lazy as_concrete_data in pp pyke rules. (#40

a7f5af4

)

remove biggus lazy data, skip netcdf save

f3efe3c

fix cube pickle test

fa5dc6d

skip netCDF save

f5294a3

Fixes for biggus array checks (#41)

e8afc4a

* Changed most occurrences of biggus.Array instance checks to either dask.array.Array instance checks or lazy data checks * updated headers

replace biggus lazy use for now, patch out netcdf save tests

a114cc5

DPeterK removed the request for review from pelson June 13, 2017 08:28

cpelley reviewed Jun 13, 2017

View reviewed changes

lbdreyer approved these changes Jun 13, 2017

View reviewed changes

corinnebosley mentioned this pull request Jun 13, 2017

Publish final v2.0 documentation #2601

Closed

djkirkham approved these changes Jun 13, 2017

View reviewed changes

Review actions.

33c8249

DPeterK approved these changes Jun 13, 2017

View reviewed changes

bjlittle mentioned this pull request Jun 13, 2017

int to float promotion #2602

Closed

bjlittle mentioned this pull request Jun 13, 2017

Investigate min/max int dtype round trip #2603

Closed

bjlittle added the dask label Jun 14, 2017

Fix for dask v0.15+ get_sync.

e2eeea3

pp-mo approved these changes Jun 14, 2017

View reviewed changes

corinnebosley merged commit 6bd26b7 into SciTools:master Jun 14, 2017

This was referenced Jun 14, 2017

Merge dask to master (8 man days' effort) #2504

Closed

Investigate and fix travis-ci conda-forge test failures in dask-merge-back branch #2593

Closed

nan + mask save and load #2609

Closed

djkirkham mentioned this pull request Oct 24, 2017

Remove all EOL maintenance branches #2736

Closed

bjlittle deleted the dask-merge-back branch November 3, 2017 14:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dask merge back #2597

Dask merge back #2597

bjlittle commented Jun 9, 2017 •

edited

Loading

ajdawson commented Jun 13, 2017

DPeterK commented Jun 13, 2017

cpelley Jun 13, 2017 •

edited

Loading

bjlittle Jun 13, 2017

cpelley Jun 14, 2017 •

edited

Loading

cpelley Aug 31, 2017

cpelley Jun 13, 2017 •

edited

Loading

bjlittle Jun 13, 2017

cpelley Jun 14, 2017

cpelley Jun 13, 2017 •

edited

Loading

bjlittle Jun 13, 2017

cpelley Jun 13, 2017

bjlittle Jun 13, 2017

cpelley Jun 14, 2017

ajdawson Jun 15, 2017

cpelley Jun 15, 2017

cpelley Jun 13, 2017

corinnebosley Jun 13, 2017

cpelley Jun 13, 2017

corinnebosley Jun 13, 2017

cpelley Jun 13, 2017 •

edited

Loading

corinnebosley Jun 13, 2017

cpelley Jun 13, 2017 •

edited

Loading

cpelley Jun 13, 2017 •

edited

Loading

corinnebosley Jun 13, 2017

cpelley Jun 13, 2017 •

edited

Loading

corinnebosley Jun 13, 2017

bjlittle Jun 13, 2017

cpelley Jun 14, 2017

bjlittle commented Jun 13, 2017 •

edited

Loading

pp-mo left a comment

cpelley commented Jun 14, 2017

corinnebosley commented Jun 14, 2017

rhattersley commented Jun 14, 2017

Dask merge back #2597

Dask merge back #2597

Conversation

bjlittle commented Jun 9, 2017 • edited Loading

ajdawson commented Jun 13, 2017

DPeterK commented Jun 13, 2017

cpelley Jun 13, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cpelley Jun 14, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cpelley Jun 13, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cpelley Jun 13, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cpelley Jun 13, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cpelley Jun 13, 2017 • edited Loading

Choose a reason for hiding this comment

cpelley Jun 13, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cpelley Jun 13, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bjlittle commented Jun 13, 2017 • edited Loading

pp-mo left a comment

Choose a reason for hiding this comment

cpelley commented Jun 14, 2017

corinnebosley commented Jun 14, 2017

rhattersley commented Jun 14, 2017

bjlittle commented Jun 9, 2017 •

edited

Loading

cpelley Jun 13, 2017 •

edited

Loading

cpelley Jun 14, 2017 •

edited

Loading

cpelley Jun 13, 2017 •

edited

Loading

cpelley Jun 13, 2017 •

edited

Loading

cpelley Jun 13, 2017 •

edited

Loading

cpelley Jun 13, 2017 •

edited

Loading

cpelley Jun 13, 2017 •

edited

Loading

cpelley Jun 13, 2017 •

edited

Loading

bjlittle commented Jun 13, 2017 •

edited

Loading