Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use dask.delayed instead of the distributed client #222

Merged
merged 21 commits into from
Jan 21, 2020
Merged

Use dask.delayed instead of the distributed client #222

merged 21 commits into from
Jan 21, 2020

Conversation

leouieda
Copy link
Member

Started work on using dask.delayed instead of the futures interface
(client.submit). It's easier and allows building the entire graph lazily
before executing. Still need tests and to port the SplineCV code to
this. Will deprecate the client argument and remove in 2.0.0.

Reminders

  • Run make format and make check to make sure the code follows the style guide.
  • Add tests for new features or tests that would have caught the bug that you're fixing.
  • Add new public functions/methods/classes to doc/api/index.rst and verde/__init__.py.
  • Write detailed docstrings for all functions/classes/methods.
  • If adding new functionality, add an example to the docstring, gallery, and/or tutorials.

Started work on using dask.delayed instead of the futures interface
(client.submit). It's easier and allows building the entire graph lazily
before executing. Still need tests and to port the SplineCV code to
this. Will deprecate the client argument and remove in 2.0.0.
@leouieda
Copy link
Member Author

Having Dask be optional is causing headaches:

  1. Can't use it in the documentation AND have builds that don't install Dask to test that it's really optinal. Since we build the docs always, there is no easy way to have both.
  2. Need to wrap all calls to dask in functions that raise exceptions when it's not installed. This increases the amount of tests for each of these functions and is a bit tedious to write.

Dask is a really lightweight dependency (pure Python with minimal requirements) and I would like to add more support for it in the future (to enable parallel fitting and larger than RAM Jacobians).

I'm thinking we should just adopt Dask as a dependency. Scikit-learn is a much heavier dependency and we already have it. This is nothing compared to GDAL and the geospatial stack.

@santisoler @jessepisel any thoughts/objections on this?

@santisoler
Copy link
Member

I'm seeing Dask in the future of Verde, specially for solving problems that include larger-than-memory arrays. So I think we could add it as a dependency right away and save the headaches of writing annoying lines on test functions and struggling with building the docs.
It's lightweight, so I don't see why we shouldn't add it as a dependency.

@leouieda
Copy link
Member Author

👍 alright, I'll try to get this one finished soon. Thanks, @santisoler

@jessepisel
Copy link
Contributor

I agree with @santisoler that Dask should be a dependency. It will be easier to deprecate the futures interface and not have to write new tests for both options. This will be nice to have for big grids that don't fit in memory @leouieda !

@leouieda
Copy link
Member Author

Thanks for the input @jessepisel and @santisoler 👍

This will be nice to have for big grids that don't fit in memory @leouieda !

I'm still struggling with ways of doing this. The dask-ml package has some optimization methods for linear problems that we could use to run the least-squares fit. But it's not stable enough to be a dependency. Right now, solving the system with the linear algebra is dask is bad because there are a lot of limitations (need square chunks, for one). That's where we should start looking at pylops. Matteo is working on a distributed version of that using dask. I'm eager to play around with it and see if we can make it work.

@leouieda
Copy link
Member Author

Another option is to break up the fit into windows and run each window separately (using dask).

@leouieda leouieda changed the title WIP Setup dask.delayed instead of futures Use dask.delayed instead of the distributed client Jan 20, 2020
@leouieda
Copy link
Member Author

OK, I added the delayed argument to the tutorials as well and made Dask a dependency. @santisoler and/or @jessepisel could you please review this PR?

@leouieda leouieda added this to the v1.3.0 milestone Jan 20, 2020
Copy link
Contributor

@jessepisel jessepisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work on the dask conversion @leouieda. I went through and reviewed all the changes. The conf.py, install.rst, environment.yml, requirements-dev.txt, requirements.txt, and setup.py look good to go.

The tutorials look really good. They have a nice logical flow and explain the difference between delayed dask computation and serial computation. I added some comments here and there for clarity and grammar, but overall I think they work really well.

As for the bulk of the changes in the rest of the package, I looked through them and I think you got all the deprecation warnings covered.

The new tests for test_model_selection.py should work nicely for the model selection. In the test for test_spline.py the mindists changed from 1e-5 to 1e6, is there any reason why you changed the minimum distances to fit by such a large value? Is it ensuring that the 1e-7 is returned for the test? That's all I noticed for the tests.

I think this looks great. The builds are passing and I think there is enough documentation and warnings for users to make the transition to the dask implementation in verde.

tutorials/model_evaluation.py Show resolved Hide resolved
# :class:`~verde.Spline` aren't optimal for this dataset. We could try
# different combinations manually until we get a good score. A better way is to
# do this automatically. In :ref:`model_selection` we'll go over how to do just
# that.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the note is appropriate for reminding users that it will be memory intensive. The last bit on improving the score is a great way to transition to model_selection.

tutorials/model_selection.py Outdated Show resolved Hide resolved
tutorials/model_selection.py Outdated Show resolved Hide resolved
verde/spline.py Outdated Show resolved Hide resolved
verde/utils.py Outdated Show resolved Hide resolved
Co-Authored-By: Jesse Pisel <jessepisel@users.noreply.github.com>
@leouieda
Copy link
Member Author

Thanks for the review @jessepisel!

is there any reason why you changed the minimum distances to fit by such a large value? Is it ensuring that the 1e-7 is returned for the test? That's all I noticed for the tests.

That is exactly it. I just wanted to make it easier for the CV by giving it obviously bad values and only 1 good combination. This will avoid future headaches due to floating point round-off.

@leouieda leouieda merged commit f680661 into master Jan 21, 2020
@leouieda leouieda deleted the delayed branch January 21, 2020 14:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants