Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates to Dask page in Xarray docs #9495

Merged
merged 11 commits into from
Nov 9, 2024
Merged

Conversation

scharlottej13
Copy link
Contributor

  • Closes #xxxx
  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
  • New functions/methods are listed in api.rst

Hi! It's been a while since the Dask docs page had a comprehensive update, so thought I'd open a PR with some suggestions.

I chatted a little bit about updating this page with @jhamman and @dcherian, some of the things I was trying to keep in mind:

  • Include more examples
  • Organize information around what the user is trying to do
  • Update outdated info

There are certainly a lot of things I wasn't sure about (I'm certainly not an xarray expert). I'll comment in-line with questions.

cc @phofl @jrbourbeau

Copy link

welcome bot commented Sep 14, 2024

Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient.
If you have questions, some answers may be found in our contributing guidelines.


ds.to_netcdf("manipulated-example-data.nc")
When using Dask’s distributed scheduler to write NETCDF4 files, it may be necessary to set the environment variable ``HDF5_USE_FILE_LOCKING=FALSE`` to avoid competing locks within the HDF5 SWMR file locking scheme. Note that writing netCDF files with Dask’s distributed scheduler is only supported for the netcdf4 backend.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see anything in the release notes about a fix/update for this, so I'm assuming it's still true.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a link to the GitHub issue to track this one?

Copy link
Contributor Author

@scharlottej13 scharlottej13 Sep 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find a corresponding open issue, but looks like Joe added it in #1793.

I noticed https://github.com/pydata/xarray/issues/1836 is referenced, but I think this is a slightly different issue?

Copy link
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @scharlottej13 for working on this! Some parts of this sorely needed an update. I've left a few comments.

Also FYI there is also https://tutorial.xarray.dev/intermediate/xarray_and_dask.html to be aware of.

doc/user-guide/dask.rst Show resolved Hide resolved
doc/user-guide/dask.rst Show resolved Hide resolved
doc/user-guide/dask.rst Outdated Show resolved Hide resolved
doc/user-guide/dask.rst Outdated Show resolved Hide resolved
doc/user-guide/dask.rst Outdated Show resolved Hide resolved
Comment on lines 93 to 105
.. tab:: HDF5

from dask.diagnostics import ProgressBar
Open HDF5 files with :py:func:`~xarray.open_dataset`::

# or distributed.progress when using the distributed scheduler
delayed_obj = ds.to_netcdf("manipulated-example-data.nc", compute=False)
with ProgressBar():
results = delayed_obj.compute()
xr.open_dataset("/path/to/my/file.h5", chunks='auto')

.. ipython:: python
:suppress:
See :ref:`io.hdf5` for more details.

os.remove("manipulated-example-data.nc") # Was not opened.
.. tab:: GeoTIFF

.. note::
Open large geoTIFF files with rioxarray::

When using Dask's distributed scheduler to write NETCDF4 files,
it may be necessary to set the environment variable `HDF5_USE_FILE_LOCKING=FALSE`
to avoid competing locks within the HDF5 SWMR file locking scheme. Note that
writing netCDF files with Dask's distributed scheduler is only supported for
the `netcdf4` backend.
xds = rioxarray.open_rasterio("my-satellite-image.tif", chunks='auto')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Xarray has dedicated documentation on IO for different formats. Unless these functions behaves differently when used with dask than one would expect from the above explanation, I suggest either moving these examples to the dedicated IO docs page or leaving them out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean this page https://docs.xarray.dev/en/stable/user-guide/io.html, right?

When I went through those sections they seemed more focused on the file format than Dask, so it seemed like adding in Dask-specific snippets or explanations would be overly complicated for people who don't need to use Dask.

My thought with adding tabs to the Dask page is they quickly show which file formats you can use with Dask, a simple example, and then link out to different sections of https://docs.xarray.dev/en/stable/user-guide/io.html.

Definitely defer to what you think is a better fit for the Xarray docs, though!

doc/user-guide/dask.rst Outdated Show resolved Hide resolved
doc/user-guide/dask.rst Outdated Show resolved Hide resolved
doc/user-guide/dask.rst Show resolved Hide resolved
Comment on lines 457 to 458
from flox.xarray import xarray_reduce
import xarray
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example seem unconnected to the surrounding text?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was wondering if this made sense.

I was trying to have the code snippet be a demonstrative, simple example of putting some of the optimization tips together. I swapped the order and added a sentence of explanation. Maybe that's better? Open to other ideas!

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @scharlottej13 ! this definitely needed an update :)

doc/user-guide/dask.rst Outdated Show resolved Hide resolved

ds.to_netcdf("manipulated-example-data.nc")
When using Dask’s distributed scheduler to write NETCDF4 files, it may be necessary to set the environment variable ``HDF5_USE_FILE_LOCKING=FALSE`` to avoid competing locks within the HDF5 SWMR file locking scheme. Note that writing netCDF files with Dask’s distributed scheduler is only supported for the netcdf4 backend.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a link to the GitHub issue to track this one?

Comment on lines 186 to 147
*eager*, in-memory NumPy arrays is to use the :py:meth:`~xarray.Dataset.load` method:
To do this, you can use :py:meth:`~xarray.Dataset.load`, which is similar to :py:meth:`~xarray.Dataset.compute`, but instead changes results in-place:

.. ipython:: python

ds.load()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's emphasize .compute() (and .persist() instead of .load() (which is a strange method that modifies an xarray object in place)

Copy link
Contributor Author

@scharlottej13 scharlottej13 Sep 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to mention load() at all? I added a note that it's recommended people use compute() instead

doc/user-guide/dask.rst Outdated Show resolved Hide resolved
doc/user-guide/dask.rst Outdated Show resolved Hide resolved
doc/user-guide/dask.rst Outdated Show resolved Hide resolved
doc/user-guide/dask.rst Outdated Show resolved Hide resolved
scharlottej13 and others added 4 commits September 16, 2024 13:51
Co-authored-by: Tom Nicholas <tom@cworthy.org>
Co-authored-by: Stephan Hoyer <shoyer@google.com>
@scharlottej13
Copy link
Contributor Author

@shoyer @TomNicholas thank you both for the review!

I believe I've addressed all of your comments, but please let me know if I missed anything.

@scharlottej13
Copy link
Contributor Author

@shoyer @TomNicholas gentle ping here, thanks again for the thorough review!

@scharlottej13
Copy link
Contributor Author

@shoyer @TomNicholas another gentle ping here, thanks again!

* main: (125 commits)
  http:// → https:// (pydata#9748)
  Discard useless `!s` conversion in f-string (pydata#9752)
  Apply ruff/flake8-simplify rule SIM401 (pydata#9749)
  Use micromamba 1.5.10 where conda is needed (pydata#9737)
  pin array-api-strict<=2.1 (pydata#9751)
  Reorganise ruff rules (pydata#9738)
  use new conda-forge package pydap-server (pydata#9741)
  Enforce ruff/flake8-pie rules (PIE) (pydata#9740)
  Enforce ruff/flake8-comprehensions rules (C4) (pydata#9724)
  Enforce ruff/Perflint rules (PERF)  (pydata#9730)
  Apply ruff rule RUF007 (pydata#9739)
  chmod -x (pydata#9725)
  Aplpy ruff rules (RUF) (pydata#9731)
  Fix typos found by codespell (pydata#9721)
  support for additional scipy nd interpolants  (pydata#9599)
  Apply ruff/flake8-simplify rules (SIM) (pydata#9727)
  Apply ruff/flake8-implicit-str-concat rules (ISC) (pydata#9722)
  Apply ruff/flake8-pie rules (PIE) (pydata#9726)
  Enforce ruff/pygrep-hooks rules (PGH) (pydata#9729)
  Move to micromamba 2 (pydata#9732)
  ...
@dcherian
Copy link
Contributor

dcherian commented Nov 9, 2024

Sorry for the really long delay here @scharlottej13

@dcherian dcherian merged commit 2619c0b into pydata:main Nov 9, 2024
27 of 29 checks passed
Copy link

welcome bot commented Nov 9, 2024

Congratulations on completing your first pull request! Welcome to Xarray! We are proud of you, and hope to see you again! celebration gif

@scharlottej13
Copy link
Contributor Author

Thank you @dcherian!!

dcherian added a commit to dcherian/xarray that referenced this pull request Nov 19, 2024
* main:
  fix html repr indexes section (pydata#9768)
  Bump pypa/gh-action-pypi-publish from 1.11.0 to 1.12.2 in the actions group (pydata#9763)
  unpin array-api-strict, as issues are resolved upstream (pydata#9762)
  rewrite the `min_deps_check` script (pydata#9754)
  CI runs ruff instead of pep8speaks (pydata#9759)
  Specify copyright holders in main license file (pydata#9756)
  Compress PNG files (pydata#9747)
  Dispatch to Dask if nanquantile is available (pydata#9719)
  Updates to Dask page in Xarray docs (pydata#9495)
  http:// → https:// (pydata#9748)
  Discard useless `!s` conversion in f-string (pydata#9752)
  Apply ruff/flake8-simplify rule SIM401 (pydata#9749)
  Use micromamba 1.5.10 where conda is needed (pydata#9737)
  pin array-api-strict<=2.1 (pydata#9751)
  Reorganise ruff rules (pydata#9738)
  use new conda-forge package pydap-server (pydata#9741)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants