Updates to Dask page in Xarray docs #9495

scharlottej13 · 2024-09-14T01:17:26Z

Closes #xxxx
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst
New functions/methods are listed in api.rst

Hi! It's been a while since the Dask docs page had a comprehensive update, so thought I'd open a PR with some suggestions.

I chatted a little bit about updating this page with @jhamman and @dcherian, some of the things I was trying to keep in mind:

Include more examples
Organize information around what the user is trying to do
Update outdated info

There are certainly a lot of things I wasn't sure about (I'm certainly not an xarray expert). I'll comment in-line with questions.

cc @phofl @jrbourbeau

welcome · 2024-09-14T01:17:29Z

Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient.
If you have questions, some answers may be found in our contributing guidelines.

for more information, see https://pre-commit.ci

scharlottej13 · 2024-09-14T01:19:21Z

doc/user-guide/dask.rst


-    ds.to_netcdf("manipulated-example-data.nc")
+        When using Dask’s distributed scheduler to write NETCDF4 files, it may be necessary to set the environment variable ``HDF5_USE_FILE_LOCKING=FALSE`` to avoid competing locks within the HDF5 SWMR file locking scheme. Note that writing netCDF files with Dask’s distributed scheduler is only supported for the netcdf4 backend.


I didn't see anything in the release notes about a fix/update for this, so I'm assuming it's still true.

Can you add a link to the GitHub issue to track this one?

I couldn't find a corresponding open issue, but looks like Joe added it in #1793.

I noticed https://github.com/pydata/xarray/issues/1836 is referenced, but I think this is a slightly different issue?

doc/user-guide/dask.rst

…-docs

TomNicholas

Thank you @scharlottej13 for working on this! Some parts of this sorely needed an update. I've left a few comments.

Also FYI there is also https://tutorial.xarray.dev/intermediate/xarray_and_dask.html to be aware of.

doc/user-guide/dask.rst

TomNicholas · 2024-09-14T02:17:14Z

doc/user-guide/dask.rst

+.. tab:: HDF5

-    from dask.diagnostics import ProgressBar
+    Open HDF5 files with :py:func:`~xarray.open_dataset`::

-    # or distributed.progress when using the distributed scheduler
-    delayed_obj = ds.to_netcdf("manipulated-example-data.nc", compute=False)
-    with ProgressBar():
-        results = delayed_obj.compute()
+        xr.open_dataset("/path/to/my/file.h5", chunks='auto')

-.. ipython:: python
-    :suppress:
+    See :ref:`io.hdf5` for more details.

-    os.remove("manipulated-example-data.nc")  # Was not opened.
+.. tab:: GeoTIFF

-.. note::
+    Open large geoTIFF files with rioxarray::

-    When using Dask's distributed scheduler to write NETCDF4 files,
-    it may be necessary to set the environment variable `HDF5_USE_FILE_LOCKING=FALSE`
-    to avoid competing locks within the HDF5 SWMR file locking scheme. Note that
-    writing netCDF files with Dask's distributed scheduler is only supported for
-    the `netcdf4` backend.
+        xds = rioxarray.open_rasterio("my-satellite-image.tif", chunks='auto')


Xarray has dedicated documentation on IO for different formats. Unless these functions behaves differently when used with dask than one would expect from the above explanation, I suggest either moving these examples to the dedicated IO docs page or leaving them out.

You mean this page https://docs.xarray.dev/en/stable/user-guide/io.html, right?

When I went through those sections they seemed more focused on the file format than Dask, so it seemed like adding in Dask-specific snippets or explanations would be overly complicated for people who don't need to use Dask.

My thought with adding tabs to the Dask page is they quickly show which file formats you can use with Dask, a simple example, and then link out to different sections of https://docs.xarray.dev/en/stable/user-guide/io.html.

Definitely defer to what you think is a better fit for the Xarray docs, though!

doc/user-guide/dask.rst

TomNicholas · 2024-09-14T02:26:43Z

doc/user-guide/dask.rst

+    from flox.xarray import xarray_reduce
+    import xarray


This example seem unconnected to the surrounding text?

Yeah, I was wondering if this made sense.

I was trying to have the code snippet be a demonstrative, simple example of putting some of the optimization tips together. I swapped the order and added a sentence of explanation. Maybe that's better? Open to other ideas!

shoyer

thanks @scharlottej13 ! this definitely needed an update :)

doc/user-guide/dask.rst

shoyer · 2024-09-16T16:10:37Z

doc/user-guide/dask.rst


-    ds.to_netcdf("manipulated-example-data.nc")
+        When using Dask’s distributed scheduler to write NETCDF4 files, it may be necessary to set the environment variable ``HDF5_USE_FILE_LOCKING=FALSE`` to avoid competing locks within the HDF5 SWMR file locking scheme. Note that writing netCDF files with Dask’s distributed scheduler is only supported for the netcdf4 backend.


Can you add a link to the GitHub issue to track this one?

shoyer · 2024-09-16T16:12:33Z

doc/user-guide/dask.rst

-*eager*, in-memory NumPy arrays is to use the :py:meth:`~xarray.Dataset.load` method:
+To do this, you can use :py:meth:`~xarray.Dataset.load`, which is similar to :py:meth:`~xarray.Dataset.compute`, but instead changes results in-place:

 .. ipython:: python

    ds.load()


Let's emphasize .compute() (and .persist() instead of .load() (which is a strange method that modifies an xarray object in place)

Is there a reason to mention load() at all? I added a note that it's recommended people use compute() instead

doc/user-guide/dask.rst

Co-authored-by: Tom Nicholas <tom@cworthy.org> Co-authored-by: Stephan Hoyer <shoyer@google.com>

for more information, see https://pre-commit.ci

scharlottej13 · 2024-09-16T23:26:07Z

@shoyer @TomNicholas thank you both for the review!

I believe I've addressed all of your comments, but please let me know if I missed anything.

scharlottej13 · 2024-09-26T00:27:49Z

@shoyer @TomNicholas gentle ping here, thanks again for the thorough review!

scharlottej13 · 2024-10-28T23:42:54Z

@shoyer @TomNicholas another gentle ping here, thanks again!

* main: (125 commits) http:// → https:// (pydata#9748) Discard useless `!s` conversion in f-string (pydata#9752) Apply ruff/flake8-simplify rule SIM401 (pydata#9749) Use micromamba 1.5.10 where conda is needed (pydata#9737) pin array-api-strict<=2.1 (pydata#9751) Reorganise ruff rules (pydata#9738) use new conda-forge package pydap-server (pydata#9741) Enforce ruff/flake8-pie rules (PIE) (pydata#9740) Enforce ruff/flake8-comprehensions rules (C4) (pydata#9724) Enforce ruff/Perflint rules (PERF) (pydata#9730) Apply ruff rule RUF007 (pydata#9739) chmod -x (pydata#9725) Aplpy ruff rules (RUF) (pydata#9731) Fix typos found by codespell (pydata#9721) support for additional scipy nd interpolants (pydata#9599) Apply ruff/flake8-simplify rules (SIM) (pydata#9727) Apply ruff/flake8-implicit-str-concat rules (ISC) (pydata#9722) Apply ruff/flake8-pie rules (PIE) (pydata#9726) Enforce ruff/pygrep-hooks rules (PGH) (pydata#9729) Move to micromamba 2 (pydata#9732) ...

dcherian · 2024-11-09T01:38:50Z

Sorry for the really long delay here @scharlottej13

welcome · 2024-11-09T01:38:58Z

Congratulations on completing your first pull request! Welcome to Xarray! We are proud of you, and hope to see you again!

scharlottej13 · 2024-11-09T01:44:36Z

Thank you @dcherian!!

* main: fix html repr indexes section (pydata#9768) Bump pypa/gh-action-pypi-publish from 1.11.0 to 1.12.2 in the actions group (pydata#9763) unpin array-api-strict, as issues are resolved upstream (pydata#9762) rewrite the `min_deps_check` script (pydata#9754) CI runs ruff instead of pep8speaks (pydata#9759) Specify copyright holders in main license file (pydata#9756) Compress PNG files (pydata#9747) Dispatch to Dask if nanquantile is available (pydata#9719) Updates to Dask page in Xarray docs (pydata#9495) http:// → https:// (pydata#9748) Discard useless `!s` conversion in f-string (pydata#9752) Apply ruff/flake8-simplify rule SIM401 (pydata#9749) Use micromamba 1.5.10 where conda is needed (pydata#9737) pin array-api-strict<=2.1 (pydata#9751) Reorganise ruff rules (pydata#9738) use new conda-forge package pydap-server (pydata#9741)

scharlottej13 added 2 commits September 13, 2024 18:02

First pass at updates to Dask page in Xarray docs

b67cbe1

cleanup

fb51f80

[pre-commit.ci] auto fixes from pre-commit.com hooks

4727192

for more information, see https://pre-commit.ci

scharlottej13 commented Sep 14, 2024

View reviewed changes

doc/user-guide/dask.rst Show resolved Hide resolved

scharlottej13 commented Sep 14, 2024

View reviewed changes

doc/user-guide/dask.rst Outdated Show resolved Hide resolved

scharlottej13 added 2 commits September 13, 2024 18:37

Fix internal references

0305f22

Merge branch 'dask-docs' of github.com:scharlottej13/xarray into dask…

4be12f4

…-docs

TomNicholas added topic-documentation topic-dask labels Sep 14, 2024

TomNicholas reviewed Sep 14, 2024

View reviewed changes

shoyer reviewed Sep 16, 2024

View reviewed changes

scharlottej13 and others added 4 commits September 16, 2024 13:51

Apply suggestions from code review

f3f85e8

Co-authored-by: Tom Nicholas <tom@cworthy.org> Co-authored-by: Stephan Hoyer <shoyer@google.com>

Update Dask Array image

289ee75

Updates after review

ce09268

[pre-commit.ci] auto fixes from pre-commit.com hooks

b690248

for more information, see https://pre-commit.ci

dcherian added 2 commits November 8, 2024 17:44

Some edits

a7c03e5

dcherian merged commit 2619c0b into pydata:main Nov 9, 2024
27 of 29 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updates to Dask page in Xarray docs #9495

Updates to Dask page in Xarray docs #9495

scharlottej13 commented Sep 14, 2024

welcome bot commented Sep 14, 2024

scharlottej13 Sep 14, 2024

shoyer Sep 16, 2024

scharlottej13 Sep 16, 2024 •

edited

Loading

TomNicholas left a comment

TomNicholas Sep 14, 2024

scharlottej13 Sep 16, 2024

TomNicholas Sep 14, 2024

scharlottej13 Sep 16, 2024

shoyer left a comment

shoyer Sep 16, 2024

shoyer Sep 16, 2024

scharlottej13 Sep 16, 2024 •

edited

Loading

scharlottej13 commented Sep 16, 2024

scharlottej13 commented Sep 26, 2024

scharlottej13 commented Oct 28, 2024

dcherian commented Nov 9, 2024

welcome bot commented Nov 9, 2024

scharlottej13 commented Nov 9, 2024


		ds.to_netcdf("manipulated-example-data.nc")
		When using Dask’s distributed scheduler to write NETCDF4 files, it may be necessary to set the environment variable ``HDF5_USE_FILE_LOCKING=FALSE`` to avoid competing locks within the HDF5 SWMR file locking scheme. Note that writing netCDF files with Dask’s distributed scheduler is only supported for the netcdf4 backend.

Updates to Dask page in Xarray docs #9495

Updates to Dask page in Xarray docs #9495

Conversation

scharlottej13 commented Sep 14, 2024

welcome bot commented Sep 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scharlottej13 Sep 16, 2024 • edited Loading

Choose a reason for hiding this comment

TomNicholas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scharlottej13 Sep 16, 2024 • edited Loading

Choose a reason for hiding this comment

scharlottej13 commented Sep 16, 2024

scharlottej13 commented Sep 26, 2024

scharlottej13 commented Oct 28, 2024

dcherian commented Nov 9, 2024

welcome bot commented Nov 9, 2024

scharlottej13 commented Nov 9, 2024

scharlottej13 Sep 16, 2024 •

edited

Loading

scharlottej13 Sep 16, 2024 •

edited

Loading