-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for remote string paths to h5netcdf
engine
#8424
base: main
Are you sure you want to change the base?
Conversation
Also cc @kmuehlbauer |
xarray/backends/h5netcdf_.py
Outdated
if isinstance(filename, str) and is_remote_uri(filename): | ||
import fsspec | ||
|
||
mode_ = "rb" if mode == "r" else mode | ||
fs, _, _ = fsspec.get_fs_token_paths( | ||
filename, mode=mode_, storage_options=storage_options | ||
) | ||
filename = fs.open(filename, mode=mode_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not an expert in this bit but there's a _find_absolute_path
in backends/common.py
that shares a lot of code with the first three lines here. It includes a nice error message if fsspec
is not installed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want to unconditionally open remote urls with fsspec
? This contradicts the usage of native implementations (via "driver"-kwarg, see #8360) in h5py/hdf5.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want to unconditionally open remote urls with fsspec?
My guess is yes by default. That's what other pydata libraries like pandas, Dask, Zarr, etc. have converged on for file handling, so it would be familiar to many users. That said, if driver=
offers some benefits over fsspec
(again, I'm not familiar with the new driver
functionality) it'd be easy for these two approaches to live alongside each other:
- Use
fsspec
by default is nodriver
is specified - If
driver
is specified use that instead - If there are conflicting options provided for some reason (e.g.
driver=
andstorage_options=
) then raise an informative error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The algorithm shown here looks good to me. I also think fsspec is more used by users although keeping the ros3 alternative is desirable too (see this).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes looks indeed very similar to _find_absolute_paths
, can we instead get that one working for this case as well?
xarray/xarray/backends/common.py
Lines 60 to 115 in 49bd63a
def _find_absolute_paths( | |
paths: str | os.PathLike | NestedSequence[str | os.PathLike], **kwargs | |
) -> list[str]: | |
""" | |
Find absolute paths from the pattern. | |
Parameters | |
---------- | |
paths : | |
Path(s) to file(s). Can include wildcards like * . | |
**kwargs : | |
Extra kwargs. Mainly for fsspec. | |
Examples | |
-------- | |
>>> from pathlib import Path | |
>>> directory = Path(xr.backends.common.__file__).parent | |
>>> paths = str(Path(directory).joinpath("comm*n.py")) # Find common with wildcard | |
>>> paths = xr.backends.common._find_absolute_paths(paths) | |
>>> [Path(p).name for p in paths] | |
['common.py'] | |
""" | |
if isinstance(paths, str): | |
if is_remote_uri(paths) and kwargs.get("engine", None) == "zarr": | |
try: | |
from fsspec.core import get_fs_token_paths | |
except ImportError as e: | |
raise ImportError( | |
"The use of remote URLs for opening zarr requires the package fsspec" | |
) from e | |
fs, _, _ = get_fs_token_paths( | |
paths, | |
mode="rb", | |
storage_options=kwargs.get("backend_kwargs", {}).get( | |
"storage_options", {} | |
), | |
expand=False, | |
) | |
tmp_paths = fs.glob(fs._strip_protocol(paths)) # finds directories | |
paths = [fs.get_mapper(path) for path in tmp_paths] | |
elif is_remote_uri(paths): | |
raise ValueError( | |
"cannot do wild-card matching for paths that are remote URLs " | |
f"unless engine='zarr' is specified. Got paths: {paths}. " | |
"Instead, supply paths as an explicit list of strings." | |
) | |
else: | |
paths = sorted(glob(_normalize_path(paths))) | |
elif isinstance(paths, os.PathLike): | |
paths = [os.fspath(paths)] | |
else: | |
paths = [os.fspath(p) if isinstance(p, os.PathLike) else p for p in paths] | |
return paths |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jrbourbeau, but this will need some discussion.
The current implementation in this PR uses fsspec
to open remote urls by default. See my other comment inline. This prevents other means of usage, like h5py/hdf5 native cloud access.
One solution would be to add some trigger-kwarg (use_fsspec
, you name it) in the h5netcdf-backend to enable fsspec
usage.
Also like to hear other @pydata/xarray input.
This should then also be coordinated with #8360, cc @zequihg50.
+1 -- it'd be good to get thoughts from others. Yeah, let's definitely coordinate this PR and #8360 👍 |
@jrbourbeau Thanks for your inline comment. We've discussed this in the dev meeting today. One question came up: Should To make both groups of users happy (h5py-native: There might be other possibilities of handling, please suggest here, if someone has some idea. |
@jrbourbeau It might be good to get #8360 in first and add your changes on top. Might take a little time. |
Is |
@dcherian That would work easiest for xarray. We would need to document properly, that |
How would we handle |
Additional kwarg? |
@jrbourbeau We're getting close with #8360. To not clutter the signature, would the following work for you? ds = xr.open_dataset(cloud_path, engine="h5netcdf", driver="fsspec", driver_kdws=storage_options) That way we can minimize the impact here in xarray while having reasonable naming. WDYT? |
Thanks @kmuehlbauer. I'm a little confused about the I was thinking about an API along these lines: # Use fsspec by default for cloud paths (this way we get coverage for s3, gcsfs, azure, etc.)
ds = xr.open_dataset(cloud_path, engine="h5netcdf")
# Specify `storage_options=` if needed
ds = xr.open_dataset(cloud_path, engine="h5netcdf", storage_options={...}) # To use h5netcdf's native remote file IO, specify `driver=`
ds = xr.open_dataset(cloud_path, engine="h5netcdf", driver="ros3")
# Or possibly `driver=` + `driver_kdws=` if needed
ds = xr.open_dataset(cloud_path, engine="h5netcdf", driver="ros3", driver_kdws={...}) # Any mixture of the `fsspec` and `driver=` options raises an error
ds = xr.open_dataset(cloud_path, engine="h5netcdf", driver="ros3", storage_options={...})
ValueError("...") |
Update: Please disregard the below and follow-up reading with #8424 (comment). @jrbourbeau I was taking @dcherian's suggestion of explicit-opt in into account (#8424 (comment)):
# Specify driver="fsspec" for cloud paths (this way we get coverage for s3, gcsfs, azure, etc.)
ds = xr.open_dataset(cloud_path, engine="h5netcdf", driver="fsspec") # Specify `storage_options=` if needed
storage_options = {...}
ds = xr.open_dataset(cloud_path, engine="h5netcdf", driver="fsspec", driver_kwds=storage_options) # use "ros3" native driver
ds = xr.open_dataset(cloud_path, engine="h5netcdf", driver="ros3") # use "ros3" native driver with driver_kwds
driver_kwds = {...}
ds = xr.open_dataset(cloud_path, engine="h5netcdf", driver="ros3", driver_kwds=driver_kwds) That would have minimal impact in the xarray code-base. It should be possible to make # Default driver="fsspec"
ds = xr.open_dataset(cloud_path, engine="h5netcdf") # Default driver="fsspec" with storage_options
storage_options = {...}
ds = xr.open_dataset(cloud_path, engine="h5netcdf", driver_kwds=storage_options) |
@jrbourbeau I've taken a step back and had a look at other built-in backends. Here is what I found:
In light of that, it would definitely make sense to align with the current naming conventions and stick with |
…-storage-options
…-storage-options
xarray/backends/h5netcdf_.py
Outdated
@@ -164,6 +181,7 @@ def open( | |||
"invalid_netcdf": invalid_netcdf, | |||
"decode_vlen_strings": decode_vlen_strings, | |||
"driver": driver, | |||
"storage_options": storage_options, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can obtain the fsspec
first, we could simplify this significantly along those lines:
# get open fsspec-handle first
if storage_options is not None:
filename = _find_absolute_paths(filename, engine="h5netcdf", backend_kwargs=dict(storage_options=storage_options))
# other code
manager = CachingFileManager(
h5netcdf.File, filename, mode=mode, kwargs=kwargs
)
_find_absolute_paths
would need changes to cover for this, though.
@jrbourbeau Sorry for letting this get out of focus. I've pushed a change with what I've had in mind with my comment inline above. Let's see how this works out. |
My bad -- I've been meaning to circle back to this PR after some time OOO for a while now
Looking at your changes now. IIRC I was intentionally avoiding this for performance reasons. Opening up thousands of files and serializing/deserializing them was much slower than just sending string filepaths and opening across many workers on a cluster. It's been a while since I've run this code though, I'll revisit it now. |
@jrbourbeau Thanks, I thought that was low hanging fruit. It's not, my bad. Looking at that now, I'd move the fs.open back into the h5netcdf backend. Can't say if that helps much, performance-wise. I'll be traveling the next days, so might not be that responsive. |
h5netcdf
engine #8423whats-new.rst
api.rst
cc @dcherian as you might find this interesting