Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for remote string paths to h5netcdf engine #8424

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

jrbourbeau
Copy link
Contributor

@jrbourbeau jrbourbeau commented Nov 7, 2023

cc @dcherian as you might find this interesting

@jrbourbeau
Copy link
Contributor Author

Also cc @kmuehlbauer

Comment on lines 94 to 101
if isinstance(filename, str) and is_remote_uri(filename):
import fsspec

mode_ = "rb" if mode == "r" else mode
fs, _, _ = fsspec.get_fs_token_paths(
filename, mode=mode_, storage_options=storage_options
)
filename = fs.open(filename, mode=mode_)
Copy link
Contributor

@dcherian dcherian Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not an expert in this bit but there's a _find_absolute_path in backends/common.py that shares a lot of code with the first three lines here. It includes a nice error message if fsspec is not installed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to unconditionally open remote urls with fsspec? This contradicts the usage of native implementations (via "driver"-kwarg, see #8360) in h5py/hdf5.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to unconditionally open remote urls with fsspec?

My guess is yes by default. That's what other pydata libraries like pandas, Dask, Zarr, etc. have converged on for file handling, so it would be familiar to many users. That said, if driver= offers some benefits over fsspec (again, I'm not familiar with the new driver functionality) it'd be easy for these two approaches to live alongside each other:

  • Use fsspec by default is no driver is specified
  • If driver is specified use that instead
  • If there are conflicting options provided for some reason (e.g. driver= and storage_options=) then raise an informative error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The algorithm shown here looks good to me. I also think fsspec is more used by users although keeping the ros3 alternative is desirable too (see this).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes looks indeed very similar to _find_absolute_paths, can we instead get that one working for this case as well?

def _find_absolute_paths(
paths: str | os.PathLike | NestedSequence[str | os.PathLike], **kwargs
) -> list[str]:
"""
Find absolute paths from the pattern.
Parameters
----------
paths :
Path(s) to file(s). Can include wildcards like * .
**kwargs :
Extra kwargs. Mainly for fsspec.
Examples
--------
>>> from pathlib import Path
>>> directory = Path(xr.backends.common.__file__).parent
>>> paths = str(Path(directory).joinpath("comm*n.py")) # Find common with wildcard
>>> paths = xr.backends.common._find_absolute_paths(paths)
>>> [Path(p).name for p in paths]
['common.py']
"""
if isinstance(paths, str):
if is_remote_uri(paths) and kwargs.get("engine", None) == "zarr":
try:
from fsspec.core import get_fs_token_paths
except ImportError as e:
raise ImportError(
"The use of remote URLs for opening zarr requires the package fsspec"
) from e
fs, _, _ = get_fs_token_paths(
paths,
mode="rb",
storage_options=kwargs.get("backend_kwargs", {}).get(
"storage_options", {}
),
expand=False,
)
tmp_paths = fs.glob(fs._strip_protocol(paths)) # finds directories
paths = [fs.get_mapper(path) for path in tmp_paths]
elif is_remote_uri(paths):
raise ValueError(
"cannot do wild-card matching for paths that are remote URLs "
f"unless engine='zarr' is specified. Got paths: {paths}. "
"Instead, supply paths as an explicit list of strings."
)
else:
paths = sorted(glob(_normalize_path(paths)))
elif isinstance(paths, os.PathLike):
paths = [os.fspath(paths)]
else:
paths = [os.fspath(p) if isinstance(p, os.PathLike) else p for p in paths]
return paths

Copy link
Contributor

@kmuehlbauer kmuehlbauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jrbourbeau, but this will need some discussion.

The current implementation in this PR uses fsspec to open remote urls by default. See my other comment inline. This prevents other means of usage, like h5py/hdf5 native cloud access.

One solution would be to add some trigger-kwarg (use_fsspec, you name it) in the h5netcdf-backend to enable fsspec usage.

Also like to hear other @pydata/xarray input.

This should then also be coordinated with #8360, cc @zequihg50.

@jrbourbeau
Copy link
Contributor Author

Also like to hear other @pydata/xarray input.
This should then also be coordinated with #8360, cc @zequihg50.

+1 -- it'd be good to get thoughts from others. Yeah, let's definitely coordinate this PR and #8360 👍

@kmuehlbauer
Copy link
Contributor

@jrbourbeau Thanks for your inline comment. We've discussed this in the dev meeting today.

One question came up: Should http mean a remote access protocol or remote disk access? And: remote file access should require an explicit argument.

To make both groups of users happy (h5py-native: driver="ros3" and via fsspec) this should equally be possible. On one side we have the h5py driver="ros3"-kwarg, which should override any default fsspec-usage for remote urls. On the other side we might just use storage_options as a marker for wanted fsspec-usage as you suggested in the inline comment.

There might be other possibilities of handling, please suggest here, if someone has some idea.

@kmuehlbauer
Copy link
Contributor

@jrbourbeau It might be good to get #8360 in first and add your changes on top. Might take a little time.

@dcherian
Copy link
Contributor

dcherian commented Nov 9, 2023

Is driver="fsspec" an option for explicit opt-in?

@kmuehlbauer
Copy link
Contributor

@dcherian That would work easiest for xarray. We would need to document properly, that driver="fsspec" is an xarray implementation detail and is different from the h5py driver meanings.

@dcherian
Copy link
Contributor

dcherian commented Nov 9, 2023

How would we handle storage_options?

@kmuehlbauer
Copy link
Contributor

Additional kwarg?

@kmuehlbauer
Copy link
Contributor

@jrbourbeau We're getting close with #8360.

To not clutter the signature, would the following work for you?

ds = xr.open_dataset(cloud_path, engine="h5netcdf", driver="fsspec", driver_kdws=storage_options)

That way we can minimize the impact here in xarray while having reasonable naming. WDYT?

@jrbourbeau
Copy link
Contributor Author

Thanks @kmuehlbauer. I'm a little confused about the driver="fsspec" option. I thought we were going down the "use fsspec by default for remote paths, but also allow for driver= to use h5netcdfs remote file option" outlined here #8424 (comment). @zequihg50 seemed to think this was a good approach #8424 (comment)

I was thinking about an API along these lines:

# Use fsspec by default for cloud paths (this way we get coverage for s3, gcsfs, azure, etc.)
ds = xr.open_dataset(cloud_path, engine="h5netcdf")

# Specify `storage_options=` if needed
ds = xr.open_dataset(cloud_path, engine="h5netcdf", storage_options={...})
# To use h5netcdf's native remote file IO, specify `driver=`
ds = xr.open_dataset(cloud_path, engine="h5netcdf", driver="ros3")

# Or possibly `driver=` + `driver_kdws=` if needed
ds = xr.open_dataset(cloud_path, engine="h5netcdf", driver="ros3", driver_kdws={...})
# Any mixture of the `fsspec` and `driver=` options raises an error
ds = xr.open_dataset(cloud_path, engine="h5netcdf", driver="ros3", storage_options={...})
ValueError("...")

@kmuehlbauer
Copy link
Contributor

kmuehlbauer commented Nov 14, 2023

Update: Please disregard the below and follow-up reading with #8424 (comment).

@jrbourbeau I was taking @dcherian's suggestion of explicit-opt in into account (#8424 (comment)):

Is driver="fsspec" an option for explicit opt-in?

# Specify driver="fsspec" for cloud paths (this way we get coverage for s3, gcsfs, azure, etc.)
ds = xr.open_dataset(cloud_path, engine="h5netcdf", driver="fsspec")
# Specify `storage_options=` if needed
storage_options = {...}
ds = xr.open_dataset(cloud_path, engine="h5netcdf", driver="fsspec", driver_kwds=storage_options)
# use "ros3" native driver
ds = xr.open_dataset(cloud_path, engine="h5netcdf", driver="ros3")
# use "ros3" native driver with driver_kwds
driver_kwds = {...}
ds = xr.open_dataset(cloud_path, engine="h5netcdf", driver="ros3", driver_kwds=driver_kwds)

That would have minimal impact in the xarray code-base. It should be possible to make driver="fsspec" the default in the backend:

# Default driver="fsspec"
ds = xr.open_dataset(cloud_path, engine="h5netcdf")
# Default driver="fsspec" with storage_options
storage_options = {...}
ds = xr.open_dataset(cloud_path, engine="h5netcdf", driver_kwds=storage_options)

@kmuehlbauer
Copy link
Contributor

@jrbourbeau I've taken a step back and had a look at other built-in backends. Here is what I found:

  • storage_options is only used for the zarr-backend so far
  • we could make this work for h5netcdf-backend, too, (see @Illviljan's comment inline, reusing some already written code))

In light of that, it would definitely make sense to align with the current naming conventions and stick with storage_options for the fsspec. So please disregard my suggestion and follow your outlined plan (#8424 (comment)).

@@ -164,6 +181,7 @@ def open(
"invalid_netcdf": invalid_netcdf,
"decode_vlen_strings": decode_vlen_strings,
"driver": driver,
"storage_options": storage_options,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can obtain the fsspec first, we could simplify this significantly along those lines:

# get open fsspec-handle first
if storage_options is not None:
    filename = _find_absolute_paths(filename, engine="h5netcdf", backend_kwargs=dict(storage_options=storage_options))

# other code

manager = CachingFileManager(
            h5netcdf.File, filename, mode=mode, kwargs=kwargs
        )

_find_absolute_paths would need changes to cover for this, though.

@kmuehlbauer
Copy link
Contributor

@jrbourbeau Sorry for letting this get out of focus. I've pushed a change with what I've had in mind with my comment inline above. Let's see how this works out.

@jrbourbeau
Copy link
Contributor Author

Sorry for letting this get out of focus

My bad -- I've been meaning to circle back to this PR after some time OOO for a while now

If we can obtain the fsspec first, we could simplify this significantly

Looking at your changes now. IIRC I was intentionally avoiding this for performance reasons. Opening up thousands of files and serializing/deserializing them was much slower than just sending string filepaths and opening across many workers on a cluster. It's been a while since I've run this code though, I'll revisit it now.

@kmuehlbauer
Copy link
Contributor

@jrbourbeau Thanks, I thought that was low hanging fruit. It's not, my bad. Looking at that now, I'd move the fs.open back into the h5netcdf backend. Can't say if that helps much, performance-wise.

I'll be traveling the next days, so might not be that responsive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants