-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check for path-like objects rather than Path type, use os.fspath #5879
Conversation
Thanks a lot @mwtoews ! Would you like to add a note to whatsnew? Re the path-like vs file-like, that’d be great to clarify in the docs. It’s fine to do this in another PR if you prefer. |
@max-sixty whats-new entry added, check to see if the paragraph is ok. I'll hold off clarifying file-like vs path-like in the docs for now, but will consider a doc intersphinx link at some time. |
Thanks @mwtoews ! |
* main: Add typing_extensions as a required dependency (pydata#5911) pydata#5740 follow up: supress xr.ufunc warnings in tests (pydata#5914) Avoid accessing slow .data in unstack (pydata#5906) Add wradlib to ecosystem in docs (pydata#5915) Use .to_numpy() for quantified facetgrids (pydata#5886) [test-upstream] fix pd skipna=None (pydata#5899) Add var and std to weighted computations (pydata#5870) Check for path-like objects rather than Path type, use os.fspath (pydata#5879) Handle single `PathLike` objects in `open_mfdataset()` (pydata#5884)
…ata#5879) * Check for path-like objects rather than Path type, use os.fspath * Add whats-new entry Co-authored-by: Illviljan <14371165+Illviljan@users.noreply.github.com> Co-authored-by: Maximilian Roos <5635139+max-sixty@users.noreply.github.com>
Note that |
@classmethod
def __subclasshook__(cls, subclass):
if cls is PathLike:
return _check_methods(subclass, '__fspath__')
return NotImplemented |
@gjoseph92 I'm less informed than most on this — do you have an example of a case that is now confusing? Thank you! |
@martindurant exactly, This generally means you can't pass s3fs/gcsfs files into In [32]: xr.open_dataset("s3://noaa-nwm-retrospective-2-1-zarr-pds/lakeout.zarr", engine="zarr")
Out[32]:
<xarray.Dataset>
Dimensions: (feature_id: 5783, time: 367439)
Coordinates:
* feature_id (feature_id) int32 491 531 747 ... 947070204 1021092845
latitude (feature_id) float32 ...
longitude (feature_id) float32 ...
* time (time) datetime64[ns] 1979-02-01T01:00:00 ... 2020-12-31T...
Data variables:
crs |S1 ...
inflow (time, feature_id) float64 ...
outflow (time, feature_id) float64 ...
water_sfc_elev (time, feature_id) float32 ...
Attributes:
Conventions: CF-1.6
TITLE: OUTPUT FROM WRF-Hydro v5.2.0-beta2
code_version: v5.2.0-beta2
featureType: timeSeries
model_configuration: retrospective
model_output_type: reservoir
proj4: +proj=lcc +units=m +a=6370000.0 +b=6370000....
reservoir_assimilated_value: Assimilation not performed
reservoir_type: 1 = level pool everywhere
station_dimension: lake_id
In [33]: xr.open_dataset(fsspec.open("s3://noaa-nwm-retrospective-2-1-zarr-pds/lakeout.zarr"), engine="zarr")
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-33-76e10d75e2c2> in <module>
----> 1 xr.open_dataset(fsspec.open("s3://noaa-nwm-retrospective-2-1-zarr-pds/lakeout.zarr"), engine="zarr")
~/dev/dask-playground/env/lib/python3.9/site-packages/xarray/backends/api.py in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, backend_kwargs, *args, **kwargs)
493
494 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 495 backend_ds = backend.open_dataset(
496 filename_or_obj,
497 drop_variables=drop_variables,
~/dev/dask-playground/env/lib/python3.9/site-packages/xarray/backends/zarr.py in open_dataset(self, filename_or_obj, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta, group, mode, synchronizer, consolidated, chunk_store, storage_options, stacklevel)
797 ):
798
--> 799 filename_or_obj = _normalize_path(filename_or_obj)
800 store = ZarrStore.open_group(
801 filename_or_obj,
~/dev/dask-playground/env/lib/python3.9/site-packages/xarray/backends/common.py in _normalize_path(path)
21 def _normalize_path(path):
22 if isinstance(path, os.PathLike):
---> 23 path = os.fspath(path)
24
25 if isinstance(path, str) and not is_remote_uri(path):
~/dev/dask-playground/env/lib/python3.9/site-packages/fsspec/core.py in __fspath__(self)
96 def __fspath__(self):
97 # may raise if cannot be resolved to local file
---> 98 return self.open().__fspath__()
99
100 def __enter__(self):
~/dev/dask-playground/env/lib/python3.9/site-packages/fsspec/core.py in open(self)
138 been deleted; but a with-context is better style.
139 """
--> 140 out = self.__enter__()
141 closer = out.close
142 fobjects = self.fobjects.copy()[:-1]
~/dev/dask-playground/env/lib/python3.9/site-packages/fsspec/core.py in __enter__(self)
101 mode = self.mode.replace("t", "").replace("b", "") + "b"
102
--> 103 f = self.fs.open(self.path, mode=mode)
104
105 self.fobjects = [f]
~/dev/dask-playground/env/lib/python3.9/site-packages/fsspec/spec.py in open(self, path, mode, block_size, cache_options, compression, **kwargs)
1007 else:
1008 ac = kwargs.pop("autocommit", not self._intrans)
-> 1009 f = self._open(
1010 path,
1011 mode=mode,
~/dev/dask-playground/env/lib/python3.9/site-packages/s3fs/core.py in _open(self, path, mode, block_size, acl, version_id, fill_cache, cache_type, autocommit, requester_pays, **kwargs)
532 cache_type = self.default_cache_type
533
--> 534 return S3File(
535 self,
536 path,
~/dev/dask-playground/env/lib/python3.9/site-packages/s3fs/core.py in __init__(self, s3, path, mode, block_size, acl, version_id, fill_cache, s3_additional_kwargs, autocommit, cache_type, requester_pays)
1824
1825 if "r" in mode:
-> 1826 self.req_kw["IfMatch"] = self.details["ETag"]
1827
1828 def _call_s3(self, method, *kwarglist, **kwargs):
KeyError: 'ETag' |
Thanks @gjoseph92, that makes sense. Do you know whether there's a standard approach that works for these? I would expect xarray's needs are fairly standard for this function of "take something that's path-like". |
"s3://noaa-nwm-retrospective-2-1-zarr-pds/lakeout.zarr" is a directory, right? You cannot open that as a file, or maybe there is no equivalent key at all (because s3 is magic like that). To make a bare mapper (i.e., dict-like):
or you could use zarr's FSMapper meant specifically for this job. |
Yeah correct. I oversimplified this from the problem I actually cared about, since of course zarr is not a single file that can be Here's a more illustrative example: In [1]: import xarray as xr
In [2]: import fsspec
In [3]: import os
In [4]: url = "s3://noaa-nwm-retrospective-2-1-pds/model_output/1979/197902010100.CHRTOUT_DOMAIN1.comp" # a netCDF file in s3
In [5]: f = fsspec.open(url)
In [6]: f
Out[6]: <OpenFile 'noaa-nwm-retrospective-2-1-pds/model_output/1979/197902010100.CHRTOUT_DOMAIN1.comp'>
In [7]: isinstance(f, os.PathLike)
Out[7]: True
In [8]: s3f = f.open()
In [9]: s3f
Out[9]: <File-like object S3FileSystem, noaa-nwm-retrospective-2-1-pds/model_output/1979/197902010100.CHRTOUT_DOMAIN1.comp>
In [10]: isinstance(s3f, os.PathLike)
Out[10]: False
In [11]: ds = xr.open_dataset(s3f, engine='h5netcdf')
In [12]: ds
Out[12]:
<xarray.Dataset>
Dimensions: (time: 1, reference_time: 1, feature_id: 2776738)
Coordinates:
* time (time) datetime64[ns] 1979-02-01T01:00:00
* reference_time (reference_time) datetime64[ns] 1979-02-01
* feature_id (feature_id) int32 101 179 181 ... 1180001803 1180001804
latitude (feature_id) float32 ...
longitude (feature_id) float32 ...
Data variables:
crs |S1 ...
order (feature_id) int32 ...
elevation (feature_id) float32 ...
streamflow (feature_id) float64 ...
q_lateral (feature_id) float64 ...
velocity (feature_id) float64 ...
qSfcLatRunoff (feature_id) float64 ...
qBucket (feature_id) float64 ...
qBtmVertRunoff (feature_id) float64 ...
Attributes: (12/18)
TITLE: OUTPUT FROM WRF-Hydro v5.2.0-beta2
featureType: timeSeries
proj4: +proj=lcc +units=m +a=6370000.0 +b=6370000.0 ...
model_initialization_time: 1979-02-01_00:00:00
station_dimension: feature_id
model_output_valid_time: 1979-02-01_01:00:00
... ...
model_configuration: retrospective
dev_OVRTSWCRT: 1
dev_NOAH_TIMESTEP: 3600
dev_channel_only: 0
dev_channelBucket_only: 0
dev: dev_ prefix indicates development/internal me...
In [13]: ds = xr.open_dataset(f, engine='h5netcdf')
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-13-de834ca911b4> in <module>
----> 1 ds = xr.open_dataset(f, engine='h5netcdf')
~/dev/dask-playground/env/lib/python3.9/site-packages/xarray/backends/api.py in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, backend_kwargs, *args, **kwargs)
493
494 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 495 backend_ds = backend.open_dataset(
496 filename_or_obj,
497 drop_variables=drop_variables,
~/dev/dask-playground/env/lib/python3.9/site-packages/xarray/backends/h5netcdf_.py in open_dataset(self, filename_or_obj, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta, format, group, lock, invalid_netcdf, phony_dims, decode_vlen_strings)
384 ):
385
--> 386 filename_or_obj = _normalize_path(filename_or_obj)
387 store = H5NetCDFStore.open(
388 filename_or_obj,
~/dev/dask-playground/env/lib/python3.9/site-packages/xarray/backends/common.py in _normalize_path(path)
21 def _normalize_path(path):
22 if isinstance(path, os.PathLike):
---> 23 path = os.fspath(path)
24
25 if isinstance(path, str) and not is_remote_uri(path):
~/dev/dask-playground/env/lib/python3.9/site-packages/fsspec/core.py in __fspath__(self)
96 def __fspath__(self):
97 # may raise if cannot be resolved to local file
---> 98 return self.open().__fspath__()
99
100 def __enter__(self):
AttributeError: 'S3File' object has no attribute '__fspath__' Because the plain Because the Note though that if I downgrade xarray to 0.19.0 (last version before this PR was merged), I still can't use the plain `fssspec.OpenFile` object successfully. It's not xarray's fault anymore—it gets passed all the way into h5netcdf—but h5netcdf also tries to call `fspath` on the `OpenFile`, which fails in the same way.In [1]: import xarray as xr
In [2]: import fsspec
In [3]: xr.__version__
Out[3]: '0.19.0'
In [4]: url = "s3://noaa-nwm-retrospective-2-1-pds/model_output/1979/197902010100.CHRTOUT_DOMAIN1.comp" # a netCDF file in s3
In [5]: f = fsspec.open(url)
In [6]: xr.open_dataset(f.open(), engine="h5netcdf")
Out[6]:
<xarray.Dataset>
Dimensions: (time: 1, reference_time: 1, feature_id: 2776738)
Coordinates:
* time (time) datetime64[ns] 1979-02-01T01:00:00
* reference_time (reference_time) datetime64[ns] 1979-02-01
* feature_id (feature_id) int32 101 179 181 ... 1180001803 1180001804
latitude (feature_id) float32 ...
longitude (feature_id) float32 ...
Data variables:
crs |S1 ...
order (feature_id) int32 ...
elevation (feature_id) float32 ...
streamflow (feature_id) float64 ...
q_lateral (feature_id) float64 ...
velocity (feature_id) float64 ...
qSfcLatRunoff (feature_id) float64 ...
qBucket (feature_id) float64 ...
qBtmVertRunoff (feature_id) float64 ...
Attributes: (12/18)
TITLE: OUTPUT FROM WRF-Hydro v5.2.0-beta2
featureType: timeSeries
proj4: +proj=lcc +units=m +a=6370000.0 +b=6370000.0 ...
model_initialization_time: 1979-02-01_00:00:00
station_dimension: feature_id
model_output_valid_time: 1979-02-01_01:00:00
... ...
model_configuration: retrospective
dev_OVRTSWCRT: 1
dev_NOAH_TIMESTEP: 3600
dev_channel_only: 0
dev_channelBucket_only: 0
dev: dev_ prefix indicates development/internal me...
In [7]: xr.open_dataset(f, engine="h5netcdf")
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~/dev/dask-playground/env/lib/python3.9/site-packages/xarray/backends/file_manager.py in _acquire_with_cache_info(self, needs_lock)
198 try:
--> 199 file = self._cache[self._key]
200 except KeyError:
~/dev/dask-playground/env/lib/python3.9/site-packages/xarray/backends/lru_cache.py in __getitem__(self, key)
52 with self._lock:
---> 53 value = self._cache[key]
54 self._cache.move_to_end(key)
KeyError: [<class 'h5netcdf.core.File'>, (<OpenFile 'noaa-nwm-retrospective-2-1-pds/model_output/1979/197902010100.CHRTOUT_DOMAIN1.comp'>,), 'r', (('decode_vlen_strings', True), ('invalid_netcdf', None))]
During handling of the above exception, another exception occurred:
AttributeError Traceback (most recent call last)
<ipython-input-7-e6098b8ab402> in <module>
----> 1 xr.open_dataset(f, engine="h5netcdf")
~/dev/dask-playground/env/lib/python3.9/site-packages/xarray/backends/api.py in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, backend_kwargs, *args, **kwargs)
495
496 overwrite_encoded_chunks = kwargs.pop("overwrite_encoded_chunks", None)
--> 497 backend_ds = backend.open_dataset(
498 filename_or_obj,
499 drop_variables=drop_variables,
~/dev/dask-playground/env/lib/python3.9/site-packages/xarray/backends/h5netcdf_.py in open_dataset(self, filename_or_obj, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, use_cftime, decode_timedelta, format, group, lock, invalid_netcdf, phony_dims, decode_vlen_strings)
372
373 filename_or_obj = _normalize_path(filename_or_obj)
--> 374 store = H5NetCDFStore.open(
375 filename_or_obj,
376 format=format,
~/dev/dask-playground/env/lib/python3.9/site-packages/xarray/backends/h5netcdf_.py in open(cls, filename, mode, format, group, lock, autoclose, invalid_netcdf, phony_dims, decode_vlen_strings)
176
177 manager = CachingFileManager(h5netcdf.File, filename, mode=mode, kwargs=kwargs)
--> 178 return cls(manager, group=group, mode=mode, lock=lock, autoclose=autoclose)
179
180 def _acquire(self, needs_lock=True):
~/dev/dask-playground/env/lib/python3.9/site-packages/xarray/backends/h5netcdf_.py in __init__(self, manager, group, mode, lock, autoclose)
121 # todo: utilizing find_root_and_group seems a bit clunky
122 # making filename available on h5netcdf.Group seems better
--> 123 self._filename = find_root_and_group(self.ds)[0].filename
124 self.is_remote = is_remote_uri(self._filename)
125 self.lock = ensure_lock(lock)
~/dev/dask-playground/env/lib/python3.9/site-packages/xarray/backends/h5netcdf_.py in ds(self)
187 @property
188 def ds(self):
--> 189 return self._acquire()
190
191 def open_store_variable(self, name, var):
~/dev/dask-playground/env/lib/python3.9/site-packages/xarray/backends/h5netcdf_.py in _acquire(self, needs_lock)
179
180 def _acquire(self, needs_lock=True):
--> 181 with self._manager.acquire_context(needs_lock) as root:
182 ds = _nc4_require_group(
183 root, self._group, self._mode, create_group=_h5netcdf_create_group
~/.pyenv/versions/3.9.1/lib/python3.9/contextlib.py in __enter__(self)
115 del self.args, self.kwds, self.func
116 try:
--> 117 return next(self.gen)
118 except StopIteration:
119 raise RuntimeError("generator didn't yield") from None
~/dev/dask-playground/env/lib/python3.9/site-packages/xarray/backends/file_manager.py in acquire_context(self, needs_lock)
185 def acquire_context(self, needs_lock=True):
186 """Context manager for acquiring a file."""
--> 187 file, cached = self._acquire_with_cache_info(needs_lock)
188 try:
189 yield file
~/dev/dask-playground/env/lib/python3.9/site-packages/xarray/backends/file_manager.py in _acquire_with_cache_info(self, needs_lock)
203 kwargs = kwargs.copy()
204 kwargs["mode"] = self._mode
--> 205 file = self._opener(*self._args, **kwargs)
206 if self._mode == "w":
207 # ensure file doesn't get overriden when opened again
~/dev/dask-playground/env/lib/python3.9/site-packages/h5netcdf/core.py in __init__(self, path, mode, invalid_netcdf, phony_dims, **kwargs)
978 self._preexisting_file = mode in {"r", "r+", "a"}
979 self._h5py = h5py
--> 980 self._h5file = self._h5py.File(
981 path, mode, track_order=track_order, **kwargs
982 )
~/dev/dask-playground/env/lib/python3.9/site-packages/h5py/_hl/files.py in __init__(self, name, mode, driver, libver, userblock_size, swmr, rdcc_nslots, rdcc_nbytes, rdcc_w0, track_order, fs_strategy, fs_persist, fs_threshold, fs_page_size, page_buf_size, min_meta_keep, min_raw_keep, locking, **kwds)
484 name = repr(name).encode('ASCII', 'replace')
485 else:
--> 486 name = filename_encode(name)
487
488 if track_order is None:
~/dev/dask-playground/env/lib/python3.9/site-packages/h5py/_hl/compat.py in filename_encode(filename)
17 filenames in h5py for more information.
18 """
---> 19 filename = fspath(filename)
20 if sys.platform == "win32":
21 if isinstance(filename, str):
~/dev/dask-playground/env/lib/python3.9/site-packages/fsspec/core.py in __fspath__(self)
96 def __fspath__(self):
97 # may raise if cannot be resolved to local file
---> 98 return self.open().__fspath__()
99
100 def __enter__(self):
AttributeError: 'S3File' object has no attribute '__fspath__' The problem is that So I may just be misunderstanding what an |
OK, I get you - so the real problem is that OpenFile can look path-like, but isn't really. OpenFile is really a file-like factory, a proxy for open file-likes that you can make (and seialise for Dask). Its main purpose is to be used in a context: with fsspec.open(url) as f:
ds = xr.open_dataset(f, engine="h5netcdf") except that the problem with xarray is that it will want to keep this thing open for subsequent operations, so you either need to put all that in the context, or use |
Yeah, I guess I expected I'll open a separate issue for improving the UX of this in xarray though. I think this would be rather confusing for new users. |
This PR generally changes (e.g.)
isinstance(filename, pathlib.Path)
toisinstance(filename, os.PathLike)
, and usesos.fspath
to convert it to (usually)str
type.(If it is vital these are always
str
, then shouldos.fsdecode
be considered?bytes
paths are not common, and only possible on some platforms).If other path-like objects are used e.g. py.path used by the tmpdir pytest fixture, an error message is shown:
This PR allows other path-like objects to be used.
A few typing objects are also adjusted too.
Be aware there are file-like and path-like object terms used in the core Python glossary. In light of this, some "file-like" wordings may need to be adjusted, such as the error message described above. This can be done in this PR if anyone aggrees.