Read from http - httppathlib? #455

TomNicholas · 2024-07-27T17:55:12Z

Best approach to support reading data from http via a `pathlib`-like class, i.e. `httppathlib`?

In the pangeo / xarray community we do a lot of reading of remote scientific data (particularly netCDF and Zarr). We generally want to treat 3 cases the same way: local filesystems, cloud storage, and http urls. The latter is important partly because a lot of archival scientific data is still only available from servers via http (e.g. via openDAP urls), and we often want to pull it out and deposit it onto cloud storage (e.g. using pangeo-forge).

We currently use fsspec to abstract over these different filesystems, but despite much engagement upstream we have unfortunately experienced chronic reliability issues stemming from ill-defined interfaces.

CloudPathlib looks really nice, especially the strict typing and clear interface. (I'm in awe of the AnyPath virtual superclass trick too - and with #347 would be even cooler!) The Path abstraction also just seems more like the minimally-useful one, rather than trying to emulate a whole filesystem.

Rather than trying to support every filesystem under the sun as fsspec does, I'm wondering if we could just use pathlib, cloudpathlib, and some new httppathlib?

Do you have any thoughts on:

Whether you think this is a good idea?
The experiments already performed with xarray in Hello from fsspec! #96 (comment)?
How hard it might be to get a httppathlib to conform to the pathlib interface?
Where such a project might live: in cloudpathlib or in a separate repository?

The text was updated successfully, but these errors were encountered:

pjbull · 2024-07-27T19:54:42Z

I do like this idea, and it is not the first time we have heard this. We've been sort of on the fence about HTTP since at a protocol level, it doesn't map to most path operations except for working with individual files. Some thoughts on a potential mapping to the abstract methods used by Client:

_download_file - GET
_exists - HEAD
_list_dir - no consistent approach; biggest limitation means a fair number of CloudPath methods will be NotImplemented.
_move_file - PUT + DELETE (though a lot of servers won't support these)
_remove - DELETE
_upload_file - PUT

I wouldn't be surprised if even beyond that most servers/scenarios people work with are limited to GET and maybe HEAD, but that may not be a total deal breaker.

Given our philosophy of trying to keep official cloud SDKs as the only dependencies of cloudpathlib, I think that we should look at implementing this with urllib.

Happy to consider a PR for this, or if someone wanted to make a cloudpathlib-http repo in the near term, I think that we'd be pretty open to upstreaming it. The best guide to implementing is in the contributing docs.

moradology · 2024-07-29T17:52:55Z

Glad to see that there's some willingness to explore options outside the current scope of cloud providers. HTTP presents some unique issues vs the more-like-a-real-filesystem cloud provider options already supported - hopefully those don't prove to be more than an annoyance here.

One thing I'm wondering about is range reads. In boto, ranges can be read like this:

import boto3

s3 = boto3.client('s3')

bucket_name = 'your-bucket-name'
object_key = 'path/to/your/object'
start_byte = 0
end_byte = 1023  # First 1KB

response = s3.get_object(
    Bucket=bucket_name,
    Key=object_key,
    Range=f'bytes={start_byte}-{end_byte}'
)

data = response['Body'].read() # Just the bytes we want

I'm new to the lib and certainly haven't gone through the source in detail but I wonder how well the Path abstraction fits with this. Here's what a Pathlib (stdlib) read looks like for only selected ranges:

from pathlib import Path

file_path = Path('path/to/your/file')
start_byte = 0
end_byte = 1023  # First 1KB

with file_path.open('rb') as file:
    file.seek(start_byte)
    data = file.read(end_byte - start_byte + 1)

It occurs to me that the expected behavior in this instance is a bit ambiguous, right? Like, if the file is remote, would cloudpathlib behavior download the whole thing locally and then seek through the bytes or would it appropriately attempt to read only bytes as-needed?

pjbull · 2024-07-29T18:55:48Z

Partial read/write or streaming is a separate issue from support HTTP urls as paths (e.g., see #9 and #264). Moving discussion on that point to #9.

TomNicholas mentioned this issue Jul 27, 2024

Use cloudpathlib instead of fsspec? zarr-developers/VirtualiZarr#172

Open

pjbull mentioned this issue Jul 29, 2024

Implement partial or streaming reads/writes (CloudFile abstraction) #9

Open

TomNicholas mentioned this issue Aug 19, 2024

Support local paths for InputDataset.source CWorthy-ocean/C-Star#30

Merged

pjbull linked a pull request Sep 1, 2024 that will close this issue

WIP: Implement HTTP #468

Draft

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read from http - httppathlib? #455

Read from http - httppathlib? #455

TomNicholas commented Jul 27, 2024 •

edited

Loading

pjbull commented Jul 27, 2024 •

edited

Loading

moradology commented Jul 29, 2024 •

edited

Loading

pjbull commented Jul 29, 2024

Read from http - httppathlib? #455

Read from http - httppathlib? #455

Comments

TomNicholas commented Jul 27, 2024 • edited Loading

Best approach to support reading data from http via a pathlib-like class, i.e. httppathlib?

pjbull commented Jul 27, 2024 • edited Loading

moradology commented Jul 29, 2024 • edited Loading

pjbull commented Jul 29, 2024

TomNicholas commented Jul 27, 2024 •

edited

Loading

Best approach to support reading data from http via a `pathlib`-like class, i.e. `httppathlib`?

pjbull commented Jul 27, 2024 •

edited

Loading

moradology commented Jul 29, 2024 •

edited

Loading