Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Providing filesystem credentials through storage_options #436

Merged
merged 8 commits into from
Mar 28, 2024

Conversation

balanz24
Copy link
Contributor

Fixes #432. This allows to specify the credentials of a storage file system to Spec class. Needed for S3-compatible storage system such as minIO.

Copy link
Member

@tomwhite tomwhite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great - thanks for submitting it @balanz24.

I've made a couple of review comments, and some formatting fixes are needed too. You can install pip install pre-commit then run pre-commit run --all-files to fix them.

@@ -308,6 +308,7 @@ def blockwise(
extra_func_kwargs=extra_func_kwargs,
fusable=fusable,
num_input_blocks=num_input_blocks,
storage_options=spec.storage_options,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know they are all kwargs, but the storage_options parameter would be better immediately after target_path, since it belongs with target_store and target_path. This applies to all the changes in ops.py and blockwise.py, but not rechunk.py.

**kwargs,
):
"""Create a Zarr array lazily in memory."""
# use an empty in-memory Zarr array as a template since it normalizes its properties
template = zarr.empty(
shape, dtype=dtype, chunks=chunks, store=zarr.storage.MemoryStore()
)
if storage_options:
s3 = s3fs.S3FileSystem(**storage_options)
store = s3fs.S3Map(root=store, s3=s3, check=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is S3-specific, which we don't want to hardcode. In fact, I think you don't need any changes in zarr.py at all - since the storage_options from the spec will be passed straight through to zarr.open_array. Does that work in your case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. They can be passed through **kwargs.
However kwargs were not passed to zarr.open_array inside LazyZarrArray.open() so I made that modification.

@balanz24
Copy link
Contributor Author

balanz24 commented Mar 23, 2024

Thanks for the comments @tomwhite.

I was looking at cubed.from_zarr function which also makes a call to zarr API. Maybe it should also receive **kwargs in order to allow storage_options be passed? Or extract them from spec which is already passed?

@tomwhite
Copy link
Member

Or extract them from spec which is already passed?

This is the most consistent way of doing it.

@tomwhite
Copy link
Member

It would be good to add a unit test. Perhaps use a fsspec local filesystem and set auto_mkdir to a non-default value (i.e. to True) to check that it has been passed through. Could this work?

@@ -109,7 +109,7 @@ def from_zarr(store, path=None, spec=None) -> "Array":
The array loaded from Zarr storage.
"""
name = gensym()
target = zarr.open(store, path=path, mode="r")
target = zarr.open(store, path=path, mode="r", storage_options=spec.storage_options)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fail if spec is None, which it can be when the default config is being used. You can fix this by adding a line like this:

self.spec = spec or spec_from_config(config)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Sorry for the confussion.

@balanz24
Copy link
Contributor Author

balanz24 commented Mar 28, 2024

Regarding the unit test, I'm trying something like this:

def test_storage_options(tmp_path):
    zarr_path = f"file://{tmp_path}/dir/array.zarr"
    arr = lazy_zarr_array(zarr_path, shape=(3, 3), dtype=int, chunks=(2, 2), storage_options={"auto_mkdir": False})

    with pytest.raises(ValueError):
        arr.open()

    arr.create()

The problem is that when running it locally, the non-existing intermediate directory gets created with auto_mkdir set to False.

@tomwhite
Copy link
Member

The problem is that when running it locally, the non-existing intermediate directory gets created with auto_mkdir set to False.

Ah, it looks like Zarr is doing this:

https://github.com/zarr-developers/zarr-python/blob/37e0a1a0c22f552daf6fd94fec5474a9b92db33d/zarr/storage.py#L1376

So that's not going to work as a test. I can't think of another simple way of testing this - have you got any ideas? Also, how has the manual testing been going?

@balanz24
Copy link
Contributor Author

With this current version the credentials are passed fine and I'm able to use a minIO server as the storage backend.

I don't know how can we test it without having access to an external filesystem. It seems that in linux you can't create a directory with credentials through POSIX, nor simulate an S3 API over local directories.

@tomwhite
Copy link
Member

I'm going to merge this now. I've opened #439 for adding a test.

Thank you for your contribution @balanz24!

@tomwhite tomwhite merged commit 7b4f7e2 into cubed-dev:main Mar 28, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Providing filesystem credentials to cubed.Spec
2 participants