Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to use fsspec.generic.rsync across different filesystems #1398

Open
f4hy opened this issue Oct 24, 2023 · 10 comments
Open

Ability to use fsspec.generic.rsync across different filesystems #1398

f4hy opened this issue Oct 24, 2023 · 10 comments

Comments

@f4hy
Copy link

f4hy commented Oct 24, 2023

The rsync method looks like exactly what I have been looking for, but I am not sure how one would use it to say sync data from two different s3 buckets which required difference credentials.

What I would want is something like

fs1_args = {'client_kwargs': {'aws_access_key_id': 'foo', 'aws_secret_access_key': 'bar'}}
fs2_args = {'client_kwargs': {'aws_access_key_id': 'baz', 'aws_secret_access_key': 'qux'}}
fs1 = fsspec.filesystem("s3", **fs1_args)
fs2 = fsspec.filesystem("s3", **fs2_args)
fsspec.generic.rsync("s3://bucket1/somepath", "s3://bucket2/somepath", from_fs=fs1, to_fs=fs2)

but the rsync method only takes an fs= not a to_fs and from_fs. So how is one supposed to pass in both values? Why does the rsync method only take one if it is meant to be able to copy cross systems?

@martindurant
Copy link
Member

This could be documented better... You may even be right that rsync is the most important thing in the whole module, and so other APIs should be tailored to make it as simple as possible.

The fs in question would be an instance of GenericFileSystem from the same module, which handles all the operations rsync needs by dispatch to other backends. It provides various ways to map from URL to filesystem instance, keyed by the protocol of each URL. So, if you wanted to copy s3->s3 with different instances, you would need to define a lookup for two different protocol strings.

You could for instance do

fsspec.generic._generic_fs["s3_source"] = fsspec.filesystem("s3", ....)
fsspec.generic._generic_fs["s3_target"] = fsspec.filesystem("s3", ....)
generic = fsspec.filesystem("generic", default_method="generic")

fsspec.generic.rsync(source, target, fs=fs)

where the URLs in source start with "s3_source://" and the ones in target with "s3_target://". Obviously this is more complicated than it might be for the case of two instances of the same backend!

@f4hy
Copy link
Author

f4hy commented Oct 24, 2023

ok wow. ya thats exactly what I wanted to do but really not clear from the docs how to do that. So ok the idea is to define a new "backend" and then mangle my uris to use the new s3_source or s3_target

I guess the downside of this is that now I can't use the URIs that I would use in other places. Ideally I want to have the inputs here to rsync both be s3:// since thats what they uri is.. (remember that i=identifier)

It seems far cleaner to have the api I proposed above with a source_fs and target_fs instead of having to mangle the uris. Any reason not to also support that?

Glad there is a workaround though. Will try this out.

@martindurant
Copy link
Member

Any reason not to also support that?

No reason, except that the generic filesystem came first, and so reused code rather than tailor something that was easier to use. Would you like to work on this?

@f4hy
Copy link
Author

f4hy commented Oct 24, 2023

Ya. If this would be of value would love to contribute. Ill draft up a PR.

@f4hy
Copy link
Author

f4hy commented Oct 24, 2023

I am still missing something about how the solution you mentioned can work. Trying to use a generic to load a different s3 config doesnt seem to work as expected.

fsspec.generic._generic_fs["s3_source"] = fsspec.filesystem("s3", ....)
generic = fsspec.filesystem("generic", default_method="generic")
generic.find('s3_source://mybucket/path/')

throws an error about
Invalid bucket name "s3_source:": Bucket name must match the regex "^[a-zA-Z0-9.\-_]{1,255}$" or be an ARN matching the regex "^arn:(aws).*:(s3|s3-object-lambda):[a-z\-0-9]*:[0-9]{12}:accesspoint[/:][a-zA-Z0-9\-.]{1,63}$|^arn:(aws).*:s3-outposts:[a-z\-0-9]+:[0-9]{12}:outpost[/:][a-zA-Z0-9\-]{1,63}[/:]accesspoint[/:][a-zA-Z0-9\-]{1,63}$"

since the whole path gets passed now to the s3fs which tries to validate that path and s3_source:// isn't valid. So My guess is that needs to be striped from what gets passed down in generics?

@martindurant
Copy link
Member

Ah, that is annoying! I guess "s3" and "s3a" would work, since those are different, but still prefixes that we know and expect. You are right that a better solution is certainly warranted!

@f4hy
Copy link
Author

f4hy commented Oct 24, 2023

I think the issue is with _strip_protocol() is broken for generic in this case.

fsspec.generic._generic_fs["s3_source"] = fsspec.filesystem("s3", ....)
generic = fsspec.filesystem("generic", default_method="generic")
generic._strip_protocol('s3_source://mybucket/')

doesn't give what you expect. it gives 's3://s3_source://mybucket' which is not right.

@f4hy
Copy link
Author

f4hy commented Oct 24, 2023

ya https://github.com/fsspec/filesystem_spec/blob/master/fsspec/generic.py#L177 is wrong. It tries to strip the protocol using the s3fs which would remove an s3:// but it needs to remove the s3_source:// let me see if there is an easy fix.

@martindurant
Copy link
Member

You can set the instance's value of protocol, that might be enough; but using "s3" and "s3a" would be a workaround for the moment.

@f4hy
Copy link
Author

f4hy commented Oct 24, 2023

ok s3 and s3a are different protocols though. Also not sure how this would then work with things like hdfs if you wanted to copy from 1 hdfs cluster to another.

but ok interesting there is a workaround to just s3 and s3a for the two. I will still propose a PR to implement a rsync like method that works without the generic and uses 2 explicit filesystems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants