Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

copy from one filesystem to the other #909

Closed
rom1504 opened this issue Feb 15, 2022 · 14 comments
Closed

copy from one filesystem to the other #909

rom1504 opened this issue Feb 15, 2022 · 14 comments

Comments

@rom1504
Copy link

rom1504 commented Feb 15, 2022

Hi,
Thanks for creating this lib, it's really convenient and make code using multiple file systems clean!

Is there a way to copy a file (or even a folder?) from one filesystem to the other using fsspec natively?
It's of course possible to implement it by using ls/walk and copying to local then from local to the other fs, but I'm wondering if there's a native way to do it.

@rom1504
Copy link
Author

rom1504 commented Feb 15, 2022

I just found this method:

import fsspec
from tqdm import tqdm
a = fsspec.get_mapper("https://some/folder")
b = fsspec.get_mapper("hdfs://root/some/other/folder")
for k in tqdm(a):
  b[k]=a[k]

working well!

same with threads:

from multiprocessing.pool import ThreadPool
from tqdm import tqdm
import fsspec
a = fsspec.get_mapper("https://some/folder")
b = fsspec.get_mapper("hdfs://root/some/other/folder")
def f(k):
 b[k]=a[k]

with ThreadPool(32) as p:
 keys = list(a.keys())
 for _ in tqdm(p.imap_unordered(f, keys), total=len(keys)):
   pass

can be faster depending on the filesystems

@rom1504
Copy link
Author

rom1504 commented Feb 15, 2022

I still would be curious about any other ways to interact between 2 filesystems with fsspec

@martindurant
Copy link
Member

That's an elegant way o do it that would not have occurred to me, although I suppose b.update(a) probably works. #828 is supposed to deal with this, but it has stalled due to my lack of time.

@martindurant
Copy link
Member

I should add that working with the mappers assumes that every file fits in memory, and will iterate through the files serially.

Closing this as a duplicate.

@hangweiqiang-uestc
Copy link

I should add that working with the mappers assumes that every file fits in memory, and will iterate through the files serially.

Closing this as a duplicate.

Does it mean 'copy' operation will firstly load the source file in memory and write the data to the destination?

@martindurant
Copy link
Member

If you use the mappers, then indeed whole files are passed in memory. The filesystems' copy() method reads and writes a chunk at a time.

@hangweiqiang-uestc
Copy link

hangweiqiang-uestc commented Aug 3, 2022

If you use the mappers, then indeed whole files are passed in memory. The filesystems' copy() method reads and writes a chunk at a time.

Yes, It looks like the 'PyArrowHDFS' filesystem use shutil.copyfileobj to load chunk from source file to buffer and then write it to destination file.

shutil.copyfileobj(lstream, rstream)

Is it possible to use copy operation provided by hdfs client? It may reduce cost from copying.

BTW, is there any interface other than 'mapper', which can be used like a file system?

@martindurant
Copy link
Member

See also the copy function in the generics filesystem, specially designed for inter-filesystem copy: https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.generic.GenericFileSystem

use shutil.copyfileobj to load chunk

This is what you will end up doing. You can use shutil directly if you like, or manually like:

with fsspec.open(firstURL) as f1:
    with fsspec.open(secondURL) as f2:
        while True:
            data = f1.read(chunksize)
            if not data:
                break
            f2.write(data)

(not data doesn't necessarily mean EOF on all filesystems, where temporary conditions might mean nothing is read)

is there any interface other than 'mapper',

You mean the filesystem instance itself, perhaps? Also, there is universal_pathlib and other packages built on top of fsspec, if you want them.

@hangweiqiang-uestc
Copy link

Thanks for your reply.

See also the copy function in the generics filesystem, specially designed for inter-filesystem copy

Is there any way for intra-filesystem copy without loading data to memory by using fsspec? For example, copy source file from hdfs to destination on hdfs.

@martindurant
Copy link
Member

Most filesystems implement a copy which doesn't need reading into memory. The title of this issue is about copies between filesystems.

@martindurant
Copy link
Member

@hangweiqiang-uestc , you probably wanted the HadoopFileSystem rather than PyArrowHDFS. We are due to switch from the old to the new when we get to it.

@martindurant
Copy link
Member

^ protocol would be "arrow_hdfs"

@ekicenz
Copy link

ekicenz commented Sep 14, 2022

I just found this method:

import fsspec
from tqdm import tqdm
a = fsspec.get_mapper("https://some/folder")
b = fsspec.get_mapper("hdfs://root/some/other/folder")
for k in tqdm(a):
  b[k]=a[k]

working well!

same with threads:

from multiprocessing.pool import ThreadPool
from tqdm import tqdm
import fsspec
a = fsspec.get_mapper("https://some/folder")
b = fsspec.get_mapper("hdfs://root/some/other/folder")
def f(k):
 b[k]=a[k]

with ThreadPool(32) as p:
 keys = list(a.keys())
 for _ in tqdm(p.imap_unordered(f, keys), total=len(keys)):
   pass

can be faster depending on the filesystems

Thanks for your code sample. This works. I have tried using fsspec.generic.GenericFileSystem, but I just can't make it works

@martindurant
Copy link
Member

I have tried using fsspec.generic.GenericFileSystem, but I just can't make it work

Perhaps you'd like to raise a new issue showing what you tried and how it failed? GenericFileSystem is still new and experimental, I'm sure we can fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants