copy from one filesystem to the other #909

rom1504 · 2022-02-15T10:52:01Z

Hi,
Thanks for creating this lib, it's really convenient and make code using multiple file systems clean!

Is there a way to copy a file (or even a folder?) from one filesystem to the other using fsspec natively?
It's of course possible to implement it by using ls/walk and copying to local then from local to the other fs, but I'm wondering if there's a native way to do it.

rom1504 · 2022-02-15T11:31:35Z

I just found this method:

import fsspec
from tqdm import tqdm
a = fsspec.get_mapper("https://some/folder")
b = fsspec.get_mapper("hdfs://root/some/other/folder")
for k in tqdm(a):
  b[k]=a[k]

working well!

same with threads:

from multiprocessing.pool import ThreadPool
from tqdm import tqdm
import fsspec
a = fsspec.get_mapper("https://some/folder")
b = fsspec.get_mapper("hdfs://root/some/other/folder")
def f(k):
 b[k]=a[k]

with ThreadPool(32) as p:
 keys = list(a.keys())
 for _ in tqdm(p.imap_unordered(f, keys), total=len(keys)):
   pass

can be faster depending on the filesystems

rom1504 · 2022-02-15T11:39:58Z

I still would be curious about any other ways to interact between 2 filesystems with fsspec

martindurant · 2022-02-16T03:08:05Z

That's an elegant way o do it that would not have occurred to me, although I suppose b.update(a) probably works. #828 is supposed to deal with this, but it has stalled due to my lack of time.

martindurant · 2022-02-16T14:08:24Z

I should add that working with the mappers assumes that every file fits in memory, and will iterate through the files serially.

Closing this as a duplicate.

hangweiqiang-uestc · 2022-08-03T08:45:56Z

I should add that working with the mappers assumes that every file fits in memory, and will iterate through the files serially.

Closing this as a duplicate.

Does it mean 'copy' operation will firstly load the source file in memory and write the data to the destination?

martindurant · 2022-08-03T13:04:01Z

If you use the mappers, then indeed whole files are passed in memory. The filesystems' copy() method reads and writes a chunk at a time.

hangweiqiang-uestc · 2022-08-03T16:05:53Z

If you use the mappers, then indeed whole files are passed in memory. The filesystems' copy() method reads and writes a chunk at a time.

Yes, It looks like the 'PyArrowHDFS' filesystem use shutil.copyfileobj to load chunk from source file to buffer and then write it to destination file.

filesystem_spec/fsspec/implementations/hdfs.py

Line 163 in 7effb83

shutil.copyfileobj(lstream, rstream)

Is it possible to use copy operation provided by hdfs client? It may reduce cost from copying.

BTW, is there any interface other than 'mapper', which can be used like a file system?

martindurant · 2022-08-03T16:42:32Z

See also the copy function in the generics filesystem, specially designed for inter-filesystem copy: https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.generic.GenericFileSystem

use shutil.copyfileobj to load chunk

This is what you will end up doing. You can use shutil directly if you like, or manually like:

with fsspec.open(firstURL) as f1:
    with fsspec.open(secondURL) as f2:
        while True:
            data = f1.read(chunksize)
            if not data:
                break
            f2.write(data)

(not data doesn't necessarily mean EOF on all filesystems, where temporary conditions might mean nothing is read)

is there any interface other than 'mapper',

You mean the filesystem instance itself, perhaps? Also, there is universal_pathlib and other packages built on top of fsspec, if you want them.

hangweiqiang-uestc · 2022-08-04T02:32:37Z

Thanks for your reply.

See also the copy function in the generics filesystem, specially designed for inter-filesystem copy

Is there any way for intra-filesystem copy without loading data to memory by using fsspec? For example, copy source file from hdfs to destination on hdfs.

martindurant · 2022-08-04T02:34:03Z

Most filesystems implement a copy which doesn't need reading into memory. The title of this issue is about copies between filesystems.

martindurant · 2022-08-04T13:32:13Z

@hangweiqiang-uestc , you probably wanted the HadoopFileSystem rather than PyArrowHDFS. We are due to switch from the old to the new when we get to it.

martindurant · 2022-08-04T13:32:32Z

^ protocol would be "arrow_hdfs"

ekicenz · 2022-09-14T02:21:15Z

I just found this method:

import fsspec
from tqdm import tqdm
a = fsspec.get_mapper("https://some/folder")
b = fsspec.get_mapper("hdfs://root/some/other/folder")
for k in tqdm(a):
  b[k]=a[k]

working well!

same with threads:

from multiprocessing.pool import ThreadPool
from tqdm import tqdm
import fsspec
a = fsspec.get_mapper("https://some/folder")
b = fsspec.get_mapper("hdfs://root/some/other/folder")
def f(k):
 b[k]=a[k]

with ThreadPool(32) as p:
 keys = list(a.keys())
 for _ in tqdm(p.imap_unordered(f, keys), total=len(keys)):
   pass

can be faster depending on the filesystems

Thanks for your code sample. This works. I have tried using fsspec.generic.GenericFileSystem, but I just can't make it works

martindurant · 2022-09-14T13:09:53Z

I have tried using fsspec.generic.GenericFileSystem, but I just can't make it work

Perhaps you'd like to raise a new issue showing what you tried and how it failed? GenericFileSystem is still new and experimental, I'm sure we can fix it.

rom1504 mentioned this issue Feb 15, 2022

Recursive directory copy between different filesystem #841

Open

martindurant closed this as completed Feb 16, 2022

Metamess mentioned this issue Jan 26, 2023

Copying files between filesystems via GenericFileSystem broken for filesystems requiring init parameters #1167

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

copy from one filesystem to the other #909

copy from one filesystem to the other #909

rom1504 commented Feb 15, 2022

rom1504 commented Feb 15, 2022 •

edited

Loading

rom1504 commented Feb 15, 2022

martindurant commented Feb 16, 2022

martindurant commented Feb 16, 2022

hangweiqiang-uestc commented Aug 3, 2022

martindurant commented Aug 3, 2022

hangweiqiang-uestc commented Aug 3, 2022 •

edited

Loading

martindurant commented Aug 3, 2022

hangweiqiang-uestc commented Aug 4, 2022

martindurant commented Aug 4, 2022

martindurant commented Aug 4, 2022

martindurant commented Aug 4, 2022

ekicenz commented Sep 14, 2022

martindurant commented Sep 14, 2022

copy from one filesystem to the other #909

copy from one filesystem to the other #909

Comments

rom1504 commented Feb 15, 2022

rom1504 commented Feb 15, 2022 • edited Loading

rom1504 commented Feb 15, 2022

martindurant commented Feb 16, 2022

martindurant commented Feb 16, 2022

hangweiqiang-uestc commented Aug 3, 2022

martindurant commented Aug 3, 2022

hangweiqiang-uestc commented Aug 3, 2022 • edited Loading

martindurant commented Aug 3, 2022

hangweiqiang-uestc commented Aug 4, 2022

martindurant commented Aug 4, 2022

martindurant commented Aug 4, 2022

martindurant commented Aug 4, 2022

ekicenz commented Sep 14, 2022

martindurant commented Sep 14, 2022

rom1504 commented Feb 15, 2022 •

edited

Loading

hangweiqiang-uestc commented Aug 3, 2022 •

edited

Loading