bpo-33671: efficient zero-copy for shutil.copy* functions (Linux, OSX and Win) #7160

giampaolo · 2018-05-28T16:21:01Z

https://bugs.python.org/issue33671

…rted

giampaolo · 2018-06-09T01:35:35Z

Code updated. Hopefully this is now ready for final review.

...as an extra safety measure: in case src/dst are "exotic" files (non regular or living on a network fs etc.) we better fail on open() instead of copyfile(3) as we're not quite sure what's gonna happen in that case.

giampaolo · 2018-06-12T16:18:09Z

I'm not sure how to interpret these CI failure but they appear unrelated with the changes at hand. If nobody has complaints I'm gonna merge this PR soon.

ned-deily · 2018-06-12T19:28:45Z

I noticed a lot of VSTS CI failures over the past 24 hours or so, unrelated to our tests. So I would recommend not giving those failures much weight and, as you can see, those failures do not prevent doing a merge.

bedevere-bot · 2018-06-12T21:04:53Z

@giampaolo: Please replace # with GH- in the commit message next time. Thanks!

ned-deily · 2018-06-12T21:13:56Z

Hi, @giampaolo. for future reference, please also manually "squash" edit the commit message in the web interface down to its essence. It's a bit jarring to have all of the details of the now squashed intermediate commits in the final cpython commit. Thanks!

giampaolo · 2018-06-12T21:50:02Z

Will do. Sorry about that.

giampaolo · 2018-06-12T21:50:58Z

@eryksun thanks for all your support. I really appreciate it.

ned-deily · 2018-06-12T21:52:51Z

This looks like a great feature. Thanks for doing it!

eryksun · 2018-06-12T22:23:27Z

Lib/shutil.py

+            if not n:
+                break
+            elif n < length:
+                fdst_write(mv[:n])


In my test case I used another with block to release this mv[:n] view, rather than depend on implicit deallocation to release it. For example:

>>> b = bytearray(100) >>> mv1 = memoryview(b) >>> mv2 = mv1[:10] >>> mv1.release() >>> b.append(0) Traceback (most recent call last): File "<stdin>", line 1, in <module> BufferError: Existing exports of data: object cannot be re-sized >>> mv2.release() >>> b.append(0) >>> len(b) 101

This bytearray is internal, but is there any issue with memory usage in garbage-collected versions of Python (e.g. Jython, IronPython) if the views on the buffer (1 MiB in Windows) aren't released explicitly? If not you can remove the first with block as well.

Uhm... yes, given the big bufsize I think it makes sense to also immediately release the sliced memoryview.

@eryksun If I recall correctly, memory views inadvertently keeping large memory buffers alive on GC based implementations was a key driver in adding context management support to memoryview in the first place, so that's definitely a concern worth keeping in mind for this kind of code.

https://bugs.python.org/issue9757

eryksun · 2018-06-12T22:39:27Z

Lib/shutil.py

+
+def _is_binary_files_pair(fsrc, fdst):
+    return hasattr(fsrc, 'readinto') and \
+        isinstance(fsrc, io.BytesIO) or 'b' in getattr(fsrc, 'mode', '') and \


Which objects provide readinto in text mode? Is it worth the function call and extra tests rather than handling AttributeError (no readinto) and TypeError (can't write bytes) with an inline try-except in copyfileobj?

I think catching TypeError on fdst.write() is too risky as we can deal with any kind of custom file-like object being passed here. It must be noted that the extra cost of this function is payed by users of copyfileobj() only (e.g. tarfile and zipfile modules). copyfile() (and others) will skip this check and call _copybinfileobj() directly.

I benchmarked _copybinfileobj() (I hadn't yet) and it turns out it's only slightly faster for 512MB files but considerably slower for 8MB and 128KB files so I am gonna remove it.

I don't know how you're testing, but the performance difference with readinto depends on whether the source file is already in the system cache. Otherwise, of course memory operations will be dwarfed by considerably slower disk I/O.

This is how I'm testing it:

$ python -c "import os; f = open('f1', 'wb'); f.write(os.urandom(8 * 1024 * 1024))" $ time ./python -m timeit -s 'import shutil; p1 = "f1"; p2 = "f2"' 'shutil.copyfile(p1, p2)'

I wrote a batch script to figure out timings on Windows more easily and this is the result (first value is original copyfileobj() implementation, second value is the memoryview() variant):

8MB file 1000 loops, best of 5: 343 usec per loop 500 loops, best of 5: 478 usec per loop 64MB file 500 loops, best of 5: 474 usec per loop 500 loops, best of 5: 554 usec per loop 128MB file 200 loops, best of 5: 1.06 msec per loop 500 loops, best of 5: 640 usec per loop 256MB file 1 loop, best of 5: 286 msec per loop 5 loops, best of 5: 36.1 msec per loop 512MB file 1 loop, best of 5: 293 msec per loop 5 loops, best of 5: 36.7 msec per loop

I think the memoryview() variant after 128MB is so much faster that it is worth to have the dual implementation and use it from copyfile() function only if on Windows and file size > 128MB. I will do it in my other PR/branch.

On the other hand, the same test on Linux shows there is no relevant difference for 512 MB files and a performance degradation for smaller ones.

Done: https://github.com/python/cpython/pull/7681/files

This command is wrong:

./python -m timeit -s "import shutil; f1 = open('f1', 'rb'); f2 = open('f2', 'wb')" "shutil.copyfileobj(f1, f2)"

The files need to be opened for each pass of the loop, not the setup. That explains the unexpected results. I corrected it to open the files in the loop statement instead of the setup and tested a broad range of file sizes. In the table below all times are best of five for the give number of loops, and normalized overall to make the 64KiB result in the RI_S column equal to 100 time units. I discuss the RI_S case in more detail below.

SIZE | LOOPS | R_16K || R_1M %CHNG | RI_1M %CHNG | RI_S %CHNG --------+-------+---------||-----------------+----------------+-------------- 1 GiB | 5 | 1060870 || 1003478 -5.4 | 977391 -7.9 | 963478 -9.2 512 MiB | 10 | 323478 || 283478 -12.4 | 217391 -32.8 | 213913 -33.9 256 MiB | 20 | 163304 || 146435 -10.3 | 112870 -30.9 | 110957 -32.1 128 MiB | 40 | 80174 || 74609 -6.9 | 55478 -30.8 | 55478 -30.8 64 MiB | 80 | 39652 || 34783 -12.3 | 27304 -31.1 | 27130 -31.6 32 MiB | 160 | 19478 || 17913 -8.0 | 13809 -29.1 | 13478 -30.8 16 MiB | 320 | 9739 || 8887 -8.7 | 6887 -29.3 | 6835 -29.8 8 MiB | 640 | 4904 || 4713 -3.9 | 3652 -25.5 | 3548 -27.7 4 MiB | 1280 | 2504 || 2348 -6.2 | 1948 -22.2 | 1913 -23.6 2 MiB | 2560 | 1150 || 1193 3.7 | 1002 -12.9 | 991 -13.8 1 MiB | 5120 | 649 || 697 7.4 | 659 1.5 | 663 2.2 512 KiB | 10240 | 388 || 553 42.5 | 499 28.6 | 322 -17.0 256 KiB | 20480 | 245 || 459 87.3 | 410 67.3 | 205 -16.3 128 KiB | 30720 | 170 || 388 128.2 | 362 112.9 | 139 -18.2 64 KiB | 40960 | 123 || 353 187.0 | 343 178.9 | 100 -18.7 R_16K -- read 16 KiB R_1M -- read 1 MiB RI_1M -- readinto 1 MiB RI_S -- readinto source size up to 1 MiB

Originally I had tested at 128 MiB with a custom test script to focus on the effects of cached vs non-cached I/O. I assumed the results would be similar for other cases. As shown in the RI_1M column, that's basically true for files larger than 1 megabyte. But there's a significant performance degradation for smaller files. In the RI_S case, I address this by calling os.fstat on the source file to cap the length of the bytearray at its size. This avoids wastefully over-allocating a zeroed byterray.

RI_S also experiments with calling SetInformationByHandle : FileEndOfFileInfo (prototyped using ctypes) to avoid having to repeatedly extend the file when length is less than the size of the source file. (Note that Python's truncate method is of no use here since it zeros the file.) CopyFileEx does this, so I figured it was worth a try. This does provide a modest performance increase. (Note that I mistakenly included the length == size boundary, which is apparent in the 1 MiB trial.) I don't know if it's significant enough to justify implementing _winapi.SetFileInformationByHandle.

If I have time, I may run another experiment using mmap to read into a sliding window of the destination file. It already implements setting the end of the file, albeit with the less efficient combination of SetFilePointer and SetEndOfFile. (This way requires 4 system calls instead of 1 to set the end of the file.)

Below is the code for RI_S:

import os _WINDOWS = (os.name == 'nt') if _WINDOWS: import msvcrt import ctypes from ctypes import wintypes COPY_BUFSIZE = 1024 *1024 kernel32 = ctypes.WinDLL('kernel32', use_last_error=True) FileEndOfFileInfo = 6 FILE_INFO_BY_HANDLE_CLASS = wintypes.DWORD kernel32.SetFileInformationByHandle.argtypes = ( wintypes.HANDLE, # _In_ hFile FILE_INFO_BY_HANDLE_CLASS, # _In_ FileInformationClass wintypes.LPVOID, # _In_ lpFileInformation wintypes.DWORD) # _In_ dwBufferSize def copyfileobj(fsrc, fdst, length=COPY_BUFSIZE): size = os.fstat(fsrc.fileno()).st_size if length > size: length = size elif _WINDOWS: info = wintypes.LARGE_INTEGER(size) if not kernel32.SetFileInformationByHandle( msvcrt.get_osfhandle(fdst.fileno()), FileEndOfFileInfo, ctypes.byref(info), ctypes.sizeof(info)): raise ctypes.WinError(ctypes.get_last_error()) fsrc_readinto = fsrc.readinto fdst_write = fdst.write with memoryview(bytearray(length)) as mv: while True: n = fsrc_readinto(mv) if not n: break elif n < length: with mv[:n] as smv: fdst_write(smv) else: fdst_write(mv)

Interesting. Thanks for the very detailed benchmark. I updated the other branch which now dynamically sets memoryview() size based on file size (93ebc1f) and I confirm using the readinto() variant is faster also for smaller files.

#7681 was merged

vstinner · 2018-06-13T08:10:50Z

@ned-deily: " +On OSX fcopyfile_ is used to copy the file content (not metadata)."

Apple changed the OS name from OSX to macOS, no?

giampaolo · 2018-06-13T08:13:26Z

True. Will make another PR soon with this and another couple of small changes.

vstinner · 2018-06-13T08:17:50Z

True. Will make another PR soon with this and another couple of small changes.

If you change it, you might also rename _fastcopy_osx() to _fastcopy_fcopyfile().

vstinner · 2018-06-13T08:24:01Z

"default buffer size on Windows was increased from 16KB to 1MB"

Nitpicking: 16 KiB to 1 MiB. I now prefer "iB" units avoid confusion between base 2 and base 10.

giampaolo · 2018-06-14T11:07:20Z

If nobody has complaints I'm gonna merge #7681 soon .

…ndows (#7681) bpo-33671 * use memoryview() with size == file size on Windows, see #7160 (comment) * release intermediate (sliced) memoryview immediately * replace "OSX" occurrences with "macOS" * add some unittests for copyfileobj()

giampaolo added 30 commits May 22, 2018 14:31

have shutil.copyfileobj use sendfile() if possible

1a72c01

refactoring: use ctx manager

77c4bfa

add test with non-regular file obj

2afa04a

emulate case where file size can't be determined

542cd17

reference _copyfileobj_sendfile directly

3520c6c

add test for offset() at certain position

050a722

add test for empty file

c1fd38a

add test for non regular file dst

2ab6317

small refactoring

dacc3b6

leave copyfileobj() alone in order to not introduce any incompatibility

29d5881

minor refactoring

114c4de

remove old test

501c0dd

update docstring

41b4506

update docstring; rename exception class

fdb0973

detect platforms which only support file to socket zero copy

64d2bc5

don't run test on platforms where file-to-file zero copy is not suppo…

3a3c8ef

…rted

use tempfiles

7861737

reset verbosity

f3eecfd

add test for smaller chunks

f67ce57

add big file size test

d457254

add comment

8eb211d

update doc

a0fe703

update whatsnew doc

7296147

update doc

d0c3bba

catch Exception

2cafd80

remove unused import

bb2a75f

add test case for error on second sendfile() call

e5025dc

turn docstring into comment

a36a534

add one more test

e9da3fa

update comment

9fcc2e7

giampaolo added 3 commits June 9, 2018 02:49

update doc

d9d27a7

remove outdated doc

17bd78b

remove last CopyFileEx remnants

b1d4917

giampaolo added 2 commits June 12, 2018 16:15

OSX - use fcopyfile(3) instead of copyfile(3)

5ce94e4

...as an extra safety measure: in case src/dst are "exotic" files (non regular or living on a network fs etc.) we better fail on open() instead of copyfile(3) as we're not quite sure what's gonna happen in that case.

update doc

07bcef5

giampaolo merged commit 4a172cc into python:master Jun 12, 2018

bedevere-bot removed the awaiting merge label Jun 12, 2018

giampaolo deleted the shutil-zero-copy branch June 12, 2018 21:04

eryksun reviewed Jun 12, 2018

View reviewed changes

giampaolo mentioned this pull request Jun 13, 2018

bpo-33671 / shutil.copyfile: use memoryview() with dynamic size on Windows #7681

Merged

giampaolo mentioned this pull request Feb 24, 2019

bpo-33671: allow setting shutil.copyfile() bufsize globally #12016

Merged

landfillbaby mentioned this pull request Oct 14, 2021

tiny speedup in shutil fast copy on Windows #28898

Closed

skshetry mentioned this pull request Jun 14, 2022

optimized copyfile iterative/dvc-objects#37

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bpo-33671: efficient zero-copy for shutil.copy* functions (Linux, OSX and Win) #7160

bpo-33671: efficient zero-copy for shutil.copy* functions (Linux, OSX and Win) #7160

giampaolo commented May 28, 2018 •

edited

Loading

giampaolo commented Jun 9, 2018

giampaolo commented Jun 12, 2018

ned-deily commented Jun 12, 2018

bedevere-bot commented Jun 12, 2018

ned-deily commented Jun 12, 2018

giampaolo commented Jun 12, 2018

giampaolo commented Jun 12, 2018

ned-deily commented Jun 12, 2018

eryksun Jun 12, 2018

giampaolo Jun 12, 2018

ncoghlan Jun 13, 2018

DK96-OS Mar 14, 2024

eryksun Jun 12, 2018 •

edited

Loading

giampaolo Jun 12, 2018

giampaolo Jun 13, 2018

eryksun Jun 13, 2018

giampaolo Jun 13, 2018

giampaolo Jun 13, 2018 •

edited

Loading

giampaolo Jun 13, 2018

eryksun Jun 14, 2018

giampaolo Jun 14, 2018

giampaolo Jun 19, 2018

vstinner commented Jun 13, 2018

giampaolo commented Jun 13, 2018

vstinner commented Jun 13, 2018

vstinner commented Jun 13, 2018

giampaolo commented Jun 14, 2018

bpo-33671: efficient zero-copy for shutil.copy* functions (Linux, OSX and Win) #7160

bpo-33671: efficient zero-copy for shutil.copy* functions (Linux, OSX and Win) #7160

Conversation

giampaolo commented May 28, 2018 • edited Loading

giampaolo commented Jun 9, 2018

giampaolo commented Jun 12, 2018

ned-deily commented Jun 12, 2018

bedevere-bot commented Jun 12, 2018

ned-deily commented Jun 12, 2018

giampaolo commented Jun 12, 2018

giampaolo commented Jun 12, 2018

ned-deily commented Jun 12, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eryksun Jun 12, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

giampaolo Jun 13, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vstinner commented Jun 13, 2018

giampaolo commented Jun 13, 2018

vstinner commented Jun 13, 2018

vstinner commented Jun 13, 2018

giampaolo commented Jun 14, 2018

giampaolo commented May 28, 2018 •

edited

Loading

eryksun Jun 12, 2018 •

edited

Loading

giampaolo Jun 13, 2018 •

edited

Loading