gh-121313: Limit the reading size from pipes to their default buffer size on Unix systems #121315

aplaikner · 2024-07-03T09:22:39Z

Issue: #121313

Issue: Limit the reading size from pipes to their default buffer size on Unix systems #121313

cpython-cla-bot · 2024-07-03T09:22:42Z

All commit authors signed the Contributor License Agreement.

bedevere-app · 2024-07-03T09:22:45Z

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

bedevere-app · 2024-07-03T09:34:34Z

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

bedevere-app · 2024-07-03T10:07:18Z

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

cmaloney · 2024-07-04T22:16:54Z

os.read() / _os_read_impl is used for reading from most kinds of files in Python. Definitely the limited size makes sense for pipes, but disk I/O generally wants "as big a read as possible". For instance reading regular files, such as python source code, one read call with a buffer that can fit the whole file is fastest in my experimenting. For both that case and the pipe case, it would be more efficient to figure out "whats the max read size" once (with the system calls that entails potentially) and re-use that for every subsequent read call

Following your chain of pieces, could this be made to be more targeted to the specific case potentially? Two thoughts

This is specifically caused by Lib/multiprocessing/connection.py, can that specify explicitly the size of read it wants?
Rather than checking / adjusting the size for every read, could that be done just when the pipe is opened/created? So on open, check type, and stash the "max read size". Compare against that (The code currently checks against _PY_READ_MAX constant, this would just be saying max read size is file type dependent, which is true on both Windows and Linux)

See also: gh-117151 which is aiming to increase the default size (albeit focused around write performance)

aplaikner · 2024-07-05T07:27:56Z

I've tried shifting the check to Lib/multiprocessing/connection.py and it seems promising, yielding the same performance improvements as having the checks in the C code. The change to os_read_impl would be reverted and the following patch applied to Lib/multiprocessing/connection.py:

diff --git a/Lib/multiprocessing/connection.py b/Lib/multiprocessing/connection.py
index b7e1e13217..4797ca4df8 100644
--- a/Lib/multiprocessing/connection.py
+++ b/Lib/multiprocessing/connection.py
@@ -18,6 +18,7 @@
 import time
 import tempfile
 import itertools
+import stat
 
 
 from . import util
@@ -391,8 +392,17 @@ def _recv(self, size, read=_read):
         buf = io.BytesIO()
         handle = self._handle
         remaining = size
+        is_pipe = False
+        page_size = 0
+        if not _winapi:
+            page_size = os.sysconf(os.sysconf_names['SC_PAGESIZE'])
+            if size > 16 * page_size:
+                mode = os.fstat(handle).st_mode
+                is_pipe = stat.S_ISFIFO(mode)
+        limit = 16 * page_size if is_pipe else remaining
         while remaining > 0:
-            chunk = read(handle, remaining)
+            to_read = min(limit, remaining)
+            chunk = read(handle, to_read)
             n = len(chunk)
             if n == 0:
                 if remaining == size:

cmaloney

Looking reasonable to me overall: Unlikely to break compatibility or reduce performance, improves default behavior. A couple smaller change requests from me.

It would be nice to add a test that will fail if something breaks / results in the "read too large on pipes resulting in bad behavior" again, although I don't see a straightforward way to do that (Maybe mocking Connection._read in a new test in _test_multiprocessing and checking the size of read when know it is a pipe?)

cmaloney · 2024-07-05T19:27:55Z

Lib/multiprocessing/connection.py

@@ -18,6 +18,7 @@
 import time
 import tempfile
 import itertools
+import stat


Personal nitpick, PEP-8 doesn't seem to specify (https://peps.python.org/pep-0008/#imports), but I like imports to be alphabetical. itertools, time, and tempfile which were already in the code just above this are also out of order (although time and tempfile only slightly). Rest are in order. Not sure if it matters for Python core developer acceptance

cmaloney · 2024-07-05T20:53:28Z

Lib/multiprocessing/connection.py

+        is_pipe = False
+        page_size = 0
+        if not _winapi:
+            page_size = os.sysconf(os.sysconf_names['SC_PAGESIZE'])


Rather than do the if not _winapi here, which has to be run/interpreted per _recv call, can you add the "calculate max size for a fifo" like https://github.com/python/cpython/blob/main/Lib/multiprocessing/connection.py#L370-L379 does to choose/define the standard read function? Code here will still need to do the min logic + "is this a fifo", but at least reduces overhead work a little bit further.

I've shifted fetching the base page size and calculating the default pipe size to the existing if _winapi block above. Is this what you meant?

Yep, looking good

cmaloney · 2024-07-05T20:59:30Z

Misc/NEWS.d/next/C API/2024-07-03-10-11-53.gh-issue-121313.D7gARW.rst

@@ -0,0 +1 @@
+Limit reading size in os.read for pipes to default pipe size in order to avoid memory overallocation


This should be updated from os.read -> multiprocessing to follow the logic location change.

Lib/multiprocessing/connection.py

cmaloney · 2024-07-07T18:38:50Z

I think as far as I can review / needs a python core dev / someone with more project familiarity to look for high level things.

Some lingering thoughts I have:

Would it make more sense to use fcntl F_GETPIPE_SZ rather than caluclate? I hadn't known about that until reading through the pipe man page linked.
How does this work for non-linux systems? Particularly FreeBSD and Apple systems that are Python supported (https://peps.python.org/pep-0011/#support-tiers). I'm not familiar with pipes on those platforms at all currently.

aplaikner · 2024-07-07T20:07:48Z

When using fcntl, an additional system call per _recv would be necessary. The main issue is that the code must be executed inside the _recv function because fcntl requires the pipe's file descriptor. To avoid errors, a check to determine if the system is Windows would be needed before executing fcntl. This could be done with a boolean set inside the if _winapi check. Additionally, there should be a check to verify if the file descriptor belongs to a pipe before attempting to fetch the pipe size. This results in two checks before obtaining the pipe size.
To optimize performance, these checks could be wrapped in another condition to verify if the read size is smaller than the default pipe size, skipping that code. Otherwise at least the fstat system call would be executed. However, this would again lead to a hardcoded value.
Using fcntl would provide a more dynamic approach, it would come at the cost of reduced performance due to the additional system calls and other checks, reducing performance.
I think the current solution covers most cases, where the default pipe size is used. If someone changes that value, they would also need to change the new constant to see some performance benefits.
I'm also not familiar with pipes on those systems, but it seems that FreeBSD and MacOS have both a default pipe buffer size of 64KiB: https://www.netmeister.org/blog/ipcbufs.html

aplaikner · 2024-07-29T12:43:09Z

Hi @cmaloney, I wanted to check in and see if there are any additional steps I need to take for this pull request before it can be reviewed by a core developer.

Thank you!

cmaloney · 2024-07-29T19:24:13Z

Re: Core Review, as far as I know no other steps needed. From https://devguide.python.org/getting-started/pull-request-lifecycle/#reviewing it's mainly just patience, that document suggests a month wait before pinging other locations.

bedevere-bot · 2024-08-31T00:34:20Z

🤖 New build scheduled with the buildbot fleet by @gpshead for commit 52606d1 🤖

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

gpshead · 2024-08-31T05:55:32Z

There's one potential further optimization, at least on Linux. fcntl F_GETPIPE_SZ on the fd if it is a pipe should return the actual size. A pipe might have been configured differently than the platform default. Regardless I don't expect that will have been the case within this multiprocessing code. Using that (and F_SETPIPE_SZ) could be a future enhancement (assuming it proves useful).

gpshead · 2024-08-31T05:57:31Z

Thanks for taking this on!

methane · 2024-08-31T07:15:26Z

2. I'm also not familiar with pipes on those systems, but it seems that FreeBSD and MacOS have both a default pipe buffer size of 64KiB: https://www.netmeister.org/blog/ipcbufs.html

This PR uses 256KiB, not 64KiB on M1 mac (16K page).

vstinner · 2024-09-02T08:47:39Z

The Changelog entry was added to C API category, instead of the Library category.

methane · 2024-09-02T10:28:51Z

Nice catch. I will change the category in #123559.

Add clean code

108e65b

bedevere-app bot mentioned this pull request Jul 3, 2024

Limit the reading size from pipes to their default buffer size on Unix systems #121313

Closed

bedevere-app bot added the awaiting review label Jul 3, 2024

Fix linting error & sysconf unsupported error

37ca606

Merge branch 'main' into feature-smaller-pipe-buffer-pull-request

df4f307

blurb-it bot and others added 6 commits July 3, 2024 10:11

📜🤖 Added by blurb_it.

89936c2

Fix news

31c65b4

Make page size non static

0a4e4a3

Merge branch 'main' into feature-smaller-pipe-buffer-pull-request

7afde51

Merge branch 'main' into feature-smaller-pipe-buffer-pull-request

8d9b16e

Remove redundant call to pymin

e6c64fe

Shift pipe check to connection.py _recv

936b601

aplaikner requested a review from gpshead as a code owner July 5, 2024 07:43

aplaikner added 2 commits July 5, 2024 13:04

Make pipe size dependant on systems page size

49d8adb

Only execute fstat in case reading size is bigger than default pipe size

43e19dd

cmaloney reviewed Jul 5, 2024

View reviewed changes

cmaloney mentioned this pull request Jul 6, 2024

GH-120754: Add a strace helper and test set of syscalls for open().read() #121143

Merged

5 tasks

aplaikner and others added 5 commits July 7, 2024 07:17

Update news message

b0b86e5

Make imports order alphabetical

2b6ff24

Shift calculation for pipe size to existing if _winapi check

e726f51

Fix linting error

59cff4d

Merge branch 'main' into feature-smaller-pipe-buffer-pull-request

da10f8e

TalAmuyal reviewed Jul 7, 2024

View reviewed changes

Lib/multiprocessing/connection.py Outdated Show resolved Hide resolved

Create constant for default number of pages per pipe

94d4c4a

Merge branch 'main' into feature-smaller-pipe-buffer-pull-request

52606d1

cmaloney approved these changes Aug 2, 2024

View reviewed changes

bedevere-app bot added awaiting core review and removed awaiting review labels Aug 2, 2024

gpshead self-assigned this Aug 31, 2024

gpshead added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Aug 31, 2024

bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Aug 31, 2024

gpshead merged commit 74bfb53 into python:main Aug 31, 2024
94 of 102 checks passed

bedevere-app bot removed the awaiting core review label Aug 31, 2024

methane mentioned this pull request Sep 1, 2024

gh-121313: multiprocessing: change connection buffer size to 64KiB #123559

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-121313: Limit the reading size from pipes to their default buffer size on Unix systems #121315

gh-121313: Limit the reading size from pipes to their default buffer size on Unix systems #121315

aplaikner commented Jul 3, 2024 •

edited by bedevere-app bot

Loading

cpython-cla-bot bot commented Jul 3, 2024 •

edited

Loading

bedevere-app bot commented Jul 3, 2024

bedevere-app bot commented Jul 3, 2024

bedevere-app bot commented Jul 3, 2024

cmaloney commented Jul 4, 2024 •

edited

Loading

aplaikner commented Jul 5, 2024 •

edited

Loading

cmaloney left a comment •

edited

Loading

cmaloney Jul 5, 2024 •

edited

Loading

aplaikner Jul 7, 2024

cmaloney Jul 5, 2024

aplaikner Jul 7, 2024

cmaloney Jul 7, 2024

cmaloney Jul 5, 2024

aplaikner Jul 7, 2024

cmaloney commented Jul 7, 2024

aplaikner commented Jul 7, 2024

aplaikner commented Jul 29, 2024

cmaloney commented Jul 29, 2024

bedevere-bot commented Aug 31, 2024

gpshead commented Aug 31, 2024

gpshead commented Aug 31, 2024

methane commented Aug 31, 2024

vstinner commented Sep 2, 2024

methane commented Sep 2, 2024

		@@ -0,0 +1 @@
		Limit reading size in os.read for pipes to default pipe size in order to avoid memory overallocation

gh-121313: Limit the reading size from pipes to their default buffer size on Unix systems #121315

gh-121313: Limit the reading size from pipes to their default buffer size on Unix systems #121315

Conversation

aplaikner commented Jul 3, 2024 • edited by bedevere-app bot Loading

cpython-cla-bot bot commented Jul 3, 2024 • edited Loading

bedevere-app bot commented Jul 3, 2024

bedevere-app bot commented Jul 3, 2024

bedevere-app bot commented Jul 3, 2024

cmaloney commented Jul 4, 2024 • edited Loading

aplaikner commented Jul 5, 2024 • edited Loading

cmaloney left a comment • edited Loading

Choose a reason for hiding this comment

cmaloney Jul 5, 2024 • edited Loading

Choose a reason for hiding this comment

aplaikner Jul 7, 2024

Choose a reason for hiding this comment

cmaloney Jul 5, 2024

Choose a reason for hiding this comment

aplaikner Jul 7, 2024

Choose a reason for hiding this comment

cmaloney Jul 7, 2024

Choose a reason for hiding this comment

cmaloney Jul 5, 2024

Choose a reason for hiding this comment

aplaikner Jul 7, 2024

Choose a reason for hiding this comment

cmaloney commented Jul 7, 2024

aplaikner commented Jul 7, 2024

aplaikner commented Jul 29, 2024

cmaloney commented Jul 29, 2024

bedevere-bot commented Aug 31, 2024

gpshead commented Aug 31, 2024

gpshead commented Aug 31, 2024

methane commented Aug 31, 2024

vstinner commented Sep 2, 2024

methane commented Sep 2, 2024

aplaikner commented Jul 3, 2024 •

edited by bedevere-app bot

Loading

cpython-cla-bot bot commented Jul 3, 2024 •

edited

Loading

cmaloney commented Jul 4, 2024 •

edited

Loading

aplaikner commented Jul 5, 2024 •

edited

Loading

cmaloney left a comment •

edited

Loading

cmaloney Jul 5, 2024 •

edited

Loading