-
-
Notifications
You must be signed in to change notification settings - Fork 30.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inline dict/list/set comprehensions in the compiler for better performance #97933
Comments
In Cinder 3.8 the win here was both from not doing a (slow) allocation of a new function object and not doing a (slow) function call. In 3.11+ function calls are now a lot faster, which may take away some of the value here. Also if we improve compatibility via |
The language definition is rather vague about the semantics. It mentions scopes and comprehensions, but doesn't seem to define either. So let's define what would be needed for full compatibility:
With that in mind, I think pushing a new frame is required. |
@markshannon A couple further clarifications about compatibility requirements here.
I'm assuming both of these are acceptable, since they don't violate the constraints you listed above, just making sure we're on the same page here. There would be a possible implementation that avoids even these changes, where we keep the comprehension's bytecode in its own code object and effectively do things just as they are done today, except we replace |
|
OOI, how do you plan to fully inline the comprehension code and have the comprehension appear in tracebacks? |
Personally, I think that the comprehension not appearing in the frame stack is fine, and may actually be an improvement, but will need wider discussion. |
One other question. Do you have a link to the Cinder code that does this? |
I had been thinking I would still push a new frame even with inlined bytecode, by means of wrapping the inlined bytecode in new If we could get wider agreement that it's ok to change the behavior of Most recent version of Cinder implementation is in facebookincubator/cinder@ba2d0ac |
Ok, I've worked on a couple possible implementations of this, and pushed one of them in #101310. The other approach (full inlining) is still WIP; ideally I'd like to get that in a working state and compare the performance (particularly in cases where the comprehension has a closure and moves some outer locals into cells) and complexity in order to make a call between them. |
I'd push for the full inlining approach. The resulting code is so much shorter, and presumably faster. >>> def f(v, s):
... return [v + i for i in s]
...
>>> dis.dis(f)
MAKE_CELL 0 (v)
RESUME 0
LOAD_CLOSURE 0 (v)
BUILD_TUPLE 1
LOAD_CONST 1 (<code object <listcomp> at 0x7fad2a129450, file "<stdin>", line 2>)
MAKE_FUNCTION 8 (closure)
LOAD_FAST 1 (s)
GET_ITER
PRECALL 0
CALL 0
RETURN_VALUE
Disassembly of <code object <listcomp> at 0x7fad2a129450, file "<stdin>", line 2>:
COPY_FREE_VARS 1
RESUME 0
BUILD_LIST 0
LOAD_FAST 0 (.0)
FOR_ITER 7 (to 24)
STORE_FAST 1 (i)
LOAD_DEREF 2 (v)
LOAD_FAST 1 (i)
BINARY_OP 0 (+)
LIST_APPEND 2
JUMP_BACKWARD 8 (to 8)
END_FOR
RETURN_VALUE With full inlining we would get: >>> def f(v, s):
... return [v + i for i in s]
...
>>> dis.dis(f)
RESUME 0
BUILD_LIST 0
LOAD_FAST 1 (s)
FOR_ITER 7 (to 24)
STORE_FAST 2 (i)
LOAD_FAST 0 (v)
LOAD_FAST 2 (i)
BINARY_OP 0 (+)
LIST_APPEND 2
JUMP_BACKWARD 8 (to 8)
END_FOR
RETURN_VALUE Full inlining also enables handling "ad-hoc comprehensions" where a generator expression is wrapped in a class call or filter. |
@markshannon it seems that we can't inline cases like this, where the comprehension itself has cells:
Because unless the comprehension has a real frame of its own, Such code is probably a bug to begin with, since the last value in the comprehension will be used for all the generated closures, but I'm still not sure we can get away with changing this? And I don't see a good way to inline and preserve the original behavior. Do you see anything better here than just refusing to inline in this case? Hmm, I could try "push previous cell to the stack, make new cell, run comprehension, pop previous cell from stack" for this case... |
Exactly this 🙂 Whether the inner def f_escape():
x0 = ...
items = [lambda: x1 for x1 in range(3)]
x0 = -1
return [item() for item in items] (Both x0 and x1 are called "x" and occupy the same local slot) def f_normal():
x0 = ...
items = [x1 for x1 in range(3)]
x0 = -1
return items We compile def f_normal():
x0 = ...
PUSH(x0) # Save x0
del x1 # Initialize x1
items = [x1 for x1 in range(3)]
x0 = POP()
... For an escaping def f_escapes():
x0 = ...
PUSH(x0) # Save x0
del x1 # Initialize x1
x1 = MAKE_CELL(x1) # Initialize x1
items = [lambda: x1 for x1 in range(3)]
x0 = POP()
... |
#101441 implements full comprehension inlining as discussed above. |
Co-authored-by: Irit Katriel <1055913+iritkatriel@users.noreply.github.com> Co-authored-by: Erlend E. Aasland <erlend.aasland@protonmail.com>
* main: pythongh-97696 Add documentation for get_coro() behavior with eager tasks (python#104304) pythongh-97933: (PEP 709) inline list/dict/set comprehensions (python#101441) pythongh-99889: Fix directory traversal security flaw in uu.decode() (python#104096) pythongh-104184: fix building --with-pydebug --enable-pystats (python#104217) pythongh-104139: Add itms-services to uses_netloc urllib.parse. (python#104312) pythongh-104240: return code unit metadata from codegen (python#104300)
* main: (156 commits) pythongh-97696 Add documentation for get_coro() behavior with eager tasks (python#104304) pythongh-97933: (PEP 709) inline list/dict/set comprehensions (python#101441) pythongh-99889: Fix directory traversal security flaw in uu.decode() (python#104096) pythongh-104184: fix building --with-pydebug --enable-pystats (python#104217) pythongh-104139: Add itms-services to uses_netloc urllib.parse. (python#104312) pythongh-104240: return code unit metadata from codegen (python#104300) pythongh-104276: Make `_struct.unpack_iterator` type use type flag instead of custom constructor (python#104277) pythongh-97696: Move around and update the whatsnew entry for asyncio eager task factory (python#104298) pythongh-103193: Fix refleaks in `test_inspect` and `test_typing` (python#104320) require-pr-label.yml: Add missing "permissions:" (python#104309) pythongh-90656: Add platform triplets for 64-bit LoongArch (LA64) (python#30939) pythongh-104180: Read SOCKS proxies from macOS System Configuration (python#104181) pythongh-97696 Remove unnecessary check for eager_start kwarg (python#104188) pythonGH-104308: socket.getnameinfo should release the GIL (python#104307) pythongh-104310: Add importlib.util.allowing_all_extensions() (pythongh-104311) pythongh-99113: A Per-Interpreter GIL! (pythongh-104210) pythonGH-104284: Fix documentation gettext build (python#104296) pythongh-89550: Buffer GzipFile.write to reduce execution time by ~15% (python#101251) pythongh-104223: Fix issues with inheriting from buffer classes (python#104227) pythongh-99108: fix typo in Modules/Setup (python#104293) ...
* main: (35 commits) pythongh-97696 Add documentation for get_coro() behavior with eager tasks (python#104304) pythongh-97933: (PEP 709) inline list/dict/set comprehensions (python#101441) pythongh-99889: Fix directory traversal security flaw in uu.decode() (python#104096) pythongh-104184: fix building --with-pydebug --enable-pystats (python#104217) pythongh-104139: Add itms-services to uses_netloc urllib.parse. (python#104312) pythongh-104240: return code unit metadata from codegen (python#104300) pythongh-104276: Make `_struct.unpack_iterator` type use type flag instead of custom constructor (python#104277) pythongh-97696: Move around and update the whatsnew entry for asyncio eager task factory (python#104298) pythongh-103193: Fix refleaks in `test_inspect` and `test_typing` (python#104320) require-pr-label.yml: Add missing "permissions:" (python#104309) pythongh-90656: Add platform triplets for 64-bit LoongArch (LA64) (python#30939) pythongh-104180: Read SOCKS proxies from macOS System Configuration (python#104181) pythongh-97696 Remove unnecessary check for eager_start kwarg (python#104188) pythonGH-104308: socket.getnameinfo should release the GIL (python#104307) pythongh-104310: Add importlib.util.allowing_all_extensions() (pythongh-104311) pythongh-99113: A Per-Interpreter GIL! (pythongh-104210) pythonGH-104284: Fix documentation gettext build (python#104296) pythongh-89550: Buffer GzipFile.write to reduce execution time by ~15% (python#101251) pythongh-104223: Fix issues with inheriting from buffer classes (python#104227) pythongh-99108: fix typo in Modules/Setup (python#104293) ...
…ection (pythonGH-105126) (cherry picked from commit 5701799) Co-authored-by: Carl Meyer <carl@oddbird.net>
Feature or enhancement
In Cinder we inline some list/dict/set comprehensions in the compiler for better performance. That is, instead of creating a function object every time and calling it, for some comprehensions we just emit bytecode directly in the outer function to implement the comprehension loop.
Pitch
This change was a significant CPU efficiency improvement on the Instagram production web tier. Allocating a new Python function object is a significant and not strictly necessary cost.
There are some backward-compatibility considerations. In the Cinder implementation, we refuse to inline comprehensions if there is a name collision between a name assigned within the comprehension and a name in the containing scope, and we delete any names defined in the comprehension immediately after the inlined bytecode, to preserve the same visibility of names defined in the comprehension. There can still be some visible changes, e.g. if
locals()
is called within the comprehension it will show names from the containing function too. If a traceback occurs within the comprehension (orsys._getframe
is called), it will not show an additional last frame for the comprehension itself. In practice we've not observed either of these to be a problem.I discussed this idea with @markshannon, and he suggested a couple possible improvements. Instead of refusing to inline comprehensions with name clashes, we could push the prior value of the colliding name(s) onto the stack before running the comprehension, and then pop it back into the name afterward. This would allow inlining ~all comprehensions. He also suggested that we could add
PUSH_FRAME
andPOP_FRAME
opcodes and wrap the inlined bytecode in these. Then we would still elide the cost of function object creation, but the comprehension bytecode (although part of the parent bytecode) would still run inside its own frame, eliminating the above-mentioned incompatibilities.Linked PRs
The text was updated successfully, but these errors were encountered: