-
-
Notifications
You must be signed in to change notification settings - Fork 30.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-118926: Deferred reference counting GC changes for free threading #121318
Conversation
This reverts commit 6b63066.
!buildbot nogil |
🤖 New build scheduled with the buildbot fleet by @Fidget-Spinner for commit 8cb139f 🤖 The command will test the builders whose names match following regular expression: The builders matched are:
|
!buildbot nogil |
🤖 New build scheduled with the buildbot fleet by @Fidget-Spinner for commit 700c2fd 🤖 The command will test the builders whose names match following regular expression: The builders matched are:
|
!buildbot nogil |
🤖 New build scheduled with the buildbot fleet by @Fidget-Spinner for commit cde931d 🤖 The command will test the builders whose names match following regular expression: The builders matched are:
|
One possibility to keep things moving for the free-threading build is to generate a different |
Ok I can port your old PR over to this one. I presume we're flushing the values to the stack before every escaping call? |
@markshannon, writing all values to the stack is going to introduce an unnecessary performance regression. This is holding up the free-threaded work for what seems like small aesthetic complaints. If the concern is about maintainer confusion, we can add comments or otherwise document the dozen or so places where this is used. |
This is not just aesthetics. You say that "writing all values to the stack will by slow". Only for those parts of the stack that are in memory.
There are a hundred or more places where execution can escape from the interpreter, not counting Even if it is correct, it stills lays traps for the unwary.
If this were changed to
then it would be unsafe, yet there would be no warning or error from any tools. |
The rewritten
We should not be thinking about this in terms of where calls escape from the interpreter. We should be thinking about this in terms of where If you want to refactor |
Sorry I forgot that this may introduce another perf regression on the free-threaded build, even if I limit it just to that. So I'm putting a pause on this. I think we should discuss this on Wednesday. |
In general, stackrefs must be spilled to the in-memory stack around any escaping call.
Identifying all escaping calls.We have a whitelist, all other calls are escaping. The list needs updating, but it's mostly right. Track assignments to output valuesAssignments are easy to spot It is easy enough to change the code move assignments out of branches, and don't want the code gen to have to do flow analysis. Generating spills around calls.Once we having identified the escaping call, we will need to walk backwards and forwards around the call to identify the whole statement. |
In case that sounds too expensive, don't worry. It shouldn't be. Spilling the stack pointer is cheap as registers often need to be spilled across calls anyway. Although escaping calls are common in |
@markshannon, that is not what we discussed and agreed to
|
Tracking just I think what I proposed handles all the use cases, without a significant performance impact.
Why would anything get written multiple times? |
I think the original goal was to not block TOS caching and full deferred refcounting. With the latest changes, this is now true. The added goal of laying the foundations for the two should be left as another PR. The responsibility of this PR IMO, is to not burden stack caching and full deferred refcounting (which it should have achieved with the latest commit). Supporting more features is out of scope. |
@markshannon - I'm most concerned because I thought we were on the same page on this approach after our last meeting. I don't think starting with this approach precludes future changes, like what you've outlined above. This PR is a bottleneck for a lot of the free-threading deferred reference counting work, so I'd appreciate it if we can figure out how to unblock it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having the spilling controlled by the code generator is a major improvement.
Thanks for doing that.
I have a few comments
@@ -1428,6 +1436,8 @@ | |||
PyObject **items = _PyTuple_ITEMS(seq_o); | |||
for (int i = oparg; --i >= 0; ) { | |||
*values++ = PyStackRef_FromPyObjectNew(items[i]); | |||
#ifdef Py_GIL_DISABLED /* flush specials */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't emit anything if there are no values to flush.
emit_to(out, tkn_iter, "SEMI") | ||
out.emit(";\n") | ||
target = uop.body.index(tkn) | ||
assert uop.body[target-1].kind == "EQUALS", (f"{uop.name} Result of a specials that is flushed" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to crash as soon as anyone writes anything but an assignment in front of PyStackRef_FromPyObjectNew
.
We might want to insist that the result of PyStackRef_FromPyObjectNew
is assigned to a variable, maybe only an output variable, but that should be an error not a crash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the error handling. For the most part, PyStackRef_FromPyObjectNew
is now required to be assigned to an output variable, but we also allow assignment to a dereferenced pointer due to usages like:
frame->localsplus[offset + i] = PyStackRef_FromPyObjectNew(o);
(in COPY_FREE_VARS)*values++ = PyStackRef_FromPyObjectNew(items[i])
(in UNPACK_SEQUENCE_LIST)
#ifdef Py_GIL_DISABLED /* flush specials */ | ||
stack_pointer[-2 - oparg] = method; | ||
#endif /* flush specials */ | ||
stack_pointer[-2 - oparg] = method; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the same value being saved twice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the generator to avoid saving the same value twice. I got rid of the #ifdef Py_GIL_DISABLED
guards because I think they would add too much complexity for no real gain.
@@ -177,6 +177,35 @@ def replace_check_eval_breaker( | |||
if not uop.properties.ends_with_eval_breaker: | |||
out.emit_at("CHECK_EVAL_BREAKER();", tkn) | |||
|
|||
def replace_pystackref_frompyobjectnew( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that locally replacing PyStackRef_FromPyObjectNew
is sufficient in general. It ensures that the references are in memory, but it doesn't ensure that the stack_pointer is saved, so the references might not be visible.
IIRC this is safe for the current FT implementation because you NULL
out the unused part of the stack.
Could you add a comment explaining why this is safe for FT even though it doesn't appear to be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a comment
target = uop.body.index(tkn) | ||
assert uop.body[target-1].kind == "EQUALS", (f"{uop.name} Result of a specials that is flushed" | ||
" must be written to a stack variable directly") | ||
# Scan to look for the first thing it's assigned to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, this sort of analysis should ideally be performed in the analysis phase, not in the code generation phase.
Don't worry about it for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've moved this to analyzer.py
since it required a number of changes anyways.
found_valid_assignment = assgn_target.text | ||
break | ||
|
||
if found_valid_assignment: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if we haven't found a valid assignment?
That sounds like an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This now raises an analysis_error
.
@@ -213,6 +213,24 @@ def flush(self, out: CWriter, cast_type: str = "uintptr_t", extract_bits: bool = | |||
self.peek_offset.clear() | |||
out.start_line() | |||
|
|||
def write_variable_to_stack(self, out: CWriter, var_name: str, cast_type: str = "uintptr_t", extract_bits: bool = False) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems odd to be writing individual values to the stack. If any value on the stack needs to be in memory, don't they all?
Would it make more sense to use the flush
method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm looking at this since Ken is on vacation.
I think that the logic in this PR (after resolving merge conflicts with main) no longer works after #122286.
Before #122286, the tier1 generator had the outputs on the logical stack when emit_tokens()
is called. Now, the inputs are still on the logical stack when emit_tokens
is called and the outputs are only pushed after emit_tokens()
.
I think flush()
has the same issue: the logical stack still contains the inputs, not the outputs.
When you're done making the requested changes, leave the comment: |
@markshannon, I've split off the changes to the interpreter generator into a separate PR: I think that will make it easier to see the changes to the generated code. (And the merge conflicts were getting unwieldy.) |
This PR mainly introduces GC changes to the free threading GC to support deferred reference counting in the future.
To get this to work, new stack references must immediately live on the stack, without any interfering
Py_DECREF
or escaping code between when we put them on the stack. This ensures they are visible to the GC.This also NULLs out the rest of the stack because the GC scans the entire stack. The stack pointer may be inconsistent between escaping calls (including those to
Py_DECREF
) so we need to scan the whole stack.This PR removes the temporary immortalization introduced previously in #117783. I wanted to do this in a separate PR, but the only way to test this properly is to remove that hack. So it has to be bundled in this PR.
Finally, this PR fixes a few bugs in steals and borrows. This was only caught by the GC changes, not by the debugger that I was working on. Since these are untestable without the GC changes, I bundled them in.
Temporary perf regressions (6 threads):