Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-84436: Implement Immortal Objects #19474

Merged
merged 201 commits into from
Apr 22, 2023

Conversation

eduardo-elizondo
Copy link
Contributor

@eduardo-elizondo eduardo-elizondo commented Apr 11, 2020

This is the implementation of PEP683

Motivation

The PR introduces the ability to immortalize instances in CPython which bypasses reference counting. Tagging objects as immortal allows up to skip certain operations when we know that the object will be around for the entire execution of the runtime.

Note that this by itself will bring a performance regression to the runtime due to the extra reference count checks. However, this brings the ability of having truly immutable objects that are useful in other contexts such as immutable data sharing between sub-interpreters.

https://bugs.python.org/issue40255

@eduardo-elizondo
Copy link
Contributor Author

eduardo-elizondo commented Apr 11, 2020

This is ready to review, the CI is finally green. Really no idea why the newly added GC tests are failing on Windows and unfortunately I don't have a Windows machine to debug this.

@eduardo-elizondo eduardo-elizondo changed the title [WIP] bpo-40255: Implement Immortal Instances bpo-40255: Implement Immortal Instances Apr 11, 2020
@eduardo-elizondo
Copy link
Contributor Author

Looping in @carljm and @DinoV who have pointed out some of the issues with immortal instances in the permanent generation participating in a GC collection (i.e dicts). Let me know if you have some other thoughts or ideas on this!

@eduardo-elizondo
Copy link
Contributor Author

eduardo-elizondo commented Apr 11, 2020

Also looping in @vstinner. Finally got around upstreaming this patch since you recently wrote about this on your C-API Improvement Docs

Include/object.h Outdated Show resolved Hide resolved
Include/object.h Outdated Show resolved Hide resolved
Modules/gcmodule.c Outdated Show resolved Hide resolved
Modules/gcmodule.c Outdated Show resolved Hide resolved
Include/object.h Outdated Show resolved Hide resolved
Lib/test/test_gc.py Outdated Show resolved Hide resolved
@nascheme
Copy link
Member

My first reaction is that this shouldn't become part of the default build because most Python users will not make use of it and then it becomes pure extra overhead. However, I know for some people that it is a useful feature (e.g. pre-fork server architecture that exploits copy-on-write OS memory management). I would use it myself since I write web applications with that style.

Would it be okay to make this a compile time option, disabled by default? I think in general it is a bad idea to have too many of those types of build options. It makes code maintenance and testing more difficult. Some example build variations from the past that caused issues: thread/no-threads, Unicode width, various debug options (@vstinner removed some of those). So, I'm not super excited about introducing a new build option.

Is it possible we can leverage this extra status bit on objects to recover the lost performance somehow? A couple years ago I did a "tagged pointer" experiment that used a similar bit. In that case, small integers became one machine word in size and also become immortal.

Another thought: when you did your testing, were any objects made immortal? I would imagine that, by default, you could make everything immortal after initial interpreter startup. You are paying for an extra test+branch in INCREF and DECREF but for many objects (e.g. None, True, False, types) you avoid dirtying the memory/cache with writes to the reference count.

@eduardo-elizondo
Copy link
Contributor Author

eduardo-elizondo commented Apr 14, 2020

@nascheme you should definitely join the conversation happening in the bug report of this PR https://bugs.python.org/issue40255

However, I know for some people that it is a useful feature

Exactly, this change might be a feature for CPython power users

Would it be okay to make this a compile time option, disabled by default?

Yeah, that's probably the best option. That's also the consensus in the bug report thread (if the change is approved)

I think in general it is a bad idea to have too many of those types of build options.

Yeah that's one of the drawbacks. That being said, I can help with setting up the travis build to integrate this change if needed (cc @vstinner).

Is it possible we can leverage this extra status bit on objects to recover the lost performance somehow?

We can indeed, I think somebody also mentioned that in the bug report. A potentially good place could be main.c:pymain_main right after pymain_main. Let me explore that and push that change if it looks like performance a improvement!

In theory we could optimize even further to reduce the perf cost. By leveraging saturated adds and conditional moves we could remove the branching instruction. I haven't explored this further since the current PR was good enough. Personally, I favor the current PR, but this could be changed to:

/* Branch-less incref saturated at PY_SSIZE_T_MAX */
#define _Py_INC_REF(op) ({
    __asm__ (
        "addq $0x1, %[refcnt]"
        "cmovoq  %[refcnt_max], %[refcnt]"
        : [refcnt] "+r" (((PyObject *)op)->ob_refcnt)
        : [refcnt_max] "r" (PY_SSIZE_T_MAX)
    );})

/* Branch-less decref saturated at PY_SSIZE_T_MAX */
#define _Py_DEC_REF(op) ({
    Py_ssize_t tmp = ((PyObject *)op)->ob_refcnt;
    __asm__ (
        "subq $0x1, %[refcnt]"
        "addq $0x1, %[tmp]"
        "cmovoq  %[refcnt_max], %[refcnt]"
        : [refcnt] "+r" (((PyObject *)op)->ob_refcnt), [tmp] "+r" (tmp)
        : [refcnt_max] "r" (PY_SSIZE_T_MAX)
    );})

@pablogsal
Copy link
Member

pablogsal commented Apr 14, 2020

Yeah that's one of the drawbacks. That being said, I can help with setting up the travis build to integrate this change if needed (cc @vstinner).

Not only that, we would need specialized buildbots to test the code base with this option activated in a bunch of supported platforms and that raises the maintainance costs.

Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feature sounds controversial, so I block it until a consensus can be reached.

@bedevere-bot
Copy link

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

@eduardo-elizondo eduardo-elizondo requested a review from 1st1 as a code owner April 15, 2020 19:19
@eduardo-elizondo
Copy link
Contributor Author

eduardo-elizondo commented Apr 25, 2023

Three years in the making and it's finally there! Huge shoutouts to @ericsnowcurrently for working together with me throughout the years to get this all the way through, you rock!!

Also, thanks @markshannon @gvanrossum and @pablogsal for being a sounding board for ideas, reviews, and coaching on the messaging of the PR / PEP!

@ydshieh
Copy link

ydshieh commented Apr 26, 2023

Congrats @eduardo-elizondo! 🔥

I followed this PR from the beginning, even needed to port the changes (in 2020) into python 3.9 due to the memory issue with multiprocessing for my previous company.

At some point, I felt this was not going to be in Python due to the inactivity here (especially the review).
It's incredible and amazing that you continued to finalize this work, and finally got it merged 🎉 !

@eendebakpt
Copy link
Contributor

@ericsnowcurrently @eduardo-elizondo One of the acceptance conditions for pep 683 was to update the pep with final benchmark results. It appears that has not been done. Are the numbers available somewhere? I want to determine whether the performance regressions reported in #109049 are due to this PR (or perhaps other PRs as well).

In discourse: pep-683-immortal-objects-using-a-fixed-refcount-round-4-last-call there is some discussion on whether to add the performance numbers or not. After that message the pep has been updated, but not the performance numbers.

@ericsnowcurrently
Copy link
Member

@eduardo-elizondo, do you have time to update the PEP with final benchmark results for ea2c001?

@arigo
Copy link
Contributor

arigo commented Nov 1, 2023

Note: https://speed.python.org/ shows an important performance regression in maybe 1/3rd of all benchmarks, dated Apr 22, which is the date this PR was merged. The most significant I've seen so far (which, if reproduced locally, could help figure out the root cause) is unpack_sequence.

@gvanrossum
Copy link
Member

@ericsnowcurrently @mdboom ^^

@eduardo-elizondo
Copy link
Contributor Author

eduardo-elizondo commented Nov 3, 2023

@ericsnowcurrently we actually already have the benchmark numbers here: #19474 (comment) which I ran right before merging and there's only test/lint fixes on top of that. This shows roughly a ~1.02x geometric mean regression (~1.03x on MSVC). Let me know if this is what we are looking for!

To be clear - there are will be both slower and faster benchmarks. However, we should focus on the geometric mean (rather than a single benchmark) which is our best proxy when benchmarks move in both ways. Separately, performance measurements can come up with wildly different results on different environments. For these experiments I used gcc-11.1 and MSVC v14.33 on 'lab-like' bare-metal machines which resulted in consistently reproducible results.

@ericsnowcurrently
Copy link
Member

python/peps#3519

@arigo
Copy link
Contributor

arigo commented Nov 3, 2023

There is a discrepancy between the text in PEP 683 and the actual implementation, sections Accidental Immortality and Accidental De-Immortalizing, on 64-bit machines. If it is already mentioned somewhere, sorry about that; but it might be useful to mention that inside the PEP itself. The problem is that, unless I'm missing something, the PEP says that 64-bit machines don't have any accidental immortality problems in practice because a 64-bit or close-to-64-bit refcount never overflows; but the implementation instead starts to consider objects immortal as soon as their refcount reaches 31 bits, a much more reachable value on 64-bit machines. For example, a 2**31-entries numpy object array can be initialized with copies of the same object---this only takes 16 GB of RAM.

The issues listed in Accidental De-Immortalizing are particularly problematic in this situation. I have not tested it, but looking at the source code it seems that this kind of code would crash CPython on 64-bit machines:

  • make a non-immortal object 'a'.
  • store at least 2**31 copies of 'a' inside a data structure maintained by an older stable-ABI extension module.
  • store 2**31 more copies of 'a' inside a list (e.g. by calling the list() function over that data structure; at that point the total memory usage is 32-33 GB of RAM, a very reachable amount). The refcount reaches 2**32-1 and there is at least one incref beyond that that is dropped.
  • deallocate the data structure; as this is done in the older extension module, this decrefs 'a' to 2**31-1 and the object looses its immortal status.
  • deallocate the list. This will cause the refcount to be decremented 2**31 times from its previous value of 2**31-1, and crash.

Again, this is not a real bug report, because it seems that this kind of issue was considered and is supposedly passed as an acceptable trade-off for using stable-ABI modules compiled with older versions of CPython. This is more a missing documentation issue.

@arigo
Copy link
Contributor

arigo commented Nov 6, 2023

Note: a potential fix for this issue would be to change Py_INCREF on 64-bit platforms: instead of checking if (uint)refcount == 2**32-1, it could check if (Py_ssize_t)refcount < 0. This would remove any risk of the crash described above because no INCREF would ever skip the refcount increment before bit 63 is set. Only the more acceptable very rare leak might occur, once bit 31 is set.

The value to initialize immortal objects with would be something like 0xC0000000C0000000, which is in the middle of both 32-bit and 64-bit ranges. It takes 0x40000000 increfs or decrefs by an old stable-ABI module to accidentally make the CPython core change the refcount again (but with no risk of accidentally freeing the object because the refcount is huge).

@ericsnowcurrently
Copy link
Member

CC @eduardo-elizondo

@arigo
Copy link
Contributor

arigo commented Nov 8, 2023

...or, change _Py_IsImmortal to (Py_ssize_t)refcount<0 on 64-bit, and then just use _Py_IsImmortal in both INCREF and DECREF on both 32- and 64-bit platforms, and be done with it? This should Always Just Work(tm) if we reasonably assume that it's completely impossible to repeat a loop 2**62 times, and initialize the immortal refcount to 2**63+2**62.

(Here I'm working with the implicit never-documented assumption that people first tried to use the 32-bit code directly on 64-bit, with the immortal value 0x3fffffffffffffff, but found that it has some performance impact on Intel. That would be because a constant value that doesn't fit 32 bits does indeed have a cost. That's why I'm suggesting here to use (Py_ssize_t)refcount < 0, which is simpler and might be even cheaper than the current 32- and 64-bit mix of arithmetic on the same refcount.)

@ericsnowcurrently
Copy link
Member

@arigo, thanks for the feedback, both about speed.python.org and about accidental de-immortalization. @eduardo-elizondo has the insight we need (which I don't) in both cases, so I'll defer to what he has to say. @markshannon may be have some thoughts as well, at least about the refcount corner case.

@eduardo-elizondo
Copy link
Contributor Author

@ariago Thanks for the reply, just went through it in detail. First of all, pretty much all of what you are saying in the first message is correct, though there are some additional details that would help complement what you wrote above. Let me try to reply to your message by breaking it down into what I believe are the main questions:

but the implementation instead starts to consider objects immortal as soon as their refcount reaches 31 bits, a much more reachable value on 64-bit machines

Overall, yes, the PEP talks about a very high value which is what we wanted to reflect there and left the value to be an implementation detail. Over here, we ended up using a saturated 32-bits because it provides a wide number of benefits:

  • Performance: By using the lower 32-bits we help the compiler generate better code by directly manipulating the 32-bit registers (from the 64-bit refcount register). This gave a considerable perf improvement (around 1-2% geometric mean).
  • Pointer Tagging: By leaving the upper 32 bits free, we could potentially se them for tagging the pointers to improve performance in future iterations.
  • NoGIL refcount: The reference implementation of PEP703 also exploits this by using the unused bits to keep track of the biased reference counts.
    Thus, this is why we decided to stick with 32-bit saturation for now which we could always revise in later versions if needed!

I have not tested it, but looking at the source code it seems that this kind of code would crash CPython on 64-bit machines

No need to test it, it will play out exactly as you mentioned! For this one, Eric and I talked about this exact scenario and even surfaced this during a language summit to the core-devs. However, we believe that it’s a very contrived example as it would require a large amount of asymmetric decrefs. In reality, what happens with very large objects such as these is that they either live throughout the entire execution of the application or there’s a combination of symmetric increfs and decrefs that prevent this from happening.

I did end up testing this in a very large machine with a large application where 16GB is relatively small with hundreds of older stable-ABI modules and didn’t see this issue materialize. Of course, this is just a single application but I was indeed on the lookout to make sure that this precise scenario was not prevalent. The good thing for this one though is that as we go into newer python versions and we keep updating our C-Extensions the risk here will become less and less prevalent.

a potential fix for this issue would be to change Py_INCREF on 64-bit platforms: instead of checking if (uint)refcount == 2**32-1, it could check if (Py_ssize_t)refcount < 0

At some point I did indeed try a solution similar to this, however, this ended up being in conflict with some of the refcount manipulations that the GC does on the two most significant bits and would ignore the fact that an object is immortal causing incorrect behavior. Not only that but also, for the reasons mentioned above, we wanted to keep the entire refcount arithmetic with just 32-bits.

A considered alternative (but never implemented) solution would be to make the GC immortal object aware and then go for the 64-bit solution, but this would required a bit more work on the gcmodule not to mention the added complexity in the module. There might be a simple solution there that I never figured out! However, this would still imply that we would all the bits that we now freed for future use cases. Given the acceptance of 703 we might as well keep refcounts as 32-bits.

...or, change _Py_IsImmortal to (Py_ssize_t)refcount<0 on 64-bit, and then just use _Py_IsImmortal in both INCREF and DECREF on both 32- and 64-bit platforms, and be done with it?

Using all 64-bits causes the issue that I pointed above with the GC. But we could indeed use this check for the lower 32 bits (which we already do in decref). This was actually the original implementation and as you mentioned, this is what we tried first. The reason we did a more specialized check in incref is just due to improved perf.

@eduardo-elizondo
Copy link
Contributor Author

@ariago Overall though these are great questions and I'll be happy to update and revise the wording of the PEP if you think it'll be useful! Let me know what you think! 🙂

@cfbolz
Copy link
Contributor

cfbolz commented Jan 1, 2024

pinging @arigo because there was a typo in the account name in the most recent messages

@arigo
Copy link
Contributor

arigo commented Jan 1, 2024

Adding a known, rare, hard-to-debug crash in CPython would be worth 1-2% of performance? I personally disagree with that but I haven't been a CPython contributor for many years now, so I have nothing to add.

#define PyObject_HEAD_INIT(type) \
{ \
_PyObject_EXTRA_INIT \
{ 1 }, \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems incorrect. A statically allocated object is immortal, regardless of whether it is in the core or not.

Copy link
Contributor Author

@eduardo-elizondo eduardo-elizondo Apr 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right! I even thought about making this the default behavior.

However, there could be cases in extension code with asserts/tests on exact refcount values. This would break those builds/tests. You can even see it in this PR where I had to change _testembed.c to modify the Py_REFCNT(str1) == 1 check to _Py_IsImmortal(str1)

There's an argument on whether or not having any exact refcounts checks are correct, but I didn't want to challenge it at the time. Thus, to avoid any problems with extension code and reduce the surface area of impact, I localized it to just affect the CPython build.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.