-
Notifications
You must be signed in to change notification settings - Fork 770
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance: calling overhead #3827
Comments
Thanks! I'll see to break this down further in the morning 👍 |
Thanks! My actual use case was struct/class methods where I think there's the same overhead, but I don't know how to implement t them without |
We still construct a GIL Pool and poke and prod at it, right? It's just that it's always empty, in principle? |
Note that we should be able to get there in 0.22, i.e. the (You should be able to see the effects already by rebasing that patch and adjust the feature name. It just is nothing we can ship because it is still possible to reach the Also note we do still have the global reference count pool, i.e. handling calls to |
I just took a quick look at this. The timings on my machine come out very similarly to @samuelcolvin with the original code, so I'll assume my new measurement is comparable. I applied the same patch as in the bottom section of #3787 (comment) and reran with this analysis. I also made sure LTO was enabled to get inlining to be aggressive as possible to make the generated code more similar to the "baremetal" snippet. With that done, I measure ~17.5ns for the "slow" code above, with no changes on the user's end. So we can make a significant dent of the difference with a GIL Pool removed and also with the global Py reference counting replaced by nogil. Regardless, ~17.5ns does imply a little bit of extra framework overhead over the "baremetal" still, but that's still cut the overall function execution by 50% from 33.5ns, so we've already made a huge impact. It's worth remembering that there's a couple of extra pieces of work which PyO3 does which are somewhat fundamental:
Once the main bulk of the overheads are gone and we're into this 17.5ns regime it would be interesting to see if we can optimize further, but I'd be surprised if there was much more to be won. I also wonder whether we could already move the global Py reference counting to a dedicated thread which attempts to wake and update counts at intervals, rather than doing this on every function call? In a Python with the GIL it will affect singlethreaded throughput a little, but once nogil is present that thread shouldn't affect throughput (and as @adamreichold says it might not be necessary at all). |
IIRC, we had soundness issues when those updates were missed before resuming GIL-dependent Rust code, e.g. 83f5fa2 |
Right, yes. That would make it very difficult to do anything other than what we already do, at least until nogil comes along 👍 |
With #3837 resolved, I think given the situation that the remaining overheads (of panic handling, internal GIL count, keyword arguments) are somewhat built into the framework, I'm going to close this here as resolved for now too. I think one day we may find further optimizations (and I want to see #3843 landed for |
Continued from conversation on long-dead #1607.
I've still seeing a significant overhead when calling
vs. a more "baremetal" implementation.
Timings:
I see this 20-40ns overhead in calling PyO3 functions in many scenarios.
Also my "baremetal" code is still using
_PyCFunctionFastWithKeywords
, not_PyCFunctionFast
as there don't seem to be methods available to call that. I assume there might be further performance improvements available over myfast_len
method if_PyCFunctionFast
could be used?Code
I'm installing pyo3 with
So I can use the new Bound API.
The text was updated successfully, but these errors were encountered: