Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bizarre runtime regression of PyString when iterating over dict of size >= 2^16 #3299

Closed
Hooksie opened this issue Jul 5, 2023 · 7 comments
Labels

Comments

@Hooksie
Copy link
Contributor

Hooksie commented Jul 5, 2023

Bug Description

Context: I am converting a function to pyo3 that reads through various types to create cache keys. Since the cache keys are performance critical, I was profiling different input values, and noticed that, after certain inputs, the performance of other inputs would change during runtime, and persistent for the lifetime of the program.

Specifically, the function returns in ~100ns when the input is a None, and will continue to behave this way when called again. If, however, a dict of size 2^16 is passed in and iterated over, the performance of the original None value suddenly drops by about 1.5 orders of magnitude to -4.5us. The performance for this input never improves for the duration of the program (which is slower than a pure python version for this value).

I was able to narrow this down to a minimally reproducible case, and found that the critical detail was whether PyString::new was called. Removing PyString::new caused the performance to behave consistently as expected. Alternatively, if the PyString::new is left in, removing the dict iterator also caused the performance to behave consistently. Notably, for dicts of size 2^16-1 and below, this issue never happens (even when iterating and leaving the string in).

I am not clear how any one function call can affect another in Pyo3 like this, so its difficult to debug further.

I see this behavior both when setting opt-level=0, and opt-level=3

Steps to Reproduce

Here is the minimal reproduction case:

use pyo3::prelude::*;
use pyo3::types::*;

#[pyfunction]
pub fn perf_test(object: &PyAny) -> () {
    if let Ok(value) = object.downcast::<PyDict>() {
        for _ in value { }
    }

    // Any string seems to be sufficient here, including empty string.
    PyString::new(object.py(), "hello");
}

#[pymodule]
fn rs_perf_test(_py: Python, m: &PyModule) -> PyResult<()> {
    m.add_function(wrap_pyfunction!(perf_test, m)?)?;
    Ok(())
}

Here is the test case:

  1. Call and measure the function with arg None
  2. Call the function with a dict arg of size 2^16 (or greater)
  3. Call and measure the function with arg None again, noting the degraded performance
  4. Remove the rust iterator on line 7 above, for _ in value { }. Rebuild lib, repeat steps 1 - 3 above, noting the performance does not degrade
  5. (Alternatively) Remove the PyString::new on line 11. Rebuild lib, repeat steps 1 - 3 above, noting the performance does not degrade

Script below. Note that for dict of size less than 2^16, the behavior is not triggered.

import timeit
from rs_perf_test import perf_test
from time import perf_counter_ns

n = 10_000

# This returns with fast "baseline" performance, as expected
t0 = timeit.timeit("perf_test(None)", globals=globals(), number=n, timer=perf_counter_ns)/n
print(f"First just None:  {t0/1000:.6f}")

# dict size (2^16 - 1)
x = {str(i): None for i in range(65535)}
perf_test(x)

# This still returns as fast as the baseline performance, as expected
t1 = timeit.timeit("perf_test(None)", globals=globals(), number=n, timer=perf_counter_ns)/n
print(f"Second just None:  {t1/1000:.6f}")

# dict size 2^16. 
y = {str(i): None for i in range(65536)}
perf_test(y)

# The performance has now degraded for the same input value above, and does not recover
t2 = timeit.timeit("perf_test(None)", globals=globals(), number=n, timer=perf_counter_ns)/n
print(f"Third just None: {t2/1000:.6f}")

Backtrace

No response

Your operating system and version

Macos M1 13.4.1

Your Python version (python --version)

3.11.4

Your Rust version (rustc --version)

rustc 1.70.0 (90c541806 2023-05-31)

Your PyO3 version

0.19.1

How did you install python? Did you use a virtualenv?

virtualenv + pyenv
maturin init

Additional Info

No response

@Hooksie Hooksie added the bug label Jul 5, 2023
@adamreichold
Copy link
Member

I suspect that this due to the dict items being registered in the thread-local owned object pool here so that our iterator can return bare references.

You should be able to check this if you iterate the dictionary using e.g. Python code like

py_run!(py, value, "collections.deque(value, maxlen=0)");

instead of using our iterator.

@adamreichold
Copy link
Member

Note that we hope to remove the performance bottleneck and pitfall this shared resource implies by retiring the pool in the upcoming PyO3 version 0.20.0, c.f. #3205 and #3253.

@Hooksie
Copy link
Contributor Author

Hooksie commented Jul 6, 2023

@adamreichold Thanks for the extra insight. I dug around (very briefly), and I'm still not clear on why the performance is a cliff, rather than slowly degrading w.r.t the pool size.

My bigger issue is why the performance remains degraded, which doesn't seem to make sense even when the pool is reused across calls. Perhaps this would be more clear if I understood what was actually slowing down.

One aside: I tweaked my example a bit, and noticed similarly that if I iterate over (2^16-1) (the previous success case), but then allocate a few more strings with PyString::new, I get the same per degradation as well. It seems that actually it's just the size of the gil pool, not how it's filled up?

@adamreichold
Copy link
Member

It seems that actually it's just the size of the gil pool, not how it's filled up?

The pool itself is oblivious to type of the objects, it is used only to decrement their reference counts after all. So indeed, PyString::new will register another object in that pool the same way that iterating a PyDict will.

(The performance cliff you are seeing might be related to the cache hierarchy of the CPU you are running the benchmark on. So maybe the OWNED_OBJECTS pool which stores pointers reaching 2^16 * 8 B = 512 kB which might be the L2 cache size of a contemporary CPU?)

@Hooksie
Copy link
Contributor Author

Hooksie commented Jul 12, 2023

@adamreichold Hrm, yeah this is interesting. I think you're right about the architecture here -- I have trouble getting such drastic performance results when running this in an x86 (cloud) instance -- although the effect is still there by a small margin. It seems the M1 arch here is the main factor

If it's of any interest, I did try this again with a local patch, where I run shrink_to_fit in GILPool::drop, which seems to resolve the issue, at the cost of some small constant overhead. Not sure if there is an appetite for such a change. It would probably need slightly more finesse than just shrink_to_fit

@adamreichold
Copy link
Member

adamreichold commented Jul 13, 2023

If it's of any interest, I did try this again with a local patch, where I run shrink_to_fit in GILPool::drop, which seems to resolve the issue, at the cost of some small constant overhead. Not sure if there is an appetite for such a change. It would probably need slightly more finesse than just shrink_to_fit

I think this only redistributes the cost of dropping the allocations backing the pool. I think it would be best if we invest our efforts into removing this choke point entirely.

@Hooksie
Copy link
Contributor Author

Hooksie commented Jul 13, 2023

@adamreichold Sounds reasonable, I appreciate the feedback

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants