Bizarre runtime regression of PyString when iterating over dict of size >= 2^16 #3299

Hooksie · 2023-07-05T19:13:25Z

Bug Description

Context: I am converting a function to pyo3 that reads through various types to create cache keys. Since the cache keys are performance critical, I was profiling different input values, and noticed that, after certain inputs, the performance of other inputs would change during runtime, and persistent for the lifetime of the program.

Specifically, the function returns in ~100ns when the input is a None, and will continue to behave this way when called again. If, however, a dict of size 2^16 is passed in and iterated over, the performance of the original None value suddenly drops by about 1.5 orders of magnitude to -4.5us. The performance for this input never improves for the duration of the program (which is slower than a pure python version for this value).

I was able to narrow this down to a minimally reproducible case, and found that the critical detail was whether PyString::new was called. Removing PyString::new caused the performance to behave consistently as expected. Alternatively, if the PyString::new is left in, removing the dict iterator also caused the performance to behave consistently. Notably, for dicts of size 2^16-1 and below, this issue never happens (even when iterating and leaving the string in).

I am not clear how any one function call can affect another in Pyo3 like this, so its difficult to debug further.

I see this behavior both when setting opt-level=0, and opt-level=3

Steps to Reproduce

Here is the minimal reproduction case:

use pyo3::prelude::*;
use pyo3::types::*;

#[pyfunction]
pub fn perf_test(object: &PyAny) -> () {
    if let Ok(value) = object.downcast::<PyDict>() {
        for _ in value { }
    }

    // Any string seems to be sufficient here, including empty string.
    PyString::new(object.py(), "hello");
}

#[pymodule]
fn rs_perf_test(_py: Python, m: &PyModule) -> PyResult<()> {
    m.add_function(wrap_pyfunction!(perf_test, m)?)?;
    Ok(())
}

Here is the test case:

Call and measure the function with arg None
Call the function with a dict arg of size 2^16 (or greater)
Call and measure the function with arg None again, noting the degraded performance
Remove the rust iterator on line 7 above, for _ in value { }. Rebuild lib, repeat steps 1 - 3 above, noting the performance does not degrade
(Alternatively) Remove the PyString::new on line 11. Rebuild lib, repeat steps 1 - 3 above, noting the performance does not degrade

Script below. Note that for dict of size less than 2^16, the behavior is not triggered.

import timeit
from rs_perf_test import perf_test
from time import perf_counter_ns

n = 10_000

# This returns with fast "baseline" performance, as expected
t0 = timeit.timeit("perf_test(None)", globals=globals(), number=n, timer=perf_counter_ns)/n
print(f"First just None:  {t0/1000:.6f}")

# dict size (2^16 - 1)
x = {str(i): None for i in range(65535)}
perf_test(x)

# This still returns as fast as the baseline performance, as expected
t1 = timeit.timeit("perf_test(None)", globals=globals(), number=n, timer=perf_counter_ns)/n
print(f"Second just None:  {t1/1000:.6f}")

# dict size 2^16. 
y = {str(i): None for i in range(65536)}
perf_test(y)

# The performance has now degraded for the same input value above, and does not recover
t2 = timeit.timeit("perf_test(None)", globals=globals(), number=n, timer=perf_counter_ns)/n
print(f"Third just None: {t2/1000:.6f}")

Backtrace

No response

Your operating system and version

Macos M1 13.4.1

Your Python version (`python --version`)

3.11.4

Your Rust version (`rustc --version`)

rustc 1.70.0 (90c541806 2023-05-31)

Your PyO3 version

0.19.1

How did you install python? Did you use a virtualenv?

virtualenv + pyenv
maturin init

Additional Info

No response

The text was updated successfully, but these errors were encountered:

adamreichold · 2023-07-05T19:36:12Z

I suspect that this due to the dict items being registered in the thread-local owned object pool here so that our iterator can return bare references.

You should be able to check this if you iterate the dictionary using e.g. Python code like

py_run!(py, value, "collections.deque(value, maxlen=0)");

instead of using our iterator.

adamreichold · 2023-07-05T19:37:44Z

Note that we hope to remove the performance bottleneck and pitfall this shared resource implies by retiring the pool in the upcoming PyO3 version 0.20.0, c.f. #3205 and #3253.

Hooksie · 2023-07-06T15:17:17Z

@adamreichold Thanks for the extra insight. I dug around (very briefly), and I'm still not clear on why the performance is a cliff, rather than slowly degrading w.r.t the pool size.

My bigger issue is why the performance remains degraded, which doesn't seem to make sense even when the pool is reused across calls. Perhaps this would be more clear if I understood what was actually slowing down.

One aside: I tweaked my example a bit, and noticed similarly that if I iterate over (2^16-1) (the previous success case), but then allocate a few more strings with PyString::new, I get the same per degradation as well. It seems that actually it's just the size of the gil pool, not how it's filled up?

adamreichold · 2023-07-06T15:26:43Z

It seems that actually it's just the size of the gil pool, not how it's filled up?

The pool itself is oblivious to type of the objects, it is used only to decrement their reference counts after all. So indeed, PyString::new will register another object in that pool the same way that iterating a PyDict will.

(The performance cliff you are seeing might be related to the cache hierarchy of the CPU you are running the benchmark on. So maybe the OWNED_OBJECTS pool which stores pointers reaching 2^16 * 8 B = 512 kB which might be the L2 cache size of a contemporary CPU?)

Hooksie · 2023-07-12T19:28:46Z

@adamreichold Hrm, yeah this is interesting. I think you're right about the architecture here -- I have trouble getting such drastic performance results when running this in an x86 (cloud) instance -- although the effect is still there by a small margin. It seems the M1 arch here is the main factor

If it's of any interest, I did try this again with a local patch, where I run shrink_to_fit in GILPool::drop, which seems to resolve the issue, at the cost of some small constant overhead. Not sure if there is an appetite for such a change. It would probably need slightly more finesse than just shrink_to_fit

adamreichold · 2023-07-13T06:04:58Z

If it's of any interest, I did try this again with a local patch, where I run shrink_to_fit in GILPool::drop, which seems to resolve the issue, at the cost of some small constant overhead. Not sure if there is an appetite for such a change. It would probably need slightly more finesse than just shrink_to_fit

I think this only redistributes the cost of dropping the allocations backing the pool. I think it would be best if we invest our efforts into removing this choke point entirely.

Hooksie · 2023-07-13T15:28:13Z

@adamreichold Sounds reasonable, I appreciate the feedback

Hooksie added the bug label Jul 5, 2023

Hooksie closed this as completed Jul 13, 2023

davidhewitt mentioned this issue Oct 16, 2023

Lingering performance issues after dealing with a very complex model pydantic/pydantic-core#1026

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bizarre runtime regression of PyString when iterating over dict of size >= 2^16 #3299

Bizarre runtime regression of PyString when iterating over dict of size >= 2^16 #3299

Hooksie commented Jul 5, 2023 •

edited

Loading

adamreichold commented Jul 5, 2023

adamreichold commented Jul 5, 2023

Hooksie commented Jul 6, 2023

adamreichold commented Jul 6, 2023

Hooksie commented Jul 12, 2023

adamreichold commented Jul 13, 2023 •

edited

Loading

Hooksie commented Jul 13, 2023

Bizarre runtime regression of PyString when iterating over dict of size >= 2^16 #3299

Bizarre runtime regression of PyString when iterating over dict of size >= 2^16 #3299

Comments

Hooksie commented Jul 5, 2023 • edited Loading

Bug Description

Steps to Reproduce

Backtrace

Your operating system and version

Your Python version (python --version)

Your Rust version (rustc --version)

Your PyO3 version

How did you install python? Did you use a virtualenv?

Additional Info

adamreichold commented Jul 5, 2023

adamreichold commented Jul 5, 2023

Hooksie commented Jul 6, 2023

adamreichold commented Jul 6, 2023

Hooksie commented Jul 12, 2023

adamreichold commented Jul 13, 2023 • edited Loading

Hooksie commented Jul 13, 2023

Hooksie commented Jul 5, 2023 •

edited

Loading

Your Python version (`python --version`)

Your Rust version (`rustc --version`)

adamreichold commented Jul 13, 2023 •

edited

Loading