Performance issue with `custom_vec_iter_impl!` `NodeIndices` #1090

enavarro51 · 2024-02-16T15:27:39Z

While working on DAGDependencyV2 in Qiskit, I was using digraph.predecessor_indices and digraph.predecessors and noticed predecessors was significantly faster than predecessor_indices. Looking at the digraph code, it appeared predecessor_indices was simpler and therefore should have been faster.

I noticed that predecessors returned a Vec and predecessor_indices returned NodeIndices. I modified the latter function to return Vec<usize> and the tests I was running for DAGDependencyV2 went from 74 sec for NodeIndices to 12 sec for Vec<usize> constructing an equivalent DAGDependencyV2.

Would appreciate any comments on why this is happening and whether this might be a wider problem.

Main code

    #[pyo3(text_signature = "(self, node, /)")]
    pub fn predecessor_indices(&self, node: usize) -> NodeIndices {
        NodeIndices {
            nodes: self
                .graph
                .neighbors_directed(NodeIndex::new(node), petgraph::Direction::Incoming)
                .map(|node| node.index())
                .collect(),
        }
    }

Modified code

    #[pyo3(text_signature = "(self, node, /)")]
    pub fn predecessor_indices(&self, node: usize) -> Vec<usize> {
        self.graph
            .neighbors_directed(NodeIndex::new(node), petgraph::Direction::Incoming)
            .map(|node| node.index())
            .collect()
    }

The text was updated successfully, but these errors were encountered:

mtreinish · 2024-02-16T16:44:13Z

In general there is a tradeoff we make with NodeIndices and the other custom return types. When we have the custom return types in rustworkx this is in an effort to return faster and optimize for workloads where we iterate over the object once vs places we access the elements in the results multiple times.

Basically when you return Vec<usize> what that ends up doing is building a new Python list and then copying and converting each element in from usize -> Python int. Which depending on the size of the list can be a lot of overhead. While NodeIndices avoids that by wrapping the output vec in a custom pyclass so there is no copy/conversion overhead on return. The tradeoff there is that when you access any element of NodeIndices it has to do the usize -> python int conversion. So if you loop over the values multiple times that ends up being slower (and in those cases I'd just do a list(indices) to convert it to a python list in a single iteration.

Now that all being said we added those custom return types are something we added years ago, and PyO3 was much less mature back then. It is entirely possible that the overhead of creating the Python List[int] isn't as bad anymore and the custom return type's overhead on access isn't worth it now with newer versions of pyo3.

enavarro51 · 2024-02-16T18:17:37Z

The DAGDependency code calls predecessor_indices about once for each node and then iterates over those indices once. I assume the overhead is in the __getitem__ for NodeIndices.

I have a pretty good test environment set up, so I can do some checks on just a single iteration of the indices for a large number of items.

IvanIsCoding · 2024-02-17T00:36:29Z

I ran a non-scientific experiment on my machine and I believe the optimizations Matthew introduced still hold true.

Here is the code for getting the times:

import rustworkx as rx
import timeit

N = 1_000_000
g = rx.PyGraph()
g.add_nodes_from(list(range(N)))
nodes = g.node_indexes()

print(
	"Getting the node indices or list from Rust a thousand times: ", timeit.timeit("g.node_indexes()", globals=globals(), number=1_000)
)

print(
	"Accessing a negative index a billion times: ", timeit.timeit("nodes[-10]", globals=globals(), number=1_000_000_000)
)

Attempt 1

I left node_indexes as the original:

#[pyo3(text_signature = "(self)")]
    pub fn node_indexes(&self) -> NodeIndices {
        self.node_indices()
    }

The results were:

Getting the node indices or list from Rust a thousand times:  1.6359028699998817
Accessing a negative index a billion times:  674.1754225720001

Atempt 2

I modified the node_indexes code to be:

#[pyo3(text_signature = "(self)")]
    pub fn node_indexes(&self) -> Vec<usize> {
        self.graph.node_indices().map(|node| node.index()).collect()
    }

The results were:

Getting the node indices or list from Rust a thousand times:  32.542972673999884
Accessing a negative index a billion times:  28.010249427000417

Corollary

Returning a Python list from Rust is slow. So to make a function return faster it makes sense.

However, accessing a specific element of the Rust object is very slow comparing to accessing a specific element of a Python list.

IvanIsCoding · 2024-02-17T00:38:38Z

I think we should definitely profile the __getitem__ method though:

rustworkx/src/iterators.rs

Line 540 in 52a4d05

fn __getitem__(&self, py: Python, idx: SliceOrInt) -> PyResult<PyObject> {

We need to see where it spends time and maybe work with the maintainers of PyO3 to see if we can make it faster. The current overhead is very high

jakelishman · 2024-02-20T18:37:51Z

It turns out that having

[derive(FromPyObject)]
enum SliceOrInt {
    Slice(PySlice),
    Int(...),
}

means that PyO3 attempts to downcast to Slice first, but like 99.9% of all __getitem__ calls are going to be with an int (especially since the Python implicit-iterator behaviour is being used here).

Flipping the order in the enum for me switched nodes[-10] (from Ivan's code block above) from taking 205ns to taking 61ns with no other changes (for comparison, indexing into the equivalent Python list took 25ns, but PyO3 will have some springboard cost associated with it too, some of which will be improved by PyO3 0.21).

I think another change to potentially consider is adding custom direct Iterator structs into the custom_vec_iter_impl that look like

#[pyclass]
struct NodeIndicesIter {
    base: Py<NodeIndices>,
    index: usize,
}

#[pymethods]
impl NodeIndicesIter {
    fn __next__(&mut self, py: Python) -> PyResult<Py<PyAny>> {
        // ...
    }
    fn __iter__(&self) -> Self { self }

which avoids any need to convert Python object inputs during the function inputs. I haven't timed that, but I could do if it's something you'd want to consider.

mtreinish · 2024-02-20T19:17:18Z

I think another change to potentially consider is adding custom direct Iterator structs into the custom_vec_iter_impl that look like which avoids any need to convert Python object inputs during the function inputs. I haven't timed that, but I could do if it's something you'd want to consider.

We have support like that with the mapping type macro right now (to get keys, values, and items iterators), so I think it'd be a good idea to add it to the vec ones too. I expect that will be a bit of a speed boost for a lot of cases since we avoid more python/pyo3 overhead by going through __getitem__

My unrelated idea to also speed things up here was maybe to cache the output PyObjects on __getitem__ so that subsequent accesses don't eat the conversion cost. The tradeoff there is we're essentially doubling the memory overhead for these classes because we'd also need to store a Vec<Option<PyObject>> of the same length in each object. So I'm not sure if it's worth it or not.

enavarro51 self-assigned this Feb 16, 2024

jakelishman mentioned this issue Feb 20, 2024

Improve performance of custom-iterator __getitem__ #1096

Merged

jakelishman mentioned this issue Feb 21, 2024

Add manual iterator implementations for custom vec iterables #1107

Merged

mtreinish closed this as completed in #1107 Feb 21, 2024

IvanIsCoding mentioned this issue Oct 9, 2024

Allow some graph and digraph methods to take iterables/generators #1292

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issue with `custom_vec_iter_impl!` `NodeIndices` #1090

Performance issue with `custom_vec_iter_impl!` `NodeIndices` #1090

enavarro51 commented Feb 16, 2024

mtreinish commented Feb 16, 2024

enavarro51 commented Feb 16, 2024

IvanIsCoding commented Feb 17, 2024 •

edited

Loading

IvanIsCoding commented Feb 17, 2024 •

edited

Loading

jakelishman commented Feb 20, 2024 •

edited

Loading

mtreinish commented Feb 20, 2024

Performance issue with custom_vec_iter_impl! NodeIndices #1090

Performance issue with custom_vec_iter_impl! NodeIndices #1090

Comments

enavarro51 commented Feb 16, 2024

mtreinish commented Feb 16, 2024

enavarro51 commented Feb 16, 2024

IvanIsCoding commented Feb 17, 2024 • edited Loading

Attempt 1

Atempt 2

Corollary

IvanIsCoding commented Feb 17, 2024 • edited Loading

jakelishman commented Feb 20, 2024 • edited Loading

mtreinish commented Feb 20, 2024

Performance issue with `custom_vec_iter_impl!` `NodeIndices` #1090

Performance issue with `custom_vec_iter_impl!` `NodeIndices` #1090

IvanIsCoding commented Feb 17, 2024 •

edited

Loading

IvanIsCoding commented Feb 17, 2024 •

edited

Loading

jakelishman commented Feb 20, 2024 •

edited

Loading