Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issue with custom_vec_iter_impl! NodeIndices #1090

Closed
enavarro51 opened this issue Feb 16, 2024 · 6 comments · Fixed by #1107
Closed

Performance issue with custom_vec_iter_impl! NodeIndices #1090

enavarro51 opened this issue Feb 16, 2024 · 6 comments · Fixed by #1107
Assignees

Comments

@enavarro51
Copy link
Contributor

While working on DAGDependencyV2 in Qiskit, I was using digraph.predecessor_indices and digraph.predecessors and noticed predecessors was significantly faster than predecessor_indices. Looking at the digraph code, it appeared predecessor_indices was simpler and therefore should have been faster.

I noticed that predecessors returned a Vec and predecessor_indices returned NodeIndices. I modified the latter function to return Vec<usize> and the tests I was running for DAGDependencyV2 went from 74 sec for NodeIndices to 12 sec for Vec<usize> constructing an equivalent DAGDependencyV2.

Would appreciate any comments on why this is happening and whether this might be a wider problem.

Main code

    #[pyo3(text_signature = "(self, node, /)")]
    pub fn predecessor_indices(&self, node: usize) -> NodeIndices {
        NodeIndices {
            nodes: self
                .graph
                .neighbors_directed(NodeIndex::new(node), petgraph::Direction::Incoming)
                .map(|node| node.index())
                .collect(),
        }
    }

Modified code

    #[pyo3(text_signature = "(self, node, /)")]
    pub fn predecessor_indices(&self, node: usize) -> Vec<usize> {
        self.graph
            .neighbors_directed(NodeIndex::new(node), petgraph::Direction::Incoming)
            .map(|node| node.index())
            .collect()
    }
@enavarro51 enavarro51 self-assigned this Feb 16, 2024
@mtreinish
Copy link
Member

In general there is a tradeoff we make with NodeIndices and the other custom return types. When we have the custom return types in rustworkx this is in an effort to return faster and optimize for workloads where we iterate over the object once vs places we access the elements in the results multiple times.

Basically when you return Vec<usize> what that ends up doing is building a new Python list and then copying and converting each element in from usize -> Python int. Which depending on the size of the list can be a lot of overhead. While NodeIndices avoids that by wrapping the output vec in a custom pyclass so there is no copy/conversion overhead on return. The tradeoff there is that when you access any element of NodeIndices it has to do the usize -> python int conversion. So if you loop over the values multiple times that ends up being slower (and in those cases I'd just do a list(indices) to convert it to a python list in a single iteration.

Now that all being said we added those custom return types are something we added years ago, and PyO3 was much less mature back then. It is entirely possible that the overhead of creating the Python List[int] isn't as bad anymore and the custom return type's overhead on access isn't worth it now with newer versions of pyo3.

@enavarro51
Copy link
Contributor Author

The DAGDependency code calls predecessor_indices about once for each node and then iterates over those indices once. I assume the overhead is in the __getitem__ for NodeIndices.

I have a pretty good test environment set up, so I can do some checks on just a single iteration of the indices for a large number of items.

@IvanIsCoding
Copy link
Collaborator

IvanIsCoding commented Feb 17, 2024

I ran a non-scientific experiment on my machine and I believe the optimizations Matthew introduced still hold true.

Here is the code for getting the times:

import rustworkx as rx
import timeit

N = 1_000_000
g = rx.PyGraph()
g.add_nodes_from(list(range(N)))
nodes = g.node_indexes()

print(
	"Getting the node indices or list from Rust a thousand times: ", timeit.timeit("g.node_indexes()", globals=globals(), number=1_000)
)

print(
	"Accessing a negative index a billion times: ", timeit.timeit("nodes[-10]", globals=globals(), number=1_000_000_000)
)

Attempt 1

I left node_indexes as the original:

#[pyo3(text_signature = "(self)")]
    pub fn node_indexes(&self) -> NodeIndices {
        self.node_indices()
    }

The results were:

Getting the node indices or list from Rust a thousand times:  1.6359028699998817
Accessing a negative index a billion times:  674.1754225720001

Atempt 2

I modified the node_indexes code to be:

#[pyo3(text_signature = "(self)")]
    pub fn node_indexes(&self) -> Vec<usize> {
        self.graph.node_indices().map(|node| node.index()).collect()
    }

The results were:

Getting the node indices or list from Rust a thousand times:  32.542972673999884
Accessing a negative index a billion times:  28.010249427000417

Corollary

Returning a Python list from Rust is slow. So to make a function return faster it makes sense.

However, accessing a specific element of the Rust object is very slow comparing to accessing a specific element of a Python list.

@IvanIsCoding
Copy link
Collaborator

IvanIsCoding commented Feb 17, 2024

I think we should definitely profile the __getitem__ method though:

fn __getitem__(&self, py: Python, idx: SliceOrInt) -> PyResult<PyObject> {

We need to see where it spends time and maybe work with the maintainers of PyO3 to see if we can make it faster. The current overhead is very high

@jakelishman
Copy link
Member

jakelishman commented Feb 20, 2024

It turns out that having

[derive(FromPyObject)]
enum SliceOrInt {
    Slice(PySlice),
    Int(...),
}

means that PyO3 attempts to downcast to Slice first, but like 99.9% of all __getitem__ calls are going to be with an int (especially since the Python implicit-iterator behaviour is being used here).

Flipping the order in the enum for me switched nodes[-10] (from Ivan's code block above) from taking 205ns to taking 61ns with no other changes (for comparison, indexing into the equivalent Python list took 25ns, but PyO3 will have some springboard cost associated with it too, some of which will be improved by PyO3 0.21).

I think another change to potentially consider is adding custom direct Iterator structs into the custom_vec_iter_impl that look like

#[pyclass]
struct NodeIndicesIter {
    base: Py<NodeIndices>,
    index: usize,
}

#[pymethods]
impl NodeIndicesIter {
    fn __next__(&mut self, py: Python) -> PyResult<Py<PyAny>> {
        // ...
    }
    fn __iter__(&self) -> Self { self }

which avoids any need to convert Python object inputs during the function inputs. I haven't timed that, but I could do if it's something you'd want to consider.

@mtreinish
Copy link
Member

I think another change to potentially consider is adding custom direct Iterator structs into the custom_vec_iter_impl that look like which avoids any need to convert Python object inputs during the function inputs. I haven't timed that, but I could do if it's something you'd want to consider.

We have support like that with the mapping type macro right now (to get keys, values, and items iterators), so I think it'd be a good idea to add it to the vec ones too. I expect that will be a bit of a speed boost for a lot of cases since we avoid more python/pyo3 overhead by going through __getitem__

My unrelated idea to also speed things up here was maybe to cache the output PyObjects on __getitem__ so that subsequent accesses don't eat the conversion cost. The tradeoff there is we're essentially doubling the memory overhead for these classes because we'd also need to store a Vec<Option<PyObject>> of the same length in each object. So I'm not sure if it's worth it or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants