-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use physical qubits internally within Sabre #10782
Conversation
One or more of the the following people are requested to review this:
|
edc765b
to
d023759
Compare
d023759
to
eaf89a5
Compare
Now rebased over #10783. The test outputs change slightly because the order we filter out duplicate swaps in |
eaf89a5
to
b846365
Compare
Now rebased over #10756. |
This swaps the whole Sabre algorithm over to using physical qubits rather than virtual qubits. This makes all operations based on finding the swaps and scoring them far more natural, at the cost of the layer structures needing to do a little more book-keeping to rewrite themselves in terms of the new physical qubits after a swap. This also means that the swaps that come out of the Sabre algorithm automatically become physical, which requires less tracking to output them into the final DAG circuit. The test outputs change slightly because the order we filter out duplicate swaps in `obtain_swaps` is not identical. We filter out cases where the left index is greater than the right, and with the assignments of virtual qubits to physical qubits varying, doing the filter with physical indices in not guaranteed to filter in the same order as doing it with virtual indices (though the trialled swaps will be the same).
b846365
to
4e44644
Compare
Ok, now that #10753 is merged, I think there's no other open PRs on Terra close to merge that conflict with this one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this LGTM. It makes a lot of sense even if increases the complexity a bit, the amount of virtual->physical mapping we need to is significantly decreased with this change so it's worth it. I left a few small inline comments, but only 2 of them are real, the others are more idle musings.
.map(|b| { | ||
let b_index = b.index(); | ||
if a_index <= b_index { | ||
dist[[a_index, b_index]] | ||
} else { | ||
0.0 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an advantage to doing this vs a filter_map
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't think about it - the original code used map
because it had an iterator that directly ran over nodes, so didn't need to worry about the double counting. I'll swap it to filter_map
if you prefer, and another possibility is making the outer map
a flat_map
and removing the internal sum
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it was more an efficiency question, I didn't know if adding a bunch of 0.0s was going to be faster than using filter_map
or something. But, yeah using a filter_map
, flat_map
, and a single sum
seems the more natural way to do this with iterators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naively, I don't think you can't represent Option<f64>
without a discriminant (assuming you care about propagating the bit pattern payload of qNaNs?) but given LLVM will have the full context of the ensuing operations, it can probably unpick everything into sensible code. I had a go in https://godbolt.org/z/Y531qhn9r, but I think at that point it'll be branch prediction and memory-access patterns that dictate the speed more than anything else. I'll try locally and just make it a flat_map+filter_map
assuming it's not visible.
If this code turns out to be something we need to micro-optimise, the next thing I'd try would be enforcing the Vec<Vec<others>>
stuff to always allocate the other qubits in groups of 4 or 8, and then get the compiler to emit always-vectorised code output (using the self-index in the outer vec to represent "missing" slots in the inner vec, to abuse dist[[a, a]] == 0.0
), and then multiply the final result by 0.5
to remove the double-count effect. But that would 100% be a different PR lol.
edit: that vectorisation wouldn't survive a conversion of the extended set into a sequence of layers, so probably not worth trying at all tbh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most important thing for performance here was just to remove the branching; the floating-point additions are probably mostly pipelined at more than a 2x rate compared to the cycle count of the operation (especially if the compiler's able to find some SIMD vectorisation). I also made this a flat-map - I didn't see much of a difference from that, but it looks a shade more natural now anyway.
Depending on where we go with the extended set in the future, I might look into rearranging the data structure to allow explicit SIMD vectorisation in all the calculations, but I'll leave that til after we know what we're going to do with a potential layer structure on the extended set.
Done in 0475b3b.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible that a further re-organisation of the data structure to store more in order to have a direct iterator over each gate exactly once without branching might help as well, but I want to wait on that til we've looked at what we're going to do about building up the extended set as a delta, because this PR still offers a speed-up over the status quo, and I don't want to get into premature optimisation that might become obsolete as soon as we add the extra requirement that it must be possible to remove single gates from the ExtendedSet
.
qubits: &[VirtualQubit; 2], | ||
layout: &NLayout, | ||
swaps: &mut Vec<[PhysicalQubit; 2]>, | ||
qubits: &[PhysicalQubit; 2], | ||
coupling_graph: &DiGraph<(), ()>, | ||
) { | ||
let mut shortest_paths: DictMap<NodeIndex, Vec<NodeIndex>> = DictMap::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Completely unrelated to anything in this PR and not something we should consider for this but instead a follow up. But reading through the code now I'm wondering if we should do something like:
let mut shortest_paths: DictMap<NodeIndex, Vec<NodeIndex>> = DictMap::new(); | |
let source_index = NodeIndex::new(qubits[0].index()); | |
let target_index = NodeIndex::new(qubits[1].index())); | |
let mut shortest_paths: DictMap<NodeIndex, Vec<NodeIndex>> = [(node_index, Vec::with_capacity(distance[[source_index, target_index]])].iter().collect(); |
which definitely won't compile because I'm sure I made a typo or missed some typing, but it's the basic idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I just looked up the dijkstra code in rustworkx. This won't work because of: https://github.com/Qiskit/rustworkx/blob/2a2a18383ee877c7bc475ef15030beaea6d03520/rustworkx-core/src/shortest_path/dijkstra.rs#L120-L122 the rest of the code will handle this correctly. I can push a PR to rustworkx to enable this optimization so it detects if the path list is already allocated and just clears it and pushes the start node at the beginning. But until that is in rustworkx we can't do this. Not that reducing the allocations here will have a noticeable impact on runtime performance as this is only triggered in an edge case.
for i in 0..split { | ||
swaps.push([shortest_path[i], shortest_path[i + 1]]); | ||
} | ||
for swap in backwards.iter().rev() { | ||
swaps.push([qubits[1], swap.to_virt(layout)]); | ||
for i in 0..split - 1 { | ||
let end = shortest_path.len() - 1 - i; | ||
swaps.push([shortest_path[end], shortest_path[end - 1]]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for the record in case we're looking at this again in the future. This took me a while to trace through in my head before I realized these were equivalent. The tricky bit I missed at first when comparing the old to the new is the difference between the virtual and physical qubits, the old path worked with virtual qubits and the new one is using physical. This is somewhere having #10761 is important to validate we're working in the correct domain because if they were just integers this would have been easy to mess up.
On modern hardware, branching is typically more expensive than a simple floating-point addition that can be pipelined in. This removes the branch in favour of removing the duplication from the scoring at the end by dividing by two.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM now, thanks for digging into the performance of that function. Not branching for maximum performance makes sense to me.
Pull Request Test Coverage Report for Build 6248390897
💛 - Coveralls |
Summary
This swaps the whole Sabre algorithm over to using physical qubits rather than virtual qubits. This makes all operations based on finding the swaps and scoring them far more natural, at the cost of the layer structures needing to do a little more book-keeping to rewrite themselves in terms of the new physical qubits after a swap. This also means that the swaps that come out of the Sabre algorithm automatically become physical, which requires less tracking to output them into the final DAG circuit.
Details and comments
I did this for two major reasons:
As in #10761, I've deliberately left the behaviour described by #10756 in place so that that change to the algorithm can come in a separate PR, which in this PR causes a weirdness where I have to continue to pass the layout into a function (edit: since the merge of #10756, this is no longer relevant.choose_best_swap
) that should no longer need it.I'm most interested in how this affects the Rust-space runtime. It should (hopefully) be relatively clear from the Python-space diff that this has neutral-to-positive improvements on rebuild performance simply because we need to do less remapping in the output swaps.
I used a setup of:
where a
HeavyHex(11)
has 291 qubits. This is purely testing the Rust-space runtime, not the subsequent Python-space DAG rebuild, which still dominates the actual timing (but I hope to improve that in the near future).Now timing
fn(heavy_hex_neighbors, heavy_hex_distance)
went from 1.76(3)s onmain
for me to 1.52(2)s with this PR, which is a ~14% improvement.