Use physical qubits internally within Sabre #10782

jakelishman · 2023-09-06T14:41:44Z

Summary

This swaps the whole Sabre algorithm over to using physical qubits rather than virtual qubits. This makes all operations based on finding the swaps and scoring them far more natural, at the cost of the layer structures needing to do a little more book-keeping to rewrite themselves in terms of the new physical qubits after a swap. This also means that the swaps that come out of the Sabre algorithm automatically become physical, which requires less tracking to output them into the final DAG circuit.

Details and comments

I did this for two major reasons:

performance
making it easier to rebuild the Python-space DAG from Rust space (to come later)

As in #10761, I've deliberately left the behaviour described by #10756 in place so that that change to the algorithm can come in a separate PR, which in this PR causes a weirdness where I have to continue to pass the layout into a function (choose_best_swap) that should no longer need it. edit: since the merge of #10756, this is no longer relevant.

I'm most interested in how this affects the Rust-space runtime. It should (hopefully) be relatively clear from the Python-space diff that this has neutral-to-positive improvements on rebuild performance simply because we need to do less remapping in the output swaps.

I used a setup of:

from qiskit.circuit.library import QuantumVolume
from qiskit.converters import circuit_to_dag
from qiskit.transpiler import CouplingMap
from qiskit.transpiler.passes.routing import sabre_swap
import rustworkx

heavy_hex = CouplingMap.from_heavy_hex(11)
heavy_hex_neighbors = sabre_swap.NeighborTable(rustworkx.adjacency_matrix(heavy_hex.graph))
heavy_hex_distance = heavy_hex.distance_matrix

dag = circuit_to_dag(QuantumVolume(heavy_hex.size(), seed=0).decompose(), copy_operations=False)
sabre_dag, _ = sabre_swap._build_sabre_dag(dag, heavy_hex.size(), {bit: i for i, bit in enumerate(dag.qubits)})
initial_layout = sabre_swap.NLayout.generate_trivial_layout(heavy_hex.size())

def fn(neighbors, dist):
    return sabre_swap.build_swap_map(
        len(dag.qubits),
        sabre_dag,
        neighbors,
        dist,
        sabre_swap.Heuristic.Decay,
        initial_layout,
        1,
        0,
    )

where a HeavyHex(11) has 291 qubits. This is purely testing the Rust-space runtime, not the subsequent Python-space DAG rebuild, which still dominates the actual timing (but I hope to improve that in the near future).

Now timing fn(heavy_hex_neighbors, heavy_hex_distance) went from 1.76(3)s on main for me to 1.52(2)s with this PR, which is a ~14% improvement.

qiskit-bot · 2023-09-06T14:41:49Z

One or more of the the following people are requested to review this:

@Eric-Arellano
@Qiskit/terra-core
@kevinhartman
@mtreinish

jakelishman · 2023-09-06T23:02:01Z

Now rebased over #10783.

The test outputs change slightly because the order we filter out duplicate swaps in obtain_swaps is not identical. We filter out cases where the left index is greater than the right, and with the assignments of virtual qubits to physical qubits varying, doing the filter with physical indices in not guaranteed to filter in the same order as doing it with virtual indices (though the trialled swaps will be the same).

jakelishman · 2023-09-07T15:39:23Z

Now rebased over #10756.

This swaps the whole Sabre algorithm over to using physical qubits rather than virtual qubits. This makes all operations based on finding the swaps and scoring them far more natural, at the cost of the layer structures needing to do a little more book-keeping to rewrite themselves in terms of the new physical qubits after a swap. This also means that the swaps that come out of the Sabre algorithm automatically become physical, which requires less tracking to output them into the final DAG circuit. The test outputs change slightly because the order we filter out duplicate swaps in `obtain_swaps` is not identical. We filter out cases where the left index is greater than the right, and with the assignments of virtual qubits to physical qubits varying, doing the filter with physical indices in not guaranteed to filter in the same order as doing it with virtual indices (though the trialled swaps will be the same).

jakelishman · 2023-09-08T12:26:42Z

Ok, now that #10753 is merged, I think there's no other open PRs on Terra close to merge that conflict with this one.

mtreinish

Overall this LGTM. It makes a lot of sense even if increases the complexity a bit, the amount of virtual->physical mapping we need to is significantly decreased with this change so it's worth it. I left a few small inline comments, but only 2 of them are real, the others are more idle musings.

crates/accelerate/src/sabre_swap/layer.rs

mtreinish · 2023-09-15T21:13:35Z

crates/accelerate/src/sabre_swap/layer.rs

+                    .map(|b| {
+                        let b_index = b.index();
+                        if a_index <= b_index {
+                            dist[[a_index, b_index]]
+                        } else {
+                            0.0
+                        }


Is there an advantage to doing this vs a filter_map?

I didn't think about it - the original code used map because it had an iterator that directly ran over nodes, so didn't need to worry about the double counting. I'll swap it to filter_map if you prefer, and another possibility is making the outer map a flat_map and removing the internal sum?

I guess it was more an efficiency question, I didn't know if adding a bunch of 0.0s was going to be faster than using filter_map or something. But, yeah using a filter_map, flat_map, and a single sum seems the more natural way to do this with iterators.

Naively, I don't think you can't represent Option<f64> without a discriminant (assuming you care about propagating the bit pattern payload of qNaNs?) but given LLVM will have the full context of the ensuing operations, it can probably unpick everything into sensible code. I had a go in https://godbolt.org/z/Y531qhn9r, but I think at that point it'll be branch prediction and memory-access patterns that dictate the speed more than anything else. I'll try locally and just make it a flat_map+filter_map assuming it's not visible.

If this code turns out to be something we need to micro-optimise, the next thing I'd try would be enforcing the Vec<Vec<others>> stuff to always allocate the other qubits in groups of 4 or 8, and then get the compiler to emit always-vectorised code output (using the self-index in the outer vec to represent "missing" slots in the inner vec, to abuse dist[[a, a]] == 0.0), and then multiply the final result by 0.5 to remove the double-count effect. But that would 100% be a different PR lol.

edit: that vectorisation wouldn't survive a conversion of the extended set into a sequence of layers, so probably not worth trying at all tbh.

The most important thing for performance here was just to remove the branching; the floating-point additions are probably mostly pipelined at more than a 2x rate compared to the cycle count of the operation (especially if the compiler's able to find some SIMD vectorisation). I also made this a flat-map - I didn't see much of a difference from that, but it looks a shade more natural now anyway.

Depending on where we go with the extended set in the future, I might look into rearranging the data structure to allow explicit SIMD vectorisation in all the calculations, but I'll leave that til after we know what we're going to do with a potential layer structure on the extended set.

Done in 0475b3b.

It's possible that a further re-organisation of the data structure to store more in order to have a direct iterator over each gate exactly once without branching might help as well, but I want to wait on that til we've looked at what we're going to do about building up the extended set as a delta, because this PR still offers a speed-up over the status quo, and I don't want to get into premature optimisation that might become obsolete as soon as we add the extra requirement that it must be possible to remove single gates from the ExtendedSet.

mtreinish · 2023-09-15T21:26:34Z

crates/accelerate/src/sabre_swap/mod.rs

-    qubits: &[VirtualQubit; 2],
-    layout: &NLayout,
+    swaps: &mut Vec<[PhysicalQubit; 2]>,
+    qubits: &[PhysicalQubit; 2],
    coupling_graph: &DiGraph<(), ()>,
 ) {
    let mut shortest_paths: DictMap<NodeIndex, Vec<NodeIndex>> = DictMap::new();


Completely unrelated to anything in this PR and not something we should consider for this but instead a follow up. But reading through the code now I'm wondering if we should do something like:

Suggested change

let mut shortest_paths: DictMap<NodeIndex, Vec<NodeIndex>> = DictMap::new();

let source_index = NodeIndex::new(qubits[0].index());

let target_index = NodeIndex::new(qubits[1].index()));

let mut shortest_paths: DictMap<NodeIndex, Vec<NodeIndex>> = [(node_index, Vec::with_capacity(distance[[source_index, target_index]])].iter().collect();

which definitely won't compile because I'm sure I made a typo or missed some typing, but it's the basic idea.

Actually I just looked up the dijkstra code in rustworkx. This won't work because of: https://github.com/Qiskit/rustworkx/blob/2a2a18383ee877c7bc475ef15030beaea6d03520/rustworkx-core/src/shortest_path/dijkstra.rs#L120-L122 the rest of the code will handle this correctly. I can push a PR to rustworkx to enable this optimization so it detects if the path list is already allocated and just clears it and pushes the start node at the beginning. But until that is in rustworkx we can't do this. Not that reducing the allocations here will have a noticeable impact on runtime performance as this is only triggered in an edge case.

mtreinish · 2023-09-15T21:43:00Z

crates/accelerate/src/sabre_swap/mod.rs

+    for i in 0..split {
+        swaps.push([shortest_path[i], shortest_path[i + 1]]);
    }
-    for swap in backwards.iter().rev() {
-        swaps.push([qubits[1], swap.to_virt(layout)]);
+    for i in 0..split - 1 {
+        let end = shortest_path.len() - 1 - i;
+        swaps.push([shortest_path[end], shortest_path[end - 1]]);


Just for the record in case we're looking at this again in the future. This took me a while to trace through in my head before I realized these were equivalent. The tricky bit I missed at first when comparing the old to the new is the difference between the virtual and physical qubits, the old path worked with virtual qubits and the new one is using physical. This is somewhere having #10761 is important to validate we're working in the correct domain because if they were just integers this would have been easy to mess up.

On modern hardware, branching is typically more expensive than a simple floating-point addition that can be pipelined in. This removes the branch in favour of removing the duplication from the scoring at the end by dividing by two.

mtreinish

This LGTM now, thanks for digging into the performance of that function. Not branching for maximum performance makes sense to me.

coveralls · 2023-09-20T12:49:16Z

Pull Request Test Coverage Report for Build 6248390897

153 of 157 (97.45%) changed or added relevant lines in 4 files are covered.
23 unchanged lines in 4 files lost coverage.
Overall coverage decreased (-0.02%) to 87.268%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
crates/accelerate/src/sabre_swap/layer.rs	68	72	94.44%

Files with Coverage Reduction	New Missed Lines	%
crates/accelerate/src/sabre_swap/layer.rs	1	95.21%
crates/accelerate/src/nlayout.rs	5	79.44%
crates/qasm2/src/lex.rs	5	91.16%
crates/qasm2/src/parse.rs	12	97.13%

Totals
Change from base Build 6246166195:	-0.02%
Covered Lines:	74327
Relevant Lines:	85171

💛 - Coveralls

jakelishman added performance Changelog: None Do not include in changelog Rust This PR or issue is related to Rust code in the repository mod: transpiler Issues and PRs related to Transpiler labels Sep 6, 2023

jakelishman requested a review from a team as a code owner September 6, 2023 14:41

jakelishman force-pushed the sabre/physical branch from edc765b to d023759 Compare September 6, 2023 14:48

This was referenced Sep 6, 2023

Reuse scratch space for Sabre best-swap choice #10783

Merged

Use SmallVec in NeighborTable for cache locality #10784

Merged

jakelishman force-pushed the sabre/physical branch from d023759 to eaf89a5 Compare September 6, 2023 23:01

jakelishman force-pushed the sabre/physical branch from eaf89a5 to b846365 Compare September 7, 2023 15:38

jakelishman force-pushed the sabre/physical branch from b846365 to 4e44644 Compare September 8, 2023 12:25

mtreinish self-assigned this Sep 15, 2023

mtreinish added this to the 0.45.0 milestone Sep 15, 2023

mtreinish reviewed Sep 15, 2023

View reviewed changes

jakelishman added 3 commits September 20, 2023 12:32

Remove branching from extended-set scoring

0475b3b

On modern hardware, branching is typically more expensive than a simple floating-point addition that can be pipelined in. This removes the branch in favour of removing the duplication from the scoring at the end by dividing by two.

Merge remote-tracking branch 'ibm/main' into sabre/physical

03089f9

Fix formatting

64d304c

mtreinish approved these changes Sep 20, 2023

View reviewed changes

mtreinish enabled auto-merge September 20, 2023 12:46

mtreinish added this pull request to the merge queue Sep 20, 2023

Merged via the queue into Qiskit:main with commit 22b94a1 Sep 20, 2023
13 checks passed

jakelishman deleted the sabre/physical branch September 20, 2023 15:06

jakelishman mentioned this pull request Sep 20, 2023

Use singleton SwapGate in Sabre reconstruction #10865

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use physical qubits internally within Sabre #10782

Use physical qubits internally within Sabre #10782

jakelishman commented Sep 6, 2023 •

edited

Loading

qiskit-bot commented Sep 6, 2023

jakelishman commented Sep 6, 2023

jakelishman commented Sep 7, 2023

jakelishman commented Sep 8, 2023

mtreinish left a comment

mtreinish Sep 15, 2023

jakelishman Sep 18, 2023

mtreinish Sep 18, 2023

jakelishman Sep 18, 2023 •

edited

Loading

jakelishman Sep 20, 2023

jakelishman Sep 20, 2023

mtreinish Sep 15, 2023

mtreinish Sep 15, 2023

mtreinish Sep 15, 2023

mtreinish left a comment

coveralls commented Sep 20, 2023

-    let mut shortest_paths: DictMap<NodeIndex, Vec<NodeIndex>> = DictMap::new();
+    let source_index = NodeIndex::new(qubits[0].index());
+    let target_index = NodeIndex::new(qubits[1].index()));
+    let mut shortest_paths: DictMap<NodeIndex, Vec<NodeIndex>> = [(node_index, Vec::with_capacity(distance[[source_index, target_index]])].iter().collect();

Use physical qubits internally within Sabre #10782

Use physical qubits internally within Sabre #10782

Conversation

jakelishman commented Sep 6, 2023 • edited Loading

Summary

Details and comments

qiskit-bot commented Sep 6, 2023

jakelishman commented Sep 6, 2023

jakelishman commented Sep 7, 2023

jakelishman commented Sep 8, 2023

mtreinish left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jakelishman Sep 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mtreinish left a comment

Choose a reason for hiding this comment

coveralls commented Sep 20, 2023

Pull Request Test Coverage Report for Build 6248390897

💛 - Coveralls

jakelishman commented Sep 6, 2023 •

edited

Loading

jakelishman Sep 18, 2023 •

edited

Loading