Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use physical qubits internally within Sabre #10782

Merged
merged 4 commits into from
Sep 20, 2023

Conversation

jakelishman
Copy link
Member

@jakelishman jakelishman commented Sep 6, 2023

Summary

This swaps the whole Sabre algorithm over to using physical qubits rather than virtual qubits. This makes all operations based on finding the swaps and scoring them far more natural, at the cost of the layer structures needing to do a little more book-keeping to rewrite themselves in terms of the new physical qubits after a swap. This also means that the swaps that come out of the Sabre algorithm automatically become physical, which requires less tracking to output them into the final DAG circuit.

Details and comments

I did this for two major reasons:

  • performance
  • making it easier to rebuild the Python-space DAG from Rust space (to come later)

As in #10761, I've deliberately left the behaviour described by #10756 in place so that that change to the algorithm can come in a separate PR, which in this PR causes a weirdness where I have to continue to pass the layout into a function (choose_best_swap) that should no longer need it. edit: since the merge of #10756, this is no longer relevant.

I'm most interested in how this affects the Rust-space runtime. It should (hopefully) be relatively clear from the Python-space diff that this has neutral-to-positive improvements on rebuild performance simply because we need to do less remapping in the output swaps.

I used a setup of:

from qiskit.circuit.library import QuantumVolume
from qiskit.converters import circuit_to_dag
from qiskit.transpiler import CouplingMap
from qiskit.transpiler.passes.routing import sabre_swap
import rustworkx

heavy_hex = CouplingMap.from_heavy_hex(11)
heavy_hex_neighbors = sabre_swap.NeighborTable(rustworkx.adjacency_matrix(heavy_hex.graph))
heavy_hex_distance = heavy_hex.distance_matrix

dag = circuit_to_dag(QuantumVolume(heavy_hex.size(), seed=0).decompose(), copy_operations=False)
sabre_dag, _ = sabre_swap._build_sabre_dag(dag, heavy_hex.size(), {bit: i for i, bit in enumerate(dag.qubits)})
initial_layout = sabre_swap.NLayout.generate_trivial_layout(heavy_hex.size())

def fn(neighbors, dist):
    return sabre_swap.build_swap_map(
        len(dag.qubits),
        sabre_dag,
        neighbors,
        dist,
        sabre_swap.Heuristic.Decay,
        initial_layout,
        1,
        0,
    )

where a HeavyHex(11) has 291 qubits. This is purely testing the Rust-space runtime, not the subsequent Python-space DAG rebuild, which still dominates the actual timing (but I hope to improve that in the near future).

Now timing fn(heavy_hex_neighbors, heavy_hex_distance) went from 1.76(3)s on main for me to 1.52(2)s with this PR, which is a ~14% improvement.

@jakelishman jakelishman added performance Changelog: None Do not include in changelog Rust This PR or issue is related to Rust code in the repository mod: transpiler Issues and PRs related to Transpiler labels Sep 6, 2023
@jakelishman jakelishman requested a review from a team as a code owner September 6, 2023 14:41
@qiskit-bot
Copy link
Collaborator

One or more of the the following people are requested to review this:

@jakelishman
Copy link
Member Author

Now rebased over #10783.

The test outputs change slightly because the order we filter out duplicate swaps in obtain_swaps is not identical. We filter out cases where the left index is greater than the right, and with the assignments of virtual qubits to physical qubits varying, doing the filter with physical indices in not guaranteed to filter in the same order as doing it with virtual indices (though the trialled swaps will be the same).

@jakelishman
Copy link
Member Author

Now rebased over #10756.

This swaps the whole Sabre algorithm over to using physical qubits
rather than virtual qubits.  This makes all operations based on finding
the swaps and scoring them far more natural, at the cost of the layer
structures needing to do a little more book-keeping to rewrite
themselves in terms of the new physical qubits after a swap.  This also
means that the swaps that come out of the Sabre algorithm automatically
become physical, which requires less tracking to output them into the
final DAG circuit.

The test outputs change slightly because the order we filter out
duplicate swaps in `obtain_swaps` is not identical.  We filter out cases
where the left index is greater than the right, and with the assignments
of virtual qubits to physical qubits varying, doing the filter with
physical indices in not guaranteed to filter in the same order as doing
it with virtual indices (though the trialled swaps will be the same).
@jakelishman
Copy link
Member Author

Ok, now that #10753 is merged, I think there's no other open PRs on Terra close to merge that conflict with this one.

@mtreinish mtreinish self-assigned this Sep 15, 2023
@mtreinish mtreinish added this to the 0.45.0 milestone Sep 15, 2023
Copy link
Member

@mtreinish mtreinish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this LGTM. It makes a lot of sense even if increases the complexity a bit, the amount of virtual->physical mapping we need to is significantly decreased with this change so it's worth it. I left a few small inline comments, but only 2 of them are real, the others are more idle musings.

crates/accelerate/src/sabre_swap/layer.rs Outdated Show resolved Hide resolved
Comment on lines 232 to 238
.map(|b| {
let b_index = b.index();
if a_index <= b_index {
dist[[a_index, b_index]]
} else {
0.0
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an advantage to doing this vs a filter_map?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't think about it - the original code used map because it had an iterator that directly ran over nodes, so didn't need to worry about the double counting. I'll swap it to filter_map if you prefer, and another possibility is making the outer map a flat_map and removing the internal sum?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it was more an efficiency question, I didn't know if adding a bunch of 0.0s was going to be faster than using filter_map or something. But, yeah using a filter_map, flat_map, and a single sum seems the more natural way to do this with iterators.

Copy link
Member Author

@jakelishman jakelishman Sep 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naively, I don't think you can't represent Option<f64> without a discriminant (assuming you care about propagating the bit pattern payload of qNaNs?) but given LLVM will have the full context of the ensuing operations, it can probably unpick everything into sensible code. I had a go in https://godbolt.org/z/Y531qhn9r, but I think at that point it'll be branch prediction and memory-access patterns that dictate the speed more than anything else. I'll try locally and just make it a flat_map+filter_map assuming it's not visible.

If this code turns out to be something we need to micro-optimise, the next thing I'd try would be enforcing the Vec<Vec<others>> stuff to always allocate the other qubits in groups of 4 or 8, and then get the compiler to emit always-vectorised code output (using the self-index in the outer vec to represent "missing" slots in the inner vec, to abuse dist[[a, a]] == 0.0), and then multiply the final result by 0.5 to remove the double-count effect. But that would 100% be a different PR lol.

edit: that vectorisation wouldn't survive a conversion of the extended set into a sequence of layers, so probably not worth trying at all tbh.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most important thing for performance here was just to remove the branching; the floating-point additions are probably mostly pipelined at more than a 2x rate compared to the cycle count of the operation (especially if the compiler's able to find some SIMD vectorisation). I also made this a flat-map - I didn't see much of a difference from that, but it looks a shade more natural now anyway.

Depending on where we go with the extended set in the future, I might look into rearranging the data structure to allow explicit SIMD vectorisation in all the calculations, but I'll leave that til after we know what we're going to do with a potential layer structure on the extended set.

Done in 0475b3b.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible that a further re-organisation of the data structure to store more in order to have a direct iterator over each gate exactly once without branching might help as well, but I want to wait on that til we've looked at what we're going to do about building up the extended set as a delta, because this PR still offers a speed-up over the status quo, and I don't want to get into premature optimisation that might become obsolete as soon as we add the extra requirement that it must be possible to remove single gates from the ExtendedSet.

qubits: &[VirtualQubit; 2],
layout: &NLayout,
swaps: &mut Vec<[PhysicalQubit; 2]>,
qubits: &[PhysicalQubit; 2],
coupling_graph: &DiGraph<(), ()>,
) {
let mut shortest_paths: DictMap<NodeIndex, Vec<NodeIndex>> = DictMap::new();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completely unrelated to anything in this PR and not something we should consider for this but instead a follow up. But reading through the code now I'm wondering if we should do something like:

Suggested change
let mut shortest_paths: DictMap<NodeIndex, Vec<NodeIndex>> = DictMap::new();
let source_index = NodeIndex::new(qubits[0].index());
let target_index = NodeIndex::new(qubits[1].index()));
let mut shortest_paths: DictMap<NodeIndex, Vec<NodeIndex>> = [(node_index, Vec::with_capacity(distance[[source_index, target_index]])].iter().collect();

which definitely won't compile because I'm sure I made a typo or missed some typing, but it's the basic idea.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I just looked up the dijkstra code in rustworkx. This won't work because of: https://github.com/Qiskit/rustworkx/blob/2a2a18383ee877c7bc475ef15030beaea6d03520/rustworkx-core/src/shortest_path/dijkstra.rs#L120-L122 the rest of the code will handle this correctly. I can push a PR to rustworkx to enable this optimization so it detects if the path list is already allocated and just clears it and pushes the start node at the beginning. But until that is in rustworkx we can't do this. Not that reducing the allocations here will have a noticeable impact on runtime performance as this is only triggered in an edge case.

Comment on lines +691 to +696
for i in 0..split {
swaps.push([shortest_path[i], shortest_path[i + 1]]);
}
for swap in backwards.iter().rev() {
swaps.push([qubits[1], swap.to_virt(layout)]);
for i in 0..split - 1 {
let end = shortest_path.len() - 1 - i;
swaps.push([shortest_path[end], shortest_path[end - 1]]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for the record in case we're looking at this again in the future. This took me a while to trace through in my head before I realized these were equivalent. The tricky bit I missed at first when comparing the old to the new is the difference between the virtual and physical qubits, the old path worked with virtual qubits and the new one is using physical. This is somewhere having #10761 is important to validate we're working in the correct domain because if they were just integers this would have been easy to mess up.

On modern hardware, branching is typically more expensive than a simple
floating-point addition that can be pipelined in.  This removes the
branch in favour of removing the duplication from the scoring at the end
by dividing by two.
Copy link
Member

@mtreinish mtreinish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM now, thanks for digging into the performance of that function. Not branching for maximum performance makes sense to me.

@coveralls
Copy link

Pull Request Test Coverage Report for Build 6248390897

  • 153 of 157 (97.45%) changed or added relevant lines in 4 files are covered.
  • 23 unchanged lines in 4 files lost coverage.
  • Overall coverage decreased (-0.02%) to 87.268%

Changes Missing Coverage Covered Lines Changed/Added Lines %
crates/accelerate/src/sabre_swap/layer.rs 68 72 94.44%
Files with Coverage Reduction New Missed Lines %
crates/accelerate/src/sabre_swap/layer.rs 1 95.21%
crates/accelerate/src/nlayout.rs 5 79.44%
crates/qasm2/src/lex.rs 5 91.16%
crates/qasm2/src/parse.rs 12 97.13%
Totals Coverage Status
Change from base Build 6246166195: -0.02%
Covered Lines: 74327
Relevant Lines: 85171

💛 - Coveralls

@mtreinish mtreinish added this pull request to the merge queue Sep 20, 2023
Merged via the queue into Qiskit:main with commit 22b94a1 Sep 20, 2023
13 checks passed
@jakelishman jakelishman deleted the sabre/physical branch September 20, 2023 15:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Changelog: None Do not include in changelog mod: transpiler Issues and PRs related to Transpiler performance Rust This PR or issue is related to Rust code in the repository
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants