-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Calculate relative, not absolute, scores in SabreSwap #9012
Conversation
This is a significant performance improvement for very wide (100q+) circuits. The previous `SabreSwap` algorithm would score each candidate swap by applying the swap to the current layout, iterating through every element in the front layer and extended sets summing the total distances of the 2q gates, and then undoing the swap. However, in the front layer, a given swap can affect at most two 2q operations. This new form instead scores each swap by calculating its total distance relative to if the swap had not been made. This means that the basic score is formed from only one or two gates, no matter how many are in the front layer. This is an algorithmic complexity improvement in the scoring for volumetric circuits with respect to the number of qubits. Such a circuit with `n` qubits has `n / 2` gates in its front layer at all times, and so (assuming a coupling map that expands by a constant connectivity factor per qubit, like heavy hex) `k * n` swaps to score. This means that choosing the best swap has quadratic complexity with the original Sabre scoring algorithm. With this new algorithm, the score for a given swap is calculated in constant time, so choosing the best is instead linear. In practice, I did not see all these improvements at the scales I tested at, but I did see significant improvements - a 1081q heavy-hex quantum-volume circuit at depth 5 was swap-mapped on my machine in 25s with this commit compared to 100s before it. The principal change is the structs `FrontLayer` and `ExtendedSet`, which combine constant-time hash-set insertions, lookups and removals with vectors to enable constant-time lookup of the affected qubits. `FrontLayer` now only ever holds currently unroutable 2q operations; routable operations are immediately placed into the output structures as soon as a new 2q gate is routed. This routing is also done by exploiting that a swap can only affect two previously unroutable gates in the front layer; we just walk forwards along the outbound edges from those two nodes, adding any operations that become routable. This avoids another linear scan through the whole front layer after each swap, although in practice this has less of a speed-up effect, because it already wasn't quadratic. In theory, this change does not affect how any swap is scored relative to any other with the same front layer and extended set (though the scores used in the comparison do change). In order to precisely match the current implementation (and ensure reproducibility from a given seed), this tracks the insertion order of nodes into the front layer, including after removals. This commit completely modifies the internals of the Rust components of Sabre, although the actual algorithm is largely unchanged, aside from the scoring difference. Various parts of the implementation do change for efficiency, though. This commit maintains RNG compatibility with the previous Rust implementation in most cases. It is possible in some circuits for floating-point differences to cause different output, when several swaps are at the minimum score, but plus/minus 1ULP. This happens in both the old and new forms of the implementation, but _which_ of the minimal swaps get the minus-1ULP score varies between them, and consequently affects the swap choice. In fairly extensive testing, this appears to be the only mechanism for differences; I've verified that the release-valve mechanism and predecessor-requirement tracking function identically to before. The resultant scores - relative for "basic" and "lookahead", absolute for "decay" - are in practice within 2ULP of the old algorithm's. In maintaining RNG compatibility, this commit leaves several further speed-ups on the table. There is additional memory usage and tracking to maintain some required iteration orders, and some reordering checks that are not strictly necessary any more. Further, the sorting stages at the ends of the swap-choosing functions (to maintain repeatability) can be made redundant now, since some hash-set iteration (which is effectively an uncontrolled randomisation per run) is no longer required. These will be addressed in follow-ups.
Thank you for opening a new pull request. Before your PR can be merged it will first need to pass continuous integration tests and be reviewed. Sometimes the review process can be slow, so please be patient. While you're waiting, please feel free to review other open PRs. While only a subset of people are authorized to approve pull requests for merging, everyone is encouraged to review open pull requests. Doing reviews helps reduce the burden on the core team and helps make the project's code better for everyone. One or more of the the following people are requested to review this:
|
Pull Request Test Coverage Report for Build 3431683015
💛 - Coveralls |
@jakelishman, the performance speedup on large benchmarks is very impressive. However, I believe that it would no longer be possible to run Sabre on |
You can get essentially the same speedup on That can all be done in a unified manner in the algorithm here. We'd perhaps want to use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I'm a big fan of the front layer and extended set classes and associated methods. It makes the core of the algorithm a lot easier to read.
I'm not sure I like the use of required_predecessors
as scratch space in populate_extended_set
since it requires mutability that looks to otherwise not be needed. Purely a preference for contract readability, but if it helps performance, I'm not opposed.
I'll hold off on auto-merge in case anyone else wants to review.
(I spent a few hours reviewing this and couldn't find any bugs. I thought I might find an issue around gates with unsatisfied classical dependencies making it into the front layer, but that looks to be handled nicely by the tracking of required_predecessors
🙂).
Nice work!
/ self.nodes.len() as f64 | ||
} | ||
|
||
/// Populate a of nodes that would be routable if the given swap was applied to a layout. This |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Populate a of nodes that would be routable if the given swap was applied to a layout. This | |
/// Populate a vector of nodes that would be routable if the given swap was applied to a layout. This |
If I remember correctly, the previous algorithm also used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks great! Nothing really looked wrong with the logic to me, and the code is a lot more structured now too which is a big win. There were a few place where you used loops that I would have gone with an iterator instead, but that typically doesn't make much of difference in practice so I didn't mention it (except in one place). I left a few inline comments but they're mostly my musings as I was reading through the code more than anything else. I don't any of them are particularly actionable so I'm going to tag automerge thanks for all the work on this.
while self.iteration_start < self.iteration_end | ||
&& self.iteration_order[self.iteration_start].is_none() | ||
{ | ||
self.iteration_start += 1; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A really tiny nit micro-optimization benchmarking question here which we should just disregard it really as it's more idle musing, especially as I know this block will be removed in the subsequent PR This just looked a bit odd to me as a way to find the index of the first non-None element in a slice. It works fine, but I wonder how it would compare performance wise to something like (untested so there are probably typos or mistakes):
self.iteration_start = match self.iteration_order[self.iteration_start..].iter().position(|x| x.is_some()) {
Some(relative_index) => relative_index + self.iteration_start,
None => self.iteration_end,
};
I expect this will be slower in the non-match case as it will traverse the full vec instead of terminating at insertion_end (although you could probably work around that with enumerate()
and some other logic. It really is a question of how much the compiler can optimize with an iterator instead of a while loop I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The answer to most questions about why I used loops and not iterators is going to be "because I didn't know there was an iterator method that does it" haha. I think in your example, we could just limit the slice to self.iteration_start..self.iteration_end
to get the early-exit behaviour too, though?
let mut decremented: HashMap<usize, u32> = HashMap::new(); | ||
let mut i = 0; | ||
while i < to_visit.len() && extended_set.len() < EXTENDED_SET_SIZE { | ||
for edge in dag.dag.edges_directed(to_visit[i], Direction::Outgoing) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fwiw, this would have been fine as edges()
too because for a directed graph this is the direction behavior in petgraph (which is what the before version did). But being explicit is probably more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok that's good to know thanks.
@@ -162,7 +154,7 @@ pub fn build_swap_map( | |||
layout: &mut NLayout, | |||
num_trials: usize, | |||
) -> (SwapMap, PyObject) { | |||
let run_in_parallel = getenv_use_multiple_threads(); | |||
let run_in_parallel = getenv_use_multiple_threads() && num_trials > 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah that's a good catch, I guess we were potentially creating an unnecessary thread pool before for the single trial case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I didn't actually really mean to commit this as part of the same patch - I was just using it while doing a little bit of profiling so I could avoid a bunch of threading calls in the output (then I realised I could just force it with the environment...). Since it's in, probably best to just leave it now.
qubits_decay[best_swap[1]] += DECAY_RATE; | ||
} | ||
ops_since_progress.push(best_swap); | ||
qubits_decay.fill(1.); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fwiw, I think there was a reason I used fill_with()
instead of fill()
for this (probably some random micro-optimization to save a few nanoseconds that I read about somewhere a while ago) but I can't remember exactly why and this will work fine (and if there was a performance difference it will be imperceivable).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tbh, the code went through a few iterations, and I'd probably lost that you'd originally used fill_with
when I wrote the fill
version. As long as there's no meaningful difference, I guess it doesn't matter (and for types that implement Copy
, I'd be surprised if fill_with
could be more efficient?).
front_layer: &mut FrontLayer, | ||
required_predecessors: &mut [u32], | ||
) { | ||
let mut to_visit = to_visit.to_vec(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we can get away without copying here somehow. I haven't thought it through in depth but my gut instinct was we could potentially write this function without having to append to the input node list like this. But I'm probably just overlooking something.
Not that this really matters, and I doubt this has any real impact on performance, especially since the routable nodes list is never going to be that long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even in the worst case, we could just write a small class that borrows the input Vec
, and only allocates new heap space on push
, and redirects the indexing to the right places in it. I didn't want to get too deep into these optimisations quite yet for now.
if is_front { | ||
first_layer.push(gate_index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm glad we're able to do this inline now. In #8388 I had originally something similar to generate the first layer as we iterated and ending up producing a slightly different but still valid result in some cases so I reverted to just using the DAGCircuit
method for that. This will make further usage of this data structure from inside rust (for other callers to sabre swap) easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking back, I'm not 100% certain why I swapped everything to Option
here, but this was one place I checked to ensure we built up everything in the same order. I think I originally started massively rewriting the SabreDAG
struct as well, and then realised that it wasn't necessary, so I just reduced it to (mostly) just the modifications I needed.
I wasn't sure what you wanted to do about the changelog tag, I was inclined to put it as a new feature changelog. Even if you add the release note on the follow-on commit I think would be good to have a line that says |
I've tagged it "new feature" at least - I suppose the labels are decoupled from the public-facing documentation, since the labels are for generating the PR-based changelog rather than the public one. I didn't really think about that before. |
* Calculate relative, not absolute, scores in SabreSwap This is a significant performance improvement for very wide (100q+) circuits. The previous `SabreSwap` algorithm would score each candidate swap by applying the swap to the current layout, iterating through every element in the front layer and extended sets summing the total distances of the 2q gates, and then undoing the swap. However, in the front layer, a given swap can affect at most two 2q operations. This new form instead scores each swap by calculating its total distance relative to if the swap had not been made. This means that the basic score is formed from only one or two gates, no matter how many are in the front layer. This is an algorithmic complexity improvement in the scoring for volumetric circuits with respect to the number of qubits. Such a circuit with `n` qubits has `n / 2` gates in its front layer at all times, and so (assuming a coupling map that expands by a constant connectivity factor per qubit, like heavy hex) `k * n` swaps to score. This means that choosing the best swap has quadratic complexity with the original Sabre scoring algorithm. With this new algorithm, the score for a given swap is calculated in constant time, so choosing the best is instead linear. In practice, I did not see all these improvements at the scales I tested at, but I did see significant improvements - a 1081q heavy-hex quantum-volume circuit at depth 5 was swap-mapped on my machine in 25s with this commit compared to 100s before it. The principal change is the structs `FrontLayer` and `ExtendedSet`, which combine constant-time hash-set insertions, lookups and removals with vectors to enable constant-time lookup of the affected qubits. `FrontLayer` now only ever holds currently unroutable 2q operations; routable operations are immediately placed into the output structures as soon as a new 2q gate is routed. This routing is also done by exploiting that a swap can only affect two previously unroutable gates in the front layer; we just walk forwards along the outbound edges from those two nodes, adding any operations that become routable. This avoids another linear scan through the whole front layer after each swap, although in practice this has less of a speed-up effect, because it already wasn't quadratic. In theory, this change does not affect how any swap is scored relative to any other with the same front layer and extended set (though the scores used in the comparison do change). In order to precisely match the current implementation (and ensure reproducibility from a given seed), this tracks the insertion order of nodes into the front layer, including after removals. This commit completely modifies the internals of the Rust components of Sabre, although the actual algorithm is largely unchanged, aside from the scoring difference. Various parts of the implementation do change for efficiency, though. This commit maintains RNG compatibility with the previous Rust implementation in most cases. It is possible in some circuits for floating-point differences to cause different output, when several swaps are at the minimum score, but plus/minus 1ULP. This happens in both the old and new forms of the implementation, but _which_ of the minimal swaps get the minus-1ULP score varies between them, and consequently affects the swap choice. In fairly extensive testing, this appears to be the only mechanism for differences; I've verified that the release-valve mechanism and predecessor-requirement tracking function identically to before. The resultant scores - relative for "basic" and "lookahead", absolute for "decay" - are in practice within 2ULP of the old algorithm's. In maintaining RNG compatibility, this commit leaves several further speed-ups on the table. There is additional memory usage and tracking to maintain some required iteration orders, and some reordering checks that are not strictly necessary any more. Further, the sorting stages at the ends of the swap-choosing functions (to maintain repeatability) can be made redundant now, since some hash-set iteration (which is effectively an uncontrolled randomisation per run) is no longer required. These will be addressed in follow-ups. * Fix lint Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
The manul insertion-order tracking was initially added in 02a1939 (Qiskitgh-9012) during a complete rewrite of the scoring algorithms for Sabre to demonstrate that the new algorithm could be made fully RNG compatible. It maintained the identical iteration order to the previous iteration once the front-layer data structure was swapped from being a raw list to hash-based. The new data structure doesn't require the manual tracking to be reproducible as long as the iteration order is independent of the hash seed. This swaps the relevant places over the `IndexMap` to be able to remove a lot of the manual tracking. In casual testing, this didn't appear to have much effect on performance, but the resulting code is much simpler. The sorts (and deliberate canonicalisation of the swaps) was necessary to match RNG compatibility in the pre-relative-score Sabre, but since the swaps are now iterated through in a deterministic order and guaranteed to be generated only once each (the previous version used a hashset to remove duplicates), neither step is necessary. For a QV circuit of depth 5 at 1081 qubits on heavy hex, this is worth almost a 2x speedup in routing performance.
The manual insertion-order tracking was initially added in 02a1939 (Qiskitgh-9012) during a complete rewrite of the scoring algorithms for Sabre to demonstrate that the new algorithm could be made fully RNG compatible. It maintained the identical iteration order to the previous iteration once the front-layer data structure was swapped from being a raw list to hash-based. The new data structure doesn't require the manual tracking to be reproducible as long as the iteration order is independent of the hash seed. This swaps the relevant places over the `IndexMap` to be able to remove a lot of the manual tracking. In casual testing, this didn't appear to have much effect on performance, but the resulting code is much simpler. The sorts (and deliberate canonicalisation of the swaps) was necessary to match RNG compatibility in the pre-relative-score Sabre, but since the swaps are now iterated through in a deterministic order and guaranteed to be generated only once each (the previous version used a hashset to remove duplicates), neither step is necessary. For a QV circuit of depth 5 at 1081 qubits on heavy hex, this is worth almost a 2x speedup in routing performance.
…#9560) * Remove Sabre's manual insertion-order iteration and unnecessary sorts The manual insertion-order tracking was initially added in 02a1939 (gh-9012) during a complete rewrite of the scoring algorithms for Sabre to demonstrate that the new algorithm could be made fully RNG compatible. It maintained the identical iteration order to the previous iteration once the front-layer data structure was swapped from being a raw list to hash-based. The new data structure doesn't require the manual tracking to be reproducible as long as the iteration order is independent of the hash seed. This swaps the relevant places over the `IndexMap` to be able to remove a lot of the manual tracking. In casual testing, this didn't appear to have much effect on performance, but the resulting code is much simpler. The sorts (and deliberate canonicalisation of the swaps) was necessary to match RNG compatibility in the pre-relative-score Sabre, but since the swaps are now iterated through in a deterministic order and guaranteed to be generated only once each (the previous version used a hashset to remove duplicates), neither step is necessary. For a QV circuit of depth 5 at 1081 qubits on heavy hex, this is worth almost a 2x speedup in routing performance. * Fix RNG-related tests * Add RNG-change note
Summary
This is a significant performance improvement for very wide (100q+) circuits - technically it's an asymptotic improvement in the scoring calculation for volumetric circuits, though in practice other constant factors especially in Python seem to mask that somewhat or perhaps there's another quadratic component I hadn't thought about - the output number of swaps seems like a potential place an implicit O(n^2) dependency could appear.
For a depth-5 QV 1081 circuit on a heavy-hex lattice, this version of Sabre routed the problem in 25s on my machine compared to 100s on
main
. In relatively informal testing, I didn't see any notable performance increases or decreases in any of our current Sabre benchmark suite; I think other overhead masks any effects, and the circuits aren't large enough for it to matter anyway.This is the first of a series of PRs I hope to make about Sabre in Terra. The purpose of this commit is to demonstrate that this improved mechanism of scoring does not need to affect the quality of output at all. This commit maintains RNG compatibility with the existing Sabre implementation despite completely overhauling the internal logic, and implementing new scoring.
In further PRs, I intend to relax the RNG compatibility requirement, because there are a lot of speed-ups currently left on the table. I haven't made too much effort to optimise things within the RNG compatibility, because it'll be easier once that restriction is relaxed.
Details and comments
The previous
SabreSwap
algorithm would score each candidate swap by applying the swap to the current layout, iterating through every element in the front layer and extended sets summing the total distances of the 2q gates, and then undoing the swap. However, in the front layer, a given swap can affect at most two 2q operations. This new form instead scores each swap by calculating its total distance relative to if the swap had not been made. This means that the basic score is formed from only one or two gates, no matter how many are in the front layer.This is an algorithmic complexity improvement in the scoring for volumetric circuits with respect to the number of qubits. Such a circuit with
n
qubits hasn / 2
gates in its front layer at all times, and so (assuming a coupling map that expands by a constant connectivity factor per qubit, like heavy hex)k * n
swaps to score. This means that choosing the best swap has quadratic complexity with the original Sabre scoring algorithm. With this new algorithm, the score for a given swap is calculated in constant time, so choosing the best is instead linear. In practice, I did not see all these improvements at the scales I tested at, but I did see significant improvements - a 1081q heavy-hex quantum-volume circuit at depth 5 was swap-mapped on my machine in 25s with this commit compared to 100s before it.The principal change is the structs
FrontLayer
andExtendedSet
, which combine constant-time hash-set insertions, lookups and removals with vectors to enable constant-time lookup of the affected qubits.FrontLayer
now only ever holds currently unroutable 2q operations; routable operations are immediately placed into the output structures as soon as a new 2q gate is routed. This routing is also done by exploiting that a swap can only affect two previously unroutable gates in the front layer; we just walk forwards along the outbound edges from those two nodes, adding any operations that become routable. This avoids another linear scan through the whole front layer after each swap, although in practice this has less of a speed-up effect, because it already wasn't quadratic.In theory, this change does not affect how any swap is scored relative to any other with the same front layer and extended set (though the scores used in the comparison do change). In order to precisely match the current implementation (and ensure reproducibility from a given seed), this tracks the insertion order of nodes into the front layer, including after removals.
This commit completely modifies the internals of the Rust components of Sabre, although the actual algorithm is largely unchanged, aside from the scoring difference. Various parts of the implementation do change for efficiency, though.
This commit maintains RNG compatibility with the previous Rust implementation in most cases. It is possible in some circuits for floating-point differences to cause different output, when several swaps are at the minimum score, but plus/minus 1ULP. This happens in both the old and new forms of the implementation, but which of the minimal swaps get the minus-1ULP score varies between them, and consequently affects the swap choice. In fairly extensive testing, this appears to be the only mechanism for differences; I've verified that the release-valve mechanism and predecessor-requirement tracking function identically to before. The resultant scores - relative for "basic" and "lookahead", absolute for "decay" - are in practice within 2ULP of the old algorithm's.
In maintaining RNG compatibility, this commit leaves several further speed-ups on the table. There is additional memory usage and tracking to maintain some required iteration orders, and some reordering checks that are not strictly necessary any more. Further, the sorting stages at the ends of the swap-choosing functions (to maintain repeatability) can be made redundant now, since some hash-set iteration (which is effectively an uncontrolled randomisation per run) is no longer required. These will be addressed in follow-ups.
I haven't written a changelog entry for this right now - I'll do it as/when I make another PR that breaks RNG compat.