Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Optimise CommutationAnalysis transpiler pass (#6982)
The `CommutationAnalysis` transpiler pass for large circuits with many gates is typically one of the longer passes, along with the mapping passes. It spends most of its time deciding whether any two given operators commute, which it does by matrix multiplication and maintaining a cache. This commit maintains the same general method (as opposed to, say, maintaining a known-good lookup table), but performs the following optimisations to improve it, in approximate order from most to least impactful: - we store the _result_ of "do these two matrices commute?" instead of the previous method of two matrices individually in the cache. With the current way the caching is written, the keys depend on both matrices, so it is never the case that one key can be a cache hit and the other a cache miss. This means that storing only the result does not cause any more cache misses than before, and saves the subsequent matrix operations (multiplication and comparison). In real-word usage, this is the most major change. - the matrix multiplication is slightly reorganised to halve the number of calls to `Operator.compose`. Instead of first constructing a identity operator over _all_ qubits, and composing both operators onto it, wey reorganise the indices of the qubit arguments so that all the "first" operator's qubits come first, and it is tensored with a small identity to bring it up to size. Then the other operator is composed with it. This is generally faster, since it replaces a call to `compose`, which needs to do a matmul-einsum step into a simple `kron`, which need not concern itself with order. It also results in fewer operations, since the input matrices are smaller. - the cache-key algorithm is changed to avoid string-ification as much as possible. This generally has a very small impact for most real-world applications, but has _massive_ impact on circuits with large numbers of unsynthesised `UnitaryGate` elements (like quantum volume circuits being transpiled without `basis_gates`). This is because the gate parameters were previously being converted to string, which for `UnitaryGate` meant string-ifying the whole matrix. This was much slower than just doing the dot product, so was defeating the purpose of the cache. On my laptop (i7 Macbook Pro 2019), this gives a 15-35% speed increase on the `time_quantum_volume_transpile` benchmark at transpiler optimisation levels 2 and 3 (the only levels the `CommutationAnalysis` pass is done by default), over the whole transpiler pass. The improvement in runtime of the pass itself depends strongly on the type of circuit itself, but in the worst (highly non-realistic) cases, can be nearly an order of magnitude improvement (for example just calling `transpile(optimization_level=3)` on a `QuantumVolume(128)` circuit drops from 27s to 5s). `asv` `time_quantum_volume_transpile` benchmarks: - This commit: =============================== ============ transpiler optimization level ------------------------------- ------------ 0 3.57±0s 1 7.16±0.03s 2 10.8±0.02s 3 33.3±0.08s =============================== ============ - Previous commit: =============================== ============ transpiler optimization level ------------------------------- ------------ 0 3.56±0.02s 1 7.24±0.05s 2 16.8±0.04s 3 38.9±0.1s =============================== ============ Co-authored-by: Kevin Krsulich <kevin.krsulich@ibm.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
- Loading branch information