Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of CommutationAnalysis transpiler pass #6982

Merged
merged 3 commits into from
Sep 22, 2021

Conversation

jakelishman
Copy link
Member

The CommutationAnalysis transpiler pass for large circuits with many
gates is typically one of the longer passes, along with the mapping
passes. It spends most of its time deciding whether any two given
operators commute, which it does by matrix multiplication and
maintaining a cache.

This commit maintains the same general method (as opposed to, say,
maintaining a known-good lookup table), but performs the following
optimisations to improve it, in approximate order from most to least
impactful:

  • we store the result of "do these two matrices commute?" instead of
    the previous method of two matrices individually in the cache. With
    the current way the caching is written, the keys depend on both
    matrices, so it is never the case that one key can be a cache hit and
    the other a cache miss. This means that storing only the result does
    not cause any more cache misses than before, and saves the subsequent
    matrix operations (multiplication and comparison). In real-word
    usage, this is the most major change.

  • the matrix multiplication is slightly reorganised to halve the number
    of calls to Operator.compose. Instead of first constructing a
    identity operator over all qubits, and composing both operators onto
    it, wey reorganise the indices of the qubit arguments so that all the
    "first" operator's qubits come first, and it is tensored with a small
    identity to bring it up to size. Then the other operator is composed
    with it. This is generally faster, since it replaces a call to
    compose, which needs to do a matmul-einsum step into a simple
    kron, which need not concern itself with order. It also results in
    fewer operations, since the input matrices are smaller.

  • the cache-key algorithm is changed to avoid string-ification as much
    as possible. This generally has a very small impact for most
    real-world applications, but has massive impact on circuits with
    large numbers of unsynthesised UnitaryGate elements (like quantum
    volume circuits being transpiled without basis_gates). This is
    because the gate parameters were previously being converted to string,
    which for UnitaryGate meant string-ifying the whole matrix. This
    was much slower than just doing the dot product, so was defeating the
    purpose of the cache.

On my laptop (i7 Macbook Pro 2019), this gives a 15-35% speed increase
on the time_quantum_volume_transpile benchmark at transpiler
optimisation levels 2 and 3 (the only levels the CommutationAnalysis
pass is done by default), over the whole transpiler pass. The
improvement in runtime of the pass itself depends strongly on the type
of circuit itself, but in the worst (highly non-realistic) cases, can be
nearly an order of magnitude improvement (for example just calling
transpile(optimization_level=3) on a QuantumVolume(128) circuit
drops from 27s to 5s).

asv time_quantum_volume_transpile benchmarks:

  • This commit:

    =============================== ============
     transpiler optimization level
    ------------------------------- ------------
                   0                  3.57±0s
                   1                 7.16±0.03s
                   2                 10.8±0.02s
                   3                 33.3±0.08s
    =============================== ============
    
  • Previous commit:

    =============================== ============
     transpiler optimization level
    ------------------------------- ------------
                   0                 3.56±0.02s
                   1                 7.24±0.05s
                   2                 16.8±0.04s
                   3                 38.9±0.1s
    =============================== ============
    

Details and comments

There is more work that could be done here as well, which I think would give us more performance in more natural workflows that don't just involve dumping in a tonne of unitary matrices. The next obvious step would be to build a lookup table of gate types that are known to commute with each other, in order to skip the entire process of comparing the two matrix multiplications for common cases; we can always fall back to this method. I haven't done this right now, because it's bigger task, and doesn't need doing in one go.

With caches, it's important that the time taken to construct the key isn't excessive - in this case, when the parameters were arrays (even if not UnitaryGate), that wasn't necessarily honoured.

The `CommutationAnalysis` transpiler pass for large circuits with many
gates is typically one of the longer passes, along with the mapping
passes.  It spends most of its time deciding whether any two given
operators commute, which it does by matrix multiplication and
maintaining a cache.

This commit maintains the same general method (as opposed to, say,
maintaining a known-good lookup table), but performs the following
optimisations to improve it, in approximate order from most to least
impactful:

- we store the _result_ of "do these two matrices commute?" instead of
  the previous method of two matrices individually in the cache.  With
  the current way the caching is written, the keys depend on both
  matrices, so it is never the case that one key can be a cache hit and
  the other a cache miss. This means that storing only the result does
  not cause any more cache misses than before, and saves the subsequent
  matrix operations (multiplication and comparison).  In real-word
  usage, this is the most major change.

- the matrix multiplication is slightly reorganised to halve the number
  of calls to `Operator.compose`. Instead of first constructing a
  identity operator over _all_ qubits, and composing both operators onto
  it, wey reorganise the indices of the qubit arguments so that all the
  "first" operator's qubits come first, and it is tensored with a small
  identity to bring it up to size.  Then the other operator is composed
  with it.  This is generally faster, since it replaces a call to
  `compose`, which needs to do a matmul-einsum step into a simple
  `kron`, which need not concern itself with order.  It also results in
  fewer operations, since the input matrices are smaller.

- the cache-key algorithm is changed to avoid string-ification as much
  as possible.  This generally has a very small impact for most
  real-world applications, but has _massive_ impact on circuits with
  large numbers of unsynthesised `UnitaryGate` elements (like quantum
  volume circuits being transpiled without `basis_gates`).  This is
  because the gate parameters were previously being converted to string,
  which for `UnitaryGate` meant string-ifying the whole matrix.  This
  was much slower than just doing the dot product, so was defeating the
  purpose of the cache.

On my laptop (i7 Macbook Pro 2019), this gives a 15-35% speed increase
on the `time_quantum_volume_transpile` benchmark at transpiler
optimisation levels 2 and 3 (the only levels the `CommutationAnalysis`
pass is done by default), over the whole transpiler pass.  The
improvement in runtime of the pass itself depends strongly on the type
of circuit itself, but in the worst (highly non-realistic) cases, can be
nearly an order of magnitude improvement (for example just calling
`transpile(optimization_level=3)` on a `QuantumVolume(128)` circuit
drops from 27s to 5s).

`asv` `time_quantum_volume_transpile` benchmarks:

- This commit:

    =============================== ============
     transpiler optimization level
    ------------------------------- ------------
                   0                  3.57±0s
                   1                 7.16±0.03s
                   2                 10.8±0.02s
                   3                 33.3±0.08s
    =============================== ============

- Previous commit:

    =============================== ============
     transpiler optimization level
    ------------------------------- ------------
                   0                 3.56±0.02s
                   1                 7.24±0.05s
                   2                 16.8±0.04s
                   3                 38.9±0.1s
    =============================== ============
@jakelishman jakelishman requested a review from a team as a code owner September 3, 2021 23:50
Copy link
Contributor

@ewinston ewinston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the cache handle floating point round-off errors? Would the keys be unique enough by manually rounding them?

@@ -87,42 +87,87 @@ def run(self, dag):
self.property_set["commutation_set"][(current_gate, wire)] = temp_len - 1


def _commute(node1, node2, cache):
_COMMUTE_ID_OP = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be module level or can it be put inside _commute?

I couldn't find where it gets populated?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be either - realistically I can't imagine it'll ever contain more than ~3 elements (you'd need to have be dealing with many-qubit gates to have more than even a single element in it), so there shouldn't be any memory problems. Having it module-level means that the cache is saved per call; it's only ever populated with various sizes of the identity matrix, and those are complete constants.

It's populated here:
https://github.com/Qiskit/qiskit-terra/blob/07299e170532a24fbaefa6a87aa3c427b79f765e/qiskit/transpiler/passes/optimization/commutation_analysis.py#L161-L168
on line 164 (note it's a double assignment).

@jakelishman
Copy link
Member Author

The original code actually had far worse behaviour with regards to string-ification, since before this commit, all Numpy arrays were turned into strings, which has all the floating-point problems you mentioned. In the new form, I'm not aware of any cases where the fallback will actually be used, but I kept it in place so it won't break any extension gates doing super weird things.

This commit in theory can make more cache misses (though it certainly doesn't in the benchmark suite), if several instructions are separately decomposed into the same type of gate with exactly equivalent, but referentially unequal Numpy arrays. Even if this were to happen, the slowness of string-ifying Numpy arrays (the previous method) almost invariably outweighs the cost of just doing to the commutation analysis again, especially now I tweaked the comparison to avoid unnecessary einsum calls and Operator creations.

(Sorry for the slow reply - I was on holiday!)

@ewinston
Copy link
Contributor

I checked scaling of timing and memory usage of QV circuits and didn't see any major issues. LGTM.

peak memory usage vs square QV circuit.
image

total time spent in function
image

@kdk kdk added the automerge label Sep 21, 2021
@kdk kdk added this to the 0.19 milestone Sep 21, 2021
@mergify mergify bot merged commit 890d9ca into Qiskit:main Sep 22, 2021
@jakelishman jakelishman deleted the commutation-analysis-performance branch September 22, 2021 08:44
@kdk kdk added the Changelog: None Do not include in changelog label Dec 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Changelog: None Do not include in changelog performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants