Optimise CommutationAnalysis transpiler pass (#6982)

The `CommutationAnalysis` transpiler pass for large circuits with many gates is typically one of the longer passes, along with the mapping passes. It spends most of its time deciding whether any two given operators commute, which it does by matrix multiplication and maintaining a cache. This commit maintains the same general method (as opposed to, say, maintaining a known-good lookup table), but performs the following optimisations to improve it, in approximate order from most to least impactful: - we store the _result_ of "do these two matrices commute?" instead of the previous method of two matrices individually in the cache. With the current way the caching is written, the keys depend on both matrices, so it is never the case that one key can be a cache hit and the other a cache miss. This means that storing only the result does not cause any more cache misses than before, and saves the subsequent matrix operations (multiplication and comparison). In real-word usage, this is the most major change. - the matrix multiplication is slightly reorganised to halve the number of calls to `Operator.compose`. Instead of first constructing a identity operator over _all_ qubits, and composing both operators onto it, wey reorganise the indices of the qubit arguments so that all the "first" operator's qubits come first, and it is tensored with a small identity to bring it up to size. Then the other operator is composed with it. This is generally faster, since it replaces a call to `compose`, which needs to do a matmul-einsum step into a simple `kron`, which need not concern itself with order. It also results in fewer operations, since the input matrices are smaller. - the cache-key algorithm is changed to avoid string-ification as much as possible. This generally has a very small impact for most real-world applications, but has _massive_ impact on circuits with large numbers of unsynthesised `UnitaryGate` elements (like quantum volume circuits being transpiled without `basis_gates`). This is because the gate parameters were previously being converted to string, which for `UnitaryGate` meant string-ifying the whole matrix. This was much slower than just doing the dot product, so was defeating the purpose of the cache. On my laptop (i7 Macbook Pro 2019), this gives a 15-35% speed increase on the `time_quantum_volume_transpile` benchmark at transpiler optimisation levels 2 and 3 (the only levels the `CommutationAnalysis` pass is done by default), over the whole transpiler pass. The improvement in runtime of the pass itself depends strongly on the type of circuit itself, but in the worst (highly non-realistic) cases, can be nearly an order of magnitude improvement (for example just calling `transpile(optimization_level=3)` on a `QuantumVolume(128)` circuit drops from 27s to 5s). `asv` `time_quantum_volume_transpile` benchmarks: - This commit: =============================== ============ transpiler optimization level ------------------------------- ------------ 0 3.57±0s 1 7.16±0.03s 2 10.8±0.02s 3 33.3±0.08s =============================== ============ - Previous commit: =============================== ============ transpiler optimization level ------------------------------- ------------ 0 3.56±0.02s 1 7.24±0.05s 2 16.8±0.04s 3 38.9±0.1s =============================== ============ Co-authored-by: Kevin Krsulich <kevin.krsulich@ibm.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Qiskit · Sep 22, 2021 · 890d9ca · 890d9ca
1 parent 80e6ca5
commit 890d9ca
Showing 1 changed file with 72 additions and 27 deletions.
diff --git a/qiskit/transpiler/passes/optimization/commutation_analysis.py b/qiskit/transpiler/passes/optimization/commutation_analysis.py
@@ -87,42 +87,87 @@ def run(self, dag):
                 self.property_set["commutation_set"][(current_gate, wire)] = temp_len - 1
 
 
-def _commute(node1, node2, cache):
+_COMMUTE_ID_OP = {}
+
+
+def _hashable_parameters(params):
+    """Convert the parameters of a gate into a hashable format for lookup in a dictionary.
+
+    This aims to be fast in common cases, and is not intended to work outside of the lifetime of a
+    single commutation pass; it does not handle mutable state correctly if the state is actually
+    changed."""
+    try:
+        hash(params)
+        return params
+    except TypeError:
+        pass
+    if isinstance(params, (list, tuple)):
+        return tuple(_hashable_parameters(x) for x in params)
+    if isinstance(params, np.ndarray):
+        # We trust that the arrays will not be mutated during the commutation pass, since nothing
+        # would work if they were anyway. Using the id can potentially cause some additional cache
+        # misses if two UnitaryGate instances are being compared that have been separately
+        # constructed to have the same underlying matrix, but in practice the cost of string-ifying
+        # the matrix to get a cache key is far more expensive than just doing a small matmul.
+        return (np.ndarray, id(params))
+    # Catch anything else with a slow conversion.
+    return ("fallback", str(params))
+
 
+def _commute(node1, node2, cache):
     if not isinstance(node1, DAGOpNode) or not isinstance(node2, DAGOpNode):
         return False
-
     for nd in [node1, node2]:
         if nd.op._directive or nd.name in {"measure", "reset", "delay"}:
             return False
-
     if node1.op.condition or node2.op.condition:
         return False
-
     if node1.op.is_parameterized() or node2.op.is_parameterized():
         return False
 
-    qarg = list(set(node1.qargs + node2.qargs))
-    qbit_num = len(qarg)
-
-    qarg1 = [qarg.index(q) for q in node1.qargs]
-    qarg2 = [qarg.index(q) for q in node2.qargs]
-
-    id_op = Operator(np.eye(2 ** qbit_num))
-
-    node1_key = (node1.op.name, str(node1.op.params), str(qarg1))
-    node2_key = (node2.op.name, str(node2.op.params), str(qarg2))
-    if (node1_key, node2_key) in cache:
-        op12 = cache[(node1_key, node2_key)]
+    # Assign indices to each of the qubits such that all `node1`'s qubits come first, followed by
+    # any _additional_ qubits `node2` addresses.  This helps later when we need to compose one
+    # operator with the other, since we can easily expand `node1` with a suitable identity.
+    qarg = {q: i for i, q in enumerate(node1.qargs)}
+    num_qubits = len(qarg)
+    for q in node2.qargs:
+        if q not in qarg:
+            qarg[q] = num_qubits
+            num_qubits += 1
+    qarg1 = tuple(qarg[q] for q in node1.qargs)
+    qarg2 = tuple(qarg[q] for q in node2.qargs)
+
+    node1_key = (node1.op.name, _hashable_parameters(node1.op.params), qarg1)
+    node2_key = (node2.op.name, _hashable_parameters(node2.op.params), qarg2)
+    try:
+        # We only need to try one orientation of the keys, since if we've seen the compound key
+        # before, we've set it in both orientations.
+        return cache[node1_key, node2_key]
+    except KeyError:
+        pass
+
+    operator_1 = Operator(node1.op, input_dims=(2,) * len(qarg1), output_dims=(2,) * len(qarg1))
+    operator_2 = Operator(node2.op, input_dims=(2,) * len(qarg2), output_dims=(2,) * len(qarg2))
+
+    if qarg1 == qarg2:
+        # Use full composition if possible to get the fastest matmul paths.
+        op12 = operator_1.compose(operator_2)
+        op21 = operator_2.compose(operator_1)
     else:
-        op12 = id_op.compose(node1.op, qargs=qarg1).compose(node2.op, qargs=qarg2)
-        cache[(node1_key, node2_key)] = op12
-    if (node2_key, node1_key) in cache:
-        op21 = cache[(node2_key, node1_key)]
-    else:
-        op21 = id_op.compose(node2.op, qargs=qarg2).compose(node1.op, qargs=qarg1)
-        cache[(node2_key, node1_key)] = op21
-
-    if_commute = op12 == op21
-
-    return if_commute
+        # Expand operator_1 to be large enough to contain operator_2 as well; this relies on qargs1
+        # being the lowest possible indices so the identity can be tensored before it.
+        extra_qarg2 = num_qubits - len(qarg1)
+        if extra_qarg2:
+            try:
+                id_op = _COMMUTE_ID_OP[extra_qarg2]
+            except KeyError:
+                id_op = _COMMUTE_ID_OP[extra_qarg2] = Operator(
+                    np.eye(2 ** extra_qarg2),
+                    input_dims=(2,) * extra_qarg2,
+                    output_dims=(2,) * extra_qarg2,
+                )
+            operator_1 = id_op.tensor(operator_1)
+        op12 = operator_1.compose(operator_2, qargs=qarg2, front=False)
+        op21 = operator_1.compose(operator_2, qargs=qarg2, front=True)
+    cache[node1_key, node2_key] = cache[node2_key, node1_key] = ret = op12 == op21
+    return ret