Optimize BarrierBeforeFinalMeasurement pass #11739

mtreinish · 2024-02-07T18:13:59Z

Summary

The barrier before final measurement pass previously was working by iterating over the DAGCircuit to find all the barriers and measurements and then evaluating if those operations were at the end of the circuit, or adjacent to only barriers prior to the end of the circuit. However, this was fairly inefficient as it means the performance of the pass scales as a function of the number of gates in the circuit. This commit optimizes the performance as a function by looking at the predecessors of each qubit's output nodes to find final measurements instead of iterating over the entire circuit.

Details and comments

The barrier before final measurement pass previously was working by iterating over the DAGCircuit to find all the barriers and measurements and then evaluating if those operations were at the end of the circuit, or adjacent to only barriers prior to the end of the circuit. However, this was fairly inefficient as it means the performance of the pass scales as a function of the number of gates in the circuit. This commit optimizes the performance as a function by looking at the predecessors of each qubit's output nodes to find final measurements instead of iterating over the entire circuit.

qiskit-bot · 2024-02-07T18:14:04Z

One or more of the the following people are requested to review this:

@Qiskit/terra-core

…ance

coveralls · 2024-02-07T18:29:23Z

Pull Request Test Coverage Report for Build 7819304021

0 of 19 (100.0%) changed or added relevant lines in 1 file are covered.
9 unchanged lines in 3 files lost coverage.
Overall coverage increased (+0.009%) to 89.196%

Files with Coverage Reduction	New Missed Lines	%
qiskit/dagcircuit/dagcircuit.py	1	90.77%
crates/qasm2/src/lex.rs	2	91.44%
crates/qasm2/src/parse.rs	6	97.62%

Totals
Change from base Build 7818682912:	0.009%
Covered Lines:	58820
Relevant Lines:	65945

💛 - Coveralls

mtreinish · 2024-02-07T19:49:40Z

Running this under asv and doing some manual testing show that this is measurably slower than the existing pass. So some more thought will be needed on the approach here. I'm marking this as on hold until I come up with a better solution.

alexanderivrii · 2024-02-08T16:06:54Z

Hi @mtreinish, what is a good set of examples to experiment with?

I was experimenting earlier today with the following circuit (Bernstein-Vazirani):

nq = 5000
qr = QuantumRegister(nq, 'q')
anc = QuantumRegister(1, 'ancilla')
cr = ClassicalRegister(nq, 'c')
qc = QuantumCircuit(qr, anc, cr)
qc.x(anc[0])
qc.h(anc[0])
qc.h(qr[0:nq])
qc.cx(qr[0:nq], anc[0])
qc.h(qr[0:nq])
qc.barrier()
qc.measure(qr, cr)

qct = BarrierBeforeFinalMeasurements()(qc)

Some quick observations (on my laptop):

By far the biggest chunk of the time is spent on remove_op_node, which I believe would be fixed by Fix overhead of DAGCircuit.remove_op_node #11677.

The proposed change from

ordered_final_nodes = [node for node in dag.topological_op_nodes() if node in set(final_ops)]

to

final_ops_set = set(final_ops)
ordered_final_nodes = [node for node in dag.topological_op_nodes() if node in final_ops_set]

is very helpful.

The collection part below

for candidate_node in dag.named_nodes(*final_op_types):
    is_final_op = True

    for _, child_successors in dag.bfs_successors(candidate_node):
        if any(
            isinstance(suc, DAGOpNode) and suc.name not in final_op_types
            for suc in child_successors
        ):
            is_final_op = False
            break

    if is_final_op:
        final_ops.append(candidate_node)

was actually very fast (compared to the above), with the following being just a tiny bit better

bc = BlockCollector(dag)
bc._collect_from_back = True
bc._setup_in_degrees()
final_ops = bc.collect_matching_block(filter_fn=lambda n: n.op.name in final_op_types)  # UPDATE: more concise
final_ops = final_ops[::-1]  # UPDATE: reverse the ops to match the original order

Unfortunately, the interface for collecting a single block is not really there (my fault), this is why I had to manually set the options and call the setup function.

jakelishman · 2024-02-08T16:19:51Z

Sasha: if you want to try out what it would look like with a fixed remove_op_node, try installing Rustworkx from this PR: Qiskit/rustworkx#1083 and then modifying DAGCircuit.remove_op_node to call self._multi_graph.remove_node_retain_edges_by_key(node) (with no key function) and it should just work.

mtreinish · 2024-02-08T17:18:56Z

Hi @mtreinish, what is a good set of examples to experiment with?

I was testing super deep circuits like a depth of ~60k with only final measurements and only 50 qubits or so. I had seen a specific case where this pass was scaling as a function of number of gates very poorly and was trying to fix that. TBH, I didn't look at a profile first (and I should have), I just intuitively (or lack of intuition) jumped to switching form iterating over dag.nodes() to just the output nodes as being the root cause of the slowdown.

Some quick observations (on my laptop):

1. By far the biggest chunk of the time is spent on `remove_op_node`, which I believe would be fixed by [Fix overhead of `DAGCircuit.remove_op_node` #11677](https://github.com/Qiskit/qiskit/issues/11677).

Oh interesting. I guess because we have to remove all the nodes in final_ops_set we end up deleting a lot of individual nodes from the dag. The specific thing I was looking at in #11677 was how the scaling of remove_op_node is poor because it's quadratic as a function to the number of arguments on the instruction. But I guess even with a lot of 2 argument ops in the dag that cost adds up there too.

2. The proposed change from
   ```
   ordered_final_nodes = [node for node in dag.topological_op_nodes() if node in set(final_ops)]
   ```
   to
   ```
   final_ops_set = set(final_ops)
   ordered_final_nodes = [node for node in dag.topological_op_nodes() if node in final_ops_set]
   ```        
   is very helpful.

Nice, yeah I only fixed that out of habit when I saw

3. The collection part below
   ```
   for candidate_node in dag.named_nodes(*final_op_types):
       is_final_op = True
   
       for _, child_successors in dag.bfs_successors(candidate_node):
           if any(
               isinstance(suc, DAGOpNode) and suc.name not in final_op_types
               for suc in child_successors
           ):
               is_final_op = False
               break
   
       if is_final_op:
           final_ops.append(candidate_node)
   ```
     
   was actually very fast (compared to the above),

Yeah this was really surprising to me, I saw that loop over the named_nodes() and thought it would be very poorly scaling because that just does:

for node in dag._multi_graph.nodes():
....

and checks for either barrier or measure in insertion order and then does the bfs_successors() call from each match. So I just assumed that going about it from the output nodes would be more efficient. But the way I wrote this ends up doing the bfs_predecessors() calls than the old search code which ends up being much slower. I was trying to make it more efficient but I'll just defer to your suggestion below.

with the following being just a tiny bit better
```
bc = BlockCollector(dag)
bc._collect_from_back = True
bc._setup_in_degrees()

   def filter_fn(node):
        return node.op.name in final_op_types
   
   final_ops = bc.collect_matching_block(filter_fn)
   ```          
     
   Unfortunately, the interface for collecting a single block is not really there (my fault), this is why I had to manually set the options and call the setup function.

Cool, I can update the PR to do this instead for the search.

alexanderivrii · 2024-02-09T08:24:18Z

The circuit that I've used does have a 5000-qubit barrier. Without Jake's PR the time to remove it is about 20 seconds (on my laptop), with Jake's PR it's 0.1 seconds (old code) or 0.5 seconds (with block-collection code).

Jake, I have modified DAGCircuit.remove_op_node to call self._multi_graph.remove_node_retain_edges_by_key(node._node_id), please confirm that this is correct. (Otherwise it complains that DAGOpNode cannot be interpreted as an integer).

The discrepancy of 0.1 vs. 0.5 seconds seems to be related to the order in which nodes are removed (I will try to look a bit deeper into this).

UPDATE: oh, of course, the block collector collects nodes backwards from the end of the block, so the nodes appear in the opposite order; after this is fixed, node removal is also 0.1 seconds.

jakelishman · 2024-02-09T09:02:34Z

Jake, I have modified DAGCircuit.remove_op_node to call self._multi_graph.remove_node_retain_edges_by_key(node._node_id), please confirm that this is correct. (Otherwise it complains that DAGOpNode cannot be interpreted as an integer).

Yeah that's right, it does have to be the index.

mtreinish added performance Changelog: None Do not include in changelog labels Feb 7, 2024

mtreinish requested a review from a team as a code owner February 7, 2024 18:14

Merge branch 'main' into fix-barrier-before-final-measurement-perform…

7000793

…ance

mtreinish added the on hold Can not fix yet label Feb 7, 2024

mtreinish closed this Apr 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize BarrierBeforeFinalMeasurement pass #11739

Optimize BarrierBeforeFinalMeasurement pass #11739

mtreinish commented Feb 7, 2024

qiskit-bot commented Feb 7, 2024

coveralls commented Feb 7, 2024

mtreinish commented Feb 7, 2024

alexanderivrii commented Feb 8, 2024 •

edited

Loading

jakelishman commented Feb 8, 2024

mtreinish commented Feb 8, 2024

alexanderivrii commented Feb 9, 2024 •

edited

Loading

jakelishman commented Feb 9, 2024

Optimize BarrierBeforeFinalMeasurement pass #11739

Optimize BarrierBeforeFinalMeasurement pass #11739

Conversation

mtreinish commented Feb 7, 2024

Summary

Details and comments

qiskit-bot commented Feb 7, 2024

coveralls commented Feb 7, 2024

Pull Request Test Coverage Report for Build 7819304021

💛 - Coveralls

mtreinish commented Feb 7, 2024

alexanderivrii commented Feb 8, 2024 • edited Loading

jakelishman commented Feb 8, 2024

mtreinish commented Feb 8, 2024

alexanderivrii commented Feb 9, 2024 • edited Loading

jakelishman commented Feb 9, 2024

alexanderivrii commented Feb 8, 2024 •

edited

Loading

alexanderivrii commented Feb 9, 2024 •

edited

Loading