Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize BarrierBeforeFinalMeasurement pass #11739

Conversation

mtreinish
Copy link
Member

Summary

The barrier before final measurement pass previously was working by iterating over the DAGCircuit to find all the barriers and measurements and then evaluating if those operations were at the end of the circuit, or adjacent to only barriers prior to the end of the circuit. However, this was fairly inefficient as it means the performance of the pass scales as a function of the number of gates in the circuit. This commit optimizes the performance as a function by looking at the predecessors of each qubit's output nodes to find final measurements instead of iterating over the entire circuit.

Details and comments

The barrier before final measurement pass previously was working by
iterating over the DAGCircuit to find all the barriers and measurements
and then evaluating if those operations were at the end of the circuit,
or adjacent to only barriers prior to the end of the circuit. However,
this was fairly inefficient as it means the performance of the pass
scales as a function of the number of gates in the circuit. This commit
optimizes the performance as a function by looking at the
predecessors of each qubit's output nodes to find final measurements
instead of iterating over the entire circuit.
@mtreinish mtreinish added performance Changelog: None Do not include in changelog labels Feb 7, 2024
@mtreinish mtreinish requested a review from a team as a code owner February 7, 2024 18:14
@qiskit-bot
Copy link
Collaborator

One or more of the the following people are requested to review this:

  • @Qiskit/terra-core

@coveralls
Copy link

Pull Request Test Coverage Report for Build 7819304021

  • 0 of 19 (100.0%) changed or added relevant lines in 1 file are covered.
  • 9 unchanged lines in 3 files lost coverage.
  • Overall coverage increased (+0.009%) to 89.196%

Files with Coverage Reduction New Missed Lines %
qiskit/dagcircuit/dagcircuit.py 1 90.77%
crates/qasm2/src/lex.rs 2 91.44%
crates/qasm2/src/parse.rs 6 97.62%
Totals Coverage Status
Change from base Build 7818682912: 0.009%
Covered Lines: 58820
Relevant Lines: 65945

💛 - Coveralls

@mtreinish mtreinish added the on hold Can not fix yet label Feb 7, 2024
@mtreinish
Copy link
Member Author

Running this under asv and doing some manual testing show that this is measurably slower than the existing pass. So some more thought will be needed on the approach here. I'm marking this as on hold until I come up with a better solution.

@alexanderivrii
Copy link
Contributor

alexanderivrii commented Feb 8, 2024

Hi @mtreinish, what is a good set of examples to experiment with?

I was experimenting earlier today with the following circuit (Bernstein-Vazirani):

nq = 5000
qr = QuantumRegister(nq, 'q')
anc = QuantumRegister(1, 'ancilla')
cr = ClassicalRegister(nq, 'c')
qc = QuantumCircuit(qr, anc, cr)
qc.x(anc[0])
qc.h(anc[0])
qc.h(qr[0:nq])
qc.cx(qr[0:nq], anc[0])
qc.h(qr[0:nq])
qc.barrier()
qc.measure(qr, cr)

qct = BarrierBeforeFinalMeasurements()(qc)

Some quick observations (on my laptop):

  1. By far the biggest chunk of the time is spent on remove_op_node, which I believe would be fixed by Fix overhead of DAGCircuit.remove_op_node #11677.

  2. The proposed change from

    ordered_final_nodes = [node for node in dag.topological_op_nodes() if node in set(final_ops)]
    

    to

    final_ops_set = set(final_ops)
    ordered_final_nodes = [node for node in dag.topological_op_nodes() if node in final_ops_set]
    

    is very helpful.

  3. The collection part below

    for candidate_node in dag.named_nodes(*final_op_types):
        is_final_op = True
    
        for _, child_successors in dag.bfs_successors(candidate_node):
            if any(
                isinstance(suc, DAGOpNode) and suc.name not in final_op_types
                for suc in child_successors
            ):
                is_final_op = False
                break
    
        if is_final_op:
            final_ops.append(candidate_node)
    

    was actually very fast (compared to the above), with the following being just a tiny bit better

    bc = BlockCollector(dag)
    bc._collect_from_back = True
    bc._setup_in_degrees()
    final_ops = bc.collect_matching_block(filter_fn=lambda n: n.op.name in final_op_types)  # UPDATE: more concise
    final_ops = final_ops[::-1]  # UPDATE: reverse the ops to match the original order
    

    Unfortunately, the interface for collecting a single block is not really there (my fault), this is why I had to manually set the options and call the setup function.

@jakelishman
Copy link
Member

Sasha: if you want to try out what it would look like with a fixed remove_op_node, try installing Rustworkx from this PR: Qiskit/rustworkx#1083 and then modifying DAGCircuit.remove_op_node to call self._multi_graph.remove_node_retain_edges_by_key(node) (with no key function) and it should just work.

@mtreinish
Copy link
Member Author

Hi @mtreinish, what is a good set of examples to experiment with?

I was testing super deep circuits like a depth of ~60k with only final measurements and only 50 qubits or so. I had seen a specific case where this pass was scaling as a function of number of gates very poorly and was trying to fix that. TBH, I didn't look at a profile first (and I should have), I just intuitively (or lack of intuition) jumped to switching form iterating over dag.nodes() to just the output nodes as being the root cause of the slowdown.

Some quick observations (on my laptop):

1. By far the biggest chunk of the time is spent on `remove_op_node`, which I believe would be fixed by [Fix overhead of `DAGCircuit.remove_op_node` #11677](https://github.com/Qiskit/qiskit/issues/11677).

Oh interesting. I guess because we have to remove all the nodes in final_ops_set we end up deleting a lot of individual nodes from the dag. The specific thing I was looking at in #11677 was how the scaling of remove_op_node is poor because it's quadratic as a function to the number of arguments on the instruction. But I guess even with a lot of 2 argument ops in the dag that cost adds up there too.

2. The proposed change from
   ```
   ordered_final_nodes = [node for node in dag.topological_op_nodes() if node in set(final_ops)]
   ```
   to
   ```
   final_ops_set = set(final_ops)
   ordered_final_nodes = [node for node in dag.topological_op_nodes() if node in final_ops_set]
   ```        
   is very helpful.

Nice, yeah I only fixed that out of habit when I saw

3. The collection part below
   ```
   for candidate_node in dag.named_nodes(*final_op_types):
       is_final_op = True
   
       for _, child_successors in dag.bfs_successors(candidate_node):
           if any(
               isinstance(suc, DAGOpNode) and suc.name not in final_op_types
               for suc in child_successors
           ):
               is_final_op = False
               break
   
       if is_final_op:
           final_ops.append(candidate_node)
   ```
     
   was actually very fast (compared to the above), 

Yeah this was really surprising to me, I saw that loop over the named_nodes() and thought it would be very poorly scaling because that just does:

for node in dag._multi_graph.nodes():
....

and checks for either barrier or measure in insertion order and then does the bfs_successors() call from each match. So I just assumed that going about it from the output nodes would be more efficient. But the way I wrote this ends up doing the bfs_predecessors() calls than the old search code which ends up being much slower. I was trying to make it more efficient but I'll just defer to your suggestion below.

with the following being just a tiny bit better
```
bc = BlockCollector(dag)
bc._collect_from_back = True
bc._setup_in_degrees()

   def filter_fn(node):
        return node.op.name in final_op_types
   
   final_ops = bc.collect_matching_block(filter_fn)
   ```          
     
   Unfortunately, the interface for collecting a single block is not really there (my fault), this is why I had to manually set the options and call the setup function.

Cool, I can update the PR to do this instead for the search.

@alexanderivrii
Copy link
Contributor

alexanderivrii commented Feb 9, 2024

The circuit that I've used does have a 5000-qubit barrier. Without Jake's PR the time to remove it is about 20 seconds (on my laptop), with Jake's PR it's 0.1 seconds (old code) or 0.5 seconds (with block-collection code).

Jake, I have modified DAGCircuit.remove_op_node to call self._multi_graph.remove_node_retain_edges_by_key(node._node_id), please confirm that this is correct. (Otherwise it complains that DAGOpNode cannot be interpreted as an integer).

The discrepancy of 0.1 vs. 0.5 seconds seems to be related to the order in which nodes are removed (I will try to look a bit deeper into this).

UPDATE: oh, of course, the block collector collects nodes backwards from the end of the block, so the nodes appear in the opposite order; after this is fixed, node removal is also 0.1 seconds.

@jakelishman
Copy link
Member

Jake, I have modified DAGCircuit.remove_op_node to call self._multi_graph.remove_node_retain_edges_by_key(node._node_id), please confirm that this is correct. (Otherwise it complains that DAGOpNode cannot be interpreted as an integer).

Yeah that's right, it does have to be the index.

@mtreinish mtreinish closed this Apr 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Changelog: None Do not include in changelog on hold Can not fix yet performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants