-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Oxidize the ConsolidateBlocks pass #13368
Conversation
This commit ports the consolidate blocks pass to rust. The logic remains the same and this is just a straight porting. One optimization is that to remove the amount of python processing the Collect2qBlocks pass is no longer run as part of the preset pass managers and this is called directly in rust. This speeds up the pass because it avoids 3 crossing of the language boundary and also the intermediate creation of DAGNode python objects. The pass still supports running with Collect2qBlocks for backwards compatibility and will skip running the pass equivalent internally the field is present in the property set. There are potential improvements that can be investigated here such as avoiding in place dag contraction and moving to rebuilding the dag iteratively. Also changing the logic around estimated error (see Qiskit#11659) to be more robust. But these can be left for follow up PRs as they change the logic. Realistically we should look at combining ConsolidateBlocks for it's current two usages with Split2qUnitaries and UnitarySynthesis into those passes for more efficiency. We can improve the performance and logic as part of that refactor. See Qiskit#12007 for more details on this for UnitarySynthesis. Closes Qiskit#12250
One or more of the following people are relevant to this code:
|
While there are still 4 tests to fix here, I did a quick asv run to get a feel for the speedup so far and it yielded:
In general there is only so much we'll be able to do on the performance here because we'll be bottlenecked on the dag manipulation and |
This should be good to review now. There might be some benchmarking/profiling and tuning we want to do, but it's not a blocker. |
The test failure fixed by a test change was incorrect and masked a logic bug that was fixed in a subsequent commit. This commit reverts that change to the test and removes the release note attempting to document a fix for a bug that only existed during development of this PR.
Pull Request Test Coverage Report for Build 11693897124Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
After the most recent round of changes the overall benchmarking results look like:
It's marginally faster than the previous run now. |
This commit reworks the logic to reduce the number of Kronecker products and 2q matrix multiplications we do as part of computing the unitary of the block. It now computes the 1q components individually with 1q matrix multiplications and only calls kron() and a 2q matmul when a 2q gate is encountered. This reduces the number of more expensive operations we need to perform and replaces them with a much faster 1q matmul.
I ran the pgo scripts under a profiler to see where the pass is spending most of it's time after 62df015 and the top 4 components taking runtime are: ~42% of the time is in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. Just a few comments / questions.
I'll avoid signing off since I believe @henryzou50 is also planning to look.
@@ -195,38 +130,15 @@ def _handle_control_flow_ops(self, dag): | |||
pass_manager = PassManager() | |||
if "run_list" in self.property_set: | |||
pass_manager.append(Collect1qRuns()) | |||
if "block_list" in self.property_set: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can block_list
be specified when run_list
is not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be, that's arguably the normal invocation of the pass. My thinking here was that we'll implicitly run the equivalent of Collect2qBlocks
on the control flow block regardless of the property set if we don't specify it. That's the new feature this PR adds to the pass. So we don't need to manually populate the property set with Collect2qBlocks
unless the run_list
is set because populating that will preclude the 2q blocks.
node.op.replace_blocks(pass_manager.run(block) for block in node.op.blocks), | ||
propagate_condition=False, | ||
) | ||
control_flow_nodes = dag.control_flow_op_nodes() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I must say, I wish that control_flow_op_nodes
just returned an empty list rather than None
, but that is beyond the scope of this PR.
.collect(); | ||
let circuit_data = CircuitData::from_packed_operations( | ||
py, | ||
block_qargs.len() as u32, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is perhaps more evidence that we ought to just use usize
in any public interface of the circuit crate rather than the internal representation.
(no action requested)
Co-authored-by: Kevin Hartman <kevin@hart.mn>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this looks great to me! Thanks to Kevin for also reviewing. I’ve left a few suggestions for improving efficiency and memory usage, along with some additional comments and questions. Excellent work, and thanks for all the work put into this!
let blocks = match blocks { | ||
Some(runs) => runs | ||
.into_iter() | ||
.map(|run| { | ||
run.into_iter() | ||
.map(NodeIndex::new) | ||
.collect::<Vec<NodeIndex>>() | ||
}) | ||
.collect(), | ||
// If runs are specified but blocks are none we're in a legacy configuration where external | ||
// collection passes are being used. In this case don't collect blocks because it's | ||
// unexpected. | ||
None => match runs { | ||
Some(_) => vec![], | ||
None => dag.collect_2q_runs().unwrap(), | ||
}, | ||
}; | ||
|
||
let runs: Option<Vec<Vec<NodeIndex>>> = runs.map(|runs| { | ||
runs.into_iter() | ||
.map(|run| { | ||
run.into_iter() | ||
.map(NodeIndex::new) | ||
.collect::<Vec<NodeIndex>>() | ||
}) | ||
.collect() | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To improve the memory efficiency here, we could simplify the handling of the blocks
and runs
by unifying them with or_else
and unwarp_or_else
. This approach would reduces the unnecessary temporary Vec
allocations and leveragers iterators for more efficient memory usage.
For instance, something like:
let blocks = match blocks { | |
Some(runs) => runs | |
.into_iter() | |
.map(|run| { | |
run.into_iter() | |
.map(NodeIndex::new) | |
.collect::<Vec<NodeIndex>>() | |
}) | |
.collect(), | |
// If runs are specified but blocks are none we're in a legacy configuration where external | |
// collection passes are being used. In this case don't collect blocks because it's | |
// unexpected. | |
None => match runs { | |
Some(_) => vec![], | |
None => dag.collect_2q_runs().unwrap(), | |
}, | |
}; | |
let runs: Option<Vec<Vec<NodeIndex>>> = runs.map(|runs| { | |
runs.into_iter() | |
.map(|run| { | |
run.into_iter() | |
.map(NodeIndex::new) | |
.collect::<Vec<NodeIndex>>() | |
}) | |
.collect() | |
}); | |
// If `blocks` is `None` but `runs` is specified, `blocks` is set to `runs.clone()`. | |
// If both are `None`, it defaults to collecting 2q runs from the DAG circuit. | |
let blocks: Vec<Vec<NodeIndex>> = blocks | |
.or_else(|| runs.clone()) | |
.unwrap_or_else(|| { | |
dag.collect_2q_runs() | |
.unwrap() | |
.into_iter() | |
.map(|run| { | |
run.into_iter() | |
.map(|node_index| node_index.index()) | |
.collect() | |
}) | |
.collect() | |
}) | |
.into_iter() | |
.map(|run| run.into_iter().map(NodeIndex::new).collect()) | |
.collect(); | |
let runs: Option<Vec<Vec<NodeIndex>>> = runs.map(|runs| { | |
runs.into_iter() | |
.map(|run| run.into_iter().map(NodeIndex::new).collect()) | |
.collect() | |
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll see if I can come up with something for this. The behavior of cloning runs like this doesn't feel correct and I think probably wouldn't work correctly because it'll try to call blocks_to_matrix
with a single qubit matrix. But I'll see if I can come up with something to avoid temporary vecs. I think I did this to make the borrow checker happy, but it was long enough ago I'm not 100% sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I see a path here, the conflict is on the type mismatch between NodeIndex
and the usize
input. From a typing perspective rust won't be happy because collect_2q_runs()
returns a Vec
of NodeIndex
and we can't take that as an input. The best idea I had to avoid an extra allocation was to take NodeIndex
on the input Vec but we can't define the FromPyObject
trait for a type not defined in qiskit.
Your suggestion here doesn't actually avoid any allocations except if blocks is Some(_)
, but at the cost of a second allocation for blocks
is None
path. The blocks
is None
path is the more common path though because that's what we run in the preset pass managers, so I'd rather stick with the bare collect_2q_runs
call and have the conversion cost from the blocks
list if it's specified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes, I understand now, and that makes sense. You're correct about the typing conflict and the allocation cost trade-offs, especially since the blocks is None
path is more common in the preset pass managers. Let’s stick with what you initially suggested with the bare collect_2q_runs
call, and handle the conversion cost from the blocks list if it's specified. Thanks for clarifying!
Co-authored-by: Henry Zou <87874865+henryzou50@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, Matt! Thanks for the latest changes!
let blocks = match blocks { | ||
Some(runs) => runs | ||
.into_iter() | ||
.map(|run| { | ||
run.into_iter() | ||
.map(NodeIndex::new) | ||
.collect::<Vec<NodeIndex>>() | ||
}) | ||
.collect(), | ||
// If runs are specified but blocks are none we're in a legacy configuration where external | ||
// collection passes are being used. In this case don't collect blocks because it's | ||
// unexpected. | ||
None => match runs { | ||
Some(_) => vec![], | ||
None => dag.collect_2q_runs().unwrap(), | ||
}, | ||
}; | ||
|
||
let runs: Option<Vec<Vec<NodeIndex>>> = runs.map(|runs| { | ||
runs.into_iter() | ||
.map(|run| { | ||
run.into_iter() | ||
.map(NodeIndex::new) | ||
.collect::<Vec<NodeIndex>>() | ||
}) | ||
.collect() | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes, I understand now, and that makes sense. You're correct about the typing conflict and the allocation cost trade-offs, especially since the blocks is None
path is more common in the preset pass managers. Let’s stick with what you initially suggested with the bare collect_2q_runs
call, and handle the conversion cost from the blocks list if it's specified. Thanks for clarifying!
The ConsolidateBlocks pass was ported to rust in Qiskit#13368 and as part of that implementation a small behavior difference between the rust and python interfaces was causing the pass to not work correctly with non-CX gates. The internal 2q decomposer interface stores a sentinel string for the kak gate which is used to tell the python space constructor use the python defined gate object. However in the pass code we weren't factoring this difference in, and for non-CX gates we were evaluating the basis count as the number of gates with that sentinel value name (which is almost always zero) and this was preventing the pass from consolidating many blocks that should have been. This commit fixes this issue by taking the name from python space and passing it through to the rust portion of the code and using that for the comparison. Fixes Qiskit#13459
The ConsolidateBlocks pass was ported to rust in #13368 and as part of that implementation a small behavior difference between the rust and python interfaces was causing the pass to not work correctly with non-CX gates. The internal 2q decomposer interface stores a sentinel string for the kak gate which is used to tell the python space constructor use the python defined gate object. However in the pass code we weren't factoring this difference in, and for non-CX gates we were evaluating the basis count as the number of gates with that sentinel value name (which is almost always zero) and this was preventing the pass from consolidating many blocks that should have been. This commit fixes this issue by taking the name from python space and passing it through to the rust portion of the code and using that for the comparison. Fixes #13459
The ConsolidateBlocks pass was ported to rust in #13368 and as part of that implementation a small behavior difference between the rust and python interfaces was causing the pass to not work correctly with non-CX gates. The internal 2q decomposer interface stores a sentinel string for the kak gate which is used to tell the python space constructor use the python defined gate object. However in the pass code we weren't factoring this difference in, and for non-CX gates we were evaluating the basis count as the number of gates with that sentinel value name (which is almost always zero) and this was preventing the pass from consolidating many blocks that should have been. This commit fixes this issue by taking the name from python space and passing it through to the rust portion of the code and using that for the comparison. Fixes #13459 (cherry picked from commit c792426)
The ConsolidateBlocks pass was ported to rust in #13368 and as part of that implementation a small behavior difference between the rust and python interfaces was causing the pass to not work correctly with non-CX gates. The internal 2q decomposer interface stores a sentinel string for the kak gate which is used to tell the python space constructor use the python defined gate object. However in the pass code we weren't factoring this difference in, and for non-CX gates we were evaluating the basis count as the number of gates with that sentinel value name (which is almost always zero) and this was preventing the pass from consolidating many blocks that should have been. This commit fixes this issue by taking the name from python space and passing it through to the rust portion of the code and using that for the comparison. Fixes #13459 (cherry picked from commit c792426) Co-authored-by: Matthew Treinish <mtreinish@kortar.org>
Summary
This commit ports the consolidate blocks pass to rust. The logic remains the same and this is just a straight porting. One optimization is that to remove the amount of python processing the Collect2qBlocks pass is no longer run as part of the preset pass managers and this is called directly in rust. This speeds up the pass because it avoids 3 crossing of the language boundary and also the intermediate creation of DAGNode python objects. The pass still supports running with Collect2qBlocks for backwards compatibility and will skip running the pass equivalent internally the field is present in the property set.
There are potential improvements that can be investigated here such as avoiding in place dag contraction and moving to rebuilding the dag iteratively. Also changing the logic around estimated error (see #11659) to be more robust. But these can be left for follow up PRs as they change the logic.
Realistically we should look at combining ConsolidateBlocks for it's current two usages with Split2qUnitaries and UnitarySynthesis into those passes for more efficiency. We can improve the performance and logic as part of that refactor. See #12007 for more details on this for UnitarySynthesis.
Details and comments
Closes #12250
TODO: