Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add experimental support of cuQuantum #1400

Merged
merged 33 commits into from
Mar 1, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
309c73d
add cuStateVec support
doichanj Dec 13, 2021
54dc128
Merge remote-tracking branch 'upstream/main' into cuStatevec
doichanj Dec 13, 2021
a5bc75e
delete space
doichanj Dec 13, 2021
b1bd96e
Merge branch 'main' into cuStatevec
chriseclectic Dec 14, 2021
a40898c
disable batched shots optimization for cuStateVec
doichanj Dec 15, 2021
adfc125
Merge branch 'cuStatevec' of github.com:doichanj/qiskit-aer into cuSt…
doichanj Dec 15, 2021
26c4538
Fix cuStateVec test fails
doichanj Dec 15, 2021
87afff5
Fix qasm_simulator.py
doichanj Dec 16, 2021
f16a35c
update for the latest cuQuantum / added diagonal matrix
doichanj Jan 4, 2022
5533b76
resolved conflict
doichanj Jan 4, 2022
0c10325
add more cuStateVec support / refactor qubitvector_thrust and chunk_c…
doichanj Jan 18, 2022
181eb2c
Merge remote-tracking branch 'upstream/main' into cuStatevec
doichanj Jan 18, 2022
54d1a68
Merge branch 'main' into cuStatevec
doichanj Jan 18, 2022
4d502ed
Merge branch 'cuStatevec' of github.com:doichanj/qiskit-aer into cuSt…
doichanj Jan 18, 2022
eba2594
Fix norm() for Thrust CPU
doichanj Jan 18, 2022
5a93807
change cuStateVec from device to option
doichanj Jan 26, 2022
983773b
Fix unchanged device=cuStateVec
doichanj Jan 26, 2022
5bea04d
Add build option to link cuStateVec statically
doichanj Jan 27, 2022
1fb5031
removed whitespace
doichanj Jan 27, 2022
1d01542
Merge remote-tracking branch 'upstream/main' into cuStatevec
doichanj Jan 27, 2022
da0f42d
Merge branch 'main' into cuStatevec
doichanj Jan 31, 2022
c781208
reflecting review comments
doichanj Feb 1, 2022
0f4a93e
added release note
doichanj Feb 1, 2022
c509131
set cuStateVec_enable to False as default, added test cases for cuSta…
doichanj Feb 3, 2022
5458b7c
Merge remote-tracking branch 'upstream/main' into cuStatevec
doichanj Feb 3, 2022
61083cb
Merge branch 'main' into cuStatevec
doichanj Feb 3, 2022
046036d
Merge branch 'cuStatevec' of github.com:doichanj/qiskit-aer into cuSt…
doichanj Feb 3, 2022
3a31cef
Fix omp setting for non-GPU / Fix omp nested loops
doichanj Feb 4, 2022
de4c978
Merge branch 'main' into cuStatevec
doichanj Feb 7, 2022
88d7d95
Implemented optimized rotation gates
doichanj Feb 14, 2022
3ffabcf
Merge branch 'cuStatevec' of github.com:doichanj/qiskit-aer into cuSt…
doichanj Feb 14, 2022
7cf50ee
Merge branch 'main' into cuStatevec
doichanj Feb 14, 2022
879a4ac
Merge branch 'main' into cuStatevec
hhorii Feb 24, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -257,6 +257,15 @@ if(AER_THRUST_SUPPORTED)

set(AER_COMPILER_DEFINITIONS ${AER_COMPILER_DEFINITIONS} THRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_CUDA)
set(THRUST_DEPENDENT_LIBS "")
if(CUSTATEVEC_ROOT)
set(AER_COMPILER_DEFINITIONS ${AER_COMPILER_DEFINITIONS} AER_CUSTATEVEC)
set(AER_COMPILER_FLAGS "${AER_COMPILER_FLAGS} -I${CUSTATEVEC_ROOT}/include")
if(CUSTATEVEC_STATIC)
set(THRUST_DEPENDANT_LIBS "-L${CUSTATEVEC_ROOT}/lib -L${CUSTATEVEC_ROOT}/lib64 -lcustatevec_static -L${CUDA_TOOLKIT_ROOT_DIR}/lib64 -lcublas")
else()
set(THRUST_DEPENDANT_LIBS "-L${CUSTATEVEC_ROOT}/lib -L${CUSTATEVEC_ROOT}/lib64 -lcustatevec")
endif()
endif()
elseif(AER_THRUST_BACKEND STREQUAL "TBB")
message(STATUS "TBB Support found!")
set(THRUST_DEPENDENT_LIBS AER_DEPENDENCY_PKG::tbb)
Expand Down
28 changes: 28 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -643,6 +643,34 @@ Few notes on GPU builds:
3. We don't need NVIDIA® drivers for building, but we need them for running simulations
4. Only Linux platforms are supported

Qiskit Aer now supports cuQuantum optimized Quantum computing APIs from NVIDIA®.
cuStateVec APIs can be exploited to accelerate statevector, density_matrix and unitary methods.
Because cuQuantum is beta version currently, some of the operations are not accelerated by cuStateVec.

To build Qiskit Aer with cuStateVec support, please set the path to cuQuantum root directory to CUSTATEVEC_ROOT as following.

For example,

qiskit-aer$ python ./setup.py bdist_wheel -- -DAER_THRUST_BACKEND=CUDA -DCUSTATEVEC_ROOT=path_to_cuQuantum

if you want to link cuQuantum library statically, set `CUSTATEVEC_STATIC` to setup.py.
Otherwise you also have to set environmental variable LD_LIBRARY_PATH to indicate path to the cuQuantum libraries.

To run with cuStateVec, set `device='GPU'` to AerSimulator option and set `cuStateVec_enable=True` to option in execute method.

```
sim = AerSimulator(method='statevector', device='GPU')
results = execute(circuit,sim,cuStateVec_enable=True).result()
```

Also you can accelrate density matrix and unitary matrix simulations as well.
```
sim = AerSimulator(method='density_matrix', device='GPU')
results = execute(circuit,sim,cuStateVec_enable=True).result()
```

hhorii marked this conversation as resolved.
Show resolved Hide resolved


### Building with MPI support

Qiskit Aer can parallelize its simulation on the cluster systems by using MPI.
Expand Down
11 changes: 11 additions & 0 deletions qiskit/providers/aer/backends/aer_simulator.py
Original file line number Diff line number Diff line change
Expand Up @@ -148,6 +148,10 @@ class AerSimulator(AerBackend):
initialization or with :meth:`set_options`. The list of supported devices
for the current system can be returned using :meth:`available_devices`.

If AerSimulator is built with cuStateVec support, cuStateVec APIs are enabled
by setting ``cuStateVec_enable=True``. This is experimental implementation
based on cuQuantum Beta 2.

**Additional Backend Options**

The following simulator specific backend options are supported
Expand Down Expand Up @@ -216,6 +220,11 @@ class AerSimulator(AerBackend):
values (16 Bytes). If set to 0, the maximum will be automatically
set to the system memory size (Default: 0).

* ``cuStateVec_enable`` (bool): This option enables accelerating by
cuStateVec library of cuQuantum from NVIDIA, that has highly optimized
kernels for GPUs (Default: False). This option will be ignored
if AerSimulator is not built with cuStateVec support.

* ``blocking_enable`` (bool): This option enables parallelization with
multiple GPUs or multiple processes with MPI (CPU/GPU). This option
is only available for ``"statevector"``, ``"density_matrix"`` and
Expand Down Expand Up @@ -514,6 +523,8 @@ def _default_options(cls):
memory=None,
noise_model=None,
seed_simulator=None,
# cuStateVec (cuQuantum) option
cuStateVec_enable=False,
# cache blocking for multi-GPUs/MPI options
blocking_qubits=None,
blocking_enable=False,
Expand Down
19 changes: 12 additions & 7 deletions qiskit/providers/aer/backends/qasm_simulator.py
Original file line number Diff line number Diff line change
Expand Up @@ -339,9 +339,9 @@ class QasmSimulator(AerBackend):
}

_SIMULATION_METHODS = [
'automatic', 'statevector', 'statevector_gpu',
'automatic', 'statevector', 'statevector_gpu', 'statevector_custatevec',
'statevector_thrust', 'density_matrix',
'density_matrix_gpu', 'density_matrix_thrust',
'density_matrix_gpu', 'density_matrix_custatevec', 'density_matrix_thrust',
'stabilizer', 'matrix_product_state', 'extended_stabilizer'
]

Expand Down Expand Up @@ -595,7 +595,8 @@ def _basis_gates(self):
def _method_basis_gates(self):
"""Return method basis gates and custom instructions"""
method = self._options.get('method', None)
if method in ['density_matrix', 'density_matrix_gpu', 'density_matrix_thrust']:
if method in ['density_matrix', 'density_matrix_gpu',
'density_matrix_custatevec', 'density_matrix_thrust']:
return sorted([
'u1', 'u2', 'u3', 'u', 'p', 'r', 'rx', 'ry', 'rz', 'id', 'x',
'y', 'z', 'h', 's', 'sdg', 'sx', 'sxdg', 't', 'tdg', 'swap', 'cx',
Expand Down Expand Up @@ -628,15 +629,17 @@ def _custom_instructions(self):
return self._options_configuration['custom_instructions']

method = self._options.get('method', None)
if method in ['statevector', 'statevector_gpu', 'statevector_thrust']:
if method in ['statevector', 'statevector_gpu',
'statevector_custatevec', 'statevector_thrust']:
return sorted([
'quantum_channel', 'qerror_loc', 'roerror', 'kraus', 'snapshot', 'save_expval',
'save_expval_var', 'save_probabilities', 'save_probabilities_dict',
'save_amplitudes', 'save_amplitudes_sq', 'save_state',
'save_density_matrix', 'save_statevector', 'save_statevector_dict',
'set_statevector'
])
if method in ['density_matrix', 'density_matrix_gpu', 'density_matrix_thrust']:
if method in ['density_matrix', 'density_matrix_gpu',
'density_matrix_custatevec', 'density_matrix_thrust']:
return sorted([
'quantum_channel', 'qerror_loc', 'roerror', 'kraus', 'superop', 'snapshot',
'save_expval', 'save_expval_var', 'save_probabilities', 'save_probabilities_dict',
Expand Down Expand Up @@ -666,10 +669,12 @@ def _custom_instructions(self):
def _set_method_config(self, method=None):
"""Set non-basis gate options when setting method"""
# Update configuration description and number of qubits
if method in ['statevector', 'statevector_gpu', 'statevector_thrust']:
if method in ['statevector', 'statevector_gpu',
'statevector_custatevec', 'statevector_thrust']:
description = 'A C++ statevector simulator with noise'
n_qubits = MAX_QUBITS_STATEVECTOR
elif method in ['density_matrix', 'density_matrix_gpu', 'density_matrix_thrust']:
elif method in ['density_matrix', 'density_matrix_gpu',
'density_matrix_custatevec', 'density_matrix_thrust']:
description = 'A C++ density matrix simulator with noise'
n_qubits = MAX_QUBITS_STATEVECTOR // 2
elif method == 'matrix_product_state':
Expand Down
13 changes: 13 additions & 0 deletions releasenotes/notes/cuQuantum-support-d33abe5b1cb778a8.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
---
features:
- |
Added support for cuQuantum, NVIDIA's APIs for quantum computing,
to accelerate statevector, density matrix and unitary simulators
by using GPUs.
This is experiemental implementation for cuQuantum Beta 2. (0.1.0)
cuStateVec APIs are enabled to accelerate instead of Aer's implementations
by building Aer by setting path of cuQuantum to ``CUSTATEVEC_ROOT``.
(binary distribution is not available currently.)
cuStateVector is enabled by setting ``device='GPU'`` and
``cuStateVec_threshold`` options. cuStateVec is enabled when number of
qubits of input circuit is equal or greater than ``cuStateVec_threshold``.
72 changes: 57 additions & 15 deletions src/controllers/aer_controller.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -377,6 +377,8 @@ class Controller {
int_t batched_shots_gpu_max_qubits_ = 16; //multi-shot parallelization is applied if qubits is less than max qubits
bool enable_batch_multi_shots_ = false; //multi-shot parallelization can be applied

//settings for cuStateVec
bool cuStateVec_enable_ = false;
};

//=========================================================================
Expand Down Expand Up @@ -466,6 +468,12 @@ void Controller::set_config(const json_t &config) {
JSON::get_value(batched_shots_gpu_max_qubits_, "batched_shots_gpu_max_qubits", config);
}

//cuStateVec configs
cuStateVec_enable_ = false;
if(JSON::check_key("cuStateVec_enable", config)) {
JSON::get_value(cuStateVec_enable_, "cuStateVec_enable", config);
}

// Override automatic simulation method with a fixed method
std::string method;
if (JSON::get_value(method, "method", config)) {
Expand All @@ -489,6 +497,9 @@ void Controller::set_config(const json_t &config) {
}
}

if(method_ == Method::density_matrix || method_ == Method::unitary)
batched_shots_gpu_max_qubits_ /= 2;

// Override automatic simulation method with a fixed method
if (JSON::get_value(sim_device_name_, "device", config)) {
if (sim_device_name_ == "CPU") {
Expand All @@ -502,18 +513,37 @@ void Controller::set_config(const json_t &config) {
#endif
} else if (sim_device_name_ == "GPU") {
#ifndef AER_THRUST_CUDA
throw std::runtime_error(
"Simulation device \"GPU\" is not supported on this system");
throw std::runtime_error(
"Simulation device \"GPU\" is not supported on this system");
#else
int nDev;
if (cudaGetDeviceCount(&nDev) != cudaSuccess) {
cudaGetLastError();
throw std::runtime_error("No CUDA device available!");
}

sim_device_ = Device::GPU;
#ifndef AER_CUSTATEVEC
if(cuStateVec_enable_){
//Aer is not built for cuStateVec
throw std::runtime_error(
"Simulation device \"GPU\" does not supported cuStateVec on this system");
}
#endif
int nDev;
if (cudaGetDeviceCount(&nDev) != cudaSuccess) {
cudaGetLastError();
throw std::runtime_error("No CUDA device available!");
}
sim_device_ = Device::GPU;

#ifdef AER_CUSTATEVEC
if(cuStateVec_enable_){
//initialize custatevevtor handle once before actual calculation (takes long time at first call)
custatevecStatus_t err;
custatevecHandle_t stHandle;
err = custatevecCreate(&stHandle);
if(err == CUSTATEVEC_STATUS_SUCCESS){
custatevecDestroy(stHandle);
}
}
#endif
#endif
}
else {
throw std::runtime_error(std::string("Invalid simulation device (\"") +
sim_device_name_ + std::string("\")."));
Expand Down Expand Up @@ -636,9 +666,16 @@ void Controller::set_parallelization_circuit(const Circuit &circ,
const Method method)
{
enable_batch_multi_shots_ = false;
if(batched_shots_gpu_ && sim_device_ == Device::GPU && circ.shots > 1 && max_batched_states_ >= num_gpus_ &&
batched_shots_gpu_max_qubits_ >= circ.num_qubits ){
enable_batch_multi_shots_ = true;
if(batched_shots_gpu_ && sim_device_ == Device::GPU &&
circ.shots > 1 && max_batched_states_ >= num_gpus_ &&
batched_shots_gpu_max_qubits_ >= circ.num_qubits ){
enable_batch_multi_shots_ = true;
}

if(sim_device_ == Device::GPU && cuStateVec_enable_){
enable_batch_multi_shots_ = false; //cuStateVec does not support batch execution of multi-shots
parallel_shots_ = 1; //cuStateVec is currently not thread safe
return;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if cuStateVec_enable=True is configured in AerSimulator.run(), parallel_state_update_ is not set. This will produce performance regression if application accidientaly sets cuStateVec_enable with device='CPU'.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: when enable_batch_multi_shots_=true would you create nShots copies of the statevector for parallelization? If so & IIUC, I think a proper "workaround" is to create multiple cuStateVec handles (or just retain and reuse a pool of handles at init time to reduce overhead) and use them in parallel.

IMHO though it's beyond a "workaround": even after we fix the thread safety issue, generally speaking it is still challenging for library handles to be shared by multiple host threads. For example, despite cuBLAS supports this usage pattern they explicitly recommend to not do so. Thus the handle pool approach is commonly seen in ML/DL frameworks.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enable_batch_multi_shots_=true is not applicable for cuStateVec currently, because multiple state vectors are calculated in a single CUDA kernel and each state vector refers classical registers to handle branch operations, this is not implemented in cuStateVec.
Multiple cuStateVec handle is required when enable_batch_multi_shots_=false and shot level parallelization is required. In this case, state vectors are independently calculated using OpenMP threads. (Currently cuStateVec is not thread safe and we disable OpenMP parallelization)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explanation @doichanj. I understand better now. So once we fix thread safety we can unblock you for the shot-level parallelization.

}

if(explicit_parallelization_)
Expand Down Expand Up @@ -785,6 +822,7 @@ size_t Controller::get_gpu_memory_mb() {
}
num_gpus_ = nDev;
#endif

#ifdef AER_MPI
// get minimum memory size per process
uint64_t locMem, minMem;
Expand Down Expand Up @@ -866,7 +904,6 @@ Result Controller::execute(const inputdata_t &input_qobj) {
auto time_taken =
std::chrono::duration<double>(myclock_t::now() - timer_start).count();
result.metadata.add(time_taken, "time_taken");

return result;
} catch (std::exception &e) {
// qobj was invalid, return valid output containing error message
Expand Down Expand Up @@ -959,7 +996,7 @@ Result Controller::execute(std::vector<Circuit> &circuits,
const int NUM_RESULTS = result.results.size();
//following looks very similar but we have to separate them to avoid omp nested loops that causes performance degradation
//(DO NOT use if statement in #pragma omp)
if (parallel_experiments_ == 1) {
if (parallel_experiments_ == 1 || sim_device_ == Device::ThrustCPU) {
for (int j = 0; j < NUM_RESULTS; ++j) {
set_parallelization_circuit(circuits[j], noise_model, methods[j]);
run_circuit(circuits[j], noise_model,methods[j],
Expand Down Expand Up @@ -1439,7 +1476,7 @@ void Controller::run_circuit_without_sampled_noise(Circuit &circ,
// Check if measure sampler and optimization are valid
if (can_sample) {
// Implement measure sampler
if (parallel_shots_ <= 1) {
if (parallel_shots_ <= 1 || sim_device_ == Device::GPU || sim_device_ == Device::ThrustCPU) {
state.set_max_matrix_qubits(max_bits);
RngEngine rng;
rng.set_seed(circ.seed);
Expand All @@ -1460,7 +1497,7 @@ void Controller::run_circuit_without_sampled_noise(Circuit &circ,
shot_state.set_parallelization(parallel_state_update_);
shot_state.set_global_phase(circ.global_phase_angle);

state.set_max_matrix_qubits(max_bits);
shot_state.set_max_matrix_qubits(max_bits);
hhorii marked this conversation as resolved.
Show resolved Hide resolved

RngEngine rng;
rng.set_seed(circ.seed + i);
Expand Down Expand Up @@ -1736,7 +1773,12 @@ void Controller::measure_sampler(
shots_or_index = shots;
else
shots_or_index = shot_index;

auto timer_start = myclock_t::now();
auto all_samples = state.sample_measure(meas_qubits, shots_or_index, rng);
auto time_taken =
std::chrono::duration<double>(myclock_t::now() - timer_start).count();
result.metadata.add(time_taken, "sample_measure_time");

// Make qubit map of position in vector of measured qubits
std::unordered_map<uint_t, uint_t> qubit_map;
Expand Down
Loading