Skip to content

Classical Optimisation

Steve Brierley edited this page Nov 27, 2018 · 5 revisions

1. The role of the classical optimiser

The aim of VQE is to find the electron configuration (henceforth referred to as the state) of a particular molecule which has the lowest energy. This will be the groundstate of the molecule. The state is described using certain parameters, known as the ansatz parameters. VQE is a hybrid quantum algorithm – a quantum computer is used for the parts of the problem which are difficult to simulate classically, and a classical computer is used for the rest. In our case, we use a quantum computer to prepare the state for particular values of the ansatz parameters and to measure the energy of the state. Given the results, the classical computer then chooses the next values of the parameters of the state for the quantum computer to prepare. We therefore require a classical optimiser, which guides the parameter values with the aim of finding the optimum value of some function (in our case, the minimum value of the energy). Use of the quantum computer is expensive, and so we would like to achieve this with as few evaluations of the energy as possible.

The main points to consider when choosing a classical optimiser are:

  • How likely is the method to find the global minimum rather than a local minimum?
  • How many function evaluations are required?
  • How well does the method cope with errors in function evaluations?

2. Methods of optimisation

2.1 Grid search

Let us first consider a simple one-dimensional example, shown in figure 1. There is a single parameter, given by the x-axis, which ranges from 0 to 1. The energy as a function of this parameter is plotted. A simple way to find the minimum value of the energy is to calculate its value for a range of values of the parameter, as shown in figure 2. This is known as a grid search. We can then see that the black circle is the location of the minimum. However, we recall that we would like to make as few evaluations of the energy as possible, and so we might wonder if we could have found the minimum without sampling the parameter so finely everywhere, perhaps by using previous energy evaluations as a guide. On the other hand, taking too few measurements can lead to converging to the wrong location. For example, if we had only sampled parameter values less than approximately 0.4, we might have thought that the blue circle was the minimum. Such a point is known as a local minimum whilst the overall minimum value is known as the global minimum.

Figure 1. A simple one-dimensional function of which we wish to find the minimum. Figure 2. Performing an optimisation of the function shown in figure 1 using a grid search. The global minimum is circled in black whilst a local minimum is circled in blue.

We will now consider a two dimensional example, illustrated in figure 3. Both parameters range from 0 to 1. The energy as a function of the two parameters is shown by the colour, with the dark purple colour indicating the lowest energy values. The white lines are contours along which the energy has a constant value. The global minimum is located at equation and there are local minima at equation and equation. The function is also shown in figure 4. The value of the energy here is given by the height of the surface.

Figure 3. A two-dimensional function of which we wish to find the minimum. The value of the energy is shown by the colour, with the darker colours being lower energy values. The white lines are contours along which the energy has a constant value. The global minimum is located at equation and there are local minima at equation and equation. Figure 4. The two-dimensional function also illustrated in figure 3 but with the energy value given by the surface height. The surface is being viewed from the (1, 1) direction. The global minimum can be seen at the front, where the surface is lowest.

Performing a grid search over this two-dimensional parameter space will clearly require more energy evaluations than for the one-dimensional example. Indeed, the number of evaluations is exponential in the number of parameters. This number can become prohibitively large as the number of parameters increases. Other optimisation methods are therefore required and we outline some of them here, using the two-dimensional example as an illustration.

2.2 Direct search methods

Instead of moving across the energy surface in a pre-determined manner, as for a grid search, the values of the energy already evaluated could be used to determine at what parameter values to evaluate the energy next. Methods which act in this way are known as direct search methods. An example of such a method is the Nelder-Mead method (see, for example, https://en.wikipedia.org/wiki/Nelder-Mead_method). At each stage, the algorithm maintains a simplex of points (an n-dimensional shape with n+1 vertices, where n is the number of parameters) and adjusts their locations depending on the energy values at each vertex. Gradually, the simplex moves across the surface and eventually shrinks so that the vertices converge to the same point. An example of the progression of the Nelder-Mead algorithm is shown in figure 5. However, the method does not always find the global minimum; figure 6 shows an example of the algorithm converging on one of the local minima.

<\p>

Figure 5. An example progression of the Nelder-Mead method where the simplex points converge upon the global minimum.

<\p>

Figure 6. An example progression of the Nelder-Mead method where the simplex points converge upon a local minimum and not the global minimum.

2.3 Gradient-based methods

Gradient-based methods make use of the derivative (or gradient) of the function with respect to the parameters. A simple way to calculate the gradient is using finite differences – each parameter is changed by a small amount in turn and the change is used to calculate the derivative with respect to the particular parameter. Having calculated the gradient, a line search can then performed along the direction in which the function decreases most rapidly (the direction of steepest descent) to find the location of the point of minimum energy in that direction. The gradient at the new location is then calculated, and the process repeated until the gradient is sufficiently small.

This method is illustrated in figure 7. The starting point is the white cross on the left hand side, labelled “0”. The arrow shows the direction of steepest descent. In figure 8, the energy as a function of distance along this direction is shown, with the cross showing the point that was chosen to be the new location. Clearly the true minimum along this direction was not found; however, only a handful of function evaluations were required to find the point and it is considered sufficiently good. The new point is labelled “1” in figure 7. The full optimisation from this starting point is shown in figure 9, and we can see that it reaches the global minimum after five iterations. However, in figure 10, we show the same process from a different starting point and can see that this time the method converges to one of the local minima.

Figure 7. An illustration of the first stage of a gradient descent optimisation. The cross labelled “0” is the starting location of the method, with the arrow indicating the direction of steepest descent. The point labelled “1” is the position along this direction that is chosen to be the starting point of the next algorithm iteration. Figure 8. The energy as a function of distance along the direction of steepest descent indicated in figure 7. The cross indicates the distance along this direction of the position of the next point. This is the point labelled “1” in figure 7.
Figure 9. A full optimisation using gradient descent which reaches the global minimum. The points are numbered sequentially with the dotted lines indicating the directions of steepest descent from each point. Figure 10. A full optimisation using gradient descent which reaches one of the local minima. The points are numbered sequentially with the dotted lines indicating the directions of steepest descent from each point.

It is typically not best to move directly along the direction of steepest descent and so methods typically adapt the gradient in some way. Examples include:

  • the conjugate gradient method;
  • Newton methods, which make use of the Hessian (the matrix of second derivatives);
  • quasi-Newton methods, such as the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm, which calculate an approximation to the Hessian.

2.4 Metaphor-based metaheuristics

Particle Swarm Optimisation is an example of a metaphor-based metaheuristic. It was originally intended for simulating social behaviour, representing the movement of birds or fish, but was found to move across a parameter space in a useful manner. The particles search the space in a partly random manner and partly guided by the best locations found by the individual particles and by the swarm as a whole. The method can search large parameter spaces and is good at finding global minima, but there are no guarantees on its performance due to its heuristic nature. An example of how the particles move over our two-dimensional function is shown in figure 11.

Figure 11. An example progression of Particle Swarm Optimisation.

Particle Swarm Optimisation is just one example of a metaphor-based metaheuristic – see https://en.wikipedia.org/wiki/List_of_metaphor-based_metaheuristics for many more methods.

2.5 Other methods

The discussion above is meant to provide a brief overview of some methods of optimisation, and is by no means an exhaustive list.

3. Noisy measurements

Each time we prepare the state and measure its energy, the value returned is not the true energy, and so we must make multiple measurements and take the mean. As the number of measurements increases, this mean more closely approximates the energy. Therefore, the number of measurements we make of a particular state controls the error in the expectation value of the energy. When designing the process of optimisation, we must bear the presence of errors in mind, as some methods can deal with errors better than others. It is also possible to adjust methods for the presence of errors. For example, Nelder-Mead can be adapted so that the number of samples increases (and hence error decreases) as the optimisation process progresses.

4. The optimiser within QCK

The default optimiser is SciPy's implementation of the Nelder-Mead simplex method. It takes several parameters which can be varied to affect the performance of the optimisation (see https://docs.scipy.org/doc/scipy/reference/optimize.minimize-neldermead.html#optimize-minimize-neldermead).

The scipy.optimize.minimize routine also provides implementations of other optimisers (see https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html), including gradient-based ones, all of which take parameters which can be tuned to vary the performance of the methods. Increased flexibility can be obtained through writing separate implementations of the methods.

Many Python packages exist with implementations of other optimisation methods. However, it may be preferable to write separate implementations.