From 643b684e335c47689e7a3ad8daadaf6bf2ba347c Mon Sep 17 00:00:00 2001 From: "Documenter.jl" Date: Sat, 23 Mar 2024 00:50:47 +0000 Subject: [PATCH] build based on 7cc8328 --- dev/.documenter-siteinfo.json | 2 +- dev/LICENSE/index.html | 2 +- dev/algo/adam_adamax/index.html | 2 +- dev/algo/brent/index.html | 2 +- dev/algo/cg/index.html | 2 +- dev/algo/complex/index.html | 2 +- dev/algo/goldensection/index.html | 2 +- dev/algo/gradientdescent/index.html | 2 +- dev/algo/index.html | 2 +- dev/algo/ipnewton/index.html | 2 +- dev/algo/lbfgs/index.html | 2 +- dev/algo/linesearch/index.html | 2 +- dev/algo/manifolds/index.html | 2 +- dev/algo/nelder_mead/index.html | 2 +- dev/algo/newton/index.html | 2 +- dev/algo/newton_trust_region/index.html | 2 +- dev/algo/ngmres/index.html | 2 +- dev/algo/particle_swarm/index.html | 2 +- dev/algo/precondition/index.html | 2 +- dev/algo/samin/index.html | 2 +- dev/algo/simulated_annealing/index.html | 2 +- dev/dev/contributing/index.html | 2 +- dev/dev/index.html | 2 +- dev/examples/generated/ipnewton_basics/index.html | 2 +- dev/examples/generated/maxlikenlm/index.html | 2 +- dev/examples/generated/rasch/index.html | 2 +- dev/index.html | 2 +- dev/user/algochoice/index.html | 2 +- dev/user/config/index.html | 2 +- dev/user/gradientsandhessians/index.html | 2 +- dev/user/minimization/index.html | 2 +- dev/user/tipsandtricks/index.html | 2 +- 32 files changed, 32 insertions(+), 32 deletions(-) diff --git a/dev/.documenter-siteinfo.json b/dev/.documenter-siteinfo.json index 1510fa8f..4a2ed58a 100644 --- a/dev/.documenter-siteinfo.json +++ b/dev/.documenter-siteinfo.json @@ -1 +1 @@ -{"documenter":{"julia_version":"1.10.2","generation_timestamp":"2024-03-21T13:54:23","documenter_version":"1.3.0"}} \ No newline at end of file +{"documenter":{"julia_version":"1.10.2","generation_timestamp":"2024-03-23T00:50:42","documenter_version":"1.3.0"}} \ No newline at end of file diff --git a/dev/LICENSE/index.html b/dev/LICENSE/index.html index caa0c457..dc86c73a 100644 --- a/dev/LICENSE/index.html +++ b/dev/LICENSE/index.html @@ -1,2 +1,2 @@ -License · Optim

Optim.jl is licensed under the MIT License:

Copyright (c) 2012: John Myles White, Tim Holy, and other contributors. Copyright (c) 2016: Patrick Kofod Mogensen, John Myles White, Tim Holy, and other contributors. Copyright (c) 2017: Patrick Kofod Mogensen, Asbjørn Nilsen Riseth, John Myles White, Tim Holy, and other contributors.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

+License · Optim

Optim.jl is licensed under the MIT License:

Copyright (c) 2012: John Myles White, Tim Holy, and other contributors. Copyright (c) 2016: Patrick Kofod Mogensen, John Myles White, Tim Holy, and other contributors. Copyright (c) 2017: Patrick Kofod Mogensen, Asbjørn Nilsen Riseth, John Myles White, Tim Holy, and other contributors.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

diff --git a/dev/algo/adam_adamax/index.html b/dev/algo/adam_adamax/index.html index c78f7028..3816a41b 100644 --- a/dev/algo/adam_adamax/index.html +++ b/dev/algo/adam_adamax/index.html @@ -5,4 +5,4 @@ epsilon=1e-8)

where alpha is the step length or learning parameter. beta_mean and beta_var are exponential decay parameters for the first and second moments estimates. Setting these closer to 0 will cause past iterates to matter less for the current steps and setting them closer to 1 means emphasizing past iterates more. epsilon should rarely be changed, and just exists to avoid a division by 0.

AdaMax(; alpha=0.002,
          beta_mean=0.9,
          beta_var=0.999,
-         epsilon=1e-8)

where alpha is the step length or learning parameter. beta_mean and beta_var are exponential decay parameters for the first and second moments estimates. Setting these closer to 0 will cause past iterates to matter less for the current steps and setting them closer to 1 means emphasizing past iterates more.

References

Kingma, Diederik P., and Jimmy Ba. "Adam: A method for stochastic optimization." arXiv preprint arXiv:1412.6980 (2014).

+ epsilon=1e-8)

where alpha is the step length or learning parameter. beta_mean and beta_var are exponential decay parameters for the first and second moments estimates. Setting these closer to 0 will cause past iterates to matter less for the current steps and setting them closer to 1 means emphasizing past iterates more.

References

Kingma, Diederik P., and Jimmy Ba. "Adam: A method for stochastic optimization." arXiv preprint arXiv:1412.6980 (2014).

diff --git a/dev/algo/brent/index.html b/dev/algo/brent/index.html index 64508ba2..858bb8d2 100644 --- a/dev/algo/brent/index.html +++ b/dev/algo/brent/index.html @@ -1,2 +1,2 @@ -Brent's Method · Optim
+Brent's Method · Optim
diff --git a/dev/algo/cg/index.html b/dev/algo/cg/index.html index 1b11fd73..a6f3651c 100644 --- a/dev/algo/cg/index.html +++ b/dev/algo/cg/index.html @@ -42,4 +42,4 @@ * stopped by an increasing objective: false * Reached Maximum Number of Iterations: false * Objective Calls: 53 - * Gradient Calls: 53

We see that for this objective and starting point, ConjugateGradient() requires fewer gradient evaluations to reach convergence.

References

+ * Gradient Calls: 53

We see that for this objective and starting point, ConjugateGradient() requires fewer gradient evaluations to reach convergence.

References

diff --git a/dev/algo/complex/index.html b/dev/algo/complex/index.html index 5e2818e7..68bf4d2f 100644 --- a/dev/algo/complex/index.html +++ b/dev/algo/complex/index.html @@ -62,4 +62,4 @@ * Stopped by an increasing objective: false * Reached Maximum Number of Iterations: false * Objective Calls: 48 - * Gradient Calls: 48

Automatic differentiation support for complex inputs may come when Cassete.jl is ready.

References

+ * Gradient Calls: 48

Automatic differentiation support for complex inputs may come when Cassete.jl is ready.

References

diff --git a/dev/algo/goldensection/index.html b/dev/algo/goldensection/index.html index 5723f7a5..654e005e 100644 --- a/dev/algo/goldensection/index.html +++ b/dev/algo/goldensection/index.html @@ -1,2 +1,2 @@ -Golden Section · Optim
+Golden Section · Optim
diff --git a/dev/algo/gradientdescent/index.html b/dev/algo/gradientdescent/index.html index df233073..d58ecb4c 100644 --- a/dev/algo/gradientdescent/index.html +++ b/dev/algo/gradientdescent/index.html @@ -2,4 +2,4 @@ Gradient Descent · Optim

Gradient Descent

Constructor

GradientDescent(; alphaguess = LineSearches.InitialPrevious(),
                   linesearch = LineSearches.HagerZhang(),
                   P = nothing,
-                  precondprep = (P, x) -> nothing)

Description

Gradient Descent a common name for a quasi-Newton solver. This means that it takes steps according to

\[x_{n+1} = x_n - P^{-1}\nabla f(x_n)\]

where $P$ is a positive definite matrix. If $P$ is the Hessian, we get Newton's method. In Gradient Descent, $P$ is simply an appropriately dimensioned identity matrix, such that we go in the exact opposite direction of the gradient. This means that we do not use the curvature information from the Hessian, or an approximation of it. While it does seem quite logical to go in the opposite direction of the fastest increase in objective value, the procedure can be very slow if the problem is ill-conditioned. See the section on preconditioners for ways to remedy this when using Gradient Descent.

As with the other quasi-Newton solvers in this package, a scalar $\alpha$ is introduced as follows

\[x_{n+1} = x_n - \alpha P^{-1}\nabla f(x_n)\]

and is chosen by a linesearch algorithm such that each step gives sufficient descent.

Example

References

+ precondprep = (P, x) -> nothing)

Description

Gradient Descent a common name for a quasi-Newton solver. This means that it takes steps according to

\[x_{n+1} = x_n - P^{-1}\nabla f(x_n)\]

where $P$ is a positive definite matrix. If $P$ is the Hessian, we get Newton's method. In Gradient Descent, $P$ is simply an appropriately dimensioned identity matrix, such that we go in the exact opposite direction of the gradient. This means that we do not use the curvature information from the Hessian, or an approximation of it. While it does seem quite logical to go in the opposite direction of the fastest increase in objective value, the procedure can be very slow if the problem is ill-conditioned. See the section on preconditioners for ways to remedy this when using Gradient Descent.

As with the other quasi-Newton solvers in this package, a scalar $\alpha$ is introduced as follows

\[x_{n+1} = x_n - \alpha P^{-1}\nabla f(x_n)\]

and is chosen by a linesearch algorithm such that each step gives sufficient descent.

Example

References

diff --git a/dev/algo/index.html b/dev/algo/index.html index 3ee39758..5b70561c 100644 --- a/dev/algo/index.html +++ b/dev/algo/index.html @@ -1,2 +1,2 @@ -Solvers · Optim
+Solvers · Optim
diff --git a/dev/algo/ipnewton/index.html b/dev/algo/ipnewton/index.html index a5105670..0c55fbf3 100644 --- a/dev/algo/ipnewton/index.html +++ b/dev/algo/ipnewton/index.html @@ -1,4 +1,4 @@ Interior point Newton · Optim

Interior point Newton method

Optim.IPNewtonType

Interior-point Newton

Constructor

IPNewton(; linesearch::Function = Optim.backtrack_constrained_grad,
          μ0::Union{Symbol,Number} = :auto,
-         show_linesearch::Bool = false)

The initial barrier penalty coefficient μ0 can be chosen as a number, or set to :auto to let the algorithm decide its value, see initialize_μ_λ!.

Note: For constrained optimization problems, we recommend always enabling allow_f_increases and successive_f_tol in the options passed to optimize. The default is set to Optim.Options(allow_f_increases = true, successive_f_tol = 2).

As of February 2018, the line search algorithm is specialised for constrained interior-point methods. In future we hope to support more algorithms from LineSearches.jl.

Description

The IPNewton method implements an interior-point primal-dual Newton algorithm for solving nonlinear, constrained optimization problems. See Nocedal and Wright (Ch. 19, 2006) for a discussion of interior-point methods for constrained optimization.

References

The algorithm was originally written by Tim Holy (@timholy, tim.holy@gmail.com).

  • J Nocedal, SJ Wright (2006), Numerical optimization, second edition. Springer.
  • A Wächter, LT Biegler (2006), On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming. Mathematical Programming 106 (1), 25-57.
source

Examples

+ show_linesearch::Bool = false)

The initial barrier penalty coefficient μ0 can be chosen as a number, or set to :auto to let the algorithm decide its value, see initialize_μ_λ!.

Note: For constrained optimization problems, we recommend always enabling allow_f_increases and successive_f_tol in the options passed to optimize. The default is set to Optim.Options(allow_f_increases = true, successive_f_tol = 2).

As of February 2018, the line search algorithm is specialised for constrained interior-point methods. In future we hope to support more algorithms from LineSearches.jl.

Description

The IPNewton method implements an interior-point primal-dual Newton algorithm for solving nonlinear, constrained optimization problems. See Nocedal and Wright (Ch. 19, 2006) for a discussion of interior-point methods for constrained optimization.

References

The algorithm was originally written by Tim Holy (@timholy, tim.holy@gmail.com).

source

Examples

diff --git a/dev/algo/lbfgs/index.html b/dev/algo/lbfgs/index.html index cf75ae31..6d588567 100644 --- a/dev/algo/lbfgs/index.html +++ b/dev/algo/lbfgs/index.html @@ -9,4 +9,4 @@ P = nothing, precondprep = (P, x) -> nothing, manifold = Flat(), - scaleinvH0::Bool = true && (typeof(P) <: Nothing))

Description

This means that it takes steps according to

\[x_{n+1} = x_n - P^{-1}\nabla f(x_n)\]

where $P$ is a positive definite matrix. If $P$ is the Hessian, we get Newton's method. In (L-)BFGS, the matrix is an approximation to the Hessian built using differences in the gradient across iterations. As long as the initial matrix is positive definite it is possible to show that all the follow matrices will be as well. The starting matrix could simply be the identity matrix, such that the first step is identical to the Gradient Descent algorithm, or even the actual Hessian.

There are two versions of BFGS in the package: BFGS, and L-BFGS. The latter is different from the former because it doesn't use a complete history of the iterative procedure to construct $P$, but rather only the latest $m$ steps. It doesn't actually build the Hessian approximation matrix either, but computes the direction directly. This makes more suitable for large scale problems, as the memory requirement to store the relevant vectors will grow quickly in large problems.

As with the other quasi-Newton solvers in this package, a scalar $\alpha$ is introduced as follows

\[x_{n+1} = x_n - \alpha P^{-1}\nabla f(x_n)\]

and is chosen by a linesearch algorithm such that each step gives sufficient descent.

Example

References

Wright, Stephen, and Jorge Nocedal (2006) "Numerical optimization." Springer

+ scaleinvH0::Bool = true && (typeof(P) <: Nothing))

Description

This means that it takes steps according to

\[x_{n+1} = x_n - P^{-1}\nabla f(x_n)\]

where $P$ is a positive definite matrix. If $P$ is the Hessian, we get Newton's method. In (L-)BFGS, the matrix is an approximation to the Hessian built using differences in the gradient across iterations. As long as the initial matrix is positive definite it is possible to show that all the follow matrices will be as well. The starting matrix could simply be the identity matrix, such that the first step is identical to the Gradient Descent algorithm, or even the actual Hessian.

There are two versions of BFGS in the package: BFGS, and L-BFGS. The latter is different from the former because it doesn't use a complete history of the iterative procedure to construct $P$, but rather only the latest $m$ steps. It doesn't actually build the Hessian approximation matrix either, but computes the direction directly. This makes more suitable for large scale problems, as the memory requirement to store the relevant vectors will grow quickly in large problems.

As with the other quasi-Newton solvers in this package, a scalar $\alpha$ is introduced as follows

\[x_{n+1} = x_n - \alpha P^{-1}\nabla f(x_n)\]

and is chosen by a linesearch algorithm such that each step gives sufficient descent.

Example

References

Wright, Stephen, and Jorge Nocedal (2006) "Numerical optimization." Springer

diff --git a/dev/algo/linesearch/index.html b/dev/algo/linesearch/index.html index 9c5f1941..7bef31e9 100644 --- a/dev/algo/linesearch/index.html +++ b/dev/algo/linesearch/index.html @@ -37,4 +37,4 @@ * Reached Maximum Number of Iterations: false * Objective Calls: 17 * Gradient Calls: 17 - * Hessian Calls: 14

References

+ * Hessian Calls: 14

References

diff --git a/dev/algo/manifolds/index.html b/dev/algo/manifolds/index.html index fc2f6f77..8d13f933 100644 --- a/dev/algo/manifolds/index.html +++ b/dev/algo/manifolds/index.html @@ -7,4 +7,4 @@ x0 = randn(n) manif = Optim.Sphere() -Optim.optimize(f, g!, x0, Optim.ConjugateGradient(manifold=manif))

Supported solvers and manifolds

All first-order optimization methods are supported.

The following manifolds are currently supported:

The following meta-manifolds construct manifolds out of pre-existing ones:

See test/multivariate/manifolds.jl for usage examples.

Implementing new manifolds is as simple as adding methods project_tangent!(M::YourManifold,g,x) and retract!(M::YourManifold,x). If you implement another manifold or optimization method, please contribute a PR!

References

The Geometry of Algorithms with Orthogonality Constraints, Alan Edelman, Tomás A. Arias, Steven T. Smith, SIAM. J. Matrix Anal. & Appl., 20(2), 303–353

Optimization Algorithms on Matrix Manifolds, P.-A. Absil, R. Mahony, R. Sepulchre, Princeton University Press, 2008

+Optim.optimize(f, g!, x0, Optim.ConjugateGradient(manifold=manif))

Supported solvers and manifolds

All first-order optimization methods are supported.

The following manifolds are currently supported:

The following meta-manifolds construct manifolds out of pre-existing ones:

See test/multivariate/manifolds.jl for usage examples.

Implementing new manifolds is as simple as adding methods project_tangent!(M::YourManifold,g,x) and retract!(M::YourManifold,x). If you implement another manifold or optimization method, please contribute a PR!

References

The Geometry of Algorithms with Orthogonality Constraints, Alan Edelman, Tomás A. Arias, Steven T. Smith, SIAM. J. Matrix Anal. & Appl., 20(2), 303–353

Optimization Algorithms on Matrix Manifolds, P.-A. Absil, R. Mahony, R. Sepulchre, Princeton University Press, 2008

diff --git a/dev/algo/nelder_mead/index.html b/dev/algo/nelder_mead/index.html index d298ecb3..98eb2a11 100644 --- a/dev/algo/nelder_mead/index.html +++ b/dev/algo/nelder_mead/index.html @@ -17,4 +17,4 @@ initial_simplex[j+1][j] += initial_simplex[j+1][j] != zero(T) ? S.b * initial_simplex[j+1][j] : S.a end initial_simplex -end

The parameters of Nelder-Mead

The different types of steps in the algorithm are governed by four parameters: $\alpha$ for the reflection, $\beta$ for the expansion, $\gamma$ for the contraction, and $\delta$ for the shrink step. We default to the adaptive parameters scheme in Gao and Han (2010). These are based on the dimensionality of the problem, and are given by

\[\alpha = 1, \quad \beta = 1+2/n,\quad \gamma =0.75 - 1/2n,\quad \delta = 1-1/n\]

It is also possible to specify the original parameters from Nelder and Mead (1965)

\[\alpha = 1,\quad \beta = 2, \quad\gamma = 1/2, \quad\delta = 1/2\]

by specifying parameters = Optim.FixedParameters(). For specifying custom values, parameters = Optim.FixedParameters(α = a, β = b, γ = g, δ = d) is used, where a, b, g, d are the chosen values. If another parameter specification is wanted, it is possible to create a custom sub-type ofOptim.NMParameters, and add a method to the parameters function. It should take the new type as the first positional argument, and the dimensionality of x as the second positional argument, and return a 4-tuple of parameters. However, it will often be easier to simply supply the wanted parameters to FixedParameters.

References

Nelder, John A. and R. Mead (1965). "A simplex method for function minimization". Computer Journal 7: 308–313. doi:10.1093/comjnl/7.4.308.

Lagarias, Jeffrey C., et al. "Convergence properties of the Nelder–Mead simplex method in low dimensions." SIAM Journal on optimization 9.1 (1998): 112-147.

Gao, Fuchang and Lixing Han (2010). "Implementing the Nelder-Mead simplex algorithm with adaptive parameters". Computational Optimization and Applications [DOI 10.1007/s10589-010-9329-3]

+end

The parameters of Nelder-Mead

The different types of steps in the algorithm are governed by four parameters: $\alpha$ for the reflection, $\beta$ for the expansion, $\gamma$ for the contraction, and $\delta$ for the shrink step. We default to the adaptive parameters scheme in Gao and Han (2010). These are based on the dimensionality of the problem, and are given by

\[\alpha = 1, \quad \beta = 1+2/n,\quad \gamma =0.75 - 1/2n,\quad \delta = 1-1/n\]

It is also possible to specify the original parameters from Nelder and Mead (1965)

\[\alpha = 1,\quad \beta = 2, \quad\gamma = 1/2, \quad\delta = 1/2\]

by specifying parameters = Optim.FixedParameters(). For specifying custom values, parameters = Optim.FixedParameters(α = a, β = b, γ = g, δ = d) is used, where a, b, g, d are the chosen values. If another parameter specification is wanted, it is possible to create a custom sub-type ofOptim.NMParameters, and add a method to the parameters function. It should take the new type as the first positional argument, and the dimensionality of x as the second positional argument, and return a 4-tuple of parameters. However, it will often be easier to simply supply the wanted parameters to FixedParameters.

References

Nelder, John A. and R. Mead (1965). "A simplex method for function minimization". Computer Journal 7: 308–313. doi:10.1093/comjnl/7.4.308.

Lagarias, Jeffrey C., et al. "Convergence properties of the Nelder–Mead simplex method in low dimensions." SIAM Journal on optimization 9.1 (1998): 112-147.

Gao, Fuchang and Lixing Han (2010). "Implementing the Nelder-Mead simplex algorithm with adaptive parameters". Computational Optimization and Applications [DOI 10.1007/s10589-010-9329-3]

diff --git a/dev/algo/newton/index.html b/dev/algo/newton/index.html index 247d5e6f..08ef01d9 100644 --- a/dev/algo/newton/index.html +++ b/dev/algo/newton/index.html @@ -1,3 +1,3 @@ Newton · Optim

Newton's Method

Constructor

Newton(; alphaguess = LineSearches.InitialStatic(),
-         linesearch = LineSearches.HagerZhang())

The constructor takes two keywords:

  • linesearch = a(d, x, p, x_new, g_new, phi0, dphi0, c), a function performing line search, see the line search section.
  • alphaguess = a(state, dphi0, d), a function for setting the initial guess for the line search algorithm, see the line search section.

Description

Newton's method for optimization has a long history, and is in some sense the gold standard in unconstrained optimization of smooth functions, at least from a theoretical viewpoint. The main benefit is that it has a quadratic rate of convergence near a local optimum. The main disadvantage is that the user has to provide a Hessian. This can be difficult, complicated, or simply annoying. It can also be computationally expensive to calculate it.

Newton's method for optimization consists of applying Newton's method for solving systems of equations, where the equations are the first order conditions, saying that the gradient should equal the zero vector.

\[\nabla f(x) = 0\]

A second order Taylor expansion of the left-hand side leads to the iterative scheme

\[x_{n+1} = x_n - H(x_n)^{-1}\nabla f(x_n)\]

where the inverse is not calculated directly, but the step size is instead calculated by solving

\[H(x) \textbf{s} = \nabla f(x_n).\]

This is equivalent to minimizing a quadratic model, $m_k$ around the current $x_n$

\[m_k(s) = f(x_n) + \nabla f(x_n)^\top \textbf{s} + \frac{1}{2} \textbf{s}^\top H(x_n) \textbf{s}\]

For functions where $H(x_n)$ is difficult, or computationally expensive to obtain, we might replace the Hessian with another positive definite matrix that approximates it. Such methods are called Quasi-Newton methods; see (L-)BFGS and Gradient Descent.

In a sufficiently small neighborhood around the minimizer, Newton's method has quadratic convergence, but globally it might have slower convergence, or it might even diverge. To ensure convergence, a line search is performed for each $\textbf{s}$. This amounts to replacing the step formula above with

\[x_{n+1} = x_n - \alpha \textbf{s}\]

and finding a scalar $\alpha$ such that we get sufficient descent; see the line search section for more information.

Additionally, if the function is locally concave, the step taken in the formulas above will go in a direction of ascent, as the Hessian will not be positive (semi)definite. To avoid this, we use a specialized method to calculate the step direction. If the Hessian is positive semidefinite then the method used is standard, but if it is not, a correction is made using the functionality in PositiveFactorizations.jl.

Example

show the example from the issue

References

+ linesearch = LineSearches.HagerZhang())

The constructor takes two keywords:

Description

Newton's method for optimization has a long history, and is in some sense the gold standard in unconstrained optimization of smooth functions, at least from a theoretical viewpoint. The main benefit is that it has a quadratic rate of convergence near a local optimum. The main disadvantage is that the user has to provide a Hessian. This can be difficult, complicated, or simply annoying. It can also be computationally expensive to calculate it.

Newton's method for optimization consists of applying Newton's method for solving systems of equations, where the equations are the first order conditions, saying that the gradient should equal the zero vector.

\[\nabla f(x) = 0\]

A second order Taylor expansion of the left-hand side leads to the iterative scheme

\[x_{n+1} = x_n - H(x_n)^{-1}\nabla f(x_n)\]

where the inverse is not calculated directly, but the step size is instead calculated by solving

\[H(x) \textbf{s} = \nabla f(x_n).\]

This is equivalent to minimizing a quadratic model, $m_k$ around the current $x_n$

\[m_k(s) = f(x_n) + \nabla f(x_n)^\top \textbf{s} + \frac{1}{2} \textbf{s}^\top H(x_n) \textbf{s}\]

For functions where $H(x_n)$ is difficult, or computationally expensive to obtain, we might replace the Hessian with another positive definite matrix that approximates it. Such methods are called Quasi-Newton methods; see (L-)BFGS and Gradient Descent.

In a sufficiently small neighborhood around the minimizer, Newton's method has quadratic convergence, but globally it might have slower convergence, or it might even diverge. To ensure convergence, a line search is performed for each $\textbf{s}$. This amounts to replacing the step formula above with

\[x_{n+1} = x_n - \alpha \textbf{s}\]

and finding a scalar $\alpha$ such that we get sufficient descent; see the line search section for more information.

Additionally, if the function is locally concave, the step taken in the formulas above will go in a direction of ascent, as the Hessian will not be positive (semi)definite. To avoid this, we use a specialized method to calculate the step direction. If the Hessian is positive semidefinite then the method used is standard, but if it is not, a correction is made using the functionality in PositiveFactorizations.jl.

Example

show the example from the issue

References

diff --git a/dev/algo/newton_trust_region/index.html b/dev/algo/newton_trust_region/index.html index 0bce453c..e7810a61 100644 --- a/dev/algo/newton_trust_region/index.html +++ b/dev/algo/newton_trust_region/index.html @@ -5,4 +5,4 @@ rho_lower = 0.25, rho_upper = 0.75)

The constructor takes keywords that determine the initial and maximal size of the trust region, when to grow and shrink the region, and how close the function should be to the quadratic approximation. The notation follows chapter four of Numerical Optimization. Below, rho $=\rho$ refers to the ratio of the actual function change to the change in the quadratic approximation for a given step.

Description

Newton's method with a trust region is designed to take advantage of the second-order information in a function's Hessian, but with more stability than Newton's method when functions are not globally well-approximated by a quadratic. This is achieved by repeatedly minimizing quadratic approximations within a dynamically-sized "trust region" in which the function is assumed to be locally quadratic [1].

Newton's method optimizes a quadratic approximation to a function. When a function is well approximated by a quadratic (for example, near an optimum), Newton's method converges very quickly by exploiting the second-order information in the Hessian matrix. However, when the function is not well-approximated by a quadratic, either because the starting point is far from the optimum or the function has a more irregular shape, Newton steps can be erratically large, leading to distant, irrelevant areas of the space.

Trust region methods use second-order information but restrict the steps to be within a "trust region" where the function is believed to be approximately quadratic. At iteration $k$, a trust region method chooses a step $p$ to minimize a quadratic approximation to the objective such that the step size is no larger than a given trust region size, $\Delta_k$.

\[\underset{p\in\mathbb{R}^n}\min m_k(p) = f_k + g_k^T p + \frac{1}{2}p^T B_k p \quad\textrm{such that } ||p||\le \Delta_k\]

Here, $p$ is the step to take at iteration $k$, so that $x_{k+1} = x_k + p$. In the definition of $m_k(p)$, $f_k = f(x_k)$ is the value at the previous location, $g_k=\nabla f(x_k)$ is the gradient at the previous location, $B_k = \nabla^2 f(x_k)$ is the Hessian matrix at the previous iterate, and $||\cdot||$ is the Euclidian norm.

If the trust region size, $\Delta_k$, is large enough that the minimizer of the quadratic approximation $m_k(p)$ has $||p|| \le \Delta_k$, then the step is the same as an ordinary Newton step. However, if the unconstrained quadratic minimizer lies outside the trust region, then the minimizer to the constrained problem will occur on the boundary, i.e. we will have $||p|| = \Delta_k$. It turns out that when the Cholesky decomposition of $B_k$ can be computed, the optimal $p$ can be found numerically with relative ease. ([1], section 4.3) This is the method currently used in Optim.

It makes sense to adapt the trust region size, $\Delta_k$, as one moves through the space and assesses the quality of the quadratic fit. This adaptation is controlled by the parameters $\eta$, $\rho_{lower}$, and $\rho_{upper}$, which are parameters to the NewtonTrustRegion optimization method. For each step, we calculate

\[\rho_k := \frac{f(x_{k+1}) - f(x_k)}{m_k(p) - m_k(0)}\]

Intuitively, $\rho_k$ measures the quality of the quadratic approximation: if $\rho_k \approx 1$, then our quadratic approximation is reasonable. If $p$ was on the boundary and $\rho_k > \rho_{upper}$, then perhaps we can benefit from larger steps. In this case, for the next iteration we grow the trust region geometrically up to a maximum of $\hat\Delta$:

\[\rho_k > \rho_{upper} \Rightarrow \Delta_{k+1} = \min(2 \Delta_k, \hat\Delta).\]

Conversely, if $\rho_k < \rho_{lower}$, then we shrink the trust region geometrically:

$\rho_k < \rho_{lower} \Rightarrow \Delta_{k+1} = 0.25 \Delta_k$. Finally, we only accept a point if its decrease is appreciable compared to the quadratic approximation. Specifically, a step is only accepted $\rho_k > \eta$. As long as we choose $\eta$ to be less than $\rho_{lower}$, we will shrink the trust region whenever we reject a step. Eventually, if the objective function is locally quadratic, $\Delta_k$ will become small enough that a quadratic approximation will be accurate enough to make progress again.

Example

using Optim, OptimTestProblems
 prob = UnconstrainedProblems.examples["Rosenbrock"];
-res = Optim.optimize(prob.f, prob.g!, prob.h!, prob.initial_x, NewtonTrustRegion())

References

[1] Nocedal, Jorge, and Stephen Wright. Numerical optimization. Springer Science & Business Media, 2006.

+res = Optim.optimize(prob.f, prob.g!, prob.h!, prob.initial_x, NewtonTrustRegion())

References

[1] Nocedal, Jorge, and Stephen Wright. Numerical optimization. Springer Science & Business Media, 2006.

diff --git a/dev/algo/ngmres/index.html b/dev/algo/ngmres/index.html index b66ae481..bd4f29d3 100644 --- a/dev/algo/ngmres/index.html +++ b/dev/algo/ngmres/index.html @@ -93,4 +93,4 @@ * Stopped by an increasing objective: false * Reached Maximum Number of Iterations: false * Objective Calls: 222 - * Gradient Calls: 222

References

[1] De Sterck. Steepest descent preconditioning for nonlinear GMRES optimization. NLAA, 2013. [2] Washio and Oosterlee. Krylov subspace acceleration for nonlinear multigrid schemes. ETNA, 1997. [3] Riseth. Objective acceleration for unconstrained optimization. 2018.

+ * Gradient Calls: 222

References

[1] De Sterck. Steepest descent preconditioning for nonlinear GMRES optimization. NLAA, 2013. [2] Washio and Oosterlee. Krylov subspace acceleration for nonlinear multigrid schemes. ETNA, 1997. [3] Riseth. Objective acceleration for unconstrained optimization. 2018.

diff --git a/dev/algo/particle_swarm/index.html b/dev/algo/particle_swarm/index.html index 491147b3..a23aa3eb 100644 --- a/dev/algo/particle_swarm/index.html +++ b/dev/algo/particle_swarm/index.html @@ -1,4 +1,4 @@ Particle Swarm · Optim

Particle Swarm

Constructor

ParticleSwarm(; lower = [],
                 upper = [],
-                n_particles = 0)

The constructor takes three keywords:

  • lower = [], a vector of lower bounds, unbounded below if empty or Inf's
  • upper = [], a vector of upper bounds, unbounded above if empty or Inf's
  • n_particles = 0, number of particles in the swarm, defaults to least three

Description

The Particle Swarm implementation in Optim.jl is the so-called Adaptive Particle Swarm algorithm in [1]. It attempts to improve global coverage and convergence by switching between four evolutionary states: exploration, exploitation, convergence, and jumping out. In the jumping out state it intentially tries to take the best particle and move it away from its (potentially and probably) local optimum, to improve the ability to find a global optimum. Of course, this comes a the cost of slower convergence, but hopefully converges to the global optimum as a result.

References

[1] Zhan, Zhang, and Chung. Adaptive particle swarm optimization, IEEE Transactions on Systems, Man, and Cybernetics, Part B: CyberneticsVolume 39, Issue 6, 2009, Pages 1362-1381 (2009)

+ n_particles = 0)

The constructor takes three keywords:

Description

The Particle Swarm implementation in Optim.jl is the so-called Adaptive Particle Swarm algorithm in [1]. It attempts to improve global coverage and convergence by switching between four evolutionary states: exploration, exploitation, convergence, and jumping out. In the jumping out state it intentially tries to take the best particle and move it away from its (potentially and probably) local optimum, to improve the ability to find a global optimum. Of course, this comes a the cost of slower convergence, but hopefully converges to the global optimum as a result.

References

[1] Zhan, Zhang, and Chung. Adaptive particle swarm optimization, IEEE Transactions on Systems, Man, and Cybernetics, Part B: CyberneticsVolume 39, Issue 6, 2009, Pages 1362-1381 (2009)

diff --git a/dev/algo/precondition/index.html b/dev/algo/precondition/index.html index 7191d52a..b703ecc7 100644 --- a/dev/algo/precondition/index.html +++ b/dev/algo/precondition/index.html @@ -7,4 +7,4 @@ f(x) = plap([0; x; 0]) g!(G, x) = copyto!(G, (plap1([0; x; 0]))[2:end-1]) result = Optim.optimize(f, g!, initial_x, method = ConjugateGradient(P = nothing)) -result = Optim.optimize(f, g!, initial_x, method = ConjugateGradient(P = precond(100)))

The former optimize call converges at a slower rate than the latter. Looking at a plot of the 2D version of the function shows the problem.

plap

The contours are shaped like ellipsoids, but we would rather want them to be circles. Using the preconditioner effectively changes the coordinates such that the contours becomes less ellipsoid-like. Benchmarking shows that using preconditioning provides an approximate speed-up factor of 15 in this 100 dimensional case.

References

+result = Optim.optimize(f, g!, initial_x, method = ConjugateGradient(P = precond(100)))

The former optimize call converges at a slower rate than the latter. Looking at a plot of the 2D version of the function shows the problem.

plap

The contours are shaped like ellipsoids, but we would rather want them to be circles. Using the preconditioner effectively changes the coordinates such that the contours becomes less ellipsoid-like. Benchmarking shows that using preconditioning provides an approximate speed-up factor of 15 in this 100 dimensional case.

References

diff --git a/dev/algo/samin/index.html b/dev/algo/samin/index.html index fef17d46..5ffea489 100644 --- a/dev/algo/samin/index.html +++ b/dev/algo/samin/index.html @@ -72,4 +72,4 @@ * Reached Maximum Number of Iterations: false * Objective Calls: 12051 * Gradient Calls: 0 -

References

+

References

diff --git a/dev/algo/simulated_annealing/index.html b/dev/algo/simulated_annealing/index.html index 8ce08f4b..7876a108 100644 --- a/dev/algo/simulated_annealing/index.html +++ b/dev/algo/simulated_annealing/index.html @@ -5,4 +5,4 @@ for i in eachindex(x) x_proposal[i] = x[i]+randn() end -end

As we see, it is not really possible to disentangle the role of the different components of the algorithm. For example, both the functional form of the acceptance function, the temperature and (indirectly) the neighbor function determine if the next draw of x is accepted or not.

The current implementation of Simulated Annealing is very rough. It lacks quite a few features which are normally part of a proper SA implementation. A better implementation is under way, see this issue.

Example

References

+end

As we see, it is not really possible to disentangle the role of the different components of the algorithm. For example, both the functional form of the acceptance function, the temperature and (indirectly) the neighbor function determine if the next draw of x is accepted or not.

The current implementation of Simulated Annealing is very rough. It lacks quite a few features which are normally part of a proper SA implementation. A better implementation is under way, see this issue.

Example

References

diff --git a/dev/dev/contributing/index.html b/dev/dev/contributing/index.html index a54230a2..7ae3162a 100644 --- a/dev/dev/contributing/index.html +++ b/dev/dev/contributing/index.html @@ -24,4 +24,4 @@ function update!{T}(d, state::MinimState{T}, method::Minim) # code for Minim here false # should the procedure force quit? -end +end diff --git a/dev/dev/index.html b/dev/dev/index.html index ff014914..b7f2a493 100644 --- a/dev/dev/index.html +++ b/dev/dev/index.html @@ -1,2 +1,2 @@ -- · Optim
+- · Optim
diff --git a/dev/examples/generated/ipnewton_basics/index.html b/dev/examples/generated/ipnewton_basics/index.html index 036f2c79..0be5a360 100644 --- a/dev/examples/generated/ipnewton_basics/index.html +++ b/dev/examples/generated/ipnewton_basics/index.html @@ -294,4 +294,4 @@ lx, ux, lc, uc) res = optimize(df, dfc, x0, IPNewton()) -# This file was generated using Literate.jl, https://github.com/fredrikekre/Literate.jl

This page was generated using Literate.jl.

+# This file was generated using Literate.jl, https://github.com/fredrikekre/Literate.jl

This page was generated using Literate.jl.

diff --git a/dev/examples/generated/maxlikenlm/index.html b/dev/examples/generated/maxlikenlm/index.html index de972637..421776c3 100644 --- a/dev/examples/generated/maxlikenlm/index.html +++ b/dev/examples/generated/maxlikenlm/index.html @@ -254,4 +254,4 @@ println("parameter estimates:", parameters) println("t-statsitics: ", t_stats) -# This file was generated using Literate.jl, https://github.com/fredrikekre/Literate.jl

This page was generated using Literate.jl.

+# This file was generated using Literate.jl, https://github.com/fredrikekre/Literate.jl

This page was generated using Literate.jl.

diff --git a/dev/examples/generated/rasch/index.html b/dev/examples/generated/rasch/index.html index 702c0a66..c8a56a54 100644 --- a/dev/examples/generated/rasch/index.html +++ b/dev/examples/generated/rasch/index.html @@ -132,4 +132,4 @@ -0.242781 0.242732 1.22279 1.66615 -3.05756 -2.62454 - 0.667647 1.10274

This page was generated using Literate.jl.

+ 0.667647 1.10274

This page was generated using Literate.jl.

diff --git a/dev/index.html b/dev/index.html index 9d9cec7a..d070d80b 100644 --- a/dev/index.html +++ b/dev/index.html @@ -1,2 +1,2 @@ -Home · Optim

Optim.jl

Univariate and multivariate optimization in Julia.

Optim.jl is part of the JuliaNLSolvers family.

SourceBuild StatusSocialReferences to cite
SourceBuild StatusJOSS
Codecov branchBuild StatusDOI

What

Optim is a Julia package for optimizing functions of various kinds. While there is some support for box constrained and Riemannian optimization, most of the solvers try to find an $x$ that minimizes a function $f(x)$ without any constraints. Thus, the main focus is on unconstrained optimization. The provided solvers, under certain conditions, will converge to a local minimum. In the case where a global minimum is desired we supply some methods such as (bounded) simulated annealing and particle swarm. For a dedicated package for global optimization techniques, see e.g. BlackBoxOptim.

Why

There are many solvers available from both free and commercial sources, and many of them are accessible from Julia. Few of them are written in Julia. Performance-wise this is rarely a problem, as they are often written in either Fortran or C. However, solvers written directly in Julia does come with some advantages.

When writing Julia software (packages) that require something to be optimized, the programmer can either choose to write their own optimization routine, or use one of the many available solvers. For example, this could be something from the NLopt suite. This means adding a dependency which is not written in Julia, and more assumptions have to be made as to the environment the user is in. Does the user have the proper compilers? Is it possible to use GPL'ed code in the project? Optim is released under the MIT license, and installation is a simple Pkg.add, so it really doesn't get much freer, easier, and lightweight than that.

It is also true, that using a solver written in C or Fortran makes it impossible to leverage one of the main benefits of Julia: multiple dispatch. Since Optim is entirely written in Julia, we can currently use the dispatch system to ease the use of custom preconditioners. A planned feature along these lines is to allow for user controlled choice of solvers for various steps in the algorithm, entirely based on dispatch, and not predefined possibilities chosen by the developers of Optim.

Being a Julia package also means that Optim has access to the automatic differentiation features through the packages in JuliaDiff.

How

The package is a registered package, and can be installed with Pkg.add.

julia> using Pkg; Pkg.add("Optim")

or through the pkg REPL mode by typing

] add Optim
+Home · Optim

Optim.jl

Univariate and multivariate optimization in Julia.

Optim.jl is part of the JuliaNLSolvers family.

SourceBuild StatusSocialReferences to cite
SourceBuild StatusJOSS
Codecov branchBuild StatusDOI

What

Optim is a Julia package for optimizing functions of various kinds. While there is some support for box constrained and Riemannian optimization, most of the solvers try to find an $x$ that minimizes a function $f(x)$ without any constraints. Thus, the main focus is on unconstrained optimization. The provided solvers, under certain conditions, will converge to a local minimum. In the case where a global minimum is desired we supply some methods such as (bounded) simulated annealing and particle swarm. For a dedicated package for global optimization techniques, see e.g. BlackBoxOptim.

Why

There are many solvers available from both free and commercial sources, and many of them are accessible from Julia. Few of them are written in Julia. Performance-wise this is rarely a problem, as they are often written in either Fortran or C. However, solvers written directly in Julia does come with some advantages.

When writing Julia software (packages) that require something to be optimized, the programmer can either choose to write their own optimization routine, or use one of the many available solvers. For example, this could be something from the NLopt suite. This means adding a dependency which is not written in Julia, and more assumptions have to be made as to the environment the user is in. Does the user have the proper compilers? Is it possible to use GPL'ed code in the project? Optim is released under the MIT license, and installation is a simple Pkg.add, so it really doesn't get much freer, easier, and lightweight than that.

It is also true, that using a solver written in C or Fortran makes it impossible to leverage one of the main benefits of Julia: multiple dispatch. Since Optim is entirely written in Julia, we can currently use the dispatch system to ease the use of custom preconditioners. A planned feature along these lines is to allow for user controlled choice of solvers for various steps in the algorithm, entirely based on dispatch, and not predefined possibilities chosen by the developers of Optim.

Being a Julia package also means that Optim has access to the automatic differentiation features through the packages in JuliaDiff.

How

The package is a registered package, and can be installed with Pkg.add.

julia> using Pkg; Pkg.add("Optim")

or through the pkg REPL mode by typing

] add Optim
diff --git a/dev/user/algochoice/index.html b/dev/user/algochoice/index.html index 66e6485a..81e20fd6 100644 --- a/dev/user/algochoice/index.html +++ b/dev/user/algochoice/index.html @@ -1,2 +1,2 @@ -Algorithm choice · Optim

Algorithm choice

There are two main settings you must choose in Optim: the algorithm and the linesearch.

Algorithms

The first choice to be made is that of the order of the method. Zeroth-order methods do not have gradient information, and are very slow to converge, especially in high dimension. First-order methods do not have access to curvature information and can take a large number of iterations to converge for badly conditioned problems. Second-order methods can converge very quickly once in the vicinity of a minimizer. Of course, this enhanced performance comes at a cost: the objective function has to be differentiable, you have to supply gradients and Hessians, and, for second order methods, a linear system has to be solved at each step.

If you can provide analytic gradients and Hessians, and the dimension of the problem is not too large, then second order methods are very efficient. The Newton method with trust region is the method of choice.

When you do not have an explicit Hessian or when the dimension becomes large enough that the linear solve in the Newton method becomes the bottleneck, first order methods should be preferred. BFGS is a very efficient method, but also requires a linear system solve. LBFGS usually has a performance very close to that of BFGS, and avoids linear system solves (the parameter m can be tweaked: increasing it can improve the convergence, at the expense of memory and time spent in linear algebra operations). The conjugate gradient method usually converges less quickly than LBFGS, but requires less memory. Gradient descent should only be used for testing. Acceleration methods are experimental.

When the objective function is non-differentiable or you do not want to use gradients, use zeroth-order methods. Nelder-Mead is currently the most robust.

Linesearches

Linesearches are used in every first- and second-order method except for the trust-region Newton method. Linesearch routines attempt to locate quickly an approximate minimizer of the univariate function $\alpha \to f(x+ \alpha d)$, where $d$ is the descent direction computed by the algorithm. They vary in how accurate this minimization is. Two good linesearches are BackTracking and HagerZhang, the former being less stringent than the latter. For well-conditioned objective functions and methods where the step is usually well-scaled (such as LBFGS or Newton), a rough linesearch such as BackTracking is usually the most performant. For badly behaved problems or when extreme accuracy is needed (gradients below the square root of the machine epsilon, about $10^{-8}$ with Float64), the HagerZhang method proves more robust. An exception is the conjugate gradient method which requires an accurate linesearch to be efficient, and should be used with the HagerZhang linesearch.

Summary

As a very crude heuristic:

For a low-dimensional problem with analytic gradients and Hessians, use the Newton method with trust region. For larger problems or when there is no analytic Hessian, use LBFGS, and tweak the parameter m if needed. If the function is non-differentiable, use Nelder-Mead. Use the HagerZhang linesearch for robustness and BackTracking for speed.

+Algorithm choice · Optim

Algorithm choice

There are two main settings you must choose in Optim: the algorithm and the linesearch.

Algorithms

The first choice to be made is that of the order of the method. Zeroth-order methods do not have gradient information, and are very slow to converge, especially in high dimension. First-order methods do not have access to curvature information and can take a large number of iterations to converge for badly conditioned problems. Second-order methods can converge very quickly once in the vicinity of a minimizer. Of course, this enhanced performance comes at a cost: the objective function has to be differentiable, you have to supply gradients and Hessians, and, for second order methods, a linear system has to be solved at each step.

If you can provide analytic gradients and Hessians, and the dimension of the problem is not too large, then second order methods are very efficient. The Newton method with trust region is the method of choice.

When you do not have an explicit Hessian or when the dimension becomes large enough that the linear solve in the Newton method becomes the bottleneck, first order methods should be preferred. BFGS is a very efficient method, but also requires a linear system solve. LBFGS usually has a performance very close to that of BFGS, and avoids linear system solves (the parameter m can be tweaked: increasing it can improve the convergence, at the expense of memory and time spent in linear algebra operations). The conjugate gradient method usually converges less quickly than LBFGS, but requires less memory. Gradient descent should only be used for testing. Acceleration methods are experimental.

When the objective function is non-differentiable or you do not want to use gradients, use zeroth-order methods. Nelder-Mead is currently the most robust.

Linesearches

Linesearches are used in every first- and second-order method except for the trust-region Newton method. Linesearch routines attempt to locate quickly an approximate minimizer of the univariate function $\alpha \to f(x+ \alpha d)$, where $d$ is the descent direction computed by the algorithm. They vary in how accurate this minimization is. Two good linesearches are BackTracking and HagerZhang, the former being less stringent than the latter. For well-conditioned objective functions and methods where the step is usually well-scaled (such as LBFGS or Newton), a rough linesearch such as BackTracking is usually the most performant. For badly behaved problems or when extreme accuracy is needed (gradients below the square root of the machine epsilon, about $10^{-8}$ with Float64), the HagerZhang method proves more robust. An exception is the conjugate gradient method which requires an accurate linesearch to be efficient, and should be used with the HagerZhang linesearch.

Summary

As a very crude heuristic:

For a low-dimensional problem with analytic gradients and Hessians, use the Newton method with trust region. For larger problems or when there is no analytic Hessian, use LBFGS, and tweak the parameter m if needed. If the function is non-differentiable, use Nelder-Mead. Use the HagerZhang linesearch for robustness and BackTracking for speed.

diff --git a/dev/user/config/index.html b/dev/user/config/index.html index 7ba22010..569865d2 100644 --- a/dev/user/config/index.html +++ b/dev/user/config/index.html @@ -13,4 +13,4 @@ iterations = 10, store_trace = true, show_trace = false, - show_warnings = true)

Notice the need to specify the method using a keyword if this syntax is used. This approach might be deprecated in the future, and as a result we recommend writing code that has to maintained using the Optim.Options approach.

+ show_warnings = true)

Notice the need to specify the method using a keyword if this syntax is used. This approach might be deprecated in the future, and as a result we recommend writing code that has to maintained using the Optim.Options approach.

diff --git a/dev/user/gradientsandhessians/index.html b/dev/user/gradientsandhessians/index.html index 81b9c4ff..ab0a2156 100644 --- a/dev/user/gradientsandhessians/index.html +++ b/dev/user/gradientsandhessians/index.html @@ -35,4 +35,4 @@ julia> Optim.minimizer(optimize(f, initial_x, Newton(); autodiff = :forward)) 2-element Array{Float64,1}: 1.0 - 1.0

Indeed, the minimizer was found, without providing any gradients or Hessians.

+ 1.0

Indeed, the minimizer was found, without providing any gradients or Hessians.

diff --git a/dev/user/minimization/index.html b/dev/user/minimization/index.html index 8b774360..0f5bd343 100644 --- a/dev/user/minimization/index.html +++ b/dev/user/minimization/index.html @@ -39,4 +39,4 @@ -1.49994 julia> Optim.minimum(res) - -2.8333333205768865

Complete list of functions

A complete list of functions can be found below.

Defined for all methods:

Defined for univariate optimization:

Defined for multivariate optimization:

Input types

Most users will input Vector's as their initial_x's, and get an Optim.minimizer(res) out that is also a vector. For zeroth and first order methods, it is also possible to pass in matrices, or even higher dimensional arrays. The only restriction imposed by leaving the Vector case is, that it is no longer possible to use finite difference approximations or automatic differentiation. Second order methods (variants of Newton's method) do not support this more general input type.

Notes on convergence flags and checks

Currently, it is possible to access a minimizer using Optim.minimizer(result) even if all convergence flags are false. This means that the user has to be a bit careful when using the output from the solvers. It is advised to include checks for convergence if the minimizer or minimum is used to carry out further calculations.

A related note is that first and second order methods makes a convergence check on the gradient before entering the optimization loop. This is done to prevent line search errors if initial_x is a stationary point. Notice, that this is only a first order check. If initial_x is any type of stationary point, g_converged will be true. This includes local minima, saddle points, and local maxima. If iterations is 0 and g_converged is true, the user needs to keep this point in mind.

+ -2.8333333205768865

Complete list of functions

A complete list of functions can be found below.

Defined for all methods:

Defined for univariate optimization:

Defined for multivariate optimization:

Input types

Most users will input Vector's as their initial_x's, and get an Optim.minimizer(res) out that is also a vector. For zeroth and first order methods, it is also possible to pass in matrices, or even higher dimensional arrays. The only restriction imposed by leaving the Vector case is, that it is no longer possible to use finite difference approximations or automatic differentiation. Second order methods (variants of Newton's method) do not support this more general input type.

Notes on convergence flags and checks

Currently, it is possible to access a minimizer using Optim.minimizer(result) even if all convergence flags are false. This means that the user has to be a bit careful when using the output from the solvers. It is advised to include checks for convergence if the minimizer or minimum is used to carry out further calculations.

A related note is that first and second order methods makes a convergence check on the gradient before entering the optimization loop. This is done to prevent line search errors if initial_x is a stationary point. Notice, that this is only a first order check. If initial_x is any type of stationary point, g_converged will be true. This includes local minima, saddle points, and local maxima. If iterations is 0 and g_converged is true, the user needs to keep this point in mind.

diff --git a/dev/user/tipsandtricks/index.html b/dev/user/tipsandtricks/index.html index 9b488143..99ae379c 100644 --- a/dev/user/tipsandtricks/index.html +++ b/dev/user/tipsandtricks/index.html @@ -192,4 +192,4 @@ * Convergence: false * √(Σ(yᵢ-ȳ)²)/n < 1.0e-08: false * Reached Maximum Number of Iterations: false - * Objective Function Calls: 24 + * Objective Function Calls: 24