Optimization Algorithms
There are three groups of optimization techniques available
in PROC NLP. A particular optimizer can be selected with the
TECH=name option in the PROC NLP statement.
Algorithm
|
TECH=
|
Linear Complementary Problem | LICOMP |
Quadratic Active Set Technique | QUADAS |
Trust-Region Method | TRUREG |
Newton-Raphson Method With Line-Search | NEWRAP |
Newton-Raphson Method With Ridging | NRRIDG |
Quasi-Newton Methods (DBFGS, DDFP, BFGS, DFP) | QUANEW |
Double-Dogleg Method (DBFGS, DDFP) | DBLDOG |
Conjugate Gradient Methods (PB, FR, PR, CD) | CONGRA |
Nelder-Mead Simplex Method | NMSIMP |
Levenberg-Marquardt Method | LEVMAR |
Hybrid Quasi-Newton Methods (DBFGS, DDFP) | HYQUAN |
Since no single optimization technique is
invariably superior to others, PROC NLP provides a variety
of optimization techniques that work well in various
circumstances. However, it possible to devise problems for which
none of the techniques in PROC NLP can find the correct solution.
Moreover, nonlinear optimization can be computationally expensive
in terms of time and memory, so care must be taken when matching
an algorithm to a problem.
All optimization techniques in PROC NLP use O(n2) memory
except the conjugate gradient methods, which use only O(n)
memory and are designed to optimize problems with many
variables.
Since the techniques are iterative, they
require the repeated computation of
- the function value (optimization criterion)
- the gradient vector (first-order partial derivatives)
- for some techniques, the (approximate) Hessian matrix
(second-order partial derivatives)
- values of linear and nonlinear constraints
- the first-order partial derivatives (Jacobian) of
nonlinear constraints
However, since each of the optimizers requires different
derivatives and supports different types of constraints,
some computational efficiencies can be gained.
The following table shows, for each optimization technique,
which derivatives are needed (FOD: first-order derivatives;
SOD: second-order derivatives) and what kind of constraints
(BC: boundary constraints; LIC: linear constraints; NLC:
nonlinear constraints) are supported.
Algorithm
|
FOD
|
SOD
|
BC
|
LIC
|
NLC
|
LICOMP | - | - | x | x | - |
QUADAS | - | - | x | x | - |
TRUREG | x | x | x | x | - |
NEWRAP | x | x | x | x | - |
NRRIDG | x | x | x | x | - |
QUANEW | x | - | x | x | x |
DBLDOG | x | - | x | x | - |
CONGRA | x | - | x | x | - |
NMSIMP | - | - | x | x | x |
LEVMAR | x | - | x | x | - |
HYQUAN | x | - | x | x | - |
Preparation for Using Optimization Algorithms
It is rare that a problem is submitted to an optimization algorithm
"as is." By making a few changes in your problem, you can reduce
its complexity, that would increase the chance of convergence
and save execution time.
Choosing an Optimization Algorithm
The factors that go into choosing a particular optimizer for
a particular problem are complex and may involve trial and
error.
The following should be taken into account:
First, the structure of the problem has to be considered:
Is it quadratic? least-squares? Does it have
linear or nonlinear constraints?
Next, it is important to consider the type of derivatives of
the objective function and the constraints that are needed and whether these
are analytically tractable or not.
This section provides some guidelines for making
the right choices.
For many optimization problems, computing the gradient takes
more computer time than computing the function value, and
computing the Hessian sometimes takes much more computer
time and memory than computing the gradient, especially when
there are many decision variables. Optimization
techniques that do not use the Hessian usually require more
iterations than techniques that do use Hessian approximations
(such as finite differences or BFGS update)
and so are often slower. Techniques that do not use Hessians at all
tend to be slow and less reliable.
The derivative compiler is not efficient in the computation
second-order derivatives. For large problems, memory and
computer time can be saved by programming your own derivatives
using the GRADIENT, JACOBIAN, CRPJAC, HESSIAN, and JACNLC
statements. If you are not able or willing
to specify first- and second-order
derivatives of the objective function, you
can rely on finite difference gradients and Hessian update formulas.
This combination is frequently used and works very well for small
and medium size problems.
For large size problems, you are advised not
to use an optimization technique that requires the computation
of second derivatives.
The following provides some guidance for matching an algorithm
to a particular problem.
- Quadratic Programming
- General Nonlinear Optimization
- Nonlinear Constraints
- Small Problems: NMSIMP
Not suitable for highly nonlinear
problems or for problems with n > 20.
- Medium Problems: QUANEW
- Only Linear Constraints
- Small Problems: TRUREG (NEWRAP, NRRIDG)
() where the
Hessian matrix is not expensive to compute.
Sometimes NRRIDG can be faster than TRUREG, but TRUREG
can be more stable. NRRIDG needs only
one matrix with n(n+1)/2 double words; TRUREG
and NEWRAP need two such matrices.
- Medium Problems: QUANEW (DBLDOG)
() where the objective function and the gradient
are much faster to evaluate than the Hessian.
QUANEW and DBLDOG in general need more iterations
than TRUREG, NRRIDG, and NEWRAP, but each iteration
can be much faster. QUANEW and DBLDOG need only the
gradient to update an approximate Hessian. QUANEW and
DBLDOG need slightly less memory than TRUREG or NEWRAP
(essentially one matrix with n(n+1)/2 double words).
- Large Problems: CONGRA
(n > 200) where the objective
function and the gradient can be computed much faster
than the Hessian and where too much memory is needed
to store the (approximate) Hessian. CONGRA in general
needs more iterations than QUANEW or DBLDOG, but each
iteration can be much faster. Since CONGRA needs only a
factor of n double-word memory, many large applications
of PROC NLP can be solved only by CONGRA.
- No Derivatives: NMSIMP
() where derivatives are not
continuous or are very difficult to compute.
- Least-Squares Minimization
- Small Problems: LEVMAR (HYQUAN)
() where the crossproduct
Jacobian matrix is easy and inexpensive to compute.
In general, LEVMAR is more reliable, but there are
problems with high residuals where HYQUAN can be faster
than LEVMAR.
- Medium Problems: QUANEW (DBLDOG)
() where the objective function and the gradient are much
faster to evaluate than the crossproduct Jacobian.
QUANEW and DBLDOG in general need more iterations
than LEVMAR or HYQUAN, but each iteration
can be much faster.
- Large Problems: CONGRA
- No Derivatives: NMSIMP
The QUADAS and LICOMP algorithms can be used to minimize or
maximize a quadratic objective function,
with linear or boundary constraints
where x = (x1, ... ,xn)T, g = (g1, ... ,gn)T,
G is an n ×n symmetric matrix, A is an m ×n
matrix of general linear constraints, and b = (b1, ... ,bm)T.
The value of c modifies only the value of the objective function,
not its derivatives, and the location of the optimizer x* does
not depend on the value of the constant term c. For QUADAS or
LICOMP, the objective function must be specified
using the MINQUAD or MAX QUAD statement or using an INQUAD= data
set.
In this case, derivatives do not need to be specified.
because the gradient vector
and the n ×n Hessian matrix
are easily obtained from the data input.
Simple boundary and general linear constraints can be specified
using the BOUNDS or LINCON statement or an INQUAD= or
INEST= data set.
General Quadratic Programming (QUADAS)
The QUADAS algorithm is an active set method that iteratively
updates the QT decomposition of the matrix Ak of active
linear constraints and the Cholesky factor of the projected
Hessian ZTGZ simultaneously. The update of active
boundary and linear constraints is done separately;
refer to Gill et al. (1984). Here
Q is an nfree ×nfree orthogonal matrix
composed of vectors spanning the null space Z of Ak
in its first nfree - nalc
columns and range space Y in its last nalc columns;
T is an nalc ×nalc triangular matrix of
special form, tij=0 for i < n-j, where nfree is
the number of free parameters (n minus the number of active
boundary constraints), and nalc is the number of active
linear constraints. The Cholesky factor of the projected Hessian
matrix ZTkGZk and the QT decomposition are updated
simultaneously when the active set changes.
The LICOMP technique solves a quadratic problem as
a linear complementarity problem.
It can be used only if G is positive (negative)
semi-definite for minimization (maximization) and
if the parameters are restricted to be positive.
This technique finds a point that meets the
Karush-Kuhn-Tucker conditions by
solving the linear complementary problem
-
w = M z + q
with constraints
where
Only the LCEPSILON= option can be used to specify a
tolerance used in computations.
General Nonlinear Optimization
Trust-Region Optimization (TRUREG)
The trust-region method uses the gradient g(x(k))
and Hessian matrix G(x(k)) and thus
requires that the objective function f(x) have
continuous first- and second-order derivatives inside
the feasible region.
The trust-region method iteratively optimizes a quadratic
approximation to the nonlinear objective function within a
hyperelliptic trust region with radius that
constrains the step size corresponding to the quality of
the quadratic approximation. The trust-region method is
implemented using Dennis, Gay, & Welsch (1981), Gay (1983),
and Mor & Sorensen (1983)
.
The trust region method performs well for small to medium-sized
problems and does not require many function, gradient, and Hessian
calls. If the computation of the Hessian matrix is
computationally expensive, use the UPDATE= option for update formulas
(that gradually build the second-order information in the Hessian).
For larger problems, the conjugate gradient algorithm may be more appropriate.
Newton-Raphson Optimization With Line-Search (NEWRAP)
The NEWRAP technique uses the gradient g(x(k))
and Hessian matrix G(x(k)) and thus
requires that the objective function have continuous first-
and second-order derivatives inside the feasible region.
If second-order derivatives are computed efficiently
and precisely, the NEWRAP method may perform well for
medium-sized to large problems, and it does not need many
function, gradient, and Hessian calls.
This algorithm uses a pure Newton step when the Hessian
is positive definite and when the Newton step reduces the
value of the objective function successfully. Otherwise,
a combination of ridging and line-search is done to
compute successful steps. If the Hessian is not positive
definite, a multiple of the identity matrix is added to
the Hessian matrix to make it positive definite (Eskow
and Schnabel 1991).
In each iteration, a line-search is done along the search
direction to find an approximate optimum of the objective
function. The default line-search method uses quadratic
interpolation and cubic extrapolation (LIS=2).
Newton-Raphson Ridge Optimization (NRRIDG)
The NRRIDG technique uses the gradient g(x(k))
and Hessian matrix G(x(k)) and thus
requires that the objective function have continuous first-
and second-order derivatives inside the feasible region.
This algorithm uses a pure Newton step when the Hessian
is positive definite and when the Newton step reduces the
value of the objective function successfully.
If at least one of these two conditions is not satisfied,
a multiple of the identity matrix is added to the Hessian
matrix.
If this algorithm is used for least-squares problems, it performs
a ridged Gauss-Newton minimization.
The NRRIDG method performs well for small to medium-sized
problems and does not need many function, gradient, and
Hessian calls. However, if the computation of the Hessian
matrix is computationally expensive, one of the (dual)
quasi-Newton or conjugate gradient algorithms may be more
efficient.
Since NRRIDG uses an
orthogonal decomposition of the approximate Hessian, each iteration
of NRRIDG can be slower than that of NEWRAP, that
works with Cholesky decomposition. However, usually NRRIDG
needs fewer iterations than NEWRAP.
Quasi-Newton Optimization (QUANEW)
The (dual) quasi-Newton method uses the gradient g(x(k))
and does not need to compute second-order derivatives
since they are approximated.
It works well for medium to moderately large optimization
problems where the objective function and the gradient
are much faster to compute than the Hessian,
but in general it requires more iterations than the techniques
TRUREG, NEWRAP, and NRRIDG, which compute second-order
derivatives.
The QUANEW algorithm depends on whether or not there are nonlinear
constraints.
[cnlpfunorlin]Unconstrained or Linearly Constrained Problems
If there are no nonlinear constraints, QUANEW is either
- the original quasi-Newton algorithm that updates
an approximation of the inverse Hessian
- the dual quasi-Newton algorithm that updates the
Cholesky factor of an approximate Hessian (default)
depending upon the value of the UPDATE= options.
For problems with general linear inequality constraints,
the dual quasi-Newton methods can be more efficient than
the original ones.
Four update formulas can be specified with the UPDATE= option:
- DBFGS
- performs the dual BFGS (Broyden, Fletcher,
Goldfarb, & Shanno) update of the Cholesky
factor of the Hessian matrix.
This is the default.
- DDFP
- performs the dual DFP (Davidon, Fletcher,
& Powell) update of the Cholesky factor of
the Hessian matrix.
- BFGS
- performs the original BFGS (Broyden, Fletcher,
Goldfarb, & Shanno) update of the inverse
Hessian matrix.
- DFP
- performs the original DFP (Davidon, Fletcher,
& Powell) update of the inverse Hessian matrix.
In each iteration, a line-search is done along the search
direction to find an approximate optimum.
The default line-search method uses quadratic
interpolation and cubic extrapolation to obtain a step
size satisfying the Goldstein conditions. One
of the Goldstein conditions can be violated if the feasible
region defines an upper limit of the step size. Violating
the left side Goldstein condition can affect the positive
definiteness of the quasi-Newton update. In those cases,
either the update is skipped or the iterations are restarted
with an identity matrix resulting in the steepest descent
or ascent search direction. Line-search algorithms other
than the default one can be specified with the LIS= option.
[cnlpfconstrained]Nonlinearly Constrained Problems
The algorithm used for nonlinearly constrained quasi-Newton
optimization is an efficient modification of Powell's
(1978, 1982) Variable Metric Constrained WatchDog
(VMCWD)
algorithm. A similar but older
algorithm (VF02AD)
is part of the Harwell library. Both VMCWD and VF02AD
use Fletcher's VE02AD algorithm (part of the Harwell
library) for positive definite quadratic programming.
The PROC NLP QUANEW implementation uses a quadratic
programming subroutine that updates and downdates the
approximation of the Cholesky factor when the active
set changes. The nonlinear QUANEW algorithm is not a
feasible point algorithm, and the value of the objective
function need not decrease (minimization) or increase
(maximization) monotonically. Instead, the algorithm
tries to reduce a linear combination of the objective
function and constraint violations, called the merit
function.
The following are similarities and differences between this
algorithm and the VMCWD algorithm:
- A modification of this algorithm can be performed by
specifying VERSION=1, that replaces the
update of the Lagrange vector with the original
update of Powell (1978)
that is used
in VF02AD. This can be helpful for some applications
with linearly dependent active constraints.
- If the VERSION option is not specified or if VERSION=2 is specified,
the evaluation of the Lagrange vector is
performed in the same way as Powell (1982) describes.
- Instead of updating an approximate Hessian matrix,
this algorithm uses the dual BFGS (or DFP) update that
updates the Cholesky factor of an approximate Hessian.
If the condition of the updated matrix gets too bad,
a restart is done with a positive diagonal matrix.
At the end of the first iteration after each restart,
the Cholesky factor is scaled.
- The Cholesky factor is loaded into the quadratic
programming subroutine, automatically ensuring
positive definiteness of the problem. During the
quadratic programming step, the Cholesky factor of
the projected Hessian matrix ZTkGZk and the QT
decomposition are updated simultaneously when the
active set changes. Refer to Gill dt al. (1984)
for more information.
- The line-search strategy is very similar to that of
Powell (1982). However, this algorithm does not call for
derivatives during the line-search, so the
algorithm generally needs fewer derivative
calls than function calls. VMCWD always requires the
same number of derivative and function calls.
Sometimes Powell's line-search method uses
steps that are too long.
In these cases, use the INSTEP= option
to restrict the step length .
- The watchdog strategy is similar to that of
Powell (1982); however, it doesn't return
automatically after a fixed number of iterations to
a former better point. A return here is further
delayed if the observed function reduction is close
to the expected function reduction of the quadratic
model.
- The Powell termination criterion still is used
(as FTOL2) but the QUANEW implementation uses two additional
termination criteria (GTOL and ABSGTOL).
The nonlinear QUANEW algorithm needs
the Jacobian matrix of the first-order derivatives (constraints
normals) of the constraints CJ(x).
You can specify two update formulas with the UPDATE=option:
- UPDATE=DBFGS performs the dual BFGS update of
the Cholesky factor of the Hessian matrix.
This is the default.
- UPDATE=DDFP performs the dual DFP update of
the Cholesky factor of the Hessian matrix.
This algorithm uses its own line-search technique. All
options and parameters (except the INSTEP= option) controlling
the line-search in the other algorithms do not apply here. In
several applications, large steps in the first iterations were
troublesome. You can use the INSTEP= option to impose an upper
bound for the step size during the first five
iterations. You may also use the INHESSIAN[=r] option to
specify a different starting approximation for the Hessian.
Choosing simply the INHESSIAN option will use the Cholesky
factor of a (possibly ridged) finite difference approximation
of the Hessian to initialize the quasi-Newton update process.
The values of the LCSINGULAR=, LCEPSILON=, and LCDEACT= options,
which control the processing of linear and boundary constraints,
are valid only for the quadratic programming subroutine used in
each iteration of the nonlinear constraints QUANEW algorithm.
Double Dogleg Optimization (DBLDOG)
The double dogleg optimization method combines the
ideas of quasi-Newton and trust region methods.
The double dogleg algorithm computes in each iteration
the step s(k) as the linear combination of the
steepest descent or ascent search direction s1(k)
and a quasi-Newton search direction s2(k),
The step is requested to remain within a prespecified trust
region radius, refer to
Fletcher (1987, p. 107). Thus, the
DBLDOG subroutine uses the dual quasi-Newton update but
does not perform a line-search. Two update formulas can
be specified with the UPDATE= option:
- DBFGS
- performs the dual BFGS (Broyden, Fletcher,
Goldfarb, & Shanno) update of the Cholesky
factor of the Hessian matrix.
This is the default.
- DDFP
- performs the dual DFP (Davidon, Fletcher,
& Powell) update of the Cholesky factor of
the Hessian matrix.
The double dogleg optimization technique works well for
medium to moderately large optimization problems where the
objective function and the gradient are much faster to
compute than the Hessian. The implementation is based on
Dennis & Mei (1979) and Gay (1983)
but is extended for dealing with boundary and
linear constraints. DBLDOG generally needs more iterations
than the techniques TRUREG, NEWRAP, or NRRIDG that need
second-order derivatives, but each of the DBLDOG iterations
is computationally cheap. Furthermore, DBLDOG needs
only gradient calls for the update of the Cholesky
factor of an approximate Hessian.
Conjugate Gradient Optimization (CONGRA)
Second-order derivatives are not used by CONGRA.
The CONGRA algorithm can be expensive in function and gradient
calls but needs only O(n) memory for unconstrained optimization.
In general, many iterations are needed to obtain a precise
solution, but each of the CONGRA iterations is computationally
cheap. Four different update formulas for generating the conjugate
directions can be specified using the UPDATE= option:
- PB
- performs the automatic restart update method of
Powell (1977) and Beale (1972). This is the default.
- FR
- performs the Fletcher-Reeves update (Fletcher 1987).
- PR
- performs the Polak-Ribiere update (Fletcher 1987).
- CD
- performs a conjugate-descent update of Fletcher (1987).
The default value is UPDATE=PB, since it behaved best in most test
examples. You are advised to avoid the option UPDATE=CD,
that behaved worst in most test examples.
The CONGRA subroutine should be used for optimization
problems with large n. For the unconstrained or boundary
constrained case, CONGRA needs only O(n) bytes of
working memory, whereas all other optimization methods
require order O(n2) bytes of working memory. During n
successive iterations, uninterrupted by restarts or changes
in the working set, the conjugate gradient algorithm
computes a cycle of n conjugate search directions.
In each iteration, a line-search is done along the search
direction to find an approximate optimum of the objective
function. The default line-search method uses quadratic
interpolation and cubic extrapolation to obtain a step
size satisfying the Goldstein conditions. One
of the Goldstein conditions can be violated if the
feasible region defines an upper limit for the step
size. Other line-search algorithms can be specified
with the LIS= option.
Nelder-Mead Simplex Optimization (NMSIMP)
The Nelder-Mead simplex method does not use any derivatives
and does not assume that the objective function
has continuous derivatives. The objective function itself
needs to be continuous. This technique requires a large number
of function evaluations. It is unlikely to give accurate
results for .
Depending on the kind of constraints, one of the following
Nelder-Mead simplex algorithms is used:
- unconstrained or only boundary constrained
problems
The original Nelder-Mead simplex algorithm is
implemented and extended to boundary constraints.
This algorithm does not compute the objective for
infeasible points. This algorithm is automatically
invoked if the LINCON or NLINCON statement is not
specified.
- general linearly constrained or nonlinearly
constrained problems
A slightly modified version of Powell's (1992)
COBYLA (Constrained Optimization BY Linear
Approximations) implementation is used. This
algorithm is automatically invoked if either the
LINCON or the NLINCON statement is specified.
The original Nelder-Mead algorithm cannot be used for
general linear or nonlinear constraints but can be faster
for the unconstrained or boundary constrained case. The
original Nelder-Mead algorithm changes the shape of the
simplex adapting the nonlinearities of the objective
function which contributes to an increased speed of
convergence. The two NMSIMP subroutines use special
sets of termination criteria. For more details, refer to the section "Termination Criteria".
[cnlpfcoby]Powell's COBYLA Algorithm (COBYLA)
Powell's COBYLA algorithm is a sequential trust-region
algorithm (originally with a monotonically decreasing
radius of a spheric trust region) that tries to
maintain a regular-shaped simplex over the iterations.
A small modification was made to the original algorithm,
that permits an increase of the trust-region radius
in special situations. A sequence of iterations
is performed with a constant trust-region radius until the computed objective function reduction is much
less than the predicted reduction.
Then, the trust-region radius is reduced.
The trust-region radius is increased only if the computed
function reduction is relatively close to the predicted
reduction and the simplex is well-shaped. The start
radius and the final radius can be specified using =INSTEP and
=ABSXTOL. The convergence to small
values of (high precision) may take many
calls of the function and constraint modules and may
result in numerical problems. There are two main reasons
for the slow convergence of the COBYLA algorithm:
- Only linear approximations of the objective and
constraint functions are used locally.
- Maintaining the regular-shaped simplex and not
adapting its shape to nonlinearities yields very
small simplexes for highly nonlinear functions
(for example, fourth-order polynomials).
Nonlinear Least-Squares Optimization
Levenberg-Marquardt Least-Squares Method (LEVMAR)
The Levenberg-Marquardt method is a modification of the
trust-region method for nonlinear least-squares problems
and is implemented as in Mor (1978).
This is the
recommended algorithm for small- to medium-sized
least-squares problems. Large least-squares problems
can be transformed into minimization problems, which can be
processed with conjugate gradient or (dual) quasi-Newton
techniques. In each iteration, LEVMAR solves a quadratically
constrained quadratic minimization problem that restricts
the step to stay at the surface of or inside an n
dimensional elliptical (or spherical) trust region.
In each iteration, LEVMAR uses the crossproduct
Jacobian matrix JTJ as an approximate Hessian matrix.
Hybrid Quasi-Newton Least-Squares Methods (HYQUAN)
In each iteration of one of the Fletcher and Xu (1987)
(refer also to AlBaali and Fletcher 1985, 1986)
hybrid quasi-Newton
methods, a criterion is used to decide whether a Gauss-Newton
or a dual quasi-Newton search direction is appropriate. The
VERSION= option can be used to choose one of three criteria
(HY1, HY2, HY3) proposed by Fletcher and Xu (1987). The default
is VERSION=2; that is, HY2. In each iteration, HYQUAN computes
the crossproduct Jacobian (used for the Gauss-Newton step),
updates the Cholesky factor of an approximate Hessian (used
for the quasi-Newton step), and does a line-search to compute
an approximate minimum along the search direction. The default
line-search technique used by HYQUAN is especially designed
for least-squares problems (refer to Lindstrm and Wedin
1984 and AlBaali and Fletcher, 1986). Using the LIS= option
you can choose a different line-search algorithm than the
default one.
Two update formulas can be specified with the UPDATE= option:
- DBFGS
- performs the dual BFGS (Broyden, Fletcher,
Goldfarb, and Shanno) update of the Cholesky
factor of the Hessian matrix.
This is the default.
- DDFP
- performs the dual DFP (Davidon, Fletcher,
and Powell) update of the Cholesky factor of
the Hessian matrix.
The HYQUAN subroutine needs about the same amount of working
memory as the LEVMAR algorithm. In most applications, LEVMAR
seems to be superior to HYQUAN, and using HYQUAN is recommended
only when problems are experienced with the performance
of LEVMAR.
Copyright © 1999 by SAS Institute Inc., Cary, NC, USA. All rights reserved.