Quasi–Newton methods - Newton-Type Methods 43

3. Newton-Type Methods 43

3.3. Quasi–Newton methods

k∇fk showsuperlinear convergence, as defined in (1.8).

Example 3.5. We have used Algorithm 3.15 on Rosenbrock’s function from Example 2.8. We use the same starting point, x0 = −1.2 1_T

, and with τ = 10⁻², ε1 = 10⁻¹⁰, ε2 = 10⁻¹² we found the solution after 29 iteration steps. The performance is illustrated below

(1) (2)

−1.2 1

0 5 10 15 20 25 30

1e−15 1e−10 1e−5 1

f(x)

||∇||

Figure 3.4. Damped Newton method applied to Rosenbrock’s function.

Top: iteration path. Bottom: f(xk),k∇f(xk)k^∞ andµ.

The six stars in the iteration path indicates points, where the attempted step was uphill, ie the current xis not changed, butµis updated. After passing the bottom of the parabola, the damping parameterµis decreased in most steps. As in the previous example we achieve superlinear final convergence.

3.3. Quasi–Newton methods

The modifications discussed in the previous section make it possible to overcome the first three of the main disadvantages of Newton’s method shown in Summary 3.6. The damped Newton method is globally conver-gent, ill-conditioning may be avoided, and minima are rapidly located.

However, the fourth disadvantage remains: The user must supply second

54 3. Newton–Type Methods derivatives, either by implementing analytically derived expressions or by means of automatic differentiation. For large or complicated prob-lems this may be a costly affair, and it often happens that we only benefit from the quadratic convergence in the last few iterations.

In quasi–Newton methods the 2nd derivatives are not required. The Newton equation (3.2),

∇²f(x)h_n = −∇f(x) , is replaced by

B h_qn = −∇f(x), (3.16) where the matrixB is an approximation to the Hessian ∇²f(x). This matrix is modified during the iteration process, and many different schemes have been proposed.

Possibly the most straight-forward quasi–Newton method is obtained if the elements of the Hessian are approximated by finite differences, see Appendix A.3. This can give a good approximation to the Hessian, but it involvesnextra evaluations of the gradient in each iteration. Further, there is no guarantee thatB is positive definite, and in order to get a robust algorithm, it might for instance be combined with the damped Newton method outlined in Algorithm 3.8.

The quasi–Newton methods that we shall discuss in the remainder of this chapter avoid these unfruitful extra evaluations of the gradient and the need for damping. First, however, we must mention that instead of generating a sequence of B-matrices for use in (3.16) some quasi–

Newton methods work with a sequence of D-matrices such that the search directionh_qn is given by

h_qn = −D∇f(x) . (3.17)

Thus, the currentD is an approximation to the inverse of∇²f(x). We shall discuss these two approaches in parallel. The modifications from the current B (or D) are made by relatively simple updates, and for each updating formula forB there is a similar updating formula for D with the same computational cost. This means that the difference in cost is that with formulation (3.16) we have to solve a linear system in thenunknown components ofh_qn; this is aO(n³) process, whereas the multiplication with D in (3.17) is a O(n²) process, ie it is cheaper for large n.

The need for damping is avoided because the sequence of B (orD) matrices are generated such that every matrix is positive definite, and

3.3. Quasi–Newton methods 55 Theorem 2.15 tells us that then every h_qn is a descent direction. In order to find a good step length we can use line search or a trust region approach. We shall only discuss the former, and conclude this section by giving the following framework.

Framework 3.18. Quasi–Newton method

with updating and line search begin

x:=x₀; k:= 0; found:=· · · {1^◦}

B :=B₀ (or D:=D₀) {2^◦} while (not found) and (k < kmax)

SolveBh_qn=−∇f(x) (or compute h_qn:=−D∇f(x))

x_new := line search(x,h_qn) {3^◦}

Updatefound {1^◦}

UpdateB to B_new (or D to D_new) {4^◦} x:=x_new

end

We have the following remarks:

1^◦ Initialization. We recommend to use the stopping criteria (3.14).

2^◦ The choiceB₀=I (orD₀=I) will make the iteration process start like the steepest descent method, which has good global convergence properties. With a good updating strategy the matrices will con-verge to good approximations to ∇²f(ˆx) (or the inverse of this), and the process ends up somewhat like Newton’s method.

3^◦ Some methods demand exact line search, others work fine with very loose line search.

4^◦ In the following we present the requirements to the updating and the techniques needed in order for the sequence of B (or D) matrices to converge towards∇²f(x) (orˆ ∇²f(x)ˆ −1

), respectively,xˆbeing the minimizer.

3.3.1. Updating formulas

In this subsection we start by looking at the requirements that should be satisfied by an updating scheme for approximations to the Hessian or its inverse. Later we discuss how to build in other desirable features of the matrix sequence.

56 3. Newton–Type Methods Let x and B be the current iterate and approximation to ∇²f(x).

Given these, steps 1^◦ and 2^◦ of Framework 3.18 can be performed yield-ing x_new. The objective is to calculate B_new by a correction of B.

The correction must contain some information about the second deriva-tives. Clearly, this information is only approximate, it is based on the gradients of f at the two points. Now, consider the first order Taylor approximation of the gradient aroundx_new,

∇f(x) ≃ ∇f(xnew) +∇²f(xnew) (x−x_new) . By rearranging we get an expression similar to (2.35),

∇f(x_new)−∇f(x) ≃ ∇²f(x_new) (x_new−x) .

We require that the approximation B_new satisfies the corresponding equality

B_newh = y with h=x_new−x,

y=∇f(xnew)−∇f(x) .

(3.19)

This is the so-calledquasi–Newton condition. The same arguments lead to the alternative formulation of the quasi–Newton condition,

D_newy = h . (3.20)

Ifn= 1, then we see thatB_new is an approximation tof^′′ at a point between the current and the new iterate, and (3.19) tells that this ap-proximation is the slope of the line between the points (x, f^′(x)) and (x_new, f^′(x_new)) in the graph off^′. Therefore condition (3.19) is some-times referred to as thesecant condition.

For n >1 the quasi–Newton condition is an underdetermined linear system with n equations and the n² elements in B_new (or D_new) as unknowns. Therefore additional conditions are needed to get a well defined method. Different quasi–Newton methods are distinguished by the choice of these extra conditions. A common feature of the methods that we shall describe is that they have the form

B_new = B+W_B (or D_new = D+W_D ) .

We say that B_new is obtained by updating B. The correction matrix W is constructed so that the the quasi-Newton condition is satisfied, and maybe ensures desirable features, such as preservation of symmetry

3.3. Quasi–Newton methods 57 and positive definiteness. Also, it should be simple to compute the correction, and in most methods used in practice, W is either a rank-one matrix,

B_new = B+a b^T , or a rank-two matrix,

B_new = B+a b^T +u v^T ,

where a,b,u,v∈Rⁿ. Hence W is an outer product of two vectors or the sum of two such products. Often a=b and u=v; this is a simple way of ensuring that W is symmetric.

Example 3.6. One of the first updating formulas to be thoroughly discussed wasBroyden’s rank-one formula, [6],

B_new = B+a b^T ,

The vectors a,b∈Rⁿ are chosen so that they satisfy the quasi–Newton condition (3.19),

B+ab^T

h = y, supplied with the condition that

B+ab^T

v = Bv for all v|v^Th= 0.

The rationale behind this is that we want to keep information already in B, and after all we only have new information about 2nd derivatives in the direction given by h. The reader can easily verify that the re-sult isBroyden’s rank-one formula for updating the approximation to the Hessian:

B_new = B+ 1

h^Th y−B h

h^T . (3.21)

A formula for updating an approximation to the inverse Hessian may be derived in the same way and we obtain

D_new=D+ 1

y^Ty h−D y

y^T . (3.22)

The observant reader will notice the symmetry between these two updat-ing formulas. This is further discussed in Section 3.5.

Now, given some initial approximation B0 (orD0) (the choice of which shall be discussed later), we can use (3.21) or (3.22) to generate the se-quence needed in the framework 3.18. However, two important features of the Hessian (or its inverse) would then be disregarded: We wish both matrices B and D to be symmetric and positive definite. This is not the case for (3.21) and (3.22), and thus the use of Broyden’s formula may lead to steps which are not even downhill, and convergence towards saddle points or maxima will often occur. Therefore, these formulas are never used for unconstrained optimization.

58 3. Newton–Type Methods The formulas were developed for solving systems of nonlinear equations, and they have several other applications, for instance in methods for least squares problems and minimax optimization. We return to (3.21) in Sec-tion 6.4.

3.3.2. Symmetric updating

The Hessian and its inverse are symmetric, and therefore it is natural to require that alsoB andD be symmetric. We can obtain this for the D-sequence if we start with a symmetric D₀ and use rank-one updates of the form

D_new = D+u u^T .

The quasi–Newton condition (3.20) determinesuuniquely:

h = D y+u u^Ty ⇔ (u^Ty)u = h−D y , leading to

D_new = D+ 1

u^Tyu u^T with u = h−D y .

A similar derivation gives the symmetric rank-one updating formula for approximations to∇²f(x),

B_new = B+ 1

h^Tv v v^T with v = y−B h.

These formulas are known as the SR1 formulas. They have been used very little in practice. Notice that the updating breaks down if u^Ty= 0. This occurs ifu=0, in which case we just takeD_new=D. It also occurs if a nonzerou is orthogonal to y, and in this case there is no solution. In implementations this is handled by setting D_new = D whenever|u^Ty|is too small.

Example 3.7. The SR1 formulas have some interesting properties. The most important is that a quasi–Newton method without line search based on SR1 updating will minimize a quadratic with positive definite Hessian in at most n+1 iterations, provided that the search directions are linearly independent andy^Turemains positive. Further, in this caseD_newequals

∇²f(x^∗)⁻¹ after n+1 steps. This important property is calledquadratic termination, cf Section 2.6.

The observant reader will have noticed a disappointing fact: a rank-one updating has not enough freedom to satisfy the quasi-Newton condi-tion, preserve symmetry and the desirable property, that we mentioned

3.4. DFP formula 59

In document Introductionto OptimizationandDataFitting (Sider 59-65)