Implementation of a Quasi–Newton Method - C ONJUGATE G RADIENT M ETHODS

4. C ONJUGATE G RADIENT M ETHODS

5.12. Implementation of a Quasi–Newton Method

dient of the cost function. If the cost function is quadratic, then its gradient is linear inx, and so is its approximation. When the Quasi–Newton con-dition has been enforced innsteps, the two linear functions agree inn+1 positions in IRⁿ, and consequently the two functions are identical. Iterate no.n+1,xnew, makes the gradient of the approximation equal to zero, and so it also makes the gradient of the cost function equal to zero; it solves the problem. The proviso that the quadratic andD₀must be positive definite, ensures thatxnewis not only a stationary point, but also a minimizer.

5.12. Implementation of a Quasi–Newton Method

In this section we shall discuss some details of the implementation and end by giving the Quasi–Newton algorithm with the different parts assembled.

Based on the discussion in the previous sections we have chosen a BFGS updating formula, and for the reasons given on page 53, an update of the inverse Hessian has been chosen. For student exercises and preliminary re-search this update is adequate, but even thoughDin theory stays positive definite, the rounding errors may cause ill conditioning and even indefinite-ness. For professional codes updating of a factorization of the Hessian is recommended such that the effect of rounding errors can be treated prop-erly. In the present context a less advanced remedy is described which is to omit the updating if the curvature condition (5.28) does not hold, since in this caseD_newwould not be positive definite. Actually, Dennis and Schnabel (1984) recommend that the updating is skipped if

h^>y≤ε^1/2_M khk2kyk2, (5.33) whereεMis the machine precision.

We shall assume the availability of a soft line search such as Algorithm 2.27.

It is important to notice that all the function evaluations take place during the line search. Hence, the values off andf⁰at the new point are recieved from the line search subprogram. In the next iteration step these values are returned to the subprogram such thatf andf⁰ forα= 0are ready for the next search. Sometimes the gradient needs not be calculated as often asf.

65 5. NEWTON-TYPEMETHODS

In a production code the line search should only calculatef respectivelyf⁰ whenever they are needed.

As regards the initial approximation to the inverse Hessian, D0, it is tra-ditionally recommended to useD0 = I, the identity matrix. ThisD0 is, of course, positive definite and the first step will be in the steepest descent direction.

Finally, we outline an algorithm for a Quasi–Newton method. Actually, the curvature condition (5.28) needs not be tested because it is incorporated in the soft line search as stopping criterion (2.26b).

Algorithm 5.34. Quasi–Newton method with BFGS–updating begin

x:=x0; D:=D0; k:= 0; nv:= 0 {Initialisation} whilekf⁰(x)k> εandk < kmaxandnv < nvmax

h_qn:=D(−f⁰(x)) {Quasi–Newton equation} [α, dv] :=soft line search(x,h_qn) {Algorithm 2.27} nv:=nv+dv {No. of function evaluations} x_new:=x+αh_qn; k:=k+1

ifh^>_qnf⁰(x_new)>h^>_qnf⁰(x) {Condition (5.28)}

UpdateD {using 5.30}

x:=xnew

end

Example 5.7. We consider Rosenbrock’s function from Examples 4.3 and 5.5. We have tried different updating formulas and line search methods. The line search parameters were chosen as in Example 4.3.

With the starting pointx0 = [−1.2, 1 ]^>, the following numbers of iteration steps and evaluations off(x)andf⁰(x)are needed to satisfy the stopping crite-rionkf⁰(x)k ≤10⁻¹⁰.

The results are as expected: BFGS combined with soft line search needs the smallest number of function evaluations to find the solution.

5.12. Implementation of a Quasi–Newton Method 66

Update by Line search # it. steps # fct. evals

DPF exact 23 295

DPF soft 31 93

BFGS exact 23 276

BFGS soft 29 68

Below we give the iterates (cf Figures 4.2, 4.3 and 5.4) and the values off(xk) andkf⁰(xk)k∞. As with the Damped Newton Method we have superlinear final convergence.

−1.2 1

x₁ x₂

0 5 10 15 20 25 30

1e−15 1e−10 1e−5 1

||f’||

Figure 5.5: BFGS with soft line search, applied to Rosenbrock’s function.

Top: iteratesxk. Bottom:f(xk)andkf⁰(xk)k∞.

The number of iteration steps is about the same as in Example 5.5, while the number of function evaluations is almost four times as big. Note, however, that with Algorithm 5.34 each evaluation involvesf(x)andf⁰(x), while each evalu-ation in the Damped Newton Method also involves the Hessianf⁰⁰(x). For many problems this is not available. If it is, it may be costly: we need to compute

2n(n+1elements in the symmetric matrixf⁰⁰(x), whilef⁰(x)hasnelements only.

67 APPENDIX

A

PPENDIX

A. Symmetric, Positive Definite Matrices

A matrixA∈IR^n×nis symmetric ifA= ˙A, ie ifaij=ajifor alli, j.

Definition A.1. The symmetric matrixA∈IR^n×nis

positive definite if x^>A x>0 for all x∈IRⁿ, x6=0, positive semidefinite if x^>A x≥0 for all x∈IRⁿ, x6=0.

Such matrices play an important role in optimization, and some useful properties are listed in

Theorem A.2. A∈IRⁿ^×ⁿbe symmetric and letA= LU, whereLis a unit lower triangular matrix andUis an upper triangular matrix. Then

1^◦ (Alluii>0, i=1, . . . , n) ⇐⇒ (Ais positive definite). IfAis positive definite, then

2^◦ The LU-factorization is numerically stable.

3^◦ U=DL^>withD=diag(uii).

4^◦ A =CC^>, the Cholesky factorization. C∈IRⁿ^×ⁿis a lower triangular matrix.

Proof. See eg Golub and Van Loan (1996), Nielsen (1996) or Eld´en et al (2004).

A unit lower triangular matrixLis characterized by`ii= 1and`ij = 0forj>i.

Note, that the LU-factorizationA =LUis made without pivoting (which, by the way, could destroy the symmetry). Also note that points3^◦–4^◦give the following relation between the LU- and the Cholesky-factorization

A=L U=LDL^>=CC^> (A.3a)

A. Symmetric, Positive Definite Matrices 68

with

C=LD^1/2, D^1/2=diag(√

uii). (A.3b)

The Cholesky factorization with test for positive definiteness can be implemented as follows. (This algorithm does not rely on (A.3), but is derived directly from4^◦in Theorem A.2).

Algorithm A.4. Cholesky factorization.

begin

k:= 0; posdef:=true {Initialisation} while posdef andk < n

k:=k+1 d:=akk−P_k−1

j=1(ckj)²

ifd >0 {test for pos. def.}

ckk:=√

d {diagonal element}

fori:=k+1, . . . , n {subdiagonal elements} cik:=

aik−P_k−1

j=1cijckj

´ /ckk

else

posdef:=false end

The “cost” of this algorithm isO(n³)operations.

This algorithm can eg be used in Algorithm 5.15. Actually it is the cheapest way to check for positive-definiteness.

The solution to the system Ax=b

can be computed via the Cholesky factorization: InsertingA= CC^>we see that the system splits into

C z=b and C^>x=z.

The two triangular systems are solved by forward- and back-substitution, respec-tively.

69 APPENDIX

The “cost” of this algorithm isO(n²)operations.

B. Proof of Theorem 4.12

and Algorithm 4.6 and (4.11) combine to hr+1=αr+1

¡−gr+γrα⁻_r¹hr

¢ with γr = g^>rgr

g^>_r−1g_r−1, (B.3) andαr+1found by exact line search. Finally, we remind the reader of (4.10) and (4.9)

h^>_rgr= 0 and α⁻¹_r+1h^>_r+1gr=−g^>_rgr. (B.4) Now, we are ready for the induction:

Forj=1, (B.1) is trivially satisfied, there is nohivector withi<1.

Next, assume that (B.1) holds for allj= 1, . . . , k. Then it follows from the proof of Theorem 4.8 that

g^>khi= 0 for i= 1, . . . , k . (B.5) If we insert (B.3), we see that this implies

0 =g^>_k¡

−gi−1+γi−1α⁻¹_i₋₁hi−1

¢=−g^>_kgi−1.

C. Proof of (5.9) 70

Thus, the gradients at the iterates are orthogonal,

g^>_kgi= 0 for i= 1, . . . , k−1. (B.6)

In the first reformulation we use both relations in (B.4), and next we use the definition ofγkin (B.3).

Thus, we have shown that (B.1) also holds forj=k+1and thereby finished the proof.

C. Proof of (5.9)

The symmetric matrixH=f⁰⁰(x)∈IR^n×nhas real eigenvalues{λj}ⁿj=1, and we can choose an orthogonal set of corresponding eigenvectors{vj},

Hvj=λjvj, j = 1, . . . , n with v^>ivj= 0 for i6=j .

It is easy to show thatHis positive definite if and only if all the eigenvalues are strictly positive.

From the above equation we get

(H+µI)vj= (λj+µ)vj, j= 1, . . . , n ,

and we see thatH+µIis positive definite for allµ >min{minλj,0}.

71 APPENDIX

D. Proof of Theorem 5.11 In (5.8) we introduced the function

qµ(h) = q(h) +¹₂µh^>h with µ≥0, whereqis given by (5.1b). The gradient ofqµis

q⁰_µ = q⁰+µh = g+ (H+µI)h,

whereg=f⁰(x),H=f⁰⁰(x). According to the assumption, the matrixH+µIis positive definite, and therefore the linear system of equationsq⁰µ=0has a unique solution, which is the minimizer ofqµ. This solution is recognized ashdn.

Now, let

hM = argmin_khk≤kh

dnk{q(h)}. Thenq(hM)≤q(hM)andh^>_MhM≤h^>_dnhdn, so that

qµ(hM) = q(hM) +¹₂µh^>_MhM ≤ q(hdn) +¹₂µh^>_dnhdn = qµ(hdn). However,hdnis the unique minimizer ofqµ, sohM=hdn.

REFERENCES 72

R

EFERENCES

1. M. Al-Baali (1985): Descent Property and Global Convergence of the Fletcher-Reeves method with inexact linesearch,

IMA Journal on Num. Anal. 5, pp 121–124.

2. C. G. Broyden (1965): A class of methods for solving nonlinear simultaneous equations, Maths. Comp. 19, pp 577–593.

3. H. P. Crowder and P. Wolfe (1972): Linear Convergence of the Conjugate Gra-dient Method, IBM J. Res. Dev. 16, pp 431–433.

4. J. E. Dennis, Jr. and R.B. Schnabel (1983): Numerical Methods for Uncon-strained Optimization and Nonlinear Equations, Prentice Hall.

5. L. Eld´en, L. Wittmeyer-Koch and H.B. Nielsen (2004): Introduction to Numer-ical Computation – analysis and MATLABillustrations,

to appear.

6. R. Fletcher (1980): Practical Methods of Optimization, vol 1, John Wiley.

7. R. Fletcher (1987): Practical Methods of Optimization, vol 2, John Wiley.

8. A. A. Goldstein and J.F. Price (1967): An Effective algorithm for Minimization, Num. Math. 10, pp 184–189.

9. G. Golub and C.F. Van Loan (1996): Matrix Computations, 3rd edition. John Hopkins Press.

10. D. Luenberger (1984): Linear and Nonlinear Programming, Addison-Wesley.

11. K. Madsen (1984): Optimering uden bibetingelser (in Danish), Hæfte 46, Nu-merisk Institut, DTH.

73 REFERENCES

12. K. Madsen, H.B. Nielsen and O. Tingleff (2004): Methods for Non-Linear Least Squares Problems, IMM, DTU. Available at

http://www.imm.dtu.dk/courses/02611/NLS.pdf

13. D. Marquardt (1963): An Algorithm for Least Squares Estimation on Nonlinear Parameters, SIAM J. APPL. MATH. 11, pp 431–441.

14. H.B. Nielsen (1996): Numerisk Lineær Algebra (in Danish), IMM, DTU.

15. H.B. Nielsen (1999): Damping Parameter in Marquardt’s Method. IMM, DTU. Report IMM-REP-1999-05. Available at

http://www.imm.dtu.dk/∼hbn/publ/TR9905.ps.Z

16. J. Nocedal (1992): Theory of Algorithms for Unconstrained Optimization, in Acta Numerica 1992, Cambridge Univ. Press.

17. J. Nocedal and S.J. Wright (1999): Numerical Optimization, Springer.

18. H.A. van der Vorst (2003): Iterative Krylov Methods for Large Linear Systems, Cambridge University Press.

Cholesky factorization, 44, 53, 60, 67 conjugate directions, 28, 33, 39

– gradients, 29, 39

– gradient method, 28, 31, 38 contours, 3, 7, 36, 42

exact line search, 17, 24, 61 factorization, 64, 67 finite difference, 52 Fletcher, 16, 19, 33, 36, 38f,

42, 47, 53, 58, 61, 63 flutter, 49

gain factor, 18, 48 GAMS, ii

global convergence, 8, 34, 45, 58f – minimizer, 2

75 INDEX

implementation, 37, 44, 64 inverse Hessian, 53, 62, 64 level curves, 3, 7, 36, 42 Levenberg, 48, 50

limited memory method, 39 linear approximation, 18

– convergence, 9, 38 – system, 39

local minimizer, 2, 4, 5ff, 12 machine precision, 64 Madsen, 20, 49 Marquardt, 47, 48, 50 maximizer, 1, 6, 44, 45

method of alternating variables, 3 minimizer, 1, 6

model problem, 18 necessary condition, 5, 7 Newton’s method, 41, 42 Nielsen, 49, 67

Nocedal, 34, 39, 47, 62 objective function, 1 outer product, 55 Polak, 34, 36, 37, 39

positive definite, 5, 14, 28, 30, 42, 45, 50, 58, 67 Powell, 34, 58, 62 pseudo angle, 14

quadratic approximation, 18 – convergence, 9, 42, 45, 58 – model, 29, 41

– termination, 30, 57, 63 quasi, 52

Quasi–Newton, 54, 65 – condition, 55, 61, 63 Reeves, 33, 36, 38 resetting, 34 Ribi`ere, 34, 36f, 39 Rosenbrock, 35, 51, 65 saddle point, 6, 44 Schnabel, 53, 64 secant condition, 54 semidefinite, 67 Shanno, 61

soft line search, 17, 34, 37, 61, 64f SR1 formula, 57

stationary point, 5

steepest descent, 25, 29, 34, 45, 47, 58, 65 Stiefel’s cage, 3, 27

stopping criterion, 10, 31, 37 sufficient condition, 5

superlinear convergence, 9, 50, 59 Taylor expansion, 4, 41

tourist, 10, 12, 17 trial point, 48 tricky function, 42, 50 truncated Newton method, 39 trust region, 18, 47

updating, 49, 54, 57f, 62, 64f Van Loan, 39, 67

van der Vorst, 39 Wolfe, 35 Wright, 47

In document UNCONSTRAINED OPTIMIZATION (Sider 34-40)