The Levenberg–Marquardt method - Nonlinear Least Squares Problems 113

6. Nonlinear Least Squares Problems 113

6.2. The Levenberg–Marquardt method

used the fitting modelM(x, t) = x1e^−x³^t+x2e^−x⁴^t. The ith row in the Jacobian is

J(x)i,: = −e^−x³^tⁱ −e^−x⁴^ti x1tie^−x³^tⁱ x2tie^−x⁴^tⁱ .

If the problem isconsistent(ier(ˆx) =0), then the Gauss–Newton method with line search will have quadratic final convergence, provided that ˆx₃ is significantly different from ˆx₄. If ˆx₃= ˆx₄, then rank(J(ˆx))≤2, and the Gauss–Newton method fails.

If one or more measurement errors are exceptionally large, thenr(ˆx) has some large components, and this may slow down the convergence. If, on the other hand, the errors behave likewhite noise, we can expect partial cancelling of the terms ri(ˆx)∇²ri(ˆx) in the difference between ∇²f(x) and∇²L(0), and get superlinear final convergence.

InMatlabwe can give a very compact function for computingr andJ:

Suppose thatxholds the current iterate and that them×2 arraytyholds the coordinates of the data points. The following function returnsr and Jcontainingr(x) andJ(x), respectively.

function [r, J] = fitexp(x, ty) t = ty(:,1); x = x(:);

E = exp(t * (-x(3:4)’);

r = ty(:,2) - E*x(1:2);

J = [-E (t*x(1:2)’).*E];

Example 6.7. If we use Newton–Raphson’s method to solve the problem from Example 6.2,r(ˆx) =0withr:Rⁿ 7→Rⁿ, the typical iteration step is

Solve J(x)hnr = −r(x); x:=x+h_nr.

The Gauss–Newton method applied to the minimization off(x) =

2r(x)^Tr(x) has the typical step Solve J(x)^TJ(x)

h_gn=−J(x)^Tr(x); x:=x+h_gn . In this caseJ(x) is a square matrix, and if it is nonsingular, then J(x)−T

exists, and it follows thath_gn=h_nr. Therefore, when applied to Powell’s problem from Example 6.2, the Gauss–Newton method will have the same troubles as discussed for Newton–Raphson’s method in that example.

These examples show that the Gauss–Newton method may fail, both with and without a line search. Still, in many applications it gives quite good performance, though it normally only has linear convergence as opposed to the quadratic convergence from Newton’s method with implemented second derivatives. In Sections 6.2 and 6.3 we give two methods with superior global performance.

120 6. Nonlinear Least Squares

6.2. The Levenberg–Marquardt method

Levenberg and later Marquardt, [31, 36] suggested to use a damped Gauss–Newton method, cf Section 3.2. The step h_lm is defined by the following modification to (6.11),

J^TJ +µI

h_lm = −J^Tr

with J =J(x), r=r(x), µ≥0 . (6.13) The damping parameterµhas several effects:

1^◦ For allµ >0 the coefficient matrix is positive definite, cf Appendix A.2, and this ensures that h_lm is a descent direction, cf Thedorem 2.15.

2^◦ For large values of µwe get h_lm ≃ −1

µJ^Tr = −1

µ∇f(x) ,

ie a short step in the steepest descent direction. This is good if the current iterate is far from the solution.

3^◦ If µis very small, then h_lm≃h_gn, which is a good step in the final stages of the iteration, when x is close to x. Ifˆ f(x) = 0 (or veryˆ small), then we can get (almost) quadratic final convergence.

Thus, the damping parameter influences both the direction and the size of the step, and this leads us to make a method without a specific line search. The choice of initialµ-value should be related to the size of the elements inA^[0] =J(x₀)^TJ(x₀), for instance by letting

µ₀ = τ·max_i{a^[0]_ii} , (6.14) where τ is chosen by the user. The algorithm is not very sensitive to the choice of τ, but as a rule of thumb, one should use a small value, for instanceτ= 10⁻⁶ ifx₀ is believed to be a good approximation tox.ˆ Otherwise, useτ= 10⁻³or evenτ= 1. During iteration the size ofµcan be updated as described in Section 3.2. The updating is controlled by thegain ratio

̺ = f(x)−f(x_new) L(0)−L(h_lm) ,

wherex_new=x+h_lm. In order to reducecancellation error the numer-ator

δf = f(x)−f(x_new) = ¹₂r(x)^Tr(x)−¹₂r(x_new)^Tr(x_new)

6.2. The Levenberg–Marquardt method 121 should be computed by the mathematically equivalent expression

δf = ¹₂(r(x)−r(x_new))^T (r(x) +r(x_new)) . (6.15) The denominator is the gain predicted by the affine model (6.9),

δL=L(0)−L(h_lm)

=−h^T_lmJ^Tr−¹₂ h_lmJ^TJ h_lm

=−¹₂ h^T_lm 2J^Tr+ (J^TJ +µI−µI)h_lm

= ¹₂ h^T_lm(µh_lm−∇f)) . (6.16) Note that both h^T_lmh_lm and −h^T_lm∇f are positive, so L(0)−L(h_lm) is guaranteed to be positive.

A large value of̺ indicates thatL(h_lm) is a good approximation to f(x+h_lm), and we can decreaseµso that the next Levenberg-Marquardt step is closer to the Gauss–Newton step. If ̺ is small (maybe even negative), thenL(h_lm) is a poor approximation, and we should increase µwith the twofold aim of getting closer to the steepest descent direction and reducing the step length. These goals can be met in different ways, see Algorithm 6.18 and Example 6.7 below.

As discussed in connection with (2.10) – (2.11) we suggest to use the following stopping criteria for the algorithm,

k∇f(x)k∞ ≤ ε₁ ,

kx_new−xk2 ≤ ε₂(kxk2+ε₂) , k ≥ k_max ,

(6.17)

whereε₁ are small, positive numbers andk_max is an integer; all of these are chosen by the user.

The last two criteria come into effect, for instance, if ε₁ is chosen so small that effects of rounding errors have large influence. This will typically reveal itself in a poor accordance between the actual gain in f and the gain predicted by the affine model L, and will result in µ being augmented in every step. The updating strategy for µ is based on (3.12), but modified so that consecutive failures of getting a smaller f-value gives fast growth of µ, resulting in smallkh_lmk, and the process will be stopped by the second criterion in (6.17).

The algorithm is summarized below. We shall refer to it as theL–M method.

122 6. Nonlinear Least Squares

Algorithm 6.18. Levenberg–Marquardt method Given x₀,τ, ε1, ε2, kmax

begin

k:= 0; ν:= 2; x:=x₀

A:=J(x)^TJ(x); g:=J(x)^Tr(x) found:= (kgk∞≤ε₁); µ:=τ ∗max{a_ii} while (notfound) and (k < kmax)

k:=k+1; Solve (A+µI)h_lm=−g if kh_lmk ≤ε₂(kxk+ε₂)

found:=true else

x_new:=x+h_lm

̺:=δf /δL {computed by (6.15) and (6.16)}

if ̺ >0 {step acceptable}

x:=x_new

A:=J(x)^TJ(x); g:=J(x)^Tr(x) found:= (kgk∞≤ε₁)

µ:=µ∗max{¹₃,1−(2̺−1)³}; ν:= 2 else

µ:=µ∗ν; ν := 2∗ν end

end end end

Example 6.8. We already saw in connection with (6.11) that the Gauss–

Newton steph_gn is the least squares solution to the linear problem r(x) +J(x)h ≃ 0.

Similarly, the L-M equations in (6.13) are the normal equations for the augmented linear problem

r(x) 0

J(x)

√µI

h ≃ 0.

As discussed in Section 5.2, the most accurate solution is found via or-thogonal transformation. However, the solutionh_lmis just a step in an it-erative process, and needs not be computed very accurately, and since the solution via the normal equations is “cheaper” to compute, this method is often used.

6.2. The Levenberg–Marquardt method 123 Example 6.9. The Rosenbrock problem from Examples 2.8, 3.5 and 3.9 can

be formulated as a system of nonlinear equations, r(x) = √

In Figure 6.1 we illustrate the performance of Algorithm 6.18 applied to this problem with the input x₀ = −1.2 1_T

Figure 6.1. Rosenbrock function. Iteration path and performance parameters.

The step from x0 is good, but from x1 the attempted step was uphill, so a new step has to be computed with an increased value for µ. This also happens at x5. Apart from that the iteration progresses nicely, and stops after 15 computations of h_lm because it has reached a point x = ˆ

x−10⁻⁹· 4.1 8.2_T

withk∇f(x)k^∞≃1.7·10⁻⁹≤ε1.

Rosenbrock’s problem is consistent,r(ˆx) =0, and therefore we can expect quadratic final convergence. This is seen by the behavior of f(x) and k∇f(x)k in the last few iterations.

Example 6.10. We have used Algorithm 6.18 on the data fitting problem from Examples 5.2 and 6.6 with the fitting model

M(x, t) = x1e^−x³^t+x2e^−x⁴^t .

Figure 5.2 indicates that bothx3andx4are positive and thatM(ˆx,0)≃0.

These conditions are satisfied byx0= 1 −1 1 2T

. Further, we used

124 6. Nonlinear Least Squares τ= 10⁻³, ε1= 10⁻⁸, ε2 = 10⁻¹⁴ kmax = 100. The algorithm stopped after 62 iteration steps withx≃ 4 −4 4 5T

, corresponding to the fit shown by full line in Figure 5.2 The performance is illustrated below.

0 10 20 30 40 50 60 70

10⁻¹² 10⁻⁸ 10⁻⁴ 10⁰

f(x)

||∇||

Figure 6.2. L-M method applied to the problem from Example 5.2.

This problem is not consistent, so we could expect linear final gence. The final iterations indicate a much better (superlinear) conver-gence. The explanation is that the∇²ri(x) are slowly varying functions ofti, and theri(ˆx) have “random” sign, so that the contributions to the

“forgotten term” in (6.12) almost cancel each other. Such a situation occurs frequently in data fitting applications.

For comparison, Figure 6.3 shows the performance with the classical up-dating strategy for µ, (3.11). From step 5 to step 68 we see that each decrease inµis immediately followed by an increase, and the norm of the gradient has a rugged behaviour. This slows down the convergence, but the final stage is as in Figure 6.2.

0 10 20 30 40 50 60 70

10⁻¹² 10⁻⁸ 10⁻⁴ 10⁰

f(x)

||∇||

Figure 6.3. L-M method with classicalµ-updating strategy.

Example 6.11. Figure 6.4 illustrates the performance of Algorithm 6.18 ap-plied to Powell’s problem from Examples 6.2 and 6.7. The starting point isx0= 3 1_T

, and we useτ= 1, ε1=ε2= 10⁻¹⁵, kmax= 100.

In document Introductionto OptimizationandDataFitting (Sider 125-131)