Linear least squares - Linear Data Fitting 75

5. Linear Data Fitting 75

5.2. Linear least squares

5.2. Linear least squares fit

For a linear model (5.4),

M(x, t) = x₁φ₁(t) +· · ·+x_nφ_n(t) , the residuals (5.5) have the form

r_i(x) = y_i− x₁φ₁(t_i) +· · ·+x_nφ_n(t_i)

and .*denotes element-wise (or Hadamard) product.

The polynomial model (5.3) is linear;n=d+1, all (F)i,n= 1, and F:,j = t.*F:,j+1 , j=d, d−1, . . . ,1 ,

where tis the vector t= (t1, . . . , tm)^T andF:,j is the jth column in F.

5.2.1. Normal equations

The linear data fitting problem can be reformulated as follows: Given theoverdetermined linear system

F x ≃ y , F ∈R^m×n, y∈R^m, m > n . (5.9) We seek a vector x∈Rⁿ such that F x is as close as possible to y in the least squares sense. This is obtained whenx=x, a minimizer of theˆ objective function

f(x) = ¹₂kr(x)k²2 = ¹₂ r(x)^Tr(x)

= ¹₂x^T F^TF x−x^T F^Ty+¹₂ y^Ty. (5.10)

82 5. Linear Data Fitting This is a function of the form discussed in Theorem 1.7. The matrix A=F^TF has the elements and the requirement thatxˆ be a stationary point takes the form

∇f(x) =ˆ F^TF xˆ−F^Ty = 0 . (5.12) Now, for anyu∈Rⁿ

u^TA u = u^TF^TF u = v^Tv ≥ 0 . (5.13) This shows thatAis positive semidefinite, cf Appendix A.2. Further, if F∈R^m×n (with n < m) has linearly independent columns (rank(F) = n), then u6=0 ⇒ v = F u6=0, and (5.13) shows that u^TA u>0, ie, Ais positive definite in this case, implying a unique solution to (5.12).

Thus we have proved the following theorem.

Theorem 5.14. A least squares solution to the system (5.9) can be found as a solution xˆ to the so-called normal equations,

A x = b with A = F^TF , b = F^Ty . (5.15) If F has linearly independent columns, thenxˆ is unique.

Thus, when the model is linear the parameters of the least squares fit are determined by the linear, quadratic system of equations (5.15). The determination of x_(p) for any other choice of p is a nonlinear problem and must be solved by iteration. This is part of the reason that least squares fitting is so popular, and in the remainder of this chapter we shall stick to that, but refer to Chapter 7 about other choices of norm.

Another reason for the popularity is that the statistical aspects are much better understood in this case, see Section 5.3.

Example 5.7. With the data from Example 5.1 and the model M(x, t) = x1t+x2 we get

5.2. Linear least squares 83

Thus, a necessary condition for rank(A) =nis that the functions{φj}are linearly independent. This is not sufficient, however. As a counterexample let φj(t) = sinjπt, which are linearly independent, but if ti = i, i = 1, . . . , m, thenF =0with rank zero.

At the other extreme suppose that the basis functions areorthogonalover the data abscissas, ie,

i=1φj(ti)φk(ti) = 0 forj6=k .

According to (5.11) the matrix Ais diagonal in this case, and the least squares fit is given by

xj=bj/ajj = (Pm

i=1φj(ti)yi)/Pm

i=1(φj(ti))², j= 1, . . . , n . Note that each component of x=xˆis independent of the other compo-nents. This is not true in the general case.

Provided that Ais positive definite, the normal equations can be solved efficiently and accurately via a Cholesky or an LDL^T factorization, cf Appendix A.2.

It is instructive to give an alternative derivation of the normal equa-tions: The vector z = F x lies in R(F), the range of F. This is the subspace of R^m which is spanned by the columns of F. The situation is illustrated below in the case m= 3, n= 2.

Figure 5.7. Range and residual vector.

We wish to minimize the distance between the vectors F xand y as measured bykrk2 =ky−F xk2. However,kuk2is equal to the Euclidean length of the geometric vector −→u whose coordinates are the elements of

84 5. Linear Data Fitting the vector u. The minimum length of −→r is obtained if this vector is orthogonal toR(F), iez^Tr= 0 for everyz∈ R(F). This is equivalent to the condition

F^Tr(x₍₂₎) = 0 ,

and by inserting (5.8) we recognize this condition as the normal equa-tions.

5.2.2. Sensitivity

Suppose that (5.9) is changed fromF x≃y to

(F +∆)x ≃ y+δ , F,∆∈R^m×n, y,δ∈R^m,

with least squares solutionx+γ. What can we say aboutˆ γ ifδand ∆ are not known explicitly, but we are given bounds on the sizes of their elements?

To answer that question we first note that from the above it follows that the least squares solution to an overdetermined system of equations is a linear function of the right-hand side. We shall express this as

x=F^†y , (5.16)

whereF^†∈R^n×m is the so-called pseudo-inverse of F.

Next, for a general matrixB∈R^m×nwe introduce thecondition num-ber

κ(B) =kBk2kB^†k2 , (5.17) wherekBk2 is thematrix norm

kBk2 = max

x6=0{kB xk2/kxk2}. (5.18) It can be shown that

kB^†k2 = max

x6=0{kxk2/kB xk2} .

The condition number is a measure of the linear independence of the columns ofB. If they are (almost) linearly dependent, thenkB xk2 can be much smaller thankxk2, sokB^†k2and thereforeκ(B) is (almost)∞. Now we are ready to answer the question at the start of this subsec-tion:

5.2. Linear least squares 85 Theorem 5.19. LetF x≃yand (F+∆)x≃y+δhave the least squares solutionsxˆ andx+γ, respectively, and letˆ

η ≡ kF^†k2k∆k2 = κ(F) k∆k2

Proof. See [3, Section 1.4] or [30, Chapter 9]).

The last term in the estimate of the relative change disappears if

∆ = 0 or if the problem is consistent, ie, if rˆ = 0 ⇔ y∈ R(F), cf Figure 5.7. Otherwise, it should be noted that the relative change in the parameters may grow as (κ(F))².

5.2.3. Solution via orthogonal transformation

In this section we shall discuss an alternative method for computing the least squares solution to the overdetermined system

F x ≃ y , F ∈R^m×n, y∈R^m, m > n .

Given a vectoru∈R^mand an orthogonal matrixQ∈R^m×m. The vec-torue =Q^Tuis anorthogonal transformation ofu. In Appendix A.4 we show that an orthogonal transformation preserves the 2-norm,kQ^Tuk2 = kuk2, and by proper choice of Q this can be used to simplify the least squares problem.

This choice ofQ is given by the QR factorization of the matrix F, cf (A.21), triangu-lar. The two submatrices ofQconsist respectively of the firstnand the last m−ncolumns,Qb =Q_:,1:n andQ˘ =Q_:,n+1:m. The last expression in (5.20) is the so-called economy sized (orthin) QR-factorization.

The matricesQb∈R^m×nandQ˘∈R^m×(m−n)have orthonormal columns.

They satisfy

Qb^TQb = I_(n), Q˘^TQ˘ = I_(m−n), Qb^TQ˘ = 0 , (5.21)

86 5. Linear Data Fitting where the index on I gives the size of the identity matrix and 0 is the n×(m−n) matrix of all zeros.¹⁾

Now we can formulate the alternative method for finding the least squares solution:

Theorem 5.22. Let the matrix F ∈R^m×n have the QR factoriza-tion (5.20). Then the least squares solufactoriza-tion to the overdetermined systemF x≃y is found by back substitution in the upper triangu-lar system

Rxˆ = Qb^Ty , (5.23)

and the corresponding residual satisfies kr(x)ˆ k2 = kQ˘^Tyk2 .

Proof. Using the norm-preserving property of an orthogonal transfor-mation and the splitting ofQ we see that

kr(x)k2 = kQ^T (y−F x)k2 =

bQ^Ty−Rx Q˘^Ty

From this expression and the definition (A.1) of the 2-norm it fol-lows that

kr(x)k²2 = kQb^Ty−Rxk²2+kQ˘^Tyk²2 , (5.24) and the minimum norm is obtained whenQb^Ty−Rx=0, which is equivalent to (5.23). The expression for the minimum norm of the

residual also follows from (5.24).

Example 5.9. The least squares solution defined by Theorem 5.22 is identical to thexˆdefined by the normal equations in Theorem 5.14,

A x = b with A = F^TF , b = F^Ty .

We express F by means of (5.20), and since Qb^TQb = I, we get A = R^TQb^TQ Rb =R^TR, so the normal equations are equivalent to

R^TR ˆ

x = R^TQb^Ty .

We assume thatF has full rank. Then (R^T)⁻¹exists, and by multiplying with it on both sides of the above equation we get (5.23). Thus, we have shown that the two ways to compute x^∗ give the same result –

1)The two matrices are not orthogonal: bothQbQb^T∈R^m^×^m andQ˘Q˘^T∈R^m^×^m are different from I_(m).

5.2. Linear least squares 87 mathematically. The matrixAis symmetric and positive definite, and we can use a Cholesky factorizationA=C^TC in the solution of the normal equations. C is an upper triangular matrix, and it can be shown that except for “row-wise sign”C is equal toR.

When the computation is done in finite precision the solution found via orthogonal transformation is more accurate. In Appendix B.3 we show that

κ(A) = (κ(F))² . (5.25)

If the problem is ill-conditioned (ie κ(F) is large), this may lead to disastrous effects of rounding errors during the computation of x. Letˆ εM denote the machine precision. With respect to this the problem must be considered as rank deficient if we use the normal equation with κ(F)>

∼1/√nεM. If the computation is done via orthogonal transforma-tion, the results are reliable if κ(F)<

∼1/(√

mn εM). As an example let εM = 2⁻⁵³ ≃ 1.11·10⁻¹⁶, m= 100 and n= 10 the two conditions are κ(F)<

∼3·10⁷andκ(F)<

∼3·10¹⁴, respectively.

Example 5.10. InMatlab, letFandyhold the matrix and right-hand side in the overdetermined system (5.9). Then the command

x = F \ y

returns the least squares solution, computed via orthogonal transforma-tion.

5.2.4. Fundamental subspaces

In this section the QR factorization is related to the geometric approach illustrated in Figure 5.7. The matrix F∈R^m×n can be thought of as the mapping matrix for a linear mapping Rⁿ 7→ R^m, and we introduce the following:²⁾

Definition 5.26. The four fundamental subspaces for the matrix F ∈R^m×n are

Range of F : R(F) ={y|y=F x, x∈Rⁿ} Range of F^T : R(F^T) =

x|x=F^Ty, y∈R^m Nullspace of F : N(F) = {x∈Rⁿ|F x=0} Nullspace of F^T : N(F^T) =

y∈R^m|F^Ty=0

These spaces are subspaces of R^m or Rⁿ, and if F has rank p, p ≤

2)In Linear Algebra textbooks the nullspace is often denoted thekernel.

88 5. Linear Data Fitting min{m, n}, then the spaces have the following dimensions

R(F) ⊆R^m, dim(R(F)) = p , R(F^T) ⊆Rⁿ, dim(R(F^T)) = p ,

N(F) ⊆Rⁿ, dim(N(F)) = n−p , N(F^T) ⊆R^m, dim(N(F^T)) = m−p .

(5.27)

We shall discuss the two subspaces of R^m in the case where m ≥n andF has full rank,p=n.

Theorem 5.28. Let the matrix F ∈R^m×n have the QR factoriza-tion

F = Q R

Qb Q˘ R 0

= Q Rb ,

and assume that rank(F) = n ≤ m. Then R(F) and N(F^T) are spanned by the columns ofQb and Q, respectively:˘

R(F) = span(Q_:,1:n), N(F^T) = span(Q_:,n+1:m) . Further, any vector y∈R^m can be written as the sum of two or-thogonal vectors, yˆ∈ R(F) and y˘∈ N(F^T),

y = yˆ+y˘ = Qbuˆ+Q˘u˘ , (5.29) where the coordinates of y with respect to the basis given by the columns in Q can be expressed as follows,

u = Q^Ty, uˆ = Qb^Ty , u˘ = Q˘^Ty.

Proof. See Appendix B.4.

Example 5.11. Now letybe the right-hand side in the overdetermined system (5.9). The splitting (5.29) is illustrated in Figure 5.7 (for m= 3, n= 2, where−→z is the geometric vector corresponding to yˆand r(ˆx) =y. The˘ coordinates foryˆare

u = Rxˆ = Qb^Ty.

This is seen to agree with (5.23). We return to the splitting in Section 5.3.

In document Introductionto OptimizationandDataFitting (Sider 87-95)