Large-Scale Problems

(1)

Large-Scale Problems

Small-scale problems:

“anything goes,”

no problem to use SVD or other factorizations/decompositions.

Large-scale problems:

factorizations are not possible in general,

if possible, use matrix structure (Toeplitz, Kronecker, . . . ), storage and computing time set the limitations,

solving, say, the Tikhonov problem for a range of reg. parameters can be a formidable task.

(2)

But Wait – There’s More

Let us consider the optimization framework for the least-squares problem:

minx F(x) , F(x) =¹/2kA x−bk²₂ , ∇F(x) =A^T(A x−b) .

Steepest descent algorithm:

x^[k+1]=x^[k^]−ω_k∇F(x^[k]) =x^[k]+ω_kA^T(b−A x^[k]). CGLS – conjugate gradient algorithm applied toA^TA x =A^Tb:

x^[k+1]=x^[k^]−α_kd^[k] ,d^[k]=search direction d^[k]T

A^TA d^[j]=0, j =1,2, . . . ,k−1.

(3)

Advantages of Iterative Methods

We typically think of iterative methods as necessary for solving nonlinear problems. But we can also use them for large, linear problems.

Iterative methods produce a sequencex^[0]→x^[1]→x^[2]→ · · · of iterates that (hopefully) converge to the desired solution, solely through the use of matrix-vector multiplications.

The matrix A is never altered, only “touched” via matrix-vector multiplications A x andA^Ty.

The matrix A is not explicitly required – we only need a “black box”

that computes the action ofA or the underlying operator.

Atomic operations of iterative methods (mat-vec product, saxpy, norm) suited for high-performance computing.

Often produce a natural sequence of regularized solutions;

stop when the solution is “satisfactory” (parameter choice).

(4)

Two Types of Iterative Methods

1 Iterative solution of a regularized problem, such as Tikhonov

A^TA+λ²I

x =A^Tb ⇔ min

x

1 2

A λI

x−

b 0

2

.

Challenges: solve for many λand needs a good preconditioner!

2 Iterate on the un-regularized system, e.g., on A x =b or A^TA x =A^Tb

and use the iteration number as the regularization parameter.

The latter approach relies onsemi-convergence:

initial convergence towards the desired x^exact, followed by (slow) convergence to unwantedA⁻¹b.

Must stop at the end of the first stage!

(5)

Illustration of Semi-Convergence

(6)

Landweber Iteration

A classical stationary iterative method:

x^[k^]=x^[k−1]+ωA^T(b−A x^[k−1]) , k =0,1,2, . . . where 0< ω <2kA^TAk⁻¹₂ =2σ₁⁻².

Where does this come from? Consider the function φ(x) = ¹₂kb−A xk²₂

associated with the least squares problemminxφ(x). It is straightforward (but perhaps a bit tedious) to show that the gradient of φis

∇φ(x) =−A^T(b−A x).

Thus, each step in Landweber’s method is a step in the direction of steepest descent. See next slide for an example of iterations.

(7)

The Geometry of Landweber Iterations

(8)

Towards Convergence Analysis

With an arbitrary starting vectorx^[0], the kth Landweber iterate is:

x^[k^] = x^[k−1]+ωA^T b−A x^(k−1)

= (I−ωA^TA)x^[k−1]+ωA^Tb

= (I−ωA^TA) h

(I−ωA^TA)x^[k−2]+ωA^Tb i

+ωA^Tb

= (I−ωA^TA)²x^[k−2]+ (I−ωA^TA) +I ωA^Tb

= (I−ωA^TA)³x^[k−3]+ (I−ωA^TA)²+ (I −ωA^TA) +I ωA^Tb

= · · ·

= (I−ωA^TA)^kx^[0]+ h

(I −ωA^TA)^k−1+ (I −ωA^TA)^k−2+· · ·+Ii ωA^Tb

= (I−ωA^TA)^kx⁽⁰⁾+

k−1

X

j=0

(I −ωA^TA)^jωA^Tb.

(9)

SVD Analysis

For simplicity we now assume thatx^[0]=0. We insert the SVD of the matrixA=UΣV^T and useI =V V^T:

x^[k]=V

k−1

X

j=0

(I−ωΣ²)^jωΣU^Tb=V Φ^(k)Σ⁻¹U^Tb,

where we introduced then×n diagonal matrix

Φ^(k) =

k−1

X

j=0

(I−ωΣ²)^jωΣ²=ωΣ²

k−1

X

j=0

(I −ωΣ²)^j =





 φ^(k)₁

φ^(k)₂ . ..







with diagonal elements φ^(k)_i =ω σ_i²

k−1

X

j=0

(1−ω σ²_i)^j, i =1,2, . . . ,n.

(10)

The Filter Factors

The sumPk−1

j=0(1−ω σ_i²)^j is a geometric series,

k−1

X

j=0

z^j = (1−z^k)/(1−z),

and thus fori =1,2, . . . ,n we have φ^(k)_i =ω σ_i²

k−1

X

j=0

(1−ω σ²_i)^j =ω σ²_i 1−(1−ω σ_i²)^k

1−(1−ω σ_i²) =1−(1−ω σ_i²)^k.

Letσ_break^(k) denote the value ofσi for which φ^(k_i ⁾=0.5. Then σ^(k)_break

σ^(2k)_break

= q

1+ (¹₂)^2k¹ →√

2 for k → ∞.

Hence, as k increases, the breakpoint tends to be reduced by a factor

√2≈1.4 each time the number of iterations k is doubled.

(11)

Landweber Filter Factors

(12)

Cimmino Iteration

Cimmino’s method is a variant of Landweber’s method, with a diagonal scaling:

x^[k]=x^[k−1]+ωA^TD(b−A x^[k−1]), k =1,2, . . .

in which D=diag(d_i)is a diagonal matrix whose elements are defined in terms of the rowsa^T_i =A(i,: ) ofA as

di =





 1 m

1

ka_ik²₂, ai 6=0

0, a_i =0.

Cimmino’s method may often converge faster than Landweber.

(13)

. . . and the prize for best acronym goes to “ART”

Kaczmarz’s method = algebraic reconstruction technique (ART).

Leta^T_i =A(i,:)=ith row of A, andb_i =ith componentb.

Each iteration of ART involves the following “sweep” over all rows:

z⁽⁰⁾ =x^[k−1]

for i =1, . . . ,m

z⁽ⁱ⁾=z⁽ⁱ⁻¹⁾+bi −a^T_i z⁽ⁱ⁻¹⁾ ka_ik²₂ a_i end

x^[k]=z^(m)

This method is not “simultaneous” because each row must be processed sequentially.

In general: fast initial convergence, then slow. See next slides.

(14)

The Geometry of ART Iterations

(15)

Slow Convergence of SIRT and ART Methods

The test problem isshaw.

(16)

Projection Methods

As an important step towards the fasterKrylov subspace methods, we consider projection methods.

Assume the columns ofW_k = (w₁, . . . ,w_k)∈R^n×k form a “good basis” for an approximate regularized solution, obtained by solving

minx kA x−bk₂ s.t. x ∈ W_k =span{w₁, . . . ,w_k}.

This solution takes the form

x^(k⁾=Wky^(k), y^(k⁾=argmin_yk(A W_k)y−bk₂,

and we refer to the least squares problemk(A W_k)y−bk₂ as theprojected problem, because it is obtained by projecting the original problem onto the k-dimensional subspace span(w₁, . . . ,w_k).

If W_k =V_k then we obtain the TSVD method, andx^(k⁾=x_k But we want to work with computationally simpler basis vectors.

(17)

Computations with DCT Basis

Note that

Ab_k =A W_k = (W_k^TA^T)^T =h

(W^TA^T)^Ti

:,1:k. In the case of the discrete cosine basis, multiplication withW^T is equivalent to a DCT. The algorithm takes the form:

Akhat = dct(A’)’;

Akhat = Akhat(:,1:k);

y = Akhat\b;

xk = idct([y;zeros(n-k,1)]);

Bottom: cosine basis w_i,i =1, . . . ,10.

(18)

Example Using Discrete Cosine Basis (shaw)

0 50 100

0 0.5 1 1.5

2 k = 1 Projected solutions

0 50 100

0 0.5 1 1.5

2 k = 2

0 50 100

0 0.5 1 1.5

2 k = 3

0 50 100

0 0.5 1 1.5

2 k = 4

0 50 100

0 0.5 1 1.5

2 k = 5

0 50 100

0 0.5 1 1.5

2 k = 6

0 50 100

0 0.5 1 1.5

2 k = 7

0 50 100

0 0.5 1 1.5

2 k = 8

0 50 100

0 0.5 1 1.5

2 k = 9

0 50 100

0 0.5 1 1.5

2 k = 10

0 50 100

−0.2

−0.1 0 0.1 0.2w

1

0 50 100

−0.2

−0.1 0 0.1 0.2w

2

0 50 100

−0.2

−0.1 0 0.1 0.2w

3

0 50 100

−0.2

−0.1 0 0.1 0.2w

4

0 50 100

−0.2

−0.1 0 0.1 0.2w

5

0 50 100

−0.2

−0.1 0 0.1 0.2w

6

0 50 100

−0.2

−0.1 0 0.1 0.2w

7

0 50 100

−0.2

−0.1 0 0.1 0.2w

8

0 50 100

−0.2

−0.1 0 0.1 0.2w

9

0 50 100

−0.2

−0.1 0 0.1 0.2w

10

(19)

The Krylov Subspace

TheKrylov subspace, defined as

K_k ≡span{A^Tb,A^TA A^Tb,(A^TA)²A^Tb, . . . ,(A^TA)^k−1A^Tb}, alwaysadapts itself to the problem at hand! But the “naive” basis qi = (A^TA)ⁱ⁻¹A^Tb is NOT useful due to scaling issues.

The normalized, “naive” basis

pi = (A^TA)ⁱ⁻¹A^Tb/k(A^TA)ⁱ⁻¹A^Tbk₂, i =1,2, . . . is NOT useful either: p_i →v₁ as i → ∞. See the next slide.

Moreover, the condition numbers of the matrices[q1, . . . ,qk]and [p1, . . . ,p_k]increases dramatically withk

Use modified Gram-Schmidt for which cond([w₁, . . . ,w_k]) =1:

w1←A^Tb; w1 ←w1/kw₁k₂

w₂←A^TA w₁; w₂ ←w₂−w₁^Tw₂w₁; w₂ ←w₂/kw₂k₂ w3←A^TA w2; w3 ←w3−w₁^Tw3w1;

w₃ ←w₃−w₂^Tw₃w₂; w₃ ←w₃/kw₃k₂

(20)

Comparison of basis vectors p

_i

(blue) and w

_i

(red)

(21)

Conditioning of the bases

This figure shows the condition numbers of the three matrices of basis vectors[q1, . . . ,q_k] and[p1, . . . ,p_k] and[w1, . . . ,w_k]for increasing k.

(22)

Can We Compute x

^(k)

Without Storing W

_k

?

Yes: theCGLS algorithm – see next slide – computes iterates given by x^(k⁾=argmin_xkA x−bk₂ s.t. x∈ K_k.

The algorithm eventually converges to the least squares solution.

But sinceK_k is a good subspace for approximate regularized solutions, CGLS exhibits semi-convergence.

(23)

CGLS = Conjugate Gradients for Least Squares

The CGLS algorithm for solvingminxkA x−bk₂ takes the following form:

x⁽⁰⁾=starting vector (e.g., zero) r⁽⁰⁾=b−A x⁽⁰⁾

d⁽⁰⁾=A^Tr⁽⁰⁾ for k =1,2, . . .

¯

α_k =kA^Tr^(k−1)k²₂/kA d^(k−1)k²₂ x^(k) =x^(k−1)+ ¯α_kd^(k−1) r^(k)=r^(k−1)−α¯_kA d^(k⁻¹⁾ β¯k =kA^Tr^(k)k²₂/kA^Tr^(k⁻¹⁾k²₂ d^(k) =A^Tr^(k)+ ¯β_kd^(k−1) end

For Tikhonov, just replaceAandb with A

λI

and b

0

.

(24)

Comparison of CGLS With the Previous Methods

(25)

SVD Analysis – Outside the Scope of This Course

It is pretty hairy, but we can perform an SVD analysis along these lines:

φ^(k)_i =1−

k

Y

j=1

θ_j^(k)−σ_i² θ^(k_j ⁾

=filter factors

θ^(k)_k =eigenvalalues ofA^TAprojected on K_k

K_k =span{A^Tb,(A^TA)A^Tb, . . . ,(A^TA)^k−1A^Tb}=Krylov subspace

(26)

Other Iterations – GMRES and RRGMRES

Sometimes difficult or inconvenient to write a matrix-free black-box function for multiplication withA^T. Can we avoid this?

TheGMRESmethod for square nonsymmetric matrices is based on the Krylov subspace

K_k =span{b,Ab,A²b, . . . ,A^k−1b}.

The presence of the noisy datab =b^exact+e in this subspace is unfortunate: the solutions include the noise componente! A better subspace, underlying theRRGMRES method:

K~_k =span{A b,A²b, . . . ,A^kb}.

Now the noise vector is multiplied with A(smoothing) at least once.

Symmetric matrices: use MR-II (a simplified variant).