• Ingen resultater fundet

3.3 Support Vector Machines

3.3.1 Computation

In line with the theme of this thesis, we will now review an algorithm that computes the entire regularization path of the SVM, using the fact that the Lagrange multipliersαi are piece-wise linear functions of the regularization pa-rameter λ. The proof and the algorithm takes an alternative but equivalent route compared to the original account by Hastie et al. [60]. The derivation presented here is arguably more direct and better suited for implementation.

We will begin by stating some useful definitions. We define the set A as the set including the indices of all training data points, i.e.A=I ∪ O ∪ M. Using indices and sets of indices, we define submatrices as A{rows},{columns}. For

instance, the submatrixKi,Acorresponds to theithrow of the kernel matrixK.

The breakpoints at which the piecewise linear functionsαi(λ) change are called events. There are four types of events.

1. A point with indexijoins the margin from the inside,I → M.

2. A point with indexijoins the margin from the outside,O → M.

3. A point with indexileaves the margin for the inside,M → I. 4. A point with indexileaves the margin for the outside,M → O.

Between events, the sets do not change.

The specification of the SVM path algorithm consists of three parts. First, we derive the functional form of the multipliers αi(λ) between events. We then discuss at what value ofλthe next event occurs. This makes it possible to trace the multipliers along the path. In the third part, we take a look at how to find a suitable starting point on the path.

To derive an expression for the multipliers, we begin by specifying the distance function (plane equation) f(x) evaluated for points in the training data set in matrix form,

f(xj) = 1

λKj,AYα+b0= (3.38)

= 1

λ(Kj,A−Ki,A)Yα+yi i∈ M (3.39) Next, we note thatyjf(xj) = 1 ∀j ∈ M. This results in |M|equations that again can be summarized in matrix form,

1

λYM,M

KM,A−1|M|Ki,A

Yα+yMyi=1|M| i∈ M (3.40) Using that αj = 0 ∀j ∈ O and αj = 1 ∀j ∈ I, this expression can be expanded and rearranged into

YM,M

KM,M−1|M|Ki,M

YM,MαM=λ(1|M|−yMyi)

−YM,M

KM,I−1|M|Ki,I

YI,I1|I| i∈ M (3.41) By defining a matrix H = Y[K−1nKi,A]Y we can reduce this rather cum-bersome expression into

HM,MαM=λ(1|M|−yMyi)−HM,I1|I| i∈ M (3.42)

3.3 Support Vector Machines 71

For j =i, the ithrow of H will be zero, making the system of equations rank deficient. However, we can replace this degenerate equation by the relation in Equation 3.30. Writing this equation

yTα=yTMαM+yTI1|I| (3.43) we see that it is sufficient to replace the ithrow of H by yT. We call this augmented matrix ˜H. This matrix and its relevant submatrices are now full rank and we can obtain the following expression for the Lagrange multipliers as functions of λ,

αM=λH˜−1M,M(1−yMyi)−H˜−1M,MM,I1|I|=λpM+qM (3.44) This shows that the multipliers are linear functions of λ. The equation for the complete set of multipliers isα=λp+qwhere the elements ofpandqfor sets I and Oare zero, except forqI = 1, which ensuresαI=1|I|.

Now that we have derived the behavior of the multipliers between events, we seek a value λl+1 of λwhere the next event occurs. For reasons that will be explained below, we trace the path backwards, starting at a large value ofλand moving towards smaller values. The value ofλwhere the last event occurred is denotedλl. We are therefore seeking the valueλl+1< λlof the next event. We treat each of the four events defined above separately.

1. When the first event (I → M) occurs for a point xj, the distance from xj to the decision boundary will be exactly 1. For each point inI we can derive the value ofλ at which this event occurs by solving the equation yjf(xj) = 1, j∈ I forλ,

1

λHj,A(λp+q) +yjyi= 1, i∈ M, j∈ I ⇔ λ= HI,Aq

1−yIyi−HI,Ap, i∈ M. (3.45) The result is|I|candidate values ofλ.

2. Equivalently, the second event (O → M) occurs at the following values of λ,

λ= HO,Aq

1−yOyi−HO,Ap (3.46) 3. When the third event (M → I) occurs for a point xj, its corresponding multiplier will obtainαj= 1. For all points inM, we can find the values of

λwhere this event occurs be setting Equation 3.44 equal to 1 and solving forλ,

λ= 1−qM

pM (3.47)

4. Similarly, the final event (M → O) occurs asαj = 0. The corresponding values ofλare

λ=−qM

pM (3.48)

A candidate value of λ is calculated for each point and each relevant event, resulting in one candidate value for each point inI andO, and two candidate values for each point inM. Out of these candidates, the value of λwhere the next event occurs must be the largest candidate valueλl+1such thatλl+1< λl. Finally, we define a suitable starting point for the path. Using the standard quadratic programming technique to obtain values of the multipliers αi for a particular value of λ allows us to start the algorithm at any point along the path. However, we would like to start the algorithm at one of the endpoints of the path and work our way towards the other. Computationally beneficial solutions occurs for very large values ofλ, corresponding to a very large margin.

There are two types of behavior of the SVM for large values ofλ, depending on whether there are equally many observations in each class or not. As in [60], we letn andn+ denote the number of observations in each class, andI andI+ denote the corresponding indices for points inside the margin.

Ifn=n+, there is a large value ofλat which one observations from each class entersMfromI, while the rest of the observations remain inI. The two must enter concurrently — otherwise we have a violation of Equation 3.30. If the plane equation (3.38) for the decision boundary was known, we could identify these observations by finding the most remote observation in each class. However, λand b0, which are the unknown elements of this expression, are equal for all observations. Further, we haveα=1. Hence, we have,

f(xj)∝Kj,AYα=Kj,Ay. (3.49) Therefore, the first observation from each class to enter M is the one with the maximal distance according to this equation with j ∈ I and j ∈ I+ respectively. Now that the indices of the elements in M have been disclosed, we can find the corresponding values of λ and b0, again using Equation 3.38, yielding two equations with two unknowns. The system of equations is

yM[KM,Ay 1]

1/λ b0

=1, (3.50)

3.3 Support Vector Machines 73

which can be solved to obtainλandb0.

The unbalanced case where e.g. n+> n is both computationally and theoret-ically more difficult. For large values of λand n+ > n, the plus-side margin will stay positioned among the most remote observations in I+ while the de-cision boundary and the opposite margin moves closer as λ shrinks. The αi remain constant until the minus-side margin reaches one of the observations in I+. We wish to find the value ofλat which this event occurs. Up to this point, αi = 1, i ∈ I, meaning that yTIαI = −n. As always, we require Equa-tion 3.30 to hold. Therefore, Pn

i=1αi = 2n. This sum will remain constant until the first event occurs. Maximizing the dual in Equation 3.28 is therefore equivalent to the following minimization problem in this case,

arg min

α αTYKYα subject to αi= 1 ∀i∈ I, 0≤αi≤1 ∀i∈ I+,

n

X

i=1

αi = 2n (3.51) Using that αi = 1 ∀i∈ I, the setup can be further simplified. To see this, we expand the criterion function into parts belonging to each class.

αTYKYα

=

αT+ αT

Y+ 0 0 Y

K+ K+−

K−+ K

Y+ 0 0 Y

α+

α

T+Y+K+Y+α++ 2yTK−+Y+α++yTKy (3.52) This results in the reduced minimization problem

arg min

α+ yTK−+Y+α++1

T+Y+K+Y+α+ subject to 0≤αi≤1 ∀i∈ I+, X

i∈I+

αi=n (3.53) The corresponding setup for the case wheren > n+ is equivalent, but with a change of signs.

Studying the values of α that we obtain from this procedure, we can identify points in M if 0 < α < 1. Points in I are harder to identify, as we cannot distinguish these from points that have just entered Mfrom I. The same goes for points inOwhich cannot be separated from points on the interface between M and O. However, there are other ways to classify the remaining points.

There are two possible cases. The first is when M is empty and the surplus of multipliers from I+ are in O+. In this case, we end up with the balanced situation where |I| =|I|+. At this instant, one element from each class has

just enteredMfromI, otherwise Equation 3.30 would be violated. We identify these variables in the same way as for the balanced initialization procedure described above. Note that Equation 3.49 is used with the current vector of multipliers instead of α = 1. The other possibility is that M is nonempty and contains positive elements with 0< αi <1. In this case, a single element from I has just entered M and hasαi = 1. To identify which element this is, Equation 3.49 is used to calculate the distances for all elements in I, and the most remote observation is singled out. The calculation of λ and b0 then proceeds as above, using Equation 3.50.

At any point along the path, the margin set Mmay become empty. Two new points will then joinMin the next event, one from each class. This is the same situation as for the balanced initialization procedure described above, but where the current states ofI and αare used.

After this ordeal, we arrive at a complete algorithm for computing the SVM path. Algorithm 3.1 states the procedure.

Figure 3.7 depicts the paths corresponding to the two rows of images in Fig-ure 3.6. In the top path, a linear kernel as been used, while a Gaussian kernel withσ= 10 was employed in the bottom path. Typically, less constrained solu-tions, such as those obtained using a Gaussian kernel, lead to more complicated paths.

50 150 200 250 300 350 400 450 500

10 15

λ λ

α α

Linear kernel

Gaussian kernel,σ= 10

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0 0

0

20 100

1 1

5

Figure 3.7: Example paths of the support vector machine, corresponding to the rows of images in Figure 3.6. The top path corresponds to a linear kernel, while a Gaussian kernel withσ= 10 has been used in the bottom path.

3.3 Support Vector Machines 75

Algorithm 3.1Path Algorithm for the Support Vector Machine

1: InitI =A, B=∅,O=∅,α=1.

2: if classes are balanced such thatn+=n then

3: For all elements inI, calculate distances from margin according to Equa-tion 3.49.

4: Find most remote element inI+ andI.

5: Move indices corresponding to these elements fromI toM.

6: Calculate λand b0 corresponding to this event using Equation 3.50.

7: else

8: if n> n+ then

9: Switch the setsI+ andIby settingy=−yand switchingn+andn.

10: end if

11: Calculate αusing the quadratic programming setup of Equation 3.3.1

12: Update sets according to the values of α.

13: if M=∅then

14: Add the most remote point in I+ to M as done in the balanced case above.

15: end if

16: Add the most remote point in I to M as done in the balanced case above.

17: Calculateλandb0according to Equation 3.50, negating the answer if the sets I+ andI were switched above.

18: end if

19: whileλ > λmin do

20: if M=∅then

21: Move elements toM and find new values ofλand b0 according to the balanced case above.

22: else

23: Computepandqaccording to Equation 3.44.

24: Calculate λcandidates according to event 1 using (3.45).

25: Calculate λcandidates according to event 2 using (3.46).

26: Calculate λcandidates according to event 3 using (3.47).

27: Calculate λcandidates according to event 4 using (3.48).

28: Choose candidateλl+1 with the largest value smaller thanλl.

29: Calculate new coefficients, α = λl+1p+q and b0 =yi−Ki,AYα for some i∈ M.

30: Update sets accordingly.

31: end if

32: end while