• Ingen resultater fundet

Elastic Net Regression

step towards the OLS solution. This represents a violation of the LASSO rule, as the coefficient and the corresponding correlation will take on different signs onceb crosses zero. In the LASSO modified LAR algorithm, we check whether any coefficients cross zero within each step. If so, the step length is chosen such that the coefficient in question just reaches zero, and is subsequently ex-cluded from the active set of variables. The step length at which each variable hits zero is easily found by setting the update expression for the coefficients bk+1= ˜γ(bAOLS−bk) +bk equal to zero and solving for ˜γ,

˜

γi= bi

bi−bAiOLS,∀i∈ A. (2.63) We denote these step lengths ˜γ to separate them from the LAR step lengths γ. If any ˜γi > 0 is smaller than the next value of γ, a LASSO modification will occur in the next step. The step length is then adjusted and the relevant variable is excluded from A. Algorithm 2.4 presents the resulting algorithm including the LASSO modification.

The proof of the agreement of signs between the coefficients and the current correlations, and that the modified LAR algorithm renders all possible LASSO solutions is given by Efron et al. [40]. The proof is rather extensive, with several lemmas leading up to the final theorem. Given the technical level of this thesis, we choose to avoid discussing the validity of the modification here, and simply present the resulting algorithm.

Figure 2.15(a) shows the resulting LASSO regularization path. As mentioned above, the LASSO condition occurs only once, during the last LAR step. This leads to the exclusion of a variable, which is included again at the end of the following step. In the iteration after, the OLS solution is reached. Figure 2.15(b) shows theCp measures of prediction error along the path. Using this heuristic, a model with 7 variables is selected.

2.10 Elastic Net Regression 47

Algorithm 2.4The modified LAR algorithm for computation of the LASSO

1: Initialize the coefficient vector b0 = 0p the fitted vector µ = 0n, and an iteration counterk= 0.

2: Initialize the active setA=∅and the inactive setI={1. . . p}

3: while|I|>0do

4: Update residualr=y−µ

5: Find maximal correlation c= maxi∈I|xTir|

6: if drop conditionthen

7: Set drop condition to FALSE.

8: else

9: Move variable corresponding toc fromI to A.

10: end if

11: Calculate the partial OLS solution bAOLS= (XTAXA)−1XTAy

12: Calculate current directiond=XAbAOLS−µ

13: Calculate drop condition step length ˜γ = mini∈Abik/(bik−bAiOLS), 0 <

˜ γ <1

14: if |I|= 0then

15: Let the LAR step lengthγ= 1 (go all the way to the full OLS solution)

16: else

17: Calculate LAR step lengthγ= mini∈I

nxT ir−c

xTid−c, xxTTir+c id+c

o

, 0< γ≤1

18: end if

19: if γ < γ˜ then

20: γ= ˜γ

21: Set drop condition to TRUE.

22: end if

23: Update regression coefficientsbk+1=γ(bAOLS−bk) +bk 24: Update the fitted vectorµ=µ+γd

25: if drop conditionthen

26: Move variable corresponding to ˜γ fromAtoI.

27: end if

28: k=k+ 1

29: end while

30: Output the series of coefficientsB={b0. . .bk}

simply combined, yielding the following cost function, bnaive= arg min

b ky−Xbk2+λkbk2+δkbk1, (2.64) where the optimal coefficientsbnaiverepresent the solution for a particular choice ofλandδ. The name naive will be described below.

The geometric interpretation of the (combined) penalty term is shown in Fig-ure 2.16. The constraint region defined by the Elastic Net consists of a

combina-P

j|bji| b

1000

500 0 500 1000

0 1000 2000 3000

(a)The LASSO regularization path for the diabetes data set.

P

j|bji| Cp

0 200 400

0 1000 2000 3000

(b)Prediction error measured by the Cp estimate along the path.

Figure 2.15: LASSO results for the diabetes data set, using all 442 observations as training data. The Cp measure estimates the prediction error as if tested on an independent validation set.

tion of the disc of ridge regression and the diamond of the LASSO. The relation betweenλandδdetermines the geometry of the region. Lettingδ= 0 results in the ridge solution, whileλ= 0 leads to a pure LASSO procedure. The Elastic Net region region shown in red in Figure 2.16 has roughly δ =λ. For δ > 0, the region will be singular at each corner along the axes of b, thus producing sparse solutions similarly to the LASSO.

Computation of the Elastic Net estimates is simple, given the path algorithm for the LASSO (cf. Algorithm 2.4). First, we review an alternative formulation of ridge regression. Instead of the loss+penalty formulation of Equation 2.32, ridge regression can be solved using OLS regression on an augmented data matrix and

2.10 Elastic Net Regression 49

b1 b2

Figure 2.16: The regions defined by the constraints in ridge regression (yellow), the LASSO (blue), and the Elastic Net (red). The Elastic Net region is made up of some combination of the other two.

response vector,

bridge= ( ˜XTX)˜ −1Ty˜ where X˜ = √X

λIp

, y˜= y

0p

. (2.65) This provides another explanation why ridge regression is able to provide an answer also in cases wherep > n. The original (n×p) data matrix is expanded through the inclusion ofpsynthetic observations yielding the ((n+p)×p) aug-mented matrix ˜X. Obviously, this matrix has more rows (”observations”) than columns (variables), and the corresponding gram matrix can thus be inverted.

Using this technique, the Elastic Net can be formulated as a LASSO problem on augmented matrices,

arg min

b ky˜−Xbk˜ 2+δkbk1, (2.66) For a fixed value of λ, the path corresponding to all relevant choices of δ can now be obtained using the LASSO algorithm. This does, however, mean that a suitable value ofλmust be determined beforehand. Figure 2.17 shows results on the diabetes data set for three such values ofλ,λ= 0.001,λ= 0.1 andλ= 1000.

The first choice represents a lightly regularized version of the LASSO, handling cases wherep > n. Asn > pin this case, the result is close to that of the pure LASSO. The second choice ofλshows a more strongly regularized model, where coefficients tend to follow simpler paths from their point of entry. This effect is taken to an extreme for λ = 1000. Each plot has an associated prediction error plot, estimated using 15-fold cross-validation. While the two first choices of λ produce seemingly useful models, the model corresponding to λ = 1000 perform only slightly better than the null model at its optimal value of δ (or

b b

b

P j|bji| P

j|bji| P

j|bji|

P j|bji| P

j|bji| P

j|bji|

CV CV

CV

λ= 0.001 λ= 0.1 λ= 1000

500

500

500

0 0

0

500 500

500

0 0

0

0 0

0

1000 1000

1000

1000 1000

1000

2000 2000

2000

2000 2000

2000

3000 3000

3000 3000

4000 4000

5000 5000

×105

×105

×105

1 1

1

2 3 4

1.2 1.2

1.4 1.4

1.6 1.6

Figure 2.17: Results from using the Elastic Net on the diabetes data set for dif-ferent values of the ridge-type regularization parameter λ. The top row shows the resulting coefficient paths, while the bottom row contains the corresponding 15-fold cross-validation error curves.

its corresponding value oft=P

j|bji|) — clearly this model is too rigid to be practical.

The inventors of the Elastic Net call the solution to Equation 2.64 the naive Elastic Net. In this form, the model has poor performance with regards to pre-diction, and is mainly useful for selecting a proper set of interesting variables. In [175], it is argued that the naive Elastic Net incurs a double amount of shrink-age. This statement is intuitively plausible; for a certain value of λ, we have the ridge regression solution at δ= 0. For increasing values ofδ, this already shrunk estimate is further shrunk, while uninteresting variables are excluded one by one. We wish to keep the results from variable selection, but adjust the coefficients for the shrinkage either incurred by the ridge penalty or the LASSO penalty. By multiplying the naive estimates by 1 +λ, the ridge shrinkage is counteracted. More specifically,

belastic= (1 +λ)bnaive. (2.67)

All coefficients in the paths of Figure 2.17 have been adjusted accordingly. In [175], more details are given regarding the appropriateness of this transforma-tion, and several examples on different data sets show that the resulting predic-tion performance is comparable, if not better, than that of ridge regression and the LASSO.