• Ingen resultater fundet

elements of b, known as the cardinality of b. This direct formulation results in a non-convex optimization problem that is difficult to solve. The problem is therefore relaxed by replacing the cardinality constraint with a convex one, making the computation feasible. The resulting PCs are reported to explain a larger proportion of variance than competing algorithms, but the complexity of the formulation grows quickly with the number of variables.

This article focuses on a method for computing sparse loading vectors using concepts from variable selection in regression. A method coined SCoTLASS [74] (Simplified Component Technique-LASSO) predates this method and is based on similar ideas. Maximizing the expression

bTi Rbi,

whereRdenotes the covariance matrix ofX, subject to bTibi= 1 and bTjbi= 0, j6=i,

renders the solution of a regular PCA. The authors propose to add the constraint kbik1=

p

X

j=1

|bij| ≤t, t∈R+, ∀i.

The parametert controls the sparsity of the loading vectors bk. The addition of this constraint was inspired by the LASSO [157] regression method described below. However, this necessitates the use of a numerical optimization method.

The problem formulation contains p parameters which is a potentially large number, and the cost function contains several local minima. The authors use a simulated annealing approach for optimization, which adds a number of tuning parameters in itself.

The following section presents the theory of the present method of sparse PCA, hereafter simply denoted SPCA. Section 6.3 shows results on shape data from three different data sets along with results on the general properties of SPCA.

Section 6.4 discusses the obtained results, debates the advantages and drawbacks of SPCA and proposes a range of different possibilities for ordering of modes.

Section 6.5 concludes the paper.

6.2 Methods 121

6.2.1 Regression Techniques

The regression methods presented here all originate from ordinary least squares (OLS) approximations. The response variable y is approximated by the pre-dictors inX. The coefficients for each variable (column) ofXare contained in b,

bOLS= arg min

b ky−Xbk2, (6.2)

where k · k represents the L2-norm. This is the best linear unbiased estimator given a number of assumptions, such as independent and identically distributed (i.i.d.) residuals. However, if some bias is allowed, estimators can be found with lower mean square error than OLS when tested on an unseen set of observations.

A common way of implementing this is by introducing some constraint on the coefficients in b. The methods described here use constraints on either the L1-norm or the L2-norm ofb, or both. Adding the L2 constraint gives

bridge= arg min

b ky−Xbk2+λkbk2. (6.3) This is known as ridge regression. Any positiveλwill shrink the coefficients of b; if λis chosen carefully, this may lead to improved prediction accuracy and better numerical properties. Replacing the L2-norm in the constraint with the L1-norm gives

bLASSO= arg min

b ky−Xbk2+δkbk1, (6.4) wherekbk1=Pp

i=1|bi|. This method is coined LASSO [157], the least absolute shrinkage and selection operator. As the name implies, using the L1-norm not only shrinks the coefficients, but drives them one by one to exactly zero as δ grows. This implements a form of variable selection, as minor coefficients will be set to zero in a controllable fashion, while the remaining coefficients will be altered to mimic the response in the best possible way. The relation to the problem of setting small PCA loadings to zero is already evident, but some more theory is needed before this can be properly handled.

LASSO has proven to be a very powerful regression and variable selection tech-nique, but it has a few limitations. If p > n, i.e. there are more variables than observations, LASSO chooses a maximum ofn variables. If there is a group of strongly correlated predictors, LASSO tends to choose a single predictor from that group only. The elastic net regression method [175] was developed to ad-dress these shortcomings. It uses a combination of the constraints from ridge regression and LASSO,

bnEN = arg min

b ky−Xbk2+λkbk2+δkbk1, (6.5)

where nEN is short for naive elastic net for reasons described below. The elastic net can be formulated as a LASSO problem on augmented variables,

bnEN= arg min

b ky−Xbk2+ δ

√1 +λkbk1, (6.6) where

X(n+p)×p= 1

√1 +λ √X

λIp

, yn+p= y

0p

, b=√ 1 +λb.

The authors argue that the formulation in Equation 6.6 incurs a double amount of coefficient shrinkage, which is why the solution of Equation 6.5 is referred to as naive. The excessive shrinkage is compensated for in the final solution for bENwhich is

bEN=√

1 +λbnEN= (1 +λ)bnEN. (6.7) The resulting LASSO problem has more observations (p+n) than variables (p), which is why cases wherep > n are handled gracefully. Ifλ >0, the elastic net constraint functionλkbk2+δkbk1 is strictly convex. It can be shown [175] that the difference between coefficients of highly correlated variables in such a system is very small. The elastic net therefore has a tendency of grouping variables, contrary to the LASSO. These are two properties that are desirable in a PCA framework. Problems where there are more variables than observations are common, and principal components built from highly correlated and significant variables are easier to interpret.

Ordinary least squares and ridge regression have closed-form solutions, that is, bOLSandbridgecan be expressed as simple functions ofX,yandλ. This is not true for the LASSO and elastic net methods. For many years, LASSO solutions were found using standard optimization techniques, which made for long com-putation times. In 2002, Efron et al. [40] published a report on a new regression method called least angle regression (LARS). The terminal S in LARS refers to its close relation to stagewise regression and LASSO. Although conceptually different, the method is shown to be very similar to LASSO, and through a small modification, the exact LASSO solution can be computed. The method is built on a powerful geometric framework, through which a computationally thrifty algorithm is conceived. The algorithm starts with all coefficients at zero, and successively adds predictors until all variables are active and the ordinary least squares solution is reached. In other words, LARS returns the solutions for all possible values ofδ. What remains is to pick a suitable solution, a proper value ofδ. This can for instance be done using cross-validation or prior knowl-edge of the desired number of non-zero coefficients. In the elastic net setting, LARS returns the solutions corresponding to all possible values of δ given a value ofλ. Zou and Hastie [175] describes a further development of the LARS

6.2 Methods 123

algorithm tailor made to suit the elastic net framework. This extension is called LARS-EN.

In summary, a regression approach has been presented through which a relevant subset of variables can be selected, which handles the case of more variables than observations gracefully, and which can be computed efficiently. We now turn to the problem of calculating sparse PCs. Note that ”sparse PCs” refers to principal components formed by linear combinations of sparse sets of variables.

6.2.2 Sparse Principal Component Analysis (SPCA)

The simplest approach to SPCA using regression is by treating each principal component as a response vector and regressing this on thepvariables. Denoting theith PC and loading vector byziandbi respectively, and inserting this into the elastic net framework gives

i= arg min

bi

kzi−Xbik2+λkbik2+δkbik1. (6.8) The principal component zi is calculated using regular PCA. The regression procedure will calculate a loading vector bi such that the resulting PC is close to zi while being sparse. The weakness of this approach is that all solutions are constrained to the immediate vicinity of a regular PCA. A better approach would be to approximate the properties of PCA, rather than its exact results.

Specifically, the loading matrix Bshould be near orthogonal, and the correla-tions between the PCs of the scores matrixZshould be kept low. Zou and Hastie propose a problem formulation called theSPCA criterion [177] to address this.

( ˆA,B) = arg minˆ

A,B n

X

i=1

kxi−ABTxik2

k

X

j=1

kbjk2+

k

X

j=1

δjkbjk1

subject to ATA=Ik (6.9)

To clarify this expression, it will be broken down into components. First,BTxi

takes the variables of observation i and projects them onto the principal axes (loading vectors) of B. Note that xi denotes the ith column of XT. Only k PCs are retained, meaning that some information is lost in this transformation.

Further, ABTxi takes the scores of BTxi and transforms them back into the original space. The orthogonality constraint onAmakes sureBis near orthog-onal. The whole termPn

i=1kxi−ABTxik2 measures the reconstruction error.

The remaining constraints are the same as for elastic net regression, driving the columns ofBtowards sparsity. The constraint weightλmust be chosen before-hand, and has the same value for all PCs, whileδmay be set to different values for each PC, offering good flexibility. It can be shown [177] that for δj = 0 ∀j,

the SPCA criterion is minimized by setting AandB equal to the loading ma-trix of ordinary PCA. Hence, the solutions of the present formulation of SPCA conveniently range from ordinary PCA on one end, to the (maximally sparse) zero matrix on the other.

Equation 6.2.2 resembles the elastic net formulation but there is a significant difference. Instead of estimating a single coefficient vector, this problem has two matrices of coefficients, Aand B. A reasonably efficient optimization method for minimizing the SPCA criterion is presented in [177]. First, assume thatA is known. By expanding and rearranging Equation 6.2.2, it is shown that B can be estimated by solving k independent naive elastic net problems, one for each column of B. Referring to Equation 6.5, the data matrix is X as usual while y =Xai for the ith loading vector. On the other hand, if B is known, Acan be calculated using a singular value decomposition; ifXTXB=UDVT, thenA=UVT. Since both matrices are unknown, an initial guess is made and AandB are estimated alternately until convergence. Zou et al. [177] suggests initializingAto the loadings ofkfirst ordinary principal components.

6.2.3 Ordering of principal components

One goal of PCA is to recover latent variables that are as descriptive as possible.

This is done by maximizing the variance of each PC subject to being orthogonal to higher order PCs. The performance of PCA methods is commonly measured by the amount of variance explained by each PC, and the total amount of variance for k modes. Regular PCA is the only linear transformation that produces both orthogonal loadings and uncorrelated scores [72]. For methods that produce correlated scores, variances cannot be calculated directly, as some of the variance explained by one PC will be present in others. This calls for a fair evaluation method. Several such methods are presented in [50], where it is concluded that the most powerful method is to measure adjusted variance, a term used by Zou and Hastie who suggest the same method in [177]. The idea is that the variance of each PC should be adjusted for the variance already explained by higher order components. For mean centered variables, such as those derived by PCA, correlation is equivalent to the cosine of the angle between vectors. Zero correlation corresponds to a 90angle between vectors while fully correlated variables are parallel. Adjustment of a PC therefore amounts to a transformation such that the resulting vector is at right angles with all higher order PCs. This is also known as Gram-Schmidt orthogonalization.

The variance of thejth PC is proportional to its squared length, varzj= z

T jzj

n

zTjzj. Any ordering that maximizes the total variance therefore also maximizes the sum of squared lengths. For ease of notation, squared lengths are considered

6.2 Methods 125

in the following equations.

A vector zj may be orthogonalized, or adjusted with respect to another vector zusing orthogonal projection by

ˆzj =zj−z(zTz)−1zTzj.

The jth PC should be adjusted for all higher order PCs. Assume that the variables, the columns of Z, have been sorted according to decreasing order.

The adjustment can then be carried out for all higher order PCs simultaneously by

ˆ

zj =zj−Z(j−1)(ZT(j−1)Z(j−1))−1ZT(j−1)zj, whereZ(j)= [z1. . .zj] [50].

The SPCA criterion (6.2.2) keeps the loading matrix near orthogonal by forcing A to be orthogonal, but does nothing to encourage uncorrelated scores. This makes an orthogonalization process central to SPCA. Zou and Hastie argue that the order of the components is not an issue; the order is left unaltered, making it possible for lower order modes to explain more variance than higher order modes. Furthermore, the amount of total adjusted variance is dependent on the ordering of the PCs, and may not be maximal in this case.

Formally, the variable ordering that maximizes the total variance can be estab-lished by maximizing P

jˆzTjˆzj and allowing for permutations, arg max

P∈Pk

˜ zT1˜z1+

k

X

j=2

˜

zTj˜zj−˜zTj(j−1)( ˜ZT(j−1)(j−1))−1T(j−1)˜zj, (6.10)

where ˜Z=ZP is the permuted scores matrix andPk is the set of permutation matrices of size k. Note that the supremum of Equation 6.10 is the sum of unadjusted variances which is equal to the maximum iff Zis orthogonal. The simplest way of finding the optimal permutation is by trying all k! possible permutations, which is feasible for a low number of PCs. This paper proposes a forward selection-type rule for picking an ordering with two properties; the variance of a PC is less than or equal to the variances of higher order PCs, and the expression in Equation 6.10 is maximized in most cases. The rule is simple.

Treat one PC at a time. At each step, choose the PC with largest (adjusted) variance, and adjust the scores matrix for this PC. This means that in the first step, we calculate the variances of all (unadjusted) PCs and choose the one with greatest variance. All PCs (Z) are then adjusted with respect to the chosen PC (zj) using

Zˆ =Z−zj(zTjzj)−1zTjZ. (6.11)

In the second step, the adjusted variances from the first step are considered.

Again, the PC with greatest variance is chosen, and all PCs are updated using Equation 6.11. This process is repeated for all PCs. This results in a zero Zmatrix, however, a sensible ordering has been established. This ordering is finally applied to the loading vectors and the original scores matrix.

The first property of this rule, which states that variances are decreasing, is easily realized since the longest vector is chosen in each step and since the squared length cannot grow as the vector is adjusted for some other vector.

The second property of maximal total variance is empirically shown below to be fulfilled to a large extent, but as shall be seen, there are counter examples, e.g.Z= [[0 1.5]T[1 1]T[1 −1]T].