• Ingen resultater fundet

Coefficient Path Algorithms

N/A
N/A
Info
Hent
Protected

Academic year: 2022

Del "Coefficient Path Algorithms"

Copied!
40
0
0

Indlæser.... (se fuldtekst nu)

Hele teksten

(1)

Coefficient Path Algorithms

Karl Sjöstrand

Informatics and Mathematical Modelling, DTU

(2)

What’s This Lecture About?

• The focus is on computation rather than methods.

Efficiency

Algorithms provide insight

(3)

Loss Functions

• We wish to model a random variable Y by a function of a set of other random variables f(X)

• To determine how far from Y our model is we define a loss function L(Y, f(X)).

(4)

Loss Function Example

Let Y be a vector y of n outcome observations

Let X be an (n×p) matrix X where the p columns are predictor variables

Use squared error loss L(y,f(X))=||y -f(X)||2

Let f(X) be a linear model with coefficients β, f(X)

= Xβ.

The loss function is then

The minimizer is the familiar OLS solution

y X X

XT T X

f Y

L( , ( )) ( ) 1 min

ˆ arg

) (

)

2 (

2 β β

β y X T y X

X

y

(5)

Adding a Penalty Function

• We get different results if we consider a penalty function J(β) along with the loss function

• Parameter λ defines amount of penalty ) ( ))

( ,

( min

arg )

ˆ(  

  L y f XJ

(6)

Virtues of the Penalty Function

• Imposes structure on the model

Computational difficulties

Unstable estimates

Non-invertible matrices

To reflect prior knowledge To perform variable selection

S p a r s e solutions are easier to interpret

(7)

Selecting a Suitable Model

• We must evaluate models for lots of different values of λ

For instance when doing cross-validation

For each training and test set, evaluate for a suitable set of values of λ.

Each evaluation of may be expensive ) ˆ(

)

ˆ(

(8)

Topic of this Lecture

• Algorithms for estimating

for all values of the parameter λ.

• Plotting the vector with respect to λ yields a coefficient path.

) ( ))

( ,

( min

arg )

ˆ(  

  L y f XJ

) ˆ(

(9)

Example Path – Ridge Regression

• Regression – Quadratic loss, quadratic penalty

2 2 2

min 2

arg )

ˆ( β β

y X

) ˆ(

(10)

Example Path - LASSO

• Regression – Quadratic loss, piecewise linear penalty

1 2

min 2

arg )

ˆ( β β

y X

) ˆ(

(11)

Example Path – Support Vector Machine

• Classification – details on loss and penalty later

(12)

Example Path – Penalized Logistic Regression

• Classification – non-linear loss, piecewise linear penalty

 

1

1

} exp{

1 log min

arg )

ˆ( β n β β

i

i

T

X X

y

Image from Rosset, NIPS 2004

(13)

Path Properties

(14)

Piecewise Linear Paths

• What is required from the loss and penalty functions for piecewise linearity?

• One condition is that is a piecewise constant vector in λ.

ˆ( )

(15)

Condition for Piecewise Linearity

0 200 400 600 800 1000 1200 1400 1600 1800

-300 -200 -100 0 100 200 300 400 500 600

||()||1

()

0 200 400 600 800 1000 1200 1400 1600 1800

-0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4

||()||1 d()/d

(16)

Tracing the Entire Path

• From a starting point along the path (e.g.

λ=∞), we can easily create the entire path if:

is known

the knots where change can be worked out

ˆ( )

ˆ( )

)

ˆ(

(17)

The Piecewise Linear Condition

   

ˆ( ) ˆ( )

  

ˆ( )

)

ˆ( 1

2

2       

    L   JJ

(18)

Sufficient and Necessary Condition

• A sufficient and necessary condition for linearity of at λ0:

expression above is a constant vector with respect to λ in a neighborhood of λ0.

   

ˆ( ) ˆ( )

  

ˆ( )

)

ˆ( 1

2

2       

    L   JJ

) ˆ(

(19)

A Stronger Sufficient Condition

• ...but not a necessary condition

• The loss is a piecewise quadratic function of β

• The penalty is a piecewise linear function of β

   

ˆ( ) ˆ( )

  

ˆ( )

)

ˆ( 1

2

2       

    L   JJ

constant disappears constant

(20)

Implications of this Condition

• Loss functions may be

Quadratic (standard squared error loss) Piecewise quadratic

Piecewise linear (a variant of piecewise quadratic)

• Penalty functions may be

Linear (SVM ”penalty”)

Piecewise linear (L1 and Linf)

(21)

Condition Applied - Examples

• Ridge regression

Quadratic loss – ok

Quadratic penalty – not ok

• LASSO

Quadratic loss – ok

Piecewise linear penalty - ok

(22)

When do Directions Change?

Directions are only valid where L and J are differentiable.

LASSO: L is differentiable everywhere, J is not at β=0.

Directions change when β touches 0.

Variables either become 0, or leave 0 Denote the set of non-zero variables A Denote the set of zero variables I

(23)

An algorithm for the LASSO

• Quadratic loss, piecewise linear penalty

• We now know it has a piecewise linear path!

• Let’s see if we can work out the directions and knots

1 2

min 2

arg )

ˆ( β β

y X

(24)

Reformulating the LASSO

j j

j

p

j

j j

, 0 ,

0

subject to

) (

) (

min arg

1 2

, 2

y X

j j

j

1 2

min 2

arg )

ˆ( β β

y X

(25)

Useful Conditions





 

 

s Constraint

1 1

) ( ) 1

(

2

2 ( )

) (

:

  

p

j

j j p

j

j j

J p

j

j j

L

Lp

X y

• Lagrange primal function

• KKT conditions

0 ,

0

0 )

( ,

0 )

(

j j j

j

j j

j

j L

L

(26)

LASSO Algorithm Properties

• Coefficients are nonzero only if

• For zero variables

L( ˆ( ))j

L( ˆ( )) j I

A

(27)

Working out the Knots (1)

• First case: a variable becomes zero (A to I)

• Assume we know the current and directions

ˆ

ˆ

)

ˆ(

ˆ 0

ˆ

d

A j

d

j j

j

/ , ˆ min ˆ

(28)

Working out the Knots (2)

• Second case: a variable becomes non-zero

• For inactive variables change with L(ˆ()) j λ.

0 200 400 600 800 1000 1200 1400 1600 1800 2000

0 500 1000 1500 2000

|dL|

algorithm direction

Second added variable

(29)

Working out the Knots (3)

• For some scalar d, will reach λ.

This is where variable j becomes active!

Solve for d :

d j

L( ˆ )

I j d d

d

d L

d L

j

T j i

T j i

T j i

T j i

j

A i I

j

, min

) (

) (

) , (

) (

) (

) min (

ˆ ) ( ˆ )

(

X x

x

X y

x x

X x

x

X y

x x

(30)

Path Directions

• Directions for non-zero variables

 

ˆ( )

 

ˆ( )

(2 ) sgn( ˆ( ) )

) ˆ(

1 1 2

A A

T A

A L A J

X X

(31)

The Algorithm

while I is not empty

Work out the minmal distance d where a variable is either added or dropped

Update sets A and I Update β = β + d

Calculate new directions

end

ˆ

(32)

Variants – Huberized LASSO

• Use a piecewise quadratic loss which is nicer to outliers

(33)

Huberized LASSO

• Same path algorithm applies

With a minor change due to the piecewise loss

(34)

Variants - SVM

• Dual SVM formulation

Quadratic ”loss”

Linear ”penalty”

i LD T T T subject to 0 i 1,

2 max 1

arg

:

1 YXX Y

(35)

A few Methods with Piecewise Linear Paths

• Least Angle Regression

• LASSO (+variants)

• Forward Stagewise Regression

• Elastic Net

• The Non-Negative Garotte

• Support Vector Machines (L1 and L2)

• Support Vector Domain Description

• Locally Adaptive Regression Splines

(36)

References

Rosset and Zhu 2004

Piecewise Linear Regularized Solution Paths

Efron et. al 2003

Least Angle Regression

Hastie et. al 2004

The Entire Regularization Path for the SVM

Zhu, Rosset et. al 2003

1-norm Support Vector Machines

Rosset 2004

Tracking Curved Regularized Solution Paths

Park and Hastie 2006

An L1-regularization Path Algorithm for Generalized Linear Models

Friedman et al. 2008

Regularized Paths for Generalized Linear Models via Coordinate Descent

(37)

Conclusion

• We have defined conditions which help

identifying problems with piecewise linear paths

...and shown that efficient algorithms exist

• Having access to solutions for all values of the regularization parameter is important when selecting a suitable model

(38)

• Questions?

• Later questions:

Karl.Sjostrand@gmail.com or Karl.Sjostrand@EXINI.com

Referencer

RELATEREDE DOKUMENTER

The TSP test is used to test the eciency of algorithm and see how fast and close it gets to the shortest path in the graph.. The algorithm will run for 15000 generation and the

KEYWORDS: microRNA, pancreas cancer, normalization methods, incidence, generalized linear models, logistic regression, prognosis, survival analysis, Cox proportional hazards

7 all building must touch building Another simple path expression √ 8 all area must have house Another simple path expression using multiple attributes

Beregn zone-zone ruter og tilføj trafik til kanter 3. Gå til 2

 Path-based Queries on Trajectory Data (Krogh et.al, SIGSPATIAL ’14).. Point Based

Red Hat Certified Engineer in Red Hat OpenStack Exam EX310 · 3 hours · Required Red Hat Certified Specialist in Ceph. Storage Administration Exam EX125 · 3 hours

It has been shown that there are no hinderance in using MCMC in online applications and the experiments indicate that with the same computational complexity MCMC methods can

an alternative path of no greater weight THUS Remove all redundant edges.. An edge is REDUNDANT if there exists an alternative path of no