View of Exact Calculation of the Product of the Hessian Matrix of Feed-Forward Network Error Functions and a Vector in 0(N) Time

(1)

Exact Calculation of the Product of the Hessian Matrix of Feed-Forward Network

Error Functions and a Vector in

^O

(

^N

) Time

Martin Mller

Computer Science Department Aarhus University

DK-8000 Arhus, Denmark

Abstract

Several methods for training feed-forward neural networks require second order information from the Hessian matrix of the error function. Although it is possible to calculate the Hessian matrix exactly it is often not desirable because of the computation and memory requirements involved. Some learning techniques does, however, only need the Hessian matrix times a vector. This paper presents a method to calculate the Hessian matrixtimesa vector in^O(^N) time,where^N is the number of variables in the network. This is in the same order as the calculation of the gradient to the error function. The usefulness of this algorithm is demonstrated by improvement of existing learning techniques.

1 Introduction

The second derivative information of the error function associated with feed-forward neural networks forms an ^N ^N matrix, which is usually referred to as the Hessian matrix. Second derivative information is needed in several learning algorithms, e.g., in some conjugate gradient algorithms

This work was done while the author was visiting School of Computer Science, Carnegie Mellon University, Pittsburgh, PA.

1

(2)

[Mller 93a], and in recent network pruning techniques [MacKay 91], [Hassibi 92].

Several researchers have recently derived formulae for exact calculation of the elements in the Hessian matrix [Buntine 91], [Bishop 92]. In the gen- eral case exact calculation of the Hessian matrix needs ^O(^N²) time- and memory requirements. For that reason it is often not worth while explicitly to calculate the Hessian matrix and approximations are often made as described in [Buntine 91]. The second order information is not always needed in the form of the Hessian matrix. This makes it possible to reduce the time- and memory requirements needed to obtain this information. The scaled conjugate gradient algorithm [Mller 93a] and a training algorithm recently proposed by Le Cun involving estimation of eigenvalues to the Hessian matrix [Le Cun 93] are good examples of this. The second order information needed here is always in the form of the Hessian matrix times a vector. In both methods the product of the Hessian and the vector is usually approximated by a one sided dierence equation. This is in many cases a good approximation but can, however, be numerical unstable even when high precision arithmetic is used.

It is possible to calculate the Hessian matrix times a vector exactly without explicitly having to calculate and store the Hessian matrix itself.

Through straightforward analytic evaluations we give explicit formulae for the Hessian matrix times a vector. We prove these formulae and give an algorithm that calculates the product. This algorithm has O(N) time- and memory requirements which is of the same order as the calculation of the gradient to the error function. The algorithm is a generalized version of an algorithm outlined by Yoshida, which was derived by applying an automatic dierentiation technique [Yoshida 91]. The automatic dierentiation technique is an indirect method of obtaining derivative information and provides no analytic expressions of the derivatives [Dixon 89]. Yoshida's algorithm is only valid for feed-forward networks with connections between adjacent layers. Our algorithm works for feed-forward networks with arbitrary connectivity.

The usefulness of the algorithm is demonstrated by discussing possible improvements of existing learning techniques. We here focus on improvements of the scaled conjugate gradient algorithm and on estimation of eigenvalues of the Hessian matrix.

2

(3)

2 Notation

The networks we consider are multilayered feed-forward neural networks with arbitrary connectivity. The network ^@ consist of nodes ⁿ_lm arranged in layers^l = 0^;...^;^L. The number of nodes in a layer l is denoted ^Nl. In order to be able to handle the arbitrary connectivity we dene for each node ⁿ_lm a set of source nodes and a set of target nodes.

S

ml = ^fn^r_s ² ^@j There is a connection from ⁿ^r_s to ⁿ^l_m^; ^r ^< ^{l ;} 1 ^s ^N(1)r^g

Tlm = ^fn_rs ² ^@j There is a connection from ⁿ_lm to ⁿ_rs^; ^r ^> ^{l ;} 1 ^s ^Nr^g

The training set accociated with network ^@ is

f(^u⁰_ps^;^s = 1^;...^;^N⁰^; ^tpj^;^j = 1^;...^;^NL)^;^p = 1^;...^;^P^g (2) The output from a node ⁿ_lm when a pattern p is propagated through the network is

u

lpm = ^f(^v^l_pm) , where ^v_pm^l = ^X

n_rs²S_lm

w

lrms^urps +^w^l_m^; (3) and ^w_lrms is the weight from node ⁿ_rs to node ⁿ_lm. ^w_lm is the usual bias of node ⁿ_lm. ^f(^v_lpm) is an appropriate activation function, e.g., the sigmoid.

The net-input ^v_lpm is chosen to be the usual weighted linear summation of inputs. The calculations to be made could, however, easily be extended to other denitions of ^v_lpm. Let an error function ^E(

w

) be

E(

w

) = ^X^P

p⁼¹Êp(û^L_p¹^;...^;û^L_pN_L^;^tp¹^;...^;^tpNL) ^; (4) where

w

is a vector containing all weights and biases in the network, and

Ep is some appropriate error measure associated with pattern p from the training set.

Based on the chain rule we dene some basic recursive formulae to calculate rst derivative information. These formulae are used frequently in the next section. Formulae based on backward propagation are

@vhpi

@vlpm = ^X

n_rs²T_lm

@vhpi

@vrps

@vlpm = ^f⁰(^v_pm^l ) ^X

n_rs²T_lm

w

rlsm

@vhpi

@vrps (5)

@Ep

@vlpm = ^X

n_rs²T_lm

@Ep

@vrps

@vlpm = ^f⁰(^v_pm^l ) ^X

n_rs²T_lm^w

rlsm^@Ep

@vrs (6)

3

(4)

3 Calculation of the Hessian matrix times a vector

This section presents an exact algorithm to calculate the vector ^Hp(

w

)

d

, where ^Hp(

w

) is the Hessian matrix of the error measure ^Ep, and

d

is a vector. The coordinates in

d

are arranged in the same manner as the coordinates in the weight-vector

w

.

Hp(

w

)

d

= ^d

d

w ⁽ d

^T^dE^p

d

w ⁾

⁼ ^d^d

w ⁽ d

^{T N}_j^X^L

=1

@Ep

@vLpj

dvLpj

d

w ⁾

⁽⁷⁾

= ^X^N^L

j⁼¹

@ 2

Ep

(^@vLpj)²

( d

^T^dv^Lpj

d

w ⁾

^dv^d

w

^Lpj ⁺ ^@v^@ELpj^p

(

^d²^v^Lpj

d

w

²

d )

= ^X^N^L

j⁼¹^(f

0(^v_pj^L)²(^@u^@²^ELpj^p)² +^f⁰⁰(^v_pj^L)^@E^p

@uLpj⁾

( d

^T^dv^Lpj

d

w ⁾

^dv^d

w

^Lpj ⁺ ^@v^@ELpj^p

(

^d²^v^Lpj

d

w

²

d )

^;

The rst and second terms of equation (7) will from now on be referred to as the

A

- and

B

-vector respectively. So we have

A

= ^N^X^L

j⁼¹^{( f}

0(^v^L_pj)²(^@u^@²^ELpj^p)²+^f⁰⁰(^v^L_pj)^@E^p

@uLpj⁾

( d

^T^dv^Lpj

d

w ⁾

^dv^d

w

^Lpj ^and

B

= ^N^X^L

j⁼¹

@Ep

@vLpj

(

^d²^v^Lpj

d

w

²

d )

^:

We rst concentrate on calculating the

A

-vector. (8)

Lemma 1

^Let ^'lpm be dened as ^'_lpm =

d

^{T dv}_d

w

^lpm^. ^'^lpm can be calculated by forward propagation using the recursive formula

'lpm = ^Pn_rs²S_lm ^(dlrms^urps+^w_lrms^f⁰(^v_rps)^'_rps⁾+^d_lm ^;^l ^> 0 ^; ^'⁰_pi = 0 ^;1

i N

0.

Proof

. For input nodes we have ^'⁰_pi = 0 as desired. Assume the lemma is true for all nodes in layers ^k ^< ^l.

'

lpm =

d

^T^dv^lpm

d

w

⁼

d

^T

(

^X

n_rs²S_lm

d

w

⁽^w^ms^lr ^u^r^ps^{) +} ^dw^d

w

^lm

⁾

= ^X

n_rs²S_lm

(

^wms^lr ^f⁰(^v_ps^r )

d

^T^dv^rps

d

w

⁺^d^lr^ms^u^r^ps

⁾

⁺^d^l^m ⁼ n_rs^X²S_lm

(

^d^lrms^ur

ps +^w^lr_ms^f⁰(^v_ps^r )^'^r_ps

)

+^d^l_m

2

4

(5)

Lemma 2

Assume that the '_lpm factors have been calculated for all nodes in the network. The

A

-vector can be calculated by backward propagation using the recursive formula

A

^lhmi = _lpmu_hpi;

A

^lm = _lpm , where lpm is

_lpm =f⁰(v_lpm)^Pn_rs²T_lm w_rlsm_rps ;l < L ;

_Lpj =⁽f⁰(v_Lpj)²⁽_@u^@²^E_Lpj^p⁾² +f⁰⁰(v_Lpj)_@u^@E_Lpj^p⁾'_Lpj ;1 j NL:

Proof

.

A

^lhmi = ^N^X^L

j⁼¹^L_pj@vLpm

@w_lhmi ⁼ ⁽ ^N^L

X

j⁼¹^L_pj @vLpj

@v_lpm⁾u^h_pi ⁾ ^l_pm = ^N^X^L

j⁼¹^L_pj @vLpj

@v_lpm For the output layer we have

A

^Lhji = Lpjuhpi as desired. Assume that the lemma is true for all nodes in layers k > l.

^lpm = ^X^N^L

j⁼¹^Lpj @vLpj

@v_lpm ⁼ ^N^L

X

j⁼¹^Lpjf⁰(vpm^l ) ^X

n_rs²T_lmw^rlsm@vLpj

@v_rps

= f⁰(v_pm^l ) ^X

n_rs²T_lmw_sm^rl ⁽^X^N^L

j⁼¹^L_pj@v_Lpj

@v_rps⁾ ⁼ f⁰(v_pm^l ) ^X

n_rs²T_lmw_sm^rl ^r_ps

2

The calculation of the

B

-vector is a bit more involved but is basicly con- structed in the same manner.

Lemma 3

Assume that the '_lpm factors have been calculated for all nodes in the network. The

B

-vector can be calculated by backward propagation using the recursive formula

B

^lhmi = lpmf⁰(vhpi)'hpi+lpmuhpi ;

B

^lm =lpm

where lpm and lpm are

lpm = f⁰(vlpm)^Pn_rs²T_lmwrlsmrps ;l < L ; Lpj = _@v^@E_Lpj^p ;1 j NL:

lpm = ^Pn_rs²T_lm⁽f⁰(vlpm)wrlsmrps+⁽drlsmf⁰(vlpm)+wrlsmf⁰⁰(vlpm)'lpm⁾rps⁾ ;l <

L ;

5

(6)

_Lpj = 0 ;1 j NL

Proof

. Observe that the

B

-vector can be written in the form

B

= ^N^X^L

j⁼¹

@Ep

@vLpj⁽d²v_Lpj

d

w

²

d

⁾ = ^X^N^k

j⁼¹

@Ep

@vLpjd'_Lpj d

w

^:

Using the chain rule we can derive analytic experessions for _lpm and _lpm.

B

^lhmi = ^N^X^L

j⁼¹

@Ep

@vLpj @'Lpj

@wlhmi = ^N^X^L

j⁼¹

@Ep

@vLpj⁽@'Lpj

@'lpm@'lpm

@wlhmi + @'Lpj

@vlpm @vlpm

@wlhmi⁾

= ^N^X^L

j⁼¹

@Ep

@vLpj⁽@'Lpj

@'lpmf⁰(v^hpi)'^hpi + @'Lpj

@vlpmu^hpi⁾

So if the lemma is true _lpm and _lpm are given by pm^l = ^X^N^L

j⁼¹

@Ep

@vLpj @'Lpj

@'_lpm ; pm^l = ^N^X^L

j⁼¹

@Ep

@vLpj @'Lpj

@v_lpm

The rest of the proof is done in two steps. We look at the parts concerned with the lpm and lpm factors separately. For all output nodes we have Lpj = _@v^@E_Lpj^p as desired. For non output nodes we have

^lpm = ^N^X^L

j⁼¹

@Ep

@vLpj

X

n_rs²T_lm

@'Lpj

@'_rps @'rps

@'_lpm ⁼ ^N^L

X

j⁼¹

@Ep

@vLpjf⁰(vpm^l ) ^X

n_rs²T_lmwsm^rl @'Lpj

@'_rps

= f⁰(v_pm^l ) ^X

n_rs²T_lmw_sm^rl ^N^X^L

j⁼¹

@Ep

@vLpj@'_Lpj

@'_rps ⁼ f⁰(v_pm^l ) ^X

n_rs²T_lmw_sm^rl ^r_ps

Similiarly is Lpj = 0 for all output nodes as desired. For non output nodes we have

pm^l = ^X^N^L

j⁼¹

@Ep

@vLpj

X

n_rs²T_lm

(

@'_Lpj

@v_rps @v_rps

@v_lpm ⁺ @'_Lpj

@'_rps @'_rps

@v_lpm⁾

= ^X^N^L

j⁼¹

@Ep

@vLpj

X

n_rs²T_lm⁽f⁰(v_pm^l )w_sm^rl @'Lpj

@vrps +⁽d^rl_smf⁰(v^l_pm) +w_sm^rl f⁰⁰(v_pm^l )'^l_pm⁾@'Lpj

@'rps⁾

= ^X

n_rs²T_lm⁽f⁰(vpm^l )wsm^rl N^XL

j⁼¹

@Ep

@vLpj@'_Lpj

@v_rps ⁺⁽d^rlsmf⁰(v^lpm) +wsm^rl f⁰⁰(vpm^l )'^lpm⁾ N^XL

j⁼¹

@Ep

@vLpj@'_Lpj

@'_rps⁾

= ^X

n_rs²T_lm

(f⁰(v_pm^l )w_sm^rl _ps^r +⁽d^rl_smf⁰(v_pm^l ) +w_sm^rl f⁰⁰(v_pm^l )'^l_pm⁾^r_ps⁾ 6

(7)

The proof of the formula for

B

^lm follows easily from the above derivations

and is left to the reader. ²

We are now ready to give an explicit formula for calculation of the Hes- sian matrix times a vector. Let

Hd

be the vector Hp(

w

)

d

.

Corollary 1

Assume that the 'lpm factors have been calculated for all nodes in the network. The vector

Hd

can be calculated by backward propagation using the following recursive formula

Hd

^lhmi = _lpmf⁰(v_hpi)'_hpi + (_lpm+_lpm)u_hpi ;

Hd

^lm = _lpm+_lpm , where _lpm, _lpm and _lpm are given as shown in lemma 2 and lemma 3.

Proof

. By combination of lemma 2 and lemma 3. ² If we view rst derivatives like _@u^@E_lpm^p and _@v^@E_lpm^p as already available information, then the formula for

Hd

can reformulated into a formula based only on one recursive parameter. First we observe that lpm and lpm can be written in the form

pm^l = @Ep

@vlpm (9)

_pm^l = f⁰(v_pm^l ) ^X

n_rs²T_lm

(w^rl_sm_ps^r +d^rl_sm@Ep

@v_rps⁾⁺f⁰⁰(v_pm^l )'^l_pm @Ep

@u_lpm

Corollary 2

Assume that the 'lpm factors have been calculated for all nodes in the network. The vector

Hd

can be calculated by backward propagation using the following recursive formula

Hd

^lhmi = _@v^@E_lpm^p f⁰(vhpi)'hpi+lpmuhpi ;

Hd

^lm = lpm , where lpm is

lpm = f⁰(vlpm)^Pn_rs²T_lm⁽wrlsmrps +drlsm@E@v_rps^p⁾+f⁰⁰(vlpm)'lpm @E@u_lpm^p

Lpj = ⁽f⁰(vLpj)²⁽_@u^@²^E_Lpj^p⁾² +f⁰⁰(vLpj)_@u^@E_Lpj^p⁾'Lpj

7

(8)

Proof

. By corollary 1 and equation 9. ² The formula in corollary 2 is a generalized version of the one that Yoshida derived for feed-forward networks with only connections between adjacent layers. An algorithm that calculates ^P_Pp⁼¹Hp(

w

)

d

based on corollary 1 is given below. The algorithm also calculates the gradient vector

G

=

PPp⁼¹ dEp

d

w

^.

1. Initialize.

Hd

=

0

;

G

=

0

Repeat the following steps for p = 1;...;P. 2. Forward propagation.

For nodes i = 1 to N⁰ do: '⁰pi = 0.

For layers l = 1 to L and nodes m= 1 to Nl do:

vlpm = ^Pn_rs²S_lm wlrmsurps +wlm ; ulpm = f(vlpm), '_lpm =^Pn_rs²S_lm ⁽d_lrmsu_rps +w_lrmsf⁰(v_rps)'_rps⁾+d_lm. 3. Output layer.

For nodes j = 1 to NL do

Lpj = ^@E_@v_Lpj^p ; Lpj = 0 ; Lpj = ⁽f⁰(vLpj)²⁽_@u^@²^E_Lpj^p⁾² +f⁰⁰(vLpj)_@u^@E_Lpj^p⁾'Lpj: For all nodes n_rs ² S_Lj do

Hd

^Lrjs =

Hd

^Lrjs +Lpjf⁰(vrps)'rps +Lpjurps ;

Hd

^Lj =

Hd

^Lj +Lpj ;

G

^Lrjs =

G

^Lrjs +Lpjurps ;

G

^Lj =

G

^Lj +Lpj. 4. Backward propagation.

For layers l = L^,1 downto 1 and nodes m= 1 to Nl do:

lpm =f⁰(vlpm)^Pn_rs²T_lm wrlsmrps ; lpm =f⁰(vlpm)^Pn_rs²T_lm wrlsmrps,

_lpm =^Pn_rs²T_lm⁽f⁰(v_lpm)w_rlsm_rps +⁽d_rlsmf⁰(v_lpm) +w_rlsmf⁰⁰(v_lpm)'_lpm⁾_rps⁾. For all nodes nrs ² Slm do

8

(9)

Hd

^lrms =

Hd

^lrms + lpmf⁰(vrps)'rps + (lpm + lpm)urps ;

Hd

^lm =

Hd

^lm +lpm +lpm ;

G

^lrms =

G

^lrms+lpmurps ;

G

^lm =

G

^lm +lpm.

Clearly this algorithm has O(N) time- and memory requirements. More precisely the time complexity is about 2.5 times the time complexity of a gradient calculation alone.

4 Improvement of existing learning techniques

In this section we justify the importance of the exact calculation of the Hessian times a vector, by showing some possible improvements on two dierent learning algorithms.

4.1 The scaled conjugate gradient algorithm

The scaled conjugate gradient algorithm is a variation of a standard conjugate gradient algorithm. The conjugate gradient algorithms produce non-interfering directions of search if the error function is assumed to be quadratic. Minimization in one direction

d

t followed by minimization in another direction

d

t⁺¹ imply that the quadratic approximation to the error has been minimized over the whole subspace spanned by

d

t and

d

t⁺¹. The search directions are given by

d

^t⁺¹ = ^,E⁰(

w

^t⁺¹) +t

d

^t ; (10) where

w

^t is a vector containing all weight values at time step t and t is

t = ^jE⁰(

w

t⁺¹)^j²^,E⁰(

w

t⁺¹)^TE⁰(

w

t)

jE⁰(

w

t)^j² (11) In the standard conjugate gradient algorithms the step sizet is found by a line search which can be very time consuming because this involves several calculations of the error and or the rst derivative. In the scaled conjugate gradient algorithm the step size is estimated by a scaling mechanism thus avoiding the time consuming line search. The step size is given by

t = ^,

d

^Tt E⁰(

w

^t⁾

d

^Tt

s

t +t^j

d

t^j² ; (12) 9

(10)

where

s

^t is

s

t = E⁰⁰(

w

t)

d

t: (13) t is the step size that minimizes the second order approximation to the error function. t is a scaling parameter whose function is similar to the scaling parameter found in Levenberg-Marquardt methods [Fletcher 75]. t

is in each iteration raised or lowered according to how good the second order approximation is to the real error. The weight update formula is given by

4

w

t = t

d

t (14)

s

t has up til now been approximated by a one sided dierence equation of the form

s

^t ⁼ ^E⁰⁽

w

t +t

d

t)^,E⁰(

w

t)

t ;0 < t 1 (15)

s

t can now be calculated exactly by applying the algorithm from the last section. We tested the SCG algorithm on several test problems using both exact and approximated calculations of

d

^Tt

s

t. The experiments indicated a minor speedup in favor of the exact calcuation. Equation (15) is in many cases a good approximation but can, however, be numerical unstable even when high precision arithmetic is used. If the relative error of E⁰(

w

t) is "

then the relative error of equation (15) can be as high as ²^"_t [Ralston 78]. So the relative error gets higher when t is lowered. We refer to [Mller 93a]

for a detailed description of SCG. For a stochastic version of SCG especially designed for training on large, redundant training sets, see also [Mller 93b].

4.2 Eigenvalue estimation

A recent gradient descent learning algorithm proposed by Le Cun, Simard and Pearlmutter involves the estimation of the eigenvalues of the Hessian matrix. We will give a brief description of the ideas in this algorithm mainly in order to explain the use of the eigenvalues and the technique to estimate them. We refer to [Le Cun 93] for a detailed description of this algorithm.

Assume that the Hessian H(

w

t) is invertible. We then have by the spec- tral theorem from linear algebra that H(

w

t) has N eigenvectors that forms an orthogonal basis in ^<^N [Horn 85]. This implies that the inverse of the

10

(11)

Hessian matrix H(

w

t)^,1 can be written in the form H(

w

t)^,1 = ^X^N

i⁼¹

e

i

e

Ti

j

e

ⁱ^j²i ; (16) where i is the i'th eigenvalue of H(

w

t) and

e

i is the corresponding eigenvector. Equation (16) implies that the search directions

d

t of the Newton algorithm [Fletcher 75] can be written as

d

t =^,H(

w

t)^,1

G(w

t

)

= ^,^X^N

i⁼¹

e

i

e

Ti

j

e

i^j²i

G(w

t

)

= ^,^X^N

i⁼¹

e

Ti

G(w

t

)

j

e

i^j²i

e

i ; (17) where

G(w

t

)

is the gradient vector. So the Newton search direction can be interpreted as a sum of projections of the gradient vector onto the eigenvectors weighted with the inverse of the eigenvalues. To calculate all eigenvalues and corresponding eigenvectors costs in O(N³) time which is infeasible for large N. Le Cun et al. argues that only a few of the largest eigenvalues and the corresponding eigenvectors is needed to achieve a considerable speed up in learning. The idea is to reduce the weight change in directions with large curvature, while keeping it large in all other directions. They choose the search direction to be

d

t = ^,(

G(w

t

)

^, ^k⁺¹

1

k

X

i⁼¹

e

Ti

G(w

t

)

j

e

i^j²

e

i⁾ ; (18) where i now runs from the largest eigenvalue ¹ down to the k'th largest eigenvalue k. The eigenvalues of the Hessian matrix are the curvatures in the direction of the corresponding eigenvectors. So Equation (18) reduces the component of the gradient along the directions with large curvature.

See also [Le Cun 91] for a discussion of this. The learning rate can now be increased with a factor of _k+1¹ , since the components in directions with large curvature has been reduced with the inverse of this factor.

The largest eigenvalue and the corresponding eigenvector can be estimated by an iterative process known as the Power method [Ralston 78].

The Power method can be used successively to estimate the k largest eigenvalues if the components in the directions of already estimated eigenvectors are substracted in the process. Below we show an algorithm for estimation of the i'th eigenvalue and eigenvector. The Power method is here combined with the Rayleigh quotient technique [Ralston 78]. This can accelerate the

11

(12)

process considerably.

Choose an initial random vector

e

⁰i. Repeat the following steps for m = 1;...;M, where M is a small constant:

e

mi = H(

w

^t)

e

^mi ^,1 ;

e

mi =

e

mi^,^Pij^,1⁼¹

e

^Tj

e

^mi

j

e

^j^j²

e

^j

_mi = ⁽

e

^mi ^,1⁾T

e

^mi

j

e

^mi ^,1^j² ;

e

mi = ¹_mi

e

mi :

^Mi and

e

^Mi are respectively the estimated eigenvalue and eigenvector. The- oretically it would be enough to substract the component in the direction of already estimated eigenvectors once, but in practice roundo errors will generally introduce these components again.

Le Cun et al. approximates the term H(

w

t)

e

mi with a one sided dier- encing as shown in equation (15). Now this term can be calculated exactly by use of the algorithm described in the last sections.

5 Conclusion

This paper has presented an algorithm for the exact calculation of the product of the Hessian matrix of error functions and a vector. The product is calculated without ever explicitly calculating the Hessian matrix itself.

The algorithm has O(N) time- and memory requirements, where N is the number of variables in the network.

The relevance of this algorithm has been demonstrated by showing possible improvements in two dierent learning techniques, the scaled conjugate gradient learning algorithm and an algorithm recently proposed by Le Cun, Simard and Pearlmutter.

Acknowledgements

It has recently come to the authors knowledge that the same algorithm has been derived independently and at approximately the same time by Barak Pearlmutter, Department of Computer Science and Engineering Oregon Graduate Institute [Pearlmutter 93]. Thank you to Barak for his nice and immediate recognition of the independence of our work.

I would also like to thank Wray Buntine, Scott Fahlman, Brian Mayoh and Ole sterby for helpful advice. This research was supported by a

12

(13)

grant from the Royal Danish Research Council. Facilities for this research were provided by the National Science Foundation (U.S.) under grant IRI- 9214873. All opinions, ndings, conclusions and recommendations in this paper are those of the author and do not necessarily reect the views of the Royal Danish Research Council or the National Science Foundation.

References

[Bishop 92] C. Bishop (1992), Exact Calculation of the Hessian Matrix for the Multilayer Perceptron, Neural Compu- tation, Vol. 2, pp. 494-501.

[Buntine 91] W. Buntine and A. Weigend (1991), Calculating Sec- ond Derivatives on Feed-Forward Networks, submitted to IEEE Transactions on Neural Networks.

[Le Cun 91] Y. Le Cun, I. Kanter, S. Solla (1991), Eigenvalues of Covariance Matrices: Application to Neural Network Learning, Physical Review Letters, Vol. 66, pp. 2396- 2399.

[Le Cun 93] Y. Le Cun, P.Y. Simard and B. Pearlmutter (1993), Local Computation of the Second Derivative Informa- tion in a Multilayer Network, in Proceedings of Neural Information Processing Systems, Morgan Kauman, in print.

[Dixon 89] L.C.W. Dixon and R.C. Price (1989), Truncated New- ton Method for Sparse Unconstrained Optimization Using Automatic Dierentiation, Journal of Optimiza- tion Theory and Applications, Vol. 60, No. 2, pp. 261- [Fletcher 75] 275.R. Fletcher (1975).Practical Methods of Optimization,

Vol. 1, John Wiley & Sons.

[Hassibi 92] B. Hassibi and D.G. Stork (1992), Second Order Derivatives for Network Pruning: Optimal Brain Sur- geon, In Proceedings of Neural Information Processing Systems, Morgan Kauman.

13

(14)

[Horn 85] R.H. Horn and C.A. Johnson (1985), Matrix Analysis, Cambridge University Press, Cambridge.

[MacKay 91] D.J.C. MacKay (1991), A Practical Bayesian Frame- work for Back-Prop Networks, Neural Computation, Vol. 4, N0. 3, pp. 448-472.

[Mller 93a] M. Mller (1993), A Scaled Conjugate Gradient Algo- rithm for Fast Supervised Learning, Neural Networks, in press.

[Mller 93b] M. Mller (1993), Supervised Learning on Large Re- dundant Training sets, International Journal of Neural Systems, in press.

[Pearlmutter 93] B.A. Pearlmutter (1993), Fast Exact Multiplication by the Hessian, preprint, submitted.

[Ralston 78] A. Ralston and P. Rabinowitz (1978), A First Course in Numerical Analysis, McGraw-Hill Book Company, Inc.

[Yoshida 91] T. Yoshida (1991),A Learning Algorithm for Multilay- ered Neural Networks: A Newton Method Using Auto- matic Dierentiation, In Proceedings of International Joint Conference on Neural Networks, Seattle, Poster.

14

View of Exact Calculation of the Product of the Hessian Matrix of Feed-Forward Network Error Functions and a Vector in 0(N) Time

Exact Calculation of the Product of the Hessian Matrix of Feed-Forward Network

Error Functions and a Vector in

(

) Time

Martin Mller

Computer Science Department Aarhus University

DK-8000 Arhus, Denmark

1 Introduction

2 Notation

w

w

w

3 Calculation of the Hessian matrix times a vector

w

d

w

d

d

w

w

d

w ( d

w )

w ( d

w )

( d

w )

w

(

w

d )

( d

w )

w

(

w

d )

A

B

A

( d

w )

w

B

(

w

d )

A

Lemma 1

d

w

Proof

d

w

d

(

w

w

)

(

d

w

)

(

)

Lemma 2

A

A

A

Proof

A

A

B

Lemma 3

B

B

B

Proof

B

w ⁽ d

w ⁾

w ⁽ d

w ⁾

w ⁾

w ⁾

w ⁾

⁾

⁾