• Ingen resultater fundet

View of Exact Calculation of the Product of the Hessian Matrix of Feed-Forward Network Error Functions and a Vector in 0(N) Time

N/A
N/A
Info
Hent
Protected

Academic year: 2022

Del "View of Exact Calculation of the Product of the Hessian Matrix of Feed-Forward Network Error Functions and a Vector in 0(N) Time"

Copied!
14
0
0

Indlæser.... (se fuldtekst nu)

Hele teksten

(1)

Exact Calculation of the Product of the Hessian Matrix of Feed-Forward Network

Error Functions and a Vector in

O

(

N

) Time

Martin Mller

Computer Science Department Aarhus University

DK-8000 Arhus, Denmark

Abstract

Several methods for training feed-forward neural networks require second order information from the Hessian matrix of the error function. Although it is possible to calculate the Hessian matrix exactly it is often not desirable because of the computation and memory requirements involved. Some learning techniques does, however, only need the Hessian matrix times a vector. This paper presents a method to calculate the Hessian matrixtimesa vector inO(N) time,whereN is the number of variables in the network. This is in the same order as the calculation of the gradient to the error function. The usefulness of this algorithm is demonstrated by improvement of existing learning techniques.

1 Introduction

The second derivative information of the error function associated with feed-forward neural networks forms an N N matrix, which is usually re- ferred to as the Hessian matrix. Second derivative information is needed in several learning algorithms, e.g., in some conjugate gradient algorithms

This work was done while the author was visiting School of Computer Science, Carnegie Mellon University, Pittsburgh, PA.

1

(2)

[Mller 93a], and in recent network pruning techniques [MacKay 91], [Hassibi 92].

Several researchers have recently derived formulae for exact calculation of the elements in the Hessian matrix [Buntine 91], [Bishop 92]. In the gen- eral case exact calculation of the Hessian matrix needs O(N2) time- and memory requirements. For that reason it is often not worth while explicitly to calculate the Hessian matrix and approximations are often made as de- scribed in [Buntine 91]. The second order information is not always needed in the form of the Hessian matrix. This makes it possible to reduce the time- and memory requirements needed to obtain this information. The scaled conjugate gradient algorithm [Mller 93a] and a training algorithm recently proposed by Le Cun involving estimation of eigenvalues to the Hessian matrix [Le Cun 93] are good examples of this. The second order information needed here is always in the form of the Hessian matrix times a vector. In both methods the product of the Hessian and the vector is usually approximated by a one sided dierence equation. This is in many cases a good approximation but can, however, be numerical unstable even when high precision arithmetic is used.

It is possible to calculate the Hessian matrix times a vector exactly without explicitly having to calculate and store the Hessian matrix itself.

Through straightforward analytic evaluations we give explicit formulae for the Hessian matrix times a vector. We prove these formulae and give an algorithm that calculates the product. This algorithm has O(N) time- and memory requirements which is of the same order as the calculation of the gradient to the error function. The algorithm is a generalized version of an algorithm outlined by Yoshida, which was derived by applying an auto- matic dierentiation technique [Yoshida 91]. The automatic dierentiation technique is an indirect method of obtaining derivative information and provides no analytic expressions of the derivatives [Dixon 89]. Yoshida's algorithm is only valid for feed-forward networks with connections between adjacent layers. Our algorithm works for feed-forward networks with arbi- trary connectivity.

The usefulness of the algorithm is demonstrated by discussing possible improvements of existing learning techniques. We here focus on improve- ments of the scaled conjugate gradient algorithm and on estimation of eigen- values of the Hessian matrix.

2

(3)

2 Notation

The networks we consider are multilayered feed-forward neural networks with arbitrary connectivity. The network @ consist of nodes nlm arranged in layersl = 0;...;L. The number of nodes in a layer l is denoted Nl. In order to be able to handle the arbitrary connectivity we dene for each node nlm a set of source nodes and a set of target nodes.

S

ml = fnrs 2 @j There is a connection from nrs to nlm; r < l ; 1 s N(1)rg

Tlm = fnrs 2 @j There is a connection from nlm to nrs; r > l ; 1 s Nrg

The training set accociated with network @ is

f(u0ps;s = 1;...;N0; tpj;j = 1;...;NL);p = 1;...;Pg (2) The output from a node nlm when a pattern p is propagated through the network is

u

lpm = f(vlpm) , where vpml = X

nrs2Slm

w

lrmsurps +wlm; (3) and wlrms is the weight from node nrs to node nlm. wlm is the usual bias of node nlm. f(vlpm) is an appropriate activation function, e.g., the sigmoid.

The net-input vlpm is chosen to be the usual weighted linear summation of inputs. The calculations to be made could, however, easily be extended to other denitions of vlpm. Let an error function E(

w

) be

E(

w

) = XP

p=1Ep(uLp1;...;uLpNL;tp1;...;tpNL) ; (4) where

w

is a vector containing all weights and biases in the network, and

Ep is some appropriate error measure associated with pattern p from the training set.

Based on the chain rule we dene some basic recursive formulae to cal- culate rst derivative information. These formulae are used frequently in the next section. Formulae based on backward propagation are

@vhpi

@vlpm = X

nrs2Tlm

@vhpi

@vrps

@vrps

@vlpm = f0(vpml ) X

nrs2Tlm

w

rlsm

@vhpi

@vrps (5)

@Ep

@vlpm = X

nrs2Tlm

@Ep

@vrps

@vrps

@vlpm = f0(vpml ) X

nrs2Tlmw

rlsm@Ep

@vrs (6)

3

(4)

3 Calculation of the Hessian matrix times a vector

This section presents an exact algorithm to calculate the vector Hp(

w

)

d

, where Hp(

w

) is the Hessian matrix of the error measure Ep, and

d

is a vector. The coordinates in

d

are arranged in the same manner as the coordinates in the weight-vector

w

.

Hp(

w

)

d

= d

d

w ( d

TdEp

d

w )

= dd

w ( d

T NjXL

=1

@Ep

@vLpj

dvLpj

d

w )

(7)

= XNL

j=1

@ 2

Ep

(@vLpj)2

( d

TdvLpj

d

w )

dvd

w

Lpj + @v@ELpjp

(

d2vLpj

d

w

2

d )

= XNL

j=1(f

0(vpjL)2(@u@2ELpjp)2 +f00(vpjL)@Ep

@uLpj)

( d

TdvLpj

d

w )

dvd

w

Lpj + @v@ELpjp

(

d2vLpj

d

w

2

d )

;

The rst and second terms of equation (7) will from now on be referred to as the

A

- and

B

-vector respectively. So we have

A

= NXL

j=1( f

0(vLpj)2(@u@2ELpjp)2+f00(vLpj)@Ep

@uLpj)

( d

TdvLpj

d

w )

dvd

w

Lpj and

B

= NXL

j=1

@Ep

@vLpj

(

d2vLpj

d

w

2

d )

:

We rst concentrate on calculating the

A

-vector. (8)

Lemma 1

Let 'lpm be dened as 'lpm =

d

T dvd

w

lpm. 'lpm can be calculated by forward propagation using the recursive formula

'lpm = Pnrs2Slm (dlrmsurps+wlrmsf0(vrps)'rps)+dlm ;l > 0 ; '0pi = 0 ;1

i N

0.

Proof

. For input nodes we have '0pi = 0 as desired. Assume the lemma is true for all nodes in layers k < l.

'

lpm =

d

Tdvlpm

d

w

=

d

T

(

X

nrs2Slm

d

d

w

(wmslr urps) + dwd

w

lm

)

= X

nrs2Slm

(

wmslr f0(vpsr )

d

Tdvrps

d

w

+dlrmsurps

)

+dlm = nrsX2Slm

(

dlrmsur

ps +wlrmsf0(vpsr )'rps

)

+dlm

2

4

(5)

Lemma 2

Assume that the 'lpm factors have been calculated for all nodes in the network. The

A

-vector can be calculated by backward propagation using the recursive formula

A

lhmi = lpmuhpi;

A

lm = lpm , where lpm is

lpm =f0(vlpm)Pnrs2Tlm wrlsmrps ;l < L ;

Lpj =(f0(vLpj)2(@u@2ELpjp)2 +f00(vLpj)@u@ELpjp)'Lpj ;1 j NL:

Proof

.

A

lhmi = NXL

j=1Lpj@vLpm

@wlhmi = ( NL

X

j=1Lpj @vLpj

@vlpm)uhpi ) lpm = NXL

j=1Lpj @vLpj

@vlpm For the output layer we have

A

Lhji = Lpjuhpi as desired. Assume that the lemma is true for all nodes in layers k > l.

lpm = XNL

j=1Lpj @vLpj

@vlpm = NL

X

j=1Lpjf0(vpml ) X

nrs2Tlmwrlsm@vLpj

@vrps

= f0(vpml ) X

nrs2Tlmwsmrl (XNL

j=1Lpj@vLpj

@vrps) = f0(vpml ) X

nrs2Tlmwsmrl rps

2

The calculation of the

B

-vector is a bit more involved but is basicly con- structed in the same manner.

Lemma 3

Assume that the 'lpm factors have been calculated for all nodes in the network. The

B

-vector can be calculated by backward propagation using the recursive formula

B

lhmi = lpmf0(vhpi)'hpi+lpmuhpi ;

B

lm =lpm

where lpm and lpm are

lpm = f0(vlpm)Pnrs2Tlmwrlsmrps ;l < L ; Lpj = @v@ELpjp ;1 j NL:

lpm = Pnrs2Tlm(f0(vlpm)wrlsmrps+(drlsmf0(vlpm)+wrlsmf00(vlpm)'lpm)rps) ;l <

L ;

5

(6)

Lpj = 0 ;1 j NL

Proof

. Observe that the

B

-vector can be written in the form

B

= NXL

j=1

@Ep

@vLpj(d2vLpj

d

w

2

d

) = XNk

j=1

@Ep

@vLpjd'Lpj d

w

:

Using the chain rule we can derive analytic experessions for lpm and lpm.

B

lhmi = NXL

j=1

@Ep

@vLpj @'Lpj

@wlhmi = NXL

j=1

@Ep

@vLpj(@'Lpj

@'lpm@'lpm

@wlhmi + @'Lpj

@vlpm @vlpm

@wlhmi)

= NXL

j=1

@Ep

@vLpj(@'Lpj

@'lpmf0(vhpi)'hpi + @'Lpj

@vlpmuhpi)

So if the lemma is true lpm and lpm are given by pml = XNL

j=1

@Ep

@vLpj @'Lpj

@'lpm ; pml = NXL

j=1

@Ep

@vLpj @'Lpj

@vlpm

The rest of the proof is done in two steps. We look at the parts concerned with the lpm and lpm factors separately. For all output nodes we have Lpj = @v@ELpjp as desired. For non output nodes we have

lpm = NXL

j=1

@Ep

@vLpj

X

nrs2Tlm

@'Lpj

@'rps @'rps

@'lpm = NL

X

j=1

@Ep

@vLpjf0(vpml ) X

nrs2Tlmwsmrl @'Lpj

@'rps

= f0(vpml ) X

nrs2Tlmwsmrl NXL

j=1

@Ep

@vLpj@'Lpj

@'rps = f0(vpml ) X

nrs2Tlmwsmrl rps

Similiarly is Lpj = 0 for all output nodes as desired. For non output nodes we have

pml = XNL

j=1

@Ep

@vLpj

X

nrs2Tlm

(

@'Lpj

@vrps @vrps

@vlpm + @'Lpj

@'rps @'rps

@vlpm)

= XNL

j=1

@Ep

@vLpj

X

nrs2Tlm(f0(vpml )wsmrl @'Lpj

@vrps +(drlsmf0(vlpm) +wsmrl f00(vpml )'lpm)@'Lpj

@'rps)

= X

nrs2Tlm(f0(vpml )wsmrl NXL

j=1

@Ep

@vLpj@'Lpj

@vrps +(drlsmf0(vlpm) +wsmrl f00(vpml )'lpm) NXL

j=1

@Ep

@vLpj@'Lpj

@'rps)

= X

nrs2Tlm

(f0(vpml )wsmrl psr +(drlsmf0(vpml ) +wsmrl f00(vpml )'lpm)rps) 6

(7)

The proof of the formula for

B

lm follows easily from the above derivations

and is left to the reader. 2

We are now ready to give an explicit formula for calculation of the Hes- sian matrix times a vector. Let

Hd

be the vector Hp(

w

)

d

.

Corollary 1

Assume that the 'lpm factors have been calculated for all nodes in the network. The vector

Hd

can be calculated by backward propagation using the following recursive formula

Hd

lhmi = lpmf0(vhpi)'hpi + (lpm+lpm)uhpi ;

Hd

lm = lpm+lpm , where lpm, lpm and lpm are given as shown in lemma 2 and lemma 3.

Proof

. By combination of lemma 2 and lemma 3. 2 If we view rst derivatives like @u@Elpmp and @v@Elpmp as already available infor- mation, then the formula for

Hd

can reformulated into a formula based only on one recursive parameter. First we observe that lpm and lpm can be written in the form

pml = @Ep

@vlpm (9)

pml = f0(vpml ) X

nrs2Tlm

(wrlsmpsr +drlsm@Ep

@vrps)+f00(vpml )'lpm @Ep

@ulpm

Corollary 2

Assume that the 'lpm factors have been calculated for all nodes in the network. The vector

Hd

can be calculated by backward propagation using the following recursive formula

Hd

lhmi = @v@Elpmp f0(vhpi)'hpi+lpmuhpi ;

Hd

lm = lpm , where lpm is

lpm = f0(vlpm)Pnrs2Tlm(wrlsmrps +drlsm@E@vrpsp)+f00(vlpm)'lpm @E@ulpmp

Lpj = (f0(vLpj)2(@u@2ELpjp)2 +f00(vLpj)@u@ELpjp)'Lpj

7

(8)

Proof

. By corollary 1 and equation 9. 2 The formula in corollary 2 is a generalized version of the one that Yoshida derived for feed-forward networks with only connections between adjacent layers. An algorithm that calculates PPp=1Hp(

w

)

d

based on corollary 1 is given below. The algorithm also calculates the gradient vector

G

=

PPp=1 dEp

d

w

.

1. Initialize.

Hd

=

0

;

G

=

0

Repeat the following steps for p = 1;...;P. 2. Forward propagation.

For nodes i = 1 to N0 do: '0pi = 0.

For layers l = 1 to L and nodes m= 1 to Nl do:

vlpm = Pnrs2Slm wlrmsurps +wlm ; ulpm = f(vlpm), 'lpm =Pnrs2Slm (dlrmsurps +wlrmsf0(vrps)'rps)+dlm. 3. Output layer.

For nodes j = 1 to NL do

Lpj = @E@vLpjp ; Lpj = 0 ; Lpj = (f0(vLpj)2(@u@2ELpjp)2 +f00(vLpj)@u@ELpjp)'Lpj: For all nodes nrs 2 SLj do

Hd

Lrjs =

Hd

Lrjs +Lpjf0(vrps)'rps +Lpjurps ;

Hd

Lj =

Hd

Lj +Lpj ;

G

Lrjs =

G

Lrjs +Lpjurps ;

G

Lj =

G

Lj +Lpj. 4. Backward propagation.

For layers l = L,1 downto 1 and nodes m= 1 to Nl do:

lpm =f0(vlpm)Pnrs2Tlm wrlsmrps ; lpm =f0(vlpm)Pnrs2Tlm wrlsmrps,

lpm =Pnrs2Tlm(f0(vlpm)wrlsmrps +(drlsmf0(vlpm) +wrlsmf00(vlpm)'lpm)rps). For all nodes nrs 2 Slm do

8

(9)

Hd

lrms =

Hd

lrms + lpmf0(vrps)'rps + (lpm + lpm)urps ;

Hd

lm =

Hd

lm +lpm +lpm ;

G

lrms =

G

lrms+lpmurps ;

G

lm =

G

lm +lpm.

Clearly this algorithm has O(N) time- and memory requirements. More precisely the time complexity is about 2.5 times the time complexity of a gradient calculation alone.

4 Improvement of existing learning techniques

In this section we justify the importance of the exact calculation of the Hessian times a vector, by showing some possible improvements on two dierent learning algorithms.

4.1 The scaled conjugate gradient algorithm

The scaled conjugate gradient algorithm is a variation of a standard con- jugate gradient algorithm. The conjugate gradient algorithms produce non-interfering directions of search if the error function is assumed to be quadratic. Minimization in one direction

d

t followed by minimization in another direction

d

t+1 imply that the quadratic approximation to the error has been minimized over the whole subspace spanned by

d

t and

d

t+1. The search directions are given by

d

t+1 = ,E0(

w

t+1) +t

d

t ; (10) where

w

t is a vector containing all weight values at time step t and t is

t = jE0(

w

t+1)j2,E0(

w

t+1)TE0(

w

t)

jE0(

w

t)j2 (11) In the standard conjugate gradient algorithms the step sizet is found by a line search which can be very time consuming because this involves several calculations of the error and or the rst derivative. In the scaled conjugate gradient algorithm the step size is estimated by a scaling mechanism thus avoiding the time consuming line search. The step size is given by

t = ,

d

Tt E0(

w

t)

d

Tt

s

t +tj

d

tj2 ; (12) 9

(10)

where

s

t is

s

t = E00(

w

t)

d

t: (13) t is the step size that minimizes the second order approximation to the error function. t is a scaling parameter whose function is similar to the scaling parameter found in Levenberg-Marquardt methods [Fletcher 75]. t

is in each iteration raised or lowered according to how good the second order approximation is to the real error. The weight update formula is given by

4

w

t = t

d

t (14)

s

t has up til now been approximated by a one sided dierence equation of the form

s

t = E0(

w

t +t

d

t),E0(

w

t)

t ;0 < t 1 (15)

s

t can now be calculated exactly by applying the algorithm from the last section. We tested the SCG algorithm on several test problems using both exact and approximated calculations of

d

Tt

s

t. The experiments indicated a minor speedup in favor of the exact calcuation. Equation (15) is in many cases a good approximation but can, however, be numerical unstable even when high precision arithmetic is used. If the relative error of E0(

w

t) is "

then the relative error of equation (15) can be as high as 2"t [Ralston 78]. So the relative error gets higher when t is lowered. We refer to [Mller 93a]

for a detailed description of SCG. For a stochastic version of SCG especially designed for training on large, redundant training sets, see also [Mller 93b].

4.2 Eigenvalue estimation

A recent gradient descent learning algorithm proposed by Le Cun, Simard and Pearlmutter involves the estimation of the eigenvalues of the Hessian matrix. We will give a brief description of the ideas in this algorithm mainly in order to explain the use of the eigenvalues and the technique to estimate them. We refer to [Le Cun 93] for a detailed description of this algorithm.

Assume that the Hessian H(

w

t) is invertible. We then have by the spec- tral theorem from linear algebra that H(

w

t) has N eigenvectors that forms an orthogonal basis in <N [Horn 85]. This implies that the inverse of the

10

(11)

Hessian matrix H(

w

t),1 can be written in the form H(

w

t),1 = XN

i=1

e

i

e

Ti

j

e

ij2i ; (16) where i is the i'th eigenvalue of H(

w

t) and

e

i is the corresponding eigen- vector. Equation (16) implies that the search directions

d

t of the Newton algorithm [Fletcher 75] can be written as

d

t =,H(

w

t),1

G(w

t

)

= ,XN

i=1

e

i

e

Ti

j

e

ij2i

G(w

t

)

= ,XN

i=1

e

Ti

G(w

t

)

j

e

ij2i

e

i ; (17) where

G(w

t

)

is the gradient vector. So the Newton search direction can be interpreted as a sum of projections of the gradient vector onto the eigenvec- tors weighted with the inverse of the eigenvalues. To calculate all eigenval- ues and corresponding eigenvectors costs in O(N3) time which is infeasible for large N. Le Cun et al. argues that only a few of the largest eigenval- ues and the corresponding eigenvectors is needed to achieve a considerable speed up in learning. The idea is to reduce the weight change in directions with large curvature, while keeping it large in all other directions. They choose the search direction to be

d

t = ,(

G(w

t

)

, k+1

1

k

X

i=1

e

Ti

G(w

t

)

j

e

ij2

e

i) ; (18) where i now runs from the largest eigenvalue 1 down to the k'th largest eigenvalue k. The eigenvalues of the Hessian matrix are the curvatures in the direction of the corresponding eigenvectors. So Equation (18) reduces the component of the gradient along the directions with large curvature.

See also [Le Cun 91] for a discussion of this. The learning rate can now be increased with a factor of k+11 , since the components in directions with large curvature has been reduced with the inverse of this factor.

The largest eigenvalue and the corresponding eigenvector can be esti- mated by an iterative process known as the Power method [Ralston 78].

The Power method can be used successively to estimate the k largest eigen- values if the components in the directions of already estimated eigenvectors are substracted in the process. Below we show an algorithm for estimation of the i'th eigenvalue and eigenvector. The Power method is here combined with the Rayleigh quotient technique [Ralston 78]. This can accelerate the

11

(12)

process considerably.

Choose an initial random vector

e

0i. Repeat the following steps for m = 1;...;M, where M is a small constant:

e

mi = H(

w

t)

e

mi ,1 ;

e

mi =

e

mi,Pij,1=1

e

Tj

e

mi

j

e

jj2

e

j

mi = (

e

mi ,1)T

e

mi

j

e

mi ,1j2 ;

e

mi = 1mi

e

mi :

Mi and

e

Mi are respectively the estimated eigenvalue and eigenvector. The- oretically it would be enough to substract the component in the direction of already estimated eigenvectors once, but in practice roundo errors will generally introduce these components again.

Le Cun et al. approximates the term H(

w

t)

e

mi with a one sided dier- encing as shown in equation (15). Now this term can be calculated exactly by use of the algorithm described in the last sections.

5 Conclusion

This paper has presented an algorithm for the exact calculation of the product of the Hessian matrix of error functions and a vector. The product is calculated without ever explicitly calculating the Hessian matrix itself.

The algorithm has O(N) time- and memory requirements, where N is the number of variables in the network.

The relevance of this algorithm has been demonstrated by showing possi- ble improvements in two dierent learning techniques, the scaled conjugate gradient learning algorithm and an algorithm recently proposed by Le Cun, Simard and Pearlmutter.

Acknowledgements

It has recently come to the authors knowledge that the same algorithm has been derived independently and at approximately the same time by Barak Pearlmutter, Department of Computer Science and Engineering Oregon Graduate Institute [Pearlmutter 93]. Thank you to Barak for his nice and immediate recognition of the independence of our work.

I would also like to thank Wray Buntine, Scott Fahlman, Brian Mayoh and Ole sterby for helpful advice. This research was supported by a

12

(13)

grant from the Royal Danish Research Council. Facilities for this research were provided by the National Science Foundation (U.S.) under grant IRI- 9214873. All opinions, ndings, conclusions and recommendations in this paper are those of the author and do not necessarily reect the views of the Royal Danish Research Council or the National Science Foundation.

References

[Bishop 92] C. Bishop (1992), Exact Calculation of the Hessian Matrix for the Multilayer Perceptron, Neural Compu- tation, Vol. 2, pp. 494-501.

[Buntine 91] W. Buntine and A. Weigend (1991), Calculating Sec- ond Derivatives on Feed-Forward Networks, submitted to IEEE Transactions on Neural Networks.

[Le Cun 91] Y. Le Cun, I. Kanter, S. Solla (1991), Eigenvalues of Covariance Matrices: Application to Neural Network Learning, Physical Review Letters, Vol. 66, pp. 2396- 2399.

[Le Cun 93] Y. Le Cun, P.Y. Simard and B. Pearlmutter (1993), Local Computation of the Second Derivative Informa- tion in a Multilayer Network, in Proceedings of Neural Information Processing Systems, Morgan Kauman, in print.

[Dixon 89] L.C.W. Dixon and R.C. Price (1989), Truncated New- ton Method for Sparse Unconstrained Optimization Using Automatic Dierentiation, Journal of Optimiza- tion Theory and Applications, Vol. 60, No. 2, pp. 261- [Fletcher 75] 275.R. Fletcher (1975).Practical Methods of Optimization,

Vol. 1, John Wiley & Sons.

[Hassibi 92] B. Hassibi and D.G. Stork (1992), Second Order Derivatives for Network Pruning: Optimal Brain Sur- geon, In Proceedings of Neural Information Processing Systems, Morgan Kauman.

13

(14)

[Horn 85] R.H. Horn and C.A. Johnson (1985), Matrix Analysis, Cambridge University Press, Cambridge.

[MacKay 91] D.J.C. MacKay (1991), A Practical Bayesian Frame- work for Back-Prop Networks, Neural Computation, Vol. 4, N0. 3, pp. 448-472.

[Mller 93a] M. Mller (1993), A Scaled Conjugate Gradient Algo- rithm for Fast Supervised Learning, Neural Networks, in press.

[Mller 93b] M. Mller (1993), Supervised Learning on Large Re- dundant Training sets, International Journal of Neural Systems, in press.

[Pearlmutter 93] B.A. Pearlmutter (1993), Fast Exact Multiplication by the Hessian, preprint, submitted.

[Ralston 78] A. Ralston and P. Rabinowitz (1978), A First Course in Numerical Analysis, McGraw-Hill Book Company, Inc.

[Yoshida 91] T. Yoshida (1991),A Learning Algorithm for Multilay- ered Neural Networks: A Newton Method Using Auto- matic Dierentiation, In Proceedings of International Joint Conference on Neural Networks, Seattle, Poster.

14

Referencer

RELATEREDE DOKUMENTER

maripaludis Mic1c10, ToF-SIMS and EDS images indicated that in the column incubated coupon the corrosion layer does not contain carbon (Figs. 6B and 9 B) whereas the corrosion

During the 1970s, Danish mass media recurrently portrayed mass housing estates as signifiers of social problems in the otherwise increasingl affluent anish

The Healthy Home project explored how technology may increase collaboration between patients in their homes and the network of healthcare professionals at a hospital, and

In a series of lectures, selected and published in Violence and Civility: At the Limits of Political Philosophy (2015), the French philosopher Étienne Balibar

In general terms, a better time resolution is obtained for higher fundamental frequencies of harmonic sound, which is in accordance both with the fact that the higher

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of

H2: Respondenter, der i høj grad har været udsat for følelsesmæssige krav, vold og trusler, vil i højere grad udvikle kynisme rettet mod borgerne.. De undersøgte sammenhænge

Driven by efforts to introduce worker friendly practices within the TQM framework, international organizations calling for better standards, national regulations and