Pruning - Probabilistic Speech Detection

Figure 9.5.A section of true (blue) and estimated (red,dashed) probabilities of class membership, P(C|x) after convergence, using artificial data. With one exception, the estimates are quite accurate.

9.4 Pruning

As mentioned in chapter 4, generalization performance can be improved by the process of pruning. An additional - practical - reason for pruning is that the smaller the network can get, the faster it will be. Compared with another classical technique with the same purpose known asweight decay, where all the weights are kept, this is an advantage for a system for which computational speed is an important parameter, as is the case with the hearing-aid target.

Further,interpretation of the relative importance of the different cross-corr-elations between frequencies is of interest in itself.

It is quite conceivable that with an input vector size of only 36, no pruning can be done without loosing generalization performance. The network may be so small already that any further reduction can only lead to reduced performance.

Pruning has been used in other cases to reduce the number of parameters in systems with an initial size of several thousand parameters, as reported by [3].

So the present case must be said to be a very small scale of pruning.

Still, even with this ’small’ scale, exhaustive search of all possible network structures is not possible. The number of possible architectures can be expressed as

9.4 Pruning 51

whereF is the number of features.

With 9 features, this is 511 possible systems and with 36 it is more than 6·10¹⁰.

Pruning is one solution to the problem of searching this architecture space.

The idea is to start with the most complex network, using all features as in-puts, and train this network until convergence. Then, one weight is cut at a time according to relative importance, and retraining is done after each cut.

This is repeated until say all the weights have been cut (possibly except for the bias weight). In this way, onlyF networks will be created. It is obvious that themethod that the weights to be cut are chosen is critical for such an enor-mous reduction to make sense and still find ’good’ networks (i.e. close to the performance of the best network that could be found using exhaustive search).

It should be mentioned that pruning has one advantage over another typical method of regularization², namely weight decay³: the network size is reduced, leading to a smaller and faster system.

How much re-training to apply after each cut is one parameter of this process and this is typically heuristically determined, but reasonable settings can be determined with some experimentation. The most important choice however is the way in which the importance of the weights is estimated, as this will primar-ily determine which architectures will be created and thus how the ’architecture space’ is searched.

The measure of importance for a weight is termed ’saliency’. A natural mea-sure for this is the change in the error function resulting from a small change in the weight. Changing the weightswi by δw−i, the change in error can be written as a Taylor expansion as

δE =X whereH is the Hessian matrix with elements

Hij = ∂E

∂wiwj

([3], (9.66)).

Assuming that training has converged, the first term in 9.15 vanishes. Further simplification can be had by assuming thatH is diagonal:

δE= 1 2

Hiiδw²_i (9.16)

With this approximation, the saliency of each weightsi is si=Hiiw²_i/2

2A term used to denote methods to limit the complexity of networks

3which is adding to the error functions for large weights, also restraining complexity

9.4 Pruning 52

Now setting the i’th weight to zero will approximately change the error by the saliency.

This leads to a pruning algorithm called ’Optimal Brain Damage’ (OBD) -see [3], page 361. The version used in this project cuts one weight at a time. It is implemented by theMatlabscriptsprune.m(saliency) andmelfb36_prune.m (keeping track of what is cut and what is not).

If the full Hessian is used instead of the approximated diagonal one, the result-ing algorithm is called ’Optimal Brain Surgeon’. On paper, it should perform better, but in practice, it suffers from matrix inversion and other problems so that OBD often performs better in practice.

It is important to train to completion (convergence) initially, and re-training should be of a similar quality, even though typically fewer re-training steps are needed.

9.4.1 The Hessian matrix of a linear network

The Hessian is defined as:

Hij = ∂E

∂wiwj

(9.17) i.e. it contains the second-order derivatives of the error function with respect to the weights that are the learning parameters of the network.

The Hessian matrix is needed for the Optimal Brain Damage algorithm, and since no derivation of the Hessian matrix for a single-layer (linear) network is given in [3], it is briefly given here.

First off, it may be noted that even though the network is linear, the Hessian matrix is not diagonal. This is due to the non-linear (here: logistic) function that is used. However, the Optimal Brain Damage algorithm is based on the assumption that the Hessianis diagonal, so this assumption is carried through here.

The error differentiated with respect to the weights has already been found in equation 9.8. Differentiating again yields

∂(y−t)xi

In document Probabilistic Speech Detection (Sider 64-67)