Novelty Detection - Event Detection - Detection and One Class Classification

Detection and One Class Classification

3.2 Event Detection

3.2.1 Novelty Detection

As described in the previous chapter, novelty detection has been divided into two different categories, statistical and neural network approaches. Some statistical approaches are considered in this project as well as one neural network approach.

3.2.1.1 Statistical

One of the most used common models used to describe data statistically is the Normal or Gaussian distribution [Das10]. The univariate version of the Normal density function is shown in equation 3.19, where µ is the mean andσ is the standard deviation.

f(x) = 1 σ√

2πe⁻^(x−µ)2^2σ² ,−∞< x <∞ (3.19) To fit this model to data only the data’s µ and σ are needed. Moreover, the Kolmogorov-Smirnov test of goodness of fit [MJ51] can be used to verify if the data truly has normal distribution. The Kolmogorov-Smirnov test is based on a distance between a hypothetical cumulative density function and the cumulative density function of the data. If this distance exceeds a certain level of signifi-cance then there is evidence that the data does not belong to the hypothesized distribution.

However, a multivariate normal distribution is more often needed to model the joint distribution of more than one variable. Then-variate normal distribution density function [HS11] is shown in equation 3.20, wherex is an n-dimensions

3.2 Event Detection 29

variable, µis an n-dimensions vector with the mean values for each dimension in x. Σis annbyncovariance matrix of the form

where σn is the standard deviation of the n-th dimension of variable x and ρn−1,n is the correlation between the data in the n−1-th dimension and the n-th dimension of variablex.

f(x) = (2π)⁻ⁿ²|Σ|⁻¹²e[−¹₂(x−µ)⁰Σ⁻¹(x−µ)] (3.20) However, some distributions of data need a more flexible model when data densities are not necessarily concentrated together. Thus, the usage of GMMs is a common practice. The density function of a GMM is shown in equation 3.21 [WZ08], whereP(j)are the mixing coefficients andp(x|j)are the Gaussian density functions. Note that, as R

p(x|j) = 1, P

P(j) = 1. Figure 3.8 shows an example of a mixture of 3 Gaussians.

f(x) =

j=1

P(j)p(x|j) (3.21)

The Expectation Maximization algorithm is commonly used to fit any of this multivariate distributions [WZ08]. Note that the amount of parameters to es-timate by the Expectation Maximization algorithm can be reduced if the co-variance matricesΣare restricted to a diagonal form. More over, the Akaike’s Information Criterion (AIC) [BA02] provides a means to evaluate how accu-rately the data is represented by the model. It is defined as shown in equation 3.22 whereθˆare the parameters of the density function,y is the training data, L is the likelihood function and K is the number of parameters in the density function. As AIC gives a measure of the relative distance between a fitted model and the unknown true mechanism that generated the training data [BA02], the model with the minimum AIC, among a set of models, is said to better describe the data.

0 1 2 3 4 5 6 (red). GMM of the same Gaussians with mixing coefficients equal to 0.1,0.4and0.5, respectively (blue).

AIC=−2log L

θˆ|y

+ 2K (3.22)

A different approach can be used to generate a model with very few parameters, namely, Parzen-Window density estimation or kernel density estimation (KDE) [Sil86]. Iff(x)is the density function to estimate, andxnis a set ofn indepen-dent and iindepen-dentically distributed random variables, the Parzen-Window density estimate is given by

fˆ(x) = 1 whereKis a kernel function that satisfiesR∞

−∞K(x)dx= 1andhis the width of the kernel. The kernel function is typically Gaussian [YC02][Sil86] as it provides a means to find the optimal width of the kernel [Sil86]. For univariate density estimation using Gaussian kernel functions the optimal widthh_opt is

hopt= 4

3n ¹₅

σ (3.24)

where nandσ are the number of samples and standard deviation of the data, respectively. A way of finding the optimal width hin multivariate densities is covered in [Sil86].

Once the model of normal data is obtained the event detection can be performed choosing an adequate thresholdk. Every incoming test data point x_tis evalu-ated in the density function of the model. Iff(xt)≥kthen the data point is classified as normal. Iff(xt)< kthen the data point is classified as ’event’.

3.2 Event Detection 31

0 1 2 3 4 5 6

0 1 2 3 4 5 6x 10⁻³

f(x)

events events events

Figure 3.9: GMM of two Gaussians in <¹ withµ= 2,4and σ= 0.5,0.4, respec-tively. Mixing coefficients equal to0.4and0.6, respectively. k= 0.002.

For the example in figure 3.9, allxtwithf(xt)≥0.002are classified as normal and those withf(xt)<0.002 are detected as events.

3.2.1.2 Neural Networks

The only neural network studied in this project is the so called auto-encoder [JMG95]. This neural network tries to replicate the input signal in its output, but this can only be achieved if the input is similar to the data used to train the network. Hence, the network has to be trained with data related to the normal behaviour of the system. When the network is tested with a transient event, it should not be able to replicate this input as it has not been trained with it, indicating the presence of a transient event. The auto-encoder neural network is composed of a certain number of input neurons, this quantity depends on the dimensions of the input variable. For example, if the input variable was taken from a 5 bands filtered signal, 5 input neurons could be used. The number of output neurons is the same as the number of input neurons, and there is a hidden layer with less neurons than the input or output. According to [JMG95], the reduction of neurons in the hidden layer forces the network to compress any redundancies in the data, retaining non-redundant information. Thus, the auto-encoder performs dimensionality reduction [Bel06]. The topology of an auto-encoder is shown in figure 3.10.

As already mentioned, the main purpose of the auto-encoder is to replicate the

Figure 3.10: Auto-encoder with 5 input neurons, 5 output neurons and 3 neurons in the hidden layer.

input in its output. Thus, the quality of the reconstruction is evaluated as shown by equation 3.25, whereIi is the input of thei-th input neuron and Oi is the output of thei-th output neuron.

Error=X

|Ii−Oi| (3.25)

After training using the back-propagation algorithm [Bel06], the detection cri-teria is based on the error. Finally, if Error ≤ k, no event is detected, if Error > k, an event is detected.

In document Detection and One Class Classiﬁcation of Transient Events in Train Track Noise (Sider 38-42)