Feed-Forward Neural Network - Artificial Neural Networks

2.2 Artificial Neural Networks

2.2.1 Feed-Forward Neural Network

The Feed-Forward Neural Network, also referred to as aMultilayer Perceptron, is a directed acyclic graph (DAG), where nodes are the units and edges are the connections between units. In Fig. 2.10 is an example of a FFNN consisting of three layers. A FFNN has an input layer (bottom), an output layer (top) and a variable amount of hidden layers (middle). All units in the lower layer are connected to all units in the next layer. There are no connections between units in the same layer. The data is processed through the input layer, and then the overlying hidden layers to finally being emitted by the output layer. This procedure is referred to as aforward-pass.

Figure 2.10: Example of a 3-layered FFNN. It forms a direct acyclic graph, where data is emitted from the bottom units towards the top units. W1 andW2 are the weights corresponding to each weight layer. wˆ⁽¹⁾andwˆ⁽²⁾ are the biases for the input and hidden layer.

Each layer consists of a variable amount of units. The size of the input unit vector ˆ

x= [x₁, ..., x_D]is defined by the dimensionalityD of the input data vectors. So if a 28×28 image is the input data, the amount of input units areD = 784.

The hidden layer unitszˆ= [z₁, ..., z_M]and the output unitsyˆ= [y₁, ..., y_K]can be defined as a training parameter.

The FFNN is able to solve much more complicated tasks than the Perceptron model. With its multilayered architecture and more complex transfer-functions it can obtain complex non-linear patterns in the dataset. The ability to hold more units within the final layer also gives it the ability to be a multi-class classifier.

In Fig. 2.11 is an example of a linear (left) and non-linear (right) separable classification problem.

Figure 2.11: Left: Classification problem that can be solved by the Perceptron.

Right: Classification problem that can be solved by a FFNN.

The processing of data that is conducted by the units of a layer can be described through mathematical functions. The units of a layer computes M linear combinations, referred to as activitiesact. The activities are calculated for each j∈ {1, ..., M} unit in a hidden layer for the input vectorxˆ= [x1, ..., xD]

Wij is the weight between visible unit i and hidden unit j and wˆj the bias attached to hidden unitj. After the activities have been computed, they are applied to a non-linear differentiable functionh, the transfer-function

zj =h(actj). (2.6)

This way the output of a unit varies continuously but not linearly. There exist many different types of units. What defines the non-linearity of the unit is the transfer-function. The most commonly known functions are thestep function (cf. Eq. (A.13)),logistic sigmoid function and thetangent hyperbolic function

σ(act) = 1

1 +e^−act = (1 +e^−act)⁻¹ (2.7) tanh(act) = exp(act)−exp(−act)

exp(act) +exp(−act) (2.8)

For the binary threshold neuron, the step function is used, as it transfers whether the unit ison or off. For training frameworks where an optimization algorithm is applied (cf. Sec. 2.2.3), it is necessary to use a differentiable transfer-function.

Both the logistic sigmoid function and the tangent hyperbolic function are continuously differentiable. The main difference between the two functions

is that the logistic sigmoid function outputs in range [0,1] and the tangent hyperbolic function outputs in range [−1,1]. In Fig. 2.12 is the plots of the three functions.

Figure 2.12: Left: Logistic Sigmoid Function. Middle: Tangent Hyperbolic function. Right: Step function.

In this thesis we will only focus on the logistic sigmoid function. We refer to units using the logistic sigmoid functions assigmoid units. The 1st order derivative of the logistic sigmoid function is

We will use this equation for training purposes (cf. Sec. 2.2.3).

The stochastic binary unit has the same transfer function as the sigmoid unit, but it will compute a binary output. The binary output is decided by comparing the output of the logistic sigmoid function, which is always in the interval [0,1], with a random number in the same interval. If the output of the logistic sigmoid function is higher than the randomly generated number, the binary output of the unit will evaluate to 1 and vice versa (cf. Algo. 1). The stochastic process employ randomness to the network, when deciding the values that should be emitted from a unit.

In order to process data where more classes are represented, thesoftmax unit,

Data: The input data from unitsx₁, .., x_D. Result: An output variable outof value 0 or 1.

1 actj= ˆwj+PD

Algorithm 1: The pseudo-code for the stochastic binary unit.

with itssoftmax activation function is applied yk = e^act^k

The softmax unit is typically used as output units of FFNNs. An output unit is reliant on the remaining units in the layer, which means that the output of all units are thereby forced to sum to 1. This way the output units in the softmax layer represents a multinomial distribution across discrete mutually exclusive alternatives. When the softmax output layer concerns a classification problem, thealternatives refer to classes.

The softmax transfer function is closely related to the logistic sigmoid function.

If we choose 1 output unit, we can set a second output unit of the network to 0 and then use the exponential rules to derive the logistic sigmoid transfer function

y_i= e^act^k Like the logistic sigmoid function, the softmax function is continuously differen-tiable [1]

∂yk

∂actk

=yk(1−yk) (2.17)

As mentioned earlier a forward-pass describes the processing of an input vector through all transfer functions to the output valuesy1, ..., yK. A forward-pass can

be described as a single non-linear function for any given size of neural network.

If we consider a FFNN with one hidden layer and sigmoid transfer functions, the output can be described as where(1)and(2)refers to the layers in the network. The inner equation results in the output vectorzˆ= [zj, ..., zM]of the hidden layer, which is given as input to the second layer. The outer equation computes the output vectoryˆ= [y1, ..., yK] of the next layer, which is the output of the network. wis a matrix containing all weights and biases for simplicity [1]. The equation can be simplified by augmenting the bias vector into the weight matrix. By adding an additional input variable x₀ with valuex₀= 1, weights and bias can now be considered as one parameter Now we have defined the FFNN as a class of parametric non-linear functions from an input vectorxˆ to an output vectory.ˆ

In document Deep Belief Nets Topic Modeling (Sider 29-33)