Meta-parameter selection for Support Vector Machines in wind energy forecasting models

(1)

Susana Rojas

Meta-parameter selection for

Support Vector Machines in wind energy forecasting models

Master’s Thesis

(2)

(3)

Susana Rojas

Meta-parameter selection for

Support Vector Machines in wind energy forecasting models

Master’s Thesis

(4)

This report was prepared by Susana Rojas

Supervisors Jose Dorronsoro Jan Larsen

Release date: 31st may of 2012 Category: 1 (public) Edition: First

Comments: This report is part of the requirements to achieve the Master of Science in Engineering (M.Sc.Eng.) at the Technical University of Denmark. This report represents 30 ECTS points.

Rights: cSusana Rojas, 2012 Department of Informatics

Mathematical Modelling and Computation Technical University of Denmark

Building 321

DK-2800 Kgs. Lyngby Denmark

Computer Science Department Universidad Autonoma de Madrid Cantoblanco, Madrid

Spain

(5)

Preface

This thesis was prepared at Instituto Ingenier´ıa del Conocimiento (IIC) at the Universidad Aut´onoma de Madrid, in collaboration with the department of In- formatics and Mathematical Modelling at the Technical University of Denmark in fulfilment of the requirements for acquiring a M.Sc. in Informatics.

The thesis deals with the obtaining of different meta-parameters model selection to achieve an optimal performance for SVMs, with a direct application in wind energy forecasting models.

Madrid, 31st May 2012

Susana Rojas

(6)

(7)

Acknowledgement

This thesis is the result of five months working, learning and studying about the topic addressed.

I would like to acknowledge every support received during this period.

First of all, I would like to thank my home university, UAM, to give me the chance to learn, study and work at IIC (Instituto Ingenieria del Conocimiento), learning from all the people who I have been working with. Moreover, thanks to DTU, to let me do my master thesis abroad.

Special mention must be made to my supervisor, Jos´e Ram´on Dorronsoro, whose advises have been incredibly worthy. Also thanks to Jan Larsen to his worried about me and concerned about my work.

Finally I am really thankful to all my friends and family for supporting and en- couraging me, especially to my sisters Cristina and Elena for their unconditional and valuable help during the writing of this thesis.

(8)

(9)

Summary

Support Vector Machines (SVMs) is a machine learning technique that allows the combination of the simplicity and uniqueness of linear models with the possibility of a highly nonlinear, kernel based, preprocessing into a possibly infinite dimensional extended feature space. This results in powerful models that can be applied to classification and regression problems. While quite simple, SVMs require the selection of two structural parameters, the penalty term that is applied to margin slack values and, in the case of Support Vector Regression (SVR), the tolerance threshold under which errors are not penalized. If, as usually done, Gaussian kernels are used, a third parameter, the kernel width has also to be selected.

The choices made may greatly affect the performance of SVMs and their correct selection is an important task when applying them to concrete problems in machine learning. In this Master Thesis methods for this choice will be considered, coming either from general, model independent, approaches to parameter selection or from concrete procedures that rely on the SVM structure. After reviewing the basic facts on SVMs and the state of the art literature on SVM parameter selection, three of such methods will be selected for implementation and application, first to data sets available at machine learning repositories and then to the concrete problem of building SVR models to predict wind energy production at individual farms.

It is well known that parameter selection is a heavy computational task but suitable to execution time savings when properly parallelized. Because of this, and as a technological complement, the Open MP standard for parallel processing will also be considered and applied in the implementation and application of the selected parameter selection techniques.

(10)

(11)

List of Figures

1.1 Graphical differences between supervised (left) and unsupervised

(right) learning. . . 3

1.2 Principal steps of a machine learning solution. . . 4

1.3 Example of overfitting. A very specific model (left) and a more general model (right). . . 6

1.4 Perceptron algorithm working for two classes. Vectorwis changing until it manages a correct decision boundary. . . 7

1.5 Example of different possibilities of hyperplane to separate the data points. . . 8

1.6 Example of bounds in two different hyperplanes. . . 8

1.7 Scheme of the wind energy forecasting model. . . 10

2.1 Example of dual gap. . . 15

3.1 Optimal Separate Hyperplane, with its margin. . . 20

3.2 Example of support vectors, marked in circles. . . 23

3.3 Example of dataset of two classes with a noisy point(top). Hard margin classification (left) and soft margin classification (right). . 25

3.4 Soft margin classification of a noisy data, with the ξ errors of each point. . . 26

3.5 Non linear regression function with an insensitive band. The ξ are non zero for examples outside the region bounded between the dashed lines (the insensitive band). . . 28

3.6 A linearǫ-insensitive loss function example. . . 28

(16)

3.7 Simple example of mapping a input data inR²into a larger space R³ with an effective function Φ and go from non linear problem

to a linear one. . . 32

4.1 Graphical example for 4-fold cross validation. . . 41

4.2 Geometrical disposition for a grid search of two parameters. . . . 42

4.3 Asymptotic behavior ofσand C. . . 48

5.1 Example of DFGS grid search. . . 63

5.2 Example of exhaustive grid search . . . 63

5.3 Banana dataset. . . 66

5.4 Banana problem boundaries obtained by using different meta- parameters: 1.Cherkassky’s method, 2.Grid method 3.DFGS method and 4.DFGS + Cherkassky method . . . 67

5.5 Linear regression problem trained model with different parameter values: 1.Cherkassky’s method, 2.Grid method 3.DFGS method and 4.DFGS + Cherkassky method. . . 68

6.1 Use of the NMAE and the NRMSE for assessing the performance of models. Comparison with the persistence model. Taken from [HM05]. . . 82

6.2 Graph from ANEMOS [IDA07]. NMAE for assessing the performance of different models from Sotavento wind park. . . 82

6.3 Wind energy forecasting results, using different model selection methods. Model 1. . . 85

6.7 Performance in terms of MAE and RMSE for DFGS and DFGS + Cherkassky at different times horizons in Models 1, 2, 3 and 4 respectively. . . 91

6.8 Pie diagram with the percentage of different ranges of production (top). Histogram with MAE for DFGS (blue) and DFGS + Cherkassky (red) at each production range in Models 1 and 2 respectively (bottom). . . 92

6.9 Pie diagram with the percentage of different ranges of production (top). Histogram with MAE for DFGS (blue) and DFGS + Cherkassky (red) at each production range in Models 3 and 4 respectively (bottom). . . 93

(17)

LIST OF FIGURES xiii

6.10 Wind energy forecasting results. Model 2 using DFGS + Cherkassky and Model obtained by taken of Model 2 + DFGS if the prediction value is between 10-40%, and of Model 2 + DFGS+Cherkassky otherwise. . . 95 6.11 Wind energy forecasting results, for Models 1, 2, 3 and 4. . . 97 6.12 Absolute error and root square error at each horizon are shown

for Model 1 and 2 (top), and Model 3 and 4 (bottom). . . 98 6.13 Pie diagram with the percentage of different ranges of produc-

tion (left). Histogram with MAE for DFGS (blue) and DFGS + Cherkassky (red) at each range of production. For Models 1 and 2, and Model 3 and 4 respectively (right). . . 99 6.14 Pie diagram indicating which model performs better at each out-

put target. Comparison between Models 1 and 2 (left), and 3 and 4 i(right). . . 100 6.15 Absolute error and root square error are shown at each horizon

for Model 1+3 and 2+4. . . 100 6.16 Wind energy forecasting results, for Model 1 and 3 (top) , and

Model 2 and 4 (bottom). . . 102

(18)

(19)

List of Tables

3.1 Table with the most representative kernel functions . . . 34 5.1 Collection of different dataset used for comparing the perfor-

mance of meta-parameter selection methods. . . 61 5.2 Results of all methods for banana data. Parameter values, cross

validation error for selection models and the accuracy of testing the data are shown. It also includes the number of Support Vectors needed for the model. . . 66 5.3 Results of all methods for linear regression data. Parameter val-

ues, cross validation error for selection models and the Mean Square Error of testing the data are shown. It also includes the number of Support Vectors needed for the model. . . 68 5.4 Results of all methods for flare solar data. Parameter values,

cross validation error for selection models and the accuracy of test the data are shown. It also includes the number of Support Vectors needed for the model. . . 69 5.5 Results of all methods for ringnorm data. Parameter values, cross

validation error (100-5CV accuracy) for selection models and the accuracy of test the data are shown. It also includes the number of Support Vectors needed for the model . . . 69 5.6 Results of all methods for triazines data. Parameter values, cross

validation error for selection models and mean square error and mean absolute error in testing are shown. It also includes the number of Support Vectors needed for the model. . . 70 5.7 Results of all methods for housing data. Parameter values, cross

validation error for selection models, Mean Square Error and Mean Absolute Error in testing are shown. It also includes the number of Support Vectors needed for the model. . . 70

(20)

5.8 Results of all methods for abalone data. Parameter values, cross validation error for selection models and Mean Square Error and Mean Absolute Error in testing are shown. It also includes the number of Support Vectors needed for the model . . . 71 5.9 Results of all methods for body fat data. Parameter values, cross

validation error for selection models and Mean Square Error and Mean Absolute Error in testing are shown. It also includes the number of Support Vectors needed for the model . . . 72 5.10 Number of iterations each grid search method needs for obtain-

ing the optimal value and their corresponding validation error.

Default and Cherkassky center for DFGS is shown respectively. . 72 5.11 MAE and MSE of the final performance for models using meta-

parameters obtained by Cherkassky’s method, using the default values of LIBSVM, using DFGS or DFGS + Cherkassky. . . 73 5.12 Runtime for DFGS using 1, 2, 4, and 8 threads . . . 74 5.13 Comparison of accuracies from different experiments and different

dataset examples in classification task. . . 74 5.14 MSE of the final performance for regression problems from exper-

iments results of section5.4and MSE from publication of [BD11]. 75 6.1 Information about data for wind energy prediction models. . . . 84 6.2 Results of Model 1 for both model selection methods. Parameter

values, cross validation error for selection models and the MAE, MSE and RMSE are shown. . . 86 6.3 Results of Model 2 for both model selection methods. Parameter

values, cross validation error for selection models and the MAE, MSE and RMSE are shown. . . 90 6.6 Results of all models in terms of MAE, MSE and RMSE . . . 95 6.7 Improvement values (respect to MAE) of rows models with re-

spect to columns models. . . 96 6.8 MAE values at each time horizon of Model 1 and Model 2. . . . 97 6.9 MAE values at each time horizon of Model 3 and Model 4. . . . 98

(21)

Chapter 1

Introduction

The amount of different data available and the improvements in their storage have made easier to work with them. Therefore, this has improve the creation of effective machine learning methods trying to learn from them.

Hand-written character recognition and medical diagnosis are common examples where machine learning can be really helpful. These classification problems are not easy to solve because there are a lot of aspects to take into account , as the origin of patterns, how to represent them to apply the defined mathematical formulation easily, the processing time and the economical aspects.

One of the main goals of this chapter is to introduce some of the most important concepts of machine learning. To start introducing the topic of the thesis, this chapter is focused on supervised learning and how to build a supervised model.

1.1 General Concepts

First of all it will be very helpful to remember some basic ideas and concepts of this field.

1.1.1 What is Machine Learning?

Tom Mitchell in [Mit97] defines machine learning as follows:

A computer program is said to learn from experience E with respect to some class of task T and performance measure P, if its performance on T, as measured by P, improves with experience E.

One example to understand this definition is the following:

(22)

Suppose someone wants to improve the filter of spam in the email account. The email program watches which mail has been considered spam and which has not. Therefore, the filter can recognize, based on past experience, the type of mail to be filtered as spam.

In this situation:

• Twould be the classification of emails as spam or not

• E would check the emails label as spam or not.

• Pwould be the number of emails correctly classified as spam or not.

Machine learningis part of artificial intelligence. Its main goal is to develop different techniques that allow computers to learn from past events and change their behavior according to these.

What machine learning does is trying to build a model which approximates the behavior of a system from the observed data. The computer learns how to solve a particular problem by observation. First, the model has to be trained, in what is called thetraining process, which can be expressed in a mathematical language.

Regarding the types of problems, the learning can be split in different types:

supervised, unsupervised, semi-supervised, and reinforcement learning. The most important ones are supervised and unsupervised learning.

• Supervised Learning: Supervised learning problems are those in which the data have examples with some attributes (or features) and their corresponding target vectors (for each pattern).

The final output of the model sometimes is not the same as the real one, producing, therefore, an error. This error will be used to adapt the parameters of the system.

This system analyzes the data and produces a decision function, a classifier or a regression function. This function should predict the correct output value for any new valid input data.

Cases such as digit recognition in which the aim is to assign each input vector to one category are called classification problems, and when the output consists of one continuous variable, as in the case of prediction of wind energy production, they are called regression problems.

Some examples of supervised algorithms are: Artificial Neural Networks, Bayesian Statistics, Decision Trees, Learning Automata, Nearest Neighbor Algorithm, Regression Analysis or Support Vector Machines (SVMs).

In this thesis SVMs are used for classification and regression tasks.

• Unsupervised Learning: On the other hand, this type of learning uses some attributes of data without any target values. The goal in this case could be classifying the data, creating patterns, extracting some features, discovering groups of similarities, or determining the distribution of data.

(23)

1.1 General Concepts 3

Some examples of unsupervised algorithms are: Data clustering, Expectation- maximization, or Radial Basis Function Networks.

Figure 1.1: Graphical differences between supervised (left) and unsupervised (right) learning.

Figure1.1shows an example of these two types of learning. It is easy to appreci- ate the difference between them. The first graph shows two separated groups of data, with a known label, one class with circles and the second one with crosses.

Both classes are well separated. The goal in supervised learning is, through this dataset, to create a model capable to separate both classes. The picture on the right side is an unsupervised learning example. As it can be seen, the data do not have targets, and the purpose consists of classifying them according to their distribution in the space and then to predict in which group the new input data belong to, according to the model.

To develop the model and the learning algorithm it is necessary to split the set of data in two different groups.

• Training data: The number of data in the training set is usually bigger than in the test set. It is used to train the model.

• Test data: This set contains the rest of the data and its goal is to evaluate the model’s generalization ability.

In order to separate the training process and the evaluating process, the data used in both phases must be different. Moreover there are some methods that need avalidation setdata, which is a part of the training set.

1.1.2 General perspective of machine learning solutions

This section explains the design cycle of machine learning algorithms taken from [RDS00]. The aim is to build a machine that will take a vectorxas input and that will produce the desired outputy. In this process it is necessary to apply some other steps.

(24)

Figure 1.2: Principal steps of a machine learning solution.

• Data Collection: For training and testing the system uses a set of examples. Machine learning is data-driven, so the first step is to collect data from the system: a set of input patterns xi, i ∈ {1, . . . , N} with their associated output targetsyi ,i∈ {1, . . . , N}

• Feature Collection: For most practical applications, the original input data need a preprocessing phase to transform them in a new space of variables where the problem will be easier to solve. This phase is called feature extraction. It is important to select the relevant features from the data, taking some xi ∈ R^D to explain the behavior of the system.

The ideal selection would be those features insensitive to noise, simple to extract and invariant under irrelevant transformation. This phase must be done thoroughly. The accuracy of the model could be deteriorated if some important features for the solution are deleted.

• Model Choice: It is very important to select an appropriate objective function. The simplest one is the linear model ˆy =f(x) =wx+bbeing w the weight vector and b the bias term. Both are the parameters of this model. Other typical approaches to capture the behavior of the data are polynomial functions. During this step there is another phase for some models calledmodel selectionwhich consists of selecting meta-parameters.

They are parameters of the structure of the models.

• Training: The process of running the algorithm is expressed as a function y(x) encoded in the same way as the target vectors. The precise form of the functiony(x) is determined in this phase which is also known as the learning phase.

The problem of learning the best model parameters,θ, can be written as the optimization problem:

minθL(θ)

where Lis the loss function; the cost of the difference betweenf(xi) and yi. There are different representations of the loss function, depending on the problem under study.

The purpose of this function is to measure the accuracy of a particular choice of parameters.

This process uses only the training data.

(25)

1.1 General Concepts 5

• Evaluation: Once the model is trained is time to the evaluation process, which consists of measuring an error rate, in order to analyze the performance of the model using the test data. It is the final performance measure.

1.1.3 Supervised Learning

The most common task is pattern recognition based on the assignment of a label to a given input value. An example of pattern recognition is classification, which assigns to each input one discrete output value. In the example of emails, the classification labels each email as spam or as not spam. Another example of pattern recognition is regression, which assigns a real-valued output to each input.

Classification predicts discrete values, while regression predicts the continuous ones.

There are some factors that determine the final performance of the model, and it is important to understand them for being able to choose the best model.

• Noise. Sometimes the data are affected by some noise, causing some inac- curacies in the collection of the data, producing therefore, some difficulties in the model choice. This factor can lead to overfit. It is important to consider its presence to avoid generating a model with errors due to bad inputs.

• Overfitting. Once the parameters have been obtained in the training step, if they fit too much the training set they lose their generalization ability, because they create a model very specific for the concrete dataset.

When new input is included, this model would not classify it properly.

Figure 1.3 shows this situation. On the left side, the picture shows that the model fits too much the data, the function takes the exact values in each point. However, on the right graph the model captures better the general behavior of the data with a simple linear function.

The opposite situation, when the model does not fit at all and all the points are essentially considered as the same is calledunderfitting.

• Regularization. It is a strategy to avoid overfitting which tries to control the complexity of the model. It can be formally defined as a new term in the optimization problem,r, called as regularizer:

minθL(θ) +λr(θ),

λis the regularization parameter. The regularizer penalizes the complexity and the regularization parameter indicates if the regularizer is the most important part, on the other hand, the loss function is so. It is also useful for redundant data. This new parameter is included in the meta-parameter set.

(26)

Figure 1.3: Example of overfitting. A very specific model (left) and a more general model (right).

• Dimensionality of the space. If the input data have a high dimension, often there are some features which are neither useful nor decisive for the learning problem and because of them the learning problem can be difficult to solve.

In practical situations, some irrelevant features are deleted from the input data, and the accuracy of the learning function is improved.

In order to reduce dimensionality properly, there are some algorithms to select and identify the most relevant features like cross-validation, PCA, or decision trees.

• Estimation of error. There several ways to measure the performance of a model and depending on the type of error chosen, the performance will be different. The goal is to minimize the generalization error.

For example, in the case of regression the most typical loss function error is Mean Square Error and to assess the effectiveness of the model MSE is decomposed in bias and variance.

M SE =V ar(θ) +Bias(θ)²

In order to minimize MSE it is needed to minimize both but it is not easy because they are hard to minimize simultaneously. In general terms, to choose the best model it is useful to do a trade-off between bias and variance:

– Models with too few parameters are inaccurate because of a large bias (not enough flexibility).

– Models with too many parameters are inaccurate because of a large variance (too much sensitivity to the sample).

– Identifying the best model requires identifying the proper model complexity (number of parameters).

(27)

1.2 Support Vector Machines at a glance 7

1.2 Support Vector Machines at a glance

Support Vector Machines (SVMs) are the main topic in this thesis. The principal reasons are their significant theoretical advantages and very good performance in many real-world problems.

Since they are considered as the state-of-the-art in machine learning, they are one of the most attractive elections. Support Vector Machines are learning systems that use a hypothesis space of linear functions in a high dimensional feature space.

In 1963 Vapnik and Lerneren developed SVMs, but they were not widely known until 1992-95 when Boser, and Cortes and Vapnik wrote their first published paper, [NC00].

Support Vector Machines are a very special case of the perceptron of Rosenblatt.

It is one of the first algorithms in Machine Learning Theory consisting of a linear model, of two classes where the inputx is used to construct basically a linear model:

y(x) =f(w·x),

withf a nonlinear activation functionf(a) = 1 ifa≥0 and f(a) =−1 otherwise.

Figure1.4explains the general idea of the perceptron algorithm, which tries to obtain a linear function to split the data.[Bis06].

Figure 1.4: Perceptron algorithm working for two classes. Vectorwis changing until it manages a correct decision boundary.

SVMs work in a way similar to perceptrons, because the aim of an SVM is to find a hyperplane that linearly separates data points from different classes.

(28)

Figure 1.5: Example of different possibilities of hyperplane to separate the data points.

Figure 1.6: Example of bounds in two different hyperplanes.

There may be several hyperplanes that satisfy this condition as it is shown in figure 1.5. There is a difference between perceptrons and SVMs which is their hyperplane election. SVM wants to find the hyperplane that is least likely to overfit the training data, trying to find the separating hyperplane with the margin as large as possible. Figure1.6shows the margin of the hyperplanesB1

andB2of the previous picture. In this case, SVM choosesB1as the separating hyperplane.

Neural Networks in general, use backpropagation for training, where gradient descent is used to obtain the solution and there is a possibility to lie in local minima. SVM, as it solves a QP problem, has a unique solution (in the primal formulation) and it is much easier to solve this kind of problems.

The goal through this training process is to obtain a machine that generalizes as well as possible.

To do this, it is necessary to search a model that minimizes the empirical cost on the training set, while on the other hand, generalizes well.

(29)

1.3 Applications 9

SVMs are also used for regression tasks with the same construction.

1.3 Applications

Machine Learning is used in many applications such as:

• Medical diagnosis.

• Detection of fraud in credit card operation.

• Speech recognition.

• Character recognition.

• Robotics.

• Game playing.

• DNA sequence classification.

• Wind energy forecasting.

Focusing on SVM, one of the most useful algorithms, the most important applications, according to [Bur98] and [NC00] are:

• Face detection in images.

• 3D object recognition.

• Speaker identification.

• Hand-written digit recognition.

• Industrial process classification.

• Disease diagnosis.

1.4 Wind Energy Forecasting

In Europe, some countries like Germany, Spain, Portugal and Denmark have as an important source of electricity the wind power (in the range of 7 to 20% of electricity consumption). The trend is to depend more on renewable energies like this one. Therefore, the increasing use of wind energy as a source of electricity energy creates the necessity to develop a reliable and accurate prediction system of wind energy power in order to use it effectively.

Wind energy production is a direct function of wind speed and other variables and, in contrast to other generation systems, is not easy to control due to fluctuations in the wind. Managing the variability of wind generation is one of the most important aspects for improving the renewable energy operation.

(30)

There are a lot of different approaches and methods to predict wind generation, and this thesis will focus on one concrete approach, Support Vector Machines, which correspond to a specific method in the statistical approach to wind power forecasting. SVMs are one of the state-of-the-art techniques in regression and because of that, Suppor Vector Regression (SVR) is one of the most common method applied for wind energy forecasting, [KG11].

Figure 1.7: Scheme of the wind energy forecasting model.

Chapter 6 explains different ways to predict wind power, and how different meteorology variables affect to production. Also, it explains the concrete model to apply, which will need past meteorology variables and past production for one wind farm to train the model.

Using meteorology prediction, the model will obtain the prediction of production. Figure 1.7 shows the scheme of the model to use. Wind energy power forecasting will be the main application to test the parameter selection models implemented in this thesis.

1.5 Thesis Contributions

This thesis is organized as follows: Chapter2introduces the Optimization The- ory, basic to develop mathematical formulations for Support Vector Machines, which are explained in Chapter3.

Chapter 4 contains the essence of this thesis as it explains different parameter selection methods and two of them will be implemented and tested in Chapter 5, with some examples for classification and regression tasks.

Finally Chapter 6 will focus on wind energy forecasting models. It will use SVMs, explained in Chapter 3, with the best parameter selection values explained in Chapter 4 to create a model capable to predict the production of wind energy using some meteorology variables obtained from AEMET (Agencia Estatal de Meteorologia of Spain) for a concrete wind farm calledSotavento.

(31)

Chapter 2

Optimization Theory

Optimization theory is the branch of mathematics concerned with characterizing the solution of problems like minimizing / maximizing a certain function and developing effective algorithms for finding the optima.

Depending on the specific cost function and nature of the constraints there is a large number of classes of optimization problems. This chapter is focused on solving problems with a convex quadratic cost function and linear constraints.

This class of optimization problems is calledconvex quadratic programmes.

The main reason is because SVMs are from this type.

This chapter does not only explain the techniques to solve them but also gives some necessary and sufficient conditions for a given function to be the correct solution. It discusses the KKT conditions as well as the duality theory to express the problem in a dual representation for solving it in an easier way. It includes some Lagrange theory to understand each step of the process. This chapter provides the necessary mathematical background to understand the subsequent chapters of the thesis.

2.1 Problem Formulation

The most general problem formulation used for finding the minimum of a function with some constraints is theprimal optimization problem, defined in [NC00]

Chapter 5 as:

Definition 2.1: Given functions f,gi,i= 1, . . . , k, andhi= 1, . . . , m, defined

(32)

on a domain Ω⊆Rⁿ ,

minimize f(w), w∈Ω,

subject to gi(w)≤0, i= 1, . . . , k, hi(w) = 0, i= 1, . . . , m,

(2.1)

wheref(w) is calledobjective functionand its optimal value is called thevalue

of the optimization problem. ⋄

This is the most general expression withg andhas constraints of the problem.

If the objective function f is a linear function as well as the inequality and equality constraints, then the optimization problem is called alinear programme while if the objective function is quadratic, is called aquadratic programme. An inequality constraint gi(w)≤0 is said to be active if the solution w^∗ satisfies gi(w^∗) = 0, otherwise it is calledinactive.

As it will see, SVM is a convex quadratic optimization problem, hence this chapter is going to focus on this kind of problems wheref(w) is convex.

Definition 2.2: A real-valued function f(w) isconvex according to [NC00], if

∀w,u∈Rⁿ and for anyθ∈(0,1):

f(θw+ (1−θ)u)≤θf(w) + (1−θ)f(u) The solution of the optimization problem2.1isw^∗∈Rwhere

R={w∈Ω :g(w)≤0, h(w) = 0}

is called thefeasible region (where all the possible solution are) and if there is no other pointw∈Rfor which f(w)< f(w^∗), it is calledglobal minimum. A pointw^∗ is a local minimum if∃ǫ >0 such thatf(w)≥f(w^∗),∀w∈Ω.

In the case of a convex function any local minimum of the unconstrained optimization problem is a global minimum.

This situation gives to SVM one of the most important advantages to respect other possible algorithms because its solution is unique (due to the convexity property) and of course it exists always (due to the quadratic property).

2.2 Lagrangian function

Lagrange multipliers are very useful for solving problems of finding a maximum or minimum of a function with some constraints because there is no need to solve explicitly the constraints.

Lagrange developed this method in 1797 for using it in mechanical problems but it was a generalization of a result of Fermat from 1629. In 1951 Kuhn and Tucker extended this method allowing inequality constraints too.

These results will be essential to develop the optimization theory. All results, taken from [NC00] Chapter 5, are restricted to the case of convex quadratic programmes.

(33)

2.2 Lagrangian function 13

2.2.1 Fermat’s condition

Theorem 2.1: A necessary condition for w^∗ to be a minimum of f(w) with f ∈C¹ is

∂f(w^∗)

∂w = 0

This condition together with convexity off is also a sufficient condition. ♦ This theorem gives the basic idea for this theory but it is necessary to include some new results when constraint functions are incorporated.

2.2.2 Lagrange’s condition

Definition 2.3: Given an optimization problem with objective function f(w) and equality constraintshi(w) = 0 , i = 1, . . . , m, the Lagragian Function is defined as:

L(w, β) =f(w) +

m

X

i=1

βihi(w),

where the coefficientsβi are called theLagrange multipliers. ⋄ Theorem 2.2: A necessary condition for a pointw^∗ to be a minimum off(w) subject tohi(w) = 0,i= 1, . . . , m, withf,hi∈C¹ is

∂L(w^∗, β^∗)

∂w = 0, (2.2)

∂L(w^∗, β^∗)

∂β = 0, (2.3)

for some values β^∗. The above conditions are also sufficient provided that

L(w, β^∗)is a convex function of w. ♦

Conditions2.2and2.3give a new system of equations. If the problem contains some constraints, it is possible that a pointw^∗ could be a minimum but

∂f(w^∗)

∂w 6= 0.

The theorem2.2solves this situation.

The geometric interpretation of the conditions of theorem 2.2 is that in the local minimum w^∗, the gradient of the objective function must belong to the subspaces generated by the gradient of the constraints. From condition2.2:

∂L(w^∗, β^∗)

∂w =∂f(w) w +X

βi

∂hi(w) w = 0,

∇f(w^∗)∈ {∇h(w^∗) : i= 1, . . . , m}.

(2.4)

(34)

2.2.3 Kuhn-Tucker’s condition

This condition deals with the most general situation and the desirable one: an optimization problem with equality and inequality constraints.

Definition 2.4: Given an optimization problem like 2.1(general optimization problem with inequalities and equalities constraints), thegeneralized Lagrangian function is defined as:

L(w, α, β) =f(w) +

k

X

i=1

αigi(w) +

m

X

i=1

βihi(w), αi ≥0.

The Lagrangian includesw and the Lagrange multipliers (α, β).

The primal problem can be transformed into a dual problem which provides an upper bound to the optimal values of the primal problem.

Definition 2.5: TheLagrangian dual problem of the primal problem2.1is:

maximize θ(α, β)

subject to α≥0. (2.5)

whereθ(α, β) =infw∈ΩL(w, α, β). ⋄

Since the problem is maximized byαandβ and it is minimized byw then the final solution is a saddle point of the Lagrange function.

The dual problem verifies some special qualities that are useful in the task of solving the primal problem.

The first one isThe Weak Duality Theorem that gives the fundamental relation- ship between the primal problem and the dual one. It states that the objective function value of the dual at any feasible solution is always smaller than or equal to the objective function of the primal problem at any feasible solution.

Theorem 2.3: The Weak Duality Theorem

Let w ∈ Ω be a feasible point of the primal problem and (α, β) are feasible solution of the dual one. Then

f(w)≥θ(α, β).

This is directly proved by definition of θ, the lagrangian function and the definition of the constraints:

θ(α, β) =infw∈ΩL(w, α, β)≤L(w, α, β) =f(w) +αg(w) +βh(w)≤f(w).

This can be also expressed as:

sup{θ(α, β) : α≥0} ≤inf{f(w) : g(w)≤0, h(w) = 0}.

(35)

2.2 Lagrangian function 15

This result indicates that maximizing the dual problem, it is impossible to exceed the value obtained by minimizing the primal one.

For any feasible solution, the values of the objective functions are related by the weak dual theorem. It is possible that these objective functions were different at their respective optima. The difference between those values at the optimum is illustrated in figure2.1and it is calleddual gap.

Figure 2.1: Example of dual gap.

However, it is necessary a theorem with more restrictions for obtaining a zero dual gap.

Theorem 2.4: The Strong Duality Theorem.

Given a primal optimization problem in the form 2.1 and a dual defined as 2.5, if the constraints of the primal are affine functions (functions that satisfy f(x) =Ax+b) as in the case of SVMs (due to its constraints are linear), then the dual gap is zero. That is,

f(w^∗) =θ(α^∗, β^∗),

wherew^∗ is the optimal solution of the primal problem and (α^∗, β^∗)is the opti-

mal solution of the dual problem. ♦

Therefore, both problems give at the same optimal value.

To conclude with this section, the following theorem establishes the conditions the solution must satisfy.

Theorem 2.5: Kuhn-Tucker.

Given an optimization problem with a convex domainΩ⊆Rⁿ as the one in2.1, the necessary and sufficient conditions for a pointw^∗ to be an optimum are the

(36)

existence of α^∗ andβ^∗ such that

∂L(w^∗, α^∗, β^∗)

∂w = 0.

∂L(w^∗, α^∗, β^∗)

∂β = 0.

α^∗_igi(w^∗) = 0, i= 1, . . . , k.

gi(w^∗)≤0, i= 1, . . . , k.

α^∗_i ≥0, i= 1. . . , k.

hi(w^∗) = 0.

(2.6)

The relationα^∗_igi(w^∗) = 0 is known as KKT complementarity condition.

This theorem can be interpreted as follows:

The solution can be:

• In the boundary of the feasible region, with the inequality constraint active (gi(w^∗) = 0) =⇒ It uses Fermat’s conditions for optimality and αi = 0.

(Third condition in theorem2.5).

• In the interior of it, with the constraint inactive (gi(w^∗)<0) =⇒It uses Lagrange’s conditions andαi 6= 0. (Two last conditions in theorem2.5).

2.3 Duality formulation

In general, the dual problem is easier to solve than the primal problem. It is just the result of introducing the variables obtained by setting to zero the derivatives of the Lagrangian expressed in theorem2.5in the primal Lagrangian problem.

This results a dual problem with no primal variables (w and b). The final function contains only the dual variables: θ(α, β) =infw∈ΩL(w, α, β).

Dual problem is explained to use it in the SVM problem. Chapter3 uses this formulation in the concrete optimization problem of SVMs. The main reason for introducing Lagrangian multipliers is because, first of all, the dual SVM is much easier to solve and because the training data appear only in inner product form. This is an important property to later generalize this algorithm.

Summing up, it is possible to forget the primal problem and solve the dual one, knowing that the same value at the optimum is obtained.

2.4 Chapter Summary

All these definitions and theorems are the basis of the optimization theory for convex quadratic problems. These theorems give a general and conceptual knowledge of the theory and will be applied in chapter3for the concrete SVM problem.

This chapter is focused on convex problems and the main ideas are:

(37)

2.4 Chapter Summary 17

• A general convex optimization problem with an objective function f(w) and inequality constraints g(w) and equality constraints h(w) is formu- lated as:

minimize f(w), w∈Ω,

subject to gi(w)≤0, i= 1, . . . , k, hi(w) = 0, i= 1, . . . , m.

(2.7)

• For this optimization problem a Lagrangian function is defined as:

L(w, α, β) =f(w) +

k

X

i=1

αigi(w) +

m

X

i=1

βihi(w), αi≥0.

• The final way to solve the first optimization problem is solving the La- grangian dual problem:

maximize θ(α, β) where θ(α, β) =infw∈ΩL(w, α, β),

subject to α≥0. (2.8)

• For this concrete problem, the necessary and sufficient conditions for a pointw^∗ to be the optimum are the existence ofα^∗ andβ^∗ such that:

∂L(w^∗, α^∗, β^∗)

∂w = 0.

∂L(w^∗, α^∗, β^∗)

∂β = 0.

α^∗_igi(w^∗) = 0, i= 1, . . . , k.

gi(w^∗)≤0, i= 1, . . . , k.

α^∗_i ≥0, i= 1. . . , k.

(2.9)

(38)

(39)

Chapter 3

Support Vector Machines

3.1 Motivation

Support Vector Machines (SVMs) are one of the most powerful techniques in Machine Learning. One of the most important properties of the SVM is that it performs well in real-world situations, giving excellent results in a wide variety of problems. Other important characteristic is the simplicity of the mathematical expression. Moreover is one of the most robust methods.

This chapter starts explaining the classification problem which can be divided in two cases: a linear separable case and a linear non-separable case. Furthermore, it explains the regression problem which is an extension of the classification one. Finally, the chapter deals with the problem of non-linearity, introducing the concept of kernelization.

3.2 Support Vector Classification

The aim of Support Vector Classification is to obtain agood separating hyperplane. A good hyperplane is the one which minimizes the norm of the weight vector.

3.2.1 Maximal Margin Classification. Linear separable case.

Maximal Margin classification works only for data that are linearly separable.

For this reason is not suitable for real problems, although, its explanation is interesting for introducing the rest of the models due to its simplicity. It exhibits

(40)

the key characteristics to this kind of learning machine.

Consider the situation of separable data with a training set{(xi, yi)}ⁿi=1, SVM obtains the optimal hyperplane which separates each class of the data. The maximal margin classifier optimizes the bounds by separating the data with the maximal margin hyperplane. This is managed by reducing the problem to a convex optimization problem, minimizing a quadratic function under linear inequality constraints.

Figure3.1 shows two different labeled classes. The maximal separating hyperplane appears in green. The margin γ is the distance between the separating hyperplane and the closest point of each class.

Figure 3.1: Optimal Separate Hyperplane, with its margin.

This linear classifier can perfectly separate the data, with a hyperplane defined as

hw,xi+b= 0, (3.1)

where the pair (w, b) satisfies:

hw,x⁺i+b≥1 ifyi = 1,

hw,x⁻i+b≤ −1 ifyi=−1.

The marginγ is half of the distance between the hyperplanes:

hw,x⁺i+b= 1

hw,x⁻i+b=−1, (3.2)

(41)

3.2 Support Vector Classification 21

which is computed as:

γ=1 2

h w

||w||2

,x⁺i − h w

||w||2

,x⁻i

= 1

2||w||2

(hw,x⁺i − hw,x⁻i)

= 1

||w||2

.

(3.3)

Once the margin is obtained from hyperplanes the goal of SVM is to maximizeγ, because the optimal separating hyperplane is given by3.1such asγis maximum.

The formulation of the concrete problem for a separable training data{(xi, yi)}ⁿi=1

is:

minimizew,b hw,wi

subject to yi(hw,xii+b)≥1, i= 1, . . . , l.

(3.4)

As it was explained before in chapter2, minimizing this equation, can be done through the Lagrange function defined in2.4:

L(w, b, α) = 1

2hw,wi −

n

X

i=1

αi[yi(hw,xii+b)−1]. (3.5)

To find the solution (w, b), the L function has to be minimized overw and b and then maximized it over the Lagrange multipliersαi ≥0.

It must satisfy the conditions given in the theorem2.2where the partial derivatives should be zero to be at a minimum:

∂L(w, b, α)

∂b =

n

X

i=1

yiαi= 0,

∂L(w, b, α)

∂w =w−

n

X

i=1

yiαixi= 0.

(3.6)

Substituting these equations into3.5, the final problem is summed up in maxi-

(42)

mizingW(α), defined as:

L(w, b, α) =1 2

l

X

i,j=1

yiyjαiαjhxi,xji −

l

X

i,j=1

yiyjαiαjhxi,xji −

l

X

i=1

αiyib+

l

X

i=1

αi.

⇓

maxα W(α) =

l

X

i=1

αi−1 2

l

X

i,j=1

αiαjyiyjhxixji, s.t αi≥0,

l

X

i=1

αiyi= 0.

(3.7) The KKT complementary condition is:

α^∗_i(yi(hw^∗,xii+b^∗)−1) = 0

The optimal values of the parameters of the primal problem are obtained by the KKT condition and by the partial derivatives.

The weight vector is:

w^∗=

l

X

i=1

yiα^∗_ixi. (3.8)

The parameterb^∗is:

b^∗=yi−w^∗xi=yi−

l

X

i=1

α^∗_iyihxi, xji ifα^∗_i >0. (3.9)

In theorem 2.5 the KKT complementary condition moreover provides useful information about those points that are close to the hyperplane, since

α^∗_i(yi(hw^∗,xii+b^∗)−1) = 0 Then:

Ifα^∗_i >0 ⇒yi(hw^∗,xii+b^∗) = 1.

These pointsxi are in one of the hyperplanes3.2.

Regarding the geometric point of view, this means that these pointsxi are the closest ones to the optimal separating hyperplane, and they have one of the most important roles in the formulation problem because they are the only ones needed in the expression of the hyperplane. These points are called support vectors. Therefore, SVM is not influenced by the number of points but by the support vectors. Picture3.2shows an example of support vectors.

(43)

Figure 3.2: Example of support vectors, marked in circles.

Furthermore, the optimal hyperplane can also be expressed in the dual representation usingα^∗ andb^∗ :

f(x,w^∗, b^∗) =

l

X

i=1

yiα^∗_ihxi,xi+b^∗

=X

i∈sv

yiα^∗_ihxi,xi+b^∗.

(3.10)

To conclude with this section, the following theorem of [NC00] gives the explicit definition of the marginγ:

Theorem 3.1: Consider a linearly separable training sample S= ((x₁, y1), . . . ,(x_l, yl))

and suppose the parameter α^∗ solves the dual optimization problem 3.7. Then the weight vectorw=Pl

i=1yiα^∗_ix_irealizes the maximal margin hyperplane with margin

γ= 1

||w||2

= (X

i∈sv

α^∗_i)^−1/2.

Using the representation of the hyperplane3.10we have:

yjf(xj, α^∗, b^∗) =yj(X

i∈sv

yiα^∗_ihxi,xji+b^∗) = 1.

Moreover, if last expression is multiplied byyj, it gives:

X

i∈sv

yiα^∗_ihxi,xji=yj−b^∗. (3.11)

(44)

Therefore,

hw^∗,w^∗i=

l

X

i,j=1

yjyiα^∗_iα^∗_jhxi,xji

= X

j∈sv

α^∗_jyj

X

i∈sv

yiα^∗_ihxi,xji

= X

j∈sv

α^∗_j(1−yjb^∗)

= X

j∈sv

α^∗_j.

(3.12)

proving the above result.

3.2.2 Soft Margin Classification. Linear non-separable case.

In a real situation, it is unlikely to find a hyperplane which separates perfectly the data. Even if it is found, it might not be desirable because it could overfit the model due to the noise or outliers in the data. Often, it is preferred a smooth decision boundary that ignores some data points which do not represent the behavior of the general dataset. This issue is handled introducing slack variables.

In order to understand better this situation, picture 3.3 presents data of two different classes. One point of the blue class is closer to every point of the red class than to any of the blue one. The most probably cause for this situation is the noise of the data. Applying maximal margin classification and kernelization, the classifier hyperplane would be the second plot, obtaining a complex curve that does not represent the real behavior of the data, while introducing slack variables, the classification of the data would be like the one of the last picture.

In order to optimize the margin, slack variables are introduced in the optimization problem with a square penalty:

minimizew,b hw,wi+C

l

X

i=1

ξ_i^k

subject to yi(hw,xii+b)≥1−ξi i= 1, . . . , l, ξi≥0, i= 1, . . . , l.

(3.13)

The correspondingξiis the error occurred, soPl

i=1ξ_i^k is a kind of upper bound on the number of training errors as [Bur98] explains.

Notice that in equation3.13kcould be any positive number, but we are going to consider just value 1.

The main idea is to keep the aim of the SVM of maximizing the margin, but at the same time, allowing some data points to be misclassified. A non-zero value

(45)

Figure 3.3: Example of dataset of two classes with a noisy point(top). Hard margin classification (left) and soft margin classification (right).

forξi allows not to meet the margin requirement at a cost proportional to the value ofξi.

Picture 3.4 shows the geometrical meaning of these penalizing errors. In the following subsection, we will see why the points that are outside the tube are support vectors and how it is possible to obtain those support vectors.

Moreover,Cis a regularizer parameter. The larger value ofC, the smaller errors are allowed.

The optimization problem is now a trade-off between how big we can make the margin versus how many points have to be moved around to allow this margin.

3.2.2.1 Soft Margin Classifier in 1-norm

The primal Lagrangian function of the soft-margin problem3.13for the 1-norm defined in [NC00] is:

L(w, b, ξ, α, r) = 1

2hw,wi+C

l

X

i=1

ξi−

l

X

i=1

αi[yi(hw,xii+b)−1 +ξi]−

l

X

i=1

riξi. (3.14) Theri≥0 is another Lagrange multipliers included to enforce the positivity of ξi as [Bur98] establishes.

Similarly to the hard margin classifier, the following steps are necessary in order to obtain the optimal solution for this problem.

First of all, the partial derivatives of the primal problem are calculated in order

(46)

Figure 3.4: Soft margin classification of a noisy data, with theξerrors of each point.

to obtain the dual formulation.

∂L(w, b, ξ, α, r)

∂w =w−

l

X

i=1

yiαixi= 0⇒w=

l

X

i=1

yiαixi,

∂L(w, b, ξ, α, r)

∂ξi

=C−αi−ri= 0⇒0≤αi≤C,

∂L(w, b, ξ, α, r)

∂b =

l

X

i=1

yiαi= 0,

(3.15)

the dual problem becomes : maxα LD(w, b, ξ, α, r) =

l

X

i=1

αi−1 2

l

X

i,j=1

yiyjαiαjhxi,xji, s.t 0≤αi≤C,

l

X

i=1

αiyi= 0.

(3.16)

Moreover the KKT conditions that the solution must satisfy are:

α^∗_i(yi(hxi,w^∗i+b^∗)−1 +ξ_i^∗) = 0, α^∗_i ≥0, ri^∗≥0, r_i^∗ξ_i^∗= 0⇒(C−α^∗_i)ξ_i^∗= 0.

(3.17)

Therefore,

• α^∗_i = 0⇒ξ_i^∗= 0⇒yi(hw^∗,xii+b^∗)≥1. (Obtained from the constraint of the primal problem).

(47)

3.3 Support Vector Regression 27

• 0< α^∗_i < C ⇒ξ^∗_i = 0⇒yi(hw^∗,xii+b^∗) = 1. (Obtained from the first KKT condition in3.17).

• α^∗_i =C⇒yi(hw^∗,xii+b^∗) = 1−ξ_i^∗ ≤1. (Obtained from the first KKT condition in3.17).

The support vectors are those points with α^∗_i 6= 0. With these conclusions, the picture 3.2 can be explained. The square points are the support vectors, four of them are in the margin hyperplanes and they satisfy the second conclusion, (when 0 < α^∗_i < C), the last support vector, on the left at the bottom, corresponds to the third conclusion, whenα^∗_i =C.

The value of b^∗ is obtained by the KKT complementary condition, when 0 <

α^∗_i < C:

b^∗=yi− hw^∗,xii. Moreover, remember that

w=

l

X

i=1

yiαixi.

Finally, the value ofγis now:

γ= 1

||w||² = X

i,j∈sv

yiyjα^∗_iα^∗_jhxi,xji−1/2

. (3.18)

3.3 Support Vector Regression

SVM can be extended also to regression problems, and it maintains all the main features that characterize the maximal margin or soft margin algorithms. The goal of SVR is to find a functionf(x) with anǫdeviation from the targetyifor all data points in the training set. Moreover it is desirable that the band of size ǫbe as small as possible and at the same time to avoid as much as possible the increment in the complexity of the model. This situation can be solved forgiving certain errors from points close to the real value.

The way to do this is by introducing a loss function that ignores the errors within a certain distance to the target value. It is called theǫ - insensitive loss function.

In figure 3.5 a non linear regression function is shown with a band of size ǫ.

Those points inside the band are considered as valid. The rest have a penalty ofξ.

Definition 3.1: The linearǫ-insensitive loss,L^ǫ(x, y, f) is defined by L^ǫ(x, y, f) =|y−f(x)|ǫ=max(0,|y−f(x)| −ǫ), wheref is a real valued function on a domainX,x∈Xand y∈R.

Similarly, the quadraticǫ-insensitive loss is given by L^ǫ₂(x, y, f) =|y−f(x)|²ǫ.

(48)

Figure 3.5: Non linear regression function with an insensitive band. Theξ are non zero for examples outside the region bounded between the dashed lines (the insensitive band).

Figure 3.6 shows a linear regression problem with the penalty given by ξ in those points outside theǫ-tube. On the right side, the linearǫ -insensitive loss function is shown. The x-axes represents|y−f(x)|and the y-axes the value of L. The penalty increases linearly with the distance between the target and the function.

Figure 3.6: A linearǫ-insensitive loss function example.

As in the soft-margin classification case, there are several options to define the problem depending on the loss function selected. The most common are the linear or the quadratic loss functions.

The steps to achieve the final problem formulation are always the same, the only difference is how the optimization problem is defined.

Meta-parameter selection for Support Vector Machines in wind energy forecasting models

Susana Rojas

Meta-parameter selection for

Support Vector Machines in wind energy forecasting models

Master’s Thesis

Susana Rojas

Meta-parameter selection for

Support Vector Machines in wind energy forecasting models

Master’s Thesis

Preface

Acknowledgement

Summary

Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1 General Concepts

1.1.1 What is Machine Learning?

1.1.2 General perspective of machine learning solutions

1.1.3 Supervised Learning

1.2 Support Vector Machines at a glance

1.3 Applications

1.4 Wind Energy Forecasting

1.5 Thesis Contributions

Chapter 2

Optimization Theory

2.1 Problem Formulation

2.2 Lagrangian function

2.2.1 Fermat’s condition

2.2.2 Lagrange’s condition

2.2.3 Kuhn-Tucker’s condition

2.3 Duality formulation

2.4 Chapter Summary

Chapter 3

Support Vector Machines

3.1 Motivation

3.2 Support Vector Classification

3.2.1 Maximal Margin Classification. Linear separable case.

3.2.2 Soft Margin Classification. Linear non-separable case.

3.3 Support Vector Regression