Evaluation Metrics - Copenhagen Business School

similar. Hopefully, these correspond to the same person, and the images can be grouped together for you.

A major challenge in unsupervised learning is evaluating whether the algorithm learned something useful. Unsupervised learning algorithms are usually applied to data that does not contain any label information, so we do not know what the right output should be. Therefore, it is very hard to say whether a model “did well”. For example, our hypothetical clustering algorithm could have grouped together all the pictures that show faces in profile and all the full-face pictures. This would certainly be a possible way to divide a collection of pictures of people’s faces, but it is not the one we were looking for. However, there is no way for us to “tell” the algorithm what we are looking for, and often the only way to evaluate the result of an unsupervised algorithm is to inspect it manually.

As a consequence, unsupervised algorithms are often used in an exploratory setting, when a data sci-entist wants to understand the data better, rather than as part of a larger automatic system. Another common application for unsupervised algorithms is as a preprocessing step for supervised algorithms.

Learning a new representation of the data can sometimes improve the accuracy of supervised algo-rithms, or can lead to reduced memory and time consumption.

In this thesis, we will exclusively focus on supervised learning as it is the method that our machine learning models take advantage of.

As seen in the table, the matrix contains 4 entries, which we will describe in real life scenarios:

1. True Positive (TP): the model predicted a stock to rise in value, and the stock appreciated.

2. True Negative (TN): the model predicted a stock to rise in value, and the stock depreciated.

3. False Negative (FN): the model predicted a stock would decrease in value, and the stock appre-ciated.

4. True Negative (TN): the model predicted a stock would decrease in value, and the stock depre-ciated.

Depending on what sort of data the model intends to predict, the data from the matrix can be used in various manners. In the stock selection setting, false positives are directly costly for the portfolio, as the model informs us to buy before the market falls. False negatives on the other hand, results in a sort of ”offer cost”, since when following the strategy of the model we intentionally choose not to buy before the market rises in value.

Additional performance measures can be deduced from the matrix:

1. Recall = _TP+FN^TP

Recall describes how well the model predicts true positive outcomes.

2. Precision = _{T P}^{T P}_{+F P}

Precision describes how well the model predicts, when it predicts positive outcomes.

3. Accuracy = ^{T P}_{T OT AL}^{+T N}

Accuracy describes the total number of successfully classified trials out of all classifications.

In a stock setting, Recall is the number of correctly predicted positive returns out of all positive re-turns. For a no-short strategy, this measure is of great importance, as the only non-coincidental way to make money is to correctly classify true positives.

Precision describes the number of correctly predicted positive returns out of all predicted positive returns.

Accuracy is a widely used evaluation metric and it measures the number of correct predictions out of all predictions. This is an intuitively appealing measure which provides insight into how well the model classifies observations on a general scale.

4.2.2 ROC

A natural extension to the confusion matrix analysis is the ”Receiver operating characteristic” or ROC. The ROC is a graphical representation of the true positive rate vs the false positive rate.

True positive rate (TPR) = True positive

True positive + False negative False positive rate (FPR) = False Positives

False positives + True negatives

To better illustrate how the measures are used we implement the following visualization:

Figure 13: ROC

The x-axis represents the False positive rate i.e. the rate at which observations are classified as posi-tives, when they are in fact negative. The y-axis represents the rate at which observations are classified as positives, when they are in fact positive samples.

In general you input a variety of parameter values into the models, and evaluate both the individual and the relative performance of the models based on the true- and false positive rates.

Points along the blue line means that the proportion of correctly classified true samples are equal to the proportion of incorrectly classified false samples. As with the confusion matrix the graphs provide insight into what sort of errors the model produces.

The ideal solution for ones model is a value as close to (x, y) = (0,1) as possible. A result such as this means that the model classifies all true samples as true, and all false samples as false. In practice, a lower limit can be set for the ”True positive rate”, and the solution furthest to the left (lowest false positive rate) will be chosen.

In practice, the ROC methodology can be implemented when optimizing hyper-parameters or when comparing various models i.e. the graph compares a Logistic Regression model with that of a Random Forest. We observe that the random forest has greater values than the logistic regression at all points, indicating that the random forest is the superior model. However, one can easily imagine that the curves cross on multiple occasions complicating the conclusion process. In the following subsection, we will introduce a more advanced view on the ROC curve, and how it contributes to evaluating the models.

4.2.3 AUC

A more sophisticated approach to the ROC is the ”area under the curve” evaluation metric AUC. It builds on the idea that the TPR and the FPR have corresponding density functions. Utilizing the density functions this approach produces a quantitative measure of how ”far up to the left” a model generally scores on the ROC graph. The AUC provides a framework in which the performance of multiple models can easily be compared on the size of a single output. In general when comparing models, the one which produces the greatest AUC is chosen.

In document Copenhagen Business School (Sider 50-53)