Bootstrap Aggregation (Bagging) - Copenhagen Business School

After each individual algorithm is tested and evaluated, the bootstrap aggregating algorithm is ap-plied in order to assemble the prediction results of different algorithms with the goal of improving stability and accuracy. Bootstrap aggregation is a simple and widely used meta-algorithm for aggre-gating predictive models. It is also the algorithm used in Random Forest for aggreaggre-gating results from individual decision trees into a final output. Bootstrap aggregation can be used for both regression and classification tasks. For regression tasks, the average of the outputs from all models is taken as the aggregated output. For classification tasks, the class with the majority vote is taken.

In this thesis, bootstrap aggregation is used on the rankings of stocks produced by each model. A stock is selected for inclusion in the portfolio if the majority of the algorithms includes the stock in their defined top-rankings. This is also called majority voting.

5 XGBoost

Extreme Gradient Boosting or XGBoost is a decision/regression-tree based ensemble machine learn-ing method, which uses the gradient boostlearn-ing optimization framework. It has become very popular in the ML community due to its predictive performance and computational speed. Furthermore, it is a flexible and robust tool which handles both regression and classifications problems well, whilst allowing for user-defined objective functions and much more.

Due to the sheer size and complexity of the XGBoost machinery, we have dedicated the following subsections to introduce some of the fundamental drivers of the algorithm. More specifically, we will lightly cover the basic concepts of decision trees and gradient boost to gain better understanding of how these are applied in the XGBoost algorithm.

5.1 Decision Trees

The decision tree algorithm is a supervised machine learning algorithm which is used for both clas-sification and regression tasks. It has a flowchart-like structure which resembles that of a tree, and some of the main elements of the structure are appropriately named roots, branches and leaves. For the sake of introduction, we will focus on decision trees and how they come about.

Figure 17: Tree structure

Common terms typically used with Decision Trees:

• The root node: the initial node as the top of the structure

• Internal nodes: The blue nodes which split into other nodes or leaves

• Splitting: The process of dividing a node into more nodes or leaves

• Branches: The decision rules from each node. E.g. if x then yes, else no

• Leaf: A node which does not split. Contains outputs, both categorical and numerical

• Stump: A structure with a root node and two leaves.

The tree is built constructing classification rules, which separate the observations in an optimal man-ner, using an ”impurity” measure. Many decision tree based models measure the similarity between observations that are grouped together in the same leaves, and use the information to update the classification rules.

Standard decision trees uses Entropy and Gini Index to measure similarity.

• Entropy measures the amount of information needed to accurately describe a sample. If the sample is homogeneous, the entropy is 0, and if the sample is equally divided, the entropy is 1.

Entropy =−

i=1

p_ilog(p_i) (20)

p_i being the probability of each class.

• Gini Index measures inequality in a sample:

Gini index = 1−

i=1

p²_i (21)

p_i being the probability of each class, where lower values suggest a homogeneous sample and visa versa.

In the decision tree algorithm, nodes represent features, branches represents decision rules and leaf nodes represent the outcome. To illustrate how the method is implemented with the impurity measure, we have conducted an experiment with the following data and outcome.

Figure 18: Observations decision tree example

Figure 19: Gini impurities decision tree example

In this example, we have constructed a data gathering and target creation scenario. The three columns contains price-related and fundamental data on a single stock, and the target variable is whether the price of the stock is increasing over the next 20 business days. The rows represent different business days observations and the corresponding ”yes” and ”no” indicates whether levels are high or low and whether the price increases or decreases respectively.

Typically in finance, the associated features have a numerical nature, which in practice results in the models calculating barriers which best separates data. In this example, we implement a pseudo-barrier for each feature which indicates whether the level of the feature is high or low.

If we look into the Size level root node, we see that the total number of observations for this variable is:

Total observation Size level = 115 + 33 + 34 + 125 = 307

it is worth noting that the total number of observations for each feature varies, which is a result of missing observations in the dataset. This portrays one of the strengths of tree-based models: they function well with data of poor quality. In practice, the algorithm calculates the pseudo-barriers on the data which is available, and constructs corresponding rules. Obviously, if data quality is too poor, one would have to supplement the model with data enriching techniques, but this is beyond the scope of this thesis.

Focusing on the Size level stump, we see that the left leaf represents the samples which have cor-responding high Size levels, in which 115 cases resulted in an increase in the price the following 20 business days, and 33 resulted in lower prices.

Utilizing the Gini impurity formula from the previous page, we can calculate the Gini impurities and determine that Momentum level grants the lowest score. Remembering that lower scores indicate

greater performance, we would then set the Momentum feature in the root node, and iterate through the tree testing the remaining features.

From this example, we see that decision trees posses a dynamic nature which works well with both numeric and categorical data, and handles missing observations well. Furthermore, it is simple, intu-itively appealing and produces transparent ”white-box” models, which can often be a challenge with Machine Learning models.

Depending on what sort of data the model aims to describe, it can either implement decision trees or regression trees. There are two fundamental differences between decision trees and regression trees.

The former constructs decision rules into classes e.g. binary ”yes”:1 or ”no”:0, whereas the latter is used when the response variable is numeric. In practice, classification problems are solved with a majority vote from decision trees, whereas regression tasks average regression tree outputs.

We have chosen to utilize the classification approach, as we intend to build a model which can predict the sign of the cumulative returns over a 20 day period, and evaluate the confidence of the prediction.

However, both decision- and regression trees are infamous for over-fitting to their training data, and are rarely used stand-alone, as they produce models which perform poorly on ”out of bag” data.

Intuitively, as the trees grow larger, they begin to describe very specific behaviour which results in predictions with high variance. However, this does not mean that decision trees are irrelevant and we will introduce a very popular decision/regression tree-based concept in the following section.

In document Copenhagen Business School (Sider 55-59)