Data Preprocessing - Copenhagen Business School

Table 1: Data characteristics

Number of stocks with more than 252 observations 4065

First observation date 31/12/1997

Last observation date 03/02/2020

Average years of observations per share 12.85 Ratio of missing fundamentals observations 51%

Number of features 94

Return frequency Daily

Fundamental frequency Monthly

Stocks with GICS classification 3179

In summary, when using stocks with a least 1 year of data, our database contains price- and funda-mental data information on 4065 stocks in the period 31/12/1997 to 03/02/2020. On average, we have almost 13 years of data on each asset, in which we have daily return observations and monthly fundamentals. Due to sub-optimal data quality, we are on average missing every second fundamental data, which in theory means we have an observation every two months. Among the 4065 stocks, 3179 of them have a corresponding GICS classification, which is an important observation to make, as the machine learning models will implement sector separation to optimize the training algorithm.

Finally, the data has observations for 94 different features, in which the data type varies across numerical, ratio and categorical data.⁵⁵

assumptions.

Our models train primarily on two types of data, fundamentals and price-related. Price-related data is easy to come by, and usually have daily observations, whereas fundamentals primarily have monthly observations. We have chosen to solve this problem by implementing a ”forward fill” approach.

The forward fill method constructs a dataset with daily observations. It does so by filling each day’s observation with the most recent observation, until newer information is acquired e.g. if we have 2 observations on a feature on 01/01/2010, 01/02/2010: the entire month of January would have the same observations from 01/01/2010 on each day.

In practice, this means that during the period where new fundamentals data has not yet been ac-quired, the only things that changes in the data are price-related variables. This gives rise to some concern, as we intend to utilize whatever information the fundamental data may hold, and not build a price-movement based model alone.

However, we argue that the level of the fundamental data (which will hold constant throughout the month) and it’s interaction with the price, might describe a different dynamic than if you were to use price alone i.e. certain features might play a bigger role when prices are relatively high or relatively low. Furthermore, the majority of our models benefit greatly from large amounts of data, thus even though the forward fill might not contribute significantly with respect to information it can help in-crease the accuracy of the models.

Another intuitively appealing approach would be to fit a linear regression to each fundamental feature, and utilize the slope to predict the daily changes in between observations. This approach however, is somewhat biased as it assumes that there is a linear development in fundamental data over time. In the end we stuck with the forward fill method, as it seems the least biased and most realistic user case.

Technical Indicators

In addition to the 94 company fundamentals, we have constructed at few popular price-driven techni-cal indicators to help predict price movement and returns:

1) Relative Strength Index (RSI)

The first indicator is RSI, which is a momentum indicator that measure the magnitude of recent price changes. It is used to evaluate overbought or oversold conditions in e.g. a stock. It is displayed as an oscillator, which is a line graph that moves between two extremes. RSI is valued between 0 and 100 and normally, values above 70 indicates that the stock is overbought or overvalued and values below 30 indicates an oversold or undervalued stock. Values above 70 could mean that the stock is moving towards a trend reversal or a pullback in price.

RSI is calculated with:

RSI = 100−

100

1 + Average of upwards price change Average of downward price change

The RSI line is then normally plotted below the asset’s price chart.

2) Moving Average (MA)

This indicator is the Simple Moving Average (SMA) that was described in section 3.

3) Exponential Moving Average (EMA)

The exponential moving average is an extension of the traditional moving average where a weighting factor is applied, which decreases the value exponentially. As such, the EMA for a series X may be calculated recursively:

S_t=

( X1, t= 1

α·X_t+ (1−α)·Si−1, t >1 (56) Where:

• αis a value between 0 and 1 which is responsible for discounting / decreasing older observations faster.

• X_tis some value at timet.

• S_t is the EMA value at timet.

In a traditional EMA,α is set to _N²₊₁, whereN is the ”window” of the moving average.

4) Smoothed Moving Average (SMMA)

Smoothed Moving Average (or Wilder’s Moving Average) is another extension. This time of EMA, the difference being thatα= _N¹, where N is the ”window” of the moving average.

We use the different methods to construct several price-driven technical indicators to aid our models in learning.

7.2.1 GICS Sector Classifications

The Global Industry Classification Standard (GICS) is an industry taxonomy widely used in the fi-nancial world. The GICS structure consists of 11 sectors, 24 industry groups, 69 industries and 158 sub-industries, in which lower levels of classification are nested within each sector, industry group and so on.

A common problem when working with large datasets across sectors and industries, is that key figures can be difficult to compare e.g. you would not have the same expectation to a health-care company as you would with an IT-company. Thus, when utilizing predefined comparison parameters such as

book-to-market value or size, certain companies will always look preferable simply due to the nature of the company. In an attempt to construct somewhat monotone groups, the assets are typically divided into sectors, which allows the analyst to identify and utilize whatever preferable characteristics the groups may display. GICS provides a framework, which divides assets into monotone groups, in which their performance characteristics can be more fairly compared.

In our case, one can easily imagine that the XGBoost model could assign higher weight to certain fundamental data types and along with it select assets which score highly within these types. If the model trains itself to prefer certain parameters it might end up constructing a portfolio of somewhat monotone assets, which eventually results in what we refer to as ”sector bets”. To prevent this sort of behaviour, we have divided the training dataset based on the GICS sector classifications. This results in 11 data-sets, in which all assets within a given set has identical sector classifications. In addition to the separation of data, we have trained 11 models, each one fitted to match a given sector.

We believe that this provides a setting in which the model can more easily compare numerical values across assets which in the end should leader to greater predictive capabilities, both with respect to accuracy and robustness.

Finally, it is worth noting that the GICS framework is not perfect, and companies which are classified identically can have very different characteristics e.g. General Electrics have always been in industrial and manufacturing, however, a large share of GE’s revenue comes from its financial businesses, so how do we classify it?

Despite it’s flaws we have chosen to use the GICS classification framework, due to it’s popularity and reputation within the financial industry. Furthermore, we are aware that there might exists structures which outperforms GICS, but identifying them is not within the scope of this thesis.

7.2.2 Data Preproccessing XGBoost

The XGBoost model is trained to predict the sign of the cumulative returns over the following 20 business days using a set of price- and fundamental data on a given day.

Due to the nature of the gradient boosting approach, in which it can be difficult to process data across time simultaneously, we have enriched the dataset with additional features.

First and foremost, we have added the technical indicators which were presented in a previous section, in the hope that they can provide the model with sufficient historical price-related information to produce adequate predictions. In addition to the technical indicators, we have added lagged versions of each individual feature. The lagged features are variables which contains data on a given feature from a previous time step in the time series. More specifically, we have added 3 lagged versions of each variable, with a lag of 5, 10 and 20 business days, which contains a week, a month and two month old data, respectively.

We utilize this approach, as we believe the model will construct better results by having more data

available for each prediction. Furthermore, the intent is also that the model will be able to pick up time-series interdependency in which lagged features have some sort of correlation with present values.

7.2.3 Data Preprocessing for LSTM

Contrary to the requirements of XGBoost, a recurrent neural network is structured in such a way as to specifically accommodate sequences of data. As such lagged variables are not necessary in the same way. Instead the network is fed a 3-dimensional dataset with dimensions being Batch Size× Time Steps×Features as examplified in Figure 31:

Figure 31: Structure of time-series sequences

Like with XGBoost, we are adding additional price-driven technical indicators as described in a pre-vious section. These serve as a stepping stone for the model and allows a lower degree of complexity instead of relying solely on the model to construct these and find these interdependencies in the evo-lution of the price. These sequences are constructed at each time-steptsuch that at timetthe model is fed data fromt₀, t−1. . . t−[number of time-steps].

For our RNN LSTM network, we chose the size of 120 time-steps (≈6 months of data). The downside of this from a computer-science perspective is memory. You will effectively increase the size of your data by a factor of the number of time-steps you choose to feed the network. Traditionally when training a neural network (or any machine learning model for the matter), your computer will load the training and validation data in memory (RAM). However, consider the size of our dataset of about

≈2GB and then multiply that by 120. Unfortunately none of our hardware sports 240 GB of RAM, so to overcome this, we employ a number of different techniques; namely splitting data into GICS classifications and using rolling- forecast-origin re-calibration with fixed training sets sizes. However, this is still not enough to reduce the size to a point that is manageable. As such, we resort to generating the sequences in parallel with the model training, utilizing the GPU for training and validation and the CPU for constructing sequences to feed the model without overloading our memory capacity.

7.2.4 Target Creation

Our thesis implements LSTM and XGBoost which are both supervised learning algorithms. Super-vised models train by utilizing input data x_i to make corresponding predictions y_i. The existence of the ”true” y values is what separates supervised from unsupervised learning, and their values are essential to the functionality of the model.

For the XGBoost algorithm the target variable for a given set of observations x is the sign of the cumulative return the following 20 business days, where negative and positive returns are mapped according to {−,+}:{0,1}

For the LSTM model the target variable for a given sequence of observations x is the is the value of the cumulative return the following 20 business days. As such the LSTM model is using a regression approach compared to the classification approach by XGBoost.

7.2.5 Balancing Dataset

XGBoost utilizes binary classification and the target classes are thus 0 and 1. To eliminate bias in the model, we have modified the training set such that there are an equal number of observations for each class. This is done by randomly removing observations from the dominating class until a balanced dataset is reached.

With the purpose of effectively visualizing and evaluating our models, we decided to perform the procedure on the validation set as well. This is done to prevent misleading coincidental performance-indicators. Imagine the model is biased towards ”buy” signals, and the validation set consists primarily of ”buy” signals. The result of this behaviour would be a model which performs extremely well on the validation set, which would lead us to believe that the model describes general behaviour extremely well.

In document Copenhagen Business School (Sider 79-84)