Preprocessing cause of action - Statistical analysis of Exchange Traded Funds for investment pu

Data analysed in this thesis consists of daily observations of the net asset value, (2.1), from 20 dierent funds and their underlying index. Data is extracted from four dierent sources, namely the iShares webpage, the SPDR webpage, Bloomberg and http://www.bullionvault.com/gold-price-chart.do. These institutions operate in dierent parts of the world, meaning they operate with dierent calenders and banking holidays. As data consists of price registrations, consequently there will be no obser-vations on banking holidays where the exchanges will be closed. In order to make the funds comparable, similar date vectors must be dened so as not to compare April 2nd in one fund to April 4th in a dierent fund or risking that a vector of 1000 observations in one fund covers a longer pe-riod than a vector of 1000 observations in a dierent fund. This imposes some instances of missing values in the sets where some countries/institu-tions have celebrated a banking holiday while other countries/institucountries/institu-tions have been working and registering prices. Apart from the banking holi-days, additional few instances of unexplained missing observations occur in all of the datasets. Lastly, few instances of very extreme observations occur. A threshold is set for determining how extreme an observation

can be, relative to its adjacent observations, before it is categorised as an erroneous registration and treated as missing. This will be elaborated on shortly.

A number of operations are implemented to prepare data for analysis.

These are described in the following.

Merge data columns to contain all unique dates This step builds one vector for each set, containing all business days between the earliest and the latest date. These are extracted and the function interpolates with daily intervals, omitting weekends. Afterwards the NAV observa-tions are added, implying NA values where there is no price. NAs are overcome by interpolation, as described in the next step.

Remove missing data The data gaps are interpolated. When a miss-ing value is detected, the cell is interpolated usmiss-ing the previous value and the rst real value going forward and calculating the number of missing values in between:

a = min(which(!is.na(x[i:length(x)]))) if(i>1)

{if(i<5)

{sd = sd(x[(i+2):(i+a+10),j], na.rm=T) e = rnorm(1,mean = 0, sd = sd)

x[i,j] = x[i-1,j] + (x[i+-1,j]-x[i-1,j])*1/a + e }else

{sd = sd(c(x[(i-5):(i-1),j], x[(i+a):(i+a+5),j]), na.rm=T) e = rnorm(1,mean = 0, sd = sd)

}x[i,j] = x[i-1,j] + (x[i+a-1,j]-x[i-1,j])*1/a + e }

A small amount of noise, e, is added to the interpolation. e is normally distributed with zero mean and the standard deviation is determined

9 based on the standard deviation of observations in a small interval on either side of the relevant point. It is possible that the real number of observations used to compute the standard deviation of the noise are fewer than intended going ahead of the relevant point, due to the possibility of small intervals of missing observations. Thus fewer points are used to determine the standard deviation. It should also be noted that points in the end of such an interval will be added noise with standard deviation computed using up to four articial observations.

For few missing values a linear interpolation is our best guess and a rea-sonable estimate. Yet for larger gaps of missing data, it is no longer reasonable to assume that a linear interpolation in suitable. In the exam-ined data the gaps span from 1 to at most 5 missing observations. Within this range the method is considered applicable without loss of quality in data. Table 3.1 shows how the amount of missing data eect each pair.

The length of the interval is determined by the time period used for anal-ysis of dierence in returns, that is the time since the fund inception date.

The fractions stated here are representative for the fractions of missing data over the lifetime of the index as well, with one exception.

TOPIX fund is the dataset most widely aected by missing data, with a fraction of 5.55 percent of the nal dataset being computed.

When expanding the focus area to the lifetime of the index, RWX carry only monthly observations in the period January 1st, 1993 to December 31st, 1998. In this period the method of expanding data to contain all banking days leaves 21-23 missing values for every one observed value.

In this case the above described method for determining the standard deviation of added noise is not applicable. Instead it is assumed that the standard deviation over the period is xed, which is supported by gure 3.1. The standard deviation of the added noise is computed as the standard deviation of the gaps between the observations, adjusted for the number of values to be interpolated (on average 22 days). The applied standard deviation is 1.99. This is marked by the red line in the bottom panel of gure3.1. It is seen that 1.99 roughly corresponds to the standard deviation in observations 4500-5000 of data. This period of the observed data is what closest resemble the interpolated period, thereby supporting the computed standard deviation of 1.99.

Index Fund Date Length Missing Fraction Missing Fraction

DGT 2978 112 3.76 3 0.10

ELR 1641 58 3.53 28 1.71

EMBI 1139 50 4.39 50 4.39

FEZ 2442 111 4.55 65 2.66

FXC 1940 24 1.24 24 1.24

GLD 1920 74 3.85 59 3.07

IBCI 1659 23 1.39 23 1.39

IBGL 1384 22 1.59 22 1.59

IBGS 1518 22 1.45 22 1.45

IEEM 1659 18 1.08 18 1.08

IHYG 1194 60 5.03 0 0

IJPN 1954 28 1.43 28 1.43

IMEU 1234 16 1.30 16 1.30

INAA 1519 18 1.18 18 1.18

LQDE 2314 55 2.38 55 2.38

RWX 1357 48 3.54 0 0

STN 2735 49 1.79 2 0.07

STZ 2735 49 1.79 2 0.07

TOPIX 1621 42 2.59 90 5.55

XOP 1486 51 3.43 30 2.02

Table 3.1: A specication of the amount of missing data in pairs. TOPIX fund is the dataset with the highest fraction of missing data, with 5.55 percent of the points in the nal dataset being computed.

0 1000 2000 3000 4000 5000

0510st.dev. 100200300400iNAV

Figure 3.1: The top panel shows the NAV process for index RWX. The blue line is the interpolated part where 22 out of every 23 values have been generated by interpolation. The bottom panel shows the one-point standard deviation for the observed data. The green line is a 200 point moving average and the red line is the applied standard deviation of 1.99.

11 Level outliers This step identies the extreme outliers in data and lev-els them by interpolation. It is assumed that extreme outliers are a result of erroneous registration. Thus there is an issue in dening the limit for what is accepted likely eects of the market, and what is the threshold for which the deviation can no longer be ascribed to valid market dynamics.

By default the threshold is set to a factor of 2.5 relative to the previous value.

A noteworthy part of the outliers come in pairs of two. Thus, interpolating with the immediate neighbour will result in half a mis-registration. The standard deviation is computed using the adjacent 10 point on either side of the relevant observation. Going forward the rst adjacent point is skipped, due to the often seen pairing of outliers. This is also the reason to expand the interval of points used to compute the standard deviation.

By expanding the interval, any extreme events in the interval will become linearly less prominent in computing the standard deviation. The script is shown below, where y defaults to 2.5 and x is the relevant data:

a = as.numeric(x[1:(length(x)-2)]) b = as.numeric(x[3:length(x)]) for(i in

first non-NA combination of a and b : last non-NA combination of a and b

&& i > 5 && i < (b-4))

if(abs(x[i-1]/x[i]) > y || abs(x[i-1]/x[i]) < 1/y) {sd = sd(c(x[(i-5):(i-1)],x[(i+1):(i+5)])

e = rnorm(1,mean = 0, sd = sd) x[i] = (x[i-1]+x[i+1])/2 + e }

This method presents some problems in either end of the vector, where it is not possible to compare the observations to observations before and after, respectively. In these cases outliers are detected by deviation from a y multiple of the mean of the proceeding respectively trailing 10 obser-vations.

3.1 Data transformation

There is a tradition in econometrics for considering the log returns instead of the simple return. This transformation is implemented to improve the behaviour of data in relation to modelling purposes. Log transformation will decrease the magnitude of extreme events, making it easier to es-tablish a model. Further, log transformation might have benecial eect on factors such as stationariness in mean and volatility and normality.

However, as data takes on numerically small and very close values, the transformation is irrelevant. In this dataset the transformation did not impart substantial improvement in data quality. Tests for normality as well as stationariness in mean and volatility were performed and produce equal results for either data series.

For this reason the simple return, as depicted in (5.1), is considered, in order not to impose any more complexity to the models than what is necessary.

The KPSS test for the null hypothesis that data is level or trend stationary [6] was performed and conrm stationarity in all processes with p-values exceeding 0.1. The Shapiro-Wilks test for normality [7] uniformly reject normality with p-values smaller than 5.7e-04.

Chapter

4 The funds and indices

In document Statistical analysis of Exchange Traded Funds for investment purposes (Sider 23-29)