• Ingen resultater fundet

Econometrics

In document CREATED BY (Sider 30-35)

Ordinary Least Squares Regression (OLS)

A regressions analysis aims to explain our variable of interest, known as the dependent variable, as a function of other variables, known as the independent or explanatory variables, by quantifying a single equation. The dependent variable is denoted as, and the independent variables are denoted as 𝑋1, 𝑋2, … 𝑋𝑘. We want to ex-plain the movement in 𝑌 as a function of movement in the 𝑋 variables, where movement also can be named as the variation (Stock & Watson, 2015).

The regression model can be written as:

Equation 16

𝐸(𝑌|𝑋1, 𝑋2, 𝑋3) = 𝑓(𝑋1, 𝑋2, 𝑋3)

This can further be explored in a ex. single linear regression model or a multiple regression model, where in the case of this paper have chosen only to focus on the multiple regression model. The multiple regression model can be expressed as:

Equation 17

𝑌𝑖 = 𝛽0+ 𝛽1𝑋1𝑖+ 𝛽2𝑋2𝑖+ ⋯ + 𝛽𝑘𝑋𝑘𝑖+ 𝑢𝑖 𝑖 = 1, … , 𝑛

Where 𝑌𝑖 is the 𝑖𝑡ℎ observation on the dependent variable and 𝑋1𝑖, 𝑋2𝑖, … , 𝑋𝑘𝑖 are all the observations on each 𝑘 regressors, is 𝑢𝑖 the error term, and the coefficients 𝛽𝑘 is the expected slope to the respective independent variable, and shows the expected change in 𝑌𝑖 resulting from changing the 𝑋𝑘𝑖 by one unit. At last there is the intercept 𝛽0 which shows the value of 𝑌 when all the independent variables are equal to zero. The estimators of the coefficients that minimize the sum of squared prediction mistakes are known as the ordinary least

squares (OLS) estimators of the coefficients, and they are denoted as 𝛽̂0, 𝛽̂1, … , 𝛽̂𝑘, and are computed from a sample of 𝑛 observations of (𝑋1𝑖, … , 𝑋𝑘𝑖, 𝑌𝑖), 𝑖 = 1, … , 𝑛. These are estimators of the unknown true population coefficients and error term. (Stock & Watson, 2015).

The OLS predicted values 𝑌̂𝑖 and residuals 𝑢̂𝑖 is expressed in the OLS multiple linear regression function as:

Equation 18

𝑌̂𝑖 = 𝛽̂0+ 𝛽̂1𝑋1𝑖+ ⋯ + 𝛽̂𝑘𝑋𝑘𝑖 , 𝑖 = 1, … , 𝑛 𝑢̂𝑖= 𝑌𝑖− 𝑌̂𝑖, 𝑖 = 1, … , 𝑛

The OLS assumption:

1. Linearity, correct specification, and additive error 2. The error term has zero population mean

3. All independent variables are uncorrelated with the error term 4. Errors are uncorrelated across observations

5. The error term has constant variance 6. No perfect collinearity

7. The error term has a normal distribution with mean zero and variance

Suppose the first three least-squares assumptions hold and the errors are homoscedastic. In that case, the OLS estimator has the smallest variance of all conditionally unbiased estimators and is also known as the OLS esti-mator, being BLUE, Best Linear Conditionally Unbiased Estimator. Thus, the sample average is the most effi-cient estimator of the population mean among all estimators. The OLS estimators minimize the sum of squared prediction mistakes (Stock & Watson, 2015).

To estimate the standard deviation of the error term, we use the standard error of the regression SER. This is a measure of the spread of the distribution of Y and is calculated as the following formula (Stock & Watson, 2015).

Equation 19

𝑆𝐸𝑅 = 𝑠𝑢̂ 𝑤ℎ𝑒𝑟𝑒 𝑠𝑢̂2= 1

𝑛 − 𝑘 − 1∗ ∑ 𝑢̂𝑖2

𝑛

𝑖=1

= 𝑆𝑆𝑅

𝑛 − 𝑘 − 1

SSR being the sum of squared residuals and by dividing by 𝑛 − 𝑘 − 1 it adjusts for the introduced downward bias by estimating 𝑘 + 1 coefficients and is also known as the degrees of freedom. Whenever a regressor is added the goodness of fit will increase in multiple regression. This is known as the 𝑅2, and is the fraction of

the sample variance of 𝑌𝑖 explained by the regressors. This is calculated as the following formula (Stock &

Watson, 2015).

Equation 20

𝑅2=𝐸𝑆𝑆

𝑇𝑆𝑆 = 1 −𝑆𝑆𝑅 𝑇𝑆𝑆

Where ESS is the explained sum of squares and TSS is the total sum of squares. This gives a non-negative out-put for 𝑅2, as it lies between zero and one, where the higher the 𝑅2 then the better are the independent varia-bles in explaining the variation of the dependent variable. Even though that adding another variable will in-crease the 𝑅2 then it only gives a false sense of better fit. One way to correct for an inflated 𝑅2 is by using the adjusted 𝑅2 which not necessarily increase when another regressor is added. It is calculated as the formula be-low (Stock & Watson, 2015).

Equation 21

𝑅̅2= 1 − 𝑛 − 1

𝑛 − 𝑘 − 1∗𝑆𝑆𝑅

𝑇𝑆𝑆 = 1 −𝑠𝑢̂2 𝑠𝑌2

When adding a regressor the adjusted 𝑅2 will increase or decrease depending on which factor has the strongest effect, since SSR will decrease making a positive effect on the adjusted 𝑅2, but the factor of (𝑛 − 1)/(𝑛 − 𝑘 − 1) will increase and that has a negative effect on the adjusted 𝑅2. In addition to the 𝑅2 then the adjusted 𝑅2 can be negative, which happens when the regressors reduce the SSR by minimum amount that the reduc-tion fails to offset the factor (𝑛 − 1)/(𝑛 − 𝑘 − 1) (Stock & Watson, 2015).

Internal Validity of Multiple Regression analysis Omitted Variable Bias

When omitted variable bias occurs, then it means that the first least-square assumption is violated. This arises in the OLS estimator when the regressor is correlated with an omitted variable. For these to occur, then two conditions need to be met. The first is that the regressor is correlated with the omitted variable, and the second is that the omitted variable is a determinant of the dependent variable 𝑌. The omitted variable bias can also be put in a mathematical formula, where the correlation between 𝑋𝑖 and 𝑢𝑖 is 𝑐𝑜𝑟𝑟(𝑋𝑖, 𝑢𝑖) = 𝜌𝑋𝑢, and we sup-pose that the second and third assumptions hold (Stock & Watson, 2015).

Equation 22 𝛽̂1→ 𝛽𝜌 1+ 𝜌𝑋𝑢∗𝜎𝑢

𝜎𝑋

This means when the sample size increases, then the 𝛽̂1 will be close to 𝛽1+ 𝜌𝑋𝑢∗ (𝜎𝑢/𝜎𝑋) with increasingly high probability. The formula above states the different concerns about omitted variable bias. It is a problem whether the sample size is large or small. The size of the bias depends on the correlation between the regressor and the error term. The direction of the bias depends on whether they are positively or negatively correlated (Stock & Watson, 2015).

Missing Data and Sample Selection Bias

Missing data can be a threat to the internal validity when the reason for the missing data is due to the sample selection process. This is known as the sample selection bias, which arises when the selection process influ-ences data availability and is related to the dependent variable. This encourages correlation between one or more independent variables and the error term, which causes bias and inconsistency of the OLS estimator (Stock & Watson, 2015).

Simultaneous Causality Bias

For simultaneous causality to occur, then it is not only when the independent variables cause the dependent variables, but it runs the other way around, so there is causality flow in both ways. This reverse causality makes the independent correlated with the error term in the population regression of interest (Stock & Watson, 2015).

Inconsistency of OLS standard errors

There are two main reasons for inconsistent standard errors: heteroscedasticity and the correlation of the error term across observations. If the regression error is heteroscedastic, then the standard errors are not reliable for the hypothesis tests and confidence intervals. If the OLS assumptions hold and the errors are homoscedastic, then the OLS estimators 𝛽̂0 and 𝛽̂1 are efficient among linear and unbiased estimators (Stock & Watson, 2015).

In econometrics, it is more common that heteroskedasticity arises, as economic theory rarely gives any reason for errors to be homoscedastic. Thus, it is prudent to assume that the errors are heteroskedastic unless there are compelling reasons to believe for homoscedasticity (Stock & Watson, 2015).

Multicollinearity

When one regressor has a perfect linear combination of the other regressors, then it has perfect multicollinear-ity. This problem typically arises in regression specification, where it can easily be located, but if that is not the case, then statistical programs have solutions and tests to check for perfect multicollinearity. Perfect multi-collinearity means that one column of 𝑋 is a perfect linear combination of another column of 𝑋 (Stock & Wat-son, 2015).

Two-sided T-test

If a sample size is large and the OLS assumptions one to four are fully satisfied, then it can be said that the OLS estimators have an asymptotic joint normal distribution, the heteroskedasticity-robust estimator of the covariance matrix is consistent, and the heteroskedasticity-robust OLS 𝑡-statistics has an asymptotic standard normal distribution (Stock & Watson, 2015).

Looking at the 95% two-sided confidence interval of the coefficient 𝛽 as this is the interval that contains the true value of 𝛽𝑗 with a 95% probability. This is also the set of values of 𝛽𝑗 that cannot be rejected by a 5%

two-sided hypothesis test. When we have a large sample size, the 95% confidence internal is (Stock & Wat-son, 2015):

Equation 23

95% 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙 𝑓𝑜𝑟 𝛽𝑗 = [𝛽̂𝑗− 1,96𝑆𝐸(𝛽̂𝑗), 𝛽̂𝑗+ 1,96𝑆𝐸(𝛽̂𝑗)]

The critical value for a 99% confidence interval would then be 2,58. However, this is determined by the num-ber of degrees of freedom in the specific sample (Stock & Watson, 2015).

A null hypothesis is that the population mean takes on a specific value, which in this paper would be zero, which means that the population means of the regressor is equal to 0. The alternative hypothesis specifies what is true if the null hypothesis is not. In this case, the alternative hypothesis is saying that it is unequal to the spe-cific value zero. This can further be written as (Stock & Watson, 2015):

𝐻0: 𝛽𝑗 = 0 𝐻1: 𝛽𝑗≠ 0

To test the null hypothesis, we start by calculating the standard error of the given coefficient as the following:

Equation 24

𝑆𝐸(𝛽̂𝑗) = √𝜎̂𝛽2𝑗 Where

Equation 25

𝜎̂𝛽2𝑗=1 𝑛∗

1

𝑛 − 2 ∗∑𝑛𝑖=1(𝑋𝑖− 𝑋̅)2∗ 𝑢̂𝑖2 [1

𝑛 ∗∑𝑛𝑖=1(𝑋𝑖− 𝑋̅)2]

2

It is then possible to compute the 𝑡 − 𝑠𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐 as

Equation 26

𝑡 =𝛽̂ − 𝛽𝑗 𝑗,0 𝑆𝐸(𝛽̂𝑗)

In document CREATED BY (Sider 30-35)