Econometrics - CREATED BY

Ordinary Least Squares Regression (OLS)

A regressions analysis aims to explain our variable of interest, known as the dependent variable, as a function of other variables, known as the independent or explanatory variables, by quantifying a single equation. The dependent variable is denoted as, and the independent variables are denoted as 𝑋₁, 𝑋₂, … 𝑋_𝑘. We want to ex-plain the movement in 𝑌 as a function of movement in the 𝑋 variables, where movement also can be named as the variation (Stock & Watson, 2015).

The regression model can be written as:

Equation 16

𝐸(𝑌|𝑋₁, 𝑋₂, 𝑋₃) = 𝑓(𝑋₁, 𝑋₂, 𝑋₃)

This can further be explored in a ex. single linear regression model or a multiple regression model, where in the case of this paper have chosen only to focus on the multiple regression model. The multiple regression model can be expressed as:

Equation 17

𝑌_𝑖 = 𝛽₀+ 𝛽₁𝑋_1𝑖+ 𝛽₂𝑋_2𝑖+ ⋯ + 𝛽_𝑘𝑋_𝑘𝑖+ 𝑢_𝑖 𝑖 = 1, … , 𝑛

Where 𝑌_𝑖 is the 𝑖^𝑡ℎ observation on the dependent variable and 𝑋_1𝑖, 𝑋_2𝑖, … , 𝑋_𝑘𝑖 are all the observations on each 𝑘 regressors, is 𝑢_𝑖 the error term, and the coefficients 𝛽_𝑘 is the expected slope to the respective independent variable, and shows the expected change in 𝑌_𝑖 resulting from changing the 𝑋_𝑘𝑖 by one unit. At last there is the intercept 𝛽₀ which shows the value of 𝑌 when all the independent variables are equal to zero. The estimators of the coefficients that minimize the sum of squared prediction mistakes are known as the ordinary least

squares (OLS) estimators of the coefficients, and they are denoted as 𝛽̂₀, 𝛽̂₁, … , 𝛽̂_𝑘, and are computed from a sample of 𝑛 observations of (𝑋_1𝑖, … , 𝑋_𝑘𝑖, 𝑌_𝑖), 𝑖 = 1, … , 𝑛. These are estimators of the unknown true population coefficients and error term. (Stock & Watson, 2015).

The OLS predicted values 𝑌̂_𝑖 and residuals 𝑢̂_𝑖 is expressed in the OLS multiple linear regression function as:

Equation 18

𝑌̂_𝑖 = 𝛽̂₀+ 𝛽̂₁𝑋_1𝑖+ ⋯ + 𝛽̂_𝑘𝑋_𝑘𝑖 , 𝑖 = 1, … , 𝑛 𝑢̂𝑖= 𝑌𝑖− 𝑌̂𝑖, 𝑖 = 1, … , 𝑛

The OLS assumption:

1. Linearity, correct specification, and additive error 2. The error term has zero population mean

3. All independent variables are uncorrelated with the error term 4. Errors are uncorrelated across observations

5. The error term has constant variance 6. No perfect collinearity

7. The error term has a normal distribution with mean zero and variance

Suppose the first three least-squares assumptions hold and the errors are homoscedastic. In that case, the OLS estimator has the smallest variance of all conditionally unbiased estimators and is also known as the OLS esti-mator, being BLUE, Best Linear Conditionally Unbiased Estimator. Thus, the sample average is the most effi-cient estimator of the population mean among all estimators. The OLS estimators minimize the sum of squared prediction mistakes (Stock & Watson, 2015).

To estimate the standard deviation of the error term, we use the standard error of the regression SER. This is a measure of the spread of the distribution of Y and is calculated as the following formula (Stock & Watson, 2015).

Equation 19

𝑆𝐸𝑅 = 𝑠_𝑢_̂ 𝑤ℎ𝑒𝑟𝑒 𝑠_𝑢_̂²= 1

𝑛 − 𝑘 − 1∗ ∑ 𝑢̂_𝑖²

𝑛

𝑖=1

= 𝑆𝑆𝑅

𝑛 − 𝑘 − 1

SSR being the sum of squared residuals and by dividing by 𝑛 − 𝑘 − 1 it adjusts for the introduced downward bias by estimating 𝑘 + 1 coefficients and is also known as the degrees of freedom. Whenever a regressor is added the goodness of fit will increase in multiple regression. This is known as the 𝑅², and is the fraction of

the sample variance of 𝑌_𝑖 explained by the regressors. This is calculated as the following formula (Stock &

Watson, 2015).

Equation 20

𝑅²=𝐸𝑆𝑆

𝑇𝑆𝑆 = 1 −𝑆𝑆𝑅 𝑇𝑆𝑆

Where ESS is the explained sum of squares and TSS is the total sum of squares. This gives a non-negative out-put for 𝑅², as it lies between zero and one, where the higher the 𝑅² then the better are the independent varia-bles in explaining the variation of the dependent variable. Even though that adding another variable will in-crease the 𝑅² then it only gives a false sense of better fit. One way to correct for an inflated 𝑅² is by using the adjusted 𝑅² which not necessarily increase when another regressor is added. It is calculated as the formula be-low (Stock & Watson, 2015).

Equation 21

𝑅̅²= 1 − 𝑛 − 1

𝑛 − 𝑘 − 1∗𝑆𝑆𝑅

𝑇𝑆𝑆 = 1 −𝑠_𝑢_̂² 𝑠_𝑌²

When adding a regressor the adjusted 𝑅² will increase or decrease depending on which factor has the strongest effect, since SSR will decrease making a positive effect on the adjusted 𝑅², but the factor of (𝑛 − 1)/(𝑛 − 𝑘 − 1) will increase and that has a negative effect on the adjusted 𝑅². In addition to the 𝑅² then the adjusted 𝑅² can be negative, which happens when the regressors reduce the SSR by minimum amount that the reduc-tion fails to offset the factor (𝑛 − 1)/(𝑛 − 𝑘 − 1) (Stock & Watson, 2015).

Internal Validity of Multiple Regression analysis Omitted Variable Bias

When omitted variable bias occurs, then it means that the first least-square assumption is violated. This arises in the OLS estimator when the regressor is correlated with an omitted variable. For these to occur, then two conditions need to be met. The first is that the regressor is correlated with the omitted variable, and the second is that the omitted variable is a determinant of the dependent variable 𝑌. The omitted variable bias can also be put in a mathematical formula, where the correlation between 𝑋_𝑖 and 𝑢_𝑖 is 𝑐𝑜𝑟𝑟(𝑋_𝑖, 𝑢_𝑖) = 𝜌_𝑋𝑢, and we sup-pose that the second and third assumptions hold (Stock & Watson, 2015).

Equation 22 𝛽̂₁→ 𝛽^𝜌 ₁+ 𝜌_𝑋𝑢∗𝜎𝑢

𝜎_𝑋

This means when the sample size increases, then the 𝛽̂₁ will be close to 𝛽₁+ 𝜌_𝑋𝑢∗ (𝜎_𝑢/𝜎_𝑋) with increasingly high probability. The formula above states the different concerns about omitted variable bias. It is a problem whether the sample size is large or small. The size of the bias depends on the correlation between the regressor and the error term. The direction of the bias depends on whether they are positively or negatively correlated (Stock & Watson, 2015).

Missing Data and Sample Selection Bias

Missing data can be a threat to the internal validity when the reason for the missing data is due to the sample selection process. This is known as the sample selection bias, which arises when the selection process influ-ences data availability and is related to the dependent variable. This encourages correlation between one or more independent variables and the error term, which causes bias and inconsistency of the OLS estimator (Stock & Watson, 2015).

Simultaneous Causality Bias

For simultaneous causality to occur, then it is not only when the independent variables cause the dependent variables, but it runs the other way around, so there is causality flow in both ways. This reverse causality makes the independent correlated with the error term in the population regression of interest (Stock & Watson, 2015).

Inconsistency of OLS standard errors

There are two main reasons for inconsistent standard errors: heteroscedasticity and the correlation of the error term across observations. If the regression error is heteroscedastic, then the standard errors are not reliable for the hypothesis tests and confidence intervals. If the OLS assumptions hold and the errors are homoscedastic, then the OLS estimators 𝛽̂₀ and 𝛽̂₁ are efficient among linear and unbiased estimators (Stock & Watson, 2015).

In econometrics, it is more common that heteroskedasticity arises, as economic theory rarely gives any reason for errors to be homoscedastic. Thus, it is prudent to assume that the errors are heteroskedastic unless there are compelling reasons to believe for homoscedasticity (Stock & Watson, 2015).

Multicollinearity

When one regressor has a perfect linear combination of the other regressors, then it has perfect multicollinear-ity. This problem typically arises in regression specification, where it can easily be located, but if that is not the case, then statistical programs have solutions and tests to check for perfect multicollinearity. Perfect multi-collinearity means that one column of 𝑋 is a perfect linear combination of another column of 𝑋 (Stock & Wat-son, 2015).

Two-sided T-test

If a sample size is large and the OLS assumptions one to four are fully satisfied, then it can be said that the OLS estimators have an asymptotic joint normal distribution, the heteroskedasticity-robust estimator of the covariance matrix is consistent, and the heteroskedasticity-robust OLS 𝑡-statistics has an asymptotic standard normal distribution (Stock & Watson, 2015).

Looking at the 95% two-sided confidence interval of the coefficient 𝛽 as this is the interval that contains the true value of 𝛽_𝑗 with a 95% probability. This is also the set of values of 𝛽_𝑗 that cannot be rejected by a 5%

two-sided hypothesis test. When we have a large sample size, the 95% confidence internal is (Stock & Wat-son, 2015):

Equation 23

95% 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙 𝑓𝑜𝑟 𝛽_𝑗 = [𝛽̂_𝑗− 1,96𝑆𝐸(𝛽̂_𝑗), 𝛽̂𝑗+ 1,96𝑆𝐸(𝛽̂_𝑗)]

The critical value for a 99% confidence interval would then be 2,58. However, this is determined by the num-ber of degrees of freedom in the specific sample (Stock & Watson, 2015).

A null hypothesis is that the population mean takes on a specific value, which in this paper would be zero, which means that the population means of the regressor is equal to 0. The alternative hypothesis specifies what is true if the null hypothesis is not. In this case, the alternative hypothesis is saying that it is unequal to the spe-cific value zero. This can further be written as (Stock & Watson, 2015):

𝐻₀: 𝛽_𝑗 = 0 𝐻₁: 𝛽_𝑗≠ 0

To test the null hypothesis, we start by calculating the standard error of the given coefficient as the following:

Equation 24

𝑆𝐸(𝛽̂_𝑗) = √𝜎̂_𝛽²_𝑗 Where

Equation 25

𝜎̂_𝛽²_𝑗=1 𝑛∗

𝑛 − 2 ∗∑^𝑛_𝑖=1(𝑋_𝑖− 𝑋̅)²∗ 𝑢̂_𝑖² [1

𝑛 ∗∑^𝑛_𝑖=1(𝑋_𝑖− 𝑋̅)²]

It is then possible to compute the 𝑡 − 𝑠𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐 as

Equation 26

𝑡 =𝛽̂ − 𝛽_𝑗 _𝑗,0 𝑆𝐸(𝛽̂_𝑗)

In document CREATED BY (Sider 30-35)