I suggest that we approach each violation of the classical regression model in a systematic way. In general, you'll need answers to several questions:
By the end of today's class, you should be able to answer these questions for the problem of heteroskedasticity.
The first burning issue today is a matter of spelling: Heteroskedasticity is the preferred spelling of those conscious of the Greek origins of the word -- "differing skips" in contrast to "same skips" (homoskedasticity) -- while heteroscedasticity is the spelling that appears most commonly in print.
The classical regression model assumes homoskedastic errors. Heteroskedasticity implies:
var[ei] = E[ei2] = si2 ¹ sj2, i ¹ j.
One can imagine many circumstances in which heteroskedasticity would arise.
1) Consider the relationship between housing expenditures and income or wealth. The lowest quality available housing may still eat up over 40% of a poor family's income, while Bill Gates has the option of living in a billion dollar mansion -- or that same city hovel. Here variance rises with income: si2 = f(Ii)
2) The importance of the billions of factors excluded from the model (because no one of them has a substantial impact on Y) may vary for subsets of the data. For example, in cross-sectional studies, these factors may be quite different across states or countries or firms.
What's wrong with heteroskedasticity?
What about the unbiasedness of the coefficients? Recall our proof that b is unbiased. Since the variance does not affect that derivation, we can conclude that the OLS regression coefficients remain unbiased when heteroskedasticity is present.
What about the variance? With heteroskedasticity
var[b] = Sci2E[ei2] = Sci2si2 ¹ s2/Sxi2
The variance of the OLS coefficient is much more difficult to calculate and (it is possible to show) it no longer is BLUE, i.e., it no longer has minimum variance of all linear unbiased estimators. This means that we estimate the regression coefficients with less precision than we would like.
Further, the standard error of the OLS coefficients turn out to be biased, with the direction of the bias depending in a complicated fashion on the correlation between si2 and the RHS variables.
Bottom line: When heteroskedasticity results in the variance of the error term being correlated with the RHS variables, hypothesis testing is made invalid. At best, when no correlation is present, estimates are less precise, implying high standard errors, hence, low t-statistics.
The perfect solution is to use weighted least squares (WLS) to estimate the regression coefficients. WLS addresses the problem of heteroskedasticity head-on by creating homoskedastic errors. Take the original model:
Yi = a + bXi + ei
and create the transformed model
(Yi/si) = a(1/si) + b(Xi/si) + ei/si
Since the rest of the classical regression model assumptions hold, it follows that
E[ei/si] = 0, and
var[ei/si] = E[(ei/si)2] = E[ei2]/si2= 1
Conceptually, estimating this model involves running OLS with two RHS variables -- (1/si) and (Xi/si) -- and no constant term. The two slope coefficients yield BLUE estimates of a and b.
The name WLS estimation reflects that fact that we effectively are weighting each observation by 1/si
The perfect WLS solution has one rather basic drawback -- we don't know the weights, i.e., we don't know si for each observation.
The traditional solution to this has been to assume that si = f(Xi) -- this is the case that invalidates hypothesis testing -- and to approximate that relationship either by replacing si with some assumed functional form for f(Xi), e.g. the square root of Xi, or to regress the absolute value of the estimated residual from the original OLS model on the RHS variables. For example
|ui| = c + d1Xi1/2 + d2Xi + d3Xi2
plus other terms if there are other RHS variables. You then substitute the fitted value of the dependent variable from this regression for si in the WLS estimations.
If you have the correct functional form for f(Xi), this solution is great. If you don't, then this cure for heteroskedasticity is worse than the disease, because it will yield biased estimates of the regression coefficients and the standard errors.
For that reason, I recommend a solution devised by Hal White. His approach is to obtain the coefficients from the OLS regression, since we know that they are unbiased. He then calculates the estimated variances and covariances of the regression coefficients by weighting each observation by |ui|. This yields estimates of the standard errors that, while still biased, are consistent.
Remember, an estimator is consistent if the probability that my estimate is extremely close to the true value approaches one as the number of observations grows infinitely. In large samples, we treat a consistent estimator as almost as good as an unbiased estimator.
The price we pay for using White's correction is the fact that our estimated coefficients are not efficient. In small samples, we run the risk that we will observe low t-statistics even though the RHS variable does exert a real influence on the dependent variable. But, at least we know that the insignificant t-statistic (or, worse, a significant t-statistic) is not an artifact of bias in the standard errors.
I'll do an example before the end of class today, but basically the way to invoke White's correction in Stata is to include the option ROBUST following a comma at the end of the REGRESS command. Stata estimates the OLS coefficients as usual, but then bases the standard errors on White's correction.
How do you know if heteroskedasticity afflicts your model?
There have been several tests designed to look at the general question -- the classic one being the Goldfeld-Quant test, which looks for differences in variances between groups of observations (e.g., by location).
Most tests focus on whether si2 is sytematically related to the RHS variables, since this is the case that biases the standard errors.
The test I recommend is a version of the Breusch-Pagan test. The probability theory is bit beyond what I want to do in this class, but here's the basic idea:
Estimate the original model using OLS. Save and square the estimated residuals.
Regress them on the RHS variables and all quadratic terms involving the RHS variables.
If si2 is unrelated to a function of the RHS variables, then we would expect the R2 for this residual regression to be zero. We reject the null hypothesis of homoskedasticity, if R2 is large enough.
Determining whether R2 is large enough is a matter of comparing the appropriate test statistic with the relevant critical value. If the null hypothesis is true, that is the errors are homoskedastic, then the test statistic is c2 with k*-1 degrees of freedom, where k*-1 is the number of RHS variables in this residual regression.
Note that this is a weak test. Failing to reject the null hypothesis doesn't prove that there is no heteroskedasticity, merely that any relationship between the variance and the RHS variables is too small to detect with our data.
There is another drawback to this test: If there are many RHS variables, then the quadratic terms may exhaust the degrees of freedom available in the data. An alternative version of the Breusch-Pagan test is to use Yhat2 as the RHS variable to generate the test statistic. This approach version of the test is less likely to uncover heteroskedasticity when it is present, but is a useful second-best option.
I've created a Stata example -- program and output files -- of a test and correction for heteroskedasticity using the sample of 100 households we used before.