Unit 5 - Inference with the Regression Model

Code

library(ggplot2) # for simple plots
library(plotly) # for the plots
library(dplyr) # for pipes and data management
library(kableExtra) # for pretty tables
library(gridExtra) # for arranging plots nicely

pop <- read.csv('nc.csv') %>% # read in the dataset for this analysis
  na.omit() # and drop any observations with missing data
N <- 100
set.seed(5092014)
nc <- pop[sample(nrow(pop), N), ] # sample down to 100 observations

These course notes should be used to supplement what you learn in class. They are not intended to replace course lectures and discussions. You are only responsible for material covered in class; you are not responsible for any material covered in the course notes but not in class.

Overview

The setting
Sampling variability
Standard errors
The distribution of \hat{\beta}_1
Hypothesis tests
Confidence intervals
Summary
Appendix

The setting

We’ve estimated a regression line, \hat{Y}_i=\hat{\beta}_0+\hat{\beta}_1X_i, but we know that this is only the line of best fit for this sample. That’s why we have the hats on the coefficients, \hat{\beta}_0 and \hat{\beta}_1; the true parameters are \beta_0 and \beta_1, but we can’t observe those directly. So, just as in previous units, we need a way to

test null-hypotheses about the true values of \hat{\beta}_1 and \hat{\beta}_0, and
obtain confidence intervals for the parameters to understand which values of the population parameters are consistent with our sample.

We’ll be working with the same data as in Unit 4. However, now we’ll be working directly with the population, 1,762 students from the same schools in North Carolina (once again, we’re only using students for whom we have complete data for both INTEREST and ENGAGEMENT).

We should start by examining a scatterplot of ENGAGEMENT on INTEREST, this time in the population. We display this in Figure 1. The left-hand panel shows the scatterplot in the population; the right-hand plot overlays the sample values and the sample line of best fit, using the sample we drew for Unit 4.

Code

pop_plot <- pop %>% ggplot(mapping = aes(x = interest, y = engage)) + 
  geom_point(alpha = .1) + geom_smooth(se = FALSE, method = 'lm', col = 'olivedrab3') + 
  labs(x = 'Teacher\'s personal interest', y = 'Engagement in class')
samp_plot <- pop_plot +
  geom_point(data = nc, col = 'salmon3') + 
  geom_smooth(data = nc, se = FALSE, method = 'lm', col = 'salmon3')
grid.arrange(pop_plot, samp_plot, ncol = 2)

Figure 1: The left-hand figure shows a scatterplot of student engagement against student perceptions of teacher personal interest in the population with the line of best fit overlaid in `olivedrab3'. The right-hand figure shows the sample scatterplot and line of best fit overlaid in`salmon3’; points that were sampled are also displayed in `salmon3’. Wow, it’s pretty close to the true line! Notice that some of the dots are darker than others; we achieved this by making the dots partially transparent; the more dots that are layered on top of each other, the darker the point appears.

One thing you’ll be happy to note is that the sample line of best fit comes quite close to the population line of best fit. The sample line of best fit won’t always come so close to the population line, but it’s usually in the same ballpark.¹ In the next few sections, we’ll look at how we can test null-hypotheses and quantify our uncertainty about the population line of best fit.

The equation for the population line of best fit is

Y_i = 1.54 + 0.52X_i + \varepsilon_i

while the equation for the sample line of best fit is

\hat{Y}_i = 1.72 + 0.45X_i.

Let’s pause for a second to remember what these numbers mean. We’ll interpret the population line of best fit since we’ve already done the sample. In the population, students who differ in terms of perceived personal interest by one scale point differ, on average, by 0.52 scale points in terms of engagement with their class. Individual pairs of students may not behave exactly as predicted (it’s entirely possible that a student with a lower sense of personal interest might report more engagement than a peer with a higher sense of personal interest), but this is the average behavior. No students have a personal interest score of 0 (the lowest possible value is 1). However, if this were a valid score for personal interest, the average engagement score for these students would be 1.54. Recall that although the intercept is not directly interpretable in this case, we still need it to adjust the height of the line of best fit. Omitting the intercept generally gives a very badly fitting line.

Once again, notice that the sample estimates of the regression coefficients are quite close to the population values, particularly \beta_1=0.52 and \hat{\beta}_1=0.45. In this case, if we were to use the sample estimate of the slope to summarize the association between ENGAGEMENT and INTEREST in the population, we wouldn’t be very wrong at all.

We should also stress that, in applied research, we will (almost?) never know the actual value of \beta_0 or \beta_1. In this unit we do, which is important because we want to see how our samples compare to the actual population. In reality we will typically write the population equation as

Y_i = \beta_0 + \beta_1X_i + \varepsilon_i.

In this case, the population equation only says that the association between X and Y is a straight line, and that individual observations are scattered around that line. It does not tell us which line (i.e., it does not include actual numbers for the coefficients, just Greek letters).

Sampling variability

Not every sample we could possibly draw from this population will give us the same values of \hat{\beta}_0 and \hat{\beta}_1. To demonstrate this, we display regression lines fit to 9 random samples from the population in Figure 2. Notice that in some samples, such as the sample in row 3, column 1, the lines of best fit are much steeper than in others, such as row 1, column 3. If we were to take more samples, we would see even more extreme lines. If we took enough samples, we would even find some with lines of best fit with negative slopes.

Code

set.seed(12111978)

samps <- sample(nrow(pop), 9 * nrow(nc))

plots <- list()
for(i in 1:9){
  samp.this <- pop[samps[((i - 1)*100 + 1):(i*100)], ]
  plots[[i]] <- samp.this %>% ggplot(mapping = aes(x = interest, y = engage)) + 
    geom_point() + geom_smooth(se = FALSE, method = 'lm', col = rainbow(9)[9 - i + 1]) +
    xlim(1, 5) + ylim(1, 5)
}
grid.arrange(grobs = plots, ncol = 3)

Figure 2: Scatterplots from 9 different samples of size 100 with overlaid lines of best fit.

We can also look at sampling variability by drawing lots of samples, calculating the line of best fit in each one, and overlaying them on a single plot. That’s what we display in Figure Figure 3. We drew 500 samples of size 100 from the full population of 1,762 students (of course we had to sample students multiple times across all the samples but each sample is probably unique). The red dot is plotted at the sample mean of INTEREST and ENGAGEMENT; the thick black line is the line of best fit for the population regression.

Here are some things to notice: most lines come pretty close to the true line of best fit, although a few miss by a substantial amount. At the population mean value of INTEREST (around 3.5), most lines come very close to the population mean value of ENGAGEMENT (around 3.4). Further from the population mean of INTEREST, the lines tend to miss by more. When we look at the most extreme values of INTEREST, close to 1, many of the sample regression lines are quite far from the true line of best fit. Finally, notice that the lines calculated from the samples are symmetric around the true line. For every line that’s too steep, there’s another that’s too shallow. For every line that’s shifted too high, there’s another line which is shifted too low (okay, this doesn’t work out perfectly here because we only took 500 samples; with more samples the distribution would be even more symmetric).

Code

samps <- sample(dim(pop)[1], nrow(samp.this)*500, replace = TRUE)

p <- pop_plot
for(i in 1:500){
  samp.this <- pop[samps[((i-1)*100+1) : (i*100)], ]
  coef.this <- coef(lm(engage ~ interest, data=samp.this))
  p <- p + geom_abline(intercept = coef.this[1], slope = coef.this[2], col = rainbow(500)[round(500*runif(1))])
}
p + geom_smooth(se = FALSE, method = 'lm', col = 'black', lwd = 2) + 
    geom_point(mapping = aes(x = mean(interest), y = mean(engage)), col = 'red', cex = 2)

Figure 3: Scatterplot of the population with overlaid lines of best fit from 500 different samples each of size 100. The red dot is at the sample mean of INTEREST and ENGAGEMENT, and is referred to as the ``centroid’’ of the joint distribution of personal interest and engagement; it’s like the center of gravity of the distribution. All the lines come fairly close to the population centroid.

Standard errors

As before, we’ll need to work with the concept of a standard error. Remember from units 2 and 3 that the standard error of a statistic is the standard deviation of its sampling distribution. Suppose we were able to take multiple independent samples from a population, each of the same size (in this case 100), and in each one we calculated the sample line of best fit. In this case, the standard errors of \hat{\beta}_0 and \hat{\beta}_1 would be the standard deviations of those statistics in lines estimated across the repeated samples. This is an important point: standard errors are not about the spread of the individual observations, they’re about the spread of statistics across repeated random samples.

If we could observe the population directly, we could calculate the standard errors.² We’re going to focus on the standard error for the slope, se(\hat{\beta}_1) because we’re usually interested in estimating \beta_1. There’s a formula for se(\hat{\beta}_0) as well, but we’re not going to bother with it (but similar statements would be true there).

We’ll need a few quantities to write the equation for se(\hat{\beta}_1). First, \sigma_\varepsilon, which is the standard deviation of the residuals from the regression; if we took all of the residuals from the population regression and calculated their standard deviation, this is what we’d get. It’s also referred to as the Root Mean Square Error, or RMSE, because we find it by taking the square root of the mean value of the squared errors. Next we need s_{X}, which is the sample standard deviation of the predictor, ENGAGEMENT. Finally, n is the sample size.

In this case, the standard error of \hat{\beta}_1 can be written as

se(\hat{\beta_1}) = \frac{\sigma_\varepsilon}{s_{X}\sqrt{n}}.

As the residuals of the regression get smaller (i.e., as the predictor and the outcome become more strongly associated), the standard error will decrease (when our predictions are more precise, our estimates are also more precise). Similarly, as the sample size increases, the standard error will decrease (when we take a larger sample, our estimates are more precise). Finally, and somewhat surprisingly, as the standard deviation of the predictor increases, the standard error will decrease (when our predictor is spread out, our estimates are more precise). Smaller standard errors means less variability in our estimates of \beta_1, which is something we want; if the standard error is small our estimates will tend to be close to the population parameter.

Although we can directly observe n and s_X, we can’t observe \sigma_\varepsilon. As we noted in the previous unit, we can only estimate \sigma_\varepsilon. We can use this estimated quantity, \hat{\sigma_\varepsilon}, to estimate the standard error, giving us

\hat{se}(\hat{\beta}_1)=\frac{\hat{\sigma_\varepsilon}}{s_{X}\sqrt{n}}.

This should be a surprising result! Even though we typically only ever see a single value of \hat{\beta}_1, we can obtain an estimate of how spread out it is over repeated random samples. Wow!³ This is one of the most important things statistics does: it allows us to estimate how an estimator (e.g., \hat{\beta}_1) will perform over repeated samples, even though we only have a single sample to work with. This, in turn, lets us quantify our uncertainty about the population from a single sample.

In our current example, the true standard error of \hat{\beta}_1, calculated from the population, is 0.077. In the sample we actually took, we estimate the standard error to be 0.082. This is pretty close, although not quite right. But as we’ll see later, we can still conduct valid tests and construct valid confidence intervals, even though our estimates of the standard errors are not identical to the true standard error.

The distribution of \hat{\beta}_1

So now we have a way to estimate se(\hat{\beta}_1), the standard deviation of \hat{\beta}_1 over multiple samples. However, we don’t yet know what the distribution actually looks like. It turns out that the estimated regression coefficients will follow Normal distributions, with means equal to the population values and standard deviations equal to the standard errors (we talked about the Normal distribution in an earlier unit). Things work out so nicely because math. Thanks math! Actually, there are proofs that we can use to demonstrate this, but you may prefer to just take it on faith.

I decided to illustrate the distribution through a simulation. Simulation is a powerful technique which allows us to see what happens over repeated random samples. Essentially, we define the population on our computer and draw samples from it by using the computer’s pseudo-random number generator.⁴ For anyone who’s interested, a pseudo-random number generator is a detemrinistic (so non-random) algorithm for generating a sequence of numbers that appear to be random, based on certain tests. The sequence of random numbers depends on some starting value, or seed, which means that we can draw the same sequence of values by starting with the same seed. Although these values are not truly random, they behave enough like random numbers for almost any purpose.

In this case, we drew 50,000 samples of size 100 from the population (remember, here the population is the complete dataset with all 1,762 students). In each sample we estimated the regression line (the whole procedure took a little over one minute in R). In Figure 4, we display a histogram of those estimated slopes. We overlaid a normal distribution with the same mean and standard deviation in red. Notice how closely the normal distribution approximates the actual distribution. Nice!

Code

samps <- sample(dim(pop)[1], 100*50000, replace = TRUE)

beta0 <- rep(NA, 50000)
beta1 <- rep(NA, 50000)

for(i in 1:50000){
  samp.this <- pop[samps[((i - 1)*100 + 1) : (i*100)], ]
  mod <- lm(engage ~ interest, data = samp.this)
  beta0[i] <- coef(mod)[1]
  beta1[i] <- coef(mod)[2]
}

plot_dat <- data.frame(beta1)
func <- function(x) {dnorm(x, mean(beta1), sd(beta1))}
ggplot(data = plot_dat, mapping = aes(x = beta1)) + 
  labs(x = expression('Estimates of ' ~ beta)) +
  geom_histogram(aes(y = after_stat(density)), binwidth = .02) + 
  stat_function(fun = func, col = 'red')

Figure 4: Sampling distribution of eta_1 in samples of size 100 taken from the population with overlaid normal density curve.

Again, the fact that the sample estimate of \beta_1 follows a Normal distribution over repeated random samples can be demonstrated mathematically, but simulation is a neat tool for illustrating this. If you’re wondering how something will behave over repeated random samples, it’s usually possible to simply simulate drawing repeated samples and look at the distribution directly. There’s an intuitive appeal to finding out how something behaves over repeated random samples by just going out and taking the samples. R makes simulation fairly straightforward, and I include some scripts on Canvas to get you started if this is something you’re interested in. Let me know if there’s something particular you’d like to see!

This means that if we take the observed \hat{\beta}_1 and subtract the population parameter, \beta_1, then divide by the estimated standard error, \hat{se}(\hat{\beta}_1), the result will follow a t_{n-2} distribution. We can write this as

\frac{\hat{\beta}_1-\beta_1}{\hat{se}(\hat{\beta}_1)}\sim t_{n-2}

This means that we can use all of the tools we developed in Unit 2 for testing null-hypotheses (typically H_0:\beta_1 = 0) and constructing confidence intervals.

Hypothesis tests

Of course, just as with means, we don’t know the value of \beta_1, so we can’t actually calculate \frac{\hat{\beta}_1-\beta_1}{\hat{se}(\hat{\beta}_1)} (if we knew \beta_1 we wouldn’t have to go to all of this trouble in the first place). However, we can use the techniques we developed in Units 2 and 3 to use the t-distribution to help us test null-hypotheses.

The most common null-hypothesis tested in regression analyses is H_0:\beta_1=0. This is the default hypothesis tested by statistical software. If \beta_1 is equal to 0, then there is no linear association between the predictor and the outcome in the population, which makes this an easily interpretable null-hypothesis. H_0 is simply claiming that the variables are not (linearly) associated with each other. This will always be the hypothesis that we use in this class, although it’s easy to generalize our approaches to other hypotheses.

If the true value of \beta_1 is 0, then we can calculate the t-statistic as

\frac{\hat{\beta}_1-\beta_1}{\hat{se}(\hat{\beta}_1)}=\frac{\hat{\beta}_1-0}{\hat{se}(\hat{\beta}_1)}=\frac{\hat{\beta}_1}{\hat{se}(\hat{\beta}_1)}

We can do this because we’ve assumed that \beta_1=0. If we wanted to test, e.g., H_0:\beta_1=1, we would just replace 0 with 1. We can calculate both \hat{\beta}_1 (we did this in unit 3) and \hat{se}(\hat{\beta}_1) (we did this earlier in this unit), so we can also calculate the t-statistic. As in Unit 2, we can compare this statistic to a t-distribution with the appropriate degrees of freedom (in this case n-2; essentially we have n degrees of freedom, but we lose one for each parameter that we have to estimate, in this case a slope and an intercept. In general, just rely on your statistical software to calculate the degrees of freedom). Doing so will give us the p-value, or the probability of drawing a sample with a t-statistic as extreme as or more extreme than the one observed if the null-hypothesis is true. Sometimes we write this as the probability of drawing a sample with a t-statistic as extreme as or more extreme than the one observed under the null-hypothesis. Think of a null-hypothesis test as asking whether the hypothesized population value is consistent with the observed data.

In our running example, the t-statistic is equal to t = \frac{0.450}{0.082} = 5.48. The p-value is .00000032, meaning that the observed t-statistic would be incredibly improbable if the true slope were 0. We’d only expect it to happen 32 times every 100,000,000 samples.

Confidence intervals

Recall from units 2 and 3 that a key drawback to hypothesis tests is that all they tell us is whether the data are consistent with a specific population value. We can say with a great deal of certainty that the true value of \beta_1 is not equal to 0, and even that it’s greater than 0 because the sample slope is greater than 0, but we can’t say much about what it actually is. That’s where confidence intervals come in.

We form a confidence interval⁵ by identifying all of the values of \beta_1^{hyp} (i.e., hypothesized values of \beta_1) for which we would fail to reject the null-hypothesis, H_0:\beta_1=\beta_1^{hyp}, typically with \alpha=.05. Remember that we define t_{crit} as the value of the t-statistic associated with a p-value of .05; if the t-statistic is larger (either positive or negative) than t_{crit}, then we will reject the null-hypothesis. t_{crit} is almost always very close to 2. Then we can form a confidence interval by taking all values in the interval

(\hat{\beta}_1-t_{crit}*\hat{se}(\hat{\beta}_1),\ \hat{\beta_1}+t_{crit}*\hat{se}(\hat{\beta}_1)),

i.e. all values within t_{crit} estimated standard errors of \hat{\beta}_1. For any value, \beta_1^{hyp}, within the confidence interval, we would fail to reject the null-hypothesis H_0:\beta_1=\beta_1^{hyp}. For any value, \beta_1^{hyp}, outside the confidence interval, we would reject the null-hypothesis H_0:\beta_1=\beta_1^{hyp}.

As we discussed before, 95%-confidence intervals have the property that, over repeated independent samples, 95% of the intervals we construct will contain the true \beta_1. Any particular confidence interval can be thought of as the set of all values of \beta_1^{hyp} which are consistent with the observed data.

In our example, the confidence interval is equal to

(0.450-1.984*.082,\ 0.450+1.984*.082)=(0.287,0.612).

Notice that small standard errors lead to narrower confidence intervals; the less variable the statistic, the more precisely we can estimate it. Notice also that we reject H_0:\beta_1=\beta_1^{hyp} if and only if the confidence interval does not contain \beta_1^{hyp}.

Summary

Once we’ve estimated the line of best fit, it’s important to have a sense of how certain or uncertain we are about our estimate. Statistical inference can help us do this with confidence intervals and, to a lesser extent, null-hypothesis tests. The details should be familiar to you from Unit 2. One new twist is that the standard deviation of the residuals, \sigma_\varepsilon, shows up in all of the standard errors. The less variable the residuals (i.e., the closer our predictions come to the true values), the more precise all of our estimates are.

Here’s one way we might summarize our findings, adding some details to our summary from Unit 4. We were interested in determining the association between students’ reports of the personal interest their teachers took in them and their reported engagement in those teachers’ classes. We fit a regression of engagement on personal interest. The fitted regression equation was ENGAGEMENT_i = 1.72 + 0.45INTEREST_i. The slope was statistically significantly different from 0 (\hat{\beta}_1=0.45, t=5.48, p<.001, 95\%-CI=(0.29,0.61)), giving strong evidence that the association between the variables is positive in the population. All of the values in the confidence interval are moderately large, which suggests that the true association is moderately strong.

Appendix

Confidence intervals for conditional means

As we’ve said before, our goal in doing regression is usually to estimate the linear association between a pair of variables. As a result, we’re usually interested in doing inference on \beta_1. Sometimes, however, we’re interested in estimating the population mean of the outcome for some value of the predictor, or the population conditional mean value. To do so, we need to introduce the concept of a confidence interval for the mean.⁶

For a quick review, suppose that we’re interested in estimating the conditional mean of the outcome where x=x_0. We can write this conditional mean as \mu_{y|x=x0}, and can express it using the regression line as

\mu_{y|x=x0} = \beta_0 + \beta_1x_0.

But since we’re only estimating \beta_0 and \beta_1, we can’t know the value of \mu_{y|x=x0}. Instead, we can only estimate it as

\hat{\mu}_{y|x=x_0} = \hat{y} = \hat{\beta}_0+\hat{\beta}_1x_0.

As you might expect, this estimated mean also has a standard error (i.e. the standard deviation of \hat{\mu}_{y|x=x_0}, or se(\hat{\mu}_{y|x=x_0})). We won’t go into the details of how to calculate it, but note that the standard error is smaller for values of x_0 close to the sample mean, \bar{x}, and larger for values far from \bar{x}. This is true whether or not there are many points close to the sample mean, and is a feature of linear regression. It also depends on the standard error of the residuals, \sigma_\varepsilon, as do all regression-related confidence intervals. More variable residuals make our estimates of conditional means less precise.

We can use the estimated standard error to test null-hypotheses as we described earlier in this unit. In fact, the null hypothesis test for \beta_0 reported by your statistical software uses the standard error \hat{se}(\hat{\mu}_{y|x=x_0}). We can also construct confidence intervals using

\hat{y}-t_{crit}*\hat{se}(\hat{\mu}_{y|x=x_0}),\ \hat{y}+t_{crit}*\hat{se}(\hat{\mu}_{y|x=x_0}).

These confidence intervals, which are commonly 95%-confidence intervals, have the same interpretation as other confidence intervals. As the sample grows, our estimates of the regression line will become increasingly precise. Figure Figure 5 shows the line of best fit calculated for our regression of 100 points, as well as confidence intervals calculated at all values of INTEREST, from 1 to 5.

Code

nc %>% ggplot(aes(x = interest, y = engage)) + geom_point() + geom_smooth(method = 'lm') +
  labs(x = 'Teacher\'s personal interest', y = 'Engagement in class')

Figure 5: Sample line of best fit with the shaded region indicating 95% confidence intervals for the conditional mean of y given x.

Prediction intervals

We also might be interested in providing a range of plausible values for a new observation from the same population at a given value of the predictor. We refer to this as a prediction interval. Prediction intervals are similar to confidence intervals, except that they include added uncertainty because not only are we unsure of the true regression line, but even if we knew the same true regression line, any new observation will have its own unpredictable residual. As a result, prediction intervals are always wider then confidence intervals.

Prediction intervals are highly dependent on the outcome having a conditional normal distribution, and are generally unrealistic. Unlike confidence intervals, which become narrower and narrower as the sample size increases, prediction intervals can never be less than 2*t_{crit}*\hat{\sigma}_\varepsilon units wide, because there’s always at least that much uncertainty about what a particular residual will be.

The interpretation of a prediction interval is somewhat difficult. If we were to take repeated independent samples, and after each sample we drew a single new observation from the same population, then 95% of the time the new point would fall within our prediction intervals.⁷

Figure 6 shows the line of best fit with overlaid confidence intervals for the conditional mean and prediction intervals. Notice that the prediction intervals contain impossibly low values of ENGAGEMENT for low values of INTEREST, and impossibly high values of ENGAGEMENT for high values of INTEREST.

Code

mod <- nc %>% lm(engage ~ interest, .)

xs <- data.frame(interest = seq(from = min(nc$interest), to = max(nc$interest), length.out = 400))
xs$engage <- seq(from = min(nc$engage), to = max(nc$engage), length.out = nrow(xs))

ci <- predict(mod, xs, interval='confidence')
pi <- predict(mod, newdata = xs, interval = 'prediction')
xs$lwr <- pi[, 'lwr']
xs$upr <- pi[, 'upr']

nc %>% ggplot(aes(x = interest, y = engage)) + geom_point() + 
  geom_ribbon(data = xs, aes(x = interest, ymin = lwr, ymax = upr, alpha = .01), col = 'pink3', fill = 'pink3') +
  geom_smooth(method = 'lm') + scale_alpha(guide = "none") +
  labs(x = 'Teacher\'s personal interest', y = 'Engagement in class')

Figure 6: Sample line of best fit with 95% prediction intervals for the new values of y given x shaded in ‘pink3’, and 95% confidence intervals shaded in grayish pink.

Degrees of freedom

A constant source of frustration for students is trying to understand degrees of freedom. What are they and why do we need to bother with them? One simple solution is to ignore them; your software will handle them for you, and the basic insight that larger samples give more power and precision is fairly intuitive. Honestly, that’s not a bad idea.

However, we think that we can give you a sense of how degrees of freedom work, at least in t-tests. Recall that \frac{\hat{\beta_1}}{se(\hat{\beta}_1)} (the estimate divided by its standard error) follows a normal distribution. But since we don’t know se(\hat{\beta}_1), we have to use our estimate of the standard error, \hat{se}(\hat{\beta}_1). This introduces additional uncertainty, which the t-distribution accounts for with wide tails, making really extreme observations more common. The fewer observations we have (i.e. the lower the degrees of freedom), the less certain we can be about the estimate standard error, \hat{se}(\hat{\beta}_1); if we only have a handful of observations, then \hat{se}(\hat{\beta}_1) will frequently be quite far from se(\hat{\beta}_1). Hence, we’ll need an extremely thick-tailed distribution (i.e. a t-distribution with few degrees of freedom) to capture this added uncertainty and allow for the possibility that we’ve estimated the true standard error, se(\hat{\beta}_1), extremely badly. Degrees of freedom capture added uncertainty about the standard error which arises from having an imprecise estimate of the standard error.

A related issue is why the degrees of freedom are equal to n-2 in simple linear regression. We see this in the formula for estimating the variance of the residuals,

\hat{\sigma}^2_\varepsilon = \frac{\sum\epsilon_i^2}{n-2}=\frac{\sum(y_i-\hat{\beta}_0-\hat{\beta}_1x_i)^2}{n-2}

In the last part of the equation, we simply rewrote the residual, \varepsilon_i as the difference between the observed value and the predicted value. We’re trying to estimate the average squared residual, which is the definition of the variance of the residual, so at first it probably seems like we should divide by n rather than n-2. This is actually exactly what we would do, if we knew the parameters for the regression line, \beta_0 and \beta_1. In that case, we would have \varepsilon_i^2=(y_i-\beta_0-\beta_1x_i)^2, and the \varepsilon_i^2’s would be the squared population residuals (because we know what the population line of best fit is). However, we have to work with the estimated parameters, \hat{\beta}_0 and \hat{\beta}_1. This means that the squared residuals we have to work with are not the squared population residuals. In fact, they’re smaller on average than the squared population residuals. The line of best fit in the sample is the single which makes the squared sample residuals as small as possible; on average, the true population line has to give larger squared residuals than the estimated line unless our estimated line is exactly equal to the population line. We correct for the fact that our squared residuals are systematically too small by making the denominator smaller; we divide by n-2 rather than n. As we’ll see in later units, the more parameters we add to the model, the more the squared sample residuals will be shrunk from their population values, and the more we’ll have to shrink the denominator. Somewhat surprisingly, this simple correction works perfectly to give us unbiased estimates of the population variance.

Robust standard errors

As we’ll discuss in unit 6, deriving the standard errors using the formula we gave above requires us to make a lot of assumptions. Specifically, we need to assume that at each value of the predictor, the residuals have the same standard deviation, have mean 0, and are normally distributed. Frequently these assumptions are unreasonable, which means that the standard errors we calculate might not be correct. Of course, the standard errors we estimate are never quite right since they’re only estimates, but if our assumptions are wrong then our procedures for conducting hypothesis tests and constructing confidence intervals won’t work either because our standard errors will be systematically wrong. The statistics we estimate won’t follow a known distribution.

However, methods have been developed for estimating standard errors which will give us correct hypothesis tests and confidence intervals even without the standard assumptions. These approaches, which provide robust standard errors because they’re robust to violations of the model assumptions, are computationally intensive. However, modern computers can handle them easily. A drawback to these approaches is that they tend to be only asymptotically correct; in small samples they won’t work at all, in large samples samples they’ll work much better. There’s no strict definition of `large’, but 200 observations tend to work well, especially for simple models.

The approach we gave in the body of this unit will work perfectly even in small samples … but only if the assumptions are met. Most statistical software makes it easy to estimate robust standard errors. It’s probably a good idea to supplement the hypothesis tests and confidence intervals you obtain using the standard approach with those you obtain using robust approaches; if they give similar results, you can be more confident in your inferences.

Bootstrapped standard errors

Another approach to deriving standard errors is bootstrapping. As you might remember from Unit 2, bootstrapping involves resampling with replacement multiple times from our original data and recalculating the statistic of interest each time. Then we can use the standard deviation of the statistic in the bootstrap distribution (the distribution of the statistic across multiple bootstrap samples) as an estimate of the standard deviation of the statistic in the sampling distribution (the distribution of the statistic across multiple actual samples).⁸

So far we’ve discussed resampling observations from the sample. However regression gives us the opportunity to resample other quantities as well. Another approach is to resample residuals; yet another is to randomly resample either each of \varepsilon_i and -\varepsilon_i. Yet another approach has us estimate not the bootstrap distribution of \hat{\beta}_1, but the bootstrap distribution of \frac{\hat{\beta}_1}{\hat{se}(\hat{\beta_1})} (the bootstrap distribution of the t-statistic).

As with robust standard errors, bootstrapping approaches allow us to relax assumptions required to derive standard errors using the typical approach described in the main body of this unit. Also as with robust standard errors, bootstrap standard errors tend to be only asymptotically correct. However, some of the more sophisticated bootstrap approaches work better than robust standard errors in small samples. Once again, it makes sense to supplement your hypothesis tests and confidence intervals using bootstrap approaches. On the other hand, bootstraps are sometimes extremely time-consuming, and sometimes require you to program your own analyses.

Footnotes

I apologzine for my habit of using American idioms in a class which includes a number of non-American students. “In the right ballpark” means “reasonably close to”. A ballpark is a stadium where baseball is played. Baseball is America’s national pastime, and one of the most boring sports ever invented.↩︎
The approach we’re going to use to do this only really works if the population has certain features; we’ll cover these in the next unit.↩︎
Come on, you’ve got be at least a little impressed, right?↩︎
Historically, simulations were more physical; scientists might flip a coin, use the last digit of the Dow Jones Industrial Average, or pick a random page from a book of random numbers. It turns out that it’s even possible to estimate \pi through simulation by tossing a needle over a grid of squares and calculating the proportion of times the needle intersects one of the squares.↩︎
All confidence intervals that we form are 95%-confidence intervals, but this is simply by convention.↩︎
The intercept is actually simply the mean value of the outcome when the predictor in 0, so the confidence interval for the intercept is the confidence interval for ENGAGEMENT conditional on and INTEREST of 0. But it’s not possible to have an INTEREST score below 1, so it’s not clear what to make of this interval.↩︎
That would be a fantastically weird thing to do.↩︎
We can do much fancier things with our bootstraps, but this is a start.↩︎