Unit 9 - Statistical Control
These course notes should be used to supplement what you learn in class. They are not intended to replace course lectures and discussions. You are only responsible for material covered in class; you are not responsible for any material covered in the course notes but not in class.
Overview
- The setting
- Point 1: The “correct model” only exists in reference to a research question
- Point 2: Regression parameters are meaningful only in the context of a model
- Point 3: It’s difficult if not impossible to control for highly related predictors
- Point 4: You need to be concerned about missing data
- Point 5: Measurement error can be a big problem
- Summary
The setting
In this unit, we’ll diving a little deeper into multiple linear regression, focusing less on the technical details, and more on deeper conceptual issues. In our experience doing analysis and reading analysis, this is the place where most mistakes are made. Fitting models and providing mechanical explanations is relatively easy; you’ll figure out the code and the standard language easily enough. However, making meaning of what you’ve found is much more challenging, as is figuring out the right model to use to answer a question that interests you. We’re not really using data in this unit, just a series of models which illustrate some ideas we consider important. The unit will be organized around a series of points regarding what regression can and, more importantly, cannot do.
Point 1: The “correct model” only exists in reference to a research question
A common misconception held both by beginning students of statistics and experienced researchers is that there is a “true” model out there which associates, e.g., INTEREST and ENGAGEMENT, and our goal is to find the correct predictors which will give us this model. This view is incorrect. In fact, any model is correct as long as it answers the research question of interest (or, at least, any set of predictors is correct; we may need to transform certain predictors or allow for interactions, which we’ll discuss in Unit 12). The key thing to ask yourself is, what question are you interested in answering with your model?
In the previous units, the first regression model we fit was
ENGAGEMENT_i=\beta_0 + \beta_1INTEREST_i+\varepsilon_i.
If we want to estimate the association between students’ engagement in their classes and the extent to which they think their teachers are interested in them as people, this is the model for us! \beta_1 represents exactly that association. Nothing could be simpler. On the other hand, adding VALUE to the model to yield
ENGAGEMENT_i=\beta_0 + \beta_1INTEREST_i+\beta_2VALUE_i+\varepsilon_i
yields a model with substantially more explanatory power, as measured by R^2. Does this mean that the latter model is the “better” one? After all, it gives us much better predictions of the outcome, which is important in regression analyses. The answer (not surprisingly, if you’ve been reading so far) is no. The reason is that the second model doesn’t estimate what we’re interested in. The coefficient for INTEREST in the second model represents the association between INTEREST and ENGAGEMENT between students with the same level of VALUE, but this is a very different thing from the association between INTEREST and ENGAGEMENT, especially since INTEREST and VALUE are so closely related. We get better predictions, which is good, but we don’t get an answer to our question!1
It’s very common for researchers to add predictors to their models just because they increase the value of R^2. This is a mistake, because it changes what the model is estimating! Similarly, although adding predictors such as race and gender (we’ll see how to do this in later units) is frequently a good idea, you should be aware that doing so changes what the models are estimating. The coefficients in a regression which controls for gender mean different things, and sometimes very different things, than the coefficients in a regression which does not. Specifically, after adding gender and race to a model, all other coefficients have to be interpreted as associations with the outcome, controlling for gender and race (i.e., among respondents of the same gender and race). Frequently this is what we want, but not always!
We can’t stress this enough: the correct model is the model which allows you to answer the question that you want to answer. Controls may help you to do so, but only because your research question might be about controlled associations. If your research question doesn’t include controlling for certain variables, then you probably shouldn’t be using them!
There are a few situations in which it makes sense to think of a “correct” model. When our predictors are perfectly uncorrelated, adding them doesn’t change the interpretation of the other coefficients and only helps by adding precision. If we want to know the association between race and achievement (we’ll explain how to do that in a later unit), it makes sense to control for gender because gender is uncorrelated with race (knowing a student’s race doesn’t give us any information about their gender) and won’t change the parameter estimates, but it will improve our precision. However, it might be a big mistake to include SES as a predictor, because students of different races differ on SES, so controlling for SES would result in a model that didn’t answer our research question. It would answer a different research question (what is the association between race and achievement, controlling for SES) which might also be interesting, but that’s a substantive issue. If we’re interested with racial differences, not controlling for SES, we should not be including SES.
Additionally, regression is sometimes used for prediction rather than for estimating parameters. In that case, the correct model is the model that best predicts new cases (cases taken from out of the sample) given the available predictors. It might be hard to find that model, but it does exist and it’s sensible to try to find it.
Point 2: Regression parameters are meaningful only in the context of a model
This is closely related point. Many analysts think that there is such a thing as a true regression parameter, \beta_1, which summarizes the real association between INTEREST and ENGAGEMENT. Under certain causal frameworks this may be true, and if we mean the uncontrolled association, then this is also right, but otherwise it’s not. There is no set of controls which will reveal the “real” association between a pair of variables, because that “real” association does not exist. That’s just not a thing.
In the two models we fit, we obtained the equations
\hat{ENGAGEMENT}_i = 1.72 + 0.45INTEREST_i
and
\hat{ENGAGEMENT}_i = 0.17 + 0.12INTEREST_i + 0.77VALUE_i.
Beginning (and advanced) researchers will frequently wonder which of these associations is correct, and will typically default to the second, because they assume that controlling for VALUE has eliminated some pseudo-association induced by covariance between INTEREST and VALUE (students who value the subject more are more engaged and think their teachers are more interested in them). This is a sneaky way of slipping causal ideas into our analyses. It’s true that in the first model, part of the association between INTEREST and ENGAGEMENT can be explained by VALUE, and when we control for VALUE, the regression coefficient gets smaller (and is no longer statistically significant). But this doesn’t mean that the new regression coefficient is better in any absolute sense. Instead, it’s estimating something different than the first one.
When we add VALUE to the model, we change the meanings of the regression parameters. the coefficient \beta_{INTEREST} now represents the association between INTEREST and ENGAGEMENT between students who report the same VALUE, which is a different thing than the association between INTEREST and ENGAGEMENT overall. If this helps us answer questions which are more interesting to us, then this is a good thing; if it doesn’t, we’re better off with the uncontrolled association.
Point 4: You need to be concerned about missing data
Missing data arises in a lot of ways. Respondents might not answer every item on a survey. People sometimes give unacceptable responses to items. Data might not be recorded correctly.
Analysts frequently ignore missing data; if you run a regression in R or Stata, your software will simply drop all observations which are missing any of the variables used in the analysis (this is called listwise or casewise deletion). This can lead to big problems.
Missing data is a big enough issue that it could be a course unto itself. In fact, it is a course (actually several courses) unto itself in the Harvard statistics department, where it’s taught by Don Rubin, one of the first people to study the topic in depth. Rubin identifies several ways in which data can be missing, including missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR; try saying that five times fast).
The most important distinction is between missingness that is ignorable and missingness which is not ignorable. Ignorable missingness is irritating; it reduces our sample size, because we can’t use those respondents who have missing data. Standard errors will get bigger and inferences will get less precise, especially if we have a lot of cases with missing values.
Non-ignorable missingness is much worse. Missingness is non-ignorable if failing to account for the missingness leads to biased parameter estimates (i.e., parameter estimates that are consistently wrong, even as the sample size gets bigger). Consider the following situation: researchers want to determine whether girls or boys have more positive relationships with their teachers. But suppose that girls who have bad relationships with their teachers are less comfortable expressing that than boys, so they tend not to answer the questions at all. Then the mean relationship quality for girls will be estimated as higher than it really is, because we’re missing responses for those girls who don’t have good relationships. This will be true no matter how large our sample grows; bias doesn’t disappear with large samples the way imprecision does.
Another way to think of this is as a biased sample. We wanted to sample from the population of all students, but instead we effectively sampled from the population of all boys, and those girls who have reasonably positive relationships with their teachers; the problem is that we don’t know how our sampling is biased, because we don’t know why those girls aren’t responding to the surveys.
This course doesn’t cover what you can do about missing data; approaches to handle missingness tend to be difficult to implement. But when doing an analysis, you should always look at data to find patterns of missing data and think about what they could mean, just as you should always think through your sampling approach and consider whom you might be missing.
Point 5: Measurement error can be a big problem
We touched on this when discussing collinearity, but measurement error deserves its own section. Perhaps obviously, regression works with the values we give it. However, we’re frequently more interested in some hidden or latent value which we can’t directly observe, but which our variables are supposed to represent. For example, when we do regression using MCAS ELA scores as the outcome, we’re typically thinking of them as measures of a latent construct, which we might call English language ability. But as you probably realize, test scores are not perfect measures of a student’s ability. In statistics, we say that tests and psychological scales contain measurement error, i.e. random deviations from a person’s true score, due to the fact that our measures are imperfect.
Measurement error in an outcome
When the outcome is measured with error, this isn’t usually a huge problem for us. What happens is that the residuals become more variable, the strengths of association (i.e. the correlations) between the outcome and the predictors decrease as the measurement error increases and the estimates become less and less precise. This is irritating because it reduces our ability to reject null-hypotheses, it makes our confidence intervals wider, and it makes the variables appear less strongly correlated than the latent variables really are (on the other hand, the unstandardized coefficients are unchanged on average, just less precise).
This is only true if we assume that measurement error is uncorrelated with the predictor. This is not always true. Some research indicates that students have better reading comprehension on passages about things in which they are interested. If a test includes passages which better represent the interests of girls than of boys (on average; obviously individual girls and boys have their own interests), we might find evidence that girls read better than boys even if they have the same latent reading ability. Technically this is not measurement error but differential item functioning. However, it highlights the importance of really understanding your measures before trying to do inference.
Measurement error in a predictor
When a predictor contains measurement error, the situation changes. Measurement error in predictors can have unpredictable consequences. If there’s a single predictor in the regression, measurement error will tend to make the correlations and also the unstandardized regression coefficients smaller than they would be if there were no measurement error. If there are multiple predictors and one is measured with error, things can get even messier.
Here’s a thought experiment. Suppose that 50 native English speakers and 50 non-native English speakers are given two English reading comprehension tests, one at the beginning of the month and one at the end. The tests, of course, contain measurement error. Suppose that both groups gain the same amount of actual English reading comprehension, although native English speakers start higher. Now suppose that we regress post-test scores on pre-test scores and whether a person is a native English speaker (in the next units we cover how to regress on categorical predictors). What we’ll find is that the regression will indicate that native English speakers who have the same pre-test score as non-native English speakers will tend to have higher post-test scores even though both groups experienced the same mean growth. The reason is measurement error in the pre-test predictor. Given that native English speakers have higher mean scores on the test, a native English speaker with the same score as a non-native English speaker is likely to have an artificially low score due to measurement error; similarly, the non-native English speaker is likely to have an artificially high score due to measurement error. The English speaker probably will get a higher post-test score because she or he probably has a higher actual latent reading ability.
Measurement error is a bigger problem when
- it occurs in a predictor,
- the variance of the measurement error is large relative to the variance of the latent variable, and
- the predictor and outcome are highly correlated.
There are lots of ways to deal with measurement error, none of which are perfectly adequate. One of the simplest is to design measures which are less error-prone. Other approaches include structural equations models, or re-specifying models to ensure that only the outcome has measurement error. We don’t cover any of those approaches in this class, but we recommend that you think carefully about the differences between what regression can tell us (about the associations between observed variables) and what we wish it could tell us (about the associations between latent constructs).3
Footnotes
There are actually ways to use the second model to answer our original question, possibly with some additional precision. But these are a lot harder to use than just fitting a model where one of the parameters directly addresses our research question!↩︎
We refer to predictors which are problematically highly correlated as collinear. If a set of predictors are jointly too highly correlated, they are referred to as multi-collinear. If the correlation is perfect we say that the model is not identified, as we are mathematically unable to fit it.↩︎
As they say, if wishes were horses, you’d have to be very careful not to wish for things or a horse might explode out of your head. Or something like that. Yuck.↩︎