Unit 9 - Statistical Control

These course notes should be used to supplement what you learn in class. They are not intended to replace course lectures and discussions. You are only responsible for material covered in class; you are not responsible for any material covered in the course notes but not in class.

Overview

The setting
Point 1: The “correct model” only exists in reference to a research question
Point 2: Regression parameters are meaningful only in the context of a model
Point 3: It’s difficult if not impossible to control for highly related predictors
Point 4: You need to be concerned about missing data
Point 5: Measurement error can be a big problem
Summary

The setting

In this unit, we’ll diving a little deeper into multiple linear regression, focusing less on the technical details, and more on deeper conceptual issues. In our experience doing analysis and reading analysis, this is the place where most mistakes are made. Fitting models and providing mechanical explanations is relatively easy; you’ll figure out the code and the standard language easily enough. However, making meaning of what you’ve found is much more challenging, as is figuring out the right model to use to answer a question that interests you. We’re not really using data in this unit, just a series of models which illustrate some ideas we consider important. The unit will be organized around a series of points regarding what regression can and, more importantly, cannot do.

Point 1: The “correct model” only exists in reference to a research question

A common misconception held both by beginning students of statistics and experienced researchers is that there is a “true” model out there which associates, e.g., INTEREST and ENGAGEMENT, and our goal is to find the correct predictors which will give us this model. This view is incorrect. In fact, any model is correct as long as it answers the research question of interest (or, at least, any set of predictors is correct; we may need to transform certain predictors or allow for interactions, which we’ll discuss in Unit 12). The key thing to ask yourself is, what question are you interested in answering with your model?

In the previous units, the first regression model we fit was

ENGAGEMENT_i=\beta_0 + \beta_1INTEREST_i+\varepsilon_i.

If we want to estimate the association between students’ engagement in their classes and the extent to which they think their teachers are interested in them as people, this is the model for us! \beta_1 represents exactly that association. Nothing could be simpler. On the other hand, adding VALUE to the model to yield

ENGAGEMENT_i=\beta_0 + \beta_1INTEREST_i+\beta_2VALUE_i+\varepsilon_i

yields a model with substantially more explanatory power, as measured by R^2. Does this mean that the latter model is the “better” one? After all, it gives us much better predictions of the outcome, which is important in regression analyses. The answer (not surprisingly, if you’ve been reading so far) is no. The reason is that the second model doesn’t estimate what we’re interested in. The coefficient for INTEREST in the second model represents the association between INTEREST and ENGAGEMENT between students with the same level of VALUE, but this is a very different thing from the association between INTEREST and ENGAGEMENT, especially since INTEREST and VALUE are so closely related. We get better predictions, which is good, but we don’t get an answer to our question!¹

It’s very common for researchers to add predictors to their models just because they increase the value of R^2. This is a mistake, because it changes what the model is estimating! Similarly, although adding predictors such as race and gender (we’ll see how to do this in later units) is frequently a good idea, you should be aware that doing so changes what the models are estimating. The coefficients in a regression which controls for gender mean different things, and sometimes very different things, than the coefficients in a regression which does not. Specifically, after adding gender and race to a model, all other coefficients have to be interpreted as associations with the outcome, controlling for gender and race (i.e., among respondents of the same gender and race). Frequently this is what we want, but not always!

We can’t stress this enough: the correct model is the model which allows you to answer the question that you want to answer. Controls may help you to do so, but only because your research question might be about controlled associations. If your research question doesn’t include controlling for certain variables, then you probably shouldn’t be using them!

There are a few situations in which it makes sense to think of a “correct” model. When our predictors are perfectly uncorrelated, adding them doesn’t change the interpretation of the other coefficients and only helps by adding precision. If we want to know the association between race and achievement (we’ll explain how to do that in a later unit), it makes sense to control for gender because gender is uncorrelated with race (knowing a student’s race doesn’t give us any information about their gender) and won’t change the parameter estimates, but it will improve our precision. However, it might be a big mistake to include SES as a predictor, because students of different races differ on SES, so controlling for SES would result in a model that didn’t answer our research question. It would answer a different research question (what is the association between race and achievement, controlling for SES) which might also be interesting, but that’s a substantive issue. If we’re interested with racial differences, not controlling for SES, we should not be including SES.

Additionally, regression is sometimes used for prediction rather than for estimating parameters. In that case, the correct model is the model that best predicts new cases (cases taken from out of the sample) given the available predictors. It might be hard to find that model, but it does exist and it’s sensible to try to find it.

Point 2: Regression parameters are meaningful only in the context of a model

This is closely related point. Many analysts think that there is such a thing as a true regression parameter, \beta_1, which summarizes the real association between INTEREST and ENGAGEMENT. Under certain causal frameworks this may be true, and if we mean the uncontrolled association, then this is also right, but otherwise it’s not. There is no set of controls which will reveal the “real” association between a pair of variables, because that “real” association does not exist. That’s just not a thing.

In the two models we fit, we obtained the equations

\hat{ENGAGEMENT}_i = 1.72 + 0.45INTEREST_i

and

\hat{ENGAGEMENT}_i = 0.17 + 0.12INTEREST_i + 0.77VALUE_i.

Beginning (and advanced) researchers will frequently wonder which of these associations is correct, and will typically default to the second, because they assume that controlling for VALUE has eliminated some pseudo-association induced by covariance between INTEREST and VALUE (students who value the subject more are more engaged and think their teachers are more interested in them). This is a sneaky way of slipping causal ideas into our analyses. It’s true that in the first model, part of the association between INTEREST and ENGAGEMENT can be explained by VALUE, and when we control for VALUE, the regression coefficient gets smaller (and is no longer statistically significant). But this doesn’t mean that the new regression coefficient is better in any absolute sense. Instead, it’s estimating something different than the first one.

When we add VALUE to the model, we change the meanings of the regression parameters. the coefficient \beta_{INTEREST} now represents the association between INTEREST and ENGAGEMENT between students who report the same VALUE, which is a different thing than the association between INTEREST and ENGAGEMENT overall. If this helps us answer questions which are more interesting to us, then this is a good thing; if it doesn’t, we’re better off with the uncontrolled association.

Point 3: It’s difficult if not impossible to control for highly related predictors

It just is. Here are a few reasons.²

It inflates the standard errors

Frequently we’re interested in controlling for predictors which are highly correlated. For example, we might want to know the association between a person’s income and her parent’s income, controlling for her and her parent’s education. A positive association would indicate that people with wealthier parents tend to have higher incomes, even if we compare people whose parent’s have the same level of education and who have the same level of education themselves, which would be interesting.

However, parental income, parental education, and a person’s own education are all highly correlated. Remember that one effect of this will be to (perhaps drastically) inflate the standard errors of the coefficient of interest. There’s very little variability in a person’s own education when we hold constant parental income and education. When including highly correlated predictors raises the standard errors to the point that we can no longer estimate the true association with sufficient precision for our purposes, we refer to this as collinearity (if the problem is between a pair of predictors) or multi-collinearity (if the problem is between more than two predictors). Remember, if the predictors are too highly associated, it’s very difficult to estimate the association between the outcome and one of the predictors, holding constant the other predictor. Intuitively, this is because there’s very little variability in one predictor when the other is held constant, making associations hard to estimate. Another way to think about this is that if two variables are highly correlated, they’re more or less the same variable, making it very challenging to estimate controlled associations. It wouldn’t be possible to estimate the association between height and mass measured in kilograms while controlling for mass measured in grams; if we control for mass measured in grams, mass measured in kilograms can’t vary at all. Collinearity isn’t usually so severe, but the same idea holds when we try to include highly correlated variables in a model.

There’s no hard and fast definition of how correlated variables must be before imprecision becomes a problem (or, rather, there are multiple definitions, but we find them all fairly arbitrary). In large samples, we can obtain reasonably precise estimates even if the predictors are highly correlated (remember, large samples mean our estimates will be more precise), but in small samples it can be difficult to control for an even moderately associated predictor. Use your judgment, paying careful attention to the widths of the confidence intervals, in deciding whether your estimates are too imprecise to be useful.

It causes huge problems if the model is even slightly misspecified or the variables are measured with error

The linear model tends to work pretty well if the association we’re modeling is close to linear. However, this tolerance for non-linearity can break down quickly if we introduce highly correlated predictors.

Here’s an example from actual practice: researchers administered two different scales to students, both of which measured their sense of belonging in school. We’ll refer to students’ scale scores on these tests as BELONG_1 and BELONG_2. They also measured students’ levels of achievement on a standardized test, which we’ll call SCORE. Previous research suggested that students who felt stronger sense of belonging would also do better on the standardized tests.

The simple regressions of SCORE separately on BELONG1 and BELONG2 gave the following fitted equations:

SCORE_i = 8.21 + 0.26BELONG1_i,

and

SCORE_i = 6.34 + 0.77BELONG2_i.

The exact coefficients aren’t all that interesting because we don’t know the scales involved, but note that responses to both sense of belonging scales are positively correlated with test scores; students who feels a greater sense of belonging at school do better on tests (or else students who do well on tests feel like they belong at school more). The correlation between the scale scores is 0.83, which is high. This is expected, because the two scales are intended to measure the same construct, sense of belonging.

However, when we enter both predictors into the model, we obtain

SCORE_i = 6.04 - 0.78BELONG1_i + 1.58BELONG2_i.

Look at what’s happened. The first coefficient has increased threefold in magnitude and reversed direction, while the second coefficient has doubled. What’s gone wrong?

It’s not totally clear. One possibility is that the linear models were approximately correctly specified, and worked well as simple linear regressions. However, adding highly correlated predictors broke the model and led it to produce unrealistic estimates.

Alternately, perhaps the model is correct, but one predictor is measured with slightly more error than the other; measurement error in a simple linear regression works to depress the estimated associations from what they should really be; something similar happens in a multiple regression where the predictors are not highly correlated. However, measurement error in a model with highly correlated predictors can have large, unexpected consequences.

Finally, it’s possible that the two predictors measured similar but slightly different constructs, and that the regression coefficients were in fact correct. That brings us to our next point.

The regression parameters are extremely difficult to interpret

We have a good understanding of the meaning of a one scale-point difference in sense of belonging (setting aside the fact that we don’t know the scale very well). It indicates that the student feels a slightly stronger sense of membership in her school community. But what is the meaning of a one-unit difference in sense of belonging measured on one scale between two students who report the same sense of belonging on the other scale? Even if we understood the scales themselves extremely well this would be a difficult question to answer. If scores on the two scales are extremely highly correlated; how can we interpret the part of the variance in one scale which is not associated with the other?

Whenever possible, we should prefer simple models which are straightforward to interpret. Regressing on highly correlated predictors makes the interpretation of the model parameters extremely difficult, especially when the variables are theoretically very similar, as well as empirically very similar. This comes up a lot when regressing on multiple psychological constructs which tend to be very closely related and hard to differentiate from one another (and sometimes differences that are clear to theoreticians are not as clear to people actually taking the surveys).

Point 4: You need to be concerned about missing data

Missing data arises in a lot of ways. Respondents might not answer every item on a survey. People sometimes give unacceptable responses to items. Data might not be recorded correctly.

Analysts frequently ignore missing data; if you run a regression in R or Stata, your software will simply drop all observations which are missing any of the variables used in the analysis (this is called listwise or casewise deletion). This can lead to big problems.

Missing data is a big enough issue that it could be a course unto itself. In fact, it is a course (actually several courses) unto itself in the Harvard statistics department, where it’s taught by Don Rubin, one of the first people to study the topic in depth. Rubin identifies several ways in which data can be missing, including missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR; try saying that five times fast).

The most important distinction is between missingness that is ignorable and missingness which is not ignorable. Ignorable missingness is irritating; it reduces our sample size, because we can’t use those respondents who have missing data. Standard errors will get bigger and inferences will get less precise, especially if we have a lot of cases with missing values.

Non-ignorable missingness is much worse. Missingness is non-ignorable if failing to account for the missingness leads to biased parameter estimates (i.e., parameter estimates that are consistently wrong, even as the sample size gets bigger). Consider the following situation: researchers want to determine whether girls or boys have more positive relationships with their teachers. But suppose that girls who have bad relationships with their teachers are less comfortable expressing that than boys, so they tend not to answer the questions at all. Then the mean relationship quality for girls will be estimated as higher than it really is, because we’re missing responses for those girls who don’t have good relationships. This will be true no matter how large our sample grows; bias doesn’t disappear with large samples the way imprecision does.

Another way to think of this is as a biased sample. We wanted to sample from the population of all students, but instead we effectively sampled from the population of all boys, and those girls who have reasonably positive relationships with their teachers; the problem is that we don’t know how our sampling is biased, because we don’t know why those girls aren’t responding to the surveys.

This course doesn’t cover what you can do about missing data; approaches to handle missingness tend to be difficult to implement. But when doing an analysis, you should always look at data to find patterns of missing data and think about what they could mean, just as you should always think through your sampling approach and consider whom you might be missing.

Point 5: Measurement error can be a big problem

We touched on this when discussing collinearity, but measurement error deserves its own section. Perhaps obviously, regression works with the values we give it. However, we’re frequently more interested in some hidden or latent value which we can’t directly observe, but which our variables are supposed to represent. For example, when we do regression using MCAS ELA scores as the outcome, we’re typically thinking of them as measures of a latent construct, which we might call English language ability. But as you probably realize, test scores are not perfect measures of a student’s ability. In statistics, we say that tests and psychological scales contain measurement error, i.e. random deviations from a person’s true score, due to the fact that our measures are imperfect.

Measurement error in an outcome

When the outcome is measured with error, this isn’t usually a huge problem for us. What happens is that the residuals become more variable, the strengths of association (i.e. the correlations) between the outcome and the predictors decrease as the measurement error increases and the estimates become less and less precise. This is irritating because it reduces our ability to reject null-hypotheses, it makes our confidence intervals wider, and it makes the variables appear less strongly correlated than the latent variables really are (on the other hand, the unstandardized coefficients are unchanged on average, just less precise).

This is only true if we assume that measurement error is uncorrelated with the predictor. This is not always true. Some research indicates that students have better reading comprehension on passages about things in which they are interested. If a test includes passages which better represent the interests of girls than of boys (on average; obviously individual girls and boys have their own interests), we might find evidence that girls read better than boys even if they have the same latent reading ability. Technically this is not measurement error but differential item functioning. However, it highlights the importance of really understanding your measures before trying to do inference.

Measurement error in a predictor

When a predictor contains measurement error, the situation changes. Measurement error in predictors can have unpredictable consequences. If there’s a single predictor in the regression, measurement error will tend to make the correlations and also the unstandardized regression coefficients smaller than they would be if there were no measurement error. If there are multiple predictors and one is measured with error, things can get even messier.

Here’s a thought experiment. Suppose that 50 native English speakers and 50 non-native English speakers are given two English reading comprehension tests, one at the beginning of the month and one at the end. The tests, of course, contain measurement error. Suppose that both groups gain the same amount of actual English reading comprehension, although native English speakers start higher. Now suppose that we regress post-test scores on pre-test scores and whether a person is a native English speaker (in the next units we cover how to regress on categorical predictors). What we’ll find is that the regression will indicate that native English speakers who have the same pre-test score as non-native English speakers will tend to have higher post-test scores even though both groups experienced the same mean growth. The reason is measurement error in the pre-test predictor. Given that native English speakers have higher mean scores on the test, a native English speaker with the same score as a non-native English speaker is likely to have an artificially low score due to measurement error; similarly, the non-native English speaker is likely to have an artificially high score due to measurement error. The English speaker probably will get a higher post-test score because she or he probably has a higher actual latent reading ability.

Measurement error is a bigger problem when

it occurs in a predictor,
the variance of the measurement error is large relative to the variance of the latent variable, and
the predictor and outcome are highly correlated.

There are lots of ways to deal with measurement error, none of which are perfectly adequate. One of the simplest is to design measures which are less error-prone. Other approaches include structural equations models, or re-specifying models to ensure that only the outcome has measurement error. We don’t cover any of those approaches in this class, but we recommend that you think carefully about the differences between what regression can tell us (about the associations between observed variables) and what we wish it could tell us (about the associations between latent constructs).³

Footnotes

There are actually ways to use the second model to answer our original question, possibly with some additional precision. But these are a lot harder to use than just fitting a model where one of the parameters directly addresses our research question!↩︎
We refer to predictors which are problematically highly correlated as collinear. If a set of predictors are jointly too highly correlated, they are referred to as multi-collinear. If the correlation is perfect we say that the model is not identified, as we are mathematically unable to fit it.↩︎
As they say, if wishes were horses, you’d have to be very careful not to wish for things or a horse might explode out of your head. Or something like that. Yuck.↩︎