Unit 8 - Worked example

This is an addition to the course text where I’ll take you through a brief analysis showing how we could use the tools discussed in class to address an actual research question. I’ll show the necessary code in R and possibly in Stata as well. If there are other things you want to see, please let me know! I will assume that you’ve already read the unit chapter, or at least that you understand the concepts, and will not spend a lot of time reexplaining things.

I’ll try to use data which are publicly available so that you can reproduce the results I share here.

Dataset

For this analysis, I’m going to be using data from the High School Longitudinal Survey, or HSLS. The HSLS is a product of the National Center for Educational Statistics (NCES) which is responsible for creating and maintaining a number of different education related datasets for the United States. These datasets are used by researchers trying to understand how American education works and what its impacts are. This particular dataset is following a cohort of students who were in 9th grade in 2009, and are currently adults. It has information on how students did in high school and afterwards. We’ll be working with two waves of data, wave 1, which was collected when the students were in 9th grade, and wave 2, which was collected two years later when most of the students were in 11th grade.

To start with, I need to read in the data. Below you can find the code to do this in R (Stata users, if you want this translated into Stata, please let me know). If you want access to the dataset, let me know and I’ll tell you how.

Code
library(readstata13) # this is a package for reading in modern Stata datasets
hsls <- read.dta13('hsls.dta') # you need to make sure that you've set the working directory to be the location where the dataset is saved

Research Questions

I want to know how student math ability is associated with their math identity. Math ability is a student’s ability to solve math problems (this is measured by a math test), and math identity is a student’s sense of themselves as a “math person” (this is measured by asking students a series of questions about how they see themselves and then aggregatig their responses). Specifically, I want to know if students who felt like “math people” in 9th grade tended to have higher math ability in 11th grade. Of course, one problem is that people who were better in math in 9th grade might have thought of themselves as math people, and that could be the reason for any association. So I also want to control for math ability in 9th grade. And then, in a final model, I want to control for a number of other variables in the dataset, including their socioeconomic status in 9th grade, the experience and effectiveness of their 9th grade math teachers, and a number of other variables measuring student attitudes towards math. I’m adding these because it could be that having an excellent 9th grade teacher will lead to higher math identification in 9th grade AND to higher ability in 11th grade, and I want to see how math identification is associated with math ability above and beyond these factors. I’m controlling for SES for the same reason, and I’m controlling for other attitudinal variables because I want to isolate math identity from math utility (sense of math as useful), math self-efficacy (sense of self as good at math), and math interest (sense of math as engaging). I want to see how math identity by itself is associated with the outcome. These models will be aimed at helping me better understand how math ability is related to math identity.

Analysis

I’m going to take a few steps as a part of this analysis.

Filter the dataset

I want to start by dropping anyone with missing data on the variables (and I’ll rename them at the same time). Most of the missingness is due to the fact that some students moved into or out of schools participating in the study between 9th and 10th grade.

Code
library(dplyr)
hsls <- hsls %>% select(math_1, math_2, math_id_1, math_int_1, math_util_1, math_eff_1, ses_1 = X1SES_U, teach_eff_1 = X1TMEFF, teach_exp_1 = X1TMEXP) %>% na.omit()

We lost a LOT of respondents, so we should be a little cautious about using these data to make inferences to a wider population.

Models

I’m going to skip a lot of the descriptive work to streamline the process. Just keep in mind that in a real analysis I would want to examine the distributions of the individual variables and to see how they were associated. However, let’s skip directly to the models. I’m going to be comparing three models. First, I’m going to regress 11th grade math ability on 9th grade math identity, with

math\_2_i = \beta_0 + \beta_1math\_id\_2_i + \varepsilon_i.

Then I’m going to add math ability in 9th grade as a key control, giving the model

math\_2_i = \beta_0 + \beta_1math\_id\_2_i + \beta_2math\_1_i + \varepsilon_i.

Finally, I’m going to add a number of other variables, giving me the model

math\_2_i = \beta_0 + \beta_1math\_id\_2_i + \beta_2math\_1_i + XB + \varepsilon_i,

where X is a vector (here basically just a group) of control variables including 9th grade math teacher experience and effectiveness, student attitudes towards math (other than math identity), and student SES.

Notice that I decide what models I want to fit based on what models will answer my questions. I don’t need to examine the data in advance, I just need to know what I want to hold constant for theoretical reasons.

Fitted Models

The fitted models are displayed below.

Code
library(texreg)
mod1 <- hsls %>% lm(math_2 ~ math_id_1, .)
mod2 <- hsls %>% lm(math_2 ~ math_id_1 + math_1, .)
mod3 <- hsls %>% lm(math_2 ~ math_id_1 + math_1 + math_eff_1 + math_util_1 + math_int_1 + teach_eff_1 + teach_exp_1 + ses_1, .)
htmlreg(list(mod1, mod2, mod3))
Statistical models
  Model 1 Model 2 Model 3
(Intercept) 0.78*** 0.67*** 0.65***
  (0.01) (0.01) (0.01)
math_id_1 0.45*** 0.14*** 0.14***
  (0.01) (0.01) (0.01)
math_1   0.84*** 0.75***
    (0.01) (0.01)
math_eff_1     0.05***
      (0.01)
math_util_1     -0.04***
      (0.01)
math_int_1     -0.01
      (0.01)
teach_eff_1     0.02**
      (0.01)
teach_exp_1     0.01
      (0.01)
ses_1     0.19***
      (0.01)
R2 0.16 0.56 0.58
Adj. R2 0.16 0.56 0.58
Num. obs. 10328 10328 10328
***p < 0.001; **p < 0.01; *p < 0.05

I’m going to focus my interpretation on the coefficient of math identity, and how it changes over the three models, although I’ll also look at some of the other coefficients in the third model. In the first model, which explains only about 16% of the variability in math scores, each scale point (the units here) difference in math identity is associated with a 0.45 point difference in math ability. It’s a little hard to make sense of this difference, since the scales are not well known. In retrospect, it might have made sense to first standardize the variables. However, I can at least say that people with higher math identity tend to have higher math ability in 11th grade.

Things get more interesting when I control for math ability in 9th grade. This results in a dramatic reduction in the magnitude of the association. When we hold constant 9th grade math ability, each scale point difference in math identity is associated with only a 0.14 point difference in 11th grade math ability. However, although 9th grade math ability seems to explain most of the association we saw before, even when we hold this constant the association between 9th grade math identity and 11th grade math ability is statistically significant (t(df = 10,325) = 17, p < .001). This model explains 56% of the variability in 11th grade math ability.

When we add the additional controls, the association between math ability and math identity remains basically unchanged. It’s still the case that, controlling for the other variables, each scale point difference in 9th grade math identity predicts a 0.14 point difference in 11th grade math ability. The association is still statistically significant (t(df = 10,319) = 14, p < .001). So the interesting thing here is that when we look at students who were the same in 9th grade in terms of math ability, math self-efficacy, math utility, math interest, math teacher effectiveness and experience, and socioeconomic status, students with a stronger sense of themselves as math people tend to have higher math ability in 11th grade. This final model explains about 58% of the variability in the outcome, which is only slightly more than the model with only 9th grade math identity and math ability.

As a quick follow-up, I want to see if teacher characteristics are associated with math ability in 11th grade when we control for all of the other variables in the model. I’ll test that by fitting a model which omits the teacher variables and use an F-test to compare it to the model which has all of the variables.

Code
mod_red <- hsls %>% lm(math_2 ~ math_id_1 + math_1 + math_eff_1 + math_util_1 + math_int_1 + ses_1, .) 
anova(mod_red, mod3)
Analysis of Variance Table

Model 1: math_2 ~ math_id_1 + math_1 + math_eff_1 + math_util_1 + math_int_1 + 
    ses_1
Model 2: math_2 ~ math_id_1 + math_1 + math_eff_1 + math_util_1 + math_int_1 + 
    teach_eff_1 + teach_exp_1 + ses_1
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1  10321 5443.7                                  
2  10319 5435.6  2    8.0951 7.6839 0.0004628 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This test gives us compelling evidence that at least one of the teacher variables is associated with math ability in 11th grade, controlling for the other variables (F(df1 = 2, df2 = 10,319) = 7.68, p < .001).

Limitations

In addition to reporting what we found, we want to be clear on what we haven’t found. This is useful both for our audience and for our own understanding.

There are two key issues, I think. First, I dropped over half the sample when I restricted my dataset to people with complete data. This could have changed the sample in non-random ways, and it might no longer be representative of the whole population. I would want to understand why some students were missing data, and how those students might be different from the rest of the sample. And second, although I controlled for a number of possible confounding variables in the final model, I still can’t claim to have uncovered a causal association. I can say that the association I found is not due to 9th grade math ability or SES, because I controlled for those variables. However, there could be other variables that I failed to control for that are driving this association. Control can sharpen our understanding of an association, but it doesn’t guarantee that something is causal.

What’s next?

What’s missing from this analysis? What else would you like to learn about? Send an e-mail to joseph_mcintyre@gse.harvard.edu if you have questions or suggestions!