Unit 2 - Continuous Variables

Alpha:

Alpha, or \alpha is the level of evidence that we require before rejecting a null-hypothesis. It should be established before starting an analysis, and in the social sciences, \alpha is conventionally set at .05. This is the value below which a p-value must fall before we reject a null-hypotheis. Essentially, this is the rate of Type-I errors that we are willing to accept if a null-hypothesis is true. That is, when we’re testing a null-hypothesis which is true, our probability of incorrectly rejecting that null-hypothesis is \alpha. Using the convention of \alpha = .05, when we test a null-hypothesis which is true, we accept a 5% chance of incorrectly rejecting the null-hypothesis. The lower we set \alpha, the more conservative our findings are; setting \alpha low makes it harder to reject a null-hypothesis.

Assumptions:

Regression assumptions are assumptions that we make in order to make it possible to fit regression models and estimate standard errors. If all of the regression assumptions are perfectly met, then our methods will work correctly. This doesn’t mean that every sample will perfectly represent the population, but it does mean that our estimates and p-values will work as intended across repeated random samples. If the assumptions are not perfectly met, but are approximately correct, we can generally trust out methods. Assumptions are always about the population (or how the sampling is done), but we use the sample to see if the assumptions are plausible.

Average Causal Effect:

The average causal effect of a particular treatment on a particular outcome is defined as difference in the sample mean value of the potential outcome under treatment and the sample mean under control. That is, its the difference in what would happen, on average, if units were assigned to treatment and what would happen, on average, if units were assigned to control. It’s also the average of all of the individual causal effects/ It’s very rare that we can directly compute the average causal effect, and so we typically estimate it, often by comparing the mean of the outcome in a treated group to the mean in a control group.

Bar plot:

A barplot is a figure which uses the heights of bars to display quantities. For example, suppose that 70% of college graduates are employed, compared with 60% of non-grads. Then we could create two bars, one with a height of 70 and one with a height of 60. This would allow us to visually compared these quantities. In this class we’ll often use barplots to represent proportions and percentages, but there are other possible uses. For example, we could use a barplot to show differences in average math scores between girls and boys. Generally the bars represent discrete groups, like girls and boys or college grads. I’ve been describing vertical barplots, but it’s also fine to orient the barplot horizontally.

Bias:

A statistic is said to be biased if, on average, the sample estimate is different from the population parameter that it’s estimating. Sample means are unbiased because, although the sample mean will be different from the population mean in any given sample, on average across repeated random samples they will be correct. In contrast, sample standard deviations are biased because on average the sample standard deviation is lower than the population standard deviation.

Bivariate:

A bivariate analysis uses two variables. Things like contingency tables, correlations, t-tests, and scatterplots are bivariate analyses. In a bivariate analysis, we’re typically interested in how two variables are associated with each other, although the exact question will depend on the variables we’re using. For example, with a t-test we might ask if a mean math score is the same for English Language Learners and non-English Language Learners, while with a correlation we might ask what the linear association is between time spent studying for a test and test score.

Blocking:

Blocking is a technique used in randomized controlled trials to reduce variability in the outcome and to try to make it easier to estimate treatment effects. In a block randomized controlled trial, we first group participants into groups or “blocks” based on variables that are expected to be related to the outcome. For example, we might form blocks based on student age and gender. Then we randomly assign people to treatment within those blocks. This ensure that the control and treatment groups will be balanced on age and gender. Even without blocking this will happen in expectation (on average) just due to the random assignment, but blocking ensures that every randomization we do will be balanced. Blocking is a valuable technique, but should only be used with variables that we expect will be related to the outcome.

Bootstrapping:

Bootstrapping is a very flexible technique for doing statistical inference, especially when the math is hard to work out (boostrapping can also be used to correct for bias in estimators). Although there are multiple bootstraps available to us, they generally involve some sort of resampling, where we repeatedly resample observations from our existing sample to simulate repeated random sampling from the population. We use this to estimate the sampling distribution of the statistic, which we can use for hypothesis testing. Bootstrapping is popular when we don’t know what a sampling distribution looks like so we aren’t able to use standard formulas.

Bonferroni Correction:

A Bonferoni correction is a tool for controlling the Type-I error rate for a whole group of tests. When we conduct multiple tests (of true null-hypotheses), each one has a 5% chance of incorrectly rejecting the null-hypothesis being tested. If we want to guarantee that the probability of incorrectly rejecting any of the tests, the Bonferroni correction has us compare each p-value to \alpha/k, where k is the number of tests being conducted. Essentially we split our 5% probability across all k tests.

Box-and-Whisker Plot:

A boxplot, or box-and-whisker plot is a figure which represents the distribution of a numeric variable by visually representing important quantiles. In a boxplot, there is a central box which extends from the 25th percentile (the first quartile) to the 75th percentile (the third quartile). Inside the box there’s a line which indicates the median value. Lines called “whiskers” extend from the ends of the boxes to either the most extreme values in the dataset, or 1.75 times the interquartile range from the outer limits, whichever is smaller. If there are any points which lie beyond the whiskers, they’re represented with a plus sign (+) or some other character.

Categorical:

Categorical variables are variables which are measured as categories. For example, we might measure a person’s highest level of education as a categorical variable with levels like “Less than high school”, “High school or GED”, “College degree”, and “Higher than college”. We contrast categorical variables to numeric variables which are measured as numbers. For example, we might measure a person’s education as the number of years of education that they completed. Categorical variables can be either nominal, when the levels are simply names and have no inherent ordering, or ordinal, when there’s a natural ordering to the levels. Here, highest level of education would be ordinal, since we can order them from lowest to highest. In contrast, something like “favorite subject” would be nominal, since there’s no natural ordering (well, math is best). The various levels of a categorical variable should be mutually exclusive and comprehensive; every observation in the dataset should fall into exactly one category. This is a problem for variables like race, where some respondents will identify with more than one race, or will not find any of the categories accurate (e.g., people from the Middle East sometimes had a hard time finding a racial group that matched their own self-conceptions on the US Census).

Causal Effect:

A causal effect of a particular treatment on a particular outcome is defined as the difference in what that outcome would have been if units had experienced the treatment and what it would have been if they had experienced the control. See average casual effect and individual causal effect for more information.

Central Tendency:

The central tendency of a variable is a measure of what is typical of that variable, i.e., what a normal value would be. The three most common measures of central tendency are the mean, used for numeric data, the median, used for numeric and ordered categorical data, and the mode, used for any type data.

Chi Squared:

\chi^2 (kai-squared, or \chi^2) refers both to a test-statistic used in contingency table analyses and other situations, and to the \chi^2 distribution, which is a heavily right-skewed distribution with a minimum value of zero, no maximum value, and a mode which depends on the degrees of freedom of the distribution. In S-040, we will almost always use \chi^2 as a test-statistic for a contingency table. The degrees of freedom in a contingency table analysis are equal to the numbers of rows minus one multiplied by the number of columns minus one. For example, in working with a two-by-two contingency table (e.g., a table of treatment status and whether a student passed a test), the degrees of freedom would be (2 - 1)\times(2 - 1)=1. For interested students, a \chi^2 distribution with k degrees of freedom is the distribution we would obtain by taking k independent standard normal distributions, squaring each one, and adding them together. If you take a more advanced statistics class, you’ll encounter other uses for this distribution.

Collinearity:

Collinearity occurs when a pair of predictors, or possibly a group of predictors, are so highly correlated that they cannot all be meaningfully included in the model. This is not very precise definition, because it’s not a very precise idea. Collinearity can increase standard errors for our regression coefficients, and can make them hard to interpret. When the collinearity occurs between a group of predictors, we refer to it as multicollinearity.

Confidence Interval:

A confidence interval (we’ll speak specifically about 95% confidence intervals, though other intervals are possible) is an interval for a parameter which has been constructed in such a way that in 95% of the samples in which it is constructed it will contain the parameter of interest. 95% confidence intervals can be thought of as a range of plausible values for the parameter, or as the set of all possible parameter values which we would not reject a null-hypothesis about. We typically form a 95% confidence interval by taking the point estimate for a parameter and adding and subtracting roughly 2 standard errors (2 being roughly the critical value for most t-distributions). Other methods of constructing 95% confidence intervals are also possible, but we don’t cover them in this class.

Contingency Table:

A contingency table is a cross-tabulation of one categorical variable against another. It shows the joint distribution of a pair of categorical variables. Contingency tables can be used in conducting \chi^2 tests of the independence of two categorical variables. Typically the variable we think of as a predictor is associated with the rows, and the variable we think of as the outcome is associated with the columns, though the choice of which variable is the row variable and which is the column has no statistical consequences. Suppose we had a dataset with variables sex, measuring the sex of the respondent, and college, measuring whether the respondent had graduated from college. Then we could make a contigency table of one variable against another, which would allow us to say how many respondents were both female and had graduated from college. We might use this to see if the proportion of females who graduated from college was different from the proportion of males.

Correlation:

A standardized numerical summary of the strength of the linear association between a pair of numeric variables. Pearson’s product-moment correlation, the most commonly used definition of correlation (in probability and statistics, this is the meaning of the term correlation. In some applied fields, there are other correlations), is defined as

\rho = \frac{\Sigma(x_i - \bar{x})(y_i - \bar{y})}{n\sigma_x\sigma_y}.

It can also be written as

\rho = \frac{Cov(x, y)}{\sigma_x\sigma_y}, where here Cov(x, y) is the covariance of x and y.

The correlation always lies between -1 and 1. A correlation of -1 or 1 indicates a perfect linear association between the variables, while a correlation of 0 indicates no association. Correlation has no scale and changing the units of measurement of the predictor will not change the correlation, which makes it possible to compare correlations across different contexts. That is, the correlation between people’s heights and weights measured in centimeters and kilograms is identical to the correlation between heights and weights measured in inches and pounds. This property is referred to as scale invariance.

Counterfactual:

The counterfactual is what would have happened differently if a unit had been assigned to a different treatment condition. For example, if students are assigned to a comprehensive sex education class and afterwards are measured on their understanding of consent, then the counterfactual would be what would have happened if they had not been assigned to the comprehensive sex education class.

Critical Value:

A critical value is defined as the value of a test-statistic required to reject a null-hypothesis. Typically it’s the value of the test-statistic which corresponds to a p-value of .05, since that’s our most common value of \alpha. The most common critical value we’ll encounter is equal to roughly 2. With most t-distributions, a t-statistic of 2 or -2 will result in a p-value of .05, which will lead us to reject a null-hypothesis.

Degrees of Freedom:

Degrees of freedom represent how many values are free to vary. They show up in a couple of contexts, which is why this description is somewhat vague. They’re generally used in hypothesis testing and determine what distribution a test-statistic will follow. In a contingency table analysis, the degrees of freedom are based on the number of rows and columns in the table, and are frequently equal to one. In a t-test, they’re based on the number of observations and are generally quite large. Your software will typically compute this for you, and there’s nothing you’ll need to do.

Density:

The density function (technically the probability density function, or pdf) is essentially a function defining a probability distribution. The pdf has the property that the probability of taking a random draw of the distribution and observing a value between any two points, say a and b, is equal to the area under the pdf between a and b. Density functions are always non-negative. You can think of it as a histogram, where higher regions indicate more likely values. We sometimes construct a density curve using values from a sample; this is essentially equal to a smoothed histogram where the total area is equal to 1.

Dichotomization:

Dichotomization is a transformation which maps every value of X to one of two values. For example, we might take student test scores and dichotomize it by identifying every student as high scoring or low scoring. Similarly, given a variable measuring a person’s age, we might dichtomize it by identifying every person as working age (age between 18 and 65), or not working age (any other age).

Dot-plot:

A dotplot is a figure for representing a numeric variable. Dots representing each value in the dataset are placed along the x-axis. Dotplots tend to work best with small samples, since the dots quickly become overwhelming in large samples.

Effect Size:

An effect size can be defined in several ways, but the most common is the mean difference in an outcome between two groups in standard deviation units (either standard deviation of the outcome or the RMSE). The term effect size is a little misleading since they do not need to measure causal effects, although they often are. Effect sizes are typically used when we need to compare the sizes of various treatments which were measured using different outcomes. For example, if one study used SAT scores as an outcome and another used ACT scores, the differences would not be directly comparable, but the effect sizes would be much more so.

Estimate:

An estimate is a value we calculate in a sample which is intended to represent, or estimate, a population parameter. For example, when we compute a mean in a sample and think of it as representing the population mean, we refer to it as an estimate. We typically represent an estimate by putting a hat on it, so if the population mean of Y is \mu_Y, then the sample mean is \hat{\mu}_Y.

F-distribution:

An F-distribution is a distribution used for conducting F-tests. An F-distribution has two degrees of freedom, df1 which is based on the number of model parameters being tested, and df which is based on the sample size. As df1 gets large, the F-distribution starts to look like a Normal distribution.

F-test:

An F-test is used to test a null-hypothesis about multiple regression coefficients at a time (although in theory we could also use it to test a null-hypothesis about one coefficient). For example, we might look for evidence of race-based differences in salaries after graduating from college by regressing salary after college on a series of indicator variables for different racial groups, and then using an F-test to test a hypothesis that all of those indicators were equal to 0 (i.e., no mean differences).

Frequentist:

The frequentist approach to statistics, which we use in this class, treats the population parameters of interest are fixed, but usually unknown, quantities. For example, if we conduct a poll to determine what proportion of Americans support increasing federal funding for education, we assume that there is some “true” value out there, even though we can’t know what it is since we can’t directly observe the entire population. Our uncertainty comes from the fact that our sample is taken at random from the population, and so the values we compute in the sample will vary randomly around the true value (technically, regression analyses only assume that the residuals are random, and treats the predictor values as fixed, but don’t worry too much about this). The goal of frequentist statistics is to conduct tests which give correct results with a known probability. When a null-hypothesis is true, a frequentist wants to be able to guarantee that we will reject it with probability equal to \alpha (usually 5%). Historically, most statistical analyses have been done using frequentist approaches.

Fundamental Problem of Causal Inference:

The fundamental problem of causal inference is that, in general, we cannot observe what would have happened to a unit under more than one treatment condition. If a unit is assigned to treatment, we can’t observe what would have happened if they had been assigned to control. At the same time, if they’re assigned to control, we can’t observe what would have happened if they had been assigned to treatment. This insights turns causal inference into a missing data problem, and we generally try to address it by estimating averge treatment effects.

Histogram:

A histogram is a plot which shows how one variable is distributed. It consists of a series of bars, or bins, arrayed along the x-axis. The height of each bar is proportional to the number of observations that fall into the part of the x-axis covered by the bin. The parts of the plot where the bins tend to be high represent values which are common for the variable, while parts where the bins tens to be low represent values which are uncommon.

Homoscedasticity:

Homosecdasticity means having the same (homo) variance (scedasticity). We say that residuals are homoscedastic if they have the same variance (i.e., vertical spread) at every value of the predictor. When residuals are not homoscedastic, we refer to them as heteroscedastic.

Hypothesis Test:

A hypothesis test is, fairly obviously, a test of a hypothesis. In general, we assume that the hypothesis is true, we see how inconsistent that is with the observed data usually by calculating a p-value, and then decide whether to reject the hypothesis or not. By far the most common form of a hypothesis test is a null-hypothesis test.

Identifiability:

When there is a unique set of coefficient estimates which minimize the variance of the residuals for a regression model, the model is said to be identified. When a pair, or group, of predictors are perfectly correlated (so an extreme case of multicollinearity), the model is not identified, which means that there is more than one set of coefficients which minimuze the variance of the residuals.

Independence:

Two variables are independent if knowing the value of one variable tells us nothing about the value of the other. For example, if two coins are tossed, the side which comes up for one coin is independent of the side which comes up for the other. Similarly, if two students’ test scores are independent, then knowing one student’s score tells us nothing about the other student’s. This would not be the case if both student’s were from the same school; if we observed one student from a school who had an extremely high score, that would give suggest that other students in the school also had high scores. When we conduct a contingency table analysis, say of treatment status and college graduation, our null-hypothesis that the variables are not associated can also be expressed as a null-hypothesis that the variables are independent of each other. I.e., our null-hypothesis might be that knowing a student’s treatment status tells us nothing about whether they will graduate or not; treated students are just as likely to graduate as untreated students.

Indicator variable:

An indicator variable is a variable which indicates membership in a category by giving anyone in the category a 1 for the variable and anyone not in the category a 0. Typically the name of the variable is the category being indicated. For example, if we have a variable called female, then a 1 on this variable would indicate a person who is female and a 0 would indicate someone who is not.

Individual Causal Effect:

An individual causal effect of a particular treatment on a particular outcome for a particular individual is defined as the difference in what value the individual would have had on the outcome if assigned to treatment and what they would have had if assigned to control. These are rarely possible to calculate for any individual, but but they’re important conceptually as they allow us to define the idea of a causal effect.

Interaction:

When two variables interact in a model, the association between each variable and the outcome differs based on the other variable. For example, a model might allow the association between hours worked and salary to differ based on whether a respondent has a college degree, or to allow the association between hours of instruction and achievement to differ based on the type of instruction delivered. We fit a model with an interaction by taking the two variables we wish to interact and multiplying them together, then entering them into the model. The coefficient of the interaction term has its own interpretation, but it also changes the interpretation of the coefficients of the variables involved in the interaction.

Intercept:

In general, the intercept, or more specifically the y-intercept, of a function is the value the function takes on when x is equal to 0. In a linear regression equation, where

Y_i = \beta_0 + \beta_1X_{1i} + \beta_2X_{2i} + ... \beta_kX_{ki} + \varepsilon_i,

the intercept, \beta_0, is the predicted value of the outcome, Y, when all of the predictors are equal to 0. Frequently this is not a meaningful value of Y or of the predictors, which is fine. Exactly what it means for all of the predictors to be equal to 0 will depend on the context.

Interquartile Range:

A numeric measure of spread which can be calculated for a numeric variable. The interquartile range (IQR) is the distance between the first and third quartiles of a distribution. The IQR is highly robust to extreme values, but is not as popular as the standard deviation as a measure of spread. When reporting the IQR, we typically report the endpoints as well as the range itself.

Intervention:

An intervention or treatment is a set of experiences to which a researcher can assign a participant. This could be something like a new curriculum, an offer of admission to a new school, cash payments, or any number of other things. Note, however, that the intervention is what the researcher (or some other actor or organization external to the participant) can control. For example, a researcher cannot assign a participant to a particular gender or race, so these are not interventions. Similarly, a researcher might be able to assign students to receive an offer to attend a new school, but will generally not be able to assign a student to attend the school, since that’s a decision that the student must make for themselves.

Kernel Density Plot:

A kernel density plot is a density plot formed by essentially stacking normal distributions on top of each other, one at each value of a variable in a dataset. These are smoothed histograms, which give us a sense of where values in the population might lie.

Line of Best Fit:

The line of best fit, or regression line, is the line which best capture the general trend of the association between the outcome and the predictor. Technically, it’s the line which minimizes the variance of the residuals or the line which connects the conditional mean values of Y at each value of X. These definitions are precise but somewhat difficult to think about, but you can reasonably think of it as the line which best captures the general trend. When we have more than one predictor, the idea of a line of best fit generalizes to a plane of best fit, or even a hyperplane of best fit. When the association between the predictor and the outcome is not linear, we might prefer a curve of best fit. In general, when doing regression we’re trying to find a mathematical structure which best captures the general association between the outcome and the predictor.

Linearity:

The assumption of linearity is an assumption that the conditional mean values of the outcome at each value of the predictor can be connected with a straight line. That is, it’s an assumption that the general association between the outcome and the predictor really is a straight line.

Logarithmic Transformation:

The function f(x) = log_ab asks what value the base a must be raised to to get the value b. For example, log_28 = 3 because 2^3 = 8. Logarithmic transformations have the effect of stretching out distances between lower values of a variable, and compressing distances at large values. They are popular with data analysts because they fix commonly encountered non-linear associations, and they give coefficients which are fairly easy to interpret. Logarithmic transformations are commonly used when a variable has extreme right skew (and has values which are close to 0), when a variable only has positive values (we can’t take the log of a 0 or a negative number), and when differences in a variable make more sense as percentages than on the raw scale (for example, when comparing salaries between people with and without college degrees, it might make more sense to say that people with college degrees have salaries which are X% higher, on average, than people without than to say that their salaries are $X higher).

Main Effects:

In a model in which X_1 and X_2 interact, the coefficients of X_1 and X_2 are referred to as the main effects of the variables. The main effects are generally interpreted as typical regression coefficients, but when the other variable is fixed at 0. For example, if we were to fit the model

SALARY_i = \beta_0 + \beta_1HOURS_i + \beta2COLLEGE_i + \beta_3HOURS_i\times COLLEGE_i + \beta_4MALE_i+\varepsilon_i, where SALARY is a person’s salary, HOURS is the number of hours a person works, COLLEGE is an indicator for having a college degree, and MALE is an indicator for being male. Then we would interpret \beta_1 as the difference in salary associated with a one-hour difference in hours worked for people without a college degree (COLLEGE_i = 0), controlling for sex. We would interpret \beta_2 as the mean difference in salaries between people with and without college degrees for people who work 0 hours (HOURS_i = 0). And finally, we would interpret \beta_4 as the mean difference in salary between males and females (technically males and non-males, but in most datasets that will be males and females) controlling for hours worked and having a college degree. Notice that the interpretation of \beta_4 is unchanged, because MALE is not involved in any interactions.

Mean:

The mean is a measure of central tendency of a numeric variable. It is the average value of a variable. We write the sample mean of X as \bar{x} or \hat{\mu}_X; we write the population mean as \mu_X. The mean is the most popular measure of central tendency in statistics. It is not as robust to unusual observations as the median (the mean can vary dramatically based on one or two unusual observations), but has many attractive mathematical properties. Computing the mean requires us to add up a number of different values, so it is only defined for numeric variables.

Median:

The median is a measure of central tendency. The median is equal to the middle value of a variable or, if there are an even number of values, halfway between the two middle values. The median is robust to extreme values, and is often used as a measure of central tendency in skewed distributions. However, the mean is a much more popular measure of central tendency for a variety of reasons. Medians can be used with variables which are numeric or ordinal, for example, highest level of education.

Mode:

The mode is a measure of central tendency. The mode of a variable is its most common value or, with numeric variables, any distinct peak in a distribution. When working with numeric variables, we typically aren’t interested in the exact value of the mode or modes, but more the general region where they lie. Note also that a “distinct” peak does not have to be the very highest peak, it just has to be clearly distinct from the rest of the distribution. The mode is easy to understand but is rarely used in statistics because it has a complex sampling distribution. The mode can be used with any type of data.

Model:

In statistics, a model is a mathematical description of the general relationship between an outcome and a predictor. It expresses the basic shape of the association (e.g., the variables involved, if the association is a line, a parabola, something else) and the probability distribution for the residuals. In this class, all of the models we consider are linear regression models, and the only decisions we need to make are the variables involved.

Model Matrix:

A model matrix is a very technical concept in regression consisting of a matrix representing all of the data about the predictors. Each row represents a single observation, and each column represents a variable. There’s also a column consisting of a vector of 1’s representing the intercept. We estimate the line of best fit by manipulating the model matrix and a column vector representing the outcome. The model matrix can be used to calculate the column space of a model, which represent the space of all possible predictions the model could make. This is also a fairly complex idea, and we won’t dwell on it.

Multiple Linear Regression:

Multiple linear regression is linear regression which involves more than one predictor. In the special case where there are exactly two predictors, multiple linear regression is finding the plane of best fit which minimizes the variance of the residuals. In cases with more than two predictors, it’s finding a hyperplane of best fit which is a perfectly sensible mathematical object, even if it’s hard to imagine. In addition to improving the accuracy of our predictions, multiple linear regression allows us to isolate the association between one variable and the outcome while holding constant, or controlling for, other variables.

Nominal:

A nominal variable is a categorical variable where the categories have no ordering. We can use these variables as predictors in regressions by transforming them into (sets of) indicator variables. They’re also used in contingency table analyses.

Normal Distribution:

The Normal distribution is a probability distribution for numeric variables. It’s a bell-shaped distribution with a central peak and tails which quickly drop off to 0. The Normal distribution is characterized by its mean and standard deviation; knowing these two things tell us everything about the distribution. The Normal distribution is extremely popular in statistics for a few reasons. First, it’s very tractable mathematically making it easy to do proofs with. Second, there are lots of things in the real world that are approximately Normally distributed. Normal distributions arise whenever a variable is a sum of a number of independent variables, and that describes many actual variables. And finally, Normal distributions show up all over the place in the distributions of statistics; for example, sample means and regression coefficients turn out to be formed by taking sums and means of independent variables, and so they approximately follow Normal distributions.

Null Hypothesis:

A null hypothesis is a hypothesis we make about a population, usually that there is no association between a pair of variables. We make null-hypotheses not because we believe them, but because they allow us to conduct tests which can demonstrate evidence of an association. In fact, we usually conduct null-hypothesis tests when we want to show that the null-hypothesis is false. If we find evidence against a null-hypothesis, when the p-value for the test is low, we reject it and conclude that there is an association. If we do not find evidence against a null-hypothesis, when the p-value is high, then we fail to reject it and conclude that there is not enough evidence to say that there’s an association. This can be hard to remember, but rejecting a null-hypothesis means finding that there is an association, and failing to reject a null-hypothesis means not finding an association.

Null Population:

A null-population is a hypothetical population for which a given null-hypothesis is true. A null-hypothesis test is essentially asking whether it is plausible that the actual population is a null-population. If we have a null-hypothesis that the number of hours spent studying for a test is not correlated with a student’s exam score, then the null population would be a (purely hypothetical) population of students for which the correlation was actually equal to 0.

Numeric:

A numeric variable is a variable measured with values where the distances between the values are actually meaningful. Another term for a numeric variable is an interval-level variable. Another term for this property, which is incorrect but commonly used, is continuous. We require variables to be numeric before using them in analyses which require us to add or subtract the values, since operations like addition and subtraction are only defined for numbers. Specifically, means, t-tests, correlations, and regressions require numeric variables.

Ordinal:

An ordinal variable is a categorical variable for which the categories have a proper ordering; it should be possible to order the values from smallest to largest. Medians require variables to be ordinal or numeric. Analyses sometimes treat ordinal variables as numeric, even though this isn’t quite right.

Ordinary Least Squares:

Ordinary Least Squares, or OLS, regression is an approach to fitting regression models which attempts to minimize the variance of the residuals/the sum of squared residuals. Although there are other ways to fit regression models, OLS regression is popular because it’s mathematically simple and has certain desirable properties, such as unbiased coefficient estimates. When people talk about fitting regressions, this is usually what they mean.

Outcome:

In a statistical model, the outcome is the variable that we’re trying to predict (although usually our goal is not actually prediction, we’re just using this as a framework for getting slope estimates which allow us to quantify associations). The term outcome is slightly misleading, since we are not assuming that the outcome is caused by the predictors. Other disciplines have other names for the outcome, including response variable and dependent variable.

P-Value:

The p-value is the probability of observing a test-statistic as extreme as, or more extreme than, the observed statistic if some hypothesis, H0, is true. The p-value is not the probability that the null-hypothesis is true. p-values are usually obtained by comparing an estimate to a null-hypothesized value, dividing by the estimated standard error of the estimate, and comparing the result to a t-distribution (or sometimes a \chi^2 or F distribution). Smaller p-values indicate more evidence against the null-hypothesis. In the social science, convention dictates that a p-value below .05 is sufficient evidence to reject a null-hypothesis, because if the null-hypothesis were true, the observed value would be a 1-in-20 event.

Parameter:

A parameter is a value in the population, and generally the thing we’re trying to estimate using our sample. For example, if we’re trying to estimate the association between years of education and income, we start by assuming that there is some real value in the population, and then use our sample to obtain an estimate of what that value is. Typically the population parameter is the thing we’re interested in, and our sample statistics are mostly useful as estimates of those parameters.

Percent:

The percent of observations falling into a category is equal to the number of observations falling into the category divided by the total number of observations (the proportion) and multiplied by 100. But I’ll bet you knew that.

Percentile:

The nth percentile is the value of a variable below with n% of the values fall. For example, the 25th percentile, also called the first quartile, is the value of a variable below which 25% of the values fall.

Permutation Distribution:

A permutation distribution is a distribution which is generated by randomly permuting (rearranging) information in a dataset (frequently a column of treatment assignments), calculating a statistic in each permuted dataset, and finding the distribution of the statistic.

Permutation Test:

A permutation test is a test which is conducted by comparing an observed statistic to its permutation distribution. These are usually used in the context of randomized controlled trials where we randomly permute treatment assignment and see if the actual estimated average treatment effect is unusual compared to the permutation distribution.

Population:

The population is the (sometimes hypothetical) group that we’re taking a sample from. In general in research we’re interested in learning about the population, and we use the sample to make inferences about the population. For example, the GSS is a sample of roughly 2,000 Americans; the population the GSS is trying to sample from is the set of all non-institutionalized people residing in America who are at least 18 years of age and who speak English.

Potential Outcome:

A potential outcome under a particular treatment condition is the value of that outcome that would be observed if the participant had experienced that condition. For example, suppose a student might either participate in a summer math camp or not. Their potential outcomes on a math test at the end of the summer would be whatever values they would have if they had participated in the math camp and if they had not. Although only one of these potential outcomes will be observable (since the student will either participate in the math camp or not), both of them can be defined.

Power:

The power of an analysis is the probability that we correctly reject a null-hypothesis which is, in fact, false. Power is a function of sample size, the size of the violation of the null-hypothesis, and the pre-established \alpha. The larger the sample, the larger the violation, and the larger the \alpha, the greater the power. Power is often referred to as 1 - \beta, where \beta is the probability of failing to reject a false null-hypothesis, or the type-2 error rate.

Predictor:

In a statistical model, a predictor is a variable used to predict an outcome. Other disciplines have their own terms for these, including independent variable or feature (which I think is only really used in machine learning).

Probability Distribution:

A probability distribution is a function which describes how likely different values are for a random variable. The probability that a random variable following a given probability distribution distribution will fall between two points, a and b, is equal to the area under the curve of the distribution between those points. The higher the value of the distribution at a given value, the more likely the random variable will be equal to that value. The total area under the curve described by a probability distribution is equal to 1.

Proportion:

The proportion of observations falling into a category is equal to the number of observations falling into the category divided by the total number of observations. It is functionally equivalent to a percent, which can be obtained by multiplying the proportion by 100%, but much more frequently used in statistics.

Prototypical Line:

A prototypical line shows how an outcome is associated with one predictor in a multiple regression model by holding all of the predictors but one constant at prototypical values, chosen to represent typical values in the dataset and plotting the resulting line. A prototypical plot can consist of one or possibly more prototypical lines, in which case we would select multiple combinations of the prototypical values for the other variables.

Quadratic:

A quadratic function is any function of the form Y = ax^2 + bx + c where a\neq0. In a regression model, a quadratic association is one where one of the variables is entered as a squared term (e.g., VARIABLE^2). In general, when we enter the square of a variable, we should also enter the variable itself, as in

Y_i = \beta_0 + \beta_1X_i + \beta_2X_i^2+\varepsilon_i.

Quadratic equations are especially useful when we see evidence of a non-monotonic association between the outcome and the predictor, where the association is, e.g., first rising and then falling, or first falling and then rising. The coefficients of a variable involved in a quadratic formula tend to be hard to make sense of, although the interpretations of other variables in the model are unchanged.

Quartile:

The quartiles of a variable are the 25th, 50th, and 75th percentiles, which generally divide the variable into four equal sized groups.

R^2:

The R^2 for a regression model is equal to the proportion of variability in the outcome which is predicted by the model. For example, if the R^2 is .40, then the model predicts 40% of the variability in the outcome. This does not mean that the model correctl predicted 40% of the outcome values and failed for the other 60%, it means that the variance of the residuals is 40% lower than the variance of the outcome. R^2 is, surprisingly, also equal to the square of the correlation between the model predictions and the outcome. That is, R^2 = \hat{\rho}_{y,\hat{y}}. The sample R^2 is a biased estimator of the population R^2; on average the sample R^2 will be alrger than the population R^2, though typically only slightly. Most software will report an adjusted R^2, which is an unbiased version of R^2. However, conventionally we report R^2 rather than the adjusted R^2.

Range:

The range is a measure of spread, and is defined as the distance from the largest value or a variable to the smallest. In reporting the range, we typically also report the smallest and largest values. Ranges are rarely interesting since they’re extremely sensitive to small differences in the sample; they’re computed from only two points. More interesting is the theoretical range of a variable, which tells the values that a variable can theoretically take on. For example, we might report that a scale ranges from one to five.

Reference Category:

In a regression model with a categorical predictor, we represent all but one of the levels of the categorical predictor using indicator variables, and we leave one out of the model (e.g., if everyone in a dataset is a college grad or a non-grad, we would only enter one variable, maybe grad, into the model). The category represented by the variable which is left out is called the reference category. The coefficient of each indicator variable will represent the mean difference between the group being indicated and the reference category. The category we make into the reference category is mostly arbitrary.

Residual:

The difference between an observed value and a predicted value. The residual represents the extent to which a statistical model has “missed” for any case. Residuals appear in multiple contexts in statistics, including in contingency table analyses and regressions. Large residuals indicate that the model has missed a prediction by a substantial amount, and provide evidence against the null-hypothesis in contingency tables. Standardized residuals are obtained by taking the raw residuals and dividing by the (estimated) standard deviation. This gives them a standard deviation near 1. Residuals are used to estimate how precise our coefficient estimates are.

RMSE:

The Root Mean Squared Error, or RMSE, is the standard deviation of the residuals of an outcome. It’s measured on the same scale as the outcome itself. The smaller it is, the more accurate our predictions were. The RMSE is mostly used as one part of the formula for estimating the standard errors for a model. In general small values of the RMSE tend to lead to smaller standard errors, and therefore more precision. This is one reason people use multiple regression, since including multiple predictors tends to reduce the RMSE.

Robust:

A method is robust to an assumption violation if it continues to work correctly even in the presence of the violation. For example, OLS estimation is robust to moderate violations of the assumption of normality, as long as the sample size is reasonably large.

We can also refer to a statistic as robust to extreme values if it tends not to vary too much based on individual values. For example, the median is robust to extreme values because one extremely large or small values won’t change the median too much; in contrast, one very large value can dramatically change the mean, so it is not robust to extreme values.

Sample:

A sample is a group of units selected at random from a population. What “at random” means can vary depending on the type of statistics we’re doing, but in this class we assume that researchers had access to a list of all the units in the population and chose the sample by picking names at random. In general we take samples because we can use them to make inference about the population; things that we estimate in samples, such as means and regression coefficients, tend to be close to the population values (e.g., the mean in the population), and we can use statistics to quantify our uncertainty. Larger samples tend to produce more precise estimates of population values. The sample does not need to be large relative to the size of the population; a sample of 2,000 Americans can give us good estimates about the population of all Americans, even though 2,000 is a tiny fraction of the whole population. What’s more important is that the sample is truly taken at random; selecting 100,000 Americans by asking graduates of elite universities to participate in our research is unlikely to give us good estimates about the population of all Americans.

Sampling Distribution:

The sampling distribution of a statistic is the probability distribution of that statistic across repeated, independent random samples of a constant size from a given population. For example, if we took repeated random samples of 100 Americans and calculated the proportion of the sample with a college degree, the distribution of the proportions would be a sampling distribution.

Scale Dependence:

A statistic is said to be scale dependent if its value depends on the scale on which a variable is measured. For example, suppose we have to choose whether to measure age in years or in months. Of course, there is no substantive difference between 12 months and one year, but some statistics, like regression coefficients, will change in value. Other statistics, like correlations, do not depend on the scales on which the variables are measured and will not change. Correlations are said to be scale-invariant, or scale-independent. This can make it easier to compare correlations across different contexts, since in some sense a particular value of a correlation means the same thing regardless of context.

Scatterplot:

A scatterplot is a visual display of the relationship between a pair of variables. In a scatterplot, one variable is assigned to the x-axis and one to the y-axis. Observations are plotted at points (x, y), using their values on the x and y variables. When we construct a scatterplot, we typically speak of plotting y on (or against) x. Scatterplots are most useful when both variables are continuous, when there are no ceiling or floor effects, and when samples are somewhat small. They can be used to detect the shape of an association between a pair of variables or for checking model assumptions, and are a handy way of displaying data to most audiences.

Skew:

The skew of a variable is the extent to which one side of the distribution (positive or negative) tends to have more extreme values. Skew actually has a very precise definition but we’re omitting it here. For example, in many samples there are lots of people with incomes near a typical value, but a long tail leading off to the right. These distributions are right-skewed because the extreme positive values are much more extreme than the extreme negative values. In contrast, when asking people about their sense of well-being, we often find most people reporting a high sense of well-being with a few people reporting very low values. These distributions are left-skewed.

Simple Linear Regression:

Simple linear regression is regression with exactly one predictor.

Slope:

The slope of a line is essentially its steepness. It’s defined as the difference in y per unit difference in x. When interpreting the slope of a regression model, we’ll typically talk about the slope as the difference in the outcome associated with a one-unit difference in the predictor. The idea of a slope is a little more complex in a plane or hyperplane, but still very similar. We can talk about the slope of a plane in one particular direction as the slope of a line which runs in that direction. In particular, we can interpret a slope coefficient of a model as the difference in the outcome associated with a unit difference in the predictor while holding constant, or controlling for, the other variable or variables.

Standardized Residuals:

Standardized residuals are basically residuals that have been standardized by taking each residual and dividing by the standard deviation of the residuals. This isn’t quite right, but it’s close enough for our purposes.

Statistical Control:

Statistical control is a technique in which we use a statistical model to estimate how differences in one variable are associated with differences in another variable while holding constant some other variable which would otherwise covary with the two variables. It’s important to note that, despite our name for the technique, we’re not actually controlling anything outside of the model, and the claims we make require us to trust our model. It’s not the same thing as the control achieved in a scientific experiment. Statistical control also does not give us the ability to make causal claims about the associations we estimate, although in some cases it can make those causal claims more credible (we won’t be discussing those sorts of models in this course).

Statistical Significance:

When we reject a null-hypothesis about an association, we say that the association is statistically significant. This generally means that we have sufficiently compelling reason to believe that the association exists in the population from which the sample was taken.

Standard Deviation:

The standard deviation is numerical summary of the spread of a variable. The standard deviation can only be used with variables which are measured on a numeric scale. The standard deviation is defined as

\sigma_y = \sqrt{\frac{\Sigma(y_i - \bar{y})}{n}}

In a sample, we estimate the population standard deviation using

\hat{\sigma}_y = s_y = \sqrt{\frac{\Sigma(y_i - \bar{y})}{n-1}}

Higher values of the standard deviation indicate that the variable is more spread out around its mean. The standard deviation is on the scale of the variable being summarized.

Standard Error:

The standard error is the standard deviation of a sampling distribution across repeated independent samples of the same size from the same population. Essentially, we imagine that we take an infinite (or at least very large) number of samples from a population, and calculate the statistic of interest in each one. The standard error tells us how spread out these estimates are. If the standard error is small, sample estimates will tend to be close to the population values. If it’s large, sample estimates will tend to be spread out and far from the population value. Frequently an estimate divided by its estimated standard error will follow a t-distribution and can be used to do inference. With a very few technical exceptions, any statistic which can be calculated from a sample has a standard error, though we’re usually interested in standard errors for means and regression coefficients.

Standardization:

Standardization is a tool used to give different variables a consistent scale. To standardize a variable, we take each observation, subtract the mean (the sample mean or the population mean) and then divide by the standard deviation (again, sample or population). The formula is

x^{std} = \frac{x - \hat{\mu_x}}{\hat{\sigma}_x}

After standardization, the new variable will have a mean of 0 and a standard deviation of 1. This means that the new scores will measure how many standard deviations an observation is from the mean. Standardization is typically used when a variable is measured on an unfamiliar or arbitrary scale, or when we need to make regression coefficients more directly comparable.

Stochastic:

The stochastic part of a statistical model is the part of the outcome which is random, or at least random with respect to the model. In the models we cover in this class, this is the residual.

Structural:

The structural part of a statistical model is the part of the outcome which is predicted. This is typically the part of the model which is most interesting to us, and is the part which described how the outcome and predictors are associated.

T-Distribution:

The t-distribution is the distribution followed by t-statistics. In fact, there are infinitely many t-distributions corresponding to the degrees of freedom. All t-distributions have a single central peak centered at 0 and symmetric tails going to positive and negative infinity. They’re similar to the Normal distribution, but they tend to have a sharper peak near 0 and heavier tails. As the degrees of freedom get larger, t-distributions get more and more similar to a Normal distribution. A t-distribution with about 50 degrees of freedom (generally meaning a sample size of just over 50 people) is almost identical to a Normal distribution.

Test Statistic:

A test-statistic is any value which can be calculated from a sample which has a known distribution under the null-hypothesis being tested. For example, if we assume that some variable Y follows a Normal distribution with mean 0 and variance unknown, and we take a sample of size n, then \bar{Y}/\hat{se}(\mu{Y}) (the mean value divided by the standard error of the mean value) is a test-statistic, which is known to be distributed t_{n-1}. Test statistics can be used to conduct null-hypothesis tests by comparing an observed test statistic to the distribution of the test-statistic which is implied by the null-hypothesis.

Transformation:

A transformation is any function which changes the values of a variable. We distinguish between linear transformations, where f(x) = ax + b (this is like converting between Celsius and Fahrenheit, or between dollars and pesos) and non-linear transformations, where this is not true. Linear transformations only rescale the variable and don’t change it in a meaningful way, while non-linear transformations can result in a completely different way of measuring a variable. Only non-linear transformations can address violations of the assumption of linearity. The transformations we consider in this class are logarithmic transformations and polynomials.

Type I Error:

A type I error occurs when a null-hypothesis is true (there is no association) but we reject it (we conclude that there is an association). That is, when we think we found something which doesn’t exist. We control the probability of committing a type I error with \alpha. When a null-hypothesis is true, our probability of incorrectly finding an association is equal to \alpha (usually .05).

Type II Error:

A type II error occurs when a null-hypothesis is false (there is an association) but we fail to reject it (we do not find evidence of the association). That is, when we conclude that there is no evidence of an association but in fact there is one (technically that’s not really an error, since we may be correct that there is no evidence of an association even if the association really exists). When a null-hypothesis is false, the probability of committing a type II error depends on a number of factors, including the sample size (large samples decrease the error rate), the value of \alpha (high values of \alpha decrease the type II error rate), and the size of the association in the population (when the association is large, the type II error rate goes down).

Uniform Distribution:

A uniform distribution is a probability distribution where every real number between two endpoints is equally likely. The most common uniform distribution is Uniform(0, 1), which is a distribution where any real value between 0 and 1 is equally likely, and no other value is possible. In graphical form, a uniform distribution looks like a rectangle with an area of 1. The most common uniform distribution we encounter is the distribution of p-values when a null-hypothesis is true.

Univariate:

A univariate statistic (or plot) looks at a single variable at a time. Since most of our research questions are about how variables are associated with each other, we don’t do all that many univariate analyses. For example, the proportion of Americans who have a college degree or the average years of education completed by Chinese citizens would be univariate statistics.

Variance:

The variance is a numerical summary of the spread of a variable. It is the square of the standard deviation. Certain results in mathematical statistics are easier to prove using variances than standard deviations, and the variance is useful in describing R2. However, the variance is a less intuitively interpretable statistic than the standard deviation.