Unit 7 - Logarithmic Transformations to Address Assumption Violations

Code

library(ggplot2) # for simple plots
library(dplyr) # for pipes and data management
library(kableExtra) # for pretty tables
library(gridExtra) # for arranging plots nicely

dat <- read.csv('country_data.csv')

These course notes should be used to supplement what you learn in class. They are not intended to replace course lectures and discussions. You are only responsible for material covered in class; you are not responsible for any material covered in the course notes but not in class.

Overview

The setting
What is a transformation?
When should we transform
The last refuge: dichotomization
The log transformation
Common transformations: Y_i = \beta_0 + \beta_1\log(X_i) + \varepsilon_i
Common transformations: \log Y = \beta_0 + \beta_1X_i + \varepsilon_i
Common transformations: \log Y = \beta_0 + \beta_1\log(X_i) + \varepsilon_i
Selecting a proper transformation
Summary

The setting

As we noted in the previous chapter, an important assumption we make when doing linear regression is that the association between the predictor and the outcome is, in fact, linear. Unfortunately, there are lots of associations out there which are not linear. =( We’d still like to be able to model them using linear regression, so we need to find a way to make non-linear associations into linear associations. The tool we’re going to use for this purpose is the transformation. This is not the only way to model non-linear associations, but it’s one that we can easily add to our toolkit, so it’s worth learning.

Because we want to demonstrate a number of different transformations, we’re not going to be able to restrict ourselves to a single dataset. Therefore, as we introduce new data we’ll give some very brief context for it. We made up most of the data (except from the first example), so don’t worry too, too much about the contexts.

What is a transformation?

A transformation is a function, f, which maps one set of values to another. We give one value to the transformation, and it gives us back a transformed value. Neat!

Generally we work with non-linear transformations, which means that the association between the original values and the transformed values is not a straight line.¹ Also, we typically work with increasing functions, which means that if X_1 is greater than X_2, then f(X_1) is also greater than f(X_2). This may seems a little abstract at the moment, but we hope that this unit will help make it more concrete.

When should we transform a variable?

There are two basic reasons to transform a variable. First, if theory suggests that the association between a pair of variables should not be linear. For example, if we were regressing test scores on the number of hours spent preparing for a test, we might expect that the association would be non-linear; the first few hours might produce big improvements in test scores, the next few hours might produce smaller gains, and hours 10 and on might not do very much at all (of course we couldn’t infer a causal association from correlational data, but we might have a causal model in mind when we formulated our research question). Since the slope of the association is non-constant (steep for the first few hours, shallow later on), we’ve theorized a non-linear association, which we could need a transformation to model. This doesn’t mean that our transformation will magically create a linear association between number of hours studied and scores, but rather that we might be able to find a transformation to apply to hours studied such that there would actually be a linear association between this transformed variable and test scores.

Second, if we observe a non-linear association between the predictor and the outcome in the sample, we might need to transform one of the variables to correct violations of the model assumptions. If we transform the correct variable(s) using the correct transformation(s), the association between the transformed variable(s) will be linear. Remember that if we use a linear model where the association is non-linear we

drastically complicate the interpretation of \beta_1 and especially \beta_0, and
estimate the standard errors incorrectly, leading to incorrect p-values and confidence intervals.

So when possible, we want to use a transformation to create a linear association between the predictor and the outcome.

This is a crucial point which is often difficult to understand at first. Although we might observe a non-linear association between the predictor and the outcome, we suspect that there might actually be a linear association between a transformed predictor and the outcome, or between the predictor and a transformed outcome. This would make the linear model appropriate for the transformed variable because we would be modeling a linear association. Our interpretations would change, as we’ll see. As we mentioned earlier, we’re not being sneaky and creating a linear association that doesn’t exist. Rather, we’re hoping that transforming a variable can reveal a linear association between the transformed variable and the other variable.

With a very few exceptions, we should avoid transforming variables when the association is already linear. If we transform variables which already have a linear association, we will generally have a non-linear association between the transformed variables. As a result, we’ll end up creating a problem where none existed before.

The last refuge of scoundrels: dichotomization

We’ll start with the most severe transformation of all, dichotomization. If we don’t want to make any assumptions about the association between a pair of variables and simply want to see whether or not they’re independent of each other, we can dichotomize one or both. This entails finding some cutoff, usually the sample mean or median, but sometimes a value with substantive meaning, and assigning everyone with a value above the cutoff a 1 and everyone below a 0. For example, if we’re studying the association between math and ELA MCAS scores and we want to see if there’s a positive association between the two, we could dichotomize math and ELA scores at their respective medians and use a contingency table with a \chi^2 distribution to test a null-hypothesis that scoring above the median in math was independent of scoring above the median in ELA.² Alternately, we could dichotomize only math and use a t-test to test a null-hypothesis that students scoring above the median in math had the same mean ELA scores as students scoring above the median in ELA.

Dichotomization is popular in many disciplines. Its has the huge advantage that it makes very few assumptions. Contingency tables really only require that the observations be independent of each other, which can be addressed through proper sample design. It’s true that t-tests have a few more assumptions, but they are usually robust to violations, especially if we use robust variants of the test. Regression models assume linearity, which can be hard to justify.

On the other hand, there are also lots and lots of reasons not to dichotomize. First, you lose a lot of power. It’s much easier to find an association using linear regression than it is using contingency tables and t-tests, because you collapse all information in the dichotomized variable into “above the cutpoint” or “below the cutpoint”. Second, linear regression estimates the strength and magnitude of an association, which is usually much more interesting than the fact that an association exists or the size of a gap. Third, the results you get from dichotomization depend on the cutoff you select. Unless you have strong theoretical reasons for preferring one particular cutoff, this decision will always be somewhat arbitrary.

In this course, we generally won’t use dichotomization as a way of correcting model violations, but you’ll see it frequently in medical and public-health research.

The log transformation

The transformation we’re going to spend most of our time on is the logarithmic, or log, transformation. Specifically, we’ll be talking about \log_2, or ``log base 2’’, transformations. To understand the log transformation, we need a quick refresher on logarithms. Suppose that we can express some number x as a power of 2, such that

x = 2^p

Then

\log_2x = p

That is, \log_2x is the power to which we need to raise 2 in order to get x. For example, 32=2^5, so \log_{2}32=5. Of course, there’s no reason we need to work with base 2; another common option is 10. For example, 1,000,000 = 10^6, so \log_{10}1,000,000=6. For our examples, however, we’re going to work almost exclusively with \log_2; it turns out that our choice of base is completely arbitrary,³ and \log_2 is fairly straightforward to interpret.

Table 1: A table of selected base-2 logarithms

x	\log_2x
\sfrac{1}{8}	-3
\sfrac{1}{4}	-2
\sfrac{1}{2}	-1
1	0
2	1
4	2
8	3
16	4
32	5
64	6

To give you a sense of how a logarithmic transformation works, Table 1 shows some values of x and \log_2x. We only present powers of 2 because they’re easier to calculate in your head;⁴ you can probably see that 2^4=16, so \log_{2}16=4. However, it’s possible to take a log of any positive number. For example, you can check with a calculator (or Google) that \log_{2}2.5=1.322, and that \log_{2}5=2.322. One thing worth noting is that if \log_{2}x=y, then \log_{2}2x=y+1. A doubling of x causes a one-unit increase in \log_{2}x. This is going to be important later on in this unit.

Another feature of logarithms you’ll want to notice is that if x is small, then small increases in x lead to relatively large increases in \log_2x. For example, if x is \sfrac{1}{8}, then increasing x by only \sfrac{1}{8} to \sfrac{1}{4} will lead to an increase of 1 in \log_2x, from -3 to -2. In contrast, if x is 32 then increasing x by 32 to 64 will lead to the same increase of 1 in \log_2x, from 5 to 6. One way to think about this is that a logarithmic transformation expands the distances between small values of x and shrinks distances between large values of x (a \sfrac{1}{8} unit difference in x becomes a 1 unit difference in \log_2x when x is small and a 32 unit difference in x becomes a 1 unit difference in \log_2x when x is large).⁵

Common transformations: Y_i = \beta_0 + \beta_1\log(X_i) + \varepsilon_i

So let’s see the logarithmic transformation in action. For this section, we have data from the CIA Factbook (these data are real). Each observation is a country, and the variables are the national per capita gross domestic product with purchasing power parity, which we’ll call GDP, and national life expectancies at birth, which we’ll call LEB.

Code

mod <- dat %>% lm(le ~ gdp, .)
dat$preds <- predict(mod)
dat$res <- rstandard(mod)
p1 <- dat %>% ggplot(aes(x = gdp, y = le)) + geom_point(alpha = .25) + 
  geom_smooth(se = FALSE, col = 'purple', lty = 2) + 
  geom_smooth(method = 'lm', se = FALSE, col = 'red') + 
  labs(x = 'Per capita GDP', y = 'Life expectancy at birth')
p2 <- dat %>% ggplot(aes(x = preds, y = res)) + geom_point(alpha = .25) + 
  geom_smooth(se = FALSE, col = 'purple', lty = 2) + 
  geom_hline(yintercept = 0, col = 'red') + 
  labs(x = 'Predicted life expectancy', y = 'Standardized residual')
grid.arrange(p1, p2, ncol = 2)

Figure 1: Scatterplot of life expectancy against GDP with LOWESS curve in purple and line of best fit in red.

Figure 1 shows the scatterplot of LEB against GDP. The LOWESS curve is overlaid in purple and the line of best fit (which doesn’t fit especially well) is overlaid in red.⁶ Notice that the overall trend is positive; wealthier countries have higher life expectancies. However, the trend is clearly non-linear; the association is extremely steep in poor countries, and extremely shallow in wealthy countries. The line of best fit does a decent job of summarizing the overall trend, but doesn’t correctly capture the local trend for any value of the predictor (except, maybe, for countries with a per capita GDP of around $7,000).

This is theoretically reasonable. If we think that higher GDPs cause countries to have higher life expectancies because wealthier people have access to more resources for prolonging their lives, we would expect that the law of diminishing returns would start to kick in as countries become wealthier. There’s a limit to how long we can prolong our lives, and additional medical interventions rapidly become extremely expensive. So it makes sense that the association between the variables is weaker in wealthy countries than in poor countries.

One way to think about this is the following: a $1,000 increase in GDP per person can probably really extend people’s lives in a country where the per capita GDP is only $1,000 to start with. In a country where the per capita GDP is already $60,000, an additional $1,000 doesn’t really go all that far. Instead, we might think that a doubling of per capita GDP should buy about as much additional life expectancy (we can’t infer causality from our data, but we can use it to drive our hypothesis generation) whether the country is rich or poor. This is a situation which calls for a logarithmic transformation of the predictor, as we’ll see later. In fact, income and wealth are frequently associated with other variables on a logarithmic scale; when working with these variables it’s customary to always explore the possibility of a transformation.

Code

mod <- dat %>% lm(le ~ log(gdp, 2), .)
dat$preds <- predict(mod)
dat$res <- rstandard(mod)
p1 <- ggplot(dat, aes(x = log(gdp, base = 2), y = le)) + geom_point(alpha = .25) + 
  geom_smooth(se = FALSE, col = 'purple', lty = 2) +
  geom_smooth(method = 'lm', se = FALSE, col = 'red') + 
  labs(x = expression(log[2]~(Per~capita~GDP)), y = 'Life expectancy at birth')
p2 <- ggplot(dat, aes(x = preds, y = res)) + geom_point(alpha = .25) + 
  geom_smooth(se = FALSE, col = 'purple', lty = 2) + 
  geom_hline(yintercept = 0, col = 'red') + 
  labs(x = 'Predicted life expectancy', y = 'Standardized residual')
grid.arrange(p1, p2, ncol = 2)

Figure 2: Scatterplot of life expectancy against \log_2GDP with LOWESS curve in purple and line of best fit in red. The linear model is clearly wrong here, as we see from the regions of consistent overprediction (in the poorest and wealthiest countries) and overprediction (in countries with moderate per-capita GDPs).

When we apply a logarithmic transformation (with a base of 2) to GDP, the result is displayed in Figure Figure 2. The association is still not perfectly linear, but it’s a lot better. There’s also some potential heteroscedasticity, with residuals being much more variable at low values of per capita GDP than at high levels. However, the line of best fit is doing a much more reasonable job of summarizing the true association between GDP and LEB. In an actual analysis, we might look for additional steps we could take to improve the line of best fit, although we might be content with these results. However, for this example, we’re happy enough with the result. When we fit a regression to the transformed data, we obtain the following equation:

\hat{LEB}_i = -2.92 + 5.57L2GDP

One way to interpret these results is that a one-unit difference in the log-base-2 of per capita GDP is associated with a 5.57 year difference in predicted life expectancy at birth. This is exactly how we’ve interpreted regression results before, and is perfectly acceptable, if a little hard to make sense of. It’s hard to have much intuition about what a one-unit difference in the log-base-2 of per capita GDP actually is.

But notice what we pointed out before: a one-unit difference in \log_{2}x is the same as a doubling of x. So another way to interpret our results is that a doubling of per capita GDP is associated with a 5.57 year difference in predicted life expectancy at birth. Of course, the number of dollars represented by a doubling of per capita GDP depends on how high the reference per capita GDP is. A doubling of a per capita GDP of $1,000 equates to a difference of $1,000 ($1,000 v.s. $2,000); a doubling of a per capita GDP of $20,000 equates to a difference of $20,000 ($20,000 v.s. $40,000). The 95% confidence interval for the slope is (4.89,6.25), and so we reject H_0:\beta_{GDP}=0.

Notice that our interpretation involves conceptualizing a doubling, or two-fold difference, in the predictor. This is often a sensible thing to consider (it’s not hard to make sense of a doubling in height, or in GDP), but no always. If one person has an IQ of 75 and another has an IQ of 150, it would not be correct to say that the second person is twice as intelligent (as measured by IQ) as the first. Similarly, there is not a two-fold difference in heat between 10 degrees and 20 (unless we’re using Kelvin). Logarithmic transformations make the most sense when a variable has a meaningful 0, which actually indicates nothing.If this is not the case, a logarithmic transformation is likely to be hard to interpret.

If we want to be careful and fit the model without assuming that we have the correct functional form for the model and the correct error distribution, we can use an approach to model fitting and estimating standard errors which is robust to violations of these assumptions. When we use these techniques, we estimate the regression line to be

\hat{LEB}_i = -2.14 + 5.56L2GDP

and the confidence interval for the slope to be (5.04,6.08). These estimates are qualitatively similar to the ones we obtained using standard OLS approaches, which is encouraging. In fact, the robust 95% confidence interval is narrower than the standard one, indicating that we may have more precision than we originally thought. Regardless, we can feel confident that the true slope of the population line of best fit is close to 5.5, and that this line does an adequate job of capturing the general association between GDP and LEB.

The big takeaway here is that if you observe an association which is steep (usually steep an positive) at low values of the predictor and less steep at higher values, consider log-transforming the predictor (especially if the distribution of the predictor has strong positive skew). If you conduct a log-2 transformation, interpret the regression coefficient as indicating the predicted difference in the outcome associated with a two-fold difference in the predictor.

Common transformations: \log Y = \beta_0 + \beta_1X_i + \varepsilon_i

For our second example, we’ll consider a model predicting income from years of education. In this case we’re just going to make up the data. We’re doing this for two reasons: first, it’s somewhat hard to find data which fit our needs. Second, if we had to reckon with how little money our many, many years of education is actually likely to bring us, we’d be too depressed to finish this unit. ;)

Figure 3 shows a scatterplot of yearly income on years of education with an overlaid LOWESS smoothed curve, and a similar plot for the residuals from a regression of yearly income on years of education. Each point represents one person.

Code

set.seed(12111978)

N <- 200

educ <- 5 + round(15 * runif(N))
loginc <- rnorm(N, log(8000) + log(4000)*educ/40, 1)
inc <- exp(loginc)

incdat <- data.frame(educ, inc, loginc)

mod <- lm(inc ~ educ, incdat)
incdat$preds <- predict(mod)
incdat$res <- rstandard(mod)
p1 <- ggplot(incdat, aes(x = educ, y = inc)) + geom_point(alpha = .25) + 
  geom_smooth(se = FALSE, col = 'purple', lty = 2) + 
  geom_smooth(method = 'lm', se = FALSE, col = 'red') + xlab('Years of formal education') + ylab('Income ($USD)')
p2 <- ggplot(incdat, aes(x = preds, y = res)) + geom_point(alpha = .25) + 
  geom_smooth(se = FALSE, col = 'purple', lty = 2) + 
  geom_hline(yintercept = 0, col = 'red') + xlab('Predicted income ($USD)') + ylab('Standardized residual')
grid.arrange(p1, p2, ncol = 2)

Figure 3: Scatterplot of salary against years of education with LOWESS curve in purple and line of best fit in red.

Notice some problems with the plot. First, the LOWESS curve doesn’t appear to be a straight line; this is especially apparent in the plot of the standardized residuals. Second, the residuals are clearly not homoscedastic; residuals at high values of EDUCATION are much more variable than those at low values. Finally, the residuals are obviously not normally distributed; there are far too many large positive values, too few large negative values, and a large positive skew. Figure 4 shows a histogram of the residuals with an overlaid normal curve with a mean 0 and standard deviation of 1.

Code

ggplot(incdat, aes(x = res)) + geom_histogram(aes(y = ..density..), bins = 20) + xlab('Standardized residuals') +
  stat_function(fun = dnorm, col = 'red') + xlim(-5, 5)

Figure 4: Histogram of standardized residuals from a regression of income on years of education.

We have problems with three of our assumptions (the residuals are definitely independent, because they were generated on my computer⁷). And most importantly, we can probably come up with theoretical reasons why we should expect the association between income and education to be non-linear. In terms of average income, we don’t really expect there to be much of a difference between having one and five years of formal education; in either case, we would typically predict a very low income. One the other hand, the difference between a high-school diploma and a BA probably corresponds to a rather large difference in salary.

Here’s another way to think about it: when your salary is low (which is generally true for people with few years of formal education education), an additional year of education probably won’t make a huge difference. On the other hand, if your salary is extremely high to begin with (which tends to be true of people who have extensive formal educations), an additional year of education might yield large benefits.

Both of these arguments suggest that the association between salary and years of education might be non-linear; a year difference in education might actually predict a certain percent increase in predicted salary. This is a very common situation for income and certain other variables. It’s rare that a predictor has a linear association with income; when a one-unit difference in a predictor is associated with a $1,000 difference in income for people making $10,000, there’s a good chance that it’s associated with a $10,000 difference in income for people making $100,000.

When we want to transform multiplicative differences (i.e., percent differences) into additive differences (i.e., the sort assumed by a linear regression model), we need a logarithmic transformation. In this case, we’re going to take the log of the outcome, yearly income (notice that income has a meaningful 0, which is what we want in variables that we’re going to log-transform). Figure 5 shows the association between the transformed outcome and the predictor and the residuals against the predictor. Notice how the transformation has simultaneously improved the problem on non-linearity and heteroscedasticity. We don’t present a histogram of the residuals, but they look a lot more normal after the transformation.

Code

mod <- lm(loginc ~ educ, incdat)
incdat$preds <- predict(mod)
incdat$res <- rstandard(mod)
p1 <- ggplot(incdat, aes(x = educ, y = loginc)) + geom_point(alpha = .25) + 
  geom_smooth(se = FALSE, col = 'purple', lty = 2) + 
  geom_smooth(method = 'lm', se = FALSE, col = 'red') + xlab('Years of formal education') + ylab(expression(log[2]~(Income)))
p2 <- ggplot(incdat, aes(x = preds, y = res)) + geom_point(alpha = .25) + 
  geom_smooth(se = FALSE, col = 'purple', lty = 2) + 
  geom_hline(yintercept = 0, col = 'red') + xlab('Years of formal education') + ylab('Standardized residual')
grid.arrange(p1, p2, ncol = 2)

Figure 5: Scatterplot of \log_2 of salary against years of education with LOWESS curve in purple and line of best fit in red.

The population model we hypothesized looks like this:

\log_2SALARY = \beta_0 + \beta_1EDUCATION + \varepsilon_i

The fitted model is

\log_2SALARY = 9.12 + 0.19EDUCATION

Both of the coefficients are statistically significant, with fairly narrow confidence intervals.

Interpretation is a little tricky. We provide a derivation in the appendix, but for now, take our word for it that a 1-year difference in education predicts a (2^{0.19}-1)\times 100\% = 14.1\% difference in salaries. In general, if we transform the outcome using \log_2, we predict a (2^{\beta_1\times k}-1)\times 100\% difference in the outcome for each k-unit difference in the predictor. If a person with x years of education has a predicted salary of y, then we predict that a person with x+1 years of education will have a salary of 114.1\% \times y = 1.141y. Figure Figure 6 shows the line of best fit on the original scale (Figure 5 shows the line of best fit on the transformed scale). The confidence interval for the mean is shaded in grey: notice how narrow it is at small values of EDUCATION and how large it is at high values. The reason for this is that small difference in the parameter estimates make for huge differences in the predicted outcome when exponentiating, when the log values are high.

Code

newdat <- data.frame(educ = seq(min(educ),max(educ),.01))
newdat$preds <- exp(predict(mod,newdat))
preds.ci <- exp(predict(mod,newdat,interval = 'confidence'))
newdat$lower <- preds.ci[, 2]
newdat$upper <- preds.ci[, 3]

ggplot(incdat, aes(x = educ, y = inc)) +
  geom_ribbon(data = newdat, mapping = aes(x = educ, ymin = lower, ymax = upper, y = NULL), alpha = .1) + 
  geom_point(alpha = .25) +
  geom_line(data = newdat, mapping = aes(x = educ, y = preds), col = 'red') +
  xlab('Years of formal education') + ylab('Income ($USD)')

Figure 6: Scatterplot of salary against years of education with fitted curve in red and confidence intervals for the mean in light grey.

Common tranformations: \log Y = \beta_0 + \log\beta_1X_i + \varepsilon_i

Sometimes one logarithmic transformation is not enough. Consider Figure 7. The top-left panel displays the (simulated) association between textbook sales of selected books in 2014 and in 2015. Each point represents a single textbook, of which there are 100. The top-right panel shows what happens when we log-transform the outcome, the bottom-left shows what happens when we log-transform the predictor, and the bottom-right shows what happens when we log-transform both the predictor and the outcome.

Code

set.seed(9281983)

logtb1 <- rnorm(100,10,4)
logtb2 <- rnorm(100,logtb1,1)

tb1 <- 2^logtb1
tb2 <- 2^logtb2

tdat <- data.frame(tb1, tb2, logtb1, logtb2)

p1 <- ggplot(tdat, aes(x = tb1, y = tb2)) + geom_point(alpha = .25) + 
  geom_smooth(method = 'lm', se = FALSE, col = 'skyblue3') + 
  xlab('Textbook sales, 2014') + ylab('Textbook sales, 2015')
p2 <- ggplot(tdat, aes(x = tb1, y = logtb2)) + geom_point(alpha = .25) + 
  geom_smooth(method = 'lm', se = FALSE, col = 'springgreen3') + 
  xlab('Textbook sales, 2014') + ylab(expression(log[2]~(Textbook~sales~2015)))
p3 <- ggplot(tdat, aes(x = logtb1, y = tb2)) + geom_point(alpha = .25) + 
  geom_smooth(method = 'lm', se = FALSE, col = 'plum2') + 
  xlab(expression(log[2]~(Textbook~sales~2014))) + ylab('Textbook sales, 2015')
p4 <- ggplot(tdat, aes(x = logtb1, y = logtb2)) + geom_point(alpha = .25) + 
  geom_smooth(method = 'lm', se = FALSE, col = 'lightpink3') + 
  xlab(expression(log[2]~(Textbook~sales~2014))) + ylab(expression(log[2]~(Textbook~sales~2015)))
grid.arrange(p1, p2, p3, p4, ncol = 2)

Figure 7: Scatterplots with log transformations of the predictor and outcome.

The first thing you should notice is that the association between textbook sales at the two timepoints appears plausibly linear, although it’s hard to see what’s happening at the low values. Log-transforming variables already in a linear relationship breaks the linear relationship, which is why we typically avoid doing so. Notice how log-transforming the outcome or the predictor makes things much worse. From a theoretical perspective, it doesn’t make sense that additive differences in 2014 should predict multiplicative differences in 2015 (log-transforming just the outcome), nor why multiplicative differences in 2014 should predict additive differences in 2015 (log-transforming just the predictor).

One the other hand, we have to conduct a transformation of some sort, because the assumptions of homoscedasticity and normality are so badly violated. Figure 8 shows appropriate diagnostic plots; they’re hard to recognize because the violations are so severe. Notice how variable the residuals are at extreme predicted values, and how absurdly large some of the residuals are. Usually these assumptions are less important than linearity, but these violations are so severe that our standard errors will be impossibly wrong.

Code

mod <- lm(tb2 ~ tb1, tdat)
tdat$preds <- predict(mod)
tdat$res <- rstandard(mod)

p1 <- ggplot(tdat, aes(x = preds, y = res)) + geom_point(alpha = .25) + geom_hline(yintercept = 0, col = 'red') +
  xlab('Predicted textbook sales in 2015') + ylab('Standardized residuals')
p2 <- ggplot(tdat, aes(x = res)) + geom_density() + stat_function(fun = dnorm, col = 'red') 
grid.arrange(p1, p2, ncol = 2)

Figure 8: Redisuals v.s. predicted values and kernel-density plot of the residuals from the untransformed model.

A theory that makes a little more sense here is that proportional differences in 2014 sales should predict proportional differences in 2015 sales. If one book sold twice as many copies as another in 2014, it makes sense to expect it to sell somewhere around twice as many copies in 2015. It also makes sense that the variability in sales should be greater for books that sold a lot, because the numbers we’re working with are so much larger. If we only predict that a book will sell 100 copies, a residual of 100 counts as a big miss. If we predict that a book will sell 100,000 copies, a residual of 100 counts as an almost perfect prediction. A model that captures both of these intuitions is one which log-transforms both the predictor and the outcome, i.e.

\log_2SALES2015 = \beta_0 + \log_2SALES2014 + \varepsilon

Fitting this model gives us

\log_2SALES2015 = 0.37 + 0.96\log_2SALES2014

We interpret as follows: a 1% (percent, not percentage point) difference in sales in 2014 predicts a (1.01^{0.96} - 1)\times 100\% = .96\% difference in prices in 2015. The appendix shows how to derive this interpretation, but generally, regardless of the base we use, a k\% difference in the predictor predicts a (1.01^{\beta_1\times k} - 1)\times 100\% difference in the outcome.

Figure 9 shows diagnostic plots from the fully transformed model. Things look a lot better, and suggest to us that we can trust the standard errors and regression coefficients from this model. There’s some evidence of heteroscedasticity at the most moderate predicted values, where it matters least, but not enough to be concerned.⁸

Code

mod <- lm(logtb2 ~ logtb1)
tdat$preds <- predict(mod)
tdat$res <- rstandard(mod)
  
p1 <- ggplot(tdat, aes(x = preds, y = res)) + geom_point(alpha = .25) + geom_hline(yintercept = 0, col = 'red') + 
  xlab('Predicted textbook sales in 2015') + ylab('Standardized residuals')
p2 <- ggplot(tdat, aes(x = res)) + geom_density() + stat_function(fun = dnorm, col = 'red') 
grid.arrange(p1, p2, ncol = 2)

Figure 9: Redisuals v.s. predicted values and kernel-density plot of the residuals from the fully transformed model.

Selecting a proper transformation

There are a few basic principles to keep in mind when selecting a transformation. First, select transformations based on theory. Keep in mind that log-transforming a predictor suggests that multiplicative (percent) differences in the predictor are associated with additive differences in the outcome, log transforming the outcome suggests that additive differences in the predictor are associated with multiplicative (percent) differences in the outcome, and log-transforming both the predictor and the outcome suggests that multiplicative (percent) differences in the predictor are associated with multiplicative (percent) differences in the outcome. If theory or intuition suggests that one of these should be true, that should guide your transformation. Similarly, if models in your discipline usually transform a given variable (such as log-transformations of income), you probably should too.

Second, select a transformation which fits the association between the outcome and the predictor. Draw a LOWESS curve or just eyeball the association. Does it look like some version of y = \log x? Then log-transform the predictor. Does it look like some version of y = 2^x (i.e. \log_2 y = x)? Then log-transform the outcome. Does it look like a straight line but with far too much skew? Log-transform the predictor and the outcome. If the transformation makes the assumption of linearity more plausible, it’s probably the right choice; if not, go back and try again. We haven’t given you a lot of options, so it’s not too hard to try them all.

Third, use a transformation that’s easy to interpret, assuming you need to interpret the association. That’s why we’ve been focusing on simple log-transformations. These aren’t the only transformations out there, but they’re some of the easiest to work with, and will usually be all that we need.

Fourth, if at all possible, transform the predictor rather than the outcome. Log-transforming the outcome has a lot of consequences, especially when we come to multiple regression. Sometimes it makes sense, but you’re usually better off transforming a predictor.

Appendix

Problems with transforming the outcome

We need to interject a note of caution about transforming outcomes. It turns out that the rules of thumb we’ve given you for interpreting regressions of logged outcomes on the original scale don’t actually work (they’re frequently close enough to be useful, but not always). The problem is something known as Jensen’s Inequality, which holds that if f is a non-linear transformation, such as a logarithm, then \mu_{f(X)} is not equal to f(\mu_X). What this means is that if we take the mean of a number of log-transformed values and then exponentiate that value, we will not get the mean of the untransformed values. Often it will be close, but it can also be very, very different. As an illustration, suppose that our x values are .01, .1, 1, 10, and 100. The mean of these values is 22.222. If we log them (base 10), we get -2, -1, 0, 1, and 2. The mean of the logged values is 0, and when we take 10^0, we get 1. This is rather different from 22.222. The same problems arise when we try to express the slope as percent differences in the outcome associated with unit differences in the predictor by taking (2^{\beta_1}-1)\times 100\%. Again, this approach is an approximation which will frequently be approximately correct, but occasionally be badly wrong. If at all possible, avoid transforming the outcome. A slightly more complicated approach to modeling, generalized linear models, allows us to non-linearly transform the outcome in such a way that we can interpret results on the original scale in a straightforward way, but we don’t cover these methods in our class.

Just to be clear, this is true whenever we apply a non-linear transformation to the outcome. It’s not specific to logarithms.

Derivations: Log-transforming the predictor

In this section, we’ll present a mathematical derivation of the interpretation of a regression with a log-transformed predictor. This derivation and the following ones can be easily extended to other bases by replacing 2 with the base you use for your transformation.

Suppose that the following model holds:

\mu_{Y|X_i} = \beta_0 + \beta_1\log_2X_i

Then consider two cases, one with X_1 = X_0 and the other with X_2 = 2X_0. Then

\mu_{Y|X_1} = \beta_0 + \beta_1\log_2X_1,

and

\mu_{Y|X_2} = \beta_0 + \beta_1\log_2(2\times X_1) = \beta_0 + \beta_1(\log_22 + \log_2X_1) = \beta_0 + \beta_1 + \beta_1\log_2X_1,

and

\mu_{Y|X_2} - \mu_{Y|X_i} = \beta_0 + \beta_1 + \beta_1\log_2X_1 - (\beta_0 + \beta_1\log_2X_1) = \beta_1,

so a two-fold difference in the log-transformed predictor is associated with a \beta_1-unit difference in the expected value of the outcome. Similarly, a d-fold difference in X predicts a \log_2d\times \beta_1 unit difference in the outcome.

Derivations: Log-transforming the outcome

In this section, we’ll present a mathematical derivation of the interpretation of a regression with a log-transformed outcome. We’ll also point out where the proof goes wrong.

\mu_{\log_2Y|X_i} = \beta_0 + \beta_1X_i.

Then we have

2^{\mu_{\log_2Y|X_i}} = 2^{\beta_0 + \beta_1X_i}.

We might naively expect that

2^{\mu_{\log_2Y|X_i}} = \mu_{\log_2Y|X_i},

i.e. that exponentiating the mean of \log_2Y will give us the mean of Y, but this is not right. In fact, the mean of Y cannot be calculated from the mean of \log_2Y without knowing the distribution of Y, due to a mathematical fact known as Jensen’s Inequality. Notice that it is true that 2^{\log_2Y} = Y, but it’s not true that 2^{\mu_{\log_2Y}} = \mu_{Y}. Probability: it’s weird.

However, we’re going to continue as though the previous equation were true, because it’s often close enough to be useful. Then

\mu_{Y|X_i} = 2^{\beta_0 + \beta_1X_i}.

Then consider two cases, one with X_1 = X_0 and the other with X_2 = X_0 + 1. Then

\mu_{Y|X_1} = \mu_1 = 2^{\beta_0 + \beta_1X_0},

and

\mu_{Y|X_2} = \mu_2 = 2^{\beta_0 + \beta_1(X_0+1)} = 2^{\beta_0 + \beta_1(X_0) + \beta_1},

and

\frac{\mu_2}{\mu_1} = \frac{2^{\beta_0 + \beta_1(X_0) + \beta_1}}{2^{\beta_0 + \beta_1(X_0)}} = 2^{\beta_0 + \beta_1(X_0) + \beta_1 - (\beta_0 + \beta_1(X_0))} = 2^{\beta_1}.

So the ratio of expected outcomes for cases which differ by 1 on the predictor is equal to 2^{\beta_1}. Expressed as a percent difference, this is (2^{\beta_1} - 1)\times 100\% greater than \mu_1.

Note that when we log-transform using base e and \beta_1 is small, (e^{\beta_1} - 1)\times 100\% tends to be very close to \beta_1\%, which makes e a very useful base with which to transform. But even when using base e, you should use the formula above to get the exact difference.

Furthermore, if two cases differ by d on the predictor, we predict that \mu_2 will be (2^{\beta_1\times d}-1)\times 100\% larger than \mu_1.

Derivations: Log-transforming both the predictor and outcome

Assume that the following model is correct:

\mu_{\log_2Y} = \beta_0 + \beta_1\log_2X.

Once again consider two cases, one with X_1 = X_0 and the other with X_2 = 1.01X_0, or one-percent larger. Then

\mu_{\log_2Y_1} = \beta_0 + \beta_1\log_2X_0,

and

\mu_{\log_2Y_2} = \beta_0 + \beta_1\log_21.01X_0 = \beta_0 + \beta_1\log_2X_0 + \beta_1\log_21.01.

Now we assume, incorrectly, that

\mu_{\log_2Y_2} - \mu_{\log_2Y_1} = \log_2\frac{\mu_{Y_2}}{\mu_{Y_1}}.

Again, it’s true that \log_2Y_2 - \log_2Y_1 = \log_2\frac{Y_2}{Y_1}, but this is only approximately true when working with means (which is why it’s the log-transformation of the outcome rather than the predictor that causes trouble; we don’t work with mean values of the predictor).

Despite that, we’ll continue to pretend that the above equation is correct, because it’s close enough to be useful. So now we have

\mu_{\log_2Y_2} - \mu_{\log_2Y_1} = \log_2\frac{\mu_{Y_2}}{\mu_{Y_1}} = \beta_0 + \beta_1\log_2(X_0) + \beta_1\log_21.01 - (\beta_0 + \beta_1\log_2(X_0)) = \beta_1\log_21.01.

\log_2\frac{\mu_{Y_2}}{\mu_{Y_1}} = \log_21.01^{\beta_1},

and exponentiating both sides gives us

\frac{\mu_{Y_2}}{\mu_{Y_1}} = 1.01^{\beta_1}.

In other words, for two observations which differ by 1% on X, we predict that their ratio will be 1.01^{\beta_1}. Alternately, we predict that Y_2 will be (1.01^{\beta_1}-1)\times 100\% larger than Y_1. This is true regardless of the base we use, and tends to be close to \beta_1\% as long as \beta_1 is not too large.

Furthermore, if X_2 is d\% larger than X_1, we predict that Y_2 will be (1.01^{\beta_1d}-1)\times 100\% larger than Y_1, which will be close to \beta_1d\%, as long as \beta_1d is small.

Non-linear regressions

Another solution to non-linear associations is to fit a non-linear regression model. Generalized linear models, mentioned above, are extremely useful, relatively simple models which allow us to model non-linear associations. However, we can in theory fit any model we can imagine using a technique called maximum-likelihood estimation (bootstrapping is also handy here). These techniques are only appropriate asymptotically, which means that they only provide valid inference as the sample size gets extremely large (in theory approaching infinity). They’re also well beyond the scope of this course.

Starting a logarithmic transformation}

\log x is undefined for x\le 0. There’s no power p such that 2^p = 0. The same is true for negative numbers.⁹ If the variable we want to transform includes negative values (or much more commonly values equal to 0), we need to be a little tricky. One easy approach is to start the logarithm by adding some small value, often though not always 1, to every value in the variable. If this makes all of the values positive, then we can do the log-transformation and proceed as before. However, the interpretations we gave you above will no longer work. Instead, you’ll have to simply describe the general shape of the association as best as you can.

Footnotes

A linear transformation is something like a transformation between temperature in Fahrenheit and temperature in Celsius (#teamkelvin), or mass measured in kilograms and mass measured in grams. Standardizing a variable is also a linear transformation; mathematically speaking these aren’t super interesting because a linear transformation of a variable doesn’t change the model you fit in a meaningful way.↩︎
Or we could dichotomize at the level the state designated proficient, and see if students who are proficient on the math MCAS are more or less likely to be proficient on the ELA MCAS than students who are not proficient on the math MCAS.↩︎
We can’t use a negative base or base 1, and it’s usually a bad idea to use a base that’s not a whole number↩︎
Another common base, for mathematical reasons, is the natural number, e. e is approximately equal to 2.71828, and makes certain proofs and derivations extremely easy; however, it’s much harder to do mental math with e than with 2, so we’ll work with base-2 logarithms.↩︎
Here small means close to 0. It’s possible to take a logarithm of a negative number, but these will be complex numbers (numbers with a real part and an imaginary part), and are not something we want to see or make sense of as data analysts.↩︎
Recall that the LOWESS curve is a non-paramteric smoothed estimate of the mean value of the outcome at any value of the predictor. Unlike the line of best fit, the LOWESS curve doesn’t need to be a straight line.↩︎
This is only for the interested. Random numbers generated by a computer are actually not random at all, aside, sometimes, from the very first one generated. If you knew the starting state of the algorithm used to generate random numbers, you could precisely predict all of the subsequent numbers. However, the numbers are pseudo-random which means that they’re deterministic in a way that’s indistinguishable from true randomness at least as far as we’re concerned. R uses a Mersenne-Twister, which is called that because it uses Mersenne primes but … twisted. Also, because it’s awesome.↩︎
It turns out the having high variance at moderate values of the predictor actually artificially inflates our standard errors, making our inferences more conservative than they need to be.↩︎
Actually, we can take the log of a negative number, but the result is a complex number which has an imaginary component. No policy-maker wants to hear that imaginary differences in dollars spent predict imaginary differences in amount learned, so that’s not something we’ll cover.↩︎