# We introduced multiple regression using two variables from a "Big Five" # personality inventory (Neuroticism and Agreeableness) and a Depression score. > bigfive <- read.csv("http://faculty.ucmerced.edu/jvevea/classes/105/data/BigFive.csv") > head(bigfive) Neuroticism Agreeableness Depression 1 66 79 15 2 11 79 6 3 7 87 15 4 71 17 17 5 87 79 33 6 60 92 11 # Before we use linear methods, we should assess whether the relationships # appear linear. (I'm not taking the time to do pretty scatterplots now, but # we'll do an example of that next class.) > par(pin=c(6,4)) > attach(bigfive) > plot(Agreeableness, Neuroticism) > plot(Depression, Neuroticism) # The relationships appear somewhat noisy, but there is no compelling # evidence of a nonlinear relationship. # We could use each predictor separately to try to understand the conditional # mean of Neuroticism: > lm(Neuroticism~Agreeableness) Call: lm(formula = Neuroticism ~ Agreeableness) Coefficients: (Intercept) Agreeableness 88.1229 -0.6496 > lm(Neuroticism~Depression) Call: lm(formula = Neuroticism ~ Depression) Coefficients: (Intercept) Depression 25.610 1.368 # But what we really want is to understand how the two predictors # work in tandem. That's what multiple regression looks at. It's easy # to estimate the multiple regression using R's lm() function: just add # another predictor, separated by a '+' sign: > lm(Neuroticism~Agreeableness+Depression) Call: lm(formula = Neuroticism ~ Agreeableness + Depression) Coefficients: (Intercept) Agreeableness Depression 55.1270 -0.4049 1.0918 # Note that the slopes are not the same as they were in the simple # regressions. That's because the two predictors are correlated, so # there is overlap in the part of Neuroticism that they can help # us understand: > cor(Agreeableness,Depression) [1] -0.3905633 # There is only one circumstance in which the simple regression # slopes are identical to the multiple regression slopes: > attach(demo) > lm(y~x1+x2) Call: lm(formula = y ~ x1 + x2) Coefficients: (Intercept) x1 x2 10 10 -5 > lm(y~x1) Call: lm(formula = y ~ x1) Coefficients: (Intercept) x1 -17.5 10.0 > lm(y~x2) Call: lm(formula = y ~ x2) Coefficients: (Intercept) x2 60 -5 # That's when the predictors are perfectly uncorrelated (which # will never happen except by design): > cor(x1,x2) [1] 0 # We talked about how multiple regression isolates what's unique about # the relationship between each predictor and the outcome. The PowerPoint # for today deswcribes this process. # First, we isolate the part of Neuroticism that Depression CAN'T help # us understand. We do that by regression Neuroticism on Depression and # keeping the residuals: > Yres <- lm(Neuroticism~Depression)$res > Yres 1 2 3 4 5 6 19.8768322 -22.8152540 -39.1231678 22.1417403 16.2610047 19.3470161 7 8 9 10 11 12 24.6285507 -30.0801621 -34.6363574 29.4496540 2.3470161 -25.3118171 13 14 15 16 17 18 24.8172000 4.2443782 -4.0801621 -19.9775242 -30.8152540 0.6451772 19 20 21 22 23 24 17.9794701 -28.0205299 -22.0801621 21.1847460 6.4496540 2.2873839 25 26 27 28 29 30 15.2443782 12.4232748 15.2443782 -20.0801621 6.8504530 16.2013725 31 32 33 34 35 36 36.2873839 -18.5503460 -25.5337195 53.6119242 -3.9012654 -40.4310815 37 38 39 40 41 19.3470161 -5.0205299 12.9794701 2.9960966 -32.3880758 # Next, we isolate the part of Agreeableness that doesn't overlap with # Depression. We do that by regressing Agreeableness on Depression and # keeping the residuals: > Xres <- lm(Agreeableness~Depression)$res # The regression of the Neuroticism residuals on the Agreeableness # residuals has a slope... > lm(Yres~Xres) Call: lm(formula = Yres ~ Xres) Coefficients: (Intercept) Xres -1.055e-16 -4.049e-01 # ...that exactly matches the slope in the multiple regression: > lm(Neuroticism~Agreeableness+Depression) Call: lm(formula = Neuroticism ~ Agreeableness + Depression) Coefficients: (Intercept) Agreeableness Depression 55.1270 -0.4049 1.0918 # The plot of those residuals is called the added variable plot # for Neuroticism and Agreeableness, controlling for Depression: > plot(Xres,Yres) # It looks a bit similar to the corresponding raw scatterplot: > plot(Agreeableness,Depression) # But the relationship looks a little cleaner in the added # variable plot. That's because some of what appears to be noise # in the raw scatterplot is actually systematic variation associated # with depression. # We can get the other multiple regression slope by repeating the # process, swapping the roles of Agreeableness and Depression. # First, we isolate what's unique about the relationship between # Neuroticism and Depression: > Yres <- lm(Neuroticism~Agreeableness)$res # Then we isolate the part of Depression that doesn't overlap # with Agreeableness: > Xres <- lm(Depression~Agreeableness)$res # Here's the regression: > lm(Yres~Xres) Call: lm(formula = Yres ~ Xres) Coefficients: (Intercept) Xres 1.916e-15 1.092e+00 # The slope of 1.092 doesn't exactly match the regression output of # 1.0918, but that's just because of rounding. If we look directly at # the $coef component of the regression output, we see the unrounded # value, which DOES match the previous multiple regression slope: > lm(Yres~Xres)$coef (Intercept) Xres 1.916465e-15 1.091835e+00 # Once again, the added variable plot... > plot(Xres,Yres) # ...looks similar to the raw scatterplot, but with less apparent noise: > plot(Depression,Neuroticism) > plot(Xres,Yres) # If we want to do inference about the slopes, it's easiest to save # the regression object: > regout <- lm(Neuroticism~Agreeableness+Depression) > regout Call: lm(formula = Neuroticism ~ Agreeableness + Depression) Coefficients: (Intercept) Agreeableness Depression 55.1270 -0.4049 1.0918 # In simple regression, we could do inference about the slope two # ways: using the anova() wrapper or the summary() wrapper. For # multiple regression, though, the anova() wrapper gives the wrong # answer; so DON'T DO THIS: > anova(regout) Analysis of Variance Table Response: Neuroticism Df Sum Sq Mean Sq F value Pr(>F) Agreeableness 1 8180.4 8180.4 16.368 0.0002465 *** Depression 1 6445.2 6445.2 12.896 0.0009299 *** Residuals 38 18991.4 499.8 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # Instead, do this: > summary(regout) Call: lm(formula = Neuroticism ~ Agreeableness + Depression) Residuals: Min 1Q Median 3Q Max -47.387 -18.687 2.273 17.682 45.497 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 55.1270 13.9587 3.949 0.000328 *** Agreeableness -0.4049 0.1744 -2.322 0.025698 * Depression 1.0918 0.3040 3.591 0.000930 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 22.36 on 38 degrees of freedom Multiple R-squared: 0.4351, Adjusted R-squared: 0.4053 F-statistic: 14.63 on 2 and 38 DF, p-value: 1.941e-05 # Both slopes are significant at the .05 level. # We can check the assumptions of equal error variability across the # range of fitted values and normality of the errors exactly the # same way we did for simple regression: > regout$fit 1 2 3 4 5 6 7 39.51374 29.68723 36.27417 66.80409 59.16677 29.88210 79.13215 8 9 10 11 12 13 14 27.50356 62.37041 46.04402 31.90683 75.71844 37.25820 50.16535 15 16 17 18 19 20 21 36.41238 24.22805 26.44765 108.38089 38.26297 45.14706 33.98270 22 23 24 25 26 27 28 36.16637 38.35004 28.43645 44.09115 89.73262 44.90105 24.26398 29 30 31 32 33 34 35 76.31826 48.81231 38.15516 27.41649 66.78888 51.50319 60.18674 36 37 38 39 40 41 42.45617 49.31952 40.28770 50.41136 68.72655 58.38728 > regout$res 1 2 3 4 5 6 26.4862591 -18.6872255 -29.2741700 4.1959145 27.8332283 30.1179020 7 8 9 10 11 12 14.8678525 -26.5035554 -30.3704084 19.9559817 11.0931702 -20.7184370 13 14 15 16 17 18 22.7417976 -1.1653504 -9.4123754 -17.2280503 -23.4476546 -12.3808870 19 20 21 22 23 24 21.7370324 -31.1470558 -24.9826972 18.8336327 4.6499626 3.5635478 25 26 27 28 29 30 15.9088450 0.2673786 15.0989523 -13.2639845 17.6817442 27.1876941 31 32 33 34 35 36 27.8448351 -9.4164856 -29.7888841 45.4968065 -0.1867383 -24.4561681 37 38 39 40 41 10.6804766 -3.2876994 4.5886415 2.2734539 -47.3872817 # The plot of residuals against fitted values lets us compare variability # as we move from left to right in the plot. This plot is exactly what we # want to see: there is roughly the same amount of variation no matter # what part of the X axis we look at: > plot(regout$fit,regout$res) # We can assess the normality of errors by using standard methods to look # at the normality of the residuals. Both the stem-and-leaf plot and the # normal Q-Q plot look a bit bad here: > stem(regout$res) The decimal point is 1 digit(s) to the right of the | -4 | 7 -3 | 100 -2 | 975431 -1 | 9732 -0 | 99310 0 | 024455 1 | 1155689 2 | 0236788 3 | 0 4 | 5 > qqnorm(regout$res); qqline(regout$res) # But we weren't particularly close to the decision boundary when we rejected # the null hypotheses about the slopes, and regression is relatively robust # to violations of normality, so I'm not too concerned.