# We reviewed the idea of added variable plots. > attach(BigFive) # Here, we isolate the parts of Neuroticism (yres) that can't # be predicted by Depression, as well as the part of Agreeableness # (xres) that can't be predicted by depression: > yres <- lm(Neuroticism~Depression)$res > xres <- lm(Agreeableness~Depression)$res # The slope for Agreeableness in the multiple regression... > lm(Neuroticism~Depression+Agreeableness) Call: lm(formula = Neuroticism ~ Depression + Agreeableness) Coefficients: (Intercept) Depression Agreeableness 55.1270 1.0918 -0.4049 # ...is the same as the slope from the regression of yres on xres: > lm(yres~xres) Call: lm(formula = yres ~ xres) Coefficients: (Intercept) xres -1.055e-16 -4.049e-01 # The added variable plot gives a graphic representation of that # relationship. Think of it as being like the raw scatterplot... > par(pin=c(6,4)) > plot(Agreeableness,Neuroticism) # ...with systematic variability associated with Depression # removed from the picture: > plot(xres,yres) # Notice that the relationship in the added variable plot (yres vs xres) # appears a bit cleaner than in the raw scatterplot. That's because # some of the variation that appears as noise in the raw plot is actually # systematic variation associated with Depression that has been quite # literally removed from the picture. # We can do inference about the slopes in multiple regression using # exactly the same syntax that we used for simple regression: > summary(lm(Neuroticism~Depression+Agreeableness)) Call: lm(formula = Neuroticism ~ Depression + Agreeableness) Residuals: Min 1Q Median 3Q Max -47.387 -18.687 2.273 17.682 45.497 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 55.1270 13.9587 3.949 0.000328 *** Depression 1.0918 0.3040 3.591 0.000930 *** Agreeableness -0.4049 0.1744 -2.322 0.025698 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 22.36 on 38 degrees of freedom Multiple R-squared: 0.4351, Adjusted R-squared: 0.4053 F-statistic: 14.63 on 2 and 38 DF, p-value: 1.941e-05 # We can also check assumptions for multiple regression in # the same way we did for simple regression. The plot of # residuals against fitted values is useful for checking # the assumption that variability of errors is equal across # the range of fitted values: > lm(Neuroticism~Depression+Agreeableness) -> reg > plot(reg$fit, reg$res) # We can assess the assumption of normally distributed errors # by applying our usual graphical methods to the residuals: > stem(reg$res) The decimal point is 1 digit(s) to the right of the | -4 | 7 -3 | 100 -2 | 975431 -1 | 9732 -0 | 99310 0 | 024455 1 | 1155689 2 | 0236788 3 | 0 4 | 5 > qqnorm(reg$res); qqline(reg$res) # We might be a bit concerned about the normality assumption, but we # were far from the decision boundary when we concluded that both slopes # are nonzero, so we can probably appeal to robustness and not be # too worried. # Here's the math practice ANOVA example: > math <- read.csv("http://faculty.ucmerced.edu/jvevea/classes/202a/data/math.csv") > head(math) Time Practice 1 51 massed 2 96 massed 3 97 massed 4 22 massed 5 35 massed 6 36 massed # There are massed, spaced, and no practice conditions, # and the outcome of interest is time to complete a set # of math problems (faster is better). # It's simple to calculate the ANOVA in R. The F ratio # of 5.85 allows us to reject the null hypothesis that # all three population means are the same. > attach(math) > anova(lm(Time~Practice)) Analysis of Variance Table Response: Time Df Sum Sq Mean Sq F value Pr(>F) Practice 2 13772 6885.9 5.8504 0.009559 ** Residuals 21 24717 1177.0 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # Now, let's think about the logic behind ANOVA. ANOVA relies # on the idea that we can estimate variability in two ways: # one that is valid regardless of whether the null hypothesis # is true, and one that is valid if the null is true but otherwise # will tend to be too large. The first method is analogous to # the pooled variance estimate in the t test: do a weighted average # of the variances in all groups. Here, n=8 in all groups, so we # don't need to weight: > tapply(Time, Practice, var) massed none spaced 925.8393 1668.2679 936.8571 # The average of those variance estimates is... > mean(tapply(Time, Practice, var)) [1] 1176.988 # ...which matches the residual mean square that R gave us. # The other way of estimating variability looks at how variable # the sample means are: > tapply(Time, Practice, mean) massed none spaced 55.125 112.625 94.000 # According to an algebraic manipulation of the Central Limit # Theorem, the variability of those means multiplied by the # sample size on which each mean is based should be a valid # estimate of variability (but only if the three means are # actually estimating the same population mean): > var(tapply(Time, Practice, mean)) * 8 [1] 6885.875 # Note that that result matches the Practice mean square in the # ANOVA table.