# We spent the class working through an example of Homework Four. # First, we need good scatterplots of Peabody against mother's age # and against income. 3 Here's the default scatterplot for mother's age: > attach(JackStatlab) > plot(MBAG,CTPEA) # Let's change the aspect ratio: > par(pin=c(6,4)) > plot(MBAG,CTPEA) # Because of some weirdness with my screen geometry, that doesn't # fit well; let's reduce the size by 20%: > par(pin=.8*c(6,4)) > plot(MBAG,CTPEA) # Those are horrible axis labels, and there's no title: > plot(MBAG,CTPEA,xlab="Mother's Age at birth", ylab="Peabody Score", + main="Peabody Picture Vocabulary Score\nMother's Age") # And we want to add a little more white space: > xinc <- .05*(max(MBAG)-min(MBAG)) > yinc <- .05*(max(CTPEA)-min(CTPEA)) > plot(MBAG,CTPEA,xlab="Mother's Age at Birth", ylab="Peabody Score", + main="Peabody Picture Vocabulary Score\nMother's Age", + xlim=c(min(MBAG)-xinc,max(MBAG)+xinc), + ylim=c(min(CTPEA)-yinc,max(CTPEA)+yinc)) # That looks pretty good! Let's repeat the process for Income: > plot(FIT,CTPEA) > xinc <- .05*(max(FIT)-min(FIT)) > plot(FIT,CTPEA,xlab="Family Income (100's of Dollars)", ylab="Peabody Score", + main="Peabody Picture Vocabulary Score\nFamily Income", + xlim=c(min(FIT)-xinc,max(FIT)+xinc), + ylim=c(min(CTPEA)-yinc,max(CTPEA)+yinc)) # Neither plot shows evidence of nonlinearity. If we wanted, # we could do the residuals plot for each predictor, as those # plots can sometimes reveal nonlinearity that we didn't see # in the scatterplot. Here, I do a crude residuals plot for # mother's age. I still don't see any problems: > temp <- lm(CTPEA~MBAG) > plot(temp$fit, temp$res) # Here is the the same plot for income; again, there are no # problems. (We could take the time to make the # plots pretty, but to save class time, I didn't.) > temp <- lm(CTPEA~FIT) > plot(temp$fit, temp$res) # A student asked about adding the regression line to the plot. # We could certainly do that: > plot(FIT,CTPEA,xlab="Family Income (100's of Dollars)", ylab="Peabody Score", + main="Peabody Picture Vocabulary Score\nFamily Income", + xlim=c(min(FIT)-xinc,max(FIT)+xinc), + ylim=c(min(CTPEA)-yinc,max(CTPEA)+yinc)) > abline(lm(CTPEA~FIT))$coef NULL # The assignment asks us to interpret and test the slopes in the # two simple regressions. Here, we tackle mother's age: > temp <- lm(CTPEA~MBAG) > summary(temp) Call: lm(formula = CTPEA ~ MBAG) Residuals: Min 1Q Median 3Q Max -25.209 -8.057 0.775 6.678 39.307 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 60.0518 9.8304 6.109 1.71e-07 *** MBAG 0.8386 0.3755 2.233 0.0302 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 13 on 48 degrees of freedom Multiple R-squared: 0.09414, Adjusted R-squared: 0.07526 F-statistic: 4.988 on 1 and 48 DF, p-value: 0.03022 # The slope is significantly different from zero (p=.0302, # which is less than .05). The slope of about 0.8 means that, # according to the model, the conditional mean of the Peabody # distribution goes up by about 8/10ths of a point for each # one-year increase in mother's age. # We do the same for the simple regression of Peabody on income: > temp <- lm(CTPEA~FIT) > summary(temp) Call: lm(formula = CTPEA ~ FIT) Residuals: Min 1Q Median 3Q Max -23.3023 -10.0824 -0.6297 8.4202 29.6674 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 72.57938 4.46190 16.266 <2e-16 *** FIT 0.05753 0.02587 2.224 0.0309 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 13 on 48 degrees of freedom Multiple R-squared: 0.0934, Adjusted R-squared: 0.07451 F-statistic: 4.945 on 1 and 48 DF, p-value: 0.0309 # The slope is significant at the .05 level. The value of about .06 # means that, according to the model, the conditional mean of the # Peabody distribution goes up by about .06 points for every $100 # increase in family income. (This change is associated with a $100 # change because income is reported in 100's of dollars in this # data set.) # The fact that the slope is much smaller for income than for # mother's age might suggest that mother's age accounts for more # change in Peabody than income does, but that's not a correct # interpretation, because income is much more variable than mother's # age, so a 1-unit (i.e. $100) is much smaller than a typical difference # in income. Note that the standard deviation for income is MUCH # larger than the standard deviation for age: > sd(MBAG) [1] 4.944756 > sd(FIT) [1] 71.79165 # So a typical change in mother's age might be five years, where a # typical change in income might be about 72 100's of dollars. # That would map to a typical change in Peabody mean of about 5 * .84 = 4.2 # points for mother's age, and a typical change associated with income would be about # 72 * .057 = 4.104 (about the same as change associated with age). # We had saved out Income regression as an object called 'temp': > temp Call: lm(formula = CTPEA ~ FIT) Coefficients: (Intercept) FIT 72.57938 0.05753 # The assignments asks you to assess the assumptions of equal variance in # errors through the range of fitted values, and normality of errors. # Here's a residuals plot for the regression of Peabody on Income: > plot(temp$fit, temp$res) # Let's make it pretty: > xinc <- .05*(max(temp$fit)-min(temp$fit)) > yinc <- .05*(max(temp$res)-min(temp$res)) > plot(temp$fit, temp$res,xlab="Predicted Conditional Mean", + ylab="Residual",main="Plot of Residuals vs. Fitted Values\nfor Regression of Peabody on Income", + xlim=c(min(temp$fit)-xinc,max(temp$fit)+xinc), + ylim=c(min(temp$res)-yinc,max(temp$res)+yinc)) # I see no problem with variability in the residuals changing as we move # across the plot. # Here, we do a normal Q-Q plot of the residuals. The plot indicates a bit # of a light tail at the left side of the distribution, but I'm not TOO # concerned, because regression is robust to violations of the normality assumption: > qqnorm(temp$res,main="Normal Q-Q Plot for Income Regression");qqline(temp$res) # (In the assignment, you would repeat that investigation for the regression # on mother's age; we didn't take the time to do that in class.) # The assignment asks us to compute the multiple regression and test the slopes: > regout <- lm(CTPEA~FIT+MBAG) > summary(regout) Call: lm(formula = CTPEA ~ FIT + MBAG) Residuals: Min 1Q Median 3Q Max -22.468 -8.929 1.052 6.418 29.929 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 55.30811 9.86828 5.605 1.06e-06 *** FIT 0.04940 0.02549 1.938 0.0587 . MBAG 0.72118 0.37013 1.948 0.0573 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 12.64 on 47 degrees of freedom Multiple R-squared: 0.1612, Adjusted R-squared: 0.1255 F-statistic: 4.515 on 2 and 47 DF, p-value: 0.01609 # According to the F statistic, the regression as a whole is significant, # but neither of the individual slopes is. # That is happening partly because the two predictors are correlated: > cor(FIT,MBAG) [1] 0.163669 # Don't make the mistake of trying to do inference about the slopes # using the ANOVA summary: the F tests there are not correct for our question: > anova(regout) Analysis of Variance Table Response: CTPEA Df Sum Sq Mean Sq F value Pr(>F) FIT 1 835.9 835.92 5.2333 0.02670 * # <--- Wrong result MBAG 1 606.4 606.43 3.7966 0.05734 . Residuals 47 7507.4 159.73 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # The assignment asks us to check the assumptions of equal variability # across the range of fitted values. Here's a crude residuals plot that # doesn't indicate problems. Take the time to make this pretty in your homework. > plot(regout$fit, regout$res) # (We didn't evaluate normality, in the interests of time, but the # same Q-Q approach we used for the simple regressions would apply here.) # The assignment asks us to create an added variable plot for each predictor. # Here, we do the added variable plot for mother's age, controlling for income: > yres <- lm(CTPEA~FIT)$res > xres <- lm(MBAG~FIT)$res > plot(xres,yres) # In our previous example (last class session), the AV plot looked less noisy # than the raw scatterplot. Here, the reverse is true: > plot(MBAG,CTPEA) > plot(xres,yres) # Let's prettify the AV plot: > xinc <- .05*(max(xres)-min(xres)) > yinc <- .05*(max(yres)-min(yres)) > plot(xres,yres,main="Added Variable Plot for Regression of Peabody on Mother's Age\nControlling for Family Income", + xlab="Mother's Age (Adjusted)",ylab="Peabody Score (Adjusted)", + xlim=c( min(xres)-xinc,max(xres+xinc)), + ylim=c( min(yres)-yinc,max(yres+yinc))) > plot(xres,yres,main="Added Variable Plot for Regression of Peabody\non Mother's Age Controlling for Family Income", + xlab="Mother's Age (Adjusted)",ylab="Peabody Score (Adjusted)", + xlim=c( min(xres)-xinc,max(xres+xinc)), + ylim=c( min(yres)-yinc,max(yres+yinc))) # We did not take the time to produce the other AV plot. # The assignment asks for confirmation that the slopes in the added variable # plots match the slopes in the multiple regression. # Here's the multiple regression: > regout Call: lm(formula = CTPEA ~ FIT + MBAG) Coefficients: (Intercept) FIT MBAG 55.3081 0.0494 0.7212 # Here's the slope for the AV plot we just did. Note that it matches # the slope in the multiple regression: > lm(yres~xres) Call: lm(formula = yres ~ xres) Coefficients: (Intercept) xres 2.191e-15 7.212e-01 > # (For the homework assignment, you would also confirm this for the # other AV plot.)