# We spent the class working through an example of Homework Four.

# First, we need good scatterplots of Peabody against mother's age
# and against income.

3 Here's the default scatterplot for mother's age:

> attach(JackStatlab)
> plot(MBAG,CTPEA)

# Let's change the aspect ratio:

> par(pin=c(6,4))
> plot(MBAG,CTPEA)

# Because of some weirdness with my screen geometry, that doesn't
# fit well; let's reduce the size by 20%:

> par(pin=.8*c(6,4))
> plot(MBAG,CTPEA)

# Those are horrible axis labels, and there's no title:

> plot(MBAG,CTPEA,xlab="Mother's Age at birth", ylab="Peabody Score",
+    main="Peabody Picture Vocabulary Score\nMother's Age")

# And we want to add a little more white space:

> xinc <- .05*(max(MBAG)-min(MBAG))
> yinc <- .05*(max(CTPEA)-min(CTPEA))
> plot(MBAG,CTPEA,xlab="Mother's Age at Birth", ylab="Peabody Score",
+    main="Peabody Picture Vocabulary Score\nMother's Age",
+    xlim=c(min(MBAG)-xinc,max(MBAG)+xinc),
+    ylim=c(min(CTPEA)-yinc,max(CTPEA)+yinc))

# That looks pretty good! Let's repeat the process for Income:

> plot(FIT,CTPEA)
> xinc <- .05*(max(FIT)-min(FIT))
> plot(FIT,CTPEA,xlab="Family Income (100's of Dollars)", ylab="Peabody Score",
+    main="Peabody Picture Vocabulary Score\nFamily Income",
+    xlim=c(min(FIT)-xinc,max(FIT)+xinc),
+    ylim=c(min(CTPEA)-yinc,max(CTPEA)+yinc))

# Neither plot shows evidence of nonlinearity. If we wanted,
# we could do the residuals plot for each predictor, as those
# plots can sometimes reveal nonlinearity that we didn't see
# in the scatterplot. Here, I do a crude residuals plot for
# mother's age. I still don't see any problems: 

> temp <- lm(CTPEA~MBAG)
> plot(temp$fit, temp$res)

# Here is the the same plot for income; again, there are no
# problems. (We could take the time to make the
# plots pretty, but to save class time, I didn't.)

> temp <- lm(CTPEA~FIT)
> plot(temp$fit, temp$res)

# A student asked about adding the regression line to the plot.
# We could certainly do that:

> plot(FIT,CTPEA,xlab="Family Income (100's of Dollars)", ylab="Peabody Score",
+    main="Peabody Picture Vocabulary Score\nFamily Income",
+    xlim=c(min(FIT)-xinc,max(FIT)+xinc),
+    ylim=c(min(CTPEA)-yinc,max(CTPEA)+yinc))
> abline(lm(CTPEA~FIT))$coef
NULL

# The assignment asks us to interpret and test the slopes in the 
# two simple regressions. Here, we tackle mother's age:

> temp <- lm(CTPEA~MBAG)
> summary(temp)

Call:
lm(formula = CTPEA ~ MBAG)

Residuals:
    Min      1Q  Median      3Q     Max 
-25.209  -8.057   0.775   6.678  39.307 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  60.0518     9.8304   6.109 1.71e-07 ***
MBAG          0.8386     0.3755   2.233   0.0302 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 13 on 48 degrees of freedom
Multiple R-squared:  0.09414,   Adjusted R-squared:  0.07526 
F-statistic: 4.988 on 1 and 48 DF,  p-value: 0.03022

# The slope is significantly different from zero (p=.0302,
# which is less than .05). The slope of about 0.8 means that,
# according to the model, the conditional mean of the Peabody
# distribution goes up by about 8/10ths of a point for each
# one-year increase in mother's age.

# We do the same for the simple regression of Peabody on income:

> temp <- lm(CTPEA~FIT)
> summary(temp)

Call:
lm(formula = CTPEA ~ FIT)

Residuals:
     Min       1Q   Median       3Q      Max 
-23.3023 -10.0824  -0.6297   8.4202  29.6674 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 72.57938    4.46190  16.266   <2e-16 ***
FIT          0.05753    0.02587   2.224   0.0309 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 13 on 48 degrees of freedom
Multiple R-squared:  0.0934,    Adjusted R-squared:  0.07451 
F-statistic: 4.945 on 1 and 48 DF,  p-value: 0.0309

# The slope is significant at the .05 level. The value of about .06
# means that, according to the model, the conditional mean of the
# Peabody distribution goes up by about .06 points for every $100
# increase in family income. (This change is associated with a $100
# change because income is reported in 100's of dollars in this
# data set.)

# The fact that the slope is much smaller for income than for
# mother's age might suggest that mother's age accounts for more
# change in Peabody than income does, but that's not a correct
# interpretation, because income is much more variable than mother's
# age, so a 1-unit (i.e. $100) is much smaller than a typical difference
# in income. Note that the standard deviation for income is MUCH
# larger than the standard deviation for age:

> sd(MBAG)
[1] 4.944756
> sd(FIT)
[1] 71.79165
 
# So a typical change in mother's age might be five years, where a
# typical change in income might be about 72 100's of dollars.
# That would map to a typical change in Peabody mean of about 5 * .84 = 4.2
# points for mother's age, and a typical change associated with income would be about
# 72 * .057 = 4.104 (about the same as change associated with age).

# We had saved out Income regression as an object called 'temp':

> temp

Call:
lm(formula = CTPEA ~ FIT)

Coefficients:
(Intercept)          FIT  
   72.57938      0.05753  

# The assignments asks you to assess the assumptions of equal variance in
# errors through the range of fitted values, and normality of errors.
# Here's a residuals plot for the regression of Peabody on Income:

> plot(temp$fit, temp$res)

# Let's make it pretty:

> xinc <- .05*(max(temp$fit)-min(temp$fit))
> yinc <- .05*(max(temp$res)-min(temp$res))
> plot(temp$fit, temp$res,xlab="Predicted Conditional Mean",
+    ylab="Residual",main="Plot of Residuals vs. Fitted Values\nfor Regression of Peabody on Income",
+    xlim=c(min(temp$fit)-xinc,max(temp$fit)+xinc),
+    ylim=c(min(temp$res)-yinc,max(temp$res)+yinc))

# I see no problem with variability in the residuals changing as we move
# across the plot.

# Here, we do a normal Q-Q plot of the residuals. The plot indicates a bit
# of a light tail at the left side of the distribution, but I'm not TOO
# concerned, because regression is robust to violations of the normality assumption:

> qqnorm(temp$res,main="Normal Q-Q Plot for Income Regression");qqline(temp$res)
 
# (In the assignment, you would repeat that investigation for the regression
# on mother's age; we didn't take the time to do that in class.)


# The assignment asks us to compute the multiple regression and test the slopes:

> regout <- lm(CTPEA~FIT+MBAG)
> summary(regout)

Call:
lm(formula = CTPEA ~ FIT + MBAG)

Residuals:
    Min      1Q  Median      3Q     Max 
-22.468  -8.929   1.052   6.418  29.929 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 55.30811    9.86828   5.605 1.06e-06 ***
FIT          0.04940    0.02549   1.938   0.0587 .  
MBAG         0.72118    0.37013   1.948   0.0573 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 12.64 on 47 degrees of freedom
Multiple R-squared:  0.1612,    Adjusted R-squared:  0.1255 
F-statistic: 4.515 on 2 and 47 DF,  p-value: 0.01609

# According to the F statistic, the regression as a whole is significant,
# but neither of the individual slopes is.

# That is happening partly because the two predictors are correlated:

> cor(FIT,MBAG)
[1] 0.163669
 
 
# Don't make the mistake of trying to do inference about the slopes
# using the ANOVA summary: the F tests there are not correct for our question:

> anova(regout)
Analysis of Variance Table

Response: CTPEA
          Df Sum Sq Mean Sq F value  Pr(>F)  
FIT        1  835.9  835.92  5.2333 0.02670 *         #    <--- Wrong result
MBAG       1  606.4  606.43  3.7966 0.05734 .
Residuals 47 7507.4  159.73                  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

# The assignment asks us to check the assumptions of equal variability
# across the range of fitted values. Here's a crude residuals plot that
# doesn't indicate problems. Take the time to make this pretty in your homework.

> plot(regout$fit, regout$res)

# (We didn't evaluate normality, in the interests of time, but the
# same Q-Q approach we used for the simple regressions would apply here.)


# The assignment asks us to create an added variable plot for each predictor.
# Here, we do the added variable plot for mother's age, controlling for income:

> yres <- lm(CTPEA~FIT)$res
> xres <- lm(MBAG~FIT)$res
> plot(xres,yres)

# In our previous example (last class session), the AV plot looked less noisy
# than the raw scatterplot. Here, the reverse is true:

> plot(MBAG,CTPEA)
> plot(xres,yres)

# Let's prettify the AV plot:

> xinc <- .05*(max(xres)-min(xres))
> yinc <- .05*(max(yres)-min(yres))
> plot(xres,yres,main="Added Variable Plot for Regression of Peabody on Mother's Age\nControlling for Family Income",
+    xlab="Mother's Age (Adjusted)",ylab="Peabody Score (Adjusted)",
+    xlim=c( min(xres)-xinc,max(xres+xinc)),
+    ylim=c( min(yres)-yinc,max(yres+yinc)))
> plot(xres,yres,main="Added Variable Plot for Regression of Peabody\non Mother's Age Controlling for Family Income",
+    xlab="Mother's Age (Adjusted)",ylab="Peabody Score (Adjusted)",
+    xlim=c( min(xres)-xinc,max(xres+xinc)),
+    ylim=c( min(yres)-yinc,max(yres+yinc)))

# We did not take the time to produce the other AV plot.

# The assignment asks for confirmation that the slopes in the added variable
# plots match the slopes in the multiple regression.

# Here's the multiple regression:

> regout

Call:
lm(formula = CTPEA ~ FIT + MBAG)

Coefficients:
(Intercept)          FIT         MBAG  
    55.3081       0.0494       0.7212  

# Here's the slope for the AV plot we just did. Note that it matches
# the slope in the multiple regression:

> lm(yres~xres)

Call:
lm(formula = yres ~ xres)

Coefficients:
(Intercept)         xres  
  2.191e-15    7.212e-01  

>  

# (For the homework assignment, you would also confirm this for the
# other AV plot.)