# We introduced multiple regression using two variables from a "Big Five"
# personality inventory (Neuroticism and Agreeableness) and a Depression score.

> bigfive <- read.csv("http://faculty.ucmerced.edu/jvevea/classes/105/data/BigFive.csv")
> head(bigfive)
  Neuroticism Agreeableness Depression
1          66            79         15
2          11            79          6
3           7            87         15
4          71            17         17
5          87            79         33
6          60            92         11

# Before we use linear methods, we should assess whether the relationships
# appear linear. (I'm not taking the time to do pretty scatterplots now, but
# we'll do an example of that next class.)

> par(pin=c(6,4))
> attach(bigfive)
> plot(Agreeableness, Neuroticism)
> plot(Depression, Neuroticism)

# The relationships appear somewhat noisy, but there is no compelling
# evidence of a nonlinear relationship.

# We could use each predictor separately to try to understand the conditional
# mean of Neuroticism:

> lm(Neuroticism~Agreeableness)

Call:
lm(formula = Neuroticism ~ Agreeableness)

Coefficients:
  (Intercept)  Agreeableness  
      88.1229        -0.6496  

> lm(Neuroticism~Depression)

Call:
lm(formula = Neuroticism ~ Depression)

Coefficients:
(Intercept)   Depression  
     25.610        1.368  


# But what we really want is to understand how the two predictors
# work in tandem. That's what multiple regression looks at. It's easy
# to estimate the multiple regression using R's lm() function: just add
# another predictor, separated by a '+' sign:

> lm(Neuroticism~Agreeableness+Depression)

Call:
lm(formula = Neuroticism ~ Agreeableness + Depression)

Coefficients:
  (Intercept)  Agreeableness     Depression  
      55.1270        -0.4049         1.0918  

# Note that the slopes are not the same as they were in the simple
# regressions. That's because the two predictors are correlated, so
# there is overlap in the part of Neuroticism that they can help
# us understand:

> cor(Agreeableness,Depression)
[1] -0.3905633

# There is only one circumstance in which the simple regression
# slopes are identical to the multiple regression slopes:

> attach(demo)
> lm(y~x1+x2)

Call:
lm(formula = y ~ x1 + x2)

Coefficients:
(Intercept)           x1           x2  
         10           10           -5  

> lm(y~x1)

Call:
lm(formula = y ~ x1)

Coefficients:
(Intercept)           x1  
      -17.5         10.0  

> lm(y~x2)

Call:
lm(formula = y ~ x2)

Coefficients:
(Intercept)           x2  
         60           -5  

# That's when the predictors are perfectly uncorrelated (which
# will never happen except by design):

> cor(x1,x2)
[1] 0

# We talked about how multiple regression isolates what's unique about
# the relationship between each predictor and the outcome. The PowerPoint
# for today deswcribes this process.

# First, we isolate the part of Neuroticism that Depression CAN'T help
# us understand. We do that by regression Neuroticism on Depression and
# keeping the residuals:

> Yres <- lm(Neuroticism~Depression)$res
> Yres
          1           2           3           4           5           6 
 19.8768322 -22.8152540 -39.1231678  22.1417403  16.2610047  19.3470161 
          7           8           9          10          11          12 
 24.6285507 -30.0801621 -34.6363574  29.4496540   2.3470161 -25.3118171 
         13          14          15          16          17          18 
 24.8172000   4.2443782  -4.0801621 -19.9775242 -30.8152540   0.6451772 
         19          20          21          22          23          24 
 17.9794701 -28.0205299 -22.0801621  21.1847460   6.4496540   2.2873839 
         25          26          27          28          29          30 
 15.2443782  12.4232748  15.2443782 -20.0801621   6.8504530  16.2013725 
         31          32          33          34          35          36 
 36.2873839 -18.5503460 -25.5337195  53.6119242  -3.9012654 -40.4310815 
         37          38          39          40          41 
 19.3470161  -5.0205299  12.9794701   2.9960966 -32.3880758 

# Next, we isolate the part of Agreeableness that doesn't overlap with
# Depression. We do that by regressing Agreeableness on Depression and
# keeping the residuals:

> Xres <- lm(Agreeableness~Depression)$res

# The regression of the Neuroticism residuals on the Agreeableness
# residuals has a slope...

> lm(Yres~Xres)

Call:
lm(formula = Yres ~ Xres)

Coefficients:
(Intercept)         Xres  
 -1.055e-16   -4.049e-01  

# ...that exactly matches the slope in the multiple regression:

> lm(Neuroticism~Agreeableness+Depression)

Call:
lm(formula = Neuroticism ~ Agreeableness + Depression)

Coefficients:
  (Intercept)  Agreeableness     Depression  
      55.1270        -0.4049         1.0918  

# The plot of those residuals is called the added variable plot
# for Neuroticism and Agreeableness, controlling for Depression:

> plot(Xres,Yres)

# It looks a bit similar to the corresponding raw scatterplot:

> plot(Agreeableness,Depression)
 
# But the relationship looks a little cleaner in the added
# variable plot. That's because some of what appears to be noise
# in the raw scatterplot is actually systematic variation associated
# with depression. 

# We can get the other multiple regression slope by repeating the
# process, swapping the roles of Agreeableness and Depression.
# First, we isolate what's unique about the relationship between
# Neuroticism and Depression:

> Yres <- lm(Neuroticism~Agreeableness)$res

# Then we isolate the part of Depression that doesn't overlap
# with Agreeableness:

> Xres <- lm(Depression~Agreeableness)$res

# Here's the regression:

> lm(Yres~Xres)

Call:
lm(formula = Yres ~ Xres)

Coefficients:
(Intercept)         Xres  
  1.916e-15    1.092e+00  

# The slope of 1.092 doesn't exactly match the regression output of
# 1.0918, but that's just because of rounding. If we look directly at
# the $coef component of the regression output, we see the unrounded
# value, which DOES match the previous multiple regression slope:

> lm(Yres~Xres)$coef
 (Intercept)         Xres 
1.916465e-15 1.091835e+00 

# Once again, the added variable plot...

> plot(Xres,Yres)

# ...looks similar to the raw scatterplot, but with less apparent noise:

> plot(Depression,Neuroticism)
> plot(Xres,Yres)

# If we want to do inference about the slopes, it's easiest to save
# the regression object:

> regout <- lm(Neuroticism~Agreeableness+Depression)
> regout

Call:
lm(formula = Neuroticism ~ Agreeableness + Depression)

Coefficients:
  (Intercept)  Agreeableness     Depression  
      55.1270        -0.4049         1.0918  

# In simple regression, we could do inference about the slope two
# ways: using the anova() wrapper or the summary() wrapper. For
# multiple regression, though, the anova() wrapper gives the wrong
# answer; so DON'T DO THIS:

> anova(regout)
Analysis of Variance Table

Response: Neuroticism
              Df  Sum Sq Mean Sq F value    Pr(>F)    
Agreeableness  1  8180.4  8180.4  16.368 0.0002465 ***
Depression     1  6445.2  6445.2  12.896 0.0009299 ***
Residuals     38 18991.4   499.8                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


# Instead, do this:

> summary(regout)

Call:
lm(formula = Neuroticism ~ Agreeableness + Depression)

Residuals:
    Min      1Q  Median      3Q     Max 
-47.387 -18.687   2.273  17.682  45.497 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    55.1270    13.9587   3.949 0.000328 ***
Agreeableness  -0.4049     0.1744  -2.322 0.025698 *  
Depression      1.0918     0.3040   3.591 0.000930 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 22.36 on 38 degrees of freedom
Multiple R-squared:  0.4351,    Adjusted R-squared:  0.4053 
F-statistic: 14.63 on 2 and 38 DF,  p-value: 1.941e-05

# Both slopes are significant at the .05 level.

# We can check the assumptions of equal error variability across the
# range of fitted values and normality of the errors exactly the
# same way we did for simple regression: 
> regout$fit
        1         2         3         4         5         6         7 
 39.51374  29.68723  36.27417  66.80409  59.16677  29.88210  79.13215 
        8         9        10        11        12        13        14 
 27.50356  62.37041  46.04402  31.90683  75.71844  37.25820  50.16535 
       15        16        17        18        19        20        21 
 36.41238  24.22805  26.44765 108.38089  38.26297  45.14706  33.98270 
       22        23        24        25        26        27        28 
 36.16637  38.35004  28.43645  44.09115  89.73262  44.90105  24.26398 
       29        30        31        32        33        34        35 
 76.31826  48.81231  38.15516  27.41649  66.78888  51.50319  60.18674 
       36        37        38        39        40        41 
 42.45617  49.31952  40.28770  50.41136  68.72655  58.38728 
> regout$res
          1           2           3           4           5           6 
 26.4862591 -18.6872255 -29.2741700   4.1959145  27.8332283  30.1179020 
          7           8           9          10          11          12 
 14.8678525 -26.5035554 -30.3704084  19.9559817  11.0931702 -20.7184370 
         13          14          15          16          17          18 
 22.7417976  -1.1653504  -9.4123754 -17.2280503 -23.4476546 -12.3808870 
         19          20          21          22          23          24 
 21.7370324 -31.1470558 -24.9826972  18.8336327   4.6499626   3.5635478 
         25          26          27          28          29          30 
 15.9088450   0.2673786  15.0989523 -13.2639845  17.6817442  27.1876941 
         31          32          33          34          35          36 
 27.8448351  -9.4164856 -29.7888841  45.4968065  -0.1867383 -24.4561681 
         37          38          39          40          41 
 10.6804766  -3.2876994   4.5886415   2.2734539 -47.3872817 

# The plot of residuals against fitted values lets us compare variability
# as we move from left to right in the plot. This plot is exactly what we
# want to see: there is roughly the same amount of variation no matter
# what part of the X axis we look at:

> plot(regout$fit,regout$res)

# We can assess the normality of errors by using standard methods to look
# at the normality of the residuals. Both the stem-and-leaf plot and the
# normal Q-Q plot look a bit bad here:

> stem(regout$res)

  The decimal point is 1 digit(s) to the right of the |

  -4 | 7
  -3 | 100
  -2 | 975431
  -1 | 9732
  -0 | 99310
   0 | 024455
   1 | 1155689
   2 | 0236788
   3 | 0
   4 | 5

> qqnorm(regout$res); qqline(regout$res)
 
# But we weren't particularly close to the decision boundary when we rejected
# the null hypotheses about the slopes, and regression is relatively robust
# to violations of normality, so I'm not too concerned.