# We reviewed how to do an ANOVA in R

> attach(math)
> anova(lm(Time~Practice))
Analysis of Variance Table

Response: Time
          Df Sum Sq Mean Sq F value   Pr(>F)   
Practice   2  13772  6885.9  5.8504 0.009559 **
Residuals 21  24717  1177.0                    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


# To check the assumption that variability is the same
# in all three populations, it is useful to compare the
# standard deviations of the three groups with the square
# root of the error mean square (which is analogous to the
# pooled variance estimate we used with the t test.

# Here are the sd's:

> tapply(Time, Practice, sd)
  massed     none   spaced 
30.42761 40.84443 30.60812 

# Here's the pooled sd:

> sqrt(1177)
[1] 34.30743

# We don't have to look at very many samples with n=8
# to see that our sd's don't vary more than one would
# expect them to simply as a function of sampling error:

> sd(rnorm(8, 0, 10.82))
[1] 13.42542
> sd(rnorm(8, 0, 34.31))
[1] 18.70995
> sd(rnorm(8, 0, 34.31))
[1] 36.49577
> sd(rnorm(8, 0, 34.31))
[1] 29.10514
> sd(rnorm(8, 0, 34.31))
[1] 16.91951
> sd(rnorm(8, 0, 34.31))
[1] 31.59481
> sd(rnorm(8, 0, 34.31))
[1] 40.35856

# The distributions in all three groups appear to be
# relatively normal, although it is essentially
# impossible to judge this with such a small sample size:

> qqnorm(Time[1:8]); qqline(Time[1:8])
> qqnorm(Time[9:16]); qqline(Time[9:16])
> qqnorm(Time[17:24]); qqline(Time[17:24])
 
# Here's one way to visualize the three groups.

> boxplot(Time~Practice)

# However, R thinks of these in alphabetical order (Massed,
# None, Spaced), whereas we prefer to think of them in the
# order Massed, Spaced, None. We can get the order we want
# by separating the three groups...

> Massed <- Time[1:8]
> Spaced <- Time[9:16]
> None <- Time[17:24]

# ...and doing the boxplots this way:

> boxplot(Massed,Spaced,None)

# We can change the labels under the plots:

> boxplot(Massed,Spaced,None,axis=FALSE)
> mtext(at=1:3,text=c("Massed","Spaced","None"),side=1,line=2)
 
# We can manually do a line plot of the means:

> Means <- tapply(Time,Practice,mean)
> Means
 massed    none  spaced 
 55.125 112.625  94.000 
> plot(c(0.5,3.5),c(45,117),type='n',xlab="",ylab="",axes=FALSE)
> box()
> axis(side=2,at=seq(45,115,10))
> axis(side=1,at=c(1,2,3),lab=c("Massed","Spaced","None"))
> lines(1:3, Means, type='b')

# We can also depict the means using a barplot:
 
> barplot(tapply(Time,Practice,mean))

# Again, we might want to change the labels. Here, I guess at
# the geometry of the figure, reasoning that the locations of
# the bars might be at 1, 2, and 3 on the x axis. That turns out
# to be untrue, but if we took the time, we could eventually
# guess the correct locations.

> barplot(tapply(Time,Practice,mean),axisnames=FALSE)
> axis(side=1, at=1:3, lab=c("Massed","None","Spaced"))
> abline(h=0)

# Some more trial and error after class led to the finding
# that the positions are .7, 1.9, and 3.1:

> barplot(tapply(Time,Practice,mean),axisnames=FALSE)
> axis(side=1, at=c(.7,1.9,3.1), lab=c("Massed","None","Spaced"),pos=0)
> abline(h=0)

 

# We turned our attention away from ANOVA for a bit to see that creative
# coding in multiple regression can lead in interesting directions.

# Here's the regression of Raven on Peabody:

> attach(JackStatlab)
> lm(CTRA~CTPEA)

Call:
lm(formula = CTRA ~ CTPEA)

Coefficients:
(Intercept)        CTPEA  
     -7.589        0.496  

# Recall that the first 32 cases are girls:

> 50-sum(CBSEX)
[1] 32

# The girls' regression of Raven on Peabody looks pretty
# similar to the regression with both sexes included; the
# intercept is a bit different, but the slope is almost the same:

> lm(CTRA[1:32]~CTPEA[1:32])

Call:
lm(formula = CTRA[1:32] ~ CTPEA[1:32])

Coefficients:
(Intercept)  CTPEA[1:32]  
    -4.8595       0.4847  

# But the boys' regression has a much higher slope:

> lm(CTRA[33:50]~CTPEA[33:50])

Call:
lm(formula = CTRA[33:50] ~ CTPEA[33:50])

Coefficients:
 (Intercept)  CTPEA[33:50]  
    -16.3891        0.5631  

# This leads to the question: is the regression significantly different
# for boys and girls? A bit of clever coding can do both separate regressions
# as a single multiple regression. Recall that we have a dummy variable for sex:

> CBSEX
 [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1
[39] 1 1 1 1 1 1 1 1 1 1 1 1

# We can create an "interaction" variable by multiplying CBSEX times the predictor:

> Int <- CBSEX*CTPEA

# Here's the result. Note that whenever it's a girl (CBSEX=0), the interaction
# variable is also zero (because it's zero times the Peabody score). But when
# we're dealing with a boy (CBSEX=1), the interaction variable is the same as
# the Peabody score:

> cbind(CBSEX,CTPEA,Int)
      CBSEX CTPEA Int
 [1,]     0    81   0
 [2,]     0    94   0
 [3,]     0    84   0
 [4,]     0    74   0
 [5,]     0    71   0
 [6,]     0    59   0
 [7,]     0    86   0
 [8,]     0    63   0
 [9,]     0    81   0
[10,]     0   100   0
[11,]     0    63   0
[12,]     0    78   0
[13,]     0    87   0
[14,]     0    75   0
[15,]     0    62   0
[16,]     0    64   0
[17,]     0    85   0
[18,]     0    76   0
[19,]     0    83   0
[20,]     0    82   0
[21,]     0    77   0
[22,]     0    76   0
[23,]     0   102   0
[24,]     0    75   0
[25,]     0    77   0
[26,]     0    68   0
[27,]     0    75   0
[28,]     0    86   0
[29,]     0    78   0
[30,]     0    88   0
[31,]     0   100   0
[32,]     0    70   0
[33,]     1    85  85
[34,]     1    74  74
[35,]     1   123 123
[36,]     1    81  81
[37,]     1    69  69
[38,]     1    94  94
[39,]     1    78  78
[40,]     1    99  99
[41,]     1    77  77
[42,]     1    93  93
[43,]     1    65  65
[44,]     1    65  65
[45,]     1    94  94
[46,]     1    69  69
[47,]     1    78  78
[48,]     1    82  82
[49,]     1    66  66
[50,]     1    89  89


# If we add CBSEX and the interaction to the regression...

> lm(CTRA~CTPEA+CBSEX+Int)

Call:
lm(formula = CTRA ~ CTPEA + CBSEX + Int)

Coefficients:
(Intercept)        CTPEA        CBSEX          Int  
   -4.85949      0.48472    -11.52963      0.07841  

# ...then there are two different classes of conditional mean,
# depending on whether CBSEX is 0 or 1. If it's zero, then the
# conditional mean is -4.85949 + 0*0.48472 - 11.52963*Peabody + 0*0.07841	
# which is the same as -4.85949 - 11.52963*Peabody. (Those values
# match the coefficients of the girls' regression that we saw above.)

# But when it's a boy, the conditional mean is -4.85949 + 1*0.48472 - 11.52963*Peabody + 1*0.07841,
# which is the same as the intercept being...

> -4.85949 - 11.52963
[1] -16.38912

# ...and the slope being:

> .48472 + .07841
[1] 0.56313
 
# Note that those values match the boys' regression, above.


# This is a special situation in which the simple regression of
# Raven on Peabody is said to be "nested" within this more complex
# model. We can turn the complex model into the simple regression by
# removing two coefficients: the slope for CBSEX, and the slope for Int.
# When model are nested like this, an approach known as a "nested F
# test" can assess whether the more complex model helps us understand
# significantly more variation than the simple model. Here, this amounts
# two a null hypothesis that the coefficients of CBSEX and Int are BOTH zero.

# If we look at the residual sum of squares for the complex model...

> aov(lm(CTRA~CTPEA+CBSEX+Int))
Call:
   aov(formula = lm(CTRA ~ CTPEA + CBSEX + Int))

Terms:
                   CTPEA    CBSEX      Int Residuals
Sum of Squares  1878.778  307.505   11.518  3302.699
Deg. of Freedom        1        1        1        46

Residual standard error: 8.473359
Estimated effects may be unbalanced

# ...and compare it to the residual sum of squares for the simpler model...

> aov(lm(CTRA~CTPEA))
Call:
   aov(formula = lm(CTRA ~ CTPEA))

Terms:
                   CTPEA Residuals
Sum of Squares  1878.778  3621.722
Deg. of Freedom        1        48

Residual standard error: 8.686343
Estimated effects may be unbalanced

# ...we see that the sum of squares goes down by 319.023 when we consider
# the simpler model. So the simple regression of Raven on Peabody helps us
# understand that much less of the total sum of squares:

> SSchange <- 3621.722-3302.699
> SSchange
[1] 319.023

# The degrees of freedom associated with this change is the same as the
# number of parameters we dropped when we moved from the complex model
# to the simple model. We dropped the coefficients of CBSEX and Int, so
# that's two degrees of freedom change. So we can use that to calculate
# a mean square for the change between models:

> MSchange <- SSchange / 2
> MSchange
[1] 159.5115

# The error mean square from the more complex model is:

> MSe <- 3302.699 / 46

# And the corresponding F statistic tests whether we have lost anything
# by using the simpler model (i.e., tests simultaneously that both slopes
# we've dropped are zero):

> Fchange <- MSchange / MSe
> Fchange
[1] 2.221677

# The change is non-significant...

> 1-pf(Fchange, 2, 46)
[1] 0.1199332

# ...so we can conclude that we have no evidence that the regression differs
# for boys and girls.


# We now move on to considering an outlier test. Recall that there was a highly
# leveraged point at the right side of the scatterplot that didn't seem to fit
# the overall pattern:

> plot(CTPEA,CTRA)

# It's the 35th case in the data set:

> CTPEA[35]
[1] 123

# We can create a dummy variable that has the value zero for every other case, but
# 1 for the 35th case:

> Outlier <- c(rep(0,34),1,rep(0,15))
> Outlier
 [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
[39] 0 0 0 0 0 0 0 0 0 0 0 0

# If we include that as a predictor in the regression, the slope for the
# new variable is equivalent to the residual for the 35th case:

> lm(CTRA~CTPEA+Outlier)

Call:
lm(formula = CTRA ~ CTPEA + Outlier)

Coefficients:
(Intercept)        CTPEA      Outlier  
   -11.3313       0.5449      -8.6949  

# The t test for the coefficient of "Outlier" represents a test of whether
# the conditional mean for the case differs from the line that all of the
# other cases point to. Note that it's non-significant (p=.395). So we conclude
# that we do NOT have evidence that the case is an outlier:

> summary(lm(CTRA~CTPEA+Outlier))

Call:
lm(formula = CTRA ~ CTPEA + Outlier)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.6282  -5.3166   0.1471   4.4602  17.3718 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -11.3313     9.1743  -1.235    0.223    
CTPEA         0.5449     0.1148   4.745 1.98e-05 ***
Outlier      -8.6949    10.1385  -0.858    0.395    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 8.71 on 47 degrees of freedom
Multiple R-squared:  0.3517,    Adjusted R-squared:  0.3241 
F-statistic: 12.75 on 2 and 47 DF,  p-value: 3.772e-05

# That t test has a formal name: the "externally Studentized residual."
# We can calculate the externally Studentized resicual for every case
# in the data set using the rstudent() function:
 
> rstudent(lm(CTRA~CTPEA))
           1            2            3            4            5            6 
 0.047636378 -1.685174968  0.921245511 -1.063933728 -0.771515334 -0.797088463 
           7            8            9           10           11           12 
-0.122988524 -0.906651289  0.278026926  0.354107510  1.110729448  1.899348261 
          13           14           15           16           17           18 
 1.718439701  0.160254593  0.334280337  0.451319854  0.049559370  1.646047808 
          19           20           21           22           23           24 
 0.048584086  1.773655461  1.095853978  0.102987516  0.838825604  0.507400405 
          25           26           27           28           29           30 
 2.094695346 -0.954790866 -0.070329905 -1.796552670 -0.357091287 -1.423222398 
          31           32           33           34           35           36 
-0.001179159  1.527591414 -1.606530750 -0.944842132 -0.857611495 -0.413441246 
          37           38           39           40           41           42 
 0.742886770  0.699285905  0.450098775  0.175330790 -1.861546044  0.403645304 
          43           44           45           46           47           48 
-1.640009366  0.157768726 -0.237320795 -1.625690833  0.219032203 -0.355235779 
          49           50 
-0.133685494 -0.295774404 
 
# Note that the 35th of those matches the t test in our multiple regression output.
# Each of these externally Studentized residuals represents a t test for outlier
# status, with degrees of freedom equal to N-3. (It would be a very poor practice,
# though, to routinely drop any cases with significant externally Studentized
# residuals; doing so would systematically exclude 5% of perfectly legitimate data.)
 
 
# We turned back to ANOVA next, focusing on a couple of approaches for an after-the-fact
# (or post hoc) investigation of which means differ. On such approach is Tukey's
# HSD (for "honestly significant difference"). We see here that the only pair of
# means that significantly differs is the None-vs.-Massed comparison:

> TukeyHSD(aov(lm(Time~Practice)))
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = lm(Time ~ Practice))

$Practice
                 diff        lwr       upr     p adj
none-massed    57.500  14.263071 100.73693 0.0081336
spaced-massed  38.875  -4.361929  82.11193 0.0831281
spaced-none   -18.625 -61.861929  24.61193 0.5330734

# A package called "DescTools" can do a variety of post hoc
# procedures. The following code to install the package was
# generated by the "Packages" menu:

> utils:::menuInstallPkgs()
--- Please select a CRAN mirror for use in this session ---
also installing the dependencies ‘rootSolve’, ‘e1071’, ‘lmom’, ‘rstudioapi’, ‘Exact’, ‘gld’

trying URL 'https://mirror.las.iastate.edu/CRAN/bin/windows/contrib/3.6/rootSolve_1.8.2.1.zip'
Content type 'application/zip' length 795313 bytes (776 KB)
downloaded 776 KB

trying URL 'https://mirror.las.iastate.edu/CRAN/bin/windows/contrib/3.6/e1071_1.7-4.zip'
Content type 'application/zip' length 1017997 bytes (994 KB)
downloaded 994 KB

trying URL 'https://mirror.las.iastate.edu/CRAN/bin/windows/contrib/3.6/lmom_2.8.zip'
Content type 'application/zip' length 593135 bytes (579 KB)
downloaded 579 KB

trying URL 'https://mirror.las.iastate.edu/CRAN/bin/windows/contrib/3.6/rstudioapi_0.11.zip'
Content type 'application/zip' length 280801 bytes (274 KB)
downloaded 274 KB

trying URL 'https://mirror.las.iastate.edu/CRAN/bin/windows/contrib/3.6/Exact_2.1.zip'
Content type 'application/zip' length 108014 bytes (105 KB)
downloaded 105 KB

trying URL 'https://mirror.las.iastate.edu/CRAN/bin/windows/contrib/3.6/gld_2.6.2.zip'
Content type 'application/zip' length 256539 bytes (250 KB)
downloaded 250 KB

trying URL 'https://mirror.las.iastate.edu/CRAN/bin/windows/contrib/3.6/DescTools_0.99.38.zip'
Content type 'application/zip' length 5645469 bytes (5.4 MB)
downloaded 5.4 MB

package ‘rootSolve’ successfully unpacked and MD5 sums checked
package ‘e1071’ successfully unpacked and MD5 sums checked
package ‘lmom’ successfully unpacked and MD5 sums checked
package ‘rstudioapi’ successfully unpacked and MD5 sums checked
package ‘Exact’ successfully unpacked and MD5 sums checked
package ‘gld’ successfully unpacked and MD5 sums checked
package ‘DescTools’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
        C:\Users\jvevea\AppData\Local\Temp\RtmpqwVqAv\downloaded_packages

# Once the package has been installed, we have to load it to use it (but we only
# need to install it once). The command to load a package is library():

> library(DescTools)
Warning message:
package ‘DescTools’ was built under R version 3.6.3 

# Here's how you would do Tykey's HSD using the package. (Note that the
# results are the same as those from the native R function.)

> PostHocTest(aov(lm(Time~Practice)), method="hsd")

  Posthoc multiple comparisons of means : Tukey HSD 
    95% family-wise confidence level

$Practice
                 diff     lwr.ci    upr.ci   pval    
none-massed    57.500  14.263071 100.73693 0.0081 ** 
spaced-massed  38.875  -4.361929  82.11193 0.0831 .  
spaced-none   -18.625 -61.861929  24.61193 0.5331    

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# But we can do other post hoc tests as well. Here, for example,
# are more conservative Bonferroni comparisons:

> PostHocTest(aov(lm(Time~Practice)), method="bonferroni")

  Posthoc multiple comparisons of means : Bonferroni 
    95% family-wise confidence level

$Practice
                 diff     lwr.ci    upr.ci   pval    
none-massed    57.500  12.877405 102.12259 0.0091 ** 
spaced-massed  38.875  -5.747595  83.49759 0.1024    
spaced-none   -18.625 -63.247595  25.99759 0.8697    

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Duncan's test can be useful in lots of situations. It allows
# us to control for multiple testing for the comparison of two
# or more groups agains a common reference or control group.
# In our case, it might be reasonable to want to compare Massed
# with None and Spaced with None, so None would be the control group.
# The function doesn't know that's how we are thinking, so it is
# up to us to consider only the first and third lines of the output:

> PostHocTest(aov(lm(Time~Practice)), method="duncan")

  Posthoc multiple comparisons of means : Duncan's new multiple range test 
    95% family-wise confidence level

$Practice
                 diff     lwr.ci   upr.ci   pval    
none-massed    57.500  20.048780 94.95122 0.0041 ** 
spaced-massed  38.875   3.202076 74.54792 0.0341 *  
spaced-none   -18.625 -54.297924 17.04792 0.2899    

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


# We ended by foreshadowing our next class, demonstrating that
# ANOVA can be viewed as a special case of multiple regression.
# Just as, when we considered the t test we could identify two
# groups with a single dummy variable, here we can identify
# three groups using two dummy variables. First, we create a dummy
# variable indicating membership in the Massed group:

> MassedDummy <- c(rep(1,8),rep(0,16))
> MassedDummy
 [1] 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

# Then we create a dummy variable indicating membership in the
# Spaced group:

> SpacedDummy <- c(rep(0,8),rep(1,8),rep(0,8))
> SpacedDummy
 [1] 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0

# If you aren't in the Massed or Space group, you must be
# in the None group; so Massed is indicated by the pattern 1, 0;
# Spaced is indicated by the pattern 0, 1; and None is indicated
# by the pattern 0, 0:

> cbind(Practice,MassedDummy,SpacedDummy)
      Practice MassedDummy SpacedDummy
 [1,]        1           1           0
 [2,]        1           1           0
 [3,]        1           1           0
 [4,]        1           1           0
 [5,]        1           1           0
 [6,]        1           1           0
 [7,]        1           1           0
 [8,]        1           1           0
 [9,]        3           0           1
[10,]        3           0           1
[11,]        3           0           1
[12,]        3           0           1
[13,]        3           0           1
[14,]        3           0           1
[15,]        3           0           1
[16,]        3           0           1
[17,]        2           0           0
[18,]        2           0           0
[19,]        2           0           0
[20,]        2           0           0
[21,]        2           0           0
[22,]        2           0           0
[23,]        2           0           0
[24,]        2           0           0

# Here's the ANOVA once again:

> anova(lm(Time~Practice))
Analysis of Variance Table

Response: Time
          Df Sum Sq Mean Sq F value   Pr(>F)   
Practice   2  13772  6885.9  5.8504 0.009559 **
Residuals 21  24717  1177.0                    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

# Note that the residual mean square there is the same as the
# residual mean square from the regression on the two
# dummy variable:

> anova(lm(Time~MassedDummy+SpacedDummy))
Analysis of Variance Table

Response: Time
            Df  Sum Sq Mean Sq F value   Pr(>F)   
MassedDummy  1 12384.2 12384.2 10.5219 0.003888 **
SpacedDummy  1  1387.6  1387.6  1.1789 0.289885   
Residuals   21 24716.8  1177.0                    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
# (We will worry next time about how to get the overall F test
# from this approach.)