# We used the full data set of our TA sex and student sex
# example to illustrate use of the Type III sum of squares
# for dealing with unbalanced ANOVA. That is justified in
# this example because the lack of balance in the design
# occurred because of random luck rather than because of
# something that informs us about what's true in the world. 
 
> attach(kellynils2)
> length(Score)
[1] 86

# Here's the usual ANOVA (which is not a correct analysis
# because of the unbalanced design):

> anova(lm(Score~TAsex+STsex+TAsex*STsex))
Analysis of Variance Table

Response: Score
            Df  Sum Sq Mean Sq F value Pr(>F)
TAsex        1   202.9  202.88  0.9257 0.3388
STsex        1   418.1  418.06  1.9075 0.1710
TAsex:STsex  1   449.1  449.09  2.0491 0.1561
Residuals   82 17971.6  219.17               

# Here's evidence that the cells have different N's:

> table(TAsex,STsex)
     STsex
TAsex  F  M
    F 26 14
    M 39  7

# The car package is a "companion to Applied Regression." 
# "Applied Regression" is an excellent book on regression
# by Fox and Weisberg. I had previously installed car using
# the drop-down menu. Here, I use the library() command to
# make its functions available:

> library(car)
Loading required package: carData

# Here's the wrong analysis again:

> unbalanced <- lm(Score~TAsex+STsex+TAsex*STsex)
> anova(unbalanced)
Analysis of Variance Table

Response: Score
            Df  Sum Sq Mean Sq F value Pr(>F)
TAsex        1   202.9  202.88  0.9257 0.3388
STsex        1   418.1  418.06  1.9075 0.1710
TAsex:STsex  1   449.1  449.09  2.0491 0.1561
Residuals   82 17971.6  219.17               

# And here are the results using the Type III
# sums of squares:

> options(contrasts=c("contr.sum","contr.poly"))
> Anova(unbalanced, type="III")
Anova Table (Type III tests)

Response: Score
            Sum Sq Df  F value  Pr(>F)    
(Intercept) 177458  1 809.6981 < 2e-16 ***
TAsex            5  1   0.0206 0.88612    
STsex          854  1   3.8961 0.05177 .  
TAsex:STsex    449  1   2.0491 0.15610    
Residuals    17972 82                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

# Note that, whereas in the wrong analysis, student sex
# was highly nonsignificant, here is is marginal (p=.052).

# Remember when we were working with multiple regression, I
# told you to interpret the t tests from the summary() wrapper
# rather than F tests from the anova() wrapper, because they
# were equivalent to Type III F statistics? Here, because everything
# in the analysis has just 1 df (so that F tests and t tests are
# equivalent), we can see that in action: the p-values for the
# t tests in the summary() results are identical to those of
# the Type III F tests in the car Anova() example:

> summary(unbalanced)

Call:
lm(formula = Score ~ TAsex + STsex + TAsex * STsex)

Residuals:
    Min      1Q  Median      3Q     Max 
-55.929  -7.481   2.654   9.923  24.071 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    82.6154     2.9033  28.455   <2e-16 ***
TAsexM         -0.5385     3.7482  -0.144   0.8861    
STsexM         -9.6868     4.9076  -1.974   0.0518 .  
TAsexM:STsexM  11.1813     7.8111   1.431   0.1561    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 14.8 on 82 degrees of freedom
Multiple R-squared:  0.05619,   Adjusted R-squared:  0.02166 
F-statistic: 1.627 on 3 and 82 DF,  p-value: 0.1893

# Here, we illustrate the use of Type III sums of squares
# with another data set. I have artifically created a lack
# of balance here by deleting the last three cases in the
# data set, so that every cell has 10 observations except
# for the Intentional Learning / Old People cell:

> Eysenck2 <- read.csv("http://faculty.ucmerced.edu/jvevea/classes/202a/data/Eysenck2.csv")
> unbalancedEysenck <- Eysenck2[1:97,]
> detach(kellynils2)
> attach(unbalancedEysenck)

# Here's the wrong analysis:

> anova(lm(Words~Group+Age+Group*Age))
Analysis of Variance Table

Response: Words
          Df  Sum Sq Mean Sq F value    Pr(>F)    
Group      4 1357.00  339.25 41.6254 < 2.2e-16 ***
Age        1  198.76  198.76 24.3870 3.772e-06 ***
Group:Age  4  158.96   39.74  4.8759  0.001353 ** 
Residuals 87  709.06    8.15                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

# And here's the Type III analysis. (Remember that the
# Anova() wrapper works only if we have used the library()
# command to load the car package.)

> Anova(lm(Words~Group+Age+Group*Age), type="III")
Anova Table (Type III tests)

Response: Words
             Sum Sq Df   F value    Pr(>F)    
(Intercept) 12826.9  1 1573.8311 < 2.2e-16 ***
Group        1425.4  4   43.7220 < 2.2e-16 ***
Age           217.4  1   26.6746 1.509e-06 ***
Group:Age     159.0  4    4.8759  0.001353 ** 
Residuals     709.1 87                        
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


# We illustrated how to deal with a random-effects factor
# in two-way ANOVA using an artificial data set. In this
# fictitious example, we are interested in three specific
# therapies (1, 2, and 3), so Therapy is a fixed effect.
# On the other hand, we collected data from 10 different
# counselors, not because we are interested in these particular
# counselors, but because we needed that many counselors to
# get enough data. Counselor, then represents 10 arbitrary
# examples from a universe of possible counselors, and is thus
# a random-effects factor.

> random <- read.csv("http://faculty.ucmerced.edu/jvevea/classes/202a/data/random.csv")
> random
   Score Counselor Therapy
1     31         1       1
2     27         1       1
3     21         2       1
4     26         2       1
5     41         3       1
6     40         3       1
7     24         4       1
8     29         4       1
9     35         5       1
10    28         5       1
11    36         6       1
12    33         6       1
13    21         7       1
14    21         7       1
15    31         8       1
16    34         8       1
17    35         9       1
18    40         9       1
19    24        10       1
20    26        10       1
21    39         1       2
22    46         1       2
23    32         2       2
24    30         2       2
25    46         3       2
26    50         3       2
27    34         4       2
28    32         4       2
29    42         5       2
30    47         5       2
31    39         6       2
32    43         6       2
33    26         7       2
34    30         7       2
35    32         8       2
36    35         8       2
37    44         9       2
38    43         9       2
39    30        10       2
40    27        10       2
41    35         1       3
42    28         1       3
43    31         2       3
44    25         2       3
45    42         3       3
46    39         3       3
47    36         4       3
48    38         4       3
49    41         5       3
50    37         5       3
51    38         6       3
52    38         6       3
53    27         7       3
54    25         7       3
55    29         8       3
56    31         8       3
57    45         9       3
58    41         9       3
59    31        10       3
60    26        10       3
> 
> attach(random)

# We have to be careful here: the levels of the factors
# are indicated with numbers, so if we do this...

> anova(lm(Score~Counselor+Therapy+Counselor*Therapy))
Analysis of Variance Table

Response: Score
                  Df  Sum Sq Mean Sq F value  Pr(>F)  
Counselor          1   33.88  33.879  0.6562 0.42135  
Therapy            1  160.00 160.000  3.0988 0.08381 .
Counselor:Therapy  1    0.88   0.876  0.0170 0.89685  
Residuals         56 2891.43  51.633                  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

# ...R just does a multiple regression, rather than creating
# a coding system to identify the levels of the factors.

# Here, we force R to see these as factors and create an
# appropriate coding system. Note the dramatic change in the
# degrees of freedom:

> anova(lm(Score~factor(Counselor)+factor(Therapy)+factor(Counselor)*factor(Therapy)))
Analysis of Variance Table

Response: Score
                                  Df  Sum Sq Mean Sq F value    Pr(>F)    
factor(Counselor)                  9 2059.68 228.854  28.908 2.103e-12 ***
factor(Therapy)                    2  520.53 260.267  32.876 2.752e-08 ***
factor(Counselor):factor(Therapy) 18  268.47  14.915   1.884   0.06067 .  
Residuals                         30  237.50   7.917                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

# If both factors were fixed effects, that would be the correct analysis.
# But Counselor is a random effect, so we need to use a special error
# term for the test about the Therapy main effect:

> F <- 260.267 / 14.915
> F
[1] 17.45002

# The result is still highly significant, but somewhat less so than in
# the original analysis:

> 1 - pf(F, 2, 18)
[1] 6.114332e-05
>