# We used the full data set of our TA sex and student sex # example to illustrate use of the Type III sum of squares # for dealing with unbalanced ANOVA. That is justified in # this example because the lack of balance in the design # occurred because of random luck rather than because of # something that informs us about what's true in the world. > attach(kellynils2) > length(Score) [1] 86 # Here's the usual ANOVA (which is not a correct analysis # because of the unbalanced design): > anova(lm(Score~TAsex+STsex+TAsex*STsex)) Analysis of Variance Table Response: Score Df Sum Sq Mean Sq F value Pr(>F) TAsex 1 202.9 202.88 0.9257 0.3388 STsex 1 418.1 418.06 1.9075 0.1710 TAsex:STsex 1 449.1 449.09 2.0491 0.1561 Residuals 82 17971.6 219.17 # Here's evidence that the cells have different N's: > table(TAsex,STsex) STsex TAsex F M F 26 14 M 39 7 # The car package is a "companion to Applied Regression." # "Applied Regression" is an excellent book on regression # by Fox and Weisberg. I had previously installed car using # the drop-down menu. Here, I use the library() command to # make its functions available: > library(car) Loading required package: carData # Here's the wrong analysis again: > unbalanced <- lm(Score~TAsex+STsex+TAsex*STsex) > anova(unbalanced) Analysis of Variance Table Response: Score Df Sum Sq Mean Sq F value Pr(>F) TAsex 1 202.9 202.88 0.9257 0.3388 STsex 1 418.1 418.06 1.9075 0.1710 TAsex:STsex 1 449.1 449.09 2.0491 0.1561 Residuals 82 17971.6 219.17 # And here are the results using the Type III # sums of squares: > options(contrasts=c("contr.sum","contr.poly")) > Anova(unbalanced, type="III") Anova Table (Type III tests) Response: Score Sum Sq Df F value Pr(>F) (Intercept) 177458 1 809.6981 < 2e-16 *** TAsex 5 1 0.0206 0.88612 STsex 854 1 3.8961 0.05177 . TAsex:STsex 449 1 2.0491 0.15610 Residuals 17972 82 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # Note that, whereas in the wrong analysis, student sex # was highly nonsignificant, here is is marginal (p=.052). # Remember when we were working with multiple regression, I # told you to interpret the t tests from the summary() wrapper # rather than F tests from the anova() wrapper, because they # were equivalent to Type III F statistics? Here, because everything # in the analysis has just 1 df (so that F tests and t tests are # equivalent), we can see that in action: the p-values for the # t tests in the summary() results are identical to those of # the Type III F tests in the car Anova() example: > summary(unbalanced) Call: lm(formula = Score ~ TAsex + STsex + TAsex * STsex) Residuals: Min 1Q Median 3Q Max -55.929 -7.481 2.654 9.923 24.071 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 82.6154 2.9033 28.455 <2e-16 *** TAsexM -0.5385 3.7482 -0.144 0.8861 STsexM -9.6868 4.9076 -1.974 0.0518 . TAsexM:STsexM 11.1813 7.8111 1.431 0.1561 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 14.8 on 82 degrees of freedom Multiple R-squared: 0.05619, Adjusted R-squared: 0.02166 F-statistic: 1.627 on 3 and 82 DF, p-value: 0.1893 # Here, we illustrate the use of Type III sums of squares # with another data set. I have artifically created a lack # of balance here by deleting the last three cases in the # data set, so that every cell has 10 observations except # for the Intentional Learning / Old People cell: > Eysenck2 <- read.csv("http://faculty.ucmerced.edu/jvevea/classes/202a/data/Eysenck2.csv") > unbalancedEysenck <- Eysenck2[1:97,] > detach(kellynils2) > attach(unbalancedEysenck) # Here's the wrong analysis: > anova(lm(Words~Group+Age+Group*Age)) Analysis of Variance Table Response: Words Df Sum Sq Mean Sq F value Pr(>F) Group 4 1357.00 339.25 41.6254 < 2.2e-16 *** Age 1 198.76 198.76 24.3870 3.772e-06 *** Group:Age 4 158.96 39.74 4.8759 0.001353 ** Residuals 87 709.06 8.15 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # And here's the Type III analysis. (Remember that the # Anova() wrapper works only if we have used the library() # command to load the car package.) > Anova(lm(Words~Group+Age+Group*Age), type="III") Anova Table (Type III tests) Response: Words Sum Sq Df F value Pr(>F) (Intercept) 12826.9 1 1573.8311 < 2.2e-16 *** Group 1425.4 4 43.7220 < 2.2e-16 *** Age 217.4 1 26.6746 1.509e-06 *** Group:Age 159.0 4 4.8759 0.001353 ** Residuals 709.1 87 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # We illustrated how to deal with a random-effects factor # in two-way ANOVA using an artificial data set. In this # fictitious example, we are interested in three specific # therapies (1, 2, and 3), so Therapy is a fixed effect. # On the other hand, we collected data from 10 different # counselors, not because we are interested in these particular # counselors, but because we needed that many counselors to # get enough data. Counselor, then represents 10 arbitrary # examples from a universe of possible counselors, and is thus # a random-effects factor. > random <- read.csv("http://faculty.ucmerced.edu/jvevea/classes/202a/data/random.csv") > random Score Counselor Therapy 1 31 1 1 2 27 1 1 3 21 2 1 4 26 2 1 5 41 3 1 6 40 3 1 7 24 4 1 8 29 4 1 9 35 5 1 10 28 5 1 11 36 6 1 12 33 6 1 13 21 7 1 14 21 7 1 15 31 8 1 16 34 8 1 17 35 9 1 18 40 9 1 19 24 10 1 20 26 10 1 21 39 1 2 22 46 1 2 23 32 2 2 24 30 2 2 25 46 3 2 26 50 3 2 27 34 4 2 28 32 4 2 29 42 5 2 30 47 5 2 31 39 6 2 32 43 6 2 33 26 7 2 34 30 7 2 35 32 8 2 36 35 8 2 37 44 9 2 38 43 9 2 39 30 10 2 40 27 10 2 41 35 1 3 42 28 1 3 43 31 2 3 44 25 2 3 45 42 3 3 46 39 3 3 47 36 4 3 48 38 4 3 49 41 5 3 50 37 5 3 51 38 6 3 52 38 6 3 53 27 7 3 54 25 7 3 55 29 8 3 56 31 8 3 57 45 9 3 58 41 9 3 59 31 10 3 60 26 10 3 > > attach(random) # We have to be careful here: the levels of the factors # are indicated with numbers, so if we do this... > anova(lm(Score~Counselor+Therapy+Counselor*Therapy)) Analysis of Variance Table Response: Score Df Sum Sq Mean Sq F value Pr(>F) Counselor 1 33.88 33.879 0.6562 0.42135 Therapy 1 160.00 160.000 3.0988 0.08381 . Counselor:Therapy 1 0.88 0.876 0.0170 0.89685 Residuals 56 2891.43 51.633 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # ...R just does a multiple regression, rather than creating # a coding system to identify the levels of the factors. # Here, we force R to see these as factors and create an # appropriate coding system. Note the dramatic change in the # degrees of freedom: > anova(lm(Score~factor(Counselor)+factor(Therapy)+factor(Counselor)*factor(Therapy))) Analysis of Variance Table Response: Score Df Sum Sq Mean Sq F value Pr(>F) factor(Counselor) 9 2059.68 228.854 28.908 2.103e-12 *** factor(Therapy) 2 520.53 260.267 32.876 2.752e-08 *** factor(Counselor):factor(Therapy) 18 268.47 14.915 1.884 0.06067 . Residuals 30 237.50 7.917 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # If both factors were fixed effects, that would be the correct analysis. # But Counselor is a random effect, so we need to use a special error # term for the test about the Therapy main effect: > F <- 260.267 / 14.915 > F [1] 17.45002 # The result is still highly significant, but somewhat less so than in # the original analysis: > 1 - pf(F, 2, 18) [1] 6.114332e-05 >