# A student asked about evaluating the assumption that two populations # are equally variable (needed for the two-sample t test). We looked at # an example in which I compare the Raven scores for boys and girls. # Their standard deviations aren't TOO far apart, but they're also not # that close: > attach(JackStatlab) > tapply(CTRA, CBSEX, sd) 0 1 10.627975 7.408867 # One approach that I find useful is to calculate the pooled variance estimate... > tapply(CTRA, CBSEX, var) 0 1 112.9538 54.8913 > table(CBSEX) CBSEX 0 1 26 24 > vp <- (25*var(CTRA[1:26]) + 23*var(CTRA[27:50])) / (25+23) > vp [1] 85.13221 # ...and use it to calculate a pooled standard deviation: > sqrt(vp) [1] 9.226712 # Then, if I sample from a normal distribution with that # standard deviation, using the sample size from one of my # groups, I look to see if we get sample sd's as far from # the pooled value as the actual values we saw in our sample # (10.6 and 7.4). Here are a few samples: > sd( rnorm(24, 0, 9.23)) [1] 7.001852 > sd( rnorm(24, 0, 9.23)) [1] 10.10173 > sd( rnorm(24, 0, 9.23)) [1] 12.61102 > sd( rnorm(24, 0, 9.23)) [1] 8.928225 > sd( rnorm(24, 0, 9.23)) [1] 9.13923 > sd( rnorm(24, 0, 9.23)) [1] 9.511692 > sd( rnorm(24, 0, 9.23)) [1] 7.59977 # Note that we are frequently getting values as far from # the pooled estimate as the actual values in our sample. # For that reason, I would not be too concerned about the # two populations having different variability. # Last time, we tested the null hypothesis that the slope # in the regression of Peabody on Raven was zero, working # from the defining formulas in the ANOVA table. Of course, # we don't need to do it that way; R can do the work for us. # Here's that regression again: > PBonRA <- lm(CTPEA~CTRA) > PBonRA Call: lm(formula = CTPEA ~ CTRA) Coefficients: (Intercept) CTRA 59.0490 0.7319 # Remember that in a regression output object, there is more # information lurking behind the scenes, such as the estimated # conditional means... > PBonRA$fit 1 2 3 4 5 6 7 8 74.41837 70.02713 91.98332 84.66459 78.80961 78.07773 70.02713 75.88212 9 10 11 12 13 14 15 16 68.56339 86.86021 74.41837 75.88212 87.59208 84.66459 82.46897 93.44707 17 18 19 20 21 22 23 24 91.98332 89.05583 82.46897 76.61399 86.86021 77.34586 91.25145 88.32396 25 26 27 28 29 30 31 32 71.49088 90.51958 74.41837 79.54148 86.86021 79.54148 81.73710 87.59208 33 34 35 36 37 38 39 40 74.41837 86.12834 74.41837 75.15024 75.88212 72.22275 79.54148 87.59208 41 42 43 44 45 46 47 48 89.78770 78.07773 83.20085 89.05583 83.93272 84.66459 81.00523 83.20085 49 50 79.54148 89.78770 # ...and the residuals: > PBonRA$res 1 2 3 4 5 6 -2.4183697 -1.0271318 -11.9833216 5.3354083 -1.8096077 5.9222653 7 8 9 10 11 12 -11.0271318 -3.8821157 -6.5633858 -4.8602106 -8.4183697 -3.8821157 13 14 15 16 17 18 8.4079164 -4.6645917 -8.4689727 -12.4470676 -9.9833216 -8.0558296 19 20 21 22 23 24 -13.4689727 -8.6139887 -1.8602106 -8.3458617 14.7485514 6.6760434 25 26 27 28 29 30 -11.4908778 -26.5195756 11.5816303 -0.5414807 -3.8602106 0.4585193 31 32 33 34 35 36 -3.7370997 -23.5920836 14.5816303 2.8716624 6.5816303 10.8497573 37 38 39 40 41 42 -0.8821157 -1.2227508 42.4585193 16.4079164 13.2122974 -13.0777347 43 44 45 46 47 48 8.7991543 18.9441704 2.0672813 0.3354083 6.9947733 -5.2008457 49 50 11.4585193 13.2122974 # R can take that information and do the inference for us. One way to # do that is to ask for the anova table: > anova(PBonRA) Analysis of Variance Table Response: CTPEA Df Sum Sq Mean Sq F value Pr(>F) CTRA 1 2189.0 2189.00 15.541 0.0002616 *** Residuals 48 6760.8 140.85 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # Note that we got the same F statistic and p-value that we obtained # manually in Tuesday's class. # Another way to do inference is to ask R for a "summary" of # the regression: > summary(PBonRA) Call: lm(formula = CTPEA ~ CTRA) Residuals: Min 1Q Median 3Q Max -26.520 -8.273 -1.516 6.915 42.459 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 59.0490 5.9663 9.897 3.55e-13 *** CTRA 0.7319 0.1856 3.942 0.000262 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 11.87 on 48 degrees of freedom Multiple R-squared: 0.2446, Adjusted R-squared: 0.2288 F-statistic: 15.54 on 1 and 48 DF, p-value: 0.0002616 # This gives us the same F statistic as before on the last # line of the output. It also gives us the coefficient of # determination (R squared). But in addition to those results # we also get a t-test for the null hypothesis that the slope # is zero. The t-test is EXACTLY equivalent to the F test; note # that the p-values are identical. But the t-test is acquired # the same way most t statistics are: by dividing an estimate # by its standard error. Here, we take the output and calculate # the t: > .7319 / .1856 [1] 3.943427 # The t is just the square root of the F statistic: > 3.942^2 [1] 15.53936 # One nice thing about having the standard error of the slope # estimate is we can use it to calculate a confidence interval # (a range of values that represent null hypotheses we would NOT # reject using this data set). First, we use the t quantile function # to figure out how big a t statistic would need to be in order # for us to reject the null: > qt(.975, 48) [1] 2.010635 # That's called the "critical value" of the t statistic. We used .975 # because we are thinking of tests at an alpha level of .05, and the # .975th quantile puts half of that .05 in the upper tail. (We would # also reject the null if we got a t below NEGATIVE 2.01.) > qt(.975, 48) -> tcrit # We can then get the range of values that would NOT lead to rejection # by taking our estimate plus or minus the critical value times the # standard error: > .7319 - tcrit*0.1856 [1] 0.3587262 > .7319 + tcrit*0.1856 [1] 1.105074 # So we would say that a 95% confidence interval for the slope runs from # 0.36 to 1.11. It's reasonable for us to act as if the slope falls in that # range because if we did lots of independent samples and calculated confidence # intervals this way, 95% of them would include the true slope. And, of course, # our best guess at the true value of the slope is still .7319. # It's possible to do this sort of confidence interval for anything that can # be tested with a t statistic. For example, when we use R's t.test() function, # we get a 95% confidence interval for the difference between means (-17.373 # to -2.998 in this example): > t.test(CTPEA~CBSEX, var.equal=TRUE) Two Sample t-test data: CTPEA by CBSEX t = -2.8494, df = 48, p-value = 0.006434 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -17.373374 -2.998421 sample estimates: mean in group 0 mean in group 1 76.73077 86.91667 # The t.test() function can also do a one-sample t test for the null # hypothesis that a single population mean equals a particular value # (here, I arbitrarily chose the value 80). The output includes a # 95% confidence interval for the mean: > t.test(CTPEA, mu=80) One Sample t-test data: CTPEA t = 0.8476, df = 49, p-value = 0.4008 alternative hypothesis: true mean is not equal to 80 95 percent confidence interval: 77.77915 85.46085 sample estimates: mean of x 81.62 # To see where that interval came from, we do it manually. We need # the standard error of the mean: > sd(CTPEA)/sqrt(50) [1] 1.911275 # We need the critical value of t: > tcrit <- qt(.975,49) # If we calculate our estimated mean plus or minus the critical # value times the standard error, we get the same confidence # interval that the t.test() function produced: > 81.62 - tcrit*1.911275 [1] 77.77915 > 81.62 + tcrit*1.911275 [1] 85.46085 # We turned now to considering different ways we can understand # the correlation coefficient. Back when we first introduced correlation, # we talked about it as the standardized covariance. Here's the # covariance between Peabody and Raven: > var(CTRA,CTPEA) [1] 61.04 # If we standardize it... > var(CTRA,CTPEA)/sd(CTRA)/sd(CTPEA) [1] 0.4945577 # ...we get the correlation coefficient: > cor(CTRA,CTPEA) [1] 0.4945577 # Last class, we talked about the correlation as the square root # of the coefficient of determination (R squared), with the sign # of the regression slope attached: > summary(PBonRA) Call: lm(formula = CTPEA ~ CTRA) Residuals: Min 1Q Median 3Q Max -26.520 -8.273 -1.516 6.915 42.459 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 59.0490 5.9663 9.897 3.55e-13 *** CTRA 0.7319 0.1856 3.942 0.000262 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 11.87 on 48 degrees of freedom Multiple R-squared: 0.2446, Adjusted R-squared: 0.2288 F-statistic: 15.54 on 1 and 48 DF, p-value: 0.0002616 > sqrt( 0.2446) [1] 0.4945705 # We now introduce a third way of understanding the correlation # coefficient. If we standardize both variables (that is, change # the metric so that they both have a mean of zero and a standard # deviation of one)... > PBstd <- (CTPEA-mean(CTPEA))/sd(CTPEA) > RAstd <- (CTRA-mean(CTRA))/sd(CTRA) > mean(RAstd) [1] 2.359874e-17 > sd(RAstd) [1] 1 # ...the scatterplot still looks the same except for the numbers # on the axes: > plot(CTRA,CTPEA) > plot(RAstd,PBstd) # So the correlation is still the same: > cor(RAstd,PBstd) [1] 0.4945577 # But if we do the regression, the slope of the standardized # regression is identical to the correlation: > lm(PBstd~RAstd) Call: lm(formula = PBstd ~ RAstd) Coefficients: (Intercept) RAstd -2.980e-16 4.946e-01 # And unlike most regressions, we get the same result regardless # of which variable we regress on which: > lm(PBstd~RAstd)$coef (Intercept) RAstd -2.979758e-16 4.945577e-01 > lm(RAstd~PBstd)$coef (Intercept) PBstd 1.948089e-16 4.945577e-01 # That gives us yet another way to understand correlation. Remember that # the slope in a regression always indicates the expected change in # conditional mean associated with a one unit change in the predictor. # But here, "one unit" means one standard deviation. So the correlation # between two variables represents the expected number of standard deviations # change in one variable associated with a one standard deviation change # in the other. # We turn our attention now to assessing the assumptions necessary for # inference in regression. The first assumption is that it makes sense to # use linear regression in the first place: the relationship must be # linear. One device that is useful for assessing linearity is the # scatterplot. But another approach is to plot residuals against fitted # values (estimated conditional means). # Here, we do both plots. Neither one seems to show evidence of a curvilinear # relationship: > plot(CTRA, CTPEA) > plot(PBonRA$fit, PBonRA$res) > plot(PBonRA$fit, PBonRA$res) # Why should we consider using the residuals plot to assess linearity? An # example will make that clear. Here, I created a relationship that is # nonlinear: y equals x to the power 1.2: > x <- seq(1,2,.1) > y <- x^1.2 # The scatterplot looks linear: > plot(x,y) # But the residuals plot clearly shows the nonlinearity we know is there # because of how we created the variables: > temp <- lm(y~x) > plot(temp$fit, temp$res) # If we go back to the scatterplot and add the regression line, we can # see the nonlinearity, but it's much easier to see in the residuals plot: > plot(x,y) > abline(temp$coef) > plot(temp$fit, temp$res) # The residuals plot is also useful for evaluating the assumption that # the variation of the errors (for which our residuals are a proxy) is # the same across the range of fitted values. Here, there's a slight # tendency for variation in the residuals to increase as we move from # left to right in the plot, but only slight, so I'm not too worried: > plot(PBonRA$fit, PBonRA$res) # Finally, we can evaluate the assumption that the errors are normally # distributed by checking the normality of the residuals: > stem(PBonRA$res) The decimal point is 1 digit(s) to the right of the | -2 | 74 -1 | 3322110 -0 | 98888755544442221111 0 | 00235677789 1 | 112335569 2 | 3 | 4 | 2 > qqnorm(PBonRA$res); qqline(PBonRA$res) # These plots certainly don't indicate strong evidence of non-normality.