# We reviewed the t.test() function: > attach(JackStatlab) > tapply(CTPEA, CBSEX, mean) 0 1 76.73077 86.91667 > t.test(CTPEA~CBSEX, var.equal=TRUE) Two Sample t-test data: CTPEA by CBSEX t = -2.8494, df = 48, p-value = 0.006434 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -17.373374 -2.998421 sample estimates: mean in group 0 mean in group 1 76.73077 86.91667 # Then we looked at what's happening behind the scenes with # the t test. We look at the variability of each group: > tapply(CTPEA, CBSEX, sd) 0 1 11.64666 13.61558 # Because the standard deviations are so similar, it makes # sense to use a combined estimate of variability. We figure # out the sample sizes of both groups. Here, we can do that # just by summing the dummy CBSEX variable; there are 24 boys: > sum(CBSEX) [1] 24 > n1 <- 26 > n2 <- 24 # The combined (or "pooled") variance estimate is a weighted # average of the two variances, weighting by n-1 (i.e., the # degrees of freedom) for each group: > vp <- ((n1-1)*var(CTPEA[1:26]) + (n2-1)*var(CTPEA[27:50])) / (n1 + n2 -2) # We do a reality check. The combined estimate should be between # the two groups' variances, and should be slightly closer to the # value for the girls, where the sample size is a little larger: > tapply(CTPEA, CBSEX, var) 0 1 135.6446 185.3841 > vp [1] 159.4781 # Now we use the pooled variance estimate to calculate the # standard error of the difference between means: > se <- sqrt(vp/n1 + vp/n2) > se [1] 3.57473 # Then we calculate the t statistic: > t <- (mean(CTPEA[1:26]) - mean(CTPEA[27:50])) / se > t [1] -2.849417 # Note that this result matches the t.test() result: > t.test(CTPEA~CBSEX, var.equal=TRUE) Two Sample t-test data: CTPEA by CBSEX t = -2.8494, df = 48, p-value = 0.006434 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -17.373374 -2.998421 sample estimates: mean in group 0 mean in group 1 76.73077 86.91667 # Where did the p-value come from? The probability that a random # draw from a t48 distribution falls below our negative t value is > pt(t, 48) [1] 0.003216964 # The t distribution is symmetric, so we can double that to include # the probability of getting a value above the positive value of the # statistic. Regardless of whether our result is positive or negative, # the following approach will always work to get the probability content # of both tails of the distribution: > 2*pt(-abs(t), 48) [1] 0.006433928 # If we use the t.test() function without the var.equal=TRUE command, # we get a slightly different value for the statistic and a weird, fractional # degrees of freedom: > t.test(CTPEA~CBSEX) Welch Two Sample t-test data: CTPEA by CBSEX t = -2.8314, df = 45.476, p-value = 0.006879 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -17.429384 -2.942411 sample estimates: mean in group 0 mean in group 1 76.73077 86.91667 # That form of the t test uses unpooled variances and does not require # the assumption that the populations are equally variable. # How plausible is it that our two populations are equally variable? # If we compare the standard deviations, they look pretty close: > tapply(CTPEA, CBSEX, sd) 0 1 11.64666 13.61558 # Our best guess at a common variability in the population would be # the square root of the pooled variance: > sqrt(vp) [1] 12.62846 # If we repeatedly sample from a normal distribution with the same # sample size as our girls' group, and arbitrary mean (0 here), and # the square root of the pooled variance for the standard deviation, # we get sd estimates further away from 12.6 than the girls' and boys' # sd's all the time: > sd( rnorm(26, 0, sqrt(vp))) [1] 14.76422 > sd( rnorm(26, 0, sqrt(vp))) [1] 11.59821 > sd( rnorm(26, 0, sqrt(vp))) [1] 12.889 > sd( rnorm(26, 0, sqrt(vp))) [1] 12.50249 > sd( rnorm(26, 0, sqrt(vp))) [1] 8.967739 # In addition to evaluating the equal variance assumption, we # want to see if it seems reasonable that both groups come from # normal populations. We can use stem-and-leaf plots: > stem(CTPEA[1:26]) The decimal point is 1 digit(s) to the right of the | 5 | 9 6 | 024 6 | 68999 7 | 2224 7 | 7 8 | 0011224 8 | 5 9 | 0 9 | 56 10 | 10 | 6 > stem(CTPEA[27:50]) The decimal point is 1 digit(s) to the right of the | 6 | 4515889 8 | 013566689912 10 | 3348 12 | 2 # Both are piling up in the center. The positive tail looks # a little long in both groups, but these are very small sample # sizes, so I don't take that indication of positive skew too # seriously. # Another useful device for evaluating normality is the normal # quantile-quantile plot. This plot does something like this: # Hey, we've got 26 observations here. # If these were from a normal distribution, how small # would I expect the smallest to be? How large would I # expect the largest to be? Where should the ones in # between be? (That is, what are the 1st, 2nd, 3rd, ..., # 26th quantiles?) If I sort the variable and plot it # against those quantiles, I ought to get something like # a straight line if the distribution is normal. # All of that is built into the qqnorm() function. (qqline() provides # a reference line to help evaluate the linearity of the plot.) > qqnorm(CTPEA[1:26]); qqline(CTPEA[1:26]) # There are a few points straying away from the line in the # upper right corner, but this is a small data set. If we want # to see if the plot would often look this bad when the distribution # is KNOWN to be normal, we can build up some experience by # using simulation: > temp <- rnorm(26); qqnorm(temp); qqline(temp) > temp <- rnorm(26); qqnorm(temp); qqline(temp) > temp <- rnorm(26); qqnorm(temp); qqline(temp) > temp <- rnorm(26); qqnorm(temp); qqline(temp) > temp <- rnorm(26); qqnorm(temp); qqline(temp) > temp <- rnorm(26); qqnorm(temp); qqline(temp) # Lots of those look worse than our actual plots, # so I would say that we probably don't need to be concerned # that our distributions are non-normal: > qqnorm(CTPEA[1:26]); qqline(CTPEA[1:26]) > qqnorm(CTPEA[27:50]); qqline(CTPEA[27:50]) # We now return to our consideration of regression. Here's # the easy way to get estimates of the intercept and slope: > lm(CTPEA~CTRA) Call: lm(formula = CTPEA ~ CTRA) Coefficients: (Intercept) CTRA 59.0490 0.7319 # And we really should satisfy ourselves that the plot looks # linear: > plot(CTRA,CTPEA) # But where do those estimates of the intercept and slope # come from? Recall that the correlation coefficient... > cor(CTPEA,CTRA) [1] 0.4945577 # ...has in its numerator the sum of the product of deviations # of each variable from its respective mean: > sum( (CTPEA-mean(CTPEA))*(CTRA-mean(CTRA)))/sd(CTPEA)/sd(CTRA)/49 [1] 0.4945577 # If we calculate exactly that same sum, but divide it by the sum # of squared deviations from the mean for the predictor variable, # we get the estimated slope: > sum( (CTPEA-mean(CTPEA))*(CTRA-mean(CTRA)))/(49*var(CTRA)) -> slope > slope [1] 0.731873 # And once we have the slope, the intercept is just > mean(CTPEA) - slope*mean(CTRA) [1] 59.04904 # Note that those estimates match the values we got from the lm() function: > lm(CTPEA~CTRA) Call: lm(formula = CTPEA ~ CTRA) Coefficients: (Intercept) CTRA 59.0490 0.7319 >