# We reviewed the t.test() function:

> attach(JackStatlab)
> tapply(CTPEA, CBSEX, mean)
       0        1 
76.73077 86.91667 
> t.test(CTPEA~CBSEX, var.equal=TRUE)

        Two Sample t-test

data:  CTPEA by CBSEX
t = -2.8494, df = 48, p-value = 0.006434
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -17.373374  -2.998421
sample estimates:
mean in group 0 mean in group 1 
       76.73077        86.91667 

# Then we looked at what's happening behind the scenes with
# the t test. We look at the variability of each group:

> tapply(CTPEA, CBSEX, sd)
       0        1 
11.64666 13.61558 

# Because the standard deviations are so similar, it makes
# sense to use a combined estimate of variability. We figure
# out the sample sizes of both groups. Here, we can do that
# just by summing the dummy CBSEX variable; there are 24 boys:

> sum(CBSEX)
[1] 24
> n1 <- 26
> n2 <- 24

# The combined (or "pooled") variance estimate is a weighted
# average of the two variances, weighting by n-1 (i.e., the
# degrees of freedom) for each group:

> vp <- ((n1-1)*var(CTPEA[1:26]) + (n2-1)*var(CTPEA[27:50])) / (n1 + n2 -2)

# We do a reality check. The combined estimate should be between
# the two groups' variances, and should be slightly closer to the
# value for the girls, where the sample size is a little larger:

> tapply(CTPEA, CBSEX, var)
       0        1 
135.6446 185.3841 
> vp
[1] 159.4781

# Now we use the pooled variance estimate to calculate the
# standard error of the difference between means:

> se <- sqrt(vp/n1 + vp/n2)
> se
[1] 3.57473

# Then we calculate the t statistic:

> t <- (mean(CTPEA[1:26]) - mean(CTPEA[27:50])) / se
> t
[1] -2.849417

# Note that this result matches the t.test() result:

> t.test(CTPEA~CBSEX, var.equal=TRUE)

        Two Sample t-test

data:  CTPEA by CBSEX
t = -2.8494, df = 48, p-value = 0.006434
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -17.373374  -2.998421
sample estimates:
mean in group 0 mean in group 1 
       76.73077        86.91667 

# Where did the p-value come from? The probability that a random
# draw from a t48 distribution falls below our negative t value is

> pt(t, 48)
[1] 0.003216964

# The t distribution is symmetric, so we can double that to include
# the probability of getting a value above the positive value of the
# statistic. Regardless of whether our result is positive or negative,
# the following approach will always work to get the probability content
# of both tails of the distribution:

> 2*pt(-abs(t), 48)
[1] 0.006433928
 
# If we use the t.test() function without the var.equal=TRUE command,
# we get a slightly different value for the statistic and a weird, fractional
# degrees of freedom:

> t.test(CTPEA~CBSEX)

        Welch Two Sample t-test

data:  CTPEA by CBSEX
t = -2.8314, df = 45.476, p-value = 0.006879
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -17.429384  -2.942411
sample estimates:
mean in group 0 mean in group 1 
       76.73077        86.91667 

# That form of the t test uses unpooled variances and does not require
# the assumption that the populations are equally variable.
 
# How plausible is it that our two populations are equally variable?
# If we compare the standard deviations, they look pretty close:

> tapply(CTPEA, CBSEX, sd)
       0        1 
11.64666 13.61558 

# Our best guess at a common variability in the population would be
# the square root of the pooled variance:

> sqrt(vp)
[1] 12.62846

# If we repeatedly sample from a normal distribution with the same
# sample size as our girls' group, and arbitrary mean (0 here), and
# the square root of the pooled variance for the standard deviation,
# we get sd estimates further away from 12.6 than the girls' and boys'
# sd's all the time:

> sd( rnorm(26, 0, sqrt(vp)))
[1] 14.76422
> sd( rnorm(26, 0, sqrt(vp)))
[1] 11.59821
> sd( rnorm(26, 0, sqrt(vp)))
[1] 12.889
> sd( rnorm(26, 0, sqrt(vp)))
[1] 12.50249
> sd( rnorm(26, 0, sqrt(vp)))
[1] 8.967739

# In addition to evaluating the equal variance assumption, we
# want to see if it seems reasonable that both groups come from
# normal populations. We can use stem-and-leaf plots:

> stem(CTPEA[1:26])

  The decimal point is 1 digit(s) to the right of the |

   5 | 9
   6 | 024
   6 | 68999
   7 | 2224
   7 | 7
   8 | 0011224
   8 | 5
   9 | 0
   9 | 56
  10 | 
  10 | 6

> stem(CTPEA[27:50])

  The decimal point is 1 digit(s) to the right of the |

   6 | 4515889
   8 | 013566689912
  10 | 3348
  12 | 2

# Both are piling up in the center. The positive tail looks
# a little long in both groups, but these are very small sample
# sizes, so I don't take that indication of positive skew too
# seriously.

# Another useful device for evaluating normality is the normal
# quantile-quantile plot. This plot does something like this:
#     Hey, we've got 26 observations here.
#     If these were from a normal distribution, how small
#     would I expect the smallest to be? How large would I
#     expect the largest to be? Where should the ones in
#     between be? (That is, what are the 1st, 2nd, 3rd, ...,
#     26th quantiles?) If I sort the variable and plot it
#     against those quantiles, I ought to get something like
#     a straight line if the distribution is normal.
# All of that is built into the qqnorm() function. (qqline() provides
# a reference line to help evaluate the linearity of the plot.)

> qqnorm(CTPEA[1:26]); qqline(CTPEA[1:26])

# There are a few points straying away from the line in the
# upper right corner, but this is a small data set. If we want
# to see if the plot would often look this bad when the distribution
# is KNOWN to be normal, we can build up some experience by
# using simulation:

> temp <- rnorm(26); qqnorm(temp); qqline(temp)
> temp <- rnorm(26); qqnorm(temp); qqline(temp)
> temp <- rnorm(26); qqnorm(temp); qqline(temp)
> temp <- rnorm(26); qqnorm(temp); qqline(temp)
> temp <- rnorm(26); qqnorm(temp); qqline(temp)
> temp <- rnorm(26); qqnorm(temp); qqline(temp)

# Lots of those look worse than our actual plots,
# so I would say that we probably don't need to be concerned
# that our distributions are non-normal:

> qqnorm(CTPEA[1:26]); qqline(CTPEA[1:26])
> qqnorm(CTPEA[27:50]); qqline(CTPEA[27:50])
 
# We now return to our consideration of regression. Here's
# the easy way to get estimates of the intercept and slope:
 
> lm(CTPEA~CTRA)

Call:
lm(formula = CTPEA ~ CTRA)

Coefficients:
(Intercept)         CTRA  
    59.0490       0.7319  

# And we really should satisfy ourselves that the plot looks
# linear:

> plot(CTRA,CTPEA)
 
# But where do those estimates of the intercept and slope
# come from? Recall that the correlation coefficient...

> cor(CTPEA,CTRA)
[1] 0.4945577

# ...has in its numerator the sum of the product of deviations
# of each variable from its respective mean:

> sum( (CTPEA-mean(CTPEA))*(CTRA-mean(CTRA)))/sd(CTPEA)/sd(CTRA)/49
[1] 0.4945577

# If we calculate exactly that same sum, but divide it by the sum
# of squared deviations from the mean for the predictor variable,
# we get the estimated slope:

> sum( (CTPEA-mean(CTPEA))*(CTRA-mean(CTRA)))/(49*var(CTRA)) -> slope
> slope
[1] 0.731873

# And once we have the slope, the intercept is just

> mean(CTPEA) - slope*mean(CTRA)
[1] 59.04904

# Note that those estimates match the values we got from the lm() function:

> lm(CTPEA~CTRA)

Call:
lm(formula = CTPEA ~ CTRA)

Coefficients:
(Intercept)         CTRA  
    59.0490       0.7319  

>