# A student asked about evaluating the assumption that two populations
# are equally variable (needed for the two-sample t test). We looked at
# an example in which I compare the Raven scores for boys and girls.
# Their standard deviations aren't TOO far apart, but they're also not
# that close:
 
> attach(JackStatlab)
> tapply(CTRA, CBSEX, sd)
        0         1 
10.627975  7.408867 

# One approach that I find useful is to calculate the pooled variance estimate...

> tapply(CTRA, CBSEX, var)
       0        1 
112.9538  54.8913 
> table(CBSEX)
CBSEX
 0  1 
26 24 
> vp <- (25*var(CTRA[1:26]) + 23*var(CTRA[27:50])) / (25+23)
> vp
[1] 85.13221

# ...and use it to calculate a pooled standard deviation:

> sqrt(vp)
[1] 9.226712

# Then, if I sample from a normal distribution with that 
# standard deviation, using the sample size from one of my
# groups, I look to see if we get sample sd's as far from
# the pooled value as the actual values we saw in our sample
# (10.6 and 7.4). Here are a few samples:

> sd( rnorm(24, 0, 9.23))
[1] 7.001852
> sd( rnorm(24, 0, 9.23))
[1] 10.10173
> sd( rnorm(24, 0, 9.23))
[1] 12.61102
> sd( rnorm(24, 0, 9.23))
[1] 8.928225
> sd( rnorm(24, 0, 9.23))
[1] 9.13923
> sd( rnorm(24, 0, 9.23))
[1] 9.511692
> sd( rnorm(24, 0, 9.23))
[1] 7.59977
 
# Note that we are frequently getting values as far from
# the pooled estimate as the actual values in our sample.
# For that reason, I would not be too concerned about the
# two populations having different variability.

 
# Last time, we tested the null hypothesis that the slope
# in the regression of Peabody on Raven was zero, working
# from the defining formulas in the ANOVA table. Of course,
# we don't need to do it that way; R can do the work for us.

# Here's that regression again:

> PBonRA <- lm(CTPEA~CTRA)
> PBonRA

Call:
lm(formula = CTPEA ~ CTRA)

Coefficients:
(Intercept)         CTRA  
    59.0490       0.7319  

# Remember that in a regression output object, there is more
# information lurking behind the scenes, such as the estimated
# conditional means...

> PBonRA$fit
       1        2        3        4        5        6        7        8 
74.41837 70.02713 91.98332 84.66459 78.80961 78.07773 70.02713 75.88212 
       9       10       11       12       13       14       15       16 
68.56339 86.86021 74.41837 75.88212 87.59208 84.66459 82.46897 93.44707 
      17       18       19       20       21       22       23       24 
91.98332 89.05583 82.46897 76.61399 86.86021 77.34586 91.25145 88.32396 
      25       26       27       28       29       30       31       32 
71.49088 90.51958 74.41837 79.54148 86.86021 79.54148 81.73710 87.59208 
      33       34       35       36       37       38       39       40 
74.41837 86.12834 74.41837 75.15024 75.88212 72.22275 79.54148 87.59208 
      41       42       43       44       45       46       47       48 
89.78770 78.07773 83.20085 89.05583 83.93272 84.66459 81.00523 83.20085 
      49       50 
79.54148 89.78770 

# ...and the residuals:

> PBonRA$res
          1           2           3           4           5           6 
 -2.4183697  -1.0271318 -11.9833216   5.3354083  -1.8096077   5.9222653 
          7           8           9          10          11          12 
-11.0271318  -3.8821157  -6.5633858  -4.8602106  -8.4183697  -3.8821157 
         13          14          15          16          17          18 
  8.4079164  -4.6645917  -8.4689727 -12.4470676  -9.9833216  -8.0558296 
         19          20          21          22          23          24 
-13.4689727  -8.6139887  -1.8602106  -8.3458617  14.7485514   6.6760434 
         25          26          27          28          29          30 
-11.4908778 -26.5195756  11.5816303  -0.5414807  -3.8602106   0.4585193 
         31          32          33          34          35          36 
 -3.7370997 -23.5920836  14.5816303   2.8716624   6.5816303  10.8497573 
         37          38          39          40          41          42 
 -0.8821157  -1.2227508  42.4585193  16.4079164  13.2122974 -13.0777347 
         43          44          45          46          47          48 
  8.7991543  18.9441704   2.0672813   0.3354083   6.9947733  -5.2008457 
         49          50 
 11.4585193  13.2122974 

# R can take that information and do the inference for us. One way to
# do that is to ask for the anova table:

> anova(PBonRA)
Analysis of Variance Table

Response: CTPEA
          Df Sum Sq Mean Sq F value    Pr(>F)    
CTRA       1 2189.0 2189.00  15.541 0.0002616 ***
Residuals 48 6760.8  140.85                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

# Note that we got the same F statistic and p-value that we obtained
# manually in Tuesday's class.

# Another way to do inference is to ask R for a "summary" of
# the regression:

> summary(PBonRA)

Call:
lm(formula = CTPEA ~ CTRA)

Residuals:
    Min      1Q  Median      3Q     Max 
-26.520  -8.273  -1.516   6.915  42.459 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  59.0490     5.9663   9.897 3.55e-13 ***
CTRA          0.7319     0.1856   3.942 0.000262 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 11.87 on 48 degrees of freedom
Multiple R-squared:  0.2446,    Adjusted R-squared:  0.2288 
F-statistic: 15.54 on 1 and 48 DF,  p-value: 0.0002616

# This gives us the same F statistic as before on the last
# line of the output. It also gives us the coefficient of
# determination (R squared). But in addition to those results
# we also get a t-test for the null hypothesis that the slope
# is zero. The t-test is EXACTLY equivalent to the F test; note
# that the p-values are identical. But the t-test is acquired
# the same way most t statistics are: by dividing an estimate
# by its standard error. Here, we take the output and calculate
# the t:

> .7319 / .1856
[1] 3.943427

# The t is just the square root of the F statistic:

> 3.942^2
[1] 15.53936
 
# One nice thing about having the standard error of the slope
# estimate is we can use it to calculate a confidence interval
# (a range of values that represent null hypotheses we would NOT
# reject using this data set). First, we use the t quantile function
# to figure out how big a t statistic would need to be in order
# for us to reject the null:

> qt(.975, 48)
[1] 2.010635

# That's called the "critical value" of the t statistic. We used .975
# because we are thinking of tests at an alpha level of .05, and the
# .975th quantile puts half of that .05 in the upper tail. (We would
# also reject the null if we got a t below NEGATIVE 2.01.)

> qt(.975, 48) -> tcrit

# We can then get the range of values that would NOT lead to rejection
# by taking our estimate plus or minus the critical value times the
# standard error:

> .7319 - tcrit*0.1856
[1] 0.3587262
> .7319 + tcrit*0.1856
[1] 1.105074

# So we would say that a 95% confidence interval for the slope runs from
# 0.36 to 1.11. It's reasonable for us to act as if the slope falls in that
# range because if we did lots of independent samples and calculated confidence
# intervals this way, 95% of them would include the true slope. And, of course,
# our best guess at the true value of the slope is still .7319.

 
# It's possible to do this sort of confidence interval for anything that can
# be tested with a t statistic. For example, when we use R's t.test() function,
# we get a 95% confidence interval for the difference between means (-17.373
# to -2.998 in this example):

> t.test(CTPEA~CBSEX, var.equal=TRUE)

        Two Sample t-test

data:  CTPEA by CBSEX
t = -2.8494, df = 48, p-value = 0.006434
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -17.373374  -2.998421
sample estimates:
mean in group 0 mean in group 1 
       76.73077        86.91667 


# The t.test() function can also do a one-sample t test for the null
# hypothesis that a single population mean equals a particular value
# (here, I arbitrarily chose the value 80). The output includes a
# 95% confidence interval for the mean:

> t.test(CTPEA, mu=80)

        One Sample t-test

data:  CTPEA
t = 0.8476, df = 49, p-value = 0.4008
alternative hypothesis: true mean is not equal to 80
95 percent confidence interval:
 77.77915 85.46085
sample estimates:
mean of x 
    81.62 

# To see where that interval came from, we do it manually. We need
# the standard error of the mean:

> sd(CTPEA)/sqrt(50)
[1] 1.911275

# We need the critical value of t:

> tcrit <- qt(.975,49)

# If we calculate our estimated mean plus or minus the critical
# value times the standard error, we get the same confidence
# interval that the t.test() function produced:

> 81.62 - tcrit*1.911275
[1] 77.77915
> 81.62 + tcrit*1.911275
[1] 85.46085
 
 
# We turned now to considering different ways we can understand
# the correlation coefficient. Back when we first introduced correlation,
# we talked about it as the standardized covariance. Here's the
# covariance between Peabody and Raven:

> var(CTRA,CTPEA)
[1] 61.04

# If we standardize it...

> var(CTRA,CTPEA)/sd(CTRA)/sd(CTPEA)
[1] 0.4945577

# ...we get the correlation coefficient:

> cor(CTRA,CTPEA)
[1] 0.4945577

# Last class, we talked about the correlation as the square root
# of the coefficient of determination (R squared), with the sign
# of the regression slope attached:

> summary(PBonRA)

Call:
lm(formula = CTPEA ~ CTRA)

Residuals:
    Min      1Q  Median      3Q     Max 
-26.520  -8.273  -1.516   6.915  42.459 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  59.0490     5.9663   9.897 3.55e-13 ***
CTRA          0.7319     0.1856   3.942 0.000262 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 11.87 on 48 degrees of freedom
Multiple R-squared:  0.2446,    Adjusted R-squared:  0.2288 
F-statistic: 15.54 on 1 and 48 DF,  p-value: 0.0002616

> sqrt( 0.2446)
[1] 0.4945705
 
# We now introduce a third way of understanding the correlation
# coefficient. If we standardize both variables (that is, change
# the metric so that they both have a mean of zero and a standard
# deviation of one)...
 
> PBstd <- (CTPEA-mean(CTPEA))/sd(CTPEA)
> RAstd <- (CTRA-mean(CTRA))/sd(CTRA)
> mean(RAstd)
[1] 2.359874e-17
> sd(RAstd)
[1] 1

# ...the scatterplot still looks the same except for the numbers
# on the axes:

> plot(CTRA,CTPEA)
> plot(RAstd,PBstd)

# So the correlation is still the same:

> cor(RAstd,PBstd)
[1] 0.4945577

# But if we do the regression, the slope of the standardized
# regression is identical to the correlation:

> lm(PBstd~RAstd)

Call:
lm(formula = PBstd ~ RAstd)

Coefficients:
(Intercept)        RAstd  
 -2.980e-16    4.946e-01  

# And unlike most regressions, we get the same result regardless
# of which variable we regress on which:

> lm(PBstd~RAstd)$coef
  (Intercept)         RAstd 
-2.979758e-16  4.945577e-01 
> lm(RAstd~PBstd)$coef
 (Intercept)        PBstd 
1.948089e-16 4.945577e-01 

# That gives us yet another way to understand correlation. Remember that
# the slope in a regression always indicates the expected change in
# conditional mean associated with a one unit change in the predictor.
# But here, "one unit" means one standard deviation. So the correlation
# between two variables represents the expected number of standard deviations
# change in one variable associated with a one standard deviation change
# in the other.


# We turn our attention now to assessing the assumptions necessary for
# inference in regression. The first assumption is that it makes sense to
# use linear regression in the first place: the relationship must be
# linear. One device that is useful for assessing linearity is the
# scatterplot. But another approach is to plot residuals against fitted
# values (estimated conditional means).

# Here, we do both plots. Neither one seems to show evidence of a curvilinear
# relationship:

> plot(CTRA, CTPEA)
> plot(PBonRA$fit, PBonRA$res)
> plot(PBonRA$fit, PBonRA$res)

# Why should we consider using the residuals plot to assess linearity? An
# example will make that clear. Here, I created a relationship that is
# nonlinear: y equals x to the power 1.2:

> x <- seq(1,2,.1)
> y <- x^1.2

# The scatterplot looks linear:

> plot(x,y)

# But the residuals plot clearly shows the nonlinearity we know is there
# because of how we created the variables:

> temp <- lm(y~x)
> plot(temp$fit, temp$res)

# If we go back to the scatterplot and add the regression line, we can
# see the nonlinearity, but it's much easier to see in the residuals plot:

> plot(x,y)
> abline(temp$coef)
> plot(temp$fit, temp$res)

# The residuals plot is also useful for evaluating the assumption that
# the variation of the errors (for which our residuals are a proxy) is
# the same across the range of fitted values. Here, there's a slight
# tendency for variation in the residuals to increase as we move from
# left to right in the plot, but only slight, so I'm not too worried:

> plot(PBonRA$fit, PBonRA$res)

# Finally, we can evaluate the assumption that the errors are normally
# distributed by checking the normality of the residuals:

> stem(PBonRA$res)

  The decimal point is 1 digit(s) to the right of the |

  -2 | 74
  -1 | 3322110
  -0 | 98888755544442221111
   0 | 00235677789
   1 | 112335569
   2 | 
   3 | 
   4 | 2

> qqnorm(PBonRA$res); qqline(PBonRA$res)
 
# These plots certainly don't indicate strong evidence of non-normality.