# We considered a data set in which there are four pairs
# of x's and y's. Examination of descriptive statistics
# and correlations for each pair suggest that the pairs
# must be quite similar to one another:

> attach(anscombe)
> mean(x1); sd(x1)
[1] 9
[1] 3.316625
> mean(x2); sd(x2)
[1] 9
[1] 3.316625
> mean(x3); sd(x3)
[1] 9
[1] 3.316625
> mean(x4); sd(x4)
[1] 9
[1] 3.316625
> mean(y1); sd(y1)
[1] 7.500909
[1] 2.031568
> mean(y2); sd(y2)
[1] 7.500909
[1] 2.031657
> mean(y3); sd(y3)
[1] 7.500909
[1] 2.029554
> mean(y4); sd(y4)
[1] 7.500909
[1] 2.030579
> cor(x1,y1)
[1] 0.8164205
> cor(x2,y2)
[1] 0.8162365
> cor(x3,y3)
[1] 0.8160423
> cor(x4,y4)
[1] 0.8165214

# The data set illustrates the importance of assessing linear
# relationships before using linear statistical methods like
# the correlation coefficient. Correlation is an appropriate
# measure of association only for the first pair of variables:

> par(pin=c(6,4))
> plot(x1,y1)
> plot(x2,y2)
> plot(x3,y3)
> plot(x4,y4)

# Here's the data set. (It is also available on the class
# data page.)

> anscombe
   x1 x2 x3 x4    y1   y2        y3    y4
1  10 10 10  8  8.04 9.14  7.460000  6.58
2   8  8  8  8  6.95 8.14  6.770000  5.76
3  13 13 13  8  7.58 8.74 12.740000  7.71
4   9  9  9  8  8.81 8.77  7.110000  8.84
5  11 11 11  8  8.33 9.26  7.810000  8.47
6  14 14 14  8  9.96 8.10  8.840000  7.04
7   6  6  6  8  7.24 6.13  6.080000  5.25
8   4  4  4 19  4.26 3.10  5.390000 12.50
9  12 12 12  8 10.84 9.13  8.150000  5.56
10  7  7  7  8  4.82 7.26  6.420000  7.91
11  5  5  5  8  5.68 4.74  5.739999  6.89
 

# Next, we illustrated what happens to correlation and regression
# when the range of the predictor is artificially restricted.
# Here is a small data set with a linear relationship between
# x and y:

> x <- rep(1:10,2)
> x
 [1]  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5  6  7  8  9 10
> y <- x+1
> y[11:20] <- x[11:20]-1
> plot(x,y)

# Here are the correlation and regression results for the
# full data set:

> cor(x,y)
[1] 0.9444003

> summary(lm(y~x))

Call:
lm(formula = y ~ x)

Residuals:
   Min     1Q Median     3Q    Max 
    -1     -1      0      1      1 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 2.383e-15  5.092e-01    0.00        1    
x           1.000e+00  8.206e-02   12.19 3.94e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.054 on 18 degrees of freedom
Multiple R-squared:  0.8919,    Adjusted R-squared:  0.8859 
F-statistic: 148.5 on 1 and 18 DF,  p-value: 3.938e-10

# When we remove the largest X, the correlation goes down
# a little. The estimate of the regression slope stays the
# same, but is slightly less significant in the test of the
# null hypothesis that the slope equals zero:

> x2 <- c(x[1:9],x[11:19])
> y2 <- c(y[1:9],y[11:19])
> plot(x2,y2)
> cor(x,y)
[1] 0.9444003
> summary(lm(y~x))

Call:
lm(formula = y ~ x)

Residuals:
   Min     1Q Median     3Q    Max 
    -1     -1      0      1      1 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 2.383e-15  5.092e-01    0.00        1    
x           1.000e+00  8.206e-02   12.19 3.94e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.054 on 18 degrees of freedom
Multiple R-squared:  0.8919,    Adjusted R-squared:  0.8859 
F-statistic: 148.5 on 1 and 18 DF,  p-value: 3.938e-10

> cor(x2,y2)
[1] 0.9325048
> summary(lm(y~x))

Call:
lm(formula = y ~ x)

Residuals:
   Min     1Q Median     3Q    Max 
    -1     -1      0      1      1 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 2.383e-15  5.092e-01    0.00        1    
x           1.000e+00  8.206e-02   12.19 3.94e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.054 on 18 degrees of freedom
Multiple R-squared:  0.8919,    Adjusted R-squared:  0.8859 
F-statistic: 148.5 on 1 and 18 DF,  p-value: 3.938e-10


# If we further restrict the range, the trend continues:
# the correlation goes down, and the slope stays the same
# but becomes less significant:

> x3 <- c(x[1:5],x[11:15])
> y3 <- c(y[1:5],y[11:15])
> plot(x3,y3)
> cor(x3,y3)
[1] 0.8164966
> summary(lm(y~x))

Call:
lm(formula = y ~ x)

Residuals:
   Min     1Q Median     3Q    Max 
    -1     -1      0      1      1 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 2.383e-15  5.092e-01    0.00        1    
x           1.000e+00  8.206e-02   12.19 3.94e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.054 on 18 degrees of freedom
Multiple R-squared:  0.8919,    Adjusted R-squared:  0.8859 
F-statistic: 148.5 on 1 and 18 DF,  p-value: 3.938e-10

> summary(lm(y3~x3))

Call:
lm(formula = y3 ~ x3)

Residuals:
   Min     1Q Median     3Q    Max 
    -1     -1      0      1      1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)   0.0000     0.8292       0  1.00000   
x3            1.0000     0.2500       4  0.00395 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.118 on 8 degrees of freedom
Multiple R-squared:  0.6667,    Adjusted R-squared:  0.625 
F-statistic:    16 on 1 and 8 DF,  p-value: 0.00395

# Here we see the same trend with even greater restriction
# of range:

> x4 <- c(x[1:2],x[11:12])
> y4 <- c(y[1:2],y[11:12])
> plot(x4,y4)
> cor(x4,y4)
[1] 0.4472136
> summary(lm(y4~x4))

Call:
lm(formula = y4 ~ x4)

Residuals:
 1  2  3  4 
 1  1 -1 -1 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.441e-16  2.236e+00   0.000    1.000
x4          1.000e+00  1.414e+00   0.707    0.553

Residual standard error: 1.414 on 2 degrees of freedom
Multiple R-squared:    0.2,     Adjusted R-squared:   -0.2 
F-statistic:   0.5 on 1 and 2 DF,  p-value: 0.5528

# When range is restricted, then, we can say two things:
#     -the correlation coefficient is biases towards zero
#     -the estimated regression slope is unbiased, but it is
#      estimated with less certainty, leading to less statistical power.


# We now begin to think about regression as a model for conditional
# mean working with binary variables as predictors. Our CBSEX variable
# takes on only two values: 0 for girls, and 1 for boys.

> attach(JackStatlab)
> CBSEX
 [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1
[39] 1 1 1 1 1 1 1 1 1 1 1 1


# We could imagine a regression model that says Raven = b0 + b1*CBSEX (+ error).
# Because CBSEX takes on only two values, there are only two possible conditional
# means:  b0 + b1*0 = b0 for the girls, and b0 + b1*1 = b0 + b1 for the boys.
 
# Because there are only two possible conditional means, the relationship
# is necessarily linear:

> plot(CBSEX,CTRA)

# Here's the regression output:

> lm(CTRA~CBSEX)

Call:
lm(formula = CTRA ~ CBSEX)

Coefficients:
(Intercept)        CBSEX  
     33.312       -3.368  

# So the two possible conditional means are 33.312 for the girls, and
# for the boys,

> 33.312 -3.368
[1] 29.944

# Note that the regression is just giving us the means of the girls and boys:

> tapply(CTRA, CBSEX, mean)
       0        1 
33.31250 29.94444 

# The regression slope is equal to the difference between the boys' and girls'
# means, so a test of the null hypothesis that it equals zero amounts
# to a test of the null hypothesis that the difference between the two population
# means is zero:

> summary(lm(CTRA~CBSEX))

Call:
lm(formula = CTRA ~ CBSEX)

Residuals:
     Min       1Q   Median       3Q      Max 
-18.9444  -8.0625   0.3715   8.2795  17.0556 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   33.312      1.870  17.817   <2e-16 ***
CBSEX         -3.368      3.116  -1.081    0.285    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 10.58 on 48 degrees of freedom
Multiple R-squared:  0.02376,   Adjusted R-squared:  0.00342 
F-statistic: 1.168 on 1 and 48 DF,  p-value: 0.2852

# But that's what the two-sample t test looks at:
 
> t.test(CTRA~CBSEX, var.equal=TRUE)

        Two Sample t-test

data:  CTRA by CBSEX
t = 1.0808, df = 48, p-value = 0.2852
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.897595  9.633706
sample estimates:
mean in group 0 mean in group 1 
       33.31250        29.94444 

 
# Note that the two t statistics, one from t.test() and one from
# the regression summary, are identical.