# We considered a data set in which there are four pairs # of x's and y's. Examination of descriptive statistics # and correlations for each pair suggest that the pairs # must be quite similar to one another: > attach(anscombe) > mean(x1); sd(x1) [1] 9 [1] 3.316625 > mean(x2); sd(x2) [1] 9 [1] 3.316625 > mean(x3); sd(x3) [1] 9 [1] 3.316625 > mean(x4); sd(x4) [1] 9 [1] 3.316625 > mean(y1); sd(y1) [1] 7.500909 [1] 2.031568 > mean(y2); sd(y2) [1] 7.500909 [1] 2.031657 > mean(y3); sd(y3) [1] 7.500909 [1] 2.029554 > mean(y4); sd(y4) [1] 7.500909 [1] 2.030579 > cor(x1,y1) [1] 0.8164205 > cor(x2,y2) [1] 0.8162365 > cor(x3,y3) [1] 0.8160423 > cor(x4,y4) [1] 0.8165214 # The data set illustrates the importance of assessing linear # relationships before using linear statistical methods like # the correlation coefficient. Correlation is an appropriate # measure of association only for the first pair of variables: > par(pin=c(6,4)) > plot(x1,y1) > plot(x2,y2) > plot(x3,y3) > plot(x4,y4) # Here's the data set. (It is also available on the class # data page.) > anscombe x1 x2 x3 x4 y1 y2 y3 y4 1 10 10 10 8 8.04 9.14 7.460000 6.58 2 8 8 8 8 6.95 8.14 6.770000 5.76 3 13 13 13 8 7.58 8.74 12.740000 7.71 4 9 9 9 8 8.81 8.77 7.110000 8.84 5 11 11 11 8 8.33 9.26 7.810000 8.47 6 14 14 14 8 9.96 8.10 8.840000 7.04 7 6 6 6 8 7.24 6.13 6.080000 5.25 8 4 4 4 19 4.26 3.10 5.390000 12.50 9 12 12 12 8 10.84 9.13 8.150000 5.56 10 7 7 7 8 4.82 7.26 6.420000 7.91 11 5 5 5 8 5.68 4.74 5.739999 6.89 # Next, we illustrated what happens to correlation and regression # when the range of the predictor is artificially restricted. # Here is a small data set with a linear relationship between # x and y: > x <- rep(1:10,2) > x [1] 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 > y <- x+1 > y[11:20] <- x[11:20]-1 > plot(x,y) # Here are the correlation and regression results for the # full data set: > cor(x,y) [1] 0.9444003 > summary(lm(y~x)) Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -1 -1 0 1 1 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.383e-15 5.092e-01 0.00 1 x 1.000e+00 8.206e-02 12.19 3.94e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.054 on 18 degrees of freedom Multiple R-squared: 0.8919, Adjusted R-squared: 0.8859 F-statistic: 148.5 on 1 and 18 DF, p-value: 3.938e-10 # When we remove the largest X, the correlation goes down # a little. The estimate of the regression slope stays the # same, but is slightly less significant in the test of the # null hypothesis that the slope equals zero: > x2 <- c(x[1:9],x[11:19]) > y2 <- c(y[1:9],y[11:19]) > plot(x2,y2) > cor(x,y) [1] 0.9444003 > summary(lm(y~x)) Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -1 -1 0 1 1 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.383e-15 5.092e-01 0.00 1 x 1.000e+00 8.206e-02 12.19 3.94e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.054 on 18 degrees of freedom Multiple R-squared: 0.8919, Adjusted R-squared: 0.8859 F-statistic: 148.5 on 1 and 18 DF, p-value: 3.938e-10 > cor(x2,y2) [1] 0.9325048 > summary(lm(y~x)) Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -1 -1 0 1 1 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.383e-15 5.092e-01 0.00 1 x 1.000e+00 8.206e-02 12.19 3.94e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.054 on 18 degrees of freedom Multiple R-squared: 0.8919, Adjusted R-squared: 0.8859 F-statistic: 148.5 on 1 and 18 DF, p-value: 3.938e-10 # If we further restrict the range, the trend continues: # the correlation goes down, and the slope stays the same # but becomes less significant: > x3 <- c(x[1:5],x[11:15]) > y3 <- c(y[1:5],y[11:15]) > plot(x3,y3) > cor(x3,y3) [1] 0.8164966 > summary(lm(y~x)) Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -1 -1 0 1 1 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.383e-15 5.092e-01 0.00 1 x 1.000e+00 8.206e-02 12.19 3.94e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.054 on 18 degrees of freedom Multiple R-squared: 0.8919, Adjusted R-squared: 0.8859 F-statistic: 148.5 on 1 and 18 DF, p-value: 3.938e-10 > summary(lm(y3~x3)) Call: lm(formula = y3 ~ x3) Residuals: Min 1Q Median 3Q Max -1 -1 0 1 1 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.0000 0.8292 0 1.00000 x3 1.0000 0.2500 4 0.00395 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.118 on 8 degrees of freedom Multiple R-squared: 0.6667, Adjusted R-squared: 0.625 F-statistic: 16 on 1 and 8 DF, p-value: 0.00395 # Here we see the same trend with even greater restriction # of range: > x4 <- c(x[1:2],x[11:12]) > y4 <- c(y[1:2],y[11:12]) > plot(x4,y4) > cor(x4,y4) [1] 0.4472136 > summary(lm(y4~x4)) Call: lm(formula = y4 ~ x4) Residuals: 1 2 3 4 1 1 -1 -1 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.441e-16 2.236e+00 0.000 1.000 x4 1.000e+00 1.414e+00 0.707 0.553 Residual standard error: 1.414 on 2 degrees of freedom Multiple R-squared: 0.2, Adjusted R-squared: -0.2 F-statistic: 0.5 on 1 and 2 DF, p-value: 0.5528 # When range is restricted, then, we can say two things: # -the correlation coefficient is biases towards zero # -the estimated regression slope is unbiased, but it is # estimated with less certainty, leading to less statistical power. # We now begin to think about regression as a model for conditional # mean working with binary variables as predictors. Our CBSEX variable # takes on only two values: 0 for girls, and 1 for boys. > attach(JackStatlab) > CBSEX [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 [39] 1 1 1 1 1 1 1 1 1 1 1 1 # We could imagine a regression model that says Raven = b0 + b1*CBSEX (+ error). # Because CBSEX takes on only two values, there are only two possible conditional # means: b0 + b1*0 = b0 for the girls, and b0 + b1*1 = b0 + b1 for the boys. # Because there are only two possible conditional means, the relationship # is necessarily linear: > plot(CBSEX,CTRA) # Here's the regression output: > lm(CTRA~CBSEX) Call: lm(formula = CTRA ~ CBSEX) Coefficients: (Intercept) CBSEX 33.312 -3.368 # So the two possible conditional means are 33.312 for the girls, and # for the boys, > 33.312 -3.368 [1] 29.944 # Note that the regression is just giving us the means of the girls and boys: > tapply(CTRA, CBSEX, mean) 0 1 33.31250 29.94444 # The regression slope is equal to the difference between the boys' and girls' # means, so a test of the null hypothesis that it equals zero amounts # to a test of the null hypothesis that the difference between the two population # means is zero: > summary(lm(CTRA~CBSEX)) Call: lm(formula = CTRA ~ CBSEX) Residuals: Min 1Q Median 3Q Max -18.9444 -8.0625 0.3715 8.2795 17.0556 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 33.312 1.870 17.817 <2e-16 *** CBSEX -3.368 3.116 -1.081 0.285 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 10.58 on 48 degrees of freedom Multiple R-squared: 0.02376, Adjusted R-squared: 0.00342 F-statistic: 1.168 on 1 and 48 DF, p-value: 0.2852 # But that's what the two-sample t test looks at: > t.test(CTRA~CBSEX, var.equal=TRUE) Two Sample t-test data: CTRA by CBSEX t = 1.0808, df = 48, p-value = 0.2852 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -2.897595 9.633706 sample estimates: mean in group 0 mean in group 1 33.31250 29.94444 # Note that the two t statistics, one from t.test() and one from # the regression summary, are identical.