# Last time in class, we used rise-over-run to calculate # that the slope of the line in the Powerpoint was 1, and # we observed that the intercept was 0. Here are those # points again: > plot(X,Y) # Here, we calculate the regression of Y on X and confirm # what we had said about the slope and intercept: > lm(Y~X) Call: lm(formula = Y ~ X) Coefficients: (Intercept) X -1.256e-15 1.000e+00 # Later in the class, a student asked about the values # of X and Y that were used to create that plot. Here # they are: > X [1] 2 2 3 3 4 4 5 5 > Y [1] 1 3 2 4 3 5 4 6 > # We now consider the regression of Peabody on Raven. # First, we plot the relationship to confirm that it # is basically linear. (This is a quick-and-dirty # scatterplot without the improvements we introduced # last time.) > attach(JackStatlab) > plot(CTRA,CTPEA) # Here's the regression of Peabody on Raven: > lm(CTPEA~CTRA) Call: lm(formula = CTPEA ~ CTRA) Coefficients: (Intercept) CTRA 59.0490 0.7319 # The intercept of 59 tells us that if we extended # the line to include Raven=0, the conditional mean # of the Peabody distribution in that area would be # 59. Another way of thinking about this is to say # that the line would cross the Y axis at 59 when # Raven = 0. # # The slope of 0.73 tells us that for each one-point # increase in Raven, the conditional mean of Peabody # is going up by about 7/10 of a point. # We can use the abline() function to draw the line # on the scatterplot: > abline(59.0490, 0.7319) # Here's another way to add the line, without the # need for typing in the values: > plot(CTRA,CTPEA) > abline(lm(CTPEA~CTRA)$coef) > lm(CTPEA~CTRA)$coef (Intercept) CTRA 59.049037 0.731873 # Next, we considered the subject of correlation. Recall # that the variance... > var(CTPEA) [1] 182.6486 # ...is equal to the sum of squared deviations from the # mean divided by (N-1): > sum((CTPEA-mean(CTPEA))^2)/49 [1] 182.6486 # Here's another, more longwinded way of writing that formula: > sum((CTPEA-mean(CTPEA))*(CTPEA-mean(CTPEA)))/49 [1] 182.6486 # Now, if we replace the second instance of Peabody with Raven, # we get a sum of components that should tend to be positive # if the linear relationship is positive and negative if the # relationship is negative. This statistic is called the "covariance" # between Peabody and Raven: > sum((CTPEA-mean(CTPEA))*(CTRA-mean(CTRA)))/49 [1] 61.04 # A problem with the covariance as a measure of strength of # association is that its value changes if we rescale one or # both of the variables. Here, for example, if we replace # Peabody with 10*Peabody, the value of the covariance increases # tenfold, even though the scatterplot looks the same: > sum((10*CTPEA-mean(10*CTPEA))*(CTRA-mean(CTRA)))/49 [1] 610.4 > plot(CTRA, 10*CTPEA) # Usually, we use the var() function to calculate a sample variance, # but if we feed it two variables, it gives us the covariance: > var(CTPEA,CTRA) [1] 61.04 # If we take away the scale of the covariance by dividing by # the standard deviation of each variable, we get a form of # the covariance that is not affected by rescaling: > var(CTPEA,CTRA)/sd(CTPEA)/sd(CTRA) [1] 0.4945577 > var(10*CTPEA,CTRA)/sd(10*CTPEA)/sd(CTRA) [1] 0.4945577 # That scaled covariance is the correlation coefficient (well, # its full name is the "Pearson product-moment correlation # coefficient"): > cor(CTPEA,CTRA) [1] 0.4945577 # Here, we illustrate the importance of confirming that a # relationship is linear before we use correlation to describe # the strength of the association. In this example, y is # perfectly associated with x, but in a nonlinear way: > x <- seq(-2,2,.1) > y <- x^2 > plot(x,y) # Even though the relationship is perfect, the correlation # is zero: > cor(x,y) [1] 1.775033e-16 # And the regression line doesn't even come close to # describing the relationship: > abline(lm(y~x)$coef) # A student asked about the intercept of 59 in the # regression of Peabody on Raven. If we extend the # dimensions of the scatterplot to include a # (nonexistent) Raven score of 0, we see that the # line crosses the Y axis at 59: > plot(CTRA,CTPEA,xlim=c(-5,50),ylim=c(50,130)) > abline(lm(CTPEA~CTRA)$coef) > lm(CTPEA~CTRA)$coef (Intercept) CTRA 59.049037 0.731873 # Next, we demonstrated the central limit theorem. # The theorem describes the distribution of a large # number of sample means based on independent samples # of the same size from the population. It says the # mean of those sample means will equal the population # mean and the standard deviation of the sample means # will equal the population standard deviation divided # by the square root of the sample size. The theorem # also says that if the population is normal, the distribution # of sample means will be normal, but if the population # is non-normal, the distribution of sample means will # become more and more like a normal distribution as the # sample size increases. # Here's the distribution of a uniform variable on # the interval (0, 1): > uniform<-runif(1000000) > hist(uniform) # We already know from previous classes that the mean of # that population is 1/2 and the standard deviation is # the square root of 1/12: > mean(uniform) [1] 0.5000398 > sd(uniform) [1] 0.2888787 > sqrt(1/12) [1] 0.2886751 # Here, we simulate taking a large number of N=3 samples # from that population and calculating the mean of each # sample: > means <- NULL > for(i in 1:50000) { + means[i] <- mean(runif(3)) } # The mean and standard deviation of those means are # exactly what the central limit theorem predicts: > mean(means) [1] 0.5000984 > sqrt(1/12)/sqrt(3) [1] 0.1666667 > sd(means) [1] 0.1668343 # But the distribution isn't quite normal: > hist(means) # If we go with an even smaller sample size (N=2), # the mean and standard deviation of the sample means # are still exactly as the theorem predicts: > for(i in 1:50000) { + means[i] <- mean(runif(2)) } > mean(means) [1] 0.4990982 > sd(means) [1] 0.2041811 # But the shape of the distribution is even less like # a normal distribution: > hist(means) # On the other hand, if we increase the sample size for # each mean to 20... > for(i in 1:50000) { + means[i] <- mean(runif(20)) } # ...the mean and sd of the means are still as predicted... > mean(means) [1] 0.5000626 > sqrt(1/12)/sqrt(20) [1] 0.06454972 > sd(means) [1] 0.06456094 # ...but now, with the larger sample size, the distribution # of the means looks pretty much like a normal distribution: > hist(means) # We can see that better if we force R to use more intervals # when it draws the histogram: > hist(means,20) # In Homework One, the class calculated sample means for Raven # based on separate, individual random N=50 samples from the population. # Here's the distribution of the means: > stem(Rmeans) The decimal point is at the | 29 | 9 30 | 33379 31 | 016 32 | 037 33 | 08 # It's difficult to evaluate whether that distribution looks normal # as predicted by the central limit theorem because there aren't very # many separate means, but it certainly doesn't look NON-normal. # Recall that we are considering the full N=1296 Statlab dataset to # be the population, so we can actually calculate the true mean # and standard deviation: > detach(JackStatlab) > attach(Statlab) > length(CTRA) [1] 1296 > mean(CTRA) [1] 30.95062 > sd(CTRA) [1] 9.974362 # Here's what the theorem would expect the sd of the means to be: > sd(CTRA)/sqrt(50) [1] 1.410588 # Well, technically, since we are considering this a population, # we should be using the population formula, where we divide by N # rather than by (N-1): > sqrt(1295*var(CTRA)/1296) [1] 9.970514 > sqrt(1295*var(CTRA)/1296)/sqrt(50) [1] 1.410044 # The mean of our means is pretty close to the population # mean of 30.95: > mean(Rmeans) [1] 31.41 # Their standard deviation is a bit lower than predicted, but that's # because our samples weren't truly independent (some of them may # have included some of the same data points, and two students actually # used the same data set). But it's in the right ballpark: > sd(Rmeans) [1] 1.179172 > detach(Statlab) > attach(JackStatlab) # We illustrated the use of the concept of sampling distributions # in hypothesis testing. Here, we use a two-sample t test to test # the null hypothesis that the population means of the boys' and the # girls' Raven scores are equal: > tapply(CTPEA, CBSEX, mean) 0 1 76.73077 86.91667 > > t.test(CTPEA~CBSEX, var.equal=TRUE) Two Sample t-test data: CTPEA by CBSEX t = -2.8494, df = 48, p-value = 0.006434 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -17.373374 -2.998421 sample estimates: mean in group 0 mean in group 1 76.73077 86.91667 # If the null hypothesis is true (i.e., if the two means actually # ARE equal), that t statistic should have a t distribution with # 48 degrees of freedom. The p-value of .006 indicates that if the # null hypothesis is actually true, a value of the statistic this # big would occur only about 6 times in 1000 samples. That's a really # rare event, so we conclude that it's actually NOT true that the # two means are equal. # Important note: there is more work that we need to do before we # trust this result. This will be discussed further next class.