# Last time in class, we used rise-over-run to calculate
# that the slope of the line in the Powerpoint was 1, and
# we observed that the intercept was 0. Here are those
# points again: 
 
> plot(X,Y)

# Here, we calculate the regression of Y on X and confirm
# what we had said about the slope and intercept:

> lm(Y~X)

Call:
lm(formula = Y ~ X)

Coefficients:
(Intercept)            X  
 -1.256e-15    1.000e+00  

# Later in the class, a student asked about the values
# of X and Y that were used to create that plot. Here
# they are:

> X
[1] 2 2 3 3 4 4 5 5
> Y
[1] 1 3 2 4 3 5 4 6
> 

# We now consider the regression of Peabody on Raven.
# First, we plot the relationship to confirm that it
# is basically linear. (This is a quick-and-dirty
# scatterplot without the improvements we introduced
# last time.)

> attach(JackStatlab)
> plot(CTRA,CTPEA)

# Here's the regression of Peabody on Raven:

> lm(CTPEA~CTRA)

Call:
lm(formula = CTPEA ~ CTRA)

Coefficients:
(Intercept)         CTRA  
    59.0490       0.7319  

# The intercept of 59 tells us that if we extended
# the line to include Raven=0, the conditional mean
# of the Peabody distribution in that area would be
# 59. Another way of thinking about this is to say
# that the line would cross the Y axis at 59 when
# Raven = 0.
#
# The slope of 0.73 tells us that for each one-point
# increase in Raven, the conditional mean of Peabody
# is going up by about 7/10 of a point.

# We can use the abline() function to draw the line
# on the scatterplot:

> abline(59.0490, 0.7319)

# Here's another way to add the line, without the
# need for typing in the values:

> plot(CTRA,CTPEA)
> abline(lm(CTPEA~CTRA)$coef)
> lm(CTPEA~CTRA)$coef
(Intercept)        CTRA 
  59.049037    0.731873 
 
 
# Next, we considered the subject of correlation. Recall
# that the variance...

> var(CTPEA)
[1] 182.6486

# ...is equal to the sum of squared deviations from the
# mean divided by (N-1):

> sum((CTPEA-mean(CTPEA))^2)/49
[1] 182.6486

# Here's another, more longwinded way of writing that formula:

> sum((CTPEA-mean(CTPEA))*(CTPEA-mean(CTPEA)))/49
[1] 182.6486

# Now, if we replace the second instance of Peabody with Raven,
# we get a sum of components that should tend to be positive
# if the linear relationship is positive and negative if the
# relationship is negative. This statistic is called the "covariance"
# between Peabody and Raven:

> sum((CTPEA-mean(CTPEA))*(CTRA-mean(CTRA)))/49
[1] 61.04

# A problem with the covariance as a measure of strength of
# association is that its value changes if we rescale one or
# both of the variables. Here, for example, if we replace
# Peabody with 10*Peabody, the value of the covariance increases
# tenfold, even though the scatterplot looks the same:

> sum((10*CTPEA-mean(10*CTPEA))*(CTRA-mean(CTRA)))/49
[1] 610.4
> plot(CTRA, 10*CTPEA)

# Usually, we use the var() function to calculate a sample variance,
# but if we feed it two variables, it gives us the covariance:

> var(CTPEA,CTRA)
[1] 61.04

# If we take away the scale of the covariance by dividing by
# the standard deviation of each variable, we get a form of
# the covariance that is not affected by rescaling:

> var(CTPEA,CTRA)/sd(CTPEA)/sd(CTRA)
[1] 0.4945577
> var(10*CTPEA,CTRA)/sd(10*CTPEA)/sd(CTRA)
[1] 0.4945577

# That scaled covariance is the correlation coefficient (well,
# its full name is the "Pearson product-moment correlation
# coefficient"):

> cor(CTPEA,CTRA)
[1] 0.4945577

# Here, we illustrate the importance of confirming that a
# relationship is linear before we use correlation to describe
# the strength of the association. In this example, y is
# perfectly associated with x, but in a nonlinear way:

> x <- seq(-2,2,.1)
> y <- x^2
> plot(x,y)

# Even though the relationship is perfect, the correlation
# is zero:

> cor(x,y)
[1] 1.775033e-16

# And the regression line doesn't even come close to
# describing the relationship:

> abline(lm(y~x)$coef)

# A student asked about the intercept of 59 in the
# regression of Peabody on Raven. If we extend the
# dimensions of the scatterplot to include a
# (nonexistent) Raven score of 0, we see that the
# line crosses the Y axis at 59:

> plot(CTRA,CTPEA,xlim=c(-5,50),ylim=c(50,130))
> abline(lm(CTPEA~CTRA)$coef)
> lm(CTPEA~CTRA)$coef
(Intercept)        CTRA 
  59.049037    0.731873 
 
# Next, we demonstrated the central limit theorem.
# The theorem describes the distribution of a large 
# number of sample means based on independent samples
# of the same size from the population. It says the
# mean of those sample means will equal the population
# mean and the standard deviation of the sample means
# will equal the population standard deviation divided
# by the square root of the sample size. The theorem
# also says that if the population is normal, the distribution
# of sample means will be normal, but if the population
# is non-normal, the distribution of sample means will
# become more and more like a normal distribution as the
# sample size increases.

# Here's the distribution of a uniform variable on
# the interval (0, 1):
 
> uniform<-runif(1000000)
> hist(uniform)

# We already know from previous classes that the mean of
# that population is 1/2 and the standard deviation is
# the square root of 1/12:

> mean(uniform)
[1] 0.5000398
> sd(uniform)
[1] 0.2888787
> sqrt(1/12)
[1] 0.2886751
 
# Here, we simulate taking a large number of N=3 samples
# from that population and calculating the mean of each
# sample:

> means <- NULL
> for(i in 1:50000) {
+    means[i] <- mean(runif(3)) }

# The mean and standard deviation of those means are
# exactly what the central limit theorem predicts:

> mean(means)
[1] 0.5000984
> sqrt(1/12)/sqrt(3)
[1] 0.1666667
> sd(means)
[1] 0.1668343

# But the distribution isn't quite normal:

> hist(means)

# If we go with an even smaller sample size (N=2),
# the mean and standard deviation of the sample means
# are still exactly as the theorem predicts:

> for(i in 1:50000) {
+   means[i] <- mean(runif(2)) }
> mean(means)
[1] 0.4990982
> sd(means)
[1] 0.2041811

# But the shape of the distribution is even less like
# a normal distribution:

> hist(means)

# On the other hand, if we increase the sample size for
# each mean to 20...

> for(i in 1:50000) {
+   means[i] <- mean(runif(20)) }

# ...the mean and sd of the means are still as predicted...

> mean(means)
[1] 0.5000626
> sqrt(1/12)/sqrt(20)
[1] 0.06454972
> sd(means)
[1] 0.06456094

# ...but now, with the larger sample size, the distribution
# of the means looks pretty much like a normal distribution:

> hist(means)

# We can see that better if we force R to use more intervals
# when it draws the histogram:

> hist(means,20)
 

# In Homework One, the class calculated sample means for Raven
# based on separate, individual random N=50 samples from the population.

# Here's the distribution of the means:

> stem(Rmeans)

  The decimal point is at the |

  29 | 9
  30 | 33379
  31 | 016
  32 | 037
  33 | 08

# It's difficult to evaluate whether that distribution looks normal
# as predicted by the central limit theorem because there aren't very
# many separate means, but it certainly doesn't look NON-normal.

# Recall that we are considering the full N=1296 Statlab dataset to
# be the population, so we can actually calculate the true mean
# and standard deviation:

> detach(JackStatlab)
> attach(Statlab)
> length(CTRA)
[1] 1296
> mean(CTRA)
[1] 30.95062
> sd(CTRA)
[1] 9.974362

# Here's what the theorem would expect the sd of the means to be:

> sd(CTRA)/sqrt(50)
[1] 1.410588
 
# Well, technically, since we are considering this a population,
# we should be using the population formula, where we divide by N
# rather than by (N-1):

> sqrt(1295*var(CTRA)/1296)
[1] 9.970514
> sqrt(1295*var(CTRA)/1296)/sqrt(50)
[1] 1.410044

# The mean of our means is pretty close to the population
# mean of 30.95:

> mean(Rmeans)
[1] 31.41

# Their standard deviation is a bit lower than predicted, but that's
# because our samples weren't truly independent (some of them may
# have included some of the same data points, and two students actually
# used the same data set). But it's in the right ballpark:

> sd(Rmeans)
[1] 1.179172
> detach(Statlab)
> attach(JackStatlab)
 
# We illustrated the use of the concept of sampling distributions
# in hypothesis testing. Here, we use a two-sample t test to test
# the null hypothesis that the population means of the boys' and the
# girls' Raven scores are equal:

> tapply(CTPEA, CBSEX, mean)
       0        1 
76.73077 86.91667 
> 
> t.test(CTPEA~CBSEX, var.equal=TRUE)

        Two Sample t-test

data:  CTPEA by CBSEX
t = -2.8494, df = 48, p-value = 0.006434
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -17.373374  -2.998421
sample estimates:
mean in group 0 mean in group 1 
       76.73077        86.91667 

 
# If the null hypothesis is true (i.e., if the two means actually
# ARE equal), that t statistic should have a t distribution with
# 48 degrees of freedom. The p-value of .006 indicates that if the
# null hypothesis is actually true, a value of the statistic this
# big would occur only about 6 times in 1000 samples. That's a really
# rare event, so we conclude that it's actually NOT true that the
# two means are equal.

# Important note: there is more work that we need to do before we
# trust this result. This will be discussed further next class.