# We continued to work with our Peabody variable:

> attach(PeabodyFrame)
> Peabody
 [1]  69  72  94  64  80  77  96  86  89  69  92  71  81  90  84  76 100  57  61
[20]  84  81  65  87  92  89  79  91  65  91  81  86  85  95  93  83  76  84  90
[39]  95  67

# Here's another graphic approach. The middle horizontal line is at the median;
# the other two are at the first and third quartiles. The whiskers extend (in this
# example) to the smallest and largest data points.

> boxplot(Peabody)

# We looked at help for the function to get more information about how the whiskers
# are drawn.

> help(boxplot)
starting httpd help server ... done

# As with virtually any graphic in R, we can take control of some of the
# basics (although in this case, R has done a nice job):

> boxplot(Peabody, axes=FALSE)
> box()
> axis(side=2, at=c(60,80,100))
> boxplot(Peabody)

# The ordering of the mean and the median can contain information about
# skew, because a long tail tends to pull the mean in the direction of the tail.
# Here, the fact that the mean is smaller than the median suggests negative skew:

> median(Peabody)
[1] 84
> mean(Peabody)
[1] 81.675
> 81.675 - 84
[1] -2.325

# Pearson developed a statistic that captures that information but takes away
# the scale so that the index in invariant if the scale of the variable is changed:

> 3 * (mean(Peabody)-median(Peabody)) / sd(Peabody)
[1] -0.6387249

# Note that the difference between the mean and median changes if we change the
# scale of the variable. For example, if we multiply the variable by 10, the difference
# increases by a factor of 10:

> junk <- 10*Peabody
> 81.675-84
[1] -2.325
> mean(junk)-median(junk)
[1] -23.25

# But Pearson's index stays the same:

> 3 * (mean(junk)-median(junk)) / sd(junk)
[1] -0.6387249
 
# That's going to be useful, so let's create a function and avoid all
# of that repetetive typing:

> pskew <- function(x) {
+    return( 3 * (mean(x)-median(x))/sd(x)) }
> pskew(Peabody)
[1] -0.6387249

# We can now see the function "pskew" in our R space:

> ls()
[1] "Deviations"    "iqr"           "junk"          "junk_function"
[5] "PeabodyFrame"  "pskew"         "x"            

# Let's do some cleanup:

> rm(Deviations,junk,junk_function,x)
> ls()
[1] "iqr"          "PeabodyFrame" "pskew"       

# By typing its name, we can see what the function does:

> iqr
function(x) {
   return( IQR(x, type=2)) }
> pskew
function(x) {
   return( 3 * (mean(x)-median(x))/sd(x)) }
> pskew(Peabody)
[1] -0.6387249

# Here, we write a function that implements the skew index that's based
# on cubed deviations from the mean. This statistic is very unstable in
# small data sets, but can be useful if we have a really big sample.
# (I chose the name "mskew" to remind us that this is the statistic that's
# based on the third moment.)

> mskew <- function(x) {
+    N <- length(x)
+    mn <- mean(x)
+    dev <- x - mn
+    return( N/(N-1)/(N-2) / sd(x)^3 * sum(dev^3)) }
> mskew
function(x) {
   N <- length(x)
   mn <- mean(x)
   dev <- x - mn
   return( N/(N-1)/(N-2) / sd(x)^3 * sum(dev^3)) }
> 
> 
# Don't do this! (N is too small.)
> mskew(Peabody)
[1] -0.5223247

# Like pskew, mskew is unaffected by changes of scale:

> mskew(10*Peabody)
[1] -0.5223247
> 

# Here, we read in the entire N=1296 Statlab data set:

> Statlab <- read.csv("http://faculty.ucmerced.edu/jvevea/classes/105/data/statlab (abridged).csv")

# The variable names are cryptic, but there's a codebook on the web page that explains what
# they mean.

> head(Statlab)
  CODE CBSEX CBLGTH CBWGT CTHGHT CTWGT CTPEA CTRA MBAG MBWGT MTHGHT MTWGT FBAG FTHGHT FTWGT FIB FIT
1 1111     0   20.0   6.6   55.7    85    85   34   17   119   66.0   130   19   70.1   171  33 150
2 1112     0   20.0   6.4   48.9    59    74   34   17   130   62.8   159   23   65.0   130  40 175
3 1113     0   19.8   6.1   54.9    70    64   25   18   134   66.1   138   21   70.0   175  44 116
4 1114     0   19.5   7.0   53.6    88    87   43   18   135   61.8   123   26   71.8   196  42 112
5 1115     0   19.5   7.9   53.4    68    87   40   18   130   62.8   146   21   68.0   163  50 129
6 1116     0   22.0   9.5   59.9    93    83   37   18   104   63.4   116   17   74.0   180   0 214


> mean(Statlab$CTPEA)
[1] 79.08642
> median(Statlab$CTPEA)
[1] 80

3 In this case, Pearson's index and the moment-based index disagree:

> pskew(Statlab$CTPEA)
[1] -0.2593727
> mskew(Statlab$CTPEA)
[1] 0.3964694

# Visually, the Peabody in the full data set appears to have some positive skew:

> hist(Statlab$CTPEA)
 
# If we have the csv file on our computer, we can also browse for it:

> statlab2 <- read.csv(file.choose())
> head(statlab2)
  CODE CBSEX CBLGTH CBWGT CTHGHT CTWGT CTPEA CTRA MBAG MBWGT MTHGHT MTWGT FBAG FTHGHT FTWGT FIB FIT
1 1111     0   20.0   6.6   55.7    85    85   34   17   119   66.0   130   19   70.1   171  33 150
2 1112     0   20.0   6.4   48.9    59    74   34   17   130   62.8   159   23   65.0   130  40 175
3 1113     0   19.8   6.1   54.9    70    64   25   18   134   66.1   138   21   70.0   175  44 116
4 1114     0   19.5   7.0   53.6    88    87   43   18   135   61.8   123   26   71.8   196  42 112
5 1115     0   19.5   7.9   53.4    68    87   40   18   130   62.8   146   21   68.0   163  50 129
6 1116     0   22.0   9.5   59.9    93    83   37   18   104   63.4   116   17   74.0   180   0 214
 
 
# Simulation can be a valuable tool for guiding us in how seriously to take indications
# of skew and multimodality in small data sets. Recall that the Peabody sample appeared
# to have some negative skew and possibly three modes:

> stem(Peabody)

  The decimal point is 1 digit(s) to the right of the |

   5 | 7
   6 | 14
   6 | 55799
   7 | 12
   7 | 6679
   8 | 01113444
   8 | 566799
   9 | 00112234
   9 | 556
  10 | 0

# But if we look at the plot for a randomly sampled variable with the
# same N that we know comes from a distribution with one mode and no
# skew (the normal distribution), we can see plenty of examples that
# appear just as skewed as our Peabody sample:

> x <- rnorm(40); stem(x); pskew(x)

  The decimal point is at the |

  -1 | 996
  -1 | 211
  -0 | 9965
  -0 | 43211
   0 | 012334
   0 | 55778888999
   1 | 001112
   1 | 5
   2 | 1

[1] -0.5874527
> x <- rnorm(40); stem(x); pskew(x)

  The decimal point is at the |

  -3 | 1
  -2 | 5
  -1 | 65411
  -0 | 99987555533332110
   0 | 11235689
   1 | 00344
   2 | 013

[1] 0.3206788
> x <- rnorm(40); stem(x); pskew(x)

  The decimal point is at the |

  -2 | 21
  -1 | 87
  -1 | 42
  -0 | 87555
  -0 | 43332
   0 | 001112233444
   0 | 555678999
   1 | 234

[1] -0.5808943
> x <- rnorm(40); stem(x); pskew(x)

  The decimal point is at the |

  -2 | 7
  -2 | 
  -1 | 
  -1 | 4110
  -0 | 987655
  -0 | 443320
   0 | 11223
   0 | 556666777899
   1 | 01
   1 | 56
   2 | 22

[1] -0.1799464
> x <- rnorm(40); stem(x); pskew(x)

  The decimal point is at the |

  -1 | 865
  -1 | 40
  -0 | 9998887766
  -0 | 332221
   0 | 113344
   0 | 556689
   1 | 0134
   1 | 
   2 | 22
   2 | 7

[1] 0.4536358
> x <- rnorm(40); stem(x); pskew(x)

  The decimal point is at the |

  -1 | 311
  -0 | 99987776
  -0 | 44433211
   0 | 12233344
   0 | 66778
   1 | 02244
   1 | 589

[1] -0.00182409

# And we don't have to try this very many times before we see
# something that looks even more multimodal than our sample:

> x <- rnorm(40); stem(x); pskew(x)

  The decimal point is at the |

  -1 | 665
  -1 | 4000
  -0 | 8877666655
  -0 | 33331
   0 | 12233444
   0 | 9
   1 | 112444
   1 | 9
   2 | 
   2 | 5
   3 | 2

[1] 0.9244788

> stem(Peabody)

  The decimal point is 1 digit(s) to the right of the |

   5 | 7
   6 | 14
   6 | 55799
   7 | 12
   7 | 6679
   8 | 01113444
   8 | 566799
   9 | 00112234
   9 | 556
  10 | 0

# The lesson here is that we really shouldn't be taking very seriously
# the idea that Peabody is multimodal or skewed on the basis of our
# small sample.

 
# Here, we look at what it means to say that a statistic is biased.
# Basically, what the means is that if we apply the statistic to many
# separate samples from the population, the average of the values we
# get for the statistic is not equal to the true value in the population.

# Looping is a way to make R do something lots of times (here, one million times):
 
> results <- NULL
> results
NULL
> 
> for(i in 1:1000000) {
+    x <- rnorm(10, 50, 10)
+    results[i] <- 9*var(x)/10 }

# Each time through that loop, we are taking a sample of size 10 from a
# distribution that has a variance of 100. When we calculate (and save)
# 9*var(x)/10, we are taking R's sample variance calculation (which divides
# the sum of squared deviations from the mean by N-1) and changing it into
# the formula that uses N (and is biased).

# Here are the first few of the one million variance calculations:

> head(results)
[1] 143.22792  26.18380 108.98963 141.82090 114.95902  69.28769

# They average about 90, but we know the true value is 100:

> mean(results)
[1] 89.9581

# If we change them back to the formula that divides by N-1, we see
# that they average about 100 (i.e., the N-1 statistic is unbiased):

> mean( 10*results/9 )
[1] 99.95344
>