# We reviewed the process of improving histograms in R (and added a bit
# of information about titling and axis labels)

> par(pin=c(6,4))
> hist(Peabody)

# In that figure, we got a reasonable title and axis label, because "Peabody"
# had been named in such a way as to facilitate that. Suppose, instead, we had
# called the variable something less graceful:

> CTPB <- Peabody
> hist(CTPB)

# Here's how to control the title and axis label:

> hist(CTPB, main="Peabody Picture Vocabulary Scores", xlab="Peabody Score")

# Some titles may be too long to fit well:

> hist(CTPB, main="Peabody Picture Vocabulary Scores of 10-Year-Old Children in Oackland, CA", xlab="Peabody Score")

# But we can force a line break in the middle:

> hist(CTPB, main="Peabody Picture Vocabulary Scores of\n 10-Year-Old Children in Oackland, CA", xlab="Peabody Score")

# Note that the midpoints of the intervals in R's grouping don't vary accurately
# represent the original values of the data. For example, if we do a weighted average
# of those midpoints to try to reconstruct the sample mean, it doesn't result in an
# answer that's very close to the actual mean:

> (57.5 + 4*62.5 + 3*67.5 + 2*72.5 + 5*77.5 + 8*82.5 + 7*87.5 + 8*92.5 + 102.5) / 40
[1] 78.9375
> mean(Peabody)
[1] 81.675

# Here, we produce a histogram with more carefully chosen break points:

> hist(CTPB, main="Peabody Picture Vocabulary Scores of\n 10-Year-Old Children in Oakland, CA", xlab="Peabody Score",
+ breaks=seq(54.5,104.5,5), axes=F)
> axis(side=1, at=seq(57,102,5), pos=0)
> axis(side=2, at=seq(0,8,2))
> abline(h=0)

# Now, the weighted average of the midpoints produces a result that's very
# close to the actual mean:

> (57 + 2*62 + 5*67 + 2*72 + 4*77 + 8*82 + 6*87 + 8*92 + 3*97 + 102) / 40
[1] 81.875
 
# We discussed the idea that seeing things like multiple modes or skew in
# figures of small data sets can be misleading. Here, we randomly sample N=40
# draws from a normal distribution (which is, by definition, symmetric and
# unimodal in the population). Look how often samples appear asymmetric or
# multimodal when we repeat that process:

> stem( rnorm(40) )

  The decimal point is at the |

  -2 | 0
  -1 | 7
  -1 | 43322110
  -0 | 87665
  -0 | 4432
   0 | 01112224444
   0 | 666678
   1 | 2
   1 | 9
   2 | 3
   2 | 5

> stem( rnorm(40) )

  The decimal point is at the |

  -3 | 1
  -2 | 0
  -1 | 721
  -0 | 9998854432100
   0 | 001133355789
   1 | 012223457
   2 | 5

> stem( rnorm(40) )

  The decimal point is at the |

  -2 | 6
  -2 | 
  -1 | 6
  -1 | 442
  -0 | 977665
  -0 | 3321100
   0 | 1223333344
   0 | 556678
   1 | 244
   1 | 58
   2 | 0

> stem( rnorm(40) )

  The decimal point is at the |

  -1 | 9
  -1 | 11000
  -0 | 7776555
  -0 | 444333221
   0 | 11223444
   0 | 5568
   1 | 0023
   1 | 8
   2 | 4

> stem( rnorm(40) )

  The decimal point is at the |

  -1 | 8
  -1 | 4221100
  -0 | 9755
  -0 | 4432
   0 | 1222233344
   0 | 566777899
   1 | 11233

> stem( rnorm(40) )

  The decimal point is at the |

  -2 | 0
  -1 | 887
  -1 | 4444322
  -0 | 65
  -0 | 44432222110
   0 | 122234
   0 | 567
   1 | 034
   1 | 5778

> stem( rnorm(40) )

  The decimal point is at the |

  -1 | 885
  -1 | 221
  -0 | 665
  -0 | 433222222000
   0 | 1124444
   0 | 567789
   1 | 0023
   1 | 89

> stem( rnorm(40) )

  The decimal point is at the |

  -2 | 2
  -1 | 975
  -1 | 22
  -0 | 988765
  -0 | 3332222
   0 | 00112244444
   0 | 7899
   1 | 000
   1 | 559

> stem( rnorm(40) )

  The decimal point is at the |

  -2 | 0
  -1 | 85
  -1 | 22220
  -0 | 98876666555
  -0 | 332211
   0 | 0111223
   0 | 567888
   1 | 01

> stem( rnorm(40) )

  The decimal point is at the |

  -2 | 830
  -1 | 86664330
  -0 | 765433322111
   0 | 012233556689
   1 | 0368
   2 | 3

> stem( rnorm(40) )

  The decimal point is at the |

  -2 | 5
  -2 | 0
  -1 | 76
  -1 | 2200
  -0 | 8888755
  -0 | 44311
   0 | 01122334
   0 | 8899
   1 | 04
   1 | 55677
   2 | 0

# We noted that our most commonly used measures of central tendency will be the
# mean and the median:

> Peabody
 [1]  69  72  94  64  80  77  96  86  89  69  92  71  81  90  84  76 100  57  61  84  81  65  87  92  89  79  91  65  91
[30]  81  86  85  95  93  83  76  84  90  95  67
> mean(Peabody)
[1] 81.675
> median(Peabody)
[1] 84

# The mean is more vulnerable to the effects of extreme observations than the median:

> Peabody[17] <- 1000
> Peabody
 [1]   69   72   94   64   80   77   96   86   89   69   92   71   81   90   84   76 1000   57   61   84   81   65   87
[24]   92   89   79   91   65   91   81   86   85   95   93   83   76   84   90   95   67
> median(Peabody)
[1] 84
> mean(Peabody)
[1] 104.175
> Peabody[17] <- 100

# The mean will also tend to be pulled toward the longer tail of a skewed
# distribution. Recall that our figure shows some suggestion of negative skew (which
# we probably shouldn't take very seriously). This results in a mean that is lower than
# the median:

> mean(Peabody)
[1] 81.675
> 
> median(Peabody)
[1] 84

# We used a stem-and-leaf plot to help us count off order statistics
# in the distribution. The median of the lower half (i.e., the first
# quartile) is the average of the 10th and 11th observations: (72+76)/2 = 74.
# The median of the upper half (the third quartile) is the average of the
# 30th and 31st observations: (90+91)/2 = 90.5.

> stem(Peabody)

  The decimal point is 1 digit(s) to the right of the |

   5 | 7
   6 | 14
   6 | 55799
   7 | 12
   7 | 6679
   8 | 01113444
   8 | 566799
   9 | 00112234
   9 | 556
  10 | 0

# The interquartile range is the difference between those to values:

> 90.5 - 74
[1] 16.5

# We might guess that R would have a function called iqr(). We would be wrong:

> iqr(Peabody)
Error in iqr(Peabody) : could not find function "iqr"

# R capitalizes the function. But it doesn't give us the same answer as our manual
# calculation:

> IQR(Peabody)
[1] 15.25

# That's because R uses a different algorithm for calculating quantiles. For example,
# it thinks the first quartile is 75, not 74:

> quantile(Peabody, .25)
25% 
 75 

# But we can force R to do things our way:

> quantile(Peabody, .25, type=2)
25% 
 74 

# That works for the IQR function as well:

> IQR(Peabody, type=2)
[1] 16.5

# Here, we write our own function for iqr, so that we don't have to keep
# typing "type=2":

> iqr <- function(x) {
+    IQR(x, type=2) }
> iqr
function(x) {
   IQR(x, type=2) }
> iqr(Peabody)
[1] 16.5
 
# We motivated the idea of standard deviation by saying that it's sort of
# like the average distance from the mean. But if we apply that idea literally,
# we find that the average distance from the mean is zero:

> mean( Peabody - mean(Peabody) )
[1] 2.842453e-15

# That's because negative deviations offset positive ones. If we take the absolute
# values of the deviations, we can calculate something that literally is the average
# distance from the mean. (In fact, this not-frequently used statistic is called the
# mean absolute deviation.)

> mean( abs(Peabody-mean(Peabody)))
[1] 8.9575

# Absolute values are really messy to deal with in mathematics, though. Mathmeticians
# tend to prefer making things positive by squaring them. So here, we calculate the
# average squared distance from the mean:

> mean( (Peabody-mean(Peabody))^2)
[1] 116.2694

# And we can put that back in the metric of the data by taking the square root:

> sqrt(mean( (Peabody-mean(Peabody))^2))
[1] 10.78283

# The average squared distance from the mean is called the "population variance,"
# and its square root is called the "population standard deviation." However, in
# small sample, these statistics are biased; they tend to be a bit smaller than
# they should. So what we actually use in data analysis is the "sample standard
# deviation," which is corrected for bias by dividing by (N-1) instead of by N:

> sqrt(sum( (Peabody-mean(Peabody))^2)/40)
[1] 10.78283
> sqrt(sum( (Peabody-mean(Peabody))^2)/39)
[1] 10.92019

# R's sd() function gives us the sample standard deviation:

> sd(Peabody)
[1] 10.92019

# And R's var() function gives the sample variance:

> var(Peabody)
[1] 119.2506
>