# We reviewed the process of improving histograms in R (and added a bit # of information about titling and axis labels) > par(pin=c(6,4)) > hist(Peabody) # In that figure, we got a reasonable title and axis label, because "Peabody" # had been named in such a way as to facilitate that. Suppose, instead, we had # called the variable something less graceful: > CTPB <- Peabody > hist(CTPB) # Here's how to control the title and axis label: > hist(CTPB, main="Peabody Picture Vocabulary Scores", xlab="Peabody Score") # Some titles may be too long to fit well: > hist(CTPB, main="Peabody Picture Vocabulary Scores of 10-Year-Old Children in Oackland, CA", xlab="Peabody Score") # But we can force a line break in the middle: > hist(CTPB, main="Peabody Picture Vocabulary Scores of\n 10-Year-Old Children in Oackland, CA", xlab="Peabody Score") # Note that the midpoints of the intervals in R's grouping don't vary accurately # represent the original values of the data. For example, if we do a weighted average # of those midpoints to try to reconstruct the sample mean, it doesn't result in an # answer that's very close to the actual mean: > (57.5 + 4*62.5 + 3*67.5 + 2*72.5 + 5*77.5 + 8*82.5 + 7*87.5 + 8*92.5 + 102.5) / 40 [1] 78.9375 > mean(Peabody) [1] 81.675 # Here, we produce a histogram with more carefully chosen break points: > hist(CTPB, main="Peabody Picture Vocabulary Scores of\n 10-Year-Old Children in Oakland, CA", xlab="Peabody Score", + breaks=seq(54.5,104.5,5), axes=F) > axis(side=1, at=seq(57,102,5), pos=0) > axis(side=2, at=seq(0,8,2)) > abline(h=0) # Now, the weighted average of the midpoints produces a result that's very # close to the actual mean: > (57 + 2*62 + 5*67 + 2*72 + 4*77 + 8*82 + 6*87 + 8*92 + 3*97 + 102) / 40 [1] 81.875 # We discussed the idea that seeing things like multiple modes or skew in # figures of small data sets can be misleading. Here, we randomly sample N=40 # draws from a normal distribution (which is, by definition, symmetric and # unimodal in the population). Look how often samples appear asymmetric or # multimodal when we repeat that process: > stem( rnorm(40) ) The decimal point is at the | -2 | 0 -1 | 7 -1 | 43322110 -0 | 87665 -0 | 4432 0 | 01112224444 0 | 666678 1 | 2 1 | 9 2 | 3 2 | 5 > stem( rnorm(40) ) The decimal point is at the | -3 | 1 -2 | 0 -1 | 721 -0 | 9998854432100 0 | 001133355789 1 | 012223457 2 | 5 > stem( rnorm(40) ) The decimal point is at the | -2 | 6 -2 | -1 | 6 -1 | 442 -0 | 977665 -0 | 3321100 0 | 1223333344 0 | 556678 1 | 244 1 | 58 2 | 0 > stem( rnorm(40) ) The decimal point is at the | -1 | 9 -1 | 11000 -0 | 7776555 -0 | 444333221 0 | 11223444 0 | 5568 1 | 0023 1 | 8 2 | 4 > stem( rnorm(40) ) The decimal point is at the | -1 | 8 -1 | 4221100 -0 | 9755 -0 | 4432 0 | 1222233344 0 | 566777899 1 | 11233 > stem( rnorm(40) ) The decimal point is at the | -2 | 0 -1 | 887 -1 | 4444322 -0 | 65 -0 | 44432222110 0 | 122234 0 | 567 1 | 034 1 | 5778 > stem( rnorm(40) ) The decimal point is at the | -1 | 885 -1 | 221 -0 | 665 -0 | 433222222000 0 | 1124444 0 | 567789 1 | 0023 1 | 89 > stem( rnorm(40) ) The decimal point is at the | -2 | 2 -1 | 975 -1 | 22 -0 | 988765 -0 | 3332222 0 | 00112244444 0 | 7899 1 | 000 1 | 559 > stem( rnorm(40) ) The decimal point is at the | -2 | 0 -1 | 85 -1 | 22220 -0 | 98876666555 -0 | 332211 0 | 0111223 0 | 567888 1 | 01 > stem( rnorm(40) ) The decimal point is at the | -2 | 830 -1 | 86664330 -0 | 765433322111 0 | 012233556689 1 | 0368 2 | 3 > stem( rnorm(40) ) The decimal point is at the | -2 | 5 -2 | 0 -1 | 76 -1 | 2200 -0 | 8888755 -0 | 44311 0 | 01122334 0 | 8899 1 | 04 1 | 55677 2 | 0 # We noted that our most commonly used measures of central tendency will be the # mean and the median: > Peabody [1] 69 72 94 64 80 77 96 86 89 69 92 71 81 90 84 76 100 57 61 84 81 65 87 92 89 79 91 65 91 [30] 81 86 85 95 93 83 76 84 90 95 67 > mean(Peabody) [1] 81.675 > median(Peabody) [1] 84 # The mean is more vulnerable to the effects of extreme observations than the median: > Peabody[17] <- 1000 > Peabody [1] 69 72 94 64 80 77 96 86 89 69 92 71 81 90 84 76 1000 57 61 84 81 65 87 [24] 92 89 79 91 65 91 81 86 85 95 93 83 76 84 90 95 67 > median(Peabody) [1] 84 > mean(Peabody) [1] 104.175 > Peabody[17] <- 100 # The mean will also tend to be pulled toward the longer tail of a skewed # distribution. Recall that our figure shows some suggestion of negative skew (which # we probably shouldn't take very seriously). This results in a mean that is lower than # the median: > mean(Peabody) [1] 81.675 > > median(Peabody) [1] 84 # We used a stem-and-leaf plot to help us count off order statistics # in the distribution. The median of the lower half (i.e., the first # quartile) is the average of the 10th and 11th observations: (72+76)/2 = 74. # The median of the upper half (the third quartile) is the average of the # 30th and 31st observations: (90+91)/2 = 90.5. > stem(Peabody) The decimal point is 1 digit(s) to the right of the | 5 | 7 6 | 14 6 | 55799 7 | 12 7 | 6679 8 | 01113444 8 | 566799 9 | 00112234 9 | 556 10 | 0 # The interquartile range is the difference between those to values: > 90.5 - 74 [1] 16.5 # We might guess that R would have a function called iqr(). We would be wrong: > iqr(Peabody) Error in iqr(Peabody) : could not find function "iqr" # R capitalizes the function. But it doesn't give us the same answer as our manual # calculation: > IQR(Peabody) [1] 15.25 # That's because R uses a different algorithm for calculating quantiles. For example, # it thinks the first quartile is 75, not 74: > quantile(Peabody, .25) 25% 75 # But we can force R to do things our way: > quantile(Peabody, .25, type=2) 25% 74 # That works for the IQR function as well: > IQR(Peabody, type=2) [1] 16.5 # Here, we write our own function for iqr, so that we don't have to keep # typing "type=2": > iqr <- function(x) { + IQR(x, type=2) } > iqr function(x) { IQR(x, type=2) } > iqr(Peabody) [1] 16.5 # We motivated the idea of standard deviation by saying that it's sort of # like the average distance from the mean. But if we apply that idea literally, # we find that the average distance from the mean is zero: > mean( Peabody - mean(Peabody) ) [1] 2.842453e-15 # That's because negative deviations offset positive ones. If we take the absolute # values of the deviations, we can calculate something that literally is the average # distance from the mean. (In fact, this not-frequently used statistic is called the # mean absolute deviation.) > mean( abs(Peabody-mean(Peabody))) [1] 8.9575 # Absolute values are really messy to deal with in mathematics, though. Mathmeticians # tend to prefer making things positive by squaring them. So here, we calculate the # average squared distance from the mean: > mean( (Peabody-mean(Peabody))^2) [1] 116.2694 # And we can put that back in the metric of the data by taking the square root: > sqrt(mean( (Peabody-mean(Peabody))^2)) [1] 10.78283 # The average squared distance from the mean is called the "population variance," # and its square root is called the "population standard deviation." However, in # small sample, these statistics are biased; they tend to be a bit smaller than # they should. So what we actually use in data analysis is the "sample standard # deviation," which is corrected for bias by dividing by (N-1) instead of by N: > sqrt(sum( (Peabody-mean(Peabody))^2)/40) [1] 10.78283 > sqrt(sum( (Peabody-mean(Peabody))^2)/39) [1] 10.92019 # R's sd() function gives us the sample standard deviation: > sd(Peabody) [1] 10.92019 # And R's var() function gives the sample variance: > var(Peabody) [1] 119.2506 >