# We continued to work with our Peabody variable: > attach(PeabodyFrame) > Peabody [1] 69 72 94 64 80 77 96 86 89 69 92 71 81 90 84 76 100 57 61 [20] 84 81 65 87 92 89 79 91 65 91 81 86 85 95 93 83 76 84 90 [39] 95 67 # Here's another graphic approach. The middle horizontal line is at the median; # the other two are at the first and third quartiles. The whiskers extend (in this # example) to the smallest and largest data points. > boxplot(Peabody) # We looked at help for the function to get more information about how the whiskers # are drawn. > help(boxplot) starting httpd help server ... done # As with virtually any graphic in R, we can take control of some of the # basics (although in this case, R has done a nice job): > boxplot(Peabody, axes=FALSE) > box() > axis(side=2, at=c(60,80,100)) > boxplot(Peabody) # The ordering of the mean and the median can contain information about # skew, because a long tail tends to pull the mean in the direction of the tail. # Here, the fact that the mean is smaller than the median suggests negative skew: > median(Peabody) [1] 84 > mean(Peabody) [1] 81.675 > 81.675 - 84 [1] -2.325 # Pearson developed a statistic that captures that information but takes away # the scale so that the index in invariant if the scale of the variable is changed: > 3 * (mean(Peabody)-median(Peabody)) / sd(Peabody) [1] -0.6387249 # Note that the difference between the mean and median changes if we change the # scale of the variable. For example, if we multiply the variable by 10, the difference # increases by a factor of 10: > junk <- 10*Peabody > 81.675-84 [1] -2.325 > mean(junk)-median(junk) [1] -23.25 # But Pearson's index stays the same: > 3 * (mean(junk)-median(junk)) / sd(junk) [1] -0.6387249 # That's going to be useful, so let's create a function and avoid all # of that repetetive typing: > pskew <- function(x) { + return( 3 * (mean(x)-median(x))/sd(x)) } > pskew(Peabody) [1] -0.6387249 # We can now see the function "pskew" in our R space: > ls() [1] "Deviations" "iqr" "junk" "junk_function" [5] "PeabodyFrame" "pskew" "x" # Let's do some cleanup: > rm(Deviations,junk,junk_function,x) > ls() [1] "iqr" "PeabodyFrame" "pskew" # By typing its name, we can see what the function does: > iqr function(x) { return( IQR(x, type=2)) } > pskew function(x) { return( 3 * (mean(x)-median(x))/sd(x)) } > pskew(Peabody) [1] -0.6387249 # Here, we write a function that implements the skew index that's based # on cubed deviations from the mean. This statistic is very unstable in # small data sets, but can be useful if we have a really big sample. # (I chose the name "mskew" to remind us that this is the statistic that's # based on the third moment.) > mskew <- function(x) { + N <- length(x) + mn <- mean(x) + dev <- x - mn + return( N/(N-1)/(N-2) / sd(x)^3 * sum(dev^3)) } > mskew function(x) { N <- length(x) mn <- mean(x) dev <- x - mn return( N/(N-1)/(N-2) / sd(x)^3 * sum(dev^3)) } > > # Don't do this! (N is too small.) > mskew(Peabody) [1] -0.5223247 # Like pskew, mskew is unaffected by changes of scale: > mskew(10*Peabody) [1] -0.5223247 > # Here, we read in the entire N=1296 Statlab data set: > Statlab <- read.csv("http://faculty.ucmerced.edu/jvevea/classes/105/data/statlab (abridged).csv") # The variable names are cryptic, but there's a codebook on the web page that explains what # they mean. > head(Statlab) CODE CBSEX CBLGTH CBWGT CTHGHT CTWGT CTPEA CTRA MBAG MBWGT MTHGHT MTWGT FBAG FTHGHT FTWGT FIB FIT 1 1111 0 20.0 6.6 55.7 85 85 34 17 119 66.0 130 19 70.1 171 33 150 2 1112 0 20.0 6.4 48.9 59 74 34 17 130 62.8 159 23 65.0 130 40 175 3 1113 0 19.8 6.1 54.9 70 64 25 18 134 66.1 138 21 70.0 175 44 116 4 1114 0 19.5 7.0 53.6 88 87 43 18 135 61.8 123 26 71.8 196 42 112 5 1115 0 19.5 7.9 53.4 68 87 40 18 130 62.8 146 21 68.0 163 50 129 6 1116 0 22.0 9.5 59.9 93 83 37 18 104 63.4 116 17 74.0 180 0 214 > mean(Statlab$CTPEA) [1] 79.08642 > median(Statlab$CTPEA) [1] 80 3 In this case, Pearson's index and the moment-based index disagree: > pskew(Statlab$CTPEA) [1] -0.2593727 > mskew(Statlab$CTPEA) [1] 0.3964694 # Visually, the Peabody in the full data set appears to have some positive skew: > hist(Statlab$CTPEA) # If we have the csv file on our computer, we can also browse for it: > statlab2 <- read.csv(file.choose()) > head(statlab2) CODE CBSEX CBLGTH CBWGT CTHGHT CTWGT CTPEA CTRA MBAG MBWGT MTHGHT MTWGT FBAG FTHGHT FTWGT FIB FIT 1 1111 0 20.0 6.6 55.7 85 85 34 17 119 66.0 130 19 70.1 171 33 150 2 1112 0 20.0 6.4 48.9 59 74 34 17 130 62.8 159 23 65.0 130 40 175 3 1113 0 19.8 6.1 54.9 70 64 25 18 134 66.1 138 21 70.0 175 44 116 4 1114 0 19.5 7.0 53.6 88 87 43 18 135 61.8 123 26 71.8 196 42 112 5 1115 0 19.5 7.9 53.4 68 87 40 18 130 62.8 146 21 68.0 163 50 129 6 1116 0 22.0 9.5 59.9 93 83 37 18 104 63.4 116 17 74.0 180 0 214 # Simulation can be a valuable tool for guiding us in how seriously to take indications # of skew and multimodality in small data sets. Recall that the Peabody sample appeared # to have some negative skew and possibly three modes: > stem(Peabody) The decimal point is 1 digit(s) to the right of the | 5 | 7 6 | 14 6 | 55799 7 | 12 7 | 6679 8 | 01113444 8 | 566799 9 | 00112234 9 | 556 10 | 0 # But if we look at the plot for a randomly sampled variable with the # same N that we know comes from a distribution with one mode and no # skew (the normal distribution), we can see plenty of examples that # appear just as skewed as our Peabody sample: > x <- rnorm(40); stem(x); pskew(x) The decimal point is at the | -1 | 996 -1 | 211 -0 | 9965 -0 | 43211 0 | 012334 0 | 55778888999 1 | 001112 1 | 5 2 | 1 [1] -0.5874527 > x <- rnorm(40); stem(x); pskew(x) The decimal point is at the | -3 | 1 -2 | 5 -1 | 65411 -0 | 99987555533332110 0 | 11235689 1 | 00344 2 | 013 [1] 0.3206788 > x <- rnorm(40); stem(x); pskew(x) The decimal point is at the | -2 | 21 -1 | 87 -1 | 42 -0 | 87555 -0 | 43332 0 | 001112233444 0 | 555678999 1 | 234 [1] -0.5808943 > x <- rnorm(40); stem(x); pskew(x) The decimal point is at the | -2 | 7 -2 | -1 | -1 | 4110 -0 | 987655 -0 | 443320 0 | 11223 0 | 556666777899 1 | 01 1 | 56 2 | 22 [1] -0.1799464 > x <- rnorm(40); stem(x); pskew(x) The decimal point is at the | -1 | 865 -1 | 40 -0 | 9998887766 -0 | 332221 0 | 113344 0 | 556689 1 | 0134 1 | 2 | 22 2 | 7 [1] 0.4536358 > x <- rnorm(40); stem(x); pskew(x) The decimal point is at the | -1 | 311 -0 | 99987776 -0 | 44433211 0 | 12233344 0 | 66778 1 | 02244 1 | 589 [1] -0.00182409 # And we don't have to try this very many times before we see # something that looks even more multimodal than our sample: > x <- rnorm(40); stem(x); pskew(x) The decimal point is at the | -1 | 665 -1 | 4000 -0 | 8877666655 -0 | 33331 0 | 12233444 0 | 9 1 | 112444 1 | 9 2 | 2 | 5 3 | 2 [1] 0.9244788 > stem(Peabody) The decimal point is 1 digit(s) to the right of the | 5 | 7 6 | 14 6 | 55799 7 | 12 7 | 6679 8 | 01113444 8 | 566799 9 | 00112234 9 | 556 10 | 0 # The lesson here is that we really shouldn't be taking very seriously # the idea that Peabody is multimodal or skewed on the basis of our # small sample. # Here, we look at what it means to say that a statistic is biased. # Basically, what the means is that if we apply the statistic to many # separate samples from the population, the average of the values we # get for the statistic is not equal to the true value in the population. # Looping is a way to make R do something lots of times (here, one million times): > results <- NULL > results NULL > > for(i in 1:1000000) { + x <- rnorm(10, 50, 10) + results[i] <- 9*var(x)/10 } # Each time through that loop, we are taking a sample of size 10 from a # distribution that has a variance of 100. When we calculate (and save) # 9*var(x)/10, we are taking R's sample variance calculation (which divides # the sum of squared deviations from the mean by N-1) and changing it into # the formula that uses N (and is biased). # Here are the first few of the one million variance calculations: > head(results) [1] 143.22792 26.18380 108.98963 141.82090 114.95902 69.28769 # They average about 90, but we know the true value is 100: > mean(results) [1] 89.9581 # If we change them back to the formula that divides by N-1, we see # that they average about 100 (i.e., the N-1 statistic is unbiased): > mean( 10*results/9 ) [1] 99.95344 >