# The interquartile range (IQR) is the difference between the median # of the upper half of the data (the third quartile) and the # median of the bottom half of the data (the first quartile). # We used the stem-and-leaf plot in the Powerpoint to calculate # that those quartiles are 90.5 and 74, so the IQR is: > 90.5 - 74 [1] 16.5 > attach(PeabodyFrame) > Peabody [1] 69 72 94 64 80 77 96 86 89 69 92 71 81 90 84 76 100 57 61 [20] 84 81 65 87 92 89 79 91 65 91 81 86 85 95 93 83 76 84 90 [39] 95 67 # We might guess that R would use iqr() as the name for # an iqr function, but that's actually not the case: > iqr(Peabody) Error in iqr(Peabody) : could not find function "iqr" # R uses upper case letters for the function. But R's IQR() function # doesn't give us the answer we want: > IQR(Peabody) [1] 15.25 # That's because embedded in the IQR function is a call to a # quantile() function, and R, by default, defines quantiles in # a way that differs from our own way of thinking. We can force # R to use the definition of quantile that matches what we want: > IQR(Peabody, type=2) [1] 16.5 # I don't want to half to remember to do that every time I want # an IQR, so I write my own lower-case iqr() function that automatically # changes the "type" value: > iqr <- function(x) { + return( IQR(x, type=2)) } # Now when I call iqr(some variable name), R will do everything # that appears between the open and close curly braces in that # function definition. # Now that I have created my function, it will appear in my R space, # looking just like a variable: > ls() [1] "iqr" "PeabodyFrame" # But if I type its name, instead of seeing values of a variable, # I will see the definition of the function: > iqr function(x) { return( IQR(x, type=2)) } # The new iqr() function gives exactly the same result as R's # IQR() function with type set to 2: > IQR(Peabody, type=2) [1] 16.5 > iqr(Peabody) [1] 16.5 # If we look at the definition of R's IQR() function, we can see # that R specifies type=7, which was not what we wanted: > IQR function (x, na.rm = FALSE, type = 7) diff(quantile(as.numeric(x), c(0.25, 0.75), na.rm = na.rm, names = FALSE, type = type)) # It's the quantile() function that actually uses the "type" setting. # Here, we look at the first and third quartiles using R's default: > quantile(Peabody,.25) 25% 75 > quantile(Peabody,.75) 75% 90.25 # That's why R's IQR() function gave us a different result: > 90.25-75 [1] 15.25 # The different "types" of quantile give different answers, but we will # always want to use type=2: > quantile(Peabody, .75, type=7) 75% 90.25 > quantile(Peabody, .75, type=6) 75% 90.75 > quantile(Peabody, .75, type=5) 75% 90.5 > quantile(Peabody, .75, type=4) 75% 90 > quantile(Peabody, .75, type=3) 75% 90 > quantile(Peabody, .75, type=2) 75% 90.5 # To address a student question, I wrote another (kind of stupid) function # that calculates the difference between the third observation of a variable # and the fifth: > junk_function <- function(qqq) { + return(qqq[3] - qqq[5]) } # The third Peabody value is 94, and the fifth is 80: > Peabody [1] 69 72 94 64 80 77 96 86 89 69 92 71 81 90 84 76 100 57 61 [20] 84 81 65 87 92 89 79 91 65 91 81 86 85 95 93 83 76 84 90 [39] 95 67 # So my stupid function returns 94 - 80: > junk_function(Peabody) [1] 14 # Of course, I didn't allow for the possibility that there would not be # as many as five values, so it I try to apply this function to a shorter # variable, it chokes: > x <- 1:3 > junk_function(x) [1] NA # We motivated the idea of the standard deviation by talking about average # deviation from the mean. Here, we calculate the deviations: > Deviations <- Peabody - mean(Peabody) > Deviations [1] -12.675 -9.675 12.325 -17.675 -1.675 -4.675 14.325 4.325 7.325 [10] -12.675 10.325 -10.675 -0.675 8.325 2.325 -5.675 18.325 -24.675 [19] -20.675 2.325 -0.675 -16.675 5.325 10.325 7.325 -2.675 9.325 [28] -16.675 9.325 -0.675 4.325 3.325 13.325 11.325 1.325 -5.675 [37] 2.325 8.325 13.325 -14.675 # The problem is that the average deviation from the mean is always zero # because the negative deviations cancel out the positive deviations: > mean(Deviations) [1] 2.842453e-15 # We can make them all positive by taking the absolute values: > abs(Deviations) [1] 12.675 9.675 12.325 17.675 1.675 4.675 14.325 4.325 7.325 12.675 [11] 10.325 10.675 0.675 8.325 2.325 5.675 18.325 24.675 20.675 2.325 [21] 0.675 16.675 5.325 10.325 7.325 2.675 9.325 16.675 9.325 0.675 [31] 4.325 3.325 13.325 11.325 1.325 5.675 2.325 8.325 13.325 14.675 # The mean of those absolute deviations from the mean gives us a perfectly # reasonable way of talking about variability. It is, quite literally, the # mean distance from the mean: > mean(abs(Deviations)) [1] 8.9575 # In fact, that statistic has a name; it's the ean absolute deviation (sometimes # abbreviated mad). # The problem with using the mad as a measure of variability is that it's # very difficult to work with absolute values when we do mathematical proofs # and derivations, and much of what we do in statistics is based on such # proofs. Mathemeticians generally prefer to make things positive by squaring. # We could calculate the mean squared deviation (known as the "variance"): > mean(Deviations^2) [1] 116.2694 # That doesn't look anything like an average distance from the mean, because # it's still in the squared metric. We can put it back in the unsquared metric # by taking the square root: > sqrt(mean(Deviations^2)) [1] 10.78283 # And we get a value that is pretty close to the value of the mad. So it is # useful to think of standard deviation as being "sort of like average # distance from the mean." # There's just one problem with that logical, average-based approach to # calculating the variance: in small samples, it tends to give a result # that is a bit smaller than the true value in the population. We can compensate # for that by using (N - 1) rather than N as the denominator when we calculate # the average squared deviation: > sum(Deviations^2) / (length(Peabody)-1) [1] 119.2506 # That is know as the "sample variance" (as opposed to the "population variance," # which is the value we get when we divide by N instead of (N - 1)). R has a function # for calculating the sample variance: > var(Peabody) [1] 119.2506 # Its square root is the sample standard deviation: > sqrt(var(Peabody)) [1] 10.92019 # R can calculate the sample standard deviation with the sd() function: > sd(Peabody) [1] 10.92019 # One important note... The variance is a useful mathematical concept, but is # not so useful as a way of describing variability (because it is in the squared # metric). Use standard deviation rather than variance as your descriptive # statistic.