# The interquartile range (IQR) is the difference between the median
# of the upper half of the data (the third quartile) and the
# median of the bottom half of the data (the first quartile).
# We used the stem-and-leaf plot in the Powerpoint to calculate
# that those quartiles are 90.5 and 74, so the IQR is:

> 90.5 - 74
[1] 16.5
> attach(PeabodyFrame)
> Peabody
 [1]  69  72  94  64  80  77  96  86  89  69  92  71  81  90  84  76 100  57  61
[20]  84  81  65  87  92  89  79  91  65  91  81  86  85  95  93  83  76  84  90
[39]  95  67

# We might guess that R would use iqr() as the name for
# an iqr function, but that's actually not the case:

> iqr(Peabody)
Error in iqr(Peabody) : could not find function "iqr"

# R uses upper case letters for the function. But R's IQR() function
# doesn't give us the answer we want:

> IQR(Peabody)
[1] 15.25

# That's because embedded in the IQR function is a call to a
# quantile() function, and R, by default, defines quantiles in
# a way that differs from our own way of thinking. We can force
# R to use the definition of quantile that matches what we want:

> IQR(Peabody, type=2)
[1] 16.5
 
# I don't want to half to remember to do that every time I want
# an IQR, so I write my own lower-case iqr() function that automatically
# changes the "type" value:

> iqr <- function(x) {
+    return( IQR(x, type=2)) }

# Now when I call iqr(some variable name), R will do everything
# that appears between the open and close curly braces in that
# function definition.

# Now that I have created my function, it will appear in my R space,
# looking just like a variable:

> ls()
[1] "iqr"          "PeabodyFrame"

# But if I type its name, instead of seeing values of a variable,
# I will see the definition of the function:

> iqr
function(x) {
   return( IQR(x, type=2)) }

# The new iqr() function gives exactly the same result as R's
# IQR() function with type set to 2:

> IQR(Peabody, type=2)
[1] 16.5
> iqr(Peabody)
[1] 16.5
 
 
# If we look at the definition of R's IQR() function, we can see
# that R specifies type=7, which was not what we wanted:
 
> IQR
function (x, na.rm = FALSE, type = 7) 
diff(quantile(as.numeric(x), c(0.25, 0.75), na.rm = na.rm, names = FALSE, 
    type = type))
<bytecode: 0x0000000008725378>
<environment: namespace:stats>

# It's the quantile() function that actually uses the "type" setting.
# Here, we look at the first and third quartiles using R's default:

> quantile(Peabody,.25)
25% 
 75 
> quantile(Peabody,.75)
  75% 
90.25 

# That's why R's IQR() function gave us a different result:

> 90.25-75
[1] 15.25

# The different "types" of quantile give different answers, but we will
# always want to use type=2:

> quantile(Peabody, .75, type=7)
  75% 
90.25 
> quantile(Peabody, .75, type=6)
  75% 
90.75 
> quantile(Peabody, .75, type=5)
 75% 
90.5 
> quantile(Peabody, .75, type=4)
75% 
 90 
> quantile(Peabody, .75, type=3)
75% 
 90 
> quantile(Peabody, .75, type=2)
 75% 
90.5 

# To address a student question, I wrote another (kind of stupid) function
# that calculates the difference between the third observation of a variable
# and the fifth:

> junk_function <- function(qqq) {
+    return(qqq[3] - qqq[5]) }

# The third Peabody value is 94, and the fifth is 80:

> Peabody
 [1]  69  72  94  64  80  77  96  86  89  69  92  71  81  90  84  76 100  57  61
[20]  84  81  65  87  92  89  79  91  65  91  81  86  85  95  93  83  76  84  90
[39]  95  67

# So my stupid function returns 94 - 80:

> junk_function(Peabody)
[1] 14

# Of course, I didn't allow for the possibility that there would not be
# as many as five values, so it I try to apply this function to a shorter
# variable, it chokes:

> x <- 1:3
> junk_function(x)
[1] NA
 
# We motivated the idea of the standard deviation by talking about average
# deviation from the mean. Here, we calculate the deviations:
 
> Deviations <- Peabody - mean(Peabody)
> Deviations
 [1] -12.675  -9.675  12.325 -17.675  -1.675  -4.675  14.325   4.325   7.325
[10] -12.675  10.325 -10.675  -0.675   8.325   2.325  -5.675  18.325 -24.675
[19] -20.675   2.325  -0.675 -16.675   5.325  10.325   7.325  -2.675   9.325
[28] -16.675   9.325  -0.675   4.325   3.325  13.325  11.325   1.325  -5.675
[37]   2.325   8.325  13.325 -14.675

# The problem is that the average deviation from the mean is always zero
# because the negative deviations cancel out the positive deviations:

> mean(Deviations)
[1] 2.842453e-15

# We can make them all positive by taking the absolute values:

> abs(Deviations)
 [1] 12.675  9.675 12.325 17.675  1.675  4.675 14.325  4.325  7.325 12.675
[11] 10.325 10.675  0.675  8.325  2.325  5.675 18.325 24.675 20.675  2.325
[21]  0.675 16.675  5.325 10.325  7.325  2.675  9.325 16.675  9.325  0.675
[31]  4.325  3.325 13.325 11.325  1.325  5.675  2.325  8.325 13.325 14.675

# The mean of those absolute deviations from the mean gives us a perfectly
# reasonable way of talking about variability. It is, quite literally, the
# mean distance from the mean:

> mean(abs(Deviations))
[1] 8.9575

# In fact, that statistic has a name; it's the ean absolute deviation (sometimes
# abbreviated mad).

 
# The problem with using the mad as a measure of variability is that it's
# very difficult to work with absolute values when we do mathematical proofs
# and derivations, and much of what we do in statistics is based on such
# proofs. Mathemeticians generally prefer to make things positive by squaring.
# We could calculate the mean squared deviation (known as the "variance"):

> mean(Deviations^2)
[1] 116.2694

# That doesn't look anything like an average distance from the mean, because
# it's still in the squared metric. We can put it back in the unsquared metric
# by taking the square root:

> sqrt(mean(Deviations^2))
[1] 10.78283

# And we get a value that is pretty close to the value of the mad. So it is 
# useful to think of standard deviation as being "sort of like average
# distance from the mean."

# There's just one problem with that logical, average-based approach to 
# calculating the variance: in small samples, it tends to give a result
# that is a bit smaller than the true value in the population. We can compensate
# for that by using (N - 1) rather than N as the denominator when we calculate
# the average squared deviation:
 
> sum(Deviations^2) / (length(Peabody)-1)
[1] 119.2506

# That is know as the "sample variance" (as opposed to the "population variance,"
# which is the value we get when we divide by N instead of (N - 1)). R has a function
# for calculating the sample variance:

> var(Peabody)
[1] 119.2506

# Its square root is the sample standard deviation:

> sqrt(var(Peabody))
[1] 10.92019

# R can calculate the sample standard deviation with the sd() function:

> sd(Peabody)
[1] 10.92019
 

# One important note...  The variance is a useful mathematical concept, but is
# not so useful as a way of describing variability (because it is in the squared
# metric). Use standard deviation rather than variance as your descriptive
# statistic.