# We introduced the idea of conditional distributions by thinking about
# the distribution of Peabody conditioned on sex: how do the distributions
# of Peabody scores differ for boys and girls?

> attach(JackStatlab)

# One convenience in our dataset is that girls and boys are listed in order:
# girls (0) first, and then boys (1):

> CBSEX
 [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1
[39] 1 1 1 1 1 1 1 1 1 1 1 1

# So if we add up those values, we get the number of boys in our dataset:

> sum(CBSEX)
[1] 24

# Which means that the first 26 cases are girls:

> 50-24
[1] 26

# Here are the girls' scores:

> CTPEA[1:26]
 [1]  72  69  80  90  77  84  59  72  62  82  66  72  96  80  74  81  82  81  69
[20]  68  85  69 106  95  60  64

# And here are the boys':

> CTPEA[27:50]
 [1]  86  79  83  80  78  64  89  89  81  86  75  71 122 104 103  65  92 108  86
[20]  85  88  78  91 103

# We can compare the distributions with respect to central tendency. It looks like
# the boys tend to score higher than the girls:

> mean(CTPEA[1:26])
[1] 76.73077
> mean(CTPEA[27:50])
[1] 86.91667
> median(CTPEA[1:26])
[1] 75.5
> median(CTPEA[27:50])
[1] 86

# But the two distributions are pretty similar in variability:

> sd(CTPEA[1:26])
[1] 11.64666
> sd(CTPEA[27:50])
[1] 13.61558
> iqr(CTPEA[1:26])
[1] 13
> iqr(CTPEA[27:50])
[1] 13

# More often, the data won't be so nicely ordered. Here's a way
# to get the same information without taking advantage of the fact
# that the data are sorted. The syntax for the tapply() function
# is "tapply(variable of interest, conditioning variable, function desired)":

> tapply( CTPEA, CBSEX, mean)
       0        1 
76.73077 86.91667 
> tapply( CTPEA, CBSEX, median)
   0    1 
75.5 86.0 
> tapply( CTPEA, CBSEX, sd)
       0        1 
11.64666 13.61558 
> tapply( CTPEA, CBSEX, iqr)
 0  1 
13 13 
> tapply( CTPEA, CBSEX, pskew)
        0         1 
0.3170273 0.2019745 

# Another device that can be useful for comparing distributions is
# the construction of box-and-whisker plots on the same axis.
# Here's the unconditional box-and-whisker plot for Peabody:

> boxplot(CTPEA)

# But if we add a "~CBSEX", we get a separate boxplot for each
# child sex, side-by-side:

> boxplot(CTPEA~CBSEX)

# (Think of that tilde as meaning "conditioned on.")

# Of course, the 0 and 1 on the X axis aren't very informative
# there. Let's try to improve the plot:

> boxplot(CTPEA~CBSEX, axes=FALSE)

# I guess (fortunately, correctly) that the 0 and 1 labels are
# actually at X=0 and X=1. So if I add my own labels at those
# locations, I'll get more informative labels:

> axis(side=1,at=c(1,2), lab=c("Girls","Boys"))

# Here's some further improvement:

> boxplot(CTPEA~CBSEX, axes=FALSE, xlab="Sex of Child", ylab="Peabody Score")
> axis(side=1,at=c(1,2),lab=c("Girls","Boys"))
> axis(side=2, at=seq(60,120,10))
> box()
 
 
# Another device that can be helpful for comparing distributions
# is the construction of back-to-back stem-and-leaf plots:

# Note that R makes a different grouping decision for girls...

> stem(CTPEA[1:26])

  The decimal point is 1 digit(s) to the right of the |

   5 | 9
   6 | 024
   6 | 68999
   7 | 2224
   7 | 7
   8 | 0011224
   8 | 5
   9 | 0
   9 | 56
  10 | 
  10 | 6

# ...and for boys:

> stem(CTPEA[27:50])

  The decimal point is 1 digit(s) to the right of the |

   6 | 4515889
   8 | 013566689912
  10 | 3348
  12 | 2

# If we tell R to try to have about half as many groups as
# it wants to by default for the girls...

> stem(CTPEA[1:26], 1/2)

  The decimal point is 1 digit(s) to the right of the |

   5 | 9
   6 | 02468999
   7 | 22247
   8 | 00112245
   9 | 056
  10 | 6

# ...and about twice as many for the boys...

> stem(CTPEA[27:50], 2)

  The decimal point is 1 digit(s) to the right of the |

   6 | 45
   7 | 15889
   8 | 0135666899
   9 | 12
  10 | 3348
  11 | 
  12 | 2

# ...we end up with the same grouping for boys and girls. We can then
# (with a bit of tedium, as a student pointed out in chat) manually
# construct the back-to-back plot in a Word document. (A pdf of the
# result is posted on the "virtual whiteboard" page of the class website.)

# Next, we started thinking about conditioning on a continuous variable.
# If we want to consider the distribution of Peabody conditioned on Raven,
# we could consider categorizing Raven. A traditional way of doing that
# is to split Raven into "low" and "high" Raven scores depending on whether
# the score is above or below the median. This is called a "median split,"
# and is very poor practice because scores that are actually almost the
# same end up being called "high" and "low."

 
# We create a dummy variable for the median split. (A dummy variable is
# one that indicates membership in one or another group using 0 and 1 to
# denote the two groups.)

# The logical operator returns FALSE or TRUE:

> above <- (CTRA > median(CTRA))
> above
 [1] FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
[13]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE
[25] FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE
[37] FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
[49] FALSE  TRUE

# But behind the scenes, those are actually zeroes and ones. We can force
# R to report them that way by changing the "mode" of the variable:

> above <- as.numeric(above)
> above
 [1] 0 0 1 1 0 0 0 0 0 1 0 0 1 1 1 1 1 1 1 0 1 0 1 1 0 1 0 0 1 0 0 1 0 1 0 0 0 0
[39] 0 1 1 0 1 1 1 1 0 1 0 1

# We happened to get lucky and get exactly 25 cases in each group:

> sum(above)
[1] 25

# And we can compare the distribution of Peabody for the low and high
# Raven groups using the same approach that we used when we conditioned
# on sex:

> boxplot(CTPEA~above)
> tapply(CTPEA, above, mean)
    0     1 
76.84 86.40 

# Once again, be aware that median splits are very poor practice.

# A step in the right direction might be to divide things into four
# groups by quartiles. Here, we create a variable that does that:

> quartile <- rep(1, 50)
> quartile
 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[39] 1 1 1 1 1 1 1 1 1 1 1 1

# Right now, everything appears to be in the first quartile. But if we add 1.0 if
# the case is ABOVE the first quartile, the remaining 1's actually ARE in the first
# quartile:

> for(i in 1:50) if(CTRA[i] > quantile(CTRA, .25, type=2)) quartile[i] <- quartile[i] + 1
> quartile
 [1] 1 1 2 2 2 2 1 1 1 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 1 2 1 1 1 1
[39] 2 2 2 2 2 2 2 2 2 2 2 2

# If we add 1.0 again for anything above the median, the 1's and 2's accurately indicate
# Raven scores in the first and second quartiles:

> for(i in 1:50) if(CTRA[i] > quantile(CTRA, .50, type=2)) quartile[i] <- quartile[i] + 1
> quartile
 [1] 1 1 3 3 2 2 1 1 1 3 1 1 3 3 3 3 3 3 3 2 3 2 3 3 1 3 1 2 3 2 2 3 1 3 1 1 1 1
[39] 2 3 3 2 3 3 3 3 2 3 2 3

# And finally, we increment by 1.0 again if a score is above the third quartile:

> for(i in 1:50) if(CTRA[i] > quantile(CTRA, .75, type=2)) quartile[i] <- quartile[i] + 1

# Now the "quartile" variable actually does represent the Raven quartiles:

> quartile
 [1] 1 1 4 3 2 2 1 1 1 3 1 1 3 3 3 4 4 4 3 2 3 2 4 4 1 4 1 2 3 2 2 3 1 3 1 1 1 1
[39] 2 3 4 2 3 4 3 3 2 3 2 4

# There are roughly the same number of Raven scores in each group:

> table(quartile)
quartile
 1  2  3  4 
14 11 15 10 

# Again, using the same techniques as before, we can compare the
# Peabody distributions for each of the Raven quartiles. Here, for
# example, we look at boxplots. We can see that the conditional
# center of the Peabody distribution increases as the quartile of
# Raven increases. (We could also compare conditional means and sd's
# if we wanted.)

> boxplot(CTPEA~quartile)

# One might imagine going even further with that approach by dividing
# Raven into 10 groups according to what decile of the distribution the
# Raven score is in, but there wouldn't be enough data to do good boxplots.

# Ultimately, we could try to do a boxplot for every unique Raven value.
# Even though there isn't enough information to construct real boxplots in
# most cases, we can still see the tendency for the conditional center
# of Peabody to increase as Raven increases:

> boxplot(CTPEA~CTRA)

# But why even bother categorizing Raven? If we simply plot Peabody
# scores against Raven scores, we can visually assess various aspects
# of the Peabody distribution for different ranges of Raven. This type
# of plot is called a "scatterplot."

> plot(CTRA, CTPEA)
 
# In that plot, we can see that the conditional center of the Peabody
# distribution is tending to increase as Raven increases. It also appears
# that the variability of the Peabody variable increases a bit as
# Raven goes up.

# R's scatterplots can be improved. For one thing, we want a more
# horizontal aspect ratio:

> par(pin=c(6,4))
> plot(CTRA, CTPEA)

# We want a title and better axis labels:

> plot(CTRA, CTPEA, main="Peabody Conditioned on Raven",xlab="Raven",ylab="Peabody")

# I also like to add a bit more white space around the cloud of plotted
# points than R provides by default. Generally, adding about 5% of the range
# for each variable works pretty well.

# I create a variable called "yinc" (a mnemonic for "y increment") that is
# 5% of the range of Peabody:

> max(CTPEA)-min(CTPEA)
[1] 63
> min(CTPEA)
[1] 59
> max(CTPEA)
[1] 122
> yinc <- .05*63
> yinc
[1] 3.15

# I do the same for Raven:

> max(CTRA)-min(CTRA)
[1] 34
> xinc <- .05*34
> xinc
[1] 1.7

# Then I use "xlim" and "ylim" commands to extend the x and y limits
# of the scatterplot by 5% above and below the maximum and minimum
# values for each variable:

> plot(CTRA, CTPEA, main="Peabody Conditioned on Raven",xlab="Raven",ylab="Peabody",
+ xlim = c(min(CTRA)-xinc,max(CTRA)+xinc), ylim = c(min(CTPEA)-yinc,max(CTPEA)+yinc))