# We introduced the idea of conditional distributions by thinking about # the distribution of Peabody conditioned on sex: how do the distributions # of Peabody scores differ for boys and girls? > attach(JackStatlab) # One convenience in our dataset is that girls and boys are listed in order: # girls (0) first, and then boys (1): > CBSEX [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 [39] 1 1 1 1 1 1 1 1 1 1 1 1 # So if we add up those values, we get the number of boys in our dataset: > sum(CBSEX) [1] 24 # Which means that the first 26 cases are girls: > 50-24 [1] 26 # Here are the girls' scores: > CTPEA[1:26] [1] 72 69 80 90 77 84 59 72 62 82 66 72 96 80 74 81 82 81 69 [20] 68 85 69 106 95 60 64 # And here are the boys': > CTPEA[27:50] [1] 86 79 83 80 78 64 89 89 81 86 75 71 122 104 103 65 92 108 86 [20] 85 88 78 91 103 # We can compare the distributions with respect to central tendency. It looks like # the boys tend to score higher than the girls: > mean(CTPEA[1:26]) [1] 76.73077 > mean(CTPEA[27:50]) [1] 86.91667 > median(CTPEA[1:26]) [1] 75.5 > median(CTPEA[27:50]) [1] 86 # But the two distributions are pretty similar in variability: > sd(CTPEA[1:26]) [1] 11.64666 > sd(CTPEA[27:50]) [1] 13.61558 > iqr(CTPEA[1:26]) [1] 13 > iqr(CTPEA[27:50]) [1] 13 # More often, the data won't be so nicely ordered. Here's a way # to get the same information without taking advantage of the fact # that the data are sorted. The syntax for the tapply() function # is "tapply(variable of interest, conditioning variable, function desired)": > tapply( CTPEA, CBSEX, mean) 0 1 76.73077 86.91667 > tapply( CTPEA, CBSEX, median) 0 1 75.5 86.0 > tapply( CTPEA, CBSEX, sd) 0 1 11.64666 13.61558 > tapply( CTPEA, CBSEX, iqr) 0 1 13 13 > tapply( CTPEA, CBSEX, pskew) 0 1 0.3170273 0.2019745 # Another device that can be useful for comparing distributions is # the construction of box-and-whisker plots on the same axis. # Here's the unconditional box-and-whisker plot for Peabody: > boxplot(CTPEA) # But if we add a "~CBSEX", we get a separate boxplot for each # child sex, side-by-side: > boxplot(CTPEA~CBSEX) # (Think of that tilde as meaning "conditioned on.") # Of course, the 0 and 1 on the X axis aren't very informative # there. Let's try to improve the plot: > boxplot(CTPEA~CBSEX, axes=FALSE) # I guess (fortunately, correctly) that the 0 and 1 labels are # actually at X=0 and X=1. So if I add my own labels at those # locations, I'll get more informative labels: > axis(side=1,at=c(1,2), lab=c("Girls","Boys")) # Here's some further improvement: > boxplot(CTPEA~CBSEX, axes=FALSE, xlab="Sex of Child", ylab="Peabody Score") > axis(side=1,at=c(1,2),lab=c("Girls","Boys")) > axis(side=2, at=seq(60,120,10)) > box() # Another device that can be helpful for comparing distributions # is the construction of back-to-back stem-and-leaf plots: # Note that R makes a different grouping decision for girls... > stem(CTPEA[1:26]) The decimal point is 1 digit(s) to the right of the | 5 | 9 6 | 024 6 | 68999 7 | 2224 7 | 7 8 | 0011224 8 | 5 9 | 0 9 | 56 10 | 10 | 6 # ...and for boys: > stem(CTPEA[27:50]) The decimal point is 1 digit(s) to the right of the | 6 | 4515889 8 | 013566689912 10 | 3348 12 | 2 # If we tell R to try to have about half as many groups as # it wants to by default for the girls... > stem(CTPEA[1:26], 1/2) The decimal point is 1 digit(s) to the right of the | 5 | 9 6 | 02468999 7 | 22247 8 | 00112245 9 | 056 10 | 6 # ...and about twice as many for the boys... > stem(CTPEA[27:50], 2) The decimal point is 1 digit(s) to the right of the | 6 | 45 7 | 15889 8 | 0135666899 9 | 12 10 | 3348 11 | 12 | 2 # ...we end up with the same grouping for boys and girls. We can then # (with a bit of tedium, as a student pointed out in chat) manually # construct the back-to-back plot in a Word document. (A pdf of the # result is posted on the "virtual whiteboard" page of the class website.) # Next, we started thinking about conditioning on a continuous variable. # If we want to consider the distribution of Peabody conditioned on Raven, # we could consider categorizing Raven. A traditional way of doing that # is to split Raven into "low" and "high" Raven scores depending on whether # the score is above or below the median. This is called a "median split," # and is very poor practice because scores that are actually almost the # same end up being called "high" and "low." # We create a dummy variable for the median split. (A dummy variable is # one that indicates membership in one or another group using 0 and 1 to # denote the two groups.) # The logical operator returns FALSE or TRUE: > above <- (CTRA > median(CTRA)) > above [1] FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE [13] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE [25] FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE [37] FALSE FALSE FALSE TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE [49] FALSE TRUE # But behind the scenes, those are actually zeroes and ones. We can force # R to report them that way by changing the "mode" of the variable: > above <- as.numeric(above) > above [1] 0 0 1 1 0 0 0 0 0 1 0 0 1 1 1 1 1 1 1 0 1 0 1 1 0 1 0 0 1 0 0 1 0 1 0 0 0 0 [39] 0 1 1 0 1 1 1 1 0 1 0 1 # We happened to get lucky and get exactly 25 cases in each group: > sum(above) [1] 25 # And we can compare the distribution of Peabody for the low and high # Raven groups using the same approach that we used when we conditioned # on sex: > boxplot(CTPEA~above) > tapply(CTPEA, above, mean) 0 1 76.84 86.40 # Once again, be aware that median splits are very poor practice. # A step in the right direction might be to divide things into four # groups by quartiles. Here, we create a variable that does that: > quartile <- rep(1, 50) > quartile [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [39] 1 1 1 1 1 1 1 1 1 1 1 1 # Right now, everything appears to be in the first quartile. But if we add 1.0 if # the case is ABOVE the first quartile, the remaining 1's actually ARE in the first # quartile: > for(i in 1:50) if(CTRA[i] > quantile(CTRA, .25, type=2)) quartile[i] <- quartile[i] + 1 > quartile [1] 1 1 2 2 2 2 1 1 1 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 1 2 1 1 1 1 [39] 2 2 2 2 2 2 2 2 2 2 2 2 # If we add 1.0 again for anything above the median, the 1's and 2's accurately indicate # Raven scores in the first and second quartiles: > for(i in 1:50) if(CTRA[i] > quantile(CTRA, .50, type=2)) quartile[i] <- quartile[i] + 1 > quartile [1] 1 1 3 3 2 2 1 1 1 3 1 1 3 3 3 3 3 3 3 2 3 2 3 3 1 3 1 2 3 2 2 3 1 3 1 1 1 1 [39] 2 3 3 2 3 3 3 3 2 3 2 3 # And finally, we increment by 1.0 again if a score is above the third quartile: > for(i in 1:50) if(CTRA[i] > quantile(CTRA, .75, type=2)) quartile[i] <- quartile[i] + 1 # Now the "quartile" variable actually does represent the Raven quartiles: > quartile [1] 1 1 4 3 2 2 1 1 1 3 1 1 3 3 3 4 4 4 3 2 3 2 4 4 1 4 1 2 3 2 2 3 1 3 1 1 1 1 [39] 2 3 4 2 3 4 3 3 2 3 2 4 # There are roughly the same number of Raven scores in each group: > table(quartile) quartile 1 2 3 4 14 11 15 10 # Again, using the same techniques as before, we can compare the # Peabody distributions for each of the Raven quartiles. Here, for # example, we look at boxplots. We can see that the conditional # center of the Peabody distribution increases as the quartile of # Raven increases. (We could also compare conditional means and sd's # if we wanted.) > boxplot(CTPEA~quartile) # One might imagine going even further with that approach by dividing # Raven into 10 groups according to what decile of the distribution the # Raven score is in, but there wouldn't be enough data to do good boxplots. # Ultimately, we could try to do a boxplot for every unique Raven value. # Even though there isn't enough information to construct real boxplots in # most cases, we can still see the tendency for the conditional center # of Peabody to increase as Raven increases: > boxplot(CTPEA~CTRA) # But why even bother categorizing Raven? If we simply plot Peabody # scores against Raven scores, we can visually assess various aspects # of the Peabody distribution for different ranges of Raven. This type # of plot is called a "scatterplot." > plot(CTRA, CTPEA) # In that plot, we can see that the conditional center of the Peabody # distribution is tending to increase as Raven increases. It also appears # that the variability of the Peabody variable increases a bit as # Raven goes up. # R's scatterplots can be improved. For one thing, we want a more # horizontal aspect ratio: > par(pin=c(6,4)) > plot(CTRA, CTPEA) # We want a title and better axis labels: > plot(CTRA, CTPEA, main="Peabody Conditioned on Raven",xlab="Raven",ylab="Peabody") # I also like to add a bit more white space around the cloud of plotted # points than R provides by default. Generally, adding about 5% of the range # for each variable works pretty well. # I create a variable called "yinc" (a mnemonic for "y increment") that is # 5% of the range of Peabody: > max(CTPEA)-min(CTPEA) [1] 63 > min(CTPEA) [1] 59 > max(CTPEA) [1] 122 > yinc <- .05*63 > yinc [1] 3.15 # I do the same for Raven: > max(CTRA)-min(CTRA) [1] 34 > xinc <- .05*34 > xinc [1] 1.7 # Then I use "xlim" and "ylim" commands to extend the x and y limits # of the scatterplot by 5% above and below the maximum and minimum # values for each variable: > plot(CTRA, CTPEA, main="Peabody Conditioned on Raven",xlab="Raven",ylab="Peabody", + xlim = c(min(CTRA)-xinc,max(CTRA)+xinc), ylim = c(min(CTPEA)-yinc,max(CTPEA)+yinc))