# We started class with an illustration of how to approach homework one. # For those who wanted to follow along, my own personal N=50 Statlab sample # is available: > JackStatlab <- read.csv("http://faculty.ucmerced.edu/jvevea/classes/202a/data/JackStatlab.csv") # We're going to be working with the Raven variable. I attach the data set: > attach(JackStatlab) > CTRA [1] 33 25 42 20 21 15 34 16 35 45 33 47 50 31 26 28 35 44 34 48 40 31 50 34 48 18 29 20 28 24 42 40 [33] 21 21 47 29 33 45 35 43 15 42 11 26 37 13 33 30 24 34 # The values in my sample run from 11 to 50: > sort(CTRA) [1] 11 13 15 15 16 18 20 20 21 21 21 24 24 25 26 26 28 28 29 29 30 31 31 33 33 33 33 34 34 34 34 35 [33] 35 35 37 40 40 42 42 42 43 44 45 45 47 47 48 48 50 50 # Using R's stem() function often will lead to a reasonable grouping: > stem(CTRA) The decimal point is 1 digit(s) to the right of the | 1 | 13 1 | 5568 2 | 0011144 2 | 5668899 3 | 01133334444 3 | 5557 4 | 0022234 4 | 557788 5 | 00 # As we have seen before, R's default histogram leaves a lot to be desired: > hist(CTRA) # We force a more horizontal aspect ratio and improve the title and X-axis label: > par(pin=c(6,4)) > hist(CTRA, main="Raven's Progressive Matrices", xlab="Raven") # R's choice of breakpoints was really bad, so we improve it: > hist(CTRA, main="Raven's Progressive Matrices", xlab="Raven", + breaks=seq(9.5,54.5,5)) # In response to a question from class, I showed what happens if you specify # a range of breakpoints that doesn't go far enough to include the whole data set: > hist(CTRA, main="Raven's Progressive Matrices", xlab="Raven", + breaks=seq(9.5,49.5,5)) Error in hist.default(CTRA, main = "Raven's Progressive Matrices", xlab = "Raven", : some 'x' not counted; maybe 'breaks' do not span range of 'x' # Here's that histogram again: > hist(CTRA, main="Raven's Progressive Matrices", xlab="Raven", + breaks=seq(9.5,54.5,5)) # We find the midpoint of the first interval. (The midpoint of the last # interval will also end in 2, so it will be 52.) > (9.5+14.5)/2 [1] 12 # We redraw the histogram with the midpoints of the histobars labeled: > hist(CTRA, main="Raven's Progressive Matrices", xlab="Raven", + breaks=seq(9.5,54.5,5), axes=FALSE) > axis(side=1,at=seq(12,52,5)) # In response to a question, we calculate the midpoint of the final interval: > (49.5+54.5)/2 [1] 52 # We raise the X axis so that it doesn't float in mid air, and add the Y axis: > hist(CTRA, main="Raven's Progressive Matrices", xlab="Raven", + breaks=seq(9.5,54.5,5), axes=FALSE) > axis(side=1,at=seq(12,52,5), pos=0) > axis(side=2, seq(0,12,2)) # It doesn't look so good with the Y axis extending that high, so we correct: > hist(CTRA, main="Raven's Progressive Matrices", xlab="Raven", + breaks=seq(9.5,54.5,5), axes=FALSE) > axis(side=1,at=seq(12,52,5), pos=0) > axis(side=2, seq(0,10,2)) # Finally, we give the histogram a base: > abline(h=0) # We calculate descriptive statistics for central tendency... > mean(CTRA) [1] 32.1 > median(CTRA) [1] 33 # ...and for variability: > sd(CTRA) [1] 10.59505 > iqr function(x) { IQR(x, type=2) } > iqr(CTRA) [1] 18 # We discussed the inadvisability of forming conclusions about modality # and skew in a data set this small. Having said that, we did observe that # Pearson's skew index is consistent with the visual appearance of negative # skew that we thought we saw in the histogram: > pskew(CTRA) [1] -0.254836 # We also noted that we would never want to use the third-moment skew index # with a data set this small: > mskew(CTRA) [1] -0.09922256 # Sometimes, skew CAN be taken seriously in a small data set. For example, # because of the nature of income, we EXPECT a positive skew: > stem(FIT) The decimal point is 2 digit(s) to the right of the | 0 | 4 0 | 5889 1 | 00000012222333344 1 | 5556677888899 2 | 000112224 2 | 668 3 | 00 3 | 4 | 1 > pskew(FIT) [1] 0.3723412 # We calculate the IQR of the normative sample from the information given # in the homework: > 41-28 [1] 13 # Our sample is somewhat more variable: > iqr(CTRA) [1] 18 > median(CTRA) [1] 33 # We starting thinking about conditional distributions by looking at the distribution # of Raven conditioned on Sex. Our first 32 cases are girls, and the final 18 are boys: > CBSEX [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [49] 1 1 # That was easy to tell, because things were already sorted by sex. But regardless of # how things are ordered, we can always get counts with the table() function: > table(CBSEX) CBSEX 0 1 32 18 # Because things are already ordered by sex, it's easy to separate the girls from the boys: > GirlsRaven <- CTRA[1:32] > BoysRaven <- CTRA[33:50] # We compare the distributions graphically: > stem(GirlsRaven) The decimal point is 1 digit(s) to the right of the | 1 | 568 2 | 0014 2 | 56889 3 | 1133444 3 | 55 4 | 00224 4 | 5788 5 | 00 > stem(BoysRaven) The decimal point is 1 digit(s) to the right of the | 1 | 135 2 | 11469 3 | 033457 4 | 2357 # The comparison will be easier if we use the same grouping: > stem(GirlsRaven) The decimal point is 1 digit(s) to the right of the | 1 | 568 2 | 0014 2 | 56889 3 | 1133444 3 | 55 4 | 00224 4 | 5788 5 | 00 > stem(BoysRaven,2) The decimal point is 1 digit(s) to the right of the | 1 | 13 1 | 5 2 | 114 2 | 69 3 | 0334 3 | 57 4 | 23 4 | 57 # We compare measures of central tendency and variability: > mean(GirlsRaven) [1] 33.3125 > mean(BoysRaven) [1] 29.94444 > median(GirlsRaven) [1] 33.5 > median(BoysRaven) [1] 31.5 > sd(GirlsRaven) [1] 10.38745 > sd(BoysRaven) [1] 10.91395 # The box-and-whisker plot is useful for comparing distributions. # The horizontal line is the median. The box extends to the first and # third quartiles, to the the height of the box represents the iqr. # By default in R, the "whiskers extend to the most extreme data point # which is no more than range times the interquartile range from the box." # (Quoted from R's help() function.) In small, fairly symmetric data sets, # this often means that they extend to the largest and smallest observation, # as they do here. > boxplot(CTRA) # Producing side-by-side boxplots on the same scale can facilitate comparison: > boxplot(GirlsRaven,BoysRaven) # Here's another way to do that: > boxplot(CTRA~CBSEX)