# We started class with an illustration of how to approach homework one.
# For those who wanted to follow along, my own personal N=50 Statlab sample
# is available:

> JackStatlab <- read.csv("http://faculty.ucmerced.edu/jvevea/classes/202a/data/JackStatlab.csv")

# We're going to be working with the Raven variable. I attach the data set:

> attach(JackStatlab)
> CTRA
 [1] 33 25 42 20 21 15 34 16 35 45 33 47 50 31 26 28 35 44 34 48 40 31 50 34 48 18 29 20 28 24 42 40
[33] 21 21 47 29 33 45 35 43 15 42 11 26 37 13 33 30 24 34

# The values in my sample run from 11 to 50:

> sort(CTRA)
 [1] 11 13 15 15 16 18 20 20 21 21 21 24 24 25 26 26 28 28 29 29 30 31 31 33 33 33 33 34 34 34 34 35
[33] 35 35 37 40 40 42 42 42 43 44 45 45 47 47 48 48 50 50

# Using R's stem() function often will lead to a reasonable grouping:

> stem(CTRA)

  The decimal point is 1 digit(s) to the right of the |

  1 | 13
  1 | 5568
  2 | 0011144
  2 | 5668899
  3 | 01133334444
  3 | 5557
  4 | 0022234
  4 | 557788
  5 | 00

# As we have seen before, R's default histogram leaves a lot to be desired:

> hist(CTRA)
 
# We force a more horizontal aspect ratio and improve the title and X-axis label:

> par(pin=c(6,4))
> hist(CTRA, main="Raven's Progressive Matrices", xlab="Raven")

# R's choice of breakpoints was really bad, so we improve it:

> hist(CTRA, main="Raven's Progressive Matrices", xlab="Raven",
+ breaks=seq(9.5,54.5,5))

# In response to a question from class, I showed what happens if you specify
# a range of breakpoints that doesn't go far enough to include the whole data set:

> hist(CTRA, main="Raven's Progressive Matrices", xlab="Raven",
+ breaks=seq(9.5,49.5,5))
Error in hist.default(CTRA, main = "Raven's Progressive Matrices", xlab = "Raven",  : 
  some 'x' not counted; maybe 'breaks' do not span range of 'x'

# Here's that histogram again:

> hist(CTRA, main="Raven's Progressive Matrices", xlab="Raven",
+ breaks=seq(9.5,54.5,5))

# We find the midpoint of the first interval. (The midpoint of the last
# interval will also end in 2, so it will be 52.)

> (9.5+14.5)/2
[1] 12

# We redraw the histogram with the midpoints of the histobars labeled:

> hist(CTRA, main="Raven's Progressive Matrices", xlab="Raven",
+ breaks=seq(9.5,54.5,5), axes=FALSE)
> axis(side=1,at=seq(12,52,5))

# In response to a question, we calculate the midpoint of the final interval:

> (49.5+54.5)/2
[1] 52

# We raise the X axis so that it doesn't float in mid air, and add the Y axis:

> hist(CTRA, main="Raven's Progressive Matrices", xlab="Raven",
+ breaks=seq(9.5,54.5,5), axes=FALSE)
> axis(side=1,at=seq(12,52,5), pos=0)
> axis(side=2, seq(0,12,2))

# It doesn't look so good with the Y axis extending that high, so we correct:

> hist(CTRA, main="Raven's Progressive Matrices", xlab="Raven",
+ breaks=seq(9.5,54.5,5), axes=FALSE)
> axis(side=1,at=seq(12,52,5), pos=0)
> axis(side=2, seq(0,10,2))

# Finally, we give the histogram a base:

> abline(h=0)

# We calculate descriptive statistics for central tendency...

> mean(CTRA)
[1] 32.1
> median(CTRA)
[1] 33

# ...and for variability:

> sd(CTRA)
[1] 10.59505
> iqr
function(x) {
   IQR(x, type=2) }
<bytecode: 0x000000001b847b90>
> iqr(CTRA)
[1] 18

# We discussed the inadvisability of forming conclusions about modality
# and skew in a data set this small. Having said that, we did observe that
# Pearson's skew index is consistent with the visual appearance of negative
# skew that we thought we saw in the histogram:

> pskew(CTRA)
[1] -0.254836

# We also noted that we would never want to use the third-moment skew index
# with a data set this small:

> mskew(CTRA)
[1] -0.09922256

# Sometimes, skew CAN be taken seriously in a small data set. For example,
# because of the nature of income, we EXPECT a positive skew:

> stem(FIT)

  The decimal point is 2 digit(s) to the right of the |

  0 | 4
  0 | 5889
  1 | 00000012222333344
  1 | 5556677888899
  2 | 000112224
  2 | 668
  3 | 00
  3 | 
  4 | 1

> pskew(FIT)
[1] 0.3723412

# We calculate the IQR of the normative sample from the information given
# in the homework:

> 41-28
[1] 13

# Our sample is somewhat more variable:

> iqr(CTRA)
[1] 18
> median(CTRA)
[1] 33
 
 
# We starting thinking about conditional distributions by looking at the distribution
# of Raven conditioned on Sex. Our first 32 cases are girls, and the final 18 are boys:

> CBSEX
 [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[49] 1 1

# That was easy to tell, because things were already sorted by sex. But regardless of
# how things are ordered, we can always get counts with the table() function:

> table(CBSEX)
CBSEX
 0  1 
32 18 

# Because things are already ordered by sex, it's easy to separate the girls from the boys:

> GirlsRaven <- CTRA[1:32]
> BoysRaven <- CTRA[33:50]
 
# We compare the distributions graphically:

> stem(GirlsRaven)

  The decimal point is 1 digit(s) to the right of the |

  1 | 568
  2 | 0014
  2 | 56889
  3 | 1133444
  3 | 55
  4 | 00224
  4 | 5788
  5 | 00

> stem(BoysRaven)

  The decimal point is 1 digit(s) to the right of the |

  1 | 135
  2 | 11469
  3 | 033457
  4 | 2357

# The comparison will be easier if we use the same grouping:

> stem(GirlsRaven)

  The decimal point is 1 digit(s) to the right of the |

  1 | 568
  2 | 0014
  2 | 56889
  3 | 1133444
  3 | 55
  4 | 00224
  4 | 5788
  5 | 00

> stem(BoysRaven,2)

  The decimal point is 1 digit(s) to the right of the |

  1 | 13
  1 | 5
  2 | 114
  2 | 69
  3 | 0334
  3 | 57
  4 | 23
  4 | 57

# We compare measures of central tendency and variability:

> mean(GirlsRaven)
[1] 33.3125
> mean(BoysRaven)
[1] 29.94444
> median(GirlsRaven)
[1] 33.5
> median(BoysRaven)
[1] 31.5
> sd(GirlsRaven)
[1] 10.38745
> sd(BoysRaven)
[1] 10.91395

# The box-and-whisker plot is useful for comparing distributions.
# The horizontal line is the median. The box extends to the first and
# third quartiles, to the the height of the box represents the iqr.
# By default in R, the "whiskers extend to the most extreme data point 
# which is no more than range times the interquartile range from the box."
# (Quoted from R's help() function.) In small, fairly symmetric data sets,
# this often means that they extend to the largest and smallest observation,
# as they do here.

> boxplot(CTRA)

# Producing side-by-side boxplots on the same scale can facilitate comparison:

> boxplot(GirlsRaven,BoysRaven)

# Here's another way to do that:

> boxplot(CTRA~CBSEX)