# A student asked about setting seeds for random numbers in R.

# The sample() function uses random number generation. If we apply
# it without specifying a seed first, we will get a new, independent
# sequence of random draws each time:
 
> sample(1:10, 5)
[1] 5 1 8 7 3
> sample(1:10, 5)
[1]  3  8 10  5  9
> sample(1:10, 5)
[1]  6 10  3  1  4

# But if we set a seed (starting place) for the random number
# generator, we will always get the same sequence if we use the
# same seed:

> set.seed(587)
> sample(1:10, 5)
[1]  9  4  2  8 10
> set.seed(587)
> sample(1:10, 5)
[1]  9  4  2  8 10

# And then if we sample again without specifying a seed, we'll
# get a different draw again:

> sample(1:10, 5)
[1]  2  6 10  5  1
 
# So setting a seed can be useful if you want to be able to 
# replicate a random sequence.

# Simulation in R can be a valuable way of understanding probability
# distributions. Last time, for example, we simulated a large number
# of coin tosses (0=tails, 1=heads):

> cointoss <- sample(0:1, 1000000, replace=TRUE)
> head(cointoss)
[1] 0 1 0 1 0 1

# We found that if we create a histogram of those simulated tosses, we
# get something that looks just like the picture we had drawn just from
# THINKING about the random coin tosses:

> hist(cointoss)

# And the mean and standard deviation of the simulated coin tosses
# matched what we had estimated based on the picture:

> mean(cointoss)
[1] 0.498767
> sd(cointoss)
[1] 0.4999987
 
# We can do the same thing with more complex random variables. Here,
# for example, we simulate rolling a fair six-sided die:
 
> dieroll <- sample(1:6, 1000000,replace=TRUE)

# We can look at the histogram of the simulated variables, and it
# resembles what we might draw just using the idea that each possible
# roll is equally likely:

> hist(dieroll)

# A glance at that picture suggests that it would balance at 3.5. We
# can validate that idea by looking at the mean of the simulated rolls:

> mean(dieroll)
[1] 3.499729

# And we might guess that a typical roll is about 1.5 away from 3.5. The
# standard deviation is actually a little higher than that:

> sd(dieroll)
[1] 1.705185
 
# In response to a student question, here's another example of the
# effect of setting a seed. First, we don't set a seed, and we get
# a different sequence of die rolls each time we sample:
 
> sample(1:6, 10, replace=TRUE)
 [1] 5 3 4 4 6 4 1 5 6 3
> sample(1:6, 10, replace=TRUE)
 [1] 3 6 3 5 6 5 2 1 1 2
> sample(1:6, 10, replace=TRUE)
 [1] 4 6 6 6 1 2 2 2 1 6

# But if we set a seed, we get exactly the same sequence each time:
 
> set.seed(99)
> sample(1:6, 10, replace=TRUE)
 [1] 1 4 6 6 5 3 2 6 2 6
> set.seed(99)
> sample(1:6, 10, replace=TRUE)
 [1] 1 4 6 6 5 3 2 6 2 6
> set.seed(99)
> sample(1:6, 10, replace=TRUE)
 [1] 1 4 6 6 5 3 2 6 2 6
 
# A student asked about the indentation I use when writing functions.
# I follow the normal programming convention of indenting things that
# naturally group together, but the spacing is actually arbitrary: it 
# doesn't matter to R. For example, when I wrote the pskew() function,
# I indented the second line:

> pskew
function(x) {
   return( 3 * (mean(x)-median(x))/sd(x)) }
<bytecode: 0x000000001a3c4bd0>

# But if I write it without the indentation, the function works exactly
# the same way; it's just a bit harder to read:

> pskew<-function(x){return(3*(mean(x)-median(x))/sd(x))}
> pskew
function(x){return(3*(mean(x)-median(x))/sd(x))}
> attach(JackStatlab)
> pskew(CTRA)
[1] -0.2168082
 
 
# Here, we illustrate simlating a continuous random variable that is
# uniformly distributed between 0 and 1:

> runif(10)
 [1] 0.19383647 0.63690411 0.68780009 0.64019077 0.35788536 0.10258500
 [7] 0.09779092 0.18288626 0.22790347 0.08052415

# By default, R assumes the range 0 to 1, but it is possible to specify
# other randes. For example, here we sample from a U(0,360) distribution.
# (That's standard notation of a uniform distribution where values between
# 0 and 360 are all equally likely.)

> runif(10, 0, 360)
 [1] 295.7826320 212.8010317 278.4200468 126.0309517   2.1820762 293.2222402
 [7]   0.4245224  72.2484657 180.0381742 116.4611984

# If we don't specify limits, R will use 0 and 1:

> runif(10)
 [1] 0.34678351 0.54548727 0.04049003 0.44384725 0.69098048 0.82385131
 [7] 0.60678671 0.97744182 0.86409885 0.48225151

# Here, we simulate a million draws from U(0,1):

> spin <- runif(1000000)
> head(spin)
[1] 0.7741288 0.9303503 0.4763101 0.9168967 0.3652093 0.4267693

# The histogram shows frequencies by default...

> hist(spin)

# But the help function can show us how to make the histogram show
# relative frequencies instead (which, in this case, represent
# probabilities):

> help(hist)
starting httpd help server ... done
> hist(spin, freq=FALSE)

# Looking at that histogram, we would guess that the distribution
# balances at 1/2. We can validate that idea from the simulated values:

> mean(spin)
[1] 0.499939

# In the histogram, half of the area under the curve is below 1/2, and
# half is above, so the median should be 1/2:

> median(spin)
[1] 0.499727

# We guessed that the standard deviation would be near 1/4 because the
# average distance from the mean is 1/4, and sd is "sort of like" average
# distance from the mean:

> sd(spin)
[1] 0.2887233

# The true value of the sd for a U(0,1) random variable is sqrt(1/12),
# which is claose to 1/4, and is almost exactly the value we see in
# our simulation:

> sqrt(1/12)
[1] 0.2886751

# From the histogram, we guessed that the IQR would be 1/2, and that turns
# out to be true in the simulation:

> iqr(spin)
[1] 0.4998338

# Visually, the distribution is symmetric, and both of our skew indices,
# applied to the simulated values, agree:

> pskew(spin)
[1] 0.002202665
> mskew(spin)
[1] 0.0003462668

# (Note that it made sense to use mskew(), the third-moment skew index,
# because our sample of one million values is very large.)

 
# A student asked about the difference between frequency and relative
# frequency. Relative frequency is just frequency divided by sample size.
# We can illustrate this by considering the distribution of Raven in
# my Statlab sample:
 
> stem(CTRA)

  The decimal point is 1 digit(s) to the right of the |

  1 | 3
  1 | 5578
  2 | 1111123334
  2 | 56678888
  3 | 0122334
  3 | 5557888999
  4 | 0112234
  4 | 557

# The sample size is 50:

> length(CTRA)
[1] 50

# The "low teens" bin in the stem-and-leaf plot has a frequency
# of 1 (the value 13), so the relative frequency is

> 1/50
[1] 0.02

# The second bin has a frequency of four (15, 15, 17, 18), so the
# relative frequency is

> 4/50
[1] 0.08

# And so on. (Here's the relative frequency for the third bin.)

> 10/50
[1] 0.2
 
# We can use simulation to validate our application of probability laws.
# Here, we simulate a million rolls of a die:

> roll <- sample(1:6, 1000000, replace=TRUE)
> head(roll)
[1] 2 5 4 1 5 2

# Here, we simulate a smaller number of rolls to help us understand
# R's logical operators:

> junkroll <- sample(1:6, 10, replace=TRUE)
> junkroll
 [1] 2 2 1 3 1 1 1 2 6 2

# "==" in R is a logical "is equal to." So "junkroll==6" returns
# a sequence of TRUE and FALSE responses indicating whether the
# value is actually 6:

> junkroll==6
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE

# But behind the scenes, those FALSE values are actually zero, and
# the TRUE values are actually one. So if we take their mean, we get
# the proportion of cases for which the value was 6:

> mean(junkroll==6)
[1] 0.1

# Now we return to the larger simulation to validate our probability
# ideas. In theory, the probability of 6 on a single roll of a fair
# die should be 1/6. The simulation validates that idea:

> mean(roll==6)
[1] 0.166685
> 1/6
[1] 0.1666667

# Similarly, the probability of a 5 should be 1/6:

> mean(roll==5)
[1] 0.166424

# A single roll of a die cannot come up both 5 and 6 (i.e., the events are
# mutually exclusive), so the simple form of the addition rule tells us that 
# the probability of getting a 5 or a 6 on a single die roll is

> 1/6 + 1/6
[1] 0.3333333

# That can be validated using our simulation. (The vertical bar "|" is a
# logical OR operator, so roll==5 | roll==6 means that the roll is either
# 5 OR 6.)

> mean(roll==6 | roll==5)
[1] 0.333109
 
# If we go back to our simulated spins of a spinner that can come to rest
# between 0 and 1, we can see that the probability of a single spin falling
# below 1/4 is 1/4:

> mean(spin<1/4)
[1] 0.249932

# The probability that a single spin falls between 1/2 and 3/4 is also 1/4.
# (Here, the "&" is a logical AND operator.)

> mean(spin>1/2 & spin<3/4)
[1] 0.249794

# A single spin of the spinner cannot fall below 1/4 AND between 1/2 and 3/4, so
# those events are mutually exclusive. The simple form of the addition rule, then
# says that the probability of a spin falling below 1/4 OR between 1/2 and 3/4
# should be 1/4 + 1/4 = 1/2. The simulation confirms that idea:

> mean(spin<1/4 | (spin>1/2 & spin<3/4))
[1] 0.499726
 
# We may recall from PSY 10 that the probability of a single draw from a normal
# distribution with a mean of 0 and a standard deviation of 1 falling below -1.6
# is .025, and the probability of getting a draw above 1.96 is also .025. We can
# confirm this by simulation:

> normal <- rnorm(1000000)
> mean(normal < -1.96)
[1] 0.025033
> mean(normal > 1.96)
[1] 0.02505

# Those outcomes are mutually exclusive (a single draw can't both be < -1.96
# and > 1.96), so the simple form of the addition rule tells us that the probability
# of one OR the other event occuring is .025 + .025 = .05. The simulation confirms this:

> mean(normal < -1.96 | normal > 1.96)
[1] 0.050083

# We are now going to concern ourselves with a sequence of two
# independent die rolls. We've already simulated one roll, so
# I'll call that roll1 (that's a one, not an ell, at the end of
# the name):

> roll1 <- roll

# Here's a second, independent roll:

> roll2 <- sample(1:6, 1000000, replace=TRUE)

# Both rolls behave the way we would expect. For example, the
# probability of getting a 6 is still 1/6:

> mean(roll1 == 6)
[1] 0.166685
> mean(roll2 == 6)
[1] 0.166128

# These rolls are independent, so the simple form of the multiplication
# rule tells us that the probability of BOTH rolls coming up 6 is
# 1/6 * 1/6 = 1/36:
> 1/36
[1] 0.02777778

# The simulation confirms this. (Here, the "&" is a logical AND.)

> mean(roll1==6 & roll2==6)
[1] 0.028101
 
# We simulate two independent coin tosses. As with the dice, I reuse
# my previous coin toss simulation to be the first toss:

> toss1 <- cointoss

# Here's a second, independent toss:

> toss2 <- sample(0:1,1000000,replace=TRUE)

# The probability of Heads (i.e., 1) on either toss is 1/2:

> mean(toss1==1)
[1] 0.498767
> mean(toss2==1)
[1] 0.499577

# The tosses are independent, so the simple form of the multiplication
# rule tells us that the probability of heads on both tosses is
# 1/2 * 1/2 = 1/4. The simulation confirms that result:

> mean(toss1==1 & toss2==1)
[1] 0.248982
> 

# We reuse our previous simulated spin and then add a second,
# independent spin or our (0,1) spinner:

> spin1 <- spin
> spin2 <- runif(1000000)

# The probability of getting a spin above 1/4 is 3/4:

> mean(spin1 > 1/4)
[1] 0.750068

# And the probability of getting a spin below 3/4 is 3/4:

> mean(spin2 < 3/4)
[1] 0.749725

# The spins are independent, so the simple form of the multiplication
# rule tells us that the probability of the first spin being > 1/4 and
# the second spin being < 3/4 if 3/4 * 3/4 = 9/16:

> 9/16
[1] 0.5625

# The simulation confirms our application of the multiplication rule:

> mean(spin1 > 1/4 & spin2 < 3/4)
[1] 0.562005
>