# A student asked about setting seeds for random numbers in R. # The sample() function uses random number generation. If we apply # it without specifying a seed first, we will get a new, independent # sequence of random draws each time: > sample(1:10, 5) [1] 5 1 8 7 3 > sample(1:10, 5) [1] 3 8 10 5 9 > sample(1:10, 5) [1] 6 10 3 1 4 # But if we set a seed (starting place) for the random number # generator, we will always get the same sequence if we use the # same seed: > set.seed(587) > sample(1:10, 5) [1] 9 4 2 8 10 > set.seed(587) > sample(1:10, 5) [1] 9 4 2 8 10 # And then if we sample again without specifying a seed, we'll # get a different draw again: > sample(1:10, 5) [1] 2 6 10 5 1 # So setting a seed can be useful if you want to be able to # replicate a random sequence. # Simulation in R can be a valuable way of understanding probability # distributions. Last time, for example, we simulated a large number # of coin tosses (0=tails, 1=heads): > cointoss <- sample(0:1, 1000000, replace=TRUE) > head(cointoss) [1] 0 1 0 1 0 1 # We found that if we create a histogram of those simulated tosses, we # get something that looks just like the picture we had drawn just from # THINKING about the random coin tosses: > hist(cointoss) # And the mean and standard deviation of the simulated coin tosses # matched what we had estimated based on the picture: > mean(cointoss) [1] 0.498767 > sd(cointoss) [1] 0.4999987 # We can do the same thing with more complex random variables. Here, # for example, we simulate rolling a fair six-sided die: > dieroll <- sample(1:6, 1000000,replace=TRUE) # We can look at the histogram of the simulated variables, and it # resembles what we might draw just using the idea that each possible # roll is equally likely: > hist(dieroll) # A glance at that picture suggests that it would balance at 3.5. We # can validate that idea by looking at the mean of the simulated rolls: > mean(dieroll) [1] 3.499729 # And we might guess that a typical roll is about 1.5 away from 3.5. The # standard deviation is actually a little higher than that: > sd(dieroll) [1] 1.705185 # In response to a student question, here's another example of the # effect of setting a seed. First, we don't set a seed, and we get # a different sequence of die rolls each time we sample: > sample(1:6, 10, replace=TRUE) [1] 5 3 4 4 6 4 1 5 6 3 > sample(1:6, 10, replace=TRUE) [1] 3 6 3 5 6 5 2 1 1 2 > sample(1:6, 10, replace=TRUE) [1] 4 6 6 6 1 2 2 2 1 6 # But if we set a seed, we get exactly the same sequence each time: > set.seed(99) > sample(1:6, 10, replace=TRUE) [1] 1 4 6 6 5 3 2 6 2 6 > set.seed(99) > sample(1:6, 10, replace=TRUE) [1] 1 4 6 6 5 3 2 6 2 6 > set.seed(99) > sample(1:6, 10, replace=TRUE) [1] 1 4 6 6 5 3 2 6 2 6 # A student asked about the indentation I use when writing functions. # I follow the normal programming convention of indenting things that # naturally group together, but the spacing is actually arbitrary: it # doesn't matter to R. For example, when I wrote the pskew() function, # I indented the second line: > pskew function(x) { return( 3 * (mean(x)-median(x))/sd(x)) } # But if I write it without the indentation, the function works exactly # the same way; it's just a bit harder to read: > pskew<-function(x){return(3*(mean(x)-median(x))/sd(x))} > pskew function(x){return(3*(mean(x)-median(x))/sd(x))} > attach(JackStatlab) > pskew(CTRA) [1] -0.2168082 # Here, we illustrate simlating a continuous random variable that is # uniformly distributed between 0 and 1: > runif(10) [1] 0.19383647 0.63690411 0.68780009 0.64019077 0.35788536 0.10258500 [7] 0.09779092 0.18288626 0.22790347 0.08052415 # By default, R assumes the range 0 to 1, but it is possible to specify # other randes. For example, here we sample from a U(0,360) distribution. # (That's standard notation of a uniform distribution where values between # 0 and 360 are all equally likely.) > runif(10, 0, 360) [1] 295.7826320 212.8010317 278.4200468 126.0309517 2.1820762 293.2222402 [7] 0.4245224 72.2484657 180.0381742 116.4611984 # If we don't specify limits, R will use 0 and 1: > runif(10) [1] 0.34678351 0.54548727 0.04049003 0.44384725 0.69098048 0.82385131 [7] 0.60678671 0.97744182 0.86409885 0.48225151 # Here, we simulate a million draws from U(0,1): > spin <- runif(1000000) > head(spin) [1] 0.7741288 0.9303503 0.4763101 0.9168967 0.3652093 0.4267693 # The histogram shows frequencies by default... > hist(spin) # But the help function can show us how to make the histogram show # relative frequencies instead (which, in this case, represent # probabilities): > help(hist) starting httpd help server ... done > hist(spin, freq=FALSE) # Looking at that histogram, we would guess that the distribution # balances at 1/2. We can validate that idea from the simulated values: > mean(spin) [1] 0.499939 # In the histogram, half of the area under the curve is below 1/2, and # half is above, so the median should be 1/2: > median(spin) [1] 0.499727 # We guessed that the standard deviation would be near 1/4 because the # average distance from the mean is 1/4, and sd is "sort of like" average # distance from the mean: > sd(spin) [1] 0.2887233 # The true value of the sd for a U(0,1) random variable is sqrt(1/12), # which is claose to 1/4, and is almost exactly the value we see in # our simulation: > sqrt(1/12) [1] 0.2886751 # From the histogram, we guessed that the IQR would be 1/2, and that turns # out to be true in the simulation: > iqr(spin) [1] 0.4998338 # Visually, the distribution is symmetric, and both of our skew indices, # applied to the simulated values, agree: > pskew(spin) [1] 0.002202665 > mskew(spin) [1] 0.0003462668 # (Note that it made sense to use mskew(), the third-moment skew index, # because our sample of one million values is very large.) # A student asked about the difference between frequency and relative # frequency. Relative frequency is just frequency divided by sample size. # We can illustrate this by considering the distribution of Raven in # my Statlab sample: > stem(CTRA) The decimal point is 1 digit(s) to the right of the | 1 | 3 1 | 5578 2 | 1111123334 2 | 56678888 3 | 0122334 3 | 5557888999 4 | 0112234 4 | 557 # The sample size is 50: > length(CTRA) [1] 50 # The "low teens" bin in the stem-and-leaf plot has a frequency # of 1 (the value 13), so the relative frequency is > 1/50 [1] 0.02 # The second bin has a frequency of four (15, 15, 17, 18), so the # relative frequency is > 4/50 [1] 0.08 # And so on. (Here's the relative frequency for the third bin.) > 10/50 [1] 0.2 # We can use simulation to validate our application of probability laws. # Here, we simulate a million rolls of a die: > roll <- sample(1:6, 1000000, replace=TRUE) > head(roll) [1] 2 5 4 1 5 2 # Here, we simulate a smaller number of rolls to help us understand # R's logical operators: > junkroll <- sample(1:6, 10, replace=TRUE) > junkroll [1] 2 2 1 3 1 1 1 2 6 2 # "==" in R is a logical "is equal to." So "junkroll==6" returns # a sequence of TRUE and FALSE responses indicating whether the # value is actually 6: > junkroll==6 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE # But behind the scenes, those FALSE values are actually zero, and # the TRUE values are actually one. So if we take their mean, we get # the proportion of cases for which the value was 6: > mean(junkroll==6) [1] 0.1 # Now we return to the larger simulation to validate our probability # ideas. In theory, the probability of 6 on a single roll of a fair # die should be 1/6. The simulation validates that idea: > mean(roll==6) [1] 0.166685 > 1/6 [1] 0.1666667 # Similarly, the probability of a 5 should be 1/6: > mean(roll==5) [1] 0.166424 # A single roll of a die cannot come up both 5 and 6 (i.e., the events are # mutually exclusive), so the simple form of the addition rule tells us that # the probability of getting a 5 or a 6 on a single die roll is > 1/6 + 1/6 [1] 0.3333333 # That can be validated using our simulation. (The vertical bar "|" is a # logical OR operator, so roll==5 | roll==6 means that the roll is either # 5 OR 6.) > mean(roll==6 | roll==5) [1] 0.333109 # If we go back to our simulated spins of a spinner that can come to rest # between 0 and 1, we can see that the probability of a single spin falling # below 1/4 is 1/4: > mean(spin<1/4) [1] 0.249932 # The probability that a single spin falls between 1/2 and 3/4 is also 1/4. # (Here, the "&" is a logical AND operator.) > mean(spin>1/2 & spin<3/4) [1] 0.249794 # A single spin of the spinner cannot fall below 1/4 AND between 1/2 and 3/4, so # those events are mutually exclusive. The simple form of the addition rule, then # says that the probability of a spin falling below 1/4 OR between 1/2 and 3/4 # should be 1/4 + 1/4 = 1/2. The simulation confirms that idea: > mean(spin<1/4 | (spin>1/2 & spin<3/4)) [1] 0.499726 # We may recall from PSY 10 that the probability of a single draw from a normal # distribution with a mean of 0 and a standard deviation of 1 falling below -1.6 # is .025, and the probability of getting a draw above 1.96 is also .025. We can # confirm this by simulation: > normal <- rnorm(1000000) > mean(normal < -1.96) [1] 0.025033 > mean(normal > 1.96) [1] 0.02505 # Those outcomes are mutually exclusive (a single draw can't both be < -1.96 # and > 1.96), so the simple form of the addition rule tells us that the probability # of one OR the other event occuring is .025 + .025 = .05. The simulation confirms this: > mean(normal < -1.96 | normal > 1.96) [1] 0.050083 # We are now going to concern ourselves with a sequence of two # independent die rolls. We've already simulated one roll, so # I'll call that roll1 (that's a one, not an ell, at the end of # the name): > roll1 <- roll # Here's a second, independent roll: > roll2 <- sample(1:6, 1000000, replace=TRUE) # Both rolls behave the way we would expect. For example, the # probability of getting a 6 is still 1/6: > mean(roll1 == 6) [1] 0.166685 > mean(roll2 == 6) [1] 0.166128 # These rolls are independent, so the simple form of the multiplication # rule tells us that the probability of BOTH rolls coming up 6 is # 1/6 * 1/6 = 1/36: > 1/36 [1] 0.02777778 # The simulation confirms this. (Here, the "&" is a logical AND.) > mean(roll1==6 & roll2==6) [1] 0.028101 # We simulate two independent coin tosses. As with the dice, I reuse # my previous coin toss simulation to be the first toss: > toss1 <- cointoss # Here's a second, independent toss: > toss2 <- sample(0:1,1000000,replace=TRUE) # The probability of Heads (i.e., 1) on either toss is 1/2: > mean(toss1==1) [1] 0.498767 > mean(toss2==1) [1] 0.499577 # The tosses are independent, so the simple form of the multiplication # rule tells us that the probability of heads on both tosses is # 1/2 * 1/2 = 1/4. The simulation confirms that result: > mean(toss1==1 & toss2==1) [1] 0.248982 > # We reuse our previous simulated spin and then add a second, # independent spin or our (0,1) spinner: > spin1 <- spin > spin2 <- runif(1000000) # The probability of getting a spin above 1/4 is 3/4: > mean(spin1 > 1/4) [1] 0.750068 # And the probability of getting a spin below 3/4 is 3/4: > mean(spin2 < 3/4) [1] 0.749725 # The spins are independent, so the simple form of the multiplication # rule tells us that the probability of the first spin being > 1/4 and # the second spin being < 3/4 if 3/4 * 3/4 = 9/16: > 9/16 [1] 0.5625 # The simulation confirms our application of the multiplication rule: > mean(spin1 > 1/4 & spin2 < 3/4) [1] 0.562005 >