# We learned about the d, r, p, and q prefixes for probability # functions in R. The d prefix (as in dunif(), dnorm(), etc.) # returns the probability density function (the hight of the curve) # at the specified value. Note that for discrete random variables, # this number represents a probability, but for continuous random # variables (like the functions illustrated here), it is merely the # height of the curve. # For example, the height of the curve for a uniform random variable # over the interval 0 to 1 is 1 for any number between 0 and 1: > dunif(1/3, 0, 1) [1] 1 > dunif(.999999999, 0, 1) [1] 1 # But for any value outside the range, it is 0: > dunif(1.5, 0, 1) [1] 0 # For a uniform random variable on a different interval, the density # will be whaterver value is needed to get a rectangle with area equal # to 1. So, for example, for a U(0, 10) random variable, the density # is 1/10: > dunif(1/3, 0, 10) [1] 0.1 # Here's a manual calculation of the height of the normal # density for a value of 45 if the random variable has a # normal distribution with a mean of 50 and a standard # deviation of 10: > 1/sqrt(2*pi*100) * exp(-1/2*(45-50)^2/100) [1] 0.03520653 # And here's the same result using the much simpler dnorm() function: > dnorm(45, 50, 10) [1] 0.03520653 # If you like, you can make it clear what the values 50 and 10 represent: > dnorm(45, m=50, sd=10) [1] 0.03520653 # Similarly, you can make it clear what the parameters of the uniform # distribution represent: > dunif(1/3, min=0, max=1) [1] 1 # The height of a normal curve is not generally interesting per se. # However, it can be useful to be able to calculate it easily, for # example, to facilitate graphical portrayals of the curve: > X <- seq(20,80,.1) > Y <- dnorm(X, 50, 10) > plot(X,Y,type='l') # All of these (and other) probability functions can also work with # and r prefix. The r probability functions generate pseudorandom # draws from the specified distribution. For example, here is a single # draw from a U(0, 1) distribution: > runif(1, 0, 1) [1] 0.3497707 # And here are 10 draws from the same distribution: > runif(10, 0, 1) [1] 0.6459240 0.3496345 0.7302054 0.9668433 0.9244626 0.5066248 0.7129959 [8] 0.1902819 0.2452199 0.1484413 # Here's a single draw from a normal distribution with mean=50, sd=10: > rnorm(1, 50, 10) [1] 45.42791 # And here are 10 draws from the same distribution: > rnorm(10, 50, 10) [1] 47.72362 47.87990 43.33846 58.00341 65.68108 53.27824 34.05266 61.81840 [9] 50.60804 50.38394 # Probability functions with a p prefix tell us the area under the curve # at or below the specified value. For example, the area under a U(0, 1) # curve that falls below 1/2 is 1/2: > punif(1/2, 0, 1) [1] 0.5 # And the area that falls below 1/4 is 1/4: > punif(1/4, 0, 1) [1] 0.25 # The area below 50 in a normal distribution with a mean of 50 and # a standard deviation of 10 is 1/2 (because the normal distribution # is symmetric about its mean): > pnorm(50, 50, 10) [1] 0.5 # The area below -1.96 in a standard normal distribution (i.e., one # with mean 0 and sd 1) is .025: > pnorm(-1.96, 0, 1) [1] 0.0249979 # If we consider a normal distribution with a mean of 50 and a sd of 10, # a value of 1000 is so huge it might as well be infinity, so the function # will tell us that the area below it is 1 (meaning that there is a probability # of 1.0 that a draw from that normal distribution will be below 1000): > pnorm(1000, 50, 10) [1] 1 # Probability function with a q prefix are precisely the inverse of those with # a p prefix. The q stands for "quantile." The .5 quantile of a distribution, say, # is the value that separates off the bottom 1/2 of the distribution. (You can also # think of this as the 50th percentile of the distribution.) > qunif(1/2, 0, 1) [1] 0.5 # The value that sets off the bottom 1/2 of a normal distribution with mean 50 # and sd 10 is 50: > qnorm(1/2, 50, 10) [1] 50 # The value that separates off the bottom 2.5% of a standard normal distribution # is -1.96: > qnorm(.025, 0, 1) [1] -1.959964 # We did several examples showing how R can be used to validate increasingly # complex probability problems. # Here, we confirm the very simple conclusion that the probability of a toss # of a fair coin coming up tails is 1/2: > temp <- sample(0:1, 100000, replace=TRUE) > mean(temp==0) [1] 0.50007 # And, of course, the probability that it comes up heads is also 1/2: > mean(temp==1) [1] 0.49993 # The addition rule tells us that the probability it comes up # heads OR tails is 1/2 + 1/2 = 1. Our simulated coin tosses in R # confirm this: > mean(temp==0 | temp==1) [1] 1 # Here, we simulate a single roll of a fair six-sided die: > temp <- sample(1:6, 100000, replace=TRUE) > table(temp) temp 1 2 3 4 5 6 16607 16764 16665 16733 16447 16784 # We confirm that the probability the a 1 is 1/6: > mean(temp==1) [1] 0.16607 # The probability of a 2 is also 1/6: > mean(temp==2) [1] 0.16764 # The addition rule tells us that the probability of a 1 OR a 2 # is 1/6 + 1/6 = 1/3, and the simulation in R confirms this: > mean(temp==1 | temp==2) [1] 0.33371 # A simulation tells us that the probability of a draw from a # standard normal distribution falling below -1.96 is .025: > temp<- rnorm(100000, 0, 1) > mean(temp < -1.96) [1] 0.02539 # Similarly the probability of a draw falling above 1.96 is .025: > mean(temp > 1.96) [1] 0.02481 # Those outcomes are mutually exclusive, so the addition rule tells # us that the probability of a single draw falling below -1.96 or # above 1.96 is .025 + .025 = .05. The simulation confirms this: > mean(temp < -1.96 | temp > 1.96) [1] 0.0502 # We now consider slightly more complicated combination of events. # Here, we simulate tossing two coins a large number of times: > toss1 <- sample(0:1, 100000, replace=TRUE) > toss2 <- sample(0:1, 100000, replace=TRUE) > head(cbind(toss1,toss2)) toss1 toss2 [1,] 0 0 [2,] 0 0 [3,] 0 1 [4,] 1 0 [5,] 1 1 [6,] 1 1 # The probability of the first toss coming up heads is 1/2: > mean(toss1==1) [1] 0.50244 # The probability of the second toss coming up heads is 1/2: > mean(toss2==1) [1] 0.49861 # The two tosses are independent, so the simple form of the # multiplication rule tells us that the probability of the first # toss coming up heads AND the second toss coming up heads is # 1/2 * 1/2 = 1/4. The simulation confirms this: > mean(toss1==1 & toss2==1) [1] 0.2515 # Here, we simulate two independent spins of our U(0, 1) game spinner: > spin1 <- runif(100000) > spin2 <- runif(100000) > head(cbind(spin1,spin2)) spin1 spin2 [1,] 0.97834617 0.7891538 [2,] 0.05130799 0.7434768 [3,] 0.60708347 0.4622158 [4,] 0.23562843 0.7456762 [5,] 0.27354925 0.2017080 [6,] 0.30620999 0.3683357 # The probability that the first spin falls above 1/2 is 1/2: > mean(spin1 > 1/2) [1] 0.49869 # The probability that the second spin falls below 3/4 is 3/4: > mean(spin2 < 3/4) [1] 0.7497 # The spins are independent, so the simple form of the multiplication # rule tells us that the probability of the first spin being above 1/2 # AND the second spin being below 3/4 is > 1/2 * 3/4 [1] 0.375 # The simulatio confirms the result: > mean(spin1 > 1/2 & spin2 < 3/4) [1] 0.37322 # Here, we simulate observing two randomly drawn IQs for an IQ test # with mean=100 and sd=12: > IQ1 <- rnorm(100000, 100, 12) > IQ2 <- rnorm(100000, 100, 12) # The probability that a draw from that distribution falls below 136 is > pnorm(136, 100, 12) [1] 0.9986501 # So the probability that it falls above 136 is > 1 - pnorm(136, 100, 12) [1] 0.001349898 # The events of one IQ being above 136 and a second, independently observed # IQ being above 136 are NOT mutually exclusive, so to get the probability # that one OR the other is above 136, we would need the complex form of the # addition rule: # p(IQ1 > 136) + p(IQ2 > 136) - p(both are > 136 > (1-pnorm(136,100,12)) + (1-pnorm(136,100,12)) - (1-pnorm(136,100,12))^2 [1] 0.002697974 # The simulation confirms the result: > mean(IQ1 > 136 | IQ2 > 136) [1] 0.00258 # Now we consider a much more complex problem. What's the probability # of getting a roll of 2 six-sided dice with a sum of 7? This can happen # by rolling (1,6), (6,1), (2,5), (5,2), (3,4), or (4,3). Those combinations # are all mutually exclusive, so a repeated application of the simple form # of the addition rule gets us the answer. > roll1 <- sample(1:6, 100000, replace=TRUE) > roll2 <- sample(1:6, 100000, replace=TRUE) > (1/6*1/6) + (1/6*1/6) + (1/6*1/6) + (1/6*1/6) + (1/6*1/6) + (1/6*1/6) [1] 0.1666667 # But we can get the same result much more easily just by seeing how often # the two rolls sum to 7 in our simulation: > mean(roll1+roll2==7) [1] 0.16682 # Calculating the probability that the two rolls sum to 7 OR 11 would # require an even longer sequence of addition rule calculations, but # the probability can be easily obtained from our simulation: > mean(roll1+roll2==7 | roll1+roll2==11) [1] 0.22243 # The rest of the R session involves Bayes' rule calculations related to # the example at the end of today's Powerpoint. Because we have the full # table from our medical test's validity study, we know that the probability # of having the disease given a positive test is > 15/19 [1] 0.7894737 # But the test manufacturer won't give us that information. Instead, they # give us the test's sensitivity (i.e., probability of a positive test if # the disease is present) and its specificity (probability of a negative # test if the disease is absent). Here, we calculate those values from # the table: > sens <- 15/17 > spec <- 1479/1483 # We also know from the table that the overall prevalence of the disease # (its base rate) is > base <- 17/1500 # Plugging those values into the formula for Bayes' rule does, indeed, get # us the probability that we have the disease, given a positive test: > sens * base / (sens*base + (1-spec)*(1-base)) [1] 0.7894737 # So if you have a positive test, you probably have the disease. # Note that Bayes' rule calculations are very sensitive to base rate. Here's # the base rate from the table: > base [1] 0.01133333 # But if the disease were much rarer... > base <- .0001 # ...then, even with a positive test, it's quite unlikely that # we actually have the disease: > sens * base / (sens*base + (1-spec)*(1-base)) [1] 0.03168005 >