# We learned about the d, r, p, and q prefixes for probability
# functions in R. The d prefix (as in dunif(), dnorm(), etc.)
# returns the probability density function (the hight of the curve)
# at the specified value. Note that for discrete random variables,
# this number represents a probability, but for continuous random
# variables (like the functions illustrated here), it is merely the
# height of the curve.

# For example, the height of the curve for a uniform random variable
# over the interval 0 to 1 is 1 for any number between 0 and 1:

> dunif(1/3, 0, 1)
[1] 1
> dunif(.999999999, 0, 1)
[1] 1

# But for any value outside the range, it is 0:

> dunif(1.5, 0, 1)
[1] 0


# For a uniform random variable on a different interval, the density
# will be whaterver value is needed to get a rectangle with area equal
# to 1. So, for example, for a U(0, 10) random variable, the density
# is 1/10:

> dunif(1/3, 0, 10)
[1] 0.1

# Here's a manual calculation of the height of the normal
# density for a value of 45 if the random variable has a
# normal distribution with a mean of 50 and a standard
# deviation of 10:

> 1/sqrt(2*pi*100) * exp(-1/2*(45-50)^2/100)
[1] 0.03520653

# And here's the same result using the much simpler dnorm() function:

> dnorm(45, 50, 10)
[1] 0.03520653

# If you like, you can make it clear what the values 50 and 10 represent:

> dnorm(45, m=50, sd=10)
[1] 0.03520653

# Similarly, you can make it clear what the parameters of the uniform
# distribution represent:

> dunif(1/3, min=0, max=1)
[1] 1

# The height of a normal curve is not generally interesting per se. 
# However, it can be useful to be able to calculate it easily, for
# example, to facilitate graphical portrayals of the curve:

> X <- seq(20,80,.1)
> Y <- dnorm(X, 50, 10)
> plot(X,Y,type='l')
 
# All of these (and other) probability functions can also work with
# and r prefix. The r probability functions generate pseudorandom
# draws from the specified distribution. For example, here is a single
# draw from a U(0, 1) distribution:

> runif(1, 0, 1)
[1] 0.3497707

# And here are 10 draws from the same distribution:

> runif(10, 0, 1)
 [1] 0.6459240 0.3496345 0.7302054 0.9668433 0.9244626 0.5066248 0.7129959
 [8] 0.1902819 0.2452199 0.1484413

# Here's a single draw from a normal distribution with mean=50, sd=10:

> rnorm(1, 50, 10)
[1] 45.42791

# And here are 10 draws from the same distribution:

> rnorm(10, 50, 10)
 [1] 47.72362 47.87990 43.33846 58.00341 65.68108 53.27824 34.05266 61.81840
 [9] 50.60804 50.38394
 
# Probability functions with a p prefix tell us the area under the curve
# at or below the specified value. For example, the area under a U(0, 1)
# curve that falls below 1/2 is 1/2:

> punif(1/2, 0, 1)
[1] 0.5

# And the area that falls below 1/4 is 1/4:

> punif(1/4, 0, 1)
[1] 0.25

# The area below 50 in a normal distribution with a mean of 50 and
# a standard deviation of 10 is 1/2 (because the normal distribution
# is symmetric about its mean):

> pnorm(50, 50, 10)
[1] 0.5

# The area below -1.96 in a standard normal distribution (i.e., one
# with mean 0 and sd 1) is .025:

> pnorm(-1.96, 0, 1)
[1] 0.0249979

# If we consider a normal distribution with a mean of 50 and a sd of 10,
# a value of 1000 is so huge it might as well be infinity, so the function
# will tell us that the area below it is 1 (meaning that there is a probability
# of 1.0 that a draw from that normal distribution will be below 1000):

> pnorm(1000, 50, 10)
[1] 1
 
# Probability function with a q prefix are precisely the inverse of those with
# a p prefix. The q stands for "quantile." The .5 quantile of a distribution, say,
# is the value that separates off the bottom 1/2 of the distribution. (You can also
# think of this as the 50th percentile of the distribution.)

> qunif(1/2, 0, 1)
[1] 0.5

# The value that sets off the bottom 1/2 of a normal distribution with mean 50
# and sd 10 is 50:

> qnorm(1/2, 50, 10)
[1] 50

# The value that separates off the bottom 2.5% of a standard normal distribution
# is -1.96:

> qnorm(.025, 0, 1)
[1] -1.959964
 

# We did several examples showing how R can be used to validate increasingly
# complex probability problems. 

# Here, we confirm the very simple conclusion that the probability of a toss
# of a fair coin coming up tails is 1/2:

> temp <- sample(0:1, 100000, replace=TRUE)
> mean(temp==0)
[1] 0.50007

# And, of course, the probability that it comes up heads is also 1/2:

> mean(temp==1)
[1] 0.49993

# The addition rule tells us that the probability it comes up
# heads OR tails is 1/2 + 1/2 = 1. Our simulated coin tosses in R
# confirm this:

> mean(temp==0 | temp==1)
[1] 1
 
# Here, we simulate a single roll of a fair six-sided die:

> temp <- sample(1:6, 100000, replace=TRUE)
> table(temp)
temp
    1     2     3     4     5     6 
16607 16764 16665 16733 16447 16784 

# We confirm that the probability the a 1 is 1/6:

> mean(temp==1)
[1] 0.16607

# The probability of a 2 is also 1/6:

> mean(temp==2)
[1] 0.16764

# The addition rule tells us that the probability of a 1 OR a 2
# is 1/6 + 1/6 = 1/3, and the simulation in R confirms this:

> mean(temp==1 | temp==2)
[1] 0.33371
 
# A simulation tells us that the probability of a draw from a
# standard normal distribution falling below -1.96 is .025:

> temp<- rnorm(100000, 0, 1)
> mean(temp < -1.96)
[1] 0.02539

# Similarly the probability of a draw falling above 1.96 is .025:

> mean(temp > 1.96)
[1] 0.02481

# Those outcomes are mutually exclusive, so the addition rule tells
# us that the probability of a single draw falling below -1.96 or
# above 1.96 is .025 + .025 = .05. The simulation confirms this:

> mean(temp < -1.96 | temp > 1.96)
[1] 0.0502

# We now consider slightly more complicated combination of events.
# Here, we simulate tossing two coins a large number of times:

> toss1 <- sample(0:1, 100000, replace=TRUE)
> toss2 <- sample(0:1, 100000, replace=TRUE)
> head(cbind(toss1,toss2))
     toss1 toss2
[1,]     0     0
[2,]     0     0
[3,]     0     1
[4,]     1     0
[5,]     1     1
[6,]     1     1

# The probability of the first toss coming up heads is 1/2:

> mean(toss1==1)
[1] 0.50244

# The probability of the second toss coming up heads is 1/2:

> mean(toss2==1)
[1] 0.49861

# The two tosses are independent, so the simple form of the
# multiplication rule tells us that the probability of the first
# toss coming up heads AND the second toss coming up heads is
# 1/2 * 1/2 = 1/4. The simulation confirms this:

> mean(toss1==1 & toss2==1)
[1] 0.2515
 
# Here, we simulate two independent spins of our U(0, 1) game spinner:

> spin1 <- runif(100000)
> spin2 <- runif(100000)
> head(cbind(spin1,spin2))
          spin1     spin2
[1,] 0.97834617 0.7891538
[2,] 0.05130799 0.7434768
[3,] 0.60708347 0.4622158
[4,] 0.23562843 0.7456762
[5,] 0.27354925 0.2017080
[6,] 0.30620999 0.3683357

# The probability that the first spin falls above 1/2 is 1/2:

> mean(spin1 > 1/2)
[1] 0.49869

# The probability that the second spin falls below 3/4 is 3/4:

> mean(spin2 < 3/4)
[1] 0.7497

# The spins are independent, so the simple form of the multiplication
# rule tells us that the probability of the first spin being above 1/2
# AND the second spin being below 3/4 is
> 1/2 * 3/4
[1] 0.375

# The simulatio confirms the result:

> mean(spin1 > 1/2 & spin2 < 3/4)
[1] 0.37322
 
# Here, we simulate observing two randomly drawn IQs for an IQ test
# with mean=100 and sd=12:
 
> IQ1 <- rnorm(100000, 100, 12)
> IQ2 <- rnorm(100000, 100, 12)

# The probability that a draw from that distribution falls below 136 is

> pnorm(136, 100, 12)
[1] 0.9986501

# So the probability that it falls above 136 is

> 1 - pnorm(136, 100, 12)
[1] 0.001349898

# The events of one IQ being above 136 and a second, independently observed
# IQ being above 136 are NOT mutually exclusive, so to get the probability
# that one OR the other is above 136, we would need the complex form of the
# addition rule:

#      p(IQ1 > 136)     +       p(IQ2 > 136)     -     p(both are > 136
>  (1-pnorm(136,100,12)) + (1-pnorm(136,100,12)) - (1-pnorm(136,100,12))^2
[1] 0.002697974

# The simulation confirms the result:

> mean(IQ1 > 136 | IQ2 > 136)
[1] 0.00258
 
 
# Now we consider a much more complex problem. What's the probability
# of getting a roll of 2 six-sided dice with a sum of 7? This can happen
# by rolling (1,6), (6,1), (2,5), (5,2), (3,4), or (4,3). Those combinations
# are all mutually exclusive, so a repeated application of the simple form
# of the addition rule gets us the answer.

> roll1 <- sample(1:6, 100000, replace=TRUE)
> roll2 <- sample(1:6, 100000, replace=TRUE)
> (1/6*1/6) + (1/6*1/6) + (1/6*1/6) + (1/6*1/6) + (1/6*1/6) + (1/6*1/6)
[1] 0.1666667

# But we can get the same result much more easily just by seeing how often
# the two rolls sum to 7 in our simulation:

> mean(roll1+roll2==7)
[1] 0.16682

# Calculating the probability that the two rolls sum to 7 OR 11 would
# require an even longer sequence of addition rule calculations, but 
# the probability can be easily obtained from our simulation:

> mean(roll1+roll2==7 | roll1+roll2==11)
[1] 0.22243
 
 
# The rest of the R session involves Bayes' rule calculations related to
# the example at the end of today's Powerpoint. Because we have the full
# table from our medical test's validity study, we know that the probability
# of having the disease given a positive test is

> 15/19
[1] 0.7894737
 
# But the test manufacturer won't give us that information. Instead, they
# give us the test's sensitivity (i.e., probability of a positive test if
# the disease is present) and its specificity (probability of a negative
# test if the disease is absent). Here, we calculate those values from
# the table:

> sens <- 15/17
> spec <- 1479/1483

# We also know from the table that the overall prevalence of the disease
# (its base rate) is

> base <- 17/1500
 
# Plugging those values into the formula for Bayes' rule does, indeed, get
# us the probability that we have the disease, given a positive test:

> sens * base / (sens*base + (1-spec)*(1-base))
[1] 0.7894737

# So if you have a positive test, you probably have the disease.

# Note that Bayes' rule calculations are very sensitive to base rate. Here's
# the base rate from the table:
> base
[1] 0.01133333

# But if the disease were much rarer...

> base <- .0001

# ...then, even with a positive test, it's quite unlikely that
# we actually have the disease:

> sens * base / (sens*base + (1-spec)*(1-base))
[1] 0.03168005
>