As you prepare for an exam, it is useful to review the syllabus and ask yourself "What was really important in each section of the class? What did we spend a lot of time on (either in class, or on homework assignments)?" With that in mind, let's review where we've been. After each major section, I have listed some questions that you can ask yourself to check your understanding. Please be aware that these questions do not represent an exhaustive list of material that might be covered on the exam.
We began the semester with a discussion of variables and their distributions. We defined those terms, and then began to think about ways of understanding the shapes of distributions. We developed two parallel systems for discussing and understanding the shapes of distributions: graphical methods, and descriptive statistics. For graphical methods, we talked about how to group data appropriately (guideline: seven to fifteen intervals usually works best, unless the sample size is very large). We discussed several graphical methods. In the context of our discussion of both graphics and descriptive statistics, we identified several aspects of shape that are worth thinking about: central tendency, variability, modality, symmetry, and peakedness or heaviness of tails (kurtosis). We identified descriptive statistics associated with most of those aspects of shape, and talked about the relationship between the descriptive statistics and what we see in the graphics (e.g., the mean as the balance point, the IQR as the range about the median that defines the central 50% of the distribution). We spent a fair amount of time discussing principles for choosing descriptive statistics in different contexts. We developed the rationale for each descriptive statistic. Both in class and in the first homework, we practiced using graphics and statistics to describe distributions. We also considered the issue of conditional distributions, and, for example, did a comparison of all of those aspects of shape for males and females. (Our discussion of conditional distributions extended to include conditioning on continous variables, but that was mainly to set up ideas we will discuss after the midterm, and has minimal practical consequences for now.)
Some questions that you should ask yourself that are relevant to those issues include:
Next, we introduced the concept of random variables and probability distributions. We noted that random variables are ideas rather than actual observations; but we can think about the expected behavior of those ideas or potential values using exactly the same approaches we used for understanding ordinary variables and their distributions. We introduced technical vocabulary, noting that these imaginary variables that could take on specific values are called random variables, and the the values that could exist, together with their long-run relative frequencies, are called probability distributions. We talked about what probabilities are (long-run relative frequencies), and examined the behavior of some simple random variables (Bernoulli trials, binomial trials). We also noted that continuous random variables introduce a special problem, because the probability of a continuous random variable taking on any specific value has to be zero. (We define probabilities for continuous RVs in terms of ranges: What is the probability that a randomly sampled male human's height is between 5'8" and 6'2"?)
We introduced some rules for combining more complex random events. Two forms of the addition rule can give us the probability of one event OR another occuring (the form depending on whether the events are mutually exclusive). Two forms of the multiplication rule give us the probability of one event AND another occuring (the form depending on whether the events are independent). We introduced Bayes' theorem, which gives us a way of reversing conditional probabilities. We played around with simulations of complex events to verify that the probability rules do give us the same answers that we obtain simply by simulating complex events and observing long run relative frequencies with which they occur.
Some useful questions to check your understanding include:
Next, we turned to the subject of sampling distributions. Sampling distributions are probability distributions for which the random variable happens to be a statistic. Sampling distributions are particularly important for us because they enable us to quantify uncertainty about statistics that arises because of the process of sampling, thus enabling us to conduct hypothesis tests. We have introduced three distributions that, in the right context, can be viewed as sampling distributions: the binomial distribution (which we used to perform inference about the probability of HEADS in a coin-tossing experiment); the normal distribution (which is the sampling distribution for the two forms of the Z test we have considered; and the t distribution, which describes the likely values of the two different forms of the t test that we have considered. Using those sampling distributions, we described the process of hypothesis testing. For each test statistic that we have introduced, we have discussed the assumptions required and how to assess them. We have also very carefully discussed the probabilistic interpretation of hypothesis testing results.
Some useful questions to ask as you review are:
We introduced the idea of confidence intervals, describing them as a range of reasonable values for a parameter calculated by considering which null hypotheses about that parameter we would not reject, given our data. Some useful questions to ask as you review are:
We discussed measures of effect size. We stressed that if the variable has an intrinsically meaningful metric (e.g., pounds, dollars, time) the effect should be specified in that metric. For variables that have an arbitrary metric, we employ standardized effect sizes. We saw two forms of Cohen's d that correspond to one- and two-sample designs. For the former, d=(M - mu0) / s. For two-group designs, d = (M1 - M2) / Sp (where Sp denotes the square root of the pooled variance estimate from a two-sample t test.