Psychology 202b

Advanced Psychological Statistics II

First homework assignment, 2/3/2011 (due 2/15/2011).

Last week, you worked with a data set consisting of taste ratings of cheddar cheese, along with amounts of three components of cheese that tend to increase as the cheese ages. You will continue to work with the same data set this week. (Information about the source of the data and the identification of columns appears in last week's homework.)

Part One

In last week's assignment, you assessed the important assumption that the relationships between the predictors and the taste ratings are linear. Note that linearity is the only assumption that is required in order for multiple regression to be useful as a descriptive technique.

This week, you will assess the assumptions necessary for inference about the multiple regression to be valid. That is, use the techniques demonstrated in class on February 3 to assess the assumptions that the errors are both homoscedastic and normally distributed. Your work should include graphical output from R, as well as a commentary explaining the conclusions you reached and why you reached them.

For purposes of this assignment, you should retain all three predictors of taste in the multiple regression. For extra credit, you may include a brief explanation of why it is appropriate to do so, even though acetic acid was not a significant predictor in the context of the multiple regression.

Part Two

It has been suggested that this data set may be collinear. Use R to calculate the condition number for the set of three predictors. (Don't forget that X in the formula includes a column of 1's at the beginning. So for your analysis, X will have four columns, not three.)

Interpret the result. Does it suggest that collinearity is an issue here?

Part Three

Examine the plot of residuals vs. predicted values that you created in Part One. Do you see any points that you suspect may be outliers?

Identify the point in the data set that has the largest residual value. (You can do this by printing out the data using proc print after an "output" statement in proc glm has been used to create residuals.)

Now that you have identified the point, use the externally Studentized residual to do an outlier test. What do you conclude?

Next, investigate the degree to which the point has affected the regression. You may find it useful to evaluate the point's leverage and Cook's distance. Do you think there is a problem here? Why, or why not?

Finally, delete the case with the largest residual and recalculate the regression. Comment on any changes in the intercept and slope estimates. Has the regression changed in a way that seems qualitatively important?