Question 1. For the prostate data set, fit a model with lpsa as the response, and the other variables as predictors.
(a) Suppose a new patient with the following values arrives:
lcavol = 1.45000, lweight = 3.59801, age = 63.00000, lbph = 0.30010,
svi = 0.00000, lcp = -0.79851, gleason = 7.00000, pgg45 = 15.00000.
Predict the lpsa for this patient along with an appropriate 95% prediction interval.
(b) Repeat the questions in (a) for a patient with the same values except that he is age 20. Explain why the prediction interval is wider.
(c) For the model of the previous question, remove all the predictors that are not significant at the 5% level. Using the reduced model recompute the predictions for the x values given in the previous questions (a) and (b). Are the new prediction intervals wider or narrower than in parts (a) and (b)? Which predictions would you prefer? Explain.
Question 2. Using the swiss data set, fit a model with Fertility as the response and all of the other variables as predictors. Answer the following.
(a) Produce a plot of the internally Studentized residuals ri versus the ordinary (least squares) residuals εˆi. (Show R code.)
(b) The points in this plot do not exactly fall on a straight line. Briefly explain why. [ Hint: What is the formula for the internally Studentized residuals? ]
(c) List the externally Studentized residuals ti (which are used as test statistics in the Mean Shift Test).
(d) Perform the Mean Shift Test without Bonferroni adjustment, using α = 0.05. Which provinces are identified as outliers?
(e) Perform the Mean Shift Test with Bonferroni adjustment, using α = 0.05. Which provinces are identified as outliers?
Question 3. Using the eco data set, fit a model with home as the response and all of the other variables as predictors. Answer the following parts.
In parts (b) through (g), you should draw a specific conclusion and clearly refer to the diagnostic tool(s) (plots or statistics) you used to draw your conclusion.
(a) Produce the four default diagnostic plots given by R.
(b) Using an appropriate diagnostic plot, check the functional form of the relationship between the mean response and the predictors.
(c) Check the assumption of constant variance.
(d) What is the largest (most positive) least-squares residual value? What is the smallest (most negative) least-squares residual value?
(e) Which observation has the greatest leverage value?
(f) Check for outliers. (You may use diagnostic plots - a formal test is not necessary.)
(g) Check for influential points.
Question 4. Using R, produce a grid of 9 normal probability plots (qqnorm) for samples of size n = 50 simulated independently from a geometric distribution with parameter prob equal to 0.4.
(Refer to the last section of the Diagnostics in Linear Regression slides. Use the R function rgeom to simulate from the geometric distribution.)
(a) Display your plots and the R code you used to produce them.
(b) Describe two distinct ways in which these plots tend to differ in appearance from what you would expect for normally-distributed data.