Linear Regression Pt. 3 - Casewise Diagnostics
After building our simple and multiple regression model, we turn our attention to casewise diagnostics to learn about which outliers are present in the sample and which data points have undue influence on our model which could affect the models stability. We will focus on the standardized residuals, Cook’s distance, and leverage/hat values, but there are several other measures to assess model diagnostics.
Outliers and standardized residuals
Residuals help us understand how well the model fits the sample data. Standardized residuals are derived by dividing the non-standardized residuals by an estimate of their standard deviation. Standardized residuals can be obtained by applying the rstandard()
function on a regression model object. Alternatively, we can use the augment()
from the broom package to compute the standardized residuals, Cook’s Distance, and Leverage (hat) values in one command. Standardized residuals primarily serve two roles. First, they facilitate interpretation across different models because the units are in standard deviations rather than the unit of the outcome variable. Second, they serve as an indicator of outliers that may bias the estimated regression coefficients. A couple of general rules are that no more than 5% of the absolute values of the standardized residuals are greater than 2 and no more than 1% of the absolute values of the standardized residuals are greater than 2.5. In our example dataset, about 5.2% of the standardized residuals values are beyond the +/-2 boundary which is evidence that our model may not represent our outcome data well.
# augment() from the broom package
dx <- augment(multiple) %>% select(SALARY, AGE, YEARS, BEAUTY, .fitted, .std.resid, .cooksd, .hat)
# Create a boolean vector of large residuals; greater than 2 or less than -2
dx$large.residual <- dx$.std.resid > 2 | dx$.std.resid < -2
# Sum of large standardized residuals
sum(dx$large.residual)
## [1] 12
# Percentage of standardized residuals greater than 2 or less than -2
sum(dx$large.residual)/length(dx$large.residual) * 100
## [1] 5.194805
kable(filter(dx, large.residual == TRUE) %>%
select(SALARY, AGE, YEARS, BEAUTY, .std.resid, large.residual))
SALARY | AGE | YEARS | BEAUTY | .std.resid | large.residual |
---|---|---|---|---|---|
53.72479 | 20.34707 | 5.506886 | 68.56999 | 2.214829 | TRUE |
95.33807 | 24.17183 | 8.532050 | 71.77039 | 4.696607 | TRUE |
48.86766 | 19.11451 | 4.951027 | 73.32626 | 2.241876 | TRUE |
51.02516 | 19.46200 | 5.187275 | 80.00141 | 2.420635 | TRUE |
56.83152 | 24.41146 | 8.753041 | 80.65103 | 2.099147 | TRUE |
64.79129 | 18.46839 | 4.284322 | 78.91763 | 3.440027 | TRUE |
61.31880 | 22.25275 | 7.397138 | 78.92917 | 2.778123 | TRUE |
89.98003 | 22.28899 | 7.419825 | 75.93018 | 4.717284 | TRUE |
74.86075 | 24.40682 | 8.444767 | 86.09212 | 3.319137 | TRUE |
54.56552 | 22.31422 | 6.833367 | 88.01470 | 2.200115 | TRUE |
50.65578 | 15.27406 | 2.981697 | 66.38544 | 3.177863 | TRUE |
71.32073 | 20.65061 | 5.834559 | 77.57684 | 3.531357 | TRUE |
Influential cases: Cook’s distance
One way to determine which cases within a regression model have unde influence in the model parameters is to calculate Cook’s distance. Cook’s distance has a straightforward interpretation - any value greater than 1 may be cause for concern. Cook’s distance values can be obtained with the cooks.distance()
function by passing a regressiong model object as its input. However, we will use the dx data frame that was created with the augment()
function in the broom package. With this dataset, there are no values greater than 1. This suggest that the model is stable across the sample because none of the cases exert undue influence on the model parameters.
# Create a boolean vector of large residuals; greater than 2 or less than -2
dx$large.cooksd <- dx$.cooksd > 1
# Sum of large Cook's distance
sum(data$large.cooks.d)
## [1] 0
Influential cases: Leverage/hat values
Leverage/hat values are an additional measure of influential cases. Leverage values can obtained by passing the regression model object to the hatvalues()
function, but are already in our dx data frame. Cases with values that are 2 or 3 times as large as (k + 1/n), where k = the number of predictors and n = the sample size, may have undue influence. With these data, values higher than 0.035 and 0.052, depending on how conservative you want to be. There are 25 cases with hat values 2 times greater than the average leverage value, and 3 cases with hat values greater than 3 times the average leverage value.
# Create a boolean vector of large residuals; greater than 2 or less than -2
# Average Leverage, # of predictors + 1 divided by n
round(((3 + 1)/231) * 2, 3)
## [1] 0.035
round(((3 + 1)/231) * 3, 3)
## [1] 0.052
# Create a boolean vector of large hat values
dx$large.hat <- dx$.hat > ((3 + 1)/231) * 2
# Sum of large leverage 2, conservative
sum(dx$large.hat)
## [1] 25
# Create a boolean vector of large hat values
dx$large.hat <- dx$.hat > ((3 + 1)/231) * 3
# Sum of large leverage 3, less conservative
sum(dx$large.hat)
## [1] 3
# Print table
kable(filter(dx, large.hat == TRUE) %>%
select(SALARY, AGE, YEARS, BEAUTY, .hat, large.hat))
SALARY | AGE | YEARS | BEAUTY | .hat | large.hat |
---|---|---|---|---|---|
6.419431 | 18.99114 | 5.237983 | 99.22141 | 0.0623528 | TRUE |
22.681436 | 25.28966 | 9.932158 | 75.42206 | 0.0580976 | TRUE |
3.534942 | 16.04653 | 4.598695 | 83.59070 | 0.0600435 | TRUE |
References
Field, Andy, Jeremy Miles, and Zoe Field. 2012. Discovering Statistics Using R. Sage.