Linear Regression Pt. 3 - Casewise Diagnostics

After building our simple and multiple regression model, we turn our attention to casewise diagnostics to learn about which outliers are present in the sample and which data points have undue influence on our model which could affect the models stability. We will focus on the standardized residuals, Cook’s distance, and leverage/hat values, but there are several other measures to assess model diagnostics.

Outliers and standardized residuals

Residuals help us understand how well the model fits the sample data. Standardized residuals are derived by dividing the non-standardized residuals by an estimate of their standard deviation. Standardized residuals can be obtained by applying the rstandard() function on a regression model object. Alternatively, we can use the augment() from the broom package to compute the standardized residuals, Cook’s Distance, and Leverage (hat) values in one command. Standardized residuals primarily serve two roles. First, they facilitate interpretation across different models because the units are in standard deviations rather than the unit of the outcome variable. Second, they serve as an indicator of outliers that may bias the estimated regression coefficients. A couple of general rules are that no more than 5% of the absolute values of the standardized residuals are greater than 2 and no more than 1% of the absolute values of the standardized residuals are greater than 2.5. In our example dataset, about 5.2% of the standardized residuals values are beyond the +/-2 boundary which is evidence that our model may not represent our outcome data well.

# augment() from the broom package
dx <- augment(multiple) %>% select(SALARY, AGE, YEARS, BEAUTY, .fitted, .std.resid, .cooksd, .hat)

# Create a boolean vector of large residuals; greater than 2 or less than -2 
dx$large.residual <- dx$.std.resid > 2 | dx$.std.resid < -2

# Sum of large standardized residuals
sum(dx$large.residual)

## [1] 12

# Percentage of standardized residuals greater than 2 or less than -2
sum(dx$large.residual)/length(dx$large.residual) * 100

## [1] 5.194805

kable(filter(dx, large.residual == TRUE) %>%
    select(SALARY, AGE, YEARS, BEAUTY, .std.resid, large.residual))

SALARY	AGE	YEARS	BEAUTY	.std.resid	large.residual
53.72479	20.34707	5.506886	68.56999	2.214829	TRUE
95.33807	24.17183	8.532050	71.77039	4.696607	TRUE
48.86766	19.11451	4.951027	73.32626	2.241876	TRUE
51.02516	19.46200	5.187275	80.00141	2.420635	TRUE
56.83152	24.41146	8.753041	80.65103	2.099147	TRUE
64.79129	18.46839	4.284322	78.91763	3.440027	TRUE
61.31880	22.25275	7.397138	78.92917	2.778123	TRUE
89.98003	22.28899	7.419825	75.93018	4.717284	TRUE
74.86075	24.40682	8.444767	86.09212	3.319137	TRUE
54.56552	22.31422	6.833367	88.01470	2.200115	TRUE
50.65578	15.27406	2.981697	66.38544	3.177863	TRUE
71.32073	20.65061	5.834559	77.57684	3.531357	TRUE

Influential cases: Cook’s distance

One way to determine which cases within a regression model have unde influence in the model parameters is to calculate Cook’s distance. Cook’s distance has a straightforward interpretation - any value greater than 1 may be cause for concern. Cook’s distance values can be obtained with the cooks.distance() function by passing a regressiong model object as its input. However, we will use the dx data frame that was created with the augment() function in the broom package. With this dataset, there are no values greater than 1. This suggest that the model is stable across the sample because none of the cases exert undue influence on the model parameters.

# Create a boolean vector of large residuals; greater than 2 or less than -2 
dx$large.cooksd <- dx$.cooksd > 1

# Sum of large Cook's distance
sum(data$large.cooks.d)

## [1] 0

Influential cases: Leverage/hat values

Leverage/hat values are an additional measure of influential cases. Leverage values can obtained by passing the regression model object to the hatvalues() function, but are already in our dx data frame. Cases with values that are 2 or 3 times as large as (k + 1/n), where k = the number of predictors and n = the sample size, may have undue influence. With these data, values higher than 0.035 and 0.052, depending on how conservative you want to be. There are 25 cases with hat values 2 times greater than the average leverage value, and 3 cases with hat values greater than 3 times the average leverage value.

# Create a boolean vector of large residuals; greater than 2 or less than -2 
# Average Leverage, # of predictors + 1 divided by n
round(((3 + 1)/231) * 2, 3)

## [1] 0.035

round(((3 + 1)/231) * 3, 3)

## [1] 0.052

# Create a boolean vector of large hat values
dx$large.hat <- dx$.hat > ((3 + 1)/231) * 2

# Sum of large leverage 2, conservative
sum(dx$large.hat)

## [1] 25

# Create a boolean vector of large hat values
dx$large.hat <- dx$.hat > ((3 + 1)/231) * 3

# Sum of large leverage 3, less conservative
sum(dx$large.hat)

## [1] 3

# Print table
kable(filter(dx, large.hat == TRUE) %>%
    select(SALARY, AGE, YEARS, BEAUTY, .hat, large.hat))

SALARY	AGE	YEARS	BEAUTY	.hat	large.hat
6.419431	18.99114	5.237983	99.22141	0.0623528	TRUE
22.681436	25.28966	9.932158	75.42206	0.0580976	TRUE
3.534942	16.04653	4.598695	83.59070	0.0600435	TRUE

References

Field, Andy, Jeremy Miles, and Zoe Field. 2012. Discovering Statistics Using R. Sage.

Last updated on Oct 8, 2023