Linear Regression Pt. 2 - Multiple Linear Regression

In the previous guide, we built a simple linear regression model to predict salary from age in a sample of super models. In this second part, we continue to build a more complex model to predict salary from age, and other variables including years of experience, and a rating of attractiveness.

Update the model

The update() function is a quick way to add variables to an existing lm() object. In this example, we simply pass in our regression model object and the added variables we want to update our model with to a new lm() object.

simple  <- lm(SALARY ~ AGE, data = data)
multiple <- update(simple, .~. + BEAUTY + YEARS)      

summary(multiple)
## 
## Call:
## lm(formula = SALARY ~ AGE + BEAUTY + YEARS, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.853  -7.950  -4.197   4.605  68.085 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -60.8897    16.4966  -3.691  0.00028 ***
## AGE           6.2344     1.4112   4.418 1.54e-05 ***
## BEAUTY       -0.1964     0.1524  -1.289  0.19871    
## YEARS        -5.5612     2.1222  -2.621  0.00937 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.57 on 227 degrees of freedom
## Multiple R-squared:  0.184,	Adjusted R-squared:  0.1733 
## F-statistic: 17.07 on 3 and 227 DF,  p-value: 4.973e-10

Interpretation of the model summary

As a reminder, the multiple R-squared value is one indicator of the variability in the outcome variable that is accounted for by the predictors. In the multiple regression model this value is .18 or 18%, leaving a substantial portion of the variance (82%) unaccounted for. Only two variables significantly predict SALARY, AGE and YEARS. Notice that the coefficient for AGE is positive, indicating a positive relationship between SALARY and AGE. However, the coefficient for YEARS is negative, indicating a negative relationship between SALARY and YEARS. For each unit increase in AGE, the model predicts a 6.23 unit increase in salary. In contrast, for each unit increase in YEARS, the model predics a 5.56 unit drop in salary. BEAUTY does not significantly predict salary.

Interpretation of regression coefficients

Some prefer to interpret standardized coefficients because the units of measurement are no longer in that of the variables. This facilitates direct comparisons between coefficients, but one has to consider that the interpretation will be standard deviations. For example, as SALARY increases by one standard deviation, then age increased by .94 standard deviations. Finally, because all coefficients are directly comparable, we can determine that AGE is the most important predictor in the model.

library(QuantPsyc)
# Print standardized regression coefficients
lm.beta(multiple)
##         AGE      BEAUTY       YEARS 
##  0.94214234 -0.08299604 -0.54779846

Confidence intervals of the regression coefficients

Confidence intervals represent the upper and lower boundaries of the true regression coeffient values for 95% of samples collected measuring the exact same variables. Notice that the confidence interval for BEAUTY contains positive and negative values. This indicates that in some samples the relationship between SALARY and BEAUTY is positive, while in others, it is negative. Ideally, we would want a consistent relationship between outcome and predictors. As a result, when confidence intervals contain zero, it is interpreted as evidence of a poor model.

kable(
  confint(multiple)
  )
2.5 %97.5 %
(Intercept)-93.3957556-28.3837443
AGE3.45366179.0152299
BEAUTY-0.49659920.1038264
YEARS-9.7429381-1.3795536

References

Field, Andy, Jeremy Miles, and Zoe Field. 2012. Discovering Statistics Using R. Sage.

Fletcher, Thomas D. 2012. QuantPsyc: Quantitative Psychology Tools. https://CRAN.R-project.org/package=QuantPsyc.

Previous
Next