Multinomial Logistic Regression
Multinomial regression can be thought of as an extension of logistic regression. In logistic regression, the dependent or outcome variable can take on one of two categorical values such as whether or not someone survived a stay at a hospital. In cases where the dependent variable falls into more than two categorical variables, multinomial regression is one possible analysis option. For this guide, we will use multinomial regression using the nnet
package. We will also use a sample data set in which researchers determined the effect of pickup lines on whether or not that person 1) got a phone number, 2) went home with the person, or 3) walked away.
library(tidyverse)
library(nnet)
library(car)
library(kableExtra)
library(broom)
Load Data
chatData<-read.delim("Chat-Up Lines.dat", header = TRUE)
#chatData$Gender<-relevel(chatData$Gender, ref = 2)
Inspect Data
Before building our multinomial logistic regression model, we will need to check that our predictor variables are factored. Simply pass each column to the is.factor()
function to check.
kable(head(chatData))
Success | Funny | Sex | Good_Mate | Gender |
---|---|---|---|---|
Get Phone Number | 3 | 7 | 6 | Male |
Go Home with Person | 5 | 7 | 2 | Male |
Get Phone Number | 4 | 6 | 6 | Male |
Go Home with Person | 3 | 7 | 5 | Male |
Get Phone Number | 5 | 1 | 6 | Male |
Get Phone Number | 4 | 7 | 5 | Male |
is.factor(chatData$Success)
## [1] FALSE
is.factor(chatData$Gender)
## [1] FALSE
Prepare Data
Since both of our predictor variables are not factored, we use the as.factor()
function to convert them and save into the chatData data frame. Next, we will want to relevel our predictor variables to arrange them and set a reference level. The reference level is what is used as the main comparison variable. The relevel()
function takes two inputs, the column of the factor that needs to be releveled and the reference level. For this example we want to set “No response/Walk Off” as the reference for Success, and “Male” as the reference for Gender. One way to think about multinomial regression is view it as a series of binary logistic regression models where one level (the reference) is compared to all others. In multinomial regression, it is the reference level that is compared with all other levels of the variable.
chatData$Gender <- as.factor(chatData$Gender)
chatData$Success <- as.factor(chatData$Success)
chatData$Success <- relevel(chatData$Success, "No response/Walk Off") # Set Reference level to no response...
chatData$Gender <- relevel(chatData$Gender, "Male") # Set Male to reference
Model the data
Once our data is prepared, we can move onto building a multinomial regression model. For this example, we want to know if the variables Good_Mate, Funny, Gender, Sex, the interaction between Gender and Sex, and the interaction between Funny and Gender predict the outcome, Success.
chatModel2 <- multinom(Success ~ Good_Mate + Funny + Gender + Sex + Gender:Sex + Funny:Gender, data = chatData)
## # weights: 24 (14 variable)
## initial value 1120.584534
## iter 10 value 953.525508
## iter 20 value 868.736917
## final value 868.736497
## converged
The first chunk of output from the multinom()
function displays a series of values from iterations of the model. The piece of information to focus on here is to ensure that the model converged. If your model fails to converge, consider the variables that have been entered into the model. Perhaps ther are a small number of certain combinations of categorical predictor variables. Some predictors may need to be omitted for the model to converge.
Model summary
Passing our newly created model to the summary()
function produces the output to examine the model coefficients, their standard errors, and the overall fit of the model.
summary(chatModel2)
## Call:
## multinom(formula = Success ~ Good_Mate + Funny + Gender + Sex +
## Gender:Sex + Funny:Gender, data = chatData)
##
## Coefficients:
## (Intercept) Good_Mate Funny GenderFemale Sex
## Get Phone Number -1.783070 0.1318371 0.1393846 -1.646198 0.2762074
## Go Home with Person -4.286323 0.1300003 0.3184571 -5.626258 0.4172888
## GenderFemale:Sex Funny:GenderFemale
## Get Phone Number -0.3483306 0.4924484
## Go Home with Person -0.4766576 1.1724111
##
## Std. Errors:
## (Intercept) Good_Mate Funny GenderFemale Sex
## Get Phone Number 0.6697696 0.05372607 0.1101253 0.7962448 0.08919698
## Go Home with Person 0.9413983 0.08352059 0.1253016 1.3285828 0.12208319
## GenderFemale:Sex Funny:GenderFemale
## Get Phone Number 0.1058751 0.1399918
## Go Home with Person 0.1634337 0.1992393
##
## Residual Deviance: 1737.473
## AIC: 1765.473
The interpretation of the output from the
summay()
of the multinom model is similar to that of that logistic regression. We see the “Call:” which is a reproduction of themultinom()
function used to create the model.Next, we see the “Coefficients:” which quantify the relationship between the predictor variables and the outcome variables. Notice that there are only two outcome variables displayed, “Get Phone Number” and “Go Home with Person.” This is because we had set “No Response/Walk Off” as the reference level. Thus as an example, the coefficient for Good_Mate and Get Phone Number is 0.1318 which indicates that for each unit increase in Good_Mate, we can expect as 0.1318 increase in the logit of the outcome variable when the outcomes are “Get Phone Number” and “No response/Walk Off.” Similarly, the coefficient for Good_Mate and Go Home with Person is 0.13, which indicates that for each unit increase in Good_Mate, we can expect a 0.13 increase in the logit (i.e. the log of the odds of an event occurring) of the outcome variable.
The “Std. Errors:” section displays the standard errors of the coefficients. These are useful in calculating z scores and p-values since we can divide the coefficients by the standard errors to obtain z scores.
Finally, the output displays the “Residual Deviance:” and the “AIC:.” The residual deviance provides information regarding fit in the model that does not contain predictors (estimating the intercept only). The AIC, or the Akaike Information Criterion, is a way to assess model fit that penalizes the model according to the number of predictor variables used. These two values are helpful for comparing different multinomial regression models.
A better way to present the output
The model summary output may look clunky on your screen and it lacks information, specifically z statistics and p values, that could help us interpret the model. The missing information can be filled in with the tidy()
function in the broom package. Using the tidy() function on our model results in a tibble that now contains a z statistic and a p value.
tidymodel <- tidy(chatModel2)
tidymodel$expB <- exp(tidymodel$estimate)
#tidymodel$`2.5%` <-
#test <- data.frame(exp(confint(chatModel2)))
kable(tidymodel)
y.level | term | estimate | std.error | statistic | p.value | expB |
---|---|---|---|---|---|---|
Get Phone Number | (Intercept) | -1.7830702 | 0.6697696 | -2.662214 | 0.0077628 | 0.1681212 |
Get Phone Number | Good_Mate | 0.1318371 | 0.0537261 | 2.453875 | 0.0141326 | 1.1409224 |
Get Phone Number | Funny | 0.1393846 | 0.1101253 | 1.265691 | 0.2056239 | 1.1495661 |
Get Phone Number | GenderFemale | -1.6461979 | 0.7962448 | -2.067452 | 0.0386916 | 0.1927815 |
Get Phone Number | Sex | 0.2762074 | 0.0891970 | 3.096600 | 0.0019575 | 1.3181212 |
Get Phone Number | GenderFemale:Sex | -0.3483306 | 0.1058751 | -3.290013 | 0.0010018 | 0.7058655 |
Get Phone Number | Funny:GenderFemale | 0.4924484 | 0.1399918 | 3.517694 | 0.0004353 | 1.6363176 |
Go Home with Person | (Intercept) | -4.2863226 | 0.9413983 | -4.553144 | 0.0000053 | 0.0137554 |
Go Home with Person | Good_Mate | 0.1300003 | 0.0835206 | 1.556507 | 0.1195877 | 1.1388288 |
Go Home with Person | Funny | 0.3184571 | 0.1253016 | 2.541525 | 0.0110370 | 1.3750046 |
Go Home with Person | GenderFemale | -5.6262579 | 1.3285828 | -4.234782 | 0.0000229 | 0.0036020 |
Go Home with Person | Sex | 0.4172888 | 0.1220832 | 3.418069 | 0.0006307 | 1.5178408 |
Go Home with Person | GenderFemale:Sex | -0.4766576 | 0.1634337 | -2.916519 | 0.0035396 | 0.6208551 |
Go Home with Person | Funny:GenderFemale | 1.1724111 | 0.1992393 | 5.884435 | 0.0000000 | 3.2297704 |
- For each coefficient other than the intercept, the values represents the log odds of the outcome variables increasing for each unit increase in the predictor variable. Because it is not intuitive to interpret the log odds, they can be exponentiated. The exponentiated coefficients have a much more straightforward interpretation. The exponentiated coefficients represent the odds ratio which can be interpreted as either greater than or less than one. Values greater than one indicate that as the predictor increases, the odds of the outcome occurring increases. A value less than one indicates that as the predictor increases, the odds of the outcome occurring decreases. To drive this point home, the exponentiated coefficient for Get Phone Number and Good_Mate is 1.14. This values is greater than one which indicates that as the Good_Mate scores increase, the odds of getting a phone number increases relative to “No response/Walk Off.” To be more specific, the odds of getting a phone number increases by 14% for each unit increase in the Good_Mate score.
Confidence intervals
kable(exp(confint(chatModel2)))
2.5 %.Get Phone Number | 97.5 %.Get Phone Number | 2.5 %.Go Home with Person | 97.5 %.Go Home with Person | |
---|---|---|---|---|
(Intercept) | 0.0452391 | 0.6247861 | 0.0021735 | 0.0870550 |
Good_Mate | 1.0268911 | 1.2676163 | 0.9668644 | 1.3413783 |
Funny | 0.9263927 | 1.4265035 | 1.0755912 | 1.7577659 |
GenderFemale | 0.0404856 | 0.9179735 | 0.0002665 | 0.0486899 |
Sex | 1.1067021 | 1.5699287 | 1.1948375 | 1.9281625 |
GenderFemale:Sex | 0.5735891 | 0.8686465 | 0.4506872 | 0.8552740 |
Funny:GenderFemale | 1.2436734 | 2.1529247 | 2.1856409 | 4.7727039 |
Test for multicolinearity
In a similar vein to linear regression, one of the assumptions of multinomial logistic regression is that the predictor variables are uncorrelated. One way to test this is to build a generalized linear model with the glm()
function, and then using the vif()
function from the car package. The vif()
function determines the variance inflation factor for all of the predictor variables. VIF values of 10 or more are problematic and indicate that we have correlated predictor variables. The output of the VIF values all indicate that our variables are not highly correlated. In addition, we can inspect the tolerance statistics by calculating 1 divided by the vif. Anything less than 0.2 is a potential problem and anything less than 0.1 is a serious problem. Based off of our output, we have no reason to believe that we have encountered multicolinearity.
chatModel <- glm(Success ~ Funny + Good_Mate + Sex + Gender, data = chatData, family = binomial())
vif(chatModel)
## Funny Good_Mate Sex Gender
## 1.580214 1.018981 1.006221 1.558624
1/vif(chatModel)
## Funny Good_Mate Sex Gender
## 0.6328256 0.9813729 0.9938173 0.6415915
Correlations An additional way of testing for multicolinearity is to calculate the Pearson pairwise correlations for the predictor variables. Again, we notice small correlation values between 0.03 and 0.16 which are not of concern here.
kable(cor(chatData[, c("Funny", "Good_Mate", "Sex")]))
Funny | Good_Mate | Sex | |
---|---|---|---|
Funny | 1.0000000 | 0.1632098 | 0.1156084 |
Good_Mate | 0.1632098 | 1.0000000 | 0.0379461 |
Sex | 0.1156084 | 0.0379461 | 1.0000000 |
References
Field, Andy, Jeremy Miles, and Zoe Field. n.d. Discovering Statistics Using R. Sage.
Fox, John, Sanford Weisberg, and Brad Price. 2020. Car: Companion to Applied Regression. https://CRAN.R-project.org/package=car.
Wickham, Hadley. 2021. Tidyverse: Easily Install and Load the Tidyverse. https://CRAN.R-project.org/package=tidyverse.