Multinomial Logistic Regression

Multinomial regression can be thought of as an extension of logistic regression. In logistic regression, the dependent or outcome variable can take on one of two categorical values such as whether or not someone survived a stay at a hospital. In cases where the dependent variable falls into more than two categorical variables, multinomial regression is one possible analysis option. For this guide, we will use multinomial regression using the nnet package. We will also use a sample data set in which researchers determined the effect of pickup lines on whether or not that person 1) got a phone number, 2) went home with the person, or 3) walked away.

library(tidyverse)
library(nnet)
library(car)
library(kableExtra)
library(broom)

Load Data

chatData<-read.delim("Chat-Up Lines.dat", header = TRUE)
#chatData$Gender<-relevel(chatData$Gender, ref = 2)

Inspect Data

Before building our multinomial logistic regression model, we will need to check that our predictor variables are factored. Simply pass each column to the is.factor() function to check.

kable(head(chatData))

Success	Funny	Sex	Good_Mate	Gender
Get Phone Number	3	7	6	Male
Go Home with Person	5	7	2	Male
Get Phone Number	4	6	6	Male
Go Home with Person	3	7	5	Male
Get Phone Number	5	1	6	Male
Get Phone Number	4	7	5	Male

is.factor(chatData$Success)

## [1] FALSE

is.factor(chatData$Gender)

## [1] FALSE

Prepare Data

Since both of our predictor variables are not factored, we use the as.factor() function to convert them and save into the chatData data frame. Next, we will want to relevel our predictor variables to arrange them and set a reference level. The reference level is what is used as the main comparison variable. The relevel() function takes two inputs, the column of the factor that needs to be releveled and the reference level. For this example we want to set “No response/Walk Off” as the reference for Success, and “Male” as the reference for Gender. One way to think about multinomial regression is view it as a series of binary logistic regression models where one level (the reference) is compared to all others. In multinomial regression, it is the reference level that is compared with all other levels of the variable.

chatData$Gender <- as.factor(chatData$Gender)
chatData$Success <- as.factor(chatData$Success)

chatData$Success <- relevel(chatData$Success, "No response/Walk Off") # Set Reference level to no response...
chatData$Gender <- relevel(chatData$Gender, "Male") # Set Male to reference

Model the data

Once our data is prepared, we can move onto building a multinomial regression model. For this example, we want to know if the variables Good_Mate, Funny, Gender, Sex, the interaction between Gender and Sex, and the interaction between Funny and Gender predict the outcome, Success.

chatModel2 <- multinom(Success ~ Good_Mate + Funny + Gender + Sex + Gender:Sex + Funny:Gender, data = chatData)

## # weights:  24 (14 variable)
## initial  value 1120.584534 
## iter  10 value 953.525508
## iter  20 value 868.736917
## final  value 868.736497 
## converged

The first chunk of output from the multinom() function displays a series of values from iterations of the model. The piece of information to focus on here is to ensure that the model converged. If your model fails to converge, consider the variables that have been entered into the model. Perhaps ther are a small number of certain combinations of categorical predictor variables. Some predictors may need to be omitted for the model to converge.

Model summary

Passing our newly created model to the summary() function produces the output to examine the model coefficients, their standard errors, and the overall fit of the model.

summary(chatModel2)

## Call:
## multinom(formula = Success ~ Good_Mate + Funny + Gender + Sex + 
##     Gender:Sex + Funny:Gender, data = chatData)
## 
## Coefficients:
##                     (Intercept) Good_Mate     Funny GenderFemale       Sex
## Get Phone Number      -1.783070 0.1318371 0.1393846    -1.646198 0.2762074
## Go Home with Person   -4.286323 0.1300003 0.3184571    -5.626258 0.4172888
##                     GenderFemale:Sex Funny:GenderFemale
## Get Phone Number          -0.3483306          0.4924484
## Go Home with Person       -0.4766576          1.1724111
## 
## Std. Errors:
##                     (Intercept)  Good_Mate     Funny GenderFemale        Sex
## Get Phone Number      0.6697696 0.05372607 0.1101253    0.7962448 0.08919698
## Go Home with Person   0.9413983 0.08352059 0.1253016    1.3285828 0.12208319
##                     GenderFemale:Sex Funny:GenderFemale
## Get Phone Number           0.1058751          0.1399918
## Go Home with Person        0.1634337          0.1992393
## 
## Residual Deviance: 1737.473 
## AIC: 1765.473

The interpretation of the output from the summay() of the multinom model is similar to that of that logistic regression. We see the “Call:” which is a reproduction of the multinom() function used to create the model.
Next, we see the “Coefficients:” which quantify the relationship between the predictor variables and the outcome variables. Notice that there are only two outcome variables displayed, “Get Phone Number” and “Go Home with Person.” This is because we had set “No Response/Walk Off” as the reference level. Thus as an example, the coefficient for Good_Mate and Get Phone Number is 0.1318 which indicates that for each unit increase in Good_Mate, we can expect as 0.1318 increase in the logit of the outcome variable when the outcomes are “Get Phone Number” and “No response/Walk Off.” Similarly, the coefficient for Good_Mate and Go Home with Person is 0.13, which indicates that for each unit increase in Good_Mate, we can expect a 0.13 increase in the logit (i.e. the log of the odds of an event occurring) of the outcome variable.
The “Std. Errors:” section displays the standard errors of the coefficients. These are useful in calculating z scores and p-values since we can divide the coefficients by the standard errors to obtain z scores.
Finally, the output displays the “Residual Deviance:” and the “AIC:.” The residual deviance provides information regarding fit in the model that does not contain predictors (estimating the intercept only). The AIC, or the Akaike Information Criterion, is a way to assess model fit that penalizes the model according to the number of predictor variables used. These two values are helpful for comparing different multinomial regression models.

A better way to present the output

The model summary output may look clunky on your screen and it lacks information, specifically z statistics and p values, that could help us interpret the model. The missing information can be filled in with the tidy() function in the broom package. Using the tidy() function on our model results in a tibble that now contains a z statistic and a p value.

tidymodel <- tidy(chatModel2)

tidymodel$expB <- exp(tidymodel$estimate)

#tidymodel$`2.5%` <- 
#test <- data.frame(exp(confint(chatModel2)))
kable(tidymodel)

y.level	term	estimate	std.error	statistic	p.value	expB
Get Phone Number	(Intercept)	-1.7830702	0.6697696	-2.662214	0.0077628	0.1681212
Get Phone Number	Good_Mate	0.1318371	0.0537261	2.453875	0.0141326	1.1409224
Get Phone Number	Funny	0.1393846	0.1101253	1.265691	0.2056239	1.1495661
Get Phone Number	GenderFemale	-1.6461979	0.7962448	-2.067452	0.0386916	0.1927815
Get Phone Number	Sex	0.2762074	0.0891970	3.096600	0.0019575	1.3181212
Get Phone Number	GenderFemale:Sex	-0.3483306	0.1058751	-3.290013	0.0010018	0.7058655
Get Phone Number	Funny:GenderFemale	0.4924484	0.1399918	3.517694	0.0004353	1.6363176
Go Home with Person	(Intercept)	-4.2863226	0.9413983	-4.553144	0.0000053	0.0137554
Go Home with Person	Good_Mate	0.1300003	0.0835206	1.556507	0.1195877	1.1388288
Go Home with Person	Funny	0.3184571	0.1253016	2.541525	0.0110370	1.3750046
Go Home with Person	GenderFemale	-5.6262579	1.3285828	-4.234782	0.0000229	0.0036020
Go Home with Person	Sex	0.4172888	0.1220832	3.418069	0.0006307	1.5178408
Go Home with Person	GenderFemale:Sex	-0.4766576	0.1634337	-2.916519	0.0035396	0.6208551
Go Home with Person	Funny:GenderFemale	1.1724111	0.1992393	5.884435	0.0000000	3.2297704

For each coefficient other than the intercept, the values represents the log odds of the outcome variables increasing for each unit increase in the predictor variable. Because it is not intuitive to interpret the log odds, they can be exponentiated. The exponentiated coefficients have a much more straightforward interpretation. The exponentiated coefficients represent the odds ratio which can be interpreted as either greater than or less than one. Values greater than one indicate that as the predictor increases, the odds of the outcome occurring increases. A value less than one indicates that as the predictor increases, the odds of the outcome occurring decreases. To drive this point home, the exponentiated coefficient for Get Phone Number and Good_Mate is 1.14. This values is greater than one which indicates that as the Good_Mate scores increase, the odds of getting a phone number increases relative to “No response/Walk Off.” To be more specific, the odds of getting a phone number increases by 14% for each unit increase in the Good_Mate score.

Confidence intervals

kable(exp(confint(chatModel2)))

	2.5 %.Get Phone Number	97.5 %.Get Phone Number	2.5 %.Go Home with Person	97.5 %.Go Home with Person
(Intercept)	0.0452391	0.6247861	0.0021735	0.0870550
Good_Mate	1.0268911	1.2676163	0.9668644	1.3413783
Funny	0.9263927	1.4265035	1.0755912	1.7577659
GenderFemale	0.0404856	0.9179735	0.0002665	0.0486899
Sex	1.1067021	1.5699287	1.1948375	1.9281625
GenderFemale:Sex	0.5735891	0.8686465	0.4506872	0.8552740
Funny:GenderFemale	1.2436734	2.1529247	2.1856409	4.7727039

Test for multicolinearity

In a similar vein to linear regression, one of the assumptions of multinomial logistic regression is that the predictor variables are uncorrelated. One way to test this is to build a generalized linear model with the glm() function, and then using the vif() function from the car package. The vif() function determines the variance inflation factor for all of the predictor variables. VIF values of 10 or more are problematic and indicate that we have correlated predictor variables. The output of the VIF values all indicate that our variables are not highly correlated. In addition, we can inspect the tolerance statistics by calculating 1 divided by the vif. Anything less than 0.2 is a potential problem and anything less than 0.1 is a serious problem. Based off of our output, we have no reason to believe that we have encountered multicolinearity.

chatModel <- glm(Success ~ Funny + Good_Mate + Sex + Gender, data = chatData, family = binomial())
vif(chatModel)

##     Funny Good_Mate       Sex    Gender 
##  1.580214  1.018981  1.006221  1.558624

1/vif(chatModel)

##     Funny Good_Mate       Sex    Gender 
## 0.6328256 0.9813729 0.9938173 0.6415915

Correlations An additional way of testing for multicolinearity is to calculate the Pearson pairwise correlations for the predictor variables. Again, we notice small correlation values between 0.03 and 0.16 which are not of concern here.

kable(cor(chatData[, c("Funny", "Good_Mate", "Sex")]))

	Funny	Good_Mate	Sex
Funny	1.0000000	0.1632098	0.1156084
Good_Mate	0.1632098	1.0000000	0.0379461
Sex	0.1156084	0.0379461	1.0000000

References

Field, Andy, Jeremy Miles, and Zoe Field. n.d. Discovering Statistics Using R. Sage.

Fox, John, Sanford Weisberg, and Brad Price. 2020. Car: Companion to Applied Regression. https://CRAN.R-project.org/package=car.

Wickham, Hadley. 2021. Tidyverse: Easily Install and Load the Tidyverse. https://CRAN.R-project.org/package=tidyverse.

Last updated on Oct 8, 2023