Displaying Race & Ethnicity Data
It is not uncommon for surveys to collect race and ethnicity data from participants. In some cases, these questions allow participants to select more than one response often under the instructions of “select all that apply.” This guide covers how to create a table and bar chart in R to display race and ethnicity responses where participants can select more than one option.
Packages
library(tidyverse)
library(gtsummary)
library(bstfun)
Create Tables
Prepare data
The first step is to create a data frame in a way that looks like the table below. In this format, each row represents a participant and each column represents one of the available responses coded as a binary variable where 1 indicates that the participant selected that category. Each participant can then endorse multiple categories.
American Indian or Alaskan Native | Asian | Black or African American | White | Hispanic or Latino | Another race | Prefer not to answer |
---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 1 | 0 | 0 |
0 | 0 | 0 | 0 | 1 | 0 | 0 |
0 | 0 | 0 | 0 | 1 | 0 | 0 |
0 | 0 | 0 | 0 | 1 | 0 | 0 |
0 | 0 | 0 | 0 | 1 | 0 | 0 |
0 | 0 | 0 | 0 | 1 | 0 | 0 |
Create a vector of the columns to be displayed
Once the data frame is prepared, create a character vector of the column names that contain the responses of interest. In the example data, these column names start with the string “demographic” and contain “race.” However, some of these column names end with “.factor” or contain the string “other” and need to be excluded. To accomplish this, I use the {dplyr} verb select()
nested within the names()
function. This vector will be used as an argument in a subsequent step.
race_vars <- names(data %>% select(starts_with("demographic") &
!ends_with(".factor") &
contains("race") &
!contains("other")))
## [1] "demographic_4_race___1" "demographic_4_race___2" "demographic_4_race___3"
## [4] "demographic_4_race___4" "demographic_4_race___5" "demographic_4_race___6"
## [7] "demographic_4_race___8"
Create a tbl_summary() table
To create the table, begin by piping the data to select()
and use the character vector of column names of the variables of interest. Next, pipe the output to the tbl_summary()
function from the {gtsummary} package. Finally, pipe the output table to the add_variable_grouping()
function of the {bstfun} package to indent all of the variables under a grouping named “Race/Ethnicity.” In the example data, all of the colum names are labelled. Thus, even though the columns are named something like “demographic_4_race___1”, the tbl_summary()
function will display the corresponding label “American Indian or Alaskan Native.”
data %>%
select(all_of(race_vars)) %>%
tbl_summary() %>%
add_variable_grouping("Race/Ethnicity" = race_vars)
Characteristic | N = 1501 |
---|---|
Race/Ethnicity | |
American Indian or Alaskan Native | 3 (2.0%) |
Asian | 9 (6.0%) |
Black or African American | 7 (4.7%) |
White | 92 (61%) |
Hispanic or Latino | 31 (21%) |
Another race | 2 (1.3%) |
Prefer not to answer | 10 (6.7%) |
1 n (%) |
Sort the order by frequency
So far, we have a decent looking table. However, we may want to sort the responses according to the frequency (%). Since our columns consist of binary responses, we can take the sums of each column, and then use the sums to determine the order. One way to accomplish this task, is by piping the columns of interest to the colSums()
function followed by the order()
function and assigning it to a vector named index. Next, use the index to order the columns of interest and assign it to a new vectore named ordered_race_vars. Finally, use the ordered_race_vars vector in the select()
and add_variable_grouping()
functions as above. All this does is essentially change the order in which the columns are selected and piped into the tbl_summary()
function.
index <- data %>%
select(all_of(race_vars)) %>%
colSums() %>%
order(decreasing = TRUE)
ordered_race_vars <- race_vars[index]
data %>%
select(all_of(ordered_race_vars)) %>%
tbl_summary() %>%
add_variable_grouping("Race/Ethnicity" = ordered_race_vars)
Characteristic | N = 1501 |
---|---|
Race/Ethnicity | |
White | 92 (61%) |
Hispanic or Latino | 31 (21%) |
Prefer not to answer | 10 (6.7%) |
Asian | 9 (6.0%) |
Black or African American | 7 (4.7%) |
American Indian or Alaskan Native | 3 (2.0%) |
Another race | 2 (1.3%) |
1 n (%) |
Create Bar Charts
Extract column labels
One way to visually display survey responses to race and ethnicity questions is to use the {ggplot} package. While tbl_summary()
will play nicely with columns that are labelled, the geom_*()
functions do not. Thus my strategy is to use the native column names to generate a bar chart, and then use the columns labels to label the plot. The rationale for this approach is avoid additional problems if your eventual column names contain spaces or special characters that tend to not play well with R column-naming conventions.
- One option to extract the labels in the example data is to use the
map_df()
function from the {purrr} package in combination with theattr()
function to extract the labels. The following code chunk essentially applies theattr()
function to all of the input columns.
# Alternative form using purrr
race_labels <- as.character(data %>%
select(all_of(race_vars)) %>%
map_df(., ~ attr(.x, "label")))
race_labels
## [1] "American Indian or Alaskan Native" "Asian"
## [3] "Black or African American" "White"
## [5] "Hispanic or Latino" "Another race"
## [7] "Prefer not to answer"
Prepare data
The next step is to use the race_vars and race_labels in a chain of commands that selects the columns, converts them to numeric, renames them, converts to long format, and then factors the categories. The reason for converting to numeric class here is because the labelled columns do not play well with {dplyr} functions. Converting to numeric essentially clears out any labelled meta data from the column and facilitates any subsequent {dplyr} verbs to be successfully applied to the data. The output shows an example of what the data look like at this step.
plot_data <- data %>%
select(all_of(race_vars)) %>%
mutate_at(all_of(race_vars), ~ as.numeric(.)) %>%
rename_at(all_of(race_vars), ~ race_labels) %>%
pivot_longer(all_of(race_labels),
names_to = "Category",
values_to = "Endorsed") %>%
mutate(Category = factor(Category, levels = race_labels))
## # A tibble: 1,050 × 2
## Category Endorsed
## <fct> <dbl>
## 1 American Indian or Alaskan Native 0
## 2 Asian 0
## 3 Black or African American 0
## 4 White 1
## 5 Hispanic or Latino 0
## 6 Another race 0
## 7 Prefer not to answer 0
## 8 American Indian or Alaskan Native 0
## 9 Asian 0
## 10 Black or African American 0
## # … with 1,040 more rows
- Building from here, the next step is to use the long format date to count the number of responses for each category, and then calculate the proportion of responses out of all participants. This can be accomplished with the
group_by()
,summarise()
, andmutate()
functions.
plot_data <- plot_data %>%
group_by(Category) %>%
summarise(Frequency = sum(Endorsed, na.rm = TRUE)) %>%
mutate(Proportion = Frequency/nrow(data))
## # A tibble: 7 × 3
## Category Frequency Proportion
## <fct> <dbl> <dbl>
## 1 American Indian or Alaskan Native 3 0.02
## 2 Asian 9 0.06
## 3 Black or African American 7 0.0467
## 4 White 92 0.613
## 5 Hispanic or Latino 31 0.207
## 6 Another race 2 0.0133
## 7 Prefer not to answer 10 0.0667
Plot the prepared data
At this point, we have the basic building blocks to plot the data.
plot_data %>%
ggplot(., aes(x = Category, y = Proportion)) +
geom_col()
From here, we can order the bars by Proportion by using the reorder()
function and add a color and fill and flip the coordinates to fix the overplotted x-axis tick labels.
plot_data %>%
ggplot(aes(x = reorder(Category, Proportion), y = Proportion)) +
geom_col(color = "#1565c0", fill = "#1565c0") +
coord_flip()
From here, we can apply some additional modifications by changing the theme and axis lines.
plot_data %>%
ggplot(aes(x = reorder(Category, Proportion), y = Proportion)) +
geom_col(color = "#1565c0", fill = "#1565c0") +
coord_flip() +
xlab("Category") +
theme_minimal() +
theme(axis.line = element_line(color = "grey70"))