Displaying Race & Ethnicity Data

It is not uncommon for surveys to collect race and ethnicity data from participants. In some cases, these questions allow participants to select more than one response often under the instructions of “select all that apply.” This guide covers how to create a table and bar chart in R to display race and ethnicity responses where participants can select more than one option.

Packages

library(tidyverse)
library(gtsummary)
library(bstfun)

Create Tables

Prepare data

The first step is to create a data frame in a way that looks like the table below. In this format, each row represents a participant and each column represents one of the available responses coded as a binary variable where 1 indicates that the participant selected that category. Each participant can then endorse multiple categories.

Table 1: Example data frame.
American Indian or Alaskan NativeAsianBlack or African AmericanWhiteHispanic or LatinoAnother racePrefer not to answer
0000100
0000100
0000100
0000100
0000100
0000100

Create a vector of the columns to be displayed

Once the data frame is prepared, create a character vector of the column names that contain the responses of interest. In the example data, these column names start with the string “demographic” and contain “race.” However, some of these column names end with “.factor” or contain the string “other” and need to be excluded. To accomplish this, I use the {dplyr} verb select() nested within the names() function. This vector will be used as an argument in a subsequent step.

race_vars <- names(data %>% select(starts_with("demographic") & 
                              !ends_with(".factor") & 
                              contains("race") &
                              !contains("other")))
## [1] "demographic_4_race___1" "demographic_4_race___2" "demographic_4_race___3"
## [4] "demographic_4_race___4" "demographic_4_race___5" "demographic_4_race___6"
## [7] "demographic_4_race___8"

Create a tbl_summary() table

To create the table, begin by piping the data to select() and use the character vector of column names of the variables of interest. Next, pipe the output to the tbl_summary() function from the {gtsummary} package. Finally, pipe the output table to the add_variable_grouping() function of the {bstfun} package to indent all of the variables under a grouping named “Race/Ethnicity.” In the example data, all of the colum names are labelled. Thus, even though the columns are named something like “demographic_4_race___1”, the tbl_summary() function will display the corresponding label “American Indian or Alaskan Native.”

data %>% 
  select(all_of(race_vars)) %>%
  tbl_summary() %>%
  add_variable_grouping("Race/Ethnicity" = race_vars)

CharacteristicN = 1501
Race/Ethnicity
    American Indian or Alaskan Native3 (2.0%)
    Asian9 (6.0%)
    Black or African American7 (4.7%)
    White92 (61%)
    Hispanic or Latino31 (21%)
    Another race2 (1.3%)
    Prefer not to answer10 (6.7%)
1 n (%)

Sort the order by frequency

So far, we have a decent looking table. However, we may want to sort the responses according to the frequency (%). Since our columns consist of binary responses, we can take the sums of each column, and then use the sums to determine the order. One way to accomplish this task, is by piping the columns of interest to the colSums() function followed by the order() function and assigning it to a vector named index. Next, use the index to order the columns of interest and assign it to a new vectore named ordered_race_vars. Finally, use the ordered_race_vars vector in the select() and add_variable_grouping() functions as above. All this does is essentially change the order in which the columns are selected and piped into the tbl_summary() function.

index <- data %>% 
  select(all_of(race_vars)) %>%
  colSums() %>%
  order(decreasing = TRUE)

ordered_race_vars <- race_vars[index]

data %>%
  select(all_of(ordered_race_vars)) %>% 
  tbl_summary() %>%
  add_variable_grouping("Race/Ethnicity" = ordered_race_vars)

CharacteristicN = 1501
Race/Ethnicity
    White92 (61%)
    Hispanic or Latino31 (21%)
    Prefer not to answer10 (6.7%)
    Asian9 (6.0%)
    Black or African American7 (4.7%)
    American Indian or Alaskan Native3 (2.0%)
    Another race2 (1.3%)
1 n (%)

Create Bar Charts

Extract column labels

One way to visually display survey responses to race and ethnicity questions is to use the {ggplot} package. While tbl_summary() will play nicely with columns that are labelled, the geom_*() functions do not. Thus my strategy is to use the native column names to generate a bar chart, and then use the columns labels to label the plot. The rationale for this approach is avoid additional problems if your eventual column names contain spaces or special characters that tend to not play well with R column-naming conventions.

  1. One option to extract the labels in the example data is to use the map_df() function from the {purrr} package in combination with the attr() function to extract the labels. The following code chunk essentially applies the attr() function to all of the input columns.
# Alternative form using purrr
race_labels <- as.character(data %>% 
                             select(all_of(race_vars)) %>% 
                             map_df(., ~ attr(.x, "label")))

race_labels
## [1] "American Indian or Alaskan Native" "Asian"                            
## [3] "Black or African American"         "White"                            
## [5] "Hispanic or Latino"                "Another race"                     
## [7] "Prefer not to answer"

Prepare data

The next step is to use the race_vars and race_labels in a chain of commands that selects the columns, converts them to numeric, renames them, converts to long format, and then factors the categories. The reason for converting to numeric class here is because the labelled columns do not play well with {dplyr} functions. Converting to numeric essentially clears out any labelled meta data from the column and facilitates any subsequent {dplyr} verbs to be successfully applied to the data. The output shows an example of what the data look like at this step.

plot_data <- data %>% 
  select(all_of(race_vars)) %>%
  mutate_at(all_of(race_vars), ~ as.numeric(.)) %>% 
  rename_at(all_of(race_vars), ~ race_labels) %>%
  pivot_longer(all_of(race_labels),
               names_to = "Category",
               values_to = "Endorsed") %>%
  mutate(Category = factor(Category, levels = race_labels))
## # A tibble: 1,050 × 2
##    Category                          Endorsed
##    <fct>                                <dbl>
##  1 American Indian or Alaskan Native        0
##  2 Asian                                    0
##  3 Black or African American                0
##  4 White                                    1
##  5 Hispanic or Latino                       0
##  6 Another race                             0
##  7 Prefer not to answer                     0
##  8 American Indian or Alaskan Native        0
##  9 Asian                                    0
## 10 Black or African American                0
## # … with 1,040 more rows
  1. Building from here, the next step is to use the long format date to count the number of responses for each category, and then calculate the proportion of responses out of all participants. This can be accomplished with the group_by(), summarise(), and mutate() functions.
plot_data <- plot_data %>%
  group_by(Category) %>%
  summarise(Frequency = sum(Endorsed, na.rm = TRUE)) %>%
  mutate(Proportion = Frequency/nrow(data))
## # A tibble: 7 × 3
##   Category                          Frequency Proportion
##   <fct>                                 <dbl>      <dbl>
## 1 American Indian or Alaskan Native         3     0.02  
## 2 Asian                                     9     0.06  
## 3 Black or African American                 7     0.0467
## 4 White                                    92     0.613 
## 5 Hispanic or Latino                       31     0.207 
## 6 Another race                              2     0.0133
## 7 Prefer not to answer                     10     0.0667

Plot the prepared data

At this point, we have the basic building blocks to plot the data.

plot_data %>%
  ggplot(., aes(x = Category, y = Proportion)) +
  geom_col()

From here, we can order the bars by Proportion by using the reorder() function and add a color and fill and flip the coordinates to fix the overplotted x-axis tick labels.

plot_data %>%
  ggplot(aes(x = reorder(Category, Proportion), y = Proportion)) +
  geom_col(color = "#1565c0", fill = "#1565c0") +
  coord_flip()

From here, we can apply some additional modifications by changing the theme and axis lines.

plot_data %>%
  ggplot(aes(x = reorder(Category, Proportion), y = Proportion)) +
  geom_col(color = "#1565c0", fill = "#1565c0") +
  coord_flip() +
  xlab("Category") +
  theme_minimal() +
  theme(axis.line = element_line(color = "grey70"))

Next