Histograms in ggplot2

Histograms are a great way to display the frequency of categorical variables and can aid the the understanding of the distribution of a variable to make visual comparisons across groups. In this guide, I focus on the use of the ggplot2 package to make different types of histograms. The following snippets use the built-in mtcars data set to demonstrate.

library(tidyverse)
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Traditional Histogram

mtcars %>%
  ggplot(., aes(x = cyl)) +
  geom_bar() +
  theme_minimal()

Paneled histogram

Here we make separate panels to show the histogram by am(automatic transmission) where a 1 denotes automatic and 0 denotes manual transmission.

mtcars %>%
  ggplot(., aes(x = cyl)) +
  geom_bar() +
  facet_wrap(~am) +
  theme_minimal()

Modify the panel labels

To modify the labels, we can convert the variable am to a factor where 1 is set to Automatic and 0 is set to Manual, then setting the order of the levels match the previous plot.

mtcars %>%
  mutate(am = factor(ifelse(am == 1, "Automatic", "Manual"), levels = c("Manual", "Automatic"))) %>%
  ggplot(., aes(x = cyl)) +
  geom_bar() +
  facet_wrap(~am) +
  theme_minimal()

Clustered histogram

In a this clustered histogram separate bars at each value of cyl are displayed, one for automatic and another for manual. The position option is needed to display the separate bars side by side, otherwise the bars will appear stacked instead.

colors = c( "#440154FF","#1565c0")
mtcars %>%
  mutate(am = factor(am)) %>%
  ggplot(., aes(x = cyl)) +
  geom_bar(aes(color = am, fill = am), position = position_dodge2(preserve = "single")) +
  theme_minimal() +
  scale_color_manual(values = colors) +
  scale_fill_manual(values = colors)

Clustered histogram with percentages

Denominator is the grouping variable

There may be situations where a percentage instead of a count facilitates comparisons across groups. In this version, the denominator for calculating a percentage is that of the total number of observations within each level of am. To produce such a plot, we first need to group the data by am and cyl, and then calculated frequencies and proportions. We can see that 4-cylinder vehicles make up about 60% of the observations where the transmission is an automatic one which can be compared to about 15% of 4-cylinder vehicles are equipped with a manual transmission.

mtcars %>%
  mutate(am = factor(am)) %>%
  group_by(am, cyl) %>%
  summarise(n = n()) %>%
  mutate(freq = n / sum(n)) %>%
  ggplot(., aes(x = cyl, y = freq, fill = am)) +
  geom_bar(stat = "identity", position = position_dodge2(preserve = "single")) +
  scale_y_continuous(labels = scales::label_percent()) +
  theme_minimal() +
  scale_fill_manual(values = colors) +
  ylab("Percentage within each am group") 

To ensure that the appropriate values are displayed, a quick gtsummary() table will display the actual percentages in a table.

mtcars %>%
  mutate(am = factor(am)) %>%
  select(cyl, am) %>%
  gtsummary::tbl_summary(by = am, digits = list(everything() ~ c(0,2))) 
Characteristic0, N = 1911, N = 131
cyl

    43 (15.79%)8 (61.54%)
    64 (21.05%)3 (23.08%)
    812 (63.16%)2 (15.38%)
1 n (%)

Denominator is the sample size

The proportion of all observations can also be displayed. In this plot we see that bout 25% of all observations are automatic and 4 cylinder vehicles.

mtcars %>%
  mutate(am = factor(am)) %>%
  ggplot(., aes(cyl, fill = am)) +
  geom_bar(aes(y = (..count..)/sum(..count..)),
           position = position_dodge2(preserve = "single")) +
  scale_y_continuous(labels = scales::label_percent()) +
  theme_minimal() +
  scale_fill_manual(values = colors) +
  ylab("Percent of all observations") 
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

To double check the values, we can display a variable in which cyl has been crossed with am.

mtcars %>%
  mutate(am = factor(ifelse(am == 1, "Automatic", "Manual"), levels = c("Manual", "Automatic"))) %>%
  mutate(var = str_c(cyl, am)) %>%
  select(var) %>%
  gtsummary::tbl_summary()
CharacteristicN = 321
var
    4Automatic8 (25%)
    4Manual3 (9.4%)
    6Automatic3 (9.4%)
    6Manual4 (13%)
    8Automatic2 (6.3%)
    8Manual12 (38%)
1 n (%)
Next