Histograms in ggplot2
Histograms are a great way to display the frequency of categorical variables and can aid the the understanding of the distribution of a variable to make visual comparisons across groups. In this guide, I focus on the use of the ggplot2 package to make different types of histograms. The following snippets use the built-in mtcars data set to demonstrate.
library(tidyverse)
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Traditional Histogram
mtcars %>%
ggplot(., aes(x = cyl)) +
geom_bar() +
theme_minimal()
Paneled histogram
Here we make separate panels to show the histogram by am(automatic transmission) where a 1 denotes automatic and 0 denotes manual transmission.
mtcars %>%
ggplot(., aes(x = cyl)) +
geom_bar() +
facet_wrap(~am) +
theme_minimal()
Modify the panel labels
To modify the labels, we can convert the variable am to a factor where 1 is set to Automatic and 0 is set to Manual, then setting the order of the levels match the previous plot.
mtcars %>%
mutate(am = factor(ifelse(am == 1, "Automatic", "Manual"), levels = c("Manual", "Automatic"))) %>%
ggplot(., aes(x = cyl)) +
geom_bar() +
facet_wrap(~am) +
theme_minimal()
Clustered histogram
In a this clustered histogram separate bars at each value of cyl are displayed, one for automatic and another for manual. The position option is needed to display the separate bars side by side, otherwise the bars will appear stacked instead.
colors = c( "#440154FF","#1565c0")
mtcars %>%
mutate(am = factor(am)) %>%
ggplot(., aes(x = cyl)) +
geom_bar(aes(color = am, fill = am), position = position_dodge2(preserve = "single")) +
theme_minimal() +
scale_color_manual(values = colors) +
scale_fill_manual(values = colors)
Clustered histogram with percentages
Denominator is the grouping variable
There may be situations where a percentage instead of a count facilitates comparisons across groups. In this version, the denominator for calculating a percentage is that of the total number of observations within each level of am. To produce such a plot, we first need to group the data by am and cyl, and then calculated frequencies and proportions. We can see that 4-cylinder vehicles make up about 60% of the observations where the transmission is an automatic one which can be compared to about 15% of 4-cylinder vehicles are equipped with a manual transmission.
mtcars %>%
mutate(am = factor(am)) %>%
group_by(am, cyl) %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n)) %>%
ggplot(., aes(x = cyl, y = freq, fill = am)) +
geom_bar(stat = "identity", position = position_dodge2(preserve = "single")) +
scale_y_continuous(labels = scales::label_percent()) +
theme_minimal() +
scale_fill_manual(values = colors) +
ylab("Percentage within each am group")
To ensure that the appropriate values are displayed, a quick gtsummary() table will display the actual percentages in a table.
mtcars %>%
mutate(am = factor(am)) %>%
select(cyl, am) %>%
gtsummary::tbl_summary(by = am, digits = list(everything() ~ c(0,2)))
Characteristic | 0, N = 191 | 1, N = 131 |
---|---|---|
cyl | ||
4 | 3 (15.79%) | 8 (61.54%) |
6 | 4 (21.05%) | 3 (23.08%) |
8 | 12 (63.16%) | 2 (15.38%) |
1 n (%) |
Denominator is the sample size
The proportion of all observations can also be displayed. In this plot we see that bout 25% of all observations are automatic and 4 cylinder vehicles.
mtcars %>%
mutate(am = factor(am)) %>%
ggplot(., aes(cyl, fill = am)) +
geom_bar(aes(y = (..count..)/sum(..count..)),
position = position_dodge2(preserve = "single")) +
scale_y_continuous(labels = scales::label_percent()) +
theme_minimal() +
scale_fill_manual(values = colors) +
ylab("Percent of all observations")
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
To double check the values, we can display a variable in which cyl has been crossed with am.
mtcars %>%
mutate(am = factor(ifelse(am == 1, "Automatic", "Manual"), levels = c("Manual", "Automatic"))) %>%
mutate(var = str_c(cyl, am)) %>%
select(var) %>%
gtsummary::tbl_summary()
Characteristic | N = 321 |
---|---|
var | |
4Automatic | 8 (25%) |
4Manual | 3 (9.4%) |
6Automatic | 3 (9.4%) |
6Manual | 4 (13%) |
8Automatic | 2 (6.3%) |
8Manual | 12 (38%) |
1 n (%) |