Correlation in R Pt. 1 - Pearson Product Moment Correlations
In this guide, I will walk through how to use the rstatix package to perform Pearson product moment correlations in R. The Pearson correlation is used for cases where there are two continuous variables. I will use a sample data file from the 1st edition of “Discovering Statistics Using R” by Field, Miles, and Field 1. The sample data contain scores from a hypothetical anxiety measure, exam test scores, and the number of hours spent studying (revising) before the exam. We will perform a correlation analysis to characterize the relationships between exam performance and two other variables, anxiety and hours spent studying.
Load packages2
library(tidyverse) # for data importing and visualization
library(rstatix) # for performing statistics
library(kableExtra) # for displaying tables
Load data
data <- read.table(file = "Exam Anxiety.dat", header = TRUE)
kable(head(data))
Code | Revise | Exam | Anxiety | Gender |
---|---|---|---|---|
1 | 4 | 40 | 86.298 | Male |
2 | 11 | 65 | 88.716 | Female |
3 | 27 | 80 | 70.178 | Male |
4 | 53 | 80 | 61.312 | Male |
5 | 4 | 40 | 89.522 | Male |
6 | 22 | 70 | 60.506 | Female |
Visualize the data
A scatterplot of the exam and anxiety scores indicates that there is a relationship between these two variables. As pre-test anxiety goes up, exam performance declines.
ggplot(data, aes(x = Anxiety, y = Exam)) +
geom_point(alpha = 0.7, color = "#1565c0") +
geom_smooth(method = "lm", se = FALSE, color = "#1565c0") +
theme_minimal() +
theme(axis.line = element_line(color = "grey70"))
When visualizing the anxiety and the number of hour spent studying, we see a more dramatic relationship.
ggplot(data, aes(x = Anxiety, y = Revise)) +
geom_point(alpha = 0.7, color = "#1565c0") +
geom_smooth(method = "lm", se = FALSE, color = "#1565c0") +
theme_minimal() +
theme(axis.line = element_line(color = "grey70"))
Correlations with rstatix
To perform the statistical test of significance for correlations, I like to use the rstatix package and its cor_test()
function. The cor_test()
function can take multiple variables that need to be tested through the vars and vars2 arguments. Simply add additional columns if they are in an analysis. cor_test()
can also perform multiple types of tests including the Pearson product moment correlation, Spearman’s rank correlation, and Kendall’s tau (non-parametric). The function can also take a [use = “pairwise.complete.obs”] argument to include only data that have complete observations. This is a useful feature in cases where you may have missing data. The output of this function produces a correlation value, a test statistic, a p value, and confidence intervals for each combination of variables in the vars and vars2 arguments.
correlations <- cor_test(data,
vars = c("Anxiety"),
vars2 = c("Exam", "Revise"),
method = "pearson",
use = "pairwise.complete.obs")
kable(correlations)
var1 | var2 | cor | statistic | p | conf.low | conf.high | method |
---|---|---|---|---|---|---|---|
Anxiety | Exam | -0.44 | -4.938026 | 3.1e-06 | -0.5846244 | -0.2705591 | Pearson |
Anxiety | Revise | -0.71 | -10.111055 | 0.0e+00 | -0.7938168 | -0.5977733 | Pearson |
To produce a correlation matrix, use the cor_mat()
function and pass in a vector of the variables and specify the method to be used. The function cor_pmat()
can be used in a similar way to obtain the p-values.
matrix <- cor_mat(data,
vars = c("Exam", "Anxiety", "Revise"),
method = "pearson")
kable(matrix)
rowname | Exam | Anxiety | Revise |
---|---|---|---|
Exam | 1.00 | -0.44 | 0.40 |
Anxiety | -0.44 | 1.00 | -0.71 |
Revise | 0.40 | -0.71 | 1.00 |
pvals <- cor_pmat(data,
vars = c("Exam", "Anxiety", "Revise"),
method = "pearson")
kable(pvals)
rowname | Exam | Anxiety | Revise |
---|---|---|---|
Exam | 0.00e+00 | 3.1e-06 | 3.34e-05 |
Anxiety | 3.10e-06 | 0.0e+00 | 0.00e+00 |
Revise | 3.34e-05 | 0.0e+00 | 0.00e+00 |
Coefficients of Determination
The coefficient of determination or R squared value is one way to help interpret correlation values. To calculate the coefficient of determination we simply square the correlation values. The coefficient of determination can also be multiplied by 100 to obtain a percentage that assesses the amount of variance in one variable that can be accounted by another. In order to do this in R, we can use the mutate()
function from the tidyverse package. This function will display two new columns, coefficient of determination (cod) and percent (percnt), in the correlations data frame with these values.
# Square the R values and convert to percent
kable(correlations %>% mutate(cod = cor^2, percnt = cor^2*100))
var1 | var2 | cor | statistic | p | conf.low | conf.high | method | cod | percnt |
---|---|---|---|---|---|---|---|---|---|
Anxiety | Exam | -0.44 | -4.938026 | 3.1e-06 | -0.5846244 | -0.2705591 | Pearson | 0.1936 | 19.36 |
Anxiety | Revise | -0.71 | -10.111055 | 0.0e+00 | -0.7938168 | -0.5977733 | Pearson | 0.5041 | 50.41 |
Interpretation
The main goal of this guide was to showcase the cor_test()
function from the rstatix package to perform Pearson product moment correlations. We noticed a negative relationship between pre-test anxiety and exam performance. Additionally, we also noticed a negative relationship between pre-test anxiety and the amount of hours spent studying. Both of these relationships are satistically significant, but it is also important to pay attention to the confidence intervals. The confidence intervals in the two correlations do not include zero, which indicate that the value of the correlations in these two relationships in our data are likely to be negative in the population. Pre-test anxiety accounts for about 19% of the variability exams scores, while the number of hours spent studying account for about 50% of the variability in exams scores. In this example, we did not examine the relationship between the number of hours spent studying and exam scores which may help explain some of the variance that is still unaccounted for. We also did not examine how these relationships may differ when separated by gender which we cover in Pt. 2.
References
Kassambara, Alboukadel. 2020. Rstatix: Pipe-Friendly Framework for Basic Statistical Tests. https://CRAN.R-project.org/package=rstatix.
Wickham, Hadley. 2021. Tidyverse: Easily Install and Load the Tidyverse. https://CRAN.R-project.org/package=tidyverse.
Zhu, Hao. 2021. KableExtra: Construct Complex Table with Kable and Pipe Syntax. https://CRAN.R-project.org/package=kableExtra.
Field, Andy, Jeremy Miles, and Zoe Field. 2012. Discovering Statistics Using R. Sage.
DSUR is an excellent introductory resource for learning more about the theory, background, and execution of several statistical analyses including correlation, regression, t-tests, and analysis of variance in R. At the time of this writing, the second edition is slated to be released in 2022 which should have some welcome updates to new R syntax, packages, and functions. I am definitely looking forward to getting a copy for myself when it is released. ↩︎
I use the kableExtra package to print/output better looking tables and is not necessary for carrying out any of the analyses. ↩︎