Converting wide to long data in R
Structured datasets tend to come in one of two formats, wide or long. In the wide format, each row contains data related to a particular subject or event. Each values in each column usually represent a variable taken from that participant. On the other hand, long data format is when related values of an outcome variable are placed in a column, while a different column contains values regarding the condition under which the outcome variables were gathered. Some R functions work best with wide data, while others work best with long data. As a result, it is important to know how to reshape data and go between the two types. This guide explains how to convert a wide format dataset to long format using tidyverse. The data are from the AMCP package and were created to perform a two-way repeated measures ANOVA.
library(tidyverse)
library(AMCP)
library(kableExtra)
# Load the data
data(chapter_12_table_1)
Wide Data
In the wide format data, each row represents a subject. Each column is a combination of two main factors Condition and Angle. Condition is a two level factor, while Angle is a three level factor. Thus there are six columns for all possible combinations of Condition and Angle. Each cell along the rows represent a unique dependent variable taken from the same subject. In this case, the numerical values are reaction times in milliseconds.
Absent0 | Absent4 | Absent8 | Present0 | Present4 | Present8 |
---|---|---|---|---|---|
420 | 420 | 480 | 480 | 600 | 780 |
420 | 480 | 480 | 360 | 480 | 600 |
480 | 480 | 540 | 660 | 780 | 780 |
420 | 540 | 540 | 480 | 780 | 900 |
540 | 660 | 540 | 480 | 660 | 720 |
360 | 420 | 360 | 360 | 480 | 540 |
480 | 480 | 600 | 540 | 720 | 840 |
480 | 600 | 660 | 540 | 720 | 900 |
540 | 600 | 540 | 480 | 720 | 780 |
480 | 420 | 540 | 540 | 660 | 780 |
Long Data
In the long data format, each row represents an observation. Instead of having six columns with all of the dependent variables in one row, we want all of our dependent variables in one column. In addition, we will want additional columns that will contain the information that was specified by the column names. Notice that the values in 1st row of the wide data are now the first six values in the new column “Reaction_Time.” One way to think about creating the “Reaction_Time” column is that we are going to take the values of each row, transpose them to column, and then begin to concatenate or stack all of the values. At the same time, we will want to keep information about Condition wether noise was present or absent, and the angle at wich a stimulus was presented so will need to create two additional columns to contain that information.Condition | Angle | Reaction_Time |
---|---|---|
Absent | 0 | 420 |
Absent | 4 | 420 |
Absent | 8 | 480 |
Present | 0 | 480 |
Present | 4 | 600 |
Present | 8 | 780 |
Converting Wide to Long
1. Create a subject id
To make this transformation we will use the tidyverse functions. First, we will want to keep track of our 10 subjects so we will create a column “id” with an integer to identify them, then bind that column to the wide data to create a new dataframe.
#1. Create a new data frame with a subject id
long.data <- cbind(id = c(1:10), chapter_12_table_1)
2. Gather dependent variables
Next, we will use the gather()
function which will stack each column of the wide data under a column called reaction_time and create a new column that will add the column name of the wide data as a value in a new column called condition.angle. The -id
is used to prevent getting the id number into the condition.angle column.
long.data %>% gather(key = condition.angle,
value = reaction_time, -id)
id | condition.angle | reaction_time |
---|---|---|
1 | Absent0 | 420 |
2 | Absent0 | 420 |
3 | Absent0 | 480 |
4 | Absent0 | 420 |
5 | Absent0 | 540 |
6 | Absent0 | 360 |
3. Separate the condition.angle column
So far we’re on a good track, but the column condition.angle needs to be separated into two columns. One to specify condition, the other to specify angle. The separate()
function can help us out here by splitting the condition.angle column into two columns, condition
and angle
. The sep = -1
argument tells the separate()
function to split the values 1 character from the end. This separates the last character which contains a single digit angle number into a separate column. Try substituting different numbers such as -2 or 1, to get a feel for what the function is doing here.
long.data %>%
gather(key = condition.angle,
value = reaction_time,
-id) %>%
separate(col = condition.angle,
into = c("condition", "angle"),
sep = -1)
id | condition | angle | reaction_time |
---|---|---|---|
1 | Absent | 0 | 420 |
2 | Absent | 0 | 420 |
3 | Absent | 0 | 480 |
4 | Absent | 0 | 420 |
5 | Absent | 0 | 540 |
6 | Absent | 0 | 360 |
4. Sort the data
At this point, we’ve done most of what is needed to convert the wide data to long data. The next step in the series of tidyverse functions is to sort the data. To get the output displayed above, we will want to sort the data by subject id, then by condition, then by angle. To accomplish this we will use the arrange()
function and will display the reaction_time values as a column of the transposed rows from the wide data set. We will re-save this manipulation as long.data. Notice how the number of rows is now 60 whereas we started with 10.
long.data <- long.data %>%
gather(key = condition.angle,
value = reaction_time,
-id) %>%
separate(col = condition.angle,
into = c("condition", "angle"),
sep = -1) %>%
arrange(id,
condition,
angle)
id | condition | angle | reaction_time |
---|---|---|---|
1 | Absent | 0 | 420 |
1 | Absent | 4 | 420 |
1 | Absent | 8 | 480 |
1 | Present | 0 | 480 |
1 | Present | 4 | 600 |
1 | Present | 8 | 780 |
5. Convert to factor
Now before conducting any statistical tests, we will want to convert our id, condition, and angle columns to factor. We can do this with the base R function as.factor()
.
long.data$id <- as.factor(long.data$id)
long.data$condition <- as.factor(long.data$condition)
long.data$angle <- as.factor(long.data$angle)
Bonus with other helper functions
We could also convert variable to factor with the convert_as_factor()
function from the rstatix package. We can pipe each of the steps above and add the convert_as_factor()
function at the end setting the columns we would like to convert.
library(rstatix)
# Create a new data frame with a subject id
rm_data <- cbind(id = c(1:10), chapter_12_table_1)
# Convert the data from wide to long and convert id, Condition and Angle to factor
rm_data <- rm_data %>%
gather(key = Condition.Angle,
value = Reaction_Time,
-id) %>%
separate(col = Condition.Angle,
into = c("Condition", "Angle"),
sep = -1) %>%
arrange(id,
Condition,
Angle) %>%
convert_as_factor(id, Condition, Angle)
Condition | Angle | Reaction_Time |
---|---|---|
Absent | 0 | 420 |
Absent | 4 | 420 |
Absent | 8 | 480 |
Present | 0 | 480 |
Present | 4 | 600 |
Present | 8 | 780 |
Using the pivot_wider()
function
The pivot_wider()
function from the dplyr package can also be used accomplish the same task of converting wide to long data. In this case we use the chapter 12 data as the input data frame, select columns Absent0 through Present8 to pivot wider, specify the column names of the selected columns will be placed into a new column called condition.angle, and then the values in to a new column called reaction time. Finally, we will sepatrate the condition and angle variables to be placed in separate columns, condition and angle. The length of the code is roughly the same, but it may be more intuitive to understand the mechanics.
long_data <-
pivot_longer(chapter_12_table_1,
cols = Absent0:Present8,
names_to = "condition.angle",
values_to = "reaction_time") %>%
separate(col = condition.angle,
into = c("condition", "angle"),
sep = -1)
condition | angle | reaction_time |
---|---|---|
Absent | 0 | 420 |
Absent | 4 | 420 |
Absent | 8 | 480 |
Present | 0 | 480 |
Present | 4 | 600 |
Present | 8 | 780 |
Wrap Up
Converting wide data to long and vice versa are indespensible skills when working in R. It can be tricky at first, but thankfully the tidyverse package as several functions that can simplify this process. Going from wide to long will set you up to perform many of the statistical analyses you may be interested in doing.
References
Kassambara, Alboukadel. 2020a. Rstatix: Pipe-Friendly Framework for Basic Statistical Tests. https://CRAN.R-project.org/package=rstatix.
Maxwell, Scott, Harold Delaney, and Ken Kelley. 2020. AMCP: A Model Comparison Perspective. https://CRAN.R-project.org/package=AMCP.
Wickham, Hadley. 2019. Tidyverse: Easily Install and Load the ’Tidyverse’. https://CRAN.R-project.org/package=tidyverse.