Converting wide to long data in R

Structured datasets tend to come in one of two formats, wide or long. In the wide format, each row contains data related to a particular subject or event. Each values in each column usually represent a variable taken from that participant. On the other hand, long data format is when related values of an outcome variable are placed in a column, while a different column contains values regarding the condition under which the outcome variables were gathered. Some R functions work best with wide data, while others work best with long data. As a result, it is important to know how to reshape data and go between the two types. This guide explains how to convert a wide format dataset to long format using tidyverse. The data are from the AMCP package and were created to perform a two-way repeated measures ANOVA.

library(tidyverse)
library(AMCP)
library(kableExtra)

# Load the data
data(chapter_12_table_1)

Wide Data

In the wide format data, each row represents a subject. Each column is a combination of two main factors Condition and Angle. Condition is a two level factor, while Angle is a three level factor. Thus there are six columns for all possible combinations of Condition and Angle. Each cell along the rows represent a unique dependent variable taken from the same subject. In this case, the numerical values are reaction times in milliseconds.

Table 1: Wide data format.
Absent0	Absent4	Absent8	Present0	Present4	Present8
420	420	480	480	600	780
420	480	480	360	480	600
480	480	540	660	780	780
420	540	540	480	780	900
540	660	540	480	660	720
360	420	360	360	480	540
480	480	600	540	720	840
480	600	660	540	720	900
540	600	540	480	720	780
480	420	540	540	660	780

Long Data

In the long data format, each row represents an observation. Instead of having six columns with all of the dependent variables in one row, we want all of our dependent variables in one column. In addition, we will want additional columns that will contain the information that was specified by the column names. Notice that the values in 1st row of the wide data are now the first six values in the new column “Reaction_Time.” One way to think about creating the “Reaction_Time” column is that we are going to take the values of each row, transpose them to column, and then begin to concatenate or stack all of the values. At the same time, we will want to keep information about Condition wether noise was present or absent, and the angle at wich a stimulus was presented so will need to create two additional columns to contain that information.

Table 2: Long data format.
Condition	Angle	Reaction_Time
Absent	0	420
Absent	4	420
Absent	8	480
Present	0	480
Present	4	600
Present	8	780

Converting Wide to Long

1. Create a subject id

To make this transformation we will use the tidyverse functions. First, we will want to keep track of our 10 subjects so we will create a column “id” with an integer to identify them, then bind that column to the wide data to create a new dataframe.

#1. Create a new data frame with a subject id
long.data <- cbind(id = c(1:10), chapter_12_table_1)

2. Gather dependent variables

Next, we will use the gather() function which will stack each column of the wide data under a column called reaction_time and create a new column that will add the column name of the wide data as a value in a new column called condition.angle. The -id is used to prevent getting the id number into the condition.angle column.

long.data %>% gather(key = condition.angle,
                     value = reaction_time, -id)

Table 3: Results after applying the `gather()` function.
id	condition.angle	reaction_time
1	Absent0	420
2	Absent0	420
3	Absent0	480
4	Absent0	420
5	Absent0	540
6	Absent0	360

3. Separate the condition.angle column

So far we’re on a good track, but the column condition.angle needs to be separated into two columns. One to specify condition, the other to specify angle. The separate() function can help us out here by splitting the condition.angle column into two columns, condition and angle. The sep = -1 argument tells the separate() function to split the values 1 character from the end. This separates the last character which contains a single digit angle number into a separate column. Try substituting different numbers such as -2 or 1, to get a feel for what the function is doing here.

long.data %>% 
  gather(key = condition.angle,
                     value = reaction_time, 
                     -id) %>%
  separate(col = condition.angle, 
           into = c("condition", "angle"), 
           sep = -1)

Table 4: Results after separating condition.angle with the `separate()` function.
id	condition	reaction_time
1	Absent	420
2	Absent	420
3	Absent	480
4	Absent	420
5	Absent	540
6	Absent	360

4. Sort the data

At this point, we’ve done most of what is needed to convert the wide data to long data. The next step in the series of tidyverse functions is to sort the data. To get the output displayed above, we will want to sort the data by subject id, then by condition, then by angle. To accomplish this we will use the arrange() function and will display the reaction_time values as a column of the transposed rows from the wide data set. We will re-save this manipulation as long.data. Notice how the number of rows is now 60 whereas we started with 10.

long.data <- long.data %>% 
  gather(key = condition.angle,
         value = reaction_time, 
         -id) %>%
  separate(col = condition.angle, 
           into = c("condition", "angle"), 
           sep = -1) %>%
  arrange(id, 
          condition, 
          angle)

Table 5: First 6 rows of long.data after applying steps 1-3 and overwriting our original data.
id	condition	angle	reaction_time
1	Absent	0	420
1	Absent	4	420
1	Absent	8	480
1	Present	0	480
1	Present	4	600
1	Present	8	780

5. Convert to factor

Now before conducting any statistical tests, we will want to convert our id, condition, and angle columns to factor. We can do this with the base R function as.factor().

long.data$id <- as.factor(long.data$id)
long.data$condition <- as.factor(long.data$condition)
long.data$angle <- as.factor(long.data$angle)

Bonus with other helper functions

We could also convert variable to factor with the convert_as_factor() function from the rstatix package. We can pipe each of the steps above and add the convert_as_factor() function at the end setting the columns we would like to convert.

library(rstatix)
# Create a new data frame with a subject id
rm_data <- cbind(id = c(1:10), chapter_12_table_1)

# Convert the data from wide to long and convert id, Condition and Angle to factor
rm_data <-  rm_data %>% 
  gather(key = Condition.Angle, 
         value = Reaction_Time, 
         -id) %>%
  separate(col = Condition.Angle, 
           into = c("Condition", "Angle"), 
           sep = -1) %>%
  arrange(id, 
          Condition, 
          Angle) %>%
  convert_as_factor(id, Condition, Angle)

Condition	Angle	Reaction_Time
Absent	0	420
Absent	4	420
Absent	8	480
Present	0	480
Present	4	600
Present	8	780

Using the `pivot_wider()` function

The pivot_wider() function from the dplyr package can also be used accomplish the same task of converting wide to long data. In this case we use the chapter 12 data as the input data frame, select columns Absent0 through Present8 to pivot wider, specify the column names of the selected columns will be placed into a new column called condition.angle, and then the values in to a new column called reaction time. Finally, we will sepatrate the condition and angle variables to be placed in separate columns, condition and angle. The length of the code is roughly the same, but it may be more intuitive to understand the mechanics.

long_data <- 
  pivot_longer(chapter_12_table_1,
             cols = Absent0:Present8,
             names_to = "condition.angle",
             values_to = "reaction_time") %>%
  separate(col = condition.angle,
           into = c("condition", "angle"),
           sep = -1)

condition	angle	reaction_time
Absent	0	420
Absent	4	420
Absent	8	480
Present	0	480
Present	4	600
Present	8	780

Wrap Up

Converting wide data to long and vice versa are indespensible skills when working in R. It can be tricky at first, but thankfully the tidyverse package as several functions that can simplify this process. Going from wide to long will set you up to perform many of the statistical analyses you may be interested in doing.

References

Kassambara, Alboukadel. 2020a. Rstatix: Pipe-Friendly Framework for Basic Statistical Tests. https://CRAN.R-project.org/package=rstatix.

Maxwell, Scott, Harold Delaney, and Ken Kelley. 2020. AMCP: A Model Comparison Perspective. https://CRAN.R-project.org/package=AMCP.

Wickham, Hadley. 2019. Tidyverse: Easily Install and Load the ’Tidyverse’. https://CRAN.R-project.org/package=tidyverse.

Last updated on Oct 8, 2023

Absent0	Absent4	Absent8	Present0	Present4	Present8
420	420	480	480	600	780
420	480	480	360	480	600
480	480	540	660	780	780
420	540	540	480	780	900
540	660	540	480	660	720
360	420	360	360	480	540
480	480	600	540	720	840
480	600	660	540	720	900
540	600	540	480	720	780
480	420	540	540	660	780

Absent0	Absent4	Absent8	Present0	Present4	Present8
420	420	480	480	600	780
420	480	480	360	480	600
480	480	540	660	780	780
420	540	540	480	780	900
540	660	540	480	660	720
360	420	360	360	480	540
480	480	600	540	720	840
480	600	660	540	720	900
540	600	540	480	720	780
480	420	540	540	660	780