Converting wide to long data in R

Structured datasets tend to come in one of two formats, wide or long. In the wide format, each row contains data related to a particular subject or event. Each values in each column usually represent a variable taken from that participant. On the other hand, long data format is when related values of an outcome variable are placed in a column, while a different column contains values regarding the condition under which the outcome variables were gathered. Some R functions work best with wide data, while others work best with long data. As a result, it is important to know how to reshape data and go between the two types. This guide explains how to convert a wide format dataset to long format using tidyverse. The data are from the AMCP package and were created to perform a two-way repeated measures ANOVA.

library(tidyverse)
library(AMCP)
library(kableExtra)

# Load the data
data(chapter_12_table_1)

Wide Data

In the wide format data, each row represents a subject. Each column is a combination of two main factors Condition and Angle. Condition is a two level factor, while Angle is a three level factor. Thus there are six columns for all possible combinations of Condition and Angle. Each cell along the rows represent a unique dependent variable taken from the same subject. In this case, the numerical values are reaction times in milliseconds.

Table 1: Wide data format.
Absent0Absent4Absent8Present0Present4Present8
420420480480600780
420480480360480600
480480540660780780
420540540480780900
540660540480660720
360420360360480540
480480600540720840
480600660540720900
540600540480720780
480420540540660780

Long Data

In the long data format, each row represents an observation. Instead of having six columns with all of the dependent variables in one row, we want all of our dependent variables in one column. In addition, we will want additional columns that will contain the information that was specified by the column names. Notice that the values in 1st row of the wide data are now the first six values in the new column “Reaction_Time.” One way to think about creating the “Reaction_Time” column is that we are going to take the values of each row, transpose them to column, and then begin to concatenate or stack all of the values. At the same time, we will want to keep information about Condition wether noise was present or absent, and the angle at wich a stimulus was presented so will need to create two additional columns to contain that information.
Table 2: Long data format.
ConditionAngleReaction_Time
Absent0420
Absent4420
Absent8480
Present0480
Present4600
Present8780

Converting Wide to Long

1. Create a subject id

To make this transformation we will use the tidyverse functions. First, we will want to keep track of our 10 subjects so we will create a column “id” with an integer to identify them, then bind that column to the wide data to create a new dataframe.

#1. Create a new data frame with a subject id
long.data <- cbind(id = c(1:10), chapter_12_table_1)

2. Gather dependent variables

Next, we will use the gather() function which will stack each column of the wide data under a column called reaction_time and create a new column that will add the column name of the wide data as a value in a new column called condition.angle. The -id is used to prevent getting the id number into the condition.angle column.

long.data %>% gather(key = condition.angle,
                     value = reaction_time, -id)
Table 3: Results after applying the gather() function.
idcondition.anglereaction_time
1Absent0420
2Absent0420
3Absent0480
4Absent0420
5Absent0540
6Absent0360

3. Separate the condition.angle column

So far we’re on a good track, but the column condition.angle needs to be separated into two columns. One to specify condition, the other to specify angle. The separate() function can help us out here by splitting the condition.angle column into two columns, condition and angle. The sep = -1 argument tells the separate() function to split the values 1 character from the end. This separates the last character which contains a single digit angle number into a separate column. Try substituting different numbers such as -2 or 1, to get a feel for what the function is doing here.

long.data %>% 
  gather(key = condition.angle,
                     value = reaction_time, 
                     -id) %>%
  separate(col = condition.angle, 
           into = c("condition", "angle"), 
           sep = -1)
Table 4: Results after separating condition.angle with the separate() function.
idconditionanglereaction_time
1Absent0420
2Absent0420
3Absent0480
4Absent0420
5Absent0540
6Absent0360

4. Sort the data

At this point, we’ve done most of what is needed to convert the wide data to long data. The next step in the series of tidyverse functions is to sort the data. To get the output displayed above, we will want to sort the data by subject id, then by condition, then by angle. To accomplish this we will use the arrange() function and will display the reaction_time values as a column of the transposed rows from the wide data set. We will re-save this manipulation as long.data. Notice how the number of rows is now 60 whereas we started with 10.

long.data <- long.data %>% 
  gather(key = condition.angle,
         value = reaction_time, 
         -id) %>%
  separate(col = condition.angle, 
           into = c("condition", "angle"), 
           sep = -1) %>%
  arrange(id, 
          condition, 
          angle)
Table 5: First 6 rows of long.data after applying steps 1-3 and overwriting our original data.
idconditionanglereaction_time
1Absent0420
1Absent4420
1Absent8480
1Present0480
1Present4600
1Present8780

5. Convert to factor

Now before conducting any statistical tests, we will want to convert our id, condition, and angle columns to factor. We can do this with the base R function as.factor().

long.data$id <- as.factor(long.data$id)
long.data$condition <- as.factor(long.data$condition)
long.data$angle <- as.factor(long.data$angle)

Bonus with other helper functions

We could also convert variable to factor with the convert_as_factor() function from the rstatix package. We can pipe each of the steps above and add the convert_as_factor() function at the end setting the columns we would like to convert.

library(rstatix)
# Create a new data frame with a subject id
rm_data <- cbind(id = c(1:10), chapter_12_table_1)

# Convert the data from wide to long and convert id, Condition and Angle to factor
rm_data <-  rm_data %>% 
  gather(key = Condition.Angle, 
         value = Reaction_Time, 
         -id) %>%
  separate(col = Condition.Angle, 
           into = c("Condition", "Angle"), 
           sep = -1) %>%
  arrange(id, 
          Condition, 
          Angle) %>%
  convert_as_factor(id, Condition, Angle)
ConditionAngleReaction_Time
Absent0420
Absent4420
Absent8480
Present0480
Present4600
Present8780

Using the pivot_wider() function

The pivot_wider() function from the dplyr package can also be used accomplish the same task of converting wide to long data. In this case we use the chapter 12 data as the input data frame, select columns Absent0 through Present8 to pivot wider, specify the column names of the selected columns will be placed into a new column called condition.angle, and then the values in to a new column called reaction time. Finally, we will sepatrate the condition and angle variables to be placed in separate columns, condition and angle. The length of the code is roughly the same, but it may be more intuitive to understand the mechanics.

long_data <- 
  pivot_longer(chapter_12_table_1,
             cols = Absent0:Present8,
             names_to = "condition.angle",
             values_to = "reaction_time") %>%
  separate(col = condition.angle,
           into = c("condition", "angle"),
           sep = -1)
conditionanglereaction_time
Absent0420
Absent4420
Absent8480
Present0480
Present4600
Present8780

Wrap Up

Converting wide data to long and vice versa are indespensible skills when working in R. It can be tricky at first, but thankfully the tidyverse package as several functions that can simplify this process. Going from wide to long will set you up to perform many of the statistical analyses you may be interested in doing.

References

Kassambara, Alboukadel. 2020a. Rstatix: Pipe-Friendly Framework for Basic Statistical Tests. https://CRAN.R-project.org/package=rstatix.

Maxwell, Scott, Harold Delaney, and Ken Kelley. 2020. AMCP: A Model Comparison Perspective. https://CRAN.R-project.org/package=AMCP.

Wickham, Hadley. 2019. Tidyverse: Easily Install and Load the ’Tidyverse’. https://CRAN.R-project.org/package=tidyverse.

Previous
Next