Data Preparation

Common data preparation tasks

pacman::p_load(tidyverse,
               magrittr,
               haven,
               install = FALSE)

Dealing with NAs

replace_na()

Replaces any NA value with the desired character or integer.

  • Requires character or integer class vector
  • Does not work with factors, use fct_explicit_na()
  • Does not work with labelled variables, convert to numeric or character first

Multiple variables

If data are labelled, first convert to numeric. Useful for the preparation of binary variables.

data %>%
  mutate(across(sample_question_1.integer:sample_question_3.integer, ~ as.numeric(.))) %>% #convert to numeric or character if variables are labelled
  mutate(across(sample_question_1.integer:sample_question_3.integer, ~ replace_na(., 0)))

fct_explicit_na()

Replaces any NA value in a factored variable with “Missing”. This is useful in cases where one needs to plot “Missing” values because the default for {ggplot2} is to drop any values with NAs.

  • Requires factored variable

One variable case

data %>% 
  mutate(sample_question_1.factor = fct_explicit_na(sample_question_1.factor, "Missing"))

Multiple variables

data %>% 
  mutate(across(sample_question_1.factor:sample_question_3.factor, ~ fct_explicit_na(., "Missing")))

drop_na()

The drop_na() function eliminates any rows that contain an NA value in the input variable. Useful for filtering before passing the output to {ggplot2} functions.

One variable case

data %>%
  drop_na(sample_question_1.factor)

Multiple variables

data %>%
  drop_na(sample_question_1.factor:sample_question_3.factor)

Recoding values

recode_()

Old values are placed on the let-hand side (LHS) while new values are placed on the right-hand side (RHS) of the recode statements.

  • Useful for converting numeric class variables to character.
  • Does not work with labelled variables. Convert to numeric or character class first.
  • Integers are encased with back ticks.

One variable case

data %>%
  mutate(sample_question_1.integer = as.numeric(sample_question_1.integer)) %>% #convert to numeric or character if variables are labelled
  mutate(sample_question_1.integer = recode(sample_question_1.integer,
                                            `1` = "Strongly Disagree", 
                                            `2` = "Disagree",
                                            `3` = "Neutral",
                                            `4` = "Agree",
                                            `5` = "Strongly Agree"))            

Multiple variables In addition to the requirements of the one variable case, the recode() function requires that all variables have the same set of values.

data %>%
  mutate(across(sample_question_1.integer:sample_question_3.integer, ~ as.numeric(.))) %>%
  mutate(across(sample_question_1.integer:sample_question_3.integer, ~ recode(.,
                                            `1` = "Strongly Disagree", 
                                            `2` = "Disagree",
                                            `3` = "Neutral",
                                            `4` = "Agree",
                                            `5` = "Strongly Agree")))

Renaming columns

Renaming columns can be useful before piping data into {ggplot2} functions to avoid having to set x- and y-axis labels.

rename()

  • New values are placed the LHS, while old values are placed on the RHS which is the opposite of recode()
data %>%
  rename("New name: Q1" = sample_question_1.factor,
         "New name: Q2" = sample_question_2.factor,
         "New name: Q3" = sample_question_3.factor)

rename_at()

#set the old column names that will be renamed
col_names <- names(data %>% select(sample_question_1.factor:sample_question_3.factor))

# Set the new col names that will be used
labels <- c("Q1", "Q2", "Q3")
  
# Rename the columns by the the values in labels
data %>%
  rename_at(all_of(col_names), ~ labels)

Row-wise operations

rowSums()

Multiple variables

  • Requires integer class variables
  • Limited to sums
# In my experience rowSums() is fastest, but it's limited to sums
data %>%
  mutate(row_sum = select(., sample_question_1.integer:sample_question_3:integer)) %>% rowSums(., na.rm = TRUE)

rowwise()

Multiple variables

Functions similar to the group_by() function and modifies how other {dplyr} verbs work.

# Example using the select() verb and everything() function
data %>%
  select(sample_question_1.integer:sample_question_3:integer) %>% 
  rowwise() %>% 
  mutate(sum = sum(c_across(everything()), na.rm = TRUE))

# Example using select() verb within c_across()
data %>% 
  rowwise() %>% 
  mutate(sum = sum(c_across(sample_question_1.integer:sample_question_3:integer), na.rm = TRUE))

Strip characters

ids <- c("[2987202982]", "[2123402982]", "[2009283471]" )
ids <- as.data.frame(ids)

Remove everything to the right of the bracket

ids %>%
  mutate(ids = sub("].*", "", ids))
##           ids
## 1 [2987202982
## 2 [2123402982
## 3 [2009283471

Remove everthing to the left

ids %>%
  mutate(ids = sub(".*\\[", "", ids))
##           ids
## 1 2987202982]
## 2 2123402982]
## 3 2009283471]
Next