Data Preparation
Common data preparation tasks
pacman::p_load(tidyverse,
magrittr,
haven,
install = FALSE)
Dealing with NAs
replace_na()
Replaces any NA value with the desired character or integer.
- Requires character or integer class vector
- Does not work with factors, use
fct_explicit_na()
- Does not work with labelled variables, convert to numeric or character first
Multiple variables
If data are labelled, first convert to numeric. Useful for the preparation of binary variables.
data %>%
mutate(across(sample_question_1.integer:sample_question_3.integer, ~ as.numeric(.))) %>% #convert to numeric or character if variables are labelled
mutate(across(sample_question_1.integer:sample_question_3.integer, ~ replace_na(., 0)))
fct_explicit_na()
Replaces any NA value in a factored variable with “Missing”. This is useful in cases where one needs to plot “Missing” values because the default for {ggplot2} is to drop any values with NAs.
- Requires factored variable
One variable case
data %>%
mutate(sample_question_1.factor = fct_explicit_na(sample_question_1.factor, "Missing"))
Multiple variables
data %>%
mutate(across(sample_question_1.factor:sample_question_3.factor, ~ fct_explicit_na(., "Missing")))
drop_na()
The drop_na()
function eliminates any rows that contain an NA value in the input variable. Useful for filtering before passing the output to {ggplot2} functions.
One variable case
data %>%
drop_na(sample_question_1.factor)
Multiple variables
data %>%
drop_na(sample_question_1.factor:sample_question_3.factor)
Recoding values
recode_()
Old values are placed on the let-hand side (LHS) while new values are placed on the right-hand side (RHS) of the recode statements.
- Useful for converting numeric class variables to character.
- Does not work with labelled variables. Convert to numeric or character class first.
- Integers are encased with back ticks.
One variable case
data %>%
mutate(sample_question_1.integer = as.numeric(sample_question_1.integer)) %>% #convert to numeric or character if variables are labelled
mutate(sample_question_1.integer = recode(sample_question_1.integer,
`1` = "Strongly Disagree",
`2` = "Disagree",
`3` = "Neutral",
`4` = "Agree",
`5` = "Strongly Agree"))
Multiple variables
In addition to the requirements of the one variable case, the recode()
function requires that all variables have the same set of values.
data %>%
mutate(across(sample_question_1.integer:sample_question_3.integer, ~ as.numeric(.))) %>%
mutate(across(sample_question_1.integer:sample_question_3.integer, ~ recode(.,
`1` = "Strongly Disagree",
`2` = "Disagree",
`3` = "Neutral",
`4` = "Agree",
`5` = "Strongly Agree")))
Renaming columns
Renaming columns can be useful before piping data into {ggplot2} functions to avoid having to set x- and y-axis labels.
rename()
- New values are placed the LHS, while old values are placed on the RHS which is the opposite of
recode()
data %>%
rename("New name: Q1" = sample_question_1.factor,
"New name: Q2" = sample_question_2.factor,
"New name: Q3" = sample_question_3.factor)
rename_at()
#set the old column names that will be renamed
col_names <- names(data %>% select(sample_question_1.factor:sample_question_3.factor))
# Set the new col names that will be used
labels <- c("Q1", "Q2", "Q3")
# Rename the columns by the the values in labels
data %>%
rename_at(all_of(col_names), ~ labels)
Row-wise operations
rowSums()
Multiple variables
- Requires integer class variables
- Limited to sums
# In my experience rowSums() is fastest, but it's limited to sums
data %>%
mutate(row_sum = select(., sample_question_1.integer:sample_question_3:integer)) %>% rowSums(., na.rm = TRUE)
rowwise()
Multiple variables
Functions similar to the group_by()
function and modifies how other {dplyr} verbs work.
# Example using the select() verb and everything() function
data %>%
select(sample_question_1.integer:sample_question_3:integer) %>%
rowwise() %>%
mutate(sum = sum(c_across(everything()), na.rm = TRUE))
# Example using select() verb within c_across()
data %>%
rowwise() %>%
mutate(sum = sum(c_across(sample_question_1.integer:sample_question_3:integer), na.rm = TRUE))
Strip characters
ids <- c("[2987202982]", "[2123402982]", "[2009283471]" )
ids <- as.data.frame(ids)
Remove everything to the right of the bracket
ids %>%
mutate(ids = sub("].*", "", ids))
## ids
## 1 [2987202982
## 2 [2123402982
## 3 [2009283471
Remove everthing to the left
ids %>%
mutate(ids = sub(".*\\[", "", ids))
## ids
## 1 2987202982]
## 2 2123402982]
## 3 2009283471]