if_else
&
case_when()
library(tidyverse)
Let’s work more with mutate.
Download the same metaphor data.csv
file from https://osf.io/qrc6b/
Create a new object named met.data
which is the
results of calling read_csv()
on the metaphor data.
met.data <- read_csv('https://www.stephenskalicky.com/r_data/metaphor_data.csv')
## Rows: 1304 Columns: 28
## ── Column specification ──────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): metaphor_id, response, met_type, sex, hand, language_group
## dbl (22): subject, conceptual, nm, trial_order, met_stim, met_RT, age, colle...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
met.data.2
from met.data
. Add a
select()
call to your pipe so that met.data.2
only has the following columns: subject, age, englishAgeofOnset, and
collegeYear. Finally, use the unique()
function so that
each subject only has one row.met.data.2 <- met.data %>%
dplyr::select(subject, age, collegeYear, englishAgeofOnset) %>%
unique()
englishAgeofOnset
first.
This is the age that participants began learning English. Look at a
summary()
of the variable - what do you notice?# there are zeros - what could that mean?
summary(met.data.2$englishAgeofOnset)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 2.000 4.262 9.000 17.000
mutate
does, lets use it to
calculate a new variable. We want to get an idea of how long someone has
been learning/using English, even if they are native speakers. Using the
existing variables in met.data.2
, how could we do that?
(Create a new object named met.data.3
from
met.data.2
, then use mutate()
to create a new
variable in met.data.3
named totalEnglish
which is a measure of the total number of years each participant has
been learning/using English.)met.data.3 <- met.data.2 %>%
mutate(totalEnglish = age - englishAgeofOnset)
met.data.4
from
met.data.3
, and then use mutate()
to create a
new variable named englishPercent
which is a percentage of
one’s total life spent using/learning English. The resulting variable
should be represented as percentages (i.e., numbers from 0.00 to
100.00), rather than decimals (e.g.,), 0.1, .5, 1.0, etc.)met.data.4 <- met.data.3 %>%
mutate(englishPercent = (totalEnglish/age)*100)
met.data.5
from met.data.4
,
and then use mutate()
to create a new variable named
ENG_Group
. Use if_else()
within your mutate
call to assign participants to one of two groups: “NES” or “NNES”.
You’ll need to choose which variable and condition you want to use in
your if_else()
function!met.data.5 <- met.data.4 %>%
mutate(ENG_Group = if_else(englishPercent == 100, 'NES', 'NNES'))
nnes.summary
from met.data.5
.
Then use summarise()
to provide some descriptive statistics
about the NNES. Get the mean and SD of relevant values, as well as the
max and min.nnes.summary <- met.data.5 %>%
group_by(ENG_Group) %>%
summarise(mean.english = mean(englishPercent),
sd.english = sd(englishPercent),
min.english = min(englishPercent),
max.english = max(englishPercent))
if_else()
is really handy for things like this, but it
only allows for two possibilities - whether something is true or false.
What if we have a lot of different values we’d like to create that
depend on multiple conditions? Among the many options, we can use
case_when()
. This function is similar to
if_else()
in that it evaluates whether a cell meets a
certain condition and then acts, but differs in that unlike
if_else()
, case_when()
only acts when the
condition is true. In this way, you can chain a series of
case_when()
functions together to make many different
changes.The syntax is also different. For case_when()
: the
syntax uses what is called formula notation, and is in the form of
case_when(condition ~ result)
. For example, if you wanted
to turn all values of NA
into 0
, you could use
case_when(variable == NA ~ 0)
. You can put multiple
conditions and results inside a single case_when()
function:
case_when(variable == NA ~ 0, variable == 1 ~ NA, etc...)
.
Your condition can also include more than one variable:
case_when(variable1 == value & variable2 != value ~ result)
Create a new object named met.data.6
from
met.data.5
. Then, create a new variable named
age_group
and assign the following values using
mutate()
and case_when()
:
met.data.6 <- met.data.5 %>%
mutate(age_group = case_when(age < 21 ~ 'lower', age > 20 & age < 41 ~ 'middle', age > 40 ~ 'higher'))
as.factor()
is a quick way to check this.summary(as.factor(met.data.6$age_group))
## higher lower middle
## 3 20 38
collegeYear
. The numbers in
collegeYear
correspond to answers on a demographic
survey:1: First-year undergraduate
2: Second-year undergraduate
3: Third-year undergraduate
4: Fourth-year undergraduate
5: Fifth-year undergraduate
6: MA Student
7: PhD Student
You can see why numbers are easier to write in the data! Well, let’s
imagine we want to create some smaller categories. Create a new object
named met.data.7
from met.data.6
and then use
mutate()
with case_when()
to create a new
variable named studentLevel
. Group the subjects into three
categories: “early UG”, “late UG”, and “PG” based on their college
year.
met.data.7 <- met.data.6 %>%
mutate(studentLevel = case_when(collegeYear < 3 ~ 'early UG', collegeYear > 2 & collegeYear < 6 ~ 'late UG', collegeYear > 5 ~ 'PG' ))
met.data.8
from
met.data.7
. Then use a pipe to create a new variable named
super.status
. This variable will assign participants to a
category based on two features:There are four values for super.status
:
NNES-UG
, NES-UG
, NNES-PG
,
NES-PG
Use mutate()
, case_when()
, and your new
variables.
met.data.8 <- met.data.7 %>%
mutate(super.status = case_when(ENG_Group == 'NES' & studentLevel == 'early UG' ~ 'NES-UG',
ENG_Group == 'NES' & studentLevel == 'late UG' ~ 'NES-UG',
ENG_Group == 'NES' & studentLevel == 'PG' ~ 'NES-PG',
ENG_Group == 'NNES' & studentLevel == 'early UG' ~ 'NNES-UG',
ENG_Group == 'NNES' & studentLevel == 'late UG' ~ 'NNES-UG',
ENG_Group == 'NNES' & studentLevel == 'PG' ~ 'NNES-PG'))
The way I did this was silly - using the collegeYear
variable means you could do this in four lines instead of six - but
still works!