if else case when

In this notebook we will compare methods for creating new columns using conditional tests.

The first type of test is an if/else test, where we provide an outcome if a test returns TRUE and an outcome if a test returns FALSE.

The second test is a case_when test, which allows us to specify any number of individual test –> result relationships.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Load in data

We can use a data set which contains measures of linguistic features for a set of satirical (The Onion) and non-satirical (The New York Times) headlines.

dat <- read_csv('https://raw.githubusercontent.com/scskalicky/scskalicky.github.io/refs/heads/main/sample_dat/linguistic_features.csv')
Rows: 80 Columns: 19
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): headline, filename, condition
dbl (16): conditionNum, MLC, numContenWords, numWords, numFunctionWords, MRC...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

variables of interest

  1. condition, this variable shows that headlines are in one of five conditions: atten, saturation, metaphor, negation, and control. The four conditions that are not the control condition are four different strategies for doing satire, but they are all nonetheless satirical headlines.

  2. “CS” variables. There are five variables in the data that represent ratings of the headlines for familiarity, understanding, positive, sincerity, and funnyness. The variables all end in the string “CS” because they were gathered using crowdsourcing methods (in this case, from workers on the Amazon Mechanical Turk platform).

Let’s create a dataframe that has only these variables.

dat_smol <- dat %>%
  select(condition, ends_with("CS"))

Check that we did it right:

glimpse(dat_smol)
Rows: 80
Columns: 6
$ condition       <chr> "atten", "saturation", "control", "control", "control"…
$ familiarCS      <dbl> 26.470588, 25.000000, 50.000000, 19.230769, 29.411765,…
$ understandingCS <dbl> 58.82353, 91.66667, 94.44444, 92.30769, 79.41176, 61.5…
$ positiveCS      <dbl> 17.647059, 0.000000, 86.111111, 3.846154, 2.941176, 23…
$ sincereCS       <dbl> 26.47059, 66.66667, 86.11111, 76.92308, 76.47059, 11.5…
$ funnyCS         <dbl> 17.647059, 4.166667, 0.000000, 7.692308, 2.941176, 57.…

creating new columns with if_else

What if we want to create a new column that contains a label showing whether a headline is simply satirical or non-satirical?

We know that we can create a new column using mutate(), so we just need to determine a new conditional test on the column that helps with this (i.e., condition).

Here we will use if_else to conduct a binary test.

Let’s first check to see how the function works. We first creat a conditional test, then the value that should be returned if the test is true, then the value for if the test is false. For example…

# if the condition column is 'atten', return a 1, otherwise return a 0
if_else(dat$condition == 'atten', 1, 0)
 [1] 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[39] 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 1 0 0
[77] 0 0 1 0

Notice that the function operates on the entire column and returns values the same length of the column. This is by design so that we can quickly vectorize over columns in our df. So if we want to create a new column called condition_big that contains the values satire or control, we could do this:

dat_smol <- dat_smol %>%
  mutate(condition_big = if_else(condition == 'control', 'control', 'satire'))

Check our work - looks good!

dat_smol$condition_big
 [1] "satire"  "satire"  "control" "control" "control" "satire"  "control"
 [8] "control" "control" "satire"  "satire"  "satire"  "satire"  "satire" 
[15] "satire"  "satire"  "satire"  "control" "control" "control" "satire" 
[22] "satire"  "control" "satire"  "control" "control" "satire"  "satire" 
[29] "satire"  "control" "control" "control" "control" "control" "control"
[36] "control" "satire"  "control" "satire"  "satire"  "control" "satire" 
[43] "control" "satire"  "control" "satire"  "control" "control" "control"
[50] "control" "control" "control" "satire"  "control" "satire"  "satire" 
[57] "satire"  "control" "control" "control" "satire"  "control" "satire" 
[64] "satire"  "satire"  "control" "satire"  "control" "satire"  "satire" 
[71] "control" "control" "satire"  "satire"  "control" "control" "satire" 
[78] "satire"  "satire"  "satire" 

introduce case_when()

There is another way to do this, using case_when(). This function performs one conditional test and returns a value only if that test is true, otherwise it returns NA

For case_when(): the syntax uses what is called formula notation, and is in the form of case_when(condition ~ result). For example, if you wanted to turn all values of 'cat' into 'dog', you could use case_when(variable == 'cat' ~ 'dog'). The usefulness of case_when() is seen in that you can put multiple conditions and results inside a single case_when() function: case_when(variable == 'a' ~ 0, variable == 'b' ~ 1, and so on...).

Let’s use case_when() to complete the same operation as above, create a new column that has the value satire if condition is not equal to control, and otherwise has the value control

Delete the column we made above

dat_smol$condition_big <- NULL

glimpse(dat_smol)
Rows: 80
Columns: 6
$ condition       <chr> "atten", "saturation", "control", "control", "control"…
$ familiarCS      <dbl> 26.470588, 25.000000, 50.000000, 19.230769, 29.411765,…
$ understandingCS <dbl> 58.82353, 91.66667, 94.44444, 92.30769, 79.41176, 61.5…
$ positiveCS      <dbl> 17.647059, 0.000000, 86.111111, 3.846154, 2.941176, 23…
$ sincereCS       <dbl> 26.47059, 66.66667, 86.11111, 76.92308, 76.47059, 11.5…
$ funnyCS         <dbl> 17.647059, 4.166667, 0.000000, 7.692308, 2.941176, 57.…

Now add the column back using case_when(). First let’s look what happens if we only use one case_when() condition:

dat_smol <- dat_smol %>%
  mutate(condition_big = case_when(condition == 'control' ~ 'control'))

It works, but also fills NA for everything where the test did not return true.

dat_smol$condition_big
 [1] NA        NA        "control" "control" "control" NA        "control"
 [8] "control" "control" NA        NA        NA        NA        NA       
[15] NA        NA        NA        "control" "control" "control" NA       
[22] NA        "control" NA        "control" "control" NA        NA       
[29] NA        "control" "control" "control" "control" "control" "control"
[36] "control" NA        "control" NA        NA        "control" NA       
[43] "control" NA        "control" NA        "control" "control" "control"
[50] "control" "control" "control" NA        "control" NA        NA       
[57] NA        "control" "control" "control" NA        "control" NA       
[64] NA        NA        "control" NA        "control" NA        NA       
[71] "control" "control" NA        NA        "control" "control" NA       
[78] NA        NA        NA       

add a second case for when the condition is not control:

dat_smol <- dat_smol %>%
  mutate(condition_big = case_when(condition == 'control' ~ 'control',
                                   condition != 'control' ~ 'satirical'))

We see it works!

dat_smol$condition_big
 [1] "satirical" "satirical" "control"   "control"   "control"   "satirical"
 [7] "control"   "control"   "control"   "satirical" "satirical" "satirical"
[13] "satirical" "satirical" "satirical" "satirical" "satirical" "control"  
[19] "control"   "control"   "satirical" "satirical" "control"   "satirical"
[25] "control"   "control"   "satirical" "satirical" "satirical" "control"  
[31] "control"   "control"   "control"   "control"   "control"   "control"  
[37] "satirical" "control"   "satirical" "satirical" "control"   "satirical"
[43] "control"   "satirical" "control"   "satirical" "control"   "control"  
[49] "control"   "control"   "control"   "control"   "satirical" "control"  
[55] "satirical" "satirical" "satirical" "control"   "control"   "control"  
[61] "satirical" "control"   "satirical" "satirical" "satirical" "control"  
[67] "satirical" "control"   "satirical" "satirical" "control"   "control"  
[73] "satirical" "satirical" "control"   "control"   "satirical" "satirical"
[79] "satirical" "satirical"

why case_when?

You can see that in this example, if_else was a more efficient and even elegant solution when compared to case_when(). However, what happens if we have three different outcomes we’d like to create? Then case_when() might start to be more useful. For example, among the four types of satire, two are different forms of exaggeration (saturation / attentuation) whereas the other two are negated or metaphor. let’s say we want to classify these into three categories:

satire-exaggerate, satire-met-neg, and control

We can write a case_when() call to handle this easily.

# remove the column so we can start fresh
dat_smol$condition_big <- NULL

Create a three-step case_when()

dat_smol <- dat_smol %>%
  mutate(condition_big = case_when(condition == 'control' ~ 'control',
                                   condition == 'atten' | condition == 'saturation' ~ 'satire-exaggerate',
                                   condition == 'negation' | condition == 'metaphor' ~ 'satire-met-neg'))

And the results…

dat_smol$condition_big
 [1] "satire-exaggerate" "satire-exaggerate" "control"          
 [4] "control"           "control"           "satire-exaggerate"
 [7] "control"           "control"           "control"          
[10] "satire-met-neg"    "satire-met-neg"    "satire-exaggerate"
[13] "satire-met-neg"    "satire-exaggerate" "satire-exaggerate"
[16] "satire-met-neg"    "satire-met-neg"    "control"          
[19] "control"           "control"           "satire-met-neg"   
[22] "satire-met-neg"    "control"           "satire-met-neg"   
[25] "control"           "control"           "satire-met-neg"   
[28] "satire-exaggerate" "satire-exaggerate" "control"          
[31] "control"           "control"           "control"          
[34] "control"           "control"           "control"          
[37] "satire-met-neg"    "control"           "satire-met-neg"   
[40] "satire-exaggerate" "control"           "satire-exaggerate"
[43] "control"           "satire-met-neg"    "control"          
[46] "satire-exaggerate" "control"           "control"          
[49] "control"           "control"           "control"          
[52] "control"           "satire-met-neg"    "control"          
[55] "satire-met-neg"    "satire-met-neg"    "satire-met-neg"   
[58] "control"           "control"           "control"          
[61] "satire-met-neg"    "control"           "satire-exaggerate"
[64] "satire-exaggerate" "satire-met-neg"    "control"          
[67] "satire-exaggerate" "control"           "satire-exaggerate"
[70] "satire-exaggerate" "control"           "control"          
[73] "satire-exaggerate" "satire-exaggerate" "control"          
[76] "control"           "satire-met-neg"    "satire-met-neg"   
[79] "satire-exaggerate" "satire-exaggerate"

default value with case_when()

In case it is useful, you can add a final argument to a case_when() call to set a default value to all other cases that don’t pass a test. It looks like this, and in this case has very similar functionality to if_else!

# the final TRUE sets the default condition, 
dat_smol <- dat_smol %>%
  mutate(condition_big = case_when(dat_smol$condition == 'control' ~ 'control', TRUE ~ 'satire'))

dat_smol$condition_big
 [1] "satire"  "satire"  "control" "control" "control" "satire"  "control"
 [8] "control" "control" "satire"  "satire"  "satire"  "satire"  "satire" 
[15] "satire"  "satire"  "satire"  "control" "control" "control" "satire" 
[22] "satire"  "control" "satire"  "control" "control" "satire"  "satire" 
[29] "satire"  "control" "control" "control" "control" "control" "control"
[36] "control" "satire"  "control" "satire"  "satire"  "control" "satire" 
[43] "control" "satire"  "control" "satire"  "control" "control" "control"
[50] "control" "control" "control" "satire"  "control" "satire"  "satire" 
[57] "satire"  "control" "control" "control" "satire"  "control" "satire" 
[64] "satire"  "satire"  "control" "satire"  "control" "satire"  "satire" 
[71] "control" "control" "satire"  "satire"  "control" "control" "satire" 
[78] "satire"  "satire"  "satire" 

more complicated case_when

Let’s now try to create a variable based on values of the CS variables. First let’s look at the distributions of these variables

Create a plot - first by pivoting the data then plotting the density.

Can you answer two questions?

  1. What is the possible scale for each rating?
  2. What rating type(s) discriminates between satirical and non-satirical headlines?
plot_dat <- dat_smol %>%
  pivot_longer(cols = ends_with("CS"), names_to = 'type', values_to = 'rating') 


ggplot(plot_dat, aes(x = rating, fill = condition_big)) + 
  facet_wrap(. ~ type, scales = 'free') + 
  geom_density(alpha = .25) + 
  labs(title = "Distribution of crowdsourced ratings among satirical/non-satirical headlines") + 
  theme(legend.position = 'bottom', legend.title = element_blank())

Let’s create a new variable named sinc_hum that takes four possible values based on the values of the CS ratings:

  1. hh: High sincerity, high funny
  2. hl: High sincerity, low funny
  3. lh: Low sincerity, high funny
  4. ll: Low sincerity, low funny

Lets define low as 0-49 and high as 50-100 on the scale.

Name the new variable sinc_hum

Can you do it on your own? You probably need to use the and operator (&)

Code
dat_smol <- dat_smol %>%
  mutate(sinc_hum = case_when(sincereCS < 50 & funnyCS < 50 ~ 'll',
                              sincereCS < 50 & funnyCS >= 50 ~ 'lh',
                              sincereCS >= 50 & funnyCS >= 50 ~ 'hh',
                              sincereCS >= 50 & funnyCS < 50 ~ 'hl'))

What do we see? What condition are we missing? Why?

table(dat_smol$sinc_hum, dat_smol$condition_big)
    
     control satire
  hl      39      5
  lh       0     27
  ll       1      8

Hopefully that helps you get a grasp on how if_else() and case_when() work!