left-join

Code

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Combining data from two different sources using joins

When working with long data, where participants have multiple observations in multiple rows, you will find that a lot of data is repeated. For instance, demographic features such as participant age and sex or experimental conditions will stay the same for each observation.

Sometimes you will want to add or manipulate data and join it with this repeated measures data.

For example, imagine you have a dataframe of scores for two subjects, five rows per subject:

Long data

Code

tibble(subject = c(rep('one',5), rep('two', 5)), 
       score = (c(1:5, 10:6)))

# A tibble: 10 × 2
   subject score
   <chr>   <int>
 1 one         1
 2 one         2
 3 one         3
 4 one         4
 5 one         5
 6 two        10
 7 two         9
 8 two         8
 9 two         7
10 two         6

You might also have recorded demographic information such as age in a separate file.

Code

tibble(subject = c('one','two'), age = c(42, 79))

# A tibble: 2 × 2
  subject   age
  <chr>   <dbl>
1 one        42
2 two        79

How can you form a single tibble which has both the score and ages in long format so that it looks like this?

Code

tibble(subject = c(rep('one',5), rep('two', 5)), 
       score = (c(1:5, 10:6)),
       age = c(rep(42,5), rep(79,5)))

# A tibble: 10 × 3
   subject score   age
   <chr>   <int> <dbl>
 1 one         1    42
 2 one         2    42
 3 one         3    42
 4 one         4    42
 5 one         5    42
 6 two        10    79
 7 two         9    79
 8 two         8    79
 9 two         7    79
10 two         6    79

We could manually sort and join the two objects, but we would prefer a method more precise. This is where joins come in.

1. Using `left_join()`

There are several ways to join, but the focus here will be on using left_join(). This function joins the rows of two tibbles into one object. The arguments for left_join() are:

x - the first tibble
y - the second tibble
by - a column or column(s) shared by both tibbles, used to match values
- this last point is crucial - if the dataframes don’t have a shared columns, they can’t be joined.

left_join() means that the object on the left (x) is the final object to retain. All matches in y that are also in x will be added. For example:

Code

one <- tibble(subject = c(rep(1,5), rep(2,5)), score = 1:10)
one

# A tibble: 10 × 2
   subject score
     <dbl> <int>
 1       1     1
 2       1     2
 3       1     3
 4       1     4
 5       1     5
 6       2     6
 7       2     7
 8       2     8
 9       2     9
10       2    10

Code

two <- tibble(subject = c(1,2), age = c(38,40))
two

# A tibble: 2 × 2
  subject   age
    <dbl> <dbl>
1       1    38
2       2    40

We can use left join to add age to each row. Notice how it matches each instance of age in y to any matches in x. All rows in x are retained.

Code

three <- left_join(one,two, by = 'subject')
three

# A tibble: 10 × 3
   subject score   age
     <dbl> <int> <dbl>
 1       1     1    38
 2       1     2    38
 3       1     3    38
 4       1     4    38
 5       1     5    38
 6       2     6    40
 7       2     7    40
 8       2     8    40
 9       2     9    40
10       2    10    40

Even if we reverse the order of the data objects, we still sort of get the results we want. This is because:

“If a row in x matches multiple rows in y, all the rows in y will be returned once for each matching row in x.”

Also, note what happens if you do not specify a by argument

Code

left_join(two,one)

Joining with `by = join_by(subject)`

# A tibble: 10 × 3
   subject   age score
     <dbl> <dbl> <int>
 1       1    38     1
 2       1    38     2
 3       1    38     3
 4       1    38     4
 5       1    38     5
 6       2    40     6
 7       2    40     7
 8       2    40     8
 9       2    40     9
10       2    40    10

2. Create a tibble named `exp.data` which has the following columns:

subject - which is the numbers 1 through 10
pre - 10 numbers randomly sampled from a normal distribution with mean = 20, sd = 3
post1 - 10 numbers randomly sampled from a normal distribution with mean = 30, sd = 1
post2 - 10 numbers randomly sampled from a normal distribution with mean = 15, sd = 5
use a set.seed of 500

You will need to use rnorm to generate these numbers:

Code

rnorm(n = 1, mean = 10, sd = 1)

[1] 8.753432

You should see this:

Code

set.seed(500)
exp.data <- tibble(subject = 1:10, pre = rnorm(10, 20, 3), 
                   post1 = rnorm(10, 30, 1), 
                   post2 = rnorm(10, 15, 5))

exp.data

# A tibble: 10 × 4
   subject   pre post1 post2
     <int> <dbl> <dbl> <dbl>
 1       1  22.9  28.6 10.7 
 2       2  25.9  28.2 19.8 
 3       3  22.7  29.4 20.3 
 4       4  20.1  30.5 16.0 
 5       5  22.8  28.2 11.4 
 6       6  18.3  28.5 11.3 
 7       7  22.2  30.3 14.6 
 8       8  21.9  29.7 10.7 
 9       9  20.1  30.9  7.96
10      10  20.8  31.5 24.0

3. Create a new tibble named `exp.data.long` from `exp.data` using `pivot_longer()`

the new column names should be score and test

You should see this:

Code

exp.data.long <- pivot_longer(exp.data, c(pre,post1,post2), values_to = 'score', names_to = 'test' )
exp.data.long

# A tibble: 30 × 3
   subject test  score
     <int> <chr> <dbl>
 1       1 pre    22.9
 2       1 post1  28.6
 3       1 post2  10.7
 4       2 pre    25.9
 5       2 post1  28.2
 6       2 post2  19.8
 7       3 pre    22.7
 8       3 post1  29.4
 9       3 post2  20.3
10       4 pre    20.1
# ℹ 20 more rows

4. Create a second tibble named `demographic.data` with the following columns

subject: the values 1-10
age: 10 random values from a normal distribution with a mean of 20, sd of .2, set.seed of 100
condition: the word “high” repeated five times, then the word “low” repeated five times

You should see this:

Code

set.seed(100)
demographic.data <- tibble(subject = 1:10, age = rnorm(10, 20, .2), condition = c(rep('high',5), rep('low', 5)))
demographic.data

# A tibble: 10 × 3
   subject   age condition
     <int> <dbl> <chr>    
 1       1  19.9 high     
 2       2  20.0 high     
 3       3  20.0 high     
 4       4  20.2 high     
 5       5  20.0 high     
 6       6  20.1 low      
 7       7  19.9 low      
 8       8  20.1 low      
 9       9  19.8 low      
10      10  19.9 low

5. Make a new tibble named `full.data` which is the result of using `left_join()` on `exp.data.long` and `demographic.data`

Code

full.data <- left_join(exp.data.long, demographic.data)

Joining with `by = join_by(subject)`

Code

full.data

# A tibble: 30 × 5
   subject test  score   age condition
     <int> <chr> <dbl> <dbl> <chr>    
 1       1 pre    22.9  19.9 high     
 2       1 post1  28.6  19.9 high     
 3       1 post2  10.7  19.9 high     
 4       2 pre    25.9  20.0 high     
 5       2 post1  28.2  20.0 high     
 6       2 post2  19.8  20.0 high     
 7       3 pre    22.7  20.0 high     
 8       3 post1  29.4  20.0 high     
 9       3 post2  20.3  20.0 high     
10       4 pre    20.1  20.2 high     
# ℹ 20 more rows

6. Create a new tibble named `output` which contains the mean and sd of `score` at each test time, separated by condition

You should see this:

Code

output <- full.data %>%
  group_by(test, condition) %>%
  summarise(mean.score = mean(score), sd.score = sd(score))

`summarise()` has grouped output by 'test'. You can override using the
`.groups` argument.

Code

output

# A tibble: 6 × 4
# Groups:   test [3]
  test  condition mean.score sd.score
  <chr> <chr>          <dbl>    <dbl>
1 post1 high            29.0    0.971
2 post1 low             30.2    1.15 
3 post2 high            15.6    4.51 
4 post2 low             13.7    6.22 
5 pre   high            22.9    2.06 
6 pre   low             20.6    1.56

7. Now let us do some simple visualization using `ggplot`

ggplot is a function to plot data and is usually what you see when people show r plots. The function is more complicated than other functions, but we can learn the basics for now.

ggplot needs to know the data it is using, as well as what to define the x and y axis with. This is done with the following syntax:

ggplot(data, aes(x = column1, y = column2))

Can you make a ggplot from output in which mean.score is the y axis and test is the x axis?

You will want to see something like this:

Code

ggplot(output, aes(y = mean.score, x = test))

What’s wrong with our plot so far? The order of “test” is wonky. This is because R will auto-sort alphabetically. Let’s change this.

8. Reordering a factor

Create a new value named test2 which is the result of calling as.factor() on output$test. Then call a summary() on test2, then call the function levels() on test2 - what does this tell us?

Code

test2 <- as.factor(output$test)
summary(test2)

post1 post2   pre 
    2     2     2

Code

levels(test2)

[1] "post1" "post2" "pre"

We can use relevel() to set a new baseline level for our factor, but we might want more control than this.

Code

levels(relevel(test2, ref = 'pre'))

[1] "pre"   "post1" "post2"

The factor() function allows us to define the levels and the labels of our factor. The levels argument should match what it already in the data, and the labels can be used to give prettier text labels (e.g., for plotting

Code

factor(test2, levels = c('pre','post1', 'post2'))

[1] post1 post1 post2 post2 pre   pre  
Levels: pre post1 post2

Code

factor(test2, levels = c('pre','post1', 'post2'), labels = c("Pre", "Post 1", "Post 2"))

[1] Post 1 Post 1 Post 2 Post 2 Pre    Pre   
Levels: Pre Post 1 Post 2

9. your turn: Please change the value of `output$test` to be a factor, with levels and labels matching `test2`. You can do this by assigning a value to itself: `output$test = factor()`

afterwards run summary() and levels() on output$test. you should see this:

Code

output$test <- factor(output$test, levels = c('pre', 'post1', 'post2'), labels = c('Pre', 'Post 1', 'Post 2'))
levels(output$test)

[1] "Pre"    "Post 1" "Post 2"

Code

summary(output$test)

   Pre Post 1 Post 2 
     2      2      2

10. Plot the same plot again, but this time save it as an object named `my.plot`. call the plot by typing the name - this time your x-axis should look better

Code

my.plot <- ggplot(output, aes(x = test, y = mean.score))
my.plot

We want to add actual points to the plot, to do so we add geoms to the ggplot, which are different geometric objects. These geoms can also take their own aes arguments, making ggplot very powerful (but also confusing at times). Also, instead of a pipe %>%, you use a + to link geoms. add a geom_col() to the plot

Code

my.plot + 
  geom_col()

Now add an aes function inside geom_col and set fill to equal condition

Code

my.plot + 
  geom_col(aes(fill = condition))

Now, let’s “dodge” the stacked columns so that they are side by side. Add the argument position inside geom_col but not inside aes and set it equal to 'dodge'

Code

my.plot + 
  geom_col(aes(fill = condition), position = 'dodge')

Combining data from two different sources using joins

1. Using left_join()

2. Create a tibble named exp.data which has the following columns:

3. Create a new tibble named exp.data.long from exp.data using pivot_longer()

4. Create a second tibble named demographic.data with the following columns

5. Make a new tibble named full.data which is the result of using left_join() on exp.data.long and demographic.data

6. Create a new tibble named output which contains the mean and sd of score at each test time, separated by condition

7. Now let us do some simple visualization using ggplot

8. Reordering a factor

9. your turn: Please change the value of output$test to be a factor, with levels and labels matching test2. You can do this by assigning a value to itself: output$test = factor()

10. Plot the same plot again, but this time save it as an object named my.plot. call the plot by typing the name - this time your x-axis should look better

1. Using `left_join()`

2. Create a tibble named `exp.data` which has the following columns:

3. Create a new tibble named `exp.data.long` from `exp.data` using `pivot_longer()`

4. Create a second tibble named `demographic.data` with the following columns

5. Make a new tibble named `full.data` which is the result of using `left_join()` on `exp.data.long` and `demographic.data`

6. Create a new tibble named `output` which contains the mean and sd of `score` at each test time, separated by condition

7. Now let us do some simple visualization using `ggplot`

9. your turn: Please change the value of `output$test` to be a factor, with levels and labels matching `test2`. You can do this by assigning a value to itself: `output$test = factor()`

10. Plot the same plot again, but this time save it as an object named `my.plot`. call the plot by typing the name - this time your x-axis should look better