This notebook introduces two new functions: group_by()
and summarise()
. The first function,
group_by()
, is used to first tell a tibble/dataframe your
set of grouping variables, i.e., what do you want to group by. The
second function, summarise()
, will allow you to ask for any
number of functions applied to your data based on that grouping
structure. To demonstrate, let’s look at some penguins!
Load tidyverse
and the palmerpenguins
data
set
penguins
We can calculate the average/mean of a variable using the base
function mean()
. This function takes a vector of numbers or
a dataframe column and returns the average/mean. Lets use it to find the
average body mass of the penguins.
## [1] NA
There is at least one NA value in the dataframe, so the function is
returning NA
. We can deal with this by using the
na.rm
argument in mean. The default value of
na.rm
is FALSE
, meaning that NA
values will not be removed. Set the argument to TRUE
in
order to remove the NA
values from the calculations
## [1] 4201.754
After removing the NA
values, we see that the average
body mass of the penguins is about 4200 grams. However, in the notebook
using filter()
, we developed a hypothesis that certain
species of penguins had higher body masses than others. Let’s confirm
this hypothesis by using mean()
on three filtered versions
of the data.
## Adelie Chinstrap Gentoo
## 152 68 124
In the line below, I wrap mean()
around a
filter()
call to penguins
, which asks for just
the "Adelie"
penguins. I also attach the $
to
the end of the filter function to call just the body_mass_g
column. I include na.rm = TRUE
to filter out
NA
.
So this line of code is saying “Give me the mean of the column
body_mass_g
in the penguins
data after
filtering all rows where species is equal to ‘Adelie’ and also remove
NA
values during the calculation of the mean”. While this
is “efficient” in being a single line, it can be a bit difficult to take
in if you are unfamiliar with the different parts of the R syntax. (This
is a pedagogical decision, we will see later how group_by()
and summarise()
improve on this.)
## [1] 3700.662
We see that the average body mass of the Adelie penguins is lower than the average body mass of all the penguins. Let’s quickly look at the average body mass of the other two species:
## [1] 3733.088
## [1] 5076.016
It seems both the Adelie and Chinstrap penguins have a lower average
body mass than the entire population average, whereas the Gentoo
penguins are heavier. To understand this, we repeated the same thing
three times, which is usually a signal that we could be more efficient.
Moreover, we only see the means one-at-a-time. It would be far more
useful to obtain all of this information at the same time. This is where
group_by()
and summarise()
come in handy!
The first thing we need to think about is what our grouping variable
will be. In this case, it should be relatively obvious that it is
species
(this is the variable we used filter()
on three times). Knowing this, we can start to build
group_by()
pipe.
It’s usually good to create a new tibble that contains your summary
data. Let’s create a new variable called bm_summary
, which
is a copy of penguins
. Then pipe into the
group_by()
function. We want to group by
species
, so add that into the group_by()
function.
Let’s start with the penguins
data and pipe into
group_by()
, like this:
Running this code doesn’t seem to do anything, but we have now create
a new tibble which contains metadata about a grouping structure imposed
on the data. You can verify this using glimpse()
, which now
contains a Groups
output.
## Rows: 344
## Columns: 8
## Groups: species [3]
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex <fct> male, female, female, NA, female, male, female, male…
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
You can also look at the tibble in the environment pane to see the grouping structure:
You can use the ungroup()
function to remove this
grouping structure.
# The Group information is gone now.
bm_summary <- penguins %>%
group_by(species) %>%
ungroup() %>%
glimpse()
## Rows: 344
## Columns: 8
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex <fct> male, female, female, NA, female, male, female, male…
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
Okay, now that we can group, let’s use summarise()
to
obtain the mean of each level of our grouping variable. To do so, pipe
into a summarise()
function after the
group_by()
function.
The summarise()
function works similar to
mutate()
. We create new values, so we first give our value
a new name. Call the new value mean_body_mass
. We then set
the value to be the result of calling some sort of function on a column.
In this case, we want to call mean()
on
body_mass_g
. Don’t forget to include
na.rm = TRUE
!
bm_summary <- penguins %>%
group_by(species) %>%
summarise(mean_body_mass = mean(body_mass_g, na.rm = TRUE)) %>%
glimpse()
## Rows: 3
## Columns: 2
## $ species <fct> Adelie, Chinstrap, Gentoo
## $ mean_body_mass <dbl> 3700.662, 3733.088, 5076.016
Notice what happens to the tibble, it’s now 3 rows long and 2 columns wide. If you look at the data in the data viewer, you see something like this:
So the two remaining columns are (1) the column which was included in
group_by()
and (2) the new variable that was created. We
have reduced (or summarised) our tibble into a smaller set of data. And,
moreover, we can clearly see the 3 means for the 3 species in one place.
Calling the name of the tibble prints this information straight to the
console for us, neat!
## # A tibble: 3 × 2
## species mean_body_mass
## <fct> <dbl>
## 1 Adelie 3701.
## 2 Chinstrap 3733.
## 3 Gentoo 5076.
summarise()
Just like mutate()
, we can create as many variables as
we want within a single summarise()
function. Let’s try
that now. When we report the mean/average of something, we usually also
report the standard deviation. You can obtain the standard deviation in
R using the sd()
function. With this knowledge, let’s
create a new variable called sd_body_mass
alongside the
mean. It will also require the na.rm = TRUE
command
bm_summary <- penguins %>%
group_by(species) %>%
summarise(mean_body_mass = mean(body_mass_g, na.rm = TRUE),
sd_body_mass = sd(body_mass_g, na.rm = TRUE)) %>%
glimpse()
## Rows: 3
## Columns: 3
## $ species <fct> Adelie, Chinstrap, Gentoo
## $ mean_body_mass <dbl> 3700.662, 3733.088, 5076.016
## $ sd_body_mass <dbl> 458.5661, 384.3351, 504.1162
Just as before, there are 3 rows (one per level of species), but now we also have 3 columns. This should make sense, since we asked for a third value (standard deviation). Your data should look something like this:
Keep going! Add the min()
and max()
values
to the data, all inside the same summarise()
call. Give the
new variables the names min_body_mass
and
max_body_mass
:
bm_summary <- penguins %>%
group_by(species) %>%
summarise(mean_body_mass = mean(body_mass_g, na.rm = TRUE),
sd_body_mass = sd(body_mass_g, na.rm = TRUE),
min_body_mass = min(body_mass_g, na.rm = TRUE),
max_body_mass = max(body_mass_g, na.rm = TRUE)) %>%
glimpse()
## Rows: 3
## Columns: 5
## $ species <fct> Adelie, Chinstrap, Gentoo
## $ mean_body_mass <dbl> 3700.662, 3733.088, 5076.016
## $ sd_body_mass <dbl> 458.5661, 384.3351, 504.1162
## $ min_body_mass <int> 2850, 2700, 3950
## $ max_body_mass <int> 4775, 4800, 6300
Just like mutate, we can operate on variables we create
inside the summarise()
function! So, after
creating the min/max body mass columns, we could then perform a
calculation to obtain the range()
of the body mass (i.e.,
the distance between the minimum and maximum values). Instead of using a
function, let’s just subtract the max from the min.
bm_summary <- penguins %>%
group_by(species) %>%
summarise(mean_body_mass = mean(body_mass_g, na.rm = TRUE),
sd_body_mass = sd(body_mass_g, na.rm = TRUE),
min_body_mass = min(body_mass_g, na.rm = TRUE),
max_body_mass = max(body_mass_g, na.rm = TRUE),
range_body_mass = max_body_mass - min_body_mass) %>%
glimpse()
## Rows: 3
## Columns: 6
## $ species <fct> Adelie, Chinstrap, Gentoo
## $ mean_body_mass <dbl> 3700.662, 3733.088, 5076.016
## $ sd_body_mass <dbl> 458.5661, 384.3351, 504.1162
## $ min_body_mass <int> 2850, 2700, 3950
## $ max_body_mass <int> 4775, 4800, 6300
## $ range_body_mass <int> 1925, 2100, 2350
Pretty cool right - we can create as many values as we need in order
to describe our population. However, we are missing a pretty important
one…counting the members of our population! A shorthand you might see
for this in manuscripts and elsewhere is something like
n = x
. There is happily a function called n()
which gives this exact information. Let’s try it out by asking for the
n
of our penguins species. It’s as simple as adding
n = n()
. I’m choosing to call it n
, but we
could call it anything (right?)
bm_summary <- penguins %>%
group_by(species) %>%
summarise(n = n(),
mean_body_mass = mean(body_mass_g, na.rm = TRUE),
sd_body_mass = sd(body_mass_g, na.rm = TRUE),
min_body_mass = min(body_mass_g, na.rm = TRUE),
max_body_mass = max(body_mass_g, na.rm = TRUE),
range_body_mass = max_body_mass - min_body_mass) %>%
glimpse()
## Rows: 3
## Columns: 7
## $ species <fct> Adelie, Chinstrap, Gentoo
## $ n <int> 152, 68, 124
## $ mean_body_mass <dbl> 3700.662, 3733.088, 5076.016
## $ sd_body_mass <dbl> 458.5661, 384.3351, 504.1162
## $ min_body_mass <int> 2850, 2700, 3950
## $ max_body_mass <int> 4775, 4800, 6300
## $ range_body_mass <int> 1925, 2100, 2350
The beauty of group_by()
is that you can add any number
of grouping variables to make calculations with. So, right now the
summarise function has been operating across the three different types
of species in the data. but what if we wanted to also know this
information separated by year? It’s as simple as adding
year
to the group_by
call. Let’s try it out
and look at the population and mean body mass of penguins, by species,
by year:
bm_year_summary <- penguins %>%
group_by(species, year) %>%
summarise(n = n(), mean_body_mass = mean(body_mass_g, na.rm = TRUE))
## `summarise()` has grouped output by 'species'. You can override using the
## `.groups` argument.
## # A tibble: 9 × 4
## # Groups: species [3]
## species year n mean_body_mass
## <fct> <int> <int> <dbl>
## 1 Adelie 2007 50 3696.
## 2 Adelie 2008 50 3742
## 3 Adelie 2009 52 3665.
## 4 Chinstrap 2007 26 3694.
## 5 Chinstrap 2008 18 3800
## 6 Chinstrap 2009 24 3725
## 7 Gentoo 2007 34 5071.
## 8 Gentoo 2008 46 5020.
## 9 Gentoo 2009 44 5141.
Repeat the above, but also include the sex
variable in
the summary. This has introduced a few rows where sex is NA. What is
going on? (Hint, look at the full data frame and find all the NA
values.)
bm_year_sex_summary <- penguins %>%
group_by(species, year, sex) %>%
summarise(n = n(), mean_body_mass = mean(body_mass_g, na.rm = TRUE))
## `summarise()` has grouped output by 'species', 'year'. You can override using the
## `.groups` argument.
## # A tibble: 22 × 5
## # Groups: species, year [9]
## species year sex n mean_body_mass
## <fct> <int> <fct> <int> <dbl>
## 1 Adelie 2007 female 22 3390.
## 2 Adelie 2007 male 22 4039.
## 3 Adelie 2007 <NA> 6 3540
## 4 Adelie 2008 female 25 3386
## 5 Adelie 2008 male 25 4098
## 6 Adelie 2009 female 26 3335.
## 7 Adelie 2009 male 26 3995.
## 8 Chinstrap 2007 female 13 3569.
## 9 Chinstrap 2007 male 13 3819.
## 10 Chinstrap 2008 female 9 3472.
## # ℹ 12 more rows
You probably want to report this summary information in a manuscript,
likely even as a table. You could copy and paste this
information from R into somewhere else, but why do that when you could
easily output the information as a .csv file? We can do so with the
write_csv()
function. The function takes two main arguments
- the name of the tibble you want to output, and the name you would like
to give the file. So, running the cell below will output the object
bm_year_summary
and give it the name
'my_awesome_penguins_summary.csv'
If you don’t tell R where to save the file, it will be saved to
whatever the current working directory is. If you are using an R
markdown or notebook file, it should be the same directory of wherever
that file is saved. You can check the current working directory using
the getwd()
function.
And if we were to open this up in a spreadsheet program, it might look like this: