# load in tidyverse and the penguin data
library(tidyverse)
library(palmerpenguins)
data(penguins)

Correlation is a measure of covariance

A correlation is a common way to measure the strength of a relationship between two continuous variables. More precisely, a correlation measures the degree to which two continuous variables co-vary (called their covariance). Put differently, a correlation determines how much two variables will change together.

Correlation values range from -1 to 1

The canonical correlation is the Pearson correlation coefficient, normally represented by \(r\). Values for \(r\) will range from -1 to 1. Negative values mean variables co-vary in different directions: as one goes up, the other goes down. Positive values means that they go in the same direction: as one goes up or down, the other goes up or down.

Let’s look at the Pearson correlation between penguin body mass and penguin flipper length:

Method 1: using cor

The cor function asks for two numeric vectors and returns the \(r\)

The use = 'complete.obs' tells the function to ignore NA values.

cor(penguins$body_mass_g, penguins$flipper_length_mm, use = 'complete.obs')
## [1] 0.8712018

Method 2: using cor.test

The cor.test function provides us with more information right away. We receive a test statistic (t) the degrees of freedom (df), a p-value, 95% confidence interval, and the \(r\)

cor.test(penguins$body_mass_g, penguins$flipper_length_mm)
## 
##  Pearson's product-moment correlation
## 
## data:  penguins$body_mass_g and penguins$flipper_length_mm
## t = 32.722, df = 340, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.843041 0.894599
## sample estimates:
##       cor 
## 0.8712018

A positive correlation of 0.87 is strong (the maximum being 1.0). It indicates there is a strong positive relationship between the two variables. Plotting the regression line shows this relationship - if the line was perfectly diagonal, then \(r\) would be 1.0.

ggplot(penguins, aes(y = body_mass_g, x = flipper_length_mm)) + 
  geom_point() + 
  geom_smooth(method = 'lm')
## `geom_smooth()` using formula = 'y ~ x'

How do we obtain \(r\) ?

To calculate the covariance of two continuous variables, we need to put the variables on the same scale and then observe differences between the values. We already know how to put variables on the same scale using z-scores.

A Pearson correlation will:

  1. z-score the variables
  2. multiply the z-scores
  3. sum the products of the z-scores
  4. calculate an adjusted mean of the products (divide this sum by n-1)

The adjusted mean is the correlation coefficient.

Let’s look at a toy example first using short vectors:

# create two vectors of numbers
v1 <- c(10, 20, 30, 4, 5, 6)
v2 <- c(4, 5, 6, 10, 20, 30)

v1
## [1] 10 20 30  4  5  6
v2
## [1]  4  5  6 10 20 30

Manually z-score the vectors:

# calculate the z-scores
z_v1 <- (v1 - mean(v1)) / sd(v1)
z_v2 <- (v2 - mean(v2)) / sd(v2)

z_v1
## [1] -0.2406741  0.7220222  1.6847184 -0.8182918 -0.7220222 -0.6257526
z_v2
## [1] -0.8182918 -0.7220222 -0.6257526 -0.2406741  0.7220222  1.6847184

We then obtain their products, which R lets us do quite easily:

# calculate the products of the z-scores
products <- z_v1 * z_v2 

products
## [1]  0.1969416 -0.5213160 -1.0542169  0.1969416 -0.5213160 -1.0542169

Now that we have obtained products of the z-scored data, we can calculate the average of these products using an adjusted n-size:

# how many observations? 
n <- length(v1)  

# divide the sum by n-1
r <- sum(products) / (n-1)

What is the correlation?

r
## [1] -0.5514365

Let’s check our work:

cor.test(v1, v2)
## 
##  Pearson's product-moment correlation
## 
## data:  v1 and v2
## t = -1.322, df = 4, p-value = 0.2567
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9416060  0.4708349
## sample estimates:
##        cor 
## -0.5514365

a one predictor lm is a correlation when you z-score \(y\) and \(x\)

Understanding what the correlation is doing gives us a better understanding of what a regression is doing. Indeed, a linear regression with a scaled dv and predictor will produce the correlation coefficient as the model estimate. Let’s try it out with penguin body mass and flipper length:

First, z-score the variables:

penguins$body_mass_g_z <- (penguins$body_mass_g - 
                             mean(penguins$body_mass_g, na.rm = T)) / 
  sd(penguins$body_mass_g, na.rm = T)

penguins$flipper_length_mm_z <- (penguins$flipper_length_mm - 
                                   mean(penguins$flipper_length_mm, na.rm = T)) /
  sd(penguins$flipper_length_mm, na.rm = T)

Then fit a linear model:

The estimate is 0.871 - the exact same as the correlation coefficient

summary(lm(body_mass_g_z ~ flipper_length_mm_z, data = penguins))
## 
## Call:
## lm(formula = body_mass_g_z ~ flipper_length_mm_z, data = penguins)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.32027 -0.32330 -0.03352  0.30841  1.60693 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.025e-15  2.659e-02    0.00        1    
## flipper_length_mm_z 8.712e-01  2.662e-02   32.72   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4916 on 340 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.759,  Adjusted R-squared:  0.7583 
## F-statistic:  1071 on 1 and 340 DF,  p-value: < 2.2e-16

visualise this covariance:

Let’s plot the two distributios over one another to visualise the covariance:

ggplot(distros, aes(x = value, fill = variable)) + 
  geom_density(alpha = .5)

Conclusion

There you have it - a correlation is a measure of covariance between two variables once they are standardized.

A linear regression will apply the same logic to see how predictors influence an outcome variable. When you add more than one predictor to a regression, the model will obtain the covariance of more than one variable, quickly increasing the complexity of the calculations.