data:image/s3,"s3://crabby-images/9120d/9120df3fd57bfd417994bc11d31a08f0e8bb1160" alt=""
Introduction
Correlation between variables suggests a directional association such that an increase in one variable is associated with an increase or decrease in the other variable. For example, we might find that income is correlated with spending on luxury goods.
Partial correlation has a similar aim but it allows us to control for the effect of related variables in order to get a more precise understanding of the relationships between individual variables.
For this example we will use a customer sales and insights dataset available on Kaggle. This dataset contains customers and related sales and marketing metrics.
Loading Data
The first thing to do is load our packages, we will use the popular ppcor package for partial correlations. We will also remove non-numeric columns since we can not use these methods directly on categorical data.
Note that ppcor loads the MASS package which maks the select function in dplyr so we need to specify dplyr::select() explicitly or it will throw an error.
library(tidyverse)
library(ppcor)
library(psych)
SalesData <- read_csv("sales_and_customer_insights.csv")
SalesData <- dplyr:::select(SalesData, where(is.numeric))
Data Alterations
None of the variables in the original dataset have any correlations which would make for a boring data science article, so we will add some linear correlations to a few of the variables.
MakeCorrelated <- function(x) {
set.seed(123)
LinLV <- x$Churn_Probability * 500 + 100 + rnorm(nrow(x), mean = 100 + runif(nrow(x), -25, 25), sd = 50)
LinAOV <- LinLV * 0.5 + 100 + rnorm(nrow(x), mean = 50 + runif(nrow(x), -5, 5), sd = 25)
return(
data.frame(
LinLV = LinLV,
LinAOV = LinAOV
)
)
}
SalesData <- cbind(SalesData, MakeCorrelated(SalesData)) %>%
dplyr::select(-Lifetime_Value, -Average_Order_Value)
cor(SalesData$Churn_Probability, SalesData$LinLV)
We now have LinLV (lifetime value) which is based on churn probability, and LinAOV (average order value) which is in turn based on lifetime value.
Panel Charts
A good first step in looking at correlations is to run a panel chart, the psych package has a very nice implementation.
pairs.panels(SalesData)
data:image/s3,"s3://crabby-images/4006d/4006d0e6e218cb45555c9cc01f0437799a81aa78" alt=""
As expected churn probability correlates highly with lifetime value and average order value.
We can see that these variables follow a normal distribution and that the data appears linear. This is important for Pearson correlation but less so for Spearman correlation which is better at detecting non-linear relationships.
Other variables look random. They appear to have uniform distributions and no relation to other variables.
Correlation
Since relationships look linear we will use Pearson correlation to explore these relationships.
round(cor(SalesData, method = "pearson"),3)[,3:5]
Churn_Probability LinLV LinAOV
Purchase_Frequency 0.018 0.018 0.018
Time_Between_Purchases 0.010 0.011 0.008
Churn_Probability 1.000 0.941 0.894
LinLV 0.941 1.000 0.950
LinAOV 0.894 0.950 1.000
Partial Correlation
Next we will explore partial correlations between these variables with the pcor function. The output is quite large so we will extract the relevant parts for brevity.
round(pcor(SalesData)$estimate[,3:5],3)
round(pcor(SalesData)$p.value[,3:5],3)
> round(pcor(SalesData)$estimate[,3:5],3)
Churn_Probability LinLV LinAOV
Purchase_Frequency 0.004 -0.001 0.003
Time_Between_Purchases 0.000 0.009 -0.009
Churn_Probability 1.000 0.655 0.001
LinLV 0.655 1.000 0.717
LinAOV 0.001 0.717 1.000
> round(pcor(SalesData)$p.value[,3:5],3)
Churn_Probability LinLV LinAOV
Purchase_Frequency 0.677 0.938 0.745
Time_Between_Purchases 0.962 0.362 0.394
Churn_Probability 0.000 0.000 0.934
LinLV 0.000 0.000 0.000
LinAOV 0.934 0.000 0.000
The estimates in the top of the readout tell us much more than a simple correlation matrix. Lifetime value has a much lower correlation to churn probability holding the other variables constant.
Similarly, average order value has a lower correlation to lifetime value and almost no correlation to churn probability while holding other variables constant. This is a very different result than the one we obtained with a basic correlation matrix.
P-values are also conveniently available to assess whether correlations are significant, in this case they are.
We can also test specific pairs variables with pcor.test. We look at the correlation of vector x with vector y, while holding a third vector z constant. This gives a more reliable, as in repeatable, estimate than we would get by controlling for many variables at the same time.
pcor.test(
x = SalesData$Churn_Probability,
y = SalesData$LinAOV,
z = SalesData$LinLV)
estimate p.value statistic n gp Method
1 0.0008510263 0.9321916 0.08508989 10000 1 pearson
We obtain a similar result to the partial correlation matrix. The correlation between churn probability and average order value is not significant after controlling for lifetime value. In other words, lifetime value is more likely to be causing the variation.
This makes sense because when creating the data we built lifetime value from churn probability, and then based average order value on lifetime value.
Semi Partial Correlation
The pcor package also has a function for semi-partial correlations. In partial correlation the third variable is held constant for both X and Y. With semi-partial this becomes X or Y, so the third variable is held constant for X or Y, but not both at the same time.
To investigate semi-partial correlations we can use spcor or spcor.test which operate identically to the pcor functions.
round(spcor(SalesData)$estimate[,3:5],3)
spcor.test(SalesData$Churn_Probability, SalesData$LinAOV, SalesData$LinLV)
> round(spcor(SalesData)$estimate[,3:5],3)
Churn_Probability LinLV LinAOV
Purchase_Frequency 0.004 -0.001 0.003
Time_Between_Purchases 0.000 0.009 -0.009
Churn_Probability 1.000 0.293 0.000
LinLV 0.204 1.000 0.243
LinAOV 0.000 0.321 1.000
> spcor.test(SalesData$Churn_Probability, SalesData$LinAOV, SalesData$LinLV)
estimate p.value statistic n gp Method
1 0.0002880153 0.9770269 0.02879721 10000 1 pearson
Using semi-partial correlation the correlation of both LinLV and LinAOV are lower.
In addition, the correlations are not longer symmetric. LinLV correlates 0.293 with churn probability, but churn probability correlates 0.204 with LinLV.
This directionality can be useful, the X variables appear on the right, and the Y variables in the columns. Therefore, we can see that churn probability does a better job predicting LinLV than vice-versa which provides clues on causality.
Again, this makes sense because we originally built LinLV off churn probabilities.
Conclusion
Partial correlations can help unmask more precise relationships between variables and provide clues on the direction of causal effects. The pcor package makes understanding these relationships a breeze.
Recent Post
Delta Sharing in PySpark and Pandas
- 17 February 2025
- 7 min read
Partial Customer Correlations
- 26 January 2025
- 5 min read
Peeking inside the basket with lists
- 31 December 2024
- 5 min read
Streamline Workflows in R Studio
- 23 November 2024
- 6 min read
Customer Clusters with Gaussian Mixed Models
- 22 October 2024
- 8 min read