Introduction
A common question retailers want to know is what’s in a basket. There are almost as many approaches to this problem as there are baskets to explore but this article will focus on a simple technique that highlights some common retail data wrangling in R.
One of the most common data operations for working with customers and transactions is to sort data into smaller chunks, apply a function, and extract summary information needed for analysis. In R this means relying on lists, dataframes, and functions.
We will use these techniques below to turn a list of transactions into a list of products that tell us something about our customers.
Create Data
First we will create three weeks of random data where each product is a letter, and there is some quantity sold of each item.
To create our weeks we will use the Map function from the purrr package within tidyverse. This is similar to lapply in base R but has more flexibility.
Map Function
The basic syntax for map is is to pass a list, and then a function preceded by the “~” operator. This can be a custom function or a function from a package. Rather than pass a parameter to the function we pass “.x” which is an alias that stands for the current element in the list being iterated by map.
Map will dutifully apply the function, in this case “MakeBasket” to each element and return a list. Since the function returns a data frame, it will be a list of data frames.
options(scipen = 999)
library(tidyverse)
library(data.table)
Customers <- c(1:1000)
Products <- c(letters)
MakeBasket <- function(x) {
Items <- data.frame(Items = sample(letters, rpois(1, 6)))
Items <- rowwise(Items) %>% mutate(Quantity = rpois(1,5), Customer = x)
TransID <- runif(1, min = 0, max = 1000000000)
Items$TransID <- TransID
return(Items)
}
#WEEK ONE
Baskets <- map(Customers, ~MakeBasket(.x))
Baskets <- rbindlist(Baskets)
#WEEK TWO
Baskets2 <- map(Customers, ~MakeBasket(.x))
Baskets2 <- rbindlist(Baskets2)
#WEEK THREE
Baskets3 <- map(Customers, ~MakeBasket(.x))
Baskets3 <- rbindlist(Baskets3)
AllBaskets <- rbind(Baskets, Baskets2, Baskets3)
List Binding
An efficient and easy way to transform a list of data frames back into a singular data frame is to use rbindlist from the data.table package. This is a data transformation package known for having very speedy functions and an awkward but efficient syntax.
Applying rbindlist to the list of baskets results in a data frame for each of our three weeks.
We can further combine this into a full list of baskets using rbind which simply binds similar data frames together in a row-wise manner.
head(AllBaskets)
# Groups: Customer [1,000]
Customer Items Quantity Visits
<int> <chr> <int> <int>
1 1 a 10 2
2 1 c 7 1
3 1 d 3 1
4 1 e 6 1
Grouping Data
Having a list of data frames, we have to consider what’s important. In this case we’re interested in popular items bought by customers. That can be in terms of quantity, but sometimes it’s useful to understand how many times a customer purchased an item across all of their visits.
For example, if a customer buys eggs 15 weeks out of 12 that could be a more important item to focus on than 12 chocolate bars purchased a single time December 31 to binge on before a new year’s resolution.
AllBaskets <- group_by(AllBaskets, Customer, Items) %>%
summarize(Quantity = sum(Quantity), Purchases = n())
In order to get some useful information out of our customer baskets we can engineer a few measures in a custom function.
Our function below takes a single parameter x, which is a dataframe for each unique customer. It first determines the number of items the customer has brought over all visits ignoring quantities, then the top item by purchase, and top two by units.
Next the function takes the items that exist in both lists, that is items that are both high in purchase frequency and high in quantity for the customer.
Finally, we take the proportion of the number of times and item was purchased divided by the total number of unique items purchased to get the share of items.
FindBest <- function(x) {
TotalPurchase <- nrow(x)
TopVisits <- slice_max(x, n = 1, order_by = Purchases)
TopUnits <- slice_max(x, n = 2, order_by = Quantity)
ItemList <- intersect(TopVisits$Items, TopUnits$Items)
Result <- filter(TopVisits, Items %in% ItemList) %>% mutate(Prop = Purchases/TotalPurchase)
return(
Result
)
}
Next we need to split our large dataframe by customers so we can run each customer through our function, this is accomplished with split in base R.
Next we use lapply which is a base R function similar to map, it takes a list and a function, in this case “FindBest.” When finished lapply will return a list of data frames, one for each customer.
BasketSplit <- split(AllBaskets, AllBaskets$Customer)
BasketList <- lapply(BasketSplit, FindBest)
BasketDF <- do.call(rbind, BasketList)
Finally we need to combine our list of dataframes back into a dataframe. We could use rbindlist which is simpler and faster than base R, but to avoid having to load a package for a single operation it can be more convenient to use rbind.
Since rbind does not work with lists we need wrap it in do.call and pass the function along with the target list.
BasketDF <- group_by(BasketDF, Items) %>%
summarize(Count = n(), Prop = mean(Prop), Quantity = sum(Quantity)) %>% arrange(desc(Count))
BasketDF
Items Count Prop Quantity
<chr> <int> <dbl> <int>
1 x 80 0.148 1015
2 c 74 0.156 891
3 j 74 0.158 919
4 f 71 0.147 835
5 z 71 0.143 840
Conclusion
Working with customer baskets is a fundamental skill in retail data.
We started with a list of weekly transactions, we now have a dataframe of popular items based on frequency of appearance in a customer’s favourites list along with average product share, and quantities for those items which are popular with our customers.
Related Articles
Recent Post
Peeking inside the basket with lists
- 31 December 2024
- 5 min read
Streamline Workflows in R Studio
- 23 November 2024
- 6 min read
Customer Clusters with Gaussian Mixed Models
- 22 October 2024
- 8 min read
Text Sentiment Analysis with Hugging Face
- 28 September 2024
- 4 min read
Product Graph Analytics
- 21 August 2024
- 11 min read