Introduction

A common question retailers want to know is what’s in a basket. There are almost as many approaches to this problem as there are baskets to explore but this article will focus on a simple technique that highlights some common retail data wrangling in R.

One of the most common data operations for working with customers and transactions is to sort data into smaller chunks, apply a function, and extract summary information needed for analysis. In R this means relying on lists, dataframes, and functions.

We will use these techniques below to turn a list of transactions into a list of products that tell us something about our customers.

Create Data

First we will create three weeks of random data where each product is a letter, and there is some quantity sold of each item.

To create our weeks we will use the Map function from the purrr package within tidyverse. This is similar to lapply in base R but has more flexibility.

Map Function

The basic syntax for map is is to pass a list, and then a function preceded by the “~” operator. This can be a custom function or a function from a package. Rather than pass a parameter to the function we pass “.x” which is an alias that stands for the current element in the list being iterated by map.

Map will dutifully apply the function, in this case “MakeBasket” to each element and return a list. Since the function returns a data frame, it will be a list of data frames.

options(scipen = 999)
library(tidyverse)
library(data.table)

Customers <- c(1:1000)

Products <- c(letters)

MakeBasket <- function(x) {
  Items <- data.frame(Items = sample(letters, rpois(1, 6)))
  Items <- rowwise(Items) %>% mutate(Quantity = rpois(1,5), Customer = x)
  TransID <- runif(1, min = 0, max = 1000000000)
  Items$TransID <- TransID
  return(Items)
}

#WEEK ONE
Baskets <- map(Customers, ~MakeBasket(.x))
Baskets <- rbindlist(Baskets)

#WEEK TWO
Baskets2 <- map(Customers, ~MakeBasket(.x))
Baskets2 <- rbindlist(Baskets2)

#WEEK THREE
Baskets3 <- map(Customers, ~MakeBasket(.x))
Baskets3 <- rbindlist(Baskets3)

AllBaskets <- rbind(Baskets, Baskets2, Baskets3)

List Binding

An efficient and easy way to transform a list of data frames back into a singular data frame is to use rbindlist from the data.table package. This is a data transformation package known for having very speedy functions and an awkward but efficient syntax.

Applying rbindlist to the list of baskets results in a data frame for each of our three weeks.

We can further combine this into a full list of baskets using rbind which simply binds similar data frames together in a row-wise manner.

head(AllBaskets)
# Groups:   Customer [1,000]
   Customer Items Quantity Visits
      <int> <chr>    <int>  <int>
 1        1 a           10      2
 2        1 c            7      1
 3        1 d            3      1
 4        1 e            6      1

Grouping Data

Having a list of data frames, we have to consider what’s important. In this case we’re interested in popular items bought by customers. That can be in terms of quantity, but sometimes it’s useful to understand how many times a customer purchased an item across all of their visits.

For example, if a customer buys eggs 15 weeks out of 12 that could be a more important item to focus on than 12 chocolate bars purchased a single time December 31 to binge on before a new year’s resolution.

AllBaskets <- group_by(AllBaskets, Customer, Items) %>% 
  summarize(Quantity = sum(Quantity), Purchases = n())

In order to get some useful information out of our customer baskets we can engineer a few measures in a custom function.

Our function below takes a single parameter x, which is a dataframe for each unique customer. It first determines the number of items the customer has brought over all visits ignoring quantities, then the top item by purchase, and top two by units.

Next the function takes the items that exist in both lists, that is items that are both high in purchase frequency and high in quantity for the customer.

Finally, we take the proportion of the number of times and item was purchased divided by the total number of unique items purchased to get the share of items.

FindBest <- function(x) {
  TotalPurchase <- nrow(x)
  TopVisits <- slice_max(x, n = 1, order_by = Purchases)
  TopUnits <- slice_max(x, n = 2, order_by = Quantity)
  ItemList <- intersect(TopVisits$Items, TopUnits$Items)
  
  Result <- filter(TopVisits, Items %in% ItemList) %>% mutate(Prop = Purchases/TotalPurchase)
  
  return(
      Result
    )
}

Next we need to split our large dataframe by customers so we can run each customer through our function, this is accomplished with split in base R.

Next we use lapply which is a base R function similar to map, it takes a list and a function, in this case “FindBest.” When finished lapply will return a list of data frames, one for each customer.

BasketSplit <- split(AllBaskets, AllBaskets$Customer)
BasketList <- lapply(BasketSplit, FindBest)
BasketDF <- do.call(rbind, BasketList)

Finally we need to combine our list of dataframes back into a dataframe. We could use rbindlist which is simpler and faster than base R, but to avoid having to load a package for a single operation it can be more convenient to use rbind.

Since rbind does not work with lists we need wrap it in do.call and pass the function along with the target list.

BasketDF <- group_by(BasketDF, Items) %>% 
  summarize(Count = n(), Prop = mean(Prop), Quantity = sum(Quantity)) %>% arrange(desc(Count))
  
BasketDF
   Items Count  Prop Quantity
   <chr> <int> <dbl>    <int>
 1 x        80 0.148     1015
 2 c        74 0.156      891
 3 j        74 0.158      919
 4 f        71 0.147      835
 5 z        71 0.143      840

Conclusion

Working with customer baskets is a fundamental skill in retail data.

We started with a list of weekly transactions, we now have a dataframe of popular items based on frequency of appearance in a customer’s favourites list along with average product share, and quantities for those items which are popular with our customers.

Related Articles