Customer clustering

Challenge

One of the the most successful online groceries in Spain, that makes +1 million in monthly revenue and claims am extraordinary customer satisfaction currently serves main cities with fresh food and the rest of Spain’s peninsula with non perishable items.

With this in mind, there was some historic data on transactions for one city, and the company wanted to know how this data could assist marketing efforts, knowing better its end customers and understanding different segments to better address them in marketing campaigns and increase the ROI of the marketing campaigns.

Another possible application could be how to feed this information into Supply Chain Management¹, capturing end customers buying behaviors (days,hours) to digitize better the inventory and delivery process planning. The data contained 30,000 orders for 10,238 customers. The data was superb well-crafted, with nothing to be done on the data wrangling aspect.Kudos for the data team!

Solution

So we have here a well-known problem of segmentation (i.e.,unsupervised learning, because there’s no target variable or label in data-science language). Keep in mind that clustering is not deterministic, because there are some randomness that can make that number and classification of clusters may differ

Let’s start the work, loading the basic libraries first and connecting to data.world API.

Now we can start playing around with the data…
Let’s take a quick look into transaction orders, so we can see how frequent customers buy and also define a criteria for loyal customers (do you think that 4 transactions in one month will be okay?). Let’s see…

Figure showing loyal customers are those that regularly buy into the grocery (<4 times) during the period of time observed and unusual buyers.

Yes, we can see that 50% of the transactions are split between loyal and sporadic customers, providing a clear line of sight on the value that loyal customers bring to the organization.

With this first slice on the customer data, we can say that 1,174 customers (12% out of total) made 14,645 purchases. And the remainder transactions, 15,355 (51% out of total) were made by 8,465 customers. We won’t be looking at the value of the basket at this stage, but it’s of course other variable to take into account.

Loyal customers

Now, we can start clustering loyal customers, so we can understand who buys what by looking into their baskets²:

Let’s see the characterization of each cluster by categories like Food, Fresh, Drinks, … From here, you knwo the % of customers in each cluster, so can identify and group your customers to target them with personalised marketing campaigns, right?

sc$centers

##   food fresh drinks  home beauty health  baby   pets
## 1 29.3  8.18  15.71 22.21   5.44  1.995 16.55  0.538
## 2 28.6 36.59  16.93  8.69   3.93  1.099  3.14  0.925
## 3 15.7 12.22   7.66  6.51   2.66  1.869  2.69 50.680
## 4 14.4  8.65  64.07  7.99   2.10  0.728  1.67  0.354
## 5 14.3 66.44   9.20  5.24   2.15  0.425  1.66  0.528

We can see the same information in a much more convenient way and after some quick rearrangement on the data:

#Plot data
ggplot(data = center_reshape, aes(x = features, y = cluster, fill = values)) +
  scale_y_continuous(breaks = seq(1, 7, by = 1)) +
  geom_tile() +
  coord_equal() +
  scale_fill_gradient(low="brown", high="yellow")+
  theme_secun()

Figure showing clustering of loyal customers. Color indicates the degree to which category is being selected.

It’s easy from here to identify #1 as the customers with babies, #3 as those who buy very much food (because they are large families maybe?),… It’s always a talking point with people in the business, that will help to better qualify this data-set or even request a different categorization of main categories in the data set. There seems to have no significant differences in health and pets. Interesting to note that the online grocery is being very popular with fresh food amongst the loyal customers, which was the initial target of the supermarket.

How many customers do we have in each cluster?

round(sc$size /sum(sc$size)*100,2)

## [1] 23.31 35.29  0.58 13.18 27.65

We can describe much better the unique characteristics of each cluster, what makes a cluster different from others?

Non-loyal customers

Now, let’s cluster the baskets from sporadic customers to see what insights we can gather from them.

Let’s see the characterization of each cluster by categories like Food, Fresh, Drinks, …

sc$centers

##    food fresh drinks  home beauty health  baby  pets
## 1  5.47  2.65   9.02  5.38   2.24  0.765 74.00 0.300
## 2 15.77  5.97  61.76  8.99   3.25  0.761  2.64 0.705
## 3 61.86  6.41  17.04  7.52   3.55  0.901  1.93 0.648
## 4 21.45 47.23  15.61  7.44   3.94  0.999  2.43 0.814
## 5 15.84  5.92  17.42 35.24  15.99  1.986  4.93 2.344

In a graphical way:

Figure showing clustering of sporadic customers. Color indicates the degree to which category is being selected.

Let’s see the allocation in % of customers to each of the 5 clusters:

## [1] 13.1 18.5 16.7 27.7 24.1

We see now another different type of clusters: those very much focused on baby and beauty, that wasn’t the original intent of the grocery but customers loving its products too. It’s true that beauty and baby represent a small percentage of the customer base, because the majority again are in food, drink and fresh categories.

Action

What have we discovered through a few lines of code?
50% of the transactions are being done by 12% of the total customer base, and all customers can be segmented into different clusters, some of them easily likely labelled as singles, families, recent parents,..Though not all of them are buying regularly in the online grocery.
Powerful, uh! And we can confirm that the fresh food categories are popular amongst the grocery for loyal customers, which was its original intent, and baby and beauty popular amongst sporadic customers, though it’s a tiny part of the customer base. Health and pets represent little interest to both types of customers, so it might be an opportunity to revisit this category.

Definitely, this is a massive insight for testing existing marketing techniques with the aim to increase the number of loyal customers.

Does this analysis ring a bell there? More info is available should you wanted to know more. Let’s get in touch!

This part has not been covered here, though it’s another application of this data to improve business processes.↩
In this example we have deployed the k-means algorithm with 5 clusters (very detailed example on how this algorithm works here).To keep this simple, we haven’t discussed why we’ve chosen 5 clusters, and not 4 or 10, but doing a the Sum of Squares Method I found that a reasonable number would be 7 clusters.↩