Customer segmentation — Part II. Segmentation of online customers by… | by Dimitris Panagopoulos | Jan, 2021
Customer segmentation is one of the most common uses of data analysis/data science. In this the second part of a two posts series, where we see an example of customer segmentation. The dataset we use is the Online Retail II data set which contains transactions of a UK-based online retail between 1/12/2009 and 09/12/2011. The dataset contains 1.067.371 rows about purchases of 5.943 customers.
In the first part, we created a customer segmentation based on the product categories. In this second part, we are going to perform a clustering based on Recency, Frequency, Monetary Value (RFM), and country of origin. Then, we are going to combine the results with the segmentation of part I.
Recency, Frequency, Monetary Value (RFM) is a way of analyzing customer value. The name comes from the three aspects it examines for each customer:
- Recency: how recent was the last buy of the customer. A customer that bought something recently has more value to a customer that made his/her last buy long ago.
- Frequency: how frequently a customer makes a purchase. The more frequent buyer a customer is, the better.
- Monetary Value: how much a customer spends. The higher, the better.
The are several ways to define Recency, Frequency, and Monetary Value. We are going to use the following definitions:
- Recency is the number of days from the last InvoiceDate to the most current InvoiceDate of the dataset (2011–12–09),
- Frequency is the number of days from first InvoiceDate to last InvoiceDate divided by the number of invoices,
- Monetary value is the average cost, where cost is the product of Quantity by Price.
The code that calculates RFM is listed below.
Online Retail II dataset contains the country each item was shipped to. By using each country’s GDP we can have a proxy of the financial strength each customer has. GDP data from Wikipedia are used to create an xlsx file which you can also find in Github. There are a few customers that are related to two countries. To deal with this, for every customer, we weight GDP with the number of invoices.
The final preprocessing step is to combine RFM analysis with GDP data.
For clustering, we will use the k-Means algorithm. After scaling the input data we perform k-Means clustering for a range of k (= number of clusters created). This will allow us to create a plot of the total sum of squared distances per k and use the elbow method to select the best value for k.
Based on the elbow method, we will examine clustering into 3 or into 7 clusters. For profiling the clusters, we create a custom function named
- calculates the median of recency, frequency, monetary value, and weighted GDP for each cluster, then
- sums the medians of recency, frequency, monetary value, and weighted GDP, and
- divides each median with the corresponding sum.
This way, function
cluster_profile_RFM_country calculates the percentage of each variable (recency, frequency, monetary value, and weighted GDP) that corresponds to each cluster. There is the option to exclude from the profiling clusters with few items.
If we cluster into 3 groups, we end up with three clusters with sizes 4.225, 1.708, and 9 customers. The small size of the third cluster is an indication of outliers. If we use the first two clusters for profiling, we see that the first cluster contains customers
- that have made a transaction more recently (in the last 9 days vs last 91),
- that spend more frequently (every 15 days vs 85) and,
- that have bigger monetary value (243 vs 197) per transaction
than the customers of the second cluster.
Selecting 7 clusters, we obtain clusters with 3.549, 1.547, 572, 247, 17, 9, and 1 customers. Profiling customers with more than 100 customers, we have
- cluster 4 with 572 customers. Cluster 4 contains customers with the lowest RFM score (i.e customers that shop less frequent, with lower monetary value per transaction, and have shopped longer ago than the rest clusters)
- cluster 0 with 1.547 customers. Cluster 0 contains customers with the second lowest RFM score.
- cluster 2 with 3.549 customers and cluster 5 with 247 customers. These contain customers with the best RFM score. The difference between the two of them is that cluster 5 has customers that are related to countries with lower GDP than the rest.
We could say that these four clusters can be ranked by RFM — Country score (from best to worst) as:
cluster 4 < cluster 0 < cluster 5 < cluster 2. (Cluster 5 has higher monetary value than cluster 2 but contains customers from countries with lower GDP. Thus we prioritize cluster 2).
Cluster 5 has customers from countries with a lower GDP than the rest.
Using hierarchical clustering we can gain a better understanding of the possible number of clusters.
We see hierarchical clustering verifies the selection of splitting into 3 or 7 clusters. Profiling major clusters in both cases, we see that the results are similar to k-Means.
We will use the clustering solution into 7 clusters with k-Means. The relative information is exported with pickle.
Read More …