Customer Segmentation in Online Retail | by Rahul Khandelwal | Jan, 2021
[ad_1]
RFM stands for Recency, Frequency, and Monetary. RFM analysis is a commonly used technique to generate and assign a score to each customer based on how recent their last transaction was (Recency), how many transactions they have made in the last year (Frequency), and what the monetary value of their transaction was (Monetary).
RFM analysis helps to answer the following questions: Who was our most recent customer? How many times has he purchased items from our shop? And what is the total value of his trade? All this information can be critical to understanding how good or bad a customer is to the company.
After getting the RFM values, a common practice is to create ‘quartiles’ on each of the metrics and assigning the required order. For example, suppose that we divide each metric into 4 cuts. For the recency metric, the highest value, 4, will be assigned to the customers with the least recency value (since they are the most recent customers). For the frequency and monetary metric, the highest value, 4, will be assigned to the customers with the Top 25% frequency and monetary values, respectively. After dividing the metrics into quartiles, we can collate the metrics into a single column (like a string of characters {like ‘213’}) to create classes of RFM values for our customers. We can divide the RFM metrics into lesser or more cuts depending on our requirements.
Let’s get down to RFM analysis on our data now.
Firstly, we need to create a column to get the monetary value of each transaction. This can be done by multiplying the UnitValue column with the Quantity column. Let’s call this the TotalSum. Calling the .describe() method on this column, we get:
This gives us an idea of how consumer spending is distributed in our data. We can see that the mean value is 20.86 and the standard deviation is 328.40. But the maximum value is 168,469. This is a very large value. Therefore, the TotalSum values in the Top 25% of our data increase very rapidly from 17.85 to 168,469.
Now, for RFM analysis, we need to define a ‘snapshot date’, which is the day on which we are conducting this analysis. Here, I have taken the snapshot date as the highest date in the data + 1 (The next day after the date till which the data was updated). This is equal to the date 2011–12–10. (YYYY-MM-DD)
Next, we confine the data to a period of one year to limit the recency value to a maximum of 365 and aggregate the data on a customer level and calculate the RFM metrics for each customer.
# Aggregate data on a customer level
data = data_rfm.groupby(['CustomerID'],as_index=False).agg({
'InvoiceDate': lambda x: (snapshot_date - x.max()).days,
'InvoiceNo': 'count',
'TotalSum': 'sum'}).rename(columns = {'InvoiceDate': 'Recency', 'InvoiceNo': 'Frequency','TotalSum': 'MonetaryValue'})
As the next step, we create quartiles on this data as described above and collate these scores into an RFM_Segment column. The RFM_Score is calculated by summing up the RFM quartile metrics.
We are now in a position to analyze our results. The RFM_Score values will range from 3 (1+1+1) to 12 (4+4+4). So, we can group by the RFM scores and check the mean values of recency, frequency, and monetary corresponding to each score.
As expected, customers with the lowest RFM scores have the highest recency value and the lowest frequency and monetary value, and the vice-versa is true as well. Finally, we can create segments within this score range of RFM_Score 3–12, by manually creating categories in our data: Customers with an RFM_Score greater than or equal to 9 can be put in the ‘Top’ category. Similarly, customers with an RFM_Score between 5 to 9 can be put in the ‘Middle’ category, and the rest can be put in the ‘Low’ category. Let us call our categories the ‘General_Segment’. Analyzing the mean values of recency, frequency, and monetary, we get:
Note that we had to create the logic for distributing customers into the ‘Top’, ‘Middle’, and ‘Low’ category manually. In many scenarios, this would be okay. But, if we want to properly find out segments on our RFM values, we can use a clustering algorithm like K-means.
In the next section, we are going to preprocess the data for K-means clustering.
Read More …
[ad_2]