Project: Modeling & Predicting of Churning Customers (in R) | by Jarar Zaidi | Jan, 2021

[ad_1]


In Part 1 of 3 of Data Wrangling, we read in our data file & install all required libraries/packages for our project. We also examine if there are any problems with our dataset, & hence see that there are no issues.

```{r DataWrangling1}library(tidyverse)
library(plyr)
library(readr)
library(dplyr)

cc1 <- read_csv("BankChurners.csv") # original dataset
#----------------------------
#----------------------------
problems(cc1) # no problems with cc1
head(cc1)
dim(cc1)
# # returns dimensions;10127 rows 23 col
cc1 %>% filter(!is.na(Income_Category))
(is.na(cc1))
glimpse(cc1)
```
output of head(cc1)
output of glimpse(cc1)

In Part 2 of 3 of Data Wrangling, we manipulate the data to get only the columns we want & to remove NA & Unknown values in our data. We also examine the dimensions & unique values for our discrete variables.

6 distinct discrete types for Income_Category :$60K — $80K, Less than $40K ,$80K — $120K ,$40K — $60K ,$120K + , Unknown
4 distinct discrete types for Marital_Status: Married, Single, Divorced, Unknown
4 distinct discrete types for Card_Category: Blue, Gold, Siler, Platinum

Note: We will also remove any rows/entries with a “Unknown”/NA value.

We see here we initally have 10,127 rows & 23 columns, but we truncate that too 8348 rows by 9 columns.

```{r DataWrangling2}# selected the columns we care about
cc2 <- cc1 %>% select(Customer_Age,Gender,Dependent_count,Education_Level,Marital_Status,Income_Category,Card_Category,Credit_Limit, Attrition_Flag) %>% filter( !is.na(.))
# see the head of it
head(cc2)
dim(cc2) #dimensions 10127 rows 9 columns

#(cc2 <- na.omit(cc2) ) # EXACt SAME as : %>% filter( !is.na(.))
#----------------------------cc2 %>% group_by(Income_Category,Marital_Status)#----------------------------# Lets see which distinct types there are
(distinct(cc2, Income_Category)) # 6 types:$60K - $80K, Less than $40K ,$80K - $120K ,$40K - $60K ,$120K + ,Unknown
(distinct(cc2, Marital_Status)) # 4 types: Married, Single, Divorced, Unknown
(distinct(cc2, Card_Category)) # 4 types: Blue, Gold, Siler, Platinum
#----------------------------# Drop all the "unknown" rows from Marital_Status & Income_Category
# 82x9, 82 rows must remove these rows
cc3 <- cc2 %>% select(Customer_Age,Gender,Dependent_count,Education_Level,Marital_Status,Income_Category,Card_Category,Credit_Limit, Attrition_Flag) %>% filter(Marital_Status != "Unknown" , Income_Category != "Unknown",Education_Level !="Unknown")
#----------------------------
head(cc3)
dim(cc3)
#8348 rows by 9 cols
#----------------------------
```
dimensions of original & new data frame
output of (distinct(cc2, Income_Category)) # 6 types:$60K — $80K, Less than $40K ,$80K — $120K ,$40K — $60K ,$120K + , Unknown
output of (distinct(cc2, Marital_Status)) # 4 types: Married, Single, Divorced, Unknown
output of (distinct(cc2, Card_Category)) # 4 types: Blue, Gold, Siler, Platinum

In Part 3 of 3 Data Wrangling, we rename our predictor Column Attrition_Flag to Exited_Flag. We also rename the binary output values for this predictor from Existing Customer/Attrited Customer to Current/Exited, respectivley. We lastly, also see the cout of each discrete feature with our discrete predictor.

```{r DataWrangling3}#----------------------------
#----------------------------
#install.packages("dplyr")
library(dplyr)
# Rename Label Colum to Exited_Flag
dataCC4 <- cc3 %>% rename(Exited_Flag = Attrition_Flag)
#dataaa <- cc3 %>% rename(Exited_Flag = Attrition_Flag)
#----------------------------
#----------------------------
dataCC4 <- cc3
#Rename values
dataCC4 $Attrition_Flag[dataCC4 $Attrition_Flag == "Existing Customer"] <- "Current"
dataCC4 $Attrition_Flag[dataCC4 $Attrition_Flag == "Attrited Customer"] <- "Exited"
#----------------------------
#----------------------------
(dataCC4 %>% group_by(Attrition_Flag) %>% summarize(meanAge= mean(Customer_Age), meanDepdent= mean(Dependent_count), meanCreditLim= mean(Credit_Limit)))
#AKA:
summarise_mean <- function(data, vars) {
data %>% summarise(n = n(), across({{ vars }}, mean))
}
#dataCC4 %>%
#group_by(Attrition_Flag) %>%
# summarise_mean(where(is.numeric))
#----------------------------
#----------------------------
#see the count of each
(dataCC4 %>% select(Gender,Attrition_Flag) %>% group_by(Gender) %>% count(Attrition_Flag) )
(dataCC4 %>% group_by(Education_Level) %>% count(Attrition_Flag) )
(dataCC4 %>% group_by(Marital_Status) %>% count(Attrition_Flag) )
(dataCC4 %>% group_by(Income_Category) %>% count(Attrition_Flag) )
(dataCC4 %>% group_by(Card_Category) %>% count(Attrition_Flag) )
summary(dataCC4)

```
(dataCC4 %>% group_by(Attrition_Flag) %>% summarize(meanAge= mean(Customer_Age), meanDepdent= mean(Dependent_count), meanCreditLim= mean(Credit_Limit)))

Above, we can evidently see that Current Customers had higher mean credit limits than did churning customers.

(dataCC4 %>% group_by(Education_Level) %>% count(Attrition_Flag) )
(dataCC4 %>% group_by(Marital_Status) %>% count(Attrition_Flag) )
(dataCC4 %>% group_by(Income_Category) %>% count(Attrition_Flag) )
(dataCC4 %>% group_by(Card_Category) %>% count(Attrition_Flag) )
summary(dataCC4)

Read More …

[ad_2]


Write a comment