Clustering & Visualization of Clusters using PCA

Customer’s Segmentation based on their Credit Card usage behavior

Dataset for this notebook consists of the credit card usage behavior of customers with 18 behavioral features. Segmentation of customers can be used to define marketing strategies.

Content of this Kernel:

  • Data Preprocessing
  • Clustering using KMeans
  • Interpretation of Clusters
  • Visualization of Clusters using PCA
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import cosine_similarity

import warnings
warnings.filterwarnings(action="ignore")
data= pd.read_csv("../input/CC GENERAL.csv")
print(data.shape)
data.head()
(8950, 18)

Out[2]:

CUST_IDBALANCEBALANCE_FREQUENCYPURCHASESONEOFF_PURCHASESINSTALLMENTS_PURCHASESCASH_ADVANCEPURCHASES_FREQUENCYONEOFF_PURCHASES_FREQUENCYPURCHASES_INSTALLMENTS_FREQUENCYCASH_ADVANCE_FREQUENCYCASH_ADVANCE_TRXPURCHASES_TRXCREDIT_LIMITPAYMENTSMINIMUM_PAYMENTSPRC_FULL_PAYMENTTENURE
0C1000140.9007490.81818295.400.0095.40.0000000.1666670.0000000.0833330.000000021000.0201.802084139.5097870.00000012
1C100023202.4674160.9090910.000.000.06442.9454830.0000000.0000000.0000000.250000407000.04103.0325971072.3402170.22222212
2C100032495.1488621.000000773.17773.170.00.0000001.0000001.0000000.0000000.0000000127500.0622.066742627.2847870.00000012
3C100041666.6705420.6363641499.001499.000.0205.7880170.0833330.0833330.0000000.083333117500.00.000000NaN0.00000012
4C10005817.7143351.00000016.0016.000.00.0000000.0833330.0833330.0000000.000000011200.0678.334763244.7912370.00000012

Data Preprocessing

Descriptive Statistics of Data

data.describe()

Out[3]:

BALANCEBALANCE_FREQUENCYPURCHASESONEOFF_PURCHASESINSTALLMENTS_PURCHASESCASH_ADVANCEPURCHASES_FREQUENCYONEOFF_PURCHASES_FREQUENCYPURCHASES_INSTALLMENTS_FREQUENCYCASH_ADVANCE_FREQUENCYCASH_ADVANCE_TRXPURCHASES_TRXCREDIT_LIMITPAYMENTSMINIMUM_PAYMENTSPRC_FULL_PAYMENTTENURE
count8950.0000008950.0000008950.0000008950.0000008950.0000008950.0000008950.0000008950.0000008950.0000008950.0000008950.0000008950.0000008949.0000008950.0000008637.0000008950.0000008950.000000
mean1564.4748280.8772711003.204834592.437371411.067645978.8711120.4903510.2024580.3644370.1351443.24882714.7098324494.4494501733.143852864.2065420.15371511.517318
std2081.5318790.2369042136.6347821659.887917904.3381152097.1638770.4013710.2983360.3974480.2001216.82464724.8576493638.8157252895.0637572372.4466070.2924991.338331
min0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.00000050.0000000.0000000.0191630.0000006.000000
25%128.2819150.88888939.6350000.0000000.0000000.0000000.0833330.0000000.0000000.0000000.0000001.0000001600.000000383.276166169.1237070.00000012.000000
50%873.3852311.000000361.28000038.00000089.0000000.0000000.5000000.0833330.1666670.0000000.0000007.0000003000.000000856.901546312.3439470.00000012.000000
75%2054.1400361.0000001110.130000577.405000468.6375001113.8211390.9166670.3000000.7500000.2222224.00000017.0000006500.0000001901.134317825.4854590.14285712.000000
max19043.1385601.00000049039.57000040761.25000022500.00000047137.2117601.0000001.0000001.0000001.500000123.000000358.00000030000.00000050721.48336076406.2075201.00000012.000000

Seems like data have many outliers!

Dealing with Missing Values
data.isnull().sum().sort_values(ascending=False).head()
MINIMUM_PAYMENTS       313
CREDIT_LIMIT             1
TENURE                   0
PURCHASES_FREQUENCY      0
BALANCE                  0
dtype: int64

Imputing these missing values with mean

data.loc[(data['MINIMUM_PAYMENTS'].isnull()==True),'MINIMUM_PAYMENTS']=data['MINIMUM_PAYMENTS'].mean()
data.loc[(data['CREDIT_LIMIT'].isnull()==True),'CREDIT_LIMIT']=data['CREDIT_LIMIT'].mean()
data.isnull().sum().sort_values(ascending=False).head()
TENURE               0
PRC_FULL_PAYMENT     0
BALANCE              0
BALANCE_FREQUENCY    0
PURCHASES            0
dtype: int64

Dealing with Outliers

By dropping outliers we can lose many rows as there are too many outliers in the dataset. So making ranges to deal with extreme values.

columns=['BALANCE', 'PURCHASES', 'ONEOFF_PURCHASES', 'INSTALLMENTS_PURCHASES', 'CASH_ADVANCE', 'CREDIT_LIMIT',
        'PAYMENTS', 'MINIMUM_PAYMENTS']

for c in columns:
    
    Range=c+'_RANGE'
    data[Range]=0        
    data.loc[((data[c]>0)&(data[c]<=500)),Range]=1
    data.loc[((data[c]>500)&(data[c]<=1000)),Range]=2
    data.loc[((data[c]>1000)&(data[c]<=3000)),Range]=3
    data.loc[((data[c]>3000)&(data[c]<=5000)),Range]=4
    data.loc[((data[c]>5000)&(data[c]<=10000)),Range]=5
    data.loc[((data[c]>10000)),Range]=6
columns=['BALANCE_FREQUENCY', 'PURCHASES_FREQUENCY', 'ONEOFF_PURCHASES_FREQUENCY', 'PURCHASES_INSTALLMENTS_FREQUENCY', 
         'CASH_ADVANCE_FREQUENCY', 'PRC_FULL_PAYMENT']

for c in columns:
    
    Range=c+'_RANGE'
    data[Range]=0
    data.loc[((data[c]>0)&(data[c]<=0.1)),Range]=1
    data.loc[((data[c]>0.1)&(data[c]<=0.2)),Range]=2
    data.loc[((data[c]>0.2)&(data[c]<=0.3)),Range]=3
    data.loc[((data[c]>0.3)&(data[c]<=0.4)),Range]=4
    data.loc[((data[c]>0.4)&(data[c]<=0.5)),Range]=5
    data.loc[((data[c]>0.5)&(data[c]<=0.6)),Range]=6
    data.loc[((data[c]>0.6)&(data[c]<=0.7)),Range]=7
    data.loc[((data[c]>0.7)&(data[c]<=0.8)),Range]=8
    data.loc[((data[c]>0.8)&(data[c]<=0.9)),Range]=9
    data.loc[((data[c]>0.9)&(data[c]<=1.0)),Range]=10
    
columns=['PURCHASES_TRX', 'CASH_ADVANCE_TRX']  

for c in columns:
    
    Range=c+'_RANGE'
    data[Range]=0
    data.loc[((data[c]>0)&(data[c]<=5)),Range]=1
    data.loc[((data[c]>5)&(data[c]<=10)),Range]=2
    data.loc[((data[c]>10)&(data[c]<=15)),Range]=3
    data.loc[((data[c]>15)&(data[c]<=20)),Range]=4
    data.loc[((data[c]>20)&(data[c]<=30)),Range]=5
    data.loc[((data[c]>30)&(data[c]<=50)),Range]=6
    data.loc[((data[c]>50)&(data[c]<=100)),Range]=7
    data.loc[((data[c]>100)),Range]=8
data.drop(['CUST_ID', 'BALANCE', 'BALANCE_FREQUENCY', 'PURCHASES',
       'ONEOFF_PURCHASES', 'INSTALLMENTS_PURCHASES', 'CASH_ADVANCE',
       'PURCHASES_FREQUENCY',  'ONEOFF_PURCHASES_FREQUENCY',
       'PURCHASES_INSTALLMENTS_FREQUENCY', 'CASH_ADVANCE_FREQUENCY',
       'CASH_ADVANCE_TRX', 'PURCHASES_TRX', 'CREDIT_LIMIT', 'PAYMENTS',
       'MINIMUM_PAYMENTS', 'PRC_FULL_PAYMENT' ], axis=1, inplace=True)

X= np.asarray(data)

Normalizing input values.

scale = StandardScaler()
X = scale.fit_transform(X)
X.shape
(8950, 17)

MODELING

Clustering using Kmeans
n_clusters=30
cost=[]
for i in range(1,n_clusters):
    kmean= KMeans(i)
    kmean.fit(X)
    cost.append(kmean.inertia_)  
plt.plot(cost, 'bx-')
[<matplotlib.lines.Line2D at 0x7fcda94dca58>]

Choosing 6 no of clusters

kmean= KMeans(6)
kmean.fit(X)
labels=kmean.labels_
clusters=pd.concat([data, pd.DataFrame({'cluster':labels})], axis=1)
clusters.head()
clusters=pd.concat([data, pd.DataFrame({'cluster':labels})], axis=1)
clusters.head()

Out[15]:

TENUREBALANCE_RANGEPURCHASES_RANGEONEOFF_PURCHASES_RANGEINSTALLMENTS_PURCHASES_RANGECASH_ADVANCE_RANGECREDIT_LIMIT_RANGEPAYMENTS_RANGEMINIMUM_PAYMENTS_RANGEBALANCE_FREQUENCY_RANGEPURCHASES_FREQUENCY_RANGEONEOFF_PURCHASES_FREQUENCY_RANGEPURCHASES_INSTALLMENTS_FREQUENCY_RANGECASH_ADVANCE_FREQUENCY_RANGEPRC_FULL_PAYMENT_RANGEPURCHASES_TRX_RANGECASH_ADVANCE_TRX_RANGEcluster
01211010211920100105
112400055431000033011
21232200522101010000303
31233301502711010115
412211003211011000105
Interpretation of Clusters

for c in clusters:
    grid= sns.FacetGrid(clusters, col='cluster')
    grid.map(plt.hist, c)
  • Cluster0 People with average to high credit limits who make all types of purchases
  • Cluster1 This group has more people with due payments who take advance cash more often
  • Cluster2 fewer money spenders with average to high credit limits who purchase mostly in installments
  • Cluster3 People with high credit limits who take more cash in advance
  • Cluster4 High spenders with high credit limits who make expensive purchases
  • Cluster5 People who don’t spend much money and who have average to the high credit limit

(Cluster number changes when re-run)

Visualization of Clusters

Using PCA to transform data into 2 dimensions for visualization
dist = 1 - cosine_similarity(X)

pca = PCA(2)
pca.fit(dist)
X_PCA = pca.transform(dist)
X_PCA.shape
(8950, 2)
x, y = X_PCA[:, 0], X_PCA[:, 1]

colors = {0: 'red',
          1: 'blue',
          2: 'green', 
          3: 'yellow', 
          4: 'orange',  
          5:'purple'}

names = {0: 'who make all type of purchases', 
         1: 'more people with due payments', 
         2: 'who purchases mostly in installments', 
         3: 'who take more cash in advance', 
         4: 'who make expensive purchases',
         5:'who don\'t spend much money'}
  
df = pd.DataFrame({'x': x, 'y':y, 'label':labels}) 
groups = df.groupby('label')

fig, ax = plt.subplots(figsize=(20, 13)) 

for name, group in groups:
    ax.plot(group.x, group.y, marker='o', linestyle='', ms=5,
            color=colors[name],label=names[name], mec='none')
    ax.set_aspect('auto')
    ax.tick_params(axis='x',which='both',bottom='off',top='off',labelbottom='off')
    ax.tick_params(axis= 'y',which='both',left='off',top='off',labelleft='off')
    
ax.legend()
ax.set_title("Customers Segmentation based on their Credit Card usage bhaviour.")
plt.show()

And it’s done!

Important Notice for college students

If you’re a college student and have skills in programming languages, Want to earn through blogging? Mail us at geekycomail@gmail.com

For more Programming related blogs Visit Us Geekycodes. Follow us on Instagram.

By geekycodesco

Leave a Reply

%d bloggers like this: