Customer Segmentation

A report for Arvato Financial Solutions — Data Scientist Nanodegree, Udacity

9 min readApr 13, 2021

This report is part of the Capstone Project of Nanodegree Udacity Data Science, where Arvato Financial Solutions kindly provided their data on a real problem so that students could apply the knowledge they learned to a real world problem.

Arvato Financial Solutions is a company that provides services that helps other companies through the complexity of credit management since 1961. Nowadays, 7,000 experts are delivering efficient credit management solutions in around 15 countries around the globe.

Introdução

The objective of this project was to analyze the demographic data of customers of a direct mail company in Germany, comparing them with demographic information of the general population. To which we will have to carry out an exploratory analysis, a segmentation of the customers and build a model of supervised learning so that it is able to decide whether or not it will be worthwhile to include a person in a campaign.

The project is divided into 4 parts:

1. Knowing the data (data cleaning and some exploratory data analysis): Many problems of missing values and incomparable values will be faced. Later:
2. Customer segmentation report (using unsupervised learning techniques) to better understand the population and customers. Later:
3. Build a supervised learning model, so that we can decide whether or not to include a person in a campaign. And finally:
4. Kaggle competition: Where I will use the predictions of the models created in a competition with the other students.

Data Analysis

We have 4 data files to be used in the project, they are:

Udacity_AZDIAS_052018.csv: Demographics data for the general population of Germany; 891 211 persons (rows) x 366 features (columns).
Udacity_CUSTOMERS_052018.csv: Demographics data for customers of a mail-order company; 191 652 persons (rows) x 369 features (columns).
Udacity_MAILOUT_052018_TRAIN.csv: Demographics data for individuals who were targets of a marketing campaign; 42 982 persons (rows) x 367 (columns).
Udacity_MAILOUT_052018_TEST.csv: Demographics data for individuals who were targets of a marketing campaign; 42 833 persons (rows) x 366 (columns).

Below we can see the 5 first rows of the Udacity_AZDIAS and Udacity_CUSTOMERS data:

We can see that the data structure is similar, but the database ‘Udacity_CUSTOMERS_052018’ contains 3 more columns. They are: ‘CUSTOMER_GROUP’, ‘ONLINE_PURCHASE’, ‘PRODUCT_GROUP’ which provide broad information about the customers.

NaN values

There are 891.221 rows in the general population data frame and 191.652 at the customer’s data frame, so there are 699.569 potential new clients! Another thing that we can see is that there are a lot of NaN values in both data frames.

For the customer database, we have:

mean       0.196049
std        0.151437
min        0.000000
25%        0.000000
50%        0.267574
75%        0.267574
max        0.998769

As we can see, 75% of the columns have less than 27% of the values NaN.

For the general databse, we have:

mean       0.102680
std        0.121640
min        0.000000
25%        0.000000
50%        0.118714
75%        0.120230
max        0.998648

As we can see, 75% of the columns have 12% or less of NaN values.

Looking at the results above, we find columns with more than 40% of the NaN values. They are: ALTER_KIND 4,3,2 and 1, KK_KUNDENTYP and EXTSEL992. Of these columns, there is only the description of KK_KUNDENTYP, so the others have been discarded.

Due to the absence of a robust environment, I decided to replace the NaN values with -1, which was the default value for unknown values in the data.

Comparing the General Population against the Customers

How does the General population differ from the Customers ?

1: < 30 yers | 2: 30~45 years | 3: 46~60 years | 4: >60 years | 9: 9 uniformly distributed

As we can see, more than 40% of customers are over 60 years old, while this share in the general population is 25%.

Are the distribution between gender the same between general population and customers?

There are more men at the customers database but at the general population database there are more women.

Are the distribution between income the same between general population and customers?

-1: unknown | 1: highest income | 2: very high income | 3: high income | 4: average income | 5: lower income | 6: very low income

We were able to observe that most of the clients are over the age of 60, they are men, with a very high income. Meanwhile the majority of the population is between 46 and 60 years old, composed of women and with a very low income.

Scaling and PCA

Since some columns are of the type object, I used the labelEncoder to transform all the objects to int. The labelEncoder works transforming each unique object to an int ranging from 0 to n, where n is the number of unique objects.

Having both data frames of numeric types, it’s time to scale the data. To scale the data I chose the minMaxScaler that scales all data to have values between zero and one.

This step is important to make the algorithm converge faster and to avoid that the scale of each feature act as a weight.

Considering that the data has a lot of features (366) I used Principal Component Analysis (PCA) to reduce the data dimensionality and get only the important features.

For this part I fit the PCA setting its n_components parameter to 0.9, so by the end of the process, the resulting data features would explain 90% of its variability.

The result of the PCA was a reduction of 72% in the number of features that before PCA was 366 and after 103.

Unsupervised Learning

The Unsupervised Learning algorithm chosen was K-means.

K-means clustering is a method of vector quantization, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.

The most important feature to the K-means is the value of K, to find this value the two most used approaches are the elbow method and the silhouette score.

I used the elbow method getting the WSS from k ranging from 1 to 10 and you can see in the plot below.

Within-cluster sum of square for each value of k

We see an elbow at k = 2, but analyzing each WSS, I noticed a big decrease in WSS from k = 2 (2390038.4060712038) to k = 10 (596970.7616676809), a reduction of approximately 75%, so I set K = 10 and trained the model K-Means.

The result of the K-Means model: A cluster ranging from 0 to 9 was assigned to each row.

With this information, I generated graphs that show the proportion of people in each cluster for the general population and customers, with the intention of finding clusters in which we are more likely to convert from the general population to the customer base.

% of the General and Customers data for each cluster

Looking at this graph we can conclude that the clusters with the general population closest to customers are: 0, 1 and 5. Therefore, clusters with a lower potential for customer conversion, given the balance between population and customers.

However, we see that clusters 3, 6 and 9 have very few customers in relation to the population, becoming clusters with great potential for conversion given the low number of customers in them.

So i got all the people in cluster 9 and redid the same 3 previous graphs, to check their behavior:

1: highest income | 2: very high income | 3: high income | 4: average income | 5: lower income | 6: very low income

We can see that the majority of people in cluster 9 have a very low income, are 60 years old or younger and are men.

Supervised Learning

The goal of this section is to build a prediction model that using the demographic information from each individual we can decide whether or not it will be worth it to include that person in the campaign.

Data

Udacity_MAILOUT_052018_TRAIN will be used to train the models and Udacity_MAILOUT_052018_TEST to test them.

The train data have the same columns from the General data adding only the target RESPONSE column that is a binary variable having a value of 1 if that row is a customer or 0 otherwise.

But, in all data frame, there are only 532 rows that contain the number one (1.2%) thus this is an imbalanced problem so accuracy shall not be used to measure the results of the model because a model that predicts only 0 would have an accuracy of 98.8% at the train data.

The metric that will be used is the Area Under the ROC Curve (AUC). The ROC curve is the plot of the true positive rate (TPR) against the false positive rate (FPR) varying the threshold.

Data wrangling and cleaning

The first step of this process is also data wrangling, starting by adjusting some columns that had numeric values mixed up with strings. The columns CAMEO_DEUG_2015 and CAMEO_INTL_2015 had an X and a XX that were replaced for -1.

The distribution of NaN values at train data is similar to the distribution of the general data frame as can be seen in the plot below.

Histogram plot of NaN values for Train data

Dealing with these NaN values is an important step that can influence the outcome of the model, so some approaches will be tested:

Dropping or not the columns ‘ALTER_KIND4’, ‘ALTER_KIND3’, ‘ALTER_KIND2’, ‘ALTER_KIND1’ that have more than 40% of NaN values and don’t have descriptions.
Replacing the NaN values with -1.
Replacing the NaN values with the statistical mode of each column.

LabelEnconder was used to transform the object’s columns into numeric and RobustScaler, which scale features using statistics that are robust to outliers, was used to scale the data.

Algorithms

Since this is an imbalanced problem, I chose algorithms that are usually used in that scenario. XGBoost Regressor and AdaBoostRegressor.

I started using the AdaBooster Regressor, replacing all NaN with -1 and without removing the columns ‘ALTER_KIND4’, ‘ALTER_KIND3’, ‘ALTER_KIND2’, ‘ALTER_KIND1’, where the AUC obtained was 0.77.

The second experiment with was the XGBoost Regressor, replacing all NaN with -1 and removing the columns ‘ALTER_KIND4’, ‘ALTER_KIND3’, ‘ALTER_KIND2’, ‘ALTER_KIND1’, where the AUC obtained was 0.61.

The third experiment was based on the second, just increasing the RandomizedSearchCV, where the AUC obtained was 0.75. (Fitting 3 folds for each of 7 candidates, totalling 21 fits).

The fourth and final experiment, I used the Ada Boost Regressor and a Stacked model combining the XGB Regressor with the Ada Boost. where the final AUC obtained was 0.82, using the removal of the columns ‘ALTER_KIND4’, ‘ALTER_KIND3’, ‘ALTER_KIND2’, ‘ALTER_KIND1’ and Replacing the NaN values with the statistical mode of each column.

As we could see, a combination of Ada + XGB together with the removal of the columns ‘ALTER_KIND4’, ‘ALTER_KIND3’, ‘ALTER_KIND2’, ‘ALTER_KIND1’ and replacing the NaN values with the statistical mode of each column resulted in the best score observed.

Conclusion

This was an interesting problem to be solved, applying several learned concepts, and the coolest thing was that it was a real problem.

I had to lead with a huge database, containing 366 columns, with many NaN values and columns containing mixed objects and integers. So data cleansing was essential.

I used K-Means and did experiments with different models and hyperparameters, looking for the best result. It took a lot of work.

Next Steps and Considerations

Some actions can be made to try to improve the results. Other Clustering techniques such as DBSCAN can be used and the results could be compared with the Kmeans approach. For the supervised learning part, a random search at the parameters of the stacked model could improve the results, and using KNNImputer to deal with the NaN values could also improve the AUC.

To accomplish those results, I’ve read many tutorials, articles and documentation from https://machinelearningmastery.com, https://www.kaggle.com and of course from https://stackoverflow.com to get insights.

All the code used in this project can be found in this repository.