GOAL: predict behavior to retain customers.
The features:
- gender (Male, Female)
- SeniorCitizen (Yes, No)
- Partner (Yes, No)
- Dependents (Yes, No)
- tenure
- PhoneService (Yes, No)
- MultipleLines (Yes, No, No phone service)
- InternetService (DSL, No, Fiber optic)
- OnlineSecurity (Yes, No, No internet service)
- OnlineBackup (Yes, No, No internet service)
- DeviceProtection (Yes, No, No internet service)
- TechSupport (Yes, No, No internet service)
- StreamingTV (Yes, No, No internet service)
- StreamingMovies (Yes, No, No internet service)
- Contract (Month-to-month, One year, Two year)
- PaperlessBilling (Yes, No)
- PaymentMethod (Bank transfer (automatic), Mailed check, Electronic check, Credit card (automatic))
- MonthlyCharges
- TotalCharges
- Churn
gender
- gender does not affect the client's decisionSenior Citizen
- older people are more likely to refuse servicesPartner
&Dependents
- clients in relationships, as well as clients with children, are less likely to refuse services. Perhaps the company will present favorable family tariffsInternetService
- customers with fiber optic more often refuse services. Customers who do not use the Internet very rarely refuseOnlineSecurity
,OnlineBackup
&DeviceProtection
- clients who use protection systems, as well as those who use cloud storage, are more likely to refuse. Competitors also have favorable package offers with additional servicesTechSupport
- customers who do not contact technical support are more likely to refuseContract
- logical, clients with a short-term contract leave more oftenPaperlessBilling
&PaperlessBilling
- customers who receive and pay bills in a conservative way are less likely to change service providers- Clients who have about 3-6 services most often churn.
Plot created with:
cat_model = CatBoostClassifier(verbose=False,
random_state=RANDOM_SEED,
custom_loss=['AUC', 'Accuracy', 'Precision', 'Recall', 'F1'])
cat_model.fit(X_train, y_train,
eval_set=(X_test, y_test),
plot=True)
- After the 92nd iteration, overfitting begins, the logloss value is 0.42.
- The AUC value at the 108th iteration is 0.8373. After about the 300th iteration the AUC value less than 0.8.
- Max value of accuracy at the 330th iteration (0.79). Also at about 50-500 iterations the value is around 0.79, then decreases slightly.
- The maximum value of precision is fixed at the 7th iteration (0.66), during the first 500 iterations it is around 0.63, then drops rapidly.
- The value of the recall at the 450th iteration is 0.52.
- The F1-score at the 479th iteration is 0.57.
Default value:
Learning Rate
- 0.04697500169277191subsample
- 0.800000011920929depth
- 6min_data_in_leaf
- 1max_leaves
- 64
Obviously, 1000 default iterations is too much for this data. It semms the optimal number is no more than 500
Since the model is starting to overfitting, the Learning Rate
is optimal in the range [0.01, 0.02, 0.03, 0.04, 0.05]
Total metrics:
Default CatBoost | CatBoost tuned with Optuna | |
---|---|---|
Accuracy: | 0.7886 | 0.7929 |
Precision: | 0.4938 | 0.4991 |
Recall: | 0.6310 | 0.6422 |
F1_score: | 0.5540 | 0.5617 |