- Discover driving features affecting churn
- Use drivers to develop a machine learning model to predict churn
- Use these predictions to inform preemptive decisions aimed at alleviating future churn
- telco_churn data from Codeup SQL database was used for this project.
- The data was initially pulled on 26-OCT.
- The initial DataFrame contained 7043 records with 44 features
(44 columns and 7043 rows) before cleaning & preparation. - Each row represents a customer record both current & historical.
- Each column represents a feature provided by telco or an informational element about the customer.
Prepare Actions
- DROP: Removed 4 index_id, 18 duplicate, and 1 corrupted data column
- RENAME: Initially did not need to Rename any original columns
- REFORMAT: 2 columns contained inappropriate data types that needed to be reformatted
- REPLACE: 7 columns had a third value that could be determined by another feature, replaced third value in each column with appropriate yes/no value. 1 column had empty non-null values that were replaced with 0
- ENCODED: 14 categorical columns from variables to boolean numeric values
- MELT: No melts needed
- PIVOT: 3 columns with more than two variables were pivotted
- FEATURE ENGINEER:: No new features were added
- DROP2: 16 Columns duplicated by Encoded and Pivot Columns were dropped
- RENAME2: 13 encoded columns were renamed after original columns were dropped
NaN/Null: Only one column contained NaN/nulls in the data (it was in the corrupted field that was removed). OUTLIERS: No outliers have been removed or altered IMPUTE: No data was imputed
- SPLIT: train, validate and test (approx. 60/20/20), stratifying on target of 'churn'
- SCALED: no scaling was conducted
Most features with 0min and 1max, the mean will represent the percentage of True values
Print nunique of all Columns shows a count of True and False for each feature, giving a quick glance at variance between feature values and allowing a quick infference into approximate percentages.
Each of the three features were tested for relationship or difference against Churn.
- Tenure
- Monthly Charges
- Tech Support
All three comparison features showed a significant relationship with the target feature Churn.
There were four feature specific questions asked across three features all compared against our Target Feature of Churn.
- 1.1 Is the average Tenure of Active customers greater than the average Tenure of Churned customers?
- 2.1 Are the average monthly charges of customers that Churn higher than the average monthly charges of Active customers?
- 3.1 Is the average of customer Churn without Tech Support greater than the average of customer Churn with Tech Support?
- 3.2 Is the average of customer Churn without Tech Support greater than the average of Active customers without tech support?
Three statistical tests were used to test these questions.
- T-Test
- Pearson's R
The first two questions 1.1 and 2.1 did not test positively against our stated question.
The remaining two questions 3.1 and 3.2 involving Tech Support both tested positively against our stated question.
30% of all customers without tech support churn 82% of all churn is attributed to customer that do NOT have tech support Only 17% of customers with tech support churn Only 18% of all churn can be attributed to customers with tech support
- Churn is incredibly important as our target feature
- Tenure
- Monthly Charges
- Tech Support
Accuracy is our evaluation metric
Our Target feature Churn, splits the data 27% Churn, 73% Active
Simply guessing Active for every customer, we could achieve an accuracy of 73%
Therefore 73% will be the baseline accuracy used for this project
Models will be developed and evaluated using three different model types and various hyperparameter configurations
- Decision Tree
- Random Forest
Models will be evaluated on train and validate data
The model that performs the best will ultimately be the one and only model evaluated on our test data
Decision Tree, Random Forest, and KNN models all performed above the Baseline of 73%
The KNN model performed slightly better on train data than it did on the validate data which may be a sign of overfit.
Because the results of the Decision Tree, Random Forest, and KNN models were all very similar and above Baseline, we could proceed to test with any of these models.
Random Forest
however, is the best model that retained high performance across both train and validate data and will likely perform well above Baseline on the Test data.
- Consider implementing incentives for increased Tech Support
- Decision Tree focused on other driving features above Tech Support
- Investigate further into these features
- Try running models with less features to isolate cause of predictions