Skip to content

Machine learning prediction on whether a lead will convert into an opportunity or not

Notifications You must be signed in to change notification settings

Gabya06/LeadConversion_Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Lead Conversion Prediction

Project Overview

The goal of this project is to predict whether a lead will convert into an opportunity or not.

Data Processing and Cleaning

Data was collected from the Lead table in SalesForce MSSQL server by connecting to the database using pymssql and querying the data table. There were about 52000 total rows and 44 columns. Some of the lead data collected included First Name, Last Name, Title, Email, Country, Company size, Industry, Phone, Website, Created Date, Lead Source and Lead Owner.

Data cleaning involved grouping industries together, assiging a title score, changing data types, mapping the number of employees to categorical values, removing columns with too many missing values, and adding fiscal quarters based on created date.

Data Exploration & Visualization

Plotting the target variables made it clear that there are more unconverted leads than converted leads.

target

In looking at the lead source, we can see that Marketing generates more leads, and a higher percentage of Marketing leads convert, whereas there are fewer converted Sales sourced leads.

source

When digging deeper and considering industry and source, we can see that Marketing leads with 'Other' industry tend to convert despite not being in one of the dominant industries whereas Sales 'Other' lead dont quite convert as much. This looks like it can be attributed to a data quality issue and not that we dont necessarily know the lead industry. Lastly, I noticed that company size seems to have a larger impact on Sales leads converting than on Marketing leads.

mktg_source

Machine Learning models: Decision Tree, Random Forest and Logistic Regression

Decision Tree

I split the data into training and test sets using 85% of the data for model training and 15% for model evaluation. I chose to start by using a decision tree model because it is easy to interpret. As a first pass, I explored feature importance without doing much parameter tuning, but instead initially set the maximum tree depth to 5 because I didn't want to build a very deep tree and overfit the training data. And, I chose to set the minimum sample split to 10 (there needs to be a minimum of 10 observations for there to be a split on a node).

X, y = shuffle(df_transformed, df_us.IsConverted_2 , random_state=23)

'''
FEATURE IMPORTANCE USING DECISION TREE
'''
# split data into training and test sets
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.15, random_state=1)


# create decision tree model and fit to training data set to obtain feature importance
tree_model = tree.DecisionTreeClassifier(criterion='gini', max_depth=5, min_samples_split=10,min_samples_leaf=5)
#tree_model = tree.DecisionTreeClassifier()
tree_model.fit(X_train, y_train)
# predict test 
tree_predicted = tree_model.predict(X_test)

In looking at the feature importance, the initial tree indicated that title score, lead source, marketing score, and owner were among the important features.

Feature Importance Feature
0.931174 Title Score
0.027602 LeadSource
0.013535 Marketing Score
0.007659 Source
0.007086 OwnerId
0.004813 Type
0.003364 Owner

Next, I used GridSearchCV to perform cross validation to do parameter tuning, to search for best tree depth and minimum sample split. The default cross validation sets k to 3, found minimum samples per split to be 5 and max depth of 3, and the best score was 82%.

'''
    Cross validation using Grid Search - for best parameters
'''
from sklearn.model_selection import GridSearchCV

parameters = {'max_depth':range(3,20), 'min_samples_split':range(5, 20), 'min_samples_leaf':range(5,15) }
clf = GridSearchCV(tree.DecisionTreeClassifier(), parameters, n_jobs=4)
clf.fit(X=x, y=y)
tree_model = clf.best_estimator_

print (clf.best_score_, clf.best_params_)

(0.82215029041626331, {'min_samples_split': 5, 'max_depth': 3, 'min_samples_leaf': 5})

The features with highest feature importance were:

Feature Importance Feature
0.956501 Title Score
0.028352 LeadSource
0.007278 OwnerID
'''
FEATURE IMPORTANCE USING DECISION TREE
'''

X, y = shuffle(df_transformed, df_us.IsConverted_2 , random_state=23)
# split data into training and test sets
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.20, random_state=1)


# create decision tree model and fit to training data set to obtain feature importance
tree_model = tree.DecisionTreeClassifier(max_depth = clf.best_params_['max_depth'], 
                                         min_samples_split = clf.best_params_['min_samples_split'],
                                        min_samples_leaf= clf.best_params_['min_samples_leaf'])
#tree_model = tree.DecisionTreeClassifier()
tree_model.fit(X_train, y_train)
# predict test 
tree_predicted = tree_model.predict(X_test)

After training the model with the best parameters, the model scored 93% on the test set.

# Decision Tree performance
print "Decision Tree score is: ", tree_model.score(X_test, y_test)

# Classification Report
print "\nClassification Report:\n"
print (classification_report(y_test, tree_predicted))

# Confustion Matrix
print "Confusion Matrix:" 
print(get_confusion_matrix(y_test, tree_predicted))

print confusion_matrix(y_true=y_test,y_pred=tree_predicted)

Decision Tree score is: 0.931034482759

Classification Report:

Class Precision Recall F1-Score Support
0 0.92 0.98 0.95 3365
1 0.95 0.83 0.89 1594
avg/total 0.93 0.93 0.93 4959

Confusion Matrix:

NotConverted-Predicted Converted-Predicted
NotConverted-Actual 3297 68
Converted-Actual 274 1320

In comparison to the Random Forest and Logistic Regression, the tuned Decision Tree performed the worse.

Random Forest

Using 80% training and 20% test set and 1500 estimators, the Random Forest model scored 98.6% accuracy

'''
Random Forest Classifier 
'''
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.20, random_state=1)

rf = RandomForestClassifier(n_estimators = 1500)
rf.fit(X_train, y_train)
rf_predicted = rf.predict(X_test)

# Random Forest performance
print "Random Forest score is: ", rf.score(X_test, y_test)

# Classification Report
print "\nClassification Report:\n"
print (classification_report(y_test, rf_predicted))

# Confustion Matrix
print "Confusion Matrix:" 
print(get_confusion_matrix(y_test, rf_predicted))

Random Forest score is: 0.986988847584

Classification Report:

Class Precision Recall F1-Score Support
0 0.99 0.99 0.99 4401
1 0.99 0.97 0.98 2055
avg/total 0.99 0.99 0.99 6456

Confusion Matrix:

NotConverted-Predicted Converted-Predicted
NotConverted-Actual 4378 23
Converted-Actual 61 1994

Logistic Regression

Using 80% training and 20% test set, the Logistic Regression model scored 97.3% accuracy

'''
    Logistic Regression
'''
# training and testing
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.2, random_state=1)

logreg = LogisticRegression(random_state=123, verbose=1)

#Cs = np.logspace(-7, 5, num=15)
logreg = LogisticRegression(C=10.0)
logreg.fit(X_train,y_train)
logreg_predicted = logreg.predict(X_test)
# Logistic Regression performance
print "\nC is ", c, "Logistic Regression score is: ", logreg.score(X_test, y_test)

# Classification Report
print "\nClassification Report:\n"
print (classification_report(y_test, logreg_predicted))

# Confustion Matrix
print "Confusion Matrix:" 
print(get_confusion_matrix(y_test, logreg_predicted))

C is 100000.0 Logistic Regression score is: 0.972893432466

Classification Report:

Class Precision Recall F1-Score Support
0 0.98 0.99 0.98 4432
1 0.97 0.95 0.96 2024
avg/total 0.97 0.97 0.97 6456

Confusion Matrix:

NotConverted-Predicted Converted-Predicted
NotConverted-Actual 4368 64
Converted-Actual 111 1913

Conclusion - Model Summary

While the tuned Decision Tree attributed more information gain to fewer features, the score was 93% and underperformed in comparison to the average scores of Random Forest (98.7%) and Logistic Regression (97.3%). Random Forests beat Logistic Regression by quite a bit, so I think this is the model we should implement for predicting whether a lead will convert to an opportunity or not.