This project is part of the Udacity Azure ML Nanodegree.
For this project, our main tasks were to build and optimize an Azure ML pipeline using Scikit-learn Logistic Regression model. The hyperparameters of this model are optimized using HyperDrive. This model is then compared to Azure AutoML so that the results obtained by both models are compared.
A diagram illustrating the steps of this project is shown below:
source: Nanodegree Program Machine Learning Engineer with Microsoft Azure
This dataset contains information related with direct marketing campaigns (via phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe (yes/no) a term deposit.
Input variables:
age
: (numeric)job
: type of job (categorical)marital
: marital status (categorical)education
: (categorical)default
: has credit in default? (categorical)housing
: has housing loan? (categorical: 'no','yes','unknown')loan
: has personal loan? (categorical)contact
: contact communication type (categorical)month
: last contact month of year (categorical)day_of_week
: last contact day of the week (categorical)duration
: last contact duration, in seconds (numeric)campaign
: number of contacts performed during this campaign and for this client (numeric)pdays
: number of days that passed by after the client was last contacted from a previous campaign (numeric)previous
: number of contacts performed before this campaign and for this client (numeric)poutcome
: outcome of the previous marketing campaign (categorical)emp.var.rate
: employment variation rate - quarterly indicator (numeric)cons.price.idx
: consumer price index - monthly indicator (numeric)cons.conf.idx
: consumer confidence index - monthly indicator (numeric)euribor3m
: euribor 3 month rate - daily indicator (numeric)nr.employed
: number of employees - quarterly indicator (numeric)
Output variable (desired target):
y
- has the client subscribed a term deposit? (binary: 'yes','no')
original source of the data: [Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014 (https://repositorio.iscte-iul.pt/bitstream/10071/9499/5/dss_v3.pdf )
As mentioned above, we proceeded with the two following approaches:
- Scikit-learn Logistic Regression model: where two hyperparameters were optimized using HyperDrive.
- Azure AutoML: which allowed AzureML to go through different models and come up with a single model which best optimized the metric of interest (i.e.
accuracy
).
Using HyperDrive, we were able to obtain an accuracy of 90.74%. Using the AutoML, and the best performing model was Voting Ensemble
, which achieved an accuracy of 91.67%.
The script train.py
included a few strategic steps:
- Loading dataset.
- Cleaning and transforming data (e.g. drop NaN values, one hot encode, etc.).
- Calling the SKlearn Logistic Regression model using parameters:
--C
(float): Inverse of regularization strengthmax_iter
(int): Maximum number of iterations taken for the solvers to converge
The following steps were run from the main notebook:
- Initialize our
Workspace
- Create an
Experiment
- Define resources, i.e., create
AmlCompute
as training compute resource Hyperparameter tuning
i.e. defining parameters to be used by HyperDrive, which also involved specifying aparameter sampler
, apolicy
for early termination, and creating an estimator for thetrain.py
script.- Submission the
HyperDriveConfig
to run the experiment. - Use
get_best_run_by_primary_metric()
on the run to select the best combination of hyperparameters for the Sklearn Logistic Regression model - Save the best model.
In the random sampling
algorithm used in this project, parameter values are chosen from a set of discrete values (choice
) or randomly selected over a uniform distribution
The other two available techniques (Grid Sampling and Bayesian) are indicated if you have a budget to exhaustively search over the search space. In addition, Bayesian does not allow using early termination.
Early stopping policy
automatically terminates poorly performing runs.
The early termination policy
we used Bandit Policy
. This policy is based on slack factor/slack amount
and evaluation interval
. Bandit terminates runs where the primary metric is not within the specified slack factor/slack amount. This allows more aggressive savings than Median Stopping policy if we apply a smaller allowable slack.
Parameter slack_factor
which is the slack allowed with respect to the best performing training run, need to be defined while evaluation_interval
and delay_interval
are optional.
AutoML tries different models and algorithms during the automation and tuning process within a short period of time. The best performing model was Voting Ensemble
with an accuracy of 91.67%.
Although the performance of AutoML (Voting Ensemble
) was slightly better than HyperDrive, it didn't demonstrate a significant improvement (less than 2%).
AutoML is definitely better than HyperDrive in terms of architecture since we can create hundreds of models a day, get better model accuracy and deploy models faster.
The first point to consider that the data is highly imbalanced (88.80% is labeled NO and 11.20% is labeled YES). This imbalance can be handled by using technique like Synthetic Minority Oversampling Technique (a.k.a. SMOTE) during the data preparation step.
We could include additional hyperparameters used in Sklearn Logistic Regression in order to achieve better results in the future. Using different parameter sampling techniques and tuning the arguments of the BanditPolicy can also prove fruitful.
About the AutoML, we would like to tune more config parameters; increasing experiment timeout minutes will enable us to test more models and thus improving the performance.
We ran the following command:
aml_compute.delete()