Introduction

This README.md serves as a complement to the responses provided for each activity in the challenge. The main ML model required is located in the file NaiveBayes.ipynb, while the files cleaning.ipynb and exploratory.ipynb document the data cleaning process and dimensionality reduction, respectively. Other models tested are found in LogisticRegression.ipynb, DecisionTreeClassifier.ipynb, and RandomForest.ipynb.

The file data_address.txt is referenced by the notebook files but does not contain crucial information about the ML model.

The files scoreValidator.py, localFilter.py, accuracyMeassure.py, and efficiencyCalculator.py contain functions used across the notebook files.

The Data folder contains both the original and pre-processed databases.

The Assets folder includes figures generated by the notebooks.

The Final-Model folder houses the final model saved as a .pkl file.

The Flask-app folder contains files for the deployed webpage with the functional ML model.

Challange Activities

What steps would you take to solve this problem?.
Which technical data science metric would you use to solve this challenge?
Which business metric would you use to solve the challenge?
How do technical metrics relate to the business metrics?
What types of analyzes would you like to perform on the customer database?
What techniques would you use to reduce the dimensionality of the problem?
What techniques would you use to select variables for your predictive model?
What predictive models would you use or test for this problem? Please indicate at least 3.
How would you rate which of the trained models is the best?
How would you explain the result of your model? Is it possible to know which variables are most important?
How would you assess the financial impact of the proposed model?
What techniques would you use to perform the hyperparameter optimization of the chosen model?
What risks or precautions would you present to the customer before putting this model into production?
If your predictive model is approved, how would you put it into production?
If the model is in production, how would you monitor it?
If the model is in production, how would you know when to retrain it?

Activity 1

What steps would you take to solve this problem?.

To solve this problem we'll need to follow the next steps:

Enviroment set-up

First, we need to set up the environment. For example, we can use Visual Studio Code and install some extensions to improve performance, such as Python and Jupyter Notebook. Then, we create a virtual environment on Windows:

py -m venv .venv

With the enviroment created, we activate it by running:

.\.venv\Script\activate

You should see something like (.venv) PS C:\Users.... After that, we can install all the required packages with specific versions, ensuring there are no conflicts.

Data Cleaning

As previously mentioned, the data had issues with missing values, which we needed to filter out. In the database, missing values were represented as the string "na". We converted these to Python's NaN value using the .replace() method of the DataFrame object.

Another concern was columns containing a large number of zeros. While zeros are valid data, an abundance of them (or any constant value) doesn't provide useful information. Therefore, we converted the zeros and removed columns that were mostly zeros or NaN values. This process reduced the number of columns from 171 to 108. For this purpose, we created a function named isRemovable() to check if a column predominantly contains bad data.

At this point, we used .dropna() to eliminate all rows with NaN values, which left us with approximately one-third of the total rows.

Dimensionality reduction

What has been done so far is already a dimensionality reduction, but it is necessary because bad data is worse than noise in machine learning. At this stage, we're going to remove columns that can be considered dispensable, based on a condition we will analyze shortly.

To reduce the number of columns, we need to identify the most important ones for our problem. We can look for two types of issues:

Information repetition: When a pair of columns have high correlation, it means they are influenced by similar variables. Having both does not improve our model and can even make it worse.
Low variance variables: Low variance variables provide less information. A variable with an order of magnitude lower variance than others appears as noise or a constant value, similar to the zeros in the previous section.

Having a DataFrame object with the numerical columns, we can apply .corr().abs() to obtain a 108 by 108 matrix with all the correlations between columns quantified as positive numbers. Using a mask, we can identify columns with high correlation. In this model, we remove columns that have an absolute correlation value of 0.7 or higher.

After the process, we ended up with 41 independent columns, less than half of the original number.

To continue decreasing dimensionality, we can use two additional filters, and we use both. However, before applying the variance filter, we need to eliminate potential outliers. Since outliers have very low probability, we use a strategy to filter data below a certain quantile. We chose the 99.99% quantile, which proved sufficient. The images comparing before and after are saved in Assets/figures/<column-name>.

Starting in the previous section up to this point, all the programming is in the file cleaning.ipynb.

We can select columns using an object from the ensemble module: the ExtraTreesClassifier(). It extracts the most important features and lets us choose which ones to keep. Using this method, we selected 15 more significant columns.

The other method is the low variance filter, as previously explained. To choose a variance threshold, we used the 3rd quantile and kept all columns with variance greater than 75% of the others. This left us with only 11 features.

Finally we save all three files: the one with 41 features, the one with 15 and the one with 11. This last filtering is in a file named exploratory.ipynb.

Understanding the problem

With our current knowledge of the problem, we cannot predict why the trucks' air systems are breaking down. However, we can predict when a truck will need maintenance on this system based on the provided database. The client did not provide details on how expenses were calculated in previous years, but they did inform us how to calculate our model's expenses and its tendency over the previous years. We must also consider that the database only includes information about trucks that went to air system maintenance and whether a problem was found or not; it does not indicate the order in which trucks were sent for maintenance.

Based on this information, we can create a new class with two values ["send", "not send"] for maintenance, which will correspond to the class given in the database, making the isomorphism ["pos", "neg"] -> ["send", "not send"]. This means that if we can predict the class ["pos", "neg"], we can determine when it is time to send a truck for maintenance and when it is not necessary.

As the class to be predicted is binary we will certanly have a very useful confusion matrix.

Based on the informations given by the client each component of the the confusion matrix will have a cost:

True negative $(p_{--})$ --> costs $0
True positive $(p_{++})$ --> costs $25
False positive $(p_{-+})$--> costs $10
False negative $(p_{+-})$--> costs $500

So we need a machine learning model that minimize the function: $$f(p_{++},p_{-+},p_{+-}) = p_{++}\cdot25+p_{-+}\cdot10+p_{+-}\cdot500$$

Testing some machine learning algorithms

We proposed several machine learning algorithms given that the class we're predicting is binary:

Logistic Regression
Naive Bayes
Decision Tree Classifier
Random Forest

The files containing the programming of these models are respectively:

LogisticRegression.ipynb
NaiveBayes.ipynb
DecisionTreeClassifier.ipynb
RandomForest.ipynb

We tested each algorithm with three datasets:

The dataset with 41 features
The dataset with 15 features
The dataset with 11 features

In each case, we printed the confusion matrix and the value of $$f(p_{++},p_{-+},p_{+-})$$, which represents the cost of maintenance for the selected data.

Choosing a machine learning model

Firstly, we trained the machine learning model by dividing the data into training and test sets. We then selected the best-performing combination of model and dataset based on the lowest $$f(p_{++},p_{-+},p_{+-})$$ value. The Naive Bayes model with 11 features performed the best.

After that, we retrained the model using the entire filtered and cleaned data from previous years for each set of features. We then applied the model to the whole present year data after cleaning it as well.

Finally, we optimized the model using the GridSearchCV class from the model_selection module in the sklearn package. This performed cross-validation to minimize overfitting, and we obtained the same result as before, indicating a good result. For the Naive Bayes model, there are no parameters to tune, so the optimization ended at this point. However, we tuned parameters for the other algorithms, but none performed as well as Naive Bayes.

Deploying the machine

We created a simple webpage using the Python Flask framework and the Bootstrap HTML framework. This allows the client to input the values of eleven features (still encrypted) and receive a prediction.

We deployed the machine learning algorithm in an AWS Elastic Compute Cloud (EC2) instance and served it through HTTP. Additionally, we included a section on the webpage for inputting new data to retrain the model, as well as a contact form for client feedback in case of issues.

Activity 2

Which technical data science metric would you use to solve this challenge?

We might want to use the confusion matrix for this problem because of the binary nature of the class to be predicted. Due to the imbalance in the data between the classes "pos" and "neg" we may be tented to use the Matthews correlation coefficient (MCC) or the F2 score, which are available in the sklearn.metrics module, but he best metric for this problem involves using a weighted score because the problem provides values for true positive, false negative, and false positive outcomes. The goal is to minimize the weighted score.

Activity 3

Which business metric would you use to solve the challenge?

The business metric should be the reduction in maintenance cost of the truck's air system, which could be measured in dollars per year

Activity 4

How do technical metrics relate to the business metrics?

They are related through the function $$f(p_{++},p_{-+},p_{+-}) = p_{++}\cdot25+p_{-+}\cdot10+p_{+-}\cdot500$$, where, $$(p_{--}) --> \text{True negative}$$ $$(p_{++}) --> \text{True positive}$$ $$(p_{-+})--> \text{False positive}$$ $$(p_{+-})--> \text{False negative}$$ The function returns the calculated expenses for the yearly data. The function score() can be found in the file scoreValidator.py.

Activity 5

What types of analyzes would you like to perform on the customer database?

Identify non-numeric data and convert it to numeric.
Search for NaN values in the file and delete them from the database.
Identify and remove columns that are mostly zeros or NaN values.
Detect and handle outliers.

Activity 6

What techniques would you use to reduce the dimensionality of the problem?

We'll be using the law variance filter and the correlation filter.

Activity 7

What techniques would you use to select variables for your predictive model?

We might use the ExtraTreeClassifier() class to select the most important features.

Activity 8

What predictive models would you use or test for this problem?

Because of the binary nature of the class, we should use the following algorithms:

Logistic Regression
Naive Bayes
Decision Tree Classifier
Random Forest

Activity 9

How would you rate which of the trained models is the best?

We'll use three metrics that complement each other:

The confusion matrix
The predicted calculated expenses from the confusion matrix
The weigthed score calcuated with the confusion matrix

Activity 10

How would you explain the result of your model? Is it possible to know which variables are most important?

Happily, the metric we created that calculates the yearly predicted expenses for air system maintenance is very intuitive and can even be compared with the expected value without the utilization of the machine learning algorithm.

To explain the result of the model, I'll show how much money will be saved using the machine learning algorithm in relation to the expected value without using the algorithm.

Using the ExtraTreeClassifier() it is possible to determine which variables are most important.

Activity 11

How would you assess the financial impact of the proposed model?

With the function relating technical metrics and business metrics, it would be possible to show how much money would be saved if we had used the machine learning model. We can also make a conservative prediction over time.

Activity 12

What techniques would you use to perform the hyperparameter optimization of the chosen model?

The technique we will use for parameter optimization is grid search with cross-validation.

Activity 13

What risks or precautions would you present to the customer before putting this model into production?

Degradation of the model over time: Each machine learning model faces the challenge of performance degradation, so the model needs to be monitored periodically, and its accuracy, in this case its MCC value, calculated regularly.
Handling of bad data: This model utilizes data from multiple years, but missing data can adversely affect its performance. It's crucial that all required features are consistently available for the model to function properly.
Data protection: The data used for retraining the model must be trusted and handled only by authorized personnel to ensure data integrity and security.
Black box impression and time scale: The model may give the impression of being unpredictable, potentially impacting trust. However, it's important to note that the model operates on a yearly scale, and conclusions should also be drawn within that timeframe.

Activity 14

If your predictive model is approved, how would you put it into production?

There are several ways to deploy a model into production. One straightforward method is to create an API with the trained model that accepts GET or POST HTTP requests and returns the predicted value. This can be achieved, for example, using cloud computing, either on the customer's server or on a cloud server.

For instance, an API built with Flask or a simple webpage can be created and served to the customer. This option is one of the most cost-effective methods I can think of.

Activity 15

If the model is in production, how would you monitor it?

Building on the previous idea, we can also create an API to retrieve the tested data and periodically compare it with the customer database of truck air system maintenance.

Activity 16

If the model is in production, how would you know when to retrain it?

Building on the idea from the previous question, we can use our weighted metric periodically to assess the machine's performance. By analyzing the trend, we can establish a threshold that determines when the machine needs to be retrained.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Challange Activities

Activity 1

Enviroment set-up

Data Cleaning

Dimensionality reduction

Understanding the problem

Testing some machine learning algorithms

Choosing a machine learning model

Deploying the machine

Activity 2

Activity 3

Activity 4

Activity 5

Activity 6

Activity 7

Activity 8

Activity 9

Activity 10

Activity 11

Activity 12

Activity 13

Activity 14

Activity 15

Activity 16

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
Assets		Assets
Data		Data
Final-Model		Final-Model
Flask-app		Flask-app
.gitignore		.gitignore
DecisionTreeClassifier.ipynb		DecisionTreeClassifier.ipynb
LogisticRegression.ipynb		LogisticRegression.ipynb
NaiveBayes.ipynb		NaiveBayes.ipynb
README.md		README.md
RadomForest.ipynb		RadomForest.ipynb
accuracyMeassure.py		accuracyMeassure.py
cleaning.ipynb		cleaning.ipynb
data_address.txt		data_address.txt
efficiencyCalculator.py		efficiencyCalculator.py
exploratory.ipynb		exploratory.ipynb
localFilter.py		localFilter.py
scoreValidator.py		scoreValidator.py
tester.py		tester.py

havurquijo/BixTechnology

Folders and files

Latest commit

History

Repository files navigation

Introduction

Challange Activities

Activity 1

Enviroment set-up

Data Cleaning

Dimensionality reduction

Understanding the problem

Testing some machine learning algorithms

Choosing a machine learning model

Deploying the machine

Activity 2

Activity 3

Activity 4

Activity 5

Activity 6

Activity 7

Activity 8

Activity 9

Activity 10

Activity 11

Activity 12

Activity 13

Activity 14

Activity 15

Activity 16

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages