Automated Machine Learning Pipeline Class

Session 1 - From Modelling to Production
Session 2 - Software Engineering for ML
Session 3 - Toolkit: Git
Session 4 - Toolkit: Colab & Python
Session 5 - Toolkit: Python Environments & Visual Studio & Health Infomatics
Session 6 - Data Set, Data Sample, Data Issues
Session 7 - Create Fake Data (is Fun!)
Session 8 - Linear Regression with Fake Data
Session 9 - Deeplearning Scenario, intro to tensorflow, data Preparation
Session 10 - Data Augmenting
Session 11 - Data Balancing
Session 12 - Data balancing & training effect
Session 13 - Collecting Data From Storage SQL
Session 14 - SQL & Python
Session 15 - Fake Data Creation Part 2
Session 16 - Data Cleansing.p1
Session 17 - Data Cleansing.p2
Session 18 - Data Cleansing.p3
Session 19 - Data Cleansing.p4
Session 20 - Tensorflow Data Validation
Session 21 - Scaling
Session 22 - Scaling Part 2
Session 23 - Quantile Transformer
Session 24 - Recap - scaling
Session 25 - Linear Regression Tool Kit
Session 26 - Python Classes
Session 27 - Coefficients
Session 28 - Feature Engineering

Session 1 - From Modelling to Production

Intro to ML modelling

DS Modelling
DS Life Cycle
DS principles
ML pipeline
Production of ML - ML development
Production of ML - tasks for apply
ML Pipeline - Target
Directed Acyclic Graph
ML Pipeline - Production ML Infrastructure
Orchestration References:
- Executive Data Science: A Guide to Training and Managing the Best Data Scientists (Brian Caffo, Jeff Leek, Roger Peng)
- The Practical Guide to Managing Data Science at Scale (Domino)
- Executive Data Science: Coursera-Johns Hopkins University
- Building Machine Learning Pipelines by Hannes Hapke, Catherine Nelson

💠 From Modelling to Production Video

Session 2 - Software Engineering for ML

Application Life Cycle
- Software Development Life Cycle
Data Science Life Cycle

💠 Software Engineering for ML Video

Session 3 - Toolkit: Git

Github
Gitbash

💠 Toolkit Git Video

Session 4 - Toolkit: Colab & Python

Google Colab
Install Python

💠 Toolkit: Colab & Python Video

Session 5 Toolkit: Python Environments

In this session Thom Ives will explain how to build python virtual environment ...

Python 3.x
Virtual environment wrapper
System Variables
Health Informatics Intro (starts 36:14)

💠 Toolkit: Python Environments & Health Informatics Intro (starts 36:14) Video

Ghaith Sankari will show one example about integrating Python project with .net core web api project using vitual studio.

VS Video 1 | VS Video 2 | VS Video 3

Session 6: Data Set, Data Sample, Data Issues

What is the importance of Data in ML process, what is the sampling and why issues might appears and what is the most important issues

Feature Space
Data Samples
Data Issues
Data Drift
Concept Drift

Assignment: just explanation based: You take random samples of the same size from a large population and compute the mean of those samples and distribute those samples, what will form from that distribution?

Central Limit Theorem

Resource: ML Data and Concept Drift

💠 Data Issues Video

Session 7 - Create Fake Data (is Fun!)

How to create fake data with Python.

Assignment: what is heteroskedasticity. Why is it a challenge, illustrate in notebook.

Send DM to Thom, correct answers can share with group.

import matplotlib.pyplot as plt
import random

X = [x/10.0 for x in range(100)]

Y = [2.0 * x + (random.random() - 0.5) * 0 + 5 for x in X]

plt.scatter(X, Y)
plt.title('This Is The Title')
plt.xlabel('These Are The X Values')
plt.ylabel('These Are The Y Values')
plt.show()

added Colab Workbook for heterskedasticity here

💠 Fake Data is Fun Video

Session 8 - Linear Regression with Fake Data

Assignment Play with the models, ❗ (Please repull the repo)

First run the Fake Data Creations .py.
1. Fake_Single_Feature_Linear_Data.py
2. Fake_Single_Feature_NonLinear_Data.py
3. Fake_Double_Feature_Linear_Data.py
4. Fake_Double_Feature_NonLinear_Data_with_Functional_Noise.py
Thise will create 5 different .csv files of data
Next run each of the files, in the folder Intro_to_Regression_Modeling and explore and play and understand the functionality of the script. look at the fake data creation.

💡 you can import sys, and enter the follow code sys.quit() in the script to force stop, so you not running the complete script.

General_Toolls.py: this file is a module that you can call from with your scirpt, has function to calculate:
1. print('Mean Square Error --> MSE
2. print('Root Mean Square Error --> RMSE
3. print('Mean Absolute Error --> MAE
4. print('Median Absolute Error --> MeDAE
5. print('R^2 --> r2
6. print('Adjusted R^2 --> r2_adj

Regression Analysis

Regression Statistics

💠 **Linear Regression with Fake Data Video **

Session 9 - Deeplearning Scenario, intro to tensorflow, data Preparation

Convolutional neural networks (CNN)

Summary of session

Convolutional Layer
Effect of Filter Size (Kernel Size)
Max Pool
Average Pool
Batch Sizing
Padding
Epochs

👇 Here are some links that have some visual explanations and a playground to experiement.

RegeX : Regular Expression import re What is regex?

💠 Deeplearning Scenario, intro to tensorflow, data Preparation Video

Session 10 - Data Augmenting

Data Augmenting Techniques

Mirroring
- Flip Horizontal / Vertical
- Flip Random
Cropping
Rotate
Recolor
PCA, Principal Component Analysis (topic for later lesson)

Ghaith

Here is some notes about data augmenting session:

Data augmentation techniques used in deep learning, but it is still part of data preparation. according to this fact, data augmentation mechanisms will be customized to create important part of ML pipeline.

I wanted to start with data quantity issues solving then we will back to the more fancy and funny part related to data quality. the assignment for next week is answering the following questions:

how to perform customized rotation(any value of degree not only 90), code in python is required, and i wish to find presenting volunteers, this task can be performed is many ways and cooperation with other family members to cover many ways to solve the assignment is allowed and appreciated.
is it possible to re-color the grayscale image, and how: for this question we are not looking for coding examples, we are just looking for explaining and proofing about the answer, you can consider as research task, also brave presenter are highly appreciated.

tensorflow data augmentation tutorial

💠 Data Augmenting Video

Session 11 - Data Imbalance

A classification data set with skewed class proportions is called imbalanced. Classes that make up a large proportion of the data set are called majority classes. Those that make up a smaller proportion are minority classes

created imbalannced data set, by taking sample from each set infected, uninfected
- using .sample from Pands library
Data set, Name of image, folder name, label
- label: 1 = infected, 0 = uninfected

monai good to use on medical sets , with predefined tools for 2D, 3D images. explanition in video at 14:00 minutes.

Example of batches sizes 10 showing the class imbalancing.

Assignment

To see the impact of oversampling, how the distribution of Data will change.

Experiment with Batch Sizes
Experiment with import sample sizes
What other techniques are there to solve imbalancing with changing number of sample in each batch

Always good to see other tools and share our findings in our pipeline_class_chat

MORE TOGETHER!

💠 Notebook 💠 Data Imbalance Video

Session 12 - Data balancing & training effect

Weight Computation for Oversmapling & Penalization
Use of Pre-Trained Models.
- Trained on label samples
- Image net (1million images, split into 1000 catergoires)
- uses of Resnet18, There are others and different varieties can be used.
Training and Validation
- train using randflipd, randrotae90d, RandGassuanNoised
- validation, no transformations
Training Vs Test Accuarcy

Confusion Matrix

	Positive (1)	Negative (0)
Postitive (1)	TP	FP
Negative (0)	FN	TN

*True Positive, True Negative, False Positive: (Type 1 Error), False Negative: (Type 2 Error)

Recall = TP / (TP + FN)
Precision = TP / (TP+FP)
F-Score = 2* Recall * Precision / Recall + Precision (used to compate models)

💠 Data balancing & training effect Video

Session 13 - Collecting Data From Storage

Creating Data with SQL, Microsoft SQL Server Managment Studio (SSMS)

Collection of data for Timeseries Analysis
Randomize data collection
Using While < 10000 to collect 10000 samples
Using Date to randomize patient transactions for collection
Create a Procedure that can be called for example in Python,
Example of creating the ERD (Entity Relationship Diagram, in SSMS

💠 Collecting Data From Storage Video

Session 14 - SQL & Python

sqlalchemy
sqlalchemy engine
Define functions for server and db connection
Functions for
- Checking table exists
- Create_table
- Drop_table
- Insert Dataframes as Table
- Update DB

Examples of:

SQL query pull and convert to Pandas DF.
Pandas DF to SQL Table.
Checksum, for detecting errors

💠 More on Data SQL & Python Video

Session 15 - Fake Data Creation Part 2

Reference to Khuyen Tran, Faker Article
Fake Data for Regression.
- Functions to define featuers / Noise / Model
- Plotting Model
Create Fake Classification Data
- Functions for Clusters, and Labels
- Plott Model

Libraries: pandas, numpy, json, matplotlib.pyplot

💠 Fake Data Creation Part 2

Session 16 - Data Cleansing (DF completeness)

Example in Visuald Studio 19
Data Cleansing, with Pandas & Numpy
Resuable Code
Class and Functions
Find an Index Column (Unique / Non-Unique)
Quickly find NaNs and % of Nans per columns
Find a drop Columns with only 1 unqiue value, (non repeating)

💠 Data Cleansing Part 1

Session 17 - Data Cleansing (Formatting)

Using Functions to determine what datatype a column be in dataframe can be.

Assignment:

Improvement of code,
Generalise datetime conversion
How to process NaN value

Note bookk examples: https://github.com/nishamathi/KT/blob/main/DataTypeChk.py

💠 Data Formatting

Session 17 - Data Cleansing (Missing Values)

Lesson discussion on how to impute missing data.
Mean / Median
Model prediction to fill out missng data

💠 Data Imputation

Session 19 - Data Cleansing (Missing Values)

MCAR: Missing Completely at Random
MAR: Missing at Random
MNAR: Missing not at Random
EDA Correlations
- Example with Pearsons Correlation
How to handle differnt types of missing Data

Assignment

Write function that will take x,y and calculation Pearsons Correlation
Write function that will take x,y and calculation Spearmans Correlation

💠 We Care

Session 20 - Tensorflow Data Validation

Data and Concept Drifts
Schema Skew
Feature Skew
Distribution Skew
Generate Statistics.
If missing a large percentage of data from a feature, get away from the chair, data validation is not about 100% screen time
Check with the Business Analyst if the feature is required.
Should there be values for the feature, go back to the Data Engineer understand why these are missing, get them added.

💠 Tensorflow Data Validation

Session 21 - Scaling Data

Scaling data using various techniques with different libraries

Numpy
Pandas
Tensorflow

💠 Scaling

Session 22 - Scaling Data Part2

Scaling data using various techniques with pure Pyython code
and comparing against libraries results

💠 Scaling Part2

Session 23 - Quantile Transformer.

Beta Distribution
cumulative probability
Skewed Data Generation
Quantile transformation of skewed data.

💠 Quantile Transformer

Session 24 - Scaling recap

💠

Session 25 - Quantile Transformer.

Linear Relationships Between Features and Labels
Model Coefficients
Errors
Homoscedastic Errors with Zero Means
Observations of Errors
Errors are normaly distributed

💠 Linear Regression Toolkit

Session 26 - Quantile Transformer.

summary of using Classes in Python

💠 Linear Regression Toolkit

Session 27 - Coefficients

feature relationships

💠 Linear Regression Toolkit

Session 28 - Feature Engineering

Basic Feature Engineering
Fake Data
Visualization
Modeling withhout Feature Engineering
Linear
Polynomial Featuers

💠 Linear Regression Toolkit

Name		Name	Last commit message	Last commit date
Latest commit History 165 Commits
images		images
README.md		README.md
assignment_heteroskedasticity.ipynb		assignment_heteroskedasticity.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automated Machine Learning Pipeline Class

Session 1 - From Modelling to Production

Session 2 - Software Engineering for ML

Session 3 - Toolkit: Git

Session 4 - Toolkit: Colab & Python

Session 5 Toolkit: Python Environments

Session 6: Data Set, Data Sample, Data Issues

Session 7 - Create Fake Data (is Fun!)

Session 8 - Linear Regression with Fake Data

Session 9 - Deeplearning Scenario, intro to tensorflow, data Preparation

Convolutional neural networks (CNN)

Summary of session

Session 10 - Data Augmenting

Session 11 - Data Imbalance

Session 12 - Data balancing & training effect

Session 13 - Collecting Data From Storage

Session 14 - SQL & Python

Session 15 - Fake Data Creation Part 2

Session 16 - Data Cleansing (DF completeness)

Session 17 - Data Cleansing (Formatting)

Session 17 - Data Cleansing (Missing Values)

Session 19 - Data Cleansing (Missing Values)

Session 20 - Tensorflow Data Validation

Session 21 - Scaling Data

Session 22 - Scaling Data Part2

Session 23 - Quantile Transformer.

Session 24 - Scaling recap

Session 25 - Quantile Transformer.

Session 26 - Quantile Transformer.

Session 27 - Coefficients

Session 28 - Feature Engineering

Session 29 - Feature Engineering

About

Releases

Packages

Languages

jonathan-pap/ML_Pipeline

Folders and files

Latest commit

History

Repository files navigation

Automated Machine Learning Pipeline Class

Session 1 - From Modelling to Production

Session 2 - Software Engineering for ML

Session 3 - Toolkit: Git

Session 4 - Toolkit: Colab & Python

Session 5 Toolkit: Python Environments

Session 6: Data Set, Data Sample, Data Issues

Session 7 - Create Fake Data (is Fun!)

Session 8 - Linear Regression with Fake Data

Session 9 - Deeplearning Scenario, intro to tensorflow, data Preparation

Convolutional neural networks (CNN)

Summary of session

Session 10 - Data Augmenting

Session 11 - Data Imbalance

Session 12 - Data balancing & training effect

Session 13 - Collecting Data From Storage

Session 14 - SQL & Python

Session 15 - Fake Data Creation Part 2

Session 16 - Data Cleansing (DF completeness)

Session 17 - Data Cleansing (Formatting)

Session 17 - Data Cleansing (Missing Values)

Session 19 - Data Cleansing (Missing Values)

Session 20 - Tensorflow Data Validation

Session 21 - Scaling Data

Session 22 - Scaling Data Part2

Session 23 - Quantile Transformer.

Session 24 - Scaling recap

Session 25 - Quantile Transformer.

Session 26 - Quantile Transformer.

Session 27 - Coefficients

Session 28 - Feature Engineering

Session 29 - Feature Engineering

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages