Skip to content

jonathan-pap/ML_Pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Automated Machine Learning Pipeline Class


Session 1 - From Modelling to Production

Intro to ML modelling

  • DS Modelling

  • DS Life Cycle

  • DS principles

  • ML pipeline

  • Production of ML - ML development

  • Production of ML - tasks for apply

  • ML Pipeline - Target

  • Directed Acyclic Graph

  • ML Pipeline - Production ML Infrastructure

  • Orchestration References:

    • Executive Data Science: A Guide to Training and Managing the Best Data Scientists (Brian Caffo, Jeff Leek, Roger Peng)
    • The Practical Guide to Managing Data Science at Scale (Domino)
    • Executive Data Science: Coursera-Johns Hopkins University
    • Building Machine Learning Pipelines by Hannes Hapke, Catherine Nelson

πŸ’  From Modelling to Production Video


Session 2 - Software Engineering for ML

  • Application Life Cycle
    • Software Development Life Cycle
  • Data Science Life Cycle

πŸ’  Software Engineering for ML Video


Session 3 - Toolkit: Git

  • Github
  • Gitbash

πŸ’  Toolkit Git Video


Session 4 - Toolkit: Colab & Python

  • Google Colab
  • Install Python

πŸ’  Toolkit: Colab & Python Video


Session 5 Toolkit: Python Environments

In this session Thom Ives will explain how to build python virtual environment ...

  • Python 3.x
  • Virtual environment wrapper
  • System Variables
  • Health Informatics Intro (starts 36:14)

πŸ’  Toolkit: Python Environments & Health Informatics Intro (starts 36:14) Video

Ghaith Sankari will show one example about integrating Python project with .net core web api project using vitual studio.

VS Video 1 | VS Video 2 | VS Video 3


Session 6: Data Set, Data Sample, Data Issues

What is the importance of Data in ML process, what is the sampling and why issues might appears and what is the most important issues

  • Feature Space
  • Data Samples
  • Data Issues
  • Data Drift
  • Concept Drift

Assignment: just explanation based: You take random samples of the same size from a large population and compute the mean of those samples and distribute those samples, what will form from that distribution?

Central Limit Theorem

Resource: ML Data and Concept Drift

πŸ’  Data Issues Video


Session 7 - Create Fake Data (is Fun!)

How to create fake data with Python.

Assignment: what is heteroskedasticity. Why is it a challenge, illustrate in notebook.

  • Send DM to Thom, correct answers can share with group.

import matplotlib.pyplot as plt
import random

X = [x/10.0 for x in range(100)]

Y = [2.0 * x + (random.random() - 0.5) * 0 + 5 for x in X]

plt.scatter(X, Y)
plt.title('This Is The Title')
plt.xlabel('These Are The X Values')
plt.ylabel('These Are The Y Values')
plt.show()

added Colab Workbook for heterskedasticity here

πŸ’  Fake Data is Fun Video


Session 8 - Linear Regression with Fake Data

Assignment Play with the models, ❗ (Please repull the repo)

  1. First run the Fake Data Creations .py.
    1. Fake_Single_Feature_Linear_Data.py
    2. Fake_Single_Feature_NonLinear_Data.py
    3. Fake_Double_Feature_Linear_Data.py
    4. Fake_Double_Feature_NonLinear_Data_with_Functional_Noise.py
  2. Thise will create 5 different .csv files of data
  3. Next run each of the files, in the folder Intro_to_Regression_Modeling and explore and play and understand the functionality of the script. look at the fake data creation.

πŸ’‘ you can import sys, and enter the follow code sys.quit() in the script to force stop, so you not running the complete script.

  1. General_Toolls.py: this file is a module that you can call from with your scirpt, has function to calculate:
    1. print('Mean Square Error --> MSE
    2. print('Root Mean Square Error --> RMSE
    3. print('Mean Absolute Error --> MAE
    4. print('Median Absolute Error --> MeDAE
    5. print('R^2 --> r2
    6. print('Adjusted R^2 --> r2_adj

Regression Analysis

Regression Statistics

πŸ’  **Linear Regression with Fake Data Video **


Session 9 - Deeplearning Scenario, intro to tensorflow, data Preparation

Convolutional neural networks (CNN)

Summary of session
  • Convolutional Layer
  • Effect of Filter Size (Kernel Size)
  • Max Pool
  • Average Pool
  • Batch Sizing
  • Padding
  • Epochs

πŸ‘‡ Here are some links that have some visual explanations and a playground to experiement.

RegeX : Regular Expression import re What is regex?

πŸ’  Deeplearning Scenario, intro to tensorflow, data Preparation Video


Session 10 - Data Augmenting

Data Augmenting Techniques

  • Mirroring

    • Flip Horizontal / Vertical
    • Flip Random
  • Cropping

  • Rotate

  • Recolor

  • PCA, Principal Component Analysis (topic for later lesson)

Ghaith

Here is some notes about data augmenting session:

Data augmentation techniques used in deep learning, but it is still part of data preparation. according to this fact, data augmentation mechanisms will be customized to create important part of ML pipeline.

I wanted to start with data quantity issues solving then we will back to the more fancy and funny part related to data quality. the assignment for next week is answering the following questions:

  • how to perform customized rotation(any value of degree not only 90), code in python is required, and i wish to find presenting volunteers, this task can be performed is many ways and cooperation with other family members to cover many ways to solve the assignment is allowed and appreciated.

  • is it possible to re-color the grayscale image, and how: for this question we are not looking for coding examples, we are just looking for explaining and proofing about the answer, you can consider as research task, also brave presenter are highly appreciated.

tensorflow data augmentation tutorial

πŸ’  Data Augmenting Video


Session 11 - Data Imbalance

A classification data set with skewed class proportions is called imbalanced. Classes that make up a large proportion of the data set are called majority classes. Those that make up a smaller proportion are minority classes

  • created imbalannced data set, by taking sample from each set infected, uninfected

    • using .sample from Pands library
  • Data set, Name of image, folder name, label

    • label: 1 = infected, 0 = uninfected

monai good to use on medical sets , with predefined tools for 2D, 3D images. explanition in video at 14:00 minutes.

  • Example of batches sizes 10 showing the class imbalancing.

Assignment

To see the impact of oversampling, how the distribution of Data will change.

  • Experiment with Batch Sizes
  • Experiment with import sample sizes
  • What other techniques are there to solve imbalancing with changing number of sample in each batch

Always good to see other tools and share our findings in our pipeline_class_chat

MORE TOGETHER!

πŸ’  Notebook πŸ’  Data Imbalance Video


Session 12 - Data balancing & training effect

  • Weight Computation for Oversmapling & Penalization
  • Use of Pre-Trained Models.
    • Trained on label samples
    • Image net (1million images, split into 1000 catergoires)
    • uses of Resnet18, There are others and different varieties can be used.
  • Training and Validation
    • train using randflipd, randrotae90d, RandGassuanNoised
    • validation, no transformations
  • Training Vs Test Accuarcy

Confusion Matrix

Positive (1) Negative (0)
Postitive (1) TP FP
Negative (0) FN TN

*True Positive, True Negative, False Positive: (Type 1 Error), False Negative: (Type 2 Error)

  • Recall = TP / (TP + FN)
  • Precision = TP / (TP+FP)
  • F-Score = 2* Recall * Precision / Recall + Precision (used to compate models)

πŸ’  Data balancing & training effect Video


Session 13 - Collecting Data From Storage

Creating Data with SQL, Microsoft SQL Server Managment Studio (SSMS)

  • Collection of data for Timeseries Analysis
  • Randomize data collection
  • Using While < 10000 to collect 10000 samples
  • Using Date to randomize patient transactions for collection
  • Create a Procedure that can be called for example in Python,
  • Example of creating the ERD (Entity Relationship Diagram, in SSMS

πŸ’  Collecting Data From Storage Video


Session 14 - SQL & Python

  • sqlalchemy
  • sqlalchemy engine
  • Define functions for server and db connection
  • Functions for
    • Checking table exists
    • Create_table
    • Drop_table
    • Insert Dataframes as Table
    • Update DB

Examples of:

  • SQL query pull and convert to Pandas DF.
  • Pandas DF to SQL Table.
  • Checksum, for detecting errors

πŸ’  More on Data SQL & Python Video


Session 15 - Fake Data Creation Part 2

  • Reference to Khuyen Tran, Faker Article

  • Fake Data for Regression.

    • Functions to define featuers / Noise / Model
    • Plotting Model
  • Create Fake Classification Data

    • Functions for Clusters, and Labels
    • Plott Model

Libraries: pandas, numpy, json, matplotlib.pyplot

πŸ’  Fake Data Creation Part 2


Session 16 - Data Cleansing (DF completeness)

  • Example in Visuald Studio 19
  • Data Cleansing, with Pandas & Numpy
  • Resuable Code
  • Class and Functions
  • Find an Index Column (Unique / Non-Unique)
  • Quickly find NaNs and % of Nans per columns
  • Find a drop Columns with only 1 unqiue value, (non repeating)

πŸ’  Data Cleansing Part 1


Session 17 - Data Cleansing (Formatting)

Using Functions to determine what datatype a column be in dataframe can be.

Assignment:

  • Improvement of code,
  • Generalise datetime conversion
  • How to process NaN value

Note bookk examples: https://github.com/nishamathi/KT/blob/main/DataTypeChk.py


πŸ’  Data Formatting

Session 17 - Data Cleansing (Missing Values)

  • Lesson discussion on how to impute missing data.
  • Mean / Median
  • Model prediction to fill out missng data

πŸ’  Data Imputation


Session 19 - Data Cleansing (Missing Values)

  • MCAR: Missing Completely at Random

  • MAR: Missing at Random

  • MNAR: Missing not at Random

  • EDA Correlations

    • Example with Pearsons Correlation
  • How to handle differnt types of missing Data

Assignment

  • Write function that will take x,y and calculation Pearsons Correlation
  • Write function that will take x,y and calculation Spearmans Correlation

πŸ’  We Care


Session 20 - Tensorflow Data Validation

  • Data and Concept Drifts

  • Schema Skew

  • Feature Skew

  • Distribution Skew

  • Generate Statistics.

  • If missing a large percentage of data from a feature, get away from the chair, data validation is not about 100% screen time

  • Check with the Business Analyst if the feature is required.

  • Should there be values for the feature, go back to the Data Engineer understand why these are missing, get them added.

πŸ’  Tensorflow Data Validation


Session 21 - Scaling Data

Scaling data using various techniques with different libraries

  • Numpy
  • Pandas
  • Tensorflow

πŸ’  Scaling


Session 22 - Scaling Data Part2

Scaling data using various techniques with pure Pyython code
and comparing against libraries results

πŸ’  Scaling Part2


Session 23 - Quantile Transformer.

  • Beta Distribution
  • cumulative probability
  • Skewed Data Generation
  • Quantile transformation of skewed data.

πŸ’  Quantile Transformer


Session 24 - Scaling recap

πŸ’ 


Session 25 - Quantile Transformer.

  • Linear Relationships Between Features and Labels
  • Model Coefficients
  • Errors
  • Homoscedastic Errors with Zero Means
  • Observations of Errors
  • Errors are normaly distributed

πŸ’  Linear Regression Toolkit


Session 26 - Quantile Transformer.

  • summary of using Classes in Python

πŸ’  Linear Regression Toolkit


Session 27 - Coefficients

  • feature relationships

πŸ’  Linear Regression Toolkit


Session 28 - Feature Engineering

  • Basic Feature Engineering
  • Fake Data
  • Visualization
  • Modeling withhout Feature Engineering
  • Linear
  • Polynomial Featuers

πŸ’  Linear Regression Toolkit


Session 29 - Feature Engineering

About

Summary of weekly ML Pipeline sessions

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published