- Session 1 - From Modelling to Production
- Session 2 - Software Engineering for ML
- Session 3 - Toolkit: Git
- Session 4 - Toolkit: Colab & Python
- Session 5 - Toolkit: Python Environments & Visual Studio & Health Infomatics
- Session 6 - Data Set, Data Sample, Data Issues
- Session 7 - Create Fake Data (is Fun!)
- Session 8 - Linear Regression with Fake Data
- Session 9 - Deeplearning Scenario, intro to tensorflow, data Preparation
- Session 10 - Data Augmenting
- Session 11 - Data Balancing
- Session 12 - Data balancing & training effect
- Session 13 - Collecting Data From Storage SQL
- Session 14 - SQL & Python
- Session 15 - Fake Data Creation Part 2
- Session 16 - Data Cleansing.p1
- Session 17 - Data Cleansing.p2
- Session 18 - Data Cleansing.p3
- Session 19 - Data Cleansing.p4
- Session 20 - Tensorflow Data Validation
- Session 21 - Scaling
- Session 22 - Scaling Part 2
- Session 23 - Quantile Transformer
- Session 24 - Recap - scaling
- Session 25 - Linear Regression Tool Kit
- Session 26 - Python Classes
- Session 27 - Coefficients
- Session 28 - Feature Engineering
Intro to ML modelling
-
DS Modelling
-
DS Life Cycle
-
DS principles
-
ML pipeline
-
Production of ML - ML development
-
Production of ML - tasks for apply
-
ML Pipeline - Target
-
Directed Acyclic Graph
-
ML Pipeline - Production ML Infrastructure
-
Orchestration References:
- Executive Data Science: A Guide to Training and Managing the Best Data Scientists (Brian Caffo, Jeff Leek, Roger Peng)
- The Practical Guide to Managing Data Science at Scale (Domino)
- Executive Data Science: Coursera-Johns Hopkins University
- Building Machine Learning Pipelines by Hannes Hapke, Catherine Nelson
π From Modelling to Production Video
- Application Life Cycle
- Software Development Life Cycle
- Data Science Life Cycle
π Software Engineering for ML Video
- Github
- Gitbash
- Google Colab
- Install Python
π Toolkit: Colab & Python Video
In this session Thom Ives will explain how to build python virtual environment ...
- Python 3.x
- Virtual environment wrapper
- System Variables
- Health Informatics Intro (starts 36:14)
π Toolkit: Python Environments & Health Informatics Intro (starts 36:14) Video
Ghaith Sankari will show one example about integrating Python project with .net core web api project using vitual studio.
VS Video 1 | VS Video 2 | VS Video 3
What is the importance of Data in ML process, what is the sampling and why issues might appears and what is the most important issues
- Feature Space
- Data Samples
- Data Issues
- Data Drift
- Concept Drift
Assignment: just explanation based: You take random samples of the same size from a large population and compute the mean of those samples and distribute those samples, what will form from that distribution?
Resource: ML Data and Concept Drift
How to create fake data with Python.
Assignment: what is heteroskedasticity. Why is it a challenge, illustrate in notebook.
- Send DM to Thom, correct answers can share with group.
import matplotlib.pyplot as plt
import random
X = [x/10.0 for x in range(100)]
Y = [2.0 * x + (random.random() - 0.5) * 0 + 5 for x in X]
plt.scatter(X, Y)
plt.title('This Is The Title')
plt.xlabel('These Are The X Values')
plt.ylabel('These Are The Y Values')
plt.show()
added Colab Workbook for heterskedasticity here
Assignment Play with the models, β (Please repull the repo)
- First run the Fake Data Creations .py.
- Fake_Single_Feature_Linear_Data.py
- Fake_Single_Feature_NonLinear_Data.py
- Fake_Double_Feature_Linear_Data.py
- Fake_Double_Feature_NonLinear_Data_with_Functional_Noise.py
- Thise will create 5 different .csv files of data
- Next run each of the files, in the folder Intro_to_Regression_Modeling and explore and play and understand the functionality of the script. look at the fake data creation.
π‘ you can import sys
, and enter the follow code sys.quit()
in the script to force stop, so you not running the complete script.
- General_Toolls.py: this file is a module that you can call from with your scirpt, has function to calculate:
π **Linear Regression with Fake Data Video **
- Convolutional Layer
- Effect of Filter Size (Kernel Size)
- Max Pool
- Average Pool
- Batch Sizing
- Padding
- Epochs
π Here are some links that have some visual explanations and a playground to experiement.
- Tinker With a Neural Network
- Convolution visualizer
- What is a Neural Network
- Convolutional Nearual Network Python
- Convolutional Neural Networks cheatsheet
- Convolutional Neural Networks (CNNs) explained Video
- A deeper understanding of NNets (Part 1) β CNNs
- Difference Between a Batch and an Epoch in a Neural Network
- Epoch vs Iterations vs Batch size
- Padding and Stride
- Avinash repo; opencv tutorials
RegeX : Regular Expression import re
What is regex?
π Deeplearning Scenario, intro to tensorflow, data Preparation Video
Data Augmenting Techniques
-
Mirroring
- Flip Horizontal / Vertical
- Flip Random
-
Cropping
-
Rotate
-
Recolor
-
PCA, Principal Component Analysis (topic for later lesson)
Ghaith
Here is some notes about data augmenting session:
Data augmentation techniques used in deep learning, but it is still part of data preparation. according to this fact, data augmentation mechanisms will be customized to create important part of ML pipeline.
I wanted to start with data quantity issues solving then we will back to the more fancy and funny part related to data quality. the assignment for next week is answering the following questions:
-
how to perform customized rotation(any value of degree not only 90), code in python is required, and i wish to find presenting volunteers, this task can be performed is many ways and cooperation with other family members to cover many ways to solve the assignment is allowed and appreciated.
-
is it possible to re-color the grayscale image, and how: for this question we are not looking for coding examples, we are just looking for explaining and proofing about the answer, you can consider as research task, also brave presenter are highly appreciated.
tensorflow data augmentation tutorial
A classification data set with skewed class proportions is called imbalanced. Classes that make up a large proportion of the data set are called majority classes. Those that make up a smaller proportion are minority classes
-
created imbalannced data set, by taking sample from each set infected, uninfected
- using
.sample
from Pands library
- using
-
Data set, Name of image, folder name, label
- label: 1 = infected, 0 = uninfected
monai good to use on medical sets , with predefined tools for 2D, 3D images. explanition in video at 14:00 minutes.
- Example of batches sizes 10 showing the class imbalancing.
Assignment
To see the impact of oversampling, how the distribution of Data will change.
- Experiment with Batch Sizes
- Experiment with import sample sizes
- What other techniques are there to solve imbalancing with changing number of sample in each batch
Always good to see other tools and share our findings in our pipeline_class_chat
MORE TOGETHER!
π Notebook π Data Imbalance Video
- Weight Computation for Oversmapling & Penalization
- Use of Pre-Trained Models.
- Trained on label samples
- Image net (1million images, split into 1000 catergoires)
- uses of Resnet18, There are others and different varieties can be used.
- Training and Validation
- train using
randflipd, randrotae90d, RandGassuanNoised
- validation, no transformations
- train using
- Training Vs Test Accuarcy
Confusion Matrix
Positive (1) | Negative (0) | |
---|---|---|
Postitive (1) | TP | FP |
Negative (0) | FN | TN |
*True Positive, True Negative, False Positive: (Type 1 Error), False Negative: (Type 2 Error)
- Recall = TP / (TP + FN)
- Precision = TP / (TP+FP)
- F-Score = 2* Recall * Precision / Recall + Precision (used to compate models)
π Data balancing & training effect Video
Creating Data with SQL, Microsoft SQL Server Managment Studio (SSMS)
- Collection of data for Timeseries Analysis
- Randomize data collection
- Using
While
< 10000 to collect 10000 samples - Using Date to randomize patient transactions for collection
- Create a Procedure that can be called for example in Python,
- Example of creating the ERD (Entity Relationship Diagram, in SSMS
π Collecting Data From Storage Video
- sqlalchemy
- sqlalchemy engine
- Define functions for server and db connection
- Functions for
- Checking table exists
- Create_table
- Drop_table
- Insert Dataframes as Table
- Update DB
Examples of:
- SQL query pull and convert to Pandas DF.
- Pandas DF to SQL Table.
- Checksum, for detecting errors
π More on Data SQL & Python Video
-
Reference to Khuyen Tran,
Faker Article
-
Fake Data for Regression.
- Functions to define featuers / Noise / Model
- Plotting Model
-
Create Fake Classification Data
- Functions for Clusters, and Labels
- Plott Model
Libraries: pandas, numpy, json, matplotlib.pyplot
- Example in Visuald Studio 19
- Data Cleansing, with Pandas & Numpy
- Resuable Code
- Class and Functions
- Find an Index Column (Unique / Non-Unique)
- Quickly find NaNs and % of Nans per columns
- Find a drop Columns with only 1 unqiue value, (non repeating)
Using Functions to determine what datatype a column be in dataframe can be.
Assignment:
- Improvement of code,
- Generalise datetime conversion
- How to process NaN value
Note bookk examples: https://github.com/nishamathi/KT/blob/main/DataTypeChk.py
π Data Formatting
- Lesson discussion on how to impute missing data.
- Mean / Median
- Model prediction to fill out missng data
π Data Imputation
-
MCAR: Missing Completely at Random
-
MAR: Missing at Random
-
MNAR: Missing not at Random
-
EDA Correlations
- Example with Pearsons Correlation
-
How to handle differnt types of missing Data
Assignment
- Write function that will take x,y and calculation Pearsons Correlation
- Write function that will take x,y and calculation Spearmans Correlation
π We Care
-
Data and Concept Drifts
-
Schema Skew
-
Feature Skew
-
Distribution Skew
-
Generate Statistics.
-
If missing a large percentage of data from a feature, get away from the chair, data validation is not about 100% screen time
-
Check with the Business Analyst if the feature is required.
-
Should there be values for the feature, go back to the Data Engineer understand why these are missing, get them added.
π Tensorflow Data Validation
Scaling data using various techniques with different libraries
- Numpy
- Pandas
- Tensorflow
π Scaling
Scaling data using various techniques with pure Pyython code
and comparing against libraries results
π Scaling Part2
- Beta Distribution
- cumulative probability
- Skewed Data Generation
- Quantile transformation of skewed data.
π
- Linear Relationships Between Features and Labels
- Model Coefficients
- Errors
- Homoscedastic Errors with Zero Means
- Observations of Errors
- Errors are normaly distributed
- summary of using Classes in Python
- feature relationships
- Basic Feature Engineering
- Fake Data
- Visualization
- Modeling withhout Feature Engineering
- Linear
- Polynomial Featuers