-
- Central Limit Theorem
- Confidence Interval
- p-value
This repository includes materials on different topics of machine learning algorithms.
- All clusterings explained
- Hashing in ML
- Target Encoding Right Way
- Bayesian Encoding in IMDb Rating
- Different Types of Encoders in Python
- Smarter ways to encode
- Entity Embeddings for Categorical Variables
- Metrics for Linear Regression
- Adaptive Boosting
- XGBoost
- Random Forest
- Why Random Forest has 63% unique rows in each trees
- Selecting features from 1000 columns - Video
- Selecting features from 1000 columns - Pydata Deck
- NLP with Deep Learning
- Word Embedding - Analytics Vidhya
- Word Embedding - Xin Rong
- CNN
- Weight of Evidence and Information Value
- Kmeans for Beginners
- AdaBoost - Special Case of Gradient Boost
- Gradient Boost
- Gradient Boost and XGBoost - Understanding Maths
- GBM Vs XGB
- Decision Trees Explaination
- ACF and PACF : Good Explanation
- Clearning air around Boosting
- XGboost explained by StatQuest
- Time Series - FbProphet
- Time Series - TBATS and BATS
- Basics of Linear Algebra by Applied AI
- Basics of Image Processing with Numpy
- LDA explained by Luis Serrano
This section contains notes/summaries/questions on some of Machine Learning topics.
- Machine Learning Interview Question Collections
- Data Science Interview Preparation
- KMeans
- Decorrelating Trees
- Shallow and Bushy Trees
- Random Forest
- AdaBoost
- Gradient Boost
- XGBoost
- Logistic Regression
Imagine a bunch of random points on a X-Y plane. Suppose if we try to find 3 clusters within this data, the algorithm would first choose 3 random points as 3 centroids from the data and calculate the eucledian distances between the each centroid. Based on these distances, 3 clusters would be formed such that all the points within that cluster is closer to the centroid. Finally the algorithm would calculate the centroids once again by taking the mean of all the points within the cluster. The process is repeated iteratively until the centroids stop changing.
As the number of clusters increase, variation (sum of squares criterion) in data decreases. Variation is zero when the number of clusters is equal to number of data points. If we plot the reduction in variation vs number of clusters, we see that at certain point increasing the number of clusters decreases the variation the maximum. This plot is called the "elbow curve". link
Inertia, or the within-cluster sum of squares criterion, can be recognized as a measure of how internally coherent clusters are. It suffers from various drawbacks:
- Inertia makes the assumption that clusters are convex and isotropic, which is not always the case. It responds poorly to elongated clusters, or manifolds with irregular shapes.
- Inertia is not a normalized metric: we just know that lower values are better and zero is optimal. But in very high-dimensional spaces, Euclidean distances tend to become inflated (this is an instance of the so-called “curse of dimensionality”). Running a dimensionality reduction algorithm such as PCA prior to k-means clustering can alleviate this problem and speed up the computations.
The decision tree algorithm tries to unmix the data at each node by asking questions that makes the two separated datasets most homogenous. The homogenity of the data is quantified using GINI impurity and at each node questions are asked such that the decrease in the Gini impurity is the maxiumum. This decrease in the impurity is called information gain. GINI impurity is basically the chance of being incorrect if we randomly assign a label to an example in the same set.
criterion : This parameter determines how the impurity of a split will be measured. The default value is “gini” but you can also use “entropy” as a metric for impurity. max_depth: This determines the maximum depth of the tree. The default value is set to none. This will often result in over-fitted decision trees. The depth parameter is one of the ways in which we can regularize the tree, or limit the way it grows to prevent over-fitting. min_samples_split: The minimum number of samples a node must contain in order to consider splitting. The default value is two. You can use this parameter to regularize your tree. min_samples_leaf: The minimum number of samples needed to be considered a leaf node. The default value is set to one. Use this parameter to limit the growth of the tree. max_features: The number of features to consider when looking for the best split. If this value is not set, the decision tree will consider all features available to make the best split. Depending on your application, it’s often a good idea to tune this parameter splitter: This is how the decision tree searches the features for a split. The default value is set to “best”. That is, for each node, the algorithm considers all the features and chooses the best split. If you decide to set the splitter parameter to “random,” then a random subset of features will be considered. The split will then be made by the best feature within the random subset.
Building decision trees require algorithms capable of determining an optimal choice at each node. One popular algorithm is the Hunt’s algorithm. This is a greedy model, meaning it makes the most optimal decision at each step, but does not take into account the global optimum. What does this mean? At each step the algorithm chooses the best result. However, choosing the best result at a given step does not ensure you will be headed down the route that will lead to the optimal decision when you make it to the final node of the tree, called the leaf node. Decision trees are prone to overfitting, especially when a tree is particularly deep. This is due to the amount of specificity we look at leading to smaller sample of events that meet the previous assumptions. This small sample could lead to unsound conclusions.One way to combat this issue is by setting a max depth. This will limit our risk of overfitting; but as always, this will be at the expense of error due to bias.This is a simpler model with less variance sample to sample but ultimately will not be a strong predictive model.
Ideally, we would like to minimize both error due to bias and error due to variance. Enter random forests. Random forests mitigate this problem well. A random forest is simply a collection of decision trees whose results are aggregated into one final result. Their ability to limit overfitting without substantially increasing error due to bias is why they are such powerful models. One way Random Forests reduce variance is by training on different samples of the data. A second way is by using a random subset of features. This techique is known as bagging.
No. It uses bagging technique which generates several decision trees in parallel also known as base learners. Data sampled with replacement is fed to these learners for training. The final prediction is the averaged output from all the learners. Individual tress might have high variance, but the bagging method eventually reduces variance.
In Ensemble Technique (RF, GBM, GRB) , trees are decorrelated to reduce variance. Random Forest uses bagging in which number of features are selected at random and then from those features splitting criteria is decided. In this way , every tree is pretty much different from each other.
In Boosting technique , same is done by giving weights to misclassified rows.
In Boosting trees, depending on problem statement one might get shallow tress as compared to those in RF. Boosting trees grow shallow trees because it can wait for later trees to grow in depth where it has not done well in terms of predictions. In Random Forest trees are independent and identically distributed , so each trees have to grow at much larger depth to identify patterns. This causes high variance which is reduced by averaging out.
Source : https://www.youtube.com/watch?v=wPqtzj5VZus @42:05
Adaboost works by giving higher weightage to misclassification and lower weightage to correct classification. For eg. Lets say total number of rows = 1000 initial weightage to each rows = 1/1000 correct classification = 200 misclassification = 800 learning rate = 0.1 weightage to rows of correct classification = (e^-0.1)/sum of numerator = 720 weightage to rows of incorrect classifcation = (e^0.1)/sum of numerator = 220 Hence , 20 rows from 200 misclassfied is duplicated to get 220 rows. This might results in overfitting. The next model is built. At the end of n trees, weighted sum of predictions is taken into account.
Adaboost works by giving higher weightage to misclassification and lower weightage to correct classification.
For eg.
Lets say total number of rows = 1000
initial weightage to each rows = 1/1000
correct classification = 200
misclassification = 800
learning rate = 0.1
weightage to rows of correct classification = (e^-0.1)/sum of numerator = 720
weightage to rows of incorrect classifcation = (e^0.1)/sum of numerator = 220
Hence , 20 rows from 200 misclassfied is duplicated to get 220 rows. This might results in overfitting.
The next model is built.
At the end of n trees, weighted sum of predictions is taken into account.\
master More is the error rate, less is the weightage given to trees.
This algorithm boost weak classifer in different ways. It uses gradient descent of loss function to reduce its misclassification.
An initial prediction is made. Its residual(Actual - Predcition) is calculated. Residual is nothing but a gradient of loss function. For the next model, this residual will be target variable. The way it differs from another algorithm like logistic is , GBM uses gradient descent for every rows rather than gradient descent at the end of each iterations. This makes the algorithm prone to outliers.
Gradient Boost works well if there is good differences between classes. However, if data is noisy it might look to fit each pattern and might overfit.
Regularization , Penalise model for its complexity eg for number of trees or leaves by giving weights Works well on sparse data (eg tfidf)
Also GBM and XGB works on greedy search to decide splitting criteria.
Objective Function or Gain = F(l,Ω,λ,g,h) - Gamma
F(l,Ω,λ) : Function to calculate the weight(predictions score) at every terminal nodes. This depends on loss function and Ω,λ.
l: A differentiable convex loss function that measures the difference between the prediction y and the target yi.
One can define its own function given it's second order derivative is defined.
g,h : first order and second order gradient descent of a loss funciton.
Ω : Penalizes the complexity of the model(i.e., the regression tree functions).
λ : The additional regularization term helps to smooth the final learnt weights to avoid over-fitting.
Gamma : This controls the number of leaves in trees.
Intuitively, the regularized objective will tend to select a model employing simple and predictive functions.
When the regularization parameter is set to zero, the objective falls back to the traditional gradient tree boosting.
Method :
Sort the data
Find the best candidate for split according to objective function (or gain) (and not gini index or entropy).
XGB and GBM works on greedy algorithm to decide the best split.
Grow the treee to maximum depth and then prunes the leaves which has negative gain.
Treating Missing Values :
Data with all missing points is guided to
- left nodes and gain is cacluated.
- right nodes and gain is calculated.
And the one with maximum gain is eventually selected.
Source : Kaggle Winning Solution Xgboost Algorithm - Learn from Its Author, Tong He
Introduction to Boosted Trees
-> K-Means is going to partition the n documents into k disjoint clusters whereas LDA tries to assign a document to a mixture of k different topics.
LDA
-> Captures the semantic sense of the document. For similar words, dot product would be on the higher side.
-> CBOW : Continous Bag of Words : Use the context word to predict the middle word.
Skipgram: Use the middle word to predict the context.
-> Basically is a one layer neural network. In CBOW method the sentence like "My name is Aniket" is such that words\ "my","is","Aniket" would be used as input and "name" would be kept as output and the neural network is trained like this. One\ Hot encoding is used to train the model. Finally we would get an embedding matrix as a word embedding for every word in the corpus as the hidden layer of the neural network.\
Logistic regression is considered a linear model as the decision boundary would be a linear function of x i.e. the predictions can be written as linear combination of x.
if p=1/(1+ e^(-z)) and if p=0.5 is the threshold then z=0 is the decision boundary. why is logistic regression a linear classifier Why is logistic regression considered a linear model
If we try to use the cost function of the linear regression in ‘Logistic Regression’ then it would be of no use as it would end up being a non-convex function with many local minimums, in which it would be very difficult to minimize the cost value and find the global minimum. So we define the log cost function for logistic regression which is quite convex in nature. Below is short explaination for it. "In case y=1, the output (i.e. the cost to pay) approaches to 0 as y_pred approaches to 1. Conversely, the cost to pay grows to infinity as y_pred approaches to 0. This is a desirable property: we want a bigger penalty as the algorithm predicts something far away from the actual value. If the label is y=1 but the algorithm predicts y_pred=0, the outcome is completely wrong." Cost Function of Logistic Regression Intro to Logistic Regression
Yes, using one-vs-all classification. Suppose there are 3 different classes we want to predict. We would train 3 different classifiers for each class i to predict the probability that y=i and then finally take the class that has the max probabilty while prediction.
Standardization isn't required for logistic regression. The main goal of standardizing features is to help convergence of the technique used for optimization. It's just that standardizing the features makes the convergence faster.
Regularization is a technique to discourage the complexity of the model. It does this by penalizing the loss function. This helps to solve the overfitting problem. In L1 regularization we change the loss function to this:
L1 regularization does feature selection. It does this by assigning insignificant input features with zero weight and useful features with a non zero weight.
L2 regularization forces the weights to be small but does not make them zero and does non sparse solution. L2 is not robust to outliers as square terms blows up the error differences of the outliers and the regularization term tries to fix it by penalizing the weights.
L1 vs L2 : Medium
L1 vs L2 : Analytics Vidhya
ElasticNet
- Complete workflow of a data science project.
- Doc to vec models
- Precision, Recall, lift charts, ROC curves.
- When to use a particular model.
Top 10 Data Science Blogs📈
- Analytics Vidya
- Data Science Central
- KDnuggets
- R-Bloggers
- Revolution Analytics
- Data Camp
- Codementor
- Data Plus Science
- Data Science 101
- DataRobot
🧮Learn Statistics and Probability for free📚
- Khan Academy
- OpenIntro
- Exam Solutions
- Seeing Theory
- Towardsdatascience
- Elitedatascience
- OLI
- Class Central
- Alison
- Guru99
🔏Sites with Free Data Sets🖇
- Data.world
- Kaggle
- FiveThirthyEight
- BuzzFeed
- Socrata OpenData
- Data gov
- Quandl
- Reddit or r/datasets
- UCI Repository
- Academic Torrents
📇Sites to Learn Python📕
- Code Academy
- TutorialsPoint
- Python org
- Python for Beginners
- Pythonspot
- Interactive Python
- Python Tutor
- Full Stack Python
- Awesome-Python
- CheckiO
📊Sites for Visualization📉
- Storytelling with Data
- Information is Beautiful
- Flowing Data
- Visualising Data
- Junk Charts
- The Pudding
- The Atlas
- Graphic Detail
- US Census and FEMA
- Tableau Blog
📍Best Data Science Courses Offered Online🔖
CourseEra
- IBM
- University of Michigan
- DeepLearning.ai
- Stanford Univerisity
EdX 5. Harvard Univeristy 6. MIT 7. UC SanDiego