From 82d3090f4b82ef8dbc7ef201d4ae3bb528e212bb Mon Sep 17 00:00:00 2001
From: tuananhhedspibk <tuananhhedspibk1@gmail.com>
Date: Thu, 6 Jun 2019 22:54:20 +0900
Subject: [PATCH 01/11] vi translating

---
 vi/cheatsheet-deep-learning.md                | 321 ++++++++
 ...tsheet-machine-learning-tips-and-tricks.md | 285 +++++++
 vi/cheatsheet-supervised-learning.md          | 567 ++++++++++++++
 vi/cheatsheet-unsupervised-learning.md        | 340 +++++++++
 vi/convolutional-neural-networks.md           | 716 ++++++++++++++++++
 vi/deep-learning-tips-and-tricks.md           | 457 +++++++++++
 vi/recurrent-neural-networks.md               | 677 +++++++++++++++++
 vi/refresher-linear-algebra.md                | 339 +++++++++
 vi/refresher-probability.md                   | 381 ++++++++++
 9 files changed, 4083 insertions(+)
 create mode 100644 vi/cheatsheet-deep-learning.md
 create mode 100644 vi/cheatsheet-machine-learning-tips-and-tricks.md
 create mode 100644 vi/cheatsheet-supervised-learning.md
 create mode 100644 vi/cheatsheet-unsupervised-learning.md
 create mode 100644 vi/convolutional-neural-networks.md
 create mode 100644 vi/deep-learning-tips-and-tricks.md
 create mode 100644 vi/recurrent-neural-networks.md
 create mode 100644 vi/refresher-linear-algebra.md
 create mode 100644 vi/refresher-probability.md
diff --git a/vi/cheatsheet-deep-learning.md b/vi/cheatsheet-deep-learning.md
new file mode 100644
index 000000000..ff3b3c508
--- /dev/null
+++ b/vi/cheatsheet-deep-learning.md
@@ -0,0 +1,321 @@
+**1. Deep Learning cheatsheet**
+
+&#10230; Deep Learning cheatsheet
+
+<br>
+
+**2. Neural Networks**
+
+&#10230; Mạng Neural
+
+<br>
+
+**3. Neural networks are a class of models that are built with layers. Commonly used types of neural networks include convolutional and recurrent neural networks.**
+
+&#10230; Mạng Neural là 1 lớp của các models được xây dựng với các tầng (layers). Các loại mạng Neural thường được sử dụng bao gồm: Mạng Neural tích chập (Convolutional Neural Networks) và Mạng Neural hồi quy (Recurrent Neural Networks).
+
+<br>
+
+**4. Architecture ― The vocabulary around neural networks architectures is described in the figure below:**
+
+&#10230; Kiến trúc - Các thuật ngữ xoay quanh kiến trúc của mạng neural được mô tả như hình phía dưới
+
+<br>
+
+**5. [Input layer, hidden layer, output layer]**
+
+&#10230; [Tầng đầu vào, tầng ẩn, tầng đầu ra]
+
+<br>
+
+**6. By noting i the ith layer of the network and j the jth hidden unit of the layer, we have:**
+
+&#10230; Bằng việc kí hiệu i là tầng thứ i của mạng, j là đơn vị ẩn (hidden unit) thứ j của tầng, ta có:
+
+<br>
+
+**7. where we note w, b, z the weight, bias and output respectively.**
+
+&#10230; Chúng ta kí hiệu w, b, z tương ứng với trọng số (weights), bias và đầu ra.
+
+<br>
+
+**8. Activation function ― Activation functions are used at the end of a hidden unit to introduce non-linear complexities to the model. Here are the most common ones:**
+
+&#10230; Hàm kích hoạt (Activation function) - Hàm kích hoạt được sử dụng ở phần cuối của đơn vị ẩn để đưa ra độ phức tạp phi tuyến tính (non-linear) cho mô hình (model). Đây là những trường hợp phổ biến nhất:
+
+<br>
+
+**9. [Sigmoid, Tanh, ReLU, Leaky ReLU]**
+
+&#10230; [Sigmoid, Tanh, ReLU, Leaky ReLU]
+
+<br>
+
+**10. Cross-entropy loss ― In the context of neural networks, the cross-entropy loss L(z,y) is commonly used and is defined as follows:**
+
+&#10230; Mất mát (loss) Cross-entropy - Trong bối cảnh của mạng neural, mất mát cross-entropy L(z, y) thường được sử dụng và định nghĩa như sau:
+
+<br>
+
+**11. Learning rate ― The learning rate, often noted α or sometimes η, indicates at which pace the weights get updated. This can be fixed or adaptively changed. The current most popular method is called Adam, which is a method that adapts the learning rate.**
+
+&#10230; Tốc độ học (Learning rate) - Tốc độ học, thường được kí hiệu bởi α hoặc đôi khi là η, chỉ ra tốc độ mà trọng số được cập nhật. Thông số này có thể là cố định hoặc được thay đổi tuỳ biến. Phương thức (method) phổ biến nhất hiện tại là Adam, đó là phương thức thay đổi tốc độ học một cách phù hợp nhất có thể.
+
+<br>
+
+**12. Backpropagation ― Backpropagation is a method to update the weights in the neural network by taking into account the actual output and the desired output. The derivative with respect to weight w is computed using chain rule and is of the following form:**
+
+&#10230; Backpropagation (Lan truyền ngược) - Backpropagation là phương thức dùng để cập nhật trọng số trong mạng neural bằng cách tính toán đầu ra thực sự và đầu ra mong muốn. Đạo hàm liên quan tới trọng số w được tính bằng cách sử dụng quy tắc chuỗi (chain rule) theo như cách dưới đây:
+
+<br>
+
+**13. As a result, the weight is updated as follows:**
+
+&#10230; Như kết quả, trọng số được cập nhật như sau:
+
+<br>
+
+**14. Updating weights ― In a neural network, weights are updated as follows:**
+
+&#10230; Cập nhật trọng số - Trong mạng neural, trọng số được cập nhật như sau:
+
+<br>
+
+**15. Step 1: Take a batch of training data.**
+
+&#10230; Bước 1: Lấy một mẻ (batch) dữ liệu huấn luyện (training data).
+
+<br>
+
+**16. Step 2: Perform forward propagation to obtain the corresponding loss.**
+
+&#10230; Bước 2: Thực thi lan truyền xuôi (forward propagation) để lấy được mất mát (loss) tương ứng.
+
+<br>
+
+**17. Step 3: Backpropagate the loss to get the gradients.**
+
+&#10230; Bước 3: Lan truyền ngược mất mát để lấy được gradients (độ dốc).
+
+<br>
+
+**18. Step 4: Use the gradients to update the weights of the network.**
+
+&#10230; Bước 4: Sử dụng gradients để cập nhật trọng số của mạng (network).
+
+<br>
+
+**19. Dropout ― Dropout is a technique meant at preventing overfitting the training data by dropping out units in a neural network. In practice, neurons are either dropped with probability p or kept with probability 1−p**
+
+&#10230; Dropout - Dropout là thuật ngữ kĩ thuật dùng trong việc tránh overfitting tập dữ liệu huấn luyện
+
+<br>
+
+**20. Convolutional Neural Networks**
+
+&#10230; Mạng neural tích chập (Convolutional Neural Networks)
+
+<br>
+
+**21. Convolutional layer requirement ― By noting W the input volume size, F the size of the convolutional layer neurons, P the amount of zero padding, then the number of neurons N that fit in a given volume is such that:**
+
+&#10230; Yêu cầu của tầng tích chập (Convolutional layer) - Bằng việc ghi chú W là kích cỡ của volume đầu vào, F là kích cỡ của neurals thuộc convolutional layer, P là số lượng zero padding, khi đó số lượng neurals N phù hợp với volume cho trước sẽ như sau:
+
+<br>
+
+**22. Batch normalization ― It is a step of hyperparameter γ,β that normalizes the batch {xi}. By noting μB,σ2B the mean and variance of that we want to correct to the batch, it is done as follows:**
+
+&#10230; Batch normalization (chuẩn hoá) - Đây là bước mà các hyperparameter γ,β chuẩn hoá batch (mẻ) {xi}. Bằng việc kí hiệu μB,σ2B là giá trị trung bình, phương sai mà ta muốn gán cho batch, nó được thực hiện như sau:
+
+<br>
+
+**23. It is usually done after a fully connected/convolutional layer and before a non-linearity layer and aims at allowing higher learning rates and reducing the strong dependence on initialization.**
+
+&#10230; Nó thường được hoàn thành sau fully connected/convolutional layer và trước non-linearity layer và mục tiêu là cho phép tốc độ học cao hơn cũng như giảm đi sự phụ thuộc mạnh mẽ vào việc khởi tạo.
+
+<br>
+
+**24. Recurrent Neural Networks**
+
+&#10230; Mạng neural hồi quy (Recurrent Neural Networks)
+
+<br>
+
+**25. Types of gates ― Here are the different types of gates that we encounter in a typical recurrent neural network:**
+
+&#10230; Các loại cổng - Đây là các loại cổng (gate) khác nhau mà chúng ta sẽ gặp ở một mạng neural hồi quy điển hình:
+
+<br>
+
+**26. [Input gate, forget gate, gate, output gate]**
+
+&#10230; [Cổng đầu vào, cổng quên, cổng đầu ra]
+
+<br>
+
+**27. [Write to cell or not?, Erase a cell or not?, How much to write to cell?, How much to reveal cell?]**
+
+&#10230; [Ghi vào cell hay không?, Xoá cell hay không?, Ghi bao nhiêu vào cell?, Cần tiết lộ bao nhiêu về cell?]
+
+<br>
+
+**28. LSTM ― A long short-term memory (LSTM) network is a type of RNN model that avoids the vanishing gradient problem by adding 'forget' gates.**
+
+&#10230; LSTM - Mạng bộ nhớ ngắn dài (LSTM) là 1 loại RNN model tránh vấn đề vanishing gradient (độ dốc biến mất đột ngột) bằng cách thêm vào cổng 'quên' ('forget' gates).
+
+<br>
+
+**29. Reinforcement Learning and Control**
+
+&#10230; Reinforcement Learning và Control
+
+<br>
+
+**30. The goal of reinforcement learning is for an agent to learn how to evolve in an environment.**
+
+&#10230; Mục tiêu của reinforcement learning đó là cho tác tử (agent) học cách làm sao để phát triển trong một môi trường
+
+<br>
+
+**31. Definitions**
+
+&#10230; Định nghĩa
+
+<br>
+
+**32. Markov decision processes ― A Markov decision process (MDP) is a 5-tuple (S,A,{Psa},γ,R) where:**
+
+&#10230; Tiến trình quyết định Markov (Markov decision processes) - Tiến trình quyết định Markov (MDP) là một dạng 5-tuple (S,A,{Psa},γ,R) mà ở đó:
+
+<br>
+
+**33. S is the set of states**
+
+&#10230; S là tập hợp các trạng thái (states)
+
+<br>
+
+**34. A is the set of actions**
+
+&#10230; A là tập hợp các hành động (actions)
+
+<br>
+
+**35. {Psa} are the state transition probabilities for s∈S and a∈A**
+
+&#10230; {Psa} là xác suất chuyển tiếp trạng thái cho s∈S và a∈A
+
+<br>
+
+**36. γ∈[0,1[ is the discount factor**
+
+&#10230; γ∈[0,1[ là discount factor
+
+<br>
+
+**37. R:S×A⟶R or R:S⟶R is the reward function that the algorithm wants to maximize**
+
+&#10230; R:S×A⟶R hoặc R:S⟶R là reward function (hàm reward) mà giải thuật muốn tối đa hoá.
+
+<br>
+
+**38. Policy ― A policy π is a function π:S⟶A that maps states to actions.**
+
+&#10230; Policy - Policy π là 1 hàm π:S⟶A có nhiệm vụ ánh xạ states tới actions
+
+<br>
+
+**39. Remark: we say that we execute a given policy π if given a state s we take the action a=π(s).**
+
+&#10230; Chú ý: Ta quy ước rằng ta thực thi policy π cho trước nếu cho trước state s ta có action a=π(s)
+
+<br>
+
+**40. Value function ― For a given policy π and a given state s, we define the value function Vπ as follows:**
+
+&#10230; Hàm giá trị (Value function) - Với policy cho trước π và state s, ta định nghĩa value function Vπ như sau:
+
+<br>
+
+**41. Bellman equation ― The optimal Bellman equations characterizes the value function Vπ∗ of the optimal policy π∗:**
+
+&#10230; Phương trình Bellman - Phương trình tối ưu Bellman đặc trưng hoá value function Vπ∗ của policy tối ưu (optimal policy) π∗:
+
+<br>
+
+**42. Remark: we note that the optimal policy π∗ for a given state s is such that:**
+
+&#10230; Chú ý: ta quy ước optimal policy π∗ đối với state s cho trước như sau:
+
+<br>
+
+**43. Value iteration algorithm ― The value iteration algorithm is in two steps:**
+
+&#10230; Giải thuật duyệt giá trị (Value iteration) - Giải thuật duyệt giá trị có 2 loại:
+
+<br>
+
+**44. 1) We initialize the value:**
+
+&#10230; 1) Ta khởi tạo gái trị (value):
+
+<br>
+
+**45. 2) We iterate the value based on the values before:**
+
+&#10230; 2) Ta duyệt qua giá trị dựa theo giá trị phía trước:
+
+<br>
+
+**46. Maximum likelihood estimate ― The maximum likelihood estimates for the state transition probabilities are as follows:**
+
+&#10230; 
+
+<br>
+
+**47. times took action a in state s and got to s′**
+
+&#10230;
+
+<br>
+
+**48. times took action a in state s**
+
+&#10230;
+
+<br>
+
+**49. Q-learning ― Q-learning is a model-free estimation of Q, which is done as follows:**
+
+&#10230;
+
+<br>
+
+**50. View PDF version on GitHub**
+
+&#10230;
+
+<br>
+
+**51. [Neural Networks, Architecture, Activation function, Backpropagation, Dropout]**
+
+&#10230;
+
+<br>
+
+**52. [Convolutional Neural Networks, Convolutional layer, Batch normalization]**
+
+&#10230;
+
+<br>
+
+**53. [Recurrent Neural Networks, Gates, LSTM]**
+
+&#10230;
+
+<br>
+
+**54. [Reinforcement learning, Markov decision processes, Value/policy iteration, Approximate dynamic programming, Policy search]**
+
+&#10230;
diff --git a/vi/cheatsheet-machine-learning-tips-and-tricks.md b/vi/cheatsheet-machine-learning-tips-and-tricks.md
new file mode 100644
index 000000000..9712297b8
--- /dev/null
+++ b/vi/cheatsheet-machine-learning-tips-and-tricks.md
@@ -0,0 +1,285 @@
+**1. Machine Learning tips and tricks cheatsheet**
+
+&#10230;
+
+<br>
+
+**2. Classification metrics**
+
+&#10230;
+
+<br>
+
+**3. In a context of a binary classification, here are the main metrics that are important to track in order to assess the performance of the model.**
+
+&#10230;
+
+<br>
+
+**4. Confusion matrix ― The confusion matrix is used to have a more complete picture when assessing the performance of a model. It is defined as follows:**
+
+&#10230;
+
+<br>
+
+**5. [Predicted class, Actual class]**
+
+&#10230;
+
+<br>
+
+**6. Main metrics ― The following metrics are commonly used to assess the performance of classification models:**
+
+&#10230;
+
+<br>
+
+**7. [Metric, Formula, Interpretation]**
+
+&#10230;
+
+<br>
+
+**8. Overall performance of model**
+
+&#10230;
+
+<br>
+
+**9. How accurate the positive predictions are**
+
+&#10230;
+
+<br>
+
+**10. Coverage of actual positive sample**
+
+&#10230;
+
+<br>
+
+**11. Coverage of actual negative sample**
+
+&#10230;
+
+<br>
+
+**12. Hybrid metric useful for unbalanced classes**
+
+&#10230;
+
+<br>
+
+**13. ROC ― The receiver operating curve, also noted ROC, is the plot of TPR versus FPR by varying the threshold. These metrics are are summed up in the table below:**
+
+&#10230;
+
+<br>
+
+**14. [Metric, Formula, Equivalent]**
+
+&#10230;
+
+<br>
+
+**15. AUC ― The area under the receiving operating curve, also noted AUC or AUROC, is the area below the ROC as shown in the following figure:**
+
+&#10230;
+
+<br>
+
+**16. [Actual, Predicted]**
+
+&#10230;
+
+<br>
+
+**17. Basic metrics ― Given a regression model f, the following metrics are commonly used to assess the performance of the model:**
+
+&#10230;
+
+<br>
+
+**18. [Total sum of squares, Explained sum of squares, Residual sum of squares]**
+
+&#10230;
+
+<br>
+
+**19. Coefficient of determination ― The coefficient of determination, often noted R2 or r2, provides a measure of how well the observed outcomes are replicated by the model and is defined as follows:**
+
+&#10230;
+
+<br>
+
+**20. Main metrics ― The following metrics are commonly used to assess the performance of regression models, by taking into account the number of variables n that they take into consideration:**
+
+&#10230;
+
+<br>
+
+**21. where L is the likelihood and ˆσ2 is an estimate of the variance associated with each response.**
+
+&#10230;
+
+<br>
+
+**22. Model selection**
+
+&#10230;
+
+<br>
+
+**23. Vocabulary ― When selecting a model, we distinguish 3 different parts of the data that we have as follows:**
+
+&#10230;
+
+<br>
+
+**24. [Training set, Validation set, Testing set]**
+
+&#10230;
+
+<br>
+
+**25. [Model is trained, Model is assessed, Model gives predictions]**
+
+&#10230;
+
+<br>
+
+**26. [Usually 80% of the dataset, Usually 20% of the dataset]**
+
+&#10230;
+
+<br>
+
+**27. [Also called hold-out or development set, Unseen data]**
+
+&#10230;
+
+<br>
+
+**28. Once the model has been chosen, it is trained on the entire dataset and tested on the unseen test set. These are represented in the figure below:**
+
+&#10230;
+
+<br>
+
+**29. Cross-validation ― Cross-validation, also noted CV, is a method that is used to select a model that does not rely too much on the initial training set. The different types are summed up in the table below:**
+
+&#10230;
+
+<br>
+
+**30. [Training on k−1 folds and assessment on the remaining one, Training on n−p observations and assessment on the p remaining ones]**
+
+&#10230;
+
+<br>
+
+**31. [Generally k=5 or 10, Case p=1 is called leave-one-out]**
+
+&#10230;
+
+<br>
+
+**32. The most commonly used method is called k-fold cross-validation and splits the training data into k folds to validate the model on one fold while training the model on the k−1 other folds, all of this k times. The error is then averaged over the k folds and is named cross-validation error.**
+
+&#10230;
+
+<br>
+
+**33. Regularization ― The regularization procedure aims at avoiding the model to overfit the data and thus deals with high variance issues. The following table sums up the different types of commonly used regularization techniques:**
+
+&#10230;
+
+<br>
+
+**34. [Shrinks coefficients to 0, Good for variable selection, Makes coefficients smaller, Tradeoff between variable selection and small coefficients]**
+
+&#10230;
+
+<br>
+
+**35. Diagnostics**
+
+&#10230;
+
+<br>
+
+**36. Bias ― The bias of a model is the difference between the expected prediction and the correct model that we try to predict for given data points.**
+
+&#10230;
+
+<br>
+
+**37. Variance ― The variance of a model is the variability of the model prediction for given data points.**
+
+&#10230;
+
+<br>
+
+**38. Bias/variance tradeoff ― The simpler the model, the higher the bias, and the more complex the model, the higher the variance.**
+
+&#10230;
+
+<br>
+
+**39. [Symptoms, Regression illustration, classification illustration, deep learning illustration, possible remedies]**
+
+&#10230;
+
+<br>
+
+**40. [High training error, Training error close to test error, High bias, Training error slightly lower than test error, Very low training error, Training error much lower than test error, High variance]**
+
+&#10230;
+
+<br>
+
+**41. [Complexify model, Add more features, Train longer, Perform regularization, Get more data]**
+
+&#10230;
+
+<br>
+
+**42. Error analysis ― Error analysis is analyzing the root cause of the difference in performance between the current and the perfect models.**
+
+&#10230;
+
+<br>
+
+**43. Ablative analysis ― Ablative analysis is analyzing the root cause of the difference in performance between the current and the baseline models.**
+
+&#10230;
+
+<br>
+
+**44. Regression metrics**
+
+&#10230;
+
+<br>
+
+**45. [Classification metrics, confusion matrix, accuracy, precision, recall, F1 score, ROC]**
+
+&#10230;
+
+<br>
+
+**46. [Regression metrics, R squared, Mallow's CP, AIC, BIC]**
+
+&#10230;
+
+<br>
+
+**47. [Model selection, cross-validation, regularization]**
+
+&#10230;
+
+<br>
+
+**48. [Diagnostics, Bias/variance tradeoff, error/ablative analysis]**
+
+&#10230;
diff --git a/vi/cheatsheet-supervised-learning.md b/vi/cheatsheet-supervised-learning.md
new file mode 100644
index 000000000..a6b19ea1c
--- /dev/null
+++ b/vi/cheatsheet-supervised-learning.md
@@ -0,0 +1,567 @@
+**1. Supervised Learning cheatsheet**
+
+&#10230;
+
+<br>
+
+**2. Introduction to Supervised Learning**
+
+&#10230;
+
+<br>
+
+**3. Given a set of data points {x(1),...,x(m)} associated to a set of outcomes {y(1),...,y(m)}, we want to build a classifier that learns how to predict y from x.**
+
+&#10230;
+
+<br>
+
+**4. Type of prediction ― The different types of predictive models are summed up in the table below:**
+
+&#10230;
+
+<br>
+
+**5. [Regression, Classifier, Outcome, Examples]**
+
+&#10230;
+
+<br>
+
+**6. [Continuous, Class, Linear regression, Logistic regression, SVM, Naive Bayes]**
+
+&#10230;
+
+<br>
+
+**7. Type of model ― The different models are summed up in the table below:**
+
+&#10230;
+
+<br>
+
+**8. [Discriminative model, Generative model, Goal, What's learned, Illustration, Examples]**
+
+&#10230;
+
+<br>
+
+**9. [Directly estimate P(y|x), Estimate P(x|y) to then deduce P(y|x), Decision boundary,  	Probability distributions of the data, Regressions, SVMs, GDA, Naive Bayes]**
+
+&#10230;
+
+<br>
+
+**10. Notations and general concepts**
+
+&#10230;
+
+<br>
+
+**11. Hypothesis ― The hypothesis is noted hθ and is the model that we choose. For a given input data x(i) the model prediction output is hθ(x(i)).**
+
+&#10230;
+
+<br>
+
+**12. Loss function ― A loss function is a function L:(z,y)∈R×Y⟼L(z,y)∈R that takes as inputs the predicted value z corresponding to the real data value y and outputs how different they are. The common loss functions are summed up in the table below:**
+
+&#10230;
+
+<br>
+
+**13. [Least squared error, Logistic loss, Hinge loss, Cross-entropy]**
+
+&#10230;
+
+<br>
+
+**14. [Linear regression, Logistic regression, SVM, Neural Network]**
+
+&#10230;
+
+<br>
+
+**15. Cost function ― The cost function J is commonly used to assess the performance of a model, and is defined with the loss function L as follows:**
+
+&#10230;
+
+<br>
+
+**16. Gradient descent ― By noting α∈R the learning rate, the update rule for gradient descent is expressed with the learning rate and the cost function J as follows:**
+
+&#10230;
+
+<br>
+
+**17. Remark: Stochastic gradient descent (SGD) is updating the parameter based on each training example, and batch gradient descent is on a batch of training examples.**
+
+&#10230;
+
+<br>
+
+**18. Likelihood ― The likelihood of a model L(θ) given parameters θ is used to find the optimal parameters θ through maximizing the likelihood. In practice, we use the log-likelihood ℓ(θ)=log(L(θ)) which is easier to optimize. We have:**
+
+&#10230;
+
+<br>
+
+**19. Newton's algorithm ― The Newton's algorithm is a numerical method that finds θ such that ℓ′(θ)=0. Its update rule is as follows:**
+
+&#10230;
+
+<br>
+
+**20. Remark: the multidimensional generalization, also known as the Newton-Raphson method, has the following update rule:**
+
+&#10230;
+
+<br>
+
+**21. Linear models**
+
+&#10230;
+
+<br>
+
+**22. Linear regression**
+
+&#10230;
+
+<br>
+
+**23. We assume here that y|x;θ∼N(μ,σ2)**
+
+&#10230;
+
+<br>
+
+**24. Normal equations ― By noting X the matrix design, the value of θ that minimizes the cost function is a closed-form solution such that:**
+
+&#10230;
+
+<br>
+
+**25. LMS algorithm ― By noting α the learning rate, the update rule of the Least Mean Squares (LMS) algorithm for a training set of m data points, which is also known as the Widrow-Hoff learning rule, is as follows:**
+
+&#10230;
+
+<br>
+
+**26. Remark: the update rule is a particular case of the gradient ascent.**
+
+&#10230;
+
+<br>
+
+**27. LWR ― Locally Weighted Regression, also known as LWR, is a variant of linear regression that weights each training example in its cost function by w(i)(x), which is defined with parameter τ∈R as:**
+
+&#10230;
+
+<br>
+
+**28. Classification and logistic regression**
+
+&#10230;
+
+<br>
+
+**29. Sigmoid function ― The sigmoid function g, also known as the logistic function, is defined as follows:**
+
+&#10230;
+
+<br>
+
+**30. Logistic regression ― We assume here that y|x;θ∼Bernoulli(ϕ). We have the following form:**
+
+&#10230;
+
+<br>
+
+**31. Remark: there is no closed form solution for the case of logistic regressions.**
+
+&#10230;
+
+<br>
+
+**32. Softmax regression ― A softmax regression, also called a multiclass logistic regression, is used to generalize logistic regression when there are more than 2 outcome classes. By convention, we set θK=0, which makes the Bernoulli parameter ϕi of each class i equal to:**
+
+&#10230;
+
+<br>
+
+**33. Generalized Linear Models**
+
+&#10230;
+
+<br>
+
+**34. Exponential family ― A class of distributions is said to be in the exponential family if it can be written in terms of a natural parameter, also called the canonical parameter or link function, η, a sufficient statistic T(y) and a log-partition function a(η) as follows:**
+
+&#10230;
+
+<br>
+
+**35. Remark: we will often have T(y)=y. Also, exp(−a(η)) can be seen as a normalization parameter that will make sure that the probabilities sum to one.**
+
+&#10230;
+
+<br>
+
+**36. Here are the most common exponential distributions summed up in the following table:**
+
+&#10230;
+
+<br>
+
+**37. [Distribution, Bernoulli, Gaussian, Poisson, Geometric]**
+
+&#10230;
+
+<br>
+
+**38. Assumptions of GLMs ― Generalized Linear Models (GLM) aim at predicting a random variable y as a function fo x∈Rn+1 and rely on the following 3 assumptions:**
+
+&#10230;
+
+<br>
+
+**39. Remark: ordinary least squares and logistic regression are special cases of generalized linear models.**
+
+&#10230;
+
+<br>
+
+**40. Support Vector Machines**
+
+&#10230;
+
+<br>
+
+**41: The goal of support vector machines is to find the line that maximizes the minimum distance to the line.**
+
+&#10230;
+
+<br>
+
+**42: Optimal margin classifier ― The optimal margin classifier h is such that:**
+
+&#10230;
+
+<br>
+
+**43: where (w,b)∈Rn×R is the solution of the following optimization problem:**
+
+&#10230;
+
+<br>
+
+**44. such that**
+
+&#10230;
+
+<br>
+
+**45. support vectors**
+
+&#10230;
+
+<br>
+
+**46. Remark: the line is defined as wTx−b=0.**
+
+&#10230;
+
+<br>
+
+**47. Hinge loss ― The hinge loss is used in the setting of SVMs and is defined as follows:**
+
+&#10230;
+
+<br>
+
+**48. Kernel ― Given a feature mapping ϕ, we define the kernel K to be defined as:**
+
+&#10230;
+
+<br>
+
+**49. In practice, the kernel K defined by K(x,z)=exp(−||x−z||22σ2) is called the Gaussian kernel and is commonly used.**
+
+&#10230;
+
+<br>
+
+**50. [Non-linear separability, Use of a kernel mapping, Decision boundary in the original space]**
+
+&#10230;
+
+<br>
+
+**51. Remark: we say that we use the "kernel trick" to compute the cost function using the kernel because we actually don't need to know the explicit mapping ϕ, which is often very complicated. Instead, only the values K(x,z) are needed.**
+
+&#10230;
+
+<br>
+
+**52. Lagrangian ― We define the Lagrangian L(w,b) as follows:**
+
+&#10230;
+
+<br>
+
+**53. Remark: the coefficients βi are called the Lagrange multipliers.**
+
+&#10230;
+
+<br>
+
+**54. Generative Learning**
+
+&#10230;
+
+<br>
+
+**55. A generative model first tries to learn how the data is generated by estimating P(x|y), which we can then use to estimate P(y|x) by using Bayes' rule.**
+
+&#10230;
+
+<br>
+
+**56. Gaussian Discriminant Analysis**
+
+&#10230;
+
+<br>
+
+**57. Setting ― The Gaussian Discriminant Analysis assumes that y and x|y=0 and x|y=1 are such that:**
+
+&#10230;
+
+<br>
+
+**58. Estimation ― The following table sums up the estimates that we find when maximizing the likelihood:**
+
+&#10230;
+
+<br>
+
+**59. Naive Bayes**
+
+&#10230;
+
+<br>
+
+**60. Assumption ― The Naive Bayes model supposes that the features of each data point are all independent:**
+
+&#10230;
+
+<br>
+
+**61. Solutions ― Maximizing the log-likelihood gives the following solutions, with k∈{0,1},l∈[[1,L]]**
+
+&#10230;
+
+<br>
+
+**62. Remark: Naive Bayes is widely used for text classification and spam detection.**
+
+&#10230;
+
+<br>
+
+**63. Tree-based and ensemble methods**
+
+&#10230;
+
+<br>
+
+**64. These methods can be used for both regression and classification problems.**
+
+&#10230;
+
+<br>
+
+**65. CART ― Classification and Regression Trees (CART), commonly known as decision trees, can be represented as binary trees. They have the advantage to be very interpretable.**
+
+&#10230;
+
+<br>
+
+**66. Random forest ― It is a tree-based technique that uses a high number of decision trees built out of randomly selected sets of features. Contrary to the simple decision tree, it is highly uninterpretable but its generally good performance makes it a popular algorithm.**
+
+&#10230;
+
+<br>
+
+**67. Remark: random forests are a type of ensemble methods.**
+
+&#10230;
+
+<br>
+
+**68. Boosting ― The idea of boosting methods is to combine several weak learners to form a stronger one. The main ones are summed up in the table below:**
+
+&#10230;
+
+<br>
+
+**69. [Adaptive boosting, Gradient boosting]**
+
+&#10230;
+
+<br>
+
+**70. High weights are put on errors to improve at the next boosting step**
+
+&#10230;
+
+<br>
+
+**71. Weak learners trained on remaining errors**
+
+&#10230;
+
+<br>
+
+**72. Other non-parametric approaches**
+
+&#10230;
+
+<br>
+
+**73. k-nearest neighbors ― The k-nearest neighbors algorithm, commonly known as k-NN, is a non-parametric approach where the response of a data point is determined by the nature of its k neighbors from the training set. It can be used in both classification and regression settings.**
+
+&#10230;
+
+<br>
+
+**74. Remark: The higher the parameter k, the higher the bias, and the lower the parameter k, the higher the variance.**
+
+&#10230;
+
+<br>
+
+**75. Learning Theory**
+
+&#10230;
+
+<br>
+
+**76. Union bound ― Let A1,...,Ak be k events. We have:**
+
+&#10230;
+
+<br>
+
+**77. Hoeffding inequality ― Let Z1,..,Zm be m iid variables drawn from a Bernoulli distribution of parameter ϕ. Let ˆϕ be their sample mean and γ>0 fixed. We have:**
+
+&#10230;
+
+<br>
+
+**78. Remark: this inequality is also known as the Chernoff bound.**
+
+&#10230;
+
+<br>
+
+**79. Training error ― For a given classifier h, we define the training error ˆϵ(h), also known as the empirical risk or empirical error, to be as follows:**
+
+&#10230;
+
+<br>
+
+**80. Probably Approximately Correct (PAC) ― PAC is a framework under which numerous results on learning theory were proved, and has the following set of assumptions: **
+
+&#10230;
+
+<br>
+
+**81: the training and testing sets follow the same distribution **
+
+&#10230;
+
+<br>
+
+**82. the training examples are drawn independently**
+
+&#10230;
+
+<br>
+
+**83. Shattering ― Given a set S={x(1),...,x(d)}, and a set of classifiers H, we say that H shatters S if for any set of labels {y(1),...,y(d)}, we have:**
+
+&#10230;
+
+<br>
+
+**84. Upper bound theorem ― Let H be a finite hypothesis class such that |H|=k and let δ and the sample size m be fixed. Then, with probability of at least 1−δ, we have:**
+
+&#10230;
+
+<br>
+
+**85. VC dimension ― The Vapnik-Chervonenkis (VC) dimension of a given infinite hypothesis class H, noted VC(H) is the size of the largest set that is shattered by H.**
+
+&#10230;
+
+<br>
+
+**86. Remark: the VC dimension of H={set of linear classifiers in 2 dimensions} is 3.**
+
+&#10230;
+
+<br>
+
+**87. Theorem (Vapnik) ― Let H be given, with VC(H)=d and m the number of training examples. With probability at least 1−δ, we have:**
+
+&#10230;
+
+<br>
+
+**88. [Introduction, Type of prediction, Type of model]**
+
+&#10230;
+
+<br>
+
+**89. [Notations and general concepts, loss function, gradient descent, likelihood]**
+
+&#10230;
+
+<br>
+
+**90. [Linear models, linear regression, logistic regression, generalized linear models]**
+
+&#10230;
+
+<br>
+
+**91. [Support vector machines, Optimal margin classifier, Hinge loss, Kernel]**
+
+&#10230;
+
+<br>
+
+**92. [Generative learning, Gaussian Discriminant Analysis, Naive Bayes]**
+
+&#10230;
+
+<br>
+
+**93. [Trees and ensemble methods, CART, Random forest, Boosting]**
+
+&#10230;
+
+<br>
+
+**94. [Other methods, k-NN]**
+
+&#10230;
+
+<br>
+
+**95. [Learning theory, Hoeffding inequality, PAC, VC dimension]**
+
+&#10230;
diff --git a/vi/cheatsheet-unsupervised-learning.md b/vi/cheatsheet-unsupervised-learning.md
new file mode 100644
index 000000000..6daab3b21
--- /dev/null
+++ b/vi/cheatsheet-unsupervised-learning.md
@@ -0,0 +1,340 @@
+**1. Unsupervised Learning cheatsheet**
+
+&#10230;
+
+<br>
+
+**2. Introduction to Unsupervised Learning**
+
+&#10230;
+
+<br>
+
+**3. Motivation ― The goal of unsupervised learning is to find hidden patterns in unlabeled data {x(1),...,x(m)}.**
+
+&#10230;
+
+<br>
+
+**4. Jensen's inequality ― Let f be a convex function and X a random variable. We have the following inequality:**
+
+&#10230;
+
+<br>
+
+**5. Clustering**
+
+&#10230;
+
+<br>
+
+**6. Expectation-Maximization**
+
+&#10230;
+
+<br>
+
+**7. Latent variables ― Latent variables are hidden/unobserved variables that make estimation problems difficult, and are often denoted z. Here are the most common settings where there are latent variables:**
+
+&#10230;
+
+<br>
+
+**8. [Setting, Latent variable z, Comments]**
+
+&#10230;
+
+<br>
+
+**9. [Mixture of k Gaussians, Factor analysis]**
+
+&#10230;
+
+<br>
+
+**10. Algorithm ― The Expectation-Maximization (EM) algorithm gives an efficient method at estimating the parameter θ through maximum likelihood estimation by repeatedly constructing a lower-bound on the likelihood (E-step) and optimizing that lower bound (M-step) as follows:**
+
+&#10230;
+
+<br>
+
+**11. E-step: Evaluate the posterior probability Qi(z(i)) that each data point x(i) came from a particular cluster z(i) as follows:**
+
+&#10230;
+
+<br>
+
+**12. M-step: Use the posterior probabilities Qi(z(i)) as cluster specific weights on data points x(i) to separately re-estimate each cluster model as follows:**
+
+&#10230;
+
+<br>
+
+**13. [Gaussians initialization, Expectation step, Maximization step, Convergence]**
+
+&#10230;
+
+<br>
+
+**14. k-means clustering**
+
+&#10230;
+
+<br>
+
+**15. We note c(i) the cluster of data point i and μj the center of cluster j.**
+
+&#10230;
+
+<br>
+
+**16. Algorithm ― After randomly initializing the cluster centroids μ1,μ2,...,μk∈Rn, the k-means algorithm repeats the following step until convergence:**
+
+&#10230;
+
+<br>
+
+**17. [Means initialization, Cluster assignment, Means update, Convergence]**
+
+&#10230;
+
+<br>
+
+**18. Distortion function ― In order to see if the algorithm converges, we look at the distortion function defined as follows:**
+
+&#10230;
+
+<br>
+
+**19. Hierarchical clustering**
+
+&#10230;
+
+<br>
+
+**20. Algorithm ― It is a clustering algorithm with an agglomerative hierarchical approach that build nested clusters in a successive manner.**
+
+&#10230;
+
+<br>
+
+**21. Types ― There are different sorts of hierarchical clustering algorithms that aims at optimizing different objective functions, which is summed up in the table below:**
+
+&#10230;
+
+<br>
+
+**22. [Ward linkage, Average linkage, Complete linkage]**
+
+&#10230;
+
+<br>
+
+**23. [Minimize within cluster distance, Minimize average distance between cluster pairs, Minimize maximum distance of between cluster pairs]**
+
+&#10230;
+
+<br>
+
+**24. Clustering assessment metrics**
+
+&#10230;
+
+<br>
+
+**25. In an unsupervised learning setting, it is often hard to assess the performance of a model since we don't have the ground truth labels as was the case in the supervised learning setting.**
+
+&#10230;
+
+<br>
+
+**26. Silhouette coefficient ― By noting a and b the mean distance between a sample and all other points in the same class, and between a sample and all other points in the next nearest cluster, the silhouette coefficient s for a single sample is defined as follows:**
+
+&#10230;
+
+<br>
+
+**27. Calinski-Harabaz index ― By noting k the number of clusters, Bk and Wk the between and within-clustering dispersion matrices respectively defined as**
+
+&#10230;
+
+<br>
+
+**28. the Calinski-Harabaz index s(k) indicates how well a clustering model defines its clusters, such that the higher the score, the more dense and well separated the clusters are. It is defined as follows:**
+
+&#10230;
+
+<br>
+
+**29. Dimension reduction**
+
+&#10230;
+
+<br>
+
+**30. Principal component analysis**
+
+&#10230;
+
+<br>
+
+**31. It is a dimension reduction technique that finds the variance maximizing directions onto which to project the data.**
+
+&#10230;
+
+<br>
+
+**32. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:**
+
+&#10230;
+
+<br>
+
+**33. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:**
+
+&#10230;
+
+<br>
+
+**34. diagonal**
+
+&#10230;
+
+<br>
+
+**35. Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of matrix A.**
+
+&#10230;
+
+<br>
+
+**36. Algorithm ― The Principal Component Analysis (PCA) procedure is a dimension reduction technique that projects the data on k
+dimensions by maximizing the variance of the data as follows:**
+
+&#10230;
+
+<br>
+
+**37. Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.**
+
+&#10230;
+
+<br>
+
+**38. Step 2: Compute Σ=1mm∑i=1x(i)x(i)T∈Rn×n, which is symmetric with real eigenvalues.**
+
+&#10230;
+
+<br>
+
+**39. Step 3: Compute u1,...,uk∈Rn the k orthogonal principal eigenvectors of Σ, i.e. the orthogonal eigenvectors of the k largest eigenvalues.**
+
+&#10230;
+
+<br>
+
+**40. Step 4: Project the data on spanR(u1,...,uk).**
+
+&#10230;
+
+<br>
+
+**41. This procedure maximizes the variance among all k-dimensional spaces.**
+
+&#10230;
+
+<br>
+
+**42. [Data in feature space, Find principal components, Data in principal components space]**
+
+&#10230;
+
+<br>
+
+**43. Independent component analysis**
+
+&#10230;
+
+<br>
+
+**44. It is a technique meant to find the underlying generating sources.**
+
+&#10230;
+
+<br>
+
+**45. Assumptions ― We assume that our data x has been generated by the n-dimensional source vector s=(s1,...,sn), where si are independent random variables, via a mixing and non-singular matrix A as follows:**
+
+&#10230;
+
+<br>
+
+**46. The goal is to find the unmixing matrix W=A−1.**
+
+&#10230;
+
+<br>
+
+**47. Bell and Sejnowski ICA algorithm ― This algorithm finds the unmixing matrix W by following the steps below:**
+
+&#10230;
+
+<br>
+
+**48. Write the probability of x=As=W−1s as:**
+
+&#10230;
+
+<br>
+
+**49. Write the log likelihood given our training data {x(i),i∈[[1,m]]} and by noting g the sigmoid function as:**
+
+&#10230;
+
+<br>
+
+**50. Therefore, the stochastic gradient ascent learning rule is such that for each training example x(i), we update W as follows:**
+
+&#10230;
+
+<br>
+
+**51. The Machine Learning cheatsheets are now available in [target language].**
+
+&#10230;
+
+<br>
+
+**52. Original authors**
+
+&#10230;
+
+<br>
+
+**53. Translated by X, Y and Z**
+
+&#10230;
+
+<br>
+
+**54. Reviewed by X, Y and Z**
+
+&#10230;
+
+<br>
+
+**55. [Introduction, Motivation, Jensen's inequality]**
+
+&#10230;
+
+<br>
+
+**56. [Clustering, Expectation-Maximization, k-means, Hierarchical clustering, Metrics]**
+
+&#10230;
+
+<br>
+
+**57. [Dimension reduction, PCA, ICA]**
+
+&#10230;
diff --git a/vi/convolutional-neural-networks.md b/vi/convolutional-neural-networks.md
new file mode 100644
index 000000000..cb7e676ca
--- /dev/null
+++ b/vi/convolutional-neural-networks.md
@@ -0,0 +1,716 @@
+**Convolutional Neural Networks translation**
+
+<br>
+
+**1. Convolutional Neural Networks cheatsheet**
+
+&#10230; Convolutional Neural Networks cheatsheet
+
+<br>
+
+
+**2. CS 230 - Deep Learning**
+
+&#10230; CS 230 - Deep Learning
+
+<br>
+
+
+**3. [Overview, Architecture structure]**
+
+&#10230; [Tổng quan, Kiến trúc]
+
+<br>
+
+
+**4. [Types of layer, Convolution, Pooling, Fully connected]**
+
+&#10230; [Loại tầng (layer), Convolution (Tích chập), Pooling, Fully connected]
+
+<br>
+
+
+**5. [Filter hyperparameters, Dimensions, Stride, Padding]**
+
+&#10230;
+
+<br>
+
+
+**6. [Tuning hyperparameters, Parameter compatibility, Model complexity, Receptive field]**
+
+&#10230;
+
+<br>
+
+
+**7. [Activation functions, Rectified Linear Unit, Softmax]**
+
+&#10230;
+
+<br>
+
+
+**8. [Object detection, Types of models, Detection, Intersection over Union, Non-max suppression, YOLO, R-CNN]**
+
+&#10230;
+
+<br>
+
+
+**9. [Face verification/recognition, One shot learning, Siamese network, Triplet loss]**
+
+&#10230;
+
+<br>
+
+
+**10. [Neural style transfer, Activation, Style matrix, Style/content cost function]**
+
+&#10230;
+
+<br>
+
+
+**11. [Computational trick architectures, Generative Adversarial Net, ResNet, Inception Network]**
+
+&#10230;
+
+<br>
+
+
+**12. Overview**
+
+&#10230;
+
+<br>
+
+
+**13. Architecture of a traditional CNN ― Convolutional neural networks, also known as CNNs, are a specific type of neural networks that are generally composed of the following layers:**
+
+&#10230;
+
+<br>
+
+
+**14. The convolution layer and the pooling layer can be fine-tuned with respect to hyperparameters that are described in the next sections.**
+
+&#10230;
+
+<br>
+
+
+**15. Types of layer**
+
+&#10230;
+
+<br>
+
+
+**16. Convolution layer (CONV) ― The convolution layer (CONV) uses filters that perform convolution operations as it is scanning the input I with respect to its dimensions. Its hyperparameters include the filter size F and stride S. The resulting output O is called feature map or activation map.**
+
+&#10230;
+
+<br>
+
+
+**17. Remark: the convolution step can be generalized to the 1D and 3D cases as well.**
+
+&#10230;
+
+<br>
+
+
+**18. Pooling (POOL) ― The pooling layer (POOL) is a downsampling operation, typically applied after a convolution layer, which does some spatial invariance. In particular, max and average pooling are special kinds of pooling where the maximum and average value is taken, respectively.**
+
+&#10230;
+
+<br>
+
+
+**19. [Type, Purpose, Illustration, Comments]**
+
+&#10230;
+
+<br>
+
+
+**20. [Max pooling, Average pooling, Each pooling operation selects the maximum value of the current view, Each pooling operation averages the values of the current view]**
+
+&#10230;
+
+<br>
+
+
+**21. [Preserves detected features, Most commonly used, Downsamples feature map, Used in LeNet]**
+
+&#10230;
+
+<br>
+
+
+**22. Fully Connected (FC) ― The fully connected layer (FC) operates on a flattened input where each input is connected to all neurons. If present, FC layers are usually found towards the end of CNN architectures and can be used to optimize objectives such as class scores.**
+
+&#10230;
+
+<br>
+
+
+**23. Filter hyperparameters**
+
+&#10230;
+
+<br>
+
+
+**24. The convolution layer contains filters for which it is important to know the meaning behind its hyperparameters.**
+
+&#10230;
+
+<br>
+
+
+**25. Dimensions of a filter ― A filter of size F×F applied to an input containing C channels is a F×F×C volume that performs convolutions on an input of size I×I×C and produces an output feature map (also called activation map) of size O×O×1.**
+
+&#10230;
+
+<br>
+
+
+**26. Filter**
+
+&#10230;
+
+<br>
+
+
+**27. Remark: the application of K filters of size F×F results in an output feature map of size O×O×K.**
+
+&#10230;
+
+<br>
+
+
+**28. Stride ― For a convolutional or a pooling operation, the stride S denotes the number of pixels by which the window moves after each operation.**
+
+&#10230;
+
+<br>
+
+
+**29. Zero-padding ― Zero-padding denotes the process of adding P zeroes to each side of the boundaries of the input. This value can either be manually specified or automatically set through one of the three modes detailed below:**
+
+&#10230;
+
+<br>
+
+
+**30. [Mode, Value, Illustration, Purpose, Valid, Same, Full]**
+
+&#10230;
+
+<br>
+
+
+**31. [No padding, Drops last convolution if dimensions do not match, Padding such that feature map size has size ⌈IS⌉, Output size is mathematically convenient, Also called 'half' padding, Maximum padding such that end convolutions are applied on the limits of the input, Filter 'sees' the input end-to-end]**
+
+&#10230;
+
+<br>
+
+
+**32. Tuning hyperparameters**
+
+&#10230;
+
+<br>
+
+
+**33. Parameter compatibility in convolution layer ― By noting I the length of the input volume size, F the length of the filter, P the amount of zero padding, S the stride, then the output size O of the feature map along that dimension is given by:**
+
+&#10230;
+
+<br>
+
+
+**34. [Input, Filter, Output]**
+
+&#10230;
+
+<br>
+
+
+**35. Remark: often times, Pstart=Pend≜P, in which case we can replace Pstart+Pend by 2P in the formula above.**
+
+&#10230;
+
+<br>
+
+
+**36. Understanding the complexity of the model ― In order to assess the complexity of a model, it is often useful to determine the number of parameters that its architecture will have. In a given layer of a convolutional neural network, it is done as follows:**
+
+&#10230;
+
+<br>
+
+
+**37. [Illustration, Input size, Output size, Number of parameters, Remarks]**
+
+&#10230;
+
+<br>
+
+
+**38. [One bias parameter per filter, In most cases, S<F, A common choice for K is 2C]**
+
+&#10230;
+
+<br>
+
+
+**39. [Pooling operation done channel-wise, In most cases, S=F]**
+
+&#10230;
+
+<br>
+
+
+**40. [Input is flattened, One bias parameter per neuron, The number of FC neurons is free of structural constraints]**
+
+&#10230;
+
+<br>
+
+
+**41. Receptive field ― The receptive field at layer k is the area denoted Rk×Rk of the input that each pixel of the k-th activation map can 'see'. By calling Fj the filter size of layer j and Si the stride value of layer i and with the convention S0=1, the receptive field at layer k can be computed with the formula:**
+
+&#10230;
+
+<br>
+
+
+**42. In the example below, we have F1=F2=3 and S1=S2=1, which gives R2=1+2⋅1+2⋅1=5.**
+
+&#10230;
+
+<br>
+
+
+**43. Commonly used activation functions**
+
+&#10230;
+
+<br>
+
+
+**44. Rectified Linear Unit ― The rectified linear unit layer (ReLU) is an activation function g that is used on all elements of the volume. It aims at introducing non-linearities to the network. Its variants are summarized in the table below:**
+
+&#10230;
+
+<br>
+
+
+**45. [ReLU, Leaky ReLU, ELU, with]**
+
+&#10230;
+
+<br>
+
+
+**46. [Non-linearity complexities biologically interpretable, Addresses dying ReLU issue for negative values, Differentiable everywhere]**
+
+&#10230;
+
+<br>
+
+
+**47. Softmax ― The softmax step can be seen as a generalized logistic function that takes as input a vector of scores x∈Rn and outputs a vector of output probability p∈Rn through a softmax function at the end of the architecture. It is defined as follows:**
+
+&#10230;
+
+<br>
+
+
+**48. where**
+
+&#10230;
+
+<br>
+
+
+**49. Object detection**
+
+&#10230;
+
+<br>
+
+
+**50. Types of models ― There are 3 main types of object recognition algorithms, for which the nature of what is predicted is different. They are described in the table below:**
+
+&#10230;
+
+<br>
+
+
+**51. [Image classification, Classification w. localization, Detection]**
+
+&#10230;
+
+<br>
+
+
+**52. [Teddy bear, Book]**
+
+&#10230;
+
+<br>
+
+
+**53. [Classifies a picture, Predicts probability of object, Detects an object in a picture, Predicts probability of object and where it is located, Detects up to several objects in a picture, Predicts probabilities of objects and where they are located]**
+
+&#10230;
+
+<br>
+
+
+**54. [Traditional CNN, Simplified YOLO, R-CNN, YOLO, R-CNN]**
+
+&#10230;
+
+<br>
+
+
+**55. Detection ― In the context of object detection, different methods are used depending on whether we just want to locate the object or detect a more complex shape in the image. The two main ones are summed up in the table below:**
+
+&#10230;
+
+<br>
+
+
+**56. [Bounding box detection, Landmark detection]**
+
+&#10230;
+
+<br>
+
+
+**57. [Detects the part of the image where the object is located, Detects a shape or characteristics of an object (e.g. eyes), More granular]**
+
+&#10230;
+
+<br>
+
+
+**58. [Box of center (bx,by), height bh and width bw, Reference points (l1x,l1y), ..., (lnx,lny)]**
+
+&#10230;
+
+<br>
+
+
+**59. Intersection over Union ― Intersection over Union, also known as IoU, is a function that quantifies how correctly positioned a predicted bounding box Bp is over the actual bounding box Ba. It is defined as:**
+
+&#10230;
+
+<br>
+
+
+**60. Remark: we always have IoU∈[0,1]. By convention, a predicted bounding box Bp is considered as being reasonably good if IoU(Bp,Ba)⩾0.5.**
+
+&#10230;
+
+<br>
+
+
+**61. Anchor boxes ― Anchor boxing is a technique used to predict overlapping bounding boxes. In practice, the network is allowed to predict more than one box simultaneously, where each box prediction is constrained to have a given set of geometrical properties. For instance, the first prediction can potentially be a rectangular box of a given form, while the second will be another rectangular box of a different geometrical form.**
+
+&#10230;
+
+<br>
+
+
+**62. Non-max suppression ― The non-max suppression technique aims at removing duplicate overlapping bounding boxes of a same object by selecting the most representative ones. After having removed all boxes having a probability prediction lower than 0.6, the following steps are repeated while there are boxes remaining:**
+
+&#10230;
+
+<br>
+
+
+**63. [For a given class, Step 1: Pick the box with the largest prediction probability., Step 2: Discard any box having an IoU⩾0.5 with the previous box.]**
+
+&#10230;
+
+<br>
+
+
+**64. [Box predictions, Box selection of maximum probability, Overlap removal of same class, Final bounding boxes]**
+
+&#10230;
+
+<br>
+
+
+**65. YOLO ― You Only Look Once (YOLO) is an object detection algorithm that performs the following steps:**
+
+&#10230;
+
+<br>
+
+
+**66. [Step 1: Divide the input image into a G×G grid., Step 2: For each grid cell, run a CNN that predicts y of the following form:, repeated k times]**
+
+&#10230;
+
+<br>
+
+
+**67. where pc is the probability of detecting an object, bx,by,bh,bw are the properties of the detected bouding box, c1,...,cp is a one-hot representation of which of the p classes were detected, and k is the number of anchor boxes.**
+
+&#10230;
+
+<br>
+
+
+**68. Step 3: Run the non-max suppression algorithm to remove any potential duplicate overlapping bounding boxes.**
+
+&#10230;
+
+<br>
+
+
+**69. [Original image, Division in GxG grid, Bounding box prediction, Non-max suppression]**
+
+&#10230;
+
+<br>
+
+
+**70. Remark: when pc=0, then the network does not detect any object. In that case, the corresponding predictions bx,...,cp have to be ignored.**
+
+&#10230;
+
+<br>
+
+
+**71. R-CNN ― Region with Convolutional Neural Networks (R-CNN) is an object detection algorithm that first segments the image to find potential relevant bounding boxes and then run the detection algorithm to find most probable objects in those bounding boxes.**
+
+&#10230;
+
+<br>
+
+
+**72. [Original image, Segmentation, Bounding box prediction, Non-max suppression]**
+
+&#10230;
+
+<br>
+
+
+**73. Remark: although the original algorithm is computationally expensive and slow, newer architectures enabled the algorithm to run faster, such as Fast R-CNN and Faster R-CNN.**
+
+&#10230;
+
+<br>
+
+
+**74. Face verification and recognition**
+
+&#10230;
+
+<br>
+
+
+**75. Types of models ― Two main types of model are summed up in table below:**
+
+&#10230;
+
+<br>
+
+
+**76. [Face verification, Face recognition, Query, Reference, Database]**
+
+&#10230;
+
+<br>
+
+
+**77. [Is this the correct person?, One-to-one lookup, Is this one of the K persons in the database?, One-to-many lookup]**
+
+&#10230;
+
+<br>
+
+
+**78. One Shot Learning ― One Shot Learning is a face verification algorithm that uses a limited training set to learn a similarity function that quantifies how different two given images are. The similarity function applied to two images is often noted d(image 1,image 2).**
+
+&#10230;
+
+<br>
+
+
+**79. Siamese Network ― Siamese Networks aim at learning how to encode images to then quantify how different two images are. For a given input image x(i), the encoded output is often noted as f(x(i)).**
+
+&#10230;
+
+<br>
+
+
+**80. Triplet loss ― The triplet loss ℓ is a loss function computed on the embedding representation of a triplet of images A (anchor), P (positive) and N (negative). The anchor and the positive example belong to a same class, while the negative example to another one. By calling α∈R+ the margin parameter, this loss is defined as follows:**
+
+&#10230;
+
+<br>
+
+
+**81. Neural style transfer**
+
+&#10230;
+
+<br>
+
+
+**82. Motivation ― The goal of neural style transfer is to generate an image G based on a given content C and a given style S.**
+
+&#10230;
+
+<br>
+
+
+**83. [Content C, Style S, Generated image G]**
+
+&#10230;
+
+<br>
+
+
+**84. Activation ― In a given layer l, the activation is noted a[l] and is of dimensions nH×nw×nc**
+
+&#10230;
+
+<br>
+
+
+**85. Content cost function ― The content cost function Jcontent(C,G) is used to determine how the generated image G differs from the original content image C. It is defined as follows:**
+
+&#10230;
+
+<br>
+
+
+**86. Style matrix ― The style matrix G[l] of a given layer l is a Gram matrix where each of its elements G[l]kk′ quantifies how correlated the channels k and k′ are. It is defined with respect to activations a[l] as follows:**
+
+&#10230;
+
+<br>
+
+
+**87. Remark: the style matrix for the style image and the generated image are noted G[l] (S) and G[l] (G) respectively.**
+
+&#10230;
+
+<br>
+
+
+**88. Style cost function ― The style cost function Jstyle(S,G) is used to determine how the generated image G differs from the style S. It is defined as follows:**
+
+&#10230;
+
+<br>
+
+
+**89. Overall cost function ― The overall cost function is defined as being a combination of the content and style cost functions, weighted by parameters α,β, as follows:**
+
+&#10230;
+
+<br>
+
+
+**90. Remark: a higher value of α will make the model care more about the content while a higher value of β will make it care more about the style.**
+
+&#10230;
+
+<br>
+
+
+**91. Architectures using computational tricks**
+
+&#10230;
+
+<br>
+
+
+**92. Generative Adversarial Network ― Generative adversarial networks, also known as GANs, are composed of a generative and a discriminative model, where the generative model aims at generating the most truthful output that will be fed into the discriminative which aims at differentiating the generated and true image.**
+
+&#10230;
+
+<br>
+
+
+**93. [Training, Noise, Real-world image, Generator, Discriminator, Real Fake]**
+
+&#10230;
+
+<br>
+
+
+**94. Remark: use cases using variants of GANs include text to image, music generation and synthesis.**
+
+&#10230;
+
+<br>
+
+
+**95. ResNet ― The Residual Network architecture (also called ResNet) uses residual blocks with a high number of layers meant to decrease the training error. The residual block has the following characterizing equation:**
+
+&#10230;
+
+<br>
+
+
+**96. Inception Network ― This architecture uses inception modules and aims at giving a try at different convolutions in order to increase its performance through features diversification. In particular, it uses the 1×1 convolution trick to limit the computational burden.**
+
+&#10230;
+
+<br>
+
+
+**97. The Deep Learning cheatsheets are now available in [target language].**
+
+&#10230;
+
+<br>
+
+
+**98. Original authors**
+
+&#10230;
+
+<br>
+
+
+**99. Translated by X, Y and Z**
+
+&#10230;
+
+<br>
+
+
+**100. Reviewed by X, Y and Z**
+
+&#10230;
+
+<br>
+
+
+**101. View PDF version on GitHub**
+
+&#10230;
+
+<br>
+
+
+**102. By X and Y**
+
+&#10230;
+
+<br>
diff --git a/vi/deep-learning-tips-and-tricks.md b/vi/deep-learning-tips-and-tricks.md
new file mode 100644
index 000000000..347234ec2
--- /dev/null
+++ b/vi/deep-learning-tips-and-tricks.md
@@ -0,0 +1,457 @@
+**Deep Learning Tips and Tricks translation**
+
+<br>
+
+**1. Deep Learning Tips and Tricks cheatsheet**
+
+&#10230;
+
+<br>
+
+
+**2. CS 230 - Deep Learning**
+
+&#10230;
+
+<br>
+
+
+**3. Tips and tricks**
+
+&#10230;
+
+<br>
+
+
+**4. [Data processing, Data augmentation, Batch normalization]**
+
+&#10230;
+
+<br>
+
+
+**5. [Training a neural network, Epoch, Mini-batch, Cross-entropy loss, Backpropagation, Gradient descent, Updating weights, Gradient checking]**
+
+&#10230;
+
+<br>
+
+
+**6. [Parameter tuning, Xavier initialization, Transfer learning, Learning rate, Adaptive learning rates]**
+
+&#10230;
+
+<br>
+
+
+**7. [Regularization, Dropout, Weight regularization, Early stopping]**
+
+&#10230;
+
+<br>
+
+
+**8. [Good practices, Overfitting small batch, Gradient checking]**
+
+&#10230;
+
+<br>
+
+
+**9. View PDF version on GitHub**
+
+&#10230;
+
+<br>
+
+
+**10. Data processing**
+
+&#10230;
+
+<br>
+
+
+**11. Data augmentation ― Deep learning models usually need a lot of data to be properly trained. It is often useful to get more data from the existing ones using data augmentation techniques. The main ones are summed up in the table below. More precisely, given the following input image, here are the techniques that we can apply:**
+
+&#10230;
+
+<br>
+
+
+**12. [Original, Flip, Rotation, Random crop]**
+
+&#10230;
+
+<br>
+
+
+**13. [Image without any modification, Flipped with respect to an axis for which the meaning of the image is preserved, Rotation with a slight angle, Simulates incorrect horizon calibration, Random focus on one part of the image, Several random crops can be done in a row]**
+
+&#10230;
+
+<br>
+
+
+**14. [Color shift, Noise addition, Information loss, Contrast change]**
+
+&#10230;
+
+<br>
+
+
+**15. [Nuances of RGB is slightly changed, Captures noise that can occur with light exposure, Addition of noise, More tolerance to quality variation of inputs, Parts of image ignored, Mimics potential loss of parts of image, Luminosity changes, Controls difference in exposition due to time of day]**
+
+&#10230;
+
+<br>
+
+
+**16. Remark: data is usually augmented on the fly during training.**
+
+&#10230;
+
+<br>
+
+
+**17. Batch normalization ― It is a step of hyperparameter γ,β that normalizes the batch {xi}. By noting μB,σ2B the mean and variance of that we want to correct to the batch, it is done as follows:**
+
+&#10230;
+
+<br>
+
+
+**18. It is usually done after a fully connected/convolutional layer and before a non-linearity layer and aims at allowing higher learning rates and reducing the strong dependence on initialization.**
+
+&#10230;
+
+<br>
+
+
+**19. Training a neural network**
+
+&#10230;
+
+<br>
+
+
+**20. Definitions**
+
+&#10230;
+
+<br>
+
+
+**21. Epoch ― In the context of training a model, epoch is a term used to refer to one iteration where the model sees the whole training set to update its weights.**
+
+&#10230;
+
+<br>
+
+
+**22. Mini-batch gradient descent ― During the training phase, updating weights is usually not based on the whole training set at once due to computation complexities or one data point due to noise issues. Instead, the update step is done on mini-batches, where the number of data points in a batch is a hyperparameter that we can tune.**
+
+&#10230;
+
+<br>
+
+
+**23. Loss function ― In order to quantify how a given model performs, the loss function L is usually used to evaluate to what extent the actual outputs y are correctly predicted by the model outputs z.**
+
+&#10230;
+
+<br>
+
+
+**24. Cross-entropy loss ― In the context of binary classification in neural networks, the cross-entropy loss L(z,y) is commonly used and is defined as follows:**
+
+&#10230;
+
+<br>
+
+
+**25. Finding optimal weights**
+
+&#10230;
+
+<br>
+
+
+**26. Backpropagation ― Backpropagation is a method to update the weights in the neural network by taking into account the actual output and the desired output. The derivative with respect to each weight w is computed using the chain rule.**
+
+&#10230;
+
+<br>
+
+
+**27. Using this method, each weight is updated with the rule:**
+
+&#10230;
+
+<br>
+
+
+**28. Updating weights ― In a neural network, weights are updated as follows:**
+
+&#10230;
+
+<br>
+
+
+**29. [Step 1: Take a batch of training data and perform forward propagation to compute the loss, Step 2: Backpropagate the loss to get the gradient of the loss with respect to each weight, Step 3: Use the gradients to update the weights of the network.]**
+
+&#10230;
+
+<br>
+
+
+**30. [Forward propagation, Backpropagation, Weights update]**
+
+&#10230;
+
+<br>
+
+
+**31. Parameter tuning**
+
+&#10230;
+
+<br>
+
+
+**32. Weights initialization**
+
+&#10230;
+
+<br>
+
+
+**33. Xavier initialization ― Instead of initializing the weights in a purely random manner, Xavier initialization enables to have initial weights that take into account characteristics that are unique to the architecture.**
+
+&#10230;
+
+<br>
+
+
+**34. Transfer learning ― Training a deep learning model requires a lot of data and more importantly a lot of time. It is often useful to take advantage of pre-trained weights on huge datasets that took days/weeks to train, and leverage it towards our use case. Depending on how much data we have at hand, here are the different ways to leverage this:**
+
+&#10230;
+
+<br>
+
+
+**35. [Training size, Illustration, Explanation]**
+
+&#10230;
+
+<br>
+
+
+**36. [Small, Medium, Large]**
+
+&#10230;
+
+<br>
+
+
+**37. [Freezes all layers, trains weights on softmax, Freezes most layers, trains weights on last layers and softmax, Trains weights on layers and softmax by initializing weights on pre-trained ones]**
+
+&#10230;
+
+<br>
+
+
+**38. Optimizing convergence**
+
+&#10230;
+
+<br>
+
+
+**39. Learning rate ― The learning rate, often noted α or sometimes η, indicates at which pace the weights get updated. It can be fixed or adaptively changed. The current most popular method is called Adam, which is a method that adapts the learning rate.
+**
+
+&#10230;
+
+<br>
+
+
+**40. Adaptive learning rates ― Letting the learning rate vary when training a model can reduce the training time and improve the numerical optimal solution. While Adam optimizer is the most commonly used technique, others can also be useful. They are summed up in the table below:**
+
+&#10230;
+
+<br>
+
+
+**41. [Method, Explanation, Update of w, Update of b]**
+
+&#10230;
+
+<br>
+
+
+**42. [Momentum, Dampens oscillations, Improvement to SGD, 2 parameters to tune]**
+
+&#10230;
+
+<br>
+
+
+**43. [RMSprop, Root Mean Square propagation, Speeds up learning algorithm by controlling oscillations]**
+
+&#10230;
+
+<br>
+
+
+**44. [Adam, Adaptive Moment estimation, Most popular method, 4 parameters to tune]**
+
+&#10230;
+
+<br>
+
+
+**45. Remark: other methods include Adadelta, Adagrad and SGD.**
+
+&#10230;
+
+<br>
+
+
+**46. Regularization**
+
+&#10230;
+
+<br>
+
+
+**47. Dropout ― Dropout is a technique used in neural networks to prevent overfitting the training data by dropping out neurons with probability p>0. It forces the model to avoid relying too much on particular sets of features.**
+
+&#10230;
+
+<br>
+
+
+**48. Remark: most deep learning frameworks parametrize dropout through the 'keep' parameter 1−p.**
+
+&#10230;
+
+<br>
+
+
+**49. Weight regularization ― In order to make sure that the weights are not too large and that the model is not overfitting the training set, regularization techniques are usually performed on the model weights. The main ones are summed up in the table below:**
+
+&#10230;
+
+<br>
+
+
+**50. [LASSO, Ridge, Elastic Net]**
+
+&#10230;
+
+<br>
+
+**50 bis. Shrinks coefficients to 0, Good for variable selection, Makes coefficients smaller, Tradeoff between variable selection and small coefficients]**
+
+&#10230;
+
+<br>
+
+**51. Early stopping ― This regularization technique stops the training process as soon as the validation loss reaches a plateau or starts to increase.**
+
+&#10230;
+
+<br>
+
+
+**52. [Error, Validation, Training, early stopping, Epochs]**
+
+&#10230;
+
+<br>
+
+
+**53. Good practices**
+
+&#10230;
+
+<br>
+
+
+**54. Overfitting small batch ― When debugging a model, it is often useful to make quick tests to see if there is any major issue with the architecture of the model itself. In particular, in order to make sure that the model can be properly trained, a mini-batch is passed inside the network to see if it can overfit on it. If it cannot, it means that the model is either too complex or not complex enough to even overfit on a small batch, let alone a normal-sized training set.**
+
+&#10230;
+
+<br>
+
+
+**55. Gradient checking ― Gradient checking is a method used during the implementation of the backward pass of a neural network. It compares the value of the analytical gradient to the numerical gradient at given points and plays the role of a sanity-check for correctness.**
+
+&#10230;
+
+<br>
+
+
+**56. [Type, Numerical gradient, Analytical gradient]**
+
+&#10230;
+
+<br>
+
+
+**57. [Formula, Comments]**
+
+&#10230;
+
+<br>
+
+
+**58. [Expensive; loss has to be computed two times per dimension, Used to verify correctness of analytical implementation, Trade-off in choosing h not too small (numerical instability) nor too large (poor gradient approximation)]**
+
+&#10230;
+
+<br>
+
+
+**59. ['Exact' result, Direct computation, Used in the final implementation]**
+
+&#10230;
+
+<br>
+
+
+**60. The Deep Learning cheatsheets are now available in [target language].
+
+&#10230;
+
+
+**61. Original authors**
+
+&#10230;
+
+<br>
+
+**62.Translated by X, Y and Z**
+
+&#10230;
+
+<br>
+
+**63.Reviewed by X, Y and Z**
+
+&#10230;
+
+<br>
+
+**64.View PDF version on GitHub**
+
+&#10230;
+
+<br>
+
+**65.By X and Y**
+
+&#10230;
+
+<br>
diff --git a/vi/recurrent-neural-networks.md b/vi/recurrent-neural-networks.md
new file mode 100644
index 000000000..191e400a1
--- /dev/null
+++ b/vi/recurrent-neural-networks.md
@@ -0,0 +1,677 @@
+**Recurrent Neural Networks translation**
+
+<br>
+
+**1. Recurrent Neural Networks cheatsheet**
+
+&#10230;
+
+<br>
+
+
+**2. CS 230 - Deep Learning**
+
+&#10230;
+
+<br>
+
+
+**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]**
+
+&#10230;
+
+<br>
+
+
+**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]**
+
+&#10230;
+
+<br>
+
+
+**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]**
+
+&#10230;
+
+<br>
+
+
+**6. [Comparing words, Cosine similarity, t-SNE]**
+
+&#10230;
+
+<br>
+
+
+**7. [Language model, n-gram, Perplexity]**
+
+&#10230;
+
+<br>
+
+
+**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]**
+
+&#10230;
+
+<br>
+
+
+**9. [Attention, Attention model, Attention weights]**
+
+&#10230;
+
+<br>
+
+
+**10. Overview**
+
+&#10230;
+
+<br>
+
+
+**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:**
+
+&#10230;
+
+<br>
+
+
+**12. For each timestep t, the activation a<t> and the output y<t> are expressed as follows:**
+
+&#10230;
+
+<br>
+
+
+**13. and**
+
+&#10230;
+
+<br>
+
+
+**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.**
+
+&#10230;
+
+<br>
+
+
+**15. The pros and cons of a typical RNN architecture are summed up in the table below:**
+
+&#10230;
+
+<br>
+
+
+**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]**
+
+&#10230;
+
+<br>
+
+
+**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]**
+
+&#10230;
+
+<br>
+
+
+**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:**
+
+&#10230;
+
+<br>
+
+
+**19. [Type of RNN, Illustration, Example]**
+
+&#10230;
+
+<br>
+
+
+**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]**
+
+&#10230;
+
+<br>
+
+
+**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]**
+
+&#10230;
+
+<br>
+
+
+**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:**
+
+&#10230;
+
+<br>
+
+
+**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:**
+
+&#10230;
+
+<br>
+
+
+**24. Handling long term dependencies**
+
+&#10230;
+
+<br>
+
+
+**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:**
+
+&#10230;
+
+<br>
+
+
+**26. [Sigmoid, Tanh, RELU]**
+
+&#10230;
+
+<br>
+
+
+**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.**
+
+&#10230;
+
+<br>
+
+
+**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.**
+
+&#10230;
+
+<br>
+
+
+**29. clipped**
+
+&#10230;
+
+<br>
+
+
+**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:**
+
+&#10230;
+
+<br>
+
+
+**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:**
+
+&#10230;
+
+<br>
+
+
+**32. [Type of gate, Role, Used in]**
+
+&#10230;
+
+<br>
+
+
+**33. [Update gate, Relevance gate, Forget gate, Output gate]**
+
+&#10230;
+
+<br>
+
+
+**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]**
+
+&#10230;
+
+<br>
+
+
+**35. [LSTM, GRU]**
+
+&#10230;
+
+<br>
+
+
+**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:**
+
+&#10230;
+
+<br>
+
+
+**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]**
+
+&#10230;
+
+<br>
+
+
+**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.**
+
+&#10230;
+
+<br>
+
+
+**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:**
+
+&#10230;
+
+<br>
+
+
+**40. [Bidirectional (BRNN), Deep (DRNN)]**
+
+&#10230;
+
+<br>
+
+
+**41. Learning word representation**
+
+&#10230;
+
+<br>
+
+
+**42. In this section, we note V the vocabulary and |V| its size.**
+
+&#10230;
+
+<br>
+
+
+**43. Motivation and notations**
+
+&#10230;
+
+<br>
+
+
+**44. Representation techniques ― The two main ways of representing words are summed up in the table below:**
+
+&#10230;
+
+<br>
+
+
+**45. [1-hot representation, Word embedding]**
+
+&#10230;
+
+<br>
+
+
+**46. [teddy bear, book, soft]**
+
+&#10230;
+
+<br>
+
+
+**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]**
+
+&#10230;
+
+<br>
+
+
+**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:**
+
+&#10230;
+
+<br>
+
+
+**49. Remark: learning the embedding matrix can be done using target/context likelihood models.**
+
+&#10230;
+
+<br>
+
+
+**50. Word embeddings**
+
+&#10230;
+
+<br>
+
+
+**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.**
+
+&#10230;
+
+<br>
+
+
+**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]**
+
+&#10230;
+
+<br>
+
+
+**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]**
+
+&#10230;
+
+<br>
+
+
+**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:**
+
+&#10230;
+
+<br>
+
+
+**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.**
+
+&#10230;
+
+<br>
+
+
+**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:**
+
+&#10230;
+
+<br>
+
+
+**57. Remark: this method is less computationally expensive than the skip-gram model.**
+
+&#10230;
+
+<br>
+
+
+**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:**
+
+&#10230;
+
+<br>
+
+
+**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0.
+Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:**
+
+&#10230;
+
+<br>
+
+
+**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.**
+
+&#10230;
+
+<br>
+
+
+**60. Comparing words**
+
+&#10230;
+
+<br>
+
+
+**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:**
+
+&#10230;
+
+<br>
+
+
+**62. Remark: θ is the angle between words w1 and w2.**
+
+&#10230;
+
+<br>
+
+
+**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.**
+
+&#10230;
+
+<br>
+
+
+**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]**
+
+&#10230;
+
+<br>
+
+
+**65. Language model**
+
+&#10230;
+
+<br>
+
+
+**66. Overview ― A language model aims at estimating the probability of a sentence P(y).**
+
+&#10230;
+
+<br>
+
+
+**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.**
+
+&#10230;
+
+<br>
+
+
+**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:**
+
+&#10230;
+
+<br>
+
+
+**69. Remark: PP is commonly used in t-SNE.**
+
+&#10230;
+
+<br>
+
+
+**70. Machine translation**
+
+&#10230;
+
+<br>
+
+
+**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:**
+
+&#10230;
+
+<br>
+
+
+**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.**
+
+&#10230;
+
+<br>
+
+
+**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y<k>|x,y<1>,...,y<k−1>, Step 3: Keep top B combinations x,y<1>,...,y<k>, End process at a stop word]**
+
+&#10230;
+
+<br>
+
+
+**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.**
+
+&#10230;
+
+<br>
+
+
+**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.**
+
+&#10230;
+
+<br>
+
+
+**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:**
+
+&#10230;
+
+<br>
+
+
+**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.**
+
+&#10230;
+
+<br>
+
+
+**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:**
+
+&#10230;
+
+<br>
+
+
+**79. [Case, Root cause, Remedies]**
+
+&#10230;
+
+<br>
+
+
+**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]**
+
+&#10230;
+
+<br>
+
+
+**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:**
+
+&#10230;
+
+<br>
+
+
+**82. where pn is the bleu score on n-gram only defined as follows:**
+
+&#10230;
+
+<br>
+
+
+**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.**
+
+&#10230;
+
+<br>
+
+
+**84. Attention**
+
+&#10230;
+
+<br>
+
+
+**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α<t,t′> the amount of attention that the output y<t> should pay to the activation a<t′> and c<t> the context at time t, we have:**
+
+&#10230;
+
+<br>
+
+
+**86. with**
+
+&#10230;
+
+<br>
+
+
+**87. Remark: the attention scores are commonly used in image captioning and machine translation.**
+
+&#10230;
+
+<br>
+
+
+**88. A cute teddy bear is reading Persian literature.**
+
+&#10230;
+
+<br>
+
+
+**89. Attention weight ― The amount of attention that the output y<t> should pay to the activation a<t′> is given by α<t,t′> computed as follows:**
+
+&#10230;
+
+<br>
+
+
+**90. Remark: computation complexity is quadratic with respect to Tx.**
+
+&#10230;
+
+<br>
+
+
+**91. The Deep Learning cheatsheets are now available in [target language].**
+
+&#10230;
+
+<br>
+
+**92. Original authors**
+
+&#10230;
+
+<br>
+
+**93. Translated by X, Y and Z**
+
+&#10230;
+
+<br>
+
+**94. Reviewed by X, Y and Z**
+
+&#10230;
+
+<br>
+
+**95. View PDF version on GitHub**
+
+&#10230;
+
+<br>
+
+**96. By X and Y**
+
+&#10230;
+
+<br>
diff --git a/vi/refresher-linear-algebra.md b/vi/refresher-linear-algebra.md
new file mode 100644
index 000000000..a6b440d1e
--- /dev/null
+++ b/vi/refresher-linear-algebra.md
@@ -0,0 +1,339 @@
+**1. Linear Algebra and Calculus refresher**
+
+&#10230;
+
+<br>
+
+**2. General notations**
+
+&#10230;
+
+<br>
+
+**3. Definitions**
+
+&#10230;
+
+<br>
+
+**4. Vector ― We note x∈Rn a vector with n entries, where xi∈R is the ith entry:**
+
+&#10230;
+
+<br>
+
+**5. Matrix ― We note A∈Rm×n a matrix with m rows and n columns, where Ai,j∈R is the entry located in the ith row and jth column:**
+
+&#10230;
+
+<br>
+
+**6. Remark: the vector x defined above can be viewed as a n×1 matrix and is more particularly called a column-vector.**
+
+&#10230;
+
+<br>
+
+**7. Main matrices**
+
+&#10230;
+
+<br>
+
+**8. Identity matrix ― The identity matrix I∈Rn×n is a square matrix with ones in its diagonal and zero everywhere else:**
+
+&#10230;
+
+<br>
+
+**9. Remark: for all matrices A∈Rn×n, we have A×I=I×A=A.**
+
+&#10230;
+
+<br>
+
+**10. Diagonal matrix ― A diagonal matrix D∈Rn×n is a square matrix with nonzero values in its diagonal and zero everywhere else:**
+
+&#10230;
+
+<br>
+
+**11. Remark: we also note D as diag(d1,...,dn).**
+
+&#10230;
+
+<br>
+
+**12. Matrix operations**
+
+&#10230;
+
+<br>
+
+**13. Multiplication**
+
+&#10230;
+
+<br>
+
+**14. Vector-vector ― There are two types of vector-vector products:**
+
+&#10230;
+
+<br>
+
+**15. inner product: for x,y∈Rn, we have:**
+
+&#10230;
+
+<br>
+
+**16. outer product: for x∈Rm,y∈Rn, we have:**
+
+&#10230;
+
+<br>
+
+**17. Matrix-vector ― The product of matrix A∈Rm×n and vector x∈Rn is a vector of size Rn, such that:**
+
+&#10230;
+
+<br>
+
+**18. where aTr,i are the vector rows and ac,j are the vector columns of A, and xi are the entries of x.**
+
+&#10230;
+
+<br>
+
+**19. Matrix-matrix ― The product of matrices A∈Rm×n and B∈Rn×p is a matrix of size Rn×p, such that:**
+
+&#10230;
+
+<br>
+
+**20. where aTr,i,bTr,i are the vector rows and ac,j,bc,j are the vector columns of A and B respectively**
+
+&#10230;
+
+<br>
+
+**21. Other operations**
+
+&#10230;
+
+<br>
+
+**22. Transpose ― The transpose of a matrix A∈Rm×n, noted AT, is such that its entries are flipped:**
+
+&#10230;
+
+<br>
+
+**23. Remark: for matrices A,B, we have (AB)T=BTAT**
+
+&#10230;
+
+<br>
+
+**24. Inverse ― The inverse of an invertible square matrix A is noted A−1 and is the only matrix such that:**
+
+&#10230;
+
+<br>
+
+**25. Remark: not all square matrices are invertible. Also, for matrices A,B, we have (AB)−1=B−1A−1**
+
+&#10230;
+
+<br>
+
+**26. Trace ― The trace of a square matrix A, noted tr(A), is the sum of its diagonal entries:**
+
+&#10230;
+
+<br>
+
+**27. Remark: for matrices A,B, we have tr(AT)=tr(A) and tr(AB)=tr(BA)**
+
+&#10230;
+
+<br>
+
+**28. Determinant ― The determinant of a square matrix A∈Rn×n, noted |A| or det(A) is expressed recursively in terms of A∖i,∖j, which is the matrix A without its ith row and jth column, as follows:**
+
+&#10230;
+
+<br>
+
+**29. Remark: A is invertible if and only if |A|≠0. Also, |AB|=|A||B| and |AT|=|A|.**
+
+&#10230;
+
+<br>
+
+**30. Matrix properties**
+
+&#10230;
+
+<br>
+
+**31. Definitions**
+
+&#10230;
+
+<br>
+
+**32. Symmetric decomposition ― A given matrix A can be expressed in terms of its symmetric and antisymmetric parts as follows:**
+
+&#10230;
+
+<br>
+
+**33. [Symmetric, Antisymmetric]**
+
+&#10230;
+
+<br>
+
+**34. Norm ― A norm is a function N:V⟶[0,+∞[ where V is a vector space, and such that for all x,y∈V, we have:**
+
+&#10230;
+
+<br>
+
+**35. N(ax)=|a|N(x) for a scalar**
+
+&#10230;
+
+<br>
+
+**36. if N(x)=0, then x=0**
+
+&#10230;
+
+<br>
+
+**37. For x∈V, the most commonly used norms are summed up in the table below:**
+
+&#10230;
+
+<br>
+
+**38. [Norm, Notation, Definition, Use case]**
+
+&#10230;
+
+<br>
+
+**39. Linearly dependence ― A set of vectors is said to be linearly dependent if one of the vectors in the set can be defined as a linear combination of the others.**
+
+&#10230;
+
+<br>
+
+**40. Remark: if no vector can be written this way, then the vectors are said to be linearly independent**
+
+&#10230;
+
+<br>
+
+**41. Matrix rank ― The rank of a given matrix A is noted rank(A) and is the dimension of the vector space generated by its columns. This is equivalent to the maximum number of linearly independent columns of A.**
+
+&#10230;
+
+<br>
+
+**42. Positive semi-definite matrix ― A matrix A∈Rn×n is positive semi-definite (PSD) and is noted A⪰0 if we have:**
+
+&#10230;
+
+<br>
+
+**43. Remark: similarly, a matrix A is said to be positive definite, and is noted A≻0, if it is a PSD matrix which satisfies for all non-zero vector x, xTAx>0.**
+
+&#10230;
+
+<br>
+
+**44. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:**
+
+&#10230;
+
+<br>
+
+**45. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:**
+
+&#10230;
+
+<br>
+
+**46. diagonal**
+
+&#10230;
+
+<br>
+
+**47. Singular-value decomposition ― For a given matrix A of dimensions m×n, the singular-value decomposition (SVD) is a factorization technique that guarantees the existence of U m×m unitary, Σ m×n diagonal and V n×n unitary matrices, such that:**
+
+&#10230;
+
+<br>
+
+**48. Matrix calculus**
+
+&#10230;
+
+<br>
+
+**49. Gradient ― Let f:Rm×n→R be a function and A∈Rm×n be a matrix. The gradient of f with respect to A is a m×n matrix, noted ∇Af(A), such that:**
+
+&#10230;
+
+<br>
+
+**50. Remark: the gradient of f is only defined when f is a function that returns a scalar.**
+
+&#10230;
+
+<br>
+
+**51. Hessian ― Let f:Rn→R be a function and x∈Rn be a vector. The hessian of f with respect to x is a n×n symmetric matrix, noted ∇2xf(x), such that:**
+
+&#10230;
+
+<br>
+
+**52. Remark: the hessian of f is only defined when f is a function that returns a scalar**
+
+&#10230;
+
+<br>
+
+**53. Gradient operations ― For matrices A,B,C, the following gradient properties are worth having in mind:**
+
+&#10230;
+
+<br>
+
+**54. [General notations, Definitions, Main matrices]**
+
+&#10230;
+
+<br>
+
+**55. [Matrix operations, Multiplication, Other operations]**
+
+&#10230;
+
+<br>
+
+**56. [Matrix properties, Norm, Eigenvalue/Eigenvector, Singular-value decomposition]**
+
+&#10230;
+
+<br>
+
+**57. [Matrix calculus, Gradient, Hessian, Operations]**
+
+&#10230;
diff --git a/vi/refresher-probability.md b/vi/refresher-probability.md
new file mode 100644
index 000000000..5c9b34656
--- /dev/null
+++ b/vi/refresher-probability.md
@@ -0,0 +1,381 @@
+**1. Probabilities and Statistics refresher**
+
+&#10230;
+
+<br>
+
+**2. Introduction to Probability and Combinatorics**
+
+&#10230;
+
+<br>
+
+**3. Sample space ― The set of all possible outcomes of an experiment is known as the sample space of the experiment and is denoted by S.**
+
+&#10230;
+
+<br>
+
+**4. Event ― Any subset E of the sample space is known as an event. That is, an event is a set consisting of possible outcomes of the experiment. If the outcome of the experiment is contained in E, then we say that E has occurred.**
+
+&#10230;
+
+<br>
+
+**5. Axioms of probability For each event E, we denote P(E) as the probability of event E occuring.**
+
+&#10230;
+
+<br>
+
+**6. Axiom 1 ― Every probability is between 0 and 1 included, i.e:**
+
+&#10230;
+
+<br>
+
+**7. Axiom 2 ― The probability that at least one of the elementary events in the entire sample space will occur is 1, i.e:**
+
+&#10230;
+
+<br>
+
+**8. Axiom 3 ― For any sequence of mutually exclusive events E1,...,En, we have:**
+
+&#10230;
+
+<br>
+
+**9. Permutation ― A permutation is an arrangement of r objects from a pool of n objects, in a given order. The number of such arrangements is given by P(n,r), defined as:**
+
+&#10230;
+
+<br>
+
+**10. Combination ― A combination is an arrangement of r objects from a pool of n objects, where the order does not matter. The number of such arrangements is given by C(n,r), defined as:**
+
+&#10230;
+
+<br>
+
+**11. Remark: we note that for 0⩽r⩽n, we have P(n,r)⩾C(n,r)**
+
+&#10230;
+
+<br>
+
+**12. Conditional Probability**
+
+&#10230;
+
+<br>
+
+**13. Bayes' rule ― For events A and B such that P(B)>0, we have:**
+
+&#10230;
+
+<br>
+
+**14. Remark: we have P(A∩B)=P(A)P(B|A)=P(A|B)P(B)**
+
+&#10230;
+
+<br>
+
+**15. Partition ― Let {Ai,i∈[[1,n]]} be such that for all i, Ai≠∅. We say that {Ai} is a partition if we have:**
+
+&#10230;
+
+<br>
+
+**16. Remark: for any event B in the sample space, we have P(B)=n∑i=1P(B|Ai)P(Ai).**
+
+&#10230;
+
+<br>
+
+**17. Extended form of Bayes' rule ― Let {Ai,i∈[[1,n]]} be a partition of the sample space. We have:**
+
+&#10230;
+
+<br>
+
+**18. Independence ― Two events A and B are independent if and only if we have:**
+
+&#10230;
+
+<br>
+
+**19. Random Variables**
+
+&#10230;
+
+<br>
+
+**20. Definitions**
+
+&#10230;
+
+<br>
+
+**21. Random variable ― A random variable, often noted X, is a function that maps every element in a sample space to a real line.**
+
+&#10230;
+
+<br>
+
+**22. Cumulative distribution function (CDF) ― The cumulative distribution function F, which is monotonically non-decreasing and is such that limx→−∞F(x)=0 and limx→+∞F(x)=1, is defined as:**
+
+&#10230;
+
+<br>
+
+**23. Remark: we have P(a<X⩽B)=F(b)−F(a).**
+
+&#10230;
+
+<br>
+
+**24. Probability density function (PDF) ― The probability density function f is the probability that X takes on values between two adjacent realizations of the random variable.**
+
+&#10230;
+
+<br>
+
+**25. Relationships involving the PDF and CDF ― Here are the important properties to know in the discrete (D) and the continuous (C) cases.**
+
+&#10230;
+
+<br>
+
+**26. [Case, CDF F, PDF f, Properties of PDF]**
+
+&#10230;
+
+<br>
+
+**27. Expectation and Moments of the Distribution ― Here are the expressions of the expected value E[X], generalized expected value E[g(X)], kth moment E[Xk] and characteristic function ψ(ω) for the discrete and continuous cases:**
+
+&#10230;
+
+<br>
+
+**28. Variance ― The variance of a random variable, often noted Var(X) or σ2, is a measure of the spread of its distribution function. It is determined as follows:**
+
+&#10230;
+
+<br>
+
+**29. Standard deviation ― The standard deviation of a random variable, often noted σ, is a measure of the spread of its distribution function which is compatible with the units of the actual random variable. It is determined as follows:**
+
+&#10230;
+
+<br>
+
+**30. Transformation of random variables ― Let the variables X and Y be linked by some function. By noting fX and fY the distribution function of X and Y respectively, we have:**
+
+&#10230;
+
+<br>
+
+**31. Leibniz integral rule ― Let g be a function of x and potentially c, and a,b boundaries that may depend on c. We have:**
+
+&#10230;
+
+<br>
+
+**32. Probability Distributions**
+
+&#10230;
+
+<br>
+
+**33. Chebyshev's inequality ― Let X be a random variable with expected value μ. For k,σ>0, we have the following inequality:**
+
+&#10230;
+
+<br>
+
+**34. Main distributions ― Here are the main distributions to have in mind:**
+
+&#10230;
+
+<br>
+
+**35. [Type, Distribution]**
+
+&#10230;
+
+<br>
+
+**36. Jointly Distributed Random Variables**
+
+&#10230;
+
+<br>
+
+**37. Marginal density and cumulative distribution ― From the joint density probability function fXY , we have**
+
+&#10230;
+
+<br>
+
+**38. [Case, Marginal density, Cumulative function]**
+
+&#10230;
+
+<br>
+
+**39. Conditional density ― The conditional density of X with respect to Y, often noted fX|Y, is defined as follows:**
+
+&#10230;
+
+<br>
+
+**40. Independence ― Two random variables X and Y are said to be independent if we have:**
+
+&#10230;
+
+<br>
+
+**41. Covariance ― We define the covariance of two random variables X and Y, that we note σ2XY or more commonly Cov(X,Y), as follows:**
+
+&#10230;
+
+<br>
+
+**42. Correlation ― By noting σX,σY the standard deviations of X and Y, we define the correlation between the random variables X and Y, noted ρXY, as follows:**
+
+&#10230;
+
+<br>
+
+**43. Remark 1: we note that for any random variables X,Y, we have ρXY∈[−1,1].**
+
+&#10230;
+
+<br>
+
+**44. Remark 2: If X and Y are independent, then ρXY=0.**
+
+&#10230;
+
+<br>
+
+**45. Parameter estimation**
+
+&#10230;
+
+<br>
+
+**46. Definitions**
+
+&#10230;
+
+<br>
+
+**47. Random sample ― A random sample is a collection of n random variables X1,...,Xn that are independent and identically distributed with X.**
+
+&#10230;
+
+<br>
+
+**48. Estimator ― An estimator is a function of the data that is used to infer the value of an unknown parameter in a statistical model.**
+
+&#10230;
+
+<br>
+
+**49. Bias ― The bias of an estimator ^θ is defined as being the difference between the expected value of the distribution of ^θ and the true value, i.e.:**
+
+&#10230;
+
+<br>
+
+**50. Remark: an estimator is said to be unbiased when we have E[^θ]=θ.**
+
+&#10230;
+
+<br>
+
+**51. Estimating the mean**
+
+&#10230;
+
+<br>
+
+**52. Sample mean ― The sample mean of a random sample is used to estimate the true mean μ of a distribution, is often noted ¯¯¯¯¯X and is defined as follows:**
+
+&#10230;
+
+<br>
+
+**53. Remark: the sample mean is unbiased, i.e E[¯¯¯¯¯X]=μ.**
+
+&#10230;
+
+<br>
+
+**54. Central Limit Theorem ― Let us have a random sample X1,...,Xn following a given distribution with mean μ and variance σ2, then we have:**
+
+&#10230;
+
+<br>
+
+**55. Estimating the variance**
+
+&#10230;
+
+<br>
+
+**56. Sample variance ― The sample variance of a random sample is used to estimate the true variance σ2 of a distribution, is often noted s2 or ^σ2 and is defined as follows:**
+
+&#10230;
+
+<br>
+
+**57. Remark: the sample variance is unbiased, i.e E[s2]=σ2.**
+
+&#10230;
+
+<br>
+
+**58. Chi-Squared relation with sample variance ― Let s2 be the sample variance of a random sample. We have:**
+
+&#10230;
+
+<br>
+
+**59. [Introduction, Sample space, Event, Permutation]**
+
+&#10230;
+
+<br>
+
+**60. [Conditional probability, Bayes' rule, Independence]**
+
+&#10230;
+
+<br>
+
+**61. [Random variables, Definitions, Expectation, Variance]**
+
+&#10230;
+
+<br>
+
+**62. [Probability distributions, Chebyshev's inequality, Main distributions]**
+
+&#10230;
+
+<br>
+
+**63. [Jointly distributed random variables, Density, Covariance, Correlation]**
+
+&#10230;
+
+<br>
+
+**64. [Parameter estimation, Mean, Variance]**
+
+&#10230;

From 65b4de28126e8fade55503df6774a8c3d90a5957 Mon Sep 17 00:00:00 2001
From: tuananhhedspibk <tuananhhedspibk1@gmail.com>
Date: Thu, 6 Jun 2019 22:58:37 +0900
Subject: [PATCH 02/11] vi translating for cheatsheet-deep-learning

---
 ...tsheet-machine-learning-tips-and-tricks.md | 285 -------
 vi/cheatsheet-supervised-learning.md          | 567 --------------
 vi/cheatsheet-unsupervised-learning.md        | 340 ---------
 vi/convolutional-neural-networks.md           | 716 ------------------
 vi/deep-learning-tips-and-tricks.md           | 457 -----------
 vi/recurrent-neural-networks.md               | 677 -----------------
 vi/refresher-linear-algebra.md                | 339 ---------
 vi/refresher-probability.md                   | 381 ----------
 8 files changed, 3762 deletions(-)
 delete mode 100644 vi/cheatsheet-machine-learning-tips-and-tricks.md
 delete mode 100644 vi/cheatsheet-supervised-learning.md
 delete mode 100644 vi/cheatsheet-unsupervised-learning.md
 delete mode 100644 vi/convolutional-neural-networks.md
 delete mode 100644 vi/deep-learning-tips-and-tricks.md
 delete mode 100644 vi/recurrent-neural-networks.md
 delete mode 100644 vi/refresher-linear-algebra.md
 delete mode 100644 vi/refresher-probability.md

diff --git a/vi/cheatsheet-machine-learning-tips-and-tricks.md b/vi/cheatsheet-machine-learning-tips-and-tricks.md
deleted file mode 100644
index 9712297b8..000000000
--- a/vi/cheatsheet-machine-learning-tips-and-tricks.md
+++ /dev/null
@@ -1,285 +0,0 @@
-**1. Machine Learning tips and tricks cheatsheet**
-
-&#10230;
-
-<br>
-
-**2. Classification metrics**
-
-&#10230;
-
-<br>
-
-**3. In a context of a binary classification, here are the main metrics that are important to track in order to assess the performance of the model.**
-
-&#10230;
-
-<br>
-
-**4. Confusion matrix ― The confusion matrix is used to have a more complete picture when assessing the performance of a model. It is defined as follows:**
-
-&#10230;
-
-<br>
-
-**5. [Predicted class, Actual class]**
-
-&#10230;
-
-<br>
-
-**6. Main metrics ― The following metrics are commonly used to assess the performance of classification models:**
-
-&#10230;
-
-<br>
-
-**7. [Metric, Formula, Interpretation]**
-
-&#10230;
-
-<br>
-
-**8. Overall performance of model**
-
-&#10230;
-
-<br>
-
-**9. How accurate the positive predictions are**
-
-&#10230;
-
-<br>
-
-**10. Coverage of actual positive sample**
-
-&#10230;
-
-<br>
-
-**11. Coverage of actual negative sample**
-
-&#10230;
-
-<br>
-
-**12. Hybrid metric useful for unbalanced classes**
-
-&#10230;
-
-<br>
-
-**13. ROC ― The receiver operating curve, also noted ROC, is the plot of TPR versus FPR by varying the threshold. These metrics are are summed up in the table below:**
-
-&#10230;
-
-<br>
-
-**14. [Metric, Formula, Equivalent]**
-
-&#10230;
-
-<br>
-
-**15. AUC ― The area under the receiving operating curve, also noted AUC or AUROC, is the area below the ROC as shown in the following figure:**
-
-&#10230;
-
-<br>
-
-**16. [Actual, Predicted]**
-
-&#10230;
-
-<br>
-
-**17. Basic metrics ― Given a regression model f, the following metrics are commonly used to assess the performance of the model:**
-
-&#10230;
-
-<br>
-
-**18. [Total sum of squares, Explained sum of squares, Residual sum of squares]**
-
-&#10230;
-
-<br>
-
-**19. Coefficient of determination ― The coefficient of determination, often noted R2 or r2, provides a measure of how well the observed outcomes are replicated by the model and is defined as follows:**
-
-&#10230;
-
-<br>
-
-**20. Main metrics ― The following metrics are commonly used to assess the performance of regression models, by taking into account the number of variables n that they take into consideration:**
-
-&#10230;
-
-<br>
-
-**21. where L is the likelihood and ˆσ2 is an estimate of the variance associated with each response.**
-
-&#10230;
-
-<br>
-
-**22. Model selection**
-
-&#10230;
-
-<br>
-
-**23. Vocabulary ― When selecting a model, we distinguish 3 different parts of the data that we have as follows:**
-
-&#10230;
-
-<br>
-
-**24. [Training set, Validation set, Testing set]**
-
-&#10230;
-
-<br>
-
-**25. [Model is trained, Model is assessed, Model gives predictions]**
-
-&#10230;
-
-<br>
-
-**26. [Usually 80% of the dataset, Usually 20% of the dataset]**
-
-&#10230;
-
-<br>
-
-**27. [Also called hold-out or development set, Unseen data]**
-
-&#10230;
-
-<br>
-
-**28. Once the model has been chosen, it is trained on the entire dataset and tested on the unseen test set. These are represented in the figure below:**
-
-&#10230;
-
-<br>
-
-**29. Cross-validation ― Cross-validation, also noted CV, is a method that is used to select a model that does not rely too much on the initial training set. The different types are summed up in the table below:**
-
-&#10230;
-
-<br>
-
-**30. [Training on k−1 folds and assessment on the remaining one, Training on n−p observations and assessment on the p remaining ones]**
-
-&#10230;
-
-<br>
-
-**31. [Generally k=5 or 10, Case p=1 is called leave-one-out]**
-
-&#10230;
-
-<br>
-
-**32. The most commonly used method is called k-fold cross-validation and splits the training data into k folds to validate the model on one fold while training the model on the k−1 other folds, all of this k times. The error is then averaged over the k folds and is named cross-validation error.**
-
-&#10230;
-
-<br>
-
-**33. Regularization ― The regularization procedure aims at avoiding the model to overfit the data and thus deals with high variance issues. The following table sums up the different types of commonly used regularization techniques:**
-
-&#10230;
-
-<br>
-
-**34. [Shrinks coefficients to 0, Good for variable selection, Makes coefficients smaller, Tradeoff between variable selection and small coefficients]**
-
-&#10230;
-
-<br>
-
-**35. Diagnostics**
-
-&#10230;
-
-<br>
-
-**36. Bias ― The bias of a model is the difference between the expected prediction and the correct model that we try to predict for given data points.**
-
-&#10230;
-
-<br>
-
-**37. Variance ― The variance of a model is the variability of the model prediction for given data points.**
-
-&#10230;
-
-<br>
-
-**38. Bias/variance tradeoff ― The simpler the model, the higher the bias, and the more complex the model, the higher the variance.**
-
-&#10230;
-
-<br>
-
-**39. [Symptoms, Regression illustration, classification illustration, deep learning illustration, possible remedies]**
-
-&#10230;
-
-<br>
-
-**40. [High training error, Training error close to test error, High bias, Training error slightly lower than test error, Very low training error, Training error much lower than test error, High variance]**
-
-&#10230;
-
-<br>
-
-**41. [Complexify model, Add more features, Train longer, Perform regularization, Get more data]**
-
-&#10230;
-
-<br>
-
-**42. Error analysis ― Error analysis is analyzing the root cause of the difference in performance between the current and the perfect models.**
-
-&#10230;
-
-<br>
-
-**43. Ablative analysis ― Ablative analysis is analyzing the root cause of the difference in performance between the current and the baseline models.**
-
-&#10230;
-
-<br>
-
-**44. Regression metrics**
-
-&#10230;
-
-<br>
-
-**45. [Classification metrics, confusion matrix, accuracy, precision, recall, F1 score, ROC]**
-
-&#10230;
-
-<br>
-
-**46. [Regression metrics, R squared, Mallow's CP, AIC, BIC]**
-
-&#10230;
-
-<br>
-
-**47. [Model selection, cross-validation, regularization]**
-
-&#10230;
-
-<br>
-
-**48. [Diagnostics, Bias/variance tradeoff, error/ablative analysis]**
-
-&#10230;
diff --git a/vi/cheatsheet-supervised-learning.md b/vi/cheatsheet-supervised-learning.md
deleted file mode 100644
index a6b19ea1c..000000000
--- a/vi/cheatsheet-supervised-learning.md
+++ /dev/null
@@ -1,567 +0,0 @@
-**1. Supervised Learning cheatsheet**
-
-&#10230;
-
-<br>
-
-**2. Introduction to Supervised Learning**
-
-&#10230;
-
-<br>
-
-**3. Given a set of data points {x(1),...,x(m)} associated to a set of outcomes {y(1),...,y(m)}, we want to build a classifier that learns how to predict y from x.**
-
-&#10230;
-
-<br>
-
-**4. Type of prediction ― The different types of predictive models are summed up in the table below:**
-
-&#10230;
-
-<br>
-
-**5. [Regression, Classifier, Outcome, Examples]**
-
-&#10230;
-
-<br>
-
-**6. [Continuous, Class, Linear regression, Logistic regression, SVM, Naive Bayes]**
-
-&#10230;
-
-<br>
-
-**7. Type of model ― The different models are summed up in the table below:**
-
-&#10230;
-
-<br>
-
-**8. [Discriminative model, Generative model, Goal, What's learned, Illustration, Examples]**
-
-&#10230;
-
-<br>
-
-**9. [Directly estimate P(y|x), Estimate P(x|y) to then deduce P(y|x), Decision boundary,  	Probability distributions of the data, Regressions, SVMs, GDA, Naive Bayes]**
-
-&#10230;
-
-<br>
-
-**10. Notations and general concepts**
-
-&#10230;
-
-<br>
-
-**11. Hypothesis ― The hypothesis is noted hθ and is the model that we choose. For a given input data x(i) the model prediction output is hθ(x(i)).**
-
-&#10230;
-
-<br>
-
-**12. Loss function ― A loss function is a function L:(z,y)∈R×Y⟼L(z,y)∈R that takes as inputs the predicted value z corresponding to the real data value y and outputs how different they are. The common loss functions are summed up in the table below:**
-
-&#10230;
-
-<br>
-
-**13. [Least squared error, Logistic loss, Hinge loss, Cross-entropy]**
-
-&#10230;
-
-<br>
-
-**14. [Linear regression, Logistic regression, SVM, Neural Network]**
-
-&#10230;
-
-<br>
-
-**15. Cost function ― The cost function J is commonly used to assess the performance of a model, and is defined with the loss function L as follows:**
-
-&#10230;
-
-<br>
-
-**16. Gradient descent ― By noting α∈R the learning rate, the update rule for gradient descent is expressed with the learning rate and the cost function J as follows:**
-
-&#10230;
-
-<br>
-
-**17. Remark: Stochastic gradient descent (SGD) is updating the parameter based on each training example, and batch gradient descent is on a batch of training examples.**
-
-&#10230;
-
-<br>
-
-**18. Likelihood ― The likelihood of a model L(θ) given parameters θ is used to find the optimal parameters θ through maximizing the likelihood. In practice, we use the log-likelihood ℓ(θ)=log(L(θ)) which is easier to optimize. We have:**
-
-&#10230;
-
-<br>
-
-**19. Newton's algorithm ― The Newton's algorithm is a numerical method that finds θ such that ℓ′(θ)=0. Its update rule is as follows:**
-
-&#10230;
-
-<br>
-
-**20. Remark: the multidimensional generalization, also known as the Newton-Raphson method, has the following update rule:**
-
-&#10230;
-
-<br>
-
-**21. Linear models**
-
-&#10230;
-
-<br>
-
-**22. Linear regression**
-
-&#10230;
-
-<br>
-
-**23. We assume here that y|x;θ∼N(μ,σ2)**
-
-&#10230;
-
-<br>
-
-**24. Normal equations ― By noting X the matrix design, the value of θ that minimizes the cost function is a closed-form solution such that:**
-
-&#10230;
-
-<br>
-
-**25. LMS algorithm ― By noting α the learning rate, the update rule of the Least Mean Squares (LMS) algorithm for a training set of m data points, which is also known as the Widrow-Hoff learning rule, is as follows:**
-
-&#10230;
-
-<br>
-
-**26. Remark: the update rule is a particular case of the gradient ascent.**
-
-&#10230;
-
-<br>
-
-**27. LWR ― Locally Weighted Regression, also known as LWR, is a variant of linear regression that weights each training example in its cost function by w(i)(x), which is defined with parameter τ∈R as:**
-
-&#10230;
-
-<br>
-
-**28. Classification and logistic regression**
-
-&#10230;
-
-<br>
-
-**29. Sigmoid function ― The sigmoid function g, also known as the logistic function, is defined as follows:**
-
-&#10230;
-
-<br>
-
-**30. Logistic regression ― We assume here that y|x;θ∼Bernoulli(ϕ). We have the following form:**
-
-&#10230;
-
-<br>
-
-**31. Remark: there is no closed form solution for the case of logistic regressions.**
-
-&#10230;
-
-<br>
-
-**32. Softmax regression ― A softmax regression, also called a multiclass logistic regression, is used to generalize logistic regression when there are more than 2 outcome classes. By convention, we set θK=0, which makes the Bernoulli parameter ϕi of each class i equal to:**
-
-&#10230;
-
-<br>
-
-**33. Generalized Linear Models**
-
-&#10230;
-
-<br>
-
-**34. Exponential family ― A class of distributions is said to be in the exponential family if it can be written in terms of a natural parameter, also called the canonical parameter or link function, η, a sufficient statistic T(y) and a log-partition function a(η) as follows:**
-
-&#10230;
-
-<br>
-
-**35. Remark: we will often have T(y)=y. Also, exp(−a(η)) can be seen as a normalization parameter that will make sure that the probabilities sum to one.**
-
-&#10230;
-
-<br>
-
-**36. Here are the most common exponential distributions summed up in the following table:**
-
-&#10230;
-
-<br>
-
-**37. [Distribution, Bernoulli, Gaussian, Poisson, Geometric]**
-
-&#10230;
-
-<br>
-
-**38. Assumptions of GLMs ― Generalized Linear Models (GLM) aim at predicting a random variable y as a function fo x∈Rn+1 and rely on the following 3 assumptions:**
-
-&#10230;
-
-<br>
-
-**39. Remark: ordinary least squares and logistic regression are special cases of generalized linear models.**
-
-&#10230;
-
-<br>
-
-**40. Support Vector Machines**
-
-&#10230;
-
-<br>
-
-**41: The goal of support vector machines is to find the line that maximizes the minimum distance to the line.**
-
-&#10230;
-
-<br>
-
-**42: Optimal margin classifier ― The optimal margin classifier h is such that:**
-
-&#10230;
-
-<br>
-
-**43: where (w,b)∈Rn×R is the solution of the following optimization problem:**
-
-&#10230;
-
-<br>
-
-**44. such that**
-
-&#10230;
-
-<br>
-
-**45. support vectors**
-
-&#10230;
-
-<br>
-
-**46. Remark: the line is defined as wTx−b=0.**
-
-&#10230;
-
-<br>
-
-**47. Hinge loss ― The hinge loss is used in the setting of SVMs and is defined as follows:**
-
-&#10230;
-
-<br>
-
-**48. Kernel ― Given a feature mapping ϕ, we define the kernel K to be defined as:**
-
-&#10230;
-
-<br>
-
-**49. In practice, the kernel K defined by K(x,z)=exp(−||x−z||22σ2) is called the Gaussian kernel and is commonly used.**
-
-&#10230;
-
-<br>
-
-**50. [Non-linear separability, Use of a kernel mapping, Decision boundary in the original space]**
-
-&#10230;
-
-<br>
-
-**51. Remark: we say that we use the "kernel trick" to compute the cost function using the kernel because we actually don't need to know the explicit mapping ϕ, which is often very complicated. Instead, only the values K(x,z) are needed.**
-
-&#10230;
-
-<br>
-
-**52. Lagrangian ― We define the Lagrangian L(w,b) as follows:**
-
-&#10230;
-
-<br>
-
-**53. Remark: the coefficients βi are called the Lagrange multipliers.**
-
-&#10230;
-
-<br>
-
-**54. Generative Learning**
-
-&#10230;
-
-<br>
-
-**55. A generative model first tries to learn how the data is generated by estimating P(x|y), which we can then use to estimate P(y|x) by using Bayes' rule.**
-
-&#10230;
-
-<br>
-
-**56. Gaussian Discriminant Analysis**
-
-&#10230;
-
-<br>
-
-**57. Setting ― The Gaussian Discriminant Analysis assumes that y and x|y=0 and x|y=1 are such that:**
-
-&#10230;
-
-<br>
-
-**58. Estimation ― The following table sums up the estimates that we find when maximizing the likelihood:**
-
-&#10230;
-
-<br>
-
-**59. Naive Bayes**
-
-&#10230;
-
-<br>
-
-**60. Assumption ― The Naive Bayes model supposes that the features of each data point are all independent:**
-
-&#10230;
-
-<br>
-
-**61. Solutions ― Maximizing the log-likelihood gives the following solutions, with k∈{0,1},l∈[[1,L]]**
-
-&#10230;
-
-<br>
-
-**62. Remark: Naive Bayes is widely used for text classification and spam detection.**
-
-&#10230;
-
-<br>
-
-**63. Tree-based and ensemble methods**
-
-&#10230;
-
-<br>
-
-**64. These methods can be used for both regression and classification problems.**
-
-&#10230;
-
-<br>
-
-**65. CART ― Classification and Regression Trees (CART), commonly known as decision trees, can be represented as binary trees. They have the advantage to be very interpretable.**
-
-&#10230;
-
-<br>
-
-**66. Random forest ― It is a tree-based technique that uses a high number of decision trees built out of randomly selected sets of features. Contrary to the simple decision tree, it is highly uninterpretable but its generally good performance makes it a popular algorithm.**
-
-&#10230;
-
-<br>
-
-**67. Remark: random forests are a type of ensemble methods.**
-
-&#10230;
-
-<br>
-
-**68. Boosting ― The idea of boosting methods is to combine several weak learners to form a stronger one. The main ones are summed up in the table below:**
-
-&#10230;
-
-<br>
-
-**69. [Adaptive boosting, Gradient boosting]**
-
-&#10230;
-
-<br>
-
-**70. High weights are put on errors to improve at the next boosting step**
-
-&#10230;
-
-<br>
-
-**71. Weak learners trained on remaining errors**
-
-&#10230;
-
-<br>
-
-**72. Other non-parametric approaches**
-
-&#10230;
-
-<br>
-
-**73. k-nearest neighbors ― The k-nearest neighbors algorithm, commonly known as k-NN, is a non-parametric approach where the response of a data point is determined by the nature of its k neighbors from the training set. It can be used in both classification and regression settings.**
-
-&#10230;
-
-<br>
-
-**74. Remark: The higher the parameter k, the higher the bias, and the lower the parameter k, the higher the variance.**
-
-&#10230;
-
-<br>
-
-**75. Learning Theory**
-
-&#10230;
-
-<br>
-
-**76. Union bound ― Let A1,...,Ak be k events. We have:**
-
-&#10230;
-
-<br>
-
-**77. Hoeffding inequality ― Let Z1,..,Zm be m iid variables drawn from a Bernoulli distribution of parameter ϕ. Let ˆϕ be their sample mean and γ>0 fixed. We have:**
-
-&#10230;
-
-<br>
-
-**78. Remark: this inequality is also known as the Chernoff bound.**
-
-&#10230;
-
-<br>
-
-**79. Training error ― For a given classifier h, we define the training error ˆϵ(h), also known as the empirical risk or empirical error, to be as follows:**
-
-&#10230;
-
-<br>
-
-**80. Probably Approximately Correct (PAC) ― PAC is a framework under which numerous results on learning theory were proved, and has the following set of assumptions: **
-
-&#10230;
-
-<br>
-
-**81: the training and testing sets follow the same distribution **
-
-&#10230;
-
-<br>
-
-**82. the training examples are drawn independently**
-
-&#10230;
-
-<br>
-
-**83. Shattering ― Given a set S={x(1),...,x(d)}, and a set of classifiers H, we say that H shatters S if for any set of labels {y(1),...,y(d)}, we have:**
-
-&#10230;
-
-<br>
-
-**84. Upper bound theorem ― Let H be a finite hypothesis class such that |H|=k and let δ and the sample size m be fixed. Then, with probability of at least 1−δ, we have:**
-
-&#10230;
-
-<br>
-
-**85. VC dimension ― The Vapnik-Chervonenkis (VC) dimension of a given infinite hypothesis class H, noted VC(H) is the size of the largest set that is shattered by H.**
-
-&#10230;
-
-<br>
-
-**86. Remark: the VC dimension of H={set of linear classifiers in 2 dimensions} is 3.**
-
-&#10230;
-
-<br>
-
-**87. Theorem (Vapnik) ― Let H be given, with VC(H)=d and m the number of training examples. With probability at least 1−δ, we have:**
-
-&#10230;
-
-<br>
-
-**88. [Introduction, Type of prediction, Type of model]**
-
-&#10230;
-
-<br>
-
-**89. [Notations and general concepts, loss function, gradient descent, likelihood]**
-
-&#10230;
-
-<br>
-
-**90. [Linear models, linear regression, logistic regression, generalized linear models]**
-
-&#10230;
-
-<br>
-
-**91. [Support vector machines, Optimal margin classifier, Hinge loss, Kernel]**
-
-&#10230;
-
-<br>
-
-**92. [Generative learning, Gaussian Discriminant Analysis, Naive Bayes]**
-
-&#10230;
-
-<br>
-
-**93. [Trees and ensemble methods, CART, Random forest, Boosting]**
-
-&#10230;
-
-<br>
-
-**94. [Other methods, k-NN]**
-
-&#10230;
-
-<br>
-
-**95. [Learning theory, Hoeffding inequality, PAC, VC dimension]**
-
-&#10230;
diff --git a/vi/cheatsheet-unsupervised-learning.md b/vi/cheatsheet-unsupervised-learning.md
deleted file mode 100644
index 6daab3b21..000000000
--- a/vi/cheatsheet-unsupervised-learning.md
+++ /dev/null
@@ -1,340 +0,0 @@
-**1. Unsupervised Learning cheatsheet**
-
-&#10230;
-
-<br>
-
-**2. Introduction to Unsupervised Learning**
-
-&#10230;
-
-<br>
-
-**3. Motivation ― The goal of unsupervised learning is to find hidden patterns in unlabeled data {x(1),...,x(m)}.**
-
-&#10230;
-
-<br>
-
-**4. Jensen's inequality ― Let f be a convex function and X a random variable. We have the following inequality:**
-
-&#10230;
-
-<br>
-
-**5. Clustering**
-
-&#10230;
-
-<br>
-
-**6. Expectation-Maximization**
-
-&#10230;
-
-<br>
-
-**7. Latent variables ― Latent variables are hidden/unobserved variables that make estimation problems difficult, and are often denoted z. Here are the most common settings where there are latent variables:**
-
-&#10230;
-
-<br>
-
-**8. [Setting, Latent variable z, Comments]**
-
-&#10230;
-
-<br>
-
-**9. [Mixture of k Gaussians, Factor analysis]**
-
-&#10230;
-
-<br>
-
-**10. Algorithm ― The Expectation-Maximization (EM) algorithm gives an efficient method at estimating the parameter θ through maximum likelihood estimation by repeatedly constructing a lower-bound on the likelihood (E-step) and optimizing that lower bound (M-step) as follows:**
-
-&#10230;
-
-<br>
-
-**11. E-step: Evaluate the posterior probability Qi(z(i)) that each data point x(i) came from a particular cluster z(i) as follows:**
-
-&#10230;
-
-<br>
-
-**12. M-step: Use the posterior probabilities Qi(z(i)) as cluster specific weights on data points x(i) to separately re-estimate each cluster model as follows:**
-
-&#10230;
-
-<br>
-
-**13. [Gaussians initialization, Expectation step, Maximization step, Convergence]**
-
-&#10230;
-
-<br>
-
-**14. k-means clustering**
-
-&#10230;
-
-<br>
-
-**15. We note c(i) the cluster of data point i and μj the center of cluster j.**
-
-&#10230;
-
-<br>
-
-**16. Algorithm ― After randomly initializing the cluster centroids μ1,μ2,...,μk∈Rn, the k-means algorithm repeats the following step until convergence:**
-
-&#10230;
-
-<br>
-
-**17. [Means initialization, Cluster assignment, Means update, Convergence]**
-
-&#10230;
-
-<br>
-
-**18. Distortion function ― In order to see if the algorithm converges, we look at the distortion function defined as follows:**
-
-&#10230;
-
-<br>
-
-**19. Hierarchical clustering**
-
-&#10230;
-
-<br>
-
-**20. Algorithm ― It is a clustering algorithm with an agglomerative hierarchical approach that build nested clusters in a successive manner.**
-
-&#10230;
-
-<br>
-
-**21. Types ― There are different sorts of hierarchical clustering algorithms that aims at optimizing different objective functions, which is summed up in the table below:**
-
-&#10230;
-
-<br>
-
-**22. [Ward linkage, Average linkage, Complete linkage]**
-
-&#10230;
-
-<br>
-
-**23. [Minimize within cluster distance, Minimize average distance between cluster pairs, Minimize maximum distance of between cluster pairs]**
-
-&#10230;
-
-<br>
-
-**24. Clustering assessment metrics**
-
-&#10230;
-
-<br>
-
-**25. In an unsupervised learning setting, it is often hard to assess the performance of a model since we don't have the ground truth labels as was the case in the supervised learning setting.**
-
-&#10230;
-
-<br>
-
-**26. Silhouette coefficient ― By noting a and b the mean distance between a sample and all other points in the same class, and between a sample and all other points in the next nearest cluster, the silhouette coefficient s for a single sample is defined as follows:**
-
-&#10230;
-
-<br>
-
-**27. Calinski-Harabaz index ― By noting k the number of clusters, Bk and Wk the between and within-clustering dispersion matrices respectively defined as**
-
-&#10230;
-
-<br>
-
-**28. the Calinski-Harabaz index s(k) indicates how well a clustering model defines its clusters, such that the higher the score, the more dense and well separated the clusters are. It is defined as follows:**
-
-&#10230;
-
-<br>
-
-**29. Dimension reduction**
-
-&#10230;
-
-<br>
-
-**30. Principal component analysis**
-
-&#10230;
-
-<br>
-
-**31. It is a dimension reduction technique that finds the variance maximizing directions onto which to project the data.**
-
-&#10230;
-
-<br>
-
-**32. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:**
-
-&#10230;
-
-<br>
-
-**33. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:**
-
-&#10230;
-
-<br>
-
-**34. diagonal**
-
-&#10230;
-
-<br>
-
-**35. Remark: the eigenvector associated with the largest eigenvalue is called principal eigenvector of matrix A.**
-
-&#10230;
-
-<br>
-
-**36. Algorithm ― The Principal Component Analysis (PCA) procedure is a dimension reduction technique that projects the data on k
-dimensions by maximizing the variance of the data as follows:**
-
-&#10230;
-
-<br>
-
-**37. Step 1: Normalize the data to have a mean of 0 and standard deviation of 1.**
-
-&#10230;
-
-<br>
-
-**38. Step 2: Compute Σ=1mm∑i=1x(i)x(i)T∈Rn×n, which is symmetric with real eigenvalues.**
-
-&#10230;
-
-<br>
-
-**39. Step 3: Compute u1,...,uk∈Rn the k orthogonal principal eigenvectors of Σ, i.e. the orthogonal eigenvectors of the k largest eigenvalues.**
-
-&#10230;
-
-<br>
-
-**40. Step 4: Project the data on spanR(u1,...,uk).**
-
-&#10230;
-
-<br>
-
-**41. This procedure maximizes the variance among all k-dimensional spaces.**
-
-&#10230;
-
-<br>
-
-**42. [Data in feature space, Find principal components, Data in principal components space]**
-
-&#10230;
-
-<br>
-
-**43. Independent component analysis**
-
-&#10230;
-
-<br>
-
-**44. It is a technique meant to find the underlying generating sources.**
-
-&#10230;
-
-<br>
-
-**45. Assumptions ― We assume that our data x has been generated by the n-dimensional source vector s=(s1,...,sn), where si are independent random variables, via a mixing and non-singular matrix A as follows:**
-
-&#10230;
-
-<br>
-
-**46. The goal is to find the unmixing matrix W=A−1.**
-
-&#10230;
-
-<br>
-
-**47. Bell and Sejnowski ICA algorithm ― This algorithm finds the unmixing matrix W by following the steps below:**
-
-&#10230;
-
-<br>
-
-**48. Write the probability of x=As=W−1s as:**
-
-&#10230;
-
-<br>
-
-**49. Write the log likelihood given our training data {x(i),i∈[[1,m]]} and by noting g the sigmoid function as:**
-
-&#10230;
-
-<br>
-
-**50. Therefore, the stochastic gradient ascent learning rule is such that for each training example x(i), we update W as follows:**
-
-&#10230;
-
-<br>
-
-**51. The Machine Learning cheatsheets are now available in [target language].**
-
-&#10230;
-
-<br>
-
-**52. Original authors**
-
-&#10230;
-
-<br>
-
-**53. Translated by X, Y and Z**
-
-&#10230;
-
-<br>
-
-**54. Reviewed by X, Y and Z**
-
-&#10230;
-
-<br>
-
-**55. [Introduction, Motivation, Jensen's inequality]**
-
-&#10230;
-
-<br>
-
-**56. [Clustering, Expectation-Maximization, k-means, Hierarchical clustering, Metrics]**
-
-&#10230;
-
-<br>
-
-**57. [Dimension reduction, PCA, ICA]**
-
-&#10230;
diff --git a/vi/convolutional-neural-networks.md b/vi/convolutional-neural-networks.md
deleted file mode 100644
index cb7e676ca..000000000
--- a/vi/convolutional-neural-networks.md
+++ /dev/null
@@ -1,716 +0,0 @@
-**Convolutional Neural Networks translation**
-
-<br>
-
-**1. Convolutional Neural Networks cheatsheet**
-
-&#10230; Convolutional Neural Networks cheatsheet
-
-<br>
-
-
-**2. CS 230 - Deep Learning**
-
-&#10230; CS 230 - Deep Learning
-
-<br>
-
-
-**3. [Overview, Architecture structure]**
-
-&#10230; [Tổng quan, Kiến trúc]
-
-<br>
-
-
-**4. [Types of layer, Convolution, Pooling, Fully connected]**
-
-&#10230; [Loại tầng (layer), Convolution (Tích chập), Pooling, Fully connected]
-
-<br>
-
-
-**5. [Filter hyperparameters, Dimensions, Stride, Padding]**
-
-&#10230;
-
-<br>
-
-
-**6. [Tuning hyperparameters, Parameter compatibility, Model complexity, Receptive field]**
-
-&#10230;
-
-<br>
-
-
-**7. [Activation functions, Rectified Linear Unit, Softmax]**
-
-&#10230;
-
-<br>
-
-
-**8. [Object detection, Types of models, Detection, Intersection over Union, Non-max suppression, YOLO, R-CNN]**
-
-&#10230;
-
-<br>
-
-
-**9. [Face verification/recognition, One shot learning, Siamese network, Triplet loss]**
-
-&#10230;
-
-<br>
-
-
-**10. [Neural style transfer, Activation, Style matrix, Style/content cost function]**
-
-&#10230;
-
-<br>
-
-
-**11. [Computational trick architectures, Generative Adversarial Net, ResNet, Inception Network]**
-
-&#10230;
-
-<br>
-
-
-**12. Overview**
-
-&#10230;
-
-<br>
-
-
-**13. Architecture of a traditional CNN ― Convolutional neural networks, also known as CNNs, are a specific type of neural networks that are generally composed of the following layers:**
-
-&#10230;
-
-<br>
-
-
-**14. The convolution layer and the pooling layer can be fine-tuned with respect to hyperparameters that are described in the next sections.**
-
-&#10230;
-
-<br>
-
-
-**15. Types of layer**
-
-&#10230;
-
-<br>
-
-
-**16. Convolution layer (CONV) ― The convolution layer (CONV) uses filters that perform convolution operations as it is scanning the input I with respect to its dimensions. Its hyperparameters include the filter size F and stride S. The resulting output O is called feature map or activation map.**
-
-&#10230;
-
-<br>
-
-
-**17. Remark: the convolution step can be generalized to the 1D and 3D cases as well.**
-
-&#10230;
-
-<br>
-
-
-**18. Pooling (POOL) ― The pooling layer (POOL) is a downsampling operation, typically applied after a convolution layer, which does some spatial invariance. In particular, max and average pooling are special kinds of pooling where the maximum and average value is taken, respectively.**
-
-&#10230;
-
-<br>
-
-
-**19. [Type, Purpose, Illustration, Comments]**
-
-&#10230;
-
-<br>
-
-
-**20. [Max pooling, Average pooling, Each pooling operation selects the maximum value of the current view, Each pooling operation averages the values of the current view]**
-
-&#10230;
-
-<br>
-
-
-**21. [Preserves detected features, Most commonly used, Downsamples feature map, Used in LeNet]**
-
-&#10230;
-
-<br>
-
-
-**22. Fully Connected (FC) ― The fully connected layer (FC) operates on a flattened input where each input is connected to all neurons. If present, FC layers are usually found towards the end of CNN architectures and can be used to optimize objectives such as class scores.**
-
-&#10230;
-
-<br>
-
-
-**23. Filter hyperparameters**
-
-&#10230;
-
-<br>
-
-
-**24. The convolution layer contains filters for which it is important to know the meaning behind its hyperparameters.**
-
-&#10230;
-
-<br>
-
-
-**25. Dimensions of a filter ― A filter of size F×F applied to an input containing C channels is a F×F×C volume that performs convolutions on an input of size I×I×C and produces an output feature map (also called activation map) of size O×O×1.**
-
-&#10230;
-
-<br>
-
-
-**26. Filter**
-
-&#10230;
-
-<br>
-
-
-**27. Remark: the application of K filters of size F×F results in an output feature map of size O×O×K.**
-
-&#10230;
-
-<br>
-
-
-**28. Stride ― For a convolutional or a pooling operation, the stride S denotes the number of pixels by which the window moves after each operation.**
-
-&#10230;
-
-<br>
-
-
-**29. Zero-padding ― Zero-padding denotes the process of adding P zeroes to each side of the boundaries of the input. This value can either be manually specified or automatically set through one of the three modes detailed below:**
-
-&#10230;
-
-<br>
-
-
-**30. [Mode, Value, Illustration, Purpose, Valid, Same, Full]**
-
-&#10230;
-
-<br>
-
-
-**31. [No padding, Drops last convolution if dimensions do not match, Padding such that feature map size has size ⌈IS⌉, Output size is mathematically convenient, Also called 'half' padding, Maximum padding such that end convolutions are applied on the limits of the input, Filter 'sees' the input end-to-end]**
-
-&#10230;
-
-<br>
-
-
-**32. Tuning hyperparameters**
-
-&#10230;
-
-<br>
-
-
-**33. Parameter compatibility in convolution layer ― By noting I the length of the input volume size, F the length of the filter, P the amount of zero padding, S the stride, then the output size O of the feature map along that dimension is given by:**
-
-&#10230;
-
-<br>
-
-
-**34. [Input, Filter, Output]**
-
-&#10230;
-
-<br>
-
-
-**35. Remark: often times, Pstart=Pend≜P, in which case we can replace Pstart+Pend by 2P in the formula above.**
-
-&#10230;
-
-<br>
-
-
-**36. Understanding the complexity of the model ― In order to assess the complexity of a model, it is often useful to determine the number of parameters that its architecture will have. In a given layer of a convolutional neural network, it is done as follows:**
-
-&#10230;
-
-<br>
-
-
-**37. [Illustration, Input size, Output size, Number of parameters, Remarks]**
-
-&#10230;
-
-<br>
-
-
-**38. [One bias parameter per filter, In most cases, S<F, A common choice for K is 2C]**
-
-&#10230;
-
-<br>
-
-
-**39. [Pooling operation done channel-wise, In most cases, S=F]**
-
-&#10230;
-
-<br>
-
-
-**40. [Input is flattened, One bias parameter per neuron, The number of FC neurons is free of structural constraints]**
-
-&#10230;
-
-<br>
-
-
-**41. Receptive field ― The receptive field at layer k is the area denoted Rk×Rk of the input that each pixel of the k-th activation map can 'see'. By calling Fj the filter size of layer j and Si the stride value of layer i and with the convention S0=1, the receptive field at layer k can be computed with the formula:**
-
-&#10230;
-
-<br>
-
-
-**42. In the example below, we have F1=F2=3 and S1=S2=1, which gives R2=1+2⋅1+2⋅1=5.**
-
-&#10230;
-
-<br>
-
-
-**43. Commonly used activation functions**
-
-&#10230;
-
-<br>
-
-
-**44. Rectified Linear Unit ― The rectified linear unit layer (ReLU) is an activation function g that is used on all elements of the volume. It aims at introducing non-linearities to the network. Its variants are summarized in the table below:**
-
-&#10230;
-
-<br>
-
-
-**45. [ReLU, Leaky ReLU, ELU, with]**
-
-&#10230;
-
-<br>
-
-
-**46. [Non-linearity complexities biologically interpretable, Addresses dying ReLU issue for negative values, Differentiable everywhere]**
-
-&#10230;
-
-<br>
-
-
-**47. Softmax ― The softmax step can be seen as a generalized logistic function that takes as input a vector of scores x∈Rn and outputs a vector of output probability p∈Rn through a softmax function at the end of the architecture. It is defined as follows:**
-
-&#10230;
-
-<br>
-
-
-**48. where**
-
-&#10230;
-
-<br>
-
-
-**49. Object detection**
-
-&#10230;
-
-<br>
-
-
-**50. Types of models ― There are 3 main types of object recognition algorithms, for which the nature of what is predicted is different. They are described in the table below:**
-
-&#10230;
-
-<br>
-
-
-**51. [Image classification, Classification w. localization, Detection]**
-
-&#10230;
-
-<br>
-
-
-**52. [Teddy bear, Book]**
-
-&#10230;
-
-<br>
-
-
-**53. [Classifies a picture, Predicts probability of object, Detects an object in a picture, Predicts probability of object and where it is located, Detects up to several objects in a picture, Predicts probabilities of objects and where they are located]**
-
-&#10230;
-
-<br>
-
-
-**54. [Traditional CNN, Simplified YOLO, R-CNN, YOLO, R-CNN]**
-
-&#10230;
-
-<br>
-
-
-**55. Detection ― In the context of object detection, different methods are used depending on whether we just want to locate the object or detect a more complex shape in the image. The two main ones are summed up in the table below:**
-
-&#10230;
-
-<br>
-
-
-**56. [Bounding box detection, Landmark detection]**
-
-&#10230;
-
-<br>
-
-
-**57. [Detects the part of the image where the object is located, Detects a shape or characteristics of an object (e.g. eyes), More granular]**
-
-&#10230;
-
-<br>
-
-
-**58. [Box of center (bx,by), height bh and width bw, Reference points (l1x,l1y), ..., (lnx,lny)]**
-
-&#10230;
-
-<br>
-
-
-**59. Intersection over Union ― Intersection over Union, also known as IoU, is a function that quantifies how correctly positioned a predicted bounding box Bp is over the actual bounding box Ba. It is defined as:**
-
-&#10230;
-
-<br>
-
-
-**60. Remark: we always have IoU∈[0,1]. By convention, a predicted bounding box Bp is considered as being reasonably good if IoU(Bp,Ba)⩾0.5.**
-
-&#10230;
-
-<br>
-
-
-**61. Anchor boxes ― Anchor boxing is a technique used to predict overlapping bounding boxes. In practice, the network is allowed to predict more than one box simultaneously, where each box prediction is constrained to have a given set of geometrical properties. For instance, the first prediction can potentially be a rectangular box of a given form, while the second will be another rectangular box of a different geometrical form.**
-
-&#10230;
-
-<br>
-
-
-**62. Non-max suppression ― The non-max suppression technique aims at removing duplicate overlapping bounding boxes of a same object by selecting the most representative ones. After having removed all boxes having a probability prediction lower than 0.6, the following steps are repeated while there are boxes remaining:**
-
-&#10230;
-
-<br>
-
-
-**63. [For a given class, Step 1: Pick the box with the largest prediction probability., Step 2: Discard any box having an IoU⩾0.5 with the previous box.]**
-
-&#10230;
-
-<br>
-
-
-**64. [Box predictions, Box selection of maximum probability, Overlap removal of same class, Final bounding boxes]**
-
-&#10230;
-
-<br>
-
-
-**65. YOLO ― You Only Look Once (YOLO) is an object detection algorithm that performs the following steps:**
-
-&#10230;
-
-<br>
-
-
-**66. [Step 1: Divide the input image into a G×G grid., Step 2: For each grid cell, run a CNN that predicts y of the following form:, repeated k times]**
-
-&#10230;
-
-<br>
-
-
-**67. where pc is the probability of detecting an object, bx,by,bh,bw are the properties of the detected bouding box, c1,...,cp is a one-hot representation of which of the p classes were detected, and k is the number of anchor boxes.**
-
-&#10230;
-
-<br>
-
-
-**68. Step 3: Run the non-max suppression algorithm to remove any potential duplicate overlapping bounding boxes.**
-
-&#10230;
-
-<br>
-
-
-**69. [Original image, Division in GxG grid, Bounding box prediction, Non-max suppression]**
-
-&#10230;
-
-<br>
-
-
-**70. Remark: when pc=0, then the network does not detect any object. In that case, the corresponding predictions bx,...,cp have to be ignored.**
-
-&#10230;
-
-<br>
-
-
-**71. R-CNN ― Region with Convolutional Neural Networks (R-CNN) is an object detection algorithm that first segments the image to find potential relevant bounding boxes and then run the detection algorithm to find most probable objects in those bounding boxes.**
-
-&#10230;
-
-<br>
-
-
-**72. [Original image, Segmentation, Bounding box prediction, Non-max suppression]**
-
-&#10230;
-
-<br>
-
-
-**73. Remark: although the original algorithm is computationally expensive and slow, newer architectures enabled the algorithm to run faster, such as Fast R-CNN and Faster R-CNN.**
-
-&#10230;
-
-<br>
-
-
-**74. Face verification and recognition**
-
-&#10230;
-
-<br>
-
-
-**75. Types of models ― Two main types of model are summed up in table below:**
-
-&#10230;
-
-<br>
-
-
-**76. [Face verification, Face recognition, Query, Reference, Database]**
-
-&#10230;
-
-<br>
-
-
-**77. [Is this the correct person?, One-to-one lookup, Is this one of the K persons in the database?, One-to-many lookup]**
-
-&#10230;
-
-<br>
-
-
-**78. One Shot Learning ― One Shot Learning is a face verification algorithm that uses a limited training set to learn a similarity function that quantifies how different two given images are. The similarity function applied to two images is often noted d(image 1,image 2).**
-
-&#10230;
-
-<br>
-
-
-**79. Siamese Network ― Siamese Networks aim at learning how to encode images to then quantify how different two images are. For a given input image x(i), the encoded output is often noted as f(x(i)).**
-
-&#10230;
-
-<br>
-
-
-**80. Triplet loss ― The triplet loss ℓ is a loss function computed on the embedding representation of a triplet of images A (anchor), P (positive) and N (negative). The anchor and the positive example belong to a same class, while the negative example to another one. By calling α∈R+ the margin parameter, this loss is defined as follows:**
-
-&#10230;
-
-<br>
-
-
-**81. Neural style transfer**
-
-&#10230;
-
-<br>
-
-
-**82. Motivation ― The goal of neural style transfer is to generate an image G based on a given content C and a given style S.**
-
-&#10230;
-
-<br>
-
-
-**83. [Content C, Style S, Generated image G]**
-
-&#10230;
-
-<br>
-
-
-**84. Activation ― In a given layer l, the activation is noted a[l] and is of dimensions nH×nw×nc**
-
-&#10230;
-
-<br>
-
-
-**85. Content cost function ― The content cost function Jcontent(C,G) is used to determine how the generated image G differs from the original content image C. It is defined as follows:**
-
-&#10230;
-
-<br>
-
-
-**86. Style matrix ― The style matrix G[l] of a given layer l is a Gram matrix where each of its elements G[l]kk′ quantifies how correlated the channels k and k′ are. It is defined with respect to activations a[l] as follows:**
-
-&#10230;
-
-<br>
-
-
-**87. Remark: the style matrix for the style image and the generated image are noted G[l] (S) and G[l] (G) respectively.**
-
-&#10230;
-
-<br>
-
-
-**88. Style cost function ― The style cost function Jstyle(S,G) is used to determine how the generated image G differs from the style S. It is defined as follows:**
-
-&#10230;
-
-<br>
-
-
-**89. Overall cost function ― The overall cost function is defined as being a combination of the content and style cost functions, weighted by parameters α,β, as follows:**
-
-&#10230;
-
-<br>
-
-
-**90. Remark: a higher value of α will make the model care more about the content while a higher value of β will make it care more about the style.**
-
-&#10230;
-
-<br>
-
-
-**91. Architectures using computational tricks**
-
-&#10230;
-
-<br>
-
-
-**92. Generative Adversarial Network ― Generative adversarial networks, also known as GANs, are composed of a generative and a discriminative model, where the generative model aims at generating the most truthful output that will be fed into the discriminative which aims at differentiating the generated and true image.**
-
-&#10230;
-
-<br>
-
-
-**93. [Training, Noise, Real-world image, Generator, Discriminator, Real Fake]**
-
-&#10230;
-
-<br>
-
-
-**94. Remark: use cases using variants of GANs include text to image, music generation and synthesis.**
-
-&#10230;
-
-<br>
-
-
-**95. ResNet ― The Residual Network architecture (also called ResNet) uses residual blocks with a high number of layers meant to decrease the training error. The residual block has the following characterizing equation:**
-
-&#10230;
-
-<br>
-
-
-**96. Inception Network ― This architecture uses inception modules and aims at giving a try at different convolutions in order to increase its performance through features diversification. In particular, it uses the 1×1 convolution trick to limit the computational burden.**
-
-&#10230;
-
-<br>
-
-
-**97. The Deep Learning cheatsheets are now available in [target language].**
-
-&#10230;
-
-<br>
-
-
-**98. Original authors**
-
-&#10230;
-
-<br>
-
-
-**99. Translated by X, Y and Z**
-
-&#10230;
-
-<br>
-
-
-**100. Reviewed by X, Y and Z**
-
-&#10230;
-
-<br>
-
-
-**101. View PDF version on GitHub**
-
-&#10230;
-
-<br>
-
-
-**102. By X and Y**
-
-&#10230;
-
-<br>
diff --git a/vi/deep-learning-tips-and-tricks.md b/vi/deep-learning-tips-and-tricks.md
deleted file mode 100644
index 347234ec2..000000000
--- a/vi/deep-learning-tips-and-tricks.md
+++ /dev/null
@@ -1,457 +0,0 @@
-**Deep Learning Tips and Tricks translation**
-
-<br>
-
-**1. Deep Learning Tips and Tricks cheatsheet**
-
-&#10230;
-
-<br>
-
-
-**2. CS 230 - Deep Learning**
-
-&#10230;
-
-<br>
-
-
-**3. Tips and tricks**
-
-&#10230;
-
-<br>
-
-
-**4. [Data processing, Data augmentation, Batch normalization]**
-
-&#10230;
-
-<br>
-
-
-**5. [Training a neural network, Epoch, Mini-batch, Cross-entropy loss, Backpropagation, Gradient descent, Updating weights, Gradient checking]**
-
-&#10230;
-
-<br>
-
-
-**6. [Parameter tuning, Xavier initialization, Transfer learning, Learning rate, Adaptive learning rates]**
-
-&#10230;
-
-<br>
-
-
-**7. [Regularization, Dropout, Weight regularization, Early stopping]**
-
-&#10230;
-
-<br>
-
-
-**8. [Good practices, Overfitting small batch, Gradient checking]**
-
-&#10230;
-
-<br>
-
-
-**9. View PDF version on GitHub**
-
-&#10230;
-
-<br>
-
-
-**10. Data processing**
-
-&#10230;
-
-<br>
-
-
-**11. Data augmentation ― Deep learning models usually need a lot of data to be properly trained. It is often useful to get more data from the existing ones using data augmentation techniques. The main ones are summed up in the table below. More precisely, given the following input image, here are the techniques that we can apply:**
-
-&#10230;
-
-<br>
-
-
-**12. [Original, Flip, Rotation, Random crop]**
-
-&#10230;
-
-<br>
-
-
-**13. [Image without any modification, Flipped with respect to an axis for which the meaning of the image is preserved, Rotation with a slight angle, Simulates incorrect horizon calibration, Random focus on one part of the image, Several random crops can be done in a row]**
-
-&#10230;
-
-<br>
-
-
-**14. [Color shift, Noise addition, Information loss, Contrast change]**
-
-&#10230;
-
-<br>
-
-
-**15. [Nuances of RGB is slightly changed, Captures noise that can occur with light exposure, Addition of noise, More tolerance to quality variation of inputs, Parts of image ignored, Mimics potential loss of parts of image, Luminosity changes, Controls difference in exposition due to time of day]**
-
-&#10230;
-
-<br>
-
-
-**16. Remark: data is usually augmented on the fly during training.**
-
-&#10230;
-
-<br>
-
-
-**17. Batch normalization ― It is a step of hyperparameter γ,β that normalizes the batch {xi}. By noting μB,σ2B the mean and variance of that we want to correct to the batch, it is done as follows:**
-
-&#10230;
-
-<br>
-
-
-**18. It is usually done after a fully connected/convolutional layer and before a non-linearity layer and aims at allowing higher learning rates and reducing the strong dependence on initialization.**
-
-&#10230;
-
-<br>
-
-
-**19. Training a neural network**
-
-&#10230;
-
-<br>
-
-
-**20. Definitions**
-
-&#10230;
-
-<br>
-
-
-**21. Epoch ― In the context of training a model, epoch is a term used to refer to one iteration where the model sees the whole training set to update its weights.**
-
-&#10230;
-
-<br>
-
-
-**22. Mini-batch gradient descent ― During the training phase, updating weights is usually not based on the whole training set at once due to computation complexities or one data point due to noise issues. Instead, the update step is done on mini-batches, where the number of data points in a batch is a hyperparameter that we can tune.**
-
-&#10230;
-
-<br>
-
-
-**23. Loss function ― In order to quantify how a given model performs, the loss function L is usually used to evaluate to what extent the actual outputs y are correctly predicted by the model outputs z.**
-
-&#10230;
-
-<br>
-
-
-**24. Cross-entropy loss ― In the context of binary classification in neural networks, the cross-entropy loss L(z,y) is commonly used and is defined as follows:**
-
-&#10230;
-
-<br>
-
-
-**25. Finding optimal weights**
-
-&#10230;
-
-<br>
-
-
-**26. Backpropagation ― Backpropagation is a method to update the weights in the neural network by taking into account the actual output and the desired output. The derivative with respect to each weight w is computed using the chain rule.**
-
-&#10230;
-
-<br>
-
-
-**27. Using this method, each weight is updated with the rule:**
-
-&#10230;
-
-<br>
-
-
-**28. Updating weights ― In a neural network, weights are updated as follows:**
-
-&#10230;
-
-<br>
-
-
-**29. [Step 1: Take a batch of training data and perform forward propagation to compute the loss, Step 2: Backpropagate the loss to get the gradient of the loss with respect to each weight, Step 3: Use the gradients to update the weights of the network.]**
-
-&#10230;
-
-<br>
-
-
-**30. [Forward propagation, Backpropagation, Weights update]**
-
-&#10230;
-
-<br>
-
-
-**31. Parameter tuning**
-
-&#10230;
-
-<br>
-
-
-**32. Weights initialization**
-
-&#10230;
-
-<br>
-
-
-**33. Xavier initialization ― Instead of initializing the weights in a purely random manner, Xavier initialization enables to have initial weights that take into account characteristics that are unique to the architecture.**
-
-&#10230;
-
-<br>
-
-
-**34. Transfer learning ― Training a deep learning model requires a lot of data and more importantly a lot of time. It is often useful to take advantage of pre-trained weights on huge datasets that took days/weeks to train, and leverage it towards our use case. Depending on how much data we have at hand, here are the different ways to leverage this:**
-
-&#10230;
-
-<br>
-
-
-**35. [Training size, Illustration, Explanation]**
-
-&#10230;
-
-<br>
-
-
-**36. [Small, Medium, Large]**
-
-&#10230;
-
-<br>
-
-
-**37. [Freezes all layers, trains weights on softmax, Freezes most layers, trains weights on last layers and softmax, Trains weights on layers and softmax by initializing weights on pre-trained ones]**
-
-&#10230;
-
-<br>
-
-
-**38. Optimizing convergence**
-
-&#10230;
-
-<br>
-
-
-**39. Learning rate ― The learning rate, often noted α or sometimes η, indicates at which pace the weights get updated. It can be fixed or adaptively changed. The current most popular method is called Adam, which is a method that adapts the learning rate.
-**
-
-&#10230;
-
-<br>
-
-
-**40. Adaptive learning rates ― Letting the learning rate vary when training a model can reduce the training time and improve the numerical optimal solution. While Adam optimizer is the most commonly used technique, others can also be useful. They are summed up in the table below:**
-
-&#10230;
-
-<br>
-
-
-**41. [Method, Explanation, Update of w, Update of b]**
-
-&#10230;
-
-<br>
-
-
-**42. [Momentum, Dampens oscillations, Improvement to SGD, 2 parameters to tune]**
-
-&#10230;
-
-<br>
-
-
-**43. [RMSprop, Root Mean Square propagation, Speeds up learning algorithm by controlling oscillations]**
-
-&#10230;
-
-<br>
-
-
-**44. [Adam, Adaptive Moment estimation, Most popular method, 4 parameters to tune]**
-
-&#10230;
-
-<br>
-
-
-**45. Remark: other methods include Adadelta, Adagrad and SGD.**
-
-&#10230;
-
-<br>
-
-
-**46. Regularization**
-
-&#10230;
-
-<br>
-
-
-**47. Dropout ― Dropout is a technique used in neural networks to prevent overfitting the training data by dropping out neurons with probability p>0. It forces the model to avoid relying too much on particular sets of features.**
-
-&#10230;
-
-<br>
-
-
-**48. Remark: most deep learning frameworks parametrize dropout through the 'keep' parameter 1−p.**
-
-&#10230;
-
-<br>
-
-
-**49. Weight regularization ― In order to make sure that the weights are not too large and that the model is not overfitting the training set, regularization techniques are usually performed on the model weights. The main ones are summed up in the table below:**
-
-&#10230;
-
-<br>
-
-
-**50. [LASSO, Ridge, Elastic Net]**
-
-&#10230;
-
-<br>
-
-**50 bis. Shrinks coefficients to 0, Good for variable selection, Makes coefficients smaller, Tradeoff between variable selection and small coefficients]**
-
-&#10230;
-
-<br>
-
-**51. Early stopping ― This regularization technique stops the training process as soon as the validation loss reaches a plateau or starts to increase.**
-
-&#10230;
-
-<br>
-
-
-**52. [Error, Validation, Training, early stopping, Epochs]**
-
-&#10230;
-
-<br>
-
-
-**53. Good practices**
-
-&#10230;
-
-<br>
-
-
-**54. Overfitting small batch ― When debugging a model, it is often useful to make quick tests to see if there is any major issue with the architecture of the model itself. In particular, in order to make sure that the model can be properly trained, a mini-batch is passed inside the network to see if it can overfit on it. If it cannot, it means that the model is either too complex or not complex enough to even overfit on a small batch, let alone a normal-sized training set.**
-
-&#10230;
-
-<br>
-
-
-**55. Gradient checking ― Gradient checking is a method used during the implementation of the backward pass of a neural network. It compares the value of the analytical gradient to the numerical gradient at given points and plays the role of a sanity-check for correctness.**
-
-&#10230;
-
-<br>
-
-
-**56. [Type, Numerical gradient, Analytical gradient]**
-
-&#10230;
-
-<br>
-
-
-**57. [Formula, Comments]**
-
-&#10230;
-
-<br>
-
-
-**58. [Expensive; loss has to be computed two times per dimension, Used to verify correctness of analytical implementation, Trade-off in choosing h not too small (numerical instability) nor too large (poor gradient approximation)]**
-
-&#10230;
-
-<br>
-
-
-**59. ['Exact' result, Direct computation, Used in the final implementation]**
-
-&#10230;
-
-<br>
-
-
-**60. The Deep Learning cheatsheets are now available in [target language].
-
-&#10230;
-
-
-**61. Original authors**
-
-&#10230;
-
-<br>
-
-**62.Translated by X, Y and Z**
-
-&#10230;
-
-<br>
-
-**63.Reviewed by X, Y and Z**
-
-&#10230;
-
-<br>
-
-**64.View PDF version on GitHub**
-
-&#10230;
-
-<br>
-
-**65.By X and Y**
-
-&#10230;
-
-<br>
diff --git a/vi/recurrent-neural-networks.md b/vi/recurrent-neural-networks.md
deleted file mode 100644
index 191e400a1..000000000
--- a/vi/recurrent-neural-networks.md
+++ /dev/null
@@ -1,677 +0,0 @@
-**Recurrent Neural Networks translation**
-
-<br>
-
-**1. Recurrent Neural Networks cheatsheet**
-
-&#10230;
-
-<br>
-
-
-**2. CS 230 - Deep Learning**
-
-&#10230;
-
-<br>
-
-
-**3. [Overview, Architecture structure, Applications of RNNs, Loss function, Backpropagation]**
-
-&#10230;
-
-<br>
-
-
-**4. [Handling long term dependencies, Common activation functions, Vanishing/exploding gradient, Gradient clipping, GRU/LSTM, Types of gates, Bidirectional RNN, Deep RNN]**
-
-&#10230;
-
-<br>
-
-
-**5. [Learning word representation, Notations, Embedding matrix, Word2vec, Skip-gram, Negative sampling, GloVe]**
-
-&#10230;
-
-<br>
-
-
-**6. [Comparing words, Cosine similarity, t-SNE]**
-
-&#10230;
-
-<br>
-
-
-**7. [Language model, n-gram, Perplexity]**
-
-&#10230;
-
-<br>
-
-
-**8. [Machine translation, Beam search, Length normalization, Error analysis, Bleu score]**
-
-&#10230;
-
-<br>
-
-
-**9. [Attention, Attention model, Attention weights]**
-
-&#10230;
-
-<br>
-
-
-**10. Overview**
-
-&#10230;
-
-<br>
-
-
-**11. Architecture of a traditional RNN ― Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:**
-
-&#10230;
-
-<br>
-
-
-**12. For each timestep t, the activation a<t> and the output y<t> are expressed as follows:**
-
-&#10230;
-
-<br>
-
-
-**13. and**
-
-&#10230;
-
-<br>
-
-
-**14. where Wax,Waa,Wya,ba,by are coefficients that are shared temporally and g1,g2 activation functions.**
-
-&#10230;
-
-<br>
-
-
-**15. The pros and cons of a typical RNN architecture are summed up in the table below:**
-
-&#10230;
-
-<br>
-
-
-**16. [Advantages, Possibility of processing input of any length, Model size not increasing with size of input, Computation takes into account historical information, Weights are shared across time]**
-
-&#10230;
-
-<br>
-
-
-**17. [Drawbacks, Computation being slow, Difficulty of accessing information from a long time ago, Cannot consider any future input for the current state]**
-
-&#10230;
-
-<br>
-
-
-**18. Applications of RNNs ― RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:**
-
-&#10230;
-
-<br>
-
-
-**19. [Type of RNN, Illustration, Example]**
-
-&#10230;
-
-<br>
-
-
-**20. [One-to-one, One-to-many, Many-to-one, Many-to-many]**
-
-&#10230;
-
-<br>
-
-
-**21. [Traditional neural network, Music generation, Sentiment classification, Name entity recognition, Machine translation]**
-
-&#10230;
-
-<br>
-
-
-**22. Loss function ― In the case of a recurrent neural network, the loss function L of all time steps is defined based on the loss at every time step as follows:**
-
-&#10230;
-
-<br>
-
-
-**23. Backpropagation through time ― Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows:**
-
-&#10230;
-
-<br>
-
-
-**24. Handling long term dependencies**
-
-&#10230;
-
-<br>
-
-
-**25. Commonly used activation functions ― The most common activation functions used in RNN modules are described below:**
-
-&#10230;
-
-<br>
-
-
-**26. [Sigmoid, Tanh, RELU]**
-
-&#10230;
-
-<br>
-
-
-**27. Vanishing/exploding gradient ― The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.**
-
-&#10230;
-
-<br>
-
-
-**28. Gradient clipping ― It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.**
-
-&#10230;
-
-<br>
-
-
-**29. clipped**
-
-&#10230;
-
-<br>
-
-
-**30. Types of gates ― In order to remedy the vanishing gradient problem, specific gates are used in some types of RNNs and usually have a well-defined purpose. They are usually noted Γ and are equal to:**
-
-&#10230;
-
-<br>
-
-
-**31. where W,U,b are coefficients specific to the gate and σ is the sigmoid function. The main ones are summed up in the table below:**
-
-&#10230;
-
-<br>
-
-
-**32. [Type of gate, Role, Used in]**
-
-&#10230;
-
-<br>
-
-
-**33. [Update gate, Relevance gate, Forget gate, Output gate]**
-
-&#10230;
-
-<br>
-
-
-**34. [How much past should matter now?, Drop previous information?, Erase a cell or not?, How much to reveal of a cell?]**
-
-&#10230;
-
-<br>
-
-
-**35. [LSTM, GRU]**
-
-&#10230;
-
-<br>
-
-
-**36. GRU/LSTM ― Gated Recurrent Unit (GRU) and Long Short-Term Memory units (LSTM) deal with the vanishing gradient problem encountered by traditional RNNs, with LSTM being a generalization of GRU. Below is a table summing up the characterizing equations of each architecture:**
-
-&#10230;
-
-<br>
-
-
-**37. [Characterization, Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), Dependencies]**
-
-&#10230;
-
-<br>
-
-
-**38. Remark: the sign ⋆ denotes the element-wise multiplication between two vectors.**
-
-&#10230;
-
-<br>
-
-
-**39. Variants of RNNs ― The table below sums up the other commonly used RNN architectures:**
-
-&#10230;
-
-<br>
-
-
-**40. [Bidirectional (BRNN), Deep (DRNN)]**
-
-&#10230;
-
-<br>
-
-
-**41. Learning word representation**
-
-&#10230;
-
-<br>
-
-
-**42. In this section, we note V the vocabulary and |V| its size.**
-
-&#10230;
-
-<br>
-
-
-**43. Motivation and notations**
-
-&#10230;
-
-<br>
-
-
-**44. Representation techniques ― The two main ways of representing words are summed up in the table below:**
-
-&#10230;
-
-<br>
-
-
-**45. [1-hot representation, Word embedding]**
-
-&#10230;
-
-<br>
-
-
-**46. [teddy bear, book, soft]**
-
-&#10230;
-
-<br>
-
-
-**47. [Noted ow, Naive approach, no similarity information, Noted ew, Takes into account words similarity]**
-
-&#10230;
-
-<br>
-
-
-**48. Embedding matrix ― For a given word w, the embedding matrix E is a matrix that maps its 1-hot representation ow to its embedding ew as follows:**
-
-&#10230;
-
-<br>
-
-
-**49. Remark: learning the embedding matrix can be done using target/context likelihood models.**
-
-&#10230;
-
-<br>
-
-
-**50. Word embeddings**
-
-&#10230;
-
-<br>
-
-
-**51. Word2vec ― Word2vec is a framework aimed at learning word embeddings by estimating the likelihood that a given word is surrounded by other words. Popular models include skip-gram, negative sampling and CBOW.**
-
-&#10230;
-
-<br>
-
-
-**52. [A cute teddy bear is reading, teddy bear, soft, Persian poetry, art]**
-
-&#10230;
-
-<br>
-
-
-**53. [Train network on proxy task, Extract high-level representation, Compute word embeddings]**
-
-&#10230;
-
-<br>
-
-
-**54. Skip-gram ― The skip-gram word2vec model is a supervised learning task that learns word embeddings by assessing the likelihood of any given target word t happening with a context word c. By noting θt a parameter associated with t, the probability P(t|c) is given by:**
-
-&#10230;
-
-<br>
-
-
-**55. Remark: summing over the whole vocabulary in the denominator of the softmax part makes this model computationally expensive. CBOW is another word2vec model using the surrounding words to predict a given word.**
-
-&#10230;
-
-<br>
-
-
-**56. Negative sampling ― It is a set of binary classifiers using logistic regressions that aim at assessing how a given context and a given target words are likely to appear simultaneously, with the models being trained on sets of k negative examples and 1 positive example. Given a context word c and a target word t, the prediction is expressed by:**
-
-&#10230;
-
-<br>
-
-
-**57. Remark: this method is less computationally expensive than the skip-gram model.**
-
-&#10230;
-
-<br>
-
-
-**57bis. GloVe ― The GloVe model, short for global vectors for word representation, is a word embedding technique that uses a co-occurence matrix X where each Xi,j denotes the number of times that a target i occurred with a context j. Its cost function J is as follows:**
-
-&#10230;
-
-<br>
-
-
-**58. where f is a weighting function such that Xi,j=0⟹f(Xi,j)=0.
-Given the symmetry that e and θ play in this model, the final word embedding e(final)w is given by:**
-
-&#10230;
-
-<br>
-
-
-**59. Remark: the individual components of the learned word embeddings are not necessarily interpretable.**
-
-&#10230;
-
-<br>
-
-
-**60. Comparing words**
-
-&#10230;
-
-<br>
-
-
-**61. Cosine similarity ― The cosine similarity between words w1 and w2 is expressed as follows:**
-
-&#10230;
-
-<br>
-
-
-**62. Remark: θ is the angle between words w1 and w2.**
-
-&#10230;
-
-<br>
-
-
-**63. t-SNE ― t-SNE (t-distributed Stochastic Neighbor Embedding) is a technique aimed at reducing high-dimensional embeddings into a lower dimensional space. In practice, it is commonly used to visualize word vectors in the 2D space.**
-
-&#10230;
-
-<br>
-
-
-**64. [literature, art, book, culture, poem, reading, knowledge, entertaining, loveable, childhood, kind, teddy bear, soft, hug, cute, adorable]**
-
-&#10230;
-
-<br>
-
-
-**65. Language model**
-
-&#10230;
-
-<br>
-
-
-**66. Overview ― A language model aims at estimating the probability of a sentence P(y).**
-
-&#10230;
-
-<br>
-
-
-**67. n-gram model ― This model is a naive approach aiming at quantifying the probability that an expression appears in a corpus by counting its number of appearance in the training data.**
-
-&#10230;
-
-<br>
-
-
-**68. Perplexity ― Language models are commonly assessed using the perplexity metric, also known as PP, which can be interpreted as the inverse probability of the dataset normalized by the number of words T. The perplexity is such that the lower, the better and is defined as follows:**
-
-&#10230;
-
-<br>
-
-
-**69. Remark: PP is commonly used in t-SNE.**
-
-&#10230;
-
-<br>
-
-
-**70. Machine translation**
-
-&#10230;
-
-<br>
-
-
-**71. Overview ― A machine translation model is similar to a language model except it has an encoder network placed before. For this reason, it is sometimes referred as a conditional language model. The goal is to find a sentence y such that:**
-
-&#10230;
-
-<br>
-
-
-**72. Beam search ― It is a heuristic search algorithm used in machine translation and speech recognition to find the likeliest sentence y given an input x.**
-
-&#10230;
-
-<br>
-
-
-**73. [Step 1: Find top B likely words y<1>, Step 2: Compute conditional probabilities y<k>|x,y<1>,...,y<k−1>, Step 3: Keep top B combinations x,y<1>,...,y<k>, End process at a stop word]**
-
-&#10230;
-
-<br>
-
-
-**74. Remark: if the beam width is set to 1, then this is equivalent to a naive greedy search.**
-
-&#10230;
-
-<br>
-
-
-**75. Beam width ― The beam width B is a parameter for beam search. Large values of B yield to better result but with slower performance and increased memory. Small values of B lead to worse results but is less computationally intensive. A standard value for B is around 10.**
-
-&#10230;
-
-<br>
-
-
-**76. Length normalization ― In order to improve numerical stability, beam search is usually applied on the following normalized objective, often called the normalized log-likelihood objective, defined as:**
-
-&#10230;
-
-<br>
-
-
-**77. Remark: the parameter α can be seen as a softener, and its value is usually between 0.5 and 1.**
-
-&#10230;
-
-<br>
-
-
-**78. Error analysis ― When obtaining a predicted translation ˆy that is bad, one can wonder why we did not get a good translation y∗ by performing the following error analysis:**
-
-&#10230;
-
-<br>
-
-
-**79. [Case, Root cause, Remedies]**
-
-&#10230;
-
-<br>
-
-
-**80. [Beam search faulty, RNN faulty, Increase beam width, Try different architecture, Regularize, Get more data]**
-
-&#10230;
-
-<br>
-
-
-**81. Bleu score ― The bilingual evaluation understudy (bleu) score quantifies how good a machine translation is by computing a similarity score based on n-gram precision. It is defined as follows:**
-
-&#10230;
-
-<br>
-
-
-**82. where pn is the bleu score on n-gram only defined as follows:**
-
-&#10230;
-
-<br>
-
-
-**83. Remark: a brevity penalty may be applied to short predicted translations to prevent an artificially inflated bleu score.**
-
-&#10230;
-
-<br>
-
-
-**84. Attention**
-
-&#10230;
-
-<br>
-
-
-**85. Attention model ― This model allows an RNN to pay attention to specific parts of the input that is considered as being important, which improves the performance of the resulting model in practice. By noting α<t,t′> the amount of attention that the output y<t> should pay to the activation a<t′> and c<t> the context at time t, we have:**
-
-&#10230;
-
-<br>
-
-
-**86. with**
-
-&#10230;
-
-<br>
-
-
-**87. Remark: the attention scores are commonly used in image captioning and machine translation.**
-
-&#10230;
-
-<br>
-
-
-**88. A cute teddy bear is reading Persian literature.**
-
-&#10230;
-
-<br>
-
-
-**89. Attention weight ― The amount of attention that the output y<t> should pay to the activation a<t′> is given by α<t,t′> computed as follows:**
-
-&#10230;
-
-<br>
-
-
-**90. Remark: computation complexity is quadratic with respect to Tx.**
-
-&#10230;
-
-<br>
-
-
-**91. The Deep Learning cheatsheets are now available in [target language].**
-
-&#10230;
-
-<br>
-
-**92. Original authors**
-
-&#10230;
-
-<br>
-
-**93. Translated by X, Y and Z**
-
-&#10230;
-
-<br>
-
-**94. Reviewed by X, Y and Z**
-
-&#10230;
-
-<br>
-
-**95. View PDF version on GitHub**
-
-&#10230;
-
-<br>
-
-**96. By X and Y**
-
-&#10230;
-
-<br>
diff --git a/vi/refresher-linear-algebra.md b/vi/refresher-linear-algebra.md
deleted file mode 100644
index a6b440d1e..000000000
--- a/vi/refresher-linear-algebra.md
+++ /dev/null
@@ -1,339 +0,0 @@
-**1. Linear Algebra and Calculus refresher**
-
-&#10230;
-
-<br>
-
-**2. General notations**
-
-&#10230;
-
-<br>
-
-**3. Definitions**
-
-&#10230;
-
-<br>
-
-**4. Vector ― We note x∈Rn a vector with n entries, where xi∈R is the ith entry:**
-
-&#10230;
-
-<br>
-
-**5. Matrix ― We note A∈Rm×n a matrix with m rows and n columns, where Ai,j∈R is the entry located in the ith row and jth column:**
-
-&#10230;
-
-<br>
-
-**6. Remark: the vector x defined above can be viewed as a n×1 matrix and is more particularly called a column-vector.**
-
-&#10230;
-
-<br>
-
-**7. Main matrices**
-
-&#10230;
-
-<br>
-
-**8. Identity matrix ― The identity matrix I∈Rn×n is a square matrix with ones in its diagonal and zero everywhere else:**
-
-&#10230;
-
-<br>
-
-**9. Remark: for all matrices A∈Rn×n, we have A×I=I×A=A.**
-
-&#10230;
-
-<br>
-
-**10. Diagonal matrix ― A diagonal matrix D∈Rn×n is a square matrix with nonzero values in its diagonal and zero everywhere else:**
-
-&#10230;
-
-<br>
-
-**11. Remark: we also note D as diag(d1,...,dn).**
-
-&#10230;
-
-<br>
-
-**12. Matrix operations**
-
-&#10230;
-
-<br>
-
-**13. Multiplication**
-
-&#10230;
-
-<br>
-
-**14. Vector-vector ― There are two types of vector-vector products:**
-
-&#10230;
-
-<br>
-
-**15. inner product: for x,y∈Rn, we have:**
-
-&#10230;
-
-<br>
-
-**16. outer product: for x∈Rm,y∈Rn, we have:**
-
-&#10230;
-
-<br>
-
-**17. Matrix-vector ― The product of matrix A∈Rm×n and vector x∈Rn is a vector of size Rn, such that:**
-
-&#10230;
-
-<br>
-
-**18. where aTr,i are the vector rows and ac,j are the vector columns of A, and xi are the entries of x.**
-
-&#10230;
-
-<br>
-
-**19. Matrix-matrix ― The product of matrices A∈Rm×n and B∈Rn×p is a matrix of size Rn×p, such that:**
-
-&#10230;
-
-<br>
-
-**20. where aTr,i,bTr,i are the vector rows and ac,j,bc,j are the vector columns of A and B respectively**
-
-&#10230;
-
-<br>
-
-**21. Other operations**
-
-&#10230;
-
-<br>
-
-**22. Transpose ― The transpose of a matrix A∈Rm×n, noted AT, is such that its entries are flipped:**
-
-&#10230;
-
-<br>
-
-**23. Remark: for matrices A,B, we have (AB)T=BTAT**
-
-&#10230;
-
-<br>
-
-**24. Inverse ― The inverse of an invertible square matrix A is noted A−1 and is the only matrix such that:**
-
-&#10230;
-
-<br>
-
-**25. Remark: not all square matrices are invertible. Also, for matrices A,B, we have (AB)−1=B−1A−1**
-
-&#10230;
-
-<br>
-
-**26. Trace ― The trace of a square matrix A, noted tr(A), is the sum of its diagonal entries:**
-
-&#10230;
-
-<br>
-
-**27. Remark: for matrices A,B, we have tr(AT)=tr(A) and tr(AB)=tr(BA)**
-
-&#10230;
-
-<br>
-
-**28. Determinant ― The determinant of a square matrix A∈Rn×n, noted |A| or det(A) is expressed recursively in terms of A∖i,∖j, which is the matrix A without its ith row and jth column, as follows:**
-
-&#10230;
-
-<br>
-
-**29. Remark: A is invertible if and only if |A|≠0. Also, |AB|=|A||B| and |AT|=|A|.**
-
-&#10230;
-
-<br>
-
-**30. Matrix properties**
-
-&#10230;
-
-<br>
-
-**31. Definitions**
-
-&#10230;
-
-<br>
-
-**32. Symmetric decomposition ― A given matrix A can be expressed in terms of its symmetric and antisymmetric parts as follows:**
-
-&#10230;
-
-<br>
-
-**33. [Symmetric, Antisymmetric]**
-
-&#10230;
-
-<br>
-
-**34. Norm ― A norm is a function N:V⟶[0,+∞[ where V is a vector space, and such that for all x,y∈V, we have:**
-
-&#10230;
-
-<br>
-
-**35. N(ax)=|a|N(x) for a scalar**
-
-&#10230;
-
-<br>
-
-**36. if N(x)=0, then x=0**
-
-&#10230;
-
-<br>
-
-**37. For x∈V, the most commonly used norms are summed up in the table below:**
-
-&#10230;
-
-<br>
-
-**38. [Norm, Notation, Definition, Use case]**
-
-&#10230;
-
-<br>
-
-**39. Linearly dependence ― A set of vectors is said to be linearly dependent if one of the vectors in the set can be defined as a linear combination of the others.**
-
-&#10230;
-
-<br>
-
-**40. Remark: if no vector can be written this way, then the vectors are said to be linearly independent**
-
-&#10230;
-
-<br>
-
-**41. Matrix rank ― The rank of a given matrix A is noted rank(A) and is the dimension of the vector space generated by its columns. This is equivalent to the maximum number of linearly independent columns of A.**
-
-&#10230;
-
-<br>
-
-**42. Positive semi-definite matrix ― A matrix A∈Rn×n is positive semi-definite (PSD) and is noted A⪰0 if we have:**
-
-&#10230;
-
-<br>
-
-**43. Remark: similarly, a matrix A is said to be positive definite, and is noted A≻0, if it is a PSD matrix which satisfies for all non-zero vector x, xTAx>0.**
-
-&#10230;
-
-<br>
-
-**44. Eigenvalue, eigenvector ― Given a matrix A∈Rn×n, λ is said to be an eigenvalue of A if there exists a vector z∈Rn∖{0}, called eigenvector, such that we have:**
-
-&#10230;
-
-<br>
-
-**45. Spectral theorem ― Let A∈Rn×n. If A is symmetric, then A is diagonalizable by a real orthogonal matrix U∈Rn×n. By noting Λ=diag(λ1,...,λn), we have:**
-
-&#10230;
-
-<br>
-
-**46. diagonal**
-
-&#10230;
-
-<br>
-
-**47. Singular-value decomposition ― For a given matrix A of dimensions m×n, the singular-value decomposition (SVD) is a factorization technique that guarantees the existence of U m×m unitary, Σ m×n diagonal and V n×n unitary matrices, such that:**
-
-&#10230;
-
-<br>
-
-**48. Matrix calculus**
-
-&#10230;
-
-<br>
-
-**49. Gradient ― Let f:Rm×n→R be a function and A∈Rm×n be a matrix. The gradient of f with respect to A is a m×n matrix, noted ∇Af(A), such that:**
-
-&#10230;
-
-<br>
-
-**50. Remark: the gradient of f is only defined when f is a function that returns a scalar.**
-
-&#10230;
-
-<br>
-
-**51. Hessian ― Let f:Rn→R be a function and x∈Rn be a vector. The hessian of f with respect to x is a n×n symmetric matrix, noted ∇2xf(x), such that:**
-
-&#10230;
-
-<br>
-
-**52. Remark: the hessian of f is only defined when f is a function that returns a scalar**
-
-&#10230;
-
-<br>
-
-**53. Gradient operations ― For matrices A,B,C, the following gradient properties are worth having in mind:**
-
-&#10230;
-
-<br>
-
-**54. [General notations, Definitions, Main matrices]**
-
-&#10230;
-
-<br>
-
-**55. [Matrix operations, Multiplication, Other operations]**
-
-&#10230;
-
-<br>
-
-**56. [Matrix properties, Norm, Eigenvalue/Eigenvector, Singular-value decomposition]**
-
-&#10230;
-
-<br>
-
-**57. [Matrix calculus, Gradient, Hessian, Operations]**
-
-&#10230;
diff --git a/vi/refresher-probability.md b/vi/refresher-probability.md
deleted file mode 100644
index 5c9b34656..000000000
--- a/vi/refresher-probability.md
+++ /dev/null
@@ -1,381 +0,0 @@
-**1. Probabilities and Statistics refresher**
-
-&#10230;
-
-<br>
-
-**2. Introduction to Probability and Combinatorics**
-
-&#10230;
-
-<br>
-
-**3. Sample space ― The set of all possible outcomes of an experiment is known as the sample space of the experiment and is denoted by S.**
-
-&#10230;
-
-<br>
-
-**4. Event ― Any subset E of the sample space is known as an event. That is, an event is a set consisting of possible outcomes of the experiment. If the outcome of the experiment is contained in E, then we say that E has occurred.**
-
-&#10230;
-
-<br>
-
-**5. Axioms of probability For each event E, we denote P(E) as the probability of event E occuring.**
-
-&#10230;
-
-<br>
-
-**6. Axiom 1 ― Every probability is between 0 and 1 included, i.e:**
-
-&#10230;
-
-<br>
-
-**7. Axiom 2 ― The probability that at least one of the elementary events in the entire sample space will occur is 1, i.e:**
-
-&#10230;
-
-<br>
-
-**8. Axiom 3 ― For any sequence of mutually exclusive events E1,...,En, we have:**
-
-&#10230;
-
-<br>
-
-**9. Permutation ― A permutation is an arrangement of r objects from a pool of n objects, in a given order. The number of such arrangements is given by P(n,r), defined as:**
-
-&#10230;
-
-<br>
-
-**10. Combination ― A combination is an arrangement of r objects from a pool of n objects, where the order does not matter. The number of such arrangements is given by C(n,r), defined as:**
-
-&#10230;
-
-<br>
-
-**11. Remark: we note that for 0⩽r⩽n, we have P(n,r)⩾C(n,r)**
-
-&#10230;
-
-<br>
-
-**12. Conditional Probability**
-
-&#10230;
-
-<br>
-
-**13. Bayes' rule ― For events A and B such that P(B)>0, we have:**
-
-&#10230;
-
-<br>
-
-**14. Remark: we have P(A∩B)=P(A)P(B|A)=P(A|B)P(B)**
-
-&#10230;
-
-<br>
-
-**15. Partition ― Let {Ai,i∈[[1,n]]} be such that for all i, Ai≠∅. We say that {Ai} is a partition if we have:**
-
-&#10230;
-
-<br>
-
-**16. Remark: for any event B in the sample space, we have P(B)=n∑i=1P(B|Ai)P(Ai).**
-
-&#10230;
-
-<br>
-
-**17. Extended form of Bayes' rule ― Let {Ai,i∈[[1,n]]} be a partition of the sample space. We have:**
-
-&#10230;
-
-<br>
-
-**18. Independence ― Two events A and B are independent if and only if we have:**
-
-&#10230;
-
-<br>
-
-**19. Random Variables**
-
-&#10230;
-
-<br>
-
-**20. Definitions**
-
-&#10230;
-
-<br>
-
-**21. Random variable ― A random variable, often noted X, is a function that maps every element in a sample space to a real line.**
-
-&#10230;
-
-<br>
-
-**22. Cumulative distribution function (CDF) ― The cumulative distribution function F, which is monotonically non-decreasing and is such that limx→−∞F(x)=0 and limx→+∞F(x)=1, is defined as:**
-
-&#10230;
-
-<br>
-
-**23. Remark: we have P(a<X⩽B)=F(b)−F(a).**
-
-&#10230;
-
-<br>
-
-**24. Probability density function (PDF) ― The probability density function f is the probability that X takes on values between two adjacent realizations of the random variable.**
-
-&#10230;
-
-<br>
-
-**25. Relationships involving the PDF and CDF ― Here are the important properties to know in the discrete (D) and the continuous (C) cases.**
-
-&#10230;
-
-<br>
-
-**26. [Case, CDF F, PDF f, Properties of PDF]**
-
-&#10230;
-
-<br>
-
-**27. Expectation and Moments of the Distribution ― Here are the expressions of the expected value E[X], generalized expected value E[g(X)], kth moment E[Xk] and characteristic function ψ(ω) for the discrete and continuous cases:**
-
-&#10230;
-
-<br>
-
-**28. Variance ― The variance of a random variable, often noted Var(X) or σ2, is a measure of the spread of its distribution function. It is determined as follows:**
-
-&#10230;
-
-<br>
-
-**29. Standard deviation ― The standard deviation of a random variable, often noted σ, is a measure of the spread of its distribution function which is compatible with the units of the actual random variable. It is determined as follows:**
-
-&#10230;
-
-<br>
-
-**30. Transformation of random variables ― Let the variables X and Y be linked by some function. By noting fX and fY the distribution function of X and Y respectively, we have:**
-
-&#10230;
-
-<br>
-
-**31. Leibniz integral rule ― Let g be a function of x and potentially c, and a,b boundaries that may depend on c. We have:**
-
-&#10230;
-
-<br>
-
-**32. Probability Distributions**
-
-&#10230;
-
-<br>
-
-**33. Chebyshev's inequality ― Let X be a random variable with expected value μ. For k,σ>0, we have the following inequality:**
-
-&#10230;
-
-<br>
-
-**34. Main distributions ― Here are the main distributions to have in mind:**
-
-&#10230;
-
-<br>
-
-**35. [Type, Distribution]**
-
-&#10230;
-
-<br>
-
-**36. Jointly Distributed Random Variables**
-
-&#10230;
-
-<br>
-
-**37. Marginal density and cumulative distribution ― From the joint density probability function fXY , we have**
-
-&#10230;
-
-<br>
-
-**38. [Case, Marginal density, Cumulative function]**
-
-&#10230;
-
-<br>
-
-**39. Conditional density ― The conditional density of X with respect to Y, often noted fX|Y, is defined as follows:**
-
-&#10230;
-
-<br>
-
-**40. Independence ― Two random variables X and Y are said to be independent if we have:**
-
-&#10230;
-
-<br>
-
-**41. Covariance ― We define the covariance of two random variables X and Y, that we note σ2XY or more commonly Cov(X,Y), as follows:**
-
-&#10230;
-
-<br>
-
-**42. Correlation ― By noting σX,σY the standard deviations of X and Y, we define the correlation between the random variables X and Y, noted ρXY, as follows:**
-
-&#10230;
-
-<br>
-
-**43. Remark 1: we note that for any random variables X,Y, we have ρXY∈[−1,1].**
-
-&#10230;
-
-<br>
-
-**44. Remark 2: If X and Y are independent, then ρXY=0.**
-
-&#10230;
-
-<br>
-
-**45. Parameter estimation**
-
-&#10230;
-
-<br>
-
-**46. Definitions**
-
-&#10230;
-
-<br>
-
-**47. Random sample ― A random sample is a collection of n random variables X1,...,Xn that are independent and identically distributed with X.**
-
-&#10230;
-
-<br>
-
-**48. Estimator ― An estimator is a function of the data that is used to infer the value of an unknown parameter in a statistical model.**
-
-&#10230;
-
-<br>
-
-**49. Bias ― The bias of an estimator ^θ is defined as being the difference between the expected value of the distribution of ^θ and the true value, i.e.:**
-
-&#10230;
-
-<br>
-
-**50. Remark: an estimator is said to be unbiased when we have E[^θ]=θ.**
-
-&#10230;
-
-<br>
-
-**51. Estimating the mean**
-
-&#10230;
-
-<br>
-
-**52. Sample mean ― The sample mean of a random sample is used to estimate the true mean μ of a distribution, is often noted ¯¯¯¯¯X and is defined as follows:**
-
-&#10230;
-
-<br>
-
-**53. Remark: the sample mean is unbiased, i.e E[¯¯¯¯¯X]=μ.**
-
-&#10230;
-
-<br>
-
-**54. Central Limit Theorem ― Let us have a random sample X1,...,Xn following a given distribution with mean μ and variance σ2, then we have:**
-
-&#10230;
-
-<br>
-
-**55. Estimating the variance**
-
-&#10230;
-
-<br>
-
-**56. Sample variance ― The sample variance of a random sample is used to estimate the true variance σ2 of a distribution, is often noted s2 or ^σ2 and is defined as follows:**
-
-&#10230;
-
-<br>
-
-**57. Remark: the sample variance is unbiased, i.e E[s2]=σ2.**
-
-&#10230;
-
-<br>
-
-**58. Chi-Squared relation with sample variance ― Let s2 be the sample variance of a random sample. We have:**
-
-&#10230;
-
-<br>
-
-**59. [Introduction, Sample space, Event, Permutation]**
-
-&#10230;
-
-<br>
-
-**60. [Conditional probability, Bayes' rule, Independence]**
-
-&#10230;
-
-<br>
-
-**61. [Random variables, Definitions, Expectation, Variance]**
-
-&#10230;
-
-<br>
-
-**62. [Probability distributions, Chebyshev's inequality, Main distributions]**
-
-&#10230;
-
-<br>
-
-**63. [Jointly distributed random variables, Density, Covariance, Correlation]**
-
-&#10230;
-
-<br>
-
-**64. [Parameter estimation, Mean, Variance]**
-
-&#10230;

From d4d48e1c9da305140b36564118edb5e92025292b Mon Sep 17 00:00:00 2001
From: tt-anh-eole <tt-anh@eole.co.jp>
Date: Fri, 7 Jun 2019 18:04:17 +0900
Subject: [PATCH 03/11] vi translating

---
 vi/cheatsheet-deep-learning.md | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/vi/cheatsheet-deep-learning.md b/vi/cheatsheet-deep-learning.md
index ff3b3c508..f0d8b477d 100644
--- a/vi/cheatsheet-deep-learning.md
+++ b/vi/cheatsheet-deep-learning.md
@@ -270,7 +270,7 @@
 
 **46. Maximum likelihood estimate ― The maximum likelihood estimates for the state transition probabilities are as follows:**
 
-&#10230; 
+&#10230; Ước lượng khả năng tối đa (Maximum likelihood estimate) - Ước lượng khả năng tối đa cho xác suất chuyển tiếp trạng thái (state) sẽ như sau:
 
 <br>
 
@@ -288,31 +288,31 @@
 
 **49. Q-learning ― Q-learning is a model-free estimation of Q, which is done as follows:**
 
-&#10230;
+&#10230; Q-learning ― Q-learning là 1 dạng phán đoán phi mô hình (model-free) của Q, được thực hiện như sau:
 
 <br>
 
 **50. View PDF version on GitHub**
 
-&#10230;
+&#10230; Xem bản PDF trên GitHub
 
 <br>
 
 **51. [Neural Networks, Architecture, Activation function, Backpropagation, Dropout]**
 
-&#10230;
+&#10230; [Mạng neural, Kiến trúc, Hàm kích hoạt, Lan truyền ngược, Dropout]
 
 <br>
 
 **52. [Convolutional Neural Networks, Convolutional layer, Batch normalization]**
 
-&#10230;
+&#10230; [Mạng neural tích chập, Tầng chập, Chuẩn hoá lô (batch)]
 
 <br>
 
 **53. [Recurrent Neural Networks, Gates, LSTM]**
 
-&#10230;
+&#10230; [Mạng neural hồi quy, Gates, LSTM]
 
 <br>
 

From 06f1ad415ee02a22e571529700c362b66d490267 Mon Sep 17 00:00:00 2001
From: tuananhhedspibk <tuananhhedspibk1@gmail.com>
Date: Fri, 7 Jun 2019 22:19:34 +0900
Subject: [PATCH 04/11] vi translating for cheatsheet-deep-learning

---
 vi/cheatsheet-deep-learning.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/vi/cheatsheet-deep-learning.md b/vi/cheatsheet-deep-learning.md
index f0d8b477d..0a795fcf8 100644
--- a/vi/cheatsheet-deep-learning.md
+++ b/vi/cheatsheet-deep-learning.md
@@ -276,13 +276,13 @@
 
 **47. times took action a in state s and got to s′**
 
-&#10230;
+&#10230; thời gian hành động a tiêu tốn cho state s và biến đổi nó thành s′
 
 <br>
 
 **48. times took action a in state s**
 
-&#10230;
+&#10230; thời gian hành động a tiêu tốn cho state (trạng thái) s
 
 <br>
 
@@ -318,4 +318,4 @@
 
 **54. [Reinforcement learning, Markov decision processes, Value/policy iteration, Approximate dynamic programming, Policy search]**
 
-&#10230;
+&#10230; [Học tăng cường (Reinforcement learning), Tiến trình quyết định Markov, Lặp Giá trị/policy, Lập trình động xấp xỉ, Tìm kiếm Policy]

From b8d82ffaa6b6faaf7823772b255b51c04408d633 Mon Sep 17 00:00:00 2001
From: tuananhhedspibk <tuananhhedspibk1@gmail.com>
Date: Sat, 8 Jun 2019 23:28:35 +0900
Subject: [PATCH 05/11] vi translating for cheatsheet-deep-learning

---
 vi/cheatsheet-deep-learning.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/vi/cheatsheet-deep-learning.md b/vi/cheatsheet-deep-learning.md
index 0a795fcf8..da05af89b 100644
--- a/vi/cheatsheet-deep-learning.md
+++ b/vi/cheatsheet-deep-learning.md
@@ -108,7 +108,7 @@
 
 **19. Dropout ― Dropout is a technique meant at preventing overfitting the training data by dropping out units in a neural network. In practice, neurons are either dropped with probability p or kept with probability 1−p**
 
-&#10230; Dropout - Dropout là thuật ngữ kĩ thuật dùng trong việc tránh overfitting tập dữ liệu huấn luyện
+&#10230; Dropout - Dropout là thuật ngữ kĩ thuật dùng trong việc tránh overfitting tập dữ liệu huấn luyện bằng việc bỏ đi các đơn vị trong mạng neural. Trong thực tế, các neurals hoặc là bị bỏ đi bởi xác suất p hoặc được giữ lại với xác suất 1-p
 
 <br>
 
@@ -174,7 +174,7 @@
 
 **30. The goal of reinforcement learning is for an agent to learn how to evolve in an environment.**
 
-&#10230; Mục tiêu của reinforcement learning đó là cho tác tử (agent) học cách làm sao để phát triển trong một môi trường
+&#10230; Mục tiêu của reinforcement learning đó là cho tác tử (agent) học cách làm sao để tối ưu hoá trong một môi trường.
 
 <br>
 
@@ -252,7 +252,7 @@
 
 **43. Value iteration algorithm ― The value iteration algorithm is in two steps:**
 
-&#10230; Giải thuật duyệt giá trị (Value iteration) - Giải thuật duyệt giá trị có 2 loại:
+&#10230; Giải thuật duyệt giá trị (Value iteration) - Giải thuật duyệt giá trị gồm 2 bước:
 
 <br>
 

From c28e3216a6326c3be1e1454a7151c73045bc43e3 Mon Sep 17 00:00:00 2001
From: tuananhhedspibk <tuananhhedspibk1@gmail.com>
Date: Thu, 13 Jun 2019 21:16:48 +0900
Subject: [PATCH 06/11] vi translation for deep learning

---
 vi/cheatsheet-deep-learning.md | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/vi/cheatsheet-deep-learning.md b/vi/cheatsheet-deep-learning.md
index da05af89b..45dad28fa 100644
--- a/vi/cheatsheet-deep-learning.md
+++ b/vi/cheatsheet-deep-learning.md
@@ -12,7 +12,7 @@
 
 **3. Neural networks are a class of models that are built with layers. Commonly used types of neural networks include convolutional and recurrent neural networks.**
 
-&#10230; Mạng Neural là 1 lớp của các models được xây dựng với các tầng (layers). Các loại mạng Neural thường được sử dụng bao gồm: Mạng Neural tích chập (Convolutional Neural Networks) và Mạng Neural hồi quy (Recurrent Neural Networks).
+&#10230; Mạng Neural là 1 lớp của các mô hình (models) được xây dựng với các tầng (layers). Các loại mạng Neural thường được sử dụng bao gồm: Mạng Neural tích chập (Convolutional Neural Networks) và Mạng Neural hồi quy (Recurrent Neural Networks).
 
 <br>
 
@@ -30,7 +30,7 @@
 
 **6. By noting i the ith layer of the network and j the jth hidden unit of the layer, we have:**
 
-&#10230; Bằng việc kí hiệu i là tầng thứ i của mạng, j là đơn vị ẩn (hidden unit) thứ j của tầng, ta có:
+&#10230; Bằng việc kí hiệu i là tầng thứ i của mạng, j là hidden unit (đơn vị ẩn) thứ j của tầng, ta có:
 
 <br>
 
@@ -54,7 +54,7 @@
 
 **10. Cross-entropy loss ― In the context of neural networks, the cross-entropy loss L(z,y) is commonly used and is defined as follows:**
 
-&#10230; Mất mát (loss) Cross-entropy - Trong bối cảnh của mạng neural, mất mát cross-entropy L(z, y) thường được sử dụng và định nghĩa như sau:
+&#10230; Lỗi (loss) Cross-entropy - Trong bối cảnh của mạng neural, hàm lỗi cross-entropy L(z, y) thường được sử dụng và định nghĩa như sau:
 
 <br>
 
@@ -90,13 +90,13 @@
 
 **16. Step 2: Perform forward propagation to obtain the corresponding loss.**
 
-&#10230; Bước 2: Thực thi lan truyền xuôi (forward propagation) để lấy được mất mát (loss) tương ứng.
+&#10230; Bước 2: Thực thi lan truyền tiến (forward propagation) để lấy được lỗi (loss) tương ứng.
 
 <br>
 
 **17. Step 3: Backpropagate the loss to get the gradients.**
 
-&#10230; Bước 3: Lan truyền ngược mất mát để lấy được gradients (độ dốc).
+&#10230; Bước 3: Lan truyền ngược lỗi để lấy được gradients (độ dốc).
 
 <br>
 
@@ -126,13 +126,13 @@
 
 **22. Batch normalization ― It is a step of hyperparameter γ,β that normalizes the batch {xi}. By noting μB,σ2B the mean and variance of that we want to correct to the batch, it is done as follows:**
 
-&#10230; Batch normalization (chuẩn hoá) - Đây là bước mà các hyperparameter γ,β chuẩn hoá batch (mẻ) {xi}. Bằng việc kí hiệu μB,σ2B là giá trị trung bình, phương sai mà ta muốn gán cho batch, nó được thực hiện như sau:
+&#10230; Batch normalization (chuẩn hoá) - Đây là bước mà các hyperparameter γ,β chuẩn hoá batch {xi}. Bằng việc kí hiệu μB,σ2B là giá trị trung bình, phương sai mà ta muốn gán cho batch, nó được thực hiện như sau:
 
 <br>
 
 **23. It is usually done after a fully connected/convolutional layer and before a non-linearity layer and aims at allowing higher learning rates and reducing the strong dependence on initialization.**
 
-&#10230; Nó thường được hoàn thành sau fully connected/convolutional layer và trước non-linearity layer và mục tiêu là cho phép tốc độ học cao hơn cũng như giảm đi sự phụ thuộc mạnh mẽ vào việc khởi tạo.
+&#10230; Nó thường được tính sau fully connected/convolutional layer và trước non-linearity layer và mục tiêu là cho phép tốc độ học cao hơn cũng như giảm đi sự phụ thuộc mạnh mẽ vào việc khởi tạo.
 
 <br>
 
@@ -162,13 +162,13 @@
 
 **28. LSTM ― A long short-term memory (LSTM) network is a type of RNN model that avoids the vanishing gradient problem by adding 'forget' gates.**
 
-&#10230; LSTM - Mạng bộ nhớ ngắn dài (LSTM) là 1 loại RNN model tránh vấn đề vanishing gradient (độ dốc biến mất đột ngột) bằng cách thêm vào cổng 'quên' ('forget' gates).
+&#10230; LSTM - Mạng bộ nhớ ngắn dài (LSTM) là 1 loại RNN model tránh vấn đề vanishing gradient (gradient biến mất đột ngột) bằng cách thêm vào cổng 'quên' ('forget' gates).
 
 <br>
 
 **29. Reinforcement Learning and Control**
 
-&#10230; Reinforcement Learning và Control
+&#10230; Reinforcement Learning (Học tăng cường) và điều khiển
 
 <br>
 
@@ -216,7 +216,7 @@
 
 **37. R:S×A⟶R or R:S⟶R is the reward function that the algorithm wants to maximize**
 
-&#10230; R:S×A⟶R hoặc R:S⟶R là reward function (hàm reward) mà giải thuật muốn tối đa hoá.
+&#10230; R:S×A⟶R hoặc R:S⟶R là reward function (hàm định nghĩa phần thưởng) mà giải thuật muốn tối đa hoá.
 
 <br>
 
@@ -306,7 +306,7 @@
 
 **52. [Convolutional Neural Networks, Convolutional layer, Batch normalization]**
 
-&#10230; [Mạng neural tích chập, Tầng chập, Chuẩn hoá lô (batch)]
+&#10230; [Mạng neural tích chập, Tầng chập, Chuẩn hoá batch]
 
 <br>
 

From 0377fd5fd1ff1b04787a1c7dcc866f3c108bf586 Mon Sep 17 00:00:00 2001
From: tuananhhedspibk <tuananhhedspibk1@gmail.com>
Date: Sun, 29 Sep 2019 22:27:28 +0900
Subject: [PATCH 07/11] vi translating for cheatsheet-deep-learning

---
 vi/cheatsheet-deep-learning.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/vi/cheatsheet-deep-learning.md b/vi/cheatsheet-deep-learning.md
index 45dad28fa..e34e5eb70 100644
--- a/vi/cheatsheet-deep-learning.md
+++ b/vi/cheatsheet-deep-learning.md
@@ -66,7 +66,7 @@
 
 **12. Backpropagation ― Backpropagation is a method to update the weights in the neural network by taking into account the actual output and the desired output. The derivative with respect to weight w is computed using chain rule and is of the following form:**
 
-&#10230; Backpropagation (Lan truyền ngược) - Backpropagation là phương thức dùng để cập nhật trọng số trong mạng neural bằng cách tính toán đầu ra thực sự và đầu ra mong muốn. Đạo hàm liên quan tới trọng số w được tính bằng cách sử dụng quy tắc chuỗi (chain rule) theo như cách dưới đây:
+&#10230; Backpropagation (Lan truyền ngược) - Backpropagation là phương thức dùng để cập nhật trọng số trong mạng neural bằng cách tính toán đầu ra thực sự và đầu ra mong muốn. Đạo hàm theo trọng số w được tính bằng cách sử dụng quy tắc chuỗi (chain rule) theo như cách dưới đây:
 
 <br>
 
@@ -162,7 +162,7 @@
 
 **28. LSTM ― A long short-term memory (LSTM) network is a type of RNN model that avoids the vanishing gradient problem by adding 'forget' gates.**
 
-&#10230; LSTM - Mạng bộ nhớ ngắn dài (LSTM) là 1 loại RNN model tránh vấn đề vanishing gradient (gradient biến mất đột ngột) bằng cách thêm vào cổng 'quên' ('forget' gates).
+&#10230; LSTM - Mạng bộ nhớ dài-ngắn (LSTM) là 1 loại RNN model tránh vấn đề vanishing gradient (gradient biến mất đột ngột) bằng cách thêm vào cổng 'quên' ('forget' gates).
 
 <br>
 
@@ -258,7 +258,7 @@
 
 **44. 1) We initialize the value:**
 
-&#10230; 1) Ta khởi tạo gái trị (value):
+&#10230; 1) Ta khởi tạo giá trị (value):
 
 <br>
 

From f8345c1dcac3835ccb7135eba6cc0e172eed28dc Mon Sep 17 00:00:00 2001
From: tuananhhedspibk <tuananhhedspibk1@gmail.com>
Date: Wed, 16 Oct 2019 22:47:43 +0900
Subject: [PATCH 08/11] fix line 309 of cheatsheet-deep-learning

---
 vi/cheatsheet-deep-learning.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/vi/cheatsheet-deep-learning.md b/vi/cheatsheet-deep-learning.md
index e34e5eb70..e03a3f0ca 100644
--- a/vi/cheatsheet-deep-learning.md
+++ b/vi/cheatsheet-deep-learning.md
@@ -306,7 +306,7 @@
 
 **52. [Convolutional Neural Networks, Convolutional layer, Batch normalization]**
 
-&#10230; [Mạng neural tích chập, Tầng chập, Chuẩn hoá batch]
+&#10230; [Mạng neural tích chập, Tầng tích chập, Chuẩn hoá batch]
 
 <br>
 

From 8fead90809e0b620c789fcc2f0c5f8b2943b7bb4 Mon Sep 17 00:00:00 2001
From: Shervine Amidi <shervinea@users.noreply.github.com>
Date: Sat, 11 Apr 2020 22:42:36 -0700
Subject: [PATCH 09/11] Rename cheatsheet-deep-learning.md to
 cs-229-deep-learning.md

---
 vi/{cheatsheet-deep-learning.md => cs-229-deep-learning.md} | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename vi/{cheatsheet-deep-learning.md => cs-229-deep-learning.md} (100%)

diff --git a/vi/cheatsheet-deep-learning.md b/vi/cs-229-deep-learning.md
similarity index 100%
rename from vi/cheatsheet-deep-learning.md
rename to vi/cs-229-deep-learning.md

From a106cf145935b85280a48b4f2f46c4b66ea63515 Mon Sep 17 00:00:00 2001
From: Shervine Amidi <shervinea@users.noreply.github.com>
Date: Sat, 11 Apr 2020 22:51:01 -0700
Subject: [PATCH 10/11] Update README.md

---
 README.md | 126 +++++++++++++++++++++++++++++-------------------------
 1 file changed, 68 insertions(+), 58 deletions(-)

diff --git a/README.md b/README.md
index dd151c4c8..24a88de72 100644
--- a/README.md
+++ b/README.md
@@ -1,6 +1,6 @@
 # Translation of VIP Cheatsheets
 ## Goal
-This repository aims at collaboratively translating our [Machine Learning](https://github.com/afshinea/stanford-cs-229-machine-learning) and [Deep Learning](https://github.com/afshinea/stanford-cs-230-deep-learning) cheatsheets into a ton of languages, so that this content can be enjoyed by anyone from any part of the world!
+This repository aims at collaboratively translating our [Machine Learning](https://github.com/afshinea/stanford-cs-229-machine-learning), [Deep Learning](https://github.com/afshinea/stanford-cs-230-deep-learning) and [Artificial Intelligence](https://github.com/afshinea/stanford-cs-221-artificial-intelligence) cheatsheets into a ton of languages, so that this content can be enjoyed by anyone from any part of the world!
 
 ## Contribution guidelines
 The translation process of each cheatsheet contains two steps:
@@ -33,65 +33,75 @@ The translation process of each cheatsheet contains two steps:
 ### Important note
 Please make sure to propose the translation of **only one** cheatsheet per pull request -- it simplifies a lot the review process.
 
-
-## Progression for CS 230 (Deep Learning)
-|Cheatsheet topic|Español|فارسی|Français|日本語|Português|中文|
-|:---|:---:|:---:|:---:|:---:|:---:|:---:|
-|Convolutional Neural Nets|not started|done|done|not started|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/128)|not started|
-|Recurrent Neural Nets|not started|done|done|not started|not started|not started|
-|DL tips and tricks|not started|done|done|not started|not started|not started|
-
-|Cheatsheet topic|العَرَبِيَّة|עִבְרִית|हिन्दी|Türkçe|Русский|Italiano|
+## Progression
+### CS 221 (Artificial Intelligence)
+| |[Reflex models](https://github.com/shervinea/cheatsheet-translation/blob/master/template/cs-221-reflex-models.md)|[States models](https://github.com/shervinea/cheatsheet-translation/blob/master/template/cs-221-states-models.md)|[Variables models](https://github.com/shervinea/cheatsheet-translation/blob/master/template/cs-221-variables-models.md)|[Logic models](https://github.com/shervinea/cheatsheet-translation/blob/master/template/cs-221-logic-models.md)|
+|:---|:---:|:---:|:---:|:---:|
+|**Deutsch**|not started|not started|not started|not started|
+|**Español**|not started|not started|not started|not started|
+|**فارسی**|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/200)|not started|not started|not started|
+|**Français**|done|done|done|done|
+|**עִבְרִית**|not started|not started|not started|not started|
+|**Italiano**|not started|not started|not started|not started|
+|**日本語**|not started|not started|not started|not started|
+|**한국어**|not started|not started|not started|not started|
+|**Português**|not started|not started|not started|not started|
+|**Türkçe**|done|done|done|done|
+|**Tiếng Việt**|not started|not started|not started|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/179)|
+|**简体中文**|not started|not started|not started|not started|
+|**繁體中文**|not started|not started|not started|not started|
+
+### CS 229 (Machine Learning)
+| |[Deep learning](https://github.com/shervinea/cheatsheet-translation/blob/master/template/cs-229-deep-learning.md)|[Supervised](https://github.com/shervinea/cheatsheet-translation/blob/master/template/cs-229-supervised-learning.md)|[Unsupervised](https://github.com/shervinea/cheatsheet-translation/blob/master/template/cs-229-unsupervised-learning.md)|[ML tips](https://github.com/shervinea/cheatsheet-translation/blob/master/template/cs-229-machine-learning-tips-and-tricks.md)|[Probabilities](https://github.com/shervinea/cheatsheet-translation/blob/master/template/cs-229-probability.md)|[Algebra](https://github.com/shervinea/cheatsheet-translation/blob/master/template/cs-229-linear-algebra.md)|
 |:---|:---:|:---:|:---:|:---:|:---:|:---:|
-|Convolutional Neural Nets|not started|not started|not started|done|not started|not started|
-|Recurrent Neural Nets|not started|not started|not started|done|not started|not started|
-|DL tips and tricks|not started|not started|not started|done|not started|not started|
-
-|Cheatsheet topic|Polski|Suomi|Català|Українська|한국어|
-|:---|:---:|:---:|:---:|:---:|:---:|
-|Convolutional Neural Nets|not started|not started|not started|not started|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/109)|
-|Recurrent Neural Nets|not started|not started|not started|not started|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/107)|
-|DL tips and tricks|not started|not started|not started|not started|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/108)|
-
-## Progression for CS 229 (Machine Learning)
-|Cheatsheet topic|Español|فارسی|Français|日本語|Português|中文|
-|:---|:---:|:---:|:---:|:---:|:---:|:---:|
-|Deep learning|done|done|done|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/96)|done|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/12)|
-|Supervised learning|done|done|done|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/144)|done|done|
-|Unsupervised learning|done|done|done|not started|done|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/48)|
-|ML tips and tricks|done|done|done|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/99)|done|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/7)|
-|Probabilities and Statistics|done|done|done|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/142)|done|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/73)|
-|Linear algebra|done|done|done|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/140)|done|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/72)|
-
-|Cheatsheet topic|العَرَبِيَّة|עִבְרִית|हिन्दी|Türkçe|Русский|Italiano|
-|:---|:---:|:---:|:---:|:---:|:---:|:---:|
-|Deep learning|done|not started|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/37)|done|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/21)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/78)|
-|Supervised learning|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/87)|not started|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/46)|done|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/21)|not started|
-|Unsupervised learning|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/88)|not started|not started|done|not started|not started|
-|ML tips and tricks|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/83)|not started|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/40)|done|not started|not started|
-|Probabilities and Statistics|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/89)|not started|not started|done|not started|not started|
-|Linear algebra|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/85)|not started|not started|done|not started|not started|
-
-
-|Cheatsheet topic|Polski|Suomi|Català|Українська|한국어|
-|:---|:---:|:---:|:---:|:---:|:---:|
-|Deep learning|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/8)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/34)|not started|not started|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/80)|
-|Supervised learning|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/8)|not started|not started|not started|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/90)|
-|Unsupervised learning|not started|not started|not started|not started|done|
-|ML tips and tricks|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/8)|not started|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/47)|not started|done|
-|Probabilities and Statistics|not started|not started|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/47)|done|done|
-|Linear algebra|not started|not started|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/47)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/95)|done|
-
-
-|Cheatsheet topic|Magyar|Deutsch|Bahasa Indonesia|
+|**العَرَبِيَّة**|done|done|done|done|done|done|
+|**Català**|not started|not started|not started|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/47)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/47)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/47)|
+|**Deutsch**|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/106)|not started|not started|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/135)|not started|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/136)|
+|**Español**|done|done|done|done|done|done|
+|**فارسی**|done|done|done|done|done|done|
+|**Suomi**|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/34)|not started|not started|not started|not started|not started|
+|**Français**|done|done|done|done|done|done|
+|**עִבְרִית**|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/156)|not started|not started|not started|not started|not started|
+|**हिन्दी**|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/37)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/46)|not started|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/40)|not started|not started|
+|**Magyar**|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/124)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/124)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/124)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/124)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/124)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/124)|
+|**Bahasa Indonesia**|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/154)|not started|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/139)|not started|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/151)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/150)|
+|**Italiano**|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/78)|not started|not started|not started|done|done|
+|**日本語**|done|done|done|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/99)|done|done|
+|**한국어**|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/80)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/90)|done|done|done|done|
+|**Polski**|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/8)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/8)|not started|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/8)|not started|not started|
+|**Português**|done|done|done|done|done|done|
+|**Русский**|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/21)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/21)|not started|not started|not started|not started|
+|**Türkçe**|done|done|done|done|done|done|
+|**Українська**|not started|not started|not started|not started|done|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/95)|
+|**Tiếng Việt**|done|done|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/199)|done|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/175)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/176)|
+|**简体中文**|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/12)|done|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/48)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/7)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/73)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/72)|
+|**繁體中文**|done|done|done|done|done|done|
+
+### CS 230 (Deep Learning)
+| |[Convolutional Neural Networks](https://github.com/shervinea/cheatsheet-translation/blob/master/template/cs-230-convolutional-neural-networks.md)|[Recurrent Neural Networks](https://github.com/shervinea/cheatsheet-translation/blob/master/template/cs-230-recurrent-neural-networks.md)|[Deep Learning tips](https://github.com/shervinea/cheatsheet-translation/blob/master/template/cs-230-deep-learning-tips-and-tricks.md)|
 |:---|:---:|:---:|:---:|
-|Deep learning|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/124/files#diff-5e2ba65ef08acd57024e82d0ae94b923)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/106)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/143)|
-|Supervised learning|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/124/files#diff-5e2ba65ef08acd57024e82d0ae94b923)|not started|not started|
-|Unsupervised learning|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/124/files#diff-5e2ba65ef08acd57024e82d0ae94b923)|not started|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/139)|
-|ML tips and tricks|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/124/files#diff-5e2ba65ef08acd57024e82d0ae94b923)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/135)|not started|
-|Probabilities and Statistics|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/124/files#diff-5e2ba65ef08acd57024e82d0ae94b923)|not started|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/143)|
-|Linear algebra|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/124/files#diff-5e2ba65ef08acd57024e82d0ae94b923)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/136)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/143)|
-
+|**العَرَبِيَّة**|not started|not started|not started|
+|**Català**|not started|not started|not started|
+|**Deutsch**|not started|not started|not started|
+|**Español**|not started|not started|not started|
+|**فارسی**|done|done|done|
+|**Suomi**|not started|not started|not started|
+|**Français**|done|done|done|
+|**עִבְרִית**|not started|not started|not started|
+|**हिन्दी**|not started|not started|not started|
+|**Magyar**|not started|not started|not started|
+|**Bahasa Indonesia**|done|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/152)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/153)|
+|**Italiano**|not started|not started|not started|
+|**日本語**|done|done|done|
+|**한국어**|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/109)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/107)|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/108)|
+|**Polski**|not started|not started|not started|
+|**Português**|done|not started|not started|
+|**Русский**|not started|not started|not started|
+|**Türkçe**|done|done|done|
+|**Українська**|not started|not started|not started|
+|**Tiếng Việt**|done|done|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/178)|
+|**简体中文**|not started|[in progress](https://github.com/shervinea/cheatsheet-translation/pull/181)|not started|
+|**繁體中文**|done|not started|not started|
 
 ## Acknowledgements
 Thank you everyone for your help! Please do not forget to add your name to the `CONTRIBUTORS` file so that we can give you proper credit in the cheatsheets' [official website](https://stanford.edu/~shervine/teaching).

From b4a9219a948dd2916f64f6d2b1e16bf1a1a36b97 Mon Sep 17 00:00:00 2001
From: Shervine Amidi <shervinea@users.noreply.github.com>
Date: Sat, 11 Apr 2020 23:02:26 -0700
Subject: [PATCH 11/11] Add contributors + miscellaneous name corrections

---
 CONTRIBUTORS | 119 ++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 108 insertions(+), 11 deletions(-)

diff --git a/CONTRIBUTORS b/CONTRIBUTORS
index dc4167fc2..19ffde67f 100644
--- a/CONTRIBUTORS
+++ b/CONTRIBUTORS
@@ -1,14 +1,26 @@
 --ar
   Amjad Khatabi (translation of deep learning)
   Zaid Alyafeai (review of deep learning)
-  
+
   Zaid Alyafeai (translation of linear algebra)
   Amjad Khatabi (review of linear algebra)
   Mazen Melibari (review of linear algebra)
+
+  Fares Al-Qunaieer (translation of machine learning tips and tricks)
+  Zaid Alyafeai (review of machine learning tips and tricks)
   
+  Mahmoud Aslan (translation of probabilities and statistics)
+  Fares Al-Qunaieer (review of probabilities and statistics)
+
+  Fares Al-Qunaieer (translation of supervised learning)
+  Zaid Alyafeai (review of supervised learning)
+
+  Redouane Lguensat (translation of unsupervised learning)
+  Fares Al-Qunaieer (review of unsupervised learning)
+
 --de
 
---es 
+--es
   Erick Gabriel Mendoza Flores (translation of deep learning)
   Fernando Diaz (review of deep learning)
   Fernando González-Herrera (review of deep learning)
@@ -17,12 +29,12 @@
   Alonso Melgar López (review of deep learning)
   Gustavo Velasco-Hernández (review of deep learning)
   Juan Manuel Nava Zamudio (review of deep learning)
-  
+
   Fernando González-Herrera (translation of linear algebra)
   Fernando Diaz (review of linear algebra)
   Gustavo Velasco-Hernández (review of linear algebra)
   Juan P. Chavat (review of linear algebra)
-  
+
   David Jiménez Paredes (translation of machine learning tips and tricks)
   Fernando Diaz (translation of machine learning tips and tricks)
   Gustavo Velasco-Hernández (review of machine learning tips and tricks)
@@ -40,7 +52,7 @@
   Jaime Noel Alvarez Luna (translation of unsupervised learning)
   Alonso Melgar López (review of unsupervised learning)
   Fernando Diaz (review of unsupervised learning)
-  
+
 --fa
   AlisterTA (translation of convolutional neural networks)
   Ehsan Kermani (translation of convolutional neural networks)
@@ -55,7 +67,7 @@
 
   Erfan Noury (translation of linear algebra)
   Mohammad Karimi (review of linear algebra)
-  
+
   AlisterTA (translation of machine learning tips and tricks)
   Mohammad Reza (translation of machine learning tips and tricks)
   Erfan Noury (review of machine learning tips and tricks)
@@ -70,10 +82,10 @@
   Amirhosein Kazemnejad (translation of supervised learning)
   Erfan Noury (review of supervised learning)
   Mohammad Karimi (review of supervised learning)
-  
+
   Erfan Noury (translation of unsupervised learning)
   Mohammad Karimi (review of unsupervised learning)
-  
+
 --fr
   Original authors
 
@@ -81,21 +93,62 @@
 
 --hi
 
+--id
+  Prasetia Utama Putra (translation of convolutional neural networks)
+  Gunawan Tri (review of convolutional neural networks)
+
+--it
+  Alessandro Piotti (translation of linear algebra)
+  Nicola Dall'Asen (review of linear algebra)
+  
+  Nicola Dall'Asen (translation of probabilities and statistics)
+  Alessandro Piotti (review of probabilities and statistics)
+
 --ko
   Wooil Jeong (translation of machine learning tips and tricks)
   
   Wooil Jeong (translation of probabilities and statistics)
   
-  Kwang Hyeok Ahn (translation of Unsupervised Learning)
+  Kwang Hyeok Ahn (translation of unsupervised learning)
 
 --ja
-
+  Tran Tuan Anh (translation of convolutional neural networks)
+  Yoshiyuki Nakai (review of convolutional neural networks)
+  Linh Dang (review of convolutional neural networks)
+  
+  Taichi Kato (translation of deep learning)
+  Dan Lillrank (review of deep learning)
+  Yoshiyuki Nakai (review of deep learning)
+  Yuki Tokyo (review of deep learning)
+  
+  Kamuela Lau (translation of deep learning tips and tricks)
+  Yoshiyuki Nakai (review of deep learning tips and tricks)
+  Hiroki Mori (review of deep learning tips and tricks)
+  
+  Robert Altena (translation of linear algebra)
+  Kamuela Lau (review of linear algebra)
+  
+  Takatoshi Nao (translation of probabilities and statistics)
+  Yuta Kanzawa (review of probabilities and statistics)
+  
+  H. Hamano (translation of recurrent neural networks)
+  Yoshiyuki Nakai (review of recurrent neural networks)
+  
+  Yuta Kanzawa (translation of supervised learning)
+  Tran Tuan Anh (review of supervised learning)
+  
+  Tran Tuan Anh (translation of unsupervised learning)
+  Yoshiyuki Nakai (review of unsupervised learning)
+  Yuta Kanzawa (review of unsupervised learning)
+  Dan Lillrank (review of unsupervised learning)
+  
 --pt
   Leticia Portella (translation of convolutional neural networks)
   Gabriel Aparecido Fonseca (review of convolutional neural networks)
 
   Gabriel Fonseca (translation of deep learning)
   Leticia Portella (review of deep learning)
+  Renato Kano (review of deep learning)
 
   Gabriel Fonseca (translation of linear algebra)
   Leticia Portella (review of linear algebra)
@@ -110,7 +163,7 @@
   Leticia Portella (translation of supervised learning)
   Gabriel Fonseca (review of supervised learning)
   Flavio Clesio (review of supervised learning)
-  
+
   Gabriel Fonseca (translation of unsupervised learning)
   Tiago Danin (review of unsupervised learning)
 
@@ -127,6 +180,9 @@
   Kadir Tekeli (translation of linear algebra)
   Ekrem Çetinkaya (review of linear algebra)
   
+  Ayyüce Kızrak (translation of logic-based models)
+  Başak Buluz (review of logic-based models)
+  
   Seray Beşer (translation of machine learning tips and tricks)
   Ayyüce Kızrak (review of machine learning tips and tricks)
   Yavuz Kömeçoğlu (review of machine learning tips and tricks)
@@ -137,22 +193,60 @@
   Başak Buluz (translation of recurrent neural networks)
   Yavuz Kömeçoğlu (review of recurrent neural networks)
   
+  Yavuz Kömeçoğlu (translation of reflex-based models)
+  Ayyüce Kızrak (review of reflex-based models)
+  
+  Cemal Gurpinar (translation of states-based models)
+  Başak Buluz (review of states-based models)
+  
   Başak Buluz (translation of supervised learning)
   Ayyüce Kızrak (review of supervised learning)
   
   Yavuz Kömeçoğlu (translation of unsupervised learning)
   Başak Buluz (review of unsupervised learning)
   
+  Başak Buluz (translation of variables-based models)
+  Ayyüce Kızrak (review of variables-based models)
+  
 --uk
   Gregory Reshetniak (translation of probabilities and statistics)
   Denys (review of probabilities and statistics)
   
+--vi
+  Phạm Hồng Vinh (translation of convolutional neural networks)
+  Đàm Minh Tiến (review of convolutional neural networks)
+  
+  Trần Tuấn Anh (translation of deep learning)
+  Phạm Hồng Vinh (review of deep learning)
+  Đàm Minh Tiến (review of deep learning)
+  Nguyễn Khánh Hưng (review of deep learning)
+  Hoàng Vũ Đạt (review of deep learning)
+  Nguyễn Trí Minh (review of deep learning)
+
+  Trần Tuấn Anh (translation of machine learning tips and tricks)
+  Nguyễn Trí Minh (review of machine learning tips and tricks)
+  Vinh Pham (review of machine learning tips and tricks)
+  Đàm Minh Tiến (review of machine learning tips and tricks)
+
+  Trần Tuấn Anh (translation of recurrent neural networks)
+  Đàm Minh Tiến (review of recurrent neural networks)
+  Hung Nguyễn (review of recurrent neural networks)
+  Nguyễn Trí Minh (review of recurrent neural networks)
+
+  Trần Tuấn Anh (translation of supervised learning)
+  Đàm Minh Tiến (review of supervised learning)
+  Hung Nguyễn (review of supervised learning)
+  Nguyễn Trí Minh (review of supervised learning)
+  
 --zh
   Wang Hongnian (translation of supervised learning)
   Xiaohu Zhu (朱小虎) (review of supervised learning)
   Chaoying Xue (review of supervised learning)
 
 --zh-tw
+  kentropy (translation of convolutional neural networks)
+  kevingo (review of convolutional neural networks)
+
   kevingo (translation of deep learning)
   TobyOoO (review of deep learning)
 
@@ -168,3 +262,6 @@
   kevingo (translation of unsupervised learning)
   imironhead (review of unsupervised learning)
   johnnychhsu (review of unsupervised learning)
+
+  kevingo (translation of machine learning tips and tricks)
+  kentropy (review of machine learning tips and tricks)