rguthrie3 · ozancaglayan · Oct 13, 2017 · Oct 13, 2017 · Oct 13, 2017 · Oct 13, 2017
diff --git a/Deep Learning for Natural Language Processing with Pytorch.ipynb b/Deep Learning for Natural Language Processing with Pytorch.ipynb
@@ -43,14 +43,14 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# 1. Introduction to Torch's tensor library"
+    "# 1. Introduction to Pytorch's tensor library"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "All of deep learning is computations on tensors, which are generalizations of a matrix that can be indexed in more than 2 dimensions.  We will see exactly what this means in-depth later.  First, lets look what we can do with tensors."
+    "All of deep learning is computations on tensors, which are generalizations of a matrix that can be indexed in more than 2 dimensions.  We will see exactly what this means in-depth later.  First, let's look what we can do with tensors."
    ]
   },
   {
@@ -369,12 +369,12 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The concept of a computation graph is essential to efficient deep learning programming, because it allows you to not have to write the back propagation gradients yourself.  A computation graph is simply a specification of how your data is combined to give you the output.  Since the graph totally specifies what parameters were involved with which operations, it contains enough information to compute derivatives.  This probably sounds vague, so lets see what is going on using the fundamental class of Pytorch: autograd.Variable.\n",
+    "The concept of a computation graph is essential to efficient deep learning programming, because it allows you to not have to write the back propagation gradients yourself.  A computation graph is simply a specification of how your data is combined to give you the output.  Since the graph totally specifies what parameters were involved with which operations, it contains enough information to compute derivatives.  This probably sounds vague, so let's see what is going on using the fundamental class of Pytorch: autograd.Variable.\n",
     "\n",
     "First, think from a programmers perspective.  What is stored in the torch.Tensor objects we were creating above?\n",
     "Obviously the data and the shape, and maybe a few other things.  But when we added two tensors together, we got an output tensor.  All this output tensor knows is its data and shape.  It has no idea that it was the sum of two other tensors (it could have been read in from a file, it could be the result of some other operation, etc.)\n",
     "\n",
-    "The Variable class keeps track of how it was created.  Lets see it in action."
+    "The Variable class keeps track of how it was created.  Let's see it in action."
    ]
   },
   {
@@ -444,7 +444,7 @@
     }
    ],
    "source": [
-    "# Lets sum up all the entries in z\n",
+    "# Let's sum up all the entries in z\n",
     "s = z.sum()\n",
     "print s\n",
     "print s.grad_fn"
@@ -468,7 +468,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Lets have Pytorch compute the gradient, and see that we were right: (note if you run this block multiple times, the gradient will increment.  That is because Pytorch *accumulates* the gradient into the .grad property, since for many models this is very convenient.)"
+    "Let's have Pytorch compute the gradient, and see that we were right: (note if you run this block multiple times, the gradient will increment.  That is because Pytorch *accumulates* the gradient into the .grad property, since for many models this is very convenient.)"
    ]
   },
   {
@@ -543,7 +543,7 @@
    "source": [
     "Here is the basic, extremely important rule for computing with autograd.Variables (note this is more general than Pytorch.  There is an equivalent object in every major deep learning toolkit):\n",
     "\n",
-    "** If you want the error from your loss function to backpropogate to a component of your network, you MUST NOT break the Variable chain from that component to your loss Variable.  If you do, the loss will have no idea your component exists, and its parameters can't be updated. **\n",
+    "** If you want the error from your loss function to backpropagate to a component of your network, you MUST NOT break the Variable chain from that component to your loss Variable.  If you do, the loss will have no idea your component exists, and its parameters can't be updated. **\n",
     "\n",
     "I say this in bold, because this error can creep up on you in very subtle ways (I will show some such ways below), and it will not cause your code to crash or complain, so you must be careful."
    ]
@@ -647,7 +647,7 @@
     }
    ],
    "source": [
-    "# In pytorch, most non-linearities are in torch.functional (we have it imported as F)\n",
+    "# In pytorch, most non-linearities are in torch.nn.functional (we have it imported as F)\n",
     "# Note that non-linearites typically don't have parameters like affine maps do.\n",
     "# That is, they don't have weights that are updated during training.\n",
     "data = autograd.Variable( torch.randn(2, 2) )\n",
@@ -708,7 +708,7 @@
     }
    ],
    "source": [
-    "# Softmax is also in torch.functional\n",
+    "# Softmax is also in torch.nn.functional\n",
     "data = autograd.Variable( torch.randn(5) )\n",
     "print data\n",
     "print F.softmax(data)\n",
@@ -744,15 +744,15 @@
     "\n",
     "$$ \\theta^{(t+1)} = \\theta^{(t)} - \\eta \\nabla_\\theta L(\\theta) $$\n",
     "\n",
-    "There are a huge collection of algorithms and active research in attempting to do something more than just this vanilla gradient update.  Many attempt to vary the learning rate based on what is happening at train time.  You don't need to worry about what specifically these algorithms are doing unless you are really interested.  Torch provies many in the torch.optim package, and they are all completely transparent.  Using the simplest gradient update is the same as the more complicated algorithms.  Trying different update algorithms and different parameters for the update algorithms (like different initial learning rates) is important in optimizing your network's performance.  Often, just replacing vanilla SGD with an optimizer like Adam or RMSProp will boost performance noticably."
+    "There are a huge collection of algorithms and active research in attempting to do something more than just this vanilla gradient update.  Many attempt to vary the learning rate based on what is happening at training time.  You don't need to worry about what specifically these algorithms are doing unless you are really interested.  Pytorch provides many in the torch.optim package, and they are all completely transparent.  Using the simplest gradient update is the same as the more complicated algorithms.  Trying different update algorithms and different parameters for the update algorithms (like different initial learning rates) is important in optimizing your network's performance.  Often, just replacing vanilla SGD with an optimizer like Adam or RMSProp will boost performance noticeably."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "# 5. Creating Network Components in Pytorch\n",
-    "Before we move on to our focus on NLP, lets do an annotated example of building a network in Pytorch using only affine maps and non-linearities.  We will also see how to compute a loss function, using Pytorch's built in negative log likelihood, and update parameters by backpropagation.\n",
+    "Before we move on to our focus on NLP, let's do an annotated example of building a network in Pytorch using only affine maps and non-linearities.  We will also see how to compute a loss function, using Pytorch's built-in negative log likelihood, and update parameters by backpropagation.\n",
     "\n",
     "All network components should inherit from nn.Module and override the forward() method.  That is about it, as far as the boilerplate is concerned.  Inheriting from nn.Module provides functionality to your component.  For example, it makes it keep track of its trainable parameters, you can swap it between CPU and GPU with the .cuda() or .cpu() functions, etc.\n",
     "\n",
@@ -763,7 +763,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Example: Logistic Regression Bag-of-Words classifier\n",
+    "### Example: Logistic Regression Bag-of-Words (BOW) classifier\n",
     "Our model will map a sparse BOW representation to log probabilities over labels.  We assign each word in the vocab an index.  For example, say our entire vocab is two words \"hello\" and \"world\", with indices 0 and 1 respectively.\n",
     "The BoW vector for the sentence \"hello hello hello hello\" is\n",
     "$$ \\left[ 4, 0 \\right] $$\n",
@@ -776,7 +776,7 @@
     "Denote this BOW vector as $x$.\n",
     "The output of our network is:\n",
     "$$ \\log \\text{Softmax}(Ax + b) $$\n",
-    "That is, we pass the input through an affine map and then do log softmax."
+    "That is, we pass the input through an affine map and then compute log softmax."
    ]
   },
   {
@@ -955,7 +955,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "So lets train!  To do this, we pass instances through to get log probabilities, compute a loss function, compute the gradient of the loss function, and then update the parameters with a gradient step.  Loss functions are provided by Torch in the nn package.  nn.NLLLoss() is the negative log likelihood loss we want.  It also defines optimization functions in torch.optim.  Here, we will just use SGD.\n",
+    "So let's train! To do this, we pass instances through to get log probabilities, compute a loss function, compute the gradient of the loss function, and then update the parameters with a gradient step.  Loss functions are provided by Pytorch in the nn package. nn.NLLLoss() is the negative log likelihood loss we want.  It also defines optimization functions in torch.optim.  Here, we will just use SGD.\n",
     "\n",
     "Note that the *input* to NLLLoss is a vector of log probabilities, and a target label.  It doesn't compute the log probabilities for us.  This is why the last layer of our network is log softmax.\n",
     "The loss function nn.CrossEntropyLoss() is the same as NLLLoss(), except it does the log softmax for you."
@@ -1148,7 +1148,7 @@
    "source": [
     "You can think of the sparse one-hot vectors from the beginning of this section as a special case of these new vectors we have defined, where each word basically has similarity 0, and we gave each word some unique semantic attribute.  These new vectors are *dense*, which is to say their entries are (typically) non-zero.\n",
     "\n",
-    "But these new vectors are a big pain: you could think of thousands of different semantic attributes that might be relevant to determining similarity, and how on earth would you set the values of the different attributes?  Central to the idea of deep learning is that the neural network learns representations of the features, rather than requiring the programmer to design them herself.  So why not just let the word embeddings be parameters in our model, and then be updated during training?  This is exactly what we will do.  We will have some *latent semantic attributes* that the network can, in principle, learn.  Note that the word embeddings will probably not be interpretable.  That is, although with our hand-crafted vectors above we can see that mathematicians and physicists are similar in that they both like coffee, if we allow a neural network to learn the embeddings and see that both mathematicians and physicisits have a large value in the second dimension, it is not clear what that means.  They are similar in some latent semantic dimension, but this probably has no interpretation to us."
+    "But these new vectors are a big pain: you could think of thousands of different semantic attributes that might be relevant to determining similarity, and how on earth would you set the values of the different attributes?  Central to the idea of deep learning is that the neural network learns representations of the features, rather than requiring the programmer to design them herself.  So why not just let the word embeddings be parameters in our model, and then be updated during training?  This is exactly what we will do.  We will have some *latent semantic attributes* that the network can, in principle, learn.  Note that the word embeddings will probably not be interpretable.  That is, although with our hand-crafted vectors above we can see that mathematicians and physicists are similar in that they both like coffee, if we allow a neural network to learn the embeddings and see that both mathematicians and physicists have a large value in the second dimension, it is not clear what that means.  They are similar in some latent semantic dimension, but this probably has no interpretation to us."
    ]
   },
   {
@@ -1474,7 +1474,7 @@
     "The classical example of a sequence model is the Hidden Markov Model for part-of-speech tagging.  Another example is the conditional random field.\n",
     "\n",
     "A recurrent neural network is a network that maintains some kind of state.\n",
-    "For example, its output could be used as part of the next input, so that information can propogate along as the network passes over the sequence.\n",
+    "For example, its output could be used as part of the next input, so that information can propagate along as the network passes over the sequence.\n",
     "In the case of an LSTM, for each element in the sequence, there is a corresponding *hidden state* $h_t$, which in principle can contain information from arbitrary points earlier in the sequence.\n",
     "We can use the hidden state to predict words in a language model, part-of-speech tags, and a myriad of other things."
    ]
@@ -1489,7 +1489,7 @@
     "Pytorch's LSTM expects all of its inputs to be 3D tensors.\n",
     "The semantics of the axes of these tensors is important.\n",
     "The first axis is the sequence itself, the second indexes instances in the mini-batch, and the third indexes elements of the input.\n",
-    "We haven't discussed mini-batching, so lets just ignore that and assume we will always have just 1 dimension on the second axis.\n",
+    "We haven't discussed mini-batching, so let's just ignore that and assume we will always have just 1 dimension on the second axis.\n",
     "If we want to run the sequence model over the sentence \"The cow jumped\", our input should look like\n",
     "$$ \n",
     "\\begin{bmatrix}\n",
@@ -1560,7 +1560,7 @@
     "# they are the same)\n",
     "# The reason for this is that:\n",
     "# \"out\" will give you access to all hidden states in the sequence\n",
-    "# \"hidden\" will allow you to continue the sequence and backpropogate, by passing it as an argument\n",
+    "# \"hidden\" will allow you to continue the sequence and backpropagate, by passing it as an argument\n",
     "# to the lstm at a later time\n",
     "inputs = torch.cat(inputs).view(len(inputs), 1, -1) # Add the extra 2nd dimension\n",
     "hidden = (autograd.Variable(torch.randn(1,1,3)), autograd.Variable(torch.randn((1,1,3)))) # clean out hidden state\n",