Added post on word2vec

TSoli · Nov 5, 2024 · f930f9d · f930f9d
1 parent 91de4a0
commit f930f9d
Show file tree

Hide file tree

Showing 4 changed files with 338 additions and 0 deletions.
diff --git a/_layouts/posts.html b/_layouts/posts.html
@@ -0,0 +1,181 @@
+---
+layout: default
+refactor: true
+panel_includes:
+  - toc
+tail_includes:
+  - related-posts
+  - post-nav
+  - comments
+---
+
+{% include lang.html %} {% include toc-status.html %}
+
+<article class="px-1" data-toc="{{ enable_toc }}">
+  <header>
+    <h1 data-toc-skip>{{ page.title }}</h1>
+    {% if page.description %}
+    <p class="post-desc fw-light mb-4">{{ page.description }}</p>
+    {% endif %}
+
+    <div class="post-meta text-muted">
+      <!-- published date -->
+      <span>
+        {{ site.data.locales[lang].post.posted }} {% include datetime.html
+        date=page.date tooltip=true lang=lang %}
+      </span>
+
+      <!-- lastmod date -->
+      {% if page.last_modified_at and page.last_modified_at != page.date %}
+      <span>
+        {{ site.data.locales[lang].post.updated }} {% include datetime.html
+        date=page.last_modified_at tooltip=true lang=lang %}
+      </span>
+      {% endif %} {% if page.image %} {% capture src %}src="{{ page.image.path |
+      default: page.image }}"{% endcapture %} {% capture class
+      %}class="preview-img{% if page.image.no_bg %}{{ ' no-bg' }}{% endif %}"{%
+      endcapture %} {% capture alt %}alt="{{ page.image.alt | xml_escape |
+      default: "Preview Image" }}"{% endcapture %} {% if page.image.lqip %} {%-
+      capture lqip -%}lqip="{{ page.image.lqip }}"{%- endcapture -%} {% endif %}
+
+      <div class="mt-3 mb-3">
+        <img {{ src }} {{ class }} {{ alt }} w="1200" h="630" {{ lqip }} />
+        {%- if page.image.alt -%}
+        <figcaption class="text-center pt-2 pb-2">
+          {{ page.image.alt }}
+        </figcaption>
+        {%- endif -%}
+      </div>
+      {% endif %}
+
+      <div class="d-flex justify-content-between">
+        <!-- author(s) -->
+        <span>
+          {% if page.author %} {% assign authors = page.author %} {% elsif
+          page.authors %} {% assign authors = page.authors %} {% endif %} {{
+          site.data.locales[lang].post.written_by }}
+
+          <em>
+            {% if authors %} {% for author in authors %} {% if
+            site.data.authors[author].url -%}
+            <a href="{{ site.data.authors[author].url }}"
+              >{{ site.data.authors[author].name }}</a
+            >
+            {%- else -%} {{ site.data.authors[author].name }} {%- endif %} {%
+            unless forloop.last %}{{ '</em
+          >,
+          <em
+            >' }}{% endunless %} {% endfor %} {% else %}
+            <a href="{{ site.social.links[0] }}">{{ site.social.name }}</a>
+            {% endif %}
+          </em>
+        </span>
+
+        <div>
+          <!-- pageviews -->
+          {% if site.pageviews.provider and
+          site.analytics[site.pageviews.provider].id %}
+          <span>
+            <em id="pageviews">
+              <i class="fas fa-spinner fa-spin small"></i>
+            </em>
+            {{ site.data.locales[lang].post.pageview_measure }}
+          </span>
+          {% endif %}
+
+          <!-- read time -->
+          {% include read-time.html content=content prompt=true lang=lang %}
+        </div>
+      </div>
+    </div>
+  </header>
+
+  {% if enable_toc %}
+  <div
+    id="toc-bar"
+    class="d-flex align-items-center justify-content-between invisible"
+  >
+    <span class="label text-truncate">{{ page.title }}</span>
+    <button type="button" class="toc-trigger btn me-1">
+      <i class="fa-solid fa-list-ul fa-fw"></i>
+    </button>
+  </div>
+
+  <button
+    id="toc-solo-trigger"
+    type="button"
+    class="toc-trigger btn btn-outline-secondary btn-sm"
+  >
+    <span class="label ps-2 pe-1"
+      >{{- site.data.locales[lang].panel.toc -}}</span
+    >
+    <i class="fa-solid fa-angle-right fa-fw"></i>
+  </button>
+
+  <dialog id="toc-popup" class="p-0">
+    <div
+      class="header d-flex flex-row align-items-center justify-content-between"
+    >
+      <div class="label text-truncate py-2 ms-4">{{- page.title -}}</div>
+      <button
+        id="toc-popup-close"
+        type="button"
+        class="btn mx-1 my-1 opacity-75"
+      >
+        <i class="fas fa-close"></i>
+      </button>
+    </div>
+    <div id="toc-popup-content" class="px-4 py-3 pb-4"></div>
+  </dialog>
+  {% endif %}
+
+  <div class="content">{{ content }}</div>
+
+  <div class="post-tail-wrapper text-muted">
+    <!-- categories -->
+    {% if page.categories.size > 0 %}
+    <div class="post-meta mb-3">
+      <i class="far fa-folder-open fa-fw me-1"></i>
+      {% for category in page.categories %}
+      <a
+        href="{{ site.baseurl }}/categories/{{ category | slugify | url_encode }}/"
+        >{{ category }}</a
+      >
+      {%- unless forloop.last -%},{%- endunless -%} {% endfor %}
+    </div>
+    {% endif %}
+
+    <!-- tags -->
+    {% if page.tags.size > 0 %}
+    <div class="post-tags">
+      <i class="fa fa-tags fa-fw me-1"></i>
+      {% for tag in page.tags %}
+      <a
+        href="{{ site.baseurl }}/tags/{{ tag | slugify | url_encode }}/"
+        class="post-tag no-text-decoration"
+      >
+        {{- tag -}}
+      </a>
+      {% endfor %}
+    </div>
+    {% endif %}
+
+    <div
+      class="post-tail-bottom d-flex justify-content-between align-items-center mt-5 pb-2"
+    >
+      <div class="license-wrapper">
+        {% if site.data.locales[lang].copyright.license.template %} {% capture
+        _replacement %}
+        <a href="{{ site.data.locales[lang].copyright.license.link }}">
+          {{ site.data.locales[lang].copyright.license.name }}
+        </a>
+        {% endcapture %} {{ site.data.locales[lang].copyright.license.template |
+        replace: ':LICENSE_NAME', _replacement }} {% endif %}
+      </div>
+
+      {% include post-sharing.html lang=lang %}
+    </div>
+    <!-- .post-tail-bottom -->
+  </div>
+  <!-- div.post-tail-wrapper -->
+</article>
diff --git a/_posts/2024-11-04-word2vec.md b/_posts/2024-11-04-word2vec.md
@@ -0,0 +1,157 @@
+---
+title: "Word2vec: Encoding Meaning with Vectors"
+date: 2024-11-04
+categories: [AI, Natural Language Processing]
+tags: [study, natural language processing, machine learning, artificial intelligence, word2vec]
+math: true
+toc: true
+---
+
+<!-- prettier-ignore -->
+* TOC
+{:toc}
+
+# Word2vec: Encoding Meaning with Vectors
+
+## Why Do We Care?
+
+There are a few things that make encoding the meaning of natural language in computers difficult. Of
+course, by this I mean something more useful than a dictionary with definitions. We want the
+information in a format such that we can easily use the semantic information in algorithms on a
+computer. Example tasks might include finding relevant documents using natural language (as a search
+engine does), evaluating the sentiment of reviews or even generating new text or images as popular
+chatbots can. Having a semantic representation of natural language is extremely useful for all of
+these tasks.
+
+## Encoding Words
+
+To begin, let's consider encoding the meaning of individual words. Of course, words and letters are
+already represented as numbers in a computer but the semantic relationships are not obvious in this
+form. For example, it is not clear that hippopotamus and animal are somehow related just by looking
+at the letters. Somehow our representations should be similar for words with similar meanings and
+different for words that have different meanings.
+
+## It's All About Relationships
+
+As a child, you were probably taught that you could try to reason about the meaning of an unknown
+word by considering the words around it. For example, in the sentence _I ate a juicy, \_\_\_ apple_,
+you would know the missing word is an adjective and one that would usually be associated with
+apples. Therefore, a good guess would be _red_ and other reasonable guesses might be _green_ or
+_big_. You probably wouldn't guess _hippopotamus_. And if I gave you more and more examples where
+the answer was _red_, you would be able to associate _red_ with its meaning based on the things it
+is associated with. That is, you know fire can be red, cars can be red, clothes can be red, blood is
+red etc. In fact, in this way, all of these objects are similar. But how does this help us encode
+words on a computer?
+
+Well this is where machine learning comes in. We can create a guessing game for a machine learning
+model. Firstly, we could define a vocabulary (all of the possible words the model knows) by mapping
+words to a one-hot-encoding (sometimes referred to as 1-of-V). Now we can define a simple neural
+network with an input layer that is the size of our vocabulary, one hidden layer and an output
+layer. In the [original paper](https://doi.org/10.48550/arXiv.1301.3781), two methods were
+presented. The first model used surrounding words to predict the middle word and is known as
+continuous bag-of-words (CBOW) and the other used the middle word to predict the surrounding words
+and is known as Skip-gram.
+
+![The model architectures for word2vec](/images/word2vec_arch.png)_The model architectures for
+word2vec_
+
+## Continuous Bag-of-words (CBOW)
+
+The continuous bag-of-words method uses the context around a word to guess the missing word. This is
+achieved by embedding each of the surrounding words using the weights from the input to the hidden
+layer. The average of these embeddings in the hidden layer is then used to predict the missing word.
+For efficiency, this is achieved using a hierarchical softmax structure similar to a Huffman tree
+that turns predicting the word into a string of binary decisions. Since this post is not focused on
+this optimisation technique I won't describe it any further and instead refer you to
+[this paper](https://proceedings.neurips.cc/paper_files/paper/2008/file/1e056d2b0ebd5c878c550da6ac5d3724-Paper.pdf)
+on it.
+
+The most important part is that this means the hidden layer can be treated as high dimensional
+vector representation that takes on some semantic meaning. This is predictions for the missing word
+are more likely to be correct if they have similar meanings. Since the input word embeddings are
+averaged, the order does not matter so it is classified as a bag-of-words method. Furthermore, the
+embeddings are dense high dimensional vectors rather than sparse one-hot encodings and so they can
+be thought of as a continuous representation. Hence the technique is named continuous bag-of-words
+(CBOW).
+
+## Skip-gram
+
+The skip-gram method performs the opposite task - that is, given an input word, it predicts the
+surrounding context words. In the [first implementation](https://doi.org/10.48550/arXiv.1301.3781),
+is given an input word and the label word is sampled from the context words around it. A higher
+sampling rate is given to words that appear closer to the input word. Once again, it uses the
+hierarchical softmax technique for efficiency.
+
+## Noise Contrastive Estimation (Negative Sampling)
+
+Additionally, their is a [follow up](https://doi.org/10.48550/arXiv.1310.4546) to the skip-gram
+paper which instead uses pairs of words actually from the context and noise samples. The objective
+of the model is then to distinguish which words are context words and which are noise samples. The
+dot product of two vectors is considered a similarity measurement since a higher dot product
+indicates that two vectors point in a similar direction (it is the magnitude of the projection of
+one onto the other). Therefore, the loss function was directly tied to this measurement of
+similarity such that context words should be similar to the input word with larger dot products and
+noise words should be dissimilar with larger negative dot products. Further details can be found in
+[the paper](https://doi.org/10.48550/arXiv.1310.4546).
+
+## Did it Work?
+
+So the idea was to encode the meaning of words into vectors. In the
+[follow up paper for the skip-gram model](https://doi.org/10.48550/arXiv.1310.4546), the authors
+investigate this by defining an analogy task. The questions would have a format similar to _Germany
+is to Berlin as France is to \_\_\_?_ where the correct answer would be _Paris_ (since it is the
+capital city of France). To answer this question, the embeddings were calculated for each of the
+input words and then the output word was calculated by finding the word in the vocabulary that had
+the highest cosine similarity to the vector,
+
+$$
+\text{vec("Germany") - vec("Berlin") + vec("France")}
+$$
+
+with the cosine similarity defined based on the dot product of two vectors,
+
+$$
+\text{cosine similarity} = \cos\theta = \frac{\mathbf{A} \cdot \mathbf{B}}{\lVert\mathbf{A}\rVert\lVert\mathbf{B}\rVert}
+$$
+
+This can be interpreted as defining a vector that points in the direction that has a meaning similar
+to capital city (vec("Germany") - vec("Berlin")) and then adding that meaning to the vector for a
+specific country, _France_ so that the result should be close to the capital city of _France_,
+_Paris_ (or in the most similar direction).
+
+The authors visualised this relationship in the following figure,
+
+![A figure that uses PCA to project the 1000 dimensional embedding vectors into 2D. The directions
+relating countries to their capital cities are similar.](/images/word2vec_capitals.png)_A
+figure that uses PCA to project the 1000 dimensional embedding vectors into 2D. The directions
+relating countries to their capital cities are similar_
+
+Now the training data never specifically mentioned any information about capital cities so this
+encoding and the linear structure for meaning in the vectors was learned. So it appears semantics
+have been encoded into the words.
+
+Additionally, due to this linearity in the structure, sentences or paragraphs can be encoded simply
+by adding the embeddings of each of the words!
+
+## Some Important Takeaways
+
+While word2vec has been mostly superseded by Large Language Models (LLMs) based on the Transformer
+architecture, dense vector representation of words and sentences to extract semantic meaning are
+still relevant. In fact, these Transformer architectures appear to learn similar structures for
+embeddings as well. The main difference then is that the architecture combines words in sentences
+much more cleverly than to just add the vectors together. The details of this are a topic for
+another day but suffice it to say that the ideas in this paper are still extremely relevant.
+
+Additionally, I think it's important to highlight that what made this model work was not a fancy
+architecture but rather a clever training method. Similar methods have been more recently used for
+LLMs. In particular, similar techniques were used to train powerful Transformer models such as
+[BERT](https://doi.org/10.48550/arXiv.1810.04805) or even
+[for vision models](https://doi.org/10.48550/arXiv.2111.06377).
+
+Finally, I want to link to a video that I thought made a very good case for why representing meaning
+in these high dimensional vector spaces works so well. It feels surprising that there could be so
+many different directions that can represent so many different meaning that are unrelated to each
+other. This video explains why high dimensional vector spaces can represent this information so well
+(particularly towards the end of the video).
+
+<center><iframe width="560" height="315" src="https://www.youtube.com/embed/9-Jl0dxWQs8?si=r3e9_zgJP1pfANhh" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe></center>
diff --git a/images/word2vec_arch.png b/images/word2vec_arch.png
diff --git a/images/word2vec_capitals.png b/images/word2vec_capitals.png