Skip to content

Commit

Permalink
Added post on word2vec
Browse files Browse the repository at this point in the history
  • Loading branch information
TSoli committed Nov 5, 2024
1 parent 91de4a0 commit f930f9d
Show file tree
Hide file tree
Showing 4 changed files with 338 additions and 0 deletions.
181 changes: 181 additions & 0 deletions _layouts/posts.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,181 @@
---
layout: default
refactor: true
panel_includes:
- toc
tail_includes:
- related-posts
- post-nav
- comments
---

{% include lang.html %} {% include toc-status.html %}

<article class="px-1" data-toc="{{ enable_toc }}">
<header>
<h1 data-toc-skip>{{ page.title }}</h1>
{% if page.description %}
<p class="post-desc fw-light mb-4">{{ page.description }}</p>
{% endif %}

<div class="post-meta text-muted">
<!-- published date -->
<span>
{{ site.data.locales[lang].post.posted }} {% include datetime.html
date=page.date tooltip=true lang=lang %}
</span>

<!-- lastmod date -->
{% if page.last_modified_at and page.last_modified_at != page.date %}
<span>
{{ site.data.locales[lang].post.updated }} {% include datetime.html
date=page.last_modified_at tooltip=true lang=lang %}
</span>
{% endif %} {% if page.image %} {% capture src %}src="{{ page.image.path |
default: page.image }}"{% endcapture %} {% capture class
%}class="preview-img{% if page.image.no_bg %}{{ ' no-bg' }}{% endif %}"{%
endcapture %} {% capture alt %}alt="{{ page.image.alt | xml_escape |
default: "Preview Image" }}"{% endcapture %} {% if page.image.lqip %} {%-
capture lqip -%}lqip="{{ page.image.lqip }}"{%- endcapture -%} {% endif %}

<div class="mt-3 mb-3">
<img {{ src }} {{ class }} {{ alt }} w="1200" h="630" {{ lqip }} />
{%- if page.image.alt -%}
<figcaption class="text-center pt-2 pb-2">
{{ page.image.alt }}
</figcaption>
{%- endif -%}
</div>
{% endif %}

<div class="d-flex justify-content-between">
<!-- author(s) -->
<span>
{% if page.author %} {% assign authors = page.author %} {% elsif
page.authors %} {% assign authors = page.authors %} {% endif %} {{
site.data.locales[lang].post.written_by }}

<em>
{% if authors %} {% for author in authors %} {% if
site.data.authors[author].url -%}
<a href="{{ site.data.authors[author].url }}"
>{{ site.data.authors[author].name }}</a
>
{%- else -%} {{ site.data.authors[author].name }} {%- endif %} {%
unless forloop.last %}{{ '</em
>,
<em
>' }}{% endunless %} {% endfor %} {% else %}
<a href="{{ site.social.links[0] }}">{{ site.social.name }}</a>
{% endif %}
</em>
</span>

<div>
<!-- pageviews -->
{% if site.pageviews.provider and
site.analytics[site.pageviews.provider].id %}
<span>
<em id="pageviews">
<i class="fas fa-spinner fa-spin small"></i>
</em>
{{ site.data.locales[lang].post.pageview_measure }}
</span>
{% endif %}

<!-- read time -->
{% include read-time.html content=content prompt=true lang=lang %}
</div>
</div>
</div>
</header>

{% if enable_toc %}
<div
id="toc-bar"
class="d-flex align-items-center justify-content-between invisible"
>
<span class="label text-truncate">{{ page.title }}</span>
<button type="button" class="toc-trigger btn me-1">
<i class="fa-solid fa-list-ul fa-fw"></i>
</button>
</div>

<button
id="toc-solo-trigger"
type="button"
class="toc-trigger btn btn-outline-secondary btn-sm"
>
<span class="label ps-2 pe-1"
>{{- site.data.locales[lang].panel.toc -}}</span
>
<i class="fa-solid fa-angle-right fa-fw"></i>
</button>

<dialog id="toc-popup" class="p-0">
<div
class="header d-flex flex-row align-items-center justify-content-between"
>
<div class="label text-truncate py-2 ms-4">{{- page.title -}}</div>
<button
id="toc-popup-close"
type="button"
class="btn mx-1 my-1 opacity-75"
>
<i class="fas fa-close"></i>
</button>
</div>
<div id="toc-popup-content" class="px-4 py-3 pb-4"></div>
</dialog>
{% endif %}

<div class="content">{{ content }}</div>

<div class="post-tail-wrapper text-muted">
<!-- categories -->
{% if page.categories.size > 0 %}
<div class="post-meta mb-3">
<i class="far fa-folder-open fa-fw me-1"></i>
{% for category in page.categories %}
<a
href="{{ site.baseurl }}/categories/{{ category | slugify | url_encode }}/"
>{{ category }}</a
>
{%- unless forloop.last -%},{%- endunless -%} {% endfor %}
</div>
{% endif %}

<!-- tags -->
{% if page.tags.size > 0 %}
<div class="post-tags">
<i class="fa fa-tags fa-fw me-1"></i>
{% for tag in page.tags %}
<a
href="{{ site.baseurl }}/tags/{{ tag | slugify | url_encode }}/"
class="post-tag no-text-decoration"
>
{{- tag -}}
</a>
{% endfor %}
</div>
{% endif %}

<div
class="post-tail-bottom d-flex justify-content-between align-items-center mt-5 pb-2"
>
<div class="license-wrapper">
{% if site.data.locales[lang].copyright.license.template %} {% capture
_replacement %}
<a href="{{ site.data.locales[lang].copyright.license.link }}">
{{ site.data.locales[lang].copyright.license.name }}
</a>
{% endcapture %} {{ site.data.locales[lang].copyright.license.template |
replace: ':LICENSE_NAME', _replacement }} {% endif %}
</div>

{% include post-sharing.html lang=lang %}
</div>
<!-- .post-tail-bottom -->
</div>
<!-- div.post-tail-wrapper -->
</article>
157 changes: 157 additions & 0 deletions _posts/2024-11-04-word2vec.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
---
title: "Word2vec: Encoding Meaning with Vectors"
date: 2024-11-04
categories: [AI, Natural Language Processing]
tags: [study, natural language processing, machine learning, artificial intelligence, word2vec]
math: true
toc: true
---

<!-- prettier-ignore -->
* TOC
{:toc}

# Word2vec: Encoding Meaning with Vectors

## Why Do We Care?

There are a few things that make encoding the meaning of natural language in computers difficult. Of
course, by this I mean something more useful than a dictionary with definitions. We want the
information in a format such that we can easily use the semantic information in algorithms on a
computer. Example tasks might include finding relevant documents using natural language (as a search
engine does), evaluating the sentiment of reviews or even generating new text or images as popular
chatbots can. Having a semantic representation of natural language is extremely useful for all of
these tasks.

## Encoding Words

To begin, let's consider encoding the meaning of individual words. Of course, words and letters are
already represented as numbers in a computer but the semantic relationships are not obvious in this
form. For example, it is not clear that hippopotamus and animal are somehow related just by looking
at the letters. Somehow our representations should be similar for words with similar meanings and
different for words that have different meanings.

## It's All About Relationships

As a child, you were probably taught that you could try to reason about the meaning of an unknown
word by considering the words around it. For example, in the sentence _I ate a juicy, \_\_\_ apple_,
you would know the missing word is an adjective and one that would usually be associated with
apples. Therefore, a good guess would be _red_ and other reasonable guesses might be _green_ or
_big_. You probably wouldn't guess _hippopotamus_. And if I gave you more and more examples where
the answer was _red_, you would be able to associate _red_ with its meaning based on the things it
is associated with. That is, you know fire can be red, cars can be red, clothes can be red, blood is
red etc. In fact, in this way, all of these objects are similar. But how does this help us encode
words on a computer?

Well this is where machine learning comes in. We can create a guessing game for a machine learning
model. Firstly, we could define a vocabulary (all of the possible words the model knows) by mapping
words to a one-hot-encoding (sometimes referred to as 1-of-V). Now we can define a simple neural
network with an input layer that is the size of our vocabulary, one hidden layer and an output
layer. In the [original paper](https://doi.org/10.48550/arXiv.1301.3781), two methods were
presented. The first model used surrounding words to predict the middle word and is known as
continuous bag-of-words (CBOW) and the other used the middle word to predict the surrounding words
and is known as Skip-gram.

![The model architectures for word2vec](/images/word2vec_arch.png)_The model architectures for
word2vec_

## Continuous Bag-of-words (CBOW)

The continuous bag-of-words method uses the context around a word to guess the missing word. This is
achieved by embedding each of the surrounding words using the weights from the input to the hidden
layer. The average of these embeddings in the hidden layer is then used to predict the missing word.
For efficiency, this is achieved using a hierarchical softmax structure similar to a Huffman tree
that turns predicting the word into a string of binary decisions. Since this post is not focused on
this optimisation technique I won't describe it any further and instead refer you to
[this paper](https://proceedings.neurips.cc/paper_files/paper/2008/file/1e056d2b0ebd5c878c550da6ac5d3724-Paper.pdf)
on it.

The most important part is that this means the hidden layer can be treated as high dimensional
vector representation that takes on some semantic meaning. This is predictions for the missing word
are more likely to be correct if they have similar meanings. Since the input word embeddings are
averaged, the order does not matter so it is classified as a bag-of-words method. Furthermore, the
embeddings are dense high dimensional vectors rather than sparse one-hot encodings and so they can
be thought of as a continuous representation. Hence the technique is named continuous bag-of-words
(CBOW).

## Skip-gram

The skip-gram method performs the opposite task - that is, given an input word, it predicts the
surrounding context words. In the [first implementation](https://doi.org/10.48550/arXiv.1301.3781),
is given an input word and the label word is sampled from the context words around it. A higher
sampling rate is given to words that appear closer to the input word. Once again, it uses the
hierarchical softmax technique for efficiency.

## Noise Contrastive Estimation (Negative Sampling)

Additionally, their is a [follow up](https://doi.org/10.48550/arXiv.1310.4546) to the skip-gram
paper which instead uses pairs of words actually from the context and noise samples. The objective
of the model is then to distinguish which words are context words and which are noise samples. The
dot product of two vectors is considered a similarity measurement since a higher dot product
indicates that two vectors point in a similar direction (it is the magnitude of the projection of
one onto the other). Therefore, the loss function was directly tied to this measurement of
similarity such that context words should be similar to the input word with larger dot products and
noise words should be dissimilar with larger negative dot products. Further details can be found in
[the paper](https://doi.org/10.48550/arXiv.1310.4546).

## Did it Work?

So the idea was to encode the meaning of words into vectors. In the
[follow up paper for the skip-gram model](https://doi.org/10.48550/arXiv.1310.4546), the authors
investigate this by defining an analogy task. The questions would have a format similar to _Germany
is to Berlin as France is to \_\_\_?_ where the correct answer would be _Paris_ (since it is the
capital city of France). To answer this question, the embeddings were calculated for each of the
input words and then the output word was calculated by finding the word in the vocabulary that had
the highest cosine similarity to the vector,

$$
\text{vec("Germany") - vec("Berlin") + vec("France")}
$$

with the cosine similarity defined based on the dot product of two vectors,

$$
\text{cosine similarity} = \cos\theta = \frac{\mathbf{A} \cdot \mathbf{B}}{\lVert\mathbf{A}\rVert\lVert\mathbf{B}\rVert}
$$

This can be interpreted as defining a vector that points in the direction that has a meaning similar
to capital city (vec("Germany") - vec("Berlin")) and then adding that meaning to the vector for a
specific country, _France_ so that the result should be close to the capital city of _France_,
_Paris_ (or in the most similar direction).

The authors visualised this relationship in the following figure,

![A figure that uses PCA to project the 1000 dimensional embedding vectors into 2D. The directions
relating countries to their capital cities are similar.](/images/word2vec_capitals.png)_A
figure that uses PCA to project the 1000 dimensional embedding vectors into 2D. The directions
relating countries to their capital cities are similar_

Now the training data never specifically mentioned any information about capital cities so this
encoding and the linear structure for meaning in the vectors was learned. So it appears semantics
have been encoded into the words.

Additionally, due to this linearity in the structure, sentences or paragraphs can be encoded simply
by adding the embeddings of each of the words!

## Some Important Takeaways

While word2vec has been mostly superseded by Large Language Models (LLMs) based on the Transformer
architecture, dense vector representation of words and sentences to extract semantic meaning are
still relevant. In fact, these Transformer architectures appear to learn similar structures for
embeddings as well. The main difference then is that the architecture combines words in sentences
much more cleverly than to just add the vectors together. The details of this are a topic for
another day but suffice it to say that the ideas in this paper are still extremely relevant.

Additionally, I think it's important to highlight that what made this model work was not a fancy
architecture but rather a clever training method. Similar methods have been more recently used for
LLMs. In particular, similar techniques were used to train powerful Transformer models such as
[BERT](https://doi.org/10.48550/arXiv.1810.04805) or even
[for vision models](https://doi.org/10.48550/arXiv.2111.06377).

Finally, I want to link to a video that I thought made a very good case for why representing meaning
in these high dimensional vector spaces works so well. It feels surprising that there could be so
many different directions that can represent so many different meaning that are unrelated to each
other. This video explains why high dimensional vector spaces can represent this information so well
(particularly towards the end of the video).

<center><iframe width="560" height="315" src="https://www.youtube.com/embed/9-Jl0dxWQs8?si=r3e9_zgJP1pfANhh" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe></center>
Binary file added images/word2vec_arch.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/word2vec_capitals.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit f930f9d

Please sign in to comment.