generated from cotes2020/chirpy-starter
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
4 changed files
with
338 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,181 @@ | ||
--- | ||
layout: default | ||
refactor: true | ||
panel_includes: | ||
- toc | ||
tail_includes: | ||
- related-posts | ||
- post-nav | ||
- comments | ||
--- | ||
|
||
{% include lang.html %} {% include toc-status.html %} | ||
|
||
<article class="px-1" data-toc="{{ enable_toc }}"> | ||
<header> | ||
<h1 data-toc-skip>{{ page.title }}</h1> | ||
{% if page.description %} | ||
<p class="post-desc fw-light mb-4">{{ page.description }}</p> | ||
{% endif %} | ||
|
||
<div class="post-meta text-muted"> | ||
<!-- published date --> | ||
<span> | ||
{{ site.data.locales[lang].post.posted }} {% include datetime.html | ||
date=page.date tooltip=true lang=lang %} | ||
</span> | ||
|
||
<!-- lastmod date --> | ||
{% if page.last_modified_at and page.last_modified_at != page.date %} | ||
<span> | ||
{{ site.data.locales[lang].post.updated }} {% include datetime.html | ||
date=page.last_modified_at tooltip=true lang=lang %} | ||
</span> | ||
{% endif %} {% if page.image %} {% capture src %}src="{{ page.image.path | | ||
default: page.image }}"{% endcapture %} {% capture class | ||
%}class="preview-img{% if page.image.no_bg %}{{ ' no-bg' }}{% endif %}"{% | ||
endcapture %} {% capture alt %}alt="{{ page.image.alt | xml_escape | | ||
default: "Preview Image" }}"{% endcapture %} {% if page.image.lqip %} {%- | ||
capture lqip -%}lqip="{{ page.image.lqip }}"{%- endcapture -%} {% endif %} | ||
|
||
<div class="mt-3 mb-3"> | ||
<img {{ src }} {{ class }} {{ alt }} w="1200" h="630" {{ lqip }} /> | ||
{%- if page.image.alt -%} | ||
<figcaption class="text-center pt-2 pb-2"> | ||
{{ page.image.alt }} | ||
</figcaption> | ||
{%- endif -%} | ||
</div> | ||
{% endif %} | ||
|
||
<div class="d-flex justify-content-between"> | ||
<!-- author(s) --> | ||
<span> | ||
{% if page.author %} {% assign authors = page.author %} {% elsif | ||
page.authors %} {% assign authors = page.authors %} {% endif %} {{ | ||
site.data.locales[lang].post.written_by }} | ||
|
||
<em> | ||
{% if authors %} {% for author in authors %} {% if | ||
site.data.authors[author].url -%} | ||
<a href="{{ site.data.authors[author].url }}" | ||
>{{ site.data.authors[author].name }}</a | ||
> | ||
{%- else -%} {{ site.data.authors[author].name }} {%- endif %} {% | ||
unless forloop.last %}{{ '</em | ||
>, | ||
<em | ||
>' }}{% endunless %} {% endfor %} {% else %} | ||
<a href="{{ site.social.links[0] }}">{{ site.social.name }}</a> | ||
{% endif %} | ||
</em> | ||
</span> | ||
|
||
<div> | ||
<!-- pageviews --> | ||
{% if site.pageviews.provider and | ||
site.analytics[site.pageviews.provider].id %} | ||
<span> | ||
<em id="pageviews"> | ||
<i class="fas fa-spinner fa-spin small"></i> | ||
</em> | ||
{{ site.data.locales[lang].post.pageview_measure }} | ||
</span> | ||
{% endif %} | ||
|
||
<!-- read time --> | ||
{% include read-time.html content=content prompt=true lang=lang %} | ||
</div> | ||
</div> | ||
</div> | ||
</header> | ||
|
||
{% if enable_toc %} | ||
<div | ||
id="toc-bar" | ||
class="d-flex align-items-center justify-content-between invisible" | ||
> | ||
<span class="label text-truncate">{{ page.title }}</span> | ||
<button type="button" class="toc-trigger btn me-1"> | ||
<i class="fa-solid fa-list-ul fa-fw"></i> | ||
</button> | ||
</div> | ||
|
||
<button | ||
id="toc-solo-trigger" | ||
type="button" | ||
class="toc-trigger btn btn-outline-secondary btn-sm" | ||
> | ||
<span class="label ps-2 pe-1" | ||
>{{- site.data.locales[lang].panel.toc -}}</span | ||
> | ||
<i class="fa-solid fa-angle-right fa-fw"></i> | ||
</button> | ||
|
||
<dialog id="toc-popup" class="p-0"> | ||
<div | ||
class="header d-flex flex-row align-items-center justify-content-between" | ||
> | ||
<div class="label text-truncate py-2 ms-4">{{- page.title -}}</div> | ||
<button | ||
id="toc-popup-close" | ||
type="button" | ||
class="btn mx-1 my-1 opacity-75" | ||
> | ||
<i class="fas fa-close"></i> | ||
</button> | ||
</div> | ||
<div id="toc-popup-content" class="px-4 py-3 pb-4"></div> | ||
</dialog> | ||
{% endif %} | ||
|
||
<div class="content">{{ content }}</div> | ||
|
||
<div class="post-tail-wrapper text-muted"> | ||
<!-- categories --> | ||
{% if page.categories.size > 0 %} | ||
<div class="post-meta mb-3"> | ||
<i class="far fa-folder-open fa-fw me-1"></i> | ||
{% for category in page.categories %} | ||
<a | ||
href="{{ site.baseurl }}/categories/{{ category | slugify | url_encode }}/" | ||
>{{ category }}</a | ||
> | ||
{%- unless forloop.last -%},{%- endunless -%} {% endfor %} | ||
</div> | ||
{% endif %} | ||
|
||
<!-- tags --> | ||
{% if page.tags.size > 0 %} | ||
<div class="post-tags"> | ||
<i class="fa fa-tags fa-fw me-1"></i> | ||
{% for tag in page.tags %} | ||
<a | ||
href="{{ site.baseurl }}/tags/{{ tag | slugify | url_encode }}/" | ||
class="post-tag no-text-decoration" | ||
> | ||
{{- tag -}} | ||
</a> | ||
{% endfor %} | ||
</div> | ||
{% endif %} | ||
|
||
<div | ||
class="post-tail-bottom d-flex justify-content-between align-items-center mt-5 pb-2" | ||
> | ||
<div class="license-wrapper"> | ||
{% if site.data.locales[lang].copyright.license.template %} {% capture | ||
_replacement %} | ||
<a href="{{ site.data.locales[lang].copyright.license.link }}"> | ||
{{ site.data.locales[lang].copyright.license.name }} | ||
</a> | ||
{% endcapture %} {{ site.data.locales[lang].copyright.license.template | | ||
replace: ':LICENSE_NAME', _replacement }} {% endif %} | ||
</div> | ||
|
||
{% include post-sharing.html lang=lang %} | ||
</div> | ||
<!-- .post-tail-bottom --> | ||
</div> | ||
<!-- div.post-tail-wrapper --> | ||
</article> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,157 @@ | ||
--- | ||
title: "Word2vec: Encoding Meaning with Vectors" | ||
date: 2024-11-04 | ||
categories: [AI, Natural Language Processing] | ||
tags: [study, natural language processing, machine learning, artificial intelligence, word2vec] | ||
math: true | ||
toc: true | ||
--- | ||
|
||
<!-- prettier-ignore --> | ||
* TOC | ||
{:toc} | ||
|
||
# Word2vec: Encoding Meaning with Vectors | ||
|
||
## Why Do We Care? | ||
|
||
There are a few things that make encoding the meaning of natural language in computers difficult. Of | ||
course, by this I mean something more useful than a dictionary with definitions. We want the | ||
information in a format such that we can easily use the semantic information in algorithms on a | ||
computer. Example tasks might include finding relevant documents using natural language (as a search | ||
engine does), evaluating the sentiment of reviews or even generating new text or images as popular | ||
chatbots can. Having a semantic representation of natural language is extremely useful for all of | ||
these tasks. | ||
|
||
## Encoding Words | ||
|
||
To begin, let's consider encoding the meaning of individual words. Of course, words and letters are | ||
already represented as numbers in a computer but the semantic relationships are not obvious in this | ||
form. For example, it is not clear that hippopotamus and animal are somehow related just by looking | ||
at the letters. Somehow our representations should be similar for words with similar meanings and | ||
different for words that have different meanings. | ||
|
||
## It's All About Relationships | ||
|
||
As a child, you were probably taught that you could try to reason about the meaning of an unknown | ||
word by considering the words around it. For example, in the sentence _I ate a juicy, \_\_\_ apple_, | ||
you would know the missing word is an adjective and one that would usually be associated with | ||
apples. Therefore, a good guess would be _red_ and other reasonable guesses might be _green_ or | ||
_big_. You probably wouldn't guess _hippopotamus_. And if I gave you more and more examples where | ||
the answer was _red_, you would be able to associate _red_ with its meaning based on the things it | ||
is associated with. That is, you know fire can be red, cars can be red, clothes can be red, blood is | ||
red etc. In fact, in this way, all of these objects are similar. But how does this help us encode | ||
words on a computer? | ||
|
||
Well this is where machine learning comes in. We can create a guessing game for a machine learning | ||
model. Firstly, we could define a vocabulary (all of the possible words the model knows) by mapping | ||
words to a one-hot-encoding (sometimes referred to as 1-of-V). Now we can define a simple neural | ||
network with an input layer that is the size of our vocabulary, one hidden layer and an output | ||
layer. In the [original paper](https://doi.org/10.48550/arXiv.1301.3781), two methods were | ||
presented. The first model used surrounding words to predict the middle word and is known as | ||
continuous bag-of-words (CBOW) and the other used the middle word to predict the surrounding words | ||
and is known as Skip-gram. | ||
|
||
![The model architectures for word2vec](/images/word2vec_arch.png)_The model architectures for | ||
word2vec_ | ||
|
||
## Continuous Bag-of-words (CBOW) | ||
|
||
The continuous bag-of-words method uses the context around a word to guess the missing word. This is | ||
achieved by embedding each of the surrounding words using the weights from the input to the hidden | ||
layer. The average of these embeddings in the hidden layer is then used to predict the missing word. | ||
For efficiency, this is achieved using a hierarchical softmax structure similar to a Huffman tree | ||
that turns predicting the word into a string of binary decisions. Since this post is not focused on | ||
this optimisation technique I won't describe it any further and instead refer you to | ||
[this paper](https://proceedings.neurips.cc/paper_files/paper/2008/file/1e056d2b0ebd5c878c550da6ac5d3724-Paper.pdf) | ||
on it. | ||
|
||
The most important part is that this means the hidden layer can be treated as high dimensional | ||
vector representation that takes on some semantic meaning. This is predictions for the missing word | ||
are more likely to be correct if they have similar meanings. Since the input word embeddings are | ||
averaged, the order does not matter so it is classified as a bag-of-words method. Furthermore, the | ||
embeddings are dense high dimensional vectors rather than sparse one-hot encodings and so they can | ||
be thought of as a continuous representation. Hence the technique is named continuous bag-of-words | ||
(CBOW). | ||
|
||
## Skip-gram | ||
|
||
The skip-gram method performs the opposite task - that is, given an input word, it predicts the | ||
surrounding context words. In the [first implementation](https://doi.org/10.48550/arXiv.1301.3781), | ||
is given an input word and the label word is sampled from the context words around it. A higher | ||
sampling rate is given to words that appear closer to the input word. Once again, it uses the | ||
hierarchical softmax technique for efficiency. | ||
|
||
## Noise Contrastive Estimation (Negative Sampling) | ||
|
||
Additionally, their is a [follow up](https://doi.org/10.48550/arXiv.1310.4546) to the skip-gram | ||
paper which instead uses pairs of words actually from the context and noise samples. The objective | ||
of the model is then to distinguish which words are context words and which are noise samples. The | ||
dot product of two vectors is considered a similarity measurement since a higher dot product | ||
indicates that two vectors point in a similar direction (it is the magnitude of the projection of | ||
one onto the other). Therefore, the loss function was directly tied to this measurement of | ||
similarity such that context words should be similar to the input word with larger dot products and | ||
noise words should be dissimilar with larger negative dot products. Further details can be found in | ||
[the paper](https://doi.org/10.48550/arXiv.1310.4546). | ||
|
||
## Did it Work? | ||
|
||
So the idea was to encode the meaning of words into vectors. In the | ||
[follow up paper for the skip-gram model](https://doi.org/10.48550/arXiv.1310.4546), the authors | ||
investigate this by defining an analogy task. The questions would have a format similar to _Germany | ||
is to Berlin as France is to \_\_\_?_ where the correct answer would be _Paris_ (since it is the | ||
capital city of France). To answer this question, the embeddings were calculated for each of the | ||
input words and then the output word was calculated by finding the word in the vocabulary that had | ||
the highest cosine similarity to the vector, | ||
|
||
$$ | ||
\text{vec("Germany") - vec("Berlin") + vec("France")} | ||
$$ | ||
|
||
with the cosine similarity defined based on the dot product of two vectors, | ||
|
||
$$ | ||
\text{cosine similarity} = \cos\theta = \frac{\mathbf{A} \cdot \mathbf{B}}{\lVert\mathbf{A}\rVert\lVert\mathbf{B}\rVert} | ||
$$ | ||
|
||
This can be interpreted as defining a vector that points in the direction that has a meaning similar | ||
to capital city (vec("Germany") - vec("Berlin")) and then adding that meaning to the vector for a | ||
specific country, _France_ so that the result should be close to the capital city of _France_, | ||
_Paris_ (or in the most similar direction). | ||
|
||
The authors visualised this relationship in the following figure, | ||
|
||
![A figure that uses PCA to project the 1000 dimensional embedding vectors into 2D. The directions | ||
relating countries to their capital cities are similar.](/images/word2vec_capitals.png)_A | ||
figure that uses PCA to project the 1000 dimensional embedding vectors into 2D. The directions | ||
relating countries to their capital cities are similar_ | ||
|
||
Now the training data never specifically mentioned any information about capital cities so this | ||
encoding and the linear structure for meaning in the vectors was learned. So it appears semantics | ||
have been encoded into the words. | ||
|
||
Additionally, due to this linearity in the structure, sentences or paragraphs can be encoded simply | ||
by adding the embeddings of each of the words! | ||
|
||
## Some Important Takeaways | ||
|
||
While word2vec has been mostly superseded by Large Language Models (LLMs) based on the Transformer | ||
architecture, dense vector representation of words and sentences to extract semantic meaning are | ||
still relevant. In fact, these Transformer architectures appear to learn similar structures for | ||
embeddings as well. The main difference then is that the architecture combines words in sentences | ||
much more cleverly than to just add the vectors together. The details of this are a topic for | ||
another day but suffice it to say that the ideas in this paper are still extremely relevant. | ||
|
||
Additionally, I think it's important to highlight that what made this model work was not a fancy | ||
architecture but rather a clever training method. Similar methods have been more recently used for | ||
LLMs. In particular, similar techniques were used to train powerful Transformer models such as | ||
[BERT](https://doi.org/10.48550/arXiv.1810.04805) or even | ||
[for vision models](https://doi.org/10.48550/arXiv.2111.06377). | ||
|
||
Finally, I want to link to a video that I thought made a very good case for why representing meaning | ||
in these high dimensional vector spaces works so well. It feels surprising that there could be so | ||
many different directions that can represent so many different meaning that are unrelated to each | ||
other. This video explains why high dimensional vector spaces can represent this information so well | ||
(particularly towards the end of the video). | ||
|
||
<center><iframe width="560" height="315" src="https://www.youtube.com/embed/9-Jl0dxWQs8?si=r3e9_zgJP1pfANhh" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe></center> |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.