Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-1400: onehot documentation #1446

Merged
merged 2 commits into from
Feb 24, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 8 additions & 2 deletions flair/embeddings.py
Original file line number Diff line number Diff line change
Expand Up @@ -462,7 +462,7 @@ def extra_repr(self):


class OneHotEmbeddings(TokenEmbeddings):
"""One-hot encoded embeddings."""
"""One-hot encoded embeddings. """

def __init__(
self,
Expand All @@ -471,7 +471,13 @@ def __init__(
embedding_length: int = 300,
min_freq: int = 3,
):

"""
Initializes one-hot encoded word embeddings and a trainable embedding layer
:param corpus: you need to pass a Corpus in order to construct the vocabulary
:param field: by default, the 'text' of tokens is embedded, but you can also embed tags such as 'pos'
:param embedding_length: dimensionality of the trainable embedding layer
:param min_freq: minimum frequency of a word to become part of the vocabulary
"""
super().__init__()
self.name = "one-hot"
self.static_embeddings = False
Expand Down
1 change: 1 addition & 0 deletions resources/docs/TUTORIAL_4_ELMO_BERT_FLAIR_EMBEDDING.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ The following word embeddings are currently supported:
| [`FastTextEmbeddings`](/resources/docs/embeddings/FASTTEXT_EMBEDDINGS.md) | Word embeddings with subword features | [Bojanowski et al. (2017)](https://aclweb.org/anthology/Q17-1010) |
| [`FlairEmbeddings`](/resources/docs/embeddings/FLAIR_EMBEDDINGS.md) | Contextualized character-level embeddings | [Akbik et al. (2018)](https://www.aclweb.org/anthology/C18-1139/) |
| [`PooledFlairEmbeddings`](/resources/docs/embeddings/FLAIR_EMBEDDINGS.md) | Pooled variant of `FlairEmbeddings` | [Akbik et al. (2019)](https://www.aclweb.org/anthology/N19-1078/) |
| [`OneHotEmbeddings`](/resources/docs/embeddings/ONE_HOT_EMBEDDINGS.md) | Standard one-hot embeddings of text or tags | - |
| [`OpenAIGPTEmbeddings`](/resources/docs/embeddings/TRANSFORMER_EMBEDDINGS.md) and [`OpenAIGPT2Embeddings`](/resources/docs/embeddings/TRANSFORMER_EMBEDDINGS.md) | Embeddings from pretrained OpenAIGPT models | [Radford et al. (2018)](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) |
| [`RoBERTaEmbeddings`](/resources/docs/embeddings/TRANSFORMER_EMBEDDINGS.md) | Embeddings from RoBERTa | [Liu et al. (2019)](https://arxiv.org/abs/1907.11692) |
| [`TransformerXLEmbeddings`](/resources/docs/embeddings/TRANSFORMER_EMBEDDINGS.md) | Embeddings from pretrained transformer-XL | [Dai et al. (2019)](https://arxiv.org/abs/1901.02860) |
Expand Down
70 changes: 70 additions & 0 deletions resources/docs/embeddings/ONE_HOT_EMBEDDINGS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# One-Hot Embeddings

`OneHotEmbeddings` are embeddings that encode each word in a vocabulary as a one-hot vector, followed by an embedding
layer. These embeddings
thus do not encode any prior knowledge as do most other embeddings. They also differ in that they require to see
a `Corpus` during instantiation, so they can build up a vocabulary consisting
of the most common words seen in the corpus, plus an UNK token for all rare words.

You initialize these embeddings like this:

```python
from flair.embeddings import OneHotEmbeddings
from flair.datasets import UD_ENGLISH

# load a corpus
corpus = UD_ENGLISH()

# init embedding
embeddings = OneHotEmbeddings(corpus=corpus)

# create a sentence
sentence = Sentence('The grass is green .')

# embed words in sentence
embedding.embed(sentence)
```

By default, the 'text' of a token (i.e. its lexical value) is one-hot encoded and the embedding layer has a dimensionality
of 300. However, this layer is randomly initialized, meaning that these embeddings do not make sense unless they are
[trained in a task](/resources/docs/TUTORIAL_7_TRAINING_A_MODEL.md).

### Vocabulary size

By default, all words that occur in the corpus at least 3 times are part of the vocabulary. You can change
this using the `min_freq` parameter. For instance, if your corpus is very large you might want to set a
higher `min_freq`:

```python
embeddings = OneHotEmbeddings(corpus=corpus, min_freq=10)
```

### Embedding dimensionality

By default, the embeddings have a dimensionality of 300. If you want to try higher or lower values, you can use the
`embedding_length` parameter:

```python
embeddings = OneHotEmbeddings(corpus=corpus, embedding_length=100)
```


## Embedding other tags

Sometimes, you want to embed something other than text. For instance, sometimes we have part-of-speech tags or
named entity annotation available that we might want to use. If this field exists in your corpus, you can embed
it by passing the field variable. For instance, the UD corpora have a universal part-of-speech tag for each
token ('upos'). Embed it like so:

```python
from flair.datasets import UD_ENGLISH
from flair.embeddings import OneHotEmbeddings

# load corpus
corpus = UD_ENGLISH()

# embed POS tags
embeddings = OneHotEmbeddings(corpus=corpus, field='upos')
```

This should print a vocabulary of size 18 consisting of universal part-of-speech tags.