Skip to content

Commit

Permalink
Merge pull request #1446 from flairNLP/GH-1400-onehot-documentation
Browse files Browse the repository at this point in the history
GH-1400: onehot documentation
  • Loading branch information
alanakbik authored Feb 24, 2020
2 parents 4ce32c7 + 2a04603 commit a35727f
Show file tree
Hide file tree
Showing 3 changed files with 79 additions and 2 deletions.
10 changes: 8 additions & 2 deletions flair/embeddings.py
Original file line number Diff line number Diff line change
Expand Up @@ -462,7 +462,7 @@ def extra_repr(self):


class OneHotEmbeddings(TokenEmbeddings):
"""One-hot encoded embeddings."""
"""One-hot encoded embeddings. """

def __init__(
self,
Expand All @@ -471,7 +471,13 @@ def __init__(
embedding_length: int = 300,
min_freq: int = 3,
):

"""
Initializes one-hot encoded word embeddings and a trainable embedding layer
:param corpus: you need to pass a Corpus in order to construct the vocabulary
:param field: by default, the 'text' of tokens is embedded, but you can also embed tags such as 'pos'
:param embedding_length: dimensionality of the trainable embedding layer
:param min_freq: minimum frequency of a word to become part of the vocabulary
"""
super().__init__()
self.name = "one-hot"
self.static_embeddings = False
Expand Down
1 change: 1 addition & 0 deletions resources/docs/TUTORIAL_4_ELMO_BERT_FLAIR_EMBEDDING.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ The following word embeddings are currently supported:
| [`FastTextEmbeddings`](/resources/docs/embeddings/FASTTEXT_EMBEDDINGS.md) | Word embeddings with subword features | [Bojanowski et al. (2017)](https://aclweb.org/anthology/Q17-1010) |
| [`FlairEmbeddings`](/resources/docs/embeddings/FLAIR_EMBEDDINGS.md) | Contextualized character-level embeddings | [Akbik et al. (2018)](https://www.aclweb.org/anthology/C18-1139/) |
| [`PooledFlairEmbeddings`](/resources/docs/embeddings/FLAIR_EMBEDDINGS.md) | Pooled variant of `FlairEmbeddings` | [Akbik et al. (2019)](https://www.aclweb.org/anthology/N19-1078/) |
| [`OneHotEmbeddings`](/resources/docs/embeddings/ONE_HOT_EMBEDDINGS.md) | Standard one-hot embeddings of text or tags | - |
| [`OpenAIGPTEmbeddings`](/resources/docs/embeddings/TRANSFORMER_EMBEDDINGS.md) and [`OpenAIGPT2Embeddings`](/resources/docs/embeddings/TRANSFORMER_EMBEDDINGS.md) | Embeddings from pretrained OpenAIGPT models | [Radford et al. (2018)](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) |
| [`RoBERTaEmbeddings`](/resources/docs/embeddings/TRANSFORMER_EMBEDDINGS.md) | Embeddings from RoBERTa | [Liu et al. (2019)](https://arxiv.org/abs/1907.11692) |
| [`TransformerXLEmbeddings`](/resources/docs/embeddings/TRANSFORMER_EMBEDDINGS.md) | Embeddings from pretrained transformer-XL | [Dai et al. (2019)](https://arxiv.org/abs/1901.02860) |
Expand Down
70 changes: 70 additions & 0 deletions resources/docs/embeddings/ONE_HOT_EMBEDDINGS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# One-Hot Embeddings

`OneHotEmbeddings` are embeddings that encode each word in a vocabulary as a one-hot vector, followed by an embedding
layer. These embeddings
thus do not encode any prior knowledge as do most other embeddings. They also differ in that they require to see
a `Corpus` during instantiation, so they can build up a vocabulary consisting
of the most common words seen in the corpus, plus an UNK token for all rare words.

You initialize these embeddings like this:

```python
from flair.embeddings import OneHotEmbeddings
from flair.datasets import UD_ENGLISH

# load a corpus
corpus = UD_ENGLISH()

# init embedding
embeddings = OneHotEmbeddings(corpus=corpus)

# create a sentence
sentence = Sentence('The grass is green .')

# embed words in sentence
embedding.embed(sentence)
```

By default, the 'text' of a token (i.e. its lexical value) is one-hot encoded and the embedding layer has a dimensionality
of 300. However, this layer is randomly initialized, meaning that these embeddings do not make sense unless they are
[trained in a task](/resources/docs/TUTORIAL_7_TRAINING_A_MODEL.md).

### Vocabulary size

By default, all words that occur in the corpus at least 3 times are part of the vocabulary. You can change
this using the `min_freq` parameter. For instance, if your corpus is very large you might want to set a
higher `min_freq`:

```python
embeddings = OneHotEmbeddings(corpus=corpus, min_freq=10)
```

### Embedding dimensionality

By default, the embeddings have a dimensionality of 300. If you want to try higher or lower values, you can use the
`embedding_length` parameter:

```python
embeddings = OneHotEmbeddings(corpus=corpus, embedding_length=100)
```


## Embedding other tags

Sometimes, you want to embed something other than text. For instance, sometimes we have part-of-speech tags or
named entity annotation available that we might want to use. If this field exists in your corpus, you can embed
it by passing the field variable. For instance, the UD corpora have a universal part-of-speech tag for each
token ('upos'). Embed it like so:

```python
from flair.datasets import UD_ENGLISH
from flair.embeddings import OneHotEmbeddings

# load corpus
corpus = UD_ENGLISH()

# embed POS tags
embeddings = OneHotEmbeddings(corpus=corpus, field='upos')
```

This should print a vocabulary of size 18 consisting of universal part-of-speech tags.

0 comments on commit a35727f

Please sign in to comment.