Merge pull request #1446 from flairNLP/GH-1400-onehot-documentation

GH-1400: onehot documentation
flairNLP · Feb 24, 2020 · a35727f · a35727f
2 parents 4ce32c7 + 2a04603
commit a35727f
Show file tree

Hide file tree

Showing 3 changed files with 79 additions and 2 deletions.
diff --git a/flair/embeddings.py b/flair/embeddings.py
@@ -462,7 +462,7 @@ def extra_repr(self):
 
 
 class OneHotEmbeddings(TokenEmbeddings):
-    """One-hot encoded embeddings."""
+    """One-hot encoded embeddings. """
 
     def __init__(
         self,
@@ -471,7 +471,13 @@ def __init__(
         embedding_length: int = 300,
         min_freq: int = 3,
     ):
-
+        """
+        Initializes one-hot encoded word embeddings and a trainable embedding layer
+        :param corpus: you need to pass a Corpus in order to construct the vocabulary
+        :param field: by default, the 'text' of tokens is embedded, but you can also embed tags such as 'pos'
+        :param embedding_length: dimensionality of the trainable embedding layer
+        :param min_freq: minimum frequency of a word to become part of the vocabulary
+        """
         super().__init__()
         self.name = "one-hot"
         self.static_embeddings = False

diff --git a/resources/docs/TUTORIAL_4_ELMO_BERT_FLAIR_EMBEDDING.md b/resources/docs/TUTORIAL_4_ELMO_BERT_FLAIR_EMBEDDING.md
@@ -19,6 +19,7 @@ The following word embeddings are currently supported:
 | [`FastTextEmbeddings`](/resources/docs/embeddings/FASTTEXT_EMBEDDINGS.md) | Word embeddings with subword features | [Bojanowski et al. (2017)](https://aclweb.org/anthology/Q17-1010)  |
 | [`FlairEmbeddings`](/resources/docs/embeddings/FLAIR_EMBEDDINGS.md) | Contextualized character-level embeddings | [Akbik et al. (2018)](https://www.aclweb.org/anthology/C18-1139/)  |
 | [`PooledFlairEmbeddings`](/resources/docs/embeddings/FLAIR_EMBEDDINGS.md) | Pooled variant of `FlairEmbeddings` |  [Akbik et al. (2019)](https://www.aclweb.org/anthology/N19-1078/)  |
+| [`OneHotEmbeddings`](/resources/docs/embeddings/ONE_HOT_EMBEDDINGS.md) | Standard one-hot embeddings of text or tags | - |  
 | [`OpenAIGPTEmbeddings`](/resources/docs/embeddings/TRANSFORMER_EMBEDDINGS.md) and [`OpenAIGPT2Embeddings`](/resources/docs/embeddings/TRANSFORMER_EMBEDDINGS.md) | Embeddings from pretrained OpenAIGPT models | [Radford et al. (2018)](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) |  
 | [`RoBERTaEmbeddings`](/resources/docs/embeddings/TRANSFORMER_EMBEDDINGS.md) | Embeddings from RoBERTa | [Liu et al. (2019)](https://arxiv.org/abs/1907.11692) |  
 | [`TransformerXLEmbeddings`](/resources/docs/embeddings/TRANSFORMER_EMBEDDINGS.md) | Embeddings from pretrained transformer-XL | [Dai et al. (2019)](https://arxiv.org/abs/1901.02860) |  

diff --git a/resources/docs/embeddings/ONE_HOT_EMBEDDINGS.md b/resources/docs/embeddings/ONE_HOT_EMBEDDINGS.md
@@ -0,0 +1,70 @@
+# One-Hot Embeddings
+
+`OneHotEmbeddings` are embeddings that encode each word in a vocabulary as a one-hot vector, followed by an embedding 
+layer. These embeddings
+thus do not encode any prior knowledge as do most other embeddings. They also differ in that they require to see 
+a `Corpus` during instantiation, so they can build up a vocabulary consisting
+of the most common words seen in the corpus, plus an UNK token for all rare words. 
+
+You initialize these embeddings like this:
+
+```python
+from flair.embeddings import OneHotEmbeddings
+from flair.datasets import UD_ENGLISH
+
+# load a corpus
+corpus = UD_ENGLISH()
+
+# init embedding
+embeddings = OneHotEmbeddings(corpus=corpus)
+
+# create a sentence
+sentence = Sentence('The grass is green .')
+
+# embed words in sentence
+embedding.embed(sentence)
+```
+
+By default, the 'text' of a token (i.e. its lexical value) is one-hot encoded and the embedding layer has a dimensionality
+of 300. However, this layer is randomly initialized, meaning that these embeddings do not make sense unless they are 
+ [trained in a task](/resources/docs/TUTORIAL_7_TRAINING_A_MODEL.md). 
+
+### Vocabulary size
+
+By default, all words that occur in the corpus at least 3 times are part of the vocabulary. You can change 
+this using the `min_freq` parameter. For instance, if your corpus is very large you might want to set a 
+higher `min_freq`: 
+
+```python
+embeddings = OneHotEmbeddings(corpus=corpus, min_freq=10)
+```
+
+### Embedding dimensionality
+
+By default, the embeddings have a dimensionality of 300. If you want to try higher or lower values, you can use the 
+`embedding_length` parameter:
+
+```python
+embeddings = OneHotEmbeddings(corpus=corpus, embedding_length=100)
+```
+
+
+## Embedding other tags
+
+Sometimes, you want to embed something other than text. For instance, sometimes we have part-of-speech tags or 
+named entity annotation available that we might want to use. If this field exists in your corpus, you can embed
+it by passing the field variable. For instance, the UD corpora have a universal part-of-speech tag for each 
+token ('upos'). Embed it like so: 
+
+```python
+from flair.datasets import UD_ENGLISH
+from flair.embeddings import OneHotEmbeddings
+
+# load corpus
+corpus = UD_ENGLISH()
+
+# embed POS tags
+embeddings = OneHotEmbeddings(corpus=corpus, field='upos')
+```
+
+This should print a vocabulary of size 18 consisting of universal part-of-speech tags.