-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Browse files
Browse the repository at this point in the history
GH-1400: onehot documentation
- Loading branch information
Showing
3 changed files
with
79 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,70 @@ | ||
# One-Hot Embeddings | ||
|
||
`OneHotEmbeddings` are embeddings that encode each word in a vocabulary as a one-hot vector, followed by an embedding | ||
layer. These embeddings | ||
thus do not encode any prior knowledge as do most other embeddings. They also differ in that they require to see | ||
a `Corpus` during instantiation, so they can build up a vocabulary consisting | ||
of the most common words seen in the corpus, plus an UNK token for all rare words. | ||
|
||
You initialize these embeddings like this: | ||
|
||
```python | ||
from flair.embeddings import OneHotEmbeddings | ||
from flair.datasets import UD_ENGLISH | ||
|
||
# load a corpus | ||
corpus = UD_ENGLISH() | ||
|
||
# init embedding | ||
embeddings = OneHotEmbeddings(corpus=corpus) | ||
|
||
# create a sentence | ||
sentence = Sentence('The grass is green .') | ||
|
||
# embed words in sentence | ||
embedding.embed(sentence) | ||
``` | ||
|
||
By default, the 'text' of a token (i.e. its lexical value) is one-hot encoded and the embedding layer has a dimensionality | ||
of 300. However, this layer is randomly initialized, meaning that these embeddings do not make sense unless they are | ||
[trained in a task](/resources/docs/TUTORIAL_7_TRAINING_A_MODEL.md). | ||
|
||
### Vocabulary size | ||
|
||
By default, all words that occur in the corpus at least 3 times are part of the vocabulary. You can change | ||
this using the `min_freq` parameter. For instance, if your corpus is very large you might want to set a | ||
higher `min_freq`: | ||
|
||
```python | ||
embeddings = OneHotEmbeddings(corpus=corpus, min_freq=10) | ||
``` | ||
|
||
### Embedding dimensionality | ||
|
||
By default, the embeddings have a dimensionality of 300. If you want to try higher or lower values, you can use the | ||
`embedding_length` parameter: | ||
|
||
```python | ||
embeddings = OneHotEmbeddings(corpus=corpus, embedding_length=100) | ||
``` | ||
|
||
|
||
## Embedding other tags | ||
|
||
Sometimes, you want to embed something other than text. For instance, sometimes we have part-of-speech tags or | ||
named entity annotation available that we might want to use. If this field exists in your corpus, you can embed | ||
it by passing the field variable. For instance, the UD corpora have a universal part-of-speech tag for each | ||
token ('upos'). Embed it like so: | ||
|
||
```python | ||
from flair.datasets import UD_ENGLISH | ||
from flair.embeddings import OneHotEmbeddings | ||
|
||
# load corpus | ||
corpus = UD_ENGLISH() | ||
|
||
# embed POS tags | ||
embeddings = OneHotEmbeddings(corpus=corpus, field='upos') | ||
``` | ||
|
||
This should print a vocabulary of size 18 consisting of universal part-of-speech tags. |