-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add gensim.models.BaseKeyedVectors.add_entity
method for fill KeyedVectors
in manual way. Fix #1942
#1957
Merged
menshikh-iv
merged 13 commits into
piskvorky:develop
from
persiyanov:feature/add-word-method-to-keyed-vectors
Mar 20, 2018
Merged
Add gensim.models.BaseKeyedVectors.add_entity
method for fill KeyedVectors
in manual way. Fix #1942
#1957
Changes from 12 commits
Commits
Show all changes
13 commits
Select commit
Hold shift + click to select a range
99bcf44
Introduce BaseKeyedVectors.add(...) method
06955c4
make default count=1
089d346
add test on add_word method
f428571
Merge branch 'develop' into feature/add-word-method-to-keyed-vectors
0aff584
address @menshikh-iv comments
f6e5e79
fix test_keyedvectors after removing add_word alias
d4b0ffe
add __setitem__, add bulk entities processing + some tests on new fun…
912d462
addressing @menshikh-iv comments on docstrings
3611320
Merge branch 'develop' into feature/add-word-method-to-keyed-vectors
437a142
addressing @gojomo comments
737cd36
adrressing nitpicks
070fbed
make self.vectors = np.zeros((0, vector_size)) by default
2294c07
fix pep8
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -73,7 +73,7 @@ | |
PYEMD_EXT = False | ||
|
||
from numpy import dot, zeros, float32 as REAL, empty, memmap as np_memmap, \ | ||
double, array, vstack, sqrt, newaxis, integer, \ | ||
double, array, zeros, vstack, sqrt, newaxis, integer, \ | ||
ndarray, sum as np_sum, prod, argmax, divide as np_divide | ||
import numpy as np | ||
from gensim import utils, matutils # utility fnc for pickling, common scipy operations etc | ||
|
@@ -109,7 +109,7 @@ def __str__(self): | |
class BaseKeyedVectors(utils.SaveLoad): | ||
|
||
def __init__(self, vector_size): | ||
self.vectors = [] | ||
self.vectors = zeros((0, vector_size)) | ||
self.vocab = {} | ||
self.vector_size = vector_size | ||
self.index2entity = [] | ||
|
@@ -154,6 +154,65 @@ def get_vector(self, entity): | |
else: | ||
raise KeyError("'%s' not in vocabulary" % entity) | ||
|
||
def add(self, entities, weights, replace=False): | ||
"""Add entities and theirs vectors in a manual way. | ||
If some entity is already in the vocabulary, old vector is keeped unless `replace` flag is True. | ||
|
||
Parameters | ||
---------- | ||
entities : list of str | ||
Entities specified by string tags. | ||
weights: {list of numpy.ndarray, numpy.ndarray} | ||
List of 1D np.array vectors or 2D np.array of vectors. | ||
replace: bool, optional | ||
Flag indicating whether to replace vectors for entities which are already in the vocabulary, | ||
if True - replace vectors, otherwise - keep old vectors. | ||
|
||
""" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nitpick: multiline docstring should ends with empty line, i.e.
|
||
if isinstance(entities, string_types): | ||
entities = [entities] | ||
weights = np.array(weights).reshape(1, -1) | ||
elif isinstance(weights, list): | ||
weights = np.array(weights) | ||
|
||
in_vocab_mask = np.zeros(len(entities), dtype=np.bool) | ||
for idx, entity in enumerate(entities): | ||
if entity in self.vocab: | ||
in_vocab_mask[idx] = True | ||
|
||
# add new entities to the vocab | ||
for idx in np.nonzero(~in_vocab_mask)[0]: | ||
entity = entities[idx] | ||
self.vocab[entity] = Vocab(index=len(self.vocab), count=1) | ||
self.index2entity.append(entity) | ||
|
||
# add vectors for new entities | ||
self.vectors = vstack((self.vectors, weights[~in_vocab_mask])) | ||
|
||
# change vectors for in_vocab entities if `replace` flag is specified | ||
if replace: | ||
in_vocab_idxs = [self.vocab[entities[idx]].index for idx in np.nonzero(in_vocab_mask)[0]] | ||
self.vectors[in_vocab_idxs] = weights[in_vocab_mask] | ||
|
||
def __setitem__(self, entities, weights): | ||
"""Add entities and theirs vectors in a manual way. | ||
If some entity is already in the vocabulary, old vector is replaced with the new one. | ||
This method is alias for `add` with `replace=True`. | ||
|
||
Parameters | ||
---------- | ||
entities : {str, list of str} | ||
Entities specified by string tags. | ||
weights: {list of numpy.ndarray, numpy.ndarray} | ||
List of 1D np.array vectors or 2D np.array of vectors. | ||
|
||
""" | ||
if not isinstance(entities, list): | ||
entities = [entities] | ||
weights = weights.reshape(1, -1) | ||
|
||
self.add(entities, weights, replace=True) | ||
|
||
def __getitem__(self, entities): | ||
""" | ||
Accept a single entity (string tag) or list of entities as input. | ||
|
@@ -163,6 +222,7 @@ def __getitem__(self, entities): | |
|
||
If a list, return designated tags' vector representations as a | ||
2D numpy array: #tags x #vector_size. | ||
|
||
""" | ||
if isinstance(entities, string_types): | ||
# allow calls like trained_model['office'], as a shorthand for trained_model[['office']] | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
zeros
imported twiceThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gojomo yeah, i've fixed it