-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add gensim.models.BaseKeyedVectors.add_entity
method for fill KeyedVectors
in manual way. Fix #1942
#1957
Add gensim.models.BaseKeyedVectors.add_entity
method for fill KeyedVectors
in manual way. Fix #1942
#1957
Conversation
Hello @persiyanov
|
BaseKeyedVectors.add
for fill kv in manual way. Fix #1942
gensim/models/keyedvectors.py
Outdated
@@ -154,6 +154,19 @@ def get_vector(self, entity): | |||
else: | |||
raise KeyError("'%s' not in vocabulary" % entity) | |||
|
|||
def add(self, entity, weights): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's about re-using this function in (this is duplication right now from https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/utils_any2vec.py#L182)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, need to add tests to check this functionality:
- load some kv, add more vectors in a manual way and check that this added fine
- create empty kv, fill it manually and check that all fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About reusing this function:
It's a bit difficult because in this function vstack
is used to append new word vector to self.vectors
, while add_word
in utils_any2vec creates vectors
array at first (https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/utils_any2vec.py#L180) and then it just inserts vectors into it (https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/utils_any2vec.py#L197).
While it's possible to follow DRY here, the interface of BaseKeyedVectors.add()
method will be more complicated (or I can change the logic in utils_any2vec
-- not to create vectors = np.zeros(...)
but append each word to the array, but it could decrease the performance of load_word2vec_format
function).
If some of these two options is okay, I'll implement it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, thanks for suggestion, let's stay it as is.
gensim/test/test_keyedvectors.py
Outdated
@@ -153,6 +153,27 @@ def test_wv_property(self): | |||
"""Test that the deprecated `wv` property returns `self`. To be removed in v4.0.0.""" | |||
self.assertTrue(self.vectors is self.vectors.wv) | |||
|
|||
def test_add_word(self): | |||
"""Test that adding word in a manual way works correctly.""" | |||
from numpy.random import randn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you remove import and use np.random.randn
gensim/models/keyedvectors.py
Outdated
@@ -154,6 +154,28 @@ def get_vector(self, entity): | |||
else: | |||
raise KeyError("'%s' not in vocabulary" % entity) | |||
|
|||
def add_entity(self, entity, weights): | |||
"""Accept an entity specified by string tag and vector weights as 1D numpy array with shape (`vector_size`,). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use numpy-style docstrings
gensim/models/keyedvectors.py
Outdated
self.vocab[entity] = Vocab(index=entity_id, count=1) | ||
self.index2entity.append(entity) | ||
|
||
def add_word(self, entity, weights): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this doesn't need?
CC: @gojomo wdyt? |
This sort of functionality is a natural and useful addition; ideally it'd also be joined with other new APIs to assist initial building of a There should be a bulk addition option: otherwise doing lots of single-adds in a loop requires many wasteful reallocations of 1-vector-larger-arrays. Should perhaps use Longer-term, a re: @persiyanov your questions - (1) I'd avoid 'word' in any method names, to stay loyal to intent that this class accepts keys other than just words. (2) if the |
BaseKeyedVectors.add
for fill kv in manual way. Fix #1942gensim.models.BaseKeyedVectors.add_entity
method for fill KeyedVectors
in manual way. Fix #1942
@menshikh-iv @gojomo please, take a look. I've implemented |
Ping @gojomo @menshikh-iv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me @persiyanov: +1:
I have only several nitpicks about docstrings + not sure about add_entity
.
gensim/models/keyedvectors.py
Outdated
---------- | ||
entities : list of str | ||
Entities specified by string tags. | ||
weights: list of np.array or np.array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{list of numpy.ndarray, numpy.ndarray}
gensim/models/keyedvectors.py
Outdated
@@ -154,6 +154,75 @@ def get_vector(self, entity): | |||
else: | |||
raise KeyError("'%s' not in vocabulary" % entity) | |||
|
|||
def add_entity(self, entity, weights, replace=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe remove this method (add_entities
looks enough, wdyt @gojomo?)
gensim/models/keyedvectors.py
Outdated
List of 1D np.array vectors or 2D np.array of vectors. | ||
replace: bool, optional | ||
Boolean flag indicating whether to replace vectors for entities which are already in the vocabulary. | ||
Default, False, means that old vectors for those entities are keeped. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to duplicate "default" value for trivial case in docstring, maybe better to write something like
Flag indicating whether to replace vectors for entities which are already in the vocabulary, if True - replace vectors, otherwise - keep old vectors.
gensim/models/keyedvectors.py
Outdated
self.vectors[in_vocab_idxs] = weights[in_vocab_mask] | ||
|
||
def __setitem__(self, entities, weights): | ||
"""Idiomatic way to call `add_entities` with `replace=True`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to write full docstring
@gojomo please have a look (for me LGTM, except |
ping @gojomo |
gensim/models/keyedvectors.py
Outdated
in_vocab_idxs = [] | ||
out_vocab_entities = [] | ||
|
||
for idx, entity in zip(range(len(entities)), entities): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enumerate()
more idiomatic.
gensim/models/keyedvectors.py
Outdated
if len(self.vectors) == 0: | ||
self.vectors = weights[~in_vocab_mask] | ||
else: | ||
self.vectors = vstack((self.vectors, weights[~in_vocab_mask])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might this line work even in the case where len(self.vectors)==0
, making the check/branch unnecessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's not obvious how to do that, because when empty KeyedVectors
object is created, self.vectors = []
is true. In that case, we can't use vstack(([], weights[~in_vocab_mask]))
and ValueError: all the input array dimensions except for the concatenation axis must match exactly
is raised.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible for an empty KeyedVectors
to have a self.vectors
that is already a proper-dimensioned (0, vector_size)
empty ndarray? (Not sure myself, but would simplify things in later places like this.)
gensim/models/keyedvectors.py
Outdated
|
||
in_vocab_mask = np.zeros(len(entities), dtype=np.bool) | ||
in_vocab_idxs = [] | ||
out_vocab_entities = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method might be simpler without separate in_vocab_idxs
and out_vocab_entities
– just driving those ops from the mask, using options like where()
or nonzero()
.
Functionality seems good. I think class has a preexisting terminology issue with the use of word 'entity' where often 'key' would be more-consistent/more-specific. Also, as the 'items' inside this are definitionally 'vectors', generally |
@gojomo @menshikh-iv please, take a look. I've removed I also think that operating with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, great work @persiyanov 👍
gensim/models/keyedvectors.py
Outdated
replace: bool, optional | ||
Flag indicating whether to replace vectors for entities which are already in the vocabulary, | ||
if True - replace vectors, otherwise - keep old vectors. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: multiline docstring should ends with empty line, i.e.
"""
...
last text
"""
gensim/models/keyedvectors.py
Outdated
""" | ||
if isinstance(entities, string_types): | ||
entities = [entities] | ||
weights = weights.reshape(1, -1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably, should be weights = np.array(weights).reshape(1, -1)
for case if weights, for example, list of floats
@menshikh-iv Fixed |
gensim/models/keyedvectors.py
Outdated
@@ -73,7 +73,7 @@ | |||
PYEMD_EXT = False | |||
|
|||
from numpy import dot, zeros, float32 as REAL, empty, memmap as np_memmap, \ | |||
double, array, vstack, sqrt, newaxis, integer, \ | |||
double, array, zeros, vstack, sqrt, newaxis, integer, \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
zeros
imported twice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gojomo yeah, i've fixed it
Congratz with first time contribution @persiyanov 👍 🥇 |
@persiyanov Yes, and thanks for your patience through all the subtle refinements! |
Is the word "Entity" has an established meaning in NLP, it comes with certain connotations. Introducing a new concept into gensim like this muddles the waters. Especially since the concept is not really introduced at all in the PR AFAICS—what is an "entity" here? I only skimmed the docstrings and they only mention the word, never explain it. This is confusing and inconsistent with our other docs. |
@piskvorky this modification for Check out other code from this class, for example
|
I agree with @gojomo , the code desperately needs a proper refactoring. The current post-#1777 situation seems untenable. @gojomo any chance you could take this up? @menshikh-iv your snippet doesn't explain |
@piskvorky I don't explain it, I just show that the current code mimics the already existing and doesn't introduce any new concepts. |
Understood. It's an issue with #1777 , not this PR. |
Pull request related to this issue.
With these changes, it would be possible to add new word vectors into a
KeyedVectors
object.@menshikh-iv please, take a look. I will write tests on it after addressing your comments.
Moreover, I have some questions/doubts:
add_entity/add_word
?weights
and raise an exception in a bad case?self.vectors
contiguity: herevectors
list is casted to C-contiguous array. Does numpy preserve C-contiguity after operations such asnp.vstack
? If so, I think I should addnumpy.ascontiguousarray
cast in my code.