Add get_sentence_vector() to FastText and get_mean_vector() to KeyedVectors #3188

rock420 · 2021-07-06T17:39:22Z

This PR provides gensim support for get_sentence_vector() method. It also implements get_mean_vector() under keyedVectors as a general method to get the average wordvecs from a list of keys.

Tasks to complete -

Implement get_mean_vector() under keyedVectors.
Implement get_sentence_vector() under FastTextKeyedVectors.
Refactor any existing method which performs an average itself.

rock420 · 2021-07-09T10:34:28Z

all_keys, mean = set(), []
for key, weight in positive + negative:
   if isinstance(key, ndarray):
      mean.append(weight * key)
   else:
      mean.append(weight * self.get_vector(key, norm=True))
      if self.has_index_for(key):
          all_keys.add(self.get_index(key))

@gojomo, @piskvorky I am trying to refactor the most_similar function. The key in this function can be a numpy array or str or int but in the current implementation of get_mean_vector, it can support only str or int.
Do you also want to support a list of numpy arrays in get_mean_vector() or use the function only in the case of str, int key in most_similar()?

piskvorky · 2021-07-09T10:48:35Z

For existing functions, let's try to keep backward compatibility = support the same input types.

For new functions, we have freedom. I have no strong preference either way. Supporting the same input types makes sense (consistency) but if it makes the code too ugly / too much work, we can simplify the function signature.

rock420 · 2021-07-10T05:30:15Z

Supporting the same input types makes sense (consistency)

Yeah, we can do this by some simple changes in get_mean_vector.

rock420 · 2021-07-12T06:21:16Z

@piskvorky, I have updated the PR.
But some checks are failing because of an error while updating sbt in Linux.

piskvorky · 2021-07-12T07:37:47Z

Something to do with sbt. I've seen the error before, but not sure what it's about. @mpenkov do you know why we need scala at all?

I restarted the jobs now, maybe that helps. EDIT: nope :(

mpenkov · 2021-08-12T01:46:23Z

Not sure. I've merged develop into this PR, let's see if that helps.

gensim/models/fasttext.py

gensim/models/keyedvectors.py

piskvorky · 2022-02-19T18:59:27Z

@rock420 the related PRs have been merged. We're planning a new release soon, I'd love to get this merged – can you please merge develop into your PR & resolve conflicts?

@gojomo OK to merge from your side?

Resolve merge conflicts in most_similar

rock420 · 2022-02-20T12:42:50Z

@piskvorky, I have rebased the branch with develop and resolved all the conflicts.

piskvorky · 2022-02-25T12:45:12Z

LGTM – @mpenkov did you figure out why some tests are failing? I'd like to get a "green light" before merging. Thanks.

codecov · 2022-03-18T15:37:52Z

Codecov Report

Merging #3188 (68109cf) into develop (ac3bbcd) will decrease coverage by 0.04%.
The diff coverage is 90.90%.

@@             Coverage Diff             @@
##           develop    #3188      +/-   ##
===========================================
- Coverage    81.43%   81.38%   -0.05%     
===========================================
  Files          122      122              
  Lines        21052    21090      +38     
===========================================
+ Hits         17144    17165      +21     
- Misses        3908     3925      +17

Impacted Files	Coverage Δ
gensim/models/fasttext.py	`89.73% <80.00%> (-0.23%)`	⬇️
gensim/models/keyedvectors.py	`82.43% <90.24%> (+0.19%)`	⬆️
gensim/test/test_keyedvectors.py	`99.29% <100.00%> (+0.02%)`	⬆️
gensim/scripts/segment_wiki.py	`83.33% <0.00%> (-3.79%)`	⬇️
gensim/downloader.py	`79.67% <0.00%> (-1.65%)`	⬇️
gensim/matutils.py	`77.07% <0.00%> (-1.07%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ac3bbcd...68109cf. Read the comment docs.

mpenkov · 2022-03-22T02:16:26Z

Merging. Thank you for your contribution and your patience @rock420 !

stelmath · 2022-05-26T11:56:54Z

If I may make an observation: as a user of Facebook's FastText library, I was looking for the equivalent of Facebook's get_sentence_vector() in Gensim. This brought me to this PR. However, I spotted a difference in the type of input between Gensim's and Facebook's implementation that might be lurking and confusing for users that are primed with using Facebook's implementation.

Here is what I mean:
Facebook's get_sentence_vector() docstring:

    Given a string, get a single vector represenation. This function
    assumes to be given a single line of text. We split words on
    whitespace (space, newline, tab, vertical tab) and the control
    characters carriage return, formfeed and the null character.

Gensim's get_mean_vector() docstring (get_sentece_vector() is based on get_mean_vector()):

Get the mean vector for a given list of keys.
Parameters
----------
keys : list of (str or int or ndarray)
Keys specified by string or int ids or numpy array.

So, when I first tried out Gensim's get_sentence_vector() I did not dive into the function definition, and I naively assumed it would expect the same input as Facebook's get_sentence_vector(), since Gensim's implementation was inspired by Facebook's implementation. So, I passed something like "A sample sentence" into Gensim's get_sentence_vector() which results in calling self.get_vector() on each character of my input string, instead of each token of it. Perhaps some type hinting at the function definition would help catch this inconsistency. Thank you!

gojomo · 2022-05-27T17:10:26Z

That's a good point about likely confusion.

However, Gensim typically lets users choose their tokenization, and even use tokens with internal whitespace if desired (including in FastText). Thus, for generality Gensim's model APIs usually work with lists-of-tokens rather than plain-strings-still-expecting-whitespace-tokenization.

That creates an especially tricky choice here: do we match the Facebook FastText operation, or Gensim conventions? Given the exact-same name, it might've made sense to exactly-mimic the string-style invocation in get_sentence_vector(), then add some other method for tokens-style calling.

But given we shipped tokens-style already, probably the most-helpful things we should do would be to add a loud warning (or possibly even error) with explanatory text whenever a plain-string is provided, reminding the user to perform their own .split() if they want to match Facebook's native fasttext behavior.

piskvorky · 2022-05-27T18:14:32Z

We definitely want to accept tokens – like everywhere else in Gensim.

+1 on throwing an exception when users provide a single string where a sequence of strings was expected. IIRC we already do such "input sanity checks" in other places in Gensim, so should be straightforward to extend them to get_sentence_vector() also.

mpenkov marked this pull request as draft July 6, 2021 21:59

rock420 changed the title ~~[WIP] Fasttext like get_sentence_vector()~~ Fasttext like get_sentence_vector() Jul 10, 2021

rock420 marked this pull request as ready for review July 10, 2021 08:40

piskvorky changed the title ~~Fasttext like get_sentence_vector()~~ Add get_sentence_vector() to FastText and get_mean_vector() to KeyedVectors Aug 12, 2021

gojomo reviewed Aug 12, 2021

View reviewed changes

gensim/models/fasttext.py Show resolved Hide resolved

gojomo reviewed Aug 12, 2021

View reviewed changes

gensim/models/keyedvectors.py Outdated Show resolved Hide resolved

gojomo reviewed Aug 12, 2021

View reviewed changes

gensim/models/keyedvectors.py Outdated Show resolved Hide resolved

gojomo reviewed Aug 12, 2021

View reviewed changes

gensim/models/keyedvectors.py Outdated Show resolved Hide resolved

gojomo mentioned this pull request Aug 12, 2021

Tidy up KeyedVectors.most_similar() API #3000

Merged

piskvorky added this to the Next release milestone Feb 19, 2022

rock420 added 10 commits February 20, 2022 16:38

Implement get_mean_vector() for keyedvectors

0ca84d8

Add test cases for get_mean_vector()

e1a2147

Implement get_sentence_vector()

91c8eda

Fix __contains__ method to consider no ngrams case

8e2d740

Add support for ndarrray in get_mean_vector

849add6

Resolve merge conflicts

3a80009

Refactor rank_by_centrality & n_similarity

147e53e

Add post-normalization in get_mean_vector()

9f7adf2

Refactor iteration and Improve doc comment

2c041d0

Resolve merge conflicts in most_similar

Convert 2D ndarray to list in most_similar

330e5c2

rock420 force-pushed the sentence_vector branch from 2acc5b0 to 330e5c2 Compare February 20, 2022 12:35

piskvorky requested a review from gojomo February 20, 2022 13:00

gojomo approved these changes Feb 21, 2022

View reviewed changes

piskvorky assigned mpenkov Feb 25, 2022

Merge remote-tracking branch 'upstream/develop' into sentence_vector

cc934d8

Merge remote-tracking branch 'upstream/develop' into sentence_vector

68109cf

mpenkov merged commit c19f223 into piskvorky:develop Mar 22, 2022

piskvorky mentioned this pull request Jun 21, 2022

Upgraded from version 4.0.1. to 4.2.0, now KeyedVectors.most_similar() throws error #3358

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add get_sentence_vector() to FastText and get_mean_vector() to KeyedVectors #3188

Add get_sentence_vector() to FastText and get_mean_vector() to KeyedVectors #3188

rock420 commented Jul 6, 2021 •

edited

Loading

rock420 commented Jul 9, 2021

piskvorky commented Jul 9, 2021 •

edited

Loading

rock420 commented Jul 10, 2021

rock420 commented Jul 12, 2021

piskvorky commented Jul 12, 2021 •

edited

Loading

mpenkov commented Aug 12, 2021

piskvorky commented Feb 19, 2022

rock420 commented Feb 20, 2022

piskvorky commented Feb 25, 2022

codecov bot commented Mar 18, 2022 •

edited

Loading

mpenkov commented Mar 22, 2022

stelmath commented May 26, 2022

gojomo commented May 27, 2022

piskvorky commented May 27, 2022 •

edited

Loading

Add get_sentence_vector() to FastText and get_mean_vector() to KeyedVectors #3188

Add get_sentence_vector() to FastText and get_mean_vector() to KeyedVectors #3188

Conversation

rock420 commented Jul 6, 2021 • edited Loading

rock420 commented Jul 9, 2021

piskvorky commented Jul 9, 2021 • edited Loading

rock420 commented Jul 10, 2021

rock420 commented Jul 12, 2021

piskvorky commented Jul 12, 2021 • edited Loading

mpenkov commented Aug 12, 2021

piskvorky commented Feb 19, 2022

rock420 commented Feb 20, 2022

piskvorky commented Feb 25, 2022

codecov bot commented Mar 18, 2022 • edited Loading

Codecov Report

mpenkov commented Mar 22, 2022

stelmath commented May 26, 2022

gojomo commented May 27, 2022

piskvorky commented May 27, 2022 • edited Loading

rock420 commented Jul 6, 2021 •

edited

Loading

piskvorky commented Jul 9, 2021 •

edited

Loading

piskvorky commented Jul 12, 2021 •

edited

Loading

codecov bot commented Mar 18, 2022 •

edited

Loading

piskvorky commented May 27, 2022 •

edited

Loading