Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add get_sentence_vector() to FastText and get_mean_vector() to KeyedVectors #3188

Merged
merged 12 commits into from
Mar 22, 2022

Conversation

rock420
Copy link
Contributor

@rock420 rock420 commented Jul 6, 2021

Fixes: #3015

This PR provides gensim support for get_sentence_vector() method. It also implements get_mean_vector() under keyedVectors as a general method to get the average wordvecs from a list of keys.

Tasks to complete -

  • Implement get_mean_vector() under keyedVectors.
  • Implement get_sentence_vector() under FastTextKeyedVectors.
  • Refactor any existing method which performs an average itself.

@mpenkov mpenkov marked this pull request as draft July 6, 2021 21:59
@rock420
Copy link
Contributor Author

rock420 commented Jul 9, 2021

all_keys, mean = set(), []
for key, weight in positive + negative:
   if isinstance(key, ndarray):
      mean.append(weight * key)
   else:
      mean.append(weight * self.get_vector(key, norm=True))
      if self.has_index_for(key):
          all_keys.add(self.get_index(key))

@gojomo, @piskvorky I am trying to refactor the most_similar function. The key in this function can be a numpy array or str or int but in the current implementation of get_mean_vector, it can support only str or int.
Do you also want to support a list of numpy arrays in get_mean_vector() or use the function only in the case of str, int key in most_similar()?

@piskvorky
Copy link
Owner

piskvorky commented Jul 9, 2021

For existing functions, let's try to keep backward compatibility = support the same input types.

For new functions, we have freedom. I have no strong preference either way. Supporting the same input types makes sense (consistency) but if it makes the code too ugly / too much work, we can simplify the function signature.

@rock420
Copy link
Contributor Author

rock420 commented Jul 10, 2021

Supporting the same input types makes sense (consistency)

Yeah, we can do this by some simple changes in get_mean_vector.

@rock420 rock420 changed the title [WIP] Fasttext like get_sentence_vector() Fasttext like get_sentence_vector() Jul 10, 2021
@rock420 rock420 marked this pull request as ready for review July 10, 2021 08:40
@rock420
Copy link
Contributor Author

rock420 commented Jul 12, 2021

@piskvorky, I have updated the PR.
But some checks are failing because of an error while updating sbt in Linux.

@piskvorky
Copy link
Owner

piskvorky commented Jul 12, 2021

Something to do with sbt. I've seen the error before, but not sure what it's about. @mpenkov do you know why we need scala at all?

I restarted the jobs now, maybe that helps. EDIT: nope :(

@mpenkov
Copy link
Collaborator

mpenkov commented Aug 12, 2021

Not sure. I've merged develop into this PR, let's see if that helps.

@piskvorky piskvorky changed the title Fasttext like get_sentence_vector() Add get_sentence_vector() to FastText and get_mean_vector() to KeyedVectors Aug 12, 2021
@piskvorky
Copy link
Owner

@rock420 the related PRs have been merged. We're planning a new release soon, I'd love to get this merged – can you please merge develop into your PR & resolve conflicts?

@gojomo OK to merge from your side?

@piskvorky piskvorky added this to the Next release milestone Feb 19, 2022
@rock420
Copy link
Contributor Author

rock420 commented Feb 20, 2022

@piskvorky, I have rebased the branch with develop and resolved all the conflicts.

@piskvorky piskvorky requested a review from gojomo February 20, 2022 13:00
@piskvorky
Copy link
Owner

LGTM – @mpenkov did you figure out why some tests are failing? I'd like to get a "green light" before merging. Thanks.

@codecov
Copy link

codecov bot commented Mar 18, 2022

Codecov Report

Merging #3188 (68109cf) into develop (ac3bbcd) will decrease coverage by 0.04%.
The diff coverage is 90.90%.

@@             Coverage Diff             @@
##           develop    #3188      +/-   ##
===========================================
- Coverage    81.43%   81.38%   -0.05%     
===========================================
  Files          122      122              
  Lines        21052    21090      +38     
===========================================
+ Hits         17144    17165      +21     
- Misses        3908     3925      +17     
Impacted Files Coverage Δ
gensim/models/fasttext.py 89.73% <80.00%> (-0.23%) ⬇️
gensim/models/keyedvectors.py 82.43% <90.24%> (+0.19%) ⬆️
gensim/test/test_keyedvectors.py 99.29% <100.00%> (+0.02%) ⬆️
gensim/scripts/segment_wiki.py 83.33% <0.00%> (-3.79%) ⬇️
gensim/downloader.py 79.67% <0.00%> (-1.65%) ⬇️
gensim/matutils.py 77.07% <0.00%> (-1.07%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ac3bbcd...68109cf. Read the comment docs.

@mpenkov
Copy link
Collaborator

mpenkov commented Mar 22, 2022

Merging. Thank you for your contribution and your patience @rock420 !

@mpenkov mpenkov merged commit c19f223 into piskvorky:develop Mar 22, 2022
@stelmath
Copy link

If I may make an observation: as a user of Facebook's FastText library, I was looking for the equivalent of Facebook's get_sentence_vector() in Gensim. This brought me to this PR. However, I spotted a difference in the type of input between Gensim's and Facebook's implementation that might be lurking and confusing for users that are primed with using Facebook's implementation.

Here is what I mean:
Facebook's get_sentence_vector() docstring:

    Given a string, get a single vector represenation. This function
    assumes to be given a single line of text. We split words on
    whitespace (space, newline, tab, vertical tab) and the control
    characters carriage return, formfeed and the null character.

Gensim's get_mean_vector() docstring (get_sentece_vector() is based on get_mean_vector()):

Get the mean vector for a given list of keys.
Parameters
----------
keys : list of (str or int or ndarray)
Keys specified by string or int ids or numpy array.

So, when I first tried out Gensim's get_sentence_vector() I did not dive into the function definition, and I naively assumed it would expect the same input as Facebook's get_sentence_vector(), since Gensim's implementation was inspired by Facebook's implementation. So, I passed something like "A sample sentence" into Gensim's get_sentence_vector() which results in calling self.get_vector() on each character of my input string, instead of each token of it. Perhaps some type hinting at the function definition would help catch this inconsistency. Thank you!

@gojomo
Copy link
Collaborator

gojomo commented May 27, 2022

That's a good point about likely confusion.

However, Gensim typically lets users choose their tokenization, and even use tokens with internal whitespace if desired (including in FastText). Thus, for generality Gensim's model APIs usually work with lists-of-tokens rather than plain-strings-still-expecting-whitespace-tokenization.

That creates an especially tricky choice here: do we match the Facebook FastText operation, or Gensim conventions? Given the exact-same name, it might've made sense to exactly-mimic the string-style invocation in get_sentence_vector(), then add some other method for tokens-style calling.

But given we shipped tokens-style already, probably the most-helpful things we should do would be to add a loud warning (or possibly even error) with explanatory text whenever a plain-string is provided, reminding the user to perform their own .split() if they want to match Facebook's native fasttext behavior.

@piskvorky
Copy link
Owner

piskvorky commented May 27, 2022

We definitely want to accept tokens – like everywhere else in Gensim.

+1 on throwing an exception when users provide a single string where a sequence of strings was expected. IIRC we already do such "input sanity checks" in other places in Gensim, so should be straightforward to extend them to get_sentence_vector() also.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add convenience get_sentence_vector()-like methods for FastText, other models
5 participants