Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tidy up KeyedVectors.most_similar() API #3000

Merged
merged 14 commits into from
Aug 13, 2021
Merged

Conversation

simonwiles
Copy link
Contributor

Fixes #2998.

This PR:

  • allows passing a single string-key as the negative argument
  • allows passing a single vector as the positive or negative argument
  • allows passing single values (string-keys or vectors) for both positive and negative arguments in the same call
  • modifies most_similar_cosmul() to match this behaviour as well.

The API is now, imo, a little more straightforward in that it simply wraps single values in a list where appropriate.

There's a comment in most_similar_cosmul at line ~981:

# TODO: Update to better match & share code with most_similar()

but I'm not sure whether the changes in the last commit completely address this or not, so I've left the comment in place.

I don't think these changes necessitate an update to the documentation, which has always specified that lists should be passed to these methods -- this is just a defensive pattern that preserves the shortcut that's in production and makes the API more resistant to violating the principle of least astonishment :)

@simonwiles
Copy link
Contributor Author

simonwiles commented Nov 17, 2020

w/r/t the remove of and not negative in 475b013: It doesn't affect the ability to pass positive as a positional parameter (the most_similar('dog') case), which presumably is invoked in the wild all over the place and so oughtn't to be broken, and it enables calling, e.g. most_similar(positive='dog', negative='cat') (or the same with single vectors), which seems appropriate.

I presume the original intention was to guard against someone calling most_similar('word1', 'word2') with the expectation that they would both be interpreted as positive? Perhaps? Not sure. In any event, the only result of and not negative was that

most_similar('word1', 'word2')

was parsed as

most_similar(positive=['w', 'o', 'r', 'd', '1'], negative=['w', 'o', 'r', 'd', 2'])

instead of

most_similar(positive=['word1'], negative=['w', 'o', 'r', 'd', '2'])

so I figure this is an improvement.

(In fact, in the case of the tweet I cited, it was only the good fortune that the key 'h' wasn't in the model that caused the error and prompted the call for help -- later in the thread the researcher remarks that they'd been using negative='word' for some time and never seen a problem before; results were being returned based on the parsing as negative=['w', 'o', 'r', 'd'] and they'd been taken to be legitimate.)

@gojomo
Copy link
Collaborator

gojomo commented Feb 5, 2021

Thanks for this, sorry it's taken so long to review. A few test-cases that verify/demo the new possibilities (and ensure that the line-comment I made about a possible lost-capability isn't the case) would make this an easy merge.

@mpenkov
Copy link
Collaborator

mpenkov commented Jun 22, 2021

@simonwiles Ping. Are you able to complete this PR?

@mpenkov mpenkov added the stale Waiting for author to complete contribution, no recent effort label Jun 29, 2021
@mpenkov
Copy link
Collaborator

mpenkov commented Jun 29, 2021

Note to self:

  • Before merging, this PR needs tests
  • Minor code cleanup possible (see suggestions above)

@mpenkov mpenkov added the breaks backward-compatibility Change breaks backward compatibility label Jul 22, 2021
@mpenkov mpenkov removed the stale Waiting for author to complete contribution, no recent effort label Jul 23, 2021
@gojomo
Copy link
Collaborator

gojomo commented Aug 12, 2021

Slightly-larger #3188 may make sense to apply 1st, then this.

@mpenkov
Copy link
Collaborator

mpenkov commented Aug 13, 2021

@piskvorky Is #3188 in scope for the current release? It isn't mentioned anywhere in the associated milestone or project. If not, my preference is to:

  1. Merge this PR
  2. Finally release 4.1
  3. Move on with other PRs

Please let me know your thoughts.

@piskvorky
Copy link
Owner

piskvorky commented Aug 13, 2021

My preference is to get in all API-breaking changes ASAP. Looking at #3188, it seems backward-compatible, so can wait potentially.

Of course, merging both this and #3188 would be ideal, but I see @gojomo added new good review comments to #3188. So if #3188 would mean blocking the release, we can release without. If trivial let's do both.

@mpenkov mpenkov merged commit 247da47 into piskvorky:develop Aug 13, 2021
@mpenkov
Copy link
Collaborator

mpenkov commented Aug 13, 2021

Merged. Thank you @simonwiles !

tbbharaj pushed a commit to tbbharaj/gensim that referenced this pull request Aug 19, 2021
* Allow supplying a string-key as the negative arg. to most_similar()

* Allow a single vector as a positive or negative arg. to most_similar()

* Update comments

* Accept single arguments when positive and negative are both supplied

* Update most_similar_cosmul to match most_similar

I'm not sure if this fully addresses the
`# TODO: Update to better match & share code with most_similar()`
at line piskvorky#981 or not, so I've left it in.

* minor code cleanup

* add unit tests

* Update CHANGELOG.md

* remove redundant variable declaration

* enforce consistency

* respond to review feedback

* Update keyedvectors.py

Co-authored-by: Michael Penkov <misha.penkov@gmail.com>
Co-authored-by: Michael Penkov <m@penkov.dev>
tbbharaj pushed a commit to tbbharaj/gensim that referenced this pull request Aug 19, 2021
* Allow supplying a string-key as the negative arg. to most_similar()

* Allow a single vector as a positive or negative arg. to most_similar()

* Update comments

* Accept single arguments when positive and negative are both supplied

* Update most_similar_cosmul to match most_similar

I'm not sure if this fully addresses the
`# TODO: Update to better match & share code with most_similar()`
at line piskvorky#981 or not, so I've left it in.

* minor code cleanup

* add unit tests

* Update CHANGELOG.md

* remove redundant variable declaration

* enforce consistency

* respond to review feedback

* Update keyedvectors.py

Co-authored-by: Michael Penkov <misha.penkov@gmail.com>
Co-authored-by: Michael Penkov <m@penkov.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaks backward-compatibility Change breaks backward compatibility
Projects
None yet
Development

Successfully merging this pull request may close these issues.

KeyedVectors.most_similar() API has a confusing wrinkle
4 participants