Further focus/slim keyedvectors.py module #2873

gojomo · 2020-07-06T20:00:39Z

Pre-#2698, keyedvectors.py was 2500+ lines, including functionality over-specific to other models, & redundant classes. Post-#2698, with some added generic functionality, it's still over 1800 lines.

It should shed some other grab-bag utility functions that have accumulated, & don't logically fit inside the KeyedVectors class.

In particular, the evaluation (analogies, word_ranks) helpers could move to their own module that takes a KV instance as an argument. (If other more-sophisticated evaluations can be contributed, as would be welcome, they should also live alongside those, rather than bloating KeyedVectors.)

The get_keras_embedding method, as its utilit is narrow to very specific uses, and is conditional on a not-necessarily install package, could go elsewhere too – either a kera-focused utilities module, or even just documentation/example code about how to convert to/from keras from `KeyedVectors.

Some of the more advanced word-vector-using calculations, like 'Word Mover's Distance' or 'Soft Cosine SImilarity', could move to method-specific modules that are then better documented/self-contained/optimized, without bloating the generic 'set of vectors' module. (They might be more discoverable, there, as well.)

And finally, some of the existing calculations could be unified/streamlined (especially the two variants of most_similar(), and some of the steps shared by multiple operations). My hope would be the module is eventually <1000 lines.

The text was updated successfully, but these errors were encountered:

piskvorky · 2020-09-28T12:26:04Z

@gojomo do you see this as essential for 4.0.0 = API breaking?

Or can we leave it for a later release?

gojomo · 2020-10-06T20:42:14Z

This is low-risk and not-hard, but also low-priority.

Making the decision that some things should be relocated would be nicer to do in 4.0.0, along with other "update your imports/code/function-names" changes, but could wait.

piskvorky · 2020-10-06T21:46:09Z

@mpenkov do you feel like taking this up, reshuffling / organizing KeyedVectors?

I already resolved the get_keras_embedding part in #2937.

gojomo · 2020-10-06T22:31:58Z

My 1st thoughts would be:

move evaluation code to its own module
move WMD calcs to its own module (even if a passthrough method remains here indefinitely); the options/optimizations for WMD might improve more beneficially in its own space (and evolve in parallel with other extensions like 'soft cosine')

Next but lower-priority, some of the other utility methods may be amenable to more reuse. (Perhaps, the cosmul as an option to most_similar, or some methods being redefined in terms of fewer more-central operations.)

Finally, after everything else has settled, the methods should be reordered by importance & grouped by role, so the autogenerated documentation has the most-used stuff up top, and a casual top-to-bottom read makes more sense.

mpenkov · 2020-10-17T10:23:06Z

@mpenkov do you feel like taking this up, reshuffling / organizing KeyedVectors?

@piskvorky Sorry I missed this. I'm in the middle of a house move, so I'd rather not get involved until things have settled down.

If this can wait a couple of weeks, then I'd be happy to pick it up then. It looks like my sort of thing, and I've done it a couple of times with gensim already.

piskvorky · 2020-10-17T13:22:21Z

If this can wait a couple of weeks

Yes, it can. In order of urgency:

Release 4.0.0beta – hopefully this coming week (only TODO left are release/migration docs = my task). I'd like you to do the release process though (or together).
A few weeks to tie up the other loose ends in https://github.com/RaRe-Technologies/gensim/milestone/3 – hopefully including this one.
Release 4.0.0rc1 / full 4.0.0.

mpenkov · 2021-02-27T09:43:45Z

How about moving the high-level methods from keyedvectors.py to a separate wordtasks.py submodule? They could be pure functions there. For example:

keyedvectors.KeyedVectors.most_similar(self, ...) -> wordtasks.most_similar(model, ...)
keyedvectors.KeyedVectors.similar_by_word(self, ...) -> wordtasks.similar_by_word(model, ...)
keyedvectors.KeyedVectors.similar_by_key -> and so on...
keyedvectors.KeyedVectors.similar_by_vector
keyedvectors.KeyedVectors.wmdistance
keyedvectors.KeyedVectors.most_similar_cosmul
keyedvectors.KeyedVectors.rank_by_centrality
keyedvectors.KeyedVectors.doesnt_match
keyedvectors.KeyedVectors.cosine_similarities
keyedvectors.KeyedVectors.distances
keyedvectors.KeyedVectors.evaluate_word_pairs
keyedvectors.KeyedVectors.evaluate_word_analogies

All the above operations do not modify the keyedvectors model, they are read-only.

This would leave the lower-level IO stuff in keyedvectors.py, so serialization (loading/saving) shouldn't be affected, as far as I understand.

The name wordtasks comes from the keyedvectors docstring:

What can I do with word vectors? You can perform various syntactic/semantic NLP word tasks with the trained vectors.

WDYT?

mpenkov · 2021-03-09T07:13:27Z

@gojomo Let's remove this from the 4.0 milestone and deal with it later

gojomo mentioned this issue Jul 7, 2020

KeyedVectors & *2Vec API streamlining, consistency #2698

Merged

piskvorky added this to the *2vec aftermath milestone Jul 26, 2020

piskvorky mentioned this issue Jul 26, 2020

save_facebook_model() - AssertionError #2853

Closed

This was referenced Jul 30, 2020

[MRG] Fix similarity bug in NMSLIB indexer + documentation fixes #2899

Merged

Added "tensorflow.keras" compatibility in "KeyedVectors.get_keras_embedding" function. #2911

Closed

piskvorky modified the milestones: *2vec aftermath, 4.0.0 Sep 24, 2020

piskvorky mentioned this issue Oct 4, 2020

Adopting a (narrow) backward-compatibility standard; implications for 4.0.0 #2967

Open

piskvorky assigned gojomo Oct 16, 2020

piskvorky assigned mpenkov Feb 25, 2021

piskvorky removed this from the 4.0.0 milestone Mar 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Further focus/slim keyedvectors.py module #2873

Further focus/slim keyedvectors.py module #2873

gojomo commented Jul 6, 2020 •

edited

Loading

piskvorky commented Sep 28, 2020

gojomo commented Oct 6, 2020

piskvorky commented Oct 6, 2020

gojomo commented Oct 6, 2020

mpenkov commented Oct 17, 2020 •

edited

Loading

piskvorky commented Oct 17, 2020 •

edited

Loading

mpenkov commented Feb 27, 2021 •

edited

Loading

mpenkov commented Mar 9, 2021 •

edited by piskvorky

Loading

Further focus/slim keyedvectors.py module #2873

Further focus/slim keyedvectors.py module #2873

Comments

gojomo commented Jul 6, 2020 • edited Loading

piskvorky commented Sep 28, 2020

gojomo commented Oct 6, 2020

piskvorky commented Oct 6, 2020

gojomo commented Oct 6, 2020

mpenkov commented Oct 17, 2020 • edited Loading

piskvorky commented Oct 17, 2020 • edited Loading

mpenkov commented Feb 27, 2021 • edited Loading

mpenkov commented Mar 9, 2021 • edited by piskvorky Loading

gojomo commented Jul 6, 2020 •

edited

Loading

mpenkov commented Oct 17, 2020 •

edited

Loading

piskvorky commented Oct 17, 2020 •

edited

Loading

mpenkov commented Feb 27, 2021 •

edited

Loading

mpenkov commented Mar 9, 2021 •

edited by piskvorky

Loading