Add KeyedVectors support to AnnoyIndexer #1318

pengowray · 2017-05-13T02:04:29Z

Currently the Annoy Indexer can be run on a Word2Vec or Doc2Vec objects, which can be created, for example, by training on sentences. e.g. gensim.models.Word2Vec(sentences)

However, I find myself with a KeyedVectors instance. This object type comes from loading pre-trained data (e.g. Google's News dataset or Stanford's Glove[1] Twitter dataset). That data is loaded using KeyedVectors.load_word2vec_format(). The KeyedVectors class appears to have everything needed to create an AnnoyIndex, so I've added support for it to GenSim.

I'm fairly new to GenSim, so maybe I'm missing some part of the usual workflow that might make this patch redundant. I have only tested this pull request using monkey patching (overriding the class methods after they loaded) and have not tested this particular form of the patch so you might want to double check it, but I don't see any reason it should have a problem [edit: except for a missing import, oops. fixed].

Thanks

[1] note: Stanford Glove datasets are missing the header—which normally contains the number of items and vector dimension count—so this had to be manually inserted, although it should have been easily inferred by gensim.

pengowray · 2017-05-13T02:07:32Z

Below is the "monkey patched" version, in case it's useful to someone. It adds KeyedVectors support to the Annoy Indexer without directly modifying gensim per se.

from gensim.models.keyedvectors import KeyedVectors

def new_init(self, model=None, num_trees=None):
	self.index = None
	self.labels = None
	self.model = model
	self.num_trees = num_trees
	
	if model and num_trees:
		if isinstance(self.model, Doc2Vec):
			self.build_from_doc2vec()
		elif isinstance(self.model, Word2Vec):
			self.build_from_word2vec()
		elif isinstance(self.model, KeyedVectors):
			self.build_from_keyedvectors()
		else:
			raise ValueError("Only a Word2Vec or Doc2Vec or KeyedVectors instance can be used")

def build_from_keyedvectors(self):
	"""Build an Annoy index using word vectors from a KeyedVectors model"""
	
	#skip normalization:
	#return self._build_from_model(self.model.syn0, self.model.index2word, self.model.vector_size)
	
	#with normalization:
	self.model.init_sims()
	return self._build_from_model(self.model.syn0norm, self.model.index2word, self.model.vector_size)

AnnoyIndexer.__init__ = new_init
AnnoyIndexer.build_from_keyedvectors = build_from_keyedvectors

piskvorky · 2017-05-14T12:21:01Z

Thanks!

I actually remember some discussion about supporting the GloVe format in the past. @tmylk didn't we add a script to convert GloVe=>word2vec format?

I agree we could (should) easily infer it directly from the file. It looks easy enough.

tmylk · 2017-05-15T21:51:00Z

@Quole This is a useful PR. Could you please add a quick unit test for it? Just loading and indexing a KV file.

Glove vectors can be loaded in Gensim after running the conversion script https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/scripts/glove2word2vec.py

Would appreciate a PR that loads glove into KeyedVectors directly, without any command line conversions.

pengowray · 2017-05-18T08:02:16Z

I've had a quick stab in the dark at writing a test, but I never got around to getting gensim to run from the source code, and getting test code running via monkey patching is kinda difficult, so didn't get it working.

I think the setup of the test work, just need to work out some assert statements. I probably also need to spend more time working out how the test code is organized. I'll have a go at it another day if I get a chance, but probably not hard for someone more familiar with the codebase to fix it up from the broken code if someone wants to.

tmylk · 2017-05-18T10:50:01Z

Sorry you had problems setting up gensim development environment. We try to make it easy. What kind of problems did you encounter?

Even without your environment you can see your test errors in our Travis build environment

menshikh-iv · 2017-05-29T16:49:18Z

@Quole You can see error in travis log, also, if you want to run tests in currect class locally, you can use this line (in your repo directory)
python gensim/test/test_similarities.py TestWord2VecAnnoyIndexer

Now, please fix this problem

======================================================================
ERROR: testAnnoyIndexingOfKeyedVectors (__main__.TestWord2VecAnnoyIndexer)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "gensim/test/test_similarities.py", line 489, in testAnnoyIndexingOfKeyedVectors
    self.assertVectorIsSimilarToItself(model, index)
  File "gensim/test/test_similarities.py", line 499, in assertVectorIsSimilarToItself
    vector = model.wv.syn0norm[0]
AttributeError: 'KeyedVectors' object has no attribute 'wv'

----------------------------------------------------------------------
Ran 4 tests in 0.060s

tmylk · 2017-05-29T23:05:08Z

This PR is now abandoned.
Anyone, especially someone looking for a first contribution, is welcome to pick it up.
It just needs a minor change to assertVectorIsSimilarToItself

…s in some tests.

pengowray · 2017-05-30T02:14:22Z

I've fixed up the tests. If they're acceptable, would it be better if I combined these commits and/or put them in a branch with a different name or anything?

@tmylk Wasn't actually difficult to set up a dev environment. I just had an awkward setup inside a docker container that I needed to find the time to make more usable for development (which took less time than I expected), though in retrospect I could have probably just relied on the checks from Travis and done the changes in notepad.

tmylk · 2017-05-30T11:26:54Z

Could you please add the new use case to the ipynb tutorial?
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/annoytutorial.ipynb

AnnoyIndexer with KeyedVectors

…into develop

pengowray · 2017-05-30T14:58:02Z

Is there a magic way of editing a notebook that doesn't make a mess of its change log?

tmylk · 2017-05-30T14:58:25Z

Nbdime helps a bit

menshikh-iv · 2017-05-30T16:46:23Z

@Quole Also you can use your favorite text editor

pengowray · 2017-06-01T02:43:08Z

Would anyone be able to test if my updated version of annoytutorial.ipynb runs correctly with this fork?

menshikh-iv · 2017-06-03T14:49:09Z

@Quole I run annoytutorial.ipynb and get exception after this code:

%%time

model.save('/tmp/mymodel')

def f(process_id):
    print('Process Id: ', os.getpid())
    process = psutil.Process(os.getpid())
    new_model = Word2Vec.load('/tmp/mymodel')
    vector = new_model["science"]
    annoy_index = AnnoyIndexer()
    annoy_index.load('index')
    annoy_index.model = new_model
    approximate_neighbors = new_model.most_similar([vector], topn=5, indexer=annoy_index)
    print('\nMemory used by process {}: '.format(os.getpid()), process.memory_info(), "\n---")

# Creating and running two parallel process to share the same index file.
p1 = Process(target=f, args=('1',))
p1.start()
p1.join()
p2 = Process(target=f, args=('2',))
p2.start()
p2.join()

output:

('Process Id: ', 13330)
Process Process-3:
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "<timed exec>", line 7, in f
  File "/home/ivan/.virtualenvs/clean2/local/lib/python2.7/site-packages/gensim-2.1.0-py2.7-linux-x86_64.egg/gensim/models/word2vec.py", line 1386, in load
    if model.negative and hasattr(model.wv, 'index2word'):
AttributeError: 'KeyedVectors' object has no attribute 'negative'
('Process Id: ', 13353)
Process Process-4:
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "<timed exec>", line 7, in f
  File "/home/ivan/.virtualenvs/clean2/local/lib/python2.7/site-packages/gensim-2.1.0-py2.7-linux-x86_64.egg/gensim/models/word2vec.py", line 1386, in load
    if model.negative and hasattr(model.wv, 'index2word'):
AttributeError: 'KeyedVectors' object has no attribute 'negative'
CPU times: user 544 ms, sys: 52 ms, total: 596 ms
Wall time: 1.21 s

pengowray · 2017-06-04T07:52:11Z

Ok, the notebook file works now. I've added a section titled "Work with Google's word2vec C formats". The part that is only possible with this PR is this bit:

# To create and save Annoy Index from a loaded `KeyedVectors` object (with 100 trees)
annoy_index = AnnoyIndexer(wv, 100)
annoy_index.save('/tmp/mymodel.annoyindex')

I've also added file extensions to all the saved tmp files, as it gets confusing.

menshikh-iv · 2017-06-04T14:08:39Z

I see some problems in your PR:

print('\nMemory used by process {}: '.format(os.getpid()), process.memory_info(), "\n---") or print ('Process Id: ', os.getpid()) print tuple (not string)
This cycle takes a lot of time, I think it will be a good idea to logging progress

   x_values.append(x)
   start_time = time.time()
   ...

This code incorrect in python2 (because print isn't a function, print is an operator), you can rewrite this as simple cycle

# The first line should have the total number of entries and the vector size
with open('/tmp/vectors.txt') as myfile:
    [print(next(myfile)) for x in range(3)]

pengowray · 2017-06-04T16:22:50Z

@menshikh-iv Thanks for reviewing the code. The first two points are from existing code. (Sorry I haven't not tried to clean up the diff as yet)

Yes, there a bunch of stuff that is very slow and makes editing and testing this notebook quite tedious, especially section 5 ("Evaluate relationship of num_trees to initialization time and accuracy"). It should probably also be moved to another notebook file. The accuracy graph could also be constructed much more quickly if the index was simply built only once with with 300 trees and then each iteration choosing the number of trees to consider (instead of re-building the entire index hundreds of times). This would probably require a small change to the API to add num_trees (aka n_trees) as a parameter for model.most_similar(). The initialization time graph still requires rebuilding the index, but it is very linear, so perhaps it could skip all but every 50th iteration. There should also be a graph for evaluating accuracy vs search_k (see annoy readme) too. But all these changes (and changes to logging) should probably be in a different PR.

(3rd point) I will rewrite my code to also work with python2. Thanks for spotting this.

piskvorky · 2017-06-05T01:20:00Z

-1 on that [print(next(myfile)) for x in range(3)] construct: just write a normal loop.

Or why is this obfuscation needed?

pengowray · 2017-06-05T06:49:52Z

I've changed the offending code as follows:

with open('/tmp/vectors.txt') as myfile:
    for i in range(3):
        print(myfile.readline().strip())

menshikh-iv · 2017-06-12T07:17:49Z

Thank you @Quole

piskvorky · 2017-06-12T08:53:11Z

@menshikh-iv open => smart_open please (we try to use that consistently, for simpler reading from S3/compressed/HDFS etc).

Add KeyedVectors support to AnnoyIndexer

9e0e01a

fix missing import

0dcb44e

Add unit test to test Annoy indexing of KeyedVectors

a0523c9

pengowray force-pushed the develop branch from 34839a6 to a0523c9 Compare May 17, 2017 23:35

fix unit test maybe

d06380f

piskvorky assigned menshikh-iv May 29, 2017

tmylk added the difficulty easy label May 29, 2017

fix test for AnnoyIndexer. Make separate model and model.wv parameter…

8fbc9c4

…s in some tests.

pengowray and others added 4 commits May 30, 2017 23:37

Update annoytutorial.ipynb

984ca26

AnnoyIndexer with KeyedVectors

Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim …

9acceae

…into develop

Merge branch 'develop' of https://github.com/quole/gensim into develop

5ffe214

better save/load notebook tutorial

28df040

Fix tutorial. Make KeyedVector.vw return the KeyedVector (self)

210c14b

How to Work with Google word2vec files

15b60cb

checkpointed

9c2829d

fix some code that is non-python2 compatible

13aedf8

fix format mistake

dbfea83

menshikh-iv merged commit 686e975 into piskvorky:develop Jun 12, 2017

menshikh-iv added the style checking label Jun 12, 2017

menshikh-iv mentioned this pull request Oct 25, 2017

Small style fixes #1650

Merged

menshikh-iv removed the style checking label Oct 26, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add KeyedVectors support to AnnoyIndexer #1318

Add KeyedVectors support to AnnoyIndexer #1318

pengowray commented May 13, 2017 •

edited

Loading

pengowray commented May 13, 2017 •

edited

Loading

piskvorky commented May 14, 2017

tmylk commented May 15, 2017

pengowray commented May 18, 2017

tmylk commented May 18, 2017

menshikh-iv commented May 29, 2017

tmylk commented May 29, 2017

pengowray commented May 30, 2017

tmylk commented May 30, 2017

pengowray commented May 30, 2017

tmylk commented May 30, 2017

menshikh-iv commented May 30, 2017

pengowray commented Jun 1, 2017

menshikh-iv commented Jun 3, 2017

pengowray commented Jun 4, 2017 •

edited

Loading

menshikh-iv commented Jun 4, 2017

pengowray commented Jun 4, 2017 •

edited

Loading

piskvorky commented Jun 5, 2017

pengowray commented Jun 5, 2017

menshikh-iv commented Jun 12, 2017

piskvorky commented Jun 12, 2017

Add KeyedVectors support to AnnoyIndexer #1318

Add KeyedVectors support to AnnoyIndexer #1318

Conversation

pengowray commented May 13, 2017 • edited Loading

pengowray commented May 13, 2017 • edited Loading

piskvorky commented May 14, 2017

tmylk commented May 15, 2017

pengowray commented May 18, 2017

tmylk commented May 18, 2017

menshikh-iv commented May 29, 2017

tmylk commented May 29, 2017

pengowray commented May 30, 2017

tmylk commented May 30, 2017

pengowray commented May 30, 2017

tmylk commented May 30, 2017

menshikh-iv commented May 30, 2017

pengowray commented Jun 1, 2017

menshikh-iv commented Jun 3, 2017

pengowray commented Jun 4, 2017 • edited Loading

menshikh-iv commented Jun 4, 2017

pengowray commented Jun 4, 2017 • edited Loading

piskvorky commented Jun 5, 2017

pengowray commented Jun 5, 2017

menshikh-iv commented Jun 12, 2017

piskvorky commented Jun 12, 2017

pengowray commented May 13, 2017 •

edited

Loading

pengowray commented May 13, 2017 •

edited

Loading

pengowray commented Jun 4, 2017 •

edited

Loading

pengowray commented Jun 4, 2017 •

edited

Loading