Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix method estimate_memory from gensim.models.FastText & huge performance improvement. Fix #1824 #1916

Merged
merged 22 commits into from
Mar 1, 2018

Conversation

jbaiter
Copy link
Contributor

@jbaiter jbaiter commented Feb 19, 2018

This PR is an attempt to optimize the memory usage of the FastText model and to provide a more accurate FastText.estimate_memory method.

Specifically, it implements the following improvements:

  • Cythonize the ft_hash function
  • Cythonize the compute_ngrams function
  • Do not pre-compute and store the ngrams with the model, calculate them on the fly when needed
  • Do not store ngrams with the model, completely rely on the ngram hashes

In its current state this PR does not merge and does not pass the test suite, due to these issues:

  • The improvements were done before the refactoring of the word embedding models, likely some code will have to be moved around
  • There are some Python2/3 issues with the cythonized compute_ngrams function
  • The tests currently check for OOV ngrams. In the old code, this relied on the ngrams attribute of the model, but in the optimized model I use the ngram hash. Since these hashes are bucketed, an OOV is most often no longer possible (i.e. if all buckets are occupied), i.e. these tests would have to be removed. Is this okay?

I think I can do 1) and 2) on my own when I find the time, but for 3) I'd need some help, since I'm not that familiar with the intentions behind the old code.

@menshikh-iv
Copy link
Contributor

@jbaiter Wow! Please resolve merge-conflicts (this is critical right now). Probably, create new branch (based on fresh develop) and apply your changes will be simpler for you than resolve conflicts here.

When you resolve conflicts - please ping me for review / any help.

@jbaiter jbaiter force-pushed the fasttext-optimization branch from 7c6afb2 to 08c464a Compare February 21, 2018 23:46
@jbaiter jbaiter force-pushed the fasttext-optimization branch from 08c464a to 51a1a6e Compare February 22, 2018 00:18
@piskvorky piskvorky requested a review from manneshiva February 22, 2018 07:10
@jbaiter
Copy link
Contributor Author

jbaiter commented Feb 22, 2018

@menshikh-iv So I managed to rebase my changes on the latest develop branch.
The Python 2 issue with _compute_ngrams has also been fixed.
The test suit now also passes all of the tests on Python 3.6.
On Python 2.7 and 3.5 however, the OOV vector in testPersistenceForOldVersions does not match the fixture. I'm currently at a loss at what could cause this, it's working fine in Python 3.6.

out_expected_vec = numpy.array([-1.34948218, -0.8686831, -1.51483142, -1.0164026, 0.56272298,
0.66228276, 1.06477463, 1.1355902, -0.80972326, -0.39845538])
out_expected_vec = numpy.array([-0.33959097, -0.21121596, -0.37212455, -0.25057459, 0.11222091,
0.17517674, 0.26949012, 0.29352987, -0.1930912, -0.09438948])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably not correct, the vector seems to differ between different Python versions (the above was with 3.6)

@jbaiter
Copy link
Contributor Author

jbaiter commented Feb 22, 2018

I now reverted all changes to the deprecated fasttext_wrapper and the test suite is now passing in all Linux environments. I don't think it's too bad if it doesn't get the optimizations for now, given that it's deprecated.

On Windows there seems to be a memory-related issue, the allocation for the ngram vectors in test_estimate_memory causes a MemoryError.

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Feb 22, 2018

@jbaiter thanks, for memory - this is known issue for appveyour :( I'll look into later, thanks for your patience.

Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC: @manneshiva can you review this one, please?

@@ -317,7 +317,8 @@ def train_batch_sg(model, sentences, alpha, _work, _l1):
continue
indexes[effective_words] = word.index

subwords = [model.wv.ngrams[subword_i] for subword_i in model.wv.ngrams_word[model.wv.index2word[word.index]]]
subwords = [model.wv.hash2index[ft_hash(subword) % model.bucket]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use only hanging indents (no vertical).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is, remove the newline and keep the comprehension on a single line?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

single line (if line <= 120 characters) or something like

subwords = [
    ....
    ....
]

@@ -0,0 +1,21 @@
#!/usr/bin/env cython
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This related only with n-grams model, I think it's better to move it to fasttext_inner.pyx

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this would work, since both functions are used by both models.fasttext and models.keyedvectors. Moving the functions into models.fasttext would break, since this would cause a circular import.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, so, in this case, better to name it as _utils_any2vec.pyx (similar with _mmreader.pyx and _matutils.pyx)

# cython: cdivision=True
# coding: utf-8

def ft_hash(unicode string):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not cdef with nogil?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both functions need to be called from both Python and Cython, so cdef won't work, but cpdef should. Will be fixed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nogil doesn't work, since the function returns a Python object.

return h


cpdef compute_ngrams(word, unsigned int min_n, unsigned int max_n):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not nogil, you can fix type for word.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing the type for word doesn't really work, since the token might be a str on Python 2.7. nogil won't work either, since the function returns a Python object.



cpdef compute_ngrams(word, unsigned int min_n, unsigned int max_n):
cdef unicode extended_word = f'<{word}>'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f-strings supports only in 3.6 (and we maintain 2.7, 3.5, 3.6), please use simple concatenation (or any alternative) here.

Copy link
Contributor Author

@jbaiter jbaiter Feb 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in Cython, which, as far as I understood, since 0.24 automatically generates cross-compatible C-code for it. It works fine under 2.7.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow, really? I didn't know about it, thanks for the information!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model_neg.build_vocab(new_sentences, update=True) # update vocab
model_neg.train(new_sentences, total_examples=model_neg.corpus_count, epochs=model_neg.iter)
self.assertEqual(len(model_neg.wv.vocab), 14)
self.assertTrue(len(model_neg.wv.ngrams), 271)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why you removed part of tests (I mean all removed lines in tests)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in the PR, one optimization was to remove the storage of ngrams on the model and solely rely on the hashes. This is why any tests that assert the number of ngrams in the model are no longer neccesary.

With the lack of ngrams, the __contains__ check now also only uses the hashed and bucketed ngrams, which is why a 'real' OOV is a lot rarer (i.e. it will only happen if not all buckets are occupied and none of the ngrams in the token match any occupied bucket).

@menshikh-iv
Copy link
Contributor

@jbaiter don't forget to resolve merge-conflict too

Copy link
Contributor

@manneshiva manneshiva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbaiter I went through your PR and it looks good to me. Great job!
Just one more small deletion required in /gensim/models/deprecated/fasttext.py. Please delete,

new_model.wv.ngrams_word = old_model.wv.ngrams_word
new_model.wv.ngrams = old_model.wv.ngrams

and the corresponding asserts in test_fasttext.py, here.

I also feel we should also evaluate the effect of the changes in this PR on the quality of vectors learnt. Maybe compare the old and new code by training a FastText model on text8 and looking at the accuracies (using accuracy) of learnt vectors on question-answers.txt. It would also be interesting to see the memory consumption in both cases.

cc: @menshikh-iv

This removes the expensive calls to `compute_ngrams` and `ft_hash`
during training and uses a simple lookup in an int -> int[] mapping
instead, resulting in a dramatic increase in training performance.
@jbaiter
Copy link
Contributor Author

jbaiter commented Feb 26, 2018

I ran some benchmarks with my optimized version and the current gensim implementation of FastText.

Initially the performance was about 10x slower, but I implemented an optimization that pre-generates the ngram buckets for each word to avoid calling compute_ngrams and ft_hash in the training loop. This actually improved performance to 2x that of the original implementation.

I trained on the text8 corpus with the default settings on an Xeon E5-1620 with 8 cores.
The test script can be found at https://gist.github.com/3d781a311e536b471b24fb4a46c952a4
All measurements were done with GNU time (/usr/bin/time).

Metric original optimized
Training time 585.5 299.3s
Training words/sec 106,804 words/s 208,911 words/s
Training peak memory 1,409.92 MiB 1,181.35 MiB
Evaluation time 60.66s 59.25s
Evaluation peak memory 980.46 MiB 768.42 MiB

As you can see, the goal of reducing memory consumption was achieved and we additionally almost doubled the training speed, while maintaining evaluation speed.

The quality of the vectors seems to suffer a bit, however I think that this might be due to the different random initializations of the two models.

Benchmark original optimized
capital-common-countries 8.5% (43/506) 7.1% (36/506)
capital-world 3.6% (52/1452) 3.6% (52/1452)
currency 0.0% (0/268) 0.0% (0/268)
city-in-state 4.3% (68/1571) 4.1% (64/1571)
family 40.8% (125/306) 40.8% (125/306)
gram1-adjective-to-adverb 96.0% (726/756) 95.1% (719/756)
gram2-opposite 91.8% (281/306) 91.2% (279/306)
gram3-comparative 81.4% (1026/1260) 82.2% (1036/1260)
gram4-superlative 85.2% (431/506) 84.6% (428/506)
gram5-present-participle 76.3% (757/992) 75.9% (753/992)
gram6-nationality-adjective 71.3% (978/1371) 68.6% (940/1371)
gram7-past-tense 29.1% (387/1332) 28.4% (378/1332)
gram8-plural 80.9% (803/992) 80.5% (799/992)
gram9-plural-verbs 82.8% (538/650) 82.9% (539/650)
total 50.7% (6215/12268) 50.1% (6148/12268)

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Feb 27, 2018

@jbaiter first table looks awesome: almost x2 faster + reduce memory usage, fantastic 🔥!

About second table & random init: can you train/evaluate several times & average results (to exclude random effect) please?

Also, please have a look at Appveyor, I see MemoryError in test_estimate_memory.
Besides, please make backward-compatibility test (train FastText with old code, and try to load it with current code).

@manneshiva
Copy link
Contributor

manneshiva commented Feb 27, 2018

@jbaiter The speedup looks great! Thanks for this contribution.
Just a few comments:

  1. You would also want to include the size of wv.buckets_word to estimate_memory() method.
  2. The initial setup time before training (calculating hashes and storing ngrams) should definitely be faster now that you have Cythonized compute_ngrams and ft_hash. But I am not sure what is causing the speedup in training (in terms of words/sec). As far as I know, the initial code (Gensim 3.3.0) did not call compute_ngrams or ft_hash during training (from fasttext_inner.pyx), what do you think is the reason for 2x increase in the number of words processed per second (I am assuming you have pasted these numbers (words/sec) from the logs)? The table values compare your current implementation (optimized) with the implementation from Gensim 3.3.0 (original), am I right?

@menshikh-iv
Copy link
Contributor

@manneshiva @jbaiter maybe try to compare on bigger corpus (something ~1GB, not text8, this is more "fair" performance comparison)

@jbaiter
Copy link
Contributor Author

jbaiter commented Feb 27, 2018

@menshikh-iv

About second table & random init: can you train/evaluate several times & average results (to exclude random effect) please?
Besides, please make backward-compatibility test (train FastText with old code, and try to load it with current code).

Will do, I'll also do a run each with a fixed seed.

Also, please have a look at Appveyor, I see MemoryError in test_estimate_memory.

Do you have any idea what could be causing this? estimate_memory is called in the training tests as well, but doesn't cause an error there. You mentioned that this tends to happen on AppVeyor with other tests as well?

@manneshiva @jbaiter maybe try to compare on bigger corpus (something ~1GB, not text8, this is more "fair" performance comparison)

Can you recommend a suitable dataset that works with the tests in questions-answers.txt? The corpora I currently have on hand are all historical German and not really suitable for these tests.

@manneshiva

You would also want to include the size of wv.buckets_word to estimate_memory() method.

Will do!

The initial setup time before training (calculating hashes and storing ngrams) should definitely be faster now that you have Cythonized compute_ngrams and ft_hash. But I am not sure what is causing the speedup in training (in terms of words/sec). As far as I know, the initial code (Gensim 3.3.0) did not call compute_ngrams or ft_hash during training (from fasttext_inner.pyx), what do you think is the reason for 2x increase in the number of words processed per second (I am assuming you have pasted these numbers (words/sec) from the logs)? The table values compare your current implementation (optimized) with the implementation from Gensim 3.3.0 (original), am I right?

Yes, the values for original are from the latest commit on the develop branch, optimized is the code of this PR.
I think the reason might be that the new code does not use a list comprehension but instead directly looks up the tuple of ngram buckets:

subwords = [model.wv.ngrams[subword_i] for subword_i in model.wv.ngrams_word[model.wv.index2word[word.index]]]

vs

subwords = model.wv.buckets_word[word.index]

The current version does num_ngrams + 2 lookups, while the new code always just uses a single lookup. It could be that the previous code was causing a lot of cache evictions/misses and the new one doesn't? I could try to look at cache usage patterns with perf if I find the time.

@menshikh-iv
Copy link
Contributor

Do you have any idea what could be causing this? estimate_memory is called in the training tests as well but doesn't cause an error there. You mentioned that this tends to happen on AppVeyor with other tests as well?

This happens sometimes with appveyour (by memory limit reasons), so, you can try to use the smaller model in this case.

Can you recommend a suitable dataset that works with the tests in questions-answers.txt? The corpora I currently have on hand are all historical German and not really suitable for these tests.

Sample from https://github.com/RaRe-Technologies/gensim-data/releases/tag/wiki-english-20171001 should be a good idea (you should pick first 1M articles).

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Feb 28, 2018

I'll make benchmark myself too (will be updated soon, finished)

Code: https://gist.github.com/menshikh-iv/ba8cba26744c668e73b59d5972dabbf8
Evaluation dataset: https://raw.githubusercontent.com/nicholas-leonard/word2vec/master/questions-words.txt
Input: 0.5M first articles from wiki ("title" + "sect_title" + "sect_content" + ...)
Preprocessing: very simple (without stemming)

from gensim.parsing.preprocessing import (
    preprocess_string, strip_punctuation, strip_multiple_whitespaces, remove_stopwords, strip_short
)

prc = partial(
    preprocess_string, 
    filters=[strip_punctuation, strip_multiple_whitespaces, remove_stopwords, strip_short]
)

Model parameters: almost default, only iter=1 (because corpus large enough),

Metric original optimized improvement (x)
Training time (1 epoch) 4823.4s (80.38 minutes) 1873.6s (31.22 minutes) 2.57
Training time (full) 1h 26min 13s 36min 43s 2.35
Training words/sec 72781 187366 2.57
Training peak memory 5,173 MB 3,671 MB 1.4
Benchmark original optimized
capital-common-countries 60.3% (305/506) 63.4% (321/506)
capital-world 35.9% (655/1826) 36.6% (669/1826)
currency 0.8% (1/128) 0.8% (1/128)
city-in-state 24.5% (539/2203) 23.4% (515/2203)
family 69.9% (214/306) 72.2% (221/306)
gram1-adjective-to-adverb 85.7% (514/600) 85.7% (514/600)
gram2-opposite 70.0% (147/210) 72.4% (152/210)
gram3-comparative 84.0% (1000/1190) 84.4% (1004/1190)
gram4-superlative 80.1% (442/552) 78.8% (435/552)
gram5-present-participle 71.5% (465/650) 71.7% (466/650)
gram6-nationality-adjective 90.5% (1175/1299) 91.3% (1186/1299)
gram7-past-tense 49.8% (664/1332) 49.7% (662/1332)
gram8-plural 88.0% (873/992) 87.1% (864/992)
gram9-plural-verbs 83.8% (387/462) 86.1% (398/462)
total 60.2% (7381/12256) 60.4% (7408/12256)

@jbaiter
Copy link
Contributor Author

jbaiter commented Feb 28, 2018

So I ran the text8 tests 100 times with both the current implementation and the optimization, the total score is now at 50.38% (original) and 50.29%, so a pretty small difference.
The averaged tests with the first 1M documents from the wiki corpus are currently running, will be able to report on the outcome tomorrow.

As for the causes of the speedup, it seems that my changes somehow result in a significant increase in parallelism, as can be gathered from these performance counters:

Perfomance counters for training + evaluation on current

    1404012.973981      task-clock (msec)         #    2.145 CPUs utilized
         1,272,950      context-switches          #    0.907 K/sec
            27,026      cpu-migrations            #    0.019 K/sec
           667,005      page-faults               #    0.475 K/sec
 4,603,142,310,724      cycles                    #    3.279 GHz                      (48.93%)
 5,168,917,813,028      instructions              #    1.12  insn per cycle           (61.42%)
 1,032,270,301,259      branches                  #  735.228 M/sec                    (61.21%)
     8,170,082,299      branch-misses             #    0.79% of all branches          (60.97%)
 1,732,560,787,151      L1-dcache-loads           # 1234.006 M/sec                    (45.44%)
   164,838,925,283      L1-dcache-load-misses     #    9.51% of all L1-dcache hits    (28.92%)
    36,178,103,172      LLC-loads                 #   25.768 M/sec                    (25.83%)
    10,871,658,490      LLC-load-misses           #   30.05% of all LL-cache hits     (36.92%)

     654.542443124 seconds time elapsed

Performance counters for training + evaluation on optimized

    1032653.771199      task-clock (msec)         #    3.002 CPUs utilized
         1,721,311      context-switches          #    0.002 M/sec
             9,352      cpu-migrations            #    0.009 K/sec
           623,004      page-faults               #    0.603 K/sec
 3,411,762,671,386      cycles                    #    3.304 GHz                      (48.48%)
 4,492,762,293,396      instructions              #    1.32  insn per cycle           (60.97%)
   877,973,650,752      branches                  #  850.211 M/sec                    (60.70%)
     6,791,892,588      branch-misses             #    0.77% of all branches          (60.63%)
 1,521,323,964,974      L1-dcache-loads           # 1473.218 M/sec                    (43.68%)
   126,194,704,440      L1-dcache-load-misses     #    8.30% of all L1-dcache hits    (29.18%)
    21,929,347,031      LLC-loads                 #   21.236 M/sec                    (26.23%)
     5,608,723,856      LLC-load-misses           #   25.58% of all LL-cache hits     (36.70%)

     343.979779366 seconds time elapsed

Additionally to using a full core more, IPC seems to be higher (1.32 instructions/cycle vs 1.12 before) and the overall number of instructions is lower (could be because of the lower number of lookups?)

@menshikh-iv
Copy link
Contributor

I checked current PR with 500,000 articles from wiki, my results #1916 (comment), this looks exciting 🌟 🔥 impressive work @jbaiter!

I also checked, the model from 3.3.0 loads fine, need to check 3.2.0 only now.

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Feb 28, 2018

@jbaiter last things that missed here

  • Check that old FastText loads correctly with new code
  • Fix Appveyour memory issue (reduce model size)

@menshikh-iv
Copy link
Contributor

This is files from profiler (current relese version + PR version)

Script:

import gensim.downloader as api
from gensim.models import FastText

data = api.load("text8")
model = FastText(data)

master.txt
optimized.txt

@manneshiva this is for you

@menshikh-iv menshikh-iv changed the title Optimizations for FastText Fix method estimate_memory from gensim.models.FastText & huge performance improvement. Fix #1824 Feb 28, 2018
Copy link
Contributor

@jayantj jayantj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jbaiter thanks a lot for the PR! This looks really great, and that's some serious speedup. Would really appreciate if you could address the comments in my review.

wv.hash2index[ngram_hash] = new_hash_count
wv.ngrams[ngram] = wv.hash2index[ngram_hash]
new_hash_count = new_hash_count + 1
wv.num_ngram_vectors = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could probably reduce some variables here - there seems to be some redundancy, if I understand correctly. wv.num_ngram_vectors, new_hash_count and len(ngram_indices) serve effectively the same purpose.
Maybe we could use len(ngram_indices) within the loop and set wv.num_ngram_vectors at the end of the loop?

new_ngrams = list(set(new_ngrams))
wv.num_ngram_vectors += len(new_ngrams)
logger.info("Number of new ngrams is %d", len(new_ngrams))
if not wv.buckets_word:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this?

new_hash_count = new_hash_count + 1
else:
wv.ngrams[ngram] = wv.hash2index[ngram_hash]
num_new_ngrams = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be some redundancy again with new_hash_count, num_new_ngrams.

continue
ngram_indices.append(len(wv.vocab) + ngram_hash)
wv.hash2index[ngram_hash] = wv.num_ngram_vectors
wv.num_ngram_vectors += 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be set to len(ngram_indices) at the end instead (sorry for nitpicking, but we already have very long code for some of these methods)

word_vec += ngram_weights[self.ngrams[ngram]]
ngram_hash = _ft_hash(ngram) % self.bucket
if ngram_hash in self.hash2index:
word_vec += ngram_weights[self.hash2index[ngram_hash]]
if word_vec.any():
return word_vec / len(ngrams)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably needs to be updated to only take into account the ngrams for which hashes were present in self.hash2index

0.58818,
0.57828,
0.75801
-0.21929,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for this change? Could it because of the len(ngrams) issue mentioned in a comment above?

Copy link
Contributor Author

@jbaiter jbaiter Feb 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's really a crucial question, but it's not because of the len(ngrams) issue.
The change was honestly simply made because the numbers were pretty similar and I thought that the vectors just changed a bit since the new code is far more lenient in assigning a vector to unknown ngrams (i.e. once all buckets are occupied, any ngram will result in a vector, even if it was not in the original corpus).

But it looks like there might a bug in the old code that has something to do with this: There are a lot more ngram vectors in the loaded model (17004) than there are in the model on disk (2762). This is probably because wv.vectors_ngram = wv.vectors_ngrams.take(ngram_indices, axis=0) in init_ngrams_post_load will result in a (num_ngrams_total, ngram_vec_len) matrix. Shouldn't vectors_ngram have a shape of (num_buckets, ngram_vec_len)? At least that's the case in the new code, and it follows from my (not necessarily correct) understanding of how the bucketing in this implementation works.

This sound similar to what was reported in #1779

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...since the new code is far more lenient in assigning a vector to unknown ngrams (i.e. once all buckets are occupied, any ngram will result in a vector, even if it was not in the original corpus).

Ahh right, that makes sense, thanks for explaining.

Re: the number of ngram vectors being greater than num_buckets (or the number of vectors on disk) - I see why that might have been happening. With a ngram vocab larger than the number of buckets, a lot of ngrams will be mapped to the same indices. And when .take is passed a list that contains multiple occurrences of the same index, the vector at that index is "taken" multiple times.
For example -

all_vectors = np.array([[0.1, 0.3], [0.3, 0.1]])
taken_vectors = all_vectors.take([0, 1, 0], axis=0)
taken_vectors.shape
>>> (3, 2)

So it wouldn't result in incorrect results, but yeah, it'd result in unexpectedly high memory usage (and kind of blowing out the whole idea of keeping memory usage constant even with increasing ngram vocabs, out of the water). Thanks for investigating this and explaining it!

@menshikh-iv we're good to merge from my side

0.18025,
-0.14128,
0.22508
-0.49111,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto:
What is the reason for this change? Could it because of the len(ngrams) issue mentioned in a comment above?

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Feb 28, 2018

Some missed stuff

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants