Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading the English wikipedia model hangs indefinitely when low on RAM #2372

Closed
mpenkov opened this issue Feb 5, 2019 · 19 comments
Closed
Assignees
Labels
fasttext Issues related to the FastText model

Comments

@mpenkov
Copy link
Collaborator

mpenkov commented Feb 5, 2019

Initially reported here by @akutuzov.

This is yet another regression after the fastText code refactoring in Gensim 3.7 (another one was fixed in #2341).
Indeed, Gensim 3.6 loads pre-trained fastText models without any trouble. Below are examples with the Wikipedia model from https://fasttext.cc/, but the same stuff happens with any models trained using native fastText.

import gensim
gensim.__version__
'3.6.0'
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : (message)s', level=logging.INFO)
model = gensim.models.fasttext.FastText.load_fasttext_format('wiki_en')
2019-01-24 16:23:47,740 : INFO : loading 2519370 words for fastText model from wiki.en.bin
2019-01-24 16:29:54,820 : INFO : loading weights for 2519370 words for fastText model from wiki.en.bin
2019-01-24 16:37:43,068 : INFO : loaded (2519370, 300) weight matrix for fastText model from wiki.en.bin 
model
<gensim.models.fasttext.FastText at 0x7f8e98e2c320>

However, Gensim 3.7 is doing weird things here (retraining the model instead of loading it?):

import gensim
gensim.__version__
'3.7.0'
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
model = gensim.models.fasttext.FastText.load_fasttext_format('wiki.en')
2019-01-24 16:25:50,816 : INFO : loading 2519370 words for fastTextmodel from wiki.en.bin
2019-01-24 16:30:14,701 : INFO : resetting layer weights
2019-01-24 16:30:14,702 : INFO : Total number of ngrams is 0
2019-01-24 16:30:14,702 : INFO : Updating model with new vocabulary
2019-01-24 16:30:40,839 : INFO : New added 2519370 unique words (50% of original 5038740) and increased the count of 2519370 pre-existing words (50% of original 5038740)
2019-01-24 16:31:02,325 : INFO : deleting the raw counts dictionary of 2519370 items
2019-01-24 16:31:02,325 : INFO : sample=0.0001 downsamples 650 most-common words
2019-01-24 16:31:02,326 : INFO : downsampling leaves estimated 4076481917 word corpus (103.2% of prior 3949186974)

After it went like this for an hour, I killed the process.

Gensim 3.7.1 does the same, nothing changed. I'm sorry, but it seems that fastText refactoring in 3.7 was extremely badly tested, with so many things broken :-(

@mpenkov mpenkov added bug Issue described a bug fasttext Issues related to the FastText model labels Feb 5, 2019
@mpenkov mpenkov self-assigned this Feb 5, 2019
@mpenkov
Copy link
Collaborator Author

mpenkov commented Feb 5, 2019

@akutuzov

Gensim 3.7.1 does the same, nothing changed.

I tried to reproduce with the current develop branch (commit hash 21376a4, FB I/O is identical to 3.7.1 release), and couldn't:

(devel.env) mpenkov@hetrad2:~$ ls -l wiki.en.bin
-rw-r--r-- 1 mpenkov mpenkov 8493673445 Oct 19  2017 wiki.en.bin
(devel.env) mpenkov@hetrad2:~$ md5sum wiki.en.bin
92341f7f94c1801db19bb6b114fddff9  wiki.en.bin
(devel.env) mpenkov@hetrad2:~$ cat 2372.py 
import gensim
print(gensim.__version__)
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
model = gensim.models.fasttext.FastText.load_fasttext_format('wiki.en')
print(model)
(devel.env) mpenkov@hetrad2:~$ python --version
Python 3.5.2
(devel.env) mpenkov@hetrad2:~$ time python 2372.py                                                                                                                                                            3.7.1
2019-02-05 03:13:50,019 : INFO : loading 2519370 words for fastText model from wiki.en.bin
2019-02-05 03:14:18,547 : INFO : resetting layer weights
2019-02-05 03:14:18,547 : INFO : Total number of ngrams is 0
2019-02-05 03:14:18,548 : INFO : Updating model with new vocabulary
2019-02-05 03:14:28,589 : INFO : New added 2519370 unique words (50% of original 5038740) and increased the count of 2519370 pre-existing words (50% of original 5038740)
2019-02-05 03:14:38,003 : INFO : deleting the raw counts dictionary of 2519370 items
2019-02-05 03:14:38,003 : INFO : sample=0.0001 downsamples 650 most-common words
2019-02-05 03:14:38,003 : INFO : downsampling leaves estimated 4076481917 word corpus (103.2% of prior 3949186974)
2019-02-05 03:18:55,583 : INFO : loaded (4519370, 300) weight matrix for fastText model from wiki.en.bin
FastText(vocab=2519370, size=300, alpha=0.025)

real    5m13.520s
user    4m15.348s
sys     0m56.488s
(devel.env) mpenkov@hetrad2:~$

Perhaps the model you are using is different? Please check the bytesize and md5 hash above and let me know if your results are different.

Furthermore, how much memory do you have available when loading the model? What O/S and Python version?

@mpenkov mpenkov added the need info Not enough information for reproduce an issue, need more info from author label Feb 5, 2019
@akutuzov
Copy link
Contributor

akutuzov commented Feb 5, 2019

Thanks for researching it.
However, the title of the issue is wrong, it doesn't have anything to do with Wikipedia models. It happens with any fastText model in *.bin format.
I checked now, and indeed a smaller model (with the vocabulary size about 200K words) does load correctly. However, with Gensim 3.6 it takes only 3 GBytes of RAM and 37 seconds to load this model, while with Gensim 3.7.1 it takes 6 (!) GBytes of RAM and 50 seconds.
Thus, loading a fastText model now takes two times more memory and is almost twice longer. No wonder many users experience freezes and errors on models which previously worked without any problems. In my initial report, I used the Wikipedia model with very large vocabulary (2.5M words), so it was just taking forever to load it. What is the reason for such drastic increase in resource consumption?

I have Gensim 3.6 installed on Python 3.5, and Gensim 3.7.1 installed on Python 3.6. Both on the same machine with Linux Mint and 8 GBytes of RAM.

@akutuzov
Copy link
Contributor

akutuzov commented Feb 5, 2019

Also, why log messages when loading a fastText model now look like the model is being trained, not loaded ('Updating model with new vocabulary', 'downsampling', etc)?

Gensim: 3.7.1
2019-02-05 10:33:30,931 : INFO : loading 228671 words for fastText model from parameters.bin
2019-02-05 10:33:39,311 : INFO : resetting layer weights
2019-02-05 10:33:39,311 : INFO : Total number of ngrams is 0
2019-02-05 10:33:39,312 : INFO : Updating model with new vocabulary
2019-02-05 10:33:39,859 : INFO : New added 228671 unique words (50% of original 457342) and increased the count of 228671 pre-existing words (50% of original 457342)
2019-02-05 10:33:40,850 : INFO : deleting the raw counts dictionary of 228671 items
2019-02-05 10:33:40,850 : INFO : sample=0.0001 downsamples 1400 most-common words
2019-02-05 10:33:40,850 : INFO : downsampling leaves estimated 1521803161 word corpus (157.9% of prior 963672019)
2019-02-05 10:34:12,676 : INFO : loaded (2228671, 300) weight matrix for fastText model from parameters.bin
FastText(vocab=228671, size=300, alpha=0.025)

What does that (50% of original 457342) mean? There are 228671 words in the vocabulary of the model being loaded, what is the source of this number multiplied by 2?
This is very confusing, compare to Gensim 3.6, same model:

Gensim: 3.6.0
2019-02-05 10:32:09,784 : INFO : loading 228671 words for fastText model from parameters.bin
2019-02-05 10:32:23,655 : INFO : loading weights for 228671 words for fastText model from parameters.bin
2019-02-05 10:32:45,777 : INFO : loaded (228671, 300) weight matrix for fastText model from parameters.bin
FastText(vocab=228671, size=300, alpha=0.025)

@mpenkov
Copy link
Collaborator Author

mpenkov commented Feb 5, 2019

However, the title of the issue is wrong, it doesn't have anything to do with Wikipedia models.

I don't think it's necessarily wrong, but now that you've provided more information, I can improve the title.

It happens with any fastText model in *.bin format.
I checked now, and indeed a smaller model (with the vocabulary size about 200K words) does load correctly.

There's a contradiction in there somewhere ;)

However, with Gensim 3.6 it takes only 3 GBytes of RAM and 37 seconds to load this model, while with Gensim 3.7.1 it takes 6 (!) GBytes of RAM and 50 seconds.
Thus, loading a fastText model now takes two times more memory and is almost twice longer. No wonder many users experience freezes and errors on models which previously worked without any problems. In my initial report, I used the Wikipedia model with very large vocabulary (2.5M words), so it was just taking forever to load it. What is the reason for such drastic increase in resource consumption?

The previous implementation of the load_fasttext_format function was broken - it did not load the complete model. This prevented training continuation.

As part of the refactoring, we resolved the above problem by loading the full model, including the neural network that is necessary for training continuation. This change means the model now occupies more memory. If you want to use the previous, broken behavior, you can pass the full_model=False parameter to the load_fasttext_format function. For more info, please see the documentation.

In the future, we will make a better effort to clarify changes like the above in the change log.

I have Gensim 3.6 installed on Python 3.5, and Gensim 3.7.1 installed on Python 3.6. Both on the same machine with Linux Mint and 8 GBytes of RAM.

I think you're running out of RAM. The load is taking a prohibitively long time cause the system is performing paging.

@mpenkov mpenkov changed the title Loading the English wikipedia model hangs indefinitely Loading the English wikipedia model hangs indefinitely when low on RAM Feb 5, 2019
@akutuzov
Copy link
Contributor

akutuzov commented Feb 5, 2019

As part of the refactoring, we resolved the above problem by loading the full model, including the neural network that is necessary for training continuation. This change means the model now occupies more memory

OK, this makes sense indeed in itself. However, see below

If you want to use the previous, broken behavior, you can pass the full_model=False parameter to the load_fasttext_format function.

I would say that this full_model=True probably should not be the default behavior. First, it makes the existing code consume twice as much resources compared to the previous behavior. Second, arguably most users don't actually want to train loaded fastText models further. This is an important option to have, but it is required by only some (advanced) users. It hardly makes sense to waste memory of those who simply want to look up static vectors, not to continue training. I believe the best way to handle this is to load models in the old way by default, but show a warning message saying something like 'if you want to continue training, set this parameter to True'. Also should be mentioned in the Gensim fastText tutorial notebook.

Anyway, I think if we introduce changes that double the consumption of resources, this should be stated in the changelog as clearly as possible.

And there is still the issue of these weird log messages, showing inconsistent numbers. Once again, these are Gensim 3.7.1 logs when loading the previously mentioned large Wikipedia model (I now used a machine with larger RAM):

2019-02-05 15:45:27,129 : INFO : loading 2519370 words for fastText model from wiki.en.bin
2019-02-05 15:47:29,576 : INFO : resetting layer weights
2019-02-05 15:47:29,576 : INFO : Total number of ngrams is 0
2019-02-05 15:47:29,577 : INFO : Updating model with new vocabulary
2019-02-05 15:47:50,002 : INFO : New added 2519370 unique words (50% of original 5038740) and increased the count of 2519370 pre-existing words (50% of original 5038740)
2019-02-05 15:48:15,712 : INFO : deleting the raw counts dictionary of 2519370 items
2019-02-05 15:48:15,712 : INFO : sample=0.0001 downsamples 650 most-common words
2019-02-05 15:48:15,712 : INFO : downsampling leaves estimated 4076481917 word corpus (103.2% of prior 3949186974)
2019-02-05 15:58:29,511 : INFO : loaded (4519370, 300) weight matrix for fastText model from wiki.en.bin
FastText(vocab=2519370, size=300, alpha=0.025)

The vocabulary size of this model is 2519370; why the weight matrix is (4519370, 300)? Are these 2 million 'extra' the vectors for ngrams? Then probably it should be reported as such.
And why we are 'downsampling most common words' and 'updating model with new vocabulary', when we were supposed to just load a pre-trained model?

@mpenkov
Copy link
Collaborator Author

mpenkov commented Feb 6, 2019

I would say that this full_model=True probably should not be the default behavior.

You make valid arguments. When we were discussing the best default, @menshikh-iv and I decided on the current behavior for two reasons:

  1. It is correct
  2. It is consistent with the rest of Gensim. If you can load a model, you can continue training it.

Also should be mentioned in the Gensim fastText tutorial notebook.

That tutorial doesn't mention anything about Facebook I/O and it never did. The Jupyter notebooks aren't a good place for this, because we don't have automatic checks for their correctness (unlike, for example, the docstrings).

The docstring for the fasttext module already contains detailed documentation about Facebook I/O functionality.

Anyway, I think if we introduce changes that double the consumption of resources, this should be stated in the changelog as clearly as possible.

Yes, I agree, and regret that we were not able to do this for the 3.7.1 release.

The vocabulary size of this model is 2519370; why the weight matrix is (4519370, 300)? Are these 2 million 'extra' the vectors for ngrams?

Yes.

And why we are 'downsampling most common words' and 'updating model with new vocabulary', when we were supposed to just load a pre-trained model?

This is an implementation detail. From your perspective, we are "just loading a pre-trained model", but there is more than just I/O happening here. As you can see, the code is converting the vocabulary from Facebook's plain dictionary to Gensim's OO design, among other things.

The log messages you are seeing are there because the refactoring re-used some of the training code when loading a native model. Previously, this was done by duplicated code, with different logging output (almost none).

@mpenkov mpenkov removed bug Issue described a bug need info Not enough information for reproduce an issue, need more info from author labels Feb 6, 2019
@piskvorky
Copy link
Owner

piskvorky commented Feb 6, 2019

I agree the current logging could be better. I find it confusing too, when I put myself in user shoes:

INFO : Total number of ngrams is 0 What? I just asked you (Gensim) to load a FT model, why are there 0 ngrams?
INFO : New added 2519370 unique words (50% of original 5038740) and increased the count of 2519370 pre-existing words (50% of original 5038740) "New added X words" => "Added X new words"? Or was it a method called "New" that added them? Where do these "pre-existing words" come from, if we just loaded and updated a model from scratch, from 0?
INFO : deleting the raw counts dictionary of 2519370 items Why are you deleting my dictionary?
INFO : sample=0.0001 downsamples 650 most-common words I asked you (Gensim) to load a model, why are you downsampling anything?
INFO : downsampling leaves estimated 4076481917 word corpus (103.2% of prior 3949186974) Nooo, leave my model alone! Also, how can downsampling increase the size to 103%?!

INFO logs should tell a story to the user. Their purpose is to illuminate what's going on, show progress, not confuse the user with inactionable trivia.

DEBUG logs should also tell a story, but useful for debugging / developer, more internal details than workflow progress. It seems most of these logs should be either rephrased or DEBUG: not enough context to be useful to users without knowledge of internal implementation details. The gensim 3.6.0 log posted by @akutuzov looked significantly clearer to me, as a user.

@akutuzov
Copy link
Contributor

akutuzov commented Feb 6, 2019

You make valid arguments. When we were discussing the best default, @menshikh-iv and I decided on the current behavior for two reasons:
1. It is correct

Well, I think both modes (with or without the matrices necessary to continue training) are correct, for different tasks.

2. It is consistent with the rest of Gensim. If you can load a model, you can continue training it.

This is not entirely correct. For example, we have a (widely used) load_word2vec_format() method. It returns an embedding model which can't be trained further, since only one vector array is present. But it is still immensely useful.

The log messages you are seeing are there because the refactoring re-used some of the training code when loading a native model. Previously, this was done by duplicated code, with different logging output (almost none).

Yes, I remember seeing these (or very similar) log messages when I continue training word2vec models in the vocabulary update mode. And this makes them even more confusing when a user sees them in the context of loading a pre-trained model. I agree with Radim: it doesn't matter what code paths are running under the hood, the INFO log messages should tell the user what happens conceptually.

@akutuzov
Copy link
Contributor

akutuzov commented Feb 6, 2019

So, in the end this issue turned out to be about improving log messages and about whether loading 'full' fastText models should be the default behavior :-)
Does it make sense to start separate issues about this?

@mpenkov
Copy link
Collaborator Author

mpenkov commented Feb 6, 2019

For example, we have a (widely used) load_word2vec_format() method. It returns an embedding model which can't be trained further, since only one vector array is present. But it is still immensely useful.

From the POV of Gensim's FastText design, what you get from calling load_fasttext_format(..., full_model=False) isn't really a model. It's a dummy model wrapping a working KeyedVectors instance (for the difference between a FastText model and KeyedVectors, see this part of the docs).

It may be clearer to introduce a separate function that loads just the KeyedVectors from a FB binary. Users that don't want to continue training can use the new function.

@piskvorky WDYT about the default value for the full_model flag? I think we've addressed both sides of the argument above, so we should make a call: keep it as is (full_model=True) or roll back to the previous behavior as default (full_model=False)?

@akutuzov I think better logging should be a separate issue. It's relatively simple and someone who hasn't ever contributed to the repo can do it.

@piskvorky
Copy link
Owner

piskvorky commented Feb 6, 2019

No matter what default we choose, there will be some people confused.

I like the idea of two separate functions (which may be thin wrappers over an internal function that accepts the parameter). And then have the docs promote whichever function is appropriate, to drive the message home, especially for the (numerous) copy-paste-happy crowd.

If that's not possible, full_model=False makes more sense to me. Both for backward compatibility and because it's the more common use-case (what users "typically" want).

Having good logging is deceptively simple. It's like good documentation: while anyone can do it, to do it well you need to understand the workflows and concepts, to know what's important and tell the story well. Not sure that's a 100% newcomer job.

@mpenkov
Copy link
Collaborator Author

mpenkov commented Feb 6, 2019

Given that we're already thinking about simplifying the API, I think it's better to introduce a separate function as part of that effort.

@akutuzov
Copy link
Contributor

akutuzov commented Feb 6, 2019

I believe that then there should be a separate new function to load full models, not vice versa.

Thus, the load_fasttext_format() function will continue to behave in the same way as it behaved before Gensim 3.7: loading only the parts of fastText models needed to lookup word vectors (including inference from char n-grams for OOV words). This will be good for backwards compatibility, as @piskvorky has said. This will also be consistent with the similarly named load_word2vec_format() function, where users also don't expect the loaded model to be ready for further training.

Since the fastText *.bin format (unlike word2vec binary format) does contain the information necessary to continue training, we can provide a new 'advanced' function: load_fasttext_full() or something like this.

I also agree with @piskvorky that a random new contributor will arguably not be able to come up with useful and meaningful log messages, since it requires deep understanding of both the fastText algorithm itself and its particular implementation in Gensim. As a quick fix, I would suggest moving all these confusing messages into the DEBUG category, and returning INFO messages back to the Gensim 3.6 state.

@mpenkov
Copy link
Collaborator Author

mpenkov commented Feb 6, 2019

@akutuzov

I believe that then there should be a separate new function to load full models, not vice versa.

This will also be consistent with the similarly named load_word2vec_format() function, where users also don't expect the loaded model to be ready for further training.

Actually, if you have a look at the load_word2vec_format function, you'll see that it returns an embedding (KeyedVectors subclass), not a model.

This means the behavior of the old load_fasttext_format function was already inconsistent with load_word2vec_format, because the former returns a model that wraps an embedding (where the model is essentially useless) whereas the latter returns an embedding.

If we really wanted to make things consistent, then we'd make a new KeyedVectors.load_fasttext_format class method. That method would return FastTextKeyedVectors, which is a word embedding that you can use to calculate vectors. This would be the same embedding you get with full_model=False, minus the model wrapper (which is useless). In this scenario, the FastText.load_fasttext_format would continue to load full models.

So we have to make a tradeoff between backwards compatibility and correctness/consistency here:

  1. Go back to the old behavior. It was broken and inconsistent, but people were used to it.
  2. Refactor the API to make things consistent: loading models gives working models, loading embeddings gives working embeddings, and make it clear which one is which. We'd have an easier time documenting our API, too, since things would be less confusing.

Given that we're trying to improve Gensim's FastText implementation, my preference is to refactor, but I'll let @piskvorky make the call on how to proceed here.

@piskvorky
Copy link
Owner

piskvorky commented Feb 6, 2019

I also agree with @piskvorky that a random new contributor will arguably not be able to come up with useful and meaningful log messages, since it requires deep understanding of both the fastText algorithm itself and its particular implementation in Gensim.

Do I hear a volunteering call? :) Can you improve the INFO logging messages @akutuzov (not just DEBUG)?

@mpenkov I'm +1 for consistency. 3.7.0 was big and recent and we can still afford a bug-fix release, while the dust is settling (also need those NMF fixes from #2371).

@akutuzov
Copy link
Contributor

akutuzov commented Feb 7, 2019

Actually, if you have a look at the load_word2vec_format function, you'll see that it returns an embedding (KeyedVectors subclass), not a model.
This means the behavior of the old load_fasttext_format function was already inconsistent with load_word2vec_format, because the former returns a model that wraps an embedding (where the model is essentially useless) whereas the latter returns an embedding.

Yes, that's technically true, but this is, as you said earlier, 'an implementation detail' :-) From the user point of view, a KeyedVector object is a model: an array of vectors mapped to words. Whether this embedding model is 'trainable' or 'static' is another issue. When I mention 'models' I mean models as general notions, not Gensim classes.

But I agree it would be great if things were consistent both on the level of user experience and on the level of 'under the hood' implementation. Clearly separating loading trainable models from loading static vector lookup tables (the usual meaning of 'pre-trained embeddings' ) also would be great.

@akutuzov
Copy link
Contributor

akutuzov commented Feb 7, 2019

Do I hear a volunteering call? :) Can you improve the INFO logging messages @akutuzov (not just DEBUG)?

Would be glad to do this, but I'm not sure I understand fastText implementation in Gensim good enough (especially after the recent refactoring). I can of course spend some time on finding out why loading fastText models triggers the same logging messages as when incrementally updating word2vec models (with seemingly meaningless numbers), and what messages should be produced instead. But my impression is that the designers of the current fastText code will do it much faster.

That being said, I can at least try. But then my first step will still be moving current messages to DEBUG, since in the current state they are absolutely cryptic for anyone except the authors of the code :)

@mpenkov
Copy link
Collaborator Author

mpenkov commented Feb 7, 2019

@akutuzov

I think you're in a good position to work on the logs. You bring the perspective of an experienced gensim user - that would help in writing informative log messages.

As for reading and understanding the code, I'd encourage you to have a look and see how far you can get. It's not as ugly as it may seem (and definitely less ugly after the most recent refactoring). Demoting the current messages to DEBUG would be a decent start, in my opinion.

@akutuzov
Copy link
Contributor

akutuzov commented Feb 7, 2019

As for reading and understanding the code, I'd encourage you to have a look and see how far you can get.

OK, will do this.

mpenkov added a commit that referenced this issue Mar 7, 2019
Introduced two new pure functions to the gensim.models.fasttext module:

1. load_facebook_vectors: loads embeddings from binaries in FB's fastText .bin format
2. load_facebook_model: loads the full model from binaries in FB's fastText .bin format

The existing FastText.load_fasttext_format method loads full models only. I've placed a deprecation warning around it. The full_model parameter is gone - it was only introduced in 3.7.1, so it's not too late to just rip it out, IMHO.

When releasing 3.7.2, we should include the above in the change log, as it changes the behavior wrt to 3.6.0

Fixes #2372
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fasttext Issues related to the FastText model
Projects
None yet
Development

No branches or pull requests

3 participants