-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loading the English wikipedia model hangs indefinitely when low on RAM #2372
Comments
I tried to reproduce with the current develop branch (commit hash 21376a4, FB I/O is identical to 3.7.1 release), and couldn't:
Perhaps the model you are using is different? Please check the bytesize and md5 hash above and let me know if your results are different. Furthermore, how much memory do you have available when loading the model? What O/S and Python version? |
Thanks for researching it. I have Gensim 3.6 installed on Python 3.5, and Gensim 3.7.1 installed on Python 3.6. Both on the same machine with Linux Mint and 8 GBytes of RAM. |
Also, why log messages when loading a fastText model now look like the model is being trained, not loaded ('Updating model with new vocabulary', 'downsampling', etc)?
What does that
|
I don't think it's necessarily wrong, but now that you've provided more information, I can improve the title.
There's a contradiction in there somewhere ;)
The previous implementation of the load_fasttext_format function was broken - it did not load the complete model. This prevented training continuation. As part of the refactoring, we resolved the above problem by loading the full model, including the neural network that is necessary for training continuation. This change means the model now occupies more memory. If you want to use the previous, broken behavior, you can pass the full_model=False parameter to the load_fasttext_format function. For more info, please see the documentation. In the future, we will make a better effort to clarify changes like the above in the change log.
I think you're running out of RAM. The load is taking a prohibitively long time cause the system is performing paging. |
OK, this makes sense indeed in itself. However, see below
I would say that this Anyway, I think if we introduce changes that double the consumption of resources, this should be stated in the changelog as clearly as possible. And there is still the issue of these weird log messages, showing inconsistent numbers. Once again, these are Gensim 3.7.1 logs when loading the previously mentioned large Wikipedia model (I now used a machine with larger RAM):
The vocabulary size of this model is 2519370; why the weight matrix is (4519370, 300)? Are these 2 million 'extra' the vectors for ngrams? Then probably it should be reported as such. |
You make valid arguments. When we were discussing the best default, @menshikh-iv and I decided on the current behavior for two reasons:
That tutorial doesn't mention anything about Facebook I/O and it never did. The Jupyter notebooks aren't a good place for this, because we don't have automatic checks for their correctness (unlike, for example, the docstrings). The docstring for the fasttext module already contains detailed documentation about Facebook I/O functionality.
Yes, I agree, and regret that we were not able to do this for the 3.7.1 release.
Yes.
This is an implementation detail. From your perspective, we are "just loading a pre-trained model", but there is more than just I/O happening here. As you can see, the code is converting the vocabulary from Facebook's plain dictionary to Gensim's OO design, among other things. The log messages you are seeing are there because the refactoring re-used some of the training code when loading a native model. Previously, this was done by duplicated code, with different logging output (almost none). |
I agree the current logging could be better. I find it confusing too, when I put myself in user shoes:
|
Well, I think both modes (with or without the matrices necessary to continue training) are correct, for different tasks.
This is not entirely correct. For example, we have a (widely used)
Yes, I remember seeing these (or very similar) log messages when I continue training word2vec models in the |
So, in the end this issue turned out to be about improving log messages and about whether loading 'full' fastText models should be the default behavior :-) |
From the POV of Gensim's FastText design, what you get from calling load_fasttext_format(..., full_model=False) isn't really a model. It's a dummy model wrapping a working KeyedVectors instance (for the difference between a FastText model and KeyedVectors, see this part of the docs). It may be clearer to introduce a separate function that loads just the KeyedVectors from a FB binary. Users that don't want to continue training can use the new function. @piskvorky WDYT about the default value for the full_model flag? I think we've addressed both sides of the argument above, so we should make a call: keep it as is (full_model=True) or roll back to the previous behavior as default (full_model=False)? @akutuzov I think better logging should be a separate issue. It's relatively simple and someone who hasn't ever contributed to the repo can do it. |
No matter what default we choose, there will be some people confused. I like the idea of two separate functions (which may be thin wrappers over an internal function that accepts the parameter). And then have the docs promote whichever function is appropriate, to drive the message home, especially for the (numerous) copy-paste-happy crowd. If that's not possible, Having good logging is deceptively simple. It's like good documentation: while anyone can do it, to do it well you need to understand the workflows and concepts, to know what's important and tell the story well. Not sure that's a 100% newcomer job. |
Given that we're already thinking about simplifying the API, I think it's better to introduce a separate function as part of that effort. |
I believe that then there should be a separate new function to load full models, not vice versa. Thus, the Since the fastText *.bin format (unlike word2vec binary format) does contain the information necessary to continue training, we can provide a new 'advanced' function: I also agree with @piskvorky that a random new contributor will arguably not be able to come up with useful and meaningful log messages, since it requires deep understanding of both the fastText algorithm itself and its particular implementation in Gensim. As a quick fix, I would suggest moving all these confusing messages into the DEBUG category, and returning INFO messages back to the Gensim 3.6 state. |
Actually, if you have a look at the load_word2vec_format function, you'll see that it returns an embedding (KeyedVectors subclass), not a model. This means the behavior of the old If we really wanted to make things consistent, then we'd make a new KeyedVectors.load_fasttext_format class method. That method would return FastTextKeyedVectors, which is a word embedding that you can use to calculate vectors. This would be the same embedding you get with full_model=False, minus the model wrapper (which is useless). In this scenario, the FastText.load_fasttext_format would continue to load full models. So we have to make a tradeoff between backwards compatibility and correctness/consistency here:
Given that we're trying to improve Gensim's FastText implementation, my preference is to refactor, but I'll let @piskvorky make the call on how to proceed here. |
Do I hear a volunteering call? :) Can you improve the INFO logging messages @akutuzov (not just DEBUG)? @mpenkov I'm +1 for consistency. 3.7.0 was big and recent and we can still afford a bug-fix release, while the dust is settling (also need those NMF fixes from #2371). |
Yes, that's technically true, but this is, as you said earlier, 'an implementation detail' :-) From the user point of view, a KeyedVector object is a model: an array of vectors mapped to words. Whether this embedding model is 'trainable' or 'static' is another issue. When I mention 'models' I mean models as general notions, not Gensim classes. But I agree it would be great if things were consistent both on the level of user experience and on the level of 'under the hood' implementation. Clearly separating loading trainable models from loading static vector lookup tables (the usual meaning of 'pre-trained embeddings' ) also would be great. |
Would be glad to do this, but I'm not sure I understand fastText implementation in Gensim good enough (especially after the recent refactoring). I can of course spend some time on finding out why loading fastText models triggers the same logging messages as when incrementally updating word2vec models (with seemingly meaningless numbers), and what messages should be produced instead. But my impression is that the designers of the current fastText code will do it much faster. That being said, I can at least try. But then my first step will still be moving current messages to DEBUG, since in the current state they are absolutely cryptic for anyone except the authors of the code :) |
I think you're in a good position to work on the logs. You bring the perspective of an experienced gensim user - that would help in writing informative log messages. As for reading and understanding the code, I'd encourage you to have a look and see how far you can get. It's not as ugly as it may seem (and definitely less ugly after the most recent refactoring). Demoting the current messages to DEBUG would be a decent start, in my opinion. |
OK, will do this. |
Introduced two new pure functions to the gensim.models.fasttext module: 1. load_facebook_vectors: loads embeddings from binaries in FB's fastText .bin format 2. load_facebook_model: loads the full model from binaries in FB's fastText .bin format The existing FastText.load_fasttext_format method loads full models only. I've placed a deprecation warning around it. The full_model parameter is gone - it was only introduced in 3.7.1, so it's not too late to just rip it out, IMHO. When releasing 3.7.2, we should include the above in the change log, as it changes the behavior wrt to 3.6.0 Fixes #2372
Initially reported here by @akutuzov.
This is yet another regression after the fastText code refactoring in Gensim 3.7 (another one was fixed in #2341).
Indeed, Gensim 3.6 loads pre-trained fastText models without any trouble. Below are examples with the Wikipedia model from https://fasttext.cc/, but the same stuff happens with any models trained using native fastText.
However, Gensim 3.7 is doing weird things here (retraining the model instead of loading it?):
After it went like this for an hour, I killed the process.
Gensim 3.7.1 does the same, nothing changed. I'm sorry, but it seems that fastText refactoring in 3.7 was extremely badly tested, with so many things broken :-(
The text was updated successfully, but these errors were encountered: