Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High RAM usage when loading FastText Model on Google Colab #2502

Closed
rianrajagede opened this issue May 28, 2019 · 14 comments
Closed

High RAM usage when loading FastText Model on Google Colab #2502

rianrajagede opened this issue May 28, 2019 · 14 comments
Labels
Hacktoberfest Issues marked for hacktoberfest help wanted

Comments

@rianrajagede
Copy link

Problem description

I want to load FastText pre-trained model using Gensim. I run this script in Google Colab with ~12GB RAM but it always crashes, with Colab's message: "Your session crashed after using all available RAM."

Steps/code/corpus to reproduce

# Download dan unzip model
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz
!gunzip -k cc.en.300.bin.gz

# Install / Upgrade Gensim
!pip install --upgrade gensim

# Load model method 1 
from gensim.models.fasttext import FastText, load_facebook_vectors
model = load_facebook_vectors("cc.en.300.bin.gz")

# Load model method 2 
from gensim.models.fasttext import FastText
model = FastText.load_fasttext_format('cc.en.300.bin')

I didn't use both methods at the same time, I only use one of them. Then, I restart the runtime to clear the memory if I want to run another method. I use method 2 to avoid issue #2378. Both method crash Colab by using all available RAM. At first, I think the problem is the model but if I check the size each model, their size is far below 12GB:

cc.en.300.bin.gz     4.19 GB
cc.en.300.bin        6.74 GB

and if I load using FastText Python module, it works:

# Install FasText
!git clone https://github.com/facebookresearch/fastText.git
!pip install fastText/.
# Load model
import fastText
model = fastText.load_model("cc.en.300.bin")

Versions

Linux-4.14.79+-x86_64-with-Ubuntu-18.04-bionic
Python 3.6.7 (default, Oct 22 2018, 11:32:17)
[GCC 8.2.0]
NumPy 1.16.3
SciPy 1.3.0
gensim 3.7.3
FAST_VERSION 1

@gojomo
Copy link
Collaborator

gojomo commented May 29, 2019

Does it crash on the load, or shortly thereafter when you start using the vectors? Because: doing common operations like most_similar() requires the creation of a cache of unit-normalized vectors, which roughly doubles the required RAM – which would easily explain exhausting 12G RAM.

@rianrajagede
Copy link
Author

it crash on the loading process, stopped in the syntax above. I haven't used the model at all.

@ptiwari2407
Copy link

@gojomo , then how can i get around using most_similar() on google colab ?

@gojomo
Copy link
Collaborator

gojomo commented Aug 12, 2020

If most_similar() is the specific operation you need, there's no getting around it. You'd need to use a smaller model or machine with more memory.

There are a bunch of major memory inefficiencies and unnecessary over-allocations in gensim's FastText support up through the current released version, 3.8.3. They'll be fixed in the eventual gensim-4.0.0 release, so those FB models might be more usable within 12GB, but those & other changes are still being tested & further improved, and there's not yet any certain date for a 4.0.0 release. An advanced user capable of running in-development code that's checked-out from Github and built locally could use that fixed code now & help test it, but I'm not sure Google Colab would support that.

@gojomo
Copy link
Collaborator

gojomo commented Oct 20, 2020

This has been much-improved by #2698, #2944, & other recent work that will be available in the 4.0.0 release, so closing this issue.

@italodamato
Copy link

I'm getting the same error. The colab runtime crashes loading the model:

model = gensim.models.fasttext.load_facebook_model('.../crawl-300d-2M-subword/crawl-300d-2M-subword.bin')

@piskvorky
Copy link
Owner

@italodamato please post your versions, like @rianrajagede did above (or open a new ticket).

@italodamato
Copy link

numpy==1.18.5
scipy==1.4.1
gensim==3.8.3

gcc 7.5.0
python 3.6.9
Ubuntu 18.04

@piskvorky

@piskvorky
Copy link
Owner

piskvorky commented Dec 7, 2020

In that case see @gojomo's answer above.

The 4.0 beta release is here: https://github.com/RaRe-Technologies/gensim/releases/tag/4.0.0beta

@italodamato
Copy link

Upgraded to 4.0 but it keeps crashing.

@gojomo
Copy link
Collaborator

gojomo commented Dec 7, 2020

It looks like you're using an even larger model (crawl-300d-2M-subword.bin, 7.24GB) than the original report.

As the gensim-4.0.0-beta has removed the major sources of unnecessary memory usage in Gensim's implementation, if you are still getting "crashed after using all available RAM" errors, your main ways forward are likely to be: (1) moving to a system with more RAM at Colab or elsewhere; (2) if it's possible your other uses of RAM are contributing the usage, reducing those usages.

@italodamato
Copy link

I'll try with his model. I'm not sure what the difference is between the two though.

@italodamato
Copy link

It's crashed again. I don't have any other things in memory when I do it.

@gojomo
Copy link
Collaborator

gojomo commented Dec 8, 2020

On a VM with 32G RAM, with Python 3.6 under Ubuntu 18.04 & gensim-4.0.0b, I used gensim.models.fasttext.load_facebook_model in a Jupyter notebook to load crawl-300d-2M-subword.bin.

It took almost 4 minutes of wall clock time (!) but completed without error. top reported the process virtual-memory usage as about 10.5GB.

Gensim in Python likely has more overhead than Facebook's C++ fasttext code for loading the same models, so in constrained environments, there will always be some models on-the-margin that could be loaded in one but not the other. But other than that, I don't see anything broken here with regard to memory usage, and can only recommend working in an environment with more memory if you need to use such large models.

(Note, though that other serious issues may remain in loading full native FastText models, such as #2969.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Hacktoberfest Issues marked for hacktoberfest help wanted
Projects
None yet
Development

No branches or pull requests

6 participants