Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix fasttext model loading from gzip files #2476

Merged
merged 9 commits into from
May 6, 2019
Merged

Fix fasttext model loading from gzip files #2476

merged 9 commits into from
May 6, 2019

Conversation

mpenkov
Copy link
Collaborator

@mpenkov mpenkov commented May 5, 2019

Due to a bug in numpy (numpy/numpy#13470), we are unable to load models from gzip files. @menshikh-iv came up with a workaround: bypass the numpy.fromfile function, do the file I/O ourselves, and pass an iterator to numpy.fromiter.

@mpenkov mpenkov requested review from piskvorky and menshikh-iv May 5, 2019 01:58

matrix = np.fromfile(fin, dtype=dtype, count=num_vectors * dim)
count = num_vectors * dim
matrix = _fromfile(fin, None, count)
Copy link
Contributor

@menshikh-iv menshikh-iv May 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit better to have something like

if fname.endswith("gz"):
    # this will be here until https://github.com/numpy/numpy/issues/13470 will be fixed
    <new_custom_code_same_as_in_pr>
else:
    <old_code>

This allows to

  • don't reduce performance (time & RAM) in the case when file is uncompressed (because it will work in the exact same way)
  • avoid the current numpy issue with gz FP

gensim/models/_fasttext_bin.py Show resolved Hide resolved
@piskvorky
Copy link
Owner

The workaround in this PR looks clean and simple, I like it. What are its performance implications?

IIRC, loading FT models is already slower than Facebook's tool, so we have to be careful not to introduce extra penalties. Or be upfront about the trade-offs, perhaps even suggesting to the user to decompress their model instead (if the implications are too dire).

@mpenkov
Copy link
Collaborator Author

mpenkov commented May 5, 2019

It takes 2min 23s to load the uncompressed model (commoncrawl, 4GB). It takes 4min30s to load the uncompressed model (7GB).

gensim/models/_fasttext_bin.py Show resolved Hide resolved
gensim/models/_fasttext_bin.py Outdated Show resolved Hide resolved
piskvorky and others added 2 commits May 5, 2019 18:55
@mpenkov mpenkov merged commit 790b9a7 into piskvorky:develop May 6, 2019
@mpenkov mpenkov deleted the numpy-weirdness branch May 6, 2019 02:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Using the load_facebook_model method produces ValueError on array reshaping
3 participants