-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix fasttext model loading from gzip files #2476
Conversation
gensim/models/_fasttext_bin.py
Outdated
|
||
matrix = np.fromfile(fin, dtype=dtype, count=num_vectors * dim) | ||
count = num_vectors * dim | ||
matrix = _fromfile(fin, None, count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit better to have something like
if fname.endswith("gz"):
# this will be here until https://github.com/numpy/numpy/issues/13470 will be fixed
<new_custom_code_same_as_in_pr>
else:
<old_code>
This allows to
- don't reduce performance (time & RAM) in the case when file is uncompressed (because it will work in the exact same way)
- avoid the current numpy issue with
gz
FP
The workaround in this PR looks clean and simple, I like it. What are its performance implications? IIRC, loading FT models is already slower than Facebook's tool, so we have to be careful not to introduce extra penalties. Or be upfront about the trade-offs, perhaps even suggesting to the user to decompress their model instead (if the implications are too dire). |
It takes 2min 23s to load the uncompressed model (commoncrawl, 4GB). It takes 4min30s to load the uncompressed model (7GB). |
Co-Authored-By: mpenkov <m@penkov.dev>
Due to a bug in numpy (numpy/numpy#13470), we are unable to load models from gzip files. @menshikh-iv came up with a workaround: bypass the numpy.fromfile function, do the file I/O ourselves, and pass an iterator to numpy.fromiter.