Fix fasttext model loading from gzip files #2476

mpenkov · 2019-05-05T01:57:06Z

Due to a bug in numpy (numpy/numpy#13470), we are unable to load models from gzip files. @menshikh-iv came up with a workaround: bypass the numpy.fromfile function, do the file I/O ourselves, and pass an iterator to numpy.fromiter.

Fix Using the load_facebook_model method produces ValueError on array reshaping #2473

menshikh-iv · 2019-05-05T05:18:15Z

gensim/models/_fasttext_bin.py

-
-    matrix = np.fromfile(fin, dtype=dtype, count=num_vectors * dim)
+    count = num_vectors * dim
+    matrix = _fromfile(fin, None, count)


It's a bit better to have something like

if fname.endswith("gz"): # this will be here until https://github.com/numpy/numpy/issues/13470 will be fixed <new_custom_code_same_as_in_pr> else: <old_code>

This allows to

don't reduce performance (time & RAM) in the case when file is uncompressed (because it will work in the exact same way)

avoid the current numpy issue with gz FP

gensim/models/_fasttext_bin.py

piskvorky · 2019-05-05T07:29:52Z

The workaround in this PR looks clean and simple, I like it. What are its performance implications?

IIRC, loading FT models is already slower than Facebook's tool, so we have to be careful not to introduce extra penalties. Or be upfront about the trade-offs, perhaps even suggesting to the user to decompress their model instead (if the implications are too dire).

mpenkov · 2019-05-05T09:25:45Z

It takes 2min 23s to load the uncompressed model (commoncrawl, 4GB). It takes 4min30s to load the uncompressed model (7GB).

gensim/models/_fasttext_bin.py

Co-Authored-By: mpenkov <m@penkov.dev>

mpenkov added 4 commits May 5, 2019 10:54

added some assertions

65f6c8b

extract np.fromfile function, add tests around it

dce7ba4

add gzip test case

63f978a

get matrix loading working with gzip

606162e

mpenkov requested review from piskvorky and menshikh-iv May 5, 2019 01:58

menshikh-iv reviewed May 5, 2019

View reviewed changes

mpenkov added 3 commits May 5, 2019 17:43

remove assertion, some tests trip it

ffd4166

apply comments from review

d186bd0

make flake8 happy

b61bfee

piskvorky requested changes May 5, 2019

View reviewed changes

gensim/models/_fasttext_bin.py Show resolved Hide resolved

gensim/models/_fasttext_bin.py Outdated Show resolved Hide resolved

piskvorky and others added 2 commits May 5, 2019 18:55

Update gensim/models/_fasttext_bin.py

4d18b34

Co-Authored-By: mpenkov <m@penkov.dev>

More review responses

0cfb9e1

mpenkov merged commit 790b9a7 into piskvorky:develop May 6, 2019

mpenkov deleted the numpy-weirdness branch May 6, 2019 02:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix fasttext model loading from gzip files #2476

Fix fasttext model loading from gzip files #2476

mpenkov commented May 5, 2019

menshikh-iv May 5, 2019 •

edited

Loading

piskvorky commented May 5, 2019

mpenkov commented May 5, 2019

Fix fasttext model loading from gzip files #2476

Fix fasttext model loading from gzip files #2476

Conversation

mpenkov commented May 5, 2019

menshikh-iv May 5, 2019 • edited Loading

Choose a reason for hiding this comment

piskvorky commented May 5, 2019

mpenkov commented May 5, 2019

menshikh-iv May 5, 2019 •

edited

Loading