Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different behaviour with open #207

Closed
menshikh-iv opened this issue Jul 5, 2018 · 12 comments
Closed

Different behaviour with open #207

menshikh-iv opened this issue Jul 5, 2018 · 12 comments
Assignees
Labels

Comments

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Jul 5, 2018

Problem: smart_open==1.6.0 doesn't work correctly with numpy.frombuffer function (numpy < 1.13.0, preferable to test with numpy==1.11.3, python2.7).

Example:
env:

  • numpy==1.11.3
  • gensim==3.4.0
  • smart_open==1.6.0
from gensim.models import FastText
from gensim.test.utils import datapath

m = FastText.load_fasttext_format(datapath("lee_fasttext"))

Trace

IOError                                   Traceback (most recent call last)
<ipython-input-1-afdc56d6c4b5> in <module>()
      2 from gensim.test.utils import datapath
      3 
----> 4 m = FastText.load_fasttext_format(datapath("lee_fasttext"))

/home/ivan/release/gensim/gensim/models/fasttext.py in load_fasttext_format(cls, model_file, encoding)
    697             model_file += '.bin'
    698         model.file_name = model_file
--> 699         model.load_binary_data(encoding=encoding)
    700         return model
    701 

/home/ivan/release/gensim/gensim/models/fasttext.py in load_binary_data(self, encoding)
    712             self._load_model_params(f)
    713             self._load_dict(f, encoding=encoding)
--> 714             self._load_vectors(f)
    715 
    716     def _load_model_params(self, file_handle):

/home/ivan/release/gensim/gensim/models/fasttext.py in _load_vectors(self, file_handle)
    819 
    820         self.num_original_vectors = num_vectors
--> 821         self.wv.vectors_ngrams = np.fromfile(file_handle, dtype=dtype, count=num_vectors * dim)
    822         self.wv.vectors_ngrams = self.wv.vectors_ngrams.reshape((num_vectors, dim))
    823         assert self.wv.vectors_ngrams.shape == (

IOError: first argument must be an open file

If I replace

with utils.smart_open(self.file_name, 'rb') as f:

from https://github.com/RaRe-Technologies/gensim/blob/875b028ccd009fe2fa7665e177b8ab0b5e2dc40d/gensim/models/fasttext.py#L711
to

with open(self.file_name, 'rb') as f:

this will works correctly (also, it works correctly with smart_open==1.5.6, i.e. bug from 1.6.0).

@menshikh-iv menshikh-iv added the bug label Jul 5, 2018
@piskvorky
Copy link
Owner

piskvorky commented Jul 5, 2018

Let me add that the bug is the violation of the smart_open contract:

"For local uncompressed files, smart_open() is always a drop-in replacement for open(). There is never a reason not to use smart_open() instead of open()."

@mpenkov
Copy link
Collaborator

mpenkov commented Jul 6, 2018 via email

@menshikh-iv
Copy link
Contributor Author

menshikh-iv commented Jul 6, 2018

@mpenkov "quack" like a python3 always maybe (this still "contradiction", but looks like best solution)?

@mpenkov
Copy link
Collaborator

mpenkov commented Jul 6, 2018 via email

menshikh-iv added a commit to piskvorky/gensim that referenced this issue Jul 6, 2018
@piskvorky
Copy link
Owner

piskvorky commented Jul 6, 2018

smart_open should work like open for local uncompressed files (no matter which Python).

If Python 2 doesn't support some parameters, we don't have to support them either. We have no contract or ambition to provide some "compatibility layer" between python 2 and 3. We want minimum surprises.

@menshikh-iv
Copy link
Contributor Author

@piskvorky open from python2 or python3? I agree with @mpenkov, this isn't clear: for example, we'll simply use standard python open - we'll have no encoding argument for python2, that's not our goal.

Probably, this should be affected only for trivial variants, i.e. if user pass parameters matched to
https://docs.python.org/2/library/functions.html#open (on python2) or
https://docs.python.org/3/library/functions.html#open (on python3) - we should return open(...), otherwise (non-trivial case) - use current "mechanism"

@piskvorky
Copy link
Owner

piskvorky commented Jul 6, 2018

Not sure I understand. What I'm saying is that if native open doesn't support a parameter, neither do we.

I mean, if we support something extra, that's nice. But definitely not at the cost of breaking the "drop-in contract".

@mpenkov
Copy link
Collaborator

mpenkov commented Jul 6, 2018 via email

@mpenkov
Copy link
Collaborator

mpenkov commented Jul 6, 2018 via email

@piskvorky
Copy link
Owner

piskvorky commented Jul 7, 2018

Both 1) and 3) are OK. 1) doesn't really violate the "drop-in" contract, since there is no original to violate (open wouldn't even work with the unknown parameter).

Note however that this is not the case that triggered this issue. There was no encoding parameter -- the original open worked fine.

@mpenkov
Copy link
Collaborator

mpenkov commented Jul 7, 2018

(so27) ubuntu@LAPTOP-S7M90K9N:~/git/smart_open$ pip install numpy==1.11.3
Collecting numpy==1.11.3
  Downloading https://files.pythonhosted.org/packages/68/e8/d68e1c47a2bd3f03cb81f1e635a55b776e51e4746f7204d541f6e3e9ffab/numpy-1.11.3-cp27-cp27mu-manylinux1_x86_64.whl (15.4MB)
    100% |████████████████████████████████| 15.4MB 705kB/s
Installing collected packages: numpy
  Found existing installation: numpy 1.14.5
    Uninstalling numpy-1.14.5:
      Successfully uninstalled numpy-1.14.5
Successfully installed numpy-1.11.3
(so27) ubuntu@LAPTOP-S7M90K9N:~/git/smart_open$ python integration-tests/test_208.py
Traceback (most recent call last):
  File "integration-tests/test_208.py", line 25, in <module>
    loaded = np.fromfile(fin)
IOError: first argument must be an open file
(so27) ubuntu@LAPTOP-S7M90K9N:~/git/smart_open$ pip install numpy==1.14.5
Collecting numpy==1.14.5
  Using cached https://files.pythonhosted.org/packages/6a/a9/c01a2d5f7b045f508c8cefef3b079fe8c413d05498ca0ae877cffa230564/numpy-1.14.5-cp27-cp27mu-manylinux1_x86_64.whl
Installing collected packages: numpy
  Found existing installation: numpy 1.11.3
    Uninstalling numpy-1.11.3:
      Successfully uninstalled numpy-1.11.3
Successfully installed numpy-1.14.5
(so27) ubuntu@LAPTOP-S7M90K9N:~/git/smart_open$ python integration-tests/test_208.py
OK
(so27) ubuntu@LAPTOP-S7M90K9N:~/git/smart_open$

I can reproduce this problem with numpy-1.11.3. With newer numpy, it's no longer a problem.

clrpackages pushed a commit to clearlinux-pkgs/gensim that referenced this issue Aug 6, 2018
… of `min_count` for `gensim.models.Word2Vec`. Fix #465 (#1915)

Andrey Kutuzov (2):
      Fix OOV pairs counter in `WordEmbeddingsKeyedVectors.evaluate_word_pairs` (#1934)
      Add `evaluate_word_analogies` (will replace `accuracy`) method for `gensim.models.KeyedVectors` (#1935)

Aneesh Joshi (3):
      Add windows venv activate command to `CONTRIBUTING.md` (#1880)
      Fix deprecation warning from `inspect.getargspec`. Fix #1878 (#1887)
      Allow initialization with `max_final_vocab` in lieu of `min_count` for `gensim.models.Word2Vec`. Fix #465 (#1915)

Chaitali Saini (1):
      Update rules for removing table markup from Wikipedia dumps. Fix #1710 (#1954)

Dennis.Chen (1):
      Fix inheritance chain for `load_word2vec_format` (return correct class in case when you create an child class based on kv) (#1968)

Dmitry (5):
      Refactor API reference `gensim.corpora`. Partial fix #1671 (#1835)
      Refactor documentation for `gensim.similarities.docsim` and `MmCorpus-related`. (#1910)
      Refactor documentation for `gensim.models.coherencemodel` (#1933)
      Refactor documentation for `gensim.models.phrases` (#1950)
      Fix format & links for `gensim.similarities.docsim` (#2030)

Dmitry Persiyanov (1):
      Add `gensim.models.BaseKeyedVectors.add_entity` method for fill `KeyedVectors` in manual way. Fix #1942 (#1957)

Fernando Camargo (1):
      Add `ns_exponent` parameter to control the negative sampling distribution for `*2vec` models. Fix #2090 (#2093)

Gordon Mohr (1):
      Fix `Doc2Vec.infer_vector`, notebook cleanup (#2103)

Gyanesh Malhotra (1):
      Fix docstrings for`gensim.models.hdpmodel`, `gensim.models.lda_worker` & `gensim.models.lda_dispatcher` (#1912)

Ibrahim Sharaf ElDen (1):
      Store images from `README.md` directly in repository. Fix #1849 (#1861)

Ivan Menshikh (2):
      Fix PEP8 in `HashDictionary`
      Disable google-style docstring support. Fix #1663 (#2106)

Jayant Jain (1):
      Fix negative sampling floating-point error for `gensim.models.PoincareModel`. Fix #1917 (#1959)

Johannes Baiter (1):
      Fix method `estimate_memory` from `gensim.models.FastText` & huge performance improvement. Fix #1824 (#1916)

Jonathan Hourany (1):
      Fixed Typo and increased performance in analyze_sentence (#2070)

Kento NOZAWA (1):
      Fix example block for `gensim.models.Word2Vec` (#1876)

Kumar Akshay (1):
      Fix documentation for `gensim.models.wrappers` (#1859)

Menshikh Ivan (5):
      Fix `test_similarities.py` (#1928)
      Add flag for skip network-related tests (#1930)
      Fix encoding in Lee corpus reader (#1931)
      Fix Keras version (avoid bug from `keras==2.1.5`) (#1963)
      Fix quoting that break `doc2vec-IMDB` notebook

Mohit Rathore (1):
      Add Pivot Normalization for `gensim.models.TfidfModel`. Fix #220 (#1780)

Mritunjay Mohitesh (1):
      Fix deprecated parameters in `D2VTransformer` and `W2VTransformer`. Fix #1937 (#1945)

Nils Werner (1):
      Add license field to `setup.py` (#1909)

Oliver Price (1):
      Fix return dtype for `matutils.unitvec` according to input dtype. Fix #1722 (#1992)

Orion Montoya (1):
      Fix parameter description of  `sg` parameter for `gensim.models.word2vec` (#1919)

Pete Bleackley (1):
      Fix SMART from TfidfModel for case when `df == "n"`. Fix #2020 (#2021)

Pushpankar Kumar Pushp (1):
      Fix datatype parameter for `KeyedVectors.load_word2vec_format`. Fix #1682 (#1819)

Radim Řehůřek (8):
      fix logging formatting in downloader
      fixes to HashDictionary
      more fixes to broken formatting
      minor wording change
      Merge pull request #2073 from RaRe-Technologies/hashdictionary_docs
      Fix documentation for `*2vec` models (#2087)
      Fix documentation for various modules (#2096)
      Update non-API docs (about, intro, etc) (#2101)

Rob Malouf (1):
      Fix `_is_single` from `Phrases` for case when corpus is numpy array (#1987)

Samyak Jain (2):
      Fix empty output bug in `Phrases`. Fix #1401 (#1853)
      Fix file-like closing bug from `gensim.corpora.MmCorpus`. Fix #1869 (#1911)

Sharan Yalburgi (3):
      Add anaconda-cloud badge. Partial fix #1901 (#1905)
      Add method that show base installation info of Gensim & related packages. Fix #1902 (#1903)
      Replace open() with smart_open() in notebooks. Fix #1789 (#1812)

Shiva Manne (4):
      Add `wv` property to KeyedVectors (for backward compatibility). Fix #1882 (#1884)
      Adds `LabeledSentence` to `gensim.models.doc2vec` (for backward compatibility). Fix #1886 (#1891)
      Fix `Doc2Vec.infer_vector` after loading old `Doc2Vec` (`gensim<=3.2`). Fix #1952 (#1974)
      Fixes issues while loading `word2vec` and `doc2vec` models saved using old Gensim versions. Fix #2000, #1977 (#2012)

Sourav Singh (1):
      Fix docstrings for `gensim.models.AuthorTopicModel` (#1907)

Stamenov (1):
      Add inference for new unseen author for `gensim.models.AuthorTopicModel` (#1766)

Stergiadis Manos (4):
      Fix docstrings for lsi-related code (#1892)
      Fix docstrings for `gensim.sklearn_api`. Fix #1667 (#1895)
      Document LDA-related models (#2026)
      Allow pass empty dictionary to `gensim.corpora.WikiCorpus`. Fix #2052 (#2042)

TheFlash10 (1):
      Fix deprecated parameters in doc2vec-lee notebook (#1918)

Umang Varma (1):
      Fix linear decay for learning rate in `Doc2Vec.infer_vector`. Fix #2061 (#2063)

Utkarsh Mishra (1):
      Fix `D2VTransformer.fit_transform`. Fix #1834 (#1845)

Vít Novotný (4):
      Implement Soft Cosine Measure (#1827)
      Fix misinformation in docstring of `gensim.models.KeyedVectors.similarity_matrix`. Fix #1960 (#1971)
      Fix `SoftCosineSimilarity.get_similarities` on corpora. Fix #1955 (#1972)
      Fix tests for `EuclideanKeyedVectors.similarity_matrix`. Fix #1961 (#1984)

Yuri Isakov (3):
      Refactor docstrings for `gensim.scripts`. Partial fix #1665 (#1792)
      Fix docstrings for `gensim.test.utils` (#1904)
      Fix docstrings for `gensim.interfaces` (#1913)

arlenk (3):
      Add Cython version of `MmReader` (#1825)
      Add cython version for "hot" functions from `gensim.models.LdaModel` (#1767)
      Fix OverflowError when loading a large term-document matrix in MatrixMarket format. Fix #1998 (#2001)

bohea (1):
      Fix bug in `Similarity.query_shards` in multiprocessing case (#2044)

darindf (3):
      Fix python 3 compatibility for `gensim.corpora.UciCorpus.save_corpus` (#1875)
      Remove duplication of class documentation for `IndexedCorpus` (#2033)
      Add `dtype` argument for `chunkize_serial`, fix `LdaModel` (#2027)

ivan (12):
      Merge branch 'master' into develop
      bump version to 3.4.0
      regenerate C files with cython==0.27
      Merge branch 'release-3.4.0'
      Merge branch 'master' into develop
      apply fixes for distributed mode lda/lsi from @piskvorky #2102
      remove smart_open limitation from setup.py, replace smart_open -> open until piskvorky/smart_open#207 will be fixed
      fix PEP8 issues
      bump version to 3.5.0
      bump changelos to 3.5.0 + add missing changelog for 3.4.0
      regenerated C files with Cython
      Merge branch 'release-3.5.0'

numericlee (1):
      Fix `doc2vec-lee` notebook (#1870)
menshikh-iv pushed a commit that referenced this issue Aug 13, 2018
* added integration test

* Resolving issue #207: use built-in open whenever possible

* rename integration test using correct issue number

* update unit tests

* isolate failing test case

* Updated unit tests

Changed the mock target to keep in step with the shortcut_open function.
Split tests into separate test cases.

* respond to review comments

- use tempfile instead of hard-coded file
- integrate new test into travis.yml so that Travis runs it

* added support for buffering parameter
@menshikh-iv
Copy link
Contributor Author

Fixed by #208

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants