-
-
Notifications
You must be signed in to change notification settings - Fork 382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Different behaviour with open
#207
Comments
Let me add that the bug is the violation of the "For local uncompressed files, smart_open() is always a drop-in replacement for open(). There is never a reason not to use smart_open() instead of open()." |
We have two important contracts:
1. Quack like open (as you mentioned above)
2. Support both Py2 and Py3
Unfortunately, these two contracts conflict because the open functions
differ between the major Py versions. Py3 accepts many more parameters and
support for encodings.
How should we manage this conflict?
…On Fri, Jul 6, 2018 at 1:07 Ivan Menshikh ***@***.***> wrote:
Assigned #207 <#207>
to @mpenkov <https://github.com/mpenkov>.
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#207 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABDOVHT75SqPjFlhm5uj7cyLG8pwJJkIks5uDjmkgaJpZM4VEJr0>
.
|
@mpenkov "quack" like a python3 always maybe (this still "contradiction", but looks like best solution)? |
That’s going to be hard. Look at this example: open a local file with
koi8-r encoding.
Under Py3, you just call the built-in open and return a file descriptor.
This satisfies the contract that Radim mentioned. Everyone is happy.
Under Py2, the built-in open does not support encodings, so you have to
deal with it yourself. You _could_ wrap it in a codec layer, but then you’d
violate the contract: the return would not be a true file descriptor. What
should we do in this particular case?
…On Fri, Jul 6, 2018 at 10:29 Ivan Menshikh ***@***.***> wrote:
@mpenkov <https://github.com/mpenkov> "quack" like a python3 always (this
still "contradiction", but looks like best solution)?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#207 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABDOVEg3lvKeqxRH6cePs4IkAwbaMRiCks5uDr1_gaJpZM4VEJr0>
.
|
If Python 2 doesn't support some parameters, we don't have to support them either. We have no contract or ambition to provide some "compatibility layer" between python 2 and 3. We want minimum surprises. |
@piskvorky open from Probably, this should be affected only for trivial variants, i.e. if user pass parameters matched to |
Not sure I understand. What I'm saying is that if native I mean, if we support something extra, that's nice. But definitely not at the cost of breaking the "drop-in contract". |
We _do_ support an encoding parameter, regardless of what the built-in open
supports. That’s part of the problem.
…On Sat, Jul 7, 2018 at 01:16 Radim Řehůřek ***@***.***> wrote:
Not sure I understand. What I'm saying is that if native open doesn't
support a parameter, neither do we.
I mean, if we support something extra, that's nice. But definitely not at
the cost of breaking the "drop-in contact".
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#207 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABDOVDX8hp-0Kq6nQnjWlOPLJ3BU4-0Kks5uD416gaJpZM4VEJr0>
.
|
Going back to the example I gave previously, if we receive an encoding
parameter when the native open doesn’t support one (Py2), we can:
1) perform our own decoding, violating the contract
2) ignore the parameter, but keep to the contract
3) explode with an error
4) ... anything else?
How do you think we should handle the above situation?
…On Sat, Jul 7, 2018 at 08:45 Misha Penkov ***@***.***> wrote:
We _do_ support an encoding parameter, regardless of what the built-in
open supports. That’s part of the problem.
On Sat, Jul 7, 2018 at 01:16 Radim Řehůřek ***@***.***>
wrote:
> Not sure I understand. What I'm saying is that if native open doesn't
> support a parameter, neither do we.
>
> I mean, if we support something extra, that's nice. But definitely not at
> the cost of breaking the "drop-in contact".
>
> —
> You are receiving this because you were mentioned.
>
>
> Reply to this email directly, view it on GitHub
> <#207 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/ABDOVDX8hp-0Kq6nQnjWlOPLJ3BU4-0Kks5uD416gaJpZM4VEJr0>
> .
>
|
Both 1) and 3) are OK. 1) doesn't really violate the "drop-in" contract, since there is no original to violate ( Note however that this is not the case that triggered this issue. There was no |
I can reproduce this problem with numpy-1.11.3. With newer numpy, it's no longer a problem. |
… of `min_count` for `gensim.models.Word2Vec`. Fix #465 (#1915) Andrey Kutuzov (2): Fix OOV pairs counter in `WordEmbeddingsKeyedVectors.evaluate_word_pairs` (#1934) Add `evaluate_word_analogies` (will replace `accuracy`) method for `gensim.models.KeyedVectors` (#1935) Aneesh Joshi (3): Add windows venv activate command to `CONTRIBUTING.md` (#1880) Fix deprecation warning from `inspect.getargspec`. Fix #1878 (#1887) Allow initialization with `max_final_vocab` in lieu of `min_count` for `gensim.models.Word2Vec`. Fix #465 (#1915) Chaitali Saini (1): Update rules for removing table markup from Wikipedia dumps. Fix #1710 (#1954) Dennis.Chen (1): Fix inheritance chain for `load_word2vec_format` (return correct class in case when you create an child class based on kv) (#1968) Dmitry (5): Refactor API reference `gensim.corpora`. Partial fix #1671 (#1835) Refactor documentation for `gensim.similarities.docsim` and `MmCorpus-related`. (#1910) Refactor documentation for `gensim.models.coherencemodel` (#1933) Refactor documentation for `gensim.models.phrases` (#1950) Fix format & links for `gensim.similarities.docsim` (#2030) Dmitry Persiyanov (1): Add `gensim.models.BaseKeyedVectors.add_entity` method for fill `KeyedVectors` in manual way. Fix #1942 (#1957) Fernando Camargo (1): Add `ns_exponent` parameter to control the negative sampling distribution for `*2vec` models. Fix #2090 (#2093) Gordon Mohr (1): Fix `Doc2Vec.infer_vector`, notebook cleanup (#2103) Gyanesh Malhotra (1): Fix docstrings for`gensim.models.hdpmodel`, `gensim.models.lda_worker` & `gensim.models.lda_dispatcher` (#1912) Ibrahim Sharaf ElDen (1): Store images from `README.md` directly in repository. Fix #1849 (#1861) Ivan Menshikh (2): Fix PEP8 in `HashDictionary` Disable google-style docstring support. Fix #1663 (#2106) Jayant Jain (1): Fix negative sampling floating-point error for `gensim.models.PoincareModel`. Fix #1917 (#1959) Johannes Baiter (1): Fix method `estimate_memory` from `gensim.models.FastText` & huge performance improvement. Fix #1824 (#1916) Jonathan Hourany (1): Fixed Typo and increased performance in analyze_sentence (#2070) Kento NOZAWA (1): Fix example block for `gensim.models.Word2Vec` (#1876) Kumar Akshay (1): Fix documentation for `gensim.models.wrappers` (#1859) Menshikh Ivan (5): Fix `test_similarities.py` (#1928) Add flag for skip network-related tests (#1930) Fix encoding in Lee corpus reader (#1931) Fix Keras version (avoid bug from `keras==2.1.5`) (#1963) Fix quoting that break `doc2vec-IMDB` notebook Mohit Rathore (1): Add Pivot Normalization for `gensim.models.TfidfModel`. Fix #220 (#1780) Mritunjay Mohitesh (1): Fix deprecated parameters in `D2VTransformer` and `W2VTransformer`. Fix #1937 (#1945) Nils Werner (1): Add license field to `setup.py` (#1909) Oliver Price (1): Fix return dtype for `matutils.unitvec` according to input dtype. Fix #1722 (#1992) Orion Montoya (1): Fix parameter description of `sg` parameter for `gensim.models.word2vec` (#1919) Pete Bleackley (1): Fix SMART from TfidfModel for case when `df == "n"`. Fix #2020 (#2021) Pushpankar Kumar Pushp (1): Fix datatype parameter for `KeyedVectors.load_word2vec_format`. Fix #1682 (#1819) Radim Řehůřek (8): fix logging formatting in downloader fixes to HashDictionary more fixes to broken formatting minor wording change Merge pull request #2073 from RaRe-Technologies/hashdictionary_docs Fix documentation for `*2vec` models (#2087) Fix documentation for various modules (#2096) Update non-API docs (about, intro, etc) (#2101) Rob Malouf (1): Fix `_is_single` from `Phrases` for case when corpus is numpy array (#1987) Samyak Jain (2): Fix empty output bug in `Phrases`. Fix #1401 (#1853) Fix file-like closing bug from `gensim.corpora.MmCorpus`. Fix #1869 (#1911) Sharan Yalburgi (3): Add anaconda-cloud badge. Partial fix #1901 (#1905) Add method that show base installation info of Gensim & related packages. Fix #1902 (#1903) Replace open() with smart_open() in notebooks. Fix #1789 (#1812) Shiva Manne (4): Add `wv` property to KeyedVectors (for backward compatibility). Fix #1882 (#1884) Adds `LabeledSentence` to `gensim.models.doc2vec` (for backward compatibility). Fix #1886 (#1891) Fix `Doc2Vec.infer_vector` after loading old `Doc2Vec` (`gensim<=3.2`). Fix #1952 (#1974) Fixes issues while loading `word2vec` and `doc2vec` models saved using old Gensim versions. Fix #2000, #1977 (#2012) Sourav Singh (1): Fix docstrings for `gensim.models.AuthorTopicModel` (#1907) Stamenov (1): Add inference for new unseen author for `gensim.models.AuthorTopicModel` (#1766) Stergiadis Manos (4): Fix docstrings for lsi-related code (#1892) Fix docstrings for `gensim.sklearn_api`. Fix #1667 (#1895) Document LDA-related models (#2026) Allow pass empty dictionary to `gensim.corpora.WikiCorpus`. Fix #2052 (#2042) TheFlash10 (1): Fix deprecated parameters in doc2vec-lee notebook (#1918) Umang Varma (1): Fix linear decay for learning rate in `Doc2Vec.infer_vector`. Fix #2061 (#2063) Utkarsh Mishra (1): Fix `D2VTransformer.fit_transform`. Fix #1834 (#1845) Vít Novotný (4): Implement Soft Cosine Measure (#1827) Fix misinformation in docstring of `gensim.models.KeyedVectors.similarity_matrix`. Fix #1960 (#1971) Fix `SoftCosineSimilarity.get_similarities` on corpora. Fix #1955 (#1972) Fix tests for `EuclideanKeyedVectors.similarity_matrix`. Fix #1961 (#1984) Yuri Isakov (3): Refactor docstrings for `gensim.scripts`. Partial fix #1665 (#1792) Fix docstrings for `gensim.test.utils` (#1904) Fix docstrings for `gensim.interfaces` (#1913) arlenk (3): Add Cython version of `MmReader` (#1825) Add cython version for "hot" functions from `gensim.models.LdaModel` (#1767) Fix OverflowError when loading a large term-document matrix in MatrixMarket format. Fix #1998 (#2001) bohea (1): Fix bug in `Similarity.query_shards` in multiprocessing case (#2044) darindf (3): Fix python 3 compatibility for `gensim.corpora.UciCorpus.save_corpus` (#1875) Remove duplication of class documentation for `IndexedCorpus` (#2033) Add `dtype` argument for `chunkize_serial`, fix `LdaModel` (#2027) ivan (12): Merge branch 'master' into develop bump version to 3.4.0 regenerate C files with cython==0.27 Merge branch 'release-3.4.0' Merge branch 'master' into develop apply fixes for distributed mode lda/lsi from @piskvorky #2102 remove smart_open limitation from setup.py, replace smart_open -> open until piskvorky/smart_open#207 will be fixed fix PEP8 issues bump version to 3.5.0 bump changelos to 3.5.0 + add missing changelog for 3.4.0 regenerated C files with Cython Merge branch 'release-3.5.0' numericlee (1): Fix `doc2vec-lee` notebook (#1870)
* added integration test * Resolving issue #207: use built-in open whenever possible * rename integration test using correct issue number * update unit tests * isolate failing test case * Updated unit tests Changed the mock target to keep in step with the shortcut_open function. Split tests into separate test cases. * respond to review comments - use tempfile instead of hard-coded file - integrate new test into travis.yml so that Travis runs it * added support for buffering parameter
Fixed by #208 |
Problem:
smart_open==1.6.0
doesn't work correctly withnumpy.frombuffer
function (numpy < 1.13.0
, preferable to test withnumpy==1.11.3
,python2.7
).Example:
env:
numpy==1.11.3
gensim==3.4.0
smart_open==1.6.0
Trace
If I replace
from https://github.com/RaRe-Technologies/gensim/blob/875b028ccd009fe2fa7665e177b8ab0b5e2dc40d/gensim/models/fasttext.py#L711
to
this will works correctly (also, it works correctly with
smart_open==1.5.6
, i.e. bug from1.6.0
).The text was updated successfully, but these errors were encountered: