Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace open() with smart_open() in notebooks. Fix #1789 #1812

Merged
merged 4 commits into from
Mar 10, 2018

Conversation

sharanry
Copy link
Contributor

Replace open() with smart_open() in notebooks to make it compatible with Python 3.x

Issue: #1789

@menshikh-iv
Copy link
Contributor

Thanks for PR @sharanry, two questions:

  • This replaces all open calls?
  • Are you test than notebooks works correctly (after replace)?

@sharanry
Copy link
Contributor Author

sharanry commented Dec 26, 2017

i have replaced all open calls in docs/notebooks directory.
What is the testing procedure? Should I run the notebooks locally?

@menshikh-iv
Copy link
Contributor

@sharanry yes

@sharanry
Copy link
Contributor Author

sharanry commented Jan 8, 2018

@menshikh-iv Could you review this?

Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sharanry, are you already test it (i.e. re-run all notebooks locally with your changes)?

"class MyCorpus(object):\n",
" def __iter__(self):\n",
" for line in open('datasets/mycorpus.txt'):\n",
" for line in smart_open('datasets/mycorpus.txt'):\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to explicitly add mode, i.e. 'rb', 'r', etc.

@sharanry
Copy link
Contributor Author

sharanry commented Jan 15, 2018

@menshikh-iv I made the changes which you had asked.
I have been trying to run the notebooks for the last couple of days. Some of the dependencies like nltk seem to be failing with the notebooks. I made sure they are all at the latest versions. Any suggestions?

I will be trying with docker image provided next.

@menshikh-iv
Copy link
Contributor

@sharanry Also please merge develop from upstream to your branch, thanks.
That's all, let me know when you finished - I'll merge your PR.

…into smart_open

* 'develop' of https://github.com/RaRe-Technologies/gensim:
  Fix positional params used for `gensim.models.CoherenceModel` in `gensim.models.callbacks` (piskvorky#1823)
  Fix parameter setting for `FastText.train`. Fix piskvorky#1818 (piskvorky#1837)
  Refactor tests for `gensim.corpora.WikiCorpus`(piskvorky#1821)
  Fix formula in `gensim.summarization.bm25`. Fix piskvorky#1828 (piskvorky#1833)
  Fix docstrings for `gensim.matutils` (piskvorky#1804)
  Fix docstrings for `gensim.models.logentropy_model` (piskvorky#1803)
  Fix docstrings for `gensim.models.normmodel` (piskvorky#1805)
  Refactor API reference `gensim.topic_coherence`. Fix piskvorky#1669 (piskvorky#1714)
  Add CircleCI for build documentation. Fix piskvorky#1807 (piskvorky#1822)
  Fix docstrings for `gensim.models.translation_matrix` (piskvorky#1806)
  Fix docstrings for `gensim.models.rpmodel` (piskvorky#1802)
  Fix docstrings for `gensim.utils` (piskvorky#1797)
  Fix tox.ini/setup.cfg configuration (piskvorky#1815)
  Add wordnet mammal train file for Poincare notebook (piskvorky#1781)
@menshikh-iv
Copy link
Contributor

Hey @sharanry, how about the run of notebooks?

@sharanry
Copy link
Contributor Author

hey @menshikh-iv, sorry for the late reply, I havent been able to build the docker image, i tried docker build -t gensim . as given here. I am getting this error.

gensim

@menshikh-iv
Copy link
Contributor

@sharanry replace locales=2.23-0ubuntu9 from https://github.com/RaRe-Technologies/gensim/blob/09672b369b2ec94a06581efd060be20662d7006c/docker/Dockerfile#L28 to locales=2.23-0ubuntu10

@menshikh-iv menshikh-iv changed the title Replace open() with smart_open() in notebooks Replace open() with smart_open() in notebooks. Fix #1789 Feb 1, 2018
@menshikh-iv
Copy link
Contributor

@sharanry how is going? are you finished with running?

@sharanry
Copy link
Contributor Author

sharanry commented Feb 5, 2018

@menshikh-iv Apologies for the delay. I am incurring multiple errors with the docker image. I am unable to set it up.

Is there any alternative way I could run the notebooks?

@menshikh-iv
Copy link
Contributor

@sharanry what's kind of errors, can you show it here?

alternative way - locally on your machine.

@sharanry
Copy link
Contributor Author

sharanry commented Feb 5, 2018

gensim

I even tried with different network connections.

@sharanry
Copy link
Contributor Author

sharanry commented Feb 6, 2018

Finally got it to work somehow.
Got a bunch of errors in Poincare Evaluations
poincareeval1
poincareeval2
poincareeval3
poincareeval4
There errors are causing even more errors

@sharanry
Copy link
Contributor Author

sharanry commented Feb 6, 2018

Getting this error even after manually downloading movie_plots.csv from the link mentioned. Must be a problem with docker.

tboardvisual1

@menshikh-iv
Copy link
Contributor

Big thanks @sharanry, need to fix notebooks first (unfortunately, this doesn't work correctly).

@sharanry
Copy link
Contributor Author

sharanry commented Feb 6, 2018

@menshikh-iv I have yet to check all the notebooks. I can help fixing them if possible.

@menshikh-iv
Copy link
Contributor

@sharanry yes, it will be really nice, feel free to fix problems.

@sharanry sharanry mentioned this pull request Feb 10, 2018
@sharanry
Copy link
Contributor Author

@menshikh-iv Can we merge this?

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Feb 15, 2018

@sharanry I want, but we need to run all of this notebooks first and check that all fine :(

UPD: let me check it myself

@menshikh-iv
Copy link
Contributor

@sharanry it was much harder than I thought

  • Some notebooks need very long time for running
  • I see many py3/py2 incompatibilities -> need to run notebook 2 or 3 times (27, 35, 36).
  • Some problems with using (like - "let's go to download dataset from X" or using non-trivial dependencies).

I'll try to run, fix, merge this stuff when I have enough time for it (probably, not soon), sorry for waiting.

Also, please move commits from #1893 to current PR (anyway, I will not merge this until we fix it), also, this allows us to avoid painful merge-conflict with notebooks.

@menshikh-iv
Copy link
Contributor

By discussion in #1964 (we'll rework most of the notebooks and provide small clear examples instead of most "long-running" stuff) + really hard to check (I spend a lot of time for it, but did not even check half, because of the need to test on 3 different python versions).

For this reason, I close current PR, sorry @sharanry, thanks for the work!

@menshikh-iv menshikh-iv closed this Mar 9, 2018
@piskvorky
Copy link
Owner

piskvorky commented Mar 9, 2018

I see no reason to wait for some undefined future tests or refactorings just to change open to smart_open. These are valid but orthogonal concerns.

I propose merging this (useful) PR.

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Mar 9, 2018

@piskvorky I close it because this needs a really long time to check that we didn't break something (and all works as expected) with this change.

P/S this thing is not so useful, without this modification everything works.

@piskvorky
Copy link
Owner

piskvorky commented Mar 10, 2018

smart_open is a drop-in replacement for open, so your checking is more checking of the rest of the notebooks, not the changes in this PR itself.

I agree it may not be a critical feature, but it is a welcome one -- please merge if the changes themselves are OK.

@menshikh-iv menshikh-iv reopened this Mar 10, 2018
@menshikh-iv menshikh-iv merged commit 17f3719 into piskvorky:develop Mar 10, 2018
Copy link
Owner

@piskvorky piskvorky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why these changes in binary/text semantics?

@@ -67,7 +68,7 @@
" files = os.listdir(data_dir + yr_dir)\n",
" for filen in files:\n",
" # Note: ignoring characters that cause encoding errors.\n",
" with open(data_dir + yr_dir + '/' + filen, errors='ignore') as fid:\n",
" with smart_open(data_dir + yr_dir + '/' + filen, 'rb') as fid:\n",
Copy link
Owner

@piskvorky piskvorky Apr 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug (not in this PR, notebook already bad): use os.path.join for joining filesystem paths.

Also, why the change in errors='ignore'?

@@ -114,7 +115,7 @@
" # as well as pages about a single year.\n",
" # As a result, this preprocessing differs from the paper.\n",
" \n",
" with open(os.path.join(data_dir, fname)) as f:\n",
" with smart_open(os.path.join(data_dir, fname), 'rb') as f:\n",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a bug. If you open the file in binary mode, operations like split and lower have a different meaning, compared to text.

I see the same problem in many places in this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants