Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unnessary workaround in LDA tutorial #3074

Closed
jonaschn opened this issue Mar 15, 2021 · 14 comments · Fixed by #3082
Closed

Unnessary workaround in LDA tutorial #3074

jonaschn opened this issue Mar 15, 2021 · 14 comments · Fixed by #3082

Comments

@jonaschn
Copy link
Contributor

The documentation in the LDA tutorial mentions an already fixed issue (piskvorky/smart_open#331).
What is now the proper way to read and untar the file on the fly?

https://github.com/RaRe-Technologies/gensim/blob/338ef330dea97c90c3180a9b570be9d0c9cef302/docs/src/auto_examples/tutorials/run_lda.py#L68-L91

@mpenkov
Copy link
Collaborator

mpenkov commented Mar 15, 2021

You could try something like:

with smart_open.open(fname, "r") as fin:
    with tarfile.open(fin) as fout:
        ...

But YMMV.

@jonaschn
Copy link
Contributor Author

I tried the following:

    with smart_open.open('https://cs.nyu.edu/~roweis/data/nips12raw_str602.tgz', "r") as fin:
        with tarfile.open(fin, mode='r:gz') as tar:
            # Ignore directory entries, as well as files like README, etc.
            files = [
                m for m in tar.getmembers()
                if m.isfile() and re.search(r'nipstxt/nips\d+/\d+\.txt', m.name)
            ]
            for member in sorted(files, key=lambda x: x.name):
                member_bytes = tar.extractfile(member).read()
                yield member_bytes.decode('utf-8', errors='replace')

and received following error:
TypeError: expected str, bytes or os.PathLike object, not StreamReader
It seems like tarfile does not support this way of streaming or did I misunderstand you?

@mpenkov
Copy link
Collaborator

mpenkov commented Mar 16, 2021

My example was incomplete. If you look at the reference for the tarfile submodule, you'll find that the open function accepts a fileobj parameter. So, the working code would be:

$ python
Python 3.8.7 (default, Feb  3 2021, 07:09:08)
[Clang 12.0.0 (clang-1200.0.32.29)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import smart_open
>>> f = smart_open.open("https://cs.nyu.edu/~roweis/data/nips12raw_str602.tgz", "rb")
>>> import tarfile
>>> tf = tarfile.open(fileobj=f)
>>> tf.getmembers()[:5]
[<TarInfo 'nipstxt' at 0x111da1880>, <TarInfo 'nipstxt/nips00' at 0x111da17c0>, <TarInfo 'nipstxt/nips00/0387.txt' at 0x111da1700>, <TarInfo 'nipstxt/nips00/0001.txt' at 0x111da1640>, <TarInfo 'nipstxt/nips00/0009.txt' at 0x111da1a00>]

The other important point is to open the outer file object for reading in binary mode as opposed to text.

@jonaschn
Copy link
Contributor Author

Thanks, I didn't know about the fileobj parameter.
Your suggested solution works but is slower than the proposed workaround (downloading the .tar archive):

def extract_documents(url='https://cs.nyu.edu/~roweis/data/nips12raw_str602.tgz'):
    with smart_open.open(url, "rb") as file:
        with tarfile.open(fileobj=file) as tar:
            # Ignore directory entries, as well as files like README, etc.
            files = [
                m for m in tar.getmembers()
                if m.isfile() and re.search(r'nipstxt/nips\d+/\d+\.txt', m.name)
            ]
            for member in sorted(files, key=lambda x: x.name):
                member_bytes = tar.extractfile(member).read()
                yield member_bytes.decode('utf-8', errors='replace')

On my local machine the workaround needs only:
CPU times: user 4.27 s, sys: 197 ms, total: 4.47 s
Wall time: 7.22 s

The filestream method is ~10x slower:
CPU times: user 8.81 s, sys: 940 ms, total: 9.75 s
Wall time: 1min 9s

I assume the sorting leads to this behavior. Is there any rationale behind sorting the extracted files?
LDA doesn't care about the order of documents 😂

@mpenkov
Copy link
Collaborator

mpenkov commented Mar 17, 2021

Perhaps... can you try disabling the sorting and re-measure the time?

It's likely the sorting is there to make the processing order predictable for anybody eyeballing the intermediate result (e.g. log statements, etc). If it does not affect the final result, it isn't strictly necessary.

jonaschn added a commit to jonaschn/gensim that referenced this issue Mar 19, 2021
@jonaschn
Copy link
Contributor Author

My guess was correct and without sorting it also takes around 7 seconds.
I opened a PR. How do I generate the other documentation files (.ipynb etc.)?

@mpenkov
Copy link
Collaborator

mpenkov commented Mar 19, 2021

Please see the READMEs in docs/src/gallery.

@piskvorky
Copy link
Owner

piskvorky commented Mar 19, 2021

@jonaschn please see the How to author documentation Wiki. Especially the Technical section at the bottom.

@jonaschn
Copy link
Contributor Author

I pushed the other doc build artifacts.
Do I always need to run and build the whole documentation?
Actualy, some dependencies are missing, e.g. nmslib
It is not very convenient to contribute such fixes if the whole process around building the docs is so much effort and needs so much time.

@piskvorky
Copy link
Owner

piskvorky commented Mar 19, 2021

AFAIK the build shouldn't do anything for files that didn't change. Only the artifacts for files you actually changed (or created) should be regenerated.

@mpenkov is that accurate?

And we can definitely do this step on our end, don't worry about it too much. As long as your core Python code is fine, that's the main thing.

@mpenkov
Copy link
Collaborator

mpenkov commented Mar 20, 2021

AFAIK the build shouldn't do anything for files that didn't change. Only the artifacts for files you actually changed (or created) should be regenerated. @mpenkov is that accurate?

It's in the ballpark, but not entirely. Yes, only files that changed will be regenerated. However, regenerating them requires the entire documentation build to be triggered (so e.g. make -C docs/src html). The build is then clever enough to skip over the parts that haven't changed.

This is not ideal for contributions like this, as @jonaschn has pointed out. It would be more convenient if they could regenerate the parts they changed without having to run the entire build (so in this case, only the LDA tutorial). Unfortunately, we haven't optimized for this use case, and I don't know if it's something that can be easily achieved with Sphinx.

I can see several ways forward:

  1. Do nothing, and expect documentation contributors to carry the burden. This will potentially turn off people from contributing, because the workload for trivial things like fixing a typo is high.
  2. Investigate the method I described above (targeted regeneration of changed files only)
  3. Do the documentation builds ourselves, as @piskvorky has suggested. After all, as long as their Python code runs, we can generate documentation from it. They can test the Python code directly (by running it through the interpreter) and let us do the rest.
  4. Do additional documentation builds in CI using a trigger. So, for PRs like Make LDA tutorial read NIPS data on the fly #3082, if the contributor is unable to build the documentation themselves, we could trigger a CI build using a comment (GitHub Actions does this well).

I think 1) then 3) is probably the path of least resistance. If it becomes too much of a burden, then we can look at the other solutions, but it will require additional effort on our part.

@piskvorky
Copy link
Owner

piskvorky commented Mar 20, 2021

Wait, I see two separate concerns:

A. Changing one tutorial caused rebuilding of that tutorial + all other tutorials & how-tos (takes forever)
B. Changing one tutorial caused rebuilding of that tutorial + generating docs (takes ~ a minute or two extra)

I thought we're talking about A, where @jonaschn had to wait hours. Is the complain actually about B?

If so, I'm definitely for your option 1) @mpenkov.

@jonaschn
Copy link
Contributor Author

Actually the final build of the docs didn't take forever but at least it was pretty confusing and time-consuming for myself to figure out how to contribute such an easy fix.
Most surprising to me was that I needed so much more dependencies and one (nmslib) was actually missing in the requirements_docs.txt. Installing these dependencies was neither mentioned in the contributor guide nor the wiki.

@mpenkov
Copy link
Collaborator

mpenkov commented Mar 21, 2021

OK, so it sounds like we could improve our developer documentation a little bit and go with option 1, then.

mpenkov pushed a commit that referenced this issue Mar 22, 2021
* Read NIPS data on the fly

fix #3074

* Simplify download of NIPS data

* Add nmslib to requirements_docs.txt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants