Unnessary workaround in LDA tutorial #3074

jonaschn · 2021-03-15T11:16:36Z

The documentation in the LDA tutorial mentions an already fixed issue (piskvorky/smart_open#331).
What is now the proper way to read and untar the file on the fly?

https://github.com/RaRe-Technologies/gensim/blob/338ef330dea97c90c3180a9b570be9d0c9cef302/docs/src/auto_examples/tutorials/run_lda.py#L68-L91

mpenkov · 2021-03-15T22:51:21Z

You could try something like:

with smart_open.open(fname, "r") as fin:
    with tarfile.open(fin) as fout:
        ...

But YMMV.

jonaschn · 2021-03-16T10:52:30Z

I tried the following:

    with smart_open.open('https://cs.nyu.edu/~roweis/data/nips12raw_str602.tgz', "r") as fin:
        with tarfile.open(fin, mode='r:gz') as tar:
            # Ignore directory entries, as well as files like README, etc.
            files = [
                m for m in tar.getmembers()
                if m.isfile() and re.search(r'nipstxt/nips\d+/\d+\.txt', m.name)
            ]
            for member in sorted(files, key=lambda x: x.name):
                member_bytes = tar.extractfile(member).read()
                yield member_bytes.decode('utf-8', errors='replace')

and received following error:
TypeError: expected str, bytes or os.PathLike object, not StreamReader
It seems like tarfile does not support this way of streaming or did I misunderstand you?

mpenkov · 2021-03-16T12:01:47Z

My example was incomplete. If you look at the reference for the tarfile submodule, you'll find that the open function accepts a fileobj parameter. So, the working code would be:

$ python
Python 3.8.7 (default, Feb  3 2021, 07:09:08)
[Clang 12.0.0 (clang-1200.0.32.29)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import smart_open
>>> f = smart_open.open("https://cs.nyu.edu/~roweis/data/nips12raw_str602.tgz", "rb")
>>> import tarfile
>>> tf = tarfile.open(fileobj=f)
>>> tf.getmembers()[:5]
[<TarInfo 'nipstxt' at 0x111da1880>, <TarInfo 'nipstxt/nips00' at 0x111da17c0>, <TarInfo 'nipstxt/nips00/0387.txt' at 0x111da1700>, <TarInfo 'nipstxt/nips00/0001.txt' at 0x111da1640>, <TarInfo 'nipstxt/nips00/0009.txt' at 0x111da1a00>]

The other important point is to open the outer file object for reading in binary mode as opposed to text.

jonaschn · 2021-03-17T06:15:39Z

Thanks, I didn't know about the fileobj parameter.
Your suggested solution works but is slower than the proposed workaround (downloading the .tar archive):

def extract_documents(url='https://cs.nyu.edu/~roweis/data/nips12raw_str602.tgz'):
    with smart_open.open(url, "rb") as file:
        with tarfile.open(fileobj=file) as tar:
            # Ignore directory entries, as well as files like README, etc.
            files = [
                m for m in tar.getmembers()
                if m.isfile() and re.search(r'nipstxt/nips\d+/\d+\.txt', m.name)
            ]
            for member in sorted(files, key=lambda x: x.name):
                member_bytes = tar.extractfile(member).read()
                yield member_bytes.decode('utf-8', errors='replace')

On my local machine the workaround needs only:
CPU times: user 4.27 s, sys: 197 ms, total: 4.47 s
Wall time: 7.22 s

The filestream method is ~10x slower:
CPU times: user 8.81 s, sys: 940 ms, total: 9.75 s
Wall time: 1min 9s

I assume the sorting leads to this behavior. Is there any rationale behind sorting the extracted files?
LDA doesn't care about the order of documents 😂

mpenkov · 2021-03-17T14:53:32Z

Perhaps... can you try disabling the sorting and re-measure the time?

It's likely the sorting is there to make the processing order predictable for anybody eyeballing the intermediate result (e.g. log statements, etc). If it does not affect the final result, it isn't strictly necessary.

fix piskvorky#3074

jonaschn · 2021-03-19T12:47:00Z

My guess was correct and without sorting it also takes around 7 seconds.
I opened a PR. How do I generate the other documentation files (.ipynb etc.)?

mpenkov · 2021-03-19T12:49:44Z

Please see the READMEs in docs/src/gallery.

piskvorky · 2021-03-19T13:26:55Z

@jonaschn please see the How to author documentation Wiki. Especially the Technical section at the bottom.

jonaschn · 2021-03-19T14:00:41Z

I pushed the other doc build artifacts.
Do I always need to run and build the whole documentation?
Actualy, some dependencies are missing, e.g. nmslib
It is not very convenient to contribute such fixes if the whole process around building the docs is so much effort and needs so much time.

piskvorky · 2021-03-19T15:41:27Z

AFAIK the build shouldn't do anything for files that didn't change. Only the artifacts for files you actually changed (or created) should be regenerated.

@mpenkov is that accurate?

And we can definitely do this step on our end, don't worry about it too much. As long as your core Python code is fine, that's the main thing.

mpenkov · 2021-03-20T08:36:53Z

AFAIK the build shouldn't do anything for files that didn't change. Only the artifacts for files you actually changed (or created) should be regenerated. @mpenkov is that accurate?

It's in the ballpark, but not entirely. Yes, only files that changed will be regenerated. However, regenerating them requires the entire documentation build to be triggered (so e.g. make -C docs/src html). The build is then clever enough to skip over the parts that haven't changed.

This is not ideal for contributions like this, as @jonaschn has pointed out. It would be more convenient if they could regenerate the parts they changed without having to run the entire build (so in this case, only the LDA tutorial). Unfortunately, we haven't optimized for this use case, and I don't know if it's something that can be easily achieved with Sphinx.

I can see several ways forward:

Do nothing, and expect documentation contributors to carry the burden. This will potentially turn off people from contributing, because the workload for trivial things like fixing a typo is high.
Investigate the method I described above (targeted regeneration of changed files only)
Do the documentation builds ourselves, as @piskvorky has suggested. After all, as long as their Python code runs, we can generate documentation from it. They can test the Python code directly (by running it through the interpreter) and let us do the rest.
Do additional documentation builds in CI using a trigger. So, for PRs like Make LDA tutorial read NIPS data on the fly #3082, if the contributor is unable to build the documentation themselves, we could trigger a CI build using a comment (GitHub Actions does this well).

I think 1) then 3) is probably the path of least resistance. If it becomes too much of a burden, then we can look at the other solutions, but it will require additional effort on our part.

piskvorky · 2021-03-20T08:58:06Z

Wait, I see two separate concerns:

A. Changing one tutorial caused rebuilding of that tutorial + all other tutorials & how-tos (takes forever)
B. Changing one tutorial caused rebuilding of that tutorial + generating docs (takes ~ a minute or two extra)

I thought we're talking about A, where @jonaschn had to wait hours. Is the complain actually about B?

If so, I'm definitely for your option 1) @mpenkov.

jonaschn · 2021-03-20T16:19:34Z

Actually the final build of the docs didn't take forever but at least it was pretty confusing and time-consuming for myself to figure out how to contribute such an easy fix.
Most surprising to me was that I needed so much more dependencies and one (nmslib) was actually missing in the requirements_docs.txt. Installing these dependencies was neither mentioned in the contributor guide nor the wiki.

mpenkov · 2021-03-21T02:26:33Z

OK, so it sounds like we could improve our developer documentation a little bit and go with option 1, then.

* Read NIPS data on the fly fix #3074 * Simplify download of NIPS data * Add nmslib to requirements_docs.txt

jonaschn added a commit to jonaschn/gensim that referenced this issue Mar 19, 2021

Read NIPS data on the fly

cd2fec6

fix piskvorky#3074

jonaschn mentioned this issue Mar 19, 2021

Make LDA tutorial read NIPS data on the fly #3082

Merged

mpenkov closed this as completed in #3082 Mar 22, 2021

mpenkov pushed a commit that referenced this issue Mar 22, 2021

Read NIPS data on the fly (#3082)

6851524

* Read NIPS data on the fly fix #3074 * Simplify download of NIPS data * Add nmslib to requirements_docs.txt

mpenkov mentioned this issue Mar 22, 2021

Streamline documentation contributions #3087

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unnessary workaround in LDA tutorial #3074

Unnessary workaround in LDA tutorial #3074

jonaschn commented Mar 15, 2021

mpenkov commented Mar 15, 2021

jonaschn commented Mar 16, 2021

mpenkov commented Mar 16, 2021 •

edited

Loading

jonaschn commented Mar 17, 2021

mpenkov commented Mar 17, 2021

jonaschn commented Mar 19, 2021

mpenkov commented Mar 19, 2021

piskvorky commented Mar 19, 2021 •

edited

Loading

jonaschn commented Mar 19, 2021

piskvorky commented Mar 19, 2021 •

edited

Loading

mpenkov commented Mar 20, 2021

piskvorky commented Mar 20, 2021 •

edited

Loading

jonaschn commented Mar 20, 2021

mpenkov commented Mar 21, 2021

Unnessary workaround in LDA tutorial #3074

Unnessary workaround in LDA tutorial #3074

Comments

jonaschn commented Mar 15, 2021

mpenkov commented Mar 15, 2021

jonaschn commented Mar 16, 2021

mpenkov commented Mar 16, 2021 • edited Loading

jonaschn commented Mar 17, 2021

mpenkov commented Mar 17, 2021

jonaschn commented Mar 19, 2021

mpenkov commented Mar 19, 2021

piskvorky commented Mar 19, 2021 • edited Loading

jonaschn commented Mar 19, 2021

piskvorky commented Mar 19, 2021 • edited Loading

mpenkov commented Mar 20, 2021

piskvorky commented Mar 20, 2021 • edited Loading

jonaschn commented Mar 20, 2021

mpenkov commented Mar 21, 2021

mpenkov commented Mar 16, 2021 •

edited

Loading

piskvorky commented Mar 19, 2021 •

edited

Loading

piskvorky commented Mar 19, 2021 •

edited

Loading

piskvorky commented Mar 20, 2021 •

edited

Loading