-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unnessary workaround in LDA tutorial #3074
Comments
You could try something like: with smart_open.open(fname, "r") as fin:
with tarfile.open(fin) as fout:
... But YMMV. |
I tried the following: with smart_open.open('https://cs.nyu.edu/~roweis/data/nips12raw_str602.tgz', "r") as fin:
with tarfile.open(fin, mode='r:gz') as tar:
# Ignore directory entries, as well as files like README, etc.
files = [
m for m in tar.getmembers()
if m.isfile() and re.search(r'nipstxt/nips\d+/\d+\.txt', m.name)
]
for member in sorted(files, key=lambda x: x.name):
member_bytes = tar.extractfile(member).read()
yield member_bytes.decode('utf-8', errors='replace') and received following error: |
My example was incomplete. If you look at the reference for the tarfile submodule, you'll find that the open function accepts a fileobj parameter. So, the working code would be: $ python
Python 3.8.7 (default, Feb 3 2021, 07:09:08)
[Clang 12.0.0 (clang-1200.0.32.29)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import smart_open
>>> f = smart_open.open("https://cs.nyu.edu/~roweis/data/nips12raw_str602.tgz", "rb")
>>> import tarfile
>>> tf = tarfile.open(fileobj=f)
>>> tf.getmembers()[:5]
[<TarInfo 'nipstxt' at 0x111da1880>, <TarInfo 'nipstxt/nips00' at 0x111da17c0>, <TarInfo 'nipstxt/nips00/0387.txt' at 0x111da1700>, <TarInfo 'nipstxt/nips00/0001.txt' at 0x111da1640>, <TarInfo 'nipstxt/nips00/0009.txt' at 0x111da1a00>] The other important point is to open the outer file object for reading in binary mode as opposed to text. |
Thanks, I didn't know about the fileobj parameter. def extract_documents(url='https://cs.nyu.edu/~roweis/data/nips12raw_str602.tgz'):
with smart_open.open(url, "rb") as file:
with tarfile.open(fileobj=file) as tar:
# Ignore directory entries, as well as files like README, etc.
files = [
m for m in tar.getmembers()
if m.isfile() and re.search(r'nipstxt/nips\d+/\d+\.txt', m.name)
]
for member in sorted(files, key=lambda x: x.name):
member_bytes = tar.extractfile(member).read()
yield member_bytes.decode('utf-8', errors='replace') On my local machine the workaround needs only: The filestream method is ~10x slower: I assume the sorting leads to this behavior. Is there any rationale behind sorting the extracted files? |
Perhaps... can you try disabling the sorting and re-measure the time? It's likely the sorting is there to make the processing order predictable for anybody eyeballing the intermediate result (e.g. log statements, etc). If it does not affect the final result, it isn't strictly necessary. |
My guess was correct and without sorting it also takes around 7 seconds. |
Please see the READMEs in docs/src/gallery. |
@jonaschn please see the How to author documentation Wiki. Especially the Technical section at the bottom. |
I pushed the other doc build artifacts. |
AFAIK the build shouldn't do anything for files that didn't change. Only the artifacts for files you actually changed (or created) should be regenerated. @mpenkov is that accurate? And we can definitely do this step on our end, don't worry about it too much. As long as your core Python code is fine, that's the main thing. |
It's in the ballpark, but not entirely. Yes, only files that changed will be regenerated. However, regenerating them requires the entire documentation build to be triggered (so e.g. make -C docs/src html). The build is then clever enough to skip over the parts that haven't changed. This is not ideal for contributions like this, as @jonaschn has pointed out. It would be more convenient if they could regenerate the parts they changed without having to run the entire build (so in this case, only the LDA tutorial). Unfortunately, we haven't optimized for this use case, and I don't know if it's something that can be easily achieved with Sphinx. I can see several ways forward:
I think 1) then 3) is probably the path of least resistance. If it becomes too much of a burden, then we can look at the other solutions, but it will require additional effort on our part. |
Wait, I see two separate concerns: A. Changing one tutorial caused rebuilding of that tutorial + all other tutorials & how-tos (takes forever) I thought we're talking about A, where @jonaschn had to wait hours. Is the complain actually about B? If so, I'm definitely for your option 1) @mpenkov. |
Actually the final build of the docs didn't take forever but at least it was pretty confusing and time-consuming for myself to figure out how to contribute such an easy fix. |
OK, so it sounds like we could improve our developer documentation a little bit and go with option 1, then. |
* Read NIPS data on the fly fix #3074 * Simplify download of NIPS data * Add nmslib to requirements_docs.txt
The documentation in the LDA tutorial mentions an already fixed issue (piskvorky/smart_open#331).
What is now the proper way to read and untar the file on the fly?
https://github.com/RaRe-Technologies/gensim/blob/338ef330dea97c90c3180a9b570be9d0c9cef302/docs/src/auto_examples/tutorials/run_lda.py#L68-L91
The text was updated successfully, but these errors were encountered: