-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Data/model storage. Fix 1453 #1632
Conversation
gensim/downloader.py
Outdated
|
||
|
||
def _create_base_dir(): | ||
r"""Create the gensim-data directory in home directory, if it has not been already created. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it really needed to add r
for all docstrings ? What's a reason?
gensim/downloader.py
Outdated
sys.stdout.flush() | ||
|
||
|
||
def _create_base_dir(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use __
instead of _
will be better (for hiding from import), here and everywhere?
gensim/downloader.py
Outdated
|
||
if __name__ == '__main__': | ||
logging.basicConfig(format='%(asctime)s :%(name)s :%(levelname)s :%(message)s', stream=sys.stdout, level=logging.INFO) | ||
parser = argparse.ArgumentParser(description="Gensim console API", usage="python -m gensim.api.downloader [-h] [-d data__name | -i data__name | -c]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to pass custom "usage" string here (argparse will generate it automatically)
gensim/downloader.py
Outdated
logging.basicConfig(format='%(asctime)s :%(name)s :%(levelname)s :%(message)s', stream=sys.stdout, level=logging.INFO) | ||
parser = argparse.ArgumentParser(description="Gensim console API", usage="python -m gensim.api.downloader [-h] [-d data__name | -i data__name | -c]") | ||
group = parser.add_mutually_exclusive_group() | ||
group.add_argument("-d", "--download", metavar="data__name", nargs=1, help="To download a corpus/model : python -m gensim.downloader -d corpus/model name") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strange names for metavar
, why metavar
is needed here?
gensim/downloader.py
Outdated
logger.info("%s downloaded", name) | ||
else: | ||
rmtree(tmp_dir) | ||
raise Exception("There was a problem in downloading the data. We recommend you to re-try.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add info about checksums (concrete filename, expected checksum, real checksum, expected size, real size).
Great job @chaitaliSaini, now your code is more readable and clear (and works stable) 🔥 👍 @anotherbugmaster will review your docstrings today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job, thank you! Fix the minor issues and check out this styleguide (in case you haven't yet), it will help you write consistent documentation:
https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt#docstring-standard
gensim/downloader.py
Outdated
|
||
|
||
def progress(chunks_downloaded, chunk_size, total_size): | ||
r"""Create and update the progress bar. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is r necessary?
gensim/downloader.py
Outdated
filled_len = int(math.floor((bar_len * size_downloaded) / total_size)) | ||
percent_downloaded = round((size_downloaded * 100) / total_size, 1) | ||
bar = '=' * filled_len + '-' * (bar_len - filled_len) | ||
sys.stdout.write('[%s] %s%s %s/%sMB downloaded\r' % (bar, percent_downloaded, "%", round(size_downloaded / (1024 * 1024), 1), round(float(total_size) / (1024 * 1024), 1))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too long
gensim/downloader.py
Outdated
|
||
|
||
def _calculate_md5_checksum(tar_file): | ||
r"""Calculate the checksum of the given tar.gz file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r again :)
gensim/downloader.py
Outdated
def info(name=None): | ||
r"""Return the information related to model/dataset. | ||
|
||
If name is supplied, then information related to the given dataset/model will be returned. Otherwise detailed information of all model/datasets will be returned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too long, split it
gensim/downloader.py
Outdated
Returns | ||
------- | ||
dict | ||
Return detailed information about all models/datasets if name is not provided. Otherwise return detailed informtiona of the specific model/dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too long
gensim/downloader.py
Outdated
data: | ||
load model to memory | ||
data_dir: str | ||
return path of dataset/model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No new line after last section
gensim/downloader.py
Outdated
|
||
Parameters | ||
---------- | ||
name : {None, data name}, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name : str or None, optional
is the right way. Also try to write a description after every parameter.
gensim/downloader.py
Outdated
Parameters | ||
---------- | ||
name: str | ||
dataset/model name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Capital letters
gensim/downloader.py
Outdated
Parameters | ||
---------- | ||
name: str | ||
dataset/model name which has to be downloaded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also capital letters
import numpy as np | ||
|
||
|
||
class TestApi(unittest.TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to add test for multipart
gensim/downloader.py
Outdated
import math | ||
import shutil | ||
import tempfile | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One try/catch is enough here.
gensim/downloader.py
Outdated
Parameters | ||
---------- | ||
chunks_downloaded : int | ||
Number of chunks of data that have been downloaded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.
at the end of sentence (here and anywhere)
gensim/downloader.py
Outdated
|
||
def _create_base_dir(): | ||
"""Create the gensim-data directory in home directory, if it has not been already created. | ||
Raises |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing newline before section title
gensim/downloader.py
Outdated
"""Create the gensim-data directory in home directory, if it has not been already created. | ||
Raises | ||
------ | ||
File Exists Error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raises
---------
Exception
Two possible reasons: ...
gensim/downloader.py
Outdated
return data['models'][name]["checksum"] | ||
else: | ||
if name in corpora: | ||
return data['corpora'][name]["checksum-" + str(part)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"cheksum-{}".format(part)
instead
gensim/downloader.py
Outdated
tmp_dir = tempfile.mkdtemp() | ||
tmp_load_file_path = os.path.join(tmp_dir, "__init__.py") | ||
urllib.urlretrieve(url_load_file, tmp_load_file_path) | ||
no_parts = int(_get_parts(name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
store it as int, don't cast
gensim/downloader.py
Outdated
compressed_folder_name = "{f}.tar.gz_a{p}".format(f=name, p=chr(96 + part)) | ||
tmp_data_file_dir = os.path.join(tmp_dir, compressed_folder_name) | ||
logger.info("Downloading Part %s/%s", part, no_parts) | ||
urllib.urlretrieve(url_data, tmp_data_file_dir, reporthook=_progress) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Show part on progressbar
gensim/downloader.py
Outdated
concatenated_folder_dir = os.path.join(tmp_dir, concatenated_folder_name) | ||
for part in range(1, no_parts + 1): | ||
url_data = "https://github.com/chaitaliSaini/gensim-data/releases/download/{f}/{f}.tar.gz_a{p}".format(f=name, p=chr(96 + part)) | ||
compressed_folder_name = "{f}.tar.gz_a{p}".format(f=name, p=chr(96 + part)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use numeric suffixes
gensim/downloader.py
Outdated
os.remove(concatenated_folder_dir) | ||
os.rename(tmp_dir, data_folder_dir) | ||
else: | ||
url_data = "https://github.com/chaitaliSaini/gensim-data/releases/download/{f}/{f}.tar.gz".format(f=name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make distinct function
gensim/downloader.py
Outdated
logger.info("%s \n", json.dumps(data['corpora'][name], indent=4)) | ||
return data['corpora'][name] | ||
elif name in models: | ||
logger.info("%s \n", json.dumps(data['corpora'][name], indent=4)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug data['corpora'][name]
-> data['models'][name]
Finished in #1705 |
API for dataset/model storage (old PR #1492).