Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Data/model storage. Fix 1453 #1632

Closed
wants to merge 24 commits into from
Closed

[WIP] Data/model storage. Fix 1453 #1632

wants to merge 24 commits into from

Conversation

chaitaliSaini
Copy link
Contributor

@chaitaliSaini chaitaliSaini commented Oct 17, 2017

API for dataset/model storage (old PR #1492).

@menshikh-iv menshikh-iv changed the title [WIP]Data/model storage [WIP] Data/model storage Oct 17, 2017
@menshikh-iv menshikh-iv changed the title [WIP] Data/model storage [WIP] Data/model storage. Fix 1453 Oct 17, 2017
@menshikh-iv menshikh-iv added the incubator project PR is RaRe incubator project label Oct 17, 2017


def _create_base_dir():
r"""Create the gensim-data directory in home directory, if it has not been already created.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really needed to add r for all docstrings ? What's a reason?

sys.stdout.flush()


def _create_base_dir():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use __ instead of _ will be better (for hiding from import), here and everywhere?


if __name__ == '__main__':
logging.basicConfig(format='%(asctime)s :%(name)s :%(levelname)s :%(message)s', stream=sys.stdout, level=logging.INFO)
parser = argparse.ArgumentParser(description="Gensim console API", usage="python -m gensim.api.downloader [-h] [-d data__name | -i data__name | -c]")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to pass custom "usage" string here (argparse will generate it automatically)

logging.basicConfig(format='%(asctime)s :%(name)s :%(levelname)s :%(message)s', stream=sys.stdout, level=logging.INFO)
parser = argparse.ArgumentParser(description="Gensim console API", usage="python -m gensim.api.downloader [-h] [-d data__name | -i data__name | -c]")
group = parser.add_mutually_exclusive_group()
group.add_argument("-d", "--download", metavar="data__name", nargs=1, help="To download a corpus/model : python -m gensim.downloader -d corpus/model name")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strange names for metavar, why metavar is needed here?

logger.info("%s downloaded", name)
else:
rmtree(tmp_dir)
raise Exception("There was a problem in downloading the data. We recommend you to re-try.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add info about checksums (concrete filename, expected checksum, real checksum, expected size, real size).

@menshikh-iv
Copy link
Contributor

Great job @chaitaliSaini, now your code is more readable and clear (and works stable) 🔥 👍 @anotherbugmaster will review your docstrings today.

@menshikh-iv menshikh-iv requested review from menshikh-iv and removed request for menshikh-iv October 26, 2017 13:05
Copy link
Contributor

@anotherbugmaster anotherbugmaster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job, thank you! Fix the minor issues and check out this styleguide (in case you haven't yet), it will help you write consistent documentation:

https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt#docstring-standard



def progress(chunks_downloaded, chunk_size, total_size):
r"""Create and update the progress bar.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is r necessary?

filled_len = int(math.floor((bar_len * size_downloaded) / total_size))
percent_downloaded = round((size_downloaded * 100) / total_size, 1)
bar = '=' * filled_len + '-' * (bar_len - filled_len)
sys.stdout.write('[%s] %s%s %s/%sMB downloaded\r' % (bar, percent_downloaded, "%", round(size_downloaded / (1024 * 1024), 1), round(float(total_size) / (1024 * 1024), 1)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too long



def _calculate_md5_checksum(tar_file):
r"""Calculate the checksum of the given tar.gz file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r again :)

def info(name=None):
r"""Return the information related to model/dataset.

If name is supplied, then information related to the given dataset/model will be returned. Otherwise detailed information of all model/datasets will be returned.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too long, split it

Returns
-------
dict
Return detailed information about all models/datasets if name is not provided. Otherwise return detailed informtiona of the specific model/dataset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too long

data:
load model to memory
data_dir: str
return path of dataset/model.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No new line after last section


Parameters
----------
name : {None, data name}, optional
Copy link
Contributor

@anotherbugmaster anotherbugmaster Oct 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name : str or None, optional is the right way. Also try to write a description after every parameter.

Parameters
----------
name: str
dataset/model name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capital letters

Parameters
----------
name: str
dataset/model name which has to be downloaded
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also capital letters

import numpy as np


class TestApi(unittest.TestCase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add test for multipart

import math
import shutil
import tempfile
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One try/catch is enough here.

Parameters
----------
chunks_downloaded : int
Number of chunks of data that have been downloaded
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

. at the end of sentence (here and anywhere)


def _create_base_dir():
"""Create the gensim-data directory in home directory, if it has not been already created.
Raises
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing newline before section title

"""Create the gensim-data directory in home directory, if it has not been already created.
Raises
------
File Exists Error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raises
---------
Exception
    Two possible reasons: ...

return data['models'][name]["checksum"]
else:
if name in corpora:
return data['corpora'][name]["checksum-" + str(part)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"cheksum-{}".format(part) instead

tmp_dir = tempfile.mkdtemp()
tmp_load_file_path = os.path.join(tmp_dir, "__init__.py")
urllib.urlretrieve(url_load_file, tmp_load_file_path)
no_parts = int(_get_parts(name))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

store it as int, don't cast

compressed_folder_name = "{f}.tar.gz_a{p}".format(f=name, p=chr(96 + part))
tmp_data_file_dir = os.path.join(tmp_dir, compressed_folder_name)
logger.info("Downloading Part %s/%s", part, no_parts)
urllib.urlretrieve(url_data, tmp_data_file_dir, reporthook=_progress)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show part on progressbar

concatenated_folder_dir = os.path.join(tmp_dir, concatenated_folder_name)
for part in range(1, no_parts + 1):
url_data = "https://github.com/chaitaliSaini/gensim-data/releases/download/{f}/{f}.tar.gz_a{p}".format(f=name, p=chr(96 + part))
compressed_folder_name = "{f}.tar.gz_a{p}".format(f=name, p=chr(96 + part))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use numeric suffixes

os.remove(concatenated_folder_dir)
os.rename(tmp_dir, data_folder_dir)
else:
url_data = "https://github.com/chaitaliSaini/gensim-data/releases/download/{f}/{f}.tar.gz".format(f=name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make distinct function

logger.info("%s \n", json.dumps(data['corpora'][name], indent=4))
return data['corpora'][name]
elif name in models:
logger.info("%s \n", json.dumps(data['corpora'][name], indent=4))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug data['corpora'][name] -> data['models'][name]

@menshikh-iv
Copy link
Contributor

Finished in #1705

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
incubator project PR is RaRe incubator project
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants