[WIP] Data/Model storage #1492

chaitaliSaini · 2017-07-20T14:29:34Z

Api for model/data storage

piskvorky · 2017-07-22T04:07:53Z

Looks nice! Is this your original code?

If not, you have to attribute it properly.

piskvorky · 2017-07-28T14:57:42Z

gensim/__main__.py

@@ -1,50 +1,57 @@
+"""Copyright (C) 2016 ExplosionAI UG (haftungsbeschränkt), 2016 spaCy GmbH,
+2015 Matthew Honnibal
+https://github.com/explosion/spaCy/blob/master/LICENSE


List the explicit license, not a link (a link can change in time, go dead etc).

Also, if you're making changes, be sure to include yourself in the authors and copyright (say "adapted from ..., original author ..., license ...").

piskvorky · 2017-07-28T15:04:25Z

Before merging, we'll have to clean up the code a little:

drop the dependencies on ftfy, plac etc (no need for those)
simplify the module structure (drop compat, about, message wrapping etc), make the logic simpler and more compact
code style consistent with gensim (no vertical indent etc)

piskvorky · 2017-07-28T15:14:18Z

In fact, drawing inspiration for the API but writing it from scratch, in a few concise functions, is preferable to this mass copy&paste from spaCy. It looks straightforward enough, and will avoid much of the complexity here as well as any copyright issues.

piskvorky · 2017-07-28T15:15:02Z

gensim/console_api/link.py

+    dictionary. If require is set to True, raise an error if no meta.json
+    found.
+    """
+    location = package_path / package / 'meta.json'


What does this / operator do? Where does it come from?

/ is similar to os.path.join()
https://docs.python.org/3/library/pathlib.html#operators

Ah, interesting, I didn't know about that. Does that work in Python 2 too?

chaitaliSaini · 2017-07-28T16:57:44Z

Ok, i will start writing my code. Should I first make a template and share it, so that there is no extra or less functions than required?

chaitaliSaini · 2017-07-28T17:30:23Z

I have two questions

Are we going to have different versions of models or corpuses compatible with different versions of gensim?
Are we going to provide shortcuts for downloading models/corpuses or the user has to write the full name? We will anyway provide a functionality to list all the models.

piskvorky · 2017-07-29T02:22:16Z

Good questions!

I'd say 1) no (except in the repo history, which I think is still downloadable? we could add a little how-to to our FAQ or something, but I don't think we need to maintain a full-blown automated dependency resolution packaging system, sounds like a headache) 2) no (just one way to do it -- the fewer moving pieces, the better).

CC @menshikh-iv @gojomo

piskvorky · 2017-08-04T12:13:09Z

gensim/api/__init__.py

-            [sys.executable, '-m', 'pip', 'install', '--no-cache-dir', url],
-            env=os.environ.copy())
+    url = "https://github.com/chaitaliSaini/Corpus_and_models/releases/"
+    url = url+"download/"+file+"/"+file+".tar.gz"


file is a reserved keyword in Python.

Also, shouldn't it be url-encoded if used like this? Or is the expectation that the argument is already url-encoded?

Oh,I will change the variable name from file.
No, i have not encoded the url, as currently I am not using any special characters or spaces in corpus/model names, but if the plan is to have those in model/corpus names, i'll add it.

piskvorky · 2017-08-04T12:15:29Z

gensim/api/__init__.py

+    base_dir = os.path.join(user_dir, 'gensim-data')
+    extracted_folder_dir = os.path.join(base_dir, file)
+    if not os.path.exists(base_dir):
+        os.makedirs(base_dir)


Needs a clear log message (INFO or even WARNING).

In fact, what is the strategy for communicating information to the user in this PR? Do we use logging (incl. timestamps, log level etc), or just print stuff to stdout?

Currently, i am just printing out.

Is that what we want? What are the pros/cons vs logging?

I just read about it and logging is way better than using print. So i'll add logging.

piskvorky

Code style comments.

piskvorky · 2017-08-07T06:00:43Z

gensim/api/__init__.py

@@ -14,23 +16,52 @@
 except ImportError:
    from urllib2 import urlopen

+logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', filename="api.log", level=logging.INFO)


Logging configuration belongs only to __main__.

piskvorky · 2017-08-07T06:02:14Z

gensim/api/__init__.py

    if not os.path.exists(base_dir):
+        logging.info("Creating {}".format(base_dir))


Module should create a logger and use that for logging (not these global logging wrappers). See other gensim modules for an example.

piskvorky · 2017-08-07T06:02:42Z

gensim/api/__init__.py

+            tar.close()
+            logging.info("{} installed".format(file_name))
+        else:
+            logging.error("Not able to create {d}. Make sure you have the "


No vertical indent (please use hanging indent; here and everywhere else).

piskvorky · 2017-08-07T06:03:27Z

gensim/api/__init__.py

 import tarfile
+import shutil
+from ..utils import SaveLoad


Better use absolute paths, for clarity and readability.

menshikh-iv · 2017-08-21T07:21:48Z

gensim/__main__.py

+    parser = argparse.ArgumentParser(description="Gensim console API")
+    group = parser.add_mutually_exclusive_group()
+    group.add_argument(
+        "-d", "--download", nargs=1,


No needed line breaks (the string is not too long), here and below

menshikh-iv · 2017-08-21T07:22:47Z

gensim/api/__init__.py

+
+def download(file_name):
+    url = "https://github.com/chaitaliSaini/Corpus_and_models/releases/"
+    url = url + "download/" + file_name + "/" + file_name + ".tar.gz"


Please use format method (instead of string concatenation through +)

menshikh-iv · 2017-08-22T07:01:09Z

gensim/__main__.py

+        "-c", "--catalogue", help="To get the list of all models/corpus stored"
+        " : python -m gensim -c", action="store_true")
+    args = parser.parse_args()
+    if sys.argv[1] == "-d" or sys.argv[1] == "--download":


Looks suspicious: you already use argparse before it -> you no need to look into sys.argv. Please use only argparse.

menshikh-iv · 2017-08-22T07:03:52Z

gensim/api/__init__.py

+    user_dir = os.path.expanduser('~')
+    base_dir = os.path.join(user_dir, 'gensim-data')
+    extracted_folder_dir = os.path.join(base_dir, file_name)
+    if not os.path.exists(base_dir):


If I have file ~/gensim-data, this will not work correctly.

menshikh-iv · 2017-08-22T07:05:34Z

gensim/api/__init__.py

+    base_dir = os.path.join(user_dir, 'gensim-data')
+    extracted_folder_dir = os.path.join(base_dir, file_name)
+    if not os.path.exists(base_dir):
+        logger.info("Creating {}".format(base_dir))


Please use logging correct formatting logger.info("Creating %s", base_dir) (here and anywhere). Look at this example.

menshikh-iv · 2017-08-22T07:08:20Z

gensim/api/__init__.py

+
+def catalogue(print_list=True):
+    url = "https://raw.githubusercontent.com/chaitaliSaini/Corpus_and_models/"
+    url = url + "master/list.json"


You should store full URL as constant (without concatenations)

menshikh-iv · 2017-08-22T07:09:15Z

gensim/api/__init__.py

+        print("Models available : ")
+        for model in models:
+            print(model)
+    else:


No need else section, return data always.

menshikh-iv · 2017-08-22T07:09:47Z

gensim/api/__init__.py

+        print("{} has already been installed".format(file_name))
+
+
+def catalogue(print_list=True):


change default to False

menshikh-iv · 2017-08-22T07:26:05Z

gensim/api/__init__.py

+    elif file_name in models:
+        print(data['gensim']['model'][file_name])
+    else:
+        print("Incorrect model/corpus name.")


Raise exception with lists (what's correct)

menshikh-iv · 2017-08-22T07:26:19Z

gensim/api/__init__.py

+    base_dir = os.path.join(user_dir, 'gensim-data')
+    folder_dir = os.path.join(base_dir, file_name)
+    if not os.path.exists(folder_dir):
+        print(


Raise exception

menshikh-iv

Some addition changes:

Use something from gensim.corpora for corpuses
Test that all loaded correctly, this behaviour unacceptable

>>> q = api.load("glove_common_crawl_42B")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "gensim/api/__init__.py", line 164, in load
    data = module.load_data()
  File "/home/ivan/gensim-data/glove_common_crawl_42B/__init__.py", line 19, in load_data
    model = KeyedVectors.load_word2vec_format(output_file_dir)
  File "gensim/models/keyedvectors.py", line 255, in load_word2vec_format
    raise ValueError("invalid vector on line %s (is this really the text format?)" % (line_no))
ValueError: invalid vector on line 78864 (is this really the text format?)

Add more datasets
Add instruction to orig repo "how to add new model"

The interface should be very simple, only load and info functions, that's enough (it's about "merge comments")

menshikh-iv · 2017-09-12T09:22:22Z

gensim/api/__init__.py

+            logger.info("%s installed", dataset)
+
+
+def catalogue(print_list=False):


I tested it, print_list is useless, please remove this arg (and printing code).

Also please merge catalogue and info function to new info function (add arguent for concrete model/dataset, by default, it should be None)

menshikh-iv · 2017-09-12T09:27:11Z

gensim/api/__init__.py

+    else:
+        catalogue(print_list=True)
+        raise Exception(
+            "Incorrect model/corpus name. Choose the model/corpus from the list "


Substitute list of available models/datasets to exception message (without printing to stdout before)

menshikh-iv · 2017-09-12T09:28:04Z

gensim/api/__init__.py

+    corpuses = data['gensim']['corpus']
+    models = data['gensim']['model']
+    if dataset in corpuses:
+        print(data['gensim']['corpus'][dataset]["desc"])


Need to add more detailed descriptions (in storage repo)

menshikh-iv · 2017-09-12T09:30:35Z

gensim/api/__init__.py

+        return data['gensim']['model'][dataset]["filename"]
+
+
+def load(dataset, return_path=False):


Please add doc-strings to all functions in google-style format (here and anywhere).

Also, please merge load and download to new load function (If I have no needed model on my PC, I want to download it, not see an exception).

menshikh-iv · 2017-09-12T09:32:25Z

gensim/api/__init__.py

+            "above.")
+
+
+def get_filename(dataset):


This function only for inner proposes?

menshikh-iv · 2017-09-12T09:37:59Z

gensim/api/__init__.py

+            f = f.readlines()
+        for line in f:
+            if installed_message in line:
+                print("{} has already been installed".format(dataset))


Replace all print to logger

menshikh-iv · 2017-09-12T09:38:45Z

gensim/api/__init__.py

+    corpuses = data['gensim']['corpus']
+    models = data['gensim']['model']
+    if dataset not in corpuses and dataset not in models:
+        logger.error(


Raise an exception (not a logger.error + sys.exit)

'corpuses' => 'corpora'

menshikh-iv · 2017-09-12T09:39:21Z

gensim/api/__init__.py

+    if dataset not in corpuses and dataset not in models:
+        logger.error(
+            "Incorect Model/corpus name. Use catalogue(print_list=TRUE) or"
+            " python -m gensim -c to get a list of models/corpuses"


Not working now, please update

menshikh-iv · 2017-09-12T09:40:30Z

gensim/api/__init__.py

@@ -0,0 +1,174 @@
+from __future__ import print_function


rename this file as downloader.py and move to library root gensim/downloader.py, also merge file with cli and this

menshikh-iv · 2017-09-12T09:48:51Z

gensim/api/__init__.py

+    downloaded_message = "{f} downloaded".format(f=dataset)
+    if os.path.exists(data_folder_dir):
+        log_file_dir = os.path.join(base_dir, 'api.log')
+        with open(log_file_dir) as f:


Looks very suspicious, you write all to this file (not only models), please do several things

Use this file only for models

Add checksum (in file and in original repo) and compare it, if something happens (file broken) - warning + download file.

If file doesn't exists - warning + create this file

Plaintext isn't best format for this file, please use json/pickle/etc

menshikh-iv · 2017-09-12T10:03:09Z

gensim/api/__init__.py

+            urllib.urlretrieve(data_url, data_dir)
+        logger.info("%s downloaded", dataset)
+    if not is_installed:
+            tar = tarfile.open(compressed_folder_dir)


Not all of your files is tar or gz, for example, you also have non-zipped fasttext, big glove in zip, etc ....

piskvorky · 2017-09-20T11:18:40Z

gensim/downloader.py

+    args = parser.parse_args()
+    if args.download is not None:
+        data_path = load(args.download[0], return_path=True)
+        logger.info("Data has been installed and data path is %s", data_path)


I don't see any logging setup here in __main__. Where will this logging go?

We don't want any printing inside a library, but in a user-invoked top-level script, printing is fine.

piskvorky · 2017-09-20T11:19:36Z

gensim/downloader.py

+    Args:
+        dataset(string): Name of the corpus/model.
+    """
+    url = "https://github.com/chaitaliSaini/Corpus_and_models/releases/download/{f}/{f}.tar.gz".format(f=dataset)


Please add a FIXME comment so we don't merge such temporary constructs by accident.

menshikh-iv · 2017-09-26T06:56:38Z

gensim/downloader.py

+base_dir = os.path.join(user_dir, 'gensim-data')
+data_log_file_dir = os.path.join(base_dir, 'data.json')
+
+logging.basicConfig(


Please setup it in __main__ section (because we need logging in console, but no need it in programming version without explicit configuration from user side

menshikh-iv · 2017-09-26T08:03:41Z

gensim/downloader.py

+    data = response.read().decode("utf-8")
+    data = json.loads(data)
+    if dataset is not None:
+        corpora = data['gensim']['corpus']


gensim is useless key in your json, you need only model and dataset keys for upper level

menshikh-iv · 2017-09-26T08:05:37Z

gensim/downloader.py

+        corpora = data['gensim']['corpus']
+        models = data['gensim']['model']
+        if dataset in corpora:
+            logger.info("%s \n", data['gensim']['corpus'][dataset]["desc"])


missing return

menshikh-iv · 2017-09-26T08:05:43Z

gensim/downloader.py

+        if dataset in corpora:
+            logger.info("%s \n", data['gensim']['corpus'][dataset]["desc"])
+        elif dataset in models:
+            logger.info("%s \n", data['gensim']['model'][dataset]["desc"])


missing return

menshikh-iv · 2017-09-26T08:08:58Z

gensim/downloader.py

+    return hash_md5.hexdigest()
+
+
+def info(dataset=None):


This is not dataset, this can be a model too (and same thing everywhere).

menshikh-iv · 2017-09-26T08:13:45Z

gensim/downloader.py

+
+user_dir = os.path.expanduser('~')
+base_dir = os.path.join(user_dir, 'gensim-data')
+data_log_file_dir = os.path.join(base_dir, 'data.json')


It's path to file, not a dir

menshikh-iv · 2017-09-26T08:14:39Z

gensim/downloader.py

+                .format(base_dir))
+
+
+def initialize_data_log_file():


Please pass data_log_file explicitly and open it here with with operator

menshikh-iv · 2017-09-26T08:15:07Z

gensim/downloader.py

+
+
+def get_data_status(dataset):
+    """Function for finding the status of the dataset.


dataset and model

menshikh-iv · 2017-09-26T08:15:47Z

gensim/downloader.py

+        dataset(string): Name of the corpus/model.
+        status(string): Status to be updates to i.e downloaded or installed.
+    """
+    jdata = json.loads(open(data_log_file_dir).read())


Please use with for file open.

menshikh-iv · 2017-09-26T08:38:12Z

gensim/downloader.py

+                logger.info("%s installed", dataset)
+            else:
+                logger.error("There was a problem in installing the file. Retrying.")
+                _download(dataset)


Recursion here

2017-09-26 13:22:03,650 :gensim.api :INFO : Creating /home/ivan/gensim-data/glove_common_crawl_42B 2017-09-26 13:22:03,652 :gensim.api :INFO : Creation of /home/ivan/gensim-data/glove_common_crawl_42B successful. 2017-09-26 13:22:03,654 :gensim.api :INFO : Downloading glove_common_crawl_42B 2017-09-26 13:33:58,964 :gensim.api :INFO : glove_common_crawl_42B downloaded 2017-09-26 13:33:58,967 :gensim.api :INFO : Extracting files from /home/ivan/gensim-data/glove_common_crawl_42B 2017-09-26 13:34:03,320 :gensim.api :ERROR : There was a problem in installing the file. Retrying. 2017-09-26 13:34:03,590 :gensim.api :INFO : Extracting files from /home/ivan/gensim-data/glove_common_crawl_42B 2017-09-26 13:34:07,628 :gensim.api :ERROR : There was a problem in installing the file. Retrying. 2017-09-26 13:34:07,910 :gensim.api :INFO : Extracting files from /home/ivan/gensim-data/glove_common_crawl_42B 2017-09-26 13:34:12,020 :gensim.api :ERROR : There was a problem in installing the file. Retrying. 2017-09-26 13:34:12,306 :gensim.api :INFO : Extracting files from /home/ivan/gensim-data/glove_common_crawl_42B 2017-09-26 13:34:16,379 :gensim.api :ERROR : There was a problem in installing the file. Retrying.

menshikh-iv · 2017-10-09T08:32:40Z

gensim/downloader.py

+
+
+def get_data_list():
+    """Function getting the list of all datasets/models.


Please re-write your docstrings according to numpy style (here and anywhere) + add missing .rst to docs/src + change apiref.rst

menshikh-iv · 2017-10-09T09:11:56Z

gensim/downloader.py

+    return data_names
+
+
+def get_data_name(data_):


It will be a good idea to hide functions that no needed to user (with underscores)

menshikh-iv · 2017-10-09T09:55:25Z

gensim/downloader.py

+    models = data['model']
+    json_list = []
+    for corpus in corpora:
+        json_object = {"name": corpus, "status": "None"}


Why do we need to store anything with "None" status if we check file with data/model from data repo each time?

This is for initialising the json log file that stores the status if the model has been downloaded or installed on the users computer.

menshikh-iv · 2017-10-09T10:03:00Z

gensim/downloader.py

+    compressed_folder_name = "{f}.tar.gz".format(f=data_)
+    compressed_folder_dir = os.path.join(base_dir, compressed_folder_name)
+    if get_data_status(data_) != "downloaded":
+        if not os.path.exists(data_folder_dir):


If exists - need to remove all broken files and after it creates all that needed.

menshikh-iv · 2017-10-09T10:03:48Z

gensim/downloader.py

+            _download(data_)
+
+    if get_data_status(data_) != "installed":
+            tar = tarfile.open(compressed_folder_dir)


very long indentation, need 4 spaces instead of 8

piskvorky · 2017-10-09T11:38:48Z

gensim/downloader.py

+
+
+def load(data_, return_path=False):
+    """Loads the corpus/model to the memory, if return_path is False.


Code style: Python docstrings use imperative mode ("Load X", not "Loads X"). Here and elsewhere.

piskvorky requested changes Jul 28, 2017

View reviewed changes

piskvorky reviewed Jul 28, 2017

View reviewed changes

chaitaliSaini added 4 commits July 30, 2017 04:59

added download and catalogue functions

ec8c016

added link and info

636bfff

modeified link and info functions

fffe203

Updated download function

f567dee

piskvorky reviewed Aug 4, 2017

View reviewed changes

Added logging

61ba3d6

piskvorky requested changes Aug 7, 2017

View reviewed changes

chaitaliSaini added 2 commits August 11, 2017 11:19

Added load function

d8257a3

Removed unused imports

5571469

menshikh-iv suggested changes Aug 22, 2017

View reviewed changes

chaitaliSaini added 4 commits August 24, 2017 00:17

added check for installed models

cabf173

updated download function

5d509fc

Improved help for terminal

551f54e

load returns model path

ff5509f

menshikh-iv suggested changes Sep 12, 2017

View reviewed changes

menshikh-iv reviewed Sep 12, 2017

View reviewed changes

menshikh-iv added the incubator project PR is RaRe incubator project label Sep 14, 2017

added jupyter notebook and merged code

e654070

piskvorky reviewed Sep 20, 2017

View reviewed changes

menshikh-iv suggested changes Sep 26, 2017

View reviewed changes

menshikh-iv reviewed Sep 26, 2017

View reviewed changes

menshikh-iv mentioned this pull request Oct 2, 2017

Link to common datasets #746

Closed

chaitaliSaini added 3 commits October 3, 2017 18:27

alternate names for load

b0d1110

corrected formatting

498b32b

added checksum after download

03649b0

menshikh-iv suggested changes Oct 9, 2017

View reviewed changes

piskvorky requested changes Oct 9, 2017

View reviewed changes

chaitaliSaini closed this Oct 17, 2017

chaitaliSaini mentioned this pull request Oct 17, 2017

[WIP] Data/model storage. Fix 1453 #1632

Closed

		if not os.path.exists(base_dir):
		logging.info("Creating {}".format(base_dir))

		print("{} has already been installed".format(file_name))


		def catalogue(print_list=True):

		logger.info("%s installed", dataset)


		def catalogue(print_list=False):

		return data['gensim']['model'][dataset]["filename"]


		def load(dataset, return_path=False):



		def get_data_status(dataset):
		"""Function for finding the status of the dataset.



		def get_data_list():
		"""Function getting the list of all datasets/models.



		def load(data_, return_path=False):
		"""Loads the corpus/model to the memory, if return_path is False.

[WIP] Data/Model storage #1492

[WIP] Data/Model storage #1492

Conversation

chaitaliSaini commented Jul 20, 2017

piskvorky commented Jul 22, 2017

Choose a reason for hiding this comment

piskvorky commented Jul 28, 2017 • edited Loading

piskvorky commented Jul 28, 2017 • edited Loading

piskvorky Jul 28, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Jul 29, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chaitaliSaini commented Jul 28, 2017

chaitaliSaini commented Jul 28, 2017

piskvorky commented Jul 29, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Aug 4, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

menshikh-iv left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

menshikh-iv Sep 12, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Sep 20, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chaitaliSaini Oct 9, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky commented Jul 28, 2017 •

edited

Loading

piskvorky commented Jul 28, 2017 •

edited

Loading

piskvorky Jul 28, 2017 •

edited

Loading

piskvorky Jul 29, 2017 •

edited

Loading

piskvorky commented Jul 29, 2017 •

edited

Loading

piskvorky Aug 4, 2017 •

edited

Loading

menshikh-iv left a comment •

edited

Loading

menshikh-iv Sep 12, 2017 •

edited

Loading

piskvorky Sep 20, 2017 •

edited

Loading

chaitaliSaini Oct 9, 2017 •

edited

Loading