Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Data/Model storage #1492

Closed
wants to merge 15 commits into from
Closed

[WIP] Data/Model storage #1492

wants to merge 15 commits into from

Conversation

chaitaliSaini
Copy link
Contributor

Api for model/data storage

@piskvorky
Copy link
Owner

Looks nice! Is this your original code?

If not, you have to attribute it properly.

@@ -1,50 +1,57 @@
"""Copyright (C) 2016 ExplosionAI UG (haftungsbeschränkt), 2016 spaCy GmbH,
2015 Matthew Honnibal
https://github.com/explosion/spaCy/blob/master/LICENSE
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List the explicit license, not a link (a link can change in time, go dead etc).

Also, if you're making changes, be sure to include yourself in the authors and copyright (say "adapted from ..., original author ..., license ...").

@piskvorky
Copy link
Owner

piskvorky commented Jul 28, 2017

Before merging, we'll have to clean up the code a little:

  • drop the dependencies on ftfy, plac etc (no need for those)
  • simplify the module structure (drop compat, about, message wrapping etc), make the logic simpler and more compact
  • code style consistent with gensim (no vertical indent etc)

@piskvorky
Copy link
Owner

piskvorky commented Jul 28, 2017

In fact, drawing inspiration for the API but writing it from scratch, in a few concise functions, is preferable to this mass copy&paste from spaCy. It looks straightforward enough, and will avoid much of the complexity here as well as any copyright issues.

dictionary. If require is set to True, raise an error if no meta.json
found.
"""
location = package_path / package / 'meta.json'
Copy link
Owner

@piskvorky piskvorky Jul 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this / operator do? Where does it come from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Owner

@piskvorky piskvorky Jul 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, interesting, I didn't know about that. Does that work in Python 2 too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@chaitaliSaini
Copy link
Contributor Author

Ok, i will start writing my code. Should I first make a template and share it, so that there is no extra or less functions than required?

@chaitaliSaini
Copy link
Contributor Author

I have two questions

  1. Are we going to have different versions of models or corpuses compatible with different versions of gensim?
  2. Are we going to provide shortcuts for downloading models/corpuses or the user has to write the full name? We will anyway provide a functionality to list all the models.

@piskvorky
Copy link
Owner

piskvorky commented Jul 29, 2017

Good questions!

I'd say 1) no (except in the repo history, which I think is still downloadable? we could add a little how-to to our FAQ or something, but I don't think we need to maintain a full-blown automated dependency resolution packaging system, sounds like a headache) 2) no (just one way to do it -- the fewer moving pieces, the better).

CC @menshikh-iv @gojomo

[sys.executable, '-m', 'pip', 'install', '--no-cache-dir', url],
env=os.environ.copy())
url = "https://github.com/chaitaliSaini/Corpus_and_models/releases/"
url = url+"download/"+file+"/"+file+".tar.gz"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file is a reserved keyword in Python.

Also, shouldn't it be url-encoded if used like this? Or is the expectation that the argument is already url-encoded?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh,I will change the variable name from file.
No, i have not encoded the url, as currently I am not using any special characters or spaces in corpus/model names, but if the plan is to have those in model/corpus names, i'll add it.

base_dir = os.path.join(user_dir, 'gensim-data')
extracted_folder_dir = os.path.join(base_dir, file)
if not os.path.exists(base_dir):
os.makedirs(base_dir)
Copy link
Owner

@piskvorky piskvorky Aug 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a clear log message (INFO or even WARNING).

In fact, what is the strategy for communicating information to the user in this PR? Do we use logging (incl. timestamps, log level etc), or just print stuff to stdout?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, i am just printing out.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that what we want? What are the pros/cons vs logging?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just read about it and logging is way better than using print. So i'll add logging.

Copy link
Owner

@piskvorky piskvorky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code style comments.

@@ -14,23 +16,52 @@
except ImportError:
from urllib2 import urlopen

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', filename="api.log", level=logging.INFO)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logging configuration belongs only to __main__.

if not os.path.exists(base_dir):
logging.info("Creating {}".format(base_dir))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Module should create a logger and use that for logging (not these global logging wrappers). See other gensim modules for an example.

tar.close()
logging.info("{} installed".format(file_name))
else:
logging.error("Not able to create {d}. Make sure you have the "
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No vertical indent (please use hanging indent; here and everywhere else).

import tarfile
import shutil
from ..utils import SaveLoad
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better use absolute paths, for clarity and readability.

parser = argparse.ArgumentParser(description="Gensim console API")
group = parser.add_mutually_exclusive_group()
group.add_argument(
"-d", "--download", nargs=1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No needed line breaks (the string is not too long), here and below


def download(file_name):
url = "https://github.com/chaitaliSaini/Corpus_and_models/releases/"
url = url + "download/" + file_name + "/" + file_name + ".tar.gz"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use format method (instead of string concatenation through +)

"-c", "--catalogue", help="To get the list of all models/corpus stored"
" : python -m gensim -c", action="store_true")
args = parser.parse_args()
if sys.argv[1] == "-d" or sys.argv[1] == "--download":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks suspicious: you already use argparse before it -> you no need to look into sys.argv. Please use only argparse.

user_dir = os.path.expanduser('~')
base_dir = os.path.join(user_dir, 'gensim-data')
extracted_folder_dir = os.path.join(base_dir, file_name)
if not os.path.exists(base_dir):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I have file ~/gensim-data, this will not work correctly.

base_dir = os.path.join(user_dir, 'gensim-data')
extracted_folder_dir = os.path.join(base_dir, file_name)
if not os.path.exists(base_dir):
logger.info("Creating {}".format(base_dir))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use logging correct formatting logger.info("Creating %s", base_dir) (here and anywhere). Look at this example.


def catalogue(print_list=True):
url = "https://raw.githubusercontent.com/chaitaliSaini/Corpus_and_models/"
url = url + "master/list.json"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should store full URL as constant (without concatenations)

print("Models available : ")
for model in models:
print(model)
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need else section, return data always.

print("{} has already been installed".format(file_name))


def catalogue(print_list=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change default to False

elif file_name in models:
print(data['gensim']['model'][file_name])
else:
print("Incorrect model/corpus name.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raise exception with lists (what's correct)

base_dir = os.path.join(user_dir, 'gensim-data')
folder_dir = os.path.join(base_dir, file_name)
if not os.path.exists(folder_dir):
print(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raise exception

Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some addition changes:

  • Use something from gensim.corpora for corpuses
  • Test that all loaded correctly, this behaviour unacceptable
>>> q = api.load("glove_common_crawl_42B")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "gensim/api/__init__.py", line 164, in load
    data = module.load_data()
  File "/home/ivan/gensim-data/glove_common_crawl_42B/__init__.py", line 19, in load_data
    model = KeyedVectors.load_word2vec_format(output_file_dir)
  File "gensim/models/keyedvectors.py", line 255, in load_word2vec_format
    raise ValueError("invalid vector on line %s (is this really the text format?)" % (line_no))
ValueError: invalid vector on line 78864 (is this really the text format?)

  • Add more datasets
  • Add instruction to orig repo "how to add new model"

The interface should be very simple, only load and info functions, that's enough (it's about "merge comments")

logger.info("%s installed", dataset)


def catalogue(print_list=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested it, print_list is useless, please remove this arg (and printing code).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also please merge catalogue and info function to new info function (add arguent for concrete model/dataset, by default, it should be None)

else:
catalogue(print_list=True)
raise Exception(
"Incorrect model/corpus name. Choose the model/corpus from the list "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Substitute list of available models/datasets to exception message (without printing to stdout before)

corpuses = data['gensim']['corpus']
models = data['gensim']['model']
if dataset in corpuses:
print(data['gensim']['corpus'][dataset]["desc"])
Copy link
Contributor

@menshikh-iv menshikh-iv Sep 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add more detailed descriptions (in storage repo)

return data['gensim']['model'][dataset]["filename"]


def load(dataset, return_path=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add doc-strings to all functions in google-style format (here and anywhere).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please merge load and download to new load function (If I have no needed model on my PC, I want to download it, not see an exception).

"above.")


def get_filename(dataset):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function only for inner proposes?

f = f.readlines()
for line in f:
if installed_message in line:
print("{} has already been installed".format(dataset))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace all print to logger

corpuses = data['gensim']['corpus']
models = data['gensim']['model']
if dataset not in corpuses and dataset not in models:
logger.error(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raise an exception (not a logger.error + sys.exit)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'corpuses' => 'corpora'

if dataset not in corpuses and dataset not in models:
logger.error(
"Incorect Model/corpus name. Use catalogue(print_list=TRUE) or"
" python -m gensim -c to get a list of models/corpuses"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not working now, please update

@@ -0,0 +1,174 @@
from __future__ import print_function
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename this file as downloader.py and move to library root gensim/downloader.py, also merge file with cli and this

downloaded_message = "{f} downloaded".format(f=dataset)
if os.path.exists(data_folder_dir):
log_file_dir = os.path.join(base_dir, 'api.log')
with open(log_file_dir) as f:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very suspicious, you write all to this file (not only models), please do several things

  • Use this file only for models
  • Add checksum (in file and in original repo) and compare it, if something happens (file broken) - warning + download file.
  • If file doesn't exists - warning + create this file

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plaintext isn't best format for this file, please use json/pickle/etc

urllib.urlretrieve(data_url, data_dir)
logger.info("%s downloaded", dataset)
if not is_installed:
tar = tarfile.open(compressed_folder_dir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all of your files is tar or gz, for example, you also have non-zipped fasttext, big glove in zip, etc ....

@menshikh-iv menshikh-iv added the incubator project PR is RaRe incubator project label Sep 14, 2017
args = parser.parse_args()
if args.download is not None:
data_path = load(args.download[0], return_path=True)
logger.info("Data has been installed and data path is %s", data_path)
Copy link
Owner

@piskvorky piskvorky Sep 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any logging setup here in __main__. Where will this logging go?

We don't want any printing inside a library, but in a user-invoked top-level script, printing is fine.

Args:
dataset(string): Name of the corpus/model.
"""
url = "https://github.com/chaitaliSaini/Corpus_and_models/releases/download/{f}/{f}.tar.gz".format(f=dataset)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a FIXME comment so we don't merge such temporary constructs by accident.

base_dir = os.path.join(user_dir, 'gensim-data')
data_log_file_dir = os.path.join(base_dir, 'data.json')

logging.basicConfig(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please setup it in __main__ section (because we need logging in console, but no need it in programming version without explicit configuration from user side

data = response.read().decode("utf-8")
data = json.loads(data)
if dataset is not None:
corpora = data['gensim']['corpus']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gensim is useless key in your json, you need only model and dataset keys for upper level

corpora = data['gensim']['corpus']
models = data['gensim']['model']
if dataset in corpora:
logger.info("%s \n", data['gensim']['corpus'][dataset]["desc"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing return

if dataset in corpora:
logger.info("%s \n", data['gensim']['corpus'][dataset]["desc"])
elif dataset in models:
logger.info("%s \n", data['gensim']['model'][dataset]["desc"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing return

return hash_md5.hexdigest()


def info(dataset=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not dataset, this can be a model too (and same thing everywhere).


user_dir = os.path.expanduser('~')
base_dir = os.path.join(user_dir, 'gensim-data')
data_log_file_dir = os.path.join(base_dir, 'data.json')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's path to file, not a dir

.format(base_dir))


def initialize_data_log_file():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please pass data_log_file explicitly and open it here with with operator



def get_data_status(dataset):
"""Function for finding the status of the dataset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dataset and model

dataset(string): Name of the corpus/model.
status(string): Status to be updates to i.e downloaded or installed.
"""
jdata = json.loads(open(data_log_file_dir).read())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use with for file open.

logger.info("%s installed", dataset)
else:
logger.error("There was a problem in installing the file. Retrying.")
_download(dataset)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recursion here

2017-09-26 13:22:03,650 :gensim.api :INFO : Creating /home/ivan/gensim-data/glove_common_crawl_42B
2017-09-26 13:22:03,652 :gensim.api :INFO : Creation of /home/ivan/gensim-data/glove_common_crawl_42B successful.
2017-09-26 13:22:03,654 :gensim.api :INFO : Downloading glove_common_crawl_42B
2017-09-26 13:33:58,964 :gensim.api :INFO : glove_common_crawl_42B downloaded
2017-09-26 13:33:58,967 :gensim.api :INFO : Extracting files from /home/ivan/gensim-data/glove_common_crawl_42B
2017-09-26 13:34:03,320 :gensim.api :ERROR : There was a problem in installing the file. Retrying.
2017-09-26 13:34:03,590 :gensim.api :INFO : Extracting files from /home/ivan/gensim-data/glove_common_crawl_42B
2017-09-26 13:34:07,628 :gensim.api :ERROR : There was a problem in installing the file. Retrying.
2017-09-26 13:34:07,910 :gensim.api :INFO : Extracting files from /home/ivan/gensim-data/glove_common_crawl_42B
2017-09-26 13:34:12,020 :gensim.api :ERROR : There was a problem in installing the file. Retrying.
2017-09-26 13:34:12,306 :gensim.api :INFO : Extracting files from /home/ivan/gensim-data/glove_common_crawl_42B
2017-09-26 13:34:16,379 :gensim.api :ERROR : There was a problem in installing the file. Retrying.



def get_data_list():
"""Function getting the list of all datasets/models.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please re-write your docstrings according to numpy style (here and anywhere) + add missing .rst to docs/src + change apiref.rst

return data_names


def get_data_name(data_):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be a good idea to hide functions that no needed to user (with underscores)

models = data['model']
json_list = []
for corpus in corpora:
json_object = {"name": corpus, "status": "None"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to store anything with "None" status if we check file with data/model from data repo each time?

Copy link
Contributor Author

@chaitaliSaini chaitaliSaini Oct 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for initialising the json log file that stores the status if the model has been downloaded or installed on the users computer.

compressed_folder_name = "{f}.tar.gz".format(f=data_)
compressed_folder_dir = os.path.join(base_dir, compressed_folder_name)
if get_data_status(data_) != "downloaded":
if not os.path.exists(data_folder_dir):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If exists - need to remove all broken files and after it creates all that needed.

_download(data_)

if get_data_status(data_) != "installed":
tar = tarfile.open(compressed_folder_dir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very long indentation, need 4 spaces instead of 8



def load(data_, return_path=False):
"""Loads the corpus/model to the memory, if return_path is False.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code style: Python docstrings use imperative mode ("Load X", not "Loads X"). Here and elsewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
incubator project PR is RaRe incubator project
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants