-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Data/Model storage #1492
[WIP] Data/Model storage #1492
Conversation
Looks nice! Is this your original code? If not, you have to attribute it properly. |
gensim/__main__.py
Outdated
@@ -1,50 +1,57 @@ | |||
"""Copyright (C) 2016 ExplosionAI UG (haftungsbeschränkt), 2016 spaCy GmbH, | |||
2015 Matthew Honnibal | |||
https://github.com/explosion/spaCy/blob/master/LICENSE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
List the explicit license, not a link (a link can change in time, go dead etc).
Also, if you're making changes, be sure to include yourself in the authors and copyright (say "adapted from ..., original author ..., license ...").
Before merging, we'll have to clean up the code a little:
|
In fact, drawing inspiration for the API but writing it from scratch, in a few concise functions, is preferable to this mass copy&paste from spaCy. It looks straightforward enough, and will avoid much of the complexity here as well as any copyright issues. |
gensim/console_api/link.py
Outdated
dictionary. If require is set to True, raise an error if no meta.json | ||
found. | ||
""" | ||
location = package_path / package / 'meta.json' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this /
operator do? Where does it come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/ is similar to os.path.join()
https://docs.python.org/3/library/pathlib.html#operators
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, interesting, I didn't know about that. Does that work in Python 2 too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
Ok, i will start writing my code. Should I first make a template and share it, so that there is no extra or less functions than required? |
I have two questions
|
Good questions! I'd say 1) no (except in the repo history, which I think is still downloadable? we could add a little how-to to our FAQ or something, but I don't think we need to maintain a full-blown automated dependency resolution packaging system, sounds like a headache) 2) no (just one way to do it -- the fewer moving pieces, the better). |
gensim/api/__init__.py
Outdated
[sys.executable, '-m', 'pip', 'install', '--no-cache-dir', url], | ||
env=os.environ.copy()) | ||
url = "https://github.com/chaitaliSaini/Corpus_and_models/releases/" | ||
url = url+"download/"+file+"/"+file+".tar.gz" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
file
is a reserved keyword in Python.
Also, shouldn't it be url-encoded if used like this? Or is the expectation that the argument is already url-encoded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh,I will change the variable name from file.
No, i have not encoded the url, as currently I am not using any special characters or spaces in corpus/model names, but if the plan is to have those in model/corpus names, i'll add it.
gensim/api/__init__.py
Outdated
base_dir = os.path.join(user_dir, 'gensim-data') | ||
extracted_folder_dir = os.path.join(base_dir, file) | ||
if not os.path.exists(base_dir): | ||
os.makedirs(base_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs a clear log message (INFO or even WARNING).
In fact, what is the strategy for communicating information to the user in this PR? Do we use logging (incl. timestamps, log level etc), or just print stuff to stdout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, i am just printing out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that what we want? What are the pros/cons vs logging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just read about it and logging is way better than using print. So i'll add logging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code style comments.
gensim/api/__init__.py
Outdated
@@ -14,23 +16,52 @@ | |||
except ImportError: | |||
from urllib2 import urlopen | |||
|
|||
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', filename="api.log", level=logging.INFO) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logging configuration belongs only to __main__
.
gensim/api/__init__.py
Outdated
if not os.path.exists(base_dir): | ||
logging.info("Creating {}".format(base_dir)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Module should create a logger
and use that for logging (not these global logging
wrappers). See other gensim modules for an example.
gensim/api/__init__.py
Outdated
tar.close() | ||
logging.info("{} installed".format(file_name)) | ||
else: | ||
logging.error("Not able to create {d}. Make sure you have the " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No vertical indent (please use hanging indent; here and everywhere else).
gensim/api/__init__.py
Outdated
import tarfile | ||
import shutil | ||
from ..utils import SaveLoad |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better use absolute paths, for clarity and readability.
gensim/__main__.py
Outdated
parser = argparse.ArgumentParser(description="Gensim console API") | ||
group = parser.add_mutually_exclusive_group() | ||
group.add_argument( | ||
"-d", "--download", nargs=1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No needed line breaks (the string is not too long), here and below
gensim/api/__init__.py
Outdated
|
||
def download(file_name): | ||
url = "https://github.com/chaitaliSaini/Corpus_and_models/releases/" | ||
url = url + "download/" + file_name + "/" + file_name + ".tar.gz" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use format
method (instead of string concatenation through +
)
gensim/__main__.py
Outdated
"-c", "--catalogue", help="To get the list of all models/corpus stored" | ||
" : python -m gensim -c", action="store_true") | ||
args = parser.parse_args() | ||
if sys.argv[1] == "-d" or sys.argv[1] == "--download": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks suspicious: you already use argparse
before it -> you no need to look into sys.argv
. Please use only argparse.
gensim/api/__init__.py
Outdated
user_dir = os.path.expanduser('~') | ||
base_dir = os.path.join(user_dir, 'gensim-data') | ||
extracted_folder_dir = os.path.join(base_dir, file_name) | ||
if not os.path.exists(base_dir): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I have file ~/gensim-data
, this will not work correctly.
gensim/api/__init__.py
Outdated
base_dir = os.path.join(user_dir, 'gensim-data') | ||
extracted_folder_dir = os.path.join(base_dir, file_name) | ||
if not os.path.exists(base_dir): | ||
logger.info("Creating {}".format(base_dir)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use logging correct formatting logger.info("Creating %s", base_dir)
(here and anywhere). Look at this example.
gensim/api/__init__.py
Outdated
|
||
def catalogue(print_list=True): | ||
url = "https://raw.githubusercontent.com/chaitaliSaini/Corpus_and_models/" | ||
url = url + "master/list.json" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should store full URL as constant (without concatenations)
gensim/api/__init__.py
Outdated
print("Models available : ") | ||
for model in models: | ||
print(model) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need else section, return data always.
gensim/api/__init__.py
Outdated
print("{} has already been installed".format(file_name)) | ||
|
||
|
||
def catalogue(print_list=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change default to False
gensim/api/__init__.py
Outdated
elif file_name in models: | ||
print(data['gensim']['model'][file_name]) | ||
else: | ||
print("Incorrect model/corpus name.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raise exception with lists (what's correct)
gensim/api/__init__.py
Outdated
base_dir = os.path.join(user_dir, 'gensim-data') | ||
folder_dir = os.path.join(base_dir, file_name) | ||
if not os.path.exists(folder_dir): | ||
print( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raise exception
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some addition changes:
- Use something from
gensim.corpora
for corpuses - Test that all loaded correctly, this behaviour unacceptable
>>> q = api.load("glove_common_crawl_42B")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "gensim/api/__init__.py", line 164, in load
data = module.load_data()
File "/home/ivan/gensim-data/glove_common_crawl_42B/__init__.py", line 19, in load_data
model = KeyedVectors.load_word2vec_format(output_file_dir)
File "gensim/models/keyedvectors.py", line 255, in load_word2vec_format
raise ValueError("invalid vector on line %s (is this really the text format?)" % (line_no))
ValueError: invalid vector on line 78864 (is this really the text format?)
- Add more datasets
- Add instruction to orig repo "how to add new model"
The interface should be very simple, only load
and info
functions, that's enough (it's about "merge comments")
gensim/api/__init__.py
Outdated
logger.info("%s installed", dataset) | ||
|
||
|
||
def catalogue(print_list=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested it, print_list
is useless, please remove this arg (and printing code).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also please merge catalogue
and info
function to new info
function (add arguent for concrete model/dataset, by default, it should be None)
gensim/api/__init__.py
Outdated
else: | ||
catalogue(print_list=True) | ||
raise Exception( | ||
"Incorrect model/corpus name. Choose the model/corpus from the list " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Substitute list of available models/datasets to exception message (without printing to stdout before)
gensim/api/__init__.py
Outdated
corpuses = data['gensim']['corpus'] | ||
models = data['gensim']['model'] | ||
if dataset in corpuses: | ||
print(data['gensim']['corpus'][dataset]["desc"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to add more detailed descriptions (in storage repo)
gensim/api/__init__.py
Outdated
return data['gensim']['model'][dataset]["filename"] | ||
|
||
|
||
def load(dataset, return_path=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add doc-strings to all functions in google-style format (here and anywhere).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, please merge load
and download
to new load
function (If I have no needed model on my PC, I want to download it, not see an exception).
gensim/api/__init__.py
Outdated
"above.") | ||
|
||
|
||
def get_filename(dataset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function only for inner proposes?
gensim/api/__init__.py
Outdated
f = f.readlines() | ||
for line in f: | ||
if installed_message in line: | ||
print("{} has already been installed".format(dataset)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace all print
to logger
gensim/api/__init__.py
Outdated
corpuses = data['gensim']['corpus'] | ||
models = data['gensim']['model'] | ||
if dataset not in corpuses and dataset not in models: | ||
logger.error( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raise an exception (not a logger.error + sys.exit)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'corpuses' => 'corpora'
gensim/api/__init__.py
Outdated
if dataset not in corpuses and dataset not in models: | ||
logger.error( | ||
"Incorect Model/corpus name. Use catalogue(print_list=TRUE) or" | ||
" python -m gensim -c to get a list of models/corpuses" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not working now, please update
gensim/api/__init__.py
Outdated
@@ -0,0 +1,174 @@ | |||
from __future__ import print_function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename this file as downloader.py
and move to library root gensim/downloader.py
, also merge file with cli and this
gensim/api/__init__.py
Outdated
downloaded_message = "{f} downloaded".format(f=dataset) | ||
if os.path.exists(data_folder_dir): | ||
log_file_dir = os.path.join(base_dir, 'api.log') | ||
with open(log_file_dir) as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very suspicious, you write all to this file (not only models), please do several things
- Use this file only for models
- Add checksum (in file and in original repo) and compare it, if something happens (file broken) - warning + download file.
- If file doesn't exists - warning + create this file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plaintext isn't best format for this file, please use json/pickle/etc
gensim/api/__init__.py
Outdated
urllib.urlretrieve(data_url, data_dir) | ||
logger.info("%s downloaded", dataset) | ||
if not is_installed: | ||
tar = tarfile.open(compressed_folder_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not all of your files is tar
or gz
, for example, you also have non-zipped fasttext, big glove in zip, etc ....
args = parser.parse_args() | ||
if args.download is not None: | ||
data_path = load(args.download[0], return_path=True) | ||
logger.info("Data has been installed and data path is %s", data_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any logging setup here in __main__
. Where will this logging go?
We don't want any printing inside a library, but in a user-invoked top-level script, printing is fine.
gensim/downloader.py
Outdated
Args: | ||
dataset(string): Name of the corpus/model. | ||
""" | ||
url = "https://github.com/chaitaliSaini/Corpus_and_models/releases/download/{f}/{f}.tar.gz".format(f=dataset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a FIXME comment so we don't merge such temporary constructs by accident.
gensim/downloader.py
Outdated
base_dir = os.path.join(user_dir, 'gensim-data') | ||
data_log_file_dir = os.path.join(base_dir, 'data.json') | ||
|
||
logging.basicConfig( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please setup it in __main__
section (because we need logging in console, but no need it in programming version without explicit configuration from user side
gensim/downloader.py
Outdated
data = response.read().decode("utf-8") | ||
data = json.loads(data) | ||
if dataset is not None: | ||
corpora = data['gensim']['corpus'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gensim
is useless key in your json, you need only model
and dataset
keys for upper level
gensim/downloader.py
Outdated
corpora = data['gensim']['corpus'] | ||
models = data['gensim']['model'] | ||
if dataset in corpora: | ||
logger.info("%s \n", data['gensim']['corpus'][dataset]["desc"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing return
gensim/downloader.py
Outdated
if dataset in corpora: | ||
logger.info("%s \n", data['gensim']['corpus'][dataset]["desc"]) | ||
elif dataset in models: | ||
logger.info("%s \n", data['gensim']['model'][dataset]["desc"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing return
gensim/downloader.py
Outdated
return hash_md5.hexdigest() | ||
|
||
|
||
def info(dataset=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not dataset, this can be a model too (and same thing everywhere).
gensim/downloader.py
Outdated
|
||
user_dir = os.path.expanduser('~') | ||
base_dir = os.path.join(user_dir, 'gensim-data') | ||
data_log_file_dir = os.path.join(base_dir, 'data.json') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's path to file, not a dir
gensim/downloader.py
Outdated
.format(base_dir)) | ||
|
||
|
||
def initialize_data_log_file(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please pass data_log_file
explicitly and open it here with with
operator
gensim/downloader.py
Outdated
|
||
|
||
def get_data_status(dataset): | ||
"""Function for finding the status of the dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dataset and model
gensim/downloader.py
Outdated
dataset(string): Name of the corpus/model. | ||
status(string): Status to be updates to i.e downloaded or installed. | ||
""" | ||
jdata = json.loads(open(data_log_file_dir).read()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use with
for file open.
gensim/downloader.py
Outdated
logger.info("%s installed", dataset) | ||
else: | ||
logger.error("There was a problem in installing the file. Retrying.") | ||
_download(dataset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recursion here
2017-09-26 13:22:03,650 :gensim.api :INFO : Creating /home/ivan/gensim-data/glove_common_crawl_42B
2017-09-26 13:22:03,652 :gensim.api :INFO : Creation of /home/ivan/gensim-data/glove_common_crawl_42B successful.
2017-09-26 13:22:03,654 :gensim.api :INFO : Downloading glove_common_crawl_42B
2017-09-26 13:33:58,964 :gensim.api :INFO : glove_common_crawl_42B downloaded
2017-09-26 13:33:58,967 :gensim.api :INFO : Extracting files from /home/ivan/gensim-data/glove_common_crawl_42B
2017-09-26 13:34:03,320 :gensim.api :ERROR : There was a problem in installing the file. Retrying.
2017-09-26 13:34:03,590 :gensim.api :INFO : Extracting files from /home/ivan/gensim-data/glove_common_crawl_42B
2017-09-26 13:34:07,628 :gensim.api :ERROR : There was a problem in installing the file. Retrying.
2017-09-26 13:34:07,910 :gensim.api :INFO : Extracting files from /home/ivan/gensim-data/glove_common_crawl_42B
2017-09-26 13:34:12,020 :gensim.api :ERROR : There was a problem in installing the file. Retrying.
2017-09-26 13:34:12,306 :gensim.api :INFO : Extracting files from /home/ivan/gensim-data/glove_common_crawl_42B
2017-09-26 13:34:16,379 :gensim.api :ERROR : There was a problem in installing the file. Retrying.
|
||
|
||
def get_data_list(): | ||
"""Function getting the list of all datasets/models. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please re-write your docstrings according to numpy style (here and anywhere) + add missing .rst
to docs/src
+ change apiref.rst
return data_names | ||
|
||
|
||
def get_data_name(data_): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be a good idea to hide functions that no needed to user (with underscores)
models = data['model'] | ||
json_list = [] | ||
for corpus in corpora: | ||
json_object = {"name": corpus, "status": "None"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to store anything with "None" status if we check file with data/model from data repo each time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for initialising the json log file that stores the status if the model has been downloaded or installed on the users computer.
compressed_folder_name = "{f}.tar.gz".format(f=data_) | ||
compressed_folder_dir = os.path.join(base_dir, compressed_folder_name) | ||
if get_data_status(data_) != "downloaded": | ||
if not os.path.exists(data_folder_dir): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If exists - need to remove all broken files and after it creates all that needed.
_download(data_) | ||
|
||
if get_data_status(data_) != "installed": | ||
tar = tarfile.open(compressed_folder_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very long indentation, need 4 spaces instead of 8
|
||
|
||
def load(data_, return_path=False): | ||
"""Loads the corpus/model to the memory, if return_path is False. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code style: Python docstrings use imperative mode ("Load X", not "Loads X"). Here and elsewhere.
Api for model/data storage