-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bAbI refactored #97
base: develop
Are you sure you want to change the base?
bAbI refactored #97
Conversation
Release 0.3
Conf and setup - updated version to 0.3.0
This pull request introduces 3 alerts when merging ddb720b into 68ad04d - view on LGTM.com new alerts:
Comment posted by LGTM.com |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Rename class
-
Introduce abstract class and move methods from SeqToSeq to that class
-
Change param names and default values
-
Enhance the returned DataDict
-
Is the configuration file ok, i.e. is the LSTM model (somehow) converging for single task/many tasks/all tasks with all samples/tenK samples?
tf.write(f.read()) | ||
return self.parse(file_data, add_punctuation) | ||
|
||
def download(self, root, check=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so that whole part should at one point be integrated with Emre's ProblemInitializer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 2 download
methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One downloads single file from a given url, the second one downloads (using the first one) file(s?), unpacks it and returns a path to it. IMO this is really 100% ProblemInitializer functionality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I need to know more about that. Will check with Emre
|
||
|
||
|
||
def download_from_url(self, url, path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is responsible for download, we are moving that funtionality to ProblemInitializer and integrating that with ProblemFactory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I will to know more about that. Will check with Emre
|
||
babi_tasks = list(range(1, 21)) | ||
|
||
params = {'directory': '/', 'tasks': babi_tasks,'data_type': 'train', 'batch_size': 10,'embedding_type' :'glove.6B.100d', 'ten_thousand_examples': True, 'one_hot_embedding': True, 'truncation_length':50 } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally we are standardizing parameter names (see #73 ).
In case of data the name of variable should be "data_dir" with default value set to "~/data/babi/"
Please change that according to convention used e.g. in MNIST in init:
self.params.add_default_params({'data_folder': '~/data/mnist', |
and then when "simulating" config file:
params.add_config_params({'data_folder': '~/data/mnist', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it
@@ -75,6 +77,86 @@ def evaluate_loss(self, data_dict, logits): | |||
return loss | |||
|
|||
|
|||
def to_dictionary_indexes(self, dictionary, sentence): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All those methods look for me as related to text. SeqToSeq problems abstract from text.
I guess it is good to move them to a class higher in the hierarchy (from bAbI), as probably they will be reused by the other multi-question version.
I propose to introduce a new class, inheriting from SeqToSeq, that will contain those methods, and maybe other methods shared with the second version of bAbI...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think those methods can be useful in e.g. machine translation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should be moved to TextToText
or another class related to NLP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes ok let's go for TextToText
|
||
|
||
|
||
class BABI(SeqToSeqProblem): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name of dataset is bAbI (AI with capital letters), and as this is a version with single question in a story (if I understand it correctly), I suggest renaming it to something that will reflect that fact, e.g.
bAbIQASingleQuestion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok done
|
||
self.data_definitions = {'sequences': {'size': [-1, -1, self.memory_size], 'type': [torch.Tensor]}, | ||
'targets': {'size': [-1], 'type': [torch.Tensor]}, | ||
'current_question': {'size': [-1, 1], 'type': [list, str]}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is current_question, and why it is singular, while other names are plural?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are all the string questionS so I corrected it to plural
'targets': {'size': [-1], 'type': [torch.Tensor]}, | ||
'current_question': {'size': [-1, 1], 'type': [list, str]}, | ||
'masks': {'size': [-1], 'type': [torch.Tensor]}, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside of that, we got twenty tasks in bAbI. I guess it would be good to store which task qiven sample belongs to.
Similarly as in the CLEVR case, we will need that to analyze how good we are e.g. in stories with two supporting facts. Please notice that people when reporting results on bAbI are providing tables with separate accuracies on all 20 tasks and then one accuracy when trained jointly on all tasks...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes ok
|
||
self.embedding_type = params['embedding_type'] | ||
|
||
self.embedding_size = 38 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The size of embedding is hardcoded, whereas, if I understand it correctly, you can use different ones by setting 'embedding type'. Is that ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes , I put it in the yaml as parameter but the parser couldn't find it. Will try again.
|
||
self.batch_size = params['batch_size'] | ||
|
||
self.memory_size = params['truncation_length'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is that? What is truncated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was padding size for Ryan. I have removed it
|
||
self.default_values = {'input_item_size': self.embedding_size , 'output_item_size':self.embedding_size} | ||
|
||
self.data_definitions = {'sequences': {'size': [-1, -1, self.memory_size], 'type': [torch.Tensor]}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed that and I do not really understand why you plugged self.memory_size as third dimension...
BATCH_SIZE x SEQ_LEN X ITEM_SIZE
ITEM_SIZE is in this case size (number of bits) returned by the used embedding of a single word. Right?
If not, what is really stored in "sequences". Is this the whole story?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, it would be best to separate the story (or context
) from the actual question
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but then simple LSTM model won't work...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So probably there should be a switch, like "separate_question", that could be parametrized from configuration file.
If true:
sequences = story
question = question
else:
sequences = story + question
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now sequences = story + question
|
||
|
||
def build_dictionaries_one_hot(self): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Description is inconsistent with the name of the function. Is it really creating word embeddings OR is it building dictionaries from all words present in that stories/questions??
Why do we need two methods for it (with one hot and without)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could keep only one. There is a boolean variable self.embedding_type that decides if you want One hot or embedding or not. I though this is a nice feature so I kept it .
Finally, can we agree that we will use the right name It should be bAbInot Babi, BABI etc. And it is for a reason https://research.fb.com/downloads/babi/ bAbI from "baby AI" And besides, it is only bAbI QA, not the whole bAbI, which contains more tasks... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand bAbI is a hard problem to implement because of the padding etc.
And this is exactly why this class needs to be clearer, so it should contain documentation, comments and variable names should reflect their use
Have a look at CLEVR and COG for the constraints to respect on a Problem class.
Perhaps we should all agree together what should be the content of a sample in bAbI...
batch_size: &b 1 | ||
data_type: train | ||
embedding_type: glove.6B.100d | ||
embedding_size: 50 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
glove.6B.100d
already uses an embedding dimension of site 100, so I do not understand embedding_size: 50
embedding_size: 50 | ||
use_mask : false | ||
joint_all: true | ||
one_hot_embedding: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this related to embedding_type: glove.6B.100d
& embedding_size: 50
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I just checked. You are right, embedding_type: glove.6B.100d are size 100 , when it is one hot , it size of the dictionary, so I will store size of the dictionary instead
tasks: [1, 2, 3] | ||
ten_thousand_examples: true | ||
truncation_length: 50 | ||
directory : ./ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @tkornut indicated, we should change that to data_folder: "~/data/babi/"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
data_type: valid | ||
embedding_type: glove.6B.100d | ||
joint_all: true | ||
one_hot_embedding: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same remark here as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I just checked. You are right, embedding_type: glove.6B.100d are size 100 , when it is one hot , it size of the dictionary, so I will store size of the dictionary instead
one_hot_embedding: true | ||
tasks: [1, 2, 3] | ||
ten_thousand_examples: true | ||
truncation_length : 50 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question as @tkornut, is this related to the maximum size of a question? Or is it related to padding?
data = data + self.load_data(tasks=tasks, tenK=self.tenK, add_punctuation=True, data_type='valid', | ||
outmod="embedding") | ||
data = data + self.load_data(tasks=tasks, tenK=self.tenK, add_punctuation=True, data_type='test', | ||
outmod="embedding") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So data
contains all the train
, val
and tests
samples? If so, what is the point of data_type: train
in the config file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is used somewhere else to load the right data. It is not useful to build the dictionaries though, you are right
tf.write(f.read()) | ||
return self.parse(file_data, add_punctuation) | ||
|
||
def download(self, root, check=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 2 download
methods?
shutil.copyfileobj(gz, uncompressed) | ||
|
||
#Return path to extracted dataset | ||
return os.path.join(path, self.dirname) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method looks quite generic, which is nice, so I would move it up in the classes hierarchy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
|
||
from miprometheus.problems.problem import Problem | ||
import torch | ||
from miprometheus.utils.app_state import AppState |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need because Problem already has
self.app_state = AppState() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok but some classes like TextToTextProblem do not see it somehow
@@ -75,6 +77,86 @@ def evaluate_loss(self, data_dict, logits): | |||
return loss | |||
|
|||
|
|||
def to_dictionary_indexes(self, dictionary, sentence): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should be moved to TextToText
or another class related to NLP
|
||
""" build embeddings from the chosen database / Example: glove.6B.100d """ | ||
|
||
self.language.build_pretrained_vocab(text, vectors=self.embedding_type, tokenize=self.tokenize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regarding the problem with the size of embeddings...
I think that language (btw. I think this is a bad name, means exactly nothing, if I understand it correctly it is... TextEncoder?) should set and return its value (i.e. size of embeddings)...
This pull request introduces 4 alerts when merging 0c052d4 into 3d4f868 - view on LGTM.com new alerts:
Comment posted by LGTM.com |
Ok guys I open this pull request if you want to have a look, I need feedbacks. For now it contains a refactored versions of the work of Ryan. It contains:
I think the problem Class needs a lot of improvement, I noticed Ryan didn't use any padding so I am working on a new version with Pack_padded_sequence as @vmarois suggested. Also the story parsing method was not commented so I had a hard time understanding what he did so I am not 100% sure the parsing is correct.