-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Feature request: some mechanism for global data information to be passed to the model #1809
Comments
@acohan, this is a feature that would address the idf calculation. We don't have a great solution yet, but please feel free to track this and offer suggestions. |
This would be very helpful to have. |
one way to do this would be the following variation on mattg's suggestion: (1) give another alternative is just to insist that people who do this are "breaking out of the library", so that e.g. they have to stick their global variable as a static property on their DatasetReader, and then the model has to take the registered name of the DatasetReader as a constructor param and then do something like
|
The trouble with sticking it in each tensor dict is that you're again duplicating it for every instance. Maybe it's just a pointer to the same object, and so it's not as bad, but it's still a global parameter that's pretending it's an instance parameter. |
we should think about this as a Q1 goal @schmmd (we should look for other issues like this too) |
@matt-gardner @joelgrus has there been any update on this issue? I have a similar problem in that I have some metadata for a classification problem that tells me how classes are grouped together, and I use that information to compute metrics at the group level instead of at the individual class level. The way I circumvented this issue is by reading in a file with this information in the |
@asaluja, unfortunately, no, there hasn't been any work on this issue. It would be really nice to have, though. Is the information you have in the file small enough to just put directly into the config? That's one way to get around having the separate file, though it's not particularly appealing... |
One issue with this idea: if you get this side information from the dataset reader, how do you access it at test time? At demo time? It not only has to be computed from the training data, it has to be serialized somehow and available when loading the model. I'm not sure how to do that. |
This may not need to be a breaking change--so we might not need to leave this in the 1.0 milestone. |
We should not do this, and instead if you need to do this, you should use allennlp as a library. |
By "as a library" you mean "write your own training loop"? "As a library" also applies to the case where I'm just writing a dataset reader and a model and using a config file. |
It's easy if you write your own script - you just construct it separately, however it needs to be constructed, and set it as an attribute on whatever thing you have pre-constructed, or manually add it into a config you are loading. The point is, it's not going to be possible to do this in a general way, but it's very easy to do for a single project. |
Yes, I think I agree with that. This is a case where we should recommend writing your own equivalent of our train command. We need to make sure that's easy to do. |
Great, I think we can close this. |
In the semantic parsing code, there are things about the data that you need to read in the dataset reader and pass to the model, but only need to be read once. This includes, for example, the set of actions in the global grammar. The global grammar depends on your dataset, but not on each instance. Currently, the only way to get global information about the data from the
DatasetReader
to theModel
is through the vocabulary, and the only way to influence theVocabulary
is through individual instances. There's no mechanism to get, e.g., a global grammar that we read from a database or a text file once, and pass it to the model. Instead, we have to either do something hacky with aField
that we just use for vocab creation and somehow make it not produce any tensors, or we have to duplicate the work of creating a bunch of tensors that we really only need one copy of.I can imagine some ways to build this into the library, but everything feels hacky to me. This is one of those cases where you really want to break outside of the abstractions... Like, you could add
DatasetReader.get_global_model_inputs()
, or something, which returns a dictionary that gets passed as additional kwargs to theModel
here. But I'm concerned about adding so many different interaction points between the abstractions that the library just gets unmanageable.The text was updated successfully, but these errors were encountered: