Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Feature request: some mechanism for global data information to be passed to the model #1809

Closed
matt-gardner opened this issue Sep 21, 2018 · 14 comments

Comments

@matt-gardner
Copy link
Contributor

In the semantic parsing code, there are things about the data that you need to read in the dataset reader and pass to the model, but only need to be read once. This includes, for example, the set of actions in the global grammar. The global grammar depends on your dataset, but not on each instance. Currently, the only way to get global information about the data from the DatasetReader to the Model is through the vocabulary, and the only way to influence the Vocabulary is through individual instances. There's no mechanism to get, e.g., a global grammar that we read from a database or a text file once, and pass it to the model. Instead, we have to either do something hacky with a Field that we just use for vocab creation and somehow make it not produce any tensors, or we have to duplicate the work of creating a bunch of tensors that we really only need one copy of.

I can imagine some ways to build this into the library, but everything feels hacky to me. This is one of those cases where you really want to break outside of the abstractions... Like, you could add DatasetReader.get_global_model_inputs(), or something, which returns a dictionary that gets passed as additional kwargs to the Model here. But I'm concerned about adding so many different interaction points between the abstractions that the library just gets unmanageable.

@brendan-ai2
Copy link
Contributor

@acohan, this is a feature that would address the idf calculation. We don't have a great solution yet, but please feel free to track this and offer suggestions.

@armancohan
Copy link

This would be very helpful to have.

@joelgrus
Copy link
Contributor

joelgrus commented Oct 3, 2018

one way to do this would be the following variation on mattg's suggestion:

(1) give DatasetReader an extras() -> Dict[str, Any] method whose default implementation returns an empty dict
(2) allow DataIterator to take an optional extras: Dict[str, Any] parameter
(3) after DataIterator calls batch.as_tensor_dict, it adds all the "extras" to the tensor dict

another alternative is just to insist that people who do this are "breaking out of the library", so that e.g. they have to stick their global variable as a static property on their DatasetReader, and then the model has to take the registered name of the DatasetReader as a constructor param and then do something like

self.global_grammar = getattr(DatasetReader.by_name(name), 'global_grammar', None)

@matt-gardner
Copy link
Contributor Author

The trouble with sticking it in each tensor dict is that you're again duplicating it for every instance. Maybe it's just a pointer to the same object, and so it's not as bad, but it's still a global parameter that's pretending it's an instance parameter.

@joelgrus
Copy link
Contributor

we should think about this as a Q1 goal @schmmd

(we should look for other issues like this too)

@asaluja
Copy link

asaluja commented Apr 5, 2019

@matt-gardner @joelgrus has there been any update on this issue? I have a similar problem in that I have some metadata for a classification problem that tells me how classes are grouped together, and I use that information to compute metrics at the group level instead of at the individual class level. The way I circumvented this issue is by reading in a file with this information in the Model class, but now when productionizing this model I have this weird dependency on this external file even when running the Predictor, since I also group the classes when returning predictions too. Any thoughts on how to do this cleanly?

@matt-gardner
Copy link
Contributor Author

@asaluja, unfortunately, no, there hasn't been any work on this issue. It would be really nice to have, though.

Is the information you have in the file small enough to just put directly into the config? That's one way to get around having the separate file, though it's not particularly appealing...

@matt-gardner
Copy link
Contributor Author

One issue with this idea: if you get this side information from the dataset reader, how do you access it at test time? At demo time? It not only has to be computed from the training data, it has to be serialized somehow and available when loading the model. I'm not sure how to do that.

@matt-gardner matt-gardner added this to the 1.0.0 milestone Dec 17, 2019
@schmmd
Copy link
Member

schmmd commented Jan 6, 2020

This may not need to be a breaking change--so we might not need to leave this in the 1.0 milestone.

@DeNeutoy
Copy link
Contributor

We should not do this, and instead if you need to do this, you should use allennlp as a library.

@matt-gardner
Copy link
Contributor Author

By "as a library" you mean "write your own training loop"? "As a library" also applies to the case where I'm just writing a dataset reader and a model and using a config file.

@DeNeutoy
Copy link
Contributor

It's easy if you write your own script - you just construct it separately, however it needs to be constructed, and set it as an attribute on whatever thing you have pre-constructed, or manually add it into a config you are loading. The point is, it's not going to be possible to do this in a general way, but it's very easy to do for a single project.

@matt-gardner
Copy link
Contributor Author

Yes, I think I agree with that. This is a case where we should recommend writing your own equivalent of our train command. We need to make sure that's easy to do.

@DeNeutoy
Copy link
Contributor

Great, I think we can close this.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants