Feature request: some mechanism for global data information to be passed to the model #1809

matt-gardner · 2018-09-21T19:08:14Z

In the semantic parsing code, there are things about the data that you need to read in the dataset reader and pass to the model, but only need to be read once. This includes, for example, the set of actions in the global grammar. The global grammar depends on your dataset, but not on each instance. Currently, the only way to get global information about the data from the DatasetReader to the Model is through the vocabulary, and the only way to influence the Vocabulary is through individual instances. There's no mechanism to get, e.g., a global grammar that we read from a database or a text file once, and pass it to the model. Instead, we have to either do something hacky with a Field that we just use for vocab creation and somehow make it not produce any tensors, or we have to duplicate the work of creating a bunch of tensors that we really only need one copy of.

I can imagine some ways to build this into the library, but everything feels hacky to me. This is one of those cases where you really want to break outside of the abstractions... Like, you could add DatasetReader.get_global_model_inputs(), or something, which returns a dictionary that gets passed as additional kwargs to the Model here. But I'm concerned about adding so many different interaction points between the abstractions that the library just gets unmanageable.

The text was updated successfully, but these errors were encountered:

brendan-ai2 · 2018-10-02T23:27:56Z

@acohan, this is a feature that would address the idf calculation. We don't have a great solution yet, but please feel free to track this and offer suggestions.

armancohan · 2018-10-03T05:59:22Z

This would be very helpful to have.

joelgrus · 2018-10-03T13:58:36Z

one way to do this would be the following variation on mattg's suggestion:

(1) give DatasetReader an extras() -> Dict[str, Any] method whose default implementation returns an empty dict
(2) allow DataIterator to take an optional extras: Dict[str, Any] parameter
(3) after DataIterator calls batch.as_tensor_dict, it adds all the "extras" to the tensor dict

another alternative is just to insist that people who do this are "breaking out of the library", so that e.g. they have to stick their global variable as a static property on their DatasetReader, and then the model has to take the registered name of the DatasetReader as a constructor param and then do something like

self.global_grammar = getattr(DatasetReader.by_name(name), 'global_grammar', None)

matt-gardner · 2018-10-03T18:37:32Z

The trouble with sticking it in each tensor dict is that you're again duplicating it for every instance. Maybe it's just a pointer to the same object, and so it's not as bad, but it's still a global parameter that's pretending it's an instance parameter.

joelgrus · 2018-12-10T00:47:47Z

we should think about this as a Q1 goal @schmmd

(we should look for other issues like this too)

asaluja · 2019-04-05T17:47:35Z

@matt-gardner @joelgrus has there been any update on this issue? I have a similar problem in that I have some metadata for a classification problem that tells me how classes are grouped together, and I use that information to compute metrics at the group level instead of at the individual class level. The way I circumvented this issue is by reading in a file with this information in the Model class, but now when productionizing this model I have this weird dependency on this external file even when running the Predictor, since I also group the classes when returning predictions too. Any thoughts on how to do this cleanly?

matt-gardner · 2019-04-08T20:36:16Z

@asaluja, unfortunately, no, there hasn't been any work on this issue. It would be really nice to have, though.

Is the information you have in the file small enough to just put directly into the config? That's one way to get around having the separate file, though it's not particularly appealing...

matt-gardner · 2019-04-08T20:37:41Z

One issue with this idea: if you get this side information from the dataset reader, how do you access it at test time? At demo time? It not only has to be computed from the training data, it has to be serialized somehow and available when loading the model. I'm not sure how to do that.

schmmd · 2020-01-06T21:48:44Z

This may not need to be a breaking change--so we might not need to leave this in the 1.0 milestone.

DeNeutoy · 2020-01-14T23:22:17Z

We should not do this, and instead if you need to do this, you should use allennlp as a library.

matt-gardner · 2020-01-15T00:07:40Z

By "as a library" you mean "write your own training loop"? "As a library" also applies to the case where I'm just writing a dataset reader and a model and using a config file.

DeNeutoy · 2020-01-15T00:28:17Z

It's easy if you write your own script - you just construct it separately, however it needs to be constructed, and set it as an attribute on whatever thing you have pre-constructed, or manually add it into a config you are loading. The point is, it's not going to be possible to do this in a general way, but it's very easy to do for a single project.

matt-gardner · 2020-01-15T00:54:48Z

Yes, I think I agree with that. This is a case where we should recommend writing your own equivalent of our train command. We need to make sure that's easy to do.

DeNeutoy · 2020-01-15T00:58:16Z

Great, I think we can close this.

matt-gardner mentioned this issue Oct 2, 2018

Structured sql data #1845

Merged

matt-gardner mentioned this issue Dec 10, 2018

Label Counter to automatically generate label weights #2155

Closed

brendan-ai2 mentioned this issue Feb 26, 2019

Extend evaluate to work with untrained models. #2545

Closed

matt-gardner added the Contributions welcome label Apr 8, 2019

matt-gardner added this to the 1.0.0 milestone Dec 17, 2019

DeNeutoy closed this as completed Jan 15, 2020

matt-gardner mentioned this issue Jan 16, 2020

How to efficiently implement a decoder with character-level embedding? #3631

Closed

sghandian mentioned this issue Apr 6, 2021

SciBERT - NER igematberkeley/NLPChemExtractor#8

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: some mechanism for global data information to be passed to the model #1809

Feature request: some mechanism for global data information to be passed to the model #1809

matt-gardner commented Sep 21, 2018

brendan-ai2 commented Oct 2, 2018

armancohan commented Oct 3, 2018

joelgrus commented Oct 3, 2018

matt-gardner commented Oct 3, 2018

joelgrus commented Dec 10, 2018

asaluja commented Apr 5, 2019

matt-gardner commented Apr 8, 2019

matt-gardner commented Apr 8, 2019

schmmd commented Jan 6, 2020

DeNeutoy commented Jan 14, 2020

matt-gardner commented Jan 15, 2020

DeNeutoy commented Jan 15, 2020

matt-gardner commented Jan 15, 2020

DeNeutoy commented Jan 15, 2020

Feature request: some mechanism for global data information to be passed to the model #1809

Feature request: some mechanism for global data information to be passed to the model #1809

Comments

matt-gardner commented Sep 21, 2018

brendan-ai2 commented Oct 2, 2018

armancohan commented Oct 3, 2018

joelgrus commented Oct 3, 2018

matt-gardner commented Oct 3, 2018

joelgrus commented Dec 10, 2018

asaluja commented Apr 5, 2019

matt-gardner commented Apr 8, 2019

matt-gardner commented Apr 8, 2019

schmmd commented Jan 6, 2020

DeNeutoy commented Jan 14, 2020

matt-gardner commented Jan 15, 2020

DeNeutoy commented Jan 15, 2020

matt-gardner commented Jan 15, 2020

DeNeutoy commented Jan 15, 2020