Differentiate between multiple sources for a same variable #63

louisPoulain · 2024-10-18T12:53:45Z

When working we different models using different resolutions and having each a set of lead times for which they are not available we end up (currently) in the situation where we pass a lot of fill_value to the network, possibly hindering the performance.

First solution: lambda layer

Use the lambda layer to use the source that "has the best resolution". Usually, this should be the model that is available for the least number of lead times.
This will require to change a bit the way data is handled in mlpp-lib as we want the network to "automatically" know which sources it should aggregate.

Pros

Code is clean, automatic and efficient. Hopefully performance is increased. Flexibility for the user to use or not the lambda layer

Cons

Currently the network receives a matrix of data (e.g features x number of data points). A priori there is no way for the network to know the variable names (too much change is needed to introduce that functionnality). So it will be hard to ensure that the network aggregates the correct sources.
Moreover, we need to ensure that this lambda routine is performed only if a variable is "proposed" by two or more different sources.

Another solution, that I think is more suited, is to introduce a new routine in the datamodule.

Second solution: remove/aggregate directly in the datamodule

The goal of this solution is to handle directly the data in the datamodule.

Pros

Code is still clean, we can always implement a routine that allow the user to use or not this feature (False by default to ensure retro-compatibility).
In the datamodule we have access to the variable names, so it's easy to see which variables are "duplicated".
The model is created after the datamodule has been set up, so we don't need to change the number of input variables.

Cons

It adds a routine to the datamodule and can slow down a bit at this point of the code.

Cons for both solutions

The distribution of the variable is not going to be same for different sources (e.g., simply because of the resolution) so we end up aggregating into one variable something that will have "jumps" in its underlying distribution at the lead times jumps.

@dnerini feel free to comment on this as we need to choose one way to proceed before implementing anything.

The text was updated successfully, but these errors were encountered:

louisPoulain added the enhancement New feature or request label Oct 18, 2024

louisPoulain self-assigned this Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Differentiate between multiple sources for a same variable #63

Differentiate between multiple sources for a same variable #63

louisPoulain commented Oct 18, 2024

Differentiate between multiple sources for a same variable #63

Differentiate between multiple sources for a same variable #63

Comments

louisPoulain commented Oct 18, 2024

First solution: lambda layer

Pros

Cons

Second solution: remove/aggregate directly in the datamodule

Pros

Cons

Cons for both solutions