Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Differentiate between multiple sources for a same variable #63

Open
louisPoulain opened this issue Oct 18, 2024 · 0 comments
Open

Differentiate between multiple sources for a same variable #63

louisPoulain opened this issue Oct 18, 2024 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@louisPoulain
Copy link
Collaborator

When working we different models using different resolutions and having each a set of lead times for which they are not available we end up (currently) in the situation where we pass a lot of fill_value to the network, possibly hindering the performance.

First solution: lambda layer

Use the lambda layer to use the source that "has the best resolution". Usually, this should be the model that is available for the least number of lead times.
This will require to change a bit the way data is handled in mlpp-lib as we want the network to "automatically" know which sources it should aggregate.

Pros

Code is clean, automatic and efficient. Hopefully performance is increased. Flexibility for the user to use or not the lambda layer

Cons

Currently the network receives a matrix of data (e.g features x number of data points). A priori there is no way for the network to know the variable names (too much change is needed to introduce that functionnality). So it will be hard to ensure that the network aggregates the correct sources.
Moreover, we need to ensure that this lambda routine is performed only if a variable is "proposed" by two or more different sources.

Another solution, that I think is more suited, is to introduce a new routine in the datamodule.

Second solution: remove/aggregate directly in the datamodule

The goal of this solution is to handle directly the data in the datamodule.

Pros

Code is still clean, we can always implement a routine that allow the user to use or not this feature (False by default to ensure retro-compatibility).
In the datamodule we have access to the variable names, so it's easy to see which variables are "duplicated".
The model is created after the datamodule has been set up, so we don't need to change the number of input variables.

Cons

It adds a routine to the datamodule and can slow down a bit at this point of the code.

Cons for both solutions

The distribution of the variable is not going to be same for different sources (e.g., simply because of the resolution) so we end up aggregating into one variable something that will have "jumps" in its underlying distribution at the lead times jumps.

@dnerini feel free to comment on this as we need to choose one way to proceed before implementing anything.

@louisPoulain louisPoulain added the enhancement New feature or request label Oct 18, 2024
@louisPoulain louisPoulain self-assigned this Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant