You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When working we different models using different resolutions and having each a set of lead times for which they are not available we end up (currently) in the situation where we pass a lot of fill_value to the network, possibly hindering the performance.
First solution: lambda layer
Use the lambda layer to use the source that "has the best resolution". Usually, this should be the model that is available for the least number of lead times.
This will require to change a bit the way data is handled in mlpp-lib as we want the network to "automatically" know which sources it should aggregate.
Pros
Code is clean, automatic and efficient. Hopefully performance is increased. Flexibility for the user to use or not the lambda layer
Cons
Currently the network receives a matrix of data (e.g features x number of data points). A priori there is no way for the network to know the variable names (too much change is needed to introduce that functionnality). So it will be hard to ensure that the network aggregates the correct sources.
Moreover, we need to ensure that this lambda routine is performed only if a variable is "proposed" by two or more different sources.
Another solution, that I think is more suited, is to introduce a new routine in the datamodule.
Second solution: remove/aggregate directly in the datamodule
The goal of this solution is to handle directly the data in the datamodule.
Pros
Code is still clean, we can always implement a routine that allow the user to use or not this feature (False by default to ensure retro-compatibility).
In the datamodule we have access to the variable names, so it's easy to see which variables are "duplicated".
The model is created after the datamodule has been set up, so we don't need to change the number of input variables.
Cons
It adds a routine to the datamodule and can slow down a bit at this point of the code.
Cons for both solutions
The distribution of the variable is not going to be same for different sources (e.g., simply because of the resolution) so we end up aggregating into one variable something that will have "jumps" in its underlying distribution at the lead times jumps.
@dnerini feel free to comment on this as we need to choose one way to proceed before implementing anything.
The text was updated successfully, but these errors were encountered:
When working we different models using different resolutions and having each a set of lead times for which they are not available we end up (currently) in the situation where we pass a lot of
fill_value
to the network, possibly hindering the performance.First solution: lambda layer
Use the lambda layer to use the source that "has the best resolution". Usually, this should be the model that is available for the least number of lead times.
This will require to change a bit the way data is handled in mlpp-lib as we want the network to "automatically" know which sources it should aggregate.
Pros
Code is clean, automatic and efficient. Hopefully performance is increased. Flexibility for the user to use or not the lambda layer
Cons
Currently the network receives a matrix of data (e.g features x number of data points). A priori there is no way for the network to know the variable names (too much change is needed to introduce that functionnality). So it will be hard to ensure that the network aggregates the correct sources.
Moreover, we need to ensure that this lambda routine is performed only if a variable is "proposed" by two or more different sources.
Another solution, that I think is more suited, is to introduce a new routine in the datamodule.
Second solution: remove/aggregate directly in the datamodule
The goal of this solution is to handle directly the data in the datamodule.
Pros
Code is still clean, we can always implement a routine that allow the user to use or not this feature (False by default to ensure retro-compatibility).
In the datamodule we have access to the variable names, so it's easy to see which variables are "duplicated".
The model is created after the datamodule has been set up, so we don't need to change the number of input variables.
Cons
It adds a routine to the datamodule and can slow down a bit at this point of the code.
Cons for both solutions
The distribution of the variable is not going to be same for different sources (e.g., simply because of the resolution) so we end up aggregating into one variable something that will have "jumps" in its underlying distribution at the lead times jumps.
@dnerini feel free to comment on this as we need to choose one way to proceed before implementing anything.
The text was updated successfully, but these errors were encountered: