Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce common input section that will be applied to all inputs #31

Open
leifdenby opened this issue Oct 30, 2024 · 4 comments
Open
Labels
enhancement New feature or request

Comments

@leifdenby
Copy link
Member

The problem

In working on the datastores functionality in neural-lam I needed the ability to define the projection information for the grid-coordinates in a zarr datasets created by mllam-data-prep. To do this a had a bit of a conundrum: where in the config to store this info? In the situation where all input datasets are given on the same projection it would be ideal to define this information only once in the config, but to make it clear that this information applies to every single source. Currently, it isn't possible to set config parameters "globally" for all inputs.

The proposition here is to add this common section

For context I have currently made (on a separate branch #18, which I would like to replace with a better implementation) what I feel like is a bit of a hack: adding an extra section to the config which the user can use to set things in the config that mllam-data-prep should ignore but can be used by other tools that read the same config. I don't really like this because this projection information is not used or even validated by mllam-data-prep. To add the domain-cropping that @TomasLandelius is planning to implement having this projection information understood by mllam-data-prep will also be key I think, so that mllam-data-prep knows how to get the latitude and longitude for each grid-point for each input dataset being used in a config.

Proposition

Add a new optionally specially-named input called __common__ the contents of which will be copied to all the inputs given.

So instead of writing:

...
inputs:
  danra_height_levels:
    path: https://mllam-test-data.s3.eu-north-1.amazonaws.com/height_levels.zarr
    dims: [time, x, y, altitude]
    variables:
      u:
        altitude:
          values: [100,]
          units: m
      v:
        altitude:
          values: [100, ]
          units: m
    dim_mapping:
      time:
        method: rename
        dim: time
      state_feature:
        method: stack_variables_by_var_name
        dims: [altitude]
        name_format: f"{var_name}{altitude}m"
      grid_index:
        method: stack
        dims: [x, y]
    target_output_variable: state

  danra_surface:
    path: https://mllam-test-data.s3.eu-north-1.amazonaws.com/single_levels.zarr
    dims: [time, x, y]
    variables:
      # use surface incoming shortwave radiation as forcing
      - swavr0m
    dim_mapping:
      time:
        method: rename
        dim: time
      grid_index:
        method: stack
        dims: [x, y]
      forcing_feature:
        method: stack_variables_by_var_name
        name_format: f"{var_name}"
    target_output_variable: forcin

one could write

...
inputs:
  __common__:
    dim_mapping:
      grid_index:
        method: stack
        dims: [x, y]

  danra_height_levels:
    path: https://mllam-test-data.s3.eu-north-1.amazonaws.com/height_levels.zarr
    dims: [time, x, y, altitude]
    variables:
      u:
        altitude:
          values: [100,]
          units: m
      v:
        altitude:
          values: [100, ]
          units: m
    dim_mapping:
      time:
        method: rename
        dim: time
      state_feature:
        method: stack_variables_by_var_name
        dims: [altitude]
        name_format: f"{var_name}{altitude}m"
    target_output_variable: state

  danra_lsm:
    path: https://mllam-test-data.s3.eu-north-1.amazonaws.com/lsm.zarr
    dims: [x, y]
    variables:
      - lsm
    dim_mapping:
      static_feature:
        method: stack_variables_by_var_name
        name_format: f"{var_name}"
    target_output_variable: static

This would then enable me to add an optional projection config parameter which I will do in a separate PR.

To make this as simple as possible it will be implemented so that collisions between keys in the common config and the config for a specific input isn't allowed. The process of merging the common config into the config a specific input is in effect a process of merging two tree structures. The leaf-nodes are the values in the input dictionary, and the branch-nodes are the nested keys. The merging will be allowed to introduce new keys, but not overwrite values.

@leifdenby leifdenby added the enhancement New feature or request label Oct 30, 2024
@leifdenby
Copy link
Member Author

If you have time @observingClouds I would appreciate your thoughts on this :)

@observingClouds
Copy link
Contributor

Is there a specific reason other than trying to avoid duplications of entries? I feel like the config files are quite short so far with only a few file definitions and adding duplicate information wouldn't really reduce the readability.

In case we would include this common key, I would argue for having it outside of the input keys to not mix it with variable names and make validations of the json files easier, e.g.

...
global:
  inputs:
    dim_mapping:
      grid_index:
        method: stack
        dims: [x, y]
inputs:
  danra_height_levels:
    path: https://mllam-test-data.s3.eu-north-1.amazonaws.com/height_levels.zarr
    dims: [time, x, y, altitude]
...

@leifdenby
Copy link
Member Author

leifdenby commented Nov 1, 2024

Is there a specific reason other than trying to avoid duplications of entries? I feel like the config files are quite short so far with only a few file definitions and adding duplicate information wouldn't really reduce the readability.

Yes, the main reason is that having repetitions makes errors hard to spot (e.g. typos where parts shouldn't be the same, but aren't) and sometimes it might even be unclear that some section actually share parts of their config, this is basically the DRY principle :) This feature isn't a necessity thought, but I thought it might be a nice thing to do to keep the config tidier.

In case we would include this common key, I would argue for having it outside of the input keys to not mix it with variable names and make validations of the json files easier, e.g.

Ok, interesting. I don't think for the yaml-file validation this would make things easier though since my idea is imply to copy the values of the __common__ input to all the other inputs. That would mean that __common__ has to have the exact same key/value pairs and pass the same validation as the other inputs. The __ prefix/suffix is there to follow the python-idea that "this is a special item that won't generally be used, but is a meta function/attribute". I was thinking one could for example also make __default__ input that could be overwritten by individual inputs.

Having a "global" (as in the root of the config tree) global section suggests to me that we might want to make global overrides for other sections too. As in also for the output section. Do you think we might want to do that some day? Maybe there is a use for that...

@observingClouds
Copy link
Contributor

observingClouds commented Nov 1, 2024

Having a "global" (as in the root of the config tree) global section suggests to me that we might want to make global overrides for other sections too. As in also for the output section. Do you think we might want to do that some day? Maybe there is a use for that...

Yes, I could imagine that there might be use cases too.

The __ prefix/suffix is there to follow the python-idea that "this is a special item that won't generally be used, but is a meta function/attribute"

Yes, BUT currently you can for example just loop through the datasets (ìnputs), but if you implement __common__ at the same level, you always need to check if your dataset name is actually a dataset and not the __common__ section. An extra global attribute removes this ambiguity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants