Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processing Dataset Definition #118

Closed
kkappler opened this issue Oct 9, 2021 · 9 comments
Closed

Processing Dataset Definition #118

kkappler opened this issue Oct 9, 2021 · 9 comments
Assignees

Comments

@kkappler
Copy link
Collaborator

kkappler commented Oct 9, 2021

We have a processing config. The other part of the TFKernel is the dataset that gets fed into the pipeline along with the config.

We need a standard for dataset-specification.

I suggest a table (dataframe) as the container, with a csv file as a first cut user interface

Dataset Specification can be one of two flavors (with others added in future):

  1. Single Station
  2. Remote Reference

In both cases we need to know:
Local Station ID: This is the location at which we are going to estimate the EMTF (sample the earths' conductivity)
Local Station Time Intervals of data to be provided for analysis

For Remote Station we also need:
Reference Station ID (Can be None, and then there are not two cases of definition?)
Refernece station time Intervals of data to be provided for analysis

Specifications:
-The time intervals for any given station must be disjoint

I would specifically like to push the logic that validates the time intervals:

  • data exists,
  • data location
  • RR data available for all intervals in dataset_definition.csv
  • etc.

out of the first cut of this class. Those tools can be built separately, and indeed Tim is already making good headway on these validations.

@kkappler
Copy link
Collaborator Author

relates to issue #80

@kkappler kkappler self-assigned this Feb 19, 2022
@kkappler
Copy link
Collaborator Author

kkappler commented Feb 19, 2022

After looking more carefully at this, it becomes clear that the designation of remote vs local is not really part if the dataset definition. That belongs in the config. The dataset specification concerns station-runs and any sub-intervals of that data to ignore/process.

The mth5 container already provides a list of stations and acquisition runs. This corresponds to station-intervals. We would like to have the option of specifying sub-intervals of data within a run to allow a user to ignore some chunks of data if desired.

In all cases, we wind up with lists of time series intervals, and these need to be merged somehow. There will be a list (of one or more runs (possibly clipped)) associated with the primary station, and a second list (of zero or more runs (possibly clipped) associated with the reference station.

These can be stored in a dict keyed by station_id, so that we may be able to extend this model to multiple station processing without too much modification.

dataset_defn = {}
dataset_defn["CAS04"] = {}
dataset_defn["CAS04"]["a"] = []
dataset_defn["CAS04"]["b"] = [(t_start_1,  t_end_1), (t_start_2, t_end_2)]
dataset_defn["CAS04"]["c"] = []
dataset_defn["NVR11"] = {}
dataset_defn["NVR11"]["a"] = []
dataset_defn["NVR11"]["b"] = []
dataset_defn["NVR11"]["c"] = [(t0, t1), (t2, t3)]
dataset_defn["NVR11"]["d"] = []

This dictionary also has a list representation ...

<style type="text/css"></style>

station id run_id start_time end_time
CAS04 a    
CAS04 b t_start_1 t_end_1
CAS04 b t_start_2 t_end_2
CAS04 c    
NVR11 a    
NVR11 b    
NVR11 c t0 t1
NVR11 c t2 t3
NVR11 d    

It is important to note that we do not need to worry at this point about how these lists are generated. They can be created by a person via a spreadsheet, they can be machine generated, or some other scheme.

We basically have a list of intervals.

Expressing the list like this seems completely general, and allows us to develop workflows where we choose to merge (or not merge) runs in time or frequency domain, as well as to treat with or without a "clock zero"

** We could probably concatenate the time series from all runs for each station in a timestamped xarray, and by tracking the discontinuities in time (or similar bookkeeping) work with the runs individually. This would basically constitute the merged run class.

@kkappler
Copy link
Collaborator Author

One more note about the run handling. It would be good to consider the case where we may have different time intervals per channel. This shouldn’t be an issue for the current code, but will come up if we wind up doing full multiple station processing (like MMT) were we compute a spectral density matrix (SDM) which considers cross powers of all channel pairs. It is possible that particular channels at a station may need to be suppressed for some time intervals, or alternatively that one channel may have several time intervals that are “good”, where other channels do not.

This basically amounts to a channel-by-channel approach to processing, not station-by-station. This could actually be handled by supressing various channel-intervals in a different data structure within the TFKernel. Will not implement this now, but keep it in mind during the development.

kkappler added a commit that referenced this issue Feb 26, 2022
Modified process_mth5 by creating process_mth5_runs, nearly a copy of process_mth5_run.
This method will take lists of runs to process and eventually supercede process_mth5_run.

[Issue(s): #31, #118, #80]
kkappler added a commit that referenced this issue Mar 4, 2022
Modified process_mth5 to loop over runs and create merged FC object.
Testing ok on decimation level zero.  Now need to add looping over
decimation levels.

[Issue(s): #31, #118, #80]
@kkappler
Copy link
Collaborator Author

kkappler commented Mar 8, 2022

Note that run_ids will in general be different at the local and remote stations. This data structure allows for that but the current code in process_mth5.py does not. It takes only a run_id, and is being modified to take a list of run_ids, but the run_ids for the reference station are not explicitly addressed.

A sensible next step would be to have a method of the ProcessingDatasetDefinition class that fills in the time intervals explicitly, even before process_mth5.process_mth5_runs() is called.

It seems like ProcessingDatasetDefinition should have an attribute to designate the local_station. This is important because once the local station is defined, then any time intervals for which a remote reference station might be processed are restricted to the time intervals for which local_station has data. Basically local_station_coverage.intersect(remote_station_coverage).

By filling in the default values (when an acquisition run is declared but time interval is not) this will force us to address/test the sub-interval run data extraction, and solidify a convention (such as half-open, lower closed) for the time_interval. This time interval should incidentally be constructed with pd.Interval if we are going to ever address issue #134.

Formalizing this class and implementing it in process_mth5.process_mth5_runs() and process_mth5.process_mth5_run() is probably the cleanest way to move forward on the issue31 branch (issue #31).

@kujaku11
Copy link
Collaborator

kujaku11 commented Mar 9, 2022

@kkappler What about something like this, which provides places to hold all the information. How the information gets in there could be your dataframe. This comes from aurora.config.Stations on branch issue_153.

{
    "stations": {
        "local.id": null,
        "local.mth5_path": null,
        "local.remote": false,
        "local.runs": [
            {
                "run": {
                    "id": [
                        "None"
                    ],
                    "input_channels": [
                        {
                            "channel": {
                                "id": "hx",
                                "scale_factor": 1.0
                            }
                        },
                        {
                            "channel": {
                                "id": "hy",
                                "scale_factor": 1.0
                            }
                        }
                    ],
                    "output_channels": [
                        {
                            "channel": {
                                "id": "hz",
                                "scale_factor": 1.0
                            }
                        },
                        {
                            "channel": {
                                "id": "ex",
                                "scale_factor": 1.0
                            }
                        },
                        {
                            "channel": {
                                "id": "ey",
                                "scale_factor": 1.0
                            }
                        }
                    ],
                    "sample_rate": -1.0,
                    "time_periods": []
                }
            }
        ],
        "remote": [
            {
                "station": {
                    "id": "rr",
                    "mth5_path": null,
                    "remote": true,
                    "runs": [
                        {
                    "run": {
                        "id": [
                            "None"
                        ],
                        "input_channels": [
                            {
                                "channel": {
                                    "id": "hx",
                                    "scale_factor": 1.0
                                }
                            },
                            {
                                "channel": {
                                    "id": "hy",
                                    "scale_factor": 1.0
                                }
                            }
                        ],
                        "output_channels": [
                            {
                                "channel": {
                                    "id": "hz",
                                    "scale_factor": 1.0
                                }
                            },
                            {
                                "channel": {
                                    "id": "ex",
                                    "scale_factor": 1.0
                                }
                            },
                            {
                                "channel": {
                                    "id": "ey",
                                    "scale_factor": 1.0
                                }
                            }
                        ],
                        "sample_rate": -1.0,
                        "time_periods": []
                    }
                }
                ]
                }
            }
        ]
    }
}

@kkappler
Copy link
Collaborator Author

@kujaku11 this data structure looks good. Let's discuss at the next tag up how to implement.

One another thing to bear in mind here: DatasetDefinition tells us about available data that we will load and process, but could also be used to describe the intervals in between, together with instructions for an in-fill process, such as interpolate, or other technique of synthetic data generation.

@kkappler
Copy link
Collaborator Author

Also, DatasetDefinition may need a column that points to the mth5 that is associated with the data ... might this make for issues if runs are somehow split across multiple mth5s?

kkappler added a commit that referenced this issue Mar 14, 2022
Have basic run merging for single station working
-Need to move all single station processing using process_mth5_run
To process_mth5_from_dataset_definition
-Then make it work for RR
-test on parkfield (singel run) and then test on CAS04 with multiple rins

[Issue(s): #118, #80, #132]
kkappler added a commit that referenced this issue Apr 17, 2022
No longer need test1.h5 for test_compare_aurora_vs_archived_emtf.py
Also, using the dataset df that comes from the mth5 to define
dataset before calling config creator in make_processing_configs

[Issue(s): #165, #118, #132]
@kkappler kkappler mentioned this issue Apr 22, 2022
Merged
38 tasks
@kkappler
Copy link
Collaborator Author

kkappler commented May 30, 2022

See note 3. in process_mth5.populate_dataset_df about trimming local and remote time series to be only overlapping

3.  Dataset_df should be easy to generate from the local_station_id,
    remote_station_id, local_run_list, remote_run_list, but allows specification of
    time_intervals.  This is important in the case where aquisition_runs are
    non-overlapping between local and remote.  Although,  theoretically, merging on
    the FCs should make nans in the places where there is no overlapping data,
    and this should be dropped in the TF portion of the code.  However,
    time-intervals where the data do not have coverage at both stations can be
    identified in a method before GET TIME SERIES in a future version.

Also note 2 in process_mth5.process_mth5:

2. ToDo: Based on the run durations, and sampling rates, determined which runs
    are valid for which decimation levels, or for which effective sample rates.  This
    action should be taken before we get here.  The dataset_definition should already
    be trimmed to exactly what will be processed.

Also see note1 in process_mth5.make_stft_objects()

Note 1: CHECK DATA COVERAGE IS THE SAME IN BOTH LOCAL AND RR
    This should be pushed into a previous validator before pipeline starts
    # # if config.reference_station_id:
    # #    local_run_xrts = local_run_xrts.where(local_run_xrts.time <=
    # #                                          remote_run_xrts.time[-1]).dropna(
    # #                                          dim="time")

kkappler added a commit that referenced this issue Jun 24, 2022
This is just a stage commit because all tests are passing currently.
operate_aurora is not yet working.

Need to decide where to put the RunSummary wrangling.

[Issue(s): #80, #118, #132]
@kkappler
Copy link
Collaborator Author

This is now working in KernelDataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants