Processing Dataset Definition #118

kkappler · 2021-10-09T17:07:00Z

We have a processing config. The other part of the TFKernel is the dataset that gets fed into the pipeline along with the config.

We need a standard for dataset-specification.

I suggest a table (dataframe) as the container, with a csv file as a first cut user interface

Dataset Specification can be one of two flavors (with others added in future):

Single Station
Remote Reference

In both cases we need to know:
Local Station ID: This is the location at which we are going to estimate the EMTF (sample the earths' conductivity)
Local Station Time Intervals of data to be provided for analysis

For Remote Station we also need:
Reference Station ID (Can be None, and then there are not two cases of definition?)
Refernece station time Intervals of data to be provided for analysis

Specifications:
-The time intervals for any given station must be disjoint

I would specifically like to push the logic that validates the time intervals:

data exists,
data location
RR data available for all intervals in dataset_definition.csv
etc.

out of the first cut of this class. Those tools can be built separately, and indeed Tim is already making good headway on these validations.

kkappler · 2021-11-11T01:12:37Z

relates to issue #80

kkappler · 2022-02-19T01:58:32Z

After looking more carefully at this, it becomes clear that the designation of remote vs local is not really part if the dataset definition. That belongs in the config. The dataset specification concerns station-runs and any sub-intervals of that data to ignore/process.

The mth5 container already provides a list of stations and acquisition runs. This corresponds to station-intervals. We would like to have the option of specifying sub-intervals of data within a run to allow a user to ignore some chunks of data if desired.

In all cases, we wind up with lists of time series intervals, and these need to be merged somehow. There will be a list (of one or more runs (possibly clipped)) associated with the primary station, and a second list (of zero or more runs (possibly clipped) associated with the reference station.

These can be stored in a dict keyed by station_id, so that we may be able to extend this model to multiple station processing without too much modification.

dataset_defn = {}
dataset_defn["CAS04"] = {}
dataset_defn["CAS04"]["a"] = []
dataset_defn["CAS04"]["b"] = [(t_start_1,  t_end_1), (t_start_2, t_end_2)]
dataset_defn["CAS04"]["c"] = []
dataset_defn["NVR11"] = {}
dataset_defn["NVR11"]["a"] = []
dataset_defn["NVR11"]["b"] = []
dataset_defn["NVR11"]["c"] = [(t0, t1), (t2, t3)]
dataset_defn["NVR11"]["d"] = []

This dictionary also has a list representation ...

station id	run_id	start_time	end_time
CAS04	a
CAS04	b	t_start_1	t_end_1
CAS04	b	t_start_2	t_end_2
CAS04	c
NVR11	a
NVR11	b
NVR11	c	t0	t1
NVR11	c	t2	t3
NVR11	d

It is important to note that we do not need to worry at this point about how these lists are generated. They can be created by a person via a spreadsheet, they can be machine generated, or some other scheme.

We basically have a list of intervals.

Expressing the list like this seems completely general, and allows us to develop workflows where we choose to merge (or not merge) runs in time or frequency domain, as well as to treat with or without a "clock zero"

** We could probably concatenate the time series from all runs for each station in a timestamped xarray, and by tracking the discontinuities in time (or similar bookkeeping) work with the runs individually. This would basically constitute the merged run class.

kkappler · 2022-02-25T22:37:10Z

One more note about the run handling. It would be good to consider the case where we may have different time intervals per channel. This shouldn’t be an issue for the current code, but will come up if we wind up doing full multiple station processing (like MMT) were we compute a spectral density matrix (SDM) which considers cross powers of all channel pairs. It is possible that particular channels at a station may need to be suppressed for some time intervals, or alternatively that one channel may have several time intervals that are “good”, where other channels do not.

This basically amounts to a channel-by-channel approach to processing, not station-by-station. This could actually be handled by supressing various channel-intervals in a different data structure within the TFKernel. Will not implement this now, but keep it in mind during the development.

Modified process_mth5 by creating process_mth5_runs, nearly a copy of process_mth5_run. This method will take lists of runs to process and eventually supercede process_mth5_run. [Issue(s): #31, #118, #80]

Modified process_mth5 to loop over runs and create merged FC object. Testing ok on decimation level zero. Now need to add looping over decimation levels. [Issue(s): #31, #118, #80]

kkappler · 2022-03-08T02:39:07Z

Note that run_ids will in general be different at the local and remote stations. This data structure allows for that but the current code in process_mth5.py does not. It takes only a run_id, and is being modified to take a list of run_ids, but the run_ids for the reference station are not explicitly addressed.

A sensible next step would be to have a method of the ProcessingDatasetDefinition class that fills in the time intervals explicitly, even before process_mth5.process_mth5_runs() is called.

It seems like ProcessingDatasetDefinition should have an attribute to designate the local_station. This is important because once the local station is defined, then any time intervals for which a remote reference station might be processed are restricted to the time intervals for which local_station has data. Basically local_station_coverage.intersect(remote_station_coverage).

By filling in the default values (when an acquisition run is declared but time interval is not) this will force us to address/test the sub-interval run data extraction, and solidify a convention (such as half-open, lower closed) for the time_interval. This time interval should incidentally be constructed with pd.Interval if we are going to ever address issue #134.

Formalizing this class and implementing it in process_mth5.process_mth5_runs() and process_mth5.process_mth5_run() is probably the cleanest way to move forward on the issue31 branch (issue #31).

kujaku11 · 2022-03-09T18:32:20Z

@kkappler What about something like this, which provides places to hold all the information. How the information gets in there could be your dataframe. This comes from aurora.config.Stations on branch issue_153.

{
    "stations": {
        "local.id": null,
        "local.mth5_path": null,
        "local.remote": false,
        "local.runs": [
            {
                "run": {
                    "id": [
                        "None"
                    ],
                    "input_channels": [
                        {
                            "channel": {
                                "id": "hx",
                                "scale_factor": 1.0
                            }
                        },
                        {
                            "channel": {
                                "id": "hy",
                                "scale_factor": 1.0
                            }
                        }
                    ],
                    "output_channels": [
                        {
                            "channel": {
                                "id": "hz",
                                "scale_factor": 1.0
                            }
                        },
                        {
                            "channel": {
                                "id": "ex",
                                "scale_factor": 1.0
                            }
                        },
                        {
                            "channel": {
                                "id": "ey",
                                "scale_factor": 1.0
                            }
                        }
                    ],
                    "sample_rate": -1.0,
                    "time_periods": []
                }
            }
        ],
        "remote": [
            {
                "station": {
                    "id": "rr",
                    "mth5_path": null,
                    "remote": true,
                    "runs": [
                        {
                    "run": {
                        "id": [
                            "None"
                        ],
                        "input_channels": [
                            {
                                "channel": {
                                    "id": "hx",
                                    "scale_factor": 1.0
                                }
                            },
                            {
                                "channel": {
                                    "id": "hy",
                                    "scale_factor": 1.0
                                }
                            }
                        ],
                        "output_channels": [
                            {
                                "channel": {
                                    "id": "hz",
                                    "scale_factor": 1.0
                                }
                            },
                            {
                                "channel": {
                                    "id": "ex",
                                    "scale_factor": 1.0
                                }
                            },
                            {
                                "channel": {
                                    "id": "ey",
                                    "scale_factor": 1.0
                                }
                            }
                        ],
                        "sample_rate": -1.0,
                        "time_periods": []
                    }
                }
                ]
                }
            }
        ]
    }
}

kkappler · 2022-03-13T01:23:55Z

@kujaku11 this data structure looks good. Let's discuss at the next tag up how to implement.

One another thing to bear in mind here: DatasetDefinition tells us about available data that we will load and process, but could also be used to describe the intervals in between, together with instructions for an in-fill process, such as interpolate, or other technique of synthetic data generation.

kkappler · 2022-03-13T20:09:34Z

Also, DatasetDefinition may need a column that points to the mth5 that is associated with the data ... might this make for issues if runs are somehow split across multiple mth5s?

Have basic run merging for single station working -Need to move all single station processing using process_mth5_run To process_mth5_from_dataset_definition -Then make it work for RR -test on parkfield (singel run) and then test on CAS04 with multiple rins [Issue(s): #118, #80, #132]

No longer need test1.h5 for test_compare_aurora_vs_archived_emtf.py Also, using the dataset df that comes from the mth5 to define dataset before calling config creator in make_processing_configs [Issue(s): #165, #118, #132]

kkappler · 2022-05-30T18:47:20Z

See note 3. in process_mth5.populate_dataset_df about trimming local and remote time series to be only overlapping

3.  Dataset_df should be easy to generate from the local_station_id,
    remote_station_id, local_run_list, remote_run_list, but allows specification of
    time_intervals.  This is important in the case where aquisition_runs are
    non-overlapping between local and remote.  Although,  theoretically, merging on
    the FCs should make nans in the places where there is no overlapping data,
    and this should be dropped in the TF portion of the code.  However,
    time-intervals where the data do not have coverage at both stations can be
    identified in a method before GET TIME SERIES in a future version.

Also note 2 in process_mth5.process_mth5:

2. ToDo: Based on the run durations, and sampling rates, determined which runs
    are valid for which decimation levels, or for which effective sample rates.  This
    action should be taken before we get here.  The dataset_definition should already
    be trimmed to exactly what will be processed.

Also see note1 in process_mth5.make_stft_objects()

Note 1: CHECK DATA COVERAGE IS THE SAME IN BOTH LOCAL AND RR
    This should be pushed into a previous validator before pipeline starts
    # # if config.reference_station_id:
    # #    local_run_xrts = local_run_xrts.where(local_run_xrts.time <=
    # #                                          remote_run_xrts.time[-1]).dropna(
    # #                                          dim="time")

This is just a stage commit because all tests are passing currently. operate_aurora is not yet working. Need to decide where to put the RunSummary wrangling. [Issue(s): #80, #118, #132]

kkappler · 2022-08-28T18:04:43Z

This is now working in KernelDataset.

kkappler mentioned this issue Oct 14, 2021

Set up conventions for initializing pipeline from (Config, mth5) pairs #75

Closed

kkappler mentioned this issue Nov 11, 2021

TF Kernel Class #132

Closed

kkappler self-assigned this Feb 19, 2022

kkappler added a commit that referenced this issue Mar 4, 2022

Towards multiple runs:

0746299

Modified process_mth5 to loop over runs and create merged FC object. Testing ok on decimation level zero. Now need to add looping over decimation levels. [Issue(s): #31, #118, #80]

kkappler added a commit that referenced this issue Mar 14, 2022

add script testing merged runs with dataset definition (#118)

1b7ccc4

kkappler mentioned this issue Apr 22, 2022

Dev #167

Merged

38 tasks

kkappler mentioned this issue Jun 6, 2022

Merged Runs in Frequency Domain #184

Merged

20 tasks

kkappler added a commit that referenced this issue Jun 24, 2022

KernelDataset Introduced

8f92815

This is just a stage commit because all tests are passing currently. operate_aurora is not yet working. Need to decide where to put the RunSummary wrangling. [Issue(s): #80, #118, #132]

kkappler closed this as completed Aug 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Processing Dataset Definition #118

Processing Dataset Definition #118

kkappler commented Oct 9, 2021 •

edited

Loading

kkappler commented Nov 11, 2021

kkappler commented Feb 19, 2022 •

edited

Loading

kkappler commented Feb 25, 2022

kkappler commented Mar 8, 2022

kujaku11 commented Mar 9, 2022

kkappler commented Mar 13, 2022

kkappler commented Mar 13, 2022

kkappler commented May 30, 2022 •

edited

Loading

kkappler commented Aug 28, 2022

Processing Dataset Definition #118

Processing Dataset Definition #118

Comments

kkappler commented Oct 9, 2021 • edited Loading

kkappler commented Nov 11, 2021

kkappler commented Feb 19, 2022 • edited Loading

kkappler commented Feb 25, 2022

kkappler commented Mar 8, 2022

kujaku11 commented Mar 9, 2022

kkappler commented Mar 13, 2022

kkappler commented Mar 13, 2022

kkappler commented May 30, 2022 • edited Loading

kkappler commented Aug 28, 2022

kkappler commented Oct 9, 2021 •

edited

Loading

kkappler commented Feb 19, 2022 •

edited

Loading

kkappler commented May 30, 2022 •

edited

Loading