Handling of multiple acquisition runs as a single processing run #80

kkappler · 2021-08-27T23:37:16Z

Use Cases:

Several long period runs, possibly broken up by a power outage for example
Regular High Frequency, short-duration acquistions, e.g. ZEN

Case 2 can be handled by breaking process_mth5_decimation_level into:
#stft_agg = []
#for run in runlist:
# stft_obj = make_stft_decimation_level()
#stft_agg.append(stft_obj)
# tf_obj = process_stft_decimation_level(stft_agg)

kkappler · 2021-10-03T16:53:13Z

ProcessingRunTS class could be used to manage, or at least reference the data
This acts to Merge runs/ processing mixed runs

Desired Elements:
-The ability to slice mth5 runs based on time interval. (A workaround was used to put this functionality into VA Tech analysis but this will be needed in general for trimming remote references.)
Could be done by trimming an existing run ts or by loading with a time interval argument
method for handling changes in metadata from run to run (this should actually be handled by a new station instance). In any case, if an instrument is swapped out, we want the previous and future runs to relate to the same location. It is not clear if we would want to mix the runs with the different instruments in a single processing job ... if we did using FC class as an interface would be useful
-Metadata standards changes: Generally how to handle multiple runs in the new vs the old metadata?
When the time are explicitly enumerated it's pretty straight forward, but when the stream comes back broken up we need to detect and build runs on the fly or repair /mend time series

Run merging requirements
-must handle an arbitrary number of runs
-must handle decimation

Nan filling is a general solution with two potential complications
A big gaps could overload ram
B filtering edge effects

kkappler · 2021-11-11T01:51:26Z

Ideally we want a model that is completely general (for disjoint time series). The TSCollection is associated with a list of time intervals ℐ0 = {(a,b)_i such that all data to be processed are in the Union of (a,b)_i). The individual elements of ℐ0 are normally acquisition runs, or intervals properly contained in acquisition runs, but we want to be careful not to exclude the case of joining acquisition runs via some gap-fill technique. For example, a few long acquisition runs, with only a short gap in between may want to be processed for very long periods (longer than either acquisition run can yield alone).

A companion set of intervals, where synthetic data (interp, iawrw, etc) can be overlain on the original set. The companion set basically specifies intervals that should be infilled so that runs can be treated as continuous. This is particularly useful in the case of a few missing samples, but could have wider application.

kkappler · 2022-02-13T20:43:21Z

The place where this will be implemented in the code is the function process_mth5_run in aurora/pipelines/process_mth5.py

The current function structure is:

def process_mth5_run(
    run_cfg,
    run_id,
    units="MT",
    show_plot=False,
    z_file_path=None,
    return_collection=True,
    **kwargs,
):

To support multiple runs we could replace run_id (currently a string) with optionally with a list of strings, each specifying a run. Implementing this change does not look too complicated. The function structure would stay very similar. Instead of extracting a single run, and STFT and process, we would instead extract each run in the list, and STFT each individually. Then the STFTs would be merged together in one xarray of spectral measurements and that array would be passed to the TF estimation method.

This solution should work in general for single station processing.

For multiple station processing there is one more layer to consider here. The run labels will not in general be the same for different stations. We would need an iterable of runs for the station of interest, and also an iterable of runs for the remote reference station. The determination of which runs will be processed is currently not supported.

When there are many stations (MMT) we would need to handle many subcases. Might need another version of
process_mth5_run for MMT.

Modified process_mth5 by creating process_mth5_runs, nearly a copy of process_mth5_run. This method will take lists of runs to process and eventually supercede process_mth5_run. [Issue(s): #31, #118, #80]

Modified process_mth5 to loop over runs and create merged FC object. Testing ok on decimation level zero. Now need to add looping over decimation levels. [Issue(s): #31, #118, #80]

Have basic run merging for single station working -Need to move all single station processing using process_mth5_run To process_mth5_from_dataset_definition -Then make it work for RR -test on parkfield (singel run) and then test on CAS04 with multiple rins [Issue(s): #118, #80, #132]

replaced config with expected_sample_rate in validate_sample_rate and also have already got one import from mt_metadata in pipelines/helpers.py in anticipation of merge [Issue(s): #153, #80 ]

While working on issue#80, and PR184, have noticed that processing config defaults to estimator.engine = "RME_RR". This is fine, but I find I need to specify to use "RME" explicitly when there is only one station. So a couple fixes were added: 1. Processing class now has a validate() method. If there is no RR station, _and_ the estimator.engine is "RME_RR", it gets reset to "RME". Also added the ability to pass a kwarg to ConfigCreator instance called estimator. The kwarg is a dict and if "engine" is a key, it will overwrite the estimator with the corresponding value. The parkfield SS run test was updated to use the config_creator method. cas04 test is usign validate() [Issue(s): #80]

Using the updated method in mth5 locally (see mth5 issue #105), am now able to process runs c and d for CAS04 as single station. Working on getting a similar h5 built in tests/cas04 [Issue(s): #31, #80]

Allow request list to have mulitple stations and modify channel_summary_to_make_mth5 to groupby station,run rather than just run. Add tests of make multistation mth5 to cas04 tests. [Issue(s): #80]

-replace DatasetDefintion by Dataset and import as TFKDataset, -replace dataset_definition with tfk_dataset [Issue(s): #80, #132]

This is just a stage commit because all tests are passing currently. operate_aurora is not yet working. Need to decide where to put the RunSummary wrangling. [Issue(s): #80, #118, #132]

Replaced dict with classes. Now have a SyntheticRun and a SyntheticStation. This will be used to create an example synthetic case with many runs [Issue(s): #80]

Change from timedelta.seconds to timedelta.total_seconds() Remove run_id from sort_by, it should be only station, starttime [Issue(s): #80]

kkappler · 2022-08-27T18:32:46Z

This is done in frequency domain. Issue #152 is still open about doing this in time domain.

kkappler mentioned this issue Oct 3, 2021

Determine where clock_zero request fits into Framework #42

Closed

This was referenced Nov 11, 2021

Processing Dataset Definition #118

Closed

TF Kernel Class #132

Closed

kkappler changed the title ~~Handling of multiple acqusition runs as a single processing run~~ Handling of multiple acquisition runs as a single processing run Nov 13, 2021

kkappler self-assigned this Feb 13, 2022

kkappler mentioned this issue Feb 13, 2022

Merging Runs in Time Domain #152

Open

kkappler added a commit that referenced this issue Mar 4, 2022

Towards multiple runs:

0746299

Modified process_mth5 to loop over runs and create merged FC object. Testing ok on decimation level zero. Now need to add looping over decimation levels. [Issue(s): #31, #118, #80]

kkappler mentioned this issue Apr 22, 2022

Dev #167

Merged

38 tasks

kkappler mentioned this issue Jun 6, 2022

Merged Runs in Frequency Domain #184

Merged

20 tasks

kkappler added a commit that referenced this issue Jun 19, 2022

Update Nomenclature for TF Kernek Dataset

4621282

-replace DatasetDefintion by Dataset and import as TFKDataset, -replace dataset_definition with tfk_dataset [Issue(s): #80, #132]

kkappler added a commit that referenced this issue Jun 24, 2022

KernelDataset Introduced

8f92815

This is just a stage commit because all tests are passing currently. operate_aurora is not yet working. Need to decide where to put the RunSummary wrangling. [Issue(s): #80, #118, #132]

kkappler added a commit that referenced this issue Jun 25, 2022

Major cleanup of synthetic data make method

f324c52

Replaced dict with classes. Now have a SyntheticRun and a SyntheticStation. This will be used to create an example synthetic case with many runs [Issue(s): #80]

kkappler added a commit that referenced this issue Jun 26, 2022

Fix duration bug & change sortby columns

a84a169

Change from timedelta.seconds to timedelta.total_seconds() Remove run_id from sort_by, it should be only station, starttime [Issue(s): #80]

kkappler closed this as completed Aug 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling of multiple acquisition runs as a single processing run #80

Handling of multiple acquisition runs as a single processing run #80

kkappler commented Aug 27, 2021

kkappler commented Oct 3, 2021

kkappler commented Nov 11, 2021 •

edited

Loading

kkappler commented Feb 13, 2022 •

edited

Loading

kkappler commented Aug 27, 2022

Handling of multiple acquisition runs as a single processing run #80

Handling of multiple acquisition runs as a single processing run #80

Comments

kkappler commented Aug 27, 2021

kkappler commented Oct 3, 2021

kkappler commented Nov 11, 2021 • edited Loading

kkappler commented Feb 13, 2022 • edited Loading

kkappler commented Aug 27, 2022

kkappler commented Nov 11, 2021 •

edited

Loading

kkappler commented Feb 13, 2022 •

edited

Loading