Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of multiple acquisition runs as a single processing run #80

Closed
kkappler opened this issue Aug 27, 2021 · 4 comments
Closed

Handling of multiple acquisition runs as a single processing run #80

kkappler opened this issue Aug 27, 2021 · 4 comments
Assignees

Comments

@kkappler
Copy link
Collaborator

Use Cases:

  1. Several long period runs, possibly broken up by a power outage for example
  2. Regular High Frequency, short-duration acquistions, e.g. ZEN

Case 2 can be handled by breaking process_mth5_decimation_level into:
#stft_agg = []
#for run in runlist:
# stft_obj = make_stft_decimation_level()
#stft_agg.append(stft_obj)
# tf_obj = process_stft_decimation_level(stft_agg)

@kkappler
Copy link
Collaborator Author

kkappler commented Oct 3, 2021

ProcessingRunTS class could be used to manage, or at least reference the data
This acts to Merge runs/ processing mixed runs

  • Desired Elements:
    -The ability to slice mth5 runs based on time interval.  (A workaround was used to put this functionality into VA Tech analysis but this will be needed in general for trimming remote references.)
  • Could be done by trimming an existing run ts or by loading with a time interval argument
  • method for handling changes in metadata from run to run (this should actually be handled by a new station instance). In any case, if an instrument is swapped out, we want the previous and future runs to relate to the same location. It is not clear if we would want to mix the runs with the different instruments in a single processing job ... if we did using FC class as an interface would be useful
    -Metadata standards changes: Generally how to handle multiple runs in the new vs the old metadata?
    When the time are explicitly enumerated it's pretty straight forward, but when the stream comes back broken up we need to detect and build runs on the fly or repair /mend time series

Run merging requirements
-must handle an arbitrary number of runs
-must handle decimation

Nan filling is a general solution with two potential complications
A big gaps could overload ram
B filtering edge effects

This was referenced Nov 11, 2021
@kkappler
Copy link
Collaborator Author

kkappler commented Nov 11, 2021

Ideally we want a model that is completely general (for disjoint time series). The TSCollection is associated with a list of time intervals ℐ0 = {(a,b)_i such that all data to be processed are in the Union of (a,b)_i). The individual elements of ℐ0 are normally acquisition runs, or intervals properly contained in acquisition runs, but we want to be careful not to exclude the case of joining acquisition runs via some gap-fill technique. For example, a few long acquisition runs, with only a short gap in between may want to be processed for very long periods (longer than either acquisition run can yield alone).

A companion set of intervals, where synthetic data (interp, iawrw, etc) can be overlain on the original set. The companion set basically specifies intervals that should be infilled so that runs can be treated as continuous. This is particularly useful in the case of a few missing samples, but could have wider application.

@kkappler kkappler changed the title Handling of multiple acqusition runs as a single processing run Handling of multiple acquisition runs as a single processing run Nov 13, 2021
@kkappler kkappler self-assigned this Feb 13, 2022
@kkappler
Copy link
Collaborator Author

kkappler commented Feb 13, 2022

The place where this will be implemented in the code is the function process_mth5_run in aurora/pipelines/process_mth5.py

The current function structure is:

def process_mth5_run(
    run_cfg,
    run_id,
    units="MT",
    show_plot=False,
    z_file_path=None,
    return_collection=True,
    **kwargs,
):

To support multiple runs we could replace run_id (currently a string) with optionally with a list of strings, each specifying a run. Implementing this change does not look too complicated. The function structure would stay very similar. Instead of extracting a single run, and STFT and process, we would instead extract each run in the list, and STFT each individually. Then the STFTs would be merged together in one xarray of spectral measurements and that array would be passed to the TF estimation method.

This solution should work in general for single station processing.

For multiple station processing there is one more layer to consider here. The run labels will not in general be the same for different stations. We would need an iterable of runs for the station of interest, and also an iterable of runs for the remote reference station. The determination of which runs will be processed is currently not supported.

When there are many stations (MMT) we would need to handle many subcases. Might need another version of
process_mth5_run for MMT.

kkappler added a commit that referenced this issue Feb 26, 2022
Modified process_mth5 by creating process_mth5_runs, nearly a copy of process_mth5_run.
This method will take lists of runs to process and eventually supercede process_mth5_run.

[Issue(s): #31, #118, #80]
kkappler added a commit that referenced this issue Mar 4, 2022
Modified process_mth5 to loop over runs and create merged FC object.
Testing ok on decimation level zero.  Now need to add looping over
decimation levels.

[Issue(s): #31, #118, #80]
kkappler added a commit that referenced this issue Mar 14, 2022
Have basic run merging for single station working
-Need to move all single station processing using process_mth5_run
To process_mth5_from_dataset_definition
-Then make it work for RR
-test on parkfield (singel run) and then test on CAS04 with multiple rins

[Issue(s): #118, #80, #132]
kkappler added a commit that referenced this issue Mar 26, 2022
replaced config with expected_sample_rate in validate_sample_rate
and also have already got one import from mt_metadata in pipelines/helpers.py
in anticipation of merge

[Issue(s):  #153, #80 ]
@kkappler kkappler mentioned this issue Apr 22, 2022
Merged
38 tasks
kkappler added a commit that referenced this issue Jun 7, 2022
While working on issue#80, and PR184, have noticed that processing
config defaults to estimator.engine = "RME_RR".  This is fine, but
I find I need to specify to use "RME" explicitly when there is only one
station.  So a couple fixes were added:
1. Processing class now has a validate() method.  If there is no RR
station, _and_ the estimator.engine is "RME_RR", it gets reset to
"RME".  Also added the ability to pass a kwarg to ConfigCreator instance
called estimator.  The kwarg is a dict and if "engine" is a key, it will
overwrite the estimator with the corresponding value.

The parkfield SS run test was updated to use the config_creator method.
cas04 test is usign validate()

[Issue(s): #80]
kkappler added a commit that referenced this issue Jun 18, 2022
Using the updated method in mth5 locally (see mth5 issue #105),
am now able to process runs c and d for CAS04 as single station.

Working on getting a similar h5 built in tests/cas04

[Issue(s): #31, #80]
kkappler added a commit that referenced this issue Jun 19, 2022
Allow request list to have mulitple stations and modify
channel_summary_to_make_mth5 to groupby station,run rather
than just run.  Add tests of make multistation mth5 to cas04 tests.

[Issue(s): #80]
kkappler added a commit that referenced this issue Jun 19, 2022
-replace DatasetDefintion by Dataset
and import as TFKDataset,
-replace dataset_definition with tfk_dataset

[Issue(s): #80, #132]
kkappler added a commit that referenced this issue Jun 24, 2022
This is just a stage commit because all tests are passing currently.
operate_aurora is not yet working.

Need to decide where to put the RunSummary wrangling.

[Issue(s): #80, #118, #132]
kkappler added a commit that referenced this issue Jun 25, 2022
Replaced dict with classes.  Now have a SyntheticRun and a
SyntheticStation.
This will be used to create an example synthetic case with many runs

[Issue(s): #80]
kkappler added a commit that referenced this issue Jun 26, 2022
Change from timedelta.seconds to timedelta.total_seconds()
Remove run_id from sort_by, it should be only station, starttime

[Issue(s): #80]
@kkappler
Copy link
Collaborator Author

This is done in frequency domain. Issue #152 is still open about doing this in time domain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant