-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Processing Dataset Definition #118
Comments
relates to issue #80 |
After looking more carefully at this, it becomes clear that the designation of remote vs local is not really part if the dataset definition. That belongs in the config. The dataset specification concerns station-runs and any sub-intervals of that data to ignore/process. The mth5 container already provides a list of stations and acquisition runs. This corresponds to station-intervals. We would like to have the option of specifying sub-intervals of data within a run to allow a user to ignore some chunks of data if desired. In all cases, we wind up with lists of time series intervals, and these need to be merged somehow. There will be a list (of one or more runs (possibly clipped)) associated with the primary station, and a second list (of zero or more runs (possibly clipped) associated with the reference station. These can be stored in a dict keyed by station_id, so that we may be able to extend this model to multiple station processing without too much modification.
This dictionary also has a list representation ... <style type="text/css"></style>
It is important to note that we do not need to worry at this point about how these lists are generated. They can be created by a person via a spreadsheet, they can be machine generated, or some other scheme. We basically have a list of intervals. Expressing the list like this seems completely general, and allows us to develop workflows where we choose to merge (or not merge) runs in time or frequency domain, as well as to treat with or without a "clock zero" ** We could probably concatenate the time series from all runs for each station in a timestamped xarray, and by tracking the discontinuities in time (or similar bookkeeping) work with the runs individually. This would basically constitute the merged run class. |
One more note about the run handling. It would be good to consider the case where we may have different time intervals per channel. This shouldn’t be an issue for the current code, but will come up if we wind up doing full multiple station processing (like MMT) were we compute a spectral density matrix (SDM) which considers cross powers of all channel pairs. It is possible that particular channels at a station may need to be suppressed for some time intervals, or alternatively that one channel may have several time intervals that are “good”, where other channels do not. This basically amounts to a channel-by-channel approach to processing, not station-by-station. This could actually be handled by supressing various channel-intervals in a different data structure within the TFKernel. Will not implement this now, but keep it in mind during the development. |
Note that run_ids will in general be different at the local and remote stations. This data structure allows for that but the current code in A sensible next step would be to have a method of the It seems like By filling in the default values (when an acquisition run is declared but time interval is not) this will force us to address/test the sub-interval run data extraction, and solidify a convention (such as half-open, lower closed) for the time_interval. This time interval should incidentally be constructed with pd.Interval if we are going to ever address issue #134. Formalizing this class and implementing it in |
@kkappler What about something like this, which provides places to hold all the information. How the information gets in there could be your dataframe. This comes from
|
@kujaku11 this data structure looks good. Let's discuss at the next tag up how to implement. One another thing to bear in mind here: DatasetDefinition tells us about available data that we will load and process, but could also be used to describe the intervals in between, together with instructions for an in-fill process, such as interpolate, or other technique of synthetic data generation. |
Also, DatasetDefinition may need a column that points to the mth5 that is associated with the data ... might this make for issues if runs are somehow split across multiple mth5s? |
See note 3. in process_mth5.populate_dataset_df about trimming local and remote time series to be only overlapping
Also note 2 in process_mth5.process_mth5:
Also see note1 in process_mth5.make_stft_objects()
|
This is now working in KernelDataset. |
We have a processing config. The other part of the TFKernel is the dataset that gets fed into the pipeline along with the config.
We need a standard for dataset-specification.
I suggest a table (dataframe) as the container, with a csv file as a first cut user interface
Dataset Specification can be one of two flavors (with others added in future):
In both cases we need to know:
Local Station ID: This is the location at which we are going to estimate the EMTF (sample the earths' conductivity)
Local Station Time Intervals of data to be provided for analysis
For Remote Station we also need:
Reference Station ID (Can be None, and then there are not two cases of definition?)
Refernece station time Intervals of data to be provided for analysis
Specifications:
-The time intervals for any given station must be disjoint
I would specifically like to push the logic that validates the time intervals:
out of the first cut of this class. Those tools can be built separately, and indeed Tim is already making good headway on these validations.
The text was updated successfully, but these errors were encountered: