-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
QC on GPS data + removing dlr/ulr for bad t_rad + recalculating RH from averaged vapour pressures #241
Conversation
- p_vap and p_vap_cor are the same - use .to_series() insterad of .to_dataframe()
FRE station had wrongly defined limits Now set manually
removing msg_i
@@ -66,24 +66,39 @@ def toL2( | |||
except Exception: | |||
logger.exception('Flagging and fixing failed:') | |||
|
|||
if ds.attrs['format'] == 'TX': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider the how the "periods" parameter is interpret by the persistence_qc
. I am not sure if sample rates are addressed correctly. This means that a 10-minutes raw dataset will be processed differently than hourly tx data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although the doc of persistence_qc said period is how many hours a value can stay the same without being set to NaN
, it was, as you suspected, number of persistent values that were counted.
I adapted the code to handle duration of persistent periods:
pypromice/src/pypromice/qc/persistence.py
Lines 168 to 195 in eee1cbd
def duration_consecutive_true( | |
series: Union[pd.Series, pd.DataFrame] | |
) -> Union[pd.Series, pd.DataFrame]: | |
""" | |
Froma a boolean series, calculates the duration, in hours, of the periods with connective true values. | |
Examples | |
-------- | |
>>> duration_consecutive_true(pd.Series([False, True, False, False, True, True, True, False, True])) | |
pd.Series([0, 1, 0, 0, 1, 2, 3, 0, 1]) | |
Parameters | |
---------- | |
series | |
Boolean pandas Series or DataFrame | |
Returns | |
------- | |
consecutive_true_duration | |
Integer pandas Series or DataFrame with values representing the number of connective true values. | |
""" | |
# assert series.dtype == bool | |
cumsum = ((series.index - series.index[0]).total_seconds()/3600).to_series(index=series.index) | |
is_first = series.astype("int").diff() == 1 | |
offset = (is_first * cumsum).replace(0, np.nan).fillna(method="ffill").fillna(0) | |
return (cumsum - offset) * series |
I updated the default threshold values to reflect this (but they are stil up for discussion):
# period is given in hours, 2 persistent 10 min values will be flagged if period < 0.333
DEFAULT_VARIABLE_THRESHOLDS = {
"t": {"max_diff": 0.0001, "period": 0.3},
"p": {"max_diff": 0.0001, "period": 0.3},
'gps_lat_lon':{"max_diff": 0.000001, "period": 6}, # gets special handling to remove simultaneously constant gps_lat and gps_lon
'gps_alt':{"max_diff": 0.0001, "period": 6},
't_rad':{"max_diff": 0.0001, "period": 0.3},
"rh": {"max_diff": 0.0001, "period": 0.3}, # gets special handling to allow constant 100%
"wspd": {"max_diff": 0.0001, "period": 6},
}
@@ -224,7 +224,7 @@ def loadL0(self): | |||
|
|||
except pd.errors.ParserError as e: | |||
# ParserError: Too many columns specified: expected 40 and found 38 | |||
logger.info(f'-----> No msg_lat or msg_lon for {k}') | |||
# logger.info(f'-----> No msg_lat or msg_lon for {k}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you remove this log messages? Is it a common case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
msg_lat and msg_lon are no longer printed in the output files because deemed unreliable (#199)
They are still extracted from the SBD messages and appended to some, but not all transmission files. In the future we should stop extracting them, but in the meantime, no need to warn anyone when they are not found in transmission files.
src/pypromice/process/L1toL2.py
Outdated
@@ -103,6 +118,28 @@ def toL2( | |||
lat = ds['gps_lat'].mean() | |||
lon = ds['gps_lon'].mean() | |||
|
|||
# smoothing tilt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you are doing 72 hour standard deviation filter with different thresholds for tilt and rotation. And in addition a subsequent smoothing on the rotation data. Correct?
I find this part difficult to read. Consider splitting out the functionality to a helper function for readability testability.
I've asked ChatGPT for a suggestion 😄 :
import xarray as xr
import pandas as pd
def apply_std_threshold_filter(data: xr.DataArray, resample_period: str, rolling_window_size: int,
std_threshold: float, time_dim: str = 'time') -> xr.DataArray:
"""Apply a standard deviation threshold filter to time series data."""
# Convert to pandas Series, resample, and calculate rolling standard deviation
series: pd.Series = data.to_series()
medians: pd.Series = series.resample(resample_period).median()
std_dev: pd.Series = medians.rolling(window=rolling_window_size, center=True, min_periods=2).std()
reindexed_std: np.ndarray = std_dev.reindex(data[time_dim], method='bfill').values
# Filter data based on standard deviation threshold
filtered_data: xr.DataArray = data.where(reindexed_std < std_threshold)
return filtered_data.ffill(dim=time_dim).bfill(dim=time_dim)
def process_tilt(ds: xr.Dataset, variables: list[str], std_threshold: float = 0.2) -> None:
"""Process tilt data variables using a standard deviation threshold."""
for variable in variables:
ds[variable] = apply_std_threshold_filter(ds[variable], 'H', 3*24, std_threshold)
def process_rotation(ds: xr.Dataset, variable: str = 'rot', std_threshold: float = 4,
smoother_window_size: int = 14) -> None:
"""Process rotation data with a harsher treatment and additional smoothing."""
# Apply harsher std threshold filtering
filtered_data: xr.DataArray = apply_std_threshold_filter(ds[variable], 'H', 3*24, std_threshold)
# Additional smoothing by resampling to daily and applying median rolling
daily_medians: pd.Series = filtered_data.to_series().resample('D').median()
smoothed_daily: pd.Series = daily_medians.rolling(window=smoother_window_size, center=True, min_periods=2).median()
reindexed_smoothed: np.ndarray = smoothed_daily.reindex(ds.time, method='bfill').values
# Update the dataset
ds[variable] = ('time', reindexed_smoothed)
# Example usage:
ds = xr.Dataset({
'tilt_x': ('time', pd.Series(range(100))),
'tilt_y': ('time', pd.Series(range(100))),
'rot': ('time', pd.Series(range(100)))
})
ds['time'] = pd.date_range(start='2020-01-01', periods=100, freq='T')
process_tilt(ds, ['tilt_x', 'tilt_y'])
process_rotation(ds, 'rot')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it is a matter of taste.
I don't think a two-line function like process_tilt
make sense.
I also think that declaring variable type at each operation (e.g. daily_medians: pd.Series = ...
) decreases readability.
I have moved these lines into two seprate functions:
pypromice/src/pypromice/process/L1toL2.py
Lines 115 to 118 in 00666bb
# smoothing tilt and rot | |
ds['tilt_x'] = smoothTilt(ds['tilt_x']) | |
ds['tilt_y'] = smoothTilt(ds['tilt_y']) | |
ds['rot'] = smoothRot(ds['rot']) |
pypromice/src/pypromice/process/L1toL2.py
Lines 260 to 316 in 00666bb
def smoothTilt(da: xr.DataArray, threshold=0.2): | |
'''Smooth the station tilt | |
Parameters | |
---------- | |
da : xarray.DataArray | |
either X or Y tilt inclinometer measurements | |
threshold : float | |
threshold used in a standrad.-deviation based filter | |
Returns | |
------- | |
xarray.DataArray | |
either X or Y smoothed tilt inclinometer measurements | |
''' | |
# we calculate the moving standard deviation over a 3-day sliding window | |
# hourly resampling is necessary to make sure the same threshold can be used | |
# for 10 min and hourly data | |
moving_std_gap_filled = da.to_series().resample('H').median().rolling( | |
3*24, center=True, min_periods=2 | |
).std().reindex(da.time, method='bfill').values | |
# we select the good timestamps and gapfill assuming that | |
# - when tilt goes missing the last available value is used | |
# - when tilt is not available for the very first time steps, the first | |
# good value is used for backfill | |
return da.where( | |
moving_std_gap_filled < threshold | |
).ffill(dim='time').bfill(dim='time') | |
def smoothRot(da: xr.DataArray, threshold=4): | |
'''Smooth the station rotation | |
Parameters | |
---------- | |
da : xarray.DataArray | |
rotation measurements from inclinometer | |
threshold : float | |
threshold used in a standrad-deviation based filter | |
Returns | |
------- | |
xarray.DataArray | |
smoothed rotation measurements from inclinometer | |
''' | |
moving_std_gap_filled = da.to_series().resample('H').median().rolling( | |
3*24, center=True, min_periods=2 | |
).std().reindex(da.time, method='bfill').values | |
# same as for tilt with, in addition: | |
# - a resampling to daily values | |
# - a two week median smoothing | |
# - a resampling from these daily values to the original temporal resolution | |
return ('time', (da.where(moving_std_gap_filled <4).ffill(dim='time') | |
.to_series().resample('D').median() | |
.rolling(7*2,center=True,min_periods=2).median() | |
.reindex(da.time, method='bfill').values | |
)) |
Is that better?
ds['usr_cor'] = ds['usr_cor'].interpolate_na(dim='time', use_coordinate=False) | ||
|
||
|
||
ds['dsr_cor'] = ds.dsr_cor.where(ds.dsr.notnull()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the intension of this line? is it to ensure wer are having NaNs instead of Nones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was just correcting an issue I found.
When calculating dsr_cor and usr_cor, some additional checks are performed and some timestamps are set to NaNs.
The original code meant to linearly interpolate the timestamps that were filtered to increase data coverage.
But what I saw is that sometimes, days were set to NaN, while the previous and following nights (when usr=dsr=0) remained. The interpolation then bridged the two nights and set the whole day to dsr_cor = usr_cor = 0.
Another way of putting it is that we should not interpolate over gaps without a thorough check, and even less as early as L2. So I removed the interpolation and made sure that not values were added in dsr_cor
compared to dsr
.
Next time: |
- better doc for calculateSaturationVaporPressure - default threshold values converted to hours in persistence_qc - removed unecessary comment regarding cloud coverage calculation for bedrock stations
src/pypromice/qc/persistence.py
Outdated
'gps_alt':{"max_diff": 0.0001, "period": 12}, | ||
't_rad':{"max_diff": 0.0001, "period": 2}, | ||
"rh": {"max_diff": 0.0001, "period": 2}, # gets special handling to allow constant 100% | ||
"t": {"max_diff": 0.0001, "period": 0.3}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the period length fro t and p was thought to be 2 hours and not 20 minutes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, because it was meant to handle transmissions.
But now that we are talking about it, I suggest that we keep it at period = 0.3 hours because we know that the loggers were, during a time, wrongly programed to report the last valid number instead of NaN. So we cannot differentiate a good measurement with the same value as the previous one (rather unlikely) from a missing observation filled with the last valid value.
Also, when period = 2 hours , 12 persistent 10 min values would remain at the start of a long static period. I would like those to be removed, which would be the case if period = 0.3 hours.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it is necessary to define configurations specifically for tx, raw, hourly and 10 minutes data.
p_u and t_u only have a single decimal precision in the tx data files hence I think there is a significant risk of filtering out valid data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
p_u and t_u only have a single decimal precision in the tx data files hence I think there is a significant risk of filtering out valid data.
I didn't know that, good point! I switch back to period = 2 hours until we have a specific handling for raw data.
Persistence QC is done on both raw and tx data, and extended to rh and wspd
pypromice/src/pypromice/process/L1toL2.py
Lines 69 to 73 in 304a270
pypromice/src/pypromice/qc/persistence.py
Lines 23 to 24 in 304a270
QC on GPS data (#238)
Done through
a persistence QC on gps_lat, gps_lon
a persistence QC on gps_alt
pypromice/src/pypromice/qc/persistence.py
Lines 20 to 21 in 304a270
If variable 'gps_lat_lon' is given then only times when both gps_lat and gps_lon are static are being removed
pypromice/src/pypromice/qc/persistence.py
Lines 79 to 100 in 304a270
removing all the elevations that deviate from an elevation baseline (gap-filled monthly median) by more than 100 m
removing gps_lat and gps_lon when gps_alt is missing or has been removed by the steps above
pypromice/src/pypromice/process/L1toL2.py
Lines 75 to 81 in 304a270
See for example here. All the pink points have missing or bad gps_alt. The baseline elevation is the black dashed line.
Removing dlr and ulr when t_rad is bad or not available (#240)
pypromice/src/pypromice/qc/persistence.py
Line 22 in 304a270
pypromice/src/pypromice/process/L1toL2.py
Lines 83 to 86 in 304a270
Re-calculating
rh
from time-average vapour pressures (#193)RH is defined as the ratio between partial vapour pressure and saturation vapour pressure. When calculating hourly, daily, monthly averages, the ratio of the mean should be taken instead of the mean of the ratio (currently used)
pypromice/src/pypromice/process/aws.py
Lines 749 to 759 in 304a270
pypromice/src/pypromice/process/aws.py
Lines 767 to 805 in 304a270
removing percentile QC
see #242
smoothing and gap-filling tilt (#176, #123)
This is necessary for the calculation of dsr_cor and usr_cor. It is done there:
pypromice/src/pypromice/process/L1toL2.py
Lines 121 to 142 in 304a270