QC on GPS data + removing dlr/ulr for bad t_rad + recalculating RH from averaged vapour pressures #241

BaptisteVandecrux · 2024-05-07T14:39:00Z

Persistence QC is done on both raw and tx data, and extended to rh and wspd

pypromice/src/pypromice/process/L1toL2.py

Lines 69 to 73 in 304a270

    
           ds = persistence_qc(ds)                                               # Flag and remove persistence outliers 
        
           # if ds.attrs['format'] == 'TX': 
        
           #     # TODO: The configuration should be provided explicitly 
        
           #     outlier_detector = ThresholdBasedOutlierDetector.default() 
        
           #     ds = outlier_detector.filter_data(ds)                                 # Flag and remove percentile outliers

pypromice/src/pypromice/qc/persistence.py

Lines 23 to 24 in 304a270

    
           "rh": {"max_diff": 0.0001, "period": 2}, # gets special handling to allow constant 100% 
        
           "wspd": {"max_diff": 0.0001, "period": 6},

QC on GPS data (#238)

Done through

a persistence QC on gps_lat, gps_lon

a persistence QC on gps_alt

pypromice/src/pypromice/qc/persistence.py

Lines 20 to 21 in 304a270

    
           'gps_lat_lon':{"max_diff": 0.000001, "period": 12}, # gets special handling to remove simultaneously constant gps_lat and gps_lon 
        
           'gps_alt':{"max_diff": 0.0001, "period": 12},

If variable 'gps_lat_lon' is given then only times when both gps_lat and gps_lon are static are being removed

pypromice/src/pypromice/qc/persistence.py

Lines 79 to 100 in 304a270

    
           if (v in df) | (v == 'gps_lat_lon'): 
        
               if v != 'gps_lat_lon': 
        
                   mask = find_persistent_regions(df[v], period, max_diff) 
        
                   if 'rh' in v: 
        
                       mask = mask & (df[v]<99) 
        
                   n_masked = mask.sum() 
        
                   n_samples = len(mask) 
        
                   logger.info( 
        
                       f"Applying persistent QC in {v}. Filtering {n_masked}/{n_samples} samples" 
        
                   ) 
        
                   # setting outliers to NaN 
        
                   df.loc[mask, v] = np.nan 
        
               else: 
        
                   mask = (find_persistent_regions(df['gps_lon'], period, max_diff) \ 
        
                               & find_persistent_regions(df['gps_lat'], period, max_diff) ) 
        
                   n_masked = mask.sum() 
        
                   n_samples = len(mask) 
        
                   logger.info( 
        
                       f"Applying persistent QC in {v}. Filtering {n_masked}/{n_samples} samples" 
        
                   ) 
        
                   # setting outliers to NaN

removing all the elevations that deviate from an elevation baseline (gap-filled monthly median) by more than 100 m

removing gps_lat and gps_lon when gps_alt is missing or has been removed by the steps above

pypromice/src/pypromice/process/L1toL2.py

Lines 75 to 81 in 304a270

    
           # filtering gps_lat, gps_lon and gps_alt based on the difference to a baseline elevation 
        
           # right now baseline elevation is gapfilled monthly median elevation 
        
           baseline_elevation = (ds.gps_alt.to_series().resample('M').median() 
        
                                 .reindex(ds.time.to_series().index, method='nearest') 
        
                                 .ffill().bfill()) 
        
           mask = (np.abs(ds.gps_alt - baseline_elevation) < 100) & ds.gps_alt.notnull() 
        
           ds[['gps_alt','gps_lon', 'gps_lat']] = ds[['gps_alt','gps_lon', 'gps_lat']].where(mask)

See for example here. All the pink points have missing or bad gps_alt. The baseline elevation is the black dashed line.

Removing dlr and ulr when t_rad is bad or not available (#240)

static t_rad are identified and removed using persistence QC

pypromice/src/pypromice/qc/persistence.py

Line 22 in 304a270

't_rad':{"max_diff": 0.0001, "period": 2},

once QC has been done (either manually or with persistence_qc), dlr and ulr are filtered from non-null t_rad

pypromice/src/pypromice/process/L1toL2.py

Lines 83 to 86 in 304a270

    
           # removing dlr and ulr that are missing t_rad 
        
           # this is done now becasue t_rad can be filtered either manually or with persistence 
        
           ds['dlr'] = ds.dlr.where(ds.t_rad.notnull()) 
        
           ds['ulr'] = ds.ulr.where(ds.t_rad.notnull())

Re-calculating `rh` from time-average vapour pressures (#193)

RH is defined as the ratio between partial vapour pressure and saturation vapour pressure. When calculating hourly, daily, monthly averages, the ratio of the mean should be taken instead of the mean of the ratio (currently used)

pypromice/src/pypromice/process/aws.py

Lines 749 to 759 in 304a270

    
           for var in ['rh_u','rh_l']: 
        
               lvl = var.split('_')[1] 
        
               if var in df_d.columns: 
        
                   if ('t_'+lvl in ds_h.keys()): 
        
                       es_wtr, es_cor = calculateSaturationVaporPressure(ds_h['t_'+lvl]) 
        
                       p_vap = ds_h[var] / 100 * es_wtr 
        
                       df_d[var] = (p_vap.to_series().resample(t).mean() \ 
        
                                  / es_wtr.to_series().resample(t).mean())*100 
        
                       df_d[var+'_cor'] = (p_vap.to_series().resample(t).mean() \ 
        
                                  / es_cor.to_series().resample(t).mean())*100

pypromice/src/pypromice/process/aws.py

Lines 767 to 805 in 304a270

    
           def calculateSaturationVaporPressure(t, T_0=273.15, T_100=373.15, es_0=6.1071, 
        
                                                es_100=1013.246, eps=0.622):             
        
               '''Calculate specific humidity 
        
               Parameters 
        
               ---------- 
        
               T_0 : float  
        
                   Steam point temperature. Default is 273.15. 
        
               T_100 : float 
        
                   Steam point temperature in Kelvin 
        
               t : xarray.DataArray 
        
                   Air temperature 
        
               es_0 : float 
        
                   Saturation vapour pressure at the melting point (hPa) 
        
               es_100 : float 
        
                   Saturation vapour pressure at steam point temperature (hPa) 
        
               Returns 
        
               ------- 
        
               xarray.DataArray 
        
                   Specific humidity data array 
        
               '''                                                          
        
               # Saturation vapour pressure above 0 C (hPa) 
        
               es_wtr = 10**(-7.90298 * (T_100 / (t + T_0) - 1) + 5.02808 * np.log10(T_100 / (t + T_0)) 
        
                             - 1.3816E-7 * (10**(11.344 * (1 - (t + T_0) / T_100)) - 1) 
        
                             + 8.1328E-3 * (10**(-3.49149 * (T_100 / (t + T_0) -1)) - 1) + np.log10(es_100)) 
        
               # Saturation vapour pressure below 0 C (hPa) 
        
               es_ice = 10**(-9.09718 * (T_0 / (t + T_0) - 1) - 3.56654 
        
                             * np.log10(T_0 / (t + T_0)) + 0.876793 
        
                             * (1 - (t + T_0) / T_0) 
        
                             + np.log10(es_0))  
        
               # Saturation vapour pressure (hPa) 
        
               es_cor = xr.where(t < 0, es_ice, es_wtr) 
        
               return es_wtr, es_cor

removing percentile QC

see #242

smoothing and gap-filling tilt (#176, #123)

This is necessary for the calculation of dsr_cor and usr_cor. It is done there:

pypromice/src/pypromice/process/L1toL2.py

Lines 121 to 142 in 304a270

    
           # smoothing tilt 
        
           for v in ['tilt_x','tilt_y']: 
        
               threshold = 0.2 
        
               ds[v] = ds[v].where( 
        
                           ds[v].to_series().resample('H').median().rolling( 
        
                               3*24, center=True, min_periods=2 
        
                               ).std().reindex(ds.time, method='bfill').values <threshold 
        
                           ).ffill(dim='time').bfill(dim='time') 
        
           # rotation gets harsher treatment 
        
           v = 'rot' 
        
           ds[v] = ('time', (ds[v].where( 
        
                               ds[v].to_series().resample('H').median().rolling( 
        
                                   3*24, center=True, min_periods=2 
        
                                   ).std().reindex(ds.time, method='bfill').values <4 
        
                               ).ffill(dim='time') 
        
                   .to_series().resample('D').median() 
        
                   .rolling(7*2,center=True,min_periods=2).median() 
        
                   .reindex(ds.time, method='bfill').values 
        
                   ))

# 238

#240 added t_trad to persistence

#193

- p_vap and p_vap_cor are the same - use .to_series() insterad of .to_dataframe()

FRE station had wrongly defined limits Now set manually

as of #242

removing msg_i

src/pypromice/qc/persistence.py

ladsmund · 2024-05-10T07:02:12Z

src/pypromice/process/L1toL2.py

@@ -66,24 +66,39 @@ def toL2(
    except Exception:
        logger.exception('Flagging and fixing failed:')

-    if ds.attrs['format'] == 'TX':


Consider the how the "periods" parameter is interpret by the persistence_qc. I am not sure if sample rates are addressed correctly. This means that a 10-minutes raw dataset will be processed differently than hourly tx data.

Although the doc of persistence_qc said period is how many hours a value can stay the same without being set to NaN, it was, as you suspected, number of persistent values that were counted.

I adapted the code to handle duration of persistent periods:

pypromice/src/pypromice/qc/persistence.py

Lines 168 to 195 in eee1cbd

def duration_consecutive_true(

series: Union[pd.Series, pd.DataFrame]

) -> Union[pd.Series, pd.DataFrame]:

"""

Froma a boolean series, calculates the duration, in hours, of the periods with connective true values.

Examples

--------

>>> duration_consecutive_true(pd.Series([False, True, False, False, True, True, True, False, True]))

pd.Series([0, 1, 0, 0, 1, 2, 3, 0, 1])

Parameters

----------

series

Boolean pandas Series or DataFrame

Returns

-------

consecutive_true_duration

Integer pandas Series or DataFrame with values representing the number of connective true values.

"""

# assert series.dtype == bool

cumsum = ((series.index - series.index[0]).total_seconds()/3600).to_series(index=series.index)

is_first = series.astype("int").diff() == 1

offset = (is_first * cumsum).replace(0, np.nan).fillna(method="ffill").fillna(0)

return (cumsum - offset) * series

I updated the default threshold values to reflect this (but they are stil up for discussion):

# period is given in hours, 2 persistent 10 min values will be flagged if period < 0.333 DEFAULT_VARIABLE_THRESHOLDS = { "t": {"max_diff": 0.0001, "period": 0.3}, "p": {"max_diff": 0.0001, "period": 0.3}, 'gps_lat_lon':{"max_diff": 0.000001, "period": 6}, # gets special handling to remove simultaneously constant gps_lat and gps_lon 'gps_alt':{"max_diff": 0.0001, "period": 6}, 't_rad':{"max_diff": 0.0001, "period": 0.3}, "rh": {"max_diff": 0.0001, "period": 0.3}, # gets special handling to allow constant 100% "wspd": {"max_diff": 0.0001, "period": 6}, }

ladsmund · 2024-05-10T08:33:28Z

src/pypromice/process/aws.py

@@ -224,7 +224,7 @@ def loadL0(self):

            except pd.errors.ParserError as e:
                # ParserError: Too many columns specified: expected 40 and found 38
-                logger.info(f'-----> No msg_lat or msg_lon for {k}')
+                # logger.info(f'-----> No msg_lat or msg_lon for {k}')


Why did you remove this log messages? Is it a common case?

msg_lat and msg_lon are no longer printed in the output files because deemed unreliable (#199)
They are still extracted from the SBD messages and appended to some, but not all transmission files. In the future we should stop extracting them, but in the meantime, no need to warn anyone when they are not found in transmission files.

src/pypromice/process/aws.py

src/pypromice/process/L1toL2.py

ladsmund · 2024-05-10T10:00:29Z

src/pypromice/process/L1toL2.py

@@ -103,6 +118,28 @@ def toL2(
        lat = ds['gps_lat'].mean()
        lon = ds['gps_lon'].mean()

+    # smoothing tilt


It looks like you are doing 72 hour standard deviation filter with different thresholds for tilt and rotation. And in addition a subsequent smoothing on the rotation data. Correct?

I find this part difficult to read. Consider splitting out the functionality to a helper function for readability testability.
I've asked ChatGPT for a suggestion 😄 :

import xarray as xr import pandas as pd def apply_std_threshold_filter(data: xr.DataArray, resample_period: str, rolling_window_size: int, std_threshold: float, time_dim: str = 'time') -> xr.DataArray: """Apply a standard deviation threshold filter to time series data.""" # Convert to pandas Series, resample, and calculate rolling standard deviation series: pd.Series = data.to_series() medians: pd.Series = series.resample(resample_period).median() std_dev: pd.Series = medians.rolling(window=rolling_window_size, center=True, min_periods=2).std() reindexed_std: np.ndarray = std_dev.reindex(data[time_dim], method='bfill').values # Filter data based on standard deviation threshold filtered_data: xr.DataArray = data.where(reindexed_std < std_threshold) return filtered_data.ffill(dim=time_dim).bfill(dim=time_dim) def process_tilt(ds: xr.Dataset, variables: list[str], std_threshold: float = 0.2) -> None: """Process tilt data variables using a standard deviation threshold.""" for variable in variables: ds[variable] = apply_std_threshold_filter(ds[variable], 'H', 3*24, std_threshold) def process_rotation(ds: xr.Dataset, variable: str = 'rot', std_threshold: float = 4, smoother_window_size: int = 14) -> None: """Process rotation data with a harsher treatment and additional smoothing.""" # Apply harsher std threshold filtering filtered_data: xr.DataArray = apply_std_threshold_filter(ds[variable], 'H', 3*24, std_threshold) # Additional smoothing by resampling to daily and applying median rolling daily_medians: pd.Series = filtered_data.to_series().resample('D').median() smoothed_daily: pd.Series = daily_medians.rolling(window=smoother_window_size, center=True, min_periods=2).median() reindexed_smoothed: np.ndarray = smoothed_daily.reindex(ds.time, method='bfill').values # Update the dataset ds[variable] = ('time', reindexed_smoothed) # Example usage: ds = xr.Dataset({ 'tilt_x': ('time', pd.Series(range(100))), 'tilt_y': ('time', pd.Series(range(100))), 'rot': ('time', pd.Series(range(100))) }) ds['time'] = pd.date_range(start='2020-01-01', periods=100, freq='T') process_tilt(ds, ['tilt_x', 'tilt_y']) process_rotation(ds, 'rot')

I guess it is a matter of taste.

I don't think a two-line function like process_tilt make sense.
I also think that declaring variable type at each operation (e.g. daily_medians: pd.Series = ...) decreases readability.

I have moved these lines into two seprate functions:

pypromice/src/pypromice/process/L1toL2.py

Lines 115 to 118 in 00666bb

# smoothing tilt and rot

ds['tilt_x'] = smoothTilt(ds['tilt_x'])

ds['tilt_y'] = smoothTilt(ds['tilt_y'])

ds['rot'] = smoothRot(ds['rot'])

pypromice/src/pypromice/process/L1toL2.py

Lines 260 to 316 in 00666bb

def smoothTilt(da: xr.DataArray, threshold=0.2):

'''Smooth the station tilt

Parameters

----------

da : xarray.DataArray

either X or Y tilt inclinometer measurements

threshold : float

threshold used in a standrad.-deviation based filter

Returns

-------

xarray.DataArray

either X or Y smoothed tilt inclinometer measurements

'''

# we calculate the moving standard deviation over a 3-day sliding window

# hourly resampling is necessary to make sure the same threshold can be used

# for 10 min and hourly data

moving_std_gap_filled = da.to_series().resample('H').median().rolling(

3*24, center=True, min_periods=2

).std().reindex(da.time, method='bfill').values

# we select the good timestamps and gapfill assuming that

# - when tilt goes missing the last available value is used

# - when tilt is not available for the very first time steps, the first

# good value is used for backfill

return da.where(

moving_std_gap_filled < threshold

).ffill(dim='time').bfill(dim='time')

def smoothRot(da: xr.DataArray, threshold=4):

'''Smooth the station rotation

Parameters

----------

da : xarray.DataArray

rotation measurements from inclinometer

threshold : float

threshold used in a standrad-deviation based filter

Returns

-------

xarray.DataArray

smoothed rotation measurements from inclinometer

'''

moving_std_gap_filled = da.to_series().resample('H').median().rolling(

3*24, center=True, min_periods=2

).std().reindex(da.time, method='bfill').values

# same as for tilt with, in addition:

# - a resampling to daily values

# - a two week median smoothing

# - a resampling from these daily values to the original temporal resolution

return ('time', (da.where(moving_std_gap_filled <4).ffill(dim='time')

.to_series().resample('D').median()

.rolling(7*2,center=True,min_periods=2).median()

.reindex(da.time, method='bfill').values

))

Is that better?

ladsmund · 2024-05-10T10:18:37Z

src/pypromice/process/L1toL2.py

-    ds['usr_cor'] = ds['usr_cor'].interpolate_na(dim='time', use_coordinate=False)
-
+
+    ds['dsr_cor'] = ds.dsr_cor.where(ds.dsr.notnull())  


What is the intension of this line? is it to ensure wer are having NaNs instead of Nones?

This was just correcting an issue I found.

When calculating dsr_cor and usr_cor, some additional checks are performed and some timestamps are set to NaNs.
The original code meant to linearly interpolate the timestamps that were filtered to increase data coverage.
But what I saw is that sometimes, days were set to NaN, while the previous and following nights (when usr=dsr=0) remained. The interpolation then bridged the two nights and set the whole day to dsr_cor = usr_cor = 0.

Another way of putting it is that we should not interpolate over gaps without a thorough check, and even less as early as L2. So I removed the interpolation and made sure that not values were added in dsr_cor compared to dsr.

ladsmund · 2024-05-10T10:20:57Z

Next time:
Please split up the updates into multiple pull requests. 🙏
This will make it much easier to read and discussion the changes 😅

- better doc for calculateSaturationVaporPressure - default threshold values converted to hours in persistence_qc - removed unecessary comment regarding cloud coverage calculation for bedrock stations

src/pypromice/qc/persistence.py

ladsmund · 2024-05-13T10:11:56Z

src/pypromice/qc/persistence.py

-    'gps_alt':{"max_diff": 0.0001, "period": 12},
-    't_rad':{"max_diff": 0.0001, "period": 2},
-    "rh": {"max_diff": 0.0001, "period": 2}, # gets special handling to allow constant 100%
+    "t": {"max_diff": 0.0001, "period": 0.3},


I think the period length fro t and p was thought to be 2 hours and not 20 minutes.

Ah, yes, because it was meant to handle transmissions.
But now that we are talking about it, I suggest that we keep it at period = 0.3 hours because we know that the loggers were, during a time, wrongly programed to report the last valid number instead of NaN. So we cannot differentiate a good measurement with the same value as the previous one (rather unlikely) from a missing observation filled with the last valid value.

Also, when period = 2 hours , 12 persistent 10 min values would remain at the start of a long static period. I would like those to be removed, which would be the case if period = 0.3 hours.

Maybe it is necessary to define configurations specifically for tx, raw, hourly and 10 minutes data.

p_u and t_u only have a single decimal precision in the tx data files hence I think there is a significant risk of filtering out valid data.

p_u and t_u only have a single decimal precision in the tx data files hence I think there is a significant risk of filtering out valid data.

I didn't know that, good point! I switch back to period = 2 hours until we have a specific handling for raw data.

…_true

BaptisteVandecrux added 6 commits May 7, 2024 13:54

Update persistence for lat/lon combined, new filter on gps_alt

290a545

# 238

Update persistence.py

73774b1

#240 added t_trad to persistence

recalculate rh after averaging p_vap and es

499e648

#193

Update L1toL2.py

a5ee3a4

Update L1toL2.py

d794872

Update L1toL2.py

91c7a90

BaptisteVandecrux requested review from ladsmund and PennyHow May 8, 2024 07:59

BaptisteVandecrux and others added 10 commits May 8, 2024 11:16

update doc in aws.py

de1af0e

simplifying rh resample

04aca15

- p_vap and p_vap_cor are the same - use .to_series() insterad of .to_dataframe()

Update thresholds.csv

ac20896

FRE station had wrongly defined limits Now set manually

Removing percentile QC

11ba26e

as of #242

adding wspd and rh to persistence

956bc88

Update variables.csv

2cb6efd

removing msg_i

better handling of subrfreezing conditions in es calculation

473869c

bug fix in usr_cor dsr_cor

95430c1

smoothing and gap-filling of tilt

5fe6e9b

using pandas resample instead of xarray's

304a270

This was linked to issues May 9, 2024

Add gps_alt, gps_lon, gps_lat to the variables filtered by persistence QC #238

Closed

Need to remove dsr and usr when t_rad is unknown #240

Closed

Relative humidity averaging from specific humidity #193

Closed

Removing msg_i from output file

f03142e

This was linked to issues May 9, 2024

Time gap filling in tilt calculation #123

Closed

Frozen anemometer filter (transform 0 to NaN values) #39

Closed

Need for smoothing tilt before it is used in the correction of dsr and usr #176

Closed

ladsmund reviewed May 10, 2024

View reviewed changes

BaptisteVandecrux added 2 commits May 13, 2024 09:43

length of persistent periods now in hours

eee1cbd

minor edits from PR#241

a2b28cd

- better doc for calculateSaturationVaporPressure - default threshold values converted to hours in persistence_qc - removed unecessary comment regarding cloud coverage calculation for bedrock stations

moved smoothTilt and smoothRot to separate function

705de5f

ladsmund reviewed May 13, 2024

View reviewed changes

BaptisteVandecrux added 2 commits May 13, 2024 12:48

removing count_consecutive_true + fixed types in duration_consecutive…

00666bb

…_true

switch back periods fro t, p, rh t 2 hours in persistence_qc

0ad2d31

PennyHow approved these changes May 13, 2024

View reviewed changes

BaptisteVandecrux merged commit 6ad5893 into main May 13, 2024
4 checks passed

BaptisteVandecrux deleted the added-qc-on-gps-data branch May 13, 2024 19:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QC on GPS data + removing dlr/ulr for bad t_rad + recalculating RH from averaged vapour pressures #241

QC on GPS data + removing dlr/ulr for bad t_rad + recalculating RH from averaged vapour pressures #241

BaptisteVandecrux commented May 7, 2024 •

edited

Loading

ladsmund May 10, 2024

BaptisteVandecrux May 13, 2024

ladsmund May 10, 2024

BaptisteVandecrux May 13, 2024 •

edited

Loading

ladsmund May 10, 2024

BaptisteVandecrux May 13, 2024 •

edited

Loading

ladsmund May 10, 2024

BaptisteVandecrux May 13, 2024

ladsmund commented May 10, 2024

ladsmund May 13, 2024

BaptisteVandecrux May 13, 2024 •

edited

Loading

ladsmund May 13, 2024

BaptisteVandecrux May 13, 2024

	ds = persistence_qc(ds) # Flag and remove persistence outliers
	# if ds.attrs['format'] == 'TX':
	# # TODO: The configuration should be provided explicitly
	# outlier_detector = ThresholdBasedOutlierDetector.default()
	# ds = outlier_detector.filter_data(ds) # Flag and remove percentile outliers

	"rh": {"max_diff": 0.0001, "period": 2}, # gets special handling to allow constant 100%
	"wspd": {"max_diff": 0.0001, "period": 6},

	'gps_lat_lon':{"max_diff": 0.000001, "period": 12}, # gets special handling to remove simultaneously constant gps_lat and gps_lon
	'gps_alt':{"max_diff": 0.0001, "period": 12},

	if (v in df) \| (v == 'gps_lat_lon'):
	if v != 'gps_lat_lon':
	mask = find_persistent_regions(df[v], period, max_diff)
	if 'rh' in v:
	mask = mask & (df[v]<99)
	n_masked = mask.sum()
	n_samples = len(mask)
	logger.info(
	f"Applying persistent QC in {v}. Filtering {n_masked}/{n_samples} samples"
	)
	# setting outliers to NaN
	df.loc[mask, v] = np.nan
	else:
	mask = (find_persistent_regions(df['gps_lon'], period, max_diff) \
	& find_persistent_regions(df['gps_lat'], period, max_diff) )

	n_masked = mask.sum()
	n_samples = len(mask)
	logger.info(
	f"Applying persistent QC in {v}. Filtering {n_masked}/{n_samples} samples"
	)
	# setting outliers to NaN

	# filtering gps_lat, gps_lon and gps_alt based on the difference to a baseline elevation
	# right now baseline elevation is gapfilled monthly median elevation
	baseline_elevation = (ds.gps_alt.to_series().resample('M').median()
	.reindex(ds.time.to_series().index, method='nearest')
	.ffill().bfill())
	mask = (np.abs(ds.gps_alt - baseline_elevation) < 100) & ds.gps_alt.notnull()
	ds[['gps_alt','gps_lon', 'gps_lat']] = ds[['gps_alt','gps_lon', 'gps_lat']].where(mask)

	# removing dlr and ulr that are missing t_rad
	# this is done now becasue t_rad can be filtered either manually or with persistence
	ds['dlr'] = ds.dlr.where(ds.t_rad.notnull())
	ds['ulr'] = ds.ulr.where(ds.t_rad.notnull())

	for var in ['rh_u','rh_l']:
	lvl = var.split('_')[1]
	if var in df_d.columns:
	if ('t_'+lvl in ds_h.keys()):
	es_wtr, es_cor = calculateSaturationVaporPressure(ds_h['t_'+lvl])
	p_vap = ds_h[var] / 100 * es_wtr

	df_d[var] = (p_vap.to_series().resample(t).mean() \
	/ es_wtr.to_series().resample(t).mean())*100
	df_d[var+'_cor'] = (p_vap.to_series().resample(t).mean() \
	/ es_cor.to_series().resample(t).mean())*100

	def calculateSaturationVaporPressure(t, T_0=273.15, T_100=373.15, es_0=6.1071,
	es_100=1013.246, eps=0.622):
	'''Calculate specific humidity

	Parameters
	----------
	T_0 : float
	Steam point temperature. Default is 273.15.
	T_100 : float
	Steam point temperature in Kelvin
	t : xarray.DataArray
	Air temperature
	es_0 : float
	Saturation vapour pressure at the melting point (hPa)
	es_100 : float
	Saturation vapour pressure at steam point temperature (hPa)

	Returns
	-------
	xarray.DataArray
	Specific humidity data array
	'''
	# Saturation vapour pressure above 0 C (hPa)
	es_wtr = 10*(-7.90298 (T_100 / (t + T_0) - 1) + 5.02808 * np.log10(T_100 / (t + T_0))
	- 1.3816E-7 * (10*(11.344 (1 - (t + T_0) / T_100)) - 1)
	+ 8.1328E-3 * (10*(-3.49149 (T_100 / (t + T_0) -1)) - 1) + np.log10(es_100))

	# Saturation vapour pressure below 0 C (hPa)
	es_ice = 10*(-9.09718 (T_0 / (t + T_0) - 1) - 3.56654
	* np.log10(T_0 / (t + T_0)) + 0.876793
	* (1 - (t + T_0) / T_0)
	+ np.log10(es_0))

	# Saturation vapour pressure (hPa)
	es_cor = xr.where(t < 0, es_ice, es_wtr)

	return es_wtr, es_cor

	# smoothing tilt
	for v in ['tilt_x','tilt_y']:
	threshold = 0.2

	ds[v] = ds[v].where(
	ds[v].to_series().resample('H').median().rolling(
	3*24, center=True, min_periods=2
	).std().reindex(ds.time, method='bfill').values <threshold
	).ffill(dim='time').bfill(dim='time')

	# rotation gets harsher treatment
	v = 'rot'
	ds[v] = ('time', (ds[v].where(
	ds[v].to_series().resample('H').median().rolling(
	3*24, center=True, min_periods=2
	).std().reindex(ds.time, method='bfill').values <4
	).ffill(dim='time')
	.to_series().resample('D').median()
	.rolling(7*2,center=True,min_periods=2).median()
	.reindex(ds.time, method='bfill').values
	))

	def duration_consecutive_true(
	series: Union[pd.Series, pd.DataFrame]
	) -> Union[pd.Series, pd.DataFrame]:
	"""
	Froma a boolean series, calculates the duration, in hours, of the periods with connective true values.

	Examples
	--------
	>>> duration_consecutive_true(pd.Series([False, True, False, False, True, True, True, False, True]))
	pd.Series([0, 1, 0, 0, 1, 2, 3, 0, 1])

	Parameters
	----------
	series
	Boolean pandas Series or DataFrame

	Returns
	-------
	consecutive_true_duration
	Integer pandas Series or DataFrame with values representing the number of connective true values.

	"""
	# assert series.dtype == bool
	cumsum = ((series.index - series.index[0]).total_seconds()/3600).to_series(index=series.index)
	is_first = series.astype("int").diff() == 1
	offset = (is_first * cumsum).replace(0, np.nan).fillna(method="ffill").fillna(0)

	return (cumsum - offset) * series

	# smoothing tilt and rot
	ds['tilt_x'] = smoothTilt(ds['tilt_x'])
	ds['tilt_y'] = smoothTilt(ds['tilt_y'])
	ds['rot'] = smoothRot(ds['rot'])

	def smoothTilt(da: xr.DataArray, threshold=0.2):
	'''Smooth the station tilt

	Parameters
	----------
	da : xarray.DataArray
	either X or Y tilt inclinometer measurements
	threshold : float
	threshold used in a standrad.-deviation based filter

	Returns
	-------
	xarray.DataArray
	either X or Y smoothed tilt inclinometer measurements
	'''
	# we calculate the moving standard deviation over a 3-day sliding window
	# hourly resampling is necessary to make sure the same threshold can be used
	# for 10 min and hourly data
	moving_std_gap_filled = da.to_series().resample('H').median().rolling(
	3*24, center=True, min_periods=2
	).std().reindex(da.time, method='bfill').values
	# we select the good timestamps and gapfill assuming that
	# - when tilt goes missing the last available value is used
	# - when tilt is not available for the very first time steps, the first
	# good value is used for backfill
	return da.where(
	moving_std_gap_filled < threshold
	).ffill(dim='time').bfill(dim='time')


	def smoothRot(da: xr.DataArray, threshold=4):
	'''Smooth the station rotation

	Parameters
	----------
	da : xarray.DataArray
	rotation measurements from inclinometer
	threshold : float
	threshold used in a standrad-deviation based filter

	Returns
	-------
	xarray.DataArray
	smoothed rotation measurements from inclinometer
	'''
	moving_std_gap_filled = da.to_series().resample('H').median().rolling(
	3*24, center=True, min_periods=2
	).std().reindex(da.time, method='bfill').values
	# same as for tilt with, in addition:
	# - a resampling to daily values
	# - a two week median smoothing
	# - a resampling from these daily values to the original temporal resolution
	return ('time', (da.where(moving_std_gap_filled <4).ffill(dim='time')
	.to_series().resample('D').median()
	.rolling(7*2,center=True,min_periods=2).median()
	.reindex(da.time, method='bfill').values
	))

		ds['usr_cor'] = ds['usr_cor'].interpolate_na(dim='time', use_coordinate=False)


		ds['dsr_cor'] = ds.dsr_cor.where(ds.dsr.notnull())

QC on GPS data + removing dlr/ulr for bad t_rad + recalculating RH from averaged vapour pressures #241

QC on GPS data + removing dlr/ulr for bad t_rad + recalculating RH from averaged vapour pressures #241

Conversation

BaptisteVandecrux commented May 7, 2024 • edited Loading

Persistence QC is done on both raw and tx data, and extended to rh and wspd

QC on GPS data (#238)

Removing dlr and ulr when t_rad is bad or not available (#240)

Re-calculating rh from time-average vapour pressures (#193)

removing percentile QC

smoothing and gap-filling tilt (#176, #123)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BaptisteVandecrux May 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BaptisteVandecrux May 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ladsmund commented May 10, 2024

Choose a reason for hiding this comment

BaptisteVandecrux May 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BaptisteVandecrux commented May 7, 2024 •

edited

Loading

Re-calculating `rh` from time-average vapour pressures (#193)

BaptisteVandecrux May 13, 2024 •

edited

Loading

BaptisteVandecrux May 13, 2024 •

edited

Loading

BaptisteVandecrux May 13, 2024 •

edited

Loading