-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Efficient Application of Bias Correction #1255
Comments
Hi @kckornelsen! There are multiple levels to answer your question. 1. Vectorize the computationxclim is based on xarray and it already knows how to apply an operation over a given dimension but at the same time over all the other dimensions. You don't need to explicitly iterate over All Here's a rewrite of your code that drops the explicit for loops. I did not test this and I might have overlooked specificities of your data. But I can try to improve the example if you have specific errors. I added comments prefixed with def process_each_lead_time(leadtime_length,ini_time,geps_fcst,path_eqm_par_folder,path_output_geps,realizations,stations):
# define the time slot for this leadtime
ltime_start = ini_time + np.timedelta64(leadtime_length*i_ltime, 'h') + np.timedelta64(1,'h')
ltime_end = ini_time + np.timedelta64(leadtime_length*(i_ltime+1), 'h')
#select geps dataset within the time slot
geps_fcst_ltime = geps_fcst.sel(time=slice(ltime_start, ltime_end)).copy(deep=True)
# read eqm parameters for this lead time
# PB: I am assuming the trained data has the same shape as the simulated data (except for the time dimension of course)
eqm_pars = xr.open_dataset(
os.path.join(path_eqm_par_folder,"Leadtime_"+str(i_ltime + 1) + "_EQM.nc"),
)
geps_uncorrect = geps_fcst_ltime["PC_NWP"]
#create EMQ model based on this parameter
QM_model = sdba.EmpiricalQuantileMapping.from_dataset(EMQ_pars)
# correct geps forecast
geps_corrected = QM_model.adjust(geps_uncorrect, extrapolation="constant", interp="linear")
geps_corrected = geps_corrected.fillna(0)
geps_corrected = xr.where((geps_corrected > 1000), 0, geps_corrected)
# assign forecast to the dataset
geps_fcst["PC_NWP_CORR"] = geps_corrected
return geps_fcst The issue could be RAM usage. You said you have 150 stations and 20 members. GEPS is usually 2 weeks long, isn't ? With 3-hourly data, that's 112 timesteps. 112 x 150 x 20 x float64 ~ 2.5 MB. That's extremely small. Xclim has been used on datasets larger than 10 GB, with success. I also see that your trained data doesn't have the same shape than what I expected. Was it generated with xclim? As you can see, this had the side effect of removing the logging. Hopefully, it will be fast enough and you won't need it? 2. ParallelismA second answer to your question is : best performance is achieved using Quickly : dask operates transparently with xarray and numpy. The idea is to divide the data into "chunks". Each time an operation is requested on a I invite you to read some tutorials about using In your case, using dask might not even be useful, as the data is so small. I usually recommend using chunks between 10 and 20 MB. Here's another version using dask: import dask
from dask.diagnostics import ProgressBar
# PB: set dask's configuration, 4 processes.
# PB: The default method uses as many _threads_ as there are cpu cores on the machine
# PB: This is a bit annoying, but _threads_ are not optimal for some of the operation sdba uses, so we switch to _processes_
dask.config.set(num_workers=4, scheduler='processes')
# PB: For the example, I am chunking the data along the stations dimension.
# PB: Choosing the optimal chunks is not easy. For sdba you should never chunk along "time".
geps_fcst = xr.open_dataset(
path_to_geps
chunks={'stations': 1}
)
def process_each_lead_time(leadtime_length,ini_time,geps_fcst,path_eqm_par_folder,path_output_geps,realizations,stations):
# define the time slot for this leadtime
ltime_start = ini_time + np.timedelta64(leadtime_length*i_ltime, 'h') + np.timedelta64(1,'h')
ltime_end = ini_time + np.timedelta64(leadtime_length*(i_ltime+1), 'h')
#select geps dataset within the time slot
geps_fcst_ltime = geps_fcst.sel(time=slice(ltime_start, ltime_end)).copy(deep=True)
# read eqm parameters for this lead time
# PB: I am assuming the trained data has the same shape as the simulated data (except for the time dimension of course)
eqm_pars = xr.open_dataset(
os.path.join(path_eqm_par_folder,"Leadtime_"+str(i_ltime + 1) + "_EQM.nc"),
chunks={'stations': 1} # PB: so the chunks fit the ones on "geps".
)
geps_uncorrect = geps_fcst_ltime["PC_NWP"]
#create EMQ model based on this parameter
QM_model = sdba.EmpiricalQuantileMapping.from_dataset(EMQ_pars)
# correct geps forecast
geps_corrected = QM_model.adjust(geps_uncorrect, extrapolation="constant", interp="linear")
geps_corrected = geps_corrected.fillna(0)
geps_corrected = xr.where((geps_corrected > 1000), 0, geps_corrected)
with ProgressBar():
# PB: the computation will be done here.
geps_corrected.load()
# assign forecast to the dataset
geps_fcst["PC_NWP_CORR"] = geps_corrected
return geps_fcst 3. Vectorize along lead times tooIf solution 1 works well, you might be able to vectorize along the lead times as well, instead of iterating. I mean we will extract all lead times and stack them over a new dimension before adjusting, and then we'll unstack each lead times. We have helpers in xclim for this kind of data restructuration, but it was meant for applying adjustment daily data on moving multi-year windows, so it might not be ideal. It also uses more xarray magic, so it looses points on the "readability" side. def extract_ltime(ds, leadtime_length, ini_time, i_ltime)
# define the time slot for this leadtime
ltime_start = ini_time + np.timedelta64(leadtime_length*i_ltime, 'h') + np.timedelta64(1,'h')
ltime_end = ini_time + np.timedelta64(leadtime_length*(i_ltime+1), 'h')
#select within the time slot
# Add a new dimension with the lead number
return ds.sel(time=slice(ltime_start, ltime_end)).expand_dims(lead=[i])
# PB: We do this to get a reference time coordinate
geps_ltime_0 = extract_ltime(geps_fcst, leadtime_length, ini_time, 0)
geps_stacked = xr.concat(
[
# For each lead time, extract the series, drop the time coordinate and add a new dimension with the lead number
extract_ltime(geps_fcst, leadtime_length, ini_time, i).drop_vars('time')
for i in range(n_leadtimes)
].
'lead'
)
# sdba will want a normal time coordinate, so we assign a reference one.
geps_stacked = geps_stacked.assign_coords(time=geps_ltime_0.time)
# read eqm parameters for this lead time
# PB: I am assuming the trained data has the same shape as the simulated data (except for the time dimension of course)
# PB: Same idea as above, we open each dataset and assign a new "lead time " coord:
eqm_pars = xr.concat(
[
xr.open_dataset(
os.path.join(path_eqm_par_folder,"Leadtime_"+str(i_ltime + 1) + "_EQM.nc")
).expand_dims(lead=i_ltime)
for i_ltime in range(n_leadtimes)
],
'lead'
)
geps_uncorrect = geps_stacked["PC_NWP"]
#create EMQ model based on this parameter
QM_model = sdba.EmpiricalQuantileMapping.from_dataset(EMQ_pars)
# correct geps forecast
geps_corrected = QM_model.adjust(geps_uncorrect, extrapolation="constant", interp="linear")
geps_corrected = geps_corrected.fillna(0)
geps_corrected = xr.where((geps_corrected > 1000), 0, geps_corrected)
# Unstack the lead times into a dictionary, keys will be the lead time as a string:
out = {}
for i_ltime in geps_corrected.lead.values:
ltime_wrong = geps_corrected.sel(lead=i_ltime)
ltime_fixed = ltime_wrong.assign_coords(
time=ltime_wrong.time + np.timedelta64(leadtime_length * i_ltime, 'h') + np.timedelta64(1,'h')
)
timestamp = ltime_fixed.time[0].dt.strftime().item()
out[timestamp] = ltime_fixed I know this last suggestion is a bit complicated, but if (1) works well, this might yield better performance than using dask, considering the small size of your data. |
Hi @aulemahal @kckornelsen , this is supper interesting~!!!. We obtained EQM parameters for each station and realization one by one and then combined them into one nc file. This is why you see the parameter file has a different structure from the GEPS forecast data. We want to follow your instruction and apply the code in ( [1] Vectorize the computation). @aulemahal Would you please provide me a simple example that creates the EQM parameter across different dimensions? Thank you in advance~! PS:
Here is how I create the EQM parameter file for each leadtime.
|
Hi @aulemahal, I modified the existing xclim examples and tested the Vectorize approach in [1]. It seems to have worked. Here is the example code. Would you please let me know if you notice something wrong with this example code?
|
<!--Please ensure the PR fulfills the following requirements! --> <!-- If this is your first PR, make sure to add your details to the AUTHORS.rst! --> ### Pull Request Checklist: - [x] This PR addresses an already opened issue (for bug fixes / features) - This PR will help #1255 - [x] Tests for the changes have been added (for bug fixes / features) - [ ] (If applicable) Documentation has been added / updated (for bug fixes / features) - [x] CHANGES.rst has been updated (with summary of main changes) - [x] Link to issue (:issue:`number`) and pull request (:pull:`number`) has been added ### What kind of change does this PR introduce? * `nbutils.quantile` has a speed-up of more than 2.5x by a combination of changes in `nbutils.quantile` and `nbutils._quantile` * This does not cover `nbutils.vec_quantiles` (used for `adapt_freq`) but similar principles could be used * It adds the possibility of using `fastnanquantile` module which is very fast ### Does this PR introduce a breaking change? No ### Other information: * The new low-level function to compute quantiles `nbutils._quantile` is a 1d jitted version of `xclim.core.utils._nan_quantile` * Manual benchmarking can be performed in the notebook `benchmarks/sdba_quantile.ipynb`, attached to this PR.
Description
I am trying to use xclim to bias correct precipitation from ECCC's GEPS product (ensemble NWP) and correcting it against CaPA data. The NWP model has 20 ensemble members and 16 lead times in the forecast. The data is applied to N different watersheds (each with a single timeseries and not gridded data). Through some research it was determined that each GEPS lead time has a slightly different bias. The EQM method was selected and parameters for EQM are pre-computed. The current code takes quite a bit of time to run and I would like to know if there are ways to speed it up. Particularly around the nested for loops. Is there a built in xarray function or xclim function?
What I Did
Assume we start with a bit of house keeping, imports, identifying folders, etc.
What I Received
This function works and provides the expected results. With the parallel code modified below it takes ~25 min to run. This process applied to ~150 sites with only a 14 day forecast lead time.
The text was updated successfully, but these errors were encountered: