-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CDAT Migration: Refactor annual_cycle_zonal_mean set #798
CDAT Migration: Refactor annual_cycle_zonal_mean set #798
Conversation
7bc2657
to
ebe73f1
Compare
Basic driver and plotting scripts are working. Through only with
|
Current results with one variable: https://portal.nersc.gov/cfs/e3sm/cdat-migration-fy24/669-annual_cycle_zonal_mean/viewer/ Other TODO items:
|
This issue seems to be related to these:
I'm currently debugging and will push fixes. |
ebe73f1
to
629b8e3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This commit fixes the multiprocessing=True
TimeoutError
issue in this comment.
RE: #798 (comment)
Current results with one variable: portal.nersc.gov/cfs/e3sm/cdat-migration-fy24/669-annual_cycle_zonal_mean/viewer
Other TODO items:
* refine axis config for plot * fix viewer * Verify all variable runs
I think the only remaining items are the last two bullets.
# NOTE: This GitHub issue explains why the "coords" and "compat" args | ||
# are defined as they are below: https://github.com/xCDAT/xcdat/issues/641 | ||
args = { | ||
"paths": filepath, | ||
"decode_times": False, | ||
"add_bounds": ["X", "Y"], | ||
"coords": "minimal", | ||
"compat": "override", | ||
"chunks": "auto", | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notable change. I am going to remove "chunks": "auto"
because we end up loading the dataset into memory. This means downstream computational operations are all serial within the single process.
# NOTE: There seems to be an issue with `open_mfdataset()` and | ||
# using the multiprocessing scheduler defined in e3sm_diags, | ||
# resulting in timeouts and resource locking. | ||
# To avoid this, we load the multi-file dataset into memory before | ||
# performing downstream operations. | ||
# Related GH issue: https://github.com/pydata/xarray/issues/3781 | ||
ds.load(scheduler="sync") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notable change.
# -------------------------------------------------------------------------- | ||
plt.xticks(time, X_TICKS) | ||
lat_formatter = LatitudeFormatter() # type: ignore | ||
ax.yaxis.set_major_formatter(lat_formatter) | ||
ax.tick_params(labelsize=8.0, direction="out", width=1) | ||
ax.xaxis.set_ticks_position("bottom") | ||
ax.yaxis.set_ticks_position("left") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this block of code from the old plotter because it was missing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it fixes the "refine axis config for plot" todo item in this comment.
_save_data_metrics_and_plots( | ||
parameter, | ||
plot_func, | ||
var_key, | ||
test_zonal_mean.to_dataset(), | ||
ref_zonal_mean.to_dataset(), | ||
diff, | ||
metrics_dict={}, | ||
metrics_dict=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
metrics_dict
can be set to None
after removing the metrics_dict
arg from the plot function.
@@ -36,7 +34,6 @@ def plot( | |||
da_test: xr.DataArray, | |||
da_ref: xr.DataArray, | |||
da_diff: xr.DataArray, | |||
metrics_dict: MetricsDict, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed unused metrics_dict
arg.
def _open_annual_cycle_climo_dataset(self, filepath: str) -> xr.Dataset: | ||
"""Open 12 monthly mean climatology dataset. | ||
|
||
Parameters | ||
---------- | ||
filepath : str | ||
The path to the climatology datasets. | ||
""" | ||
args = {"paths": filepath, "decode_times": False, "add_bounds": ["X", "Y"]} | ||
ds = xc.open_mfdataset(**args) | ||
return ds | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also replaced _open_annual_cycle_climo_dataset()
with an updated version of _open_climo_dataset()
that supports multi-file datasets.
1de9b95
to
ceba30c
Compare
f6c4fdf
to
1e1ab90
Compare
@chengzhuzhang you can pick this set back up. I did not make any progress since our last meeting on 4/15/24 (notes). Specifically, there is still a problem related to:
|
|
When set multi-processing=True, error |
@@ -109,7 +109,8 @@ def _add_colormap( | |||
# Add the contour plot | |||
# -------------------------------------------------------------------------- | |||
ax = fig.add_axes(DEFAULT_PANEL_CFG[subplot_num], projection=None) | |||
var = var.transpose("lat", "time") | |||
# var = var.transpose("lat", "time") | |||
var = var.transpose(var.dims[1], var.dims[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One SST data set has "latitude" instead of "lat" as dimension name. The code change voided using the dimension name explicitly.
More update: The TimeOut error came from driver/utils/regrid.py
|
@@ -413,6 +413,14 @@ def _get_climo_dataset(self, season: str) -> xr.Dataset: | |||
# ds = ds[[self.var, 'lat_bnds', 'lon_bnds']] | |||
ds = ds[[self.var] + keep_bnds] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line only keeps variable after derivation and bounds related data variables. It helps to remove excessive data for alleviating memory usage. This change in dataset_xr.py may affect other sets.
# To avoid this, we load the multi-file dataset into memory before | ||
# performing downstream operations. | ||
# Related GH issue: https://github.com/pydata/xarray/issues/3781 | ||
ds.load(scheduler="sync") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed to resolve conflicts between multiprocessing
and dask. Though we need it to be here to only load variables needed, otherwise, out of memory still occur as @tomvothecoder noted.
|
f1dc8eb
to
2cf65c4
Compare
bf28fe2
to
8b90a38
Compare
2c188cd
to
befef87
Compare
Actually, the easier thing to do is to ignore the decoded time values since they aren't used and to assume the order is 1-12 (Jan to Dec) like what the CDAT code does. The main caveat is that the time coordinates must be in ascending order, which they are when opening the datasets in Xarray/xCDAT with The only change needed is to update e3sm_diags/e3sm_diags/plot/annual_cycle_zonal_mean_plot.py Lines 103 to 108 in 784404b
|
@tomvothecoder I was searching some code example that reads data using open_mfdatasets with specifying order of files: https://stackoverflow.com/questions/75241585/using-xarrays-open-mfdataset-to-open-a-series-of-nc-files
Though I think your solution actually works okay, given that |
@tomvothecoder I'm retesting this set will all variables, and realize that the memory issue came back. Then I tested again with the commit which resolved the memory issue (15811b8). No errors. Some changes between (f2c3568) and 15811b8 brought back the issue. I doubted the |
Besides the recent plotter update, |
To change back |
Not sure the best solution to continue troubleshooting, after ruling out the args change for open_mfdataset. But what I did is to swap the dataset_xr.py from commit 15811b8 into latest code. (I do need to edit slightly to make the code work, i.e. change CLIMO_FREQ to Climo_Freq). No memory issue. At least it narrows down the problematic file, and I suspect some changes made in other PRs being merged introduced memory problem. I'm stepping through the differs to see what might be the cause. The file diff for dataset_xr.py is here https://www.diffchecker.com/mTw8AWif/ |
- Due to incorrectly updating `keep_bnds` logic - Add `_encode_time_coords()` to workaround cftime issue `ValueError: "months since" units only allowed for "360_day" calendar`
I was actually in the middle of debugging here with my comment. I resolved the multiprocessing issue, it was my fault :( Issues I resolved in
|
I re-ran the regression test notebook with the latest commit. I am still getting the following diffs: AODVISComparing:
/global/cfs/cdirs/e3sm/www/cdat-migration-fy24/669-annual_cycle_zonal_mean-debug/annual_cycle_zonal_mean/AOD_550/AOD_550-AODVIS-ANNUALCYCLE-global_ref.nc
/global/cfs/cdirs/e3sm/www/cdat-migration-fy24/main/annual_cycle_zonal_mean/AOD_550/AOD_550-AODVIS-Annual-Cycle_test.nc
AODVIS
var_key AODVIS
Not equal to tolerance rtol=1e-05, atol=0
Mismatched elements: 1808 / 2160 (83.7%)
Max absolute difference: 0.12250582
Max relative difference: 91.14554689
x: array([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],...
y: array([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],... ALBEDO -- It just looks like
|
@tomvothecoder this is big relief! I skimed through the file several times and noticed the changed line |
I added a debug script for I think the max relative diff is large because the values are close to 0. import numpy as np
import xcdat as xc
dev_path = "/global/cfs/cdirs/e3sm/www/cdat-migration-fy24/669-annual_cycle_zonal_mean-debug/annual_cycle_zonal_mean/AOD_550/AOD_550-AODVIS-ANNUALCYCLE-global_ref.nc"
main_path = "/global/cfs/cdirs/e3sm/www/cdat-migration-fy24/main/annual_cycle_zonal_mean/AOD_550/AOD_550-AODVIS-Annual-Cycle_test.nc"
var_a = xc.open_dataset(dev_path)["AODVIS"]
var_b = xc.open_dataset(main_path)["AODVIS"]
"""
Floating point comparison
AssertionError:
Not equal to tolerance rtol=1e-07, atol=0
Mismatched elements: 1808 / 2160 (83.7%)
Max absolute difference: 0.12250582
Max relative difference: 91.14554689
x: array([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],...
y: array([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],...
"""
np.testing.assert_allclose(var_a, var_b)
# Get the max of all values
# -------------------------
# 0.28664299845695496
print(var_a.max().item())
# 0.2866430557436412
print(var_b.max().item())
# Get the min of all values
# -------------------------
# 0.0
print(var_a.min().item())
# 0.0
print(var_b.min().item())
# Get the sum of all values
# -------------------------
# 224.2569122314453
print(var_a.sum().item())
# 224.25691348856003
print(var_b.sum().item())
# Get the mean of all values
# -------------------------
# 0.10382264107465744
print(var_a.mean().item())
# 0.1038226451335926
print(var_b.mean().item())
# %%
# Get the max absolute diff
# -------------------------
# 0.12250582128763199
print((var_a - var_b).max().item()) |
yeah, the values and metrics look all very close. Based on the plots i saw earlier, months were off. |
@tomvothecoder I think we can merge after CI/CD test is completed! |
* Refactor `annual_cycle_zonal_mean` set * Address PR review comments * Add lat lon regression testing * Add debugging scripts * Update `_open_climo_dataset()` to decode times as workaround to misaligned time coords - Update `annual_cycle_zonal_mean_plot.py` to convert time coordinates to month integers * Fix unit tests * Remove old plotter * Add script to debug decode_times=True and ncclimo file * Update plotter time values to month integers * Fix slow `.load()` and multiprocessing issue - Due to incorrectly updating `keep_bnds` logic - Add `_encode_time_coords()` to workaround cftime issue `ValueError: "months since" units only allowed for "360_day" calendar` * Update `_encode_time_coords()` docstring * Add AODVIS debug script * update AODVIS obs datasets; regression test results --------- Co-authored-by: Tom Vo <tomvothecoder@gmail.com>
* Refactor `annual_cycle_zonal_mean` set * Address PR review comments * Add lat lon regression testing * Add debugging scripts * Update `_open_climo_dataset()` to decode times as workaround to misaligned time coords - Update `annual_cycle_zonal_mean_plot.py` to convert time coordinates to month integers * Fix unit tests * Remove old plotter * Add script to debug decode_times=True and ncclimo file * Update plotter time values to month integers * Fix slow `.load()` and multiprocessing issue - Due to incorrectly updating `keep_bnds` logic - Add `_encode_time_coords()` to workaround cftime issue `ValueError: "months since" units only allowed for "360_day" calendar` * Update `_encode_time_coords()` docstring * Add AODVIS debug script * update AODVIS obs datasets; regression test results --------- Co-authored-by: Tom Vo <tomvothecoder@gmail.com>
* Refactor `annual_cycle_zonal_mean` set * Address PR review comments * Add lat lon regression testing * Add debugging scripts * Update `_open_climo_dataset()` to decode times as workaround to misaligned time coords - Update `annual_cycle_zonal_mean_plot.py` to convert time coordinates to month integers * Fix unit tests * Remove old plotter * Add script to debug decode_times=True and ncclimo file * Update plotter time values to month integers * Fix slow `.load()` and multiprocessing issue - Due to incorrectly updating `keep_bnds` logic - Add `_encode_time_coords()` to workaround cftime issue `ValueError: "months since" units only allowed for "360_day" calendar` * Update `_encode_time_coords()` docstring * Add AODVIS debug script * update AODVIS obs datasets; regression test results --------- Co-authored-by: Tom Vo <tomvothecoder@gmail.com>
* Refactor `annual_cycle_zonal_mean` set * Address PR review comments * Add lat lon regression testing * Add debugging scripts * Update `_open_climo_dataset()` to decode times as workaround to misaligned time coords - Update `annual_cycle_zonal_mean_plot.py` to convert time coordinates to month integers * Fix unit tests * Remove old plotter * Add script to debug decode_times=True and ncclimo file * Update plotter time values to month integers * Fix slow `.load()` and multiprocessing issue - Due to incorrectly updating `keep_bnds` logic - Add `_encode_time_coords()` to workaround cftime issue `ValueError: "months since" units only allowed for "360_day" calendar` * Update `_encode_time_coords()` docstring * Add AODVIS debug script * update AODVIS obs datasets; regression test results --------- Co-authored-by: Tom Vo <tomvothecoder@gmail.com>
* Refactor `annual_cycle_zonal_mean` set * Address PR review comments * Add lat lon regression testing * Add debugging scripts * Update `_open_climo_dataset()` to decode times as workaround to misaligned time coords - Update `annual_cycle_zonal_mean_plot.py` to convert time coordinates to month integers * Fix unit tests * Remove old plotter * Add script to debug decode_times=True and ncclimo file * Update plotter time values to month integers * Fix slow `.load()` and multiprocessing issue - Due to incorrectly updating `keep_bnds` logic - Add `_encode_time_coords()` to workaround cftime issue `ValueError: "months since" units only allowed for "360_day" calendar` * Update `_encode_time_coords()` docstring * Add AODVIS debug script * update AODVIS obs datasets; regression test results --------- Co-authored-by: Tom Vo <tomvothecoder@gmail.com>
* Refactor `annual_cycle_zonal_mean` set * Address PR review comments * Add lat lon regression testing * Add debugging scripts * Update `_open_climo_dataset()` to decode times as workaround to misaligned time coords - Update `annual_cycle_zonal_mean_plot.py` to convert time coordinates to month integers * Fix unit tests * Remove old plotter * Add script to debug decode_times=True and ncclimo file * Update plotter time values to month integers * Fix slow `.load()` and multiprocessing issue - Due to incorrectly updating `keep_bnds` logic - Add `_encode_time_coords()` to workaround cftime issue `ValueError: "months since" units only allowed for "360_day" calendar` * Update `_encode_time_coords()` docstring * Add AODVIS debug script * update AODVIS obs datasets; regression test results --------- Co-authored-by: Tom Vo <tomvothecoder@gmail.com>
Description
Refactor
annual_cycle_zonal_mean
with xarray/xcdatDriver is pretty short and has unique
_create_annual_cycle
functionannual_cycle_zonal_mean
set #669Checklist
If applicable: