-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Land-use data upload #123
Comments
I am attempting to upload the files to ftp.llnl.gov. The instructions indicate to upload files to the "incoming" folder, but when I connect to the FTP server (using FileZilla on my Mac) there is just an empty root directory and no folder named "incoming". Should I just upload to that location or am I not connected correctly? |
Good question @lchini, we've had a similar query from @mjevanmarle this morning. How about if you try and connect explicitly using the IP address 198.128.250.1? See below, seems to work for me currently Once in you should be able to navigate to the |
Option b: here's a python script that should work. You'll need to install input4mips-validation first. Python scriptimport ftplib
import os
import traceback
from pathlib import Path
from input4mips_validation.upload_ftp import cd_v, login_to_ftp, mkdir_v, upload_file
# Point this at the path which contains the files you want to upload
# PATH_TO_DIRECTORY_TO_UPLOAD = (
"output-bundles/v0.4.0/data/processed/esgf-ready/input4MIPs/CMIP6Plus/CMIP/CR/CR-CMIP-0-4-0/atmos/yr/cf4"
)
PATH_TO_DIRECTORY_TO_UPLOAD = "path/to/somewhere"
# Use your email here
# EMAIL = "zebedee.nicholls@climate-resource.com"
EMAIL = "your_email"
# Use a unique value here
# FTP_DIR_REL_TO_ROOT = "cr-junk-2"
FTP_DIR_REL_TO_ROOT = "UofMD-landState-3-0_240918_1"
FTP_DIR_ROOT = "/incoming"
with login_to_ftp(
ftp_server="ftp.llnl.gov",
username="anonymous",
password=EMAIL,
dry_run=False,
) as ftp:
print("Opened FTP connection")
print()
cd_v(FTP_DIR_ROOT, ftp=ftp)
mkdir_v(FTP_DIR_REL_TO_ROOT, ftp=ftp)
cd_v(FTP_DIR_REL_TO_ROOT, ftp=ftp)
n_errors = 0
n_total = 0
for file in Path(PATH_TO_DIRECTORY_TO_UPLOAD).rglob("*.nc"):
file_stats = os.stat(file)
file_size_mb = file_stats.st_size / (1024 * 1024)
file_size_gb = file_stats.st_size / (1024 * 1024 * 1024)
print(f"{file=}")
print(f"{file_size_mb=:.3f}")
print(f"{file_size_gb=:.3f}")
try:
upload_file(
file,
strip_pre_upload=file.parent,
ftp_dir_upload_in=f"{FTP_DIR_ROOT}/{FTP_DIR_REL_TO_ROOT}",
ftp=ftp,
)
print(f"Uploaded {file=}")
except ftplib.error_perm:
print(f"Failed to upload {file=}")
traceback.print_exc()
n_errors += 1
n_total += 1
print()
print(f"Finished: {n_errors=}, {n_total=}") |
@lchini @mjevanmarle it seems like there is a DNS/network issue that is causing problems for me when I am not connected to the LLNL institutional network. Weirdly, this isn't an issue for @znichollscr, so might be something that will just work itself out, or will need a nudge within the LLNL network. This is what I see, which looks similar to @mjevanmarle's issue, and @lchini probably your issue too I'll raise a ticket with the LLNL network folks to see if someone can check. |
Yes that is the same issue that I'm experiencing. I've been trying to configure settings on my end but it sounds like I might need to wait for the LLNL server update. |
@lchini can you try the python script and post the output here if it fails please? |
Python script is here import ftplib
import os
import traceback
from pathlib import Path
from input4mips_validation.upload_ftp import cd_v, login_to_ftp, mkdir_v, upload_file
# Point this at the path which contains the files you want to upload
# PATH_TO_DIRECTORY_TO_UPLOAD = (
"output-bundles/v0.4.0/data/processed/esgf-ready/input4MIPs/CMIP6Plus/CMIP/CR/CR-CMIP-0-4-0/atmos/yr/cf4"
)
PATH_TO_DIRECTORY_TO_UPLOAD = "path/to/somewhere"
# Use your email here
# EMAIL = "zebedee.nicholls@climate-resource.com"
EMAIL = "your_email"
# Use a unique value here
# FTP_DIR_REL_TO_ROOT = "cr-junk-2"
FTP_DIR_REL_TO_ROOT = "UofMD-landState-3-0_240918_1"
FTP_DIR_ROOT = "/incoming"
with login_to_ftp(
ftp_server="ftp.llnl.gov",
username="anonymous",
password=EMAIL,
dry_run=False,
) as ftp:
print("Opened FTP connection")
print()
cd_v(FTP_DIR_ROOT, ftp=ftp)
mkdir_v(FTP_DIR_REL_TO_ROOT, ftp=ftp)
cd_v(FTP_DIR_REL_TO_ROOT, ftp=ftp)
n_errors = 0
n_total = 0
for file in Path(PATH_TO_DIRECTORY_TO_UPLOAD).rglob("*.nc"):
file_stats = os.stat(file)
file_size_mb = file_stats.st_size / (1024 * 1024)
file_size_gb = file_stats.st_size / (1024 * 1024 * 1024)
print(f"{file=}")
print(f"{file_size_mb=:.3f}")
print(f"{file_size_gb=:.3f}")
try:
upload_file(
file,
strip_pre_upload=file.parent,
ftp_dir_upload_in=f"{FTP_DIR_ROOT}/{FTP_DIR_REL_TO_ROOT}",
ftp=ftp,
)
print(f"Uploaded {file=}")
except ftplib.error_perm:
print(f"Failed to upload {file=}")
traceback.print_exc()
n_errors += 1
n_total += 1
print()
print(f"Finished: {n_errors=}, {n_total=}") |
I tried to install input4mips-validation so that I could run the python script. I used pip since I didn't have mamba installed. Although pip did not return an error, I don't think the installation worked correctly because when I tried to run the python script I received the error: No module named 'input4mips_validation' |
I've tried to engage with LLNL comp support folks and had a tepid response, so if this doesn't begin to work next time we try, let's seek alternative paths to getting these data in the publication queues |
Hmmm that's unfortunate. Here's a version of the script without any dependencies that aren't in the standard library so it should just work with any Python >= 3.9. Can you try that please? import ftplib
import os
import traceback
from collections.abc import Iterator
from contextlib import contextmanager
from pathlib import Path
from typing import Optional
# Point this at the path which contains the files you want to upload
# PATH_TO_DIRECTORY_TO_UPLOAD = (
# "output-bundles/v0.4.0/data/processed/esgf-ready/input4MIPs/CMIP6Plus/CMIP/CR/CR-CMIP-0-4-0/atmos/yr/cf4"
# )
PATH_TO_DIRECTORY_TO_UPLOAD = "path/to/somewhere"
# Use your email here
# EMAIL = "zebedee.nicholls@climate-resource.com"
EMAIL = "your_email"
# Use a unique value here
# FTP_DIR_REL_TO_ROOT = "cr-junk-4"
FTP_DIR_REL_TO_ROOT = "UofMD-landState-3-0_240918_1"
FTP_DIR_ROOT = "/incoming"
@contextmanager
def login_to_ftp(
ftp_server: str, username: str, password: str, dry_run: bool
) -> Iterator[Optional[ftplib.FTP]]:
"""
Create a connection to an FTP server.
When the context block is excited, the connection is closed.
If we are doing a dry run, `None` is returned instead
to signal that no connection was actually made.
We do, however, log messages to indicate what would have happened.
Parameters
----------
ftp_server
FTP server to login to
username
Username
password
Password
dry_run
Is this a dry run?
If `True`, we won't actually login to the FTP server.
Yields
------
:
Connection to the FTP server.
If it is a dry run, we simply return `None`.
"""
if dry_run:
print(f"Dry run. Would log in to {ftp_server} using {username=}")
ftp = None
else:
ftp = ftplib.FTP(ftp_server, passwd=password, user=username) # noqa: S321
print(f"Logged into {ftp_server} using {username=}")
yield ftp
if ftp is None:
if not dry_run: # pragma: no cover
raise AssertionError
print(f"Dry run. Would close connection to {ftp_server}")
else:
ftp.quit()
print(f"Closed connection to {ftp_server}")
def cd_v(dir_to_move_to: str, ftp: ftplib.FTP) -> ftplib.FTP:
"""
Change directory verbosely
Parameters
----------
dir_to_move_to
Directory to move to on the server
ftp
FTP connection
Returns
-------
:
The FTP connection
"""
ftp.cwd(dir_to_move_to)
print(f"Now in {ftp.pwd()} on FTP server")
return ftp
def mkdir_v(dir_to_make: str, ftp: ftplib.FTP) -> None:
"""
Make directory verbosely
Also, don't fail if the directory already exists
Parameters
----------
dir_to_make
Directory to make
ftp
FTP connection
"""
try:
print(f"Attempting to make {dir_to_make} on {ftp.host=}")
ftp.mkd(dir_to_make)
print(f"Made {dir_to_make} on {ftp.host=}")
except ftplib.error_perm:
print(f"{dir_to_make} already exists on {ftp.host=}")
def upload_file(
file: Path,
strip_pre_upload: Path,
ftp_dir_upload_in: str,
ftp: Optional[ftplib.FTP],
) -> Optional[ftplib.FTP]:
"""
Upload a file to an FTP server
Parameters
----------
file
File to upload.
The full path of the file relative to `strip_pre_upload` will be uploaded.
In other words, any directories in `file` will be made on the
FTP server before uploading.
strip_pre_upload
The parts of the path that should be stripped before the file is uploaded.
For example, if `file` is `/path/to/a/file/somewhere/file.nc`
and `strip_pre_upload` is `/path/to/a`,
then we will upload the file to `file/somewhere/file.nc` on the FTP server
(relative to whatever directory the FTP server is in
when we enter this function).
ftp_dir_upload_in
Directory on the FTP server in which to upload `file`
(after removing `strip_pre_upload`).
ftp
FTP connection to use for the upload.
If this is `None`, we assume this is a dry run.
Returns
-------
:
The FTP connection.
If it is a dry run, this can simply be `None`.
"""
print(f"Uploading {file}")
if ftp is None:
print(f"Dry run. Would cd on the FTP server to {ftp_dir_upload_in}")
else:
cd_v(ftp_dir_upload_in, ftp=ftp)
filepath_upload = file.relative_to(strip_pre_upload)
print(
f"Relative to {ftp_dir_upload_in} on the FTP server, " f"will upload {file} to {filepath_upload}",
)
for parent in list(filepath_upload.parents)[::-1]:
if parent == Path("."):
continue
to_make = parent.parts[-1]
if ftp is None:
print("Dry run. " "Would ensure existence of " f"and cd on the FTP server to {to_make}")
else:
mkdir_v(to_make, ftp=ftp)
cd_v(to_make, ftp=ftp)
if ftp is None:
print(f"Dry run. Would upload {file}")
return ftp
with open(file, "rb") as fh:
upload_command = f"STOR {file.name}"
print(f"Upload command: {upload_command}")
try:
print(f"Initiating upload of {file}")
ftp.storbinary(upload_command, fh)
print(f"Successfully uploaded {file}")
except ftplib.error_perm:
print(
f"{file.name} already exists on the server in {ftp.pwd()}. "
"Use a different directory on the receiving server "
"if you really wish to upload again."
)
raise
return ftp
with login_to_ftp(
ftp_server="ftp.llnl.gov",
username="anonymous",
password=EMAIL,
dry_run=False,
) as ftp:
print("Opened FTP connection")
print()
cd_v(FTP_DIR_ROOT, ftp=ftp)
mkdir_v(FTP_DIR_REL_TO_ROOT, ftp=ftp)
cd_v(FTP_DIR_REL_TO_ROOT, ftp=ftp)
n_errors = 0
n_total = 0
for file in Path(PATH_TO_DIRECTORY_TO_UPLOAD).rglob("*.nc"):
file_stats = os.stat(file)
file_size_mb = file_stats.st_size / (1024 * 1024)
file_size_gb = file_stats.st_size / (1024 * 1024 * 1024)
print(f"{file=}")
print(f"{file_size_mb=:.3f}")
print(f"{file_size_gb=:.3f}")
try:
upload_file(
file,
strip_pre_upload=file.parent,
ftp_dir_upload_in=f"{FTP_DIR_ROOT}/{FTP_DIR_REL_TO_ROOT}",
ftp=ftp,
)
print(f"Uploaded {file=}")
except ftplib.error_perm:
print(f"Failed to upload {file=}")
traceback.print_exc()
n_errors += 1
n_total += 1
print()
print(f"Finished: {n_errors=}, {n_total=}") |
Thanks for the new script Zeb. I'm running it now and it appears to be working, although it's hard to gauge progress on the other end. The first time I ran the script it appeared to be uploading ALL files within the given directory, so I canceled that and moved to a different folder. So there might be some half-uploaded files from that first run. |
The script completed. Can someone else confirm that it was successful? I just uploaded a single file because I wanted to make sure everything looks OK with that one before sending the others. Let me know if there is anything I need to change with the format or metadata in the file. Also, I assume the filename will be changed from the name of the uploaded file? |
Ah yes it uploads every
Hopefully @durack1 can take a look. Can you tell us which directory you uploaded in (i.e. the value of
Sounds good. We'll take a look and get back to you asap |
@lchini great! We're off, I could see the below, so if that looks right to you - mint a new upload dir and give us the lot. If you can also indicate what we're to expect, number of files, then I can double check these and then drop them into the publication queue, where we can runs @znichollscr validator to double check |
Alrighty looks like Paul found it so don't worry, we don't need any more info for now. I'll take a look and get back to you asap.
Yep we'll re-write that as part of putting the file in the DRS |
@znichollscr the 2 files are in the normal place - |
Alrighty: I'm assuming that For
Other than that, looks good I think |
@znichollscr yep looks like you're right - we have a bigger version of that file now, AND another transitions file. @lchini so I wait until it's all up, what should we be expecting, how many files and their filenames/sizes? I might wait until I've heard back from you, and wait until the complete set is down before I pull these across |
The management4.nc file was uploaded in error when I didn't realize that the python script would upload all files in the given directory. So please delete that one. There are 4 files that I'll be uploading for states, transitions, and management, as well as a staticData file. The issues that Zeb pointed out with the states file will be issues in the transitions and management files too. I've already uploaded the transitions so will have to fix and re-upload that one as well as the states file, and I'll try to update the management file before I upload it. |
For the time units, the product is annual. We originally created time units that give the actual year, e.g. 850, 851, 852, etc. I post-processed that to give years since 850, e.g.: 0,1,2,3 .... Should I revert to the original plan or switch to days since 850 as you suggested. We have 1175 years of data so a simple multiplication by 365 will end up missing quite a few days due to leap years. |
Unfortunately I can't do anything about deleting/moving/etc on this system, it's simply a dropbox.. So good to know I'll purge it in our cop(ies) once I pull the complete file list down. When you have the new data generated, upload this to a new directory e.g., Also we have a standard template for the filenames (and directory structure, which I can impose once the files are down and their metadata matches what we expect), so this should be something like |
UDUNITS defines a year to be exactly 365.242198781 days (the interval between 2 successive passages of the sun through vernal equinox, yes pedantic). So if we are mapping into days since, then we'd have to be careful about @znichollscr suggested multiplication, as this will lead to problems toward the end of the record. In addition as you span the Gregorian (1582-10-04 to 1582-10-15 the next day) hop, this is going to get a little weird.. @lchini how are you writing these files, what software? the python datetime library and cftime could help here |
Our model that generates the data and writes the files is written in C++. I am doing some post-processing on the files in MATLAB (just to add in the new variables that don't have computed data yet), and then doing more post-processing (modifying the time dimension, writing global attributes, etc) using NCO command-line tools. I guess my question is: since converting to days is tricky, is it really necessary? Especially since our data is an annual product? |
To be honest, your file looks pretty good to me ( variables:
double time(time) ;
time:axis = "T" ;
time:calendar = "noleap" ;
time:long_name = "time" ;
time:realtopology = "linear" ;
time:standard_name = "time" ;
time:units = "years since 850-01-01 0:0:0" ;
...
data:
time = "0850-01-01", "0851-01-01", "0852-01-01", "0853-01-01", "0854-01-01",
"0855-01-01", "0856-01-01", "0857-01-01", "0858-01-01", "0859-01-01",
...
"2015-01-01", "2016-01-01", "2017-01-01", "2018-01-01", "2019-01-01",
"2020-01-01", "2021-01-01", "2022-01-01", "2023-01-01" ; The xarray warning is
A quick tweak |
There's also a couple of inconsistencies in the file metadata vs what we are expecting, // global attributes:
:host = "UMD College Park" ;
:creation_date = "2024-07-18T14:51:50Z" ;
:Conventions = "CF-1.6" ;
:data_structure = "grid" ;
:dataset_category = "landState" ;
:variable_id = "multiple" ;
:grid_label = "gn" ;
:mip_era = "CMIP6" ; ## CMIP6Plus
:license = "Land-Use Harmonization data produced by the University of Maryland is licensed under a Creative Commons Attribution \\\"Share Alike\\\" 4.0 International License (http://creativecommons.org/licenses
/by/4.0/). The data producers and data providers make no warranty, either express or implied, including but not limited to, warranties of merchantability and fitness for a particular purpose. All liabilities arising from the s
upply of the information (including any liability arising in negligence) are excluded to the fullest extent permitted by law." ;
:further_info_url = "http://luh.umd.edu" ;
:frequency = "yr" ;
:institution_id = "UofMD" ;
:institution = "University of Maryland (UofMD), College Park, MD 20742, USA" ;
:realm = "land" ;
:source = "LUH3 V0: Land-Use Harmonization Data Set for CMIP7" ;
:comment = "LUH3 V0" ;
:title = "UofMD LUH3 V0 dataset prepared for CMIP7" ;
:activity_id = "CMIP7" ; ### input4MIPs
:dataset_version_number = "LUH3 V0" ;
:source_id = "UofMD-landState-LUH3" ; ## UofMD-landState-3-0
:target_mip = "CMIP7" ; ### CMIP
:references = "Hurtt et al. 2020, Chini et al. 2021" ; ## Want to expand these with DOIs?
:contact = "lchini@umd.edu, gchurtt@umd.edu" ; |
(This is completely non-obvious unless you love the CF-conventions), because you're using a 'noleap' calendar, every year in your calendar has exactly 365 days. Hence, you can do the multiplication by 365 without an issue (just don't change the calendar attribute of your time variable!).
See above. Because of the calendar attribute, UDUNITS doesn't come into it and just multiplying by 365 is fine (again, this statement only applies because of the "noleap" calendar).
As above, because of the calendar, converting to days is trivial. The reason I would (strongly) recommend doing this is that the data doesn't load properly with xarray if the time units are "years since" rather than "days since". This is a bug in xarray, but given it is such a widely used tool, I would recommend making this tweak (particularly given how trivial it is).
Note here @durack1 that you've loaded with click me to see the full xarray error>>> import xarray as xr
>>> xr.open_dataset("states_new_vars2.nc", use_cftime=True)
Traceback (most recent call last):
File "/shared/input4mips-validation-v0.11.3/lib/python3.11/site-packages/xarray/coding/times.py", line 218, in _decode_cf_datetime_dtype
result = decode_cf_datetime(example_value, units, calendar, use_cftime)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/shared/input4mips-validation-v0.11.3/lib/python3.11/site-packages/xarray/coding/times.py", line 349, in decode_cf_datetime
dates = _decode_datetime_with_cftime(flat_num_dates, units, calendar)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/shared/input4mips-validation-v0.11.3/lib/python3.11/site-packages/xarray/coding/times.py", line 242, in _decode_datetime_with_cftime
cftime.num2date(num_dates, units, calendar, only_use_cftime_datetimes=True)
File "src/cftime/_cftime.pyx", line 587, in cftime._cftime.num2date
File "src/cftime/_cftime.pyx", line 105, in cftime._cftime._dateparse
ValueError: In general, units must be one of 'microseconds', 'milliseconds', 'seconds', 'minutes', 'hours', or 'days' (or select abbreviated versions of these). For the '360_day' calendar, 'months' can also be used, or for the 'noleap' calendar 'common_years' can also be used. Got 'years' instead, which are not recognized.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/shared/input4mips-validation-v0.11.3/lib/python3.11/site-packages/xarray/conventions.py", line 450, in decode_cf_variables
new_vars[k] = decode_cf_variable(
^^^^^^^^^^^^^^^^^^^
File "/shared/input4mips-validation-v0.11.3/lib/python3.11/site-packages/xarray/conventions.py", line 291, in decode_cf_variable
var = times.CFDatetimeCoder(use_cftime=use_cftime).decode(var, name=name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/shared/input4mips-validation-v0.11.3/lib/python3.11/site-packages/xarray/coding/times.py", line 992, in decode
dtype = _decode_cf_datetime_dtype(data, units, calendar, self.use_cftime)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/shared/input4mips-validation-v0.11.3/lib/python3.11/site-packages/xarray/coding/times.py", line 228, in _decode_cf_datetime_dtype
raise ValueError(msg)
ValueError: unable to decode time units 'years since 850-01-01 0:0:0' with "calendar 'noleap'". Try opening your dataset with decode_times=False or installing cftime if it is not installed.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/shared/input4mips-validation-v0.11.3/lib/python3.11/site-packages/xarray/backends/api.py", line 588, in open_dataset
backend_ds = backend.open_dataset(
^^^^^^^^^^^^^^^^^^^^^
File "/shared/input4mips-validation-v0.11.3/lib/python3.11/site-packages/xarray/backends/netCDF4_.py", line 659, in open_dataset
ds = store_entrypoint.open_dataset(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/shared/input4mips-validation-v0.11.3/lib/python3.11/site-packages/xarray/backends/store.py", line 46, in open_dataset
vars, attrs, coord_names = conventions.decode_cf_variables(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/shared/input4mips-validation-v0.11.3/lib/python3.11/site-packages/xarray/conventions.py", line 461, in decode_cf_variables
raise type(e)(f"Failed to decode variable {k!r}: {e}") from e
ValueError: Failed to decode variable 'time': unable to decode time units 'years since 850-01-01 0:0:0' with "calendar 'noleap'". Try opening your dataset with decode_times=False or installing cftime if it is not installed.
I don't think this matters for us though does it Paul? We'll just re-write with the correct name and save @lchini the headache? If you do want to write it yourself, the current DRS suggests the filename should start with "multiple-*" (e.g. multiple-transitions, multiple-states) because there are multiple variables in the file. |
Thanks for this info. I think most of these changes will be easy to implement and I will get started on it right away. The issue of the time variable and time bounds should also be OK but I just wanted to make sure I get this right. As I understand it, the plan is the following:
Does this sound correct? Questions:
|
All correct. (The 1582 calendar change also doesn't matter as all you're really saying with your data is, "this is the start of year" state, which is what the approach you're taking will do.)
Spot on. I think the variable is meant to be called
I don't think it matters, but I don't think it will hurt either and it will make it easier for tools that expect 4 digits in their year so I would do this if it were me (I'm assuming it is a very easy change).
Given the info you have provided, I would leave as is. |
@lchini this looks good to me, the issues highlighted above (#123 (comment)) are fixed. It seems you've hardcoded the For a python example (which may or may not be useful), see below $ python
Python 3.11.7 | packaged by conda-forge | (main, Dec 23 2023, 14:43:09) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import datetime
>>> print(datetime.datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ"))
2024-09-23T21:33:04Z @znichollscr the single files is on nimbus Also a question to you, did you want to rename these files so what you are producing is consistent with what will be downloaded from ESGF? This is optional, but we will confuse folks if we have inconsistent filenames from differing sources, even if their content is identical. @znichollscr highlighted the renaming above (#123 (comment)) |
And just adding another note, looks like the time axis fix has solved the python
Python 3.11.7 | packaged by conda-forge | (main, Dec 23 2023, 14:43:09) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import xarray as xr
>>> fh = xr.open_dataset("../LouiseChini-landUseChange/20240923/states_new_vars3.nc")
>>> fh
<xarray.Dataset>
Dimensions: (time: 1175, lat: 720, lon: 1440, nbnd: 2)
Coordinates:
* time (time) object 0850-01-01 00:00:00 ... 2024-01-01 00:00:00
* lat (lat) float64 89.88 89.62 89.38 89.12 ... -89.38 -89.62 -89.88
* lon (lon) float64 -179.9 -179.6 -179.4 -179.1 ... 179.4 179.6 179.9
Dimensions without coordinates: nbnd
Data variables: (12/16)
primf (time, lat, lon) float32 ...
primn (time, lat, lon) float32 ...
secdf (time, lat, lon) float32 ...
secdn (time, lat, lon) float32 ...
urban (time, lat, lon) float32 ...
c3ann (time, lat, lon) float32 ...
... ...
pastr (time, lat, lon) float32 ...
range (time, lat, lon) float32 ...
secmb (time, lat, lon) float32 ...
secma (time, lat, lon) float32 ...
pltns (time, lat, lon) float32 ...
time_bnds (nbnd, time) object ...
Attributes: (12/25)
host: UMD College Park
creation_date: 2024-07-18T14:51:36Z
Conventions: CF-1.6
data_structure: grid
dataset_category: landState
variable_id: multiple
... ...
source_id: UofMD-landState-3-0
target_mip: CMIP
mip_era: CMIP6Plus
references: Hurtt et al. 2020 (https://doi.org/10.5194/gmd-1...
history: Mon Sep 23 13:31:22 2024: ncrename -a ._Fillvalu...
NCO: netCDF Operators version 5.0.0 (Homepage = HTTP:...
>>> |
Hi @lchini looking good. Tweaks from this round below:
Thanks! |
Thanks for these additional modifications Zeb. I've updated the states file and uploaded it to the FTP server. I'm working on the transitions and management files now to implement the same changes. The management file has a couple of variables where the standard_name attribute is listed as 'biomass_fraction', which I now realize is not a standard name. So, I'm assuming I should just remove standard_name for those variables, like I did with secma in the states file? Also, the creation_date attribute is generated automatically when I create the data. The files that I'm uploading are based on the data that I created on July 18, 2024. Since then I have just been modifying the files with these metadata corrections etc, and I did also add in some placeholder variables that we will fill with actual data in the next release. I did not change the creation_date attribute when I made those changes. But moving forward the creation_date will update based on the date when the new data gets generated. |
Yep, for these cases: a) remove "standard_name" and b) make sure that there is at least a value for "long_name".
Ah ok. We normally use that for when the file is created, rather than the data, so we can tell the difference between files more easily (even if they have the same name, the creation date helps us differentiate). It's probably not essential to change though (although @durack1 can correct me). Speaking of identifying files, the other thing attribute we're missing is "tracking_id". This should be file specific and generated following the UUID4 protocol (re-generated every time you write a new file). In Python, it can be generated with code like the below import uuid
tracking_id = "hdl:21.14100/" + str(uuid.uuid4()) In Matlab, it's a bit less clear to me but that's also because I'm worse at reading matlab docs I think. |
(Although, to be honest, I would be ok with skipping tracking_id for this first set of files and just picking it up next time we go round...) |
Hi folks, I'm sorry, but the >> disp(join(["hdl:21.14100",char(java.util.UUID.randomUUID)],"/")) # Matlab R2023a
hdl:21.14100/df3a5513-ee63-4969-aff4-5efc4e71f4bc
Which matches the format of the python UUID4
:tracking_id = "hdl:21.14100/c0045041-73e0-4e75-b36d-38a962fb813c" ; Above example is from the PCMDI-AMIP-1-1-6 example here And a matlab example of creating a >> disp(join([replace(char(datetime('now','Format','yyyy-MM-dd_HH:mm:ss','Timezone','Z')),'_','T'),'Z']))
2024-09-25T17:42:47Z
Matching the python
>>> import datetime; print(datetime.datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ"))
2024-09-25T17:44:04Z @znichollscr the latest transitions file is now in |
That settles that then :)
Having looked at it now, it looks like most of your variables don't have a true standard name. Standard names never contain whitespace, so anytime there is whitespace in a standard name, that information should either be in "long_name" or, if there's already a "long_name", you can just delete the "standard_name" information entirely. Looking closer at the values, I would say that I would be surprised if any of your variables had standard names (the full list is here: https://cfconventions.org/Data/cf-standard-names/current/build/cf-standard-name-table.html). The thing that is suggesting that to me is that lots of your variables have the same standard name, but I don't think two different variables can have the same standard name (so it seems like the standard names are wrong to me). I could be mistaken of course. |
I 100% agree with @znichollscr if there is not a very definitive mapping to a CF standard name that has been approved and listed on the v86 of the CF Standard Name table then let's remove "standard_name" and rather go with the descriptive "long_name" attribute alone. If we want to jump through hoops to get a standard name assigned, we can do that on the second go around |
OK, sounds good. I'll remove "standard_name" for all variables. I assume it's OK/preferable to keep the existing standard_name for time, lat, and lon? I can also add the tracking_id. Do I need to do anything about creation_date at this stage or leave it as is for now? |
Yep, these standard_names and all other attributes are registered standards that you are using correctly
The creation_date is meant to indicate the date that I file was generated, and this (and other files) was not generated on the 18 July 2024, so I would prefer we update this, and preferably update this automatically as files are written - just so we don't create this inconsistency again. As @znichollscr notes, the creation_date is one of the attributes if used correctly that allows a crumbtrail of when a file was generated, and so in most cases presumably the latest dated file is the preferred. In the CMIP6 example file (here), this lists the following attributes as "absolutely essential": Conventions, activity_id, contact, creation_date, dataset_category, frequency, further_info_url, grid_label, institution, institution_id, mip_era, nominal_resolution, realm, source, source_id, source_version, target_mip, title, tracking_id, variable_id. Looking at the below, we're now all good, aside from the bolded entries above nominal_resolution = "25 km", rename "dataset_version_number" -> source_version = "3.0", tracking_id matlab code as above #123 (comment) and the update to creation_date also above. ncdump -ct multiple-transitions_input4MIPs_landState_CMIP_UofMD-landState-3-0_gn_0850-2024.nc
...
// global attributes:
:host = "UMD College Park" ;
:creation_date = "2024-07-18T14:51:50Z" ;
:Conventions = "CF-1.6" ;
:data_structure = "grid" ;
:dataset_category = "landState" ;
:grid_label = "gn" ;
:license = "Land-Use Harmonization data produced by the University of Maryland is licensed under a Creative Commons Attribution \\\"Share Alike\\\" 4.0 International License (http://creativecommons.org/licenses/by/4.0/). The data producers and data providers make no warranty, either express or implied, including but not limited to, warranties of merchantability and fitness for a particular purpose. All liabilities arising from the supply of the information (including any liability arising in negligence) are excluded to the fullest extent permitted by law." ;
:further_info_url = "http://luh.umd.edu" ;
:frequency = "yr" ;
:institution_id = "UofMD" ;
:institution = "University of Maryland (UofMD), College Park, MD 20742, USA" ;
:realm = "land" ;
:source = "LUH3 V0: Land-Use Harmonization Data Set for CMIP7" ;
:comment = "LUH3 V0" ;
:title = "UofMD LUH3 V0 dataset prepared for CMIP7" ;
:dataset_version_number = "LUH3 V0" ;
:contact = "lchini@umd.edu, gchurtt@umd.edu" ;
:activity_id = "input4MIPs" ;
:source_id = "UofMD-landState-3-0" ;
:target_mip = "CMIP" ;
:mip_era = "CMIP6Plus" ;
:references = "Hurtt et al. 2020 (https://doi.org/10.5194/gmd-13-5425-2020), Chini et al. 2021 (https://doi.org/10.5194/essd-13-4175-2021)" ;
:history = "Wed Sep 25 09:16:37 2024: ncrename -a ._Fillvalue,_FillValue transitions_new_vars3.nc" ;
:NCO = "netCDF Operators version 5.0.0 (Homepage = http://nco.sf.net, Code = http://github.com/nco/nco)" ;
:variable_id = "multiple-transitions" ;
... |
Hi @lchini, thanks again for your patience with this. I found one more thing. I realise that @durack1 and I have now thrown quite a lot at you now, so I've tried to summarise below too. The extra thingThe time bounds values are still not coming through as expected. For example, if I look at the time bounds, the values are >>> tmp["time_bnds"].values[:3, :]
array([[cftime.DatetimeNoLeap(850, 1, 1, 0, 0, 0, 0, has_year_zero=True),
cftime.DatetimeNoLeap(851, 1, 1, 0, 0, 0, 0, has_year_zero=True)],
[cftime.DatetimeNoLeap(852, 1, 1, 0, 0, 0, 0, has_year_zero=True),
cftime.DatetimeNoLeap(853, 1, 1, 0, 0, 0, 0, has_year_zero=True)],
[cftime.DatetimeNoLeap(854, 1, 1, 0, 0, 0, 0, has_year_zero=True),
cftime.DatetimeNoLeap(855, 1, 1, 0, 0, 0, 0, has_year_zero=True)]],
dtype=object) What this is basically saying is that the first time step goes from 850-01-01 to 851-01-01, that's all good. However, it then says that the second timestep goes from 852-01-01 to 853-01-01 i.e. one year too far forward. For the third timestep, the bounds are 854-01-01 to 855-01-01, now two years too far forward. This looks like some sort of stacking issue. If I look in the middle of the bounds, I see that the bounds effectively restart: >>> tmp["time_bnds"].values[585:589, :]
array([[cftime.DatetimeNoLeap(2020, 1, 1, 0, 0, 0, 0, has_year_zero=True),
cftime.DatetimeNoLeap(2021, 1, 1, 0, 0, 0, 0, has_year_zero=True)],
[cftime.DatetimeNoLeap(2022, 1, 1, 0, 0, 0, 0, has_year_zero=True),
cftime.DatetimeNoLeap(2023, 1, 1, 0, 0, 0, 0, has_year_zero=True)],
[cftime.DatetimeNoLeap(851, 1, 1, 0, 0, 0, 0, has_year_zero=True),
cftime.DatetimeNoLeap(852, 1, 1, 0, 0, 0, 0, has_year_zero=True)],
[cftime.DatetimeNoLeap(853, 1, 1, 0, 0, 0, 0, has_year_zero=True),
cftime.DatetimeNoLeap(854, 1, 1, 0, 0, 0, 0, has_year_zero=True)]],
dtype=object) I think this should be an easy fix. In pseudo-code, what you want is
The first few values should then look like >>> tmp["time_bnds"].values[:3, :]
array([[cftime.DatetimeNoLeap(850, 1, 1, 0, 0, 0, 0, has_year_zero=True),
cftime.DatetimeNoLeap(851, 1, 1, 0, 0, 0, 0, has_year_zero=True)],
[cftime.DatetimeNoLeap(851, 1, 1, 0, 0, 0, 0, has_year_zero=True),
cftime.DatetimeNoLeap(852, 1, 1, 0, 0, 0, 0, has_year_zero=True)],
[cftime.DatetimeNoLeap(852, 1, 1, 0, 0, 0, 0, has_year_zero=True),
cftime.DatetimeNoLeap(853, 1, 1, 0, 0, 0, 0, has_year_zero=True)]],
dtype=object) or, in raw values >>> tmp["time_bnds"].values[:3, :]
array([[0, 365],
[365, 730],
[730, 1095],],
dtype=object) Summary of things to fix (as I see them):
Then I think we're golden (or, at least, very close) |
OK, I think I've taken care of that list now (it took me a while to figure out why the time bounds weren't working as expected!). I've uploaded a new set of files to the FTP server. Let me know how they look. |
@lchini this is great, I can confirm in all files valid
A query about the time, these now look great, spanning the 850-2023 (or 2024-01-01 as a bound) period, but the filename suggests we have coverage from 850 to 2024. We need to rename the file I think, as our last time entry is 2023 - @lchini can you confirm. See below (xcd061nctax) bash-4.2$ python
Python 3.11.7 | packaged by conda-forge | (main, Dec 23 2023, 14:43:09) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import xarray as xr
>>> fh = xr.open_dataset("../LouiseChini-landUseChange/20240927/multiple-management_input4MIPs_landState_CMIP_UofMD-landState-3-0_gn_0850-2024.nc")
>>> fh
<xarray.Dataset>
Dimensions: (time: 1175, lat: 720, lon: 1440, nbnd: 2)
Coordinates:
* time (time) object 0850-01-01 00:00:00 ... 2024-01-01 00:00:00
* lat (lat) float64 89.88 89.62 89.38 89.12 ... -89.38 -89.62 -89.88
* lon (lon) float64 -179.9 -179.6 -179.4 -179.1 ... 179.4 179.6 179.9
Dimensions without coordinates: nbnd
Data variables: (12/36)
fertl_c3ann (time, lat, lon) float32 ...
irrig_c3ann (time, lat, lon) float32 ...
cpbf1_c3ann (time, lat, lon) float32 ...
fertl_c4ann (time, lat, lon) float32 ...
irrig_c4ann (time, lat, lon) float32 ...
cpbf1_c4ann (time, lat, lon) float32 ...
... ...
prtct_primn (time, lat, lon) float32 ...
prtct_secdf (time, lat, lon) float32 ...
prtct_secdn (time, lat, lon) float32 ...
prtct_pltns (time, lat, lon) float32 ...
addtc (time, lat, lon) float32 ...
time_bnds (time, nbnd) object ...
Attributes: (12/27)
host: UMD College Park
creation_date: 2024-09-27T17:30:27Z
Conventions: CF-1.6
data_structure: grid
dataset_category: landState
grid_label: gn
... ...
references: Hurtt et al. 2020 (https://doi.org/10.5194/gmd-13-54...
history: Fri Sep 27 13:31:20 2024: ncrename -a ._Fillvalue,_F...
variable_id: multiple-management
nominal_resolution: 25 km
source_version: 3.0
tracking_id: hdl:21.14100/d444819c-035b-4663-999d-eff2ce8170ac
>>> fh["time_bnds"].values[:-1,:]
array([[cftime.DatetimeNoLeap(850, 1, 1, 0, 0, 0, 0, has_year_zero=True),
cftime.DatetimeNoLeap(851, 1, 1, 0, 0, 0, 0, has_year_zero=True)],
[cftime.DatetimeNoLeap(851, 1, 1, 0, 0, 0, 0, has_year_zero=True),
cftime.DatetimeNoLeap(852, 1, 1, 0, 0, 0, 0, has_year_zero=True)],
[cftime.DatetimeNoLeap(852, 1, 1, 0, 0, 0, 0, has_year_zero=True),
cftime.DatetimeNoLeap(853, 1, 1, 0, 0, 0, 0, has_year_zero=True)],
...,
[cftime.DatetimeNoLeap(2021, 1, 1, 0, 0, 0, 0, has_year_zero=True),
cftime.DatetimeNoLeap(2022, 1, 1, 0, 0, 0, 0, has_year_zero=True)],
[cftime.DatetimeNoLeap(2022, 1, 1, 0, 0, 0, 0, has_year_zero=True),
cftime.DatetimeNoLeap(2023, 1, 1, 0, 0, 0, 0, has_year_zero=True)],
[cftime.DatetimeNoLeap(2023, 1, 1, 0, 0, 0, 0, has_year_zero=True),
cftime.DatetimeNoLeap(2024, 1, 1, 0, 0, 0, 0, has_year_zero=True)]],
dtype=object)
>>> fh["time"].values[:-1]
array([cftime.DatetimeNoLeap(850, 1, 1, 0, 0, 0, 0, has_year_zero=True),
cftime.DatetimeNoLeap(851, 1, 1, 0, 0, 0, 0, has_year_zero=True),
cftime.DatetimeNoLeap(852, 1, 1, 0, 0, 0, 0, has_year_zero=True),
...,
cftime.DatetimeNoLeap(2021, 1, 1, 0, 0, 0, 0, has_year_zero=True),
cftime.DatetimeNoLeap(2022, 1, 1, 0, 0, 0, 0, has_year_zero=True),
cftime.DatetimeNoLeap(2023, 1, 1, 0, 0, 0, 0, has_year_zero=True)],
dtype=object) @znichollscr files in the path above, all 4. |
Thanks Paul. Almost there @lchini! Now that I've seen all four files, there are a few more questions. Overall questions
File by file
|
@lchini I'd also note that I'd have to rename @znichollscr has already noted some other tweaks above #123 (comment) - these are very, very close! |
Regarding the years that these datasets represent:
So, I would prefer to keep the filenames (and the years of data) as they are now. This is the way we have provided this data for many years now. Does this seem like a reasonable plan? For the fertilizer units I think we can remove the "crop season" part. In theory we are providing the amount of fertilizer applied to the land per ha per year and per crop season, but since we don't actually represent double cropping in the dataset, and I don't think we have full consistency between the historical data and future scenarios on this point, I think we can remove the crop season part from the units. If we did end up feeling like that was a necessary part of the fertilizer units, is there another way that we should represent that in these files? |
Makes sense.
If you mean that the filenames would be:
Yes
3 options I see:
As a note. If it's per year, shouldn't the number of crop seasons already be included (e.g. if there were 2 crop seasons in 1876, then the total application in the year would be twice as much as a year in which there was only 1 crop season)? Or are models meant to multiple this by the number of crop seasons in their model to get total application in a year? |
At this point, I wonder whether we're good enough for the v0 land use change dataset. @vnaik60, are these files usable for the NOAA-GFDL team? I note that for any variable, you can always add a per variable comment, any attribute could be added, which provides some context for folks to use these data. So we'll need to rename the @lchini would you prefer to make a couple more tweaks to target the questions of @znichollscr or are you good for publication to begin? As an FYI, this likely wouldn't start until Thursday this week anyway, as @sashakames is travelling |
Thanks for the feedback! Since we have a couple of days before the publication would begin, why don't I try to make those last few tweaks and then we should hopefully be all set! |
@lchini ok great, if you're happy to catch the final tweaks then let's wait for that. If the files are on the FTP server mid to late week, I can pull these across and then get them in the publication queue, hopefully for a Thursday release! Woo hoo! |
The new files have been uploaded to the server. |
@lchini wonderful! Just to confirm the files uploaded into Woo hoo! |
Good to publish I think @durack1 ! |
Excellent, checks out for me, so I have moved this into the publication queue - hopefully these files are live tomorrow! Nice work @lchini |
Fixed by #127, closing |
Issue for tracking the progress and any issues related to the land-use data.
cc @lchini @durack1 @vnaik60
The text was updated successfully, but these errors were encountered: