Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Land-use data upload #123

Closed
znichollscr opened this issue Sep 18, 2024 · 58 comments
Closed

Land-use data upload #123

znichollscr opened this issue Sep 18, 2024 · 58 comments
Labels
dataset-publication Issue related to publication of a dataset

Comments

@znichollscr
Copy link
Collaborator

Issue for tracking the progress and any issues related to the land-use data.

cc @lchini @durack1 @vnaik60

@znichollscr znichollscr added the dataset-publication Issue related to publication of a dataset label Sep 18, 2024
@lchini
Copy link

lchini commented Sep 18, 2024

I am attempting to upload the files to ftp.llnl.gov. The instructions indicate to upload files to the "incoming" folder, but when I connect to the FTP server (using FileZilla on my Mac) there is just an empty root directory and no folder named "incoming". Should I just upload to that location or am I not connected correctly?

@durack1
Copy link
Contributor

durack1 commented Sep 18, 2024

Good question @lchini, we've had a similar query from @mjevanmarle this morning.

How about if you try and connect explicitly using the IP address 198.128.250.1? See below, seems to work for me currently
Screenshot 2024-09-18 at 7 50 51 AM

Once in you should be able to navigate to the incoming subdirectory, create a new directory for yourself (e.g., UofMD-landState-3-0_240918, we might have to try a couple of times, so adding a datestamp on the end) upload and voila

@znichollscr
Copy link
Collaborator Author

znichollscr commented Sep 18, 2024

Option b: here's a python script that should work. You'll need to install input4mips-validation first.

Python script
import ftplib
import os
import traceback
from pathlib import Path

from input4mips_validation.upload_ftp import cd_v, login_to_ftp, mkdir_v, upload_file

# Point this at the path which contains the files you want to upload
# PATH_TO_DIRECTORY_TO_UPLOAD = (
    "output-bundles/v0.4.0/data/processed/esgf-ready/input4MIPs/CMIP6Plus/CMIP/CR/CR-CMIP-0-4-0/atmos/yr/cf4"
)
PATH_TO_DIRECTORY_TO_UPLOAD = "path/to/somewhere"

# Use your email here
# EMAIL = "zebedee.nicholls@climate-resource.com"
EMAIL = "your_email"

# Use a unique value here
# FTP_DIR_REL_TO_ROOT = "cr-junk-2"
FTP_DIR_REL_TO_ROOT = "UofMD-landState-3-0_240918_1"

FTP_DIR_ROOT = "/incoming"

with login_to_ftp(
    ftp_server="ftp.llnl.gov",
    username="anonymous",
    password=EMAIL,
    dry_run=False,
) as ftp:
    print("Opened FTP connection")
    print()

    cd_v(FTP_DIR_ROOT, ftp=ftp)

    mkdir_v(FTP_DIR_REL_TO_ROOT, ftp=ftp)
    cd_v(FTP_DIR_REL_TO_ROOT, ftp=ftp)

    n_errors = 0
    n_total = 0
    for file in Path(PATH_TO_DIRECTORY_TO_UPLOAD).rglob("*.nc"):
        file_stats = os.stat(file)
        file_size_mb = file_stats.st_size / (1024 * 1024)
        file_size_gb = file_stats.st_size / (1024 * 1024 * 1024)

        print(f"{file=}")
        print(f"{file_size_mb=:.3f}")
        print(f"{file_size_gb=:.3f}")

        try:
            upload_file(
                file,
                strip_pre_upload=file.parent,
                ftp_dir_upload_in=f"{FTP_DIR_ROOT}/{FTP_DIR_REL_TO_ROOT}",
                ftp=ftp,
            )
            print(f"Uploaded {file=}")

        except ftplib.error_perm:
            print(f"Failed to upload {file=}")
            traceback.print_exc()
            n_errors += 1

        n_total += 1
        print()

print(f"Finished: {n_errors=}, {n_total=}")

@durack1
Copy link
Contributor

durack1 commented Sep 18, 2024

@lchini @mjevanmarle it seems like there is a DNS/network issue that is causing problems for me when I am not connected to the LLNL institutional network. Weirdly, this isn't an issue for @znichollscr, so might be something that will just work itself out, or will need a nudge within the LLNL network.

This is what I see, which looks similar to @mjevanmarle's issue, and @lchini probably your issue too
Screenshot 2024-09-18 at 7 58 34 AM

I'll raise a ticket with the LLNL network folks to see if someone can check.

@lchini
Copy link

lchini commented Sep 18, 2024

Yes that is the same issue that I'm experiencing. I've been trying to configure settings on my end but it sounds like I might need to wait for the LLNL server update.

@znichollscr
Copy link
Collaborator Author

@lchini can you try the python script and post the output here if it fails please?

@znichollscr
Copy link
Collaborator Author

Python script is here

import ftplib
import os
import traceback
from pathlib import Path

from input4mips_validation.upload_ftp import cd_v, login_to_ftp, mkdir_v, upload_file

# Point this at the path which contains the files you want to upload
# PATH_TO_DIRECTORY_TO_UPLOAD = (
    "output-bundles/v0.4.0/data/processed/esgf-ready/input4MIPs/CMIP6Plus/CMIP/CR/CR-CMIP-0-4-0/atmos/yr/cf4"
)
PATH_TO_DIRECTORY_TO_UPLOAD = "path/to/somewhere"

# Use your email here
# EMAIL = "zebedee.nicholls@climate-resource.com"
EMAIL = "your_email"

# Use a unique value here
# FTP_DIR_REL_TO_ROOT = "cr-junk-2"
FTP_DIR_REL_TO_ROOT = "UofMD-landState-3-0_240918_1"

FTP_DIR_ROOT = "/incoming"

with login_to_ftp(
    ftp_server="ftp.llnl.gov",
    username="anonymous",
    password=EMAIL,
    dry_run=False,
) as ftp:
    print("Opened FTP connection")
    print()

    cd_v(FTP_DIR_ROOT, ftp=ftp)

    mkdir_v(FTP_DIR_REL_TO_ROOT, ftp=ftp)
    cd_v(FTP_DIR_REL_TO_ROOT, ftp=ftp)

    n_errors = 0
    n_total = 0
    for file in Path(PATH_TO_DIRECTORY_TO_UPLOAD).rglob("*.nc"):
        file_stats = os.stat(file)
        file_size_mb = file_stats.st_size / (1024 * 1024)
        file_size_gb = file_stats.st_size / (1024 * 1024 * 1024)

        print(f"{file=}")
        print(f"{file_size_mb=:.3f}")
        print(f"{file_size_gb=:.3f}")

        try:
            upload_file(
                file,
                strip_pre_upload=file.parent,
                ftp_dir_upload_in=f"{FTP_DIR_ROOT}/{FTP_DIR_REL_TO_ROOT}",
                ftp=ftp,
            )
            print(f"Uploaded {file=}")

        except ftplib.error_perm:
            print(f"Failed to upload {file=}")
            traceback.print_exc()
            n_errors += 1

        n_total += 1
        print()

print(f"Finished: {n_errors=}, {n_total=}")

@lchini
Copy link

lchini commented Sep 18, 2024

I tried to install input4mips-validation so that I could run the python script. I used pip since I didn't have mamba installed. Although pip did not return an error, I don't think the installation worked correctly because when I tried to run the python script I received the error: No module named 'input4mips_validation'

@durack1
Copy link
Contributor

durack1 commented Sep 18, 2024

I've tried to engage with LLNL comp support folks and had a tepid response, so if this doesn't begin to work next time we try, let's seek alternative paths to getting these data in the publication queues

@PCMDI PCMDI deleted a comment from znicholls Sep 19, 2024
@znichollscr
Copy link
Collaborator Author

I tried to install input4mips-validation so that I could run the python script. I used pip since I didn't have mamba installed. Although pip did not return an error, I don't think the installation worked correctly because when I tried to run the python script I received the error: No module named 'input4mips_validation'

Hmmm that's unfortunate. Here's a version of the script without any dependencies that aren't in the standard library so it should just work with any Python >= 3.9. Can you try that please?

import ftplib
import os
import traceback
from collections.abc import Iterator
from contextlib import contextmanager
from pathlib import Path
from typing import Optional


# Point this at the path which contains the files you want to upload
# PATH_TO_DIRECTORY_TO_UPLOAD = (
#     "output-bundles/v0.4.0/data/processed/esgf-ready/input4MIPs/CMIP6Plus/CMIP/CR/CR-CMIP-0-4-0/atmos/yr/cf4"
# )
PATH_TO_DIRECTORY_TO_UPLOAD = "path/to/somewhere"

# Use your email here
# EMAIL = "zebedee.nicholls@climate-resource.com"
EMAIL = "your_email"

# Use a unique value here
# FTP_DIR_REL_TO_ROOT = "cr-junk-4"
FTP_DIR_REL_TO_ROOT = "UofMD-landState-3-0_240918_1"

FTP_DIR_ROOT = "/incoming"


@contextmanager
def login_to_ftp(
    ftp_server: str, username: str, password: str, dry_run: bool
) -> Iterator[Optional[ftplib.FTP]]:
    """
    Create a connection to an FTP server.

    When the context block is excited, the connection is closed.

    If we are doing a dry run, `None` is returned instead
    to signal that no connection was actually made.
    We do, however, log messages to indicate what would have happened.

    Parameters
    ----------
    ftp_server
        FTP server to login to

    username
        Username

    password
        Password

    dry_run
        Is this a dry run?

        If `True`, we won't actually login to the FTP server.

    Yields
    ------
    :
        Connection to the FTP server.

        If it is a dry run, we simply return `None`.
    """
    if dry_run:
        print(f"Dry run. Would log in to {ftp_server} using {username=}")
        ftp = None

    else:
        ftp = ftplib.FTP(ftp_server, passwd=password, user=username)  # noqa: S321
        print(f"Logged into {ftp_server} using {username=}")

    yield ftp

    if ftp is None:
        if not dry_run:  # pragma: no cover
            raise AssertionError
        print(f"Dry run. Would close connection to {ftp_server}")

    else:
        ftp.quit()
        print(f"Closed connection to {ftp_server}")


def cd_v(dir_to_move_to: str, ftp: ftplib.FTP) -> ftplib.FTP:
    """
    Change directory verbosely

    Parameters
    ----------
    dir_to_move_to
        Directory to move to on the server

    ftp
        FTP connection

    Returns
    -------
    :
        The FTP connection
    """
    ftp.cwd(dir_to_move_to)
    print(f"Now in {ftp.pwd()} on FTP server")

    return ftp


def mkdir_v(dir_to_make: str, ftp: ftplib.FTP) -> None:
    """
    Make directory verbosely

    Also, don't fail if the directory already exists

    Parameters
    ----------
    dir_to_make
        Directory to make

    ftp
        FTP connection
    """
    try:
        print(f"Attempting to make {dir_to_make} on {ftp.host=}")
        ftp.mkd(dir_to_make)
        print(f"Made {dir_to_make} on {ftp.host=}")
    except ftplib.error_perm:
        print(f"{dir_to_make} already exists on {ftp.host=}")


def upload_file(
    file: Path,
    strip_pre_upload: Path,
    ftp_dir_upload_in: str,
    ftp: Optional[ftplib.FTP],
) -> Optional[ftplib.FTP]:
    """
    Upload a file to an FTP server

    Parameters
    ----------
    file
        File to upload.

        The full path of the file relative to `strip_pre_upload` will be uploaded.
        In other words, any directories in `file` will be made on the
        FTP server before uploading.

    strip_pre_upload
        The parts of the path that should be stripped before the file is uploaded.

        For example, if `file` is `/path/to/a/file/somewhere/file.nc`
        and `strip_pre_upload` is `/path/to/a`,
        then we will upload the file to `file/somewhere/file.nc` on the FTP server
        (relative to whatever directory the FTP server is in
        when we enter this function).

    ftp_dir_upload_in
        Directory on the FTP server in which to upload `file`
        (after removing `strip_pre_upload`).

    ftp
        FTP connection to use for the upload.

        If this is `None`, we assume this is a dry run.

    Returns
    -------
    :
        The FTP connection.

        If it is a dry run, this can simply be `None`.
    """
    print(f"Uploading {file}")
    if ftp is None:
        print(f"Dry run. Would cd on the FTP server to {ftp_dir_upload_in}")

    else:
        cd_v(ftp_dir_upload_in, ftp=ftp)

    filepath_upload = file.relative_to(strip_pre_upload)
    print(
        f"Relative to {ftp_dir_upload_in} on the FTP server, " f"will upload {file} to {filepath_upload}",
    )

    for parent in list(filepath_upload.parents)[::-1]:
        if parent == Path("."):
            continue

        to_make = parent.parts[-1]

        if ftp is None:
            print("Dry run. " "Would ensure existence of " f"and cd on the FTP server to {to_make}")

        else:
            mkdir_v(to_make, ftp=ftp)
            cd_v(to_make, ftp=ftp)

    if ftp is None:
        print(f"Dry run. Would upload {file}")

        return ftp

    with open(file, "rb") as fh:
        upload_command = f"STOR {file.name}"
        print(f"Upload command: {upload_command}")

        try:
            print(f"Initiating upload of {file}")
            ftp.storbinary(upload_command, fh)

            print(f"Successfully uploaded {file}")
        except ftplib.error_perm:
            print(
                f"{file.name} already exists on the server in {ftp.pwd()}. "
                "Use a different directory on the receiving server "
                "if you really wish to upload again."
            )
            raise

    return ftp


with login_to_ftp(
    ftp_server="ftp.llnl.gov",
    username="anonymous",
    password=EMAIL,
    dry_run=False,
) as ftp:
    print("Opened FTP connection")
    print()

    cd_v(FTP_DIR_ROOT, ftp=ftp)

    mkdir_v(FTP_DIR_REL_TO_ROOT, ftp=ftp)
    cd_v(FTP_DIR_REL_TO_ROOT, ftp=ftp)

    n_errors = 0
    n_total = 0
    for file in Path(PATH_TO_DIRECTORY_TO_UPLOAD).rglob("*.nc"):
        file_stats = os.stat(file)
        file_size_mb = file_stats.st_size / (1024 * 1024)
        file_size_gb = file_stats.st_size / (1024 * 1024 * 1024)

        print(f"{file=}")
        print(f"{file_size_mb=:.3f}")
        print(f"{file_size_gb=:.3f}")

        try:
            upload_file(
                file,
                strip_pre_upload=file.parent,
                ftp_dir_upload_in=f"{FTP_DIR_ROOT}/{FTP_DIR_REL_TO_ROOT}",
                ftp=ftp,
            )
            print(f"Uploaded {file=}")

        except ftplib.error_perm:
            print(f"Failed to upload {file=}")
            traceback.print_exc()
            n_errors += 1

        n_total += 1
        print()

print(f"Finished: {n_errors=}, {n_total=}")

@lchini
Copy link

lchini commented Sep 19, 2024

Thanks for the new script Zeb. I'm running it now and it appears to be working, although it's hard to gauge progress on the other end. The first time I ran the script it appeared to be uploading ALL files within the given directory, so I canceled that and moved to a different folder. So there might be some half-uploaded files from that first run.

@lchini
Copy link

lchini commented Sep 19, 2024

The script completed. Can someone else confirm that it was successful? I just uploaded a single file because I wanted to make sure everything looks OK with that one before sending the others. Let me know if there is anything I need to change with the format or metadata in the file. Also, I assume the filename will be changed from the name of the uploaded file?

@znichollscr
Copy link
Collaborator Author

The first time I ran the script it appeared to be uploading ALL files within the given directory, so I canceled that and moved to a different folder. So there might be some half-uploaded files from that first run.

Ah yes it uploads every .nc file it can find, should have warned you about that probably :)

Can someone else confirm that it was successful?

Hopefully @durack1 can take a look. Can you tell us which directory you uploaded in (i.e. the value of FTP_DIR_REL_TO_ROOT in the script)?

I just uploaded a single file because I wanted to make sure everything looks OK with that one before sending the others. Let me know if there is anything I need to change with the format or metadata in the file. Also, I assume the filename will be changed from the name of the uploaded file?

Sounds good. We'll take a look and get back to you asap

@durack1
Copy link
Contributor

durack1 commented Sep 19, 2024

@lchini great! We're off, I could see the below, so if that looks right to you - mint a new upload dir and give us the lot.
Screenshot 2024-09-19 at 6 51 34 AM

If you can also indicate what we're to expect, number of files, then I can double check these and then drop them into the publication queue, where we can runs @znichollscr validator to double check

@znichollscr
Copy link
Collaborator Author

Alrighty looks like Paul found it so don't worry, we don't need any more info for now. I'll take a look and get back to you asap.

Also, I assume the filename will be changed from the name of the uploaded file?

Yep we'll re-write that as part of putting the file in the DRS

@durack1
Copy link
Contributor

durack1 commented Sep 19, 2024

@znichollscr the 2 files are in the normal place - ../LouiseChini-landUseChange/20240919

@znichollscr
Copy link
Collaborator Author

znichollscr commented Sep 19, 2024

Alrighty:

I'm assuming that management4.nc is a half uploaded file because I couldn't even read it with ncdump...

For states_new_vars2.nc:

  • the source_id attribute in the file should be "UofMD-landState-3-0"
  • the time units, "years since 850-01-01 0:0:0", cause xarray to explode, which isn't ideal. Could these be updated to "days since 850-01-01" and you update your time axis accordingly (just multiply all the values by 365)?
  • there are no time bounds. For example, we have a "time_bnds" variable in our datasets where we specify the bounds of each timestep. The variable should be 2D, the first dimension being time and the second being the bounds (first value for each timestep is the start of the timestep, second value is the end of the timestep). So, if you have a time axis like [0, 1, 2] then time bounds would be something like [[0, 1], [1, 2], [2, 3]].
  • on all variables, "_Fillvalue" should be renamed to "_FillValue"
  • For the secma variable, either put a value in the "standard_name" or just remove it I think
  • the missing value of pltns should be a float I think, at the moment it is a string
  • wherever you have cell methods, there is a space missing after the colon (it should be "time: mean" not "time:mean"). (This bug causes iris to explode, which is not ideal)

Other than that, looks good I think

@durack1
Copy link
Contributor

durack1 commented Sep 19, 2024

@znichollscr yep looks like you're right - we have a bigger version of that file now, AND another transitions file. @lchini so I wait until it's all up, what should we be expecting, how many files and their filenames/sizes?
Screenshot 2024-09-19 at 10 36 32 AM

I might wait until I've heard back from you, and wait until the complete set is down before I pull these across

@lchini
Copy link

lchini commented Sep 19, 2024

The management4.nc file was uploaded in error when I didn't realize that the python script would upload all files in the given directory. So please delete that one. There are 4 files that I'll be uploading for states, transitions, and management, as well as a staticData file. The issues that Zeb pointed out with the states file will be issues in the transitions and management files too. I've already uploaded the transitions so will have to fix and re-upload that one as well as the states file, and I'll try to update the management file before I upload it.

@lchini
Copy link

lchini commented Sep 19, 2024

For the time units, the product is annual. We originally created time units that give the actual year, e.g. 850, 851, 852, etc. I post-processed that to give years since 850, e.g.: 0,1,2,3 .... Should I revert to the original plan or switch to days since 850 as you suggested. We have 1175 years of data so a simple multiplication by 365 will end up missing quite a few days due to leap years.

@durack1
Copy link
Contributor

durack1 commented Sep 19, 2024

So please delete that one

Unfortunately I can't do anything about deleting/moving/etc on this system, it's simply a dropbox.. So good to know I'll purge it in our cop(ies) once I pull the complete file list down.

When you have the new data generated, upload this to a new directory e.g., UofMD-landState-3-0_240919_1 and that way we won't have problems with attempts to overwrite files etc, which likely won't work.

Also we have a standard template for the filenames (and directory structure, which I can impose once the files are down and their metadata matches what we expect), so this should be something like transitions_input4MIPs_landState_CMIP_UofMD-landState-3-0_gn_0850-2023.nc

@durack1
Copy link
Contributor

durack1 commented Sep 19, 2024

For the time units, the product is annual. We originally created time units that give the actual year, e.g. 850, 851, 852, etc. I post-processed that to give years since 850, e.g.: 0,1,2,3 .... Should I revert to the original plan or switch to days since 850 as you suggested. We have 1175 years of data so a simple multiplication by 365 will end up missing quite a few days due to leap years.

UDUNITS defines a year to be exactly 365.242198781 days (the interval between 2 successive passages of the sun through vernal equinox, yes pedantic). So if we are mapping into days since, then we'd have to be careful about @znichollscr suggested multiplication, as this will lead to problems toward the end of the record. In addition as you span the Gregorian (1582-10-04 to 1582-10-15 the next day) hop, this is going to get a little weird.. @lchini how are you writing these files, what software? the python datetime library and cftime could help here

@lchini
Copy link

lchini commented Sep 19, 2024

Our model that generates the data and writes the files is written in C++. I am doing some post-processing on the files in MATLAB (just to add in the new variables that don't have computed data yet), and then doing more post-processing (modifying the time dimension, writing global attributes, etc) using NCO command-line tools.

I guess my question is: since converting to days is tricky, is it really necessary? Especially since our data is an annual product?

@durack1
Copy link
Contributor

durack1 commented Sep 19, 2024

I guess my question is: since converting to days is tricky, is it really necessary? Especially since our data is an annual product?

To be honest, your file looks pretty good to me (ncdump -ct $file.nc):

variables:
        double time(time) ;
                time:axis = "T" ;
                time:calendar = "noleap" ;
                time:long_name = "time" ;
                time:realtopology = "linear" ;
                time:standard_name = "time" ;
                time:units = "years since 850-01-01 0:0:0" ;

...

data:

 time = "0850-01-01", "0851-01-01", "0852-01-01", "0853-01-01", "0854-01-01", 
    "0855-01-01", "0856-01-01", "0857-01-01", "0858-01-01", "0859-01-01", 
...
    "2015-01-01", "2016-01-01", "2017-01-01", "2018-01-01", "2019-01-01", 
    "2020-01-01", "2021-01-01", "2022-01-01", "2023-01-01" ;

The xarray warning is

>>> fh = xcdat.open_dataset("transitions_new_vars2.nc")
../lib/python3.11/site-packages/xarray/coding/times.py:167: SerializationWarning: Ambiguous reference
date string: 850-01-01 0:0:0. The first value is assumed to be the year hence will be padded with zeros
to remove the ambiguity (the padded reference date string is: 0850-01-01 0:0:0). To remove this
message, remove the ambiguity by padding your reference date strings with zeros.
  warnings.warn(warning_msg, SerializationWarning)
>>> fh.time.data
array([cftime.DatetimeNoLeap(850, 1, 1, 0, 0, 0, 0, has_year_zero=True),
       cftime.DatetimeNoLeap(851, 1, 1, 0, 0, 0, 0, has_year_zero=True),
       cftime.DatetimeNoLeap(852, 1, 1, 0, 0, 0, 0, has_year_zero=True),
       ...,
       cftime.DatetimeNoLeap(2021, 1, 1, 0, 0, 0, 0, has_year_zero=True),
       cftime.DatetimeNoLeap(2022, 1, 1, 0, 0, 0, 0, has_year_zero=True),
       cftime.DatetimeNoLeap(2023, 1, 1, 0, 0, 0, 0, has_year_zero=True)],
      dtype=object)

A quick tweak 850-1-1 0:0:0 to 0850-01-01 0:0:0 might get you most of the way there. Adding a time_bnds would be ideal too, so that we satisfy CF requirements this would mean that the first year is bounded by 0850-01-01 0:0:0 and 0851-01-01 0:0:0, this would also be cleaner if the time axis value was the central value within the annual time period, so 0850-07-02

@durack1
Copy link
Contributor

durack1 commented Sep 19, 2024

There's also a couple of inconsistencies in the file metadata vs what we are expecting,

// global attributes:
                :host = "UMD College Park" ;
                :creation_date = "2024-07-18T14:51:50Z" ;
                :Conventions = "CF-1.6" ;
                :data_structure = "grid" ;
                :dataset_category = "landState" ;
                :variable_id = "multiple" ;
                :grid_label = "gn" ;

                :mip_era = "CMIP6" ;  ## CMIP6Plus

                :license = "Land-Use Harmonization data produced by the University of Maryland is licensed under a Creative Commons Attribution \\\"Share Alike\\\" 4.0 International License (http://creativecommons.org/licenses
/by/4.0/). The data producers and data providers make no warranty, either express or implied, including but not limited to, warranties of merchantability and fitness for a particular purpose. All liabilities arising from the s
upply of the information (including any liability arising in negligence) are excluded to the fullest extent permitted by law." ;
                :further_info_url = "http://luh.umd.edu" ;
                :frequency = "yr" ;
                :institution_id = "UofMD" ;
                :institution = "University of Maryland (UofMD), College Park, MD 20742, USA" ;
                :realm = "land" ;
                :source = "LUH3 V0: Land-Use Harmonization Data Set for CMIP7" ;
                :comment = "LUH3 V0" ;
                :title = "UofMD LUH3 V0 dataset prepared for CMIP7" ;

                :activity_id = "CMIP7" ;  ### input4MIPs

                :dataset_version_number = "LUH3 V0" ;

                :source_id = "UofMD-landState-LUH3" ;  ## UofMD-landState-3-0

                :target_mip = "CMIP7" ;  ### CMIP

                :references = "Hurtt et al. 2020, Chini et al. 2021" ; ## Want to expand these with DOIs?

                :contact = "lchini@umd.edu, gchurtt@umd.edu" ;

@znichollscr
Copy link
Collaborator Author

znichollscr commented Sep 19, 2024

Should I revert to the original plan or switch to days since 850 as you suggested. We have 1175 years of data so a simple multiplication by 365 will end up missing quite a few days due to leap years.

(This is completely non-obvious unless you love the CF-conventions), because you're using a 'noleap' calendar, every year in your calendar has exactly 365 days. Hence, you can do the multiplication by 365 without an issue (just don't change the calendar attribute of your time variable!).

UDUNITS defines a year to be exactly 365.242198781 days (the interval between 2 successive passages of the sun through vernal equinox, yes pedantic). So if we are mapping into days since, then we'd have to be careful about @znichollscr suggested multiplication, as this will lead to problems toward the end of the record. In addition as you span the Gregorian (1582-10-04 to 1582-10-15 the next day) hop, this is going to get a little weird..

See above. Because of the calendar attribute, UDUNITS doesn't come into it and just multiplying by 365 is fine (again, this statement only applies because of the "noleap" calendar).

since converting to days is tricky, is it really necessary? Especially since our data is an annual product?

As above, because of the calendar, converting to days is trivial. The reason I would (strongly) recommend doing this is that the data doesn't load properly with xarray if the time units are "years since" rather than "days since". This is a bug in xarray, but given it is such a widely used tool, I would recommend making this tweak (particularly given how trivial it is).

The xarray warning is

Note here @durack1 that you've loaded with xcdat, not xarray (I assume xcdat is better up to speed with CF-conventions than xarray/cftime, which is what raises the original error). If you try to load with xarray you get:

click me to see the full xarray error
>>> import xarray as xr
>>> xr.open_dataset("states_new_vars2.nc", use_cftime=True)
Traceback (most recent call last):
  File "/shared/input4mips-validation-v0.11.3/lib/python3.11/site-packages/xarray/coding/times.py", line 218, in _decode_cf_datetime_dtype
    result = decode_cf_datetime(example_value, units, calendar, use_cftime)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared/input4mips-validation-v0.11.3/lib/python3.11/site-packages/xarray/coding/times.py", line 349, in decode_cf_datetime
    dates = _decode_datetime_with_cftime(flat_num_dates, units, calendar)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared/input4mips-validation-v0.11.3/lib/python3.11/site-packages/xarray/coding/times.py", line 242, in _decode_datetime_with_cftime
    cftime.num2date(num_dates, units, calendar, only_use_cftime_datetimes=True)
  File "src/cftime/_cftime.pyx", line 587, in cftime._cftime.num2date
  File "src/cftime/_cftime.pyx", line 105, in cftime._cftime._dateparse
ValueError: In general, units must be one of 'microseconds', 'milliseconds', 'seconds', 'minutes', 'hours', or 'days' (or select abbreviated versions of these).  For the '360_day' calendar, 'months' can also be used, or for the 'noleap' calendar 'common_years' can also be used. Got 'years' instead, which are not recognized.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/shared/input4mips-validation-v0.11.3/lib/python3.11/site-packages/xarray/conventions.py", line 450, in decode_cf_variables
    new_vars[k] = decode_cf_variable(
                  ^^^^^^^^^^^^^^^^^^^
  File "/shared/input4mips-validation-v0.11.3/lib/python3.11/site-packages/xarray/conventions.py", line 291, in decode_cf_variable
    var = times.CFDatetimeCoder(use_cftime=use_cftime).decode(var, name=name)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared/input4mips-validation-v0.11.3/lib/python3.11/site-packages/xarray/coding/times.py", line 992, in decode
    dtype = _decode_cf_datetime_dtype(data, units, calendar, self.use_cftime)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared/input4mips-validation-v0.11.3/lib/python3.11/site-packages/xarray/coding/times.py", line 228, in _decode_cf_datetime_dtype
    raise ValueError(msg)
ValueError: unable to decode time units 'years since 850-01-01 0:0:0' with "calendar 'noleap'". Try opening your dataset with decode_times=False or installing cftime if it is not installed.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/shared/input4mips-validation-v0.11.3/lib/python3.11/site-packages/xarray/backends/api.py", line 588, in open_dataset
    backend_ds = backend.open_dataset(
                 ^^^^^^^^^^^^^^^^^^^^^
  File "/shared/input4mips-validation-v0.11.3/lib/python3.11/site-packages/xarray/backends/netCDF4_.py", line 659, in open_dataset
    ds = store_entrypoint.open_dataset(
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared/input4mips-validation-v0.11.3/lib/python3.11/site-packages/xarray/backends/store.py", line 46, in open_dataset
    vars, attrs, coord_names = conventions.decode_cf_variables(
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/shared/input4mips-validation-v0.11.3/lib/python3.11/site-packages/xarray/conventions.py", line 461, in decode_cf_variables
    raise type(e)(f"Failed to decode variable {k!r}: {e}") from e
ValueError: Failed to decode variable 'time': unable to decode time units 'years since 850-01-01 0:0:0' with "calendar 'noleap'". Try opening your dataset with decode_times=False or installing cftime if it is not installed.

so this should be something like transitions_input4MIPs_landState_CMIP_UofMD-landState-3-0_gn_0850-2023.nc

I don't think this matters for us though does it Paul? We'll just re-write with the correct name and save @lchini the headache? If you do want to write it yourself, the current DRS suggests the filename should start with "multiple-*" (e.g. multiple-transitions, multiple-states) because there are multiple variables in the file.

@lchini
Copy link

lchini commented Sep 20, 2024

Thanks for this info. I think most of these changes will be easy to implement and I will get started on it right away. The issue of the time variable and time bounds should also be OK but I just wanted to make sure I get this right. As I understand it, the plan is the following:

  1. change the time variable to be days since 850-1-1 0:0:0 and multiply the current values by 365 to convert the existing values (this is OK since my calendar is does not include leap years, and I assume it is not impacted by the 1582 calendar jump). So we will have time values of 0, 365, 730, ...
  2. add a time bounds variable that is 2 dimensional - one dimension will be the same size as the time variable and the other dimension will have a length of 2. Since the time variable will be days since 850, ie 0, 365, 730, 1095 ..., the time bounds variable will be [[0,365],[365,730],[730,1095],...]

Does this sound correct?

Questions:

  1. Do I also need to change 850 to 0850?
  2. Regarding the idea of making the time variable the central value for the year, i.e. 0850-07-02, we actually consider the land use states in each year to be the states on jan 1, so I would prefer not to make that change. The transitions in the year 850 are actually the transitions that occur during that year (from jan 1 850 to Jan 1 851) so they are not really tied to a specific date but span that time period. So, do I need to change anything here to reflect that or leave it as it is?

@znichollscr
Copy link
Collaborator Author

znichollscr commented Sep 20, 2024

  1. change the time ...

All correct. (The 1582 calendar change also doesn't matter as all you're really saying with your data is, "this is the start of year" state, which is what the approach you're taking will do.)

2. add a time bounds variable...

Spot on. I think the variable is meant to be called time_bnds according to CF-conventions. When you do this, it's also recommend (perhaps required) to add a "bounds" attribute to the "time" variable that has the value "time_bnds". (In Python that would be something like ds["time"].setncattr("bounds", "time_bnds"). I know you don't use Python, but that might help make things clearer.)

Do I also need to change 850 to 0850?

I don't think it matters, but I don't think it will hurt either and it will make it easier for tools that expect 4 digits in their year so I would do this if it were me (I'm assuming it is a very easy change).

2. So, do I need to change anything here to reflect that or leave it as it is?

Given the info you have provided, I would leave as is.

@durack1
Copy link
Contributor

durack1 commented Sep 23, 2024

@lchini this looks good to me, the issues highlighted above (#123 (comment)) are fixed.

It seems you've hardcoded the :creation_date = "2024-07-18T14:51:36Z", you might want to generate this automatically, so we don't have old info lurking around in files.

For a python example (which may or may not be useful), see below

$ python
Python 3.11.7 | packaged by conda-forge | (main, Dec 23 2023, 14:43:09) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import datetime
>>> print(datetime.datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ"))
2024-09-23T21:33:04Z

@znichollscr the single files is on nimbus ../LouiseChini-landUseChange/20240923

Also a question to you, did you want to rename these files so what you are producing is consistent with what will be downloaded from ESGF? This is optional, but we will confuse folks if we have inconsistent filenames from differing sources, even if their content is identical. @znichollscr highlighted the renaming above (#123 (comment))

@durack1
Copy link
Contributor

durack1 commented Sep 23, 2024

And just adding another note, looks like the time axis fix has solved the xarray read problems, at least for me

 python
Python 3.11.7 | packaged by conda-forge | (main, Dec 23 2023, 14:43:09) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import xarray as xr
>>> fh = xr.open_dataset("../LouiseChini-landUseChange/20240923/states_new_vars3.nc")
>>> fh
<xarray.Dataset>
Dimensions:    (time: 1175, lat: 720, lon: 1440, nbnd: 2)
Coordinates:
  * time       (time) object 0850-01-01 00:00:00 ... 2024-01-01 00:00:00
  * lat        (lat) float64 89.88 89.62 89.38 89.12 ... -89.38 -89.62 -89.88
  * lon        (lon) float64 -179.9 -179.6 -179.4 -179.1 ... 179.4 179.6 179.9
Dimensions without coordinates: nbnd
Data variables: (12/16)
    primf      (time, lat, lon) float32 ...
    primn      (time, lat, lon) float32 ...
    secdf      (time, lat, lon) float32 ...
    secdn      (time, lat, lon) float32 ...
    urban      (time, lat, lon) float32 ...
    c3ann      (time, lat, lon) float32 ...
    ...         ...
    pastr      (time, lat, lon) float32 ...
    range      (time, lat, lon) float32 ...
    secmb      (time, lat, lon) float32 ...
    secma      (time, lat, lon) float32 ...
    pltns      (time, lat, lon) float32 ...
    time_bnds  (nbnd, time) object ...
Attributes: (12/25)
    host:                    UMD College Park
    creation_date:           2024-07-18T14:51:36Z
    Conventions:             CF-1.6
    data_structure:          grid
    dataset_category:        landState
    variable_id:             multiple
    ...                      ...
    source_id:               UofMD-landState-3-0
    target_mip:              CMIP
    mip_era:                 CMIP6Plus
    references:              Hurtt et al. 2020 (https://doi.org/10.5194/gmd-1...
    history:                 Mon Sep 23 13:31:22 2024: ncrename -a ._Fillvalu...
    NCO:                     netCDF Operators version 5.0.0 (Homepage = HTTP:...
>>>

@znichollscr
Copy link
Collaborator Author

Hi @lchini looking good. Tweaks from this round below:

  • the variable time_bnds shouldn't have any attributes (the convention, as I understand it, is that its attributes are all assumed to be the same as time)
  • there shouldn't be any "cell_methods" attribute for time, lat or lon (these variables aren't the time mean of something else)
  • I don't think there is a standard name for secma (at least, searching here didn't show anything obvious https://cfconventions.org/Data/cf-standard-names/current/build/cf-standard-name-table.html) so I would delete the "standard_name" attribute from secma, the long_name alone is enough
  • it turns out the conventions with time bounds are trickier than I realised. The time dimension has to come first so its dimensions should be (time, bnds) not (bnds, time) as you currently have. With that tweak, I think it should be pretty easy to write. Something like time_bnds[:, 0] = time, time_bnds[:, 1] = time + 365.
  • As Paul mentioned, to be safe I would change the time units attribute from "days since 0850-01-01 0:0:0" to "days since 0850-01-01 00:00:00" i.e. write "00:00:00" instead of "0:0:0" in the time units.
  • As Paul says, if you want to rename your file, the filename for this example would be multiple-states_input4MIPs_landState_CMIP_UofMD-landState-3-0_gn_0850-2024.nc. The transition file would be multiple-transitions_input4MIPs_landState_CMIP_UofMD-landState-3-0_gn_0850-2024.nc. Others would be of the form multiple-<other-id>_input4MIPs_landState_CMIP_UofMD-landState-3-0_gn_0850-2024.nc.
    • The "variable_id" attribute should then match "multiple-", so for your states file, variable ID should be "multiple-states", for the transitions file it should be "multiple-transitions", for others it would be "multiple-".

Thanks!

@lchini
Copy link

lchini commented Sep 25, 2024

Thanks for these additional modifications Zeb. I've updated the states file and uploaded it to the FTP server. I'm working on the transitions and management files now to implement the same changes. The management file has a couple of variables where the standard_name attribute is listed as 'biomass_fraction', which I now realize is not a standard name. So, I'm assuming I should just remove standard_name for those variables, like I did with secma in the states file?

Also, the creation_date attribute is generated automatically when I create the data. The files that I'm uploading are based on the data that I created on July 18, 2024. Since then I have just been modifying the files with these metadata corrections etc, and I did also add in some placeholder variables that we will fill with actual data in the next release. I did not change the creation_date attribute when I made those changes. But moving forward the creation_date will update based on the date when the new data gets generated.

@znichollscr
Copy link
Collaborator Author

The management file has a couple of variables where the standard_name attribute is listed as 'biomass_fraction', which I now realize is not a standard name. So, I'm assuming I should just remove standard_name for those variables, like I did with secma in the states file?

Yep, for these cases: a) remove "standard_name" and b) make sure that there is at least a value for "long_name".

Also, the creation_date attribute is generated automatically when I create the data....

Ah ok. We normally use that for when the file is created, rather than the data, so we can tell the difference between files more easily (even if they have the same name, the creation date helps us differentiate). It's probably not essential to change though (although @durack1 can correct me).

Speaking of identifying files, the other thing attribute we're missing is "tracking_id". This should be file specific and generated following the UUID4 protocol (re-generated every time you write a new file). In Python, it can be generated with code like the below

import uuid
tracking_id = "hdl:21.14100/" + str(uuid.uuid4())

In Matlab, it's a bit less clear to me but that's also because I'm worse at reading matlab docs I think.

@znichollscr
Copy link
Collaborator Author

(Although, to be honest, I would be ok with skipping tracking_id for this first set of files and just picking it up next time we go round...)

@durack1
Copy link
Contributor

durack1 commented Sep 25, 2024

Hi folks, I'm sorry, but the tracking_id is an ESGF dependency that needs to be unique per file. Apologies for omitting that check. Below is a matlab example of generating a compatible UUID4.

>> disp(join(["hdl:21.14100",char(java.util.UUID.randomUUID)],"/")) # Matlab R2023a
hdl:21.14100/df3a5513-ee63-4969-aff4-5efc4e71f4bc
Which matches the format of the python UUID4
:tracking_id = "hdl:21.14100/c0045041-73e0-4e75-b36d-38a962fb813c" ; 

Above example is from the PCMDI-AMIP-1-1-6 example here

And a matlab example of creating a creation_date which aligns with the ESGF expectation

>> disp(join([replace(char(datetime('now','Format','yyyy-MM-dd_HH:mm:ss','Timezone','Z')),'_','T'),'Z']))
2024-09-25T17:42:47Z
Matching the python
>>> import datetime; print(datetime.datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ"))
2024-09-25T17:44:04Z

@znichollscr the latest transitions file is now in ../LouiseChini-landUseChange/20240925

@znichollscr
Copy link
Collaborator Author

Hi folks, I'm sorry, but the tracking_id is an ESGF dependency that needs to be unique per file

That settles that then :)

@znichollscr the latest transitions file is now in ../LouiseChini-landUseChange/20240925

Having looked at it now, it looks like most of your variables don't have a true standard name. Standard names never contain whitespace, so anytime there is whitespace in a standard name, that information should either be in "long_name" or, if there's already a "long_name", you can just delete the "standard_name" information entirely.

Looking closer at the values, I would say that I would be surprised if any of your variables had standard names (the full list is here: https://cfconventions.org/Data/cf-standard-names/current/build/cf-standard-name-table.html). The thing that is suggesting that to me is that lots of your variables have the same standard name, but I don't think two different variables can have the same standard name (so it seems like the standard names are wrong to me). I could be mistaken of course.

@durack1
Copy link
Contributor

durack1 commented Sep 25, 2024

most of your variables don't have a true standard name

I 100% agree with @znichollscr if there is not a very definitive mapping to a CF standard name that has been approved and listed on the v86 of the CF Standard Name table then let's remove "standard_name" and rather go with the descriptive "long_name" attribute alone. If we want to jump through hoops to get a standard name assigned, we can do that on the second go around

@lchini
Copy link

lchini commented Sep 25, 2024

OK, sounds good. I'll remove "standard_name" for all variables. I assume it's OK/preferable to keep the existing standard_name for time, lat, and lon? I can also add the tracking_id. Do I need to do anything about creation_date at this stage or leave it as is for now?
BTW, I've now uploaded a full set of files: states, transitions, management, and staticData. I know that most of your comments so far have been for the states file, so if you want to take a look at those other ones as well and let me know whether they are compliant, that would be helpful.

@durack1
Copy link
Contributor

durack1 commented Sep 25, 2024

I assume it's OK/preferable to keep the existing standard_name for time, lat, and long?

Yep, these standard_names and all other attributes are registered standards that you are using correctly

Do I need to do anything about creation_date at this stage or leave it as is for now?

The creation_date is meant to indicate the date that I file was generated, and this (and other files) was not generated on the 18 July 2024, so I would prefer we update this, and preferably update this automatically as files are written - just so we don't create this inconsistency again. As @znichollscr notes, the creation_date is one of the attributes if used correctly that allows a crumbtrail of when a file was generated, and so in most cases presumably the latest dated file is the preferred.

In the CMIP6 example file (here), this lists the following attributes as "absolutely essential": Conventions, activity_id, contact, creation_date, dataset_category, frequency, further_info_url, grid_label, institution, institution_id, mip_era, nominal_resolution, realm, source, source_id, source_version, target_mip, title, tracking_id, variable_id.

Looking at the below, we're now all good, aside from the bolded entries above nominal_resolution = "25 km", rename "dataset_version_number" -> source_version = "3.0", tracking_id matlab code as above #123 (comment) and the update to creation_date also above.

ncdump -ct multiple-transitions_input4MIPs_landState_CMIP_UofMD-landState-3-0_gn_0850-2024.nc
...
// global attributes:
                :host = "UMD College Park" ;
                :creation_date = "2024-07-18T14:51:50Z" ;
                :Conventions = "CF-1.6" ;
                :data_structure = "grid" ;
                :dataset_category = "landState" ;
                :grid_label = "gn" ;
                :license = "Land-Use Harmonization data produced by the University of Maryland is licensed under a Creative Commons Attribution \\\"Share Alike\\\" 4.0 International License (http://creativecommons.org/licenses/by/4.0/). The data producers and data providers make no warranty, either express or implied, including but not limited to, warranties of merchantability and fitness for a particular purpose. All liabilities arising from the supply of the information (including any liability arising in negligence) are excluded to the fullest extent permitted by law." ;
                :further_info_url = "http://luh.umd.edu" ;
                :frequency = "yr" ;
                :institution_id = "UofMD" ;
                :institution = "University of Maryland (UofMD), College Park, MD 20742, USA" ;
                :realm = "land" ;
                :source = "LUH3 V0: Land-Use Harmonization Data Set for CMIP7" ;
                :comment = "LUH3 V0" ;
                :title = "UofMD LUH3 V0 dataset prepared for CMIP7" ;
                :dataset_version_number = "LUH3 V0" ;
                :contact = "lchini@umd.edu, gchurtt@umd.edu" ;
                :activity_id = "input4MIPs" ;
                :source_id = "UofMD-landState-3-0" ;
                :target_mip = "CMIP" ;
                :mip_era = "CMIP6Plus" ;
                :references = "Hurtt et al. 2020 (https://doi.org/10.5194/gmd-13-5425-2020), Chini et al. 2021 (https://doi.org/10.5194/essd-13-4175-2021)" ;
                :history = "Wed Sep 25 09:16:37 2024: ncrename -a ._Fillvalue,_FillValue transitions_new_vars3.nc" ;
                :NCO = "netCDF Operators version 5.0.0 (Homepage = http://nco.sf.net, Code = http://github.com/nco/nco)" ;
                :variable_id = "multiple-transitions" ;
...

@znichollscr
Copy link
Collaborator Author

Hi @lchini, thanks again for your patience with this. I found one more thing. I realise that @durack1 and I have now thrown quite a lot at you now, so I've tried to summarise below too.

The extra thing

The time bounds values are still not coming through as expected. For example, if I look at the time bounds, the values are

>>> tmp["time_bnds"].values[:3, :]
array([[cftime.DatetimeNoLeap(850, 1, 1, 0, 0, 0, 0, has_year_zero=True),
        cftime.DatetimeNoLeap(851, 1, 1, 0, 0, 0, 0, has_year_zero=True)],
       [cftime.DatetimeNoLeap(852, 1, 1, 0, 0, 0, 0, has_year_zero=True),
        cftime.DatetimeNoLeap(853, 1, 1, 0, 0, 0, 0, has_year_zero=True)],
       [cftime.DatetimeNoLeap(854, 1, 1, 0, 0, 0, 0, has_year_zero=True),
        cftime.DatetimeNoLeap(855, 1, 1, 0, 0, 0, 0, has_year_zero=True)]],
      dtype=object)

What this is basically saying is that the first time step goes from 850-01-01 to 851-01-01, that's all good. However, it then says that the second timestep goes from 852-01-01 to 853-01-01 i.e. one year too far forward. For the third timestep, the bounds are 854-01-01 to 855-01-01, now two years too far forward.

This looks like some sort of stacking issue. If I look in the middle of the bounds, I see that the bounds effectively restart:

>>> tmp["time_bnds"].values[585:589, :]
array([[cftime.DatetimeNoLeap(2020, 1, 1, 0, 0, 0, 0, has_year_zero=True),
        cftime.DatetimeNoLeap(2021, 1, 1, 0, 0, 0, 0, has_year_zero=True)],
       [cftime.DatetimeNoLeap(2022, 1, 1, 0, 0, 0, 0, has_year_zero=True),
        cftime.DatetimeNoLeap(2023, 1, 1, 0, 0, 0, 0, has_year_zero=True)],
       [cftime.DatetimeNoLeap(851, 1, 1, 0, 0, 0, 0, has_year_zero=True),
        cftime.DatetimeNoLeap(852, 1, 1, 0, 0, 0, 0, has_year_zero=True)],
       [cftime.DatetimeNoLeap(853, 1, 1, 0, 0, 0, 0, has_year_zero=True),
        cftime.DatetimeNoLeap(854, 1, 1, 0, 0, 0, 0, has_year_zero=True)]],
      dtype=object)

I think this should be an easy fix. In pseudo-code, what you want is

time_bounds = [
    time,  # start of each bound is start of the timestep
    time + 365,  # end of each bound is 365 days after the start of the timestep
].T  # then transpose it all so that time is the first axis and the bound is the second

The first few values should then look like

>>> tmp["time_bnds"].values[:3, :]
array([[cftime.DatetimeNoLeap(850, 1, 1, 0, 0, 0, 0, has_year_zero=True),
        cftime.DatetimeNoLeap(851, 1, 1, 0, 0, 0, 0, has_year_zero=True)],
       [cftime.DatetimeNoLeap(851, 1, 1, 0, 0, 0, 0, has_year_zero=True),
        cftime.DatetimeNoLeap(852, 1, 1, 0, 0, 0, 0, has_year_zero=True)],
       [cftime.DatetimeNoLeap(852, 1, 1, 0, 0, 0, 0, has_year_zero=True),
        cftime.DatetimeNoLeap(853, 1, 1, 0, 0, 0, 0, has_year_zero=True)]],
      dtype=object)

or, in raw values

>>> tmp["time_bnds"].values[:3, :]
array([[0, 365],
    [365, 730],
    [730, 1095],],
      dtype=object)

Summary of things to fix (as I see them):

  • the time bounds values
  • remove standard_name for everything except time, lon and lat
  • make the creation date the creation date of the file
    • matlab code: join([replace(char(datetime('now','Format','yyyy-MM-dd_HH:mm:ss','Timezone','Z')),'_','T'),'Z'])
  • add a tracking ID to each file
    • matlab code: join(["hdl:21.14100",char(java.util.UUID.randomUUID)],"/")
  • add the attribute "nominal_resolution" with value "25 km"
  • add the attribute "source_version" with value "3.0"

Then I think we're golden (or, at least, very close)

@lchini
Copy link

lchini commented Sep 28, 2024

OK, I think I've taken care of that list now (it took me a while to figure out why the time bounds weren't working as expected!). I've uploaded a new set of files to the FTP server. Let me know how they look.

@durack1
Copy link
Contributor

durack1 commented Sep 28, 2024

@lchini this is great, I can confirm in all files valid

  • creation_date
  • nominal_resolution
  • source_version
  • tracking_id

A query about the time, these now look great, spanning the 850-2023 (or 2024-01-01 as a bound) period, but the filename suggests we have coverage from 850 to 2024. We need to rename the file I think, as our last time entry is 2023 - @lchini can you confirm. See below

(xcd061nctax) bash-4.2$ python
Python 3.11.7 | packaged by conda-forge | (main, Dec 23 2023, 14:43:09) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import xarray as xr
>>> fh = xr.open_dataset("../LouiseChini-landUseChange/20240927/multiple-management_input4MIPs_landState_CMIP_UofMD-landState-3-0_gn_0850-2024.nc")
>>> fh
<xarray.Dataset>
Dimensions:      (time: 1175, lat: 720, lon: 1440, nbnd: 2)
Coordinates:
  * time         (time) object 0850-01-01 00:00:00 ... 2024-01-01 00:00:00
  * lat          (lat) float64 89.88 89.62 89.38 89.12 ... -89.38 -89.62 -89.88
  * lon          (lon) float64 -179.9 -179.6 -179.4 -179.1 ... 179.4 179.6 179.9
Dimensions without coordinates: nbnd
Data variables: (12/36)
    fertl_c3ann  (time, lat, lon) float32 ...
    irrig_c3ann  (time, lat, lon) float32 ...
    cpbf1_c3ann  (time, lat, lon) float32 ...
    fertl_c4ann  (time, lat, lon) float32 ...
    irrig_c4ann  (time, lat, lon) float32 ...
    cpbf1_c4ann  (time, lat, lon) float32 ...
    ...           ...
    prtct_primn  (time, lat, lon) float32 ...
    prtct_secdf  (time, lat, lon) float32 ...
    prtct_secdn  (time, lat, lon) float32 ...
    prtct_pltns  (time, lat, lon) float32 ...
    addtc        (time, lat, lon) float32 ...
    time_bnds    (time, nbnd) object ...
Attributes: (12/27)
    host:                UMD College Park
    creation_date:       2024-09-27T17:30:27Z
    Conventions:         CF-1.6
    data_structure:      grid
    dataset_category:    landState
    grid_label:          gn
    ...                  ...
    references:          Hurtt et al. 2020 (https://doi.org/10.5194/gmd-13-54...
    history:             Fri Sep 27 13:31:20 2024: ncrename -a ._Fillvalue,_F...
    variable_id:         multiple-management
    nominal_resolution:  25 km
    source_version:      3.0
    tracking_id:         hdl:21.14100/d444819c-035b-4663-999d-eff2ce8170ac

>>> fh["time_bnds"].values[:-1,:]
array([[cftime.DatetimeNoLeap(850, 1, 1, 0, 0, 0, 0, has_year_zero=True),
        cftime.DatetimeNoLeap(851, 1, 1, 0, 0, 0, 0, has_year_zero=True)],
       [cftime.DatetimeNoLeap(851, 1, 1, 0, 0, 0, 0, has_year_zero=True),
        cftime.DatetimeNoLeap(852, 1, 1, 0, 0, 0, 0, has_year_zero=True)],
       [cftime.DatetimeNoLeap(852, 1, 1, 0, 0, 0, 0, has_year_zero=True),
        cftime.DatetimeNoLeap(853, 1, 1, 0, 0, 0, 0, has_year_zero=True)],
       ...,
       [cftime.DatetimeNoLeap(2021, 1, 1, 0, 0, 0, 0, has_year_zero=True),
        cftime.DatetimeNoLeap(2022, 1, 1, 0, 0, 0, 0, has_year_zero=True)],
       [cftime.DatetimeNoLeap(2022, 1, 1, 0, 0, 0, 0, has_year_zero=True),
        cftime.DatetimeNoLeap(2023, 1, 1, 0, 0, 0, 0, has_year_zero=True)],
       [cftime.DatetimeNoLeap(2023, 1, 1, 0, 0, 0, 0, has_year_zero=True),
        cftime.DatetimeNoLeap(2024, 1, 1, 0, 0, 0, 0, has_year_zero=True)]],
      dtype=object)

>>> fh["time"].values[:-1]
array([cftime.DatetimeNoLeap(850, 1, 1, 0, 0, 0, 0, has_year_zero=True),
       cftime.DatetimeNoLeap(851, 1, 1, 0, 0, 0, 0, has_year_zero=True),
       cftime.DatetimeNoLeap(852, 1, 1, 0, 0, 0, 0, has_year_zero=True),
       ...,
       cftime.DatetimeNoLeap(2021, 1, 1, 0, 0, 0, 0, has_year_zero=True),
       cftime.DatetimeNoLeap(2022, 1, 1, 0, 0, 0, 0, has_year_zero=True),
       cftime.DatetimeNoLeap(2023, 1, 1, 0, 0, 0, 0, has_year_zero=True)],
      dtype=object)

@znichollscr files in the path above, all 4.

@znichollscr
Copy link
Collaborator Author

Thanks Paul. Almost there @lchini! Now that I've seen all four files, there are a few more questions.

Overall questions

  • is it expected that some files have data to 2024 while others only go to 2023?

File by file

multiple-states_input4MIPs_landState_CMIP_UofMD-landState-3-0_gn_0850-2024.nc

  • all looks good from the checks I have run

multiple-transitions_input4MIPs_landState_CMIP_UofMD-landState-3-0_gn_0850-2024.nc

  • the data in this file only goes up to 2023, hence the filename should have '2024' -> '2023'. However, this raises the question above, is it expected that the data only goes to 2023 or is there some bug here? (Maybe it makes sense, these are transitions so while we have states for 2024, we don't have the transition over the period 2024 to 2025 yet?)
  • otherwise all looks good from the checks I have run

multiple-management_input4MIPs_landState_CMIP_UofMD-landState-3-0_gn_0850-2024.nc

  • the units for some variables are "kg ha-1 yr-1 (crop season)". Having (crop season) in the units doesn't seem correct and the cf-checker raises an error about this. Are these normal units, or should the units just be "kg ha-1 yr-1"?
  • otherwise all looks good from the checks I have run

staticData_quarterdeg.nc

  • the value of "_FillValue" for lon and lat is float when the data has a type of double, which the cf-checker complains about. Options for fixing: a) delete the "_FillValue" attribute (you have "missing_value" anyway) b) update the value of "_FillValue" so it's a double not a float.
  • lat needs an attribute, "bounds", with value "lat_bounds"
    • optional: rename "lat_bounds" to "lat_bnds"
  • lon needs an attribute, "bounds", with value "lon_bounds"
    • optional: rename "lon_bounds" to "lon_bnds"
  • the frequency attribute should have a value of "fx" (currently it is "yr")
  • the institution_id attribute should have a value of "UofMD" (currently it is missing from this file)
  • the variable_id attribute should have a value of "multiple-static" (or something like that, currently it is missing from this file)
  • the grid_label attribute should have a value of "gn" (currently it is missing from this file)
  • the dataset_category attribute should have a value of "landState" (currently it is missing from this file)
  • otherwise all looks good from the checks I have run

@durack1
Copy link
Contributor

durack1 commented Sep 28, 2024

@lchini I'd also note that I'd have to rename static_quarterdeg.nc, so if you're making any changes, a filename update to multiple-fixed_input4MIPs_landState_CMIP_UofMD-landState-3-0_gn.nc, and then the variable_id = "multiple" noted above.

@znichollscr has already noted some other tweaks above #123 (comment) - these are very, very close!

@lchini
Copy link

lchini commented Sep 30, 2024

Regarding the years that these datasets represent:

  • the states file provides the land-use state on January 1 each year from 850 to 2024. So, it does extend all the way to the year 2024.
  • the transitions file provides the land-use transition that occurs during the year that begins on January 1. So this file includes data for the years 850 to 2023, and does not have a time point for 2024
  • the management file provides land management information. Strictly speaking it is the management of the land on January 1 each year (for example the area being irrigated), but in reality some of these management variables describe processes that would occur during the year (similar to the transitions) such as the fertilizer usage, or the fraction of wood harvest used for fuel or products. So, this file has data for the years 850 to 2024 but some of the 2024 data is probably not used by models.

So, I would prefer to keep the filenames (and the years of data) as they are now. This is the way we have provided this data for many years now. Does this seem like a reasonable plan?

For the fertilizer units I think we can remove the "crop season" part. In theory we are providing the amount of fertilizer applied to the land per ha per year and per crop season, but since we don't actually represent double cropping in the dataset, and I don't think we have full consistency between the historical data and future scenarios on this point, I think we can remove the crop season part from the units. If we did end up feeling like that was a necessary part of the fertilizer units, is there another way that we should represent that in these files?

@znichollscr
Copy link
Collaborator Author

  • the transitions file provides the land-use transition that occurs during the year that begins on January 1. So this file includes data for the years 850 to 2023, and does not have a time point for 2024

Makes sense.

So, I would prefer to keep the filenames (and the years of data) as they are now. This is the way we have provided this data for many years now. Does this seem like a reasonable plan?

If you mean that the filenames would be:

  • multiple-states_input4MIPs_landState_CMIP_UofMD-landState-3-0_gn_0850-2024.nc
  • multiple-management_input4MIPs_landState_CMIP_UofMD-landState-3-0_gn_0850-2024.nc
  • multiple-transitions_input4MIPs_landState_CMIP_UofMD-landState-3-0_gn_0850-2023.nc

Yes

If we did end up feeling like that was a necessary part of the fertilizer units, is there another way that we should represent that in these files?

3 options I see:

  • put this quasi-unit info in the long_name
  • find a unit that does represent this in the CF-conventions/udunits (might already exist, who knows)
  • just go back to the units we have and ignore the warnings about a non-standard unit

As a note. If it's per year, shouldn't the number of crop seasons already be included (e.g. if there were 2 crop seasons in 1876, then the total application in the year would be twice as much as a year in which there was only 1 crop season)? Or are models meant to multiple this by the number of crop seasons in their model to get total application in a year?

@durack1
Copy link
Contributor

durack1 commented Sep 30, 2024

At this point, I wonder whether we're good enough for the v0 land use change dataset. @vnaik60, are these files usable for the NOAA-GFDL team?

I note that for any variable, you can always add a per variable comment, any attribute could be added, which provides some context for folks to use these data.

So we'll need to rename the multiple-transitions* file to 2023, otherwise we're good to go, no?

@lchini would you prefer to make a couple more tweaks to target the questions of @znichollscr or are you good for publication to begin? As an FYI, this likely wouldn't start until Thursday this week anyway, as @sashakames is travelling

@lchini
Copy link

lchini commented Oct 1, 2024

Thanks for the feedback! Since we have a couple of days before the publication would begin, why don't I try to make those last few tweaks and then we should hopefully be all set!

@durack1
Copy link
Contributor

durack1 commented Oct 1, 2024

@lchini ok great, if you're happy to catch the final tweaks then let's wait for that. If the files are on the FTP server mid to late week, I can pull these across and then get them in the publication queue, hopefully for a Thursday release!

Woo hoo!

@lchini
Copy link

lchini commented Oct 2, 2024

The new files have been uploaded to the server.

@durack1
Copy link
Contributor

durack1 commented Oct 2, 2024

@lchini wonderful!

Just to confirm the files uploaded into UofMD-landState-3-0_241001_1 are now found locally at ../LouiseChini-landUseChange/20241001 - @znichollscr have at them!

Woo hoo!

@znichollscr
Copy link
Collaborator Author

Good to publish I think @durack1 !

@durack1
Copy link
Contributor

durack1 commented Oct 2, 2024

Excellent, checks out for me, so I have moved this into the publication queue - hopefully these files are live tomorrow!

Nice work @lchini

@durack1
Copy link
Contributor

durack1 commented Oct 3, 2024

Good news, @lchini you're live! See here.

Please take a peek to check it's all right, looks great to me, and files are downloading quickly (on-site here at LLNL)

@durack1
Copy link
Contributor

durack1 commented Oct 4, 2024

Fixed by #127, closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataset-publication Issue related to publication of a dataset
Projects
None yet
Development

No branches or pull requests

3 participants