diff --git a/appa.adoc b/appa.adoc index d59818f3..1eaa8ee9 100644 --- a/appa.adoc +++ b/appa.adoc @@ -9,7 +9,7 @@ See <> for the grid mapping attributes, and <> for the distinction between **BI** and **BO**), and **-** for variables with some other purpose. +For variable attributes, the possible values of "Use" are: **C** for variables containing coordinate data, **D** for data variables, **M** for geometry container variables, **Q** for quantization container variables, **Do** for domain variables, **BI** and **BO** for boundary variables (see <> for the distinction between **BI** and **BO**), and **-** for variables with some other purpose. CF does not prohibit any of these attributes from being attached to variables of different kinds from those listed as their "Use" in this table, but their meanings are not defined by CF if they are used in these other ways. "Links" indicates the location of the attribute"s original definition (first link) and sections where the attribute is discussed in this document (additional links as necessary). @@ -38,6 +38,12 @@ Attribute If both **`scale_factor`** and **`add_offset`** attributes are present, the data are first scaled before the offset is added. In cases where there is a strong constraint on dataset size, it is allowed to pack the coordinate variables (using add_offset and/or scale_factor), but this is not recommended in general. +| **`algorithm`** +| S +| Q +| <>, and <> +| Name of the quantization algorithm employed. + | **`ancillary_variables`** | S | D @@ -200,6 +206,12 @@ Use in conjunction with **`flag_meanings`**. | link:$$https://www.unidata.ucar.edu/software/netcdf/docs/attribute_conventions.html$$[NUG Appendix A, "Attribute Conventions"] | List of the applications that have modified the original data. +| **`implementation`** +| S +| Q +| <>, and <> +| The name and version of the library or client software that performed the quantization with **`algorithm`**. + | **`instance_dimension`** | S | - @@ -300,6 +312,26 @@ Allowed for auxiliary coordinate variables but not allowed for coordinate variab | <> | Direction of increasing vertical coordinate value. +| **`quantization`** +| S +| D +| <> +| Identifies a variable that defines a quantization algorithm and its provenance. + +| **`quantization_nsb`** +| N +| D +| <>, and <> +| Specifies the number of significant bits retained in the IEEE mantissa of data quantized with the BitRound algorithm. +Use in conjunction with **`quantization`**. + +| **`quantization_nsd`** +| N +| D +| <>, and <> +| Specifies the number of significant base-10 digits retained in the IEEE mantissa of data quantized with base-10 quantization algorithms. +Use in conjunction with **`quantization`**. + | **`references`** | S | G, D diff --git a/bibliography.adoc b/bibliography.adoc index 5673cc8d..3e0b5f9b 100644 --- a/bibliography.adoc +++ b/bibliography.adoc @@ -3,11 +3,15 @@ [bibliography] === References +- [[[CFDM]]] link:$$https://doi.org/10.5194/gmd-10-4619-2017$$[A data model of the Climate and Forecast metadata conventions (CF-1.6) with a software implementation (cf-python v2.1)]. Hassell, D., Gregory, J., Blower, J., Lawrence, B. N., and Taylor, K. E.: _Geosci. Model Dev._, 10, 4619-4646, 2017. - [[[COARDS]]] link:$$https://ferret.pmel.noaa.gov/Ferret/documentation/coards-netcdf-conventions$$[Conventions for the standardization of NetCDF Files]. Sponsored by the "Cooperative Ocean/Atmosphere Research Data Service," a NOAA/university cooperative for the sharing and distribution of global atmospheric and oceanographic research data sets. May 1995. +- [[[DCG19]]] link:$$https://doi.org/10.5194/gmd-12-4099-2019$$[Evaluation of lossless and lossy algorithms for the compression of scientific datasets in netCDF-4 or HDF5 files]. Delaunay, X., A. Courtois, and F. Gouillon: _Geosci. Model Dev._, 12, 4099-4113, 2019. - [[[FGDC]]] link:$$https://www.fgdc.gov/standards/projects/FGDC-standards-projects/metadata/base-metadata/v2_0698.pdf$$[Content Standard for Digital Geospatial Metadata]. Federal Geographic Data Committee, FGDC-STD-001-1998. - [[[IEEE_754]]] link:$$https://doi.org/10.1109/IEEESTD.2019.8766229$$[IEEE Standard for Floating-Point Arithmetic], in _IEEE Std 754-2019 (Revision of IEEE 754-2008)_, 22 July 2019. +- [[[Kou21]]] link:$$https://doi.org/10.5194/gmd-14-377-2021$$[A note on precision-preserving compression of scientific data]. Kouznetsov, R.: _Geosci. Model Dev._, 14, 377-389, 2021. +- [[[KRD21]]] link:$$https://doi.org/10.1038/s43588-021-00156-2$$[Compressing atmospheric data into its real information content]. Klöwer, M., Razinger, M., Dominguez, J. J., Düben, P., and Palmer, T. N.: _Nat. Comput. Sci._, 1, 713-724, 2021. - [[[NetCDF]]] link:$$https://doi.org/10.5065/D6H70CW6$$[NetCDF Software Package]. UNIDATA Program Center of the University Corporation for Atmospheric Research. - [[[NUG]]] link:$$https://docs.unidata.ucar.edu/nug/current/index.html$$[The NetCDF User's Guide]. - [[[OGC_WKT-CRS]]] link:$$https://www.opengeospatial.org/standards/wkt-crs$$[OGC Well-known text representation of coordinate reference systems]. @@ -16,7 +20,7 @@ OGC document 12-063. 1st May 2015. - [[[SCH02]]] link:$$https://doi.org/10.1175/1520-0493(2002)130<2459:ANTFVC>2.0.CO;2$$[A new terrain-following vertical coordinate formulation for atmospheric prediction models]. C Schaer, D Leuenberger, and O Fuhrer. 2002. _Monthly Weather Review_. 130. 2459-2480. - [[[Snyder]]] link:$$https://doi.org/10.3133/pp1395$$[Map Projections: A Working Manual]. USGS Professional Paper 1395. - [[[UDUNITS]]] link:$$https://doi.org/10.5065/D6KD1WN0$$[UDUNITS Software Package]. UNIDATA Program Center of the University Corporation for Atmospheric Research. +- [[[UGRID]]] link:$$https://ugrid-conventions.github.io/ugrid-conventions$$[UGRID Conventions for storing unstructured (or flexible mesh) data in netCDF files] - [[[W3C]]] link:$$https://www.w3.org/$$[World Wide Web Consortium (W3C)]. - [[[XML]]] link:$$https://www.w3.org/TR/1998/REC-xml-19980210$$[Extensible Markup Language (XML) 1.0]. T. Bray, J. Paoli, and C.M. Sperberg-McQueen. 10 February 1998. -- [[[CFDM]]] link:$$https://doi.org/10.5194/gmd-10-4619-2017$$[A data model of the Climate and Forecast metadata conventions (CF-1.6) with a software implementation (cf-python v2.1)]. Hassell, D., Gregory, J., Blower, J., Lawrence, B. N., and Taylor, K. E.: _Geosci. Model Dev._, 10, 4619-4646, 2017. -- [[[UGRID]]] link:$$https://ugrid-conventions.github.io/ugrid-conventions$$[UGRID Conventions for storing unstructured (or flexible mesh) data in netCDF files] +- [[[Zen16]]] link:$$https://doi.org/10.5194/gmd-9-3199-2016$$[Bit Grooming: Statistically accurate precision-preserving quantization with compression, evaluated in the netCDF Operators (NCO, v4.4.8+)]. Zender, C. S.: _Geosci. Model Dev._, 9, 3199-3211, 2016. diff --git a/ch01.adoc b/ch01.adoc index c80c7fd0..7fa20950 100644 --- a/ch01.adoc +++ b/ch01.adoc @@ -104,6 +104,8 @@ out-of-group reference:: A reference to a variable or dimension that is not cont path:: Paths must follow the UNIX style path convention and may begin with either a '/', '..', or a word. +quantization variable:: A variable used as a container for attributes that define a specific quantization algorithm. The type of the variable is arbitrary since it contains no data. + recommendation:: Recommendations in this convention are meant to provide advice that may be helpful for reducing common mistakes. In some cases we have recommended rather than required particular attributes in order to maintain backwards compatibility with COARDS. An application must not depend on a dataset's adherence to recommendations. @@ -226,4 +228,4 @@ The UGRID conventions description is referenced from, rather than rewritten into A summary indicating how UGRID relates to other parts of the CF conventions, and which features of UGRID are excluded from CF, can be found in <>. To reduce the chance of ambiguities arising from their accidental re-use, all of the UGRID standardized attributes are specified in <> and <>. -The UGRID conventions have their own conformance document, which should be used in conjunction with the CF conformance document when checking the validity of datasets. \ No newline at end of file +The UGRID conventions have their own conformance document, which should be used in conjunction with the CF conformance document when checking the validity of datasets. diff --git a/ch08.adoc b/ch08.adoc index b5b8dc5f..6c387ad2 100644 --- a/ch08.adoc +++ b/ch08.adoc @@ -5,7 +5,7 @@ There are three methods for reducing dataset size: packing, lossless compression, and lossy compression. By packing we mean altering the data in a way that reduces its precision (but has no other effect on accuracy). By lossless compression we mean techniques that store the data more efficiently and result in no loss of precision or accuracy. -By lossy compression we mean techniques that store the data more efficiently and retain its precision but result in some loss in accuracy. +By lossy compression we mean techniques that either store the data more efficiently and retain its precision but result in some loss in accuracy, or techniques that intentionally reduce data precision to improve the efficiency of subsequent lossless compression. Lossless compression only works in certain circumstances, e.g., when a variable contains a significant amount of missing or repeated data values. In this case it is possible to make use of standard utilities, e.g., UNIX **`compress`** or GNU **`gzip`**, to compress the entire file after it has been written. @@ -675,3 +675,133 @@ The data creator shall specify the floating-point arithmetic precision used duri Using the given computational precision in the interpolation computations is a necessary, but not sufficient, condition for the data user to be able to reconstitute the coordinates to an accuracy comparable to that intended by the data creator. For instance, a **`computational_precision**` value of **`"64"**` would specify that, using the same implementation and hardware as the creator of the compressed dataset, sufficient accuracy could not be reached when using a floating-point precision lower than 64-bit floating-point arithmetic in the interpolation computations required to reconstitute the coordinates. +[[lossy-compression-via-quantization, Section 8.4, "Lossy Compression via Quantization"]] +=== Lossy Compression via Quantization + +Geoscientific models and measurements generate false floating-point precision (scientifically meaningless data bits) that wastes storage space. +False precision can mislead (by implying noise is signal) and is scientifically pointless. +Quantization algorithms can eliminate false precision, usually by rounding the least significant bits of <> floating-point mantissas to zeros. +(Quantization of integer types, although theoretically allowed, is not covered by this convention.) +The quantized results are valid <> values---no special software or decoder is necessary to read them. +Importantly, the quantized bits compress more efficiently than random bits. +Thus quantization is sometimes referred to as a form of lossy compression although, strictly speaking, quantization only pre-conditions data for more efficient compression by a subsequent compressor. + +The CF conventions of this section define a metadata framework to record quantization properties alongside quantized floating-point data variables. +The goals are twofold. +First, to inform interested users how, and to what degree, the quantized data differ from the original unquantized data, which are not stored in the dataset and may no longer exist. +Second, to provide the necessary provenance metadata for users to reproduce the data transformations on the same or other raw data. +These conventions also allow users to better understand the precision that data producers expect from source models or measurements. + +These conventions must not be used with data variables of integer type. +They must not be used with any variable, even if it is also a data variable, that serves as a coordinate variable, or is named by a **`coordinates`**, **`formula_terms`** or **`cell_measures`** attribute of any other variable. +This is because variables that provide metadata or are used in computation of domain metrics are often known to the highest precision possible, and degrading the precision of metadata properties may have unintended side effects on the accuracy of subsequent operations such as regridding, interpolation, and conservation checks. +These variables can include spatial and temporal coordinate variables (e.g., **`latitude`**, **`longitude`**, **`level`**, **`time`**), properties derived from these coordinates (e.g., **`area`**, **`volume`**), and variables referenced by the **`formula_terms`** attribute of a coordinate variable. + +[[quantization-variables, Section 8.4.1, "Quantization Variables"]] +==== Quantization variables + +A quantization variable describes a quantization algorithm via a collection of attached attributes. +It is of arbitrary type since it contains no data. +Its purpose is to act as a container for the generic attributes of a quantization algorithm. +Quantization variables are required to have at least two attributes: **`algorithm`** and **`implementation`**. + +The **`algorithm`** attribute names a specific quantization algorithm. +Four quantization algorithms are currently recognized: BitRound, BitGroom, DigitRound, and Granular BitRound. +The controlled vocabulary for these algorithms thus consists of **`bitround`**, **`bitgroom`**, **`digitround`**, and **`granular_bitround`**. +See <> for a brief summary of these algorithms. + +The second attribute required in a quantization variable is **`implementation`**. +This attribute contains unstandardized text that concisely conveys the algorithm provenance including the name of the library or client that performed the quantization, the software version, and any other information required to disambiguate the source of the algorithm employed. +The text must take the form "_software-name_ version _version-string_ [( _optional-information_ )]" such as +**`libnetcdf version 4.9.2`** in <>. + +[[per-variable-quantization-attributes, Section 8.4.2, "Per-variable Quantization Attributes"]] +==== Per-variable quantization attributes + +Each data variable that has been quantized must include at least two attributes to describe the quantization. +First, all such data variables must have a **`quantization`** attribute containing the name of the quantization variable describing the algorithm. +Second, all such variables must record the specific parameter value used in the quantization algorithm. +The input parameter for all quantization algorithms determines the precision preserved by the algorithm. + +BitRound retains the specified number of significant bits (NSB) in the IEEE mantissa, and quantizes the trailing bits. +All data variables quantized by BitRound must record the NSB in the **`quantization_nsb`** attribute. +Note that BitRound __counts only explicitly represented mantissa bits__. +It does not include the most-significant-bit with value 1 that implicitly begins all <> mantissas. +Thus **`quantization_nsb`** is an integer type attribute with **`1 \<= NSB \<= 23`** for data type **`float`** or **`real`**, and **`1 \<= NSB \<= 52`** for data type **`double`**. + +The BitGroom, Granular BitRound, and DigitRound algorithms guarantee preservation of a specified number of significant digits (NSD) in base 10 representation. +The actual number of mantissa bits quantized depends on the algorithm. +Thus all data variables quantized by BitGroom, Granular BitRound, or DigitRound must have a corresponding attribute **`quantization_nsd`**. +The value of **`quantization_nsd`** is an integer with **`1 \<= NSD \<= 7`** for data type **`float`** or **`real`**, and **`1 \<= NSD \<= 15`** for data type **`double`**. + +[[example-quantization-nsb-libnetcdf]] +[caption="Example 8.8. "] +.Quantization performed by BitRound algorithm in libnetcdf +==== +---- + variables: + char quantization_info ; + quantization_info:algorithm = "bitround" ; + quantization_info:implementation = "libnetcdf version 4.9.2" ; + + float ps(time,lat,lon) ; + ps:_QuantizeBitRoundNumberOfSignificantBits = 9 ; + ps:quantization = "quantization_info" ; + ps:quantization_nsb = 9 ; + ps:standard_name = "surface_air_pressure" ; + ps:units = "Pa" ; +---- +Note how the same NSB is reported in two attributes of the data variable **`ps`**. +The quantization variable (**`quantization_info`**) **`implementation`** attribute reveals that the netCDF library applied the BitRound algorithm. +The netCDF library wrote the system-defined **`_QuantizeBitRoundNumberOfSignificantBits`** attribute <> which contains the same parameter value as the CF **`quantization_nsb`** attribute (see the main text for further details). +==== + +[[example-quantization-nsd-multiple-variables-nco]] +[caption="Example 8.9. "] +.Quantization performed by Granular BitRound algorithm in NCO +==== +Quantization of different variables to different levels often makes good scientific sense. Here the pressure variable **`ps`** has four significant digits of precision while the temperature variable **`ts`** retains only three significant digits. +---- + variables: + char quantization_info ; + quantization_info:algorithm = "granular_bitround" ; + quantization_info:implementation = "NCO version 5.2.7" ; + + float ps(time,lat,lon) ; + ps:standard_name = "surface_air_pressure" ; + ps:units = "Pa" ; + ps:quantization = "quantization_info" ; + ps:quantization_nsd = 4 ; + + float ts(time) ; + ts:standard_name = "surface_temperature" ; + ts:units = "K" ; + ts:quantization = "quantization_info" ; + ts:quantization_nsd = 3 ; +---- +Both variables were quantized by the same algorithm and so utilize the same quantization variable. +**`quantization_info`** reveals that the Granular BitRound algorithm in NCO performed the quantization. +Since the netCDF library did not perform the quantization, there is no system-defined underscored quantization attribute. +==== + +[[quantization-algorithms-description, Section 8.4.3, "Description of Quantization Algorithms"]] +==== Description of quantization algorithms + +This section briefly describes and contrasts each recognized **`quantize`** algorithm and points to further documentation. +BitRound is also called the "round-to-nearest" method <> and the "half-to-even" method <>. +This is the default <> rounding method and is bias-free and conservative for random distributions of numbers. +BitRound is preferred when the number of significant bits (NSB) to retain is known. + +The other **`quantize`** algorithms guarantee to preserve a given number of significant (base-10 representation) digits (NSD). +Their quantization errors never exceed half of the unit value at the NSD decimal place <>. +BitGroom <> appeared first, though is now known to be suboptimal in accuracy <> and in compressibility compared to later methods. +DigitRound <> has superior compressibility for a given NSD compared to BitGroom. +Granular BitRound combines the DigitRound approach for compressibility with the BitRound approach for quantization. +Granular BitRound and DigitRound are both good choices when the NSD to retain is known. + +The netCDF C and Fortran libraries can directly invoke BitRound, BitGroom, and Granular BitRound [<>]. +The netCDF library attaches a long, system-defined attribute to every data variable that it quantizes, such as +**`_QuantizeBitRoundNumberOfSignificantBits = 9`** in <>. +The leading underscore indicates that the netCDF library wrote this attribute <>. +Any variable that has the library-defined attribute must, in addition, contain the corresponding CF metadata. +Example 8.9 shows how the CF metadata might appear for other (non-netCDF library) implementations of **`quantize`** algorithms. diff --git a/conformance.adoc b/conformance.adoc index ddaff5be..4e490ab9 100644 --- a/conformance.adoc +++ b/conformance.adoc @@ -536,7 +536,23 @@ The requirements on all other bounds tie point variable attributes are the same * An interpolation variable should have 0 dimensions. * The recommendations on bounds tie point variable attributes are the same as for bounds variables described in <>. -  +[[lossy-compression-via-quantization]] +=== 8.4 Lossy Compression via Quantization + +*Requirements:* + +* Quantization container variables must have two string-valued attributes, **`algorithm`** and **`implementation`**. +* The value of **`algorithm`** must be one of the values permitted by this section. +* Only floating-point type variables can be quantized. Quantized variables are identified by having a string-valued attribute named **`quantization`**. +* The value of **`quantization`** must be the name of the quantization container variable which exists in the file. +* Variables that were quantized must have an integer type attribute named either **`quantization_nsb`** (if the corresponding quantization variable has the **`algorithm`** attribute value **`bitround`**) or **`quantization_nsd`** (if the corresponding quantization variable has one of the **`algorithm`** attribute values **`bitgroom`**, **`digitround`**, or **`granular_bitround`**). +* The value of **`quantization_nsb`** must be in the range **`1 \<= NSB \<= 23`** for data type **`float`** or **`real`**, and **`1 \<= NSB \<= 52`** for data type **`double`**. +* The value of **`quantization_nsd`** must be in the range **`1 \<= NSD \<= 7`** for data type **`float`** or **`real`**, and **`1 \<= NSD \<= 15`** for data type **`double`**. +* Variables that serve as a coordinate variable, or are named by a **`coordinates`**, **`formula_terms`**, or **`cell_measures`** attribute of any other variable must not have a **`quantization`** attribute. +* The value of **`implementation`** must take the form +"_software-name_ version _version-string_ [( _optional-information_ )]". +where brackets indicate optional words. + [[parametric-vertical-coordinates]] === Appendix D Parametric Vertical Coordinates diff --git a/history.adoc b/history.adoc index 2cc228be..8ef18ae7 100644 --- a/history.adoc +++ b/history.adoc @@ -7,6 +7,7 @@ === Working version (most recent first) +* {issues}403[Issue #403]: Metadata to encode quantization properties * {issues}530{Issue #530]: Define "the most rapidly varying dimension", and use this phrase consistently with the clarification "(the last dimension in CDL order)". * {issues}163[Issue #163]: Provide a convention for boundary variables for grids whose cells do not all have the same number of sides. * {issues}174[Issue #174]: A one-dimensional string-valued variable must not have the same name as its dimension, in order to avoid its being mistaken for a coordinate variable. diff --git a/toc-extra.adoc b/toc-extra.adoc index eac9bb14..e73da712 100644 --- a/toc-extra.adoc +++ b/toc-extra.adoc @@ -96,6 +96,8 @@ J.5. <> 8.5. <> 8.6. <> 8.7. <> +8.8. <> +8.9. <> B.1. <> H.1. <> H.2. <> @@ -119,4 +121,4 @@ H.19. <> H.20. <> H.21. <> H.22. <> -I.1. <> \ No newline at end of file +I.1. <>