-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lossy Compression by Coordinate Sampling #327
Comments
It might be better to continue the conversion over at #37 on the precision of interpolation calculations (the comment thread starting at cf-convention/discuss#37 (comment)) here in this issue, as this is the main place now for discussing the PR containing the details of this proposal, of which this precision question is one. I hope that's alright, thanks, |
Hi @taylor13 Thank you very much or your comments. We did have a flaw or a weakness in the algorithm, which we have corrected following your comments. To briefly explain: the methods of the proposal stores coordinates at a set of tie points, from which the coordinates in the target domain may then be reconstituted by interpolation. The source of the problem was the computation of the distance squared between two such tie points. The distance will never be zero and could for example be in the order of a few kilometers. As the line between the two tie points forms a right triangle with two other lines of known length, the fastest way to compute the distance squared is to use Pythagoras's theorem. However, as the two other sides both of a length significantly larger than the one we wish to calculate, the result was very sensitive to rounding in 32-bit floating-point calculations and occasionally returned zero. We have now changed the algorithm to compute the distance squared as In terms of accuracy of how well the method reconstitutes the original coordinates, the change improved the performance of the internal calculations being calculated in 32-bit floating-point. However, still with errors a couple of times larger than when using 64-bit floating-point calculations. I would therefore support the proposal put forward by @davidhassell. The proposal avoids setting a general rule, which, as you point out, may not cover all cases. It permits setting a requirement when needed to reconstitute data with the accuracy intended by the data creator. Once again, thank you very much for your comments – further comments from your side on the proposal would be highly welcome! Cheers |
For convenience, here is the proposal for specifying the precision to be used for the interpolation calculations (slightly robustified):
Do you think that this might work, @taylor13? Thanks, |
Thanks @AndersMS for the care taken to address my concern, and thanks @davidhassell for the proposed revision. A few minor comments:
In the example, then, "0D" would be replaced by "decimal64". |
Hi @taylor13, 1: I agree that higher precisions should be allowed. A modified description (which could do with some rewording, but the intent is clear for now, I hope):
2: 3: A controlled vocabulary is certainly clearer than my original proposal, both in term of defining the concept and the encoding, and the IEEE standard does indeed provide what we need. I wonder if it might be good to define the (subset) of IEEE terms ourselves in a table (I'm reminded of https://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html#table-supported-units) rather than relying on the contents of the external standard, to avoid the potential governance issues we always have when standards outside of CF's influence are brought in. Would the "binary" terms be valid, as well as the "decimal" ones? |
yes, |
Thank you @taylor13 for the proposals and @davidhassell for the implementation details. I fully agree with your point 1, 2 and 3. There is possibly one situation that might need attention. If the coordinates subject to compression are stored in decimal64, typically we would require the computations to be decimal64 too, rater than decimal32. We could deal with that either by: A. Using the scheme proposed above, requiring the data creator to set the Probably A would be the cleanest, what do you think? |
Thanks, @taylor13 and @AndersMS, I, too, would favour A (Using the scheme proposed above, requiring the data creator to set the computational_precision accordingly.). I'm starting to think that the we need to be clear about Could the answer to be to define our own vocabulary of Or am I over complicating things? |
I don't understand the difference between decimal64 and binary64 or what they precisely mean. If these terms specify things beyond precision, it's probably not appropriate to use them here, so I would support defining our own vocabulary, which would not confuse precision with anything else. And I too would favor (or favour) A over B. |
Hi @taylor13 and @davidhassell, I am not fully up to date on the data types, but following the links that David sent, it appears that decimal64 is a base-10 floating-point number representation that is intended for applications where it is necessary to emulate decimal rounding exactly, such as financial and tax computations. I think we can disregard that for now. binary32 and binary64 are the new official IEEE 754 names for what used to be called single- and double-precision floating-point numbers respectively, and is what most of us are familiar with. I would suggest that we do not require a specific floating-point arithmetic standard to be used, but more a level of precision. If we adopt the naming convention proposed by David, it could look like: By default, the user may use any floating-point arithmetic precision they like for the interpolation calculations. If the The allowed values of (table) I think that would archive what we are after, while leaving the implementers the freedom to use what their programming language and computing platform offers. What do you think? |
looks good to me. Can we omit "base-2" from the descriptions, or is that essential? Might even reduce description to, for example:
|
Leaving out "base-2" is fine. Shortening the description further as you suggest would also be fine with me. I am wondering if we could change the wording to: "The floating-point arithmetic precision should match or exceed the precision specified by (table) If the That would ensure that we can assume a minimum precision on the user side, which would be important. Practically speaking, high level languages that support 16-bit floating-point variables, typically use 32-bit floating-point arithmetic for the 16-bit floating-point variables (CPU design). |
@taylor13 by the way I'm still on the prowl for a moderator for this discussion. As I see you've taken an interest, would you be willing to take on that role? I'd be able to do it as well, but as I've been involved in this proposal for quite some time it would be nice to have a fresh set of eyes on it. |
Hi Anders,
This is good for me.
I'm not so sure about having a default value. In the absence of guidance from the creator, I'd probably prefer that the user is free to use whatever precision they would like. Thanks, David |
Hi David, Fine, I take your advice regarding not having a default value. That is probably also simpler - one rule less. Anders |
Hi Anders - thanks, it sounds like we're currently in agreement - do you want to update the PR? |
Hi David, Yes, I would be happy to update the PR. However, I still have one concern regarding the In the introduction to Lossy Compression by Coordinate Sampling in chapter 8, I am planning to change the last sentence from
to
where section X will be a new short section in chapter 8 describing the Recalling that we also write in the introduction to Lossy Compression by Coordinate Sampling in chapter 8 that
I think it would be more consistent if we make the Would that be agreeable? |
Hi Anders,
That's certainly agreeable to me, as is your outline of how to change chapter 8. Thanks, |
Wouldn't the statement be correct as is (perhaps rewritten slightly; see below), if we indicated that if the computational_precision attribute is not specified, a default precision of "32" should be assumed? I would think that almost always the default precision would suffice, so for most data writers, it would be simpler if we didn't require this attribute. (But I don't feel strongly about this.) Not sure how to word this precisely. Perhaps:
|
Hi @taylor13 and @davidhassell, Regarding the I have written two versions of the new section 8.3.8, one for each of the two proposals. I hope that will help deciding! Anders Optional attribute version: 8.3.8 Computational Precision The accuracy of the reconstituted coordinates will depend on the degree of subsampling, the choice of interpolation method and the choice of the floating-point arithmetic precision with which the interpolation method is applied. To ensure that the results of the coordinate reconstitution process are reproducible and of predictable accuracy, the creator of the compressed dataset may specify the floating-point arithmetic precision by setting the interpolation variable’s (table) For the coordinate reconstitution process, the floating-point arithmetic precision should (or shall?) match or exceed the precision specified by Mandatory attribute version: 8.3.8 Computational Precision The accuracy of the reconstituted coordinates will depend on the degree of subsampling, the choice of interpolation method and the choice of the floating-point arithmetic precision with which the interpolation method is applied. To ensure that the results of the coordinate reconstitution process are reproducible and of predictable accuracy, the creator of the compressed dataset must specify the floating-point arithmetic precision by setting the interpolation variable’s (table) For the coordinate reconstitution process, the floating-point arithmetic precision should (or shall?) match or exceed the precision specified by |
I have a preference for "optional" because I suspect in most cases 32-bit will be sufficient and this would relieve data writers from including this attribute. There may be good reasons for making it mandatory; what are they? Not sure about this, but I think "should" rather than "shall" is better. |
Dear all I've studied the text of proposed changes to Sect 8, as someone not at all involved in writing it or using these kinds of technique. (It's easier to read the files in Daniel's repo than the pull request in order to see the diagrams in place.) I think it all makes sense. It's well-designed and consistent with the rest of CF. Thanks for working it out so thoughtfully and carefully. The diagrams are very good as well. I have not yet reviewed Appendix J or the conformance document. I'm going to be on leave next week, so I thought I'd contribute just this part before going. Best wishes Jonathan There is one point where I have a suggestion for changing the content of the proposal, although probably you've already discussed this possibility. If I understand correctly, you must always have both the Also, I have some suggestions for naming:
In the first paragraph of Sect 8 we distinguish three methods of reduction of datset size. I would suggest minor clarifications:
Then I think we could start a new paragraph with "Lossless compression only works in certain circumstances ...". By the way, isn't it the case that HDF supports per-variable gzipping? That wasn't available in the old netCDF data format for which this section was first written, so it's not mentioned, but perhaps it should be now. There are a few points where I found the text of Sect 8.3 possibly unclear or difficult to follow:
|
Dear @JonathanGregory Thank you very much for your rich and detailed comments and suggestions, very appreciated. The team behind the proposal met today and discussed all the points you raised. We have prepared or are in the process of preparing replies to each of the points. However, before sharing these here, we would like to update the proposal text accordingly via pull requests, in order to see if the changes have other effects on the overall proposal, which we have not yet identified. Best regards, |
Dear All, Following a discussion yesterday in the team behind the proposal, we propose the 'computational_precision` attribute to be optional. Here is the proposed text, which now has a reference to [IEEE_754]. Feel free to comment. Anders 8.3.8 Computational Precision The accuracy of the reconstituted coordinates will depend on the degree of subsampling, the choice of interpolation method and the choice of the floating-point arithmetic precision with which the interpolation method is applied. To ensure that the results of the coordinate reconstitution process are reproducible and of predictable accuracy, the creator of the compressed dataset may specify the floating-point arithmetic precision by setting the interpolation variable’s (table) For the coordinate reconstitution process, the floating-point arithmetic precision should match or exceed the precision specified by Bibliography [IEEE_754] "IEEE Standard for Floating-Point Arithmetic," in IEEE Std 754-2019 (Revision of IEEE 754-2008) , vol., no., pp.1-84, 22 July 2019, doi: 10.1109/IEEESTD.2019.8766229. |
Thanks, Sylvain, I support this suggestion |
I too like the idea to recommend that data producers report positional errors (and I guess other coordinate value errors) between the original data coordinates and the reconstituted coordinates in the Regarding the specification of the computational precision, which is required as input for the method to achieve an accuracy within the errors reported in the comment of the coordinate variable, my preference would still be the The reason for this preference is that I have tried out different selections of interpolation method, degree of subsampling (4x4, 8x8, 16x16, 64x16) and computational precision (64-bit, 32-bit floating-point arithmetic) on a test data set. All three components can have a comparable effect on the positional error between the original and the uncompressed file, which I think justifies specifying the computational precision in the same way as we specify the interpolation method and the degree of subsampling. @erget: It is true that the Conventions do not address computational precision, but I guess there are a number of undocumented and implicit assumptions. Say, if you have specified a grid mapping for coordinates represented as 64-bit floating-point, one would assumes that the conversion between the two reference frames have been performed using 64-bit floating-point arithmetic, otherwise significant errors would be introduced. Considering the complexity of what we are doing, I think that stating the computational precision explicitly would be the safest. Best regards, |
Dear team, Following our meeting this afternoon, I propose the following new paragraph at the end of the section "Tie Points and Interpolation Subareas":
Please let me know if you have comments. Anders Done: 0c5b732 |
Dear @JonathanGregory, Just an update regarding the Lossy Compression by Coordinate Subsampling. We have completed the implementation of the 16 changes in response to your comments on chapter 8. I have edited the comment above to include a link to the related commit(s) for each of the changes. Generally we are very happy with the outcome and in particular the renaming of terms and attributes that you proposed has made the text easier to read. You might wish to take a look at the rewritten section "Interpolation of Cell Boundaries". In response to your proposed change 15, we have had several discussions and meetings, resulting in a new concept for bounds interpolation. You will find the new section as the last in f3de508. We will still do one more iteration on the section on Computational Precision, we will publish it here within the next days. Regarding the Appendix J, we have nearly completed the changes required to reflect the changes in Chapter 8. We expect to complete the update tomorrow or Thursday and I think it would make sense for you to wait for that before reading the appendix J. Best regards, |
Dear @AndersMS Thanks for the update and your hard work on this. I will read the section again in conjunction with Appendix J, once you announce that the latter is ready. Best wishes Jonathan |
Dear All, Just to let you know that as agreed during the discussion of the new "Interpolation of Cell Boundaries" section (f3de508) I have added a the following sentence in the "Interpolation Parameters"
Anders |
Dear @JonathanGregory, Appendix J is now ready for your review. The only remaining open issues is now that we will do one more iteration on the section on Computational Precision for Chapter 8 - we will publish it here within the next days. Best regards, |
Dear @AndersMS et al. Thanks for the new version. Can you tell me where to find versions of Ch 8 and App J with the figures in place? That would make it easier to follow. I've just read the text of Ch 8, which I found much clearer than before. I don't recall reading about bounds last time. Is that new, or was I asleep? Best wishes Jonathan |
Dear @JonathanGregory , I am still a bit new to documents on GtiHup, but these two links does the job in my browser: I got these links by going to #326, then selecting the Files changed tab, then scrolling down to ch08.adoc or appj.adoc and then selecting View File in the "... " pull down menu on the right hand side, opposite to the file name. Hope this will work at your end. We had a section on boundary interpolation in the first version you read, but it was short and didn`t do the job we would like it to do. For example, it did not guarantee to reconstitute contiguous bounds as contiguous bounds. The new section is our consolidated version, which does all what we wanted it to do. Best regards, |
Great, thanks, @AndersMS. I am still learning about GitHub. I was using the Diff, which doesn't show the diagrams, rather than Viewing the file, which works fine. Jonathan |
Dear @AndersMS and colleagues Thanks again for the new version. I find it very clear and comprehensive. I have a few comments. Chapter 8"Tie point mapping attribute" mentions "target dimension", which is not a phrase used elsewhere. Should this be "interpolated dimension"? You say, "For the purpose of bounds interpolation, a single bounds tie point is created for each coordinate tie point, and is selected as the vertex of the tie point cell that is the closest to the boundary of the interpolation subarea with respect to each interpolated dimension." I don't understand why there is a choice of bounds tie points, because there's no index variable for them. Doesn't the tie point index variable dictate the choice of tie points for bounds? Appendix JThe title says Appendix A. Presumably that's something to do with automatic numbering. All of the subsections listed at the start (Common Definitions and Notation, Common conversions and formulas, Interpolation Methods, Coordinate Compression Steps, Coordinate Uncompression Steps) should have subsection headings, I think. They will be Sections J.1 etc. At the moment the last two are labelled as Tables J.1 and J.2 rather than subsections, but they're never referenced as tables. Fig 1ff. You say, "When an interpolation method is referred to as linear or quadratic, it means that the method is linear or quadratic in the indices of the interpolated dimensions." Linear also means that the coordinates of the interpolated points are evenly spaced, doesn't it; if so, that would be helpful to state. You say, "In the case of two dimensional interpolation, the two variables are equivalently computed as ...". I would say "similarly", not "equivalently", which I would understand to mean that
Please put the "Common conversion and formulae" table before the interpolation methods, or at least refer to it. Otherwise the reader encounters [ Where is A couple of times, you write, "For each of the interpolated dimension". There should be an -s. ConformanceFor "Each Regarding, "The legal values for the Best wishes Jonathan |
Dear all @AndersMS and colleagues have proposed a large addition to Chapter 8 and an accompanying new appendix to the CF convention, defining methods for storing subsampled coordinate variables and the descriptions of the interpolation methods that should be used to reconstruct the entire (uncompressed) coordinate variables. I've reviewed this in detail and it makes sense and seems to clear to me, as someone who's never used these methods. Those who wrote this proposal are the experts. Enough support has been expressed for this proposal to be adopted, after allowing the time prescribed for the rules for further comments, and there are no objections expressed. Therefore this proposal is on course for adoption in the next release of the CF convention as things stand. If anyone else who wasn't involved in preparing it has the time and interest to review it, that would no doubt be helpful, and now is the time to do that, in order not to delay its approval. It definitely requires careful reading and thinking, but it's logical and well-illustrated. Best wishes Jonathan |
Dear @JonathanGregory , Thank you for your rich set of comments and suggestions. I have provided replies below, in the same format we used for the first set of comments. Several of the replies I have already implemented in the document and indicated the corresponding commit. For others, the reply is not conclusive and if you find time, your feedback on the reply would be valuable. The comments on the conformance chapter, I would prefer that @davidhassell look at when he is available again. Best regards, Comment/Proposed Change 17
Reply to Comment/Proposed Change 17 Comment/Proposed Change 18
Reply to Comment/Proposed Change 18 Comment/Proposed Change 19
Reply to Comment/Proposed Change 19 Comment/Proposed Change 20
Reply to Comment/Proposed Change 20 Comment/Proposed Change 21
Reply to Comment/Proposed Change 21 Comment/Proposed Change 22
Reply to Comment/Proposed Change 22 Comment/Proposed Change 23
Reply to Comment/Proposed Change 23 Actually, the best of our current methods to generate evenly spaced coordinate points is the "quadratic_remote_sensing" method. It can utilize its quadratic terms to counteract the distorting effect of the latitude/longitude coordinates. Commit(s) related to Comment/Proposed Change 23 Comment/Proposed Change 24
Reply to Comment/Proposed Change 24 Comment/Proposed Change 25
Reply to Comment/Proposed Change 25 Comment/Proposed Change 26
Reply to Comment/Proposed Change 26 Comment/Proposed Change 27
Reply to Comment/Proposed Change 27 You are right, it is like a projection plane, but we are using 3D cartesian coordinates. The problem we are addressing is that interpolating directly in latitude/longitude is inadequate when we are close to the poles. So, we temporarily convert the four tie points from lat/lon to xyz, do the interpolation and then convert the result back from xyz to lat/lon. Another common way to address this problem is to project the lat/lon point on the xy plane, do the interpolation and project the point back to lat/lon. However, by using xyz, we can also solve the problem that arises when our interpolation subarea crosses +/-180 deg longitude. Let me try to support the above with a simple example (hoping that I am not upsetting anybody with such a simple example...) Think of a hypothetical remote sensing instrument that scans the Earth in a way that can be approximated as arcs of a great circle on the Earth surface. So, if the instrument scans from point A to point B, then the points it scanned between A and B will be on the great circle between A and B. It will follow this simple principle for any location on Earth. If you are near Equator and A = (0W, 0N) and B= (4W, 4N), then you can generate three points between A and B by interpolating in longitude and latitude separately and will get (1W, 1N), (2W, 2N) and (3W, 3N), which are approximately aligned with the great circle arc between A and B. If you are near the North Pole and A = (0W, 88N) and B= (180W, 88N) and do the interpolation in longitude and latitude separately, you will get (45W, 88N), (90W, 88N) and (135W, 88N), which are on an arc of a small circle and is the wrong result. By first converting to cartesian coordinates, then interpolating and then converting back to longitude latitude, you will get the correct result: (0W, 89N), (0W, 90N) and (180W, 89N), which are on a great circle. That was also why we suggested the name with [bi_]quadratic_remote_sensing. Commit(s) related to Comment/Proposed Change 27 Comment/Proposed Change 28
Reply to Comment/Proposed Change 28 Commit(s) related to Comment/Proposed Change 28 Comment/Proposed Change 29
Reply to Comment/Proposed Change 29 |
Dear All, Here are the links to the easy-to-read versions including all the above changes: Anders |
Dear @JonathanGregory, Just to let you know that I just updated my reply to Reply to Comment/Proposed Change 23 above. Anders |
Dear @AndersMS Thanks for your detailed replies. I think there are only two outstanding points in those you have answered. 18: Now I understand what you mean, thanks. To make this clearer to myself, I would say something like this: Bounds interpolation uses the same tie point index variables and therefore the same tie point cells as coordinate interpolation. One of the vertices of each coordinate tie point cell as chosen as the bounds tie point for the cell. For 1D bounds, the vertex chosen is the one which is on the side closer to the boundary of the interpolation subarea. For 2D bounds, the vertex chosen is the one which is closest to the boundary of the interpolation subarea, considering all the interpolated coordinates together, or in other words, the one closest to the corner of the interpolation subarea. Are you restricting the consideration of 2D bounds to rectangular cells, or are polygons of n vertices allowed? 27: I think the key point is that you mean three-dimensional Cartesian interpolation. I didn't think of that. If you could clarify this, it would be fine. Cheers Jonathan |
Dear @JonathanGregory, @AndersMS, and all,
Addressed in AndersMS@8b8c185 I have also added some conformance requirements and recommendations for bounds tie point variables: AndersMS@bdac108 Thanks, |
Dear @JonathanGregory et al., Due to the heroic contributions primarily of @AndersMS and @davidhassell as well as the expert review of @oceandatalab and friends we can present to you the now-finalised version of the pull request associated with this issue. To see all points listed and addressed one-by-one you can check #327 (comment) hopefully that is traceable. We have completed our proposal, finalising the section regarding computational precision - this is now found at the end of chapter 8.3. #326 contains the documents in their latest state, which I have also attached in a compiled form for your perusal: Note before finalisation of this version of the Conventions the following items will need to be addressed; these are however of a purely editorial nature so in the interest of time we are not correcting them for the 3 week freeze:
A clever idea here would be to name e.g. the first figure in chapter 7 "Figure 7.1" so that the figures are always numbered correctly independently of previous chapters. I leave this to future minds to solve. I therefore thank all contributors again for the loads of precise and hard work, and motion that the 3 week period start for this proposal so that we are on time to get it adopted into CF-1.9. I look forward to hearing hopefully a resounding silence in response to the finalised proposal! |
Dear @AndersMS @davidhassell @erget @oceandatalab and collaborators Thanks for the enormous amount of hard and thorough work you have put into this, and for answering all my questions and comments. I have no more concerns. Looking through the rendered PDF of App J, I see boxes, probably indicating some character which Chrome can't print, in "Common Conversions and Formulas", after sin and cos. If anyone else would like to review and comment, they are welcome to do so. If no further concerns are raised, the proposal will be accepted on 24th August. Cheers Jonathan |
Dear @JonathanGregory , Regarding the interpolation of bounds, you asked:
We are restricting the interpolation of bounds to contiguous cell bounds. I think that the consequence of this is that we are are restricting the consideration of 2D bounds to rectangular cells. Possibly @davidhassell can confirm. What we do support is interpolation of 1D, 2D, etc bounds. Hence the sentence:
that applies for any number of interpolated dimensions. Cheers Anders |
Dear @JonathanGregory, Once again, thank you very much for your thorough review and valuable comments, which significantly improved the proposal. Cheers Anders |
Dear @JonathanGregory , We have just discussed the matter of the cell bounds interpolation and the question you raised . To make the conditions for bounds interpolation clearer, we have changed (b10fb67) the first part of the first paragraph in the section on bounds interpolation to:
We hope you are fine with that change. Best regards, |
Dear @AndersMS Thanks for the clarification. That's fine. The proposal will be approved next Tuesday 24th if no further concern is raised. [edited twice - I was accidentally reading the calendar for next month] Best wishes Jonathan |
@JonathanGregory @AndersMS et al., chanting is all done and the merge is complete. Thanks all for your many varied contributions - this was a lot of work on all sides and my hope is that it proves useful to both data producers and consumers moving forward! |
Congratulations and thanks to all who contributed to this successful piece of work. |
Title
Lossy Compression by Coordinate Sampling
Moderator
@JonathanGregory
Moderator Status Review [last updated: YYYY-MM-DD]
Brief comment on current status, update periodically
Requirement Summary
The spatiotemporal, spectral, and thematic resolution of Earth science data are increasing rapidly. This presents a challenge for all types of Earth science data, whether it is derived from models, in-situ, or remote sensing observations.
In particular, when coordinate information varies with time, the domain definition can be many times larger than the (potentially already very large) data which it describes. This is often the case for remote sensing products, such as a swath measurements from a polar orbiting satellite (e.g. slide 4 in https://cfconventions.org/Meetings/2020-workshop/Subsampled-coordinates-in-CF-netCDF.pdf).
Such datasets are often prohibitively expensive to store, and so some form of compression is required. However, native compression, such as is available in the HDF5 library, does not generally provide enough of a saving, due to the nature of the values being compressed (e.g. few missing or repeated values).
An alternative form of compression-by-convention amounts to storing only a small subsample of the coordinate values, alongside an interpolation algorithm that describes how the subsample can be used to generate the original, unsampled set of coordinates. This form of compression has been shown to out-perform native compression by "orders of magnitude" (e.g. slide 6 in https://cfconventions.org/Meetings/2020-workshop/Subsampled-coordinates-in-CF-netCDF.pdf).
Various implementations following this broad methodology are currently in use (see cf-convention/discuss#37 (comment) for examples), however, the steps that are required to reconstitute the full resolution coordinates are not necessarily well defined within a dataset.
This proposal offers a standardized approach covering the complete end-to-end process, including a detailed description of the required steps. At the same time it is a framework where new methods can be added or existing methods can be extended.
Unlike compression by gathering, this form of compression is lossy due to rounding and approximation errors in the required interpolation calculations. However, the loss in accuracy is a function of the degree to which the coordinates are subsampled, and the choice of interpolation algorithm (of which there are configurable standardized and non-standardized options), and so may be determined by the data creator to be within acceptable limits. For example, in one application with cell sizes of approximately 750 metres by 750 metres, interpolation of a stored subsample comprising every 16th value in each dimension was able to recreate the original coordinate values to a mean accuracy of ~1 metre. (Details of this test are available.)
Whilst remote sensing applications are the motivating concern for this proposal, the approach presented has been designed to be fully general, and so can be applied to structured coordinates describing any domain, such as one describing model outputs.
Technical Proposal Summary
See PR #326 for details. In summary:
The approach and encoding is fully described in the new section 8.3 "Lossy Compression by Coordinate Sampling" to Chapter 8: Reduction of Dataset Size.
A new appendix J describes the standardized interpolation algorithms, and includes guidance for data creators.
Appendix A has been updated for a new data and domain variable attribute.
The conformance document has new checks for all of the new content.
The new "interpolation variable" has been included in the Terminology in Chapter 1.
The list of examples in toc-extra.adoc has been updated for the new examples in section 8.3.
Benefits
Anyone may benefit who has prohibitively large domain descriptions for which absolute accuracy of cell locations is not an issue.
Status Quo
The storage of large, structure domain descriptions is either prohibitively expensive, or is handled non-standardized ways
Associated pull request
PR #326
Detailed Proposal
PR #326
Authors
This proposal has been put together by (in alphabetic order)
Aleksandar Jelenak
Anders Meier Soerensen
Daniel Lee
David Hassell
Lucile Gaultier
Sylvain Herlédan
Thomas Lavergne
The text was updated successfully, but these errors were encountered: