-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should we recommend EPSG:4326 or something else? #52
Comments
As Geoparquet will follow WKB order of axis, and therefore the order of axis defined by any CRS will always be override, I agree with you to include something like you are proposing. I wrote this similar text to include at the end of crs section in spec:
We can recommend EPSG:4326 for the sake of simplicity for most of the users (who have low accuracy requirements), but we definitively need to include a warning about the fact that it's using a Datum ensemble and the important accuracy problems related to that. I like the recommendations about how to use this ensemble CRS in one of the last technical docs of IOGP "EPSG null and copy transformations to WGS 84"(https://www.iogp.org/bookstore/product/epsg-null-and-copy-transformations-to-wgs-84/). Take a look at the section about using ensemble Datums for high accuracy applications: We could include something like this but more in the short style of QGIS warning (we can also point to GDAL docs where it's explained very well). |
I think that using What I think needs to be included is the ability to document the use of any CRS, so not like GeoJSON, which requires you to break the standard to use and document the use of a different CRS. The broader question of what is a good, global CRS that is accurate and stays accurate through time is really a whole new issue for a far broader set of projects! |
That is already the case, see the "crs" field in the column metadata: https://github.com/opengeospatial/geoparquet/blob/3a58590ce21ddefc1aff819534e13625a9fb969e/format-specs/geoparquet.md#column-metadata |
I agree @alexgleith , that's the point. People, in general, are not yet concerned with this important issue. The problem is that it's not easy to understand it from a practical point of view (how to use the epoch with my data and what CRS should I choose). The community is starting to include notes and recommendations in libraries but the road is still long. |
I would advice against recommending a CRS that defines axis order in the opposite way they actually are, as |
I think it uses the same ensemble datum as EPSG:4326? (at least according to PROJ) $ projinfo "OGC:CRS84"
|
You are right. I was looking at the output of
for PROJ 8.2.0 and 9.0.0. Strange enough it does gives the ensembles for |
Hi @edzer , for this reason, we are proposing a text clarifying it and including both CRS. See my comment here: #52 (comment) |
As long as the CRS is required, it seems like the recommended value might have the following impact
As long as CRS is mandatory, it doesn't strike me as a big interoperability hit to use the WKT for OGC:CRS84 over EPSG:4326 (for example). Requiring CRS essentially requires that clients (etc.) be capable of working with an arbitrary CRS. GeoJSON chose to go with a single CRS in hopes of increasing interoperability. I think it is fair to say that this worked (at the expense of flexibility). Perhaps a balance of interoperability and flexibility could be achieved by saying that the CRS is not mandatory, and if absent, the data is in OGC:CRS84 (or some other). The risk here I guess is that you pick a default that turns out to be wrong in a couple years (solved by issuing a new version of the spec). But I'm gathering that making CRS mandatory has already been decided. |
I wouldn't say that making CRS mandatory has already been decided. I was the one who proposed, but I was really hoping we'd be able to say that to people who don't care about CRS they could just put in 'XXXX' string in there for long/lat and everything would be fine. Turns I had lots more to learn about the current state of CRS's, and there's not an easy answer for what 'XXXX' should be. So I've recently been leaning towards about what you're suggesting. That we say if there is no CRS then your data should be in long/lat. You can leave out the CRS, and we advise that leaving it out implies CRS=YYYY, and that any implementation that is CRS aware should just use that YYYY. This seems to be what GeoJSON did - I had looked at it to see what CRS definition they used, but they didn't - it just said i'ts long/lat. I'm leaning towards YYYY being the WKT2 of OGC:CRS84. We could make it EPSG:4326, with the caveat that we're overriding it to be long/lat, but that still just seems a bit weird for me - seems like the CRS should actually describe what is in there. It's then probably still an open question if we include the coordinate axis over section that requires '(x, y) where x is easting or longitude and y is northing or latitude.' From my latest learnings I lean towards 'yes' - instead of forcing libraries to look up axis order in the CRS we instead say that they can expect it's x,y/easting,northing, and the crs information is used for projection but not for defining the order of things. That's probably worth its own issue. I can try to take a crack at a PR that does the above - may make it easier to look at concrete text changes instead of talking in the abstract. |
I am not fully sure what it helps to make this less explicit (I am personally fine with making crs an optional field, or a required field but with a value that can be null, but I would then not attach any meaning to "not defined", and have a strong recommendation to always define the CRS). If we would say that leaving it out implies YYYY, why then not simply having libraries specify YYYY in the metadata? Is it to avoid that people need to be able to write (and recognize) the WKT of YYYY?
Personally, I don't have a strong opinion about OGC:CRS84 vs EPSG:4326, but I also don't think it matters that much: they are both based on the same (ensemble) datum, and given that we currently override the coordinate axis order anyway, both are essentially equivalent for our purpose. I think it is also an option to not recommend a specific CRS, and only strongly recommend that you specify a CRS. Or have a more elaborate recommendation (that gives some general guidelines instead of recommending one specific CRS) like the example given above at #52 (comment) by @cayetanobv |
I think that for a user (or software writer) it is most useful if a file at least reveals whether coordinates are geographic or Cartesian, because it matters for computing distances, buffers, finding intersections and so on. If it is not there the software will have to make an assumption, and may make the wrong one. It might make sense to look at what GPKG does: when writing a GPKG without specifying the CRS, a default CRS corresponding to
is added (having the axis order we don't want). For that reason, when writing an object without CRS in R with
is being substituted, as the (historically grown) assumption is that data with missing CRS imply some Cartesian CRS. Assigning an "Undefined geographic SRS" when writing data that has no CRS specified has the advantages that (i) there is no doubt about coordinates being geographic or not when reading the data, (ii) software writers know they need to do something when writing data with Cartesian coordinates, and (iii) it doesn't assign/assume a potentially wrong datum. A simple boolean metadata flag, indicating coordinates are geographic or Cartesian would of course reach the same goals, and avoid the need to parse a CRS in WKT2 form. |
Yes. The reason for this was to do as much as possible to eliminate the implementation burden of CRS on what we perceived to be as the dominant audience of consumers – web ones using GeoJSON as the wire format. Responsibility for consuming, interpreting, and reflecting all of the possible CRSs is a big lift for geospatial folks who care about it. For web people who really don't, they are apt to do their best to ignore what they can, and make up stuff when they can't. GeoJSON said, "ok, fine, we'll do the naive approach too", and in exchange for that, interoperability of the standard was very high because it could be implemented in your own software in an afternoon using copypasta from the specification itself. I don't think GeoParquet's audience is the same as GeoJSON's, and if the intended dominant use of GeoParquet is as a memory layout and serialization format instead of a wire format, CRS interoperability is really important. @edzer's point about the specification providing a responsibility gradation by denoting cartesian or geographic coordinates is a good one, and it probably meets the needs of many software implementations for their interpretation of the data when they are using it, but a full CRS definition is still needed to specifically define where/when the data are if GeoParquet is to sit as a blob for a decade. I kind of like GPKG's solution, but I don't have a recommendation for a default CRS. One thing I think is the specification should provide clear consensus on whether or not software implementations are responsible for interpretation and consumption of any possible CRS, and the definition of those CRSs be in only a single format. ASPRS LAS, for example, provides backward compatibility using both WKT and GeoTIFF keys, and because the expressibility of each is not equivalent, the software implementation gets to figure out what to do. If you pick WKTv2, make a statement about supporting all possible WKT future specifications, and how to do it. |
Thanks a lot for the input, Howard. One follow-up question: can you go in a bit more detail what the "GPKG solution" is? |
Does https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#edges handle this? Or is this something slightly different? |
That is slightly different, because we explicitly included this field since there is not necessarily a 1:1 mapping between geographic vs cartesian and spherical edges or not. There exists lots of data using geographic coordinates but that are not valid (or fixed) when being interpreted with spherical edges (one good example is GeoJSON data, which explicitly mentions it uses straight edges) |
It's related, but slightly different: if you'd convert a GeoJSON to a geoparquet file, it would have geographical coordinates but (according to the GeoJSON specs) assume edges to be planar (straight in a flat, 2D, Cartesian space). |
I think there is a significant difference between saying CRS is optional (if not present, the CRS is YYYY) and CRS is mandatory (and you are strongly encouraged to use YYYY). If the CRS field is optional (and the spec describes the default), CRS-naive applications can happily work with data that doesn't have a CRS field. They can parse it, plot it, transform it, etc. If the CRS field is required, all applications must at least be able to answer the question This may not be as trivial as it sounds. For example, here are two equivalent CRS represented with WKT2_2019:
and
Keywords are case-insensitive. Space outside of quoted strings is insignificant. Delimiters can be If, on the other hand, CRS is not mandatory, consumers who encounter data without a CRS field can work with it (knowing what CRS it is in because the spec describes it). |
Thanks a lot everybody for all the comments, something clear is that is not quite obvious just for the fact we're having multiple discussions about it. Reading all the comments, I think the approach that better satisfies all of our concerns will be making CRS optional and setting the default to OGC:84 when it's not specified. I'm going to open a draft PR with this today, so we can see if it's clearer that way. |
The complete name (including authority) of the crs is OGC:CRS84. I've included comments in PR #60 |
I'm +1 on CRS optional, assuming Long/Lat as the default, and pointing CRS aware readers at the OGC:84 definition for the default. |
PR merged. I think we can close this one |
As I mentioned above, I don't have a strong opinion on the change from EPSG:4326 to OGC:CRS84 as the default / recommended CRS. But, for background, I wanted to share how this will probably work in practice in GeoPandas. There is one other case: when the GeoDataFrame has no CRS information. But also in that case I don't think GeoPandas should write OGC:CRS84 as the crs, because we have no basis to assume that this is correct for that data (it can by anything, because GeoPandas supports coordinates "without crs" specified). In summary, from GeoPandas point of view, it could be useful to be able to specify that the CRS is "unknown" (in the original geopandas version of this spec, we allowed a |
Personally, I'm in favor of Given arbitrary input data that is missing a CRS, we cannot claim to know what CRS it should be (without other info or guessing) nor should we imply what it is via defaulting to It seems unnecessarily burdensome on writers - and also not particularly helpful to readers, that would then need to parse WKT2 - to instead translate an unknown CRS to WKT (e.g., If you are a toolkit that knows you are working entirely with But if you are a toolkit where the CRS of the inputs matters, there simply is no avoiding the need to parse WKT - unless you control all your inputs, and thus choose not to care. Given that GeoParquet doesn't have a single mandatory CRS (and shouldn't!), I don't see another way. It's a known issue that WKT is not friendly to parse but not for the spec to solve via implied defaults; that feels dangerouse for reasons stated in comments above. If adding |
The purpose of having a default if This is not in conflict with making it possible for a provider to say that the CRS is unknown. Though I'm not sure if it will be difficult for consumers to distinguish between |
I understand the goal, but I think this only works where there is also only one supported CRS for GeoParquet; you can ignore CRS because there can be no variation in CRS. But if the common case in practice is that there will be a wide variation in CRS provided via WKT, including those writers that choose to be explicit (based on what they knew about the data) by writing some variant of If we are encouraging writers to always include CRS as WKT - which we should - then the cases where the implied default of How is I think the convenience case for readers is only where:
Otherwise, in an ecosystem of mixed CRS, where you consume what you do not produce, you have to deal with complexities of CRS, right? |
Yes, this would make the world a nicer place. |
It takes away the ambiguity for missing CRS about whether coordinates are geodetic or Cartesian. GPKG left the datum unspecified, but removing that ambiguity is good IMO. |
If it's EPSG:4326 Do you write that out as long/lat? Or lat/long? Like in the WKB? It seems to me that if you are writing it out as long/lat then the ideal would be to just leave off the CRS and use the default in geoparquet. I do think our goal should be to have as much data as possible in the default long/lat. And I agree it feels like a step back if most writers just take EPSG:4326 data and explicitly write out that CRS. Like could we make the recommendation to 'please write CRS except if it is |
I'll preface this by saying I think some of the challenges (and tension) we're having here is between trying to standardize the metadata that goes along with the data and transport of the data (i.e., structure w/in parquet file) versus standardizing the geospatial data being transported. The spec emerged from the former, and while the latter is noble, it's also a tough sell given the wide variability of geo data that can be transported via WKB (for now; arrow spec later) and variability of data we're trying to shove into this format. Thus concerns about gatekeeping what is allowed into the format (i.e., data representations that are not valid according to spec) and the implementation or performance impact of standardizing the data and not just the transport.
This seems counterproductive to good metadata, right? It seems like the best practice is for writers to document what they know, but document it in a well-defined manner. Thus it seems reasonable that the spec would prescribe how you document CRS information. Further, if future versions of the spec relax that default, or want to switch to a different default, now readers and writers have to do more strict version checking in order to do the right thing. Re: ambiguity of geodetic vs cartesian. In the case outlined above where a dataset simply has no CRS defined, a toolkit still cannot automatically with 100% confidence determine which of those two it is, which means that emitting a WKT with unknown cartesian or unknown geodetic is still potentially not correct. Which then means a 3rd unknown variant, or that the writer simply should prevent writing a dataset with an unknown CRS. Such gatekeeping maybe is OK if the strict goal is for interoperability, but it seems possibly counterproductive for internal use where CRS is irrelevant. From the reader perspective, readers are free to reject outright a dataset where Another consideration is round-tripping the data through GeoParquet. If we make the assumption that the writer can produce WKT that accurately matches the data, then it is reasonable to assume that we can write to GeoParquet, read from GeoParquet, and assert that the CRS is the same. If a datset has no CRS set, and we backfill that with WKT that states the CRS is unknown (geopackage example), when we read that we now have a non-empty CRS and can no longer assert (without additional processing and / or assumptions) that the CRS that was read matches the CRS that was written. Instead, if we allow unset (= Likewise, if the CRS can be taken to be close enough to
This seems like it would be a good recommendation rather than element of the spec. E.g., for better interoperability, we recommend that your coordinates are in
But in this case, if the writer knows with 100% accuracy the CRS, it is very easy to simply hardcode the CRS into the writer, right? And likewise if they read then write data in exactly that CRS, they can simply copy the CRS from input to output. Then it is written directly within the metadata alongside the data following an established format (WKT). I'm not seeing a lot of burden here on the writer to write with an effectively hard-coded or copied value, but I could be wrong. I think if we're trying to make the argument of convenience to readers that they can ignore the I think this would be a different situation if we were inventing a format where data are represented in only one coordinate representation, and we have to decide how to document that that CRS is (i.e., the GeoJSON problem). Instead, we have a format that supports and documents various CRS's, so you have to be able to work with CRS if you interoperate with data outside your control. All that said, my suggestion is that the default for |
We probably should try to have this discussion in a synchronous meeting at some point, to really get into it. But getting all these points of view written down is great, appreciate the time we're taking to try to sort this out to hopefully get to a solution that balances all the various concerns. I don't have the time to respond in full, but one point:
So I think I'd agree with you that this is counterproductive to good metadata, except that in most cases adding the CRS for EPSG:4326 is actually bad metadata, since most systems override that it says lat, long and use it for data that is long, lat. So the metadata (EPSG:4326) indicates that it's latitude, longitude, but is actually inaccurate in many/most cases, relying on some often used common knowledge - so it's not actually 'good' metadata. I'm definitely not sure that the right answer for the The important thing to me is that long, lat is the recommended 'default' for people who don't know anything about CRS's, and that we encourage data to be written as that for maximum interoperability. But that we also do enable those who prefer other projections to use geoparquet. The tricky part does seem to be striking the right balance with defaults / recommendations / etc. relative to the mess that is CRS's and axis order for longitude & latitude. And I'd prefer we attempt to take one step to 'help' that situation and make it work both for readers and writers that are CRS aware and also make it easy for those that are not CRS aware. Like not forcing non-CRS aware readers to try to look for every possible WKT2 string that might mean long, lat to know that they can read it. Isn't that what we'd need to do if we don't have a default CRS? Provide a list of all the WKT2 strings that could potentially correspond to long/lat? I'm definitely open to other answers, but I'd say that's my main concern: how we help non-crs-aware readers to be able to read all the geoparquet data that is in long/lat. |
I don't think we are saying to them that the data is in one and only one well-defined CRS. I think we're saying to them 'the data is stored in longitude, latitude. Readers who are CRS aware can use But I agree to make that work well we need writers on board to try to identify when they are aware that they are writing out long/lat that they don't include the CRS. |
I agree that a synchronous meeting would be helpful, and I hope that my critical comments here are helpful to advancing this effort and not just being a wet blanket. I think the specification extension (or whatever it should be called) idea I outlined in #89 would address your concerns around giving non CRS-aware readers the ability to safely opt-out of parsing CRS, without getting into awkward territory around how the I agree that |
Something that contributes to making this a difficult discussion (apart from the inherent difficulty of coordinate reference systems ;)), is that there are different, but intertwined questions:
Partly, I would have preferred to try to keep those as separate discussions to keep it structured. But they are also clearly affecting each other, because as far as I understand from the discussions above, one of the reasons for making CRS optional (with a default value) is because WKT is inconvenient and requires a parser to understand even basic question of "are this geographic lon/lat coords". So that makes we wonder, while we are discussing again the optional-ness of the Depending on how we improve this CRS information, that might actually go in the direction of the extension idea @brendan-ward is proposing in the comment above (#89) Some possible ideas:
EDIT: while I was writing this, it seems @kylebarron opened a very related discussion about using PROJJSON ;) -> Thoughts on PROJJSON for CRS encoding? #90 |
Sidenote, I would like to disagree with the following:
Yes, it is unfortunate that there is a need for the whole "authority compliant axis order" vs "traditional GIS order" concept. But I think you either say "we follow the axis order as specified by the CRS" (and then actually follow that) or either "we explicitly overrule this and always use the same axis order, regardless of the CRS" (in which case the axis order of the CRS doesn't matter anymore). geoparquet/format-specs/geoparquet.md Lines 126 to 128 in a3c1212
At that moment, I don't think it is "bad" to use a CRS that has a different axis order, since we explicitly specified that this is OK and we ignore it. For EPSG:4326 there is an alternative, i.e. OGC:CRS84 which is exactly equivalent except for the axis order. But there are many other geographic CRS options that define lat/lon axis order (eg NAD83, one of the specific realizations of the WGS84 ensemble, etc), for which there is not such an equivalent alternative CRS. |
Giving my GDAL perspective,
|
In my mind, the argument for having CRS default to OGC:CRS84 was succinctly stated by @tschaub above:
On an issue in geo-arrow-spec, @rouault pointed out
Given the issues with WKT, a function to assert Would switching to PROJJSON nullify these concerns? Testing with Pyproj, the In [1]: from pyproj import CRS
In [2]: CRS('epsg:4326').to_json_dict()['id']
Out[2]: {'authority': 'EPSG', 'code': 4326}
In [3]: CRS('crs84').to_json_dict()['id']
Out[3]: {'authority': 'OGC', 'code': 'CRS84'} For my own needs on the web, if I can check the CRS of the data as easily as suggested above only via JSON, I would not be opposed to having the spec always require CRS metadata (or having |
that would be simpler indeed. Note that potentially, you can have several id for an object, and then the "ids" member is used to have an array of id. But that's mostly a theoretical concern as this isn't much used in practice. |
Just getting around to rewriting some example files to catch up with the latest version of this spec, and I'm still not sure what the best way is handle the case where the CRS is unspecified by the user (in R this happens frequently for reasons that are sometimes good and sometimes bad). I've read (I'm also +10000 to PROJJSON being in the CRS field!) |
I fully understand that concern. But so I think that there are other ways to address this issue than omitting the CRS. For example, we could add another (optional) field that indicates whether data are WGS84 lon/lat (see some ideas in my comment above at #52 (comment)). Or by switching to PROJJSON as you proposed! Based on the latest discussions, I think there are several ideas floating around:
Others? I think the 3rd item to allow the CRS to be explicitly "unknown" might be relatively uncontroversial? (it's backwards compatible). In that case I could already open a PR for that item. I would go for |
+1 for |
Is the top-level The identifier keyword ( I think JSON or more structured data representing the CRS sounds great. Just wanting to get clarification on when a consumer can rely on |
no, it is optional, as in WKT. Typically a custom CRS will likely lack a top-level id. |
General notice: we are having a synchronous meeting tomorrow (Tuesday) at 15:00 UTC (8am Pacific, 5pm central Europe). Everybody is welcome, so if you would like to join, let me know and I will send you an invite with the meeting details. |
Sorry for not responding in for awhile - got too busy with other stuff. Will attempt to sound in on a few things in this comment.
A big +1, and I do think we should try to structure the synchronous conversation around this tomorrow. And I agree they all are intertwined, so the synchronous conversation should help.
I think projjson does solve some of the original concerns that I had, so am definitely psyched to explore this. Big question to me is how much support there is in non-GDAL geo-tools chains, and if support isn't great than how hard is it to implement? One very concrete example would be ESRI - it'd be a big win for them to support geoparquet, but they likely don't support projjson yet, and I don't think they use proj under the hood.
This one doesn't seem like it'd help the case where someone has data as long,lat but calls it 4326, and the writer just writes that out? Like in this case the writer wouldn't write that the crs_name is OGC:CRS84, right? But I suppose this helps if we don't adopt projjson, as it means that wkt parsing isn't required for everything.
I like this one. Those who do CRS can do their thing, but naive readers who don't want to worry about that can count on it being lon/lat. I think the one small downside is it introduces corner cases where things don't agree with one another - like if someone used web mercator CRS / coordinates and didn't know what they were doing and just set this to 'true'. That's clearly a violation, but what do we do if they set crs to 'null' (which I'm +1 on)? I think this can all be solved by good validation tools. But an original reason to just do one CRS field was so that things couldn't get out of sync - there's just one source of truth. |
If we allow a non-WKT string value for |
Regarding PROJJSON +1. This is not a new thing at all; It's been here since 2019 (https://github.com/OSGeo/PROJ/releases/tag/6.2.0) so compatibility is high because most open-source software (and an important part of non-open) is using Proj and can handle it. And indeed it makes easier the life of developers. But the @cholmes concern about other important software non supporting this is something to take into account. Also +1 for |
Closing this, as we shipped 1.0 and made this decision awhile ago. |
This was discussed extensively in #25, but it feels worth revisiting. I think everyone feels good that the core recommendation is to use longitude, latitude in the WKB as the interoperability recommendation. The main question here is how we 'describe' that - use 4326 but then rely on the 'override' in our spec to put longitude first, or use something like OGC:84 that is less popular but actually describes things right.
Other points that were originally in #35:
The text was updated successfully, but these errors were encountered: