-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow crs to be WKT or WKID #8
Comments
WKID is a concept that only works when you have the equivalent of a spatial_ref_sys column to point to. Often people make the confusion WKID = EPSG code, but that is not always true. |
Indeed, if we would allow IDs, it should certainly include the authority as well (either as pair or as string "authority:code"). @rouault compared to requiring a full WKT (in modern format), do you see major downsides in allowing authority/code pairs ? |
that's a convenient way of communicating a CRS when all parties have the database available. Orthogonal to that issue, there's the usual and recurring issue with coordinate order used in geometry encoding vs axis order of the CRS definition. EPSG:4326 says first axis is latitude, second is longitude... |
I think the best we could do to enforce axis order is warn on EPSG:4326 (with a note to please use OGC:CRS84, with apologies if I got that wrong). That would have to be done by whatever implements the read_ or write_ to Parquet/Feather/IPC if PROJ was available. I'd love to put it in a spec that CRS values will be interpreted as authority compliant but I have a feeling it would be ignored.
I am in favour of allowing AUTHORITY:CODE (e.g., EPSG:32620, OGC:CRS84) or WKT2 as the CRS field of the metadata. I like WKT2 for completeness; I like AUTHORITY:CODE because it's easier to test for equality and is more human-readable. Both can be put as is into |
I think we should indeed enforce axis order (and thus explicitly not follow the CRS axis order). In practice, with the current encoding of WKB, we already do that implicitly, since WKB specifies (x, y).
I think that's up to every implementation to decide whether they want to warn about it or not? (although we can of course have a recommendation about it) In practice, in GeoPandas, people mostly use EPSG:4326 and not OGC:CRS84 (I assume), also because this currently roundtrips fine to formats like shapefiles and geopackages. |
There is some related discussion on CRS handling at opengeospatial/geoparquet#3 |
I'm going to be too pedantic, but 'x' and 'y' doesn't mean a lot. If you look at the name of the axis of coordinate systems used by EPSG CRS, you'll see that X might be sometimes used as the abbrevation for a northing instead of being an easting, and when using that odd naming, the axis order itself might be different e.g EPSG:3787 is funny with the first axis being AXIS["easting (Y)",east] and the second one AXIS["northing (X)",north] (so it is actually "x", "y" in usual language)
To be compared to EPSG:3833 which is "northing (X)",north first, and "easting (Y)",east, so using an inverted order...
That said, no need to be more royalist than the king as one says in good French, and the language of the GeoPackage spec could be reused. See the note in the 2.1.3.1.1. BLOB Format paragraph of http://www.geopackage.org/spec130/ : "The axis order in WKB stored in a GeoPackage follows the de facto standard for axis order in WKB and is therefore always (x,y{,z}{,m}) where x is easting or longitude, y is northing or latitude, z is optional elevation, and m is optional measure. This ordering explicitly overrides the axis order as specified in the SRS metadata, applying Case 4 from OGC 08-038r7, Revision to Axis Order Policy and Recommendations[K11]. This was done to maintain consistency with previous implementations of WKB that predated the OGC policy." (I would note that the WKB specification is far from being explicit that x = easting/longitude, and the 'de facto standard' comes from usage of WKB, not the WKB spec itself, and the GeoPackage spec isn't super explicit about what to do for a few Namibian or Czech CRS that use westings and southings...) |
I like the idea of deferring to the GeoPackage wording. In the unlikely event that users are begging for the ability to specify an authority compliant CRS, that could be added as another field in the metadata (crs_authority_compliant = true). |
Everything could be packaged like the geometry format in geopackage in spec 2.1.3.1.1. BLOB Format. Where the Srid is in the header itself. It could remove the requirement of storing it in the geo metadata . This would allow users to store mixed coordinate systems. Just a thought. |
Warning: as I pointed in #8 (comment), a SRID is a pointer to an entry in a spatial_ref_sys table. SRID != EPSG code. This works for GeoPackage as such table exists. |
Yeah, this is why I leaned towards WKT2. I don't think we want to ship a full SRID database like geopackage does... Though I suppose we could consider it? Maybe there's a way to have a default that we don't need to ship around, but then let users specify their own?
Is there a use case for this? Seems like it'd add a lot more complexity to readers to have to check for differing coordinate systems. In general I think we should have just one way to refer to the CRS. I do also like borrowing geopackage wording, following an existing spec. |
@cholmes if you want to parse geojson, you can have mixed types. Feature classes, like shapefiles, only allow single geometries. The geopackage WKB with additional headers is very attractive because it would not require the use of meta data on the column. |
I think allowing well known Ids and WKT is a benefit. If you are using WGS1984, a very common well known coordinate system, why do I need to enter in: GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,298.257223563,AUTHORITY["EPSG","7030"]],AUTHORITY["EPSG","6326"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4326"]] I know the ID, 4326, off the top of my head. The WKT I had to look it up and use a library to get the syntax. |
That are mixed geometry types, not mixed coordinate reference systems (all the features in a geojson file share a single CRS). But currently I think we made a clear choice for a single CRS per file / batch of data.
Being able to use those authority/code pairs like EPSG:4326 is certainly a convenience as a user. But generally I assume that the WKT will be handled by the software, and requiring WKT for the file metadata doesn't necessarily require that you, as a user, need to use it (depending on the software you are using to read/write the parquet files). |
@jorisvandenbossche that is true that geopandas and other software will handle it, but the whole point is why do I need pyproj if I know my data is EPSG:4326? It adds another layer of complexity/dependency. I think adding CRS TYPE : WKID/WKT solves all issues. |
I've come around to mandating WKT2 as the If we mandate that, I think it gives us the responsibility of providing highly accessible code (i.e., zero dependency header-only C++, zero dependency R, zero-dependency Python) to generate and interpret that value. We do have to make sure that our "mandate" of WKT2 only isn't going to be completely ignored, and perhaps a "crs_type" field is a good compromise. I think we really just want to avoid a situation where people are discarding CRS information or providing uninterpretable CRS information. |
Everyone knows what A related question then: Are WKT2 strings guaranteed to stay the same for a given EPSG code, say for
If this is guaranteed to stay the same in the future, then @achapkowski 's critique could be answered by checking against that WKT string specifically, no? const WGS84_WKT = 'GEOGCRS["WGS 84", ...'
function isWGS84(wktString) {
// Maybe add some minimal string processing to remove newlines and spaces
return wktString === WGS84_WKT;
} Also, is there a concise reference for the differences in WKT vs WKT2 syntax, apart from the 132 page WKT2 spec? Proj4.js currently only supports WKT1, and I'm curious how much work it would take to add WKT2 support, but I don't know enough about what differs between the two specs. Edit: a presentation with some WKT2 vs WKT1 examples. |
no, that one will typically change over time as new members of the WGS84 datum ensemble are added in the EPSG dataset. The values of scope or area could also be amended, etc. PROJ has a proj_is_equivalent() function that can be used to test for strict or laxer equivalence. proj_identify() can also be used to identify an EPSG code from a CRS object instanciated from it (if it lacks an ID[] node typically) |
A better string based approach to identify EPSG:4326 would be to check that the string ends with (not just contains) |
In support of a That information can be communicated outside the Or maybe it's too early to consider that example (we could always add the |
The WKT in a Shapefile .prj is never (the non-deprecated syntax of) WKT2, but the ancient ESRI WKT, which is very close to/an implementation of the > 20 year old spec OGC 99-049 (https://portal.ogc.org/files/?artifact_id=829). Very few formats natively speak WKT2. Typically you'd want to use PROJ to morph ESRI WKT to WKT2. The term WKT2 can also be ambiguous. It could refer to the older spec of 2015 : http://docs.opengeospatial.org/is/12-063r5/12-063r5.html . See https://proj.org/development/reference/cpp/cpp_general.html#general_doc_1WKT2 for some historic of those successive revisions |
Can we just do something like this data class?
or as JSON:
Then let the reader figure out what each code means? I disagree that we need to bundle the table. It's on the user to know how to use that spatial reference. This way it handles all past and future cases of spatial reference. So when WKT35 is invented, you just place in the type with the associated emoji string. Basically this format keeps it simple and universal. |
Since the last discussion here, GeoParquet has moved to PROJJSON to represent CRS and can use a GeoArrow is different from GeoParquet in that it is an in-memory form and the arrays which it encodes may represent ephemeral or intermediary steps in a computation, and so in #22, I proposed the following:
Also happy to update that based on feedback! |
There is more than FOSS libraries and if you want it to be standard across the ecosystem, the specification needs to be flexible enough to be used for all GIS users. Having a {
crs : '<spatial reference information>' # string
crs_encoding : "ESRI:WKID" or "PYPROJ:WKT2" # string
} That way a user would knows that this spatial reference was created via Just thinking outloud, what is everyone's thoughts? |
I think a disconnect here (at least for me) is on how to define CRS in a way that Arrow readers don't need to have a separate contract with writers. One of the goals of the Arrow project is to enable zero-copy reading of data between systems in different languages and frameworks. For non-geospatial data, if a reader implements the minimal C Data Interface, then that reader has everything it needs to know to interpret that memory buffer. The question is how to achieve a similar outcome with geospatial Arrow memory: where a reader that implements the C Data Interface has full knowledge of what the table contains. This is where I disagree with your premise of let the "reader figure out what each code means", at least while also saying "I disagree that we need to bundle the table". For a geospatial library that implements GeoArrow memory, it could receive that memory from anywhere. It doesn't have a contract with the software or tool that produces that memory. But if a library receives GeoArrow memory with table metadata saying Above you mentioned
This is one reason why GeoParquet moved to PROJJSON. With PROJJSON this is easy to assert without having access to pyproj! It's just: proj_dict = json.loads(PROJJSON_STRING)
assert proj_dict['id']['authority'] == 'EPSG'
assert proj_dict['id']['code'] == '4326' Non-CRS-aware implementations can check for a specific CRS quickly by only needing to parse JSON, while CRS-aware implementations have the entire geodetic definition. |
My hesitation on requiring PROJJSON in the CRS field is that it then requires that every producer must be CRS aware. It would mean that you can't write a standalone GeoPackage reader that outputs an ArrowArrayStream, because GeoPackage stores CRS information as WKT2 (to my knowledge). For me that's the most compelling reason for allowing flexibility in the I think any producer that can serialize the CRS as PROJJSON should serialize the CRS as PROJJSON. As you mentioned, this enables Arrow-native kernels implementing things like geodesic length to exist without a lookup table. Worth noting that there are huge swaths of the geospatial world that don't have the capability to write PROJJSON (or even WKT2), though (from the last GeoParquet call it seemed like this was an issue in Java, and it sounds like it might be an issue in parts of the ESRI universe).
Many operations don't need to understand the CRS to do their thing, they just pass on that information in whatever form the received it. Subset, arrange, extract vertices, convex hull, just to name a few. I managed to wrap the whole GEOS C API in R without needing to interpret any CRS values (with the exception of the binary operators, which have to check equality, which is a rather thorny issue that PROJJSON would help with).
I think I'd prefer a known list in the spec (like... |
@achapkowski in the proposed encoding in #22 could you set the |
That would work for me.
This information would be stored in the Geometry column then? |
Yes, it would be metadata on the geometry column (and I will add some concrete examples regarding what the CRS would look like in the case of projjson, wkt2, and authority_code to #22). |
I'm not an authority on GeoPackage but it doesn't appear to be that simple. In the WKT reference in the spec, it says
So the latest version of the GeoPackage spec defines a new column to be used for WKT:
I think these refer to WKT2:2015 and WKT2:2019 respectively, but that means that some older GeoPackage files won't have that column and will only contain WKT1 strings. I don't know if there's a good solution here, but this is why I worry about interoperability. If the column metadata supports WKT1, WKT2:2015, WKT2:2019, PROJJSON, and authority code, then that seems like a huge API surface for implementors to cover. |
From #8 (comment):
From #8 (comment):
Given this fact, I don't think that "writing a standalone format reader" is a good enough criteria for how we should handle CRS information in geo-arrow. I.e., if a requirement was supporting a standalone reader from Shapefile, we'd also have to support the 20-year-old ESRI WKT. I think this plainly leads into "GeoArrow supports every CRS format that has existed in the last 20 years" which seems like a strictly bad outcome. |
That's the general nature of a factured world of Geo. If adopt one format then you exclude everyone else. Wide adoption should include all major players. |
I don't think GeoArrow has to support any CRS format...it's merely a vessel that can transport a CRS from one place to another. I think it's reasonable to give an opinion on what that format should be for runtimes and languages with access to the one library (can anything other than PROJ do this?) that can write and validate PROJJSON; I don't think it's reasonable to restrict GeoArrow's usage to those languages and runtimes. In addition to low-level readers (which realistically is pretty much just me thinking about writing them) I think the excluded environments would be (1) Java/Android, (2) JavaScript/Web, and (3) the ESRI universe (I don't have definitive proof of this, so feel free to correct me). I think all of those should be able to produce GeoArrow arrays and should be able to provide CRS information in a form most easily parsed by a consumer, just as those with access to PROJ should provide CRS information in a form most easily parsed by a consumer (i.e., PROJJSON).
I think it matters quite a bit what exactly you are implementing. Many operations just pass this value through. Those with access to PROJ will probably just call I don't know the answers to all of the questions here and perhaps their answers will invalidate some of these arguments. I still think the approach of "please give me the CRS information you have and tell me what you think the encoding is" will yield a more inclusive and less error-prone format than "don't use this format if you can't encode your CRS in this one relatively new way". |
In part of that universe, there's ArcGIS 9.2 and later which are known to embed GDAL. And PROJ >= 6 is a required dependency of GDAL >= 3. PROJ could be used to convert PROJJSON to ESRI WKT, if the later is what is required to make their Projection Engine (https://support.esri.com/en/technical-article/000002817) happy.
I've a one-file no-dependency Python lib converting (most common CRS formulations from) PROJJSON to WKT2:2019 or WKT1 in https://github.com/rouault/projjson_to_wkt. Could definitely be ported in a few hours in whatever language you want. |
Projjson is not supported in the Esriverse as far as I know. Here is a full list: https://developers.arcgis.com/rest/services-reference/enterprise/using-spatial-references.htm It's quite large. |
Awesome! This will help on the consuming end (I'd love something like it in geoarrow-cpp eventually).
I don't think this will be a problem on the consuming end because of tools like the one Even has written (and for a CRS with an authority/code identifier, it's trivial to extract given a JSON parser). PROJJSON by far the best CRS representation currently available and I plan to encode all CRSes in any tool that I write using it (in R or Python). None of those solve the producing end, where a producer from (say) ESRI already has the CRS in some format. If it sends it a computational tool that (say) computes a buffer, I think it's perfectly reasonable to provide a way to pass that CRS to the tool creating the buffer and back to the original caller without either having to re-encode the CRS into something different to comply with the specification. The best zero-dependency tool I can think of to help here would be a lookup table (the last time I did this it compressed to less than 1 MB for every auth/code combo in the PROJ db) or API endpoint (perhaps both exist already). Workarounds to this are much worse for interoperability than not forbidding a flexible CRS encoding: the (non-proj aware) producer can write the CRS to some other (non-extension type) metadata field (maybe With the |
Acknowledging that this is a draft specification and we're not sure exactly how everybody can/will use it, I'd propose that we follow the GeoParquet spec and mandate the Just as the GeoPackage standard mandates a CRS representation (or tried/tries to), so shall GeoArrow (until enough evidence is accumulated to suggest otherwise). As a draft specification, I think it's perhaps easier to be strict at first and relax the restriction later rather than the other way around. For producers and consumers with access to GDAL or PROJ there shouldn't be any problems. For producers without access to PROJ that want to include valid CRS information, it's rather trivial to create a lookup table between authority:code and PROJJSON (my last attempt was here: https://github.com/paleolimbot/geocrs/blob/master/data-raw/update-proj-db.R#L2-L48 ). As I understand it, WKID can be represented by For consumers without access to PROJ, all one needs is a JSON parser. Even's offer to rewrite a converter to WKT2/WKT1 in some other languages is generous, although it may be more direct to go from JSON directly to some data structure. As a workaround if you really need to pass on CRS information and don't want to re-encode it, I'd suggest using other (non-extension type) metadata fields for now. |
Typically WKID <= 32767 belong to EPSG namespace (the PROJ database doesn't duplicate them in the ESRI namespace when they are identical to EPSG), and WKID > 32767 to ESRI namespace
(for the sake of clarity: I meant this was left as an exercise for others, potentially taking inspiration on my Python POC 😁) |
Given it is only a first iteration, I am fine with for now being conservative and only allowing PROJJSON, since it is always easier to become more flexible in the future than the other way around. That said, I personally would also argue for being more permissive on this front, following the arguments that @paleolimbot made before. In contrast to GeoParquet, which as a file format we should standardize as much as possible / practical, the GeoArrow in memory encoding can be more flexible IMO, given that it has different use cases. Requiring PROJJSON means that any producer of arrow-encoded memory (custom file reader, Flight service, database connector, ...) needs to be able to construct the proper PROJJSON, while most of the time the source information that this producer has access to will not be PROJJSON (and thus needs access to something PROJ-like). |
Closed by #40 |
I mentioned this in a PR, but I want it listed here.
The metadata column information should allow for WKT and WKID crs values.
The text was updated successfully, but these errors were encountered: