Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow crs to be WKT or WKID #8

Closed
achapkowski opened this issue Mar 17, 2021 · 40 comments
Closed

Allow crs to be WKT or WKID #8

achapkowski opened this issue Mar 17, 2021 · 40 comments

Comments

@achapkowski
Copy link

I mentioned this in a PR, but I want it listed here.

The metadata column information should allow for WKT and WKID crs values.

@rouault
Copy link

rouault commented Feb 1, 2022

WKID crs values.

WKID is a concept that only works when you have the equivalent of a spatial_ref_sys column to point to. Often people make the confusion WKID = EPSG code, but that is not always true.
A (authority_name, code ) pair would be better. eg (EPSG,4326), (IGNF,RGF93), (ESRI,53014)

@jorisvandenbossche
Copy link
Contributor

Indeed, if we would allow IDs, it should certainly include the authority as well (either as pair or as string "authority:code").

@rouault compared to requiring a full WKT (in modern format), do you see major downsides in allowing authority/code pairs ?

@rouault
Copy link

rouault commented Feb 8, 2022

do you see major downsides in allowing authority/code pairs ?

that's a convenient way of communicating a CRS when all parties have the database available.

Orthogonal to that issue, there's the usual and recurring issue with coordinate order used in geometry encoding vs axis order of the CRS definition. EPSG:4326 says first axis is latitude, second is longitude...

@paleolimbot
Copy link
Contributor

Orthogonal to that issue, there's the usual and recurring issue with coordinate order used in geometry encoding vs axis order of the CRS definition. EPSG:4326 says first axis is latitude, second is longitude...

I think the best we could do to enforce axis order is warn on EPSG:4326 (with a note to please use OGC:CRS84, with apologies if I got that wrong). That would have to be done by whatever implements the read_ or write_ to Parquet/Feather/IPC if PROJ was available. I'd love to put it in a spec that CRS values will be interpreted as authority compliant but I have a feeling it would be ignored.

The metadata column information should allow for WKT and WKID crs values.

I am in favour of allowing AUTHORITY:CODE (e.g., EPSG:32620, OGC:CRS84) or WKT2 as the CRS field of the metadata. I like WKT2 for completeness; I like AUTHORITY:CODE because it's easier to test for equality and is more human-readable. Both can be put as is into proj_create(), which I think is also important. The R prototype currently writes WKT2 by default, and the most useful part of using AUTHORITY:CODE is when writing tests.

@jorisvandenbossche
Copy link
Contributor

I think we should indeed enforce axis order (and thus explicitly not follow the CRS axis order). In practice, with the current encoding of WKB, we already do that implicitly, since WKB specifies (x, y).
With the geo-arrow encoding (#12), we could in principle be flexible as long as it is clear from the field name (eg "xy" vs "yx"), but I would rather not go there. I think that ideally the data should be correctly interpretable even if you loose the "metadata" such as the field names.

warn on EPSG:4326 (with a note to please use OGC:CRS84, with apologies if I got that wrong). That would have to be done by whatever implements the read_ or write_ to Parquet/Feather/IPC if PROJ was available.

I think that's up to every implementation to decide whether they want to warn about it or not? (although we can of course have a recommendation about it)

In practice, in GeoPandas, people mostly use EPSG:4326 and not OGC:CRS84 (I assume), also because this currently roundtrips fine to formats like shapefiles and geopackages.

@jorisvandenbossche
Copy link
Contributor

There is some related discussion on CRS handling at opengeospatial/geoparquet#3

@rouault
Copy link

rouault commented Feb 18, 2022

since WKB specifies (x, y).

I'm going to be too pedantic, but 'x' and 'y' doesn't mean a lot. If you look at the name of the axis of coordinate systems used by EPSG CRS, you'll see that X might be sometimes used as the abbrevation for a northing instead of being an easting, and when using that odd naming, the axis order itself might be different

e.g EPSG:3787 is funny with the first axis being AXIS["easting (Y)",east] and the second one AXIS["northing (X)",north] (so it is actually "x", "y" in usual language)

PROJCRS["MGI / Slovene National Grid",
    BASEGEOGCRS["MGI",
        DATUM["Militar-Geographische Institut",
            ELLIPSOID["Bessel 1841",6377397.155,299.1528128,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4312]],
    CONVERSION["Slovene National Grid",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",0,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",15,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",0.9999,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",500000,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",-5000000,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["easting (Y)",east,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["northing (X)",north,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Engineering survey, topographic mapping."],
        AREA["Slovenia - onshore and offshore."],
        BBOX[45.42,13.38,46.88,16.61]],
    ID["EPSG",3787]]

To be compared to EPSG:3833 which is "northing (X)",north first, and "easting (Y)",east, so using an inverted order...

PROJCRS["Pulkovo 1942(58) / Gauss-Kruger zone 2",
    BASEGEOGCRS["Pulkovo 1942(58)",
        DATUM["Pulkovo 1942(58)",
            ELLIPSOID["Krassowsky 1940",6378245,298.3,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4179]],
    CONVERSION["6-degree Gauss-Kruger zone 2",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",0,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",9,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",2500000,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",0,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (X)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (Y)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Military survey."],
        AREA["Germany - states of former East Germany - west of 12°E."],
        BBOX[50.2,9.92,54.23,12]],
    ID["EPSG",3833]]

That said, no need to be more royalist than the king as one says in good French, and the language of the GeoPackage spec could be reused. See the note in the 2.1.3.1.1. BLOB Format paragraph of http://www.geopackage.org/spec130/ : "The axis order in WKB stored in a GeoPackage follows the de facto standard for axis order in WKB and is therefore always (x,y{,z}{,m}) where x is easting or longitude, y is northing or latitude, z is optional elevation, and m is optional measure. This ordering explicitly overrides the axis order as specified in the SRS metadata, applying Case 4 from OGC 08-038r7, Revision to Axis Order Policy and Recommendations[K11]. This was done to maintain consistency with previous implementations of WKB that predated the OGC policy." (I would note that the WKB specification is far from being explicit that x = easting/longitude, and the 'de facto standard' comes from usage of WKB, not the WKB spec itself, and the GeoPackage spec isn't super explicit about what to do for a few Namibian or Czech CRS that use westings and southings...)

@paleolimbot
Copy link
Contributor

I like the idea of deferring to the GeoPackage wording. In the unlikely event that users are begging for the ability to specify an authority compliant CRS, that could be added as another field in the metadata (crs_authority_compliant = true).

@achapkowski
Copy link
Author

Everything could be packaged like the geometry format in geopackage in spec 2.1.3.1.1. BLOB Format. Where the Srid is in the header itself. It could remove the requirement of storing it in the geo metadata . This would allow users to store mixed coordinate systems. Just a thought.

@rouault
Copy link

rouault commented Feb 19, 2022

here the Srid is in the header itself.

Warning: as I pointed in #8 (comment), a SRID is a pointer to an entry in a spatial_ref_sys table. SRID != EPSG code. This works for GeoPackage as such table exists.

@cholmes
Copy link

cholmes commented Feb 20, 2022

Warning: as I pointed in #8 (comment), a SRID is a pointer to an entry in a spatial_ref_sys table. SRID != EPSG code. This works for GeoPackage as such table exists.

Yeah, this is why I leaned towards WKT2. I don't think we want to ship a full SRID database like geopackage does... Though I suppose we could consider it? Maybe there's a way to have a default that we don't need to ship around, but then let users specify their own?

This would allow users to store mixed coordinate systems.

Is there a use case for this? Seems like it'd add a lot more complexity to readers to have to check for differing coordinate systems.

In general I think we should have just one way to refer to the CRS. I do also like borrowing geopackage wording, following an existing spec.

@achapkowski
Copy link
Author

@cholmes if you want to parse geojson, you can have mixed types. Feature classes, like shapefiles, only allow single geometries.

The geopackage WKB with additional headers is very attractive because it would not require the use of meta data on the column.

@achapkowski
Copy link
Author

achapkowski commented Mar 11, 2022

I think allowing well known Ids and WKT is a benefit. If you are using WGS1984, a very common well known coordinate system, why do I need to enter in: GEOGCS["WGS 84",DATUM["WGS_1984",SPHEROID["WGS 84",6378137,298.257223563,AUTHORITY["EPSG","7030"]],AUTHORITY["EPSG","6326"]],PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],UNIT["degree",0.0174532925199433,AUTHORITY["EPSG","9122"]],AUTHORITY["EPSG","4326"]]

I know the ID, 4326, off the top of my head. The WKT I had to look it up and use a library to get the syntax.

@jorisvandenbossche
Copy link
Contributor

if you want to parse geojson, you can have mixed types.

That are mixed geometry types, not mixed coordinate reference systems (all the features in a geojson file share a single CRS).
(and mixed geometry types are already supported by using WKB for the geometries)

But currently I think we made a clear choice for a single CRS per file / batch of data.

I think allowing well known Ids and WKT is a benefit. ... I know the ID, 4326, off the top of my head. The WKT I had to look it up and use a library to get the syntax.

Being able to use those authority/code pairs like EPSG:4326 is certainly a convenience as a user. But generally I assume that the WKT will be handled by the software, and requiring WKT for the file metadata doesn't necessarily require that you, as a user, need to use it (depending on the software you are using to read/write the parquet files).
For example, in GeoPandas, you can specify the CRS as EPSG:4326, and geopandas.to_parquet will handle the conversion to WKT for you (actually by using pyproj under the hood).

@achapkowski
Copy link
Author

@jorisvandenbossche that is true that geopandas and other software will handle it, but the whole point is why do I need pyproj if I know my data is EPSG:4326? It adds another layer of complexity/dependency. I think adding CRS TYPE : WKID/WKT solves all issues.

@paleolimbot
Copy link
Contributor

I've come around to mandating WKT2 as the crs field, because it makes it possible to write compute kernels like "ellipsoidal length" (as long as the reader can read WKT2).

If we mandate that, I think it gives us the responsibility of providing highly accessible code (i.e., zero dependency header-only C++, zero dependency R, zero-dependency Python) to generate and interpret that value.

We do have to make sure that our "mandate" of WKT2 only isn't going to be completely ignored, and perhaps a "crs_type" field is a good compromise. I think we really just want to avoid a situation where people are discarding CRS information or providing uninterpretable CRS information.

@kylebarron
Copy link
Member

kylebarron commented Mar 17, 2022

but the whole point is why do I need pyproj if I know my data is EPSG:4326?

Everyone knows what EPSG:4326 means, but I have no idea what IGNF:RGF93 or ESRI:53014 means, to use Even's above example, without looking them up. Therefore, to support WKID in the spec, we'd need a lookup table to be shipped to every environment, which seems like a significant downside in applications such as the web, when a WKT2 string would fully describe the projection without needing a lookup (right?).

A related question then: Are WKT2 strings guaranteed to stay the same for a given EPSG code, say for 4326? Using projinfo gives:

> projinfo EPSG:4326 -o 'WKT2:2019' -q
GEOGCRS["WGS 84",
    ENSEMBLE["World Geodetic System 1984 ensemble",
        MEMBER["World Geodetic System 1984 (Transit)"],
        MEMBER["World Geodetic System 1984 (G730)"],
        MEMBER["World Geodetic System 1984 (G873)"],
        MEMBER["World Geodetic System 1984 (G1150)"],
        MEMBER["World Geodetic System 1984 (G1674)"],
        MEMBER["World Geodetic System 1984 (G1762)"],
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]],
        ENSEMBLEACCURACY[2.0]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["geodetic latitude (Lat)",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["geodetic longitude (Lon)",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    USAGE[
        SCOPE["Horizontal component of 3D system."],
        AREA["World."],
        BBOX[-90,-180,90,180]],
    ID["EPSG",4326]]

If this is guaranteed to stay the same in the future, then @achapkowski 's critique could be answered by checking against that WKT string specifically, no?

const WGS84_WKT = 'GEOGCRS["WGS 84", ...'
function isWGS84(wktString) {
  // Maybe add some minimal string processing to remove newlines and spaces
  return wktString === WGS84_WKT;
}

Also, is there a concise reference for the differences in WKT vs WKT2 syntax, apart from the 132 page WKT2 spec? Proj4.js currently only supports WKT1, and I'm curious how much work it would take to add WKT2 support, but I don't know enough about what differs between the two specs. Edit: a presentation with some WKT2 vs WKT1 examples.

@rouault
Copy link

rouault commented Mar 17, 2022

Are WKT2 strings guaranteed to stay the same for a given EPSG code, say for 4326?

no, that one will typically change over time as new members of the WGS84 datum ensemble are added in the EPSG dataset. The values of scope or area could also be amended, etc. PROJ has a proj_is_equivalent() function that can be used to test for strict or laxer equivalence. proj_identify() can also be used to identify an EPSG code from a CRS object instanciated from it (if it lacks an ID[] node typically)

@rouault
Copy link

rouault commented Mar 17, 2022

A better string based approach to identify EPSG:4326 would be to check that the string ends with (not just contains) ID["EPSG",4326]] (however one cannot exclude that the formatting will be slightly different with extra spaces and identation, so one would have first to strip all space, tabulation, newline).
ID["EPSG",4326,URI["urn:ogc:def:crs:EPSG::4326"]]] would also be valid in WKT2 (cf example at end of §7.3.3 "Identifier" of http://docs.opengeospatial.org/is/18-010r7/18-010r7.html#37)

@paleolimbot
Copy link
Contributor

In support of a crs_type field, I wonder if this could be useful to communicate CRS information that was provided to be properly formatted as WKT2 elsewhere. The example that I'm thinking of is writing a shapefile reader that outputs a struct ArrowArrayStream...we have some "crs information" from the shapefile but with no guarantee that it's formatted as WKT2 (I think).

That information can be communicated outside the struct ArrowArrayStream, too, but it seems like that might introduce some complexity (the example I'm thinking of is if we're using the Arrow Compute engine to do a select/filter on non-spatial columns...all of a sudden the library (maybe geopandas) collecting the results from the compute engine has to know about the shapefile reader.

Or maybe it's too early to consider that example (we could always add the crs_type field at the point where that use-case is compelling).

@rouault
Copy link

rouault commented Mar 31, 2022

we have some "crs information" from the shapefile but with no guarantee that it's formatted as WKT2 (I think).

The WKT in a Shapefile .prj is never (the non-deprecated syntax of) WKT2, but the ancient ESRI WKT, which is very close to/an implementation of the > 20 year old spec OGC 99-049 (https://portal.ogc.org/files/?artifact_id=829). Very few formats natively speak WKT2. Typically you'd want to use PROJ to morph ESRI WKT to WKT2.
The WKT2:2019 spec makes references to backward compatibility in http://docs.opengeospatial.org/is/18-010r7/18-010r7.html#20 , but it is not clear to me (reading 'Backward compatibility means that an implementation of the text strings in this document would be able to read CRS WKT strings conforming to the old (ISO 19125-1:2004) syntax') if a WKT2:2019 reader is mandated to accept the older syntax (the previous sentence says "would" and not "should", and there's no formal requirements).

The term WKT2 can also be ambiguous. It could refer to the older spec of 2015 : http://docs.opengeospatial.org/is/12-063r5/12-063r5.html . See https://proj.org/development/reference/cpp/cpp_general.html#general_doc_1WKT2 for some historic of those successive revisions

@achapkowski
Copy link
Author

Can we just do something like this data class?

class SpatialReference:
   type:str # WKT, WKID, WKT2, ESRIWKT, ESRIWKID, etc.... 
   value: str # WKT/WKT2 String or WKID

or as JSON:

{
   "type" : "str # WKT, WKID, WKT2"
   "value" : "str # WKT/WKT2 String or WKID"
}

Then let the reader figure out what each code means? I disagree that we need to bundle the table. It's on the user to know how to use that spatial reference.

This way it handles all past and future cases of spatial reference. So when WKT35 is invented, you just place in the type with the associated emoji string.

Basically this format keeps it simple and universal.

@paleolimbot
Copy link
Contributor

Since the last discussion here, GeoParquet has moved to PROJJSON to represent CRS and can use a null value for the CRS field to indicate that the writer has no information about the CRS.

GeoArrow is different from GeoParquet in that it is an in-memory form and the arrays which it encodes may represent ephemeral or intermediary steps in a computation, and so in #22, I proposed the following:

  • CRS information is written as two fields: crs and crs_encoding. The default crs_encoding should be PROJJSON if producers are able to provide it (e.g., an R or Python language binding with access to a GDAL or PROJ package)
  • Producers can also pass on CRS information in whatever form they receive it, which could mean setting crs_encoding to something else (like WKT2) or omitting it if the producer does not know. I think that is appropriate for the scope of an in-memory encoding: if a user types in "EPSG:26920", the GeoArrow encoding should be able to transport that value to a library that can look that up and convert it to PROJJSON before sending it on to the next library.
  • If the producer has no information about the CRS, they can omit the crs field (the key/value encoding is potentially not JSON, so it's a bit hard to replicate the GeoParquet null vs. missing behaviour here).

Also happy to update that based on feedback!

@achapkowski
Copy link
Author

There is more than FOSS libraries and if you want it to be standard across the ecosystem, the specification needs to be flexible enough to be used for all GIS users.

Having a crs_encoding would be beneficial, but maybe they should be identified like this:

{
crs : '<spatial reference information>'  # string
crs_encoding : "ESRI:WKID" or "PYPROJ:WKT2" # string
}

That way a user would knows that this spatial reference was created via <product> and the <fomat> .

Just thinking outloud, what is everyone's thoughts?

@kylebarron
Copy link
Member

I think a disconnect here (at least for me) is on how to define CRS in a way that Arrow readers don't need to have a separate contract with writers.

One of the goals of the Arrow project is to enable zero-copy reading of data between systems in different languages and frameworks. For non-geospatial data, if a reader implements the minimal C Data Interface, then that reader has everything it needs to know to interpret that memory buffer.

The question is how to achieve a similar outcome with geospatial Arrow memory: where a reader that implements the C Data Interface has full knowledge of what the table contains. This is where I disagree with your premise of let the "reader figure out what each code means", at least while also saying "I disagree that we need to bundle the table". For a geospatial library that implements GeoArrow memory, it could receive that memory from anywhere. It doesn't have a contract with the software or tool that produces that memory.

But if a library receives GeoArrow memory with table metadata saying crs_encoding: ESRI:WKID and crs: 1234, then it seems there's no way to reliably understand that data contained in the table without also having the full lookup table that ESRI:WKID defines.

Above you mentioned

but the whole point is why do I need pyproj if I know my data is EPSG:4326

This is one reason why GeoParquet moved to PROJJSON. With PROJJSON this is easy to assert without having access to pyproj! It's just:

proj_dict = json.loads(PROJJSON_STRING)
assert proj_dict['id']['authority'] == 'EPSG'
assert proj_dict['id']['code'] == '4326'

Non-CRS-aware implementations can check for a specific CRS quickly by only needing to parse JSON, while CRS-aware implementations have the entire geodetic definition.

@paleolimbot
Copy link
Contributor

Non-CRS-aware implementations can check for a specific CRS quickly by only needing to parse JSON, while CRS-aware implementations have the entire geodetic definition.

My hesitation on requiring PROJJSON in the CRS field is that it then requires that every producer must be CRS aware. It would mean that you can't write a standalone GeoPackage reader that outputs an ArrowArrayStream, because GeoPackage stores CRS information as WKT2 (to my knowledge). For me that's the most compelling reason for allowing flexibility in the crs field...it allows the concern of CRS interpretation to be separated from the concern of coordinate shuffling.

I think any producer that can serialize the CRS as PROJJSON should serialize the CRS as PROJJSON. As you mentioned, this enables Arrow-native kernels implementing things like geodesic length to exist without a lookup table.

Worth noting that there are huge swaths of the geospatial world that don't have the capability to write PROJJSON (or even WKT2), though (from the last GeoParquet call it seemed like this was an issue in Java, and it sounds like it might be an issue in parts of the ESRI universe).

it seems there's no way to reliably understand that data contained in the table without also having the full lookup table that ESRI:WKID defines.

Many operations don't need to understand the CRS to do their thing, they just pass on that information in whatever form the received it. Subset, arrange, extract vertices, convex hull, just to name a few. I managed to wrap the whole GEOS C API in R without needing to interpret any CRS values (with the exception of the binary operators, which have to check equality, which is a rather thorny issue that PROJJSON would help with).

That way a user would knows that this spatial reference was created via and the .

I think I'd prefer a known list in the spec (like...crs_encoding: "PROJJSON" refers to the spec at this link; crs_encoding: "WKT2" refers to the spec at this link, etc.). The spec applies when data leaves your application, though, so there's nothing that should keep you from using crs_encoding: "myapp:my_encoding" internally (the worst that would happen is that a consumer will error if it tries to interpret the CRS).

@paleolimbot
Copy link
Contributor

@achapkowski in the proposed encoding in #22 could you set the crs field to ESRI:2948330 (or whatever the ID is)? Perhaps we could formalize a crs_encoding value for AUTHORITY:CODE since it seems that it will get some use?

@achapkowski
Copy link
Author

That would work for me.
So I would do something like:

{
'crs': <product>: <value> as a string
'crs_encoding': <wkid, wkt, wkt2, etc... As a string
}

This information would be stored in the Geometry column then?

@paleolimbot
Copy link
Contributor

Yes, it would be metadata on the geometry column (and I will add some concrete examples regarding what the CRS would look like in the case of projjson, wkt2, and authority_code to #22).

@kylebarron
Copy link
Member

GeoPackage stores CRS information as WKT2

I'm not an authority on GeoPackage but it doesn't appear to be that simple. In the WKT reference in the spec, it says

The OGC GeoPackage standard was adopted prior to the adoption of "12-063r5 OGC Well known text representation of Coordinate Reference Systems" [I34], in 13 August, 2014. As a result, the OGC GeoPackage standard references an older document [I32] which has known ambiguities that are being encountered in the field

So the latest version of the GeoPackage spec defines a new column to be used for WKT:

Values of the definition_12_063 column SHALL be constructed per the WKT syntax in [I34] and [I36].

I think these refer to WKT2:2015 and WKT2:2019 respectively, but that means that some older GeoPackage files won't have that column and will only contain WKT1 strings.

I don't know if there's a good solution here, but this is why I worry about interoperability. If the column metadata supports WKT1, WKT2:2015, WKT2:2019, PROJJSON, and authority code, then that seems like a huge API surface for implementors to cover.

@kylebarron
Copy link
Member

From #8 (comment):

My hesitation on requiring PROJJSON in the CRS field is that it then requires that every producer must be CRS aware. It would mean that you can't write a standalone GeoPackage reader that outputs an ArrowArrayStream, because GeoPackage stores CRS information as WKT2 (to my knowledge).

From #8 (comment):

The WKT in a Shapefile .prj is never (the non-deprecated syntax of) WKT2, but the ancient ESRI WKT, which is very close to/an implementation of the > 20 year old spec OGC 99-049 (portal.ogc.org/files/?artifact_id=829). Very few formats natively speak WKT2. Typically you'd want to use PROJ to morph ESRI WKT to WKT2.

Given this fact, I don't think that "writing a standalone format reader" is a good enough criteria for how we should handle CRS information in geo-arrow. I.e., if a requirement was supporting a standalone reader from Shapefile, we'd also have to support the 20-year-old ESRI WKT. I think this plainly leads into "GeoArrow supports every CRS format that has existed in the last 20 years" which seems like a strictly bad outcome.

@achapkowski
Copy link
Author

That's the general nature of a factured world of Geo. If adopt one format then you exclude everyone else.

Wide adoption should include all major players.

@paleolimbot
Copy link
Contributor

I think this plainly leads into "GeoArrow supports every CRS format that has existed in the last 20 years" which seems like a strictly bad outcome.

I don't think GeoArrow has to support any CRS format...it's merely a vessel that can transport a CRS from one place to another. I think it's reasonable to give an opinion on what that format should be for runtimes and languages with access to the one library (can anything other than PROJ do this?) that can write and validate PROJJSON; I don't think it's reasonable to restrict GeoArrow's usage to those languages and runtimes.

In addition to low-level readers (which realistically is pretty much just me thinking about writing them) I think the excluded environments would be (1) Java/Android, (2) JavaScript/Web, and (3) the ESRI universe (I don't have definitive proof of this, so feel free to correct me). I think all of those should be able to produce GeoArrow arrays and should be able to provide CRS information in a form most easily parsed by a consumer, just as those with access to PROJ should provide CRS information in a form most easily parsed by a consumer (i.e., PROJJSON).

I don't know if there's a good solution here, but this is why I worry about interoperability. If the column metadata supports WKT1, WKT2:2015, WKT2:2019, PROJJSON, and authority code, then that seems like a huge API surface for implementors to cover.

I think it matters quite a bit what exactly you are implementing. Many operations just pass this value through. Those with access to PROJ will probably just call proj_create(), which supports all of those to my knowledge. Those without access to PROJ might benefit from PROJJSON, but also might not...can the ESRI universe ingest PROJJSON? How difficult is it to get a library to do that in Java or JavaScript? The list of accepted values for crs_encoding is a problem...I'm inclined to leave it at just "projjson" with the ability for a user to namespace an encoding (as Andrew suggested).

I don't know the answers to all of the questions here and perhaps their answers will invalidate some of these arguments. I still think the approach of "please give me the CRS information you have and tell me what you think the encoding is" will yield a more inclusive and less error-prone format than "don't use this format if you can't encode your CRS in this one relatively new way".

@rouault
Copy link

rouault commented Jun 11, 2022

the ESRI universe

In part of that universe, there's ArcGIS 9.2 and later which are known to embed GDAL. And PROJ >= 6 is a required dependency of GDAL >= 3. PROJ could be used to convert PROJJSON to ESRI WKT, if the later is what is required to make their Projection Engine (https://support.esri.com/en/technical-article/000002817) happy.

How difficult is it to get a library to do that in Java or JavaScript?

I've a one-file no-dependency Python lib converting (most common CRS formulations from) PROJJSON to WKT2:2019 or WKT1 in https://github.com/rouault/projjson_to_wkt. Could definitely be ported in a few hours in whatever language you want.

@achapkowski
Copy link
Author

Projjson is not supported in the Esriverse as far as I know. Here is a full list: https://developers.arcgis.com/rest/services-reference/enterprise/using-spatial-references.htm

It's quite large.

@paleolimbot
Copy link
Contributor

I've a one-file no-dependency Python lib

Awesome! This will help on the consuming end (I'd love something like it in geoarrow-cpp eventually).

Projjson is not supported in the Esriverse

I don't think this will be a problem on the consuming end because of tools like the one Even has written (and for a CRS with an authority/code identifier, it's trivial to extract given a JSON parser). PROJJSON by far the best CRS representation currently available and I plan to encode all CRSes in any tool that I write using it (in R or Python).

None of those solve the producing end, where a producer from (say) ESRI already has the CRS in some format. If it sends it a computational tool that (say) computes a buffer, I think it's perfectly reasonable to provide a way to pass that CRS to the tool creating the buffer and back to the original caller without either having to re-encode the CRS into something different to comply with the specification. The best zero-dependency tool I can think of to help here would be a lookup table (the last time I did this it compressed to less than 1 MB for every auth/code combo in the PROJ db) or API endpoint (perhaps both exist already).

Workarounds to this are much worse for interoperability than not forbidding a flexible CRS encoding: the (non-proj aware) producer can write the CRS to some other (non-extension type) metadata field (maybe geoarrowshp:prj_file_dump for the shapefile reader I keep threatening to write), the producer can attempt to construct PROJJSON themselves without the ability to validate it, or it can ignore the spec and write whatever it can to the crs field under the (probably good) assumption that whoever is going to read it next will be doing it with proj_create(). (Perhaps I'm missing some other workarounds).

With the crs_encoding + crs field, at least the consumer knows where to look (and can error with a reasonable message if it doesn't support crs_encoding).

@paleolimbot
Copy link
Contributor

Acknowledging that this is a draft specification and we're not sure exactly how everybody can/will use it, I'd propose that we follow the GeoParquet spec and mandate the crs field, if set, to be PROJJSON. I'll update the draft spec shortly with some specific wording.

Just as the GeoPackage standard mandates a CRS representation (or tried/tries to), so shall GeoArrow (until enough evidence is accumulated to suggest otherwise). As a draft specification, I think it's perhaps easier to be strict at first and relax the restriction later rather than the other way around.

For producers and consumers with access to GDAL or PROJ there shouldn't be any problems.

For producers without access to PROJ that want to include valid CRS information, it's rather trivial to create a lookup table between authority:code and PROJJSON (my last attempt was here: https://github.com/paleolimbot/geocrs/blob/master/data-raw/update-proj-db.R#L2-L48 ). As I understand it, WKID can be represented by ESRI:<WKID> in authority:code syntax.

For consumers without access to PROJ, all one needs is a JSON parser. Even's offer to rewrite a converter to WKT2/WKT1 in some other languages is generous, although it may be more direct to go from JSON directly to some data structure.

As a workaround if you really need to pass on CRS information and don't want to re-encode it, I'd suggest using other (non-extension type) metadata fields for now.

@rouault
Copy link

rouault commented Jun 22, 2022

As I understand it, WKID can be represented by ESRI: in authority:code syntax.

Typically WKID <= 32767 belong to EPSG namespace (the PROJ database doesn't duplicate them in the ESRI namespace when they are identical to EPSG), and WKID > 32767 to ESRI namespace

Even's offer to rewrite a converter to WKT2/WKT1 in some other languages is generous

(for the sake of clarity: I meant this was left as an exercise for others, potentially taking inspiration on my Python POC 😁)

@jorisvandenbossche
Copy link
Contributor

Acknowledging that this is a draft specification and we're not sure exactly how everybody can/will use it, I'd propose that we follow the GeoParquet spec and mandate the crs field, if set, to be PROJJSON. I'll update the draft spec shortly with some specific wording.

Given it is only a first iteration, I am fine with for now being conservative and only allowing PROJJSON, since it is always easier to become more flexible in the future than the other way around.

That said, I personally would also argue for being more permissive on this front, following the arguments that @paleolimbot made before. In contrast to GeoParquet, which as a file format we should standardize as much as possible / practical, the GeoArrow in memory encoding can be more flexible IMO, given that it has different use cases. Requiring PROJJSON means that any producer of arrow-encoded memory (custom file reader, Flight service, database connector, ...) needs to be able to construct the proper PROJJSON, while most of the time the source information that this producer has access to will not be PROJJSON (and thus needs access to something PROJ-like).
My feeling is that this won't always be easily possible or desirable, but it is of course hard to guess in advance, and we can reconsider relaxing this later if we get more concrete data points about this.

@kylebarron
Copy link
Member

Closed by #40

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants