-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add geometric data types and functions #543
Conversation
extensions/functions_geometry.yaml
Outdated
value: fp32 | ||
return: point | ||
- | ||
name: "st_makeline" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is the point creation function name st_point but this function is st_makeline?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the names are from postgis
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we correct the naming here to avoid perpetuating their mistake? st_makeline seems to be the problematic one as it's actually performing a join on two arbitrary geometries instead of two points. Or is it making a line between the two geometries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ST_ (Spatial Type) prefix isn't really necessary in Substrait as we're already associating all of the geometry functions within the geometry YAML file. I believe we should probably use MakePoint and MakeLine to be compatible with what PostGIS supports. Some Substrait implementations will only support the 2 argument version of MakePoint which would be the same as Point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, ST_Point()
is in the simple features access standard (which is probably why it is not ST_MakePoint()
). If you want to correct the name it would probably be ST_Line()
but PostGIS names are probably the option that is most likely to get this spec taken seriously by anybody who will actually implement it. You could also just omit any functions you're not sure about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for getting this started!
For now, I wonder if what you want is just:
types:
- name: geometry
# ... is constraining the storage type mandatory?
Getting into the details of exactly how coordinates are stored and which dimensions are stored (e.g., sometimes there's a Z coordinate for elevation or an M coordinate with time) is a rather large can of worms that geospatial people have a lot of opinions about. If you must specify some constraint on the type, you might want to rename the type after the encoding (e.g., call it wkb and specify that the encoding is well-known binary).
Having just one type simplifies the function definitions: many are "just" geometry in, geometry out. I don't know if you want to get into the weeds on list
and structure
with point
, linestring
, polygon
, multipoint
, multielinestring
, and multipolygon
...every function treats those differently and I'm not sure that any two GIS engine people would give you the same answer if you asked them to come up with that list. Even then you still need a generic geometry
(e.g., st_intersection(polygon, polygon)
might give you any combination of points, lines, and/or polygons). You can see how the GeoArrow memory layout spec deals with this if you really want to be specific about types.
One choice you may have to make is whether to add a string parameter "crs" to the type. This is sort of like "timezone"...geospatial people care quite a lot about it. PostGIS handles this by making it datum specific (every value has its own uint32 SRID, which refers to a row in an SRID table); the GeoArrow metadata specification makes it part of the type. You could also punt and say that if it's missing, the CRS is "OGC:CRS84", which is GIS speak for "longitude/latitude". This is what the GeoParquet specification does.
Another term you may see is "GEOGRAPHY"...PostGIS has this type and so does BigQuery and Snowflake. GeoArrow handles this as a type parameter (edges: spherical | planar
, with spherical being the one that corresponds to GEOGRAPHY). Geography is always longitude/latitude and never has Z or M values.
dc03bbb
to
ac6f82a
Compare
Thanks for the feedback @paleolimbot! For now i'll stick with just adding a new geometry extension type you suggested. That also makes the new function specs easier. |
extensions/functions_geometry.yaml
Outdated
value: fp32 | ||
return: point | ||
- | ||
name: "st_makeline" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we correct the naming here to avoid perpetuating their mistake? st_makeline seems to be the problematic one as it's actually performing a join on two arbitrary geometries instead of two points. Or is it making a line between the two geometries?
@EpsilonPrime i'll update the description for |
extensions/functions_geometry.yaml
Outdated
value: fp32 | ||
return: point | ||
- | ||
name: "st_makeline" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ST_ (Spatial Type) prefix isn't really necessary in Substrait as we're already associating all of the geometry functions within the geometry YAML file. I believe we should probably use MakePoint and MakeLine to be compatible with what PostGIS supports. Some Substrait implementations will only support the 2 argument version of MakePoint which would be the same as Point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more comments!
You may also be interested in the Simple Features Access standard for SQL which defines these functions and their return types (link and a few relevant sections pasted below). The functions that never made it to PostGIS are probably not worth including.
Link: https://portal.ogc.org/files/?artifact_id=25354
Methods on abstract geometry class:
METHOD ST_Dimension()
RETURNS SMALLINT
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
RETURNS NULL ON NULL INPUT,
METHOD ST_GeometryType()
RETURNS CHARACTER VARYING(ST_MaxTypeNameLength)
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
RETURNS NULL ON NULL INPUT,
METHOD ST_AsText()
RETURNS CHARACTER LARGE OBJECT(ST_MaxGeometryAsText)
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
RETURNS NULL ON NULL INPUT
RETURNS NULL ON NULL INPUT,
METHOD ST_AsBinary()
RETURNS BINARY LARGE OBJECT(ST_MaxGeometryAsBinary)
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
RETURNS NULL ON NULL INPUT,
METHOD ST_IsEmpty()
RETURNS INTEGER
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
RETURNS NULL ON NULL INPUT,
METHOD ST_IsSimple()
RETURNS INTEGER
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
RETURNS NULL ON NULL INPUT,
METHOD ST_Boundary()
RETURNS ST_Geometry
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
RETURNS NULL ON NULL INPUT,
METHOD ST_Envelope()
RETURNS ST_Polygon
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
RETURNS NULL ON NULL INPUT,
METHOD ST_Equals(ageometry ST_Geometry)
RETURNS INTEGER
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
RETURNS NULL ON NULL INPUT,
METHOD ST_Disjoint(ageometry ST_Geometry)
RETURNS INTEGER
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
RETURNS NULL ON NULL INPUT,
METHOD ST_Intersects (ageometry ST_Geometry)
RETURNS INTEGER
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
RETURNS NULL ON NULL INPUT,
METHOD ST_Touches(ageometry ST_Geometry)
RETURNS INTEGER
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
RETURNS NULL ON NULL INPUT,
METHOD ST_Crosses(ageometry ST_Geometry)
RETURNS INTEGER
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
RETURNS NULL ON NULL INPUT,
METHOD ST_Within (ageometry ST_Geometry)
RETURNS INTEGER
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
RETURNS NULL ON NULL INPUT,
METHOD ST_Contains(ageometry ST_Geometry)
RETURNS INTEGER
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
RETURNS NULL ON NULL INPUT,
METHOD ST_Overlaps(ageometry ST_Geometry)
RETURNS INTEGER
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
RETURNS NULL ON NULL INPUT,
METHOD ST_Relate(ageometry ST_Geometry, amatrix CHARACTER(9))
RETURNS INTEGER
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
RETURNS NULL ON NULL INPUT,
METHOD ST_Distance(ageometry ST_Geometry)
RETURNS DOUBLE PRECISION
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
RETURNS NULL ON NULL INPUT,
METHOD ST_Intersection(ageometry ST_Geometry)
RETURNS ST_Geometry
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
RETURNS NULL ON NULL INPUT,
METHOD ST_Difference(ageometry ST_Geometry)
RETURNS ST_Geometry
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
RETURNS NULL ON NULL INPUT,
METHOD ST_Union(ageometry ST_Geometry)
RETURNS ST_Geometry
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
RETURNS NULL ON NULL INPUT,
METHOD ST_SymDifference (ageometry ST_Geometry)
RETURNS ST_Geometry
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
RETURNS NULL ON NULL INPUT,
METHOD ST_Buffer (adistance DOUBLE PRECISION)
RETURNS ST_Geometry
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
RETURNS NULL ON NULL INPUT,
METHOD ST_Buffer ( adistance DOUBLE PRECISION,
aunit CHARACTER VARYING(ST_MaxUnitNameLength))
RETURNS ST_Geometry
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
RETURNS NULL ON NULL INPUT,
METHOD ST_ConvexHull()
RETURNS ST_Geometry
LANGUAGE SQL
DETERMINISTIC
CONTAINS SQL
RETURNS NULL ON NULL INPUT
...there are more but you probably don't need all of them?
extensions/functions_geometry.yaml
Outdated
value: fp32 | ||
return: point | ||
- | ||
name: "st_makeline" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, ST_Point()
is in the simple features access standard (which is probably why it is not ST_MakePoint()
). If you want to correct the name it would probably be ST_Line()
but PostGIS names are probably the option that is most likely to get this spec taken seriously by anybody who will actually implement it. You could also just omit any functions you're not sure about.
0196131
to
e750368
Compare
extensions/functions_geometry.yaml
Outdated
value: fp64 | ||
- name: y | ||
value: fp64 | ||
return: geometry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
geometry
as a type would only be valid if it was a core substrait type (e.g i64
). To reference a user-defined type, the syntax would be u!geometry
.
Currently though, there's a limitation where types from one extension file can't be referenced from a different extension file: #496
This means that you can't reference the geometry
type from extension_types.yaml
(or really any of the types) in this file.
One way to work around this for now would be to define geometry
and the other types we need directly in this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Made some updates!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this should work nicely. Thanks for handling all the iterations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working through all of these details! Geometry is hard but I am excited to this addition (+ future additions that it enables!) to the spec!
- name: x | ||
value: u!geometry | ||
- name: y | ||
value: u!geometry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nitpick (and I also don't know if substrait has something like positional args where the actual name doesn't matter, in which case it is only for documentation): you might want to use different argument names than x
and y
, since naively that would easily be interpreted as x and y coordinates (like in point()
), while here it's rather "point 1" and "point 2".
So I would rather go with something like geom1
/geom2
, or if preferring something generic, eg a
/b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh...I think I missed this when we decided to change to using the opaque geometry type. Thanks for catching this! I'm addressing it in the PR: https://github.com/substrait-io/substrait/pull/554/files
PR to add new geometric data types and functions.