-
Hello, I was wondering how can the geometry column of geoparquet file (example) can be read in spark. Since spark supports parquet format, which approach would be good to make BYTE_ARRAY (with WKB encoding) type readable via dataframes. Thanks. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Hey @ashar236, there should be no problems, it will be read as import org.locationtech.jts.geom.Geometry
import org.locationtech.jts.io._
// I feel like you'd need a Geometry ExpressionEncoder to keep it in this shape
// another option is to add a Spark UDT to make it a typed column
val geomFromWKB = udf { arr: Array[Byte] =>
// can be improved to try to reuse the WKBReader and not to create the new one every time
new WKBReader().read(arr)
}
val df = spark.read.parquet("/path/to/parquet/file").select(geomFromWKB($"geom")) To add JTS types you may use GeoMesa (geomesa-spark-jts package) they add JTS encoders as well as other functions, though internally they store it in slightly different shape, but could be good to use as an example/ We also have Hive UDFs (supported by Spark) for these purposes, that live here: And with these UDFs it will look the following way: spark
.read
.parquet("/path/to/parquet/file")
.createOrReplaceTempView("parquet_table")
// a SQL query example, you can stay in the SQL world for as long as you need with all UDFs available and to extract it in a form you need
spark.sql("SELECT ST_AsText(ST_GeomFromWKB(geom)) FROM parquet_table") I hope it answers your question. |
Beta Was this translation helpful? Give feedback.
Hey @ashar236, there should be no problems, it will be read as
BinaryType
by Spark; You'd need to have UDFs or CatalystExpressions to interpret it as aGeometry
if needed (you can use JTS for these purposes); the usage looks smth like this (via a simple UDF defined):