Geoparquet file reader in Spark #56

ashar236 · 2022-04-01T23:50:15Z

ashar236
Apr 1, 2022

Hello,

I was wondering how can the geometry column of geoparquet file (example) can be read in spark. Since spark supports parquet format, which approach would be good to make BYTE_ARRAY (with WKB encoding) type readable via dataframes.

Thanks.

Answered by pomadchin

Apr 2, 2022

Hey @ashar236, there should be no problems, it will be read as BinaryType by Spark; You'd need to have UDFs or CatalystExpressions to interpret it as a Geometry if needed (you can use JTS for these purposes); the usage looks smth like this (via a simple UDF defined):

import org.locationtech.jts.geom.Geometry
import org.locationtech.jts.io._

// I feel like you'd need a Geometry ExpressionEncoder to keep it in this shape 
// another option is to add a Spark UDT to make it a typed column
val geomFromWKB = udf { arr: Array[Byte] => 
  // can be improved to try to reuse the WKBReader and not to create the new one every time
  new WKBReader().read(arr)
}

val df = spark.read.parquet("/path/to/…

View full answer

pomadchin · 2022-04-02T00:05:29Z

pomadchin
Apr 2, 2022
Collaborator

Hey @ashar236, there should be no problems, it will be read as BinaryType by Spark; You'd need to have UDFs or CatalystExpressions to interpret it as a Geometry if needed (you can use JTS for these purposes); the usage looks smth like this (via a simple UDF defined):

import org.locationtech.jts.geom.Geometry
import org.locationtech.jts.io._

// I feel like you'd need a Geometry ExpressionEncoder to keep it in this shape 
// another option is to add a Spark UDT to make it a typed column
val geomFromWKB = udf { arr: Array[Byte] => 
  // can be improved to try to reuse the WKBReader and not to create the new one every time
  new WKBReader().read(arr)
}

val df = spark.read.parquet("/path/to/parquet/file").select(geomFromWKB($"geom"))

To add JTS types you may use GeoMesa (geomesa-spark-jts package) they add JTS encoders as well as other functions, though internally they store it in slightly different shape, but could be good to use as an example/

We also have Hive UDFs (supported by Spark) for these purposes, that live here:

And with these UDFs it will look the following way:

spark
  .read
  .parquet("/path/to/parquet/file")
  .createOrReplaceTempView("parquet_table")
  
// a SQL query example, you can stay in the SQL world for as long as you need with all UDFs available and to extract it in a form you need
spark.sql("SELECT ST_AsText(ST_GeomFromWKB(geom)) FROM parquet_table")

I hope it answers your question.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Geoparquet file reader in Spark #56

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Geoparquet file reader in Spark #56

ashar236 Apr 1, 2022

Replies: 1 comment

pomadchin Apr 2, 2022 Collaborator

ashar236
Apr 1, 2022

pomadchin
Apr 2, 2022
Collaborator