Magellan "index" column in parquet not reused for joins #203

zebehringer · 2018-03-12T14:25:44Z

I wanted to pre-generate the index for a very large set of polygons (loaded from Shapefile) and store as parquet so that I can reuse it in frequent production processes, but it seems that the ZOrderCurve type column named "index" is ignored when joining the parquet data with a list of points.

import org.apache.spark.sql.types._
import magellan.{Point, Polygon}
import org.apache.spark.sql.magellan.dsl.expressions._

val schema = new StructType(Array(
    StructField("latitude",           DoubleType,       false),
    StructField("longitude",          DoubleType,       false)
))

val sample = spark.read.schema(schema).option("header",true).csv("./sample.csv.gz")

magellan.Utils.injectRules(spark)

//spark.read.format("magellan").load("s3://myBucket/my_shapefile_folder")
//    .withColumn("index", $"polygon" index 15)
//    .selectExpr("polygon", "index", "metadata.ID AS id")
//    .write.saveAsTable("shapes")

sample.join(spark.table("shapes"), point($"longitude",$"latitude") within $"polygon").explain()

Here's the plan:

== Physical Plan ==
*Project [id#7, longitude#1, latitude#0, polygon#5, index#6]
+- *BroadcastHashJoin [curve#245], [curve#247], Inner, BuildLeft, ((relation#248 = Within) || Within(pointconverter(longitude#1, latitude#0), polygon#5))
   :- BroadcastExchange HashedRelationBroadcastMode(List(input[2, struct<xmin:double,ymin:double,xmax:double,ymax:double,precision:int,bits:bigint>, true]))
   :  +- Generate inline(indexer(pointconverter(longitude#1, latitude#0), 30)), true, false, [curve#245, relation#246]
   :     +- *FileScan csv [latitude#0,longitude#1] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/home/ec2-user/sample.csv.gz], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<latitude:double,longitude:double>
   +- Generate inline(indexer(polygon#5, 30)), true, false, [curve#247, relation#248]
      +- *FileScan parquet default.df3[polygon#5,index#6,id#7] Batched: false, Format: Parquet, Location: InMemoryFileIndex[file:/home/ec2-user/spark-warehouse/df3], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<polygon:struct<type:int,xmin:double,ymin:double,xmax:double,ymax:double,indices:array<int>...

The text was updated successfully, but these errors were encountered:

harsha2010 · 2018-03-12T20:51:13Z

@zebehringer can you give this PR a try? The issue I think is that the nullability column is reset(a bug in Spark SQL) when Spark SQL writes to Parquet.. and when we read back this causes a schema mismatch

halfabrane mentioned this issue Mar 12, 2018

Bug Fix: Nullability seems to be changed by Spark when writing Parque… #204

Merged

harsha2010 closed this as completed in #204 Mar 16, 2018

harsha2010 added this to the 1.0.6 milestone Mar 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Magellan "index" column in parquet not reused for joins #203

Magellan "index" column in parquet not reused for joins #203

zebehringer commented Mar 12, 2018

harsha2010 commented Mar 12, 2018 •

edited

Loading

Magellan "index" column in parquet not reused for joins #203

Magellan "index" column in parquet not reused for joins #203

Comments

zebehringer commented Mar 12, 2018

harsha2010 commented Mar 12, 2018 • edited Loading

harsha2010 commented Mar 12, 2018 •

edited

Loading